Re: Tracking down log segment corruption

Lists: pgsql-general
From: Charles Duffy <Charles_Duffy(at)messageone(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Tracking down log segment corruption
Date: 2008-12-22 00:16:28
Message-ID: gimm8t$lh1$1@ger.gmane.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Howdy, all.

I have a log-shipping replication environment (based on PostgreSQL
8.3.4) using pg_lesslog+LZOP for compression of archived segments (kept
around long-term for possible use doing PITR). The slave came out of
synchronization recently, restoring a series of segments and then
failing with a SIGABORT (and doing the same after each automated restart):

[2-1] LOG: starting archive recovery
[3-1] LOG: restore_command = '/opt/extropy/postgres/bin/restore-segment
/exports/pgwal/segments.recoveryq %f %p %r'
[4-1] LOG: restored log file "00000001000000140000000B" from archive
[5-1] LOG: automatic recovery in progress
[6-1] LOG: restored log file "000000010000001400000009" from archive
[7-1] LOG: redo starts at 14/9127270
[8-1] LOG: restored log file "00000001000000140000000A" from archive
[9-1] LOG: restored log file "00000001000000140000000B" from archive
[10-1] LOG: restored log file "00000001000000140000000C" from archive
[11-1] LOG: restored log file "00000001000000140000000D" from archive
[12-1] LOG: restored log file "00000001000000140000000E" from archive
[13-1] LOG: restored log file "00000001000000140000000F" from archive
[14-1] LOG: restored log file "000000010000001400000010" from archive
[15-1] WARNING: specified item offset is too large
[15-2] CONTEXT: xlog redo insert_upper: rel 1663/16384/17763; tid 2960/89
[16-1] PANIC: btree_insert_redo: failed to add item
[16-2] CONTEXT: xlog redo insert_upper: rel 1663/16384/17763; tid 2960/89
[1-1] LOG: startup process (PID 17310) was terminated by signal 6: Aborted
[2-1] LOG: aborting startup due to startup process failure

Replacing only 000000010000001400000010 with a pristine (never processed
with pg_compresslog) copy made no difference, but doing so for all of
the involved segments permitted the slave to pick up where it left off
and continue replaying.

It seems clear, then, that pg_lesslog was responsible in some way for
this corruption, and that there is some germane difference between the
original and { pg_compresslog | pg_decompresslog } version of one of the
involved segments. I've tried using xlogdump
[http://xlogviewer.projects.postgresql.org/] on both versions to look
for differing output, but it segfaults even with the "good" log segments.

Does anyone have suggestions as to how I should go about tracking down
the root cause of this issue?

Thanks!


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Tracking down log segment corruption
Date: 2010-05-02 16:30:07
Message-ID: 28427922.post@talk.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


I just got ran into the same problem. Both servers are running 8.4.3, and
the standby server had been running for 2 days, processing many thousands of
logs successfully. Here's my error:

4158 2010-05-02 11:12:09 EDT [26445]LOG: restored log file
"0000000100003C77000000C3" from archive
4158 2010-05-02 11:12:09 EDT [26446]LOG: restored log file
"0000000100003C77000000C4" from archive
4158 2010-05-02 11:12:09 EDT [26447]WARNING: specified item offset is too
large
4158 2010-05-02 11:12:09 EDT [26448]CONTEXT: xlog redo insert: rel
48777166/22362/48778276; tid 2/2
4158 2010-05-02 11:12:09 EDT [26449]PANIC: btree_insert_redo: failed to
add item
4158 2010-05-02 11:12:09 EDT [26450]CONTEXT: xlog redo insert: rel
48777166/22362/48778276; tid 2/2
4151 2010-05-02 11:12:09 EDT [1]LOG: startup process (PID 4158) was
terminated by signal 6: Aborted
4151 2010-05-02 11:12:09 EDT [2]LOG: terminating any other active server
processes

Any suggestions?

Charles Duffy-5 wrote:
>
>
> [14-1] LOG: restored log file "000000010000001400000010" from archive
> [15-1] WARNING: specified item offset is too large
> [15-2] CONTEXT: xlog redo insert_upper: rel 1663/16384/17763; tid 2960/89
> [16-1] PANIC: btree_insert_redo: failed to add item
> [16-2] CONTEXT: xlog redo insert_upper: rel 1663/16384/17763; tid 2960/89
> [1-1] LOG: startup process (PID 17310) was terminated by signal 6:
> Aborted
> [2-1] LOG: aborting startup due to startup process failure
>
>

--
View this message in context: http://old.nabble.com/Tracking-down-log-segment-corruption-tp21121136p28427922.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gordon Shannon <gordo169(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Tracking down log segment corruption
Date: 2010-05-02 17:02:03
Message-ID: 13174.1272819723@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gordon Shannon <gordo169(at)gmail(dot)com> writes:
> I just got ran into the same problem. Both servers are running 8.4.3, and
> the standby server had been running for 2 days, processing many thousands of
> logs successfully. Here's my error:

> 4158 2010-05-02 11:12:09 EDT [26445]LOG: restored log file
> "0000000100003C77000000C3" from archive
> 4158 2010-05-02 11:12:09 EDT [26446]LOG: restored log file
> "0000000100003C77000000C4" from archive
> 4158 2010-05-02 11:12:09 EDT [26447]WARNING: specified item offset is too
> large
> 4158 2010-05-02 11:12:09 EDT [26448]CONTEXT: xlog redo insert: rel
> 48777166/22362/48778276; tid 2/2
> 4158 2010-05-02 11:12:09 EDT [26449]PANIC: btree_insert_redo: failed to
> add item
> 4158 2010-05-02 11:12:09 EDT [26450]CONTEXT: xlog redo insert: rel
> 48777166/22362/48778276; tid 2/2
> 4151 2010-05-02 11:12:09 EDT [1]LOG: startup process (PID 4158) was
> terminated by signal 6: Aborted
> 4151 2010-05-02 11:12:09 EDT [2]LOG: terminating any other active server
> processes

Hmm ... AFAICS the only way to get that message when the incoming TID's
offsetNumber is only 2 is for the index page to be completely empty
(not zeroes, else PageAddItem's sanity check would have triggered,
but valid and empty). What that smells like is a software bug, like
failing to emit a WAL record in a case where it was necessary. Can you
identify which index this was? (Look for relfilenode 48778276 in the
database with OID 22362.) If so, can you give us any hints about
unusual things that might have been done with that index?

> Any suggestions?

As far as recovering goes, there's probably not much you can do except
resync the standby from scratch. But it would be nice to get to the
bottom of the problem, so that we can fix the bug. Have you got an
archive of this xlog segment and the ones before it, and would you be
willing to let a developer look at them?

regards, tom lane


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Tracking down log segment corruption
Date: 2010-05-02 17:55:26
Message-ID: s2qb2dd93301005021055w93b80fb6q9ec27b826a550f7d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Sun, May 2, 2010 at 11:02 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>
>
> Hmm ... AFAICS the only way to get that message when the incoming TID's
> offsetNumber is only 2 is for the index page to be completely empty
> (not zeroes, else PageAddItem's sanity check would have triggered,
> but valid and empty). What that smells like is a software bug, like
> failing to emit a WAL record in a case where it was necessary. Can you
> identify which index this was? (Look for relfilenode 48778276 in the
> database with OID 22362.) If so, can you give us any hints about
> unusual things that might have been done with that index?
>

Interesting. There is no pg_class entry for 22362. There is, however, an
entry for that filenode. It's an index I created Sat AM, about 6AM.

select oid, * from pg_class where relfilenode=48778276;
-[ RECORD 1 ]--+---------------------------
oid | 48777488
relname | cts_20100501_topic_date_nk
relnamespace | 2200
reltype | 0
relowner | 16412
relam | 403
relfilenode | 48778276
reltablespace | 48777166
relpages | 2476
reltuples | 58879
reltoastrelid | 0
reltoastidxid | 0
relhasindex | f
relisshared | f
relistemp | f
relkind | i
relnatts | 2
relchecks | 0
relhasoids | f
relhaspkey | f
relhasrules | f
relhastriggers | f
relhassubclass | f
relfrozenxid | 0
relacl | null
reloptions | null

Possibly relevant facts:

- The WSB server went active on Friday around 3:30PM
- On Friday evening, I added about 11 tablespaces. I noted the new files on
the WSB, no problems.
- On Sat morning, I created a partitioned table cts_20100501 (inherits from
another table) and 4 indexes.
- This morning, I was doing some table maintenance on the master and
discovered I had created this table and its indexes in the wrong tablespace.
I wanted the table in ts29, but had it in ts30. Vice versa for the
indexes. So I moved them. This is from my command history:

alter index cts_20100501_natural_uk set tablespace ts30;
alter index cts_20100501_pkey set tablespace ts30;
alter index cts_20100501_topic_date_nk set tablespace ts30;
alter index cts_20100501_updated_nk set tablespace ts30;
alter table cts_20100501 set tablespace ts29;

These commands worked fine on the master, yet this seems suspiciously
relevant.

> > Any suggestions?
>
> As far as recovering goes, there's probably not much you can do except
> resync the standby from scratch. But it would be nice to get to the
> bottom of the problem, so that we can fix the bug. Have you got an
> archive of this xlog segment and the ones before it, and would you be
> willing to let a developer look at them?
>
>
Before I received your reply, I had already started the re-sync, and
unfortunately already deleted the wal logs in question.
If it happens again, I will certainly keep them, and would be happy to share
them.

Regards,
Gordon


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gordon Shannon <gordo169(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Tracking down log segment corruption
Date: 2010-05-02 18:10:01
Message-ID: 14283.1272823801@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gordon Shannon <gordo169(at)gmail(dot)com> writes:
> Interesting. There is no pg_class entry for 22362.

No, this would be a pg_database row with that OID. But it looks like
you found the relevant index anyway.

> ... There is, however, an
> entry for that filenode. It's an index I created Sat AM, about 6AM.
> ...
> - This morning, I was doing some table maintenance on the master and
> discovered I had created this table and its indexes in the wrong tablespace.
> I wanted the table in ts29, but had it in ts30. Vice versa for the
> indexes. So I moved them. This is from my command history:

> alter index cts_20100501_natural_uk set tablespace ts30;
> alter index cts_20100501_pkey set tablespace ts30;
> alter index cts_20100501_topic_date_nk set tablespace ts30;
> alter index cts_20100501_updated_nk set tablespace ts30;
> alter table cts_20100501 set tablespace ts29;

> These commands worked fine on the master, yet this seems suspiciously
> relevant.

Yeah, perhaps so. What time did the failure on the standby occur (how
long after you did those moves)? Is it reasonable to assume that this
was the first subsequent access to the index?

regards, tom lane


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Tracking down log segment corruption
Date: 2010-05-02 18:36:58
Message-ID: t2ob2dd93301005021136l1164130u22f16a8d0f94667f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Sun, May 2, 2010 at 12:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> No, this would be a pg_database row with that OID. But it looks like
> you found the relevant index anyway.
>
> Yup, realized that on second reading.

> > These commands worked fine on the master, yet this seems suspiciously
> > relevant.
>
>
> Yeah, perhaps so. What time did the failure on the standby occur (how
> long after you did those moves)? Is it reasonable to assume that this
> was the first subsequent access to the index?
>
>
Bingo. Yes it is reasonable. It was 25 seconds between my altering the
index in question and the server crash.

My local commands (in MDT, plus my machine is 15 sec ahead of the server):

09:10:52> alter index cts_20100501_natural_uk set tablespace ts30;
ALTER INDEX
Time: 787.790 ms
09:11:41> alter index cts_20100501_pkey set tablespace ts30;
ALTER INDEX
Time: 468.526 ms
09:11:51> alter index cts_20100501_topic_date_nk set tablespace ts30;
ALTER INDEX
Time: 385.322 ms
09:11:59> alter index cts_20100501_updated_nk set tablespace ts30;
ALTER INDEX
Time: 963.150 ms
09:12:10> alter table cts_20100501 set tablespace ts29;
ALTER TABLE

And from the wsb log (times in EDT):

4158 2010-05-02 11:12:09 EDT [26446]LOG: restored log file
"0000000100003C77000000C4" from archive
4158 2010-05-02 11:12:09 EDT [26447]WARNING: specified item offset is too
large
4158 2010-05-02 11:12:09 EDT [26448]CONTEXT: xlog redo insert: rel
48777166/22362/48778276; tid 2/2
4158 2010-05-02 11:12:09 EDT [26449]PANIC: btree_insert_redo: failed to
add item


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gordon Shannon <gordo169(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Tracking down log segment corruption
Date: 2010-05-02 18:52:18
Message-ID: 15061.1272826338@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gordon Shannon <gordo169(at)gmail(dot)com> writes:
> Bingo. Yes it is reasonable. It was 25 seconds between my altering the
> index in question and the server crash.

Sounds like we have a smoking gun. Could you show all your non-default
postgresql.conf settings on the master? I'm wondering about
full_page_writes in particular, but might as well gather all the
relevant data while we're at it. Easiest way is:

select name,setting from pg_settings where source not in ('default','override');

regards, tom lane


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Tracking down log segment corruption
Date: 2010-05-02 19:47:02
Message-ID: h2kb2dd93301005021247k3cb308d4v5de10ad77c5b051c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Sun, May 2, 2010 at 12:52 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Gordon Shannon <gordo169(at)gmail(dot)com> writes:
> > Bingo. Yes it is reasonable. It was 25 seconds between my altering the
> > index in question and the server crash.
>
> Sounds like we have a smoking gun. Could you show all your non-default
> postgresql.conf settings on the master? I'm wondering about
> full_page_writes in particular, but might as well gather all the
> relevant data while we're at it. Easiest way is:
>
> select name,setting from pg_settings where source not in
> ('default','override');
>
> regards, tom lane
>

archive_command | cp %p /var/lib/pgsql/wal/%f.wrk; mv
/var/lib/pgsql/wal/%f.wrk /var/lib/pgsql/wal/%f
archive_mode | on
authentication_timeout | 20
autovacuum_analyze_scale_factor | 0.05
autovacuum_freeze_max_age | 1500000000
autovacuum_vacuum_cost_delay | -1
autovacuum_vacuum_scale_factor | 0.1
checkpoint_segments | 128
checkpoint_warning | 300
commit_siblings | 1
cpu_index_tuple_cost | 0.001
cpu_operator_cost | 0.0005
cpu_tuple_cost | 0.003
DateStyle | ISO, MDY
deadlock_timeout | 5000
default_text_search_config | pg_catalog.english
effective_cache_size | 6291456
lc_messages | en_US.UTF-8
lc_monetary | en_US.UTF-8
lc_numeric | en_US.UTF-8
lc_time | en_US.UTF-8
listen_addresses | *
log_autovacuum_min_duration | 0
log_destination | stderr
log_directory | /var/log/postgres
log_filename | pg%d.log
log_line_prefix | %p %u %r %t [%l]
log_min_duration_statement | 180000
log_rotation_age | 1440
log_rotation_size | 0
log_temp_files | 0
log_timezone | US/Eastern
log_truncate_on_rotation | on
logging_collector | on
maintenance_work_mem | 8388608
max_connections | 500
max_stack_depth | 2048
port | 5432
search_path | public, archive, _slony_cluster
shared_buffers | 2359296
standard_conforming_strings | on
synchronous_commit | off
temp_tablespaces |
ts27,ts28,ts29,ts30,ts31,ts32,ts33,ts34,ts35,ts36,ts37
TimeZone | US/Eastern
timezone_abbreviations | Default
track_functions | pl
vacuum_freeze_min_age | 500000000
vacuum_freeze_table_age | 1300000000
wal_buffers | 1024
work_mem | 204800


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gordon Shannon <gordo169(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Tracking down log segment corruption
Date: 2010-05-02 21:16:11
Message-ID: 17352.1272834971@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gordon Shannon <gordo169(at)gmail(dot)com> writes:
> [ corruption on a standby slave after an ALTER SET TABLESPACE operation ]

Found it, I think. ATExecSetTableSpace transfers the copied data to the
slave by means of XLOG_HEAP_NEWPAGE WAL records. The replay function
for this (heap_xlog_newpage) is failing to pay any attention to the
forkNum field of the WAL record. This means it will happily write FSM
and visibility-map pages into the main fork of the relation. So if the
index had any such pages on the master, it would immediately become
corrupted on the slave. Now indexes don't have a visibility-map fork,
but they could have FSM pages. And an FSM page would have the right
header information to look like an empty index page. So dropping an
index FSM page into the main fork of the index would produce the
observed symptom.

I'm not 100% sure that this is what bit you, but it's clearly a bug and
AFAICS it could produce the observed symptoms.

This is a seriously, seriously nasty data corruption bug. The only bit
of good news is that ALTER SET TABLESPACE seems to be the only operation
that can emit XLOG_HEAP_NEWPAGE records with forkNum different from
MAIN_FORKNUM, so that's the only operation that's at risk. But if you
do do that, not only are standby slaves going to get clobbered, but the
master could get corrupted too if you were unlucky enough to have a
crash and replay from WAL shortly after completing the ALTER. And it's
not only indexes that are at risk --- tables could get clobbered the
same way.

My crystal ball says there will be update releases in the very near
future.

regards, tom lane


From: Gordon Shannon <gordo169(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Tracking down log segment corruption
Date: 2010-05-02 21:43:39
Message-ID: x2ub2dd93301005021443t3c13fb57m7fc5b53e7f0f466b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Sounds like you're on it. Just wanted to share one additional piece, in
case it helps.

Just before the ALTER INDEX SET TABLESPACE was issued, there were some
writes to the table in question inside a serializable transaction. The
transaction committed at 11:11:58 EDT, and consisted of, among a couple
thousand writes to sibling tables, 4 writes (unknown combination of inserts
and updates) to cts_20100501, which definitely effected the index in
question.

In any case, I will cease and desist from ALTER SET TABLESPACE for a while!.

Thanks!
Gordon

Between 11:11:56 and 11:11:58 EDT (11 sec before the crash), there were

On Sun, May 2, 2010 at 3:16 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Found it, I think. ATExecSetTableSpace transfers the copied data to the
> slave by means of XLOG_HEAP_NEWPAGE WAL records. The replay function
> for this (heap_xlog_newpage) is failing to pay any attention to the
> forkNum field of the WAL record. This means it will happily write FSM
> and visibility-map pages into the main fork of the relation. So if the
> index had any such pages on the master, it would immediately become
> corrupted on the slave. Now indexes don't have a visibility-map fork,
> but they could have FSM pages. And an FSM page would have the right
> header information to look like an empty index page. So dropping an
> index FSM page into the main fork of the index would produce the
> observed symptom.
>
> I'm not 100% sure that this is what bit you, but it's clearly a bug and
> AFAICS it could produce the observed symptoms.
>
> This is a seriously, seriously nasty data corruption bug. The only bit
> of good news is that ALTER SET TABLESPACE seems to be the only operation
> that can emit XLOG_HEAP_NEWPAGE records with forkNum different from
> MAIN_FORKNUM, so that's the only operation that's at risk. But if you
> do do that, not only are standby slaves going to get clobbered, but the
> master could get corrupted too if you were unlucky enough to have a
> crash and replay from WAL shortly after completing the ALTER. And it's
> not only indexes that are at risk --- tables could get clobbered the
> same way.
>
> My crystal ball says there will be update releases in the very near
> future.
>
> regards, tom lane
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gordon Shannon <gordo169(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Tracking down log segment corruption
Date: 2010-05-02 22:29:55
Message-ID: 19434.1272839395@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gordon Shannon <gordo169(at)gmail(dot)com> writes:
> In any case, I will cease and desist from ALTER SET TABLESPACE for a while!.

Here's the applied patch, if you are interested in testing it.

regards, tom lane

Attachment Content-Type Size
newpage-replay-84.patch text/x-patch 1.2 KB