Greg Farnum [Thu, 25 Jun 2009 19:05:15 +0000 (12:05 -0700)]
messages/MClass[Ack]: Roll back some unification.
version_t last and PaxosServiceMessage::version shouldn't
be the same in these messages. Remove that and add a new
constructor that does set the version (but it's unneeded).
Sage Weil [Thu, 25 Jun 2009 20:01:16 +0000 (13:01 -0700)]
osd: update primary's notion of peer last_update on activate
We are pushing the peer the log to bring it up to date, so
update our peer_info[peer].last_update to match. Otherwise,
we get confused if we get, say, stray content and peer() is
called later, and we have out of date peer stats.
Sage Weil [Thu, 25 Jun 2009 19:45:35 +0000 (12:45 -0700)]
osd: force RMW ordering globally
We can't mix RMW and DELAYED in the same PG without screwing
up the ordering of writes, the pg log, and so forth.
So force RMW throughout. This won't affect the mds log
appends because the client is constant. It will slow down
concurrent writes to the same object by multiple clients, but
we don't have many (any?) of those yet.
Sage Weil [Thu, 25 Jun 2009 03:43:01 +0000 (20:43 -0700)]
osd: store snapset in _snapdir object if head dne
If the _head doesn't logically exist, we can't keep it around just for
the SnapSet or else an 'ls' will have to stat in order to tell if the
head object logically exists and should be included. That's no good,
so:
- put snapset in SS_ATTR on head if it exists
- otherwise, put it SS_ATTR on a _snapdir object
Sage Weil [Wed, 24 Jun 2009 18:17:55 +0000 (11:17 -0700)]
osd: adjust recovery op accounting; explicitly track set of recovering objects
Use a single {start,finish}_recovery_op() func to start and stop
recovery ops, so that there is a single point for counter adjustments
to occur. On reset, simply call into OSD multiple times.
Also maintain a set<sobject_t> in each PG and on the OSD to track
the set of objects that are recovering. This can hopefully be
compiled out once all the bugs are identified.
We are chasing this:
osd/OSD.cc:3465: FAILED assert(recovery_ops_active >= 0)
1: ./cosd(_Z18__ceph_assert_failPKcS0_iS0_+0x3a) [0x7a769b]
2: ./cosd(_ZN3OSD18finish_recovery_opEP2PGib+0x148) [0x696bce]
3: ./cosd(_ZN12ReplicatedPG18finish_recovery_opEv+0x77) [0x6359c5]
4: ./cosd(_ZN12ReplicatedPG17sub_op_push_replyEP14MOSDSubOpReply+0x540) [0x63628a]
5: ./cosd(_ZN12ReplicatedPG15do_sub_op_replyEP14MOSDSubOpReply+0x64) [0x6407fe]
6: ./cosd(_ZN3OSD10dequeue_opEP2PG+0x224) [0x6996ee]
7: ./cosd(_ZN3OSD4OpWQ8_processEP2PG+0x21) [0x70d175]
8: ./cosd(_ZN10ThreadPool9WorkQueueI2PGE13_void_processEPv+0x28) [0x6c9f78]
9: ./cosd(_ZN10ThreadPool6workerEv+0x280) [0x7a825c]
10: ./cosd(_ZN10ThreadPool10WorkThread5entryEv+0x19) [0x70cb9f]
11: ./cosd(_ZN6Thread11_entry_funcEPv+0x20) [0x629d48]
12: /lib/libpthread.so.0 [0x7f2f1e3f33f7]
13: /lib/libc.so.6(clone+0x6d) [0x7f2f1d9c294d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sage Weil [Wed, 24 Jun 2009 05:06:24 +0000 (22:06 -0700)]
osd: fix merge_log when log and olog share bottom
If log has 6'10 and olog has 7'10, on same object, merge_log
was failing to throw out log's 6'10 entry because the
last_kept iterator was still end(). Use a simple eversion_t
instead, and simplify existing (and otherwise correct)
log.bottom logic, but without the last_kept != end() guard
that threw us off.
09.06.23 16:52:56.032981 1145465168 osd4 485 pg[1.cd( v 469'11021/469'11021 (469'11020,469'11021] n=8151 ec=2 les=476 485/480) r=0 lcod 0'0 mlcod 469'11021 !hml crashed+peering] merge_log log(469'11020,476'11021] from osd0 into log(469'11020,469'11021]
09.06.23 16:52:56.033001 1145465168 osd4 485 pg[1.cd( v 469'11021/469'11021 (469'11020,469'11021] n=8151 ec=2 les=476 485/480) r=0 lcod 0'0 mlcod 469'11021 !hml crashed+peering] merge_log extending top to 476'11021
09.06.23 16:52:56.033033 1145465168 osd4 485 pg[1.cd( v 469'11021/469'11021 (469'11020,469'11021] n=8151 ec=2 les=476 485/480) r=0 lcod 0'0 mlcod 469'11021 !hml crashed+peering] ? 476'11021 (0'0) m 10001641d24.00000000/head by mds0.16:33860 09.06.23 16:50:28.931949
09.06.23 16:52:56.033057 1145465168 osd4 485 pg[1.cd( v 469'11021/469'11021 (469'11020,469'11021] n=8151 ec=2 les=476 485/480) r=0 lcod 0'0 mlcod 469'11021 !hml crashed+peering] merge_log 476'11021 (0'0) m 10001641d24.00000000/head by mds0.16:33860 09.06.23 16:50:28.931949
09.06.23 16:52:56.033090 1145465168 osd4 485 pg[1.cd( v 476'11021/469'11021 (469'11020,476'11021] n=8151 ec=2 les=476 485/480) r=0 lcod 0'0 mlcod 469'11021 !hml crashed+peering m=1 l=1] merge_log result log(469'11020,476'11021] missing(1) changed=1
Sage Weil [Tue, 23 Jun 2009 04:32:09 +0000 (21:32 -0700)]
osd: stop rewinding replica log when we reach log.bottom
We stop rewinding a replica log when we reach our own
log.bottom, because we don't know enough to do so in any
meaningful way, and because we can assume it is not
divergent at that point (barring any complete screwupedness).
Also, if we do change last_update, make sure last_complete is
rewound too.
Sage Weil [Sat, 20 Jun 2009 06:27:04 +0000 (23:27 -0700)]
osd: do NOT include op vector when shipping raw transaction
This just doubles up the data payload. And makes the MOSDSubOp printout
look like garbage, since e.g. the setxattr names are taken from the
portion of the data payload encoding the transaction.
Sage Weil [Fri, 19 Jun 2009 19:45:36 +0000 (12:45 -0700)]
osd: pass updated stats to replica
When we ship the raw transaction to the replica, we need to ship the
new pg_stat_t as well, since that isn't getting updated in parallel by
prepare_transaction().
Sage Weil [Fri, 19 Jun 2009 04:05:59 +0000 (21:05 -0700)]
osd: fix initialization of log.complete_to in PG::activate()
The complete_to should point to the next object to get, which
should be just PAST info.last_complete. That is because we
can trim the log up to and including last_complete (because
that entry is recovered), and we don't want to invalidate
the iterator.
That is
while (log.complete_to->version <= info.last_complete)
log.complete_to++;
and in sub_op_push,
while (...) {
...
if (info.last_complete < log.complete_to->version)
info.last_complete = log.complete_to->version;
log.complete_to++;
}
Sage Weil [Thu, 18 Jun 2009 23:37:03 +0000 (16:37 -0700)]
osd: make add_next_entry behave when we start at backlog split point
Weaken the assertions a bit and just adjust missing appropriately.
Things may not match up perfectly if the split point is a backlog
entry, so just make missing what it should be a worry less about
what it was.
Here is the specific crash:
09.06.18 16:29:15.085353 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 lcod 0'0 stray m=1] my log = log(0'0,5'4]+backlog
3'1 (0'0) m 200.00000000/head by mds0.1:1 09.06.18 16:20:07.524996 indexed
3'2 (0'0) m 2.00000000/head by mds0.1:5 09.06.18 16:20:07.527454 indexed
5'3 (3'1) m 200.00000000/head by mds0.1:23 09.06.18 16:20:25.128842 indexed
5'4 (5'3) m 200.00000000/head by mds0.1:35 09.06.18 16:20:48.623669 indexed
09.06.18 16:29:15.085393 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 lcod 0'0 stray m=1] osd2 log = log(8'68,9'69]+backlog
3'2 (0'0) b 2.00000000/head by mds0.1:5 09.06.18 16:20:07.527454
9'69 (8'68) m 200.00000000/head by mds0.1:1114 09.06.18 16:28:08.837907
09.06.18 16:29:15.085416 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 lcod 0'0 stray m=1] merge_log log(8'68,9'69]+backlog from osd2 into log(0'0,5'4]+backlog
09.06.18 16:29:15.085456 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 (log bound mismatch, actual=[3'2,9'69] len=2) lcod 0'0 stray m=1] merge_log split point is 3'2 (0'0) b 2.00000000/head by mds0.1:5 09.06.18 16:20:07.527454
09.06.18 16:29:15.085472 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 (log bound mismatch, actual=[3'2,9'69] len=2) lcod 0'0 stray m=1] merge_log merging 3'2 (0'0) b 2.00000000/head by mds0.1:5 09.06.18 16:20:07.527454
09.06.18 16:29:15.085493 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 (log bound mismatch, actual=[3'2,9'69] len=2) lcod 0'0 stray m=2] merge_log merging 9'69 (8'68) m 200.00000000/head by mds0.1:1114 09.06.18 16:28:08.837907
osd/PG.h: In function 'void PG::Missing::add_next_event(PG::Log::Entry&)':
osd/PG.h:494: FAILED assert(missing[e.soid].need == e.prior_version)
Sage Weil [Thu, 18 Jun 2009 21:23:29 +0000 (14:23 -0700)]
crush: redefine hash using __u32, for consistency across 32/64 bit
I'm pretty sure this was giving inconsistent results across archs,
because bits would get shifted into the high 32 and then back again
on x86_64 but not x86_32.
Sage Weil [Thu, 18 Jun 2009 18:40:55 +0000 (11:40 -0700)]
osd: trim pg logs on recovery completion
When replica finds itself fully up to date (last_complete ==
last_update) it tells the primary. Primary checks the same.
If the primary find the min_last_complete_ondisk has changed,
it sends out a trim command.
This will let us drop huge pg logs out of memory after a recovery
without waiting for IO and the usual piggybacked trimming logic
to kick in.