Sage Weil [Fri, 26 Jun 2009 17:13:07 +0000 (10:13 -0700)]
buffer: throw exceptions instead of always asserting.
We still assert when the user is doing something wrong. We throw
asserts for failed memory allocs, and for buffer overruns. I
think that'll align with usage... esp the encoding/decoding.
Greg Farnum [Thu, 25 Jun 2009 19:05:15 +0000 (12:05 -0700)]
messages/MClass[Ack]: Roll back some unification.
version_t last and PaxosServiceMessage::version shouldn't
be the same in these messages. Remove that and add a new
constructor that does set the version (but it's unneeded).
Sage Weil [Thu, 25 Jun 2009 20:01:16 +0000 (13:01 -0700)]
osd: update primary's notion of peer last_update on activate
We are pushing the peer the log to bring it up to date, so
update our peer_info[peer].last_update to match. Otherwise,
we get confused if we get, say, stray content and peer() is
called later, and we have out of date peer stats.
Sage Weil [Thu, 25 Jun 2009 19:45:35 +0000 (12:45 -0700)]
osd: force RMW ordering globally
We can't mix RMW and DELAYED in the same PG without screwing
up the ordering of writes, the pg log, and so forth.
So force RMW throughout. This won't affect the mds log
appends because the client is constant. It will slow down
concurrent writes to the same object by multiple clients, but
we don't have many (any?) of those yet.
Sage Weil [Thu, 25 Jun 2009 03:43:01 +0000 (20:43 -0700)]
osd: store snapset in _snapdir object if head dne
If the _head doesn't logically exist, we can't keep it around just for
the SnapSet or else an 'ls' will have to stat in order to tell if the
head object logically exists and should be included. That's no good,
so:
- put snapset in SS_ATTR on head if it exists
- otherwise, put it SS_ATTR on a _snapdir object
Sage Weil [Wed, 24 Jun 2009 18:17:55 +0000 (11:17 -0700)]
osd: adjust recovery op accounting; explicitly track set of recovering objects
Use a single {start,finish}_recovery_op() func to start and stop
recovery ops, so that there is a single point for counter adjustments
to occur. On reset, simply call into OSD multiple times.
Also maintain a set<sobject_t> in each PG and on the OSD to track
the set of objects that are recovering. This can hopefully be
compiled out once all the bugs are identified.
We are chasing this:
osd/OSD.cc:3465: FAILED assert(recovery_ops_active >= 0)
1: ./cosd(_Z18__ceph_assert_failPKcS0_iS0_+0x3a) [0x7a769b]
2: ./cosd(_ZN3OSD18finish_recovery_opEP2PGib+0x148) [0x696bce]
3: ./cosd(_ZN12ReplicatedPG18finish_recovery_opEv+0x77) [0x6359c5]
4: ./cosd(_ZN12ReplicatedPG17sub_op_push_replyEP14MOSDSubOpReply+0x540) [0x63628a]
5: ./cosd(_ZN12ReplicatedPG15do_sub_op_replyEP14MOSDSubOpReply+0x64) [0x6407fe]
6: ./cosd(_ZN3OSD10dequeue_opEP2PG+0x224) [0x6996ee]
7: ./cosd(_ZN3OSD4OpWQ8_processEP2PG+0x21) [0x70d175]
8: ./cosd(_ZN10ThreadPool9WorkQueueI2PGE13_void_processEPv+0x28) [0x6c9f78]
9: ./cosd(_ZN10ThreadPool6workerEv+0x280) [0x7a825c]
10: ./cosd(_ZN10ThreadPool10WorkThread5entryEv+0x19) [0x70cb9f]
11: ./cosd(_ZN6Thread11_entry_funcEPv+0x20) [0x629d48]
12: /lib/libpthread.so.0 [0x7f2f1e3f33f7]
13: /lib/libc.so.6(clone+0x6d) [0x7f2f1d9c294d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Sage Weil [Wed, 24 Jun 2009 05:06:24 +0000 (22:06 -0700)]
osd: fix merge_log when log and olog share bottom
If log has 6'10 and olog has 7'10, on same object, merge_log
was failing to throw out log's 6'10 entry because the
last_kept iterator was still end(). Use a simple eversion_t
instead, and simplify existing (and otherwise correct)
log.bottom logic, but without the last_kept != end() guard
that threw us off.
09.06.23 16:52:56.032981 1145465168 osd4 485 pg[1.cd( v 469'11021/469'11021 (469'11020,469'11021] n=8151 ec=2 les=476 485/480) r=0 lcod 0'0 mlcod 469'11021 !hml crashed+peering] merge_log log(469'11020,476'11021] from osd0 into log(469'11020,469'11021]
09.06.23 16:52:56.033001 1145465168 osd4 485 pg[1.cd( v 469'11021/469'11021 (469'11020,469'11021] n=8151 ec=2 les=476 485/480) r=0 lcod 0'0 mlcod 469'11021 !hml crashed+peering] merge_log extending top to 476'11021
09.06.23 16:52:56.033033 1145465168 osd4 485 pg[1.cd( v 469'11021/469'11021 (469'11020,469'11021] n=8151 ec=2 les=476 485/480) r=0 lcod 0'0 mlcod 469'11021 !hml crashed+peering] ? 476'11021 (0'0) m 10001641d24.00000000/head by mds0.16:33860 09.06.23 16:50:28.931949
09.06.23 16:52:56.033057 1145465168 osd4 485 pg[1.cd( v 469'11021/469'11021 (469'11020,469'11021] n=8151 ec=2 les=476 485/480) r=0 lcod 0'0 mlcod 469'11021 !hml crashed+peering] merge_log 476'11021 (0'0) m 10001641d24.00000000/head by mds0.16:33860 09.06.23 16:50:28.931949
09.06.23 16:52:56.033090 1145465168 osd4 485 pg[1.cd( v 476'11021/469'11021 (469'11020,476'11021] n=8151 ec=2 les=476 485/480) r=0 lcod 0'0 mlcod 469'11021 !hml crashed+peering m=1 l=1] merge_log result log(469'11020,476'11021] missing(1) changed=1
Sage Weil [Tue, 23 Jun 2009 04:32:09 +0000 (21:32 -0700)]
osd: stop rewinding replica log when we reach log.bottom
We stop rewinding a replica log when we reach our own
log.bottom, because we don't know enough to do so in any
meaningful way, and because we can assume it is not
divergent at that point (barring any complete screwupedness).
Also, if we do change last_update, make sure last_complete is
rewound too.
Sage Weil [Sat, 20 Jun 2009 06:27:04 +0000 (23:27 -0700)]
osd: do NOT include op vector when shipping raw transaction
This just doubles up the data payload. And makes the MOSDSubOp printout
look like garbage, since e.g. the setxattr names are taken from the
portion of the data payload encoding the transaction.
Sage Weil [Fri, 19 Jun 2009 19:45:36 +0000 (12:45 -0700)]
osd: pass updated stats to replica
When we ship the raw transaction to the replica, we need to ship the
new pg_stat_t as well, since that isn't getting updated in parallel by
prepare_transaction().