Sage Weil [Thu, 26 Apr 2012 23:45:56 +0000 (16:45 -0700)]
mon: consider pending_inc in {up,in}_ratio for can_mark_{out,down}()
Consider pending changes when calculating the current up/in ratios. Among
other things, this will make the marking of osds down->out stop once it
hits the min in ratio.
Sage Weil [Wed, 25 Apr 2012 16:23:49 +0000 (09:23 -0700)]
mon: 'osd thrash <num epochs>'
Thrash the osdmap for N iterations. Randomly mark OSDs up, down, in, out,
and up_thru in order to generate a difficult osdmap history for peering
to chew through.
Sage Weil [Wed, 25 Apr 2012 00:21:27 +0000 (17:21 -0700)]
mon: add 'mon osd min up ratio' and 'mon osd min in ratio'
Prevent the monitor from marking osds down or out when too many are already
in that state. At this point the cluster is already broken and there is
little point in continuing to mark things down/out.
Setting these to 0 obviously disables the feature (by setting a minimum
of 0).
Sage Weil [Wed, 25 Apr 2012 18:15:34 +0000 (11:15 -0700)]
mon: use can_mark_*() helpers
So we can generalize beyond NO* flags. We'll soon be adding other reasons
to not mark things up/down/in/out. This lets us keep all though checks in
one place.
The helper methods will tell us why we can't do the thing (e.g., "NODOWN
flag is set"). The callers will generally tell us exactly what didn't
happen (e.g., "failure report of X ignored").
Sage Weil [Tue, 24 Apr 2012 22:46:49 +0000 (15:46 -0700)]
mon: do not mark osds out if NOOUT flag is set
Do not mark down osds out when NOOUT flag is set. This is more or less
equivalent to setting a very long 'mon osd down out interval', but
reversible and less annoying.
Sage Weil [Tue, 24 Apr 2012 22:45:58 +0000 (15:45 -0700)]
mon: do not mark booting osds in if NOIN flag is set
If the NOIN osdmap flag is set, do not mark booting osds in. Normally
we would for a range of reasons (always, new, auto-marked-out), but block
them all.
Sage Weil [Tue, 24 Apr 2012 22:28:36 +0000 (15:28 -0700)]
mon: always remove booting osds from down_pending_out
The down_pending_out tracks OSDs that are down that we may want to
auto-mark out. If an osd boots, it should be removed from this list
because it is no longer down; it doesn't matter whether it is marked in
or not.
Sage Weil [Tue, 24 Apr 2012 21:22:10 +0000 (14:22 -0700)]
osd: do not attempt to boot if NOUP
If NOUP is set, do not send the boot message.
We already send onetime subscriptions to the osdmap, so we will find out
about osdmap flag changes. If it is cleared later, we'll pass into
start_boot() and _got_boot_version() again and send it then.
Sage Weil [Tue, 24 Apr 2012 16:43:44 +0000 (09:43 -0700)]
librbd: pass errors removing head back to user
In particular, the OSD may return EBUSY if there are still watchers.
Ignore ENOENT, as that may indicate we are cleaning up a previously
aborted removal.
Fixes: #2311 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 24 Apr 2012 17:55:18 +0000 (10:55 -0700)]
mon: fix pg stats timeout
We clear out the osd entry when an osd goes up or down. Thus, if we find
it missing from an up osd, we should start the timer. Otherwise we get
behavior like this
2012-04-24 13:22:47.888291 7fa5bc587700 mon.peon5752@0(leader).osd e21633 OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 521. Marking down!
2012-04-24 13:22:50.076394 7fa5bcd88700 log [INF] : osd.521 [2607:f298:4:2243::7088]:6806/53217 boot
2012-04-24 13:22:52.903558 7fa5bc587700 mon.peon5752@0(leader).osd e21638 OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 521. Marking down!
2012-04-24 13:23:15.144532 7fa5bcd88700 log [INF] : osd.521 [2607:f298:4:2243::7088]:6806/53217 boot
2012-04-24 13:23:17.967118 7fa5bc587700 mon.peon5752@0(leader).osd e21663 OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 521. Marking down!
2012-04-24 13:23:22.173778 7fa5bcd88700 log [INF] : osd.521 [2607:f298:4:2243::7088]:6806/53217 boot
2012-04-24 13:23:22.981556 7fa5bc587700 mon.peon5752@0(leader).osd e21668 OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 521. Marking down!
2012-04-24 13:23:45.245380 7fa5bcd88700 log [INF] : osd.521 [2607:f298:4:2243::7088]:6806/53217 boot
when the pg stats message doesn't arrive quickly enough.
Fixes: #2341 Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Greg Farnum <gregory.farnum@dreamhost.com>
Sage Weil [Mon, 23 Apr 2012 20:57:25 +0000 (13:57 -0700)]
run_seed_to.sh: rework the script, make it more flexible and broaden the tests.
Allow for '-h' and other options such as disabling the journal sync tests,
defining it is to be run on a btrfs FS, enabling exit on error (default is
now 'off'), and allow certain env variables to specify additional options
to each store.
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
Sage Weil [Sun, 22 Apr 2012 03:28:45 +0000 (20:28 -0700)]
Makefile: disable format-security warning
The prt() varargs function generates this warning
test/rbd/fsx.c: In function ‘prt’:
warning: test/rbd/fsx.c:203:2: format not a string literal and no format arguments [-Wformat-security]
warning: test/rbd/fsx.c:205:3: format not a string literal and no format arguments [-Wformat-security]
Disable that check for the fsx build only.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 21 Apr 2012 00:13:08 +0000 (17:13 -0700)]
librbd: fix ictx_check pointer weirdness by using std::string
I was seeing failures of LibRBD.TestIOToSnapshot where we would fail to
refresh after rollback, even though the snap existed. I assume it is
because the std::string whose c_str() we were pointing to was reallocated.
Sage Weil [Fri, 20 Apr 2012 23:56:57 +0000 (16:56 -0700)]
filestore: fix collection_add journal replay problem
In collection_add we have a two-phase guard set on the linked object via
the old name. During replay, we might see that the dest name is missing
and replay the operation, and in the process overwrite a newer guard with
an older one.
Avoid this by checking the source name too, and skipping the operation
entirely if a new guard exists.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
This code should be on a stand-alone class, instead of being embedded on
a single test, in case someone or something find it useful somewhere down
the line.
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
Sage Weil [Thu, 19 Apr 2012 18:00:20 +0000 (11:00 -0700)]
librbd: fix zeroing of trailing bits on short reads that span objects
handle_sparse_read() was taking buf_ofs and buf_len, but buf_len was being
interpreted as the total size of the buffer, not the length of the extent
in the buffer start at buf_ofs. Both callers pass in an extent length, so
fix the zero code to do the right thing.
Specifically, the behavior I saw was:
- read range spanning 2 objects, trailing 20k and leading 50k
- first object didn't exist, zeroed first 20k of buffer
- second object didn't exist, zeroed next 30k (50k-20k) of buffer
- the last 20k of buffer was unzeroed.
Sage Weil [Fri, 20 Apr 2012 23:36:54 +0000 (16:36 -0700)]
log: prefix dump with line numbers
This makes it easier to interpret the dump, and makes it obvious what is
dump (and potentially a dup of something that was already logged) and what
is not.
Sage Weil [Wed, 18 Apr 2012 21:39:18 +0000 (14:39 -0700)]
osd: dump old ops singly rather than all at once.
Fixes #2269. Convert the OpTracker::check_ops_in_flight interface
to take a vector<string> and create a separate warning for each old
Op, and dump those singly to the clog in the OSD.
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
pgmap: allow Incrementals to specify [near]full_ratios of 0
This commit isn't entirely safe: old monitors used 0 to mean "no change".
We can revert this (and the PGMonitor.cc portion of 841f2885318d1bcf37aab3f2947b1f40fee772a9) if we don't want to allow
0 as a valid ratio setting, and to maintain perfect backwards
compatibility.
pgmon: remove the PGMonitor update_full_ratios stuff
Making it a config watcher is just a huge mess in terms of consistently
updating it appropriately.
The next commit will add a monitor command for changing it.
Sage Weil [Tue, 17 Apr 2012 17:32:38 +0000 (10:32 -0700)]
mon: only fill in full/nearfull sets if the ratio > 0
This avoids putting all OSDs in both sets when the ratios are 0, as they
are with a fresh cluster and pgmap. This also makes setting the ratio to
0 effectively disable the full/nearfull feature.
mon: unconditionally encode PGMap full ratios in the Incremental
This properly spreads the real value to peon monitors -- they weren't
seeing the right values at all before.
Initialize all related values to zero so that it's obvious if they
somehow avoided becoming set properly.
This doesn't require any kind of protocol revision, luckily -- mixing
monitors from before and after this change might result in extra work
recalculating full sets, but it won't spread bad values or anything.
Alex Elder [Tue, 17 Apr 2012 13:33:42 +0000 (08:33 -0500)]
qa: comment out xfstest 232
Test 232 in the xfstests suite produces an XFS error in the log
when run over an RBD device. This is most likely an XFS problem
that will be tracked separately (in tracker 2302).
My original plan with getting this checked in was to have it run a
baseline set of the tests--all known to pass on rbd devices--with
the intention of doing ongoing work to add back missing tests (at
least from the "auto" group) as we understand and fix whatever
makes them produce failures.
So just comment out test 232 so the xfstests script is able to
run to completion without error.