Sage Weil [Fri, 27 Apr 2012 05:12:40 +0000 (22:12 -0700)]
crush: remove parent maps
These were used (poorly) for forcefeeding, but they are useless now. Which
is good, because we allow items to appear in multiple trees, which means
they have no single parent. Good riddance!
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 27 Apr 2012 04:29:53 +0000 (21:29 -0700)]
mon: limit size of MOSDMap message sent as reply
We may send an MOSDMap as a reply to various requests, including
- a failure report
- a boot message
- a pg_temp message
- an up_thru message
In these cases, send a single MOSDMap message, but limit how big it gets.
All recipients here are osds, which are smart enough to request more maps
based on the MOSDMap::newest_map field.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 28 Apr 2012 03:54:50 +0000 (20:54 -0700)]
osdmap: fix addr dedup check
Compare *every* address for a match, or else note that it is (or might be)
different. Previously, we falsely took diff==0 to mean that all addrs
were definitely equal, which was not necessarily the case.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
filestore: first lock osd mount point, next detect fs type
Fixes #2353. Problem was that there were (at least) two osd processes
that were racing for the fs detection, which triggered some errors
in the btrfs create/remove snapshot.
Sage Weil [Fri, 27 Apr 2012 04:51:23 +0000 (21:51 -0700)]
config: allow {get,set}_val on subsystem debug levels
This mimics the allows you to get and set subsystem debug levels via the
normal config access methods. Among other things, this allows librados
users to set debug levels.
Fixes: #2350 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 26 Apr 2012 18:12:11 +0000 (11:12 -0700)]
osdmap: dedup pg_temp
We only deal with the case where the entire map is identical, since the
individual items are too small to make the pointer overhead worthwhile.
Too bad. A in-memory btree-like structure would work better for this.
Sage Weil [Wed, 25 Apr 2012 23:22:14 +0000 (16:22 -0700)]
osdmap: drop obsolete PG_ROLE_* constants
There are cruft from the old primary/chain/splay replication code. All
current code says <0 is stray, 0 is primary, and >0 is replica. That is,
the role is the acting vector position, or -1 if not in the vector.
Sage Weil [Wed, 25 Apr 2012 22:10:34 +0000 (15:10 -0700)]
osdmap: use shared_ptr for addrs, addr vectors
We share a lot of identical addresses between map versions because they
don't tend to change very often. Instead of having a separate copy for
every map, use shared_ptr and share references. Also use a reference for
the entire addr vector(s) in case no addrs differ at all.
Create new encode/decode macros for vector< shared_ptr<T> >.
Sage Weil [Thu, 26 Apr 2012 23:45:56 +0000 (16:45 -0700)]
mon: consider pending_inc in {up,in}_ratio for can_mark_{out,down}()
Consider pending changes when calculating the current up/in ratios. Among
other things, this will make the marking of osds down->out stop once it
hits the min in ratio.
Sage Weil [Wed, 25 Apr 2012 20:07:34 +0000 (13:07 -0700)]
osd: filter osds removed from probe set from peer_info_requested
Peef_info_requested should be a strict subset of the probe set. Filter
osds that are dropped from probe from peer_info_requested. We could also
restart peering from scratch here, but this is less expensive, because we
don't have to re-probe everyone.
Once we adjust the probe and peer_info_requested sets, (re)check if we're
done: we may have been blocedk on a previous peer_info_requested entry.
We don't need the lock in the WorkloadGenerator class. Everything that does
need a lock is handled by TestFileStoreState, and all that remains can be
handled by an atomic_t.
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
workload_generator: Delegate store tracking to TestFileStoreState.
We had a lot of duplicate code between the WorkloadGenerator and the
TestFileStoreState classes, and the last one is far more versatile than
what we initially had in the WorkloadGenerator. Therefore, delegate
everything we can to the TestFileStoreState class.
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
librados: call notification under different thread context
This fixes #2342. We shouldn't call notify on the dispatcher
context. We should also make sure that we don't hold
the client lock while waiting for the responses.
Also, pushed the client_lock locking into the
ctx->notify().
Sage Weil [Wed, 25 Apr 2012 16:23:49 +0000 (09:23 -0700)]
mon: 'osd thrash <num epochs>'
Thrash the osdmap for N iterations. Randomly mark OSDs up, down, in, out,
and up_thru in order to generate a difficult osdmap history for peering
to chew through.
Sage Weil [Wed, 25 Apr 2012 20:07:34 +0000 (13:07 -0700)]
osd: filter osds removed from probe set from peer_info_requested
Peef_info_requested should be a strict subset of the probe set. Filter
osds that are dropped from probe from peer_info_requested. We could also
restart peering from scratch here, but this is less expensive, because we
don't have to re-probe everyone.
Once we adjust the probe and peer_info_requested sets, (re)check if we're
done: we may have been blocedk on a previous peer_info_requested entry.
Sage Weil [Wed, 25 Apr 2012 00:21:27 +0000 (17:21 -0700)]
mon: add 'mon osd min up ratio' and 'mon osd min in ratio'
Prevent the monitor from marking osds down or out when too many are already
in that state. At this point the cluster is already broken and there is
little point in continuing to mark things down/out.
Setting these to 0 obviously disables the feature (by setting a minimum
of 0).
Sage Weil [Wed, 25 Apr 2012 18:15:34 +0000 (11:15 -0700)]
mon: use can_mark_*() helpers
So we can generalize beyond NO* flags. We'll soon be adding other reasons
to not mark things up/down/in/out. This lets us keep all though checks in
one place.
The helper methods will tell us why we can't do the thing (e.g., "NODOWN
flag is set"). The callers will generally tell us exactly what didn't
happen (e.g., "failure report of X ignored").
TestFileStoreState: distinguish between 'get_coll()' and 'get_coll_at()'
get_coll_at(int pos) should return the collection at the map's position
'pos', but 'pos' was being used as a map key. Therefore, we add a new
function 'get_coll(int key)' to mimic this behavior, and we make
'get_coll_at()' follow its intended behavior.
This patch may affect the test_filestore_idempotent_sequence tester, since
it uses the 'get_coll_at()' function a lot, and we changed this function's
behavior.
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
mon: decode old PGMap Incrementals differently from new ones
We need to distinguish between the old 0 (meaning undefined) and
the new 0 (meaning switch to 0 and disable the flags). So rev the
encoding version on PGMap::Incremental, and if you decode an old
version with [near]full_ratio == 0, set the ratio to -1 instead. Then
when applying the Incremental interpret -1 as no change.
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com> Reviewed-by: Sage Weil <sage@newdream.net>
Sage Weil [Tue, 24 Apr 2012 22:46:49 +0000 (15:46 -0700)]
mon: do not mark osds out if NOOUT flag is set
Do not mark down osds out when NOOUT flag is set. This is more or less
equivalent to setting a very long 'mon osd down out interval', but
reversible and less annoying.
Sage Weil [Tue, 24 Apr 2012 22:45:58 +0000 (15:45 -0700)]
mon: do not mark booting osds in if NOIN flag is set
If the NOIN osdmap flag is set, do not mark booting osds in. Normally
we would for a range of reasons (always, new, auto-marked-out), but block
them all.
Sage Weil [Tue, 24 Apr 2012 22:28:36 +0000 (15:28 -0700)]
mon: always remove booting osds from down_pending_out
The down_pending_out tracks OSDs that are down that we may want to
auto-mark out. If an osd boots, it should be removed from this list
because it is no longer down; it doesn't matter whether it is marked in
or not.
Sage Weil [Tue, 24 Apr 2012 21:22:10 +0000 (14:22 -0700)]
osd: do not attempt to boot if NOUP
If NOUP is set, do not send the boot message.
We already send onetime subscriptions to the osdmap, so we will find out
about osdmap flag changes. If it is cleared later, we'll pass into
start_boot() and _got_boot_version() again and send it then.
Sage Weil [Tue, 24 Apr 2012 16:43:44 +0000 (09:43 -0700)]
librbd: pass errors removing head back to user
In particular, the OSD may return EBUSY if there are still watchers.
Ignore ENOENT, as that may indicate we are cleaning up a previously
aborted removal.
Fixes: #2311 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 24 Apr 2012 17:55:18 +0000 (10:55 -0700)]
mon: fix pg stats timeout
We clear out the osd entry when an osd goes up or down. Thus, if we find
it missing from an up osd, we should start the timer. Otherwise we get
behavior like this
2012-04-24 13:22:47.888291 7fa5bc587700 mon.peon5752@0(leader).osd e21633 OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 521. Marking down!
2012-04-24 13:22:50.076394 7fa5bcd88700 log [INF] : osd.521 [2607:f298:4:2243::7088]:6806/53217 boot
2012-04-24 13:22:52.903558 7fa5bc587700 mon.peon5752@0(leader).osd e21638 OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 521. Marking down!
2012-04-24 13:23:15.144532 7fa5bcd88700 log [INF] : osd.521 [2607:f298:4:2243::7088]:6806/53217 boot
2012-04-24 13:23:17.967118 7fa5bc587700 mon.peon5752@0(leader).osd e21663 OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 521. Marking down!
2012-04-24 13:23:22.173778 7fa5bcd88700 log [INF] : osd.521 [2607:f298:4:2243::7088]:6806/53217 boot
2012-04-24 13:23:22.981556 7fa5bc587700 mon.peon5752@0(leader).osd e21668 OSDMonitor::handle_osd_timeouts: never got MOSDPGStat info from osd 521. Marking down!
2012-04-24 13:23:45.245380 7fa5bcd88700 log [INF] : osd.521 [2607:f298:4:2243::7088]:6806/53217 boot
when the pg stats message doesn't arrive quickly enough.
Fixes: #2341 Signed-off-by: Sage Weil <sage@newdream.net> Reviewed-by: Greg Farnum <gregory.farnum@dreamhost.com>
If this happened, we would not re-read the header with the new
snapshot, so the snapshot would not happen at the intended point
in time, but only after we re-read the header again.
* snapid should determine whether our mapped snapshot is gone, not snapname
* snap_set(<nonexistent_snap>) shouldn't reset us to CEPH_NOSNAP
* snapname should be set before using the it in the perfcounter name
* snapname and image name don't need to be passed as arguments since an
ImageCtx already contains that info
* ictx_check() doesn't need to check for non-existent snaps - only I/Os care,
so check in check_io() instead
Sage Weil [Mon, 23 Apr 2012 20:57:25 +0000 (13:57 -0700)]
run_seed_to.sh: rework the script, make it more flexible and broaden the tests.
Allow for '-h' and other options such as disabling the journal sync tests,
defining it is to be run on a btrfs FS, enabling exit on error (default is
now 'off'), and allow certain env variables to specify additional options
to each store.
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>