Sage Weil [Wed, 2 Dec 2015 19:50:28 +0000 (14:50 -0500)]
osd: call on_new_interval on newly split child PG
We must call on_new_interval() on any interval change *and* on the
creation of the PG. Currently we call it from PG::init() and
PG::start_peering_interval(). However, PG::split_into() did not
do so for the child PG, which meant that the new child feature
bits were not properly initialized and the bitwise/nibblewise
debug bit was not correctly set. That, in turn, could lead to
various misbehaviors, the most obvious of which is scrub errors
due to the sort order mismatch.
Fixes: #13962 Signed-off-by: Sage Weil <sage@redhat.com>
Ilya Dryomov [Sun, 29 Nov 2015 20:46:41 +0000 (21:46 +0100)]
rbd: bail if too many arguments provided
The code has a catch clause for that, but it was being rendered useless
by the preceding
if (command_spec.size() > matching_spec->size())
positional_options.add(at::POSITIONAL_ARGUMENTS.c_str(), -1);
which names all (both expected and extraneous) positional arguments.
Change it to name only expected arguments, deriving the number of
expected arguments from the length of positional_opts vector, supplied
by each action. This works for all actions except "feature enable" and
"feature disable" which are specified as multitoken, so keep on passing
in -1 for those.
Ilya Dryomov [Mon, 30 Nov 2015 15:36:43 +0000 (16:36 +0100)]
tests: update unmap.t CLI test
Fixup the exit code - the old CLI tried to differentiate between CLI
errors and action errors by returning EXIT_FAILURE in the former case.
Also remove a test that relied on a special case check in the old CLI.
Ilya Dryomov [Mon, 30 Nov 2015 15:29:56 +0000 (16:29 +0100)]
cmake: librbd needs libjournal and libcls_journal_client
Commit 4719696cadd1 ("cmake: updates for refactored librbd IO path")
fixed file lists but missed the link dependency - librbd now needs
libjournal and libcls_journal_client.
Chengyuan Li [Fri, 20 Nov 2015 05:29:39 +0000 (22:29 -0700)]
mon/PGMonitor: MAX AVAIL is 0 if some OSDs' weight is 0
In get_rule_avail(), even p->second is 0, it's possible to be used
as divisor and quotient is infinity, then is converted to an integer
which is negative value.
So we should check p->second value before calculation.
Josh Durgin [Thu, 26 Nov 2015 05:37:23 +0000 (21:37 -0800)]
pybind: decode empty string in conf_parse_argv() correctly
cretargs is a array of c_char_p, which means ctypes has already
converted it to python byte strings. decode_cstr() would misinterpret
the empty string as a NULL c_char_p(), and convert it to None by
accident, resulting in errors when running commands like
'ceph config-key put foo ""'.
Since this is the only place we use arrays of c_char_p, just decode
it directly in conf_parse_argv(). Tested with python 2 and 3.
Sage Weil [Wed, 25 Nov 2015 21:40:13 +0000 (16:40 -0500)]
mon/OSDMonitor: block 'ceph osd pg-temp ...' if update is pending
The OSD expects it's pg_temp update requests to succeed. If it
races with an ill-timed admin request, it can get stuck in
WaitActingChange indefinitely.
This is only a real problem now that the OSD/mon interaction has
been updated with wip-bigbang; previously we would retry (although
it would take a while). Backporting is optional.
Sage Weil [Mon, 16 Nov 2015 16:32:34 +0000 (11:32 -0500)]
osdc/Objecter: call notify completion only once
If we race with a reconnect we could get a second notify message
before the notify linger op is torn down. Ensure we only ever
call the notify completion once to prevent a segfault.
Fixes: #13805 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Sat, 14 Nov 2015 03:27:14 +0000 (22:27 -0500)]
mon/OSDMonitor: simplify failure reporters vs reports logic
Since each OSD only sends a failure report for a given peer once,
we don't need to count reports vs reporters separately. (This was
probably a bad idea anyway.) Remove this logic and the associated
config option.
Reported-by: Greg Farnum <gfarnum@redhat.com> Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Sat, 14 Nov 2015 03:11:17 +0000 (22:11 -0500)]
osd: simplify pg creation
We used to have a complicated pg creation process in which we
would query any previous mappings for the pg before we created the
new 'empty' pg locally. The tracking of the prior mappings was
very simple (and broken), but it didn't really matter because the
mon would resend pg create messages periodically. Now it doesn't,
so that broke.
However, none of this is necessary: the PG peering process does
all of the same things. Namely, it
- enumerates past intervals
- determines which ones may have been rw
- queries OSDs from each one to gather any potential changes
This is a more robust version of what the creation code was (or
should have been doing). So, let's rip it all out and let
peering handle it. As long as the newly instantiated PG sets
last_epoch_started and _clean to the created epoch we will probe
and consider all of these prior mappings and find any previous
instance of the PG (if one existed).
Sage Weil [Mon, 12 Oct 2015 02:06:33 +0000 (22:06 -0400)]
mon/PGMonitor: avoid useless pg gets when pool is deleted
If the .0 pg no longer exists, we know the entire pool was
deleted, and can avoid querying every other pg. (This is a good
thing because leveldb and rocksdb can be very slow to query
missing keys.)
Sage Weil [Thu, 8 Oct 2015 16:13:40 +0000 (12:13 -0400)]
mon/PGMonitor: revamp how pg creates are tracked
Previously we were calculating and managing in-core state that
wasn't committed as part of the pg_map, leading to all sorts of
ugliness that didn't really work. Instead,
* set mapping in all creating pgs in the committed pg_map
* make all pg create message sending be based on committed state
* update mappings for creating pgs every time we consume a new
osdmap, so that we have a reliable/stable epoch to attach to
it.
In particular, having that stable epoch means we have a reference
we can put in the pg create message that will also be used for
the subscription version. That way OSDs get consistent creates
from any mon.
Sage Weil [Wed, 7 Oct 2015 01:39:33 +0000 (21:39 -0400)]
mon/PGMonitor: send pg creates via persistent subscriptions, not spam
Generate and send pg create messages only for those OSDs who have
subscribed on this monitor. This is N time more efficient (where there
are N monitors) than the previous method.
Sage Weil [Thu, 8 Oct 2015 16:14:49 +0000 (12:14 -0400)]
mon/OSDMonitor: do not prime pg_temp for creating pgs
It will be less work for the old primary to ignore the create message
and the new one to query it and find nothing that for the slightly more
complicated peering and removal process to happen. Also, this reduces
bloat in the OSDMap a bit.
Sage Weil [Fri, 2 Oct 2015 13:15:33 +0000 (09:15 -0400)]
mon: disabled rocksdb compression when used as the backend
This significantly reduced CPU utilization on the bigbang scale
testing cluster at CERN. Note that it is already disabled for
leveldb by default (in ceph_mon.cc).
Sage Weil [Fri, 2 Oct 2015 13:06:29 +0000 (09:06 -0400)]
osd: cap adjusted max mon report interval at 2/3 of timeout
This ensures that we don't throttle back mon reports so much that
the mon times out out due to no pg stat reports. Since there is
little value is having a lower max anyway, just set this at an
upper bound (relative to the mon's timeout value).
Sage Weil [Wed, 30 Sep 2015 01:03:53 +0000 (21:03 -0400)]
osd: protect mon reporting with mon_report_lock
We need an exclusive lock over paths that update state related to
mon reports, lest they step on fields like up_thru_*, *stats_ack*,
last_mon_report, and so on. Everybody still needs a read lock
on map_lock too to get a stable OSDMap epoch.
Sage Weil [Wed, 23 Sep 2015 21:58:15 +0000 (17:58 -0400)]
osd: introduce explicit preboot stage
We want to separate the stage where we do a bunch of work
prior to booting (but intend to eventually boot), like when we
get maps and wait to be healthy, from the point after we've sent
the boot message while we are just waiting for a response (so that
we can avoid resending that boot message needlessly).
- start at PREBOOT in start_boot()
- transition to BOOTING in _send_boot()
- only call _preboot() while in PREBOOT state