Sage Weil [Sat, 14 Nov 2015 03:27:14 +0000 (22:27 -0500)]
mon/OSDMonitor: simplify failure reporters vs reports logic
Since each OSD only sends a failure report for a given peer once,
we don't need to count reports vs reporters separately. (This was
probably a bad idea anyway.) Remove this logic and the associated
config option.
Reported-by: Greg Farnum <gfarnum@redhat.com> Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Sat, 14 Nov 2015 03:11:17 +0000 (22:11 -0500)]
osd: simplify pg creation
We used to have a complicated pg creation process in which we
would query any previous mappings for the pg before we created the
new 'empty' pg locally. The tracking of the prior mappings was
very simple (and broken), but it didn't really matter because the
mon would resend pg create messages periodically. Now it doesn't,
so that broke.
However, none of this is necessary: the PG peering process does
all of the same things. Namely, it
- enumerates past intervals
- determines which ones may have been rw
- queries OSDs from each one to gather any potential changes
This is a more robust version of what the creation code was (or
should have been doing). So, let's rip it all out and let
peering handle it. As long as the newly instantiated PG sets
last_epoch_started and _clean to the created epoch we will probe
and consider all of these prior mappings and find any previous
instance of the PG (if one existed).
Sage Weil [Mon, 12 Oct 2015 02:06:33 +0000 (22:06 -0400)]
mon/PGMonitor: avoid useless pg gets when pool is deleted
If the .0 pg no longer exists, we know the entire pool was
deleted, and can avoid querying every other pg. (This is a good
thing because leveldb and rocksdb can be very slow to query
missing keys.)
Sage Weil [Thu, 8 Oct 2015 16:13:40 +0000 (12:13 -0400)]
mon/PGMonitor: revamp how pg creates are tracked
Previously we were calculating and managing in-core state that
wasn't committed as part of the pg_map, leading to all sorts of
ugliness that didn't really work. Instead,
* set mapping in all creating pgs in the committed pg_map
* make all pg create message sending be based on committed state
* update mappings for creating pgs every time we consume a new
osdmap, so that we have a reliable/stable epoch to attach to
it.
In particular, having that stable epoch means we have a reference
we can put in the pg create message that will also be used for
the subscription version. That way OSDs get consistent creates
from any mon.
Sage Weil [Wed, 7 Oct 2015 01:39:33 +0000 (21:39 -0400)]
mon/PGMonitor: send pg creates via persistent subscriptions, not spam
Generate and send pg create messages only for those OSDs who have
subscribed on this monitor. This is N time more efficient (where there
are N monitors) than the previous method.
Sage Weil [Thu, 8 Oct 2015 16:14:49 +0000 (12:14 -0400)]
mon/OSDMonitor: do not prime pg_temp for creating pgs
It will be less work for the old primary to ignore the create message
and the new one to query it and find nothing that for the slightly more
complicated peering and removal process to happen. Also, this reduces
bloat in the OSDMap a bit.
Sage Weil [Fri, 2 Oct 2015 13:15:33 +0000 (09:15 -0400)]
mon: disabled rocksdb compression when used as the backend
This significantly reduced CPU utilization on the bigbang scale
testing cluster at CERN. Note that it is already disabled for
leveldb by default (in ceph_mon.cc).
Sage Weil [Fri, 2 Oct 2015 13:06:29 +0000 (09:06 -0400)]
osd: cap adjusted max mon report interval at 2/3 of timeout
This ensures that we don't throttle back mon reports so much that
the mon times out out due to no pg stat reports. Since there is
little value is having a lower max anyway, just set this at an
upper bound (relative to the mon's timeout value).
Sage Weil [Wed, 30 Sep 2015 01:03:53 +0000 (21:03 -0400)]
osd: protect mon reporting with mon_report_lock
We need an exclusive lock over paths that update state related to
mon reports, lest they step on fields like up_thru_*, *stats_ack*,
last_mon_report, and so on. Everybody still needs a read lock
on map_lock too to get a stable OSDMap epoch.
Sage Weil [Wed, 23 Sep 2015 21:58:15 +0000 (17:58 -0400)]
osd: introduce explicit preboot stage
We want to separate the stage where we do a bunch of work
prior to booting (but intend to eventually boot), like when we
get maps and wait to be healthy, from the point after we've sent
the boot message while we are just waiting for a response (so that
we can avoid resending that boot message needlessly).
- start at PREBOOT in start_boot()
- transition to BOOTING in _send_boot()
- only call _preboot() while in PREBOOT state
Sage Weil [Tue, 15 Sep 2015 20:08:02 +0000 (16:08 -0400)]
osd: exponential backoff on pg stats ack timeout
If we don't get a timely response to our pg stats update we fail
the mon connection and reconnect to a new mon. If the mons aren't
responding because they are overloaded (for example, because they
are overwhelmed with stats updates) this just makes the problem
worse.
Mitigate the situation by doing an exponential backoff on the
timeout. When we do successfully send an update, slowly decay the
timeout back to the initial value.
Sage Weil [Mon, 14 Sep 2015 21:04:23 +0000 (17:04 -0400)]
mon/PGMonitor: avoid iterating over all pgs to find stale
Instead of iterating over all pgs when an osd goes down, make a
set of all osds that might have gone down, and only check pgs that
it manages. This is more efficient, especially for large clusters
with large numbers of OSDs.
David Coles [Wed, 11 Nov 2015 22:06:45 +0000 (14:06 -0800)]
ceph: Make stdout/stderr always output Unicode (UTF-8)
If a stream is not interactive, then under Python 2, then the encoding for
stdout/stderr may be None. This means that it's not possible to print Unicode
characters since the encoding will fall back to ASCII.
This explicitly makes sys.stdout/sys.stderr always use UTF-8 encoding for
strings, regardless of the system's local or if the console is interactive or
not.
This matches the existing tests that assume that output of non-ASCII pool names
will be UTF-8 encoded.
When outputting raw binary data (such as the CRUSH-map), we must bypass the
codec and write directly to raw streams (since the new stream will only accept
ASCII byte-strings or Unicode strings).
David Coles [Tue, 27 Oct 2015 20:32:44 +0000 (13:32 -0700)]
pybind: Add decode_cstr helper function
This function attempts to decode a C-style string into a Python Unicode string.
It accepts an optional "size" parameter for the string length, otherwise it is
assumed that the string is NUL-terminated.
If the pointer is NULL, then this function returns None.
David Coles [Tue, 20 Oct 2015 02:42:18 +0000 (19:42 -0700)]
pybind: Don't encode str on Python 2
If you attempt to call encode on a non-ASCII string, then a UnicodeDecodeError
will be raised.
Since str on Python 2 is an 8-bit string, it's possible that it's already UTF-8
encoded. As such we should just pass it through to the C API unmodified.
On Python 3 or if the user explicitly uses unicode, then we'll encode it to
UTF-8 for them.