in previous change, MDSMap::min_compat_client was changed from int8_t to
ceph_release_t, i.e. uint8_t, and in
Server::update_required_client_features(), we check the
MDSMap::min_compat_client to see if it is greater than given version, so
a negative "-1" would overflow and be interpreted as a 255, hence will
be always greater than whatever version is compared. so we need to
bump up the encoding version, and
* normalize the number if it is -1
* ignore the number which is way too large.
we have following pains when it comes to ceph release related
programming:
* we use int, uint8_t, uint32_t, unsigned int for representing the ceph
release, i.e., jewel, luminous, nautilus, in different places in our
source tree.
* we always need to add a comment aside of `uint8_t release` to help
the folks to understand that it is CEPH_RELEASE_*.
* also we keep forgetting that "os << release" actually prints the
release as an ASCII.
* and it's painful to remember that we have to translate the release
number using `ceph_release_name()` before print it out in the human
readable format.
* we replicate the n+2 upgrade policy in multiple places
in this change, `ceph_release_t` and some helper functions are
intruduced to alleviate the pains above.
* add a scoped enum for representing ceph releases, so the release
is typed . which means that we can attach different function to
it. and in future, we can even replace `ceph_release_t` with
a class if we need to support more fancy features which cannot be
implemented using free functions.
* add `ostream<<()` operator for `ceph_release_t`, so we can simply
send it to `ostream`
* add `can_upgrade_from()` so we don't need to repeat ourselves.
* move ceph_release_from_name() to ceph_release.{h,cc}, as currently,
ceph_release.cc uses `ceph_release_name()` for implementing
`ostream<<()`, and after this change, `ceph_release_from_name()`
will return `ceph_release_t`, so if we keep `ceph_release_from_name()`
where it was, these two headers will be included by each other,
which is a no-go.
* reimplement `ceph_release_from_name()` using a loop. before this
change, `ceph_release_from_name()` was implemented using a manually
unrolled if-else structure, which is more performant, but the
downside is that, it replicates mapping between release number
and its name. so after this change, a loop is used instead.
as this function is not used in the critical path, so this change
should not have visible impact on the performance.
* always use ceph_release_t::unknown as the default value of the
"release" member variables. before this change, sometimes, we use
"0" and sometimes we use "1", after inspecting the code, i found that
"0" is good enough to cover all the use cases. and since "0" is a
magic number in this context, it is replaced using
`ceph_release_t::unknown`. to facilidate the checking against
`ceph_release_t::unknown`, `operator!()` is added.
* ceph::to_string() and ceph::to_integer<>() are added to help
to remove the asssumption of the underlying type of `ceph_release_t`,
ideally, users of `ceph_release_t` should not use `static_cast<>` to
cast it into integer types, instead, they should use
`ceph::to_integer<>()` to do this job. if, in future, we want to
use a `class` to represent `ceph_release_t`, we can get this done
with minimum change, if `ceph::to_string()` and `ceph::to_string()`
are used. we can not specialize them in `std` naming space. as
it's claimed that it's undefined behavior to do so. see
https://en.cppreference.com/w/cpp/language/extending_std .
Kefu Chai [Thu, 2 May 2019 15:48:56 +0000 (23:48 +0800)]
test/common/test_util: skip it if /etc/os-release does not exist
some GNU/Linux distros do not ship this file, and we should not fail the
test on them.
inspired by
http://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/ceph-skip-collect-sys-info-test.patch?id=48f19e60c4677e392ee2c23f28098cfcaf9d1710
Jason Dillaman [Tue, 30 Apr 2019 16:56:08 +0000 (12:56 -0400)]
librbd: allow AioCompletion objects to be blocked
This will be used when user-provided memory is wrapped into a
ceph::buffer::raw pointer to prevent its release prior to the
drop of its last reference internally.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Tue, 6 Mar 2018 19:23:47 +0000 (14:23 -0500)]
librbd: switch to lock-free queue for event poll IO interface
'perf' shows several percent of CPU being wasted on lock contention
in the event poll interface. The 'fio' RBD engine uses this poll
IO interface by default when available.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Fri, 26 Apr 2019 18:24:15 +0000 (14:24 -0400)]
librbd: avoid using lock within AIO completion where possible
'perf' shows several percent of CPU is being utilized handling the
heavyweight locking semantics of AIO completion. With these changes,
the lock contention disappears.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 29 Apr 2019 14:13:21 +0000 (10:13 -0400)]
librbd: simplify IO flush handling through AsyncOperation
Allow ImageFlushRequest to directly execute a flush call through
AsyncOperation. This will allow the flush to be directly linked
to its preceeding IOs.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
PeeringState: don't zero backfill target num_bytes on activation
834d3c19a774f1cc93903447d91d182776e12d18 preserves num_bytes
on backfill targets in order to estimate space required to complete
backill. However, from activation until backfill reservation,
info.stats.stats.sum.num_bytes is persisted to disk as 0 messing
up future intervals. Instead, preserve it in the info sent during
recovery and leave it alone in RequestBackfillPrio.
Additionally, it's possible for backfill to be preempted between
last_backfill=MAX being sent to the replica and Backfilled being
queued occuring. In that case, the stats get on reservation
and the replica ends up with invalid stats.
Samuel Just [Wed, 10 Apr 2019 02:29:05 +0000 (19:29 -0700)]
osd/: condense missing mutations for recovery/repair/errors
At a high level, this patch attempts to unify the various
sites at which some combination of
- mark object missing in one or more pg_missing_t
- mark object needs_recovery in missing_loc
- manipulate the locations map based on external information
occur. It seems to me that the pg_missing_t and missing_loc
should be in sync except for the mark_unfound_lost_revert
case and the case where we are about to do a backfill push.
This patch also cleans up repair_object. It sort of worked by accident
for non-ec non-primary bad peers. It didn't update missing_loc, so
needs_recovery() returns the wrong answer. However, is_unfound() does
as well, so ReplicatedBackend is nevertheless happy as the object would
be present on the primary. This patch updates the behavior to be
uniform as in the other force_obejct_missing cases.
Samuel Just [Sun, 21 Apr 2019 01:51:08 +0000 (18:51 -0700)]
test-erasure-eio: first eio may be fixed during recovery
The changes to the way EC/ReplicatedBackend communicate read
t showerrors had a side effect of making first eio on the object in
TEST_rados_get_subread_eio_shard_[01] repair itself depending
on the timing of the killed osd recovering. The test should
be improved to actually test that behavior at some point.
PeeringState was the only remaining caller -- both the interface in
PGBackend and the pass-through in PrimaryLogPG were actually unused.
Further, the WaitLocalRecoveryReserved user does not appear to actually
require the fresh-off-of-the-wire osdmap available from OSDService.
As such, moved the logic into OSDMap itself and simply used the
PG local osdmap.
Note, the above is a change in behavior that probably could use a second
opinion.