Roman Penyaev [Tue, 30 Apr 2019 15:43:01 +0000 (17:43 +0200)]
global/global_init: do first transport connection after setuid()
uverbs kernel module forbids access to a file descriptor after credentials
change, that leads to -EACCESS on each following ibv_*() call.
Why it matters? Infiniband transport stops working after the following
syscalls:
o setuid()
o fork()
Originally the problem was described here [1] and here [2].
This patch targets only setuid() syscall and moves the first transport
initialization after setuid() has been done.
fork() is used to daemonize ceph services (when systemd is not used
for any reason) and probably the easiest way is to rip the whole lagacy
daemonization code out, so this patch does not target this problem.
Kefu Chai [Tue, 7 May 2019 12:57:17 +0000 (20:57 +0800)]
seastar: pick up changes for better performance
to be specific, a78fb44c96e2912c6f39b2151f94a0bb2b5796a6 helps to
improve the performance of future implementation -- with this change
future can always reference its local state without checking its `_promise`
and dereferencing it.
Kefu Chai [Tue, 7 May 2019 07:06:42 +0000 (15:06 +0800)]
crimson/osd: shutdown services in the right order
we should stop config service *after* osd is stopped, as osd depends on
a working and alive config subsystem when stopping itself. for instance,
the destructor of AuthRegistry unregisters itself from the ObserverMgr,
which is in turn a member variable of ConfigProxy, so if ConfigProxy is
destroyed before we destroy mon::Client, we will have a segfault with
following backtrace
ObserverMgr<ceph::md_config_obs_impl<ceph::common::ConfigProxy>
>::remove_observer(ceph::md_config_obs_impl<ceph::common::ConfigProxy>*)
at /var/ssd/ceph/build/../src/common/config_obs_mgr.h:78
AuthRegistry::~AuthRegistry() at
/var/ssd/ceph/build/../src/crimson/common/config_proxy.h:101
(inlined by) AuthRegistry::~AuthRegistry() at
/var/ssd/ceph/build/../src/auth/AuthRegistry.cc:28
ceph::mon::Client::~Client() at
/var/ssd/ceph/build/../src/crimson/mon/MonClient.h:44
ceph::mon::Client::~Client() at
/var/ssd/ceph/build/../src/crimson/mon/MonClient.h:44
OSD::~OSD() at /usr/include/c++/9/bits/unique_ptr.h:81
vstart.sh: enable creating multiple OSDs backed by spdk backend
Currently vstart.sh only support deploying one OSD based on NVMe SSD.
The following two cases will cause errors:
1.There are 2 more NVMe SSDs from the same vendor on the machine
2.Trying to deploy 2 more OSDs if we only get 1 pci_id available
Add the support for allowing deploying multiple OSDs on a machine with
multiple NVME SSDs.
Changcheng Liu [Mon, 6 May 2019 02:29:11 +0000 (10:29 +0800)]
vstart.sh: correct ceph-run path
ceph-run is in the same directory as vstart.sh. It's often that
vstart.sh is run under build directory. Without giving the right
directory, ceph-run file can't be found.
Signed-off-by: Changcheng Liu <changcheng.liu@intel.com>
Sage Weil [Thu, 2 May 2019 19:34:53 +0000 (14:34 -0500)]
osd: clean up osdmap sharing
- always use the Session::last_sent_epoch value, both for clients and osds
- get rid of the stl map<> of peer epochs
- consolidate all map sharing into a single maybe_share_map()
- optionally take a lower bound on the peer's epoch, for use when it is
available (e.g., when we are handling a message that specifies what
epoch the peer had when it sent the message)
- use const OSDMapRef& where possible
- drop osd->is_active() check, since we no longer have any dependency on
OSD[Service] state beyond our osdmap
The old callchain was convoluted, partly because it was needlessly
separated into several layers of helpers, and partly because the tracking
for clients and peer OSDs was totally different.
Kefu Chai [Thu, 2 May 2019 15:48:56 +0000 (23:48 +0800)]
test/common/test_util: skip it if /etc/os-release does not exist
some GNU/Linux distros do not ship this file, and we should not fail the
test on them.
inspired by
http://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/patches/ceph-skip-collect-sys-info-test.patch?id=48f19e60c4677e392ee2c23f28098cfcaf9d1710
Jason Dillaman [Tue, 30 Apr 2019 16:56:08 +0000 (12:56 -0400)]
librbd: allow AioCompletion objects to be blocked
This will be used when user-provided memory is wrapped into a
ceph::buffer::raw pointer to prevent its release prior to the
drop of its last reference internally.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Tue, 6 Mar 2018 19:23:47 +0000 (14:23 -0500)]
librbd: switch to lock-free queue for event poll IO interface
'perf' shows several percent of CPU being wasted on lock contention
in the event poll interface. The 'fio' RBD engine uses this poll
IO interface by default when available.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Fri, 26 Apr 2019 18:24:15 +0000 (14:24 -0400)]
librbd: avoid using lock within AIO completion where possible
'perf' shows several percent of CPU is being utilized handling the
heavyweight locking semantics of AIO completion. With these changes,
the lock contention disappears.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 29 Apr 2019 14:13:21 +0000 (10:13 -0400)]
librbd: simplify IO flush handling through AsyncOperation
Allow ImageFlushRequest to directly execute a flush call through
AsyncOperation. This will allow the flush to be directly linked
to its preceeding IOs.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
PeeringState: don't zero backfill target num_bytes on activation
834d3c19a774f1cc93903447d91d182776e12d18 preserves num_bytes
on backfill targets in order to estimate space required to complete
backill. However, from activation until backfill reservation,
info.stats.stats.sum.num_bytes is persisted to disk as 0 messing
up future intervals. Instead, preserve it in the info sent during
recovery and leave it alone in RequestBackfillPrio.
Additionally, it's possible for backfill to be preempted between
last_backfill=MAX being sent to the replica and Backfilled being
queued occuring. In that case, the stats get on reservation
and the replica ends up with invalid stats.
Samuel Just [Wed, 10 Apr 2019 02:29:05 +0000 (19:29 -0700)]
osd/: condense missing mutations for recovery/repair/errors
At a high level, this patch attempts to unify the various
sites at which some combination of
- mark object missing in one or more pg_missing_t
- mark object needs_recovery in missing_loc
- manipulate the locations map based on external information
occur. It seems to me that the pg_missing_t and missing_loc
should be in sync except for the mark_unfound_lost_revert
case and the case where we are about to do a backfill push.
This patch also cleans up repair_object. It sort of worked by accident
for non-ec non-primary bad peers. It didn't update missing_loc, so
needs_recovery() returns the wrong answer. However, is_unfound() does
as well, so ReplicatedBackend is nevertheless happy as the object would
be present on the primary. This patch updates the behavior to be
uniform as in the other force_obejct_missing cases.
Samuel Just [Sun, 21 Apr 2019 01:51:08 +0000 (18:51 -0700)]
test-erasure-eio: first eio may be fixed during recovery
The changes to the way EC/ReplicatedBackend communicate read
t showerrors had a side effect of making first eio on the object in
TEST_rados_get_subread_eio_shard_[01] repair itself depending
on the timing of the killed osd recovering. The test should
be improved to actually test that behavior at some point.