Adam Kupczyk [Mon, 22 Mar 2021 10:20:11 +0000 (11:20 +0100)]
os/bluestore: Make Onode::put/get resiliant to split_cache
In
OnodeCacheShard* ocs = c->get_onode_cache();
std::lock_guard l(ocs->lock);
while waiting for lock, split_cache might have changed OnodeCacheShard.
This will result in adding Onode to improper OnodeCacheShard.
Such action is obviously bad, as we will operate in future (at least once) on
different OnodeCacheShard then we got lock for. Particulary sensitive to this
are _trim and split_cache functions, as they iterate over elements.
Kefu Chai [Thu, 25 Mar 2021 09:08:48 +0000 (17:08 +0800)]
run-make-check.sh: let ctest generate XML output
to enable XUnit plugin of jenkins to consume the ctest output and
publish it in the dashboard, we need to
* let ctest generate XML output instead of plain text output
* do not fail the test if any test case fails. this allows the publisher
to do its job by checking the XML output.
* prevent ctest from compressing the output. see
https://issues.jenkins.io/browse/JENKINS-21737
Kamoltat [Mon, 8 Feb 2021 15:45:06 +0000 (15:45 +0000)]
qa/tasks/mgr/test_progress: fix wait_until_equal
Octopus ceph_test_case doesn't have period arg
so remove that in wait_until_equal. Also increase
time to wait for complete events by using RECOVERY_PERIOD
instead of EVENT_CREATION_PERIOD
Not needed in masters because only octopus and nautilus
doesn't have a period argument in qa/tasks/mgr/test_progress.py
wait_until_equals() function
Kefu Chai [Sat, 20 Mar 2021 05:00:01 +0000 (13:00 +0800)]
install-deps.sh: remove existing ceph-libboost of different version
we install different versions of precompiled ceph-libboost packages
for different branches when building and testing them on ubuntu test
nodes. for instance,
- nautilus, octopus: v1.72
- pacific: v1.73
they share the same set of test nodes. and these ceph-libboost packages
conflict with each other, because they install files to the same places.
in order to avoid the confliction, we should uninstall existing packages
before installing a different version of ceph-libboost packages.
ceph-libboost${version}-dev is a package providing the shared headers of
boost library, so, in this change we check if it is installed before
returning or removing the existing packages.
delete the part where _osd_in_out_completed_events_count()
was called in test_osd_cannot_recover() and revert to initial
state of the function since we don't need to use this function
in octopus. Also delete a duplicate of _osd_in_out_events_count().
This must be added by mistake in #39289 as well.
No need to fix for the backport in Nautilus: #38173
since the bugs are occured by adding additional code to
the cherry-pick specifically for Octopus.
Ilya Dryomov [Wed, 17 Mar 2021 10:00:33 +0000 (11:00 +0100)]
qa: krbd_blkroset.t: update for separate hw and user read-only flags
Since kernel 5.12, hardware read-only state and user read-only
policy (BLKROGET/SET ioctls) are tracked separately in the block
layer. As the purpose of our ->set_read_only() method was exactly
that, it was removed.
As a side effect, BLKROSET no longer returns EROFS on an attempt
to make a read-only mapping read-write with "blockdev --setrw".
The policy gets updated, but the device remains read-only as before
because the hardware (== mapping) state is controlled by the driver.
Yuval Lifshitz [Tue, 17 Nov 2020 11:31:59 +0000 (13:31 +0200)]
rgw/notification: trigger notifications on changes from any user
any user authorized to make changes to a bucket may trigger
notifications defined on that bucket.
manual test procedure of the fix is described here:
https://gist.github.com/yuvalif/39c183aa0f74d286ecef7844268817df
Xuehan Xu [Sun, 7 Feb 2021 04:40:36 +0000 (12:40 +0800)]
mgr: relax osd ok-to-stop condition on degraded pgs
Right now, the "ok-to-stop" condition is relatively rigorous, it allows
stopping an osd only if no PG on it is non-active or degraded. But there
are situations in which an OSD is part of a degraded pg and the pg still
still have > min_size complete replicas after the OSD is stopped.
In 9750061d5d4236aaba156d60790e0b8bcd7cfb64, we changed from considering
just acting to using avail_no_missing (OSDs that have no missing objects).
When the projected pg_acting is constructed this way, we can safely compare
to min_size... even for a PG marked degraded.
Ilya Dryomov [Fri, 19 Feb 2021 15:47:17 +0000 (16:47 +0100)]
krbd: make sure the device node is accessible after the mapping
We have always assumed this to be the case and users' scripts and
orchestration tools have grown to depend on this. Let's add some
enforcement, prompted by [1]:
"I am running my Kubernetes worker node inside of an LXC container
which doesn't benefit from the device node created by the kernel, so
I'm using udev to create the /dev/rbd* device nodes inside of the LXC
container."
which, through the unfortunate interaction with ceph-csi rbd plugin,
results in data loss for "volumeMode: Filesystem" PVs because it ends
up recreating the filesystem every time the PV is attached to the pod:
"When deleting the pod and re-creating it, I can see that the RBD
image is indeed being reformatted. This seems to be because when
blkid is being run to check if the image is formatted, the /dev/rbd*
device has not yet been created by udev. By the time the code gets
down to running mkfs, the device is there and the damage is done."
Jason Dillaman [Thu, 11 Feb 2021 20:54:01 +0000 (15:54 -0500)]
librbd/mirror: leave non-primary snapshot images in creating state
The creating state is a special case in rbd-mirror where it will
automatically delete the image since it assumes it's malformed.
A non-primary, snapshot-based mirror image needs to have at least
one non-primary snapshot and the first one is not created until
after replay has started. Now rbd-mirror will update the mirror
image state to the enabled state after creating the first
non-primary snapshot but before attempting the sync.
Fixes: https://tracker.ceph.com/issues/49238 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 43f2c208fa3042d93e4810d804ffe28e9ca7af77)
Jason Dillaman [Thu, 11 Feb 2021 20:45:01 +0000 (15:45 -0500)]
rbd-mirror: ensure that the last non-primary snapshot cannot be pruned
Tweak the normal pruning behavior to ensure that an incomplete initial
non-primary snapshot is not included in the prune set since we know
it will be complete since otherwise the image would have been deleted
due to not updating the mirror-image-state to enabled. Also ensure
we cannot prune a non-primary mirror snapshot if we don't have a
predecessor.
Jason Dillaman [Thu, 11 Feb 2021 20:23:56 +0000 (15:23 -0500)]
rbd-mirror: update snapshot mirror image state after snapshot creation
The non-primary mirror snapshot is what is used to link the non-primary
to the primary image. If there is an interruption between creating the
non-primary image and the creation of the first non-primary snapshot,
the images will be considerered unlinked.
A future commit will modify librbd to avoid setting the mirror image
state to enabled for non-primary snapshot-based mirroring images.
rbd-mirror will already automatically delete images in the CREATING
state during the bootstrap phase.
Ilya Dryomov [Mon, 8 Feb 2021 16:01:47 +0000 (17:01 +0100)]
librbd: don't hold owner_lock for validate_image_removal()
handle_exclusive_lock() and handle_shut_down_exclusive_lock() call
validate_image_removal() without owner_lock held, so holding it in
shut_down_exclusive_lock() appears to be redundant.
Ilya Dryomov [Sun, 7 Feb 2021 14:09:24 +0000 (15:09 +0100)]
librbd: treat EROFS as expected in handle_acquire_lock()
If the peer refuses to release exclusive lock (e.g. in case automatic
exclusive lock transitions are disabled), EROFS is retured. Suppress
a rather confusing "Read-only file system" error message -- this case
is no different from EBUSY or EAGAIN.
Ilya Dryomov [Sun, 7 Feb 2021 12:46:15 +0000 (13:46 +0100)]
librbd: refuse to release exclusive lock when removing
Commit 25c2ffe145be ("librbd: acquire exclusive lock from peer when
removing") changed PreRemoveRequest to request exclusive lock from the
peer instead of giving up and proceeding without exclusive lock. This
caused one of the test cases that sometimes runs concurrent "rbd rm"
against the same image to fail intermittently, most often on assert
because exclusive lock is now automatically transitioned to another
"rbd rm" on its request.
The root cause is older and probably goes back to when synchronous
librbd::remove() which held owner_lock across all operations including
trim_image() was converted to a set of state machines. Since then, any
peer that requests exclusive lock (instead of trying once and backing
off) is able to mess with image removal.
Install StandardPolicy to disable automatic exclusive lock transitions
during image removal.
It appears that commit 6eb8f30a238 broke the test utility and
its failure was masked by the test case that expected a failure
due to a timeout force-killing the app.
Fixes: https://tracker.ceph.com/issues/49117 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 8643b046fb4d5b05b4c75b83f16cd8ccc6a8b0a0)
Jason Dillaman [Wed, 16 Dec 2020 15:15:28 +0000 (10:15 -0500)]
librbd/api: avoid retrieving more than max mirror image info records
This could otherwise result in an assertion failure in the API if
it failed to retrieve the status on an image and therefore required
a second iteration through the loop.
Fixes: https://tracker.ceph.com/issues/48522 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 77bc48bbadd8e8423e9342102475994632441eaa)
Conflicts:
src/librbd/api/Mirror.cc: conflict with new AsioEngine
Jason Dillaman [Fri, 11 Dec 2020 00:31:45 +0000 (19:31 -0500)]
rbd-mirror: validate that remote start snapshot still exists
Perform a basic sanity check to verify that the remote start snapshot
still exists. This was previosly being deleted as part of the unlink
process due to a race condition between the remote side completing
a sync between snapshots 1 and 2 and snapshot 2 being unlinked due
to reaching max snapshots.
Jason Dillaman [Thu, 10 Dec 2020 22:32:16 +0000 (17:32 -0500)]
librbd/mirror: tweak which snapshot is unlinked when at capacity
The rbd-mirror daemon will attempt to sync from the last synced
snapshot to the next mirror snapshot. When the limit is at 3, this
currently can result in a situation where an in-use sync snapshot is
deleted. Instead of unlinking the second oldest snapshot, always
unlink the third oldest.
Fixes: https://tracker.ceph.com/issues/48553 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit a888bff8d00e3e496ec80e4273e01a47b67da5dc)
Jason Dillaman [Thu, 10 Dec 2020 21:13:23 +0000 (16:13 -0500)]
librbd/mirror: increase debug logging of snapshot state machines
Try to keep debug level 20 for IO state machines so that setting the
debug level to something lower should show the manipulation of
the mirror snapshots.
Jason Dillaman [Thu, 10 Dec 2020 04:17:24 +0000 (23:17 -0500)]
rbd-mirror: do not attempt to unlink from more recent snapshots
The snapshot-based mirroring replayer should only attempt to unlink
from any snapshots that are older than the end remote snapshot id to
prevent the remote side from incorrectly deleted the snapshot.
Fixes: https://tracker.ceph.com/issues/48527 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 78f8abce2d90d7c9bcf7b4bd4d805c3fe0b39b03)
Jason Dillaman [Thu, 10 Dec 2020 03:30:17 +0000 (22:30 -0500)]
librbd/mirror: unlink peer might recursively loop
If the mirror peer set is (incorrectly) empty, it's not currently
possible for the unlink peer state machine to properly delete the
snapshot. This can result in a recursive loop between the create
primary snapshot state machine and the unlink peer state machine
until the stack depth grows too large.
Fixes: https://tracker.ceph.com/issues/48525 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 18a45503011a572325e09b56d5ab799a15ee83d4)
If the requested write length does not match the provided bufferlist
length, disable the move optimization and instead fallback to creating
a new sub-bufferlist for the object request.
Fixes: https://tracker.ceph.com/issues/49173 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 8dbb4a3d971d9a48c171f161f531956dd0030403)
Kefu Chai [Sat, 6 Mar 2021 16:32:42 +0000 (00:32 +0800)]
.github: correct the regex in mileston workflow
also use pull_request_target event so the action is run in the
context of the base of the pull request. this helps us to overcome
the "Resource not accessible by integration" issue where the action
is run in the context of the pull request.
* refs/pull/39906/head:
mgr/volumes: Bump up AuthMetadataManager's version
pybind/ceph_volume_client: Bump up the version and compat_version to 6
pybind/ceph_volume_client: Fix auth-metadata file recovery
pybind/ceph_volume_client: Update the 'volumes' key to 'subvolumes' in auth metadata file
Reviewed-by: Ramana Raja <rraja@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Kotresh HR [Fri, 19 Feb 2021 11:27:23 +0000 (16:57 +0530)]
mgr/volumes: Bump up AuthMetadataManager's version
With ceph_volume_client and mgr-volumes co-existing
for sometime, the version of both needs to be same.
The ceph_volume_client version <=5 can't decode
'subvolumes' key in auth-metadata file. Hence to
handle version in-compatibility, the version of
ceph_volume_client is bumped up to 6 and the same
needs to be done in mgr-volume's AuthMetadataManager
Kotresh HR [Fri, 19 Feb 2021 11:12:33 +0000 (16:42 +0530)]
pybind/ceph_volume_client: Bump up the version and compat_version to 6
With 'volumes' key updated to 'subvolumes', the version of
ceph_volume_client <= 5 can't decode auth-metadata file. Hence
bumping up ceph_volume_client version and compat_version to 6.
Kotresh HR [Mon, 15 Feb 2021 16:26:51 +0000 (21:56 +0530)]
pybind/ceph_volume_client: Update the 'volumes' key to 'subvolumes' in auth metadata file
The older auth metadata files before nautilus release stores
the authorized subvolumes using the 'volumes' key. As the
notion of 'subvolumes' brought in by mgr/volumes, it makes
sense to use 'subvolumes' key. This patch would be tranparently
update 'volumes' key to 'subvolumes' and newer auth metadata
files would store them with 'subvolumes' key.
Also fails the deauthorize if the auth-id doesn't exist.
Gerald Yang [Wed, 3 Mar 2021 04:37:15 +0000 (04:37 +0000)]
common: reset last_log_sent when clog_to_monitors is updated
When clog_to_monitors is disabled, "last_log" still keeps increasing by
get_next_seq() if OSD writes info to clog
But "last_log_sent" doesn't increase, if we disable clog_to_monitors for
a bit longer and then re-enabling it, the num_unsent could be bigger than
log_queue_size(), it will trigger an assertion in _get_mon_log_message
We need to reset last_log_sent to last_log before updating clog_to_monitors
Signed-off-by: Gerald Yang <gerald.yang@canonical.com>