Leonid Usov [Sat, 11 May 2024 14:00:21 +0000 (17:00 +0300)]
mds: enhance the `lock path` asok command
* when the quiesce lock is taken by this op, don't consider the inode `quiesced`
* drop all locks taken during traversal
* drop all local authpins after the locks are taken
* add --await functionality that will block the command until locks are taken or an error is encountered
* return the RC that represents the operation result. 0 if the operation was scheduled and hasn't failed so far
* add authpin control flags
** --ap-freeze - to auth_pin_freeze the target inode
** --ap-dont-block - to pass auth_pin_nonblocking when acquiring the target inode locks
Leonid Usov [Thu, 9 May 2024 01:39:12 +0000 (04:39 +0300)]
mds/quiesce: overdrive fragmenting that's still freezing
Quiesce requires revocation of capabilities,
which is not working for a freezing/frozen nodes.
Since it is best effort, abort an ongoing fragmenting
for the sake of a faster quiesce.
Signed-off-by: Leonid Usov <leonid.usov@ibm.com> Fixes: https://tracker.ceph.com/issues/65716
Leonid Usov [Sun, 12 May 2024 16:19:34 +0000 (19:19 +0300)]
revert: mds: provide a mechanism to authpin while freezing
This is a functional revert of a9964a7ccc4394f923fb0f1c76eb8fa03fe8733d
git revert was giving too many conflicts, as the code has changed
too much since the original commit.
The bypass freezing mechanism lead us into several deadlocks,
and when we found out that a freezing inode defers reclaiming
client caps, we realized that we needed to try a different approach.
This commit removes the bypass freezing related changes to clear way
for a different approach to resolving the conflict between quiesce
and freezing.
jiawd [Fri, 12 Nov 2021 03:48:56 +0000 (03:48 +0000)]
osd: full-object read crc is mismatch, because truncate modify oi.size and forget to clear data_digest
when write before truncate, need trim length, if truncate is to 0,
write is [0~128k], write change to [0~0], do nothing, oi.size is 0, x1 = set_data_digest(crc32(-1)).
write is [128k~128k], write change to [128k~0], truncate oi.size to offset 128k, x2 = set_data_digest(crc32(x1)).
write is [256k~128k], write change to [256k~0], truncate oi.size to offset 256k, x3 = set_data_digest(crc32(x2)).
...
write is [4063232~128k], write change to [4063232~0], truncate oi.size to offset 4063232, xn = set_data_digest(crs32(xn-1))
Now, we can see oi.size is 4063232, and data_digest is 0xffffffff, because thelength of in_data of crc is 0 every time.
when read verify crc will reply EIO. (EC pool).
so, when truncate in write, need clear data_digest and DIGEST flag,
when write before truncate, need to trim length, when offset over than oi.size, don't truncate oi.size to offset.
Zac Dover [Sun, 19 May 2024 00:00:29 +0000 (10:00 +1000)]
doc/cephfs: Squid and later - subvolume quiesce
Add a note to the "Subvolume quiesce" section that says that the
information in the section applies only to the Squid and later releases
of Ceph. This is included here so that I don't overwrite the Reef and
Quincy documentation with irrelevant information, and so that I don't
overwrite the Squid information with blank space where the "Subvolume
quiesce" section should be.
Rishabh Dave [Thu, 16 May 2024 07:00:49 +0000 (12:30 +0530)]
qa/cephfs: block buggy tests in test_admin.py
Block test_idem_unaffected_root_squash temporarily and
test_multifs_single_path_rootsquash.
This test fails due to a known bug. Block it temporarily so that
test_admin.py can run fully and PRs under QA can be tested fully.
Otherwise, this test fails and that halts test_admin.py, which leaves
the PR partially untested.
This failure is then seen as an unrelated failure which lets the buggy
code get merged. This has happened recently.
Rishabh Dave [Thu, 16 May 2024 16:30:01 +0000 (22:00 +0530)]
qa/cephfs: add MDS_CLIENTS_BROKEN_ROOTSQUASH to ignorelist
MDS_CLIENTS_BROKEN_ROOTSQUASH is generated and expected by
test_rootsquash_nofeature but it hasn't be added to ignorelist as a
result of which QA code marks the job as failed even though all tests
finished running successfully.
Introduced-by: bccc8ceb471c441ec04d7eb2c353630f8c5ce843 Fixes: https://tracker.ceph.com/issues/66075 Signed-off-by: Rishabh Dave <ridave@redhat.com>
Rishabh Dave [Tue, 7 May 2024 14:50:55 +0000 (20:20 +0530)]
qa/cephfs: set joinable on FS before exiting tests in TestFSFail
After running TestFSFail, CephFSTestCase.tearDown() fails attempting
to unmount CephFS. Set joinable on FS and wait for the MDS to be up
before exiting the test. This will ensure that unmounting is
successful in teardown.
Fixes: https://tracker.ceph.com/issues/65841 Signed-off-by: Rishabh Dave <ridave@redhat.com>
Since this --flags=locks takes the mds_lock and dumps thousands of ops, this
may take a long time to complete for each individual MDS. The entire quiesce
set may timeout (and all q ops killed) before we finish dumping ops.
Fixes: https://tracker.ceph.com/issues/65823 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Ilya Dryomov [Thu, 16 May 2024 10:40:58 +0000 (12:40 +0200)]
common/options: link to mon_osd_blocklist_default_expire from RBD
"number of seconds to blocklist - set to 0 for OSD default" in the
description of rbd_blocklist_expire_seconds refers to the value that is
controlled by mon_osd_blocklist_default_expire.
We are currently conducting regular ceph-dencoder tests for backward compatibility.
However, we are omitting tests for forward compatibility.
This suite will introduce tests against the ceph-objects-corpus to address forward
compatibility issues that may arise.
the script will install N-2 version and run against the latest version corpus objects
that we have, then install N-1 to N version and check them as well.
Patrick Donnelly [Thu, 16 May 2024 03:01:16 +0000 (23:01 -0400)]
Merge PR #57454 into main
* refs/pull/57454/head:
mds/quiesce-db: optimize peer updates
mds/quiesce-db: track db epoch separately from the membership epoch
mds/quiesce-db: test that a peer on a newer membership epoch can ack a root
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Ramana Raja [Tue, 30 Apr 2024 17:56:12 +0000 (13:56 -0400)]
librbd/image: create rbd_trash object during RBD pool initialization
... and RBD namespace creation.
It was not possible to remove a RBD image when OSDs were full and the
'rbd_trash' object was not already created in the image's pool or pool
namespace. The 'rbd_trash' object was created in a pool or namespace
during the first instance of image removal from the pool or namespace.
If no images were ever removed from a RBD pool or namespace and the
OSDs became full, removal of images using the CLI failed. The failure
occured when trying to move the images to trash since the 'rbd_trash'
object was missing in the pool or namespace.
Fix this issue by creating the rbd_trash object in a pool when
initalizing the pool as a RBD pool and when creating a RBD namespace.
Fixes: https://tracker.ceph.com/issues/64800 Signed-off-by: Ramana Raja <rraja@redhat.com>
Lucian Petrut [Thu, 12 Jan 2023 10:55:06 +0000 (12:55 +0200)]
qa: add ceph-rbd windows service restart test
We're adding a test that:
* maps a configurable number of images
* runs a specified test - we're reusing the ones from stress_test,
making just a few minor changes to allow running the same test
multiple times
* restarts the ceph-rbd Windows service
* waits for the images to be reconnected and refreshes the mount
information
* reruns the test
* repeats the above workflow for a specified number of times,
reusing the same images
This test ensures that:
* mounted images are still available after a service restart
* drive letters are retained
* the image content is retained
* there are no race conditions when connecting or disconnecting
a large number of images in parallel
* the driver is capable of mapping a specified number of images
simultaneously
Lucian Petrut [Tue, 10 Jan 2023 14:50:04 +0000 (16:50 +0200)]
qa: reorganize Windows python test
We're splitting the rbd-wnbd python test into separate files so
that the common code may easily be reused by other tests. This
also makes the code easier to read and maintain.
Nizamudeen A [Fri, 3 May 2024 08:56:19 +0000 (14:26 +0530)]
mgr/k8sevents: update V1Events to CoreV1Events
centos9 only provides kubernetes 26.1.0 as base dep and hence the
k8sevents code needs to be updated accordingly. the api changes happened
in kuberenetes while 19.0.0 was released
Fixes: https://tracker.ceph.com/issues/65627 Fixes: https://tracker.ceph.com/issues/64981 Signed-off-by: Nizamudeen A <nia@redhat.com>
Patrick Donnelly [Wed, 15 May 2024 00:19:30 +0000 (20:19 -0400)]
Merge PR #57453 into main
* refs/pull/57453/head:
doc: add status badge for backport creation
.github: use shorter name for backport tracker action
.github: document where runs/output can be examined
Leonid Usov [Mon, 13 May 2024 21:10:04 +0000 (00:10 +0300)]
mds/quiesce-db: track db epoch separately from the membership epoch
Tracking the db epoch separately will make sure that replicas
only follow leader's epoch choice, even if they are already on
the new membership epoch. This eliminates races due to the
random order of mdsmap updates.
Fixes: https://tracker.ceph.com/issues/65977 Signed-off-by: Leonid Usov <leonid.usov@ibm.com>