Aashish Sharma [Wed, 4 Oct 2023 06:54:13 +0000 (12:24 +0530)]
mgr/dashboard: Consider null values as zero in grafana panels
After upgrading from RHCS4 to RHCS5..some of the grafana charts broke.
This is because in RHCS5 we do not generate the metrics if its value is
zero as a result the null value from that metric breaks the grafana
charts or graphs. This PR is to fix the above mentioned issue.
Patrick Donnelly [Mon, 13 Nov 2023 14:46:16 +0000 (09:46 -0500)]
Merge PR #52852 into pacific
* refs/pull/52852/head:
mds: remove calculating caps after adding revokes back
test/libcephfs: add test case for revoking caps
client: issue a cap release immediately if no cap exists
mds: add the revoking caps back to _revokes list
mds: move confirm_receipt() to Capability.cc
Reviewed-by: Rishabh Dave <ridave@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
There is no need for CreateSnapshotRequests.__del__() that calls
CreateSnapshotRequests.wait_for_pending().
MirrorSnapshotScheduleHandler.shutdown() already calls
CreateSnapshotRequests.wait_for_pending().
Conflicts:
src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py
- Above conflict was due to commit e4a16e2
("mgr/rbd_support: add type annotation") not in pacific
Ramana Raja [Thu, 26 Oct 2023 17:18:52 +0000 (13:18 -0400)]
mgr/rbd_support: fix recursive locking on CreateSnapshotRequests lock
The MirrorSnapshotScheduleHandler's run thread issues asynchronous
create snapshot requests using a CreateSnapshotRequests instance. When
the thread invokes a CreateSnapshotRequests instance's get_ioctx(),
the instance's class variable lock is acquired. With the class
variable lock held, the garbage collection of a CreateSnapshotRequests
instance may race in the thread. The thread would then call
CreateSnapshotRequests __del__() that tries to acquire the class
variable lock that the thread already holds. Fix this
recursive deadlock by converting the CreateSnapshotRequests lock from
a class variable to an instance variable. There is no need to share
the lock across CreateSnapshotRequests instances.
Also convert MirrorSnapshotScheduleHandler, PerfHandler and
TrashPurgeScheduleHandler class variables to instance variables
that don't need to be shared across the instances.
Fixes: https://tracker.ceph.com/issues/62994 Signed-off-by: Ramana Raja <rraja@redhat.com> Co-Authored-By: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 4452bc22d1c6c8499cf55d6e39090adf7ae1dcbf)
Conflicts:
src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py
src/pybind/mgr/rbd_support/perf.py
src/pybind/mgr/rbd_support/trash_purge_schedule.py
- Above conflicts were due to commit e4a16e2
("mgr/rbd_support: add type annotation") not in pacific
Ramana Raja [Mon, 18 Sep 2023 02:52:56 +0000 (22:52 -0400)]
qa/suites/rbd: add test to check rbd_support module recovery
... on repeated blocklisting of its client.
There were issues with rbd_support module not being able to recover
from its RADOS client being repeatedly blocklisted. This occured for
example in clusters with OSDs slow to process RBD requests while the
module's mirror_snapshot_scheduler was taking mirror snapshots by
requesting exclusive locks on the RBD images and workloads were running
on the snapshotted images via kernel clients.
test/librbd/fsx: wait for resize to propagate in krbd_resize()
With this changes resize request will not be blocked until the resize is
completed. Because of this the fsx test fails as it assumes that the
request to resize immediately implies changes on the device size.
Hence we have to add a wait in resize handler of fsx for the device to
actually get resized.
Problem:
-------
Trying to disable any feature on an rbd image mapped with nbd leads to stuck
in rbd-nbd.
The rbd-nbd registers a watcher callback to detect image resize in
NBDWatchCtx::handle_notify(). The handle_notify calls image info method, which
calls refresh_if_required and it got stuck there.
It is getting stuck in ImageState::refresh_if_required() because
DisableFeaturesRequest issues update notifications while still holding onto
the exclusive lock with everything that has to do with it blocked.
Solution:
--------
Set only notify flag as part of NBDWatchCtx::handle_notify() and handle
the resize detection part as part of a different thread.
When the OSD preboots it sends a MMonGetPurgedSnaps message to
the monitor (`_get_purged_snaps`).
The monitor will reply with all the purged snapshots that their purged_epoch_ is in the
range of superblock.purged_snaps_last + 1 up to the last superblock.current_epoch + 1.
When the OSD will handle the reply from the mon (`handle_get_purged_snaps_reply`)
it will call `record_purged_snaps` to write those purged snapshots in the
OSD store as well (PSN_ keys).
Once purged_snaps_last is reset, in the following OSD reboot, the snapshots that were marked as
purged (purged_snaps_ keys) in the mon's store will be also marked,
correspondingly, in the OSD store.
That way `scrub_purged_snaps` will be able to re-trim the snapshots that weren't
marked as purged in the OSD side (for some reason)
Fixes: https://tracker.ceph.com/issues/62981 Signed-off-by: Matan Breizman <mbreizma@redhat.com>
(cherry picked from commit 120ed0f0e8f65c18bfcd1d649617770c2c5af663)
Manual conflict fixes: 'scrubdebug' command was removed since it's
not part of the original commit.
Commit dc69033763cc116c6ccdf1f97149a74248691042 moves cephfs-shell from
"<CEPH-REPO-ROOT>/src/tools/cephfs/" to
"<CEPH-REPO-ROOT>/src/tools/cephfs/shell" but cephfs-shell's location in
src/vstart.sh and qa/tasks/cephfs/test_cephfs_shell.py is left
un-updated. This produces a broken vstart_environment.sh and broken
export command in test_cephfs_shell.py.
Introduced-by: dc69033763cc116c6ccdf1f97149a74248691042 Fixes: https://tracker.ceph.com/issues/58795 Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 48ef0444774934dd6d0d3e026142d95e4098bebd)
Conflicts:
qa/tasks/cephfs/test_cephfs_shell.py
- Comment present at the top of file was different in Pacific
compared to main branch.
Adam King [Mon, 21 Aug 2023 17:48:56 +0000 (13:48 -0400)]
cephadm: make custom_configs work for tcmu-runner container
This is intended to be a temporary workaround to make
custom config files be able to be mounted into
the tcmu-runner container. The hope is to refactor
cephadm's iscsi handling for squid, but a patch
like this could be useful for iscsi in older
releases where currently custom config files
are unusable for the tcmu-runner container
What this patch actually does is have us write the
custom config files to a dir for the tcmu-runner
container so that the rest of the logic works without
change. I thought this would be easier to remove later
than a patch that integrates more with the container
mounts or general deployment
Ilya Dryomov [Thu, 12 Oct 2023 17:03:10 +0000 (19:03 +0200)]
pybind/rbd: don't produce info on errors in aio_mirror_image_get_info()
Check completion return value before attemting to decode c_info.
Otherwise we are guaranteed to access invalid memory in decode_cstr()
while trying to compute global_id string length when the client is
blocklisted for example.
lichaochao [Tue, 28 Mar 2023 03:17:26 +0000 (05:17 +0200)]
rgw: fix rgw cache invalidation after unregister_watch() error
When a metadata osd fails, an unregister_watch() error may occur,
resulting in an rgw cache invalidation.
By adding an unregister_done flag and when a register_watch() error ,
performing a reinit() operation again,
After the first reinit() failure, the register_watch() will be performed again
Fixes: https://tracker.ceph.com/issues/59217 Signed-off-by: lichaochao <lichaochao2_yewu@cmss.chinamobile.com>
(cherry picked from commit f9aae71af3ad8eee5996c31544d98041968dbbec)
we may get -ENOENT looking up cur_bucket here. we look up cur_bucket so
we can avoid purging the 'current' bucket instance. but if that
entrypoint doesn't exist, there is no current instance and that
shouldn't prevent us from purging
Vedansh Bhartia [Thu, 2 Mar 2023 13:04:53 +0000 (18:34 +0530)]
rgw: use unique_ptr for flat_map emplace in BucketTrimWatcher
When emplacing objects into the trim notify handler of
BucketTrimWatcher, use a unique_ptr for the handler so that it is
destroyed if the emplace fails.
Though the destructor is already called, this behaviour cannot be relied
upon. std::map does not exhibit the same behaviour, and would have
leaked memory had it been used instead.
RadosGW API: incorrect bucket quota in response to HEAD /{bucket}/?usage
When we try to get the bucket usage via various methods, through curl or while accessing rgw api endpoint at HEAD /{bucket}/?usage doesn't return the updated information. The endpoint was always returning the user quota and not the actual bucket quota which we see after querying the endpoint.
Fixes: https://tracker.ceph.com/issues/62737 Signed-off-by: shreyanshjain7174 <ssanchet@redhat.com>
(cherry picked from commit 78cd82b6e9f36a91f47d44ad2cfae89add335d4c)
Conflicts:
- path: src/rgw/rgw_rest_s3.cc
comment: resolve minor conflict
Xiubo Li [Fri, 23 Jun 2023 14:44:23 +0000 (22:44 +0800)]
mds: remove calculating caps after adding revokes back
The calc_issued() makes no sense and will blindly set the 'issued'
to the 'pending', which is incorrect.
For the cap update msg it will pass the client's 'implemented' caps
to MDS, and MDS will use the 'implemented' to calculate the 'issued'
and 'pending' members and also will adjust the revoke list.
The confirm_receipt() has already correctly calculating the 'issued'
and 'pending' members. And after add the cap back to the revoke list
we should mark it notable, which will move the cap object to the
front of session list.
Xiubo Li [Tue, 11 Oct 2022 04:53:17 +0000 (12:53 +0800)]
test/libcephfs: add test case for revoking caps
When writing to a file and the max_size is approaching the client
will try to trigger to call check_caps() and flush the caps to MDS.
But just in case the MDS is revoking Fsxrw caps, since the client
keeps writing and holding the Fw caps it may only release part of
the caps but the Fw.
Xiubo Li [Tue, 16 May 2023 01:18:15 +0000 (09:18 +0800)]
client: issue a cap release immediately if no cap exists
In case:
mds client
- Releases cap and put Inode
- Increase cap->seq and sends
revokes req to the client
- Receives release req and - Receives & drops the revoke req
skip removing the cap and
then eval the CInode and
issue or revoke caps again.
- Receives & drops the caps update
or revoke req
- Health warning for client
isn't responding to
mclientcaps(revoke)
All the IMPORT/REVOKE/GRANT cap ops will increase the session seq
in MDS side and then the client need to issue a cap release to
unblock MDS to remove the corresponding cap to unblock possible
waiters.
Fixes: https://tracker.ceph.com/issues/57244 Fixes: https://tracker.ceph.com/issues/61148 Signed-off-by: Xiubo Li <xiubli@redhat.com>
(cherry picked from commit 7aaf4ba81b978db63b9cb11a90f881196530e5d5)
Xiubo Li [Thu, 2 Mar 2023 14:01:08 +0000 (22:01 +0800)]
mds: add the revoking caps back to _revokes list
When revoking caps from clients and if the clients could release
some of the caps references and the clients still could send cap
update request back to MDS, while the confirm_receipt() will clear
the _revokes list anyway.
But this cap will still be kept in revoking_caps list.
At the same time add one debug log when revocation is not totally
finished.