Aashish Sharma [Wed, 4 Oct 2023 06:54:13 +0000 (12:24 +0530)]
mgr/dashboard: Consider null values as zero in grafana panels
After upgrading from RHCS4 to RHCS5..some of the grafana charts broke.
This is because in RHCS5 we do not generate the metrics if its value is
zero as a result the null value from that metric breaks the grafana
charts or graphs. This PR is to fix the above mentioned issue.
Patrick Donnelly [Mon, 13 Nov 2023 14:46:16 +0000 (09:46 -0500)]
Merge PR #52852 into pacific
* refs/pull/52852/head:
mds: remove calculating caps after adding revokes back
test/libcephfs: add test case for revoking caps
client: issue a cap release immediately if no cap exists
mds: add the revoking caps back to _revokes list
mds: move confirm_receipt() to Capability.cc
Reviewed-by: Rishabh Dave <ridave@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Venky Shankar [Fri, 11 Aug 2023 08:36:52 +0000 (04:36 -0400)]
mds: blocklist clients with "bloated" session metadata
Buggy clients (or maybe a MDS bug) causes a huge buildup of
`completed_requests` metadata in its session information.
This could cause the MDS to go read-only when its flushing
session metadata to the journal since the bloated metadata
causes the ODSOp payload to exceed the maximum write size.
Blocklist such clients so as to allow the MDS to continue
servicing requests.
In pacific, the client would close the mds session if it has missing
features - adjust the code due to that. Also adjust for `session`
variable in Client::handle_client_session() as in pacific its of
type MetaSession* rather than MetaSesssionRef.
MClientRequest: handle ext_num_retry and ext_num_fwd from ceph_mds_request_head_legacy
When a client is too old and uses struct ceph_mds_request_head_legacy we must
fill new ext_num_retry and ext_num_fwd fields from an old num_retry and num_fwd.
Fixes: https://github.com/ceph/ceph/pull/45669 Fixes: https://tracker.ceph.com/issues/63288 Fixes: commit cbd7e3040208 ("ceph_fs.h: add 32 bits extended num_retry and num_fwd support") Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
(cherry picked from commit 43f32a46aa9095b19525357ba7ca215e842b4f77)
There is no need for CreateSnapshotRequests.__del__() that calls
CreateSnapshotRequests.wait_for_pending().
MirrorSnapshotScheduleHandler.shutdown() already calls
CreateSnapshotRequests.wait_for_pending().
Conflicts:
src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py
- Above conflict was due to commit e4a16e2
("mgr/rbd_support: add type annotation") not in pacific
Ramana Raja [Thu, 26 Oct 2023 17:18:52 +0000 (13:18 -0400)]
mgr/rbd_support: fix recursive locking on CreateSnapshotRequests lock
The MirrorSnapshotScheduleHandler's run thread issues asynchronous
create snapshot requests using a CreateSnapshotRequests instance. When
the thread invokes a CreateSnapshotRequests instance's get_ioctx(),
the instance's class variable lock is acquired. With the class
variable lock held, the garbage collection of a CreateSnapshotRequests
instance may race in the thread. The thread would then call
CreateSnapshotRequests __del__() that tries to acquire the class
variable lock that the thread already holds. Fix this
recursive deadlock by converting the CreateSnapshotRequests lock from
a class variable to an instance variable. There is no need to share
the lock across CreateSnapshotRequests instances.
Also convert MirrorSnapshotScheduleHandler, PerfHandler and
TrashPurgeScheduleHandler class variables to instance variables
that don't need to be shared across the instances.
Fixes: https://tracker.ceph.com/issues/62994 Signed-off-by: Ramana Raja <rraja@redhat.com> Co-Authored-By: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 4452bc22d1c6c8499cf55d6e39090adf7ae1dcbf)
Conflicts:
src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py
src/pybind/mgr/rbd_support/perf.py
src/pybind/mgr/rbd_support/trash_purge_schedule.py
- Above conflicts were due to commit e4a16e2
("mgr/rbd_support: add type annotation") not in pacific
shimin [Sun, 8 Oct 2023 11:15:09 +0000 (19:15 +0800)]
mon: fix mds metadata lost in one case.
In most cases, peon's pending_metadata is inconsistent with mon's db.
When a peon turns into leader, and at the same time a active mds stops,
the new leader may flush wrong mds metadata into db. So we meed to
update mds metadata from db at every fsmap change.
This phenomenon can be reproduce like this:
A Cluster with 3 mon and 3 mds (one active, other two standby), 6 osd.
step 1. stop two standby mds;
step 2. restart all mon; (make pending_medata consistent with db)
step 3. start other two mds
step 4. stop leader mon
step 5. run "ceph mds metadata" command to check mds metadata
step 6. stop active mds
step 7. run "ceph mds metadata" command to check mds metadata again
Ramana Raja [Mon, 18 Sep 2023 02:52:56 +0000 (22:52 -0400)]
qa/suites/rbd: add test to check rbd_support module recovery
... on repeated blocklisting of its client.
There were issues with rbd_support module not being able to recover
from its RADOS client being repeatedly blocklisted. This occured for
example in clusters with OSDs slow to process RBD requests while the
module's mirror_snapshot_scheduler was taking mirror snapshots by
requesting exclusive locks on the RBD images and workloads were running
on the snapshotted images via kernel clients.
test/librbd/fsx: wait for resize to propagate in krbd_resize()
With this changes resize request will not be blocked until the resize is
completed. Because of this the fsx test fails as it assumes that the
request to resize immediately implies changes on the device size.
Hence we have to add a wait in resize handler of fsx for the device to
actually get resized.
Problem:
-------
Trying to disable any feature on an rbd image mapped with nbd leads to stuck
in rbd-nbd.
The rbd-nbd registers a watcher callback to detect image resize in
NBDWatchCtx::handle_notify(). The handle_notify calls image info method, which
calls refresh_if_required and it got stuck there.
It is getting stuck in ImageState::refresh_if_required() because
DisableFeaturesRequest issues update notifications while still holding onto
the exclusive lock with everything that has to do with it blocked.
Solution:
--------
Set only notify flag as part of NBDWatchCtx::handle_notify() and handle
the resize detection part as part of a different thread.
mgr/dashboard: add 'omit_usage' query param to dashboard api 'get rbd' endpoint
Allows RBD info to be retrieved without getting associated usage info. This
can be useful for large RBDs where the process of gathering such usage info
is sometimes very slow.
When the OSD preboots it sends a MMonGetPurgedSnaps message to
the monitor (`_get_purged_snaps`).
The monitor will reply with all the purged snapshots that their purged_epoch_ is in the
range of superblock.purged_snaps_last + 1 up to the last superblock.current_epoch + 1.
When the OSD will handle the reply from the mon (`handle_get_purged_snaps_reply`)
it will call `record_purged_snaps` to write those purged snapshots in the
OSD store as well (PSN_ keys).
Once purged_snaps_last is reset, in the following OSD reboot, the snapshots that were marked as
purged (purged_snaps_ keys) in the mon's store will be also marked,
correspondingly, in the OSD store.
That way `scrub_purged_snaps` will be able to re-trim the snapshots that weren't
marked as purged in the OSD side (for some reason)
Fixes: https://tracker.ceph.com/issues/62981 Signed-off-by: Matan Breizman <mbreizma@redhat.com>
(cherry picked from commit 120ed0f0e8f65c18bfcd1d649617770c2c5af663)
Manual conflict fixes: 'scrubdebug' command was removed since it's
not part of the original commit.
Commit dc69033763cc116c6ccdf1f97149a74248691042 moves cephfs-shell from
"<CEPH-REPO-ROOT>/src/tools/cephfs/" to
"<CEPH-REPO-ROOT>/src/tools/cephfs/shell" but cephfs-shell's location in
src/vstart.sh and qa/tasks/cephfs/test_cephfs_shell.py is left
un-updated. This produces a broken vstart_environment.sh and broken
export command in test_cephfs_shell.py.
Introduced-by: dc69033763cc116c6ccdf1f97149a74248691042 Fixes: https://tracker.ceph.com/issues/58795 Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 48ef0444774934dd6d0d3e026142d95e4098bebd)
Conflicts:
qa/tasks/cephfs/test_cephfs_shell.py
- Comment present at the top of file was different in Pacific
compared to main branch.