Patrick Donnelly [Fri, 15 Jul 2022 20:39:00 +0000 (16:39 -0400)]
mds: ensure next replay is queued on req drop
Not all client replay requests are queued at once since [1]. We require
the next request by queued when completed (unsafely) or during cleanup.
Not all code paths seem to handle this [2] so move it to a generic
location, MDCache::request_cleanup. Even so, this doesn't handle all
errors (so we must still be careful) as sometimes we must queue the next
replay request before an MDRequest is constructed [3] during some error
conditions.
Additionally, preserve the behavior of Server::journal_and_reply
queueing the next replay op. Otherwise, must wait for the request to be
durable before moving onto the next one, unnecessarily.
For reproducing, two specific cases are highlighted (thanks to @Mer1997 on
Github for locating these):
- The request is killed by a session close / eviction while a replayed request
is queued and waiting for a journal flush (e.g. dirty inest locks).
- The request construction fails because the request is already in the
active_requests. This could happen theoretically if a client resends the same
request (same reqid) twice.
The first case is most probable but very difficult to reproduce for testing
purposes. The replayed op would need to wait on a journal flush (to be
restarted by C_MDS_RetryRequest). Then, the request would need killed by a
session close.
Ramana Raja [Mon, 18 Sep 2023 02:52:56 +0000 (22:52 -0400)]
qa/suites/rbd: add test to check rbd_support module recovery
... on repeated blocklisting of its client.
There were issues with rbd_support module not being able to recover
from its RADOS client being repeatedly blocklisted. This occured for
example in clusters with OSDs slow to process RBD requests while the
module's mirror_snapshot_scheduler was taking mirror snapshots by
requesting exclusive locks on the RBD images and workloads were running
on the snapshotted images via kernel clients.
There is no need for CreateSnapshotRequests.__del__() that calls
CreateSnapshotRequests.wait_for_pending().
MirrorSnapshotScheduleHandler.shutdown() already calls
CreateSnapshotRequests.wait_for_pending().
Ramana Raja [Thu, 26 Oct 2023 17:18:52 +0000 (13:18 -0400)]
mgr/rbd_support: fix recursive locking on CreateSnapshotRequests lock
The MirrorSnapshotScheduleHandler's run thread issues asynchronous
create snapshot requests using a CreateSnapshotRequests instance. When
the thread invokes a CreateSnapshotRequests instance's get_ioctx(),
the instance's class variable lock is acquired. With the class
variable lock held, the garbage collection of a CreateSnapshotRequests
instance may race in the thread. The thread would then call
CreateSnapshotRequests __del__() that tries to acquire the class
variable lock that the thread already holds. Fix this
recursive deadlock by converting the CreateSnapshotRequests lock from
a class variable to an instance variable. There is no need to share
the lock across CreateSnapshotRequests instances.
Also convert MirrorSnapshotScheduleHandler, PerfHandler and
TrashPurgeScheduleHandler class variables to instance variables
that don't need to be shared across the instances.
Fixes: https://tracker.ceph.com/issues/62994 Signed-off-by: Ramana Raja <rraja@redhat.com> Co-Authored-By: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 4452bc22d1c6c8499cf55d6e39090adf7ae1dcbf)
Zac Dover [Wed, 1 Nov 2023 01:53:59 +0000 (11:53 +1000)]
doc/cephadm: edit troubleshooting.rst (1 of x)
Edit doc/cephadm/troubleshooting.rst. This commit and the PR of which it
is a part was raised in response to
https://github.com/ceph/ceph/pull/53976. The limits of reStructuredText
are particularly visible here in every instance of a BASH for-loop and
in every instance of a command stretched over multiple lines.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 69472c26af5419faa9ed93c071ed5933d03fa67f)
Zac Dover [Mon, 30 Oct 2023 02:37:39 +0000 (12:37 +1000)]
doc/glossary: improve "BlueStore" entry
Initially s/backend/back end/ but then I added a little more information
about BlueStore's use of RocksDB to map object names to block locations
on disk.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 8713cca328c9373636efdb92449d743b5bd56584)
Aashish Sharma [Mon, 30 Oct 2023 07:47:37 +0000 (13:17 +0530)]
mgr/dashboard: update rgw multisite import form helper info
Change 'To obtain the token, generate it from your secondary Ceph cluster' to 'To obtain the token, generate it from your primary Ceph cluster' in rgw multisite import form helper
Zac Dover [Fri, 27 Oct 2023 06:58:28 +0000 (16:58 +1000)]
doc/rados: remove cache-tiering-related keys
Remove information related to cache-tiering-related keys from
doc/rados/operations/pools.rst. Cache-tiering is deprecated in Reef.
This PR is suitable for backporting to the Reef release branch, but not
to release branches prior to Reef.
John Mulligan [Wed, 11 Oct 2023 18:05:17 +0000 (14:05 -0400)]
cephadm: add a --dry-run option to cephadm shell
Instead of creating the shell, the --dry-run option prints the container
command that would be used. This can be used as a starting point for
creating custom container commands similar to what cephadm shell would
generate but with tweaks.
Zac Dover [Wed, 25 Oct 2023 23:48:57 +0000 (09:48 +1000)]
doc/rados: remove HitSet-related key information
Remove HitSet-related key information from
doc/rados/operations/pools.rst. HitSet-related keys are relevant only to
releases of Ceph that support cache tiering. Only Quincy and earlier
(inclusive) releases of Ceph support cache tiering. Backport this commit
from main to Reef, but not to Quincy or to release branches earlier than
Quincy.
Direct leak of 8 byte(s) in 1 object(s) allocated from:
#0 0x7f5c76eb6367 in operator new(unsigned long) (/lib64/libasan.so.6+0xb6367)
#1 0x7f5c76a2fb81 in MallocExtension::Register(MallocExtension*) (/lib64/libtcmalloc.so.4+0x2fb81)
SUMMARY: AddressSanitizer: 8 byte(s) leaked in 1 allocation(s)
```
crimson/os/seastore/onode_manager: populate value recorders of onodes to
be erased
Otherwise, the following modification sequence with the same transaction
might lead to onode extents' crc inconsistency during journal replay:
1. modify the last mapping in an onode extent;
2. erase the last mapping in that onode extent.
During journal replay, if the first modification is not recorded in the
delta, the onode extent's content would be inconsistent with that before
the system reboot
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Co-authored-by: Zac Dover <zac.dover@proton.me> Signed-off-by: Juan Miguel Olmo Martínez <jolmomar@ibm.com>
(cherry picked from commit 9bb63bdc8969f2ecdfeedfc8396890ad59f0d796)
Mark Nelson [Wed, 27 Apr 2022 15:06:22 +0000 (15:06 +0000)]
crimson: Enable tcmalloc when using seastar
classic-osds have always caused significant memory fragmentation
when using the libc memory allocator due to the way that Ceph
tends to utilize memory. In recent testing, crimson-osd was found
to use 25-27GB of RAM with the stock 3GB bluestore cache settings
(osd_memory_target is only used when tcmalloc is available). Upon
further testing, it was found that the classic OSD is even worse,
using between 32-33GB of RAM after a 5 minute 4K sequential
write test when using libc malloc.
The good news is that it appears that crimson-osd is able to use
tcmalloc for alienstore without significant modification. Better
still, it drastically reduces memory usage. In the same test that
resulted in 25GB RSS memory usage for crimson-osd with libc malloc,
a tcmalloc linked version took around 9GB (with an 8GB
osd_memory_target). Since we do not yet (afaik) expose classic OSD
debugging in crimson it is tough to tell why we are still a little
over, but it's clear that for alienstore we are going to need to
use tcmalloc as we do in classic.