Leonid Usov [Tue, 20 Feb 2024 12:43:43 +0000 (14:43 +0200)]
mds: MDSCacheObject wait mask and SimpleLock wait shift
MDSCacheObject waiting interface accepts a wait mask to queue waiters.
The mask is used to prioritize waiters, so lower absolute mask values will be invoked first.
SimpleLock used to create a different wait mask per lock type, thus prioritizing locks,
but the approach proved to be unfair and #8965 changed the waiting to be FIFO
per wanted wait bit. This change made the spreading of lock wait masks per lock type
redundant.
Until now it wasn't an issue, but adding a new lock type we found out that the 64 bit mask
space was already exhausted given that each lock used up 4 bits from the mask
Further, there was a magic number `8` added to the lock mask shift offset,
which apparently was meant to accommodate for custom high-priority wait bits
of the lock's parent object, but this wasn't properly communicated via interfaces
and caused an overlap of the first lock's WAIT_RD bit with the fourth private
wait bit of CInode::WAIT_FLOCK;
This change optimizes the usage of the available wait tag 64 bit space,
aiming to maintain backward compatibility with the current implementation.
The approach is to split the 64bits of the wait tag into a 16 bit ID and
a 48 bit MASK fields. The ID field will be matched literally, while the MASK
field will be matched by an intersection.
With this apprach we can encode the lock ordinal into the ID field and use just
four bits from the MASK field to distinguish between the different lock wait flags.
Since the ID field occupies the most significant 2 bytes, the comparision of
different lock tags will yield the same results as it does today, when higher
lock ids were occupying higher bit offsets.
Redouane Kachach [Wed, 21 Feb 2024 07:27:53 +0000 (08:27 +0100)]
mgr/rook: adding empty calls to upgrade_ls and upgrade_status
added empty calls to upgrade_ls and upgrade_status to avoid
dashboard errors when entering the view Cluster > Upgrade. Empty
calls are used because we don't support the upgrade functionality
in rook as we do for normal Ceph deployments. In case of rook user
has to follow a different process to upgrade Ceph.
Venky Shankar [Tue, 20 Feb 2024 04:58:48 +0000 (10:28 +0530)]
Merge PR #52670 into main
* refs/pull/52670/head:
doc: add the reject the clone when threads are not available feature in the document
qa: add test cases for the support to reject clones feature
mgr/volumes: support to reject CephFS clones if cloner threads are not available
Patrick Donnelly [Thu, 15 Feb 2024 15:28:32 +0000 (10:28 -0500)]
mds: reverse MDSMap encoding of max_xattr_size/bal_rank_mask
Commit e134c890 adds the bal_rank_mask with encoded (ev) version 17. This was
merged into main Oct 2022 and made it into the reef release normally.
Commit 7b8def5c adds the max_xattr_size also with encoded (ev) version 17 but
places it before bal_rank_mask. This is problematic as there were no plans to
backport e134c890 to quincy or pacific so piggybacking on the ev 17 bump would
not work and otherwise would require the backports to be done as a set to
ensure consistency (including with the kernel client).
However, the real issue is that 7b8def5c was not merged until after reef was
already cut. This required 7b8def5c to be backported separately in [1] which
was not merged until after v18.2.1 (current reef HEAD as of this commit).
Ultimately, this means that there are reef versions (v18.2.[01]) in the wild
which expect bal_rank_mask to be encoded at ev17 and not (max_xattr_size,
bal_rank_mask). Adding to the complications, the kernel client has already
merged code [2] expecting max_xattr_size for ev17.
It was decided in a github discussion [3] to move bal_rank_mask to ev18 to
avoid updating the kernel client which was done in the main branch via 36ee8e7e
and update the reef max_xattr_size backport with the same change (d8cebd67).
Unfortunately, this breaks upgrades v18.2.[01] to newer reef versions or to
main. The reason is that monitors will encode v17 with bal_rank_mask
(max_xattr_size is not merged yet) and send that to upgraded mgrs (which are
upgraded first). The mgr will attempt to decode bal_rank_mask as a uint64_t
(max_xattr_size) but fail because an empty (by default) bal_rank_mask is simply
encoded as a signed 32-bit integer. Consequently, the mgr will fail decoding
with:
failed to decode message of type 45 v1: End of buffer [buffer:2]
Of course the problem does not stop there, even if the mgr were able to handle
this, the monitors/mds/clients would fail in similar fashion.
So the only choice left is to fix max_xattr_size to be encoded at ev18.
Fortunately, v18.2.2 has not been released nor has any max_xattr_size backport
to quincy/pacific been merged. The main downside will be that kernels will
wrongly decode ev17 (which is already true for ceph clusters running
v18.2.[01]). A follow-up kernel fix will be required.
Zac Dover [Mon, 19 Feb 2024 08:41:45 +0000 (18:41 +1000)]
doc/cephfs: edit add-remove-mds
Disambiguate a note in doc/cephfs/add-remove-mds.rst to help readers
distinguish between cases in which they might want to use an automated
tool such as cephadm to deploy MDSes and cases in which they might want
to manually deploy MDSes.
See: https://github.com/ceph/ceph/pull/45639
Tracker: https://tracker.ceph.com/issues/54551
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
common/tracer: fix decoding when jaeger tracing is disabled
We aren't currently using jaeger tracing on Windows. The issue is
that Windows hosts (or any other host that doesn't use jaeger)
are experiencing message decoding failures after a recent change [1].
This change updates the tracer encoding so that messages from
non-jaeger hosts may be decoded by services that use jaeger.
[1] https://github.com/ceph/ceph/pull/47457
Signed-off-by: Lucian Petrut <lpetrut@cloudbasesolutions.com>
This commit rebrings 3701ffa6733b001d4278a0b68395c5efe2382f25 which
got reverted due to an implicit dependency with other revert. Please
see https://github.com/ceph/ceph/pull/52114#issuecomment-1950288188.
When doing PG dump using 'ceph pg dump --format json-pretty'
the output is extremely big that the command hangs and also
the ceph-mgr hangs and eventuall fails over.
The exact size depends on the number of OSDs in the cluster
and the number of peers for each OSD.
In tests, it's been identified that the network ping times
is the largest component in terms of size which is removed
from the output now so as to limit the overall size.
Ronen Friedman [Mon, 12 Feb 2024 14:50:22 +0000 (08:50 -0600)]
test/osd: fix test_scrub_sched following scrubber changes
Replacing PgScrubber::determine_scrub_time() with a local copy,
as a stop-gap measure to keep the test running.
The scrub scheduling refactoring will remove the need for
this function, and the test will be updated accordingly.
Adam King [Thu, 15 Feb 2024 14:42:50 +0000 (09:42 -0500)]
Merge pull request #55566 from zdover23/wip-doc-2024-02-14-cephadm-services-nfs
doc/cephadm: correct nfs config pool name
Reviewed-by: Adam King <adking@redhat.com> Reviewed-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com> Reviewed-by: John Mulligan <jmulligan@redhat.com>
Casey Bodley [Wed, 14 Feb 2024 14:43:14 +0000 (09:43 -0500)]
rgw/putobj: RadosWriter uses part head object for multipart parts
the cleanup logic in the RadosWrite destructor was using the wrong
`head_obj` to avoid races between cleanup and part re-uploads. it
pointed at the final location of the multipart upload, rather than the
head object of the current part
Ilya Dryomov [Mon, 12 Feb 2024 12:07:22 +0000 (13:07 +0100)]
librbd: refactor merge() for SparseBufferlistExtent
- pass left.length + right.length instead of bl.length()
for consistency and to avoid circumventing the assert in
SparseBufferlistExtent constructor
- claim_append() takes an lvalue reference, no need to move
- follow the pattern used in split()
Ilya Dryomov [Mon, 12 Feb 2024 10:00:45 +0000 (11:00 +0100)]
librbd: fix split() for SparseExtent and SparseBufferlistExtent
SparseExtents and SparseBufferlist are typedefs for interval_map. In
both cases, split() handler is broken: for the former the extent isn't
actually split and for the latter incorrect bufferlist is attached to
the split extent.
Fortunately, both SnapshotDelta as produced by ObjectListSnapsRequest
and SparseBufferlist used in a couple of places seem to be collections
where only disjoint intervals are inserted and splitting doesn't occur
(at least in the common case). But still, this is a landmine waiting
for someone to step on it.