Ilya Dryomov [Wed, 12 Feb 2025 10:25:48 +0000 (11:25 +0100)]
qa/workunits/rbd: use create_image_and_enable_mirror() in bootstrap tests
The reason create_image() + enable_mirror() happens to work for
PARENT_POOL is that PARENT_POOL is enabled for mirroring in image mode
unconditionally, unlike POOL, POOL/NS1 or PARENT_POOL/NS1 for which
MIRROR_POOL_MODE setting is respected. This isn't immediately obvious
because it's done in setup_pools() in rbd_mirror_helpers.sh.
Switch to create_image_and_enable_mirror() for clarity.
Ilya Dryomov [Tue, 11 Feb 2025 16:44:51 +0000 (17:44 +0100)]
librbd: fix mirror image status summary in a namespace
For the purposes of the summary with image counts, "rbd mirror pool
status" command is supposed to count each image only once. To this
end, for unidirectional mirroring the status of the receiving site
should be taken while for bidirectional mirroring the statuses should
be combined/reduced. For example, if mirroring is enabled on a single
image and everything is in order, the summary is expected to be
image health: OK
images: 1 total
1 replaying
on both clusters even though on the primary the local status is
MIRROR_IMAGE_STATUS_STATE_STOPPED and only on the secondary it's
MIRROR_IMAGE_STATUS_STATE_REPLAYING.
Currently this isn't the case for custom namespaces. In the same
scenario the primary ends up reporting
Conflicts:
qa/workunits/rbd/rbd_mirror_bootstrap.sh [ commits 3fd8a0388735
("qa/workunits/rbd: merge journal and snapshot test scripts")
and 3fdbc160bb21 ("rbd-mirror: allow mirroring to a different
namespace") not in reef ]
Zac Dover [Mon, 10 Feb 2025 08:12:34 +0000 (18:12 +1000)]
doc/cephadm: improve "Activate Existing OSDs".
Make three minor changes to doc/cephadm/services/osd.rst. These three
changes were suggested by Eugen Block, who reviewed this procedure after
developing it.
Zac Dover [Fri, 7 Feb 2025 01:32:20 +0000 (11:32 +1000)]
doc/cephadm: improve "Activate Existing OSDs"
Improve the section "Activate Existing OSDs".
Supplement the information in the "Activate Existing OSDs" section with
a procedure developed by Eugen Block, here:
https://heiterbiswolkig.blogs.nde.ag/2025/02/06/cephadm-activate-existing-osds/
This procedure explains how to activate OSDs on a host that, for
whatever reason, has had to have its operating system reinstalled.
Ilya Dryomov [Wed, 29 Jan 2025 11:56:34 +0000 (12:56 +0100)]
librbd: stop filtering async request error codes
The roots of this go back to 2015 when snap create was changed to
filter EEXIST in commit 63f6c9bac9a4 ("librbd: fixed snap create race
conditions") and flatten respectively EINVAL in commit ef7e210c3f74
("librbd: better handling for duplicate flatten requests"). From there
this pattern made it to most other operations that can be proxied
including "rbd migration execute".
The motivation was to suppress generation of an "expected" error in
response to a duplicate async request notification for the operation.
However, doing this at the top of the handler (right before returning
to the caller) and for an error as generic as EINVAL is super fragile.
It's trivial for an error that is being filtered to sneak in with
a lower level change completely unnoticed. For example, live migration
recently added NBD stream which is implemented on top of libnbd and it
turns out that some libnbd APIs return EINVAL on various occasions when
the NBD endpoint disappears and an error like ENOTCONN would make more
sense. If this occurs during "rbd migration execute" operation, the
rest of librbd never learns that migration was disrupted and the image
is transitioned to MIGRATION_STATE_EXECUTED, thus handing a partially
imported (read: corrupted) image to the user.
Luckily, with commits 07fbc4b71df4 ("librbd: track complete async
operation requests") and 96bc20445afb ("librbd: track complete async
operation return code"), the scenario which originally prompted error
code filtering isn't an issue anymore. Despite a few shortcomings
(e.g. when an async request notification is acked with result 0, it's
impossible to tell whether a) a new operation was kicked off, b) there
is an operation that is still in progress or c) it's for an operation
that completed earlier but hasn't "expired" yet), even just commit 07fbc4b71df4 by itself prevents a duplicate notification from kicking
off a second operation that could generate an error for something that
actually succeeded. With that in mind, eradicate error code filtering
from Operations class.
edef [Thu, 16 Mar 2023 09:43:58 +0000 (09:43 +0000)]
common: use close_range on Linux
Fix rook/rook#10110, which occurs when _SC_OPEN_MAX/RLIMIT_NOFILE is
set to very large values (2^30), leaving fork_function pegging a core
busylooping.
The glibc wrappers closefrom(3)/close_range(3) are not available before
glibc 2.34, so we invoke the syscall directly. When glibc 2.34 is old
enough to be a reasonable hard minimum dependency, we should switch to
using closefrom.
If we're not running on (recent enough) Linux, we fall back to the
existing approach.
Thrashers that do not inherit from ThrasherGreenlet previously used a
method called do_join, which combined stop and join functionality. To
ensure consistency and clarity, we want all thrashers to use separate
stop, join, and stop_and_join methods.
This commit renames methods and implements missing stop and stop_and_join
methods in thrashers that did not inherit from ThrasherGreenlet.
John Mulligan [Tue, 21 Jan 2025 21:28:42 +0000 (16:28 -0500)]
container: add label ceph=True back
Add a label used by cephadm internally that was always set by
ceph-container [1] back to the new containerfile. This should
prevent issues with cephadm shell command thinking official ceph images
are not official ceph images.
Ilya Dryomov [Thu, 30 Jan 2025 19:30:18 +0000 (20:30 +0100)]
doc/rbd: use https links in live import examples
Even though it's explicitly said that "http" stream can be used to
import via both HTTP and HTTPS, it can still be confusing that "type":
"http" is expected to go with "url": "https://...". Switch example
URLs from HTTP to HTTPS to make it more obvious.
Ilya Dryomov [Mon, 27 Jan 2025 11:29:54 +0000 (12:29 +0100)]
osd/OSDCap: fix misleading grammar comments
The restrictions on pool name and namespace have been independent of
each other for ages. Specifying namespace[=]<namespace> doesn't require
specifying pool[=]<pool> like is currently suggested -- neither for
regular "allow" grants nor for "profile" grants.
Ilya Dryomov [Fri, 24 Jan 2025 19:47:11 +0000 (20:47 +0100)]
mon/OSDMonitor: relax cap enforcement for unmanaged snapshots
Since commit 4972e054b32c ("mon/OSDMonitor: enforce caps when
creating/deleting unmanaged snapshots"), a) write access to the MON
service, b) write access to the OSD service for a pool or c) permission
for "osd pool op unmanaged-snap" command for a pool is required. For
"profile rbd" we configure read-only access to the MON service and rely
on write access to the OSD service, however the corresponding check in
is_osd_writable() is too strict.
A OSD cap like "profile rbd namespace=myns" or "allow w namespace=myns"
allows write access to myns namespace of any pool, but is_osd_writable()
disallows operations with unmanaged snapshots with such a cap because
its match.pool_namespace.pool_name.empty() is true. This condition
appears to serve as the "doesn't include support for the application
tag" guard, but it should actually be match.pool_tag.is_match_all()
(or match.pool_tag.application.empty() if open-coded) -- no restriction
on the pool name doesn't automatically mean that there is a restriction
on the application tag.
Ilya Dryomov [Wed, 22 Jan 2025 19:34:11 +0000 (20:34 +0100)]
librbd: clear ctx before initiating close in Image::{aio_,}close()
Image::aio_close() must clear ctx before initiating close. Otherwise
the provided callback may see a non-NULL ctx and attempt to close the
image again from Image destructor, leading to an invalid memory access
as ImageCtx and ImageState are both freed immediately after the image
is closed (i.e. before AioCompletion is completed and the callback is
executed).
The same adjustment is made to Image::close() just for consistency.
Ilya Dryomov [Sat, 25 Jan 2025 10:11:14 +0000 (11:11 +0100)]
doc/rados: pool and namespace are independent osdcap restrictions
For the "profile {name}" syntax, pool and namespace restrictions are
independent of each other (i.e. specifying namespace doesn't also
require specifying pool like is currently suggested). A cap can look
like "profile rbd namespace=myns", signifying that the RBD profile is
to be allowed in myns namespace of any pool.
For the "allow {access-spec}" syntax, pool restriction is optional.
A cap can look like "allow r namespace=myns", "allow w object_prefix
myprefix" or "allow rw namespace=myns object_prefix myprefix", for
example.
Zac Dover [Fri, 24 Jan 2025 13:46:19 +0000 (23:46 +1000)]
doc/cephfs: edit disaster-recovery-experts (6 of x)
In doc/cephfs/disaster-recovery-experts.rst, incorporate Anthony's
suggestions in
https://github.com/ceph/ceph/pull/61462#discussion_r1923917812
and
https://github.com/ceph/ceph/pull/61462#discussion_r1923920724
and reword the sentences in the section "Using an alternate metadata
pool for recovery" to be in the imperative mood, which better suits the
ordered list format that was introduced in
https://github.com/ceph/ceph/pull/61493.
Follows https://github.com/ceph/ceph/pull/61493.
https://tracker.ceph.com/issues/69557
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 5670054bd0533c8f2507d0596797214da8ba489a)
msg: insert PriorityDispatchers in sorted position
avoid calling stable_sort() after every insertion by inserting directly
into the sorted position. use lower_bound() to insert at the head and
upper_bound() to insert at the tail
this generally only happens during startup so isn't a performance
problem, but std::stable_sort() was triggering strange valgrind warnings
for "Mismatched free() / delete / delete []" when it allocates a
temporary buffer