Zac Dover [Fri, 7 Feb 2025 01:32:20 +0000 (11:32 +1000)]
doc/cephadm: improve "Activate Existing OSDs"
Improve the section "Activate Existing OSDs".
Supplement the information in the "Activate Existing OSDs" section with
a procedure developed by Eugen Block, here:
https://heiterbiswolkig.blogs.nde.ag/2025/02/06/cephadm-activate-existing-osds/
This procedure explains how to activate OSDs on a host that, for
whatever reason, has had to have its operating system reinstalled.
Ilya Dryomov [Wed, 29 Jan 2025 11:56:34 +0000 (12:56 +0100)]
librbd: stop filtering async request error codes
The roots of this go back to 2015 when snap create was changed to
filter EEXIST in commit 63f6c9bac9a4 ("librbd: fixed snap create race
conditions") and flatten respectively EINVAL in commit ef7e210c3f74
("librbd: better handling for duplicate flatten requests"). From there
this pattern made it to most other operations that can be proxied
including "rbd migration execute".
The motivation was to suppress generation of an "expected" error in
response to a duplicate async request notification for the operation.
However, doing this at the top of the handler (right before returning
to the caller) and for an error as generic as EINVAL is super fragile.
It's trivial for an error that is being filtered to sneak in with
a lower level change completely unnoticed. For example, live migration
recently added NBD stream which is implemented on top of libnbd and it
turns out that some libnbd APIs return EINVAL on various occasions when
the NBD endpoint disappears and an error like ENOTCONN would make more
sense. If this occurs during "rbd migration execute" operation, the
rest of librbd never learns that migration was disrupted and the image
is transitioned to MIGRATION_STATE_EXECUTED, thus handing a partially
imported (read: corrupted) image to the user.
Luckily, with commits 07fbc4b71df4 ("librbd: track complete async
operation requests") and 96bc20445afb ("librbd: track complete async
operation return code"), the scenario which originally prompted error
code filtering isn't an issue anymore. Despite a few shortcomings
(e.g. when an async request notification is acked with result 0, it's
impossible to tell whether a) a new operation was kicked off, b) there
is an operation that is still in progress or c) it's for an operation
that completed earlier but hasn't "expired" yet), even just commit 07fbc4b71df4 by itself prevents a duplicate notification from kicking
off a second operation that could generate an error for something that
actually succeeded. With that in mind, eradicate error code filtering
from Operations class.
Thrashers that do not inherit from ThrasherGreenlet previously used a
method called do_join, which combined stop and join functionality. To
ensure consistency and clarity, we want all thrashers to use separate
stop, join, and stop_and_join methods.
This commit renames methods and implements missing stop and stop_and_join
methods in thrashers that did not inherit from ThrasherGreenlet.
John Mulligan [Tue, 21 Jan 2025 21:28:42 +0000 (16:28 -0500)]
container: add label ceph=True back
Add a label used by cephadm internally that was always set by
ceph-container [1] back to the new containerfile. This should
prevent issues with cephadm shell command thinking official ceph images
are not official ceph images.
Ilya Dryomov [Thu, 30 Jan 2025 19:30:18 +0000 (20:30 +0100)]
doc/rbd: use https links in live import examples
Even though it's explicitly said that "http" stream can be used to
import via both HTTP and HTTPS, it can still be confusing that "type":
"http" is expected to go with "url": "https://...". Switch example
URLs from HTTP to HTTPS to make it more obvious.
Ilya Dryomov [Mon, 27 Jan 2025 11:29:54 +0000 (12:29 +0100)]
osd/OSDCap: fix misleading grammar comments
The restrictions on pool name and namespace have been independent of
each other for ages. Specifying namespace[=]<namespace> doesn't require
specifying pool[=]<pool> like is currently suggested -- neither for
regular "allow" grants nor for "profile" grants.
Ilya Dryomov [Fri, 24 Jan 2025 19:47:11 +0000 (20:47 +0100)]
mon/OSDMonitor: relax cap enforcement for unmanaged snapshots
Since commit 4972e054b32c ("mon/OSDMonitor: enforce caps when
creating/deleting unmanaged snapshots"), a) write access to the MON
service, b) write access to the OSD service for a pool or c) permission
for "osd pool op unmanaged-snap" command for a pool is required. For
"profile rbd" we configure read-only access to the MON service and rely
on write access to the OSD service, however the corresponding check in
is_osd_writable() is too strict.
A OSD cap like "profile rbd namespace=myns" or "allow w namespace=myns"
allows write access to myns namespace of any pool, but is_osd_writable()
disallows operations with unmanaged snapshots with such a cap because
its match.pool_namespace.pool_name.empty() is true. This condition
appears to serve as the "doesn't include support for the application
tag" guard, but it should actually be match.pool_tag.is_match_all()
(or match.pool_tag.application.empty() if open-coded) -- no restriction
on the pool name doesn't automatically mean that there is a restriction
on the application tag.
Ilya Dryomov [Wed, 22 Jan 2025 19:34:11 +0000 (20:34 +0100)]
librbd: clear ctx before initiating close in Image::{aio_,}close()
Image::aio_close() must clear ctx before initiating close. Otherwise
the provided callback may see a non-NULL ctx and attempt to close the
image again from Image destructor, leading to an invalid memory access
as ImageCtx and ImageState are both freed immediately after the image
is closed (i.e. before AioCompletion is completed and the callback is
executed).
The same adjustment is made to Image::close() just for consistency.
Ilya Dryomov [Sat, 25 Jan 2025 10:11:14 +0000 (11:11 +0100)]
doc/rados: pool and namespace are independent osdcap restrictions
For the "profile {name}" syntax, pool and namespace restrictions are
independent of each other (i.e. specifying namespace doesn't also
require specifying pool like is currently suggested). A cap can look
like "profile rbd namespace=myns", signifying that the RBD profile is
to be allowed in myns namespace of any pool.
For the "allow {access-spec}" syntax, pool restriction is optional.
A cap can look like "allow r namespace=myns", "allow w object_prefix
myprefix" or "allow rw namespace=myns object_prefix myprefix", for
example.
Zac Dover [Fri, 24 Jan 2025 13:46:19 +0000 (23:46 +1000)]
doc/cephfs: edit disaster-recovery-experts (6 of x)
In doc/cephfs/disaster-recovery-experts.rst, incorporate Anthony's
suggestions in
https://github.com/ceph/ceph/pull/61462#discussion_r1923917812
and
https://github.com/ceph/ceph/pull/61462#discussion_r1923920724
and reword the sentences in the section "Using an alternate metadata
pool for recovery" to be in the imperative mood, which better suits the
ordered list format that was introduced in
https://github.com/ceph/ceph/pull/61493.
Follows https://github.com/ceph/ceph/pull/61493.
https://tracker.ceph.com/issues/69557
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 5670054bd0533c8f2507d0596797214da8ba489a)
msg: insert PriorityDispatchers in sorted position
avoid calling stable_sort() after every insertion by inserting directly
into the sorted position. use lower_bound() to insert at the head and
upper_bound() to insert at the tail
this generally only happens during startup so isn't a performance
problem, but std::stable_sort() was triggering strange valgrind warnings
for "Mismatched free() / delete / delete []" when it allocates a
temporary buffer
Zac Dover [Thu, 23 Jan 2025 09:49:26 +0000 (19:49 +1000)]
doc/cephfs: edit disaster-recovery-experts (5 of x)
Put the procedure in the section called "Using an alternate metadata
pool for recovery" into an ordered list, so that it is in a proper
procedure format.
This commit is meant only to break the procedure into steps. The English
language in each of these steps could be improved, but that improvement
will be done after this formatting has been merged and backported.
pg-split-merge using ceph daemon command to check merge.
but it doesn't use asok path, which causes the check not to
return the correct output. change the command to use asok path.