We don't backport PRs merged into doc/releases. Therefore, when one browses to an older Ceph release version on docs.ceph.com (e.g., https://docs.ceph.com/en/pacific/), the information is out of date at best.
The doc/releases page is only accurate if browsing https://docs.ceph.com/en/latest/, for example.
So this post_checkout command will make sure we've checked out doc/releases from main before building and publishing.
mgr/volumes: Fix the subvolume creation in FIPS enabled system.
The md5 checksum is used in the construction of legacy
subvolume config filename. It's not used for security reason.
Hence marking the 'usedforsecurity' flag to false to
make it FIPs compliant.
The usage of md5 was always in there. The commit 373a04cf734
made it to get exercised in 'open_subvol' which is pre-requisite
for all the subvolume operations and hence subvolume
creation has failed.
Kotresh HR [Fri, 4 Feb 2022 09:58:39 +0000 (15:28 +0530)]
qa: validate subvolume discover on upgrade
Validate subvolume discover on upgrade from
legacy subvolume to v1. The handcrafted
`.meta' file on legacy subvolume root should
not be used for any subvolume apis like getpath,
authorize.
Kotresh HR [Fri, 4 Feb 2022 09:25:03 +0000 (14:55 +0530)]
mgr/volumes: Fix subvolume discover during upgrade
Fixes the subvolume discover to use the correct
metadata file after an upgrade from legacy subvolume
to v1. The fix makes sure, it doesn't use the
handcrafted metadata file placed in the subvolume
root of legacy subvolume.
Co-authored-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch> Co-authored-by: Dan van der Ster <daniel.vanderster@cern.ch> Co-authored-by: Ramana Raja <rraja@redhat.com> Signed-off-by: Kotresh HR <khiremat@redhat.com>
(cherry picked from commit 7eba9cab6cfb9a13a84062177d7a0fa228311e13)
ceph-volume: fix fast device alloc size on mulitple device
The size computed by get_physical_fast_allocs() was wrong when the
function had multiple devices to treat.
For instance if there is 4 OSDs and 2 fast devices of each 10G while
allocating 2 slots per fast devvices. The behavior before was that each
slot would be 2.5G meaning that both fast devices would half full. The
behavior now is that each slot will take 5G so that the fast devices
would be full.
Fixes: https://tracker.ceph.com/issues/56031 Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit d0f9e93914e2b7feac41a634311d74c146c8868b)
librbd: bail from schedule_request_lock() if already lock owner
Race condition may be hit if there are multiple pending locks for the
same image and pending callbacks. Abort exclusive lock process if
already exclusive lock owner.
Fixes: https://tracker.ceph.com/issues/56549 Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit 3527d2c764626c09c5ede80ae844551fd8845756)
rbd: don't default empty pool name unless namespace is specified
Commit 96f05a7956b3 ("rbd: delay determination of default pool name")
broke "rbd perf image iostat" and "rbd perf image iotop" GLOBAL_POOL_KEY
support (the ability to blend all rbd pools together into a single
view).
This check was added in commit ecd3778a6f9a ("rbd-mirror: ensure that
the last non-primary snapshot cannot be pruned") as an additional
safeguard against pruning an incomplete non-primary snapshot in case
there is no predecessor mirror snapshot. However it still fires if the
predecessor is there but happens to be a primary demotion snapshot.
A bogus "incomplete local non-primary snapshot" error is reported and
the replayer gets stuck.
Remove completed_non_primary_snapshots_exist tracking as the presence
of the predecessor in the incomplete non-primary snapshot pruning arm
is already ensured by "m_local_snap_id_start > 0" condition.
Incomplete non-primary snapshot handling is bifurcated depending
on whether any data objects have been copied. If no data objects
have been copied, an incomplete non-primary snapshot is assumed to
be malformed and gets pruned; the sync is restarted from scratch.
7. Checking number of pg log's entries and dups
```
rzarz@ubulap:~/dev/ceph/build$ for pgid in `cat osd0_pgs.txt`; do echo $pgid; bin/ceph-objectstore-tool --data-path dev/osd0 --op log --pgid $pgid | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)'; done
2.7
10020
2500
2.6
10100
3000
2.3
10012
2800
2.1
10049
2900
2.2
10057
2700
2.0
10027
2900
2.5
10077
2700
2.4
10072
2900
1.0
97
0
```
Conflicts:
src/tools/ceph_objectstore_tool.cc -- undetected conflict
with d5445b8f113797718a0dbb05e884a6bffbfed76a. Fixed by
adopting the patch no not require the `unique_ptr<T>::get()`.
wanwencong [Fri, 24 Jun 2022 15:54:52 +0000 (23:54 +0800)]
rbd-fuse: librados will filter out -r option from command-line
The -r option will be filtered out by librados
when exec cmd "rbd-fuse /mountpoint -p pool_name -r rbd_name"
other rbds can be seen under the mount point
Fixes: https://tracker.ceph.com/issues/56387 Signed-off-by: wanwencong <wanwc@chinatelecom.cn>
(cherry picked from commit e99d64bc8a5c3bbb8a3632f211d4f56751cf499e)
Ilya Dryomov [Sun, 26 Jun 2022 11:05:09 +0000 (13:05 +0200)]
librbd: update progress for non-existent objects on deep-copy
As a side effect of commit e5a21e904142 ("librbd: deep-copy image copy
state machine skips clean objects"), handle_object_copy() stopped being
called for non-existent objects. This broke progress_object_no logic,
which expects to "see" all object numbers so that update_progress()
callback invocations can be ordered. Currently update_progress() based
progress reporting gets stuck after encountering a hole in the image.
To fix, arrange for handle_object_copy() to be called for all object
numbers, even if ObjectCopyRequest isn't created. Defer the extra call
to the image work queue to avoid locking issues.
Conflicts:
src/librbd/deep_copy/ImageCopyRequest.cc [ commit aabfb76e51bf
("librbd: swapped ThreadPool/ContextWQ for AsioEngine") not in
octopus ]
src/test/librbd/deep_copy/test_mock_ImageCopyRequest.cc [
commit 235b27a8f08a ("librbd/deep_copy: skip snap list if
object is known to be clean") not in octopus ]
Ilya Dryomov [Sat, 18 Jun 2022 13:25:49 +0000 (15:25 +0200)]
rbd-mirror: spell out "remote image is not primary" status correctly
There is a difference: non-primary means NON_PRIMARY promotion state,
while "not primary" can refer to any of NON_PRIMARY, ORPHAN or UNKNOWN
promotion states.
Ilya Dryomov [Sat, 18 Jun 2022 11:00:34 +0000 (13:00 +0200)]
rbd-mirror: fix up PrepareReplayDisconnected test case
It was botched in commit 2bca9ee96c65 ("rbd-mirror: consolidate
prepare local/remote image steps to bootstrap") and went unnoticed
because currently no special handling is needed for disconnected
clients -- is_disconnected() check happens to be the last step
and it doesn't generate an error.
Ilya Dryomov [Mon, 20 Jun 2022 12:19:41 +0000 (14:19 +0200)]
rbd-mirror: generally skip replay/resync if remote image is not primary
Replay and resync should generally be skipped if the remote image is
not primary.
If this is not done for replay, snapshot-based mirroring can run into
a livelock if the primary image is demoted while a mirror snapshot is
being synced. On the demote site, rbd-mirror would pick up the just
demoted image, grab the exclusive lock on it and idle waiting for a new
mirror snapshot to be created. On the (still) non-primary site,
rbd-mirror would eventually finish syncing that mirror snapshot and
attempt to unlink from it on the demote site. These attempts would
fail with EROFS due to exclusive lock being held in the "refuse proxied
maintenance operations" mode, blocking forward progress (syncing of the
demotion snapshot so that the non-primary image can be orderly promoted
to primary, etc).
If this is not done for resync, data loss can ensue as the just demoted
image would be immediately trashed, underneath the non-primary site that
is still syncing.
Currently this is done in PrepareReplayRequest only for journal-based
mirroring. Note that it is conditional: if the local image is linked
to the remote image, proceeding is desirable.
Generalize this check, consolidate it with a related check in
PrepareRemoteImageRequest and move the result to BootstrapRequest to
cover both "local image does not exist" and "local image is unlinked"
cases for both modes.
Ilya Dryomov [Sat, 18 Jun 2022 10:35:51 +0000 (12:35 +0200)]
rbd-mirror: strengthen is_local_primary() and is_linked()
Initialize local_promotion_state and remote_promotion_state to UNKNOWN
instead of counterintuitive PRIMARY and NON_PRIMARY -- half the time the
final values are flipped. Then is_local_primary() and is_linked() can
be strengthened as a non-existent image should stay in UNKNOWN.
Ilya Dryomov [Sun, 19 Jun 2022 10:12:01 +0000 (12:12 +0200)]
mgr/rbd_support: always rescan image mirror snapshots on refresh
Establishing a watch on rbd_mirroring object and skipping rescanning
image mirror snapshots on periodic refresh unless rbd_mirroring object
gets notified in the interim is flawed. rbd_mirroring object is
notified when mirroring is enabled or disabled on some image (including
when the image is removed), but it is not notified when images are
promoted or demoted. However, load_pool_images() discards images that
are not primary at the time of the scan. If the image is promoted
later, no snapshots are created even if the schedule is in place. This
happens regardless of whether the schedule is added before or after the
promotion.
This effectively reverts commit 69259c8d3722 ("mgr/rbd_support: make
mirror_snapshot_schedule rescan only updated pools"). An alternative
fix could be to stop discarding non-primary images (i.e. drop
if not info['primary']:
continue
check added in commit d39eb283c5ce ("mgr/rbd_support: mirror snapshot
schedule should skip non-primary images")), but that would clutter the
queue and therefore "rbd mirror snapshot schedule status" output with
bogus entries. Performing a rescan roughly every 60 seconds should be
manageable: currently it amounts to a single mirror_image_status_list
request, followed by mirror_image_get, get_snapcontext and snapshot_get
requests for each snapshot-based mirroring enabled image and concluded
by a single dir_list request. Among these, per-image get_snapcontext
and snapshot_get requests are necessary for determining primaryness.
Ilya Dryomov [Fri, 17 Jun 2022 12:03:20 +0000 (14:03 +0200)]
mgr/rbd_support: avoid losing a schedule on load vs add race
If load_schedules() (i.e. periodic refresh) races with add_schedule()
invoked by the user for a fresh image, that image's schedule may get
lost until the next rebuild (not refresh!) of the queue:
1. periodic refresh invokes load_schedules()
2. load_schedules() creates a new Schedules instance and loads
schedules from rbd_mirror_snapshot_schedule object
3. add_schedule() is invoked for a new image (an image that isn't
present in self.images) by the user
4. before load_schedules() can grab self.lock, add_schedule() commits
the new schedule to rbd_mirror_snapshot_schedule object and adds it
to self.schedules
5. load_schedules() grabs self.lock and reassigns self.schedules with
Schedules instance that is now stale
6. periodic refresh invokes load_pool_images() which discovers the new
image; eventually it is added to self.images
7. periodic refresh invokes refresh_queue() which attempts to enqueue()
the new image; this fails because a matching schedule isn't present
The next periodic refresh recovers the discarded schedule from
rbd_mirror_snapshot_schedule object but no attempt to enqueue() that
image is made since it is already "known" at that point. Despite the
schedule being in place, no snapshots are created until the queue is
rebuilt from scratch or rbd_support module is reloaded.
To fix that, extend self.lock critical sections so that add_schedule()
and remove_schedule() can't get stepped on by load_schedules().
Ilya Dryomov [Fri, 17 Jun 2022 08:28:55 +0000 (10:28 +0200)]
mgr/rbd_support: refresh schedule queue immediately after delay elapses
The existing logic often leads to refresh_pools() and refresh_images()
being invoked after a 120 second delay instead of after an intended 60
second delay.
Sage Weil [Wed, 17 Feb 2021 16:28:05 +0000 (10:28 -0600)]
mgr/cephadm: make drain adjust crush weight if not replacing
If we are replacing an OSD, we should mark it out and then back in
again when a new device shows up. However, if we are going to
destroy an OSD, we should just weight it to 0 in crush, so that data
doesn't move again once the OSD is purged.
Although the chunking in off-line `dups` trimming (via COT) seems
fine, the `ceph-objectstore-tool` is a client of `trim()` of
`PGLog::IndexedLog` which means than a partial revert is not
possible without extensive changes.
The backport ticket is: https://tracker.ceph.com/issues/55990
Revert "osd/PGLog.cc: Trim duplicates by number of entries"
This reverts commit 7cc1b29f2b7b7feee127f1dbbef947799e56f38b.
which is the in-OSD part of the fix for accumulation of `dup`
entries in a PG Log. Brainstorming it has brought questions
on the OSD's behaviour during an upgrade if there are tons of
dups in the log. What must be double-checked before bringing
it back is ensuring we chunk the deletions properly to not
impose OOMs / stalls in, to exemplify, RocksDB.
The backport ticket is: https://tracker.ceph.com/issues/55990
We really want to have the ability to know how many
entries `PGLog::IndexedLog::dups` has inside.
The current ways are either invasive (stopping an OSD)
or indirect (examination of `dump_mempools`).
Ilya Dryomov [Sun, 29 May 2022 16:20:34 +0000 (18:20 +0200)]
librbd: unlink newest mirror snapshot when at capacity, bump capacity
CreatePrimaryRequest::unlink_peer() invoked via "rbd mirror image
snapshot" command or via rbd_support mgr module when creating a new
scheduled mirror snapshot at rbd_mirroring_max_mirroring_snapshots
capacity on the primary cluster can race with Replayer::unlink_peer()
invoked by rbd-mirror when finishing syncing an older snapshot on the
secondary cluster. Consider the following:
0. rbd-mirror is syncing snap1..snap2 delta
1. rbd_support creates primary-snap4
2. due to rbd_mirroring_max_mirroring_snapshots == 3, rbd_support picks
primary-snap3 for unlinking
3. rbd-mirror finishes syncing snap1..snap2 delta and marks
non-primary-snap2 complete
[ snap1 (the old base) is no longer needed on either cluster ]
4. rbd-mirror unlinks and removes primary-snap1
5. rbd-mirror removes non-primary-snap1
6. rbd-mirror picks snap2 as the new base
7. rbd-mirror creates non-primary-snap3 and starts syncing snap2..snap3
delta
8. rbd_support unlinks and removes primary-snap3 which is in-use by
rbd-mirror
If snap trimming on the primary cluster kicks in soon enough, the
secondary image becomes corrupted: rbd-mirror would eventually finish
"syncing" non-primary-snap3 and mark it complete in spite of bogus data
in the HEAD -- the primary cluster OSDs would start returning ENOENT
for snap trimmed objects. Luckily, rbd-mirror's attempt to pick snap3
as the new base would wedge the replayer with "split-brain detected:
failed to find matching non-primary snapshot in remote image" error.
Before commit a888bff8d00e ("librbd/mirror: tweak which snapshot is
unlinked when at capacity") this could happen pretty much all the time
as it was the second oldest snapshot that was unlinked. This commit
changed it to be the third oldest snapshot, turning this into a more
narrow but still very much possible to hit race.
Unfortunately this race condition appears to be inherent to the way
snapshot-based mirroring is currently implemented:
a. when mirror snapshots are created on the producer side of the
snapshot queue, they are already linked
b. mirror snapshots can be concurrently unlinked/removed on both
sides of the snapshot queue by non-cooperating clients (local
rbd_mirror_image_create_snapshot() vs remote rbd-mirror)
c. with mirror peer links off the list due to (a), there is no
existing way for rbd-mirror to persistently mark a snapshot as
in-use
As a workaround, bump rbd_mirroring_max_mirroring_snapshots to 5 and
always unlink the newest snapshot (i.e. slot 4) instead of the third
oldest snapshot (i.e. slot 2). Hopefully this gives enough leeway,
as rbd-mirror would need to sync two snapshots (i.e. transition from
syncing 0-1 to 1-2 and then to 2-3) before potentially colliding with
rbd_mirror_image_create_snapshot() on slot 4.
Ilya Dryomov [Sun, 29 May 2022 17:55:04 +0000 (19:55 +0200)]
test/librbd: fix set_val() call in SuccessUnlink* test cases
rbd_mirroring_max_mirroring_snapshots isn't actually set to 3 there
due to the stray conf_ prefix. It didn't matter until now because the
default was also 3.
Ilya Dryomov [Sat, 28 May 2022 18:06:22 +0000 (20:06 +0200)]
rbd-mirror: don't prune non-primary snapshot when restarting delta sync
When restarting interrupted sync (signified by the "end" non-primary
snapshot with last_copied_object_number > 0), preserve the "start"
non-primary snapshot until the sync is completed, like it would have
been done had the sync not been interrupted. This ensures that the
same m_local_snap_id_start is passed to scan_remote_mirror_snapshots()
and ultimately ImageCopyRequest state machine on restart as on initial
start.
This ends up being yet another fixup for 281af0de86b1 ("rbd-mirror:
prune unnecessary non-primary mirror snapshots"), following earlier 7ba9214ea5b7 ("rbd-mirror: don't prune older mirror snapshots when
pruning incomplete snapshot") and ecd3778a6f9a ("rbd-mirror: ensure
that the last non-primary snapshot cannot be pruned").
Zac Dover [Mon, 30 May 2022 13:32:06 +0000 (23:32 +1000)]
doc/start: update "memory" in hardware-recs.rst
This PR corrects some usage errors in the "Memory" section
of the hardware-recommendations.rst file. It also closes
some opened but never closed parentheses.