7. Checking number of pg log's entries and dups
```
rzarz@ubulap:~/dev/ceph/build$ for pgid in `cat osd0_pgs.txt`; do echo $pgid; bin/ceph-objectstore-tool --data-path dev/osd0 --op log --pgid $pgid | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)'; done
2.7
10020
2500
2.6
10100
3000
2.3
10012
2800
2.1
10049
2900
2.2
10057
2700
2.0
10027
2900
2.5
10077
2700
2.4
10072
2900
1.0
97
0
```
Conflicts:
src/tools/ceph_objectstore_tool.cc -- undetected conflict
with d5445b8f113797718a0dbb05e884a6bffbfed76a. Fixed by
adopting the patch no not require the `unique_ptr<T>::get()`.
wanwencong [Fri, 24 Jun 2022 15:54:52 +0000 (23:54 +0800)]
rbd-fuse: librados will filter out -r option from command-line
The -r option will be filtered out by librados
when exec cmd "rbd-fuse /mountpoint -p pool_name -r rbd_name"
other rbds can be seen under the mount point
Fixes: https://tracker.ceph.com/issues/56387 Signed-off-by: wanwencong <wanwc@chinatelecom.cn>
(cherry picked from commit e99d64bc8a5c3bbb8a3632f211d4f56751cf499e)
Ilya Dryomov [Sun, 26 Jun 2022 11:05:09 +0000 (13:05 +0200)]
librbd: update progress for non-existent objects on deep-copy
As a side effect of commit e5a21e904142 ("librbd: deep-copy image copy
state machine skips clean objects"), handle_object_copy() stopped being
called for non-existent objects. This broke progress_object_no logic,
which expects to "see" all object numbers so that update_progress()
callback invocations can be ordered. Currently update_progress() based
progress reporting gets stuck after encountering a hole in the image.
To fix, arrange for handle_object_copy() to be called for all object
numbers, even if ObjectCopyRequest isn't created. Defer the extra call
to the image work queue to avoid locking issues.
Conflicts:
src/librbd/deep_copy/ImageCopyRequest.cc [ commit aabfb76e51bf
("librbd: swapped ThreadPool/ContextWQ for AsioEngine") not in
octopus ]
src/test/librbd/deep_copy/test_mock_ImageCopyRequest.cc [
commit 235b27a8f08a ("librbd/deep_copy: skip snap list if
object is known to be clean") not in octopus ]
Ilya Dryomov [Sat, 18 Jun 2022 13:25:49 +0000 (15:25 +0200)]
rbd-mirror: spell out "remote image is not primary" status correctly
There is a difference: non-primary means NON_PRIMARY promotion state,
while "not primary" can refer to any of NON_PRIMARY, ORPHAN or UNKNOWN
promotion states.
Ilya Dryomov [Sat, 18 Jun 2022 11:00:34 +0000 (13:00 +0200)]
rbd-mirror: fix up PrepareReplayDisconnected test case
It was botched in commit 2bca9ee96c65 ("rbd-mirror: consolidate
prepare local/remote image steps to bootstrap") and went unnoticed
because currently no special handling is needed for disconnected
clients -- is_disconnected() check happens to be the last step
and it doesn't generate an error.
Ilya Dryomov [Mon, 20 Jun 2022 12:19:41 +0000 (14:19 +0200)]
rbd-mirror: generally skip replay/resync if remote image is not primary
Replay and resync should generally be skipped if the remote image is
not primary.
If this is not done for replay, snapshot-based mirroring can run into
a livelock if the primary image is demoted while a mirror snapshot is
being synced. On the demote site, rbd-mirror would pick up the just
demoted image, grab the exclusive lock on it and idle waiting for a new
mirror snapshot to be created. On the (still) non-primary site,
rbd-mirror would eventually finish syncing that mirror snapshot and
attempt to unlink from it on the demote site. These attempts would
fail with EROFS due to exclusive lock being held in the "refuse proxied
maintenance operations" mode, blocking forward progress (syncing of the
demotion snapshot so that the non-primary image can be orderly promoted
to primary, etc).
If this is not done for resync, data loss can ensue as the just demoted
image would be immediately trashed, underneath the non-primary site that
is still syncing.
Currently this is done in PrepareReplayRequest only for journal-based
mirroring. Note that it is conditional: if the local image is linked
to the remote image, proceeding is desirable.
Generalize this check, consolidate it with a related check in
PrepareRemoteImageRequest and move the result to BootstrapRequest to
cover both "local image does not exist" and "local image is unlinked"
cases for both modes.
Ilya Dryomov [Sat, 18 Jun 2022 10:35:51 +0000 (12:35 +0200)]
rbd-mirror: strengthen is_local_primary() and is_linked()
Initialize local_promotion_state and remote_promotion_state to UNKNOWN
instead of counterintuitive PRIMARY and NON_PRIMARY -- half the time the
final values are flipped. Then is_local_primary() and is_linked() can
be strengthened as a non-existent image should stay in UNKNOWN.
Ilya Dryomov [Sun, 19 Jun 2022 10:12:01 +0000 (12:12 +0200)]
mgr/rbd_support: always rescan image mirror snapshots on refresh
Establishing a watch on rbd_mirroring object and skipping rescanning
image mirror snapshots on periodic refresh unless rbd_mirroring object
gets notified in the interim is flawed. rbd_mirroring object is
notified when mirroring is enabled or disabled on some image (including
when the image is removed), but it is not notified when images are
promoted or demoted. However, load_pool_images() discards images that
are not primary at the time of the scan. If the image is promoted
later, no snapshots are created even if the schedule is in place. This
happens regardless of whether the schedule is added before or after the
promotion.
This effectively reverts commit 69259c8d3722 ("mgr/rbd_support: make
mirror_snapshot_schedule rescan only updated pools"). An alternative
fix could be to stop discarding non-primary images (i.e. drop
if not info['primary']:
continue
check added in commit d39eb283c5ce ("mgr/rbd_support: mirror snapshot
schedule should skip non-primary images")), but that would clutter the
queue and therefore "rbd mirror snapshot schedule status" output with
bogus entries. Performing a rescan roughly every 60 seconds should be
manageable: currently it amounts to a single mirror_image_status_list
request, followed by mirror_image_get, get_snapcontext and snapshot_get
requests for each snapshot-based mirroring enabled image and concluded
by a single dir_list request. Among these, per-image get_snapcontext
and snapshot_get requests are necessary for determining primaryness.
Ilya Dryomov [Fri, 17 Jun 2022 12:03:20 +0000 (14:03 +0200)]
mgr/rbd_support: avoid losing a schedule on load vs add race
If load_schedules() (i.e. periodic refresh) races with add_schedule()
invoked by the user for a fresh image, that image's schedule may get
lost until the next rebuild (not refresh!) of the queue:
1. periodic refresh invokes load_schedules()
2. load_schedules() creates a new Schedules instance and loads
schedules from rbd_mirror_snapshot_schedule object
3. add_schedule() is invoked for a new image (an image that isn't
present in self.images) by the user
4. before load_schedules() can grab self.lock, add_schedule() commits
the new schedule to rbd_mirror_snapshot_schedule object and adds it
to self.schedules
5. load_schedules() grabs self.lock and reassigns self.schedules with
Schedules instance that is now stale
6. periodic refresh invokes load_pool_images() which discovers the new
image; eventually it is added to self.images
7. periodic refresh invokes refresh_queue() which attempts to enqueue()
the new image; this fails because a matching schedule isn't present
The next periodic refresh recovers the discarded schedule from
rbd_mirror_snapshot_schedule object but no attempt to enqueue() that
image is made since it is already "known" at that point. Despite the
schedule being in place, no snapshots are created until the queue is
rebuilt from scratch or rbd_support module is reloaded.
To fix that, extend self.lock critical sections so that add_schedule()
and remove_schedule() can't get stepped on by load_schedules().
Ilya Dryomov [Fri, 17 Jun 2022 08:28:55 +0000 (10:28 +0200)]
mgr/rbd_support: refresh schedule queue immediately after delay elapses
The existing logic often leads to refresh_pools() and refresh_images()
being invoked after a 120 second delay instead of after an intended 60
second delay.
Sage Weil [Wed, 17 Feb 2021 16:28:05 +0000 (10:28 -0600)]
mgr/cephadm: make drain adjust crush weight if not replacing
If we are replacing an OSD, we should mark it out and then back in
again when a new device shows up. However, if we are going to
destroy an OSD, we should just weight it to 0 in crush, so that data
doesn't move again once the OSD is purged.
Although the chunking in off-line `dups` trimming (via COT) seems
fine, the `ceph-objectstore-tool` is a client of `trim()` of
`PGLog::IndexedLog` which means than a partial revert is not
possible without extensive changes.
The backport ticket is: https://tracker.ceph.com/issues/55990
Revert "osd/PGLog.cc: Trim duplicates by number of entries"
This reverts commit 7cc1b29f2b7b7feee127f1dbbef947799e56f38b.
which is the in-OSD part of the fix for accumulation of `dup`
entries in a PG Log. Brainstorming it has brought questions
on the OSD's behaviour during an upgrade if there are tons of
dups in the log. What must be double-checked before bringing
it back is ensuring we chunk the deletions properly to not
impose OOMs / stalls in, to exemplify, RocksDB.
The backport ticket is: https://tracker.ceph.com/issues/55990
We really want to have the ability to know how many
entries `PGLog::IndexedLog::dups` has inside.
The current ways are either invasive (stopping an OSD)
or indirect (examination of `dump_mempools`).
Ilya Dryomov [Sun, 29 May 2022 16:20:34 +0000 (18:20 +0200)]
librbd: unlink newest mirror snapshot when at capacity, bump capacity
CreatePrimaryRequest::unlink_peer() invoked via "rbd mirror image
snapshot" command or via rbd_support mgr module when creating a new
scheduled mirror snapshot at rbd_mirroring_max_mirroring_snapshots
capacity on the primary cluster can race with Replayer::unlink_peer()
invoked by rbd-mirror when finishing syncing an older snapshot on the
secondary cluster. Consider the following:
0. rbd-mirror is syncing snap1..snap2 delta
1. rbd_support creates primary-snap4
2. due to rbd_mirroring_max_mirroring_snapshots == 3, rbd_support picks
primary-snap3 for unlinking
3. rbd-mirror finishes syncing snap1..snap2 delta and marks
non-primary-snap2 complete
[ snap1 (the old base) is no longer needed on either cluster ]
4. rbd-mirror unlinks and removes primary-snap1
5. rbd-mirror removes non-primary-snap1
6. rbd-mirror picks snap2 as the new base
7. rbd-mirror creates non-primary-snap3 and starts syncing snap2..snap3
delta
8. rbd_support unlinks and removes primary-snap3 which is in-use by
rbd-mirror
If snap trimming on the primary cluster kicks in soon enough, the
secondary image becomes corrupted: rbd-mirror would eventually finish
"syncing" non-primary-snap3 and mark it complete in spite of bogus data
in the HEAD -- the primary cluster OSDs would start returning ENOENT
for snap trimmed objects. Luckily, rbd-mirror's attempt to pick snap3
as the new base would wedge the replayer with "split-brain detected:
failed to find matching non-primary snapshot in remote image" error.
Before commit a888bff8d00e ("librbd/mirror: tweak which snapshot is
unlinked when at capacity") this could happen pretty much all the time
as it was the second oldest snapshot that was unlinked. This commit
changed it to be the third oldest snapshot, turning this into a more
narrow but still very much possible to hit race.
Unfortunately this race condition appears to be inherent to the way
snapshot-based mirroring is currently implemented:
a. when mirror snapshots are created on the producer side of the
snapshot queue, they are already linked
b. mirror snapshots can be concurrently unlinked/removed on both
sides of the snapshot queue by non-cooperating clients (local
rbd_mirror_image_create_snapshot() vs remote rbd-mirror)
c. with mirror peer links off the list due to (a), there is no
existing way for rbd-mirror to persistently mark a snapshot as
in-use
As a workaround, bump rbd_mirroring_max_mirroring_snapshots to 5 and
always unlink the newest snapshot (i.e. slot 4) instead of the third
oldest snapshot (i.e. slot 2). Hopefully this gives enough leeway,
as rbd-mirror would need to sync two snapshots (i.e. transition from
syncing 0-1 to 1-2 and then to 2-3) before potentially colliding with
rbd_mirror_image_create_snapshot() on slot 4.
Ilya Dryomov [Sun, 29 May 2022 17:55:04 +0000 (19:55 +0200)]
test/librbd: fix set_val() call in SuccessUnlink* test cases
rbd_mirroring_max_mirroring_snapshots isn't actually set to 3 there
due to the stray conf_ prefix. It didn't matter until now because the
default was also 3.
Ilya Dryomov [Sat, 28 May 2022 18:06:22 +0000 (20:06 +0200)]
rbd-mirror: don't prune non-primary snapshot when restarting delta sync
When restarting interrupted sync (signified by the "end" non-primary
snapshot with last_copied_object_number > 0), preserve the "start"
non-primary snapshot until the sync is completed, like it would have
been done had the sync not been interrupted. This ensures that the
same m_local_snap_id_start is passed to scan_remote_mirror_snapshots()
and ultimately ImageCopyRequest state machine on restart as on initial
start.
This ends up being yet another fixup for 281af0de86b1 ("rbd-mirror:
prune unnecessary non-primary mirror snapshots"), following earlier 7ba9214ea5b7 ("rbd-mirror: don't prune older mirror snapshots when
pruning incomplete snapshot") and ecd3778a6f9a ("rbd-mirror: ensure
that the last non-primary snapshot cannot be pruned").
Zac Dover [Mon, 30 May 2022 13:32:06 +0000 (23:32 +1000)]
doc/start: update "memory" in hardware-recs.rst
This PR corrects some usage errors in the "Memory" section
of the hardware-recommendations.rst file. It also closes
some opened but never closed parentheses.
Laura Flores [Mon, 16 May 2022 22:59:42 +0000 (17:59 -0500)]
qa/suites/rados/thrash-erasure-code-big/thrashers: add `osd max backfills` setting to mapgap and pggrow
All `rados/thrash-erasure-code-big` tests that die due to the “wait_for_recovery” timeout have one thing in common: They contain either `thrashers/pggrow` or `thrashers/mapgap`.
The difference between pggrow and mapgap vs. all other non-offending thrashers (default, careful, fastread, and morepggrow) is that they lack an override setting for `osd max backfills`. `osd max backfills` is the max number of backfill operations allowed to/from an OSD. The higher the number, the quicker the recovery. By default, this value is 1. On all of the non-offending thrashers (default, careful, fastread, and morepggrow), the default 1 value gets overridden in their .yaml files with a value > 1. This is not the case for pggrow and mapgap, however, as they lack an `osd max backfills` override setting.
The mclock op scheduler is known to override `osd max backfills` with a high value, but all of the thrash-erasure-code-big thrashers have their op queue set to “debug_random”, which chooses randomly between op queues (the debug_random op queue is set to override the default mclock_scheduler in qa/config/rados.yaml). So, coupled with the “debug_random” op queue, the low `osd max backfill` setting is causing some tests to time out in recovery.
WITHOUT `osd max backfills`, as they are now, “mapgap” and “pggrow” tests die due to timed-out recovery about 17/100 times, as seen here with a pggrow test: http://pulpito.front.sepia.ceph.com/lflores-2022-05-18_14:24:29-rados:thrash-erasure-code-big-master-distro-default-smithi/
WITH `osd max backfills` specified, as I have suggested in this PR, 99/100 tests passed, with one test failing for a different reason:
http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/
I also scheduled 145 tests WITH `osd max backfills` that are a mix of pggrow and mapgap thrashers. 144/145 tests passed, with one test failing for a different reason. http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/
Fixes: https://tracker.ceph.com/issues/51076 Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 40062676c2ceed49b9fa147127ffa83ba6118e2a)
This PR links issue tracker by label, and not
by file. This method was proposed by Kefu Chai
in 40f9e1cee054bb568dfa3267982467c99b4ce5c5
on 05 Sep 2020 and was never before incorporated
into Octopus. This was noticed by Neha Ojha in
May 2020.
Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) more link fixes
Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) see last commit
Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) fix broken link (again)
This corrects the error "Mismatch: both
interpreted text role prefix and reference suffix", which
presented because I treated the link to an external URL as
though it were a :ref:-style link to a location inside the
Ceph documentation suite.
Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) ibid - malformed external link
Prashant D [Fri, 29 Oct 2021 13:09:24 +0000 (09:09 -0400)]
osd/OSD: Log aggregated slow ops detail to cluster logs
Slow requests can overwhelm a cluster log with every slow op in
detail and also fills up the monitor db. Instead, log slow ops
details in aggregated format.
Fixes: https://tracker.ceph.com/issues/52424 Signed-off-by: Prashant D <pdhange@redhat.com>
(cherry picked from commit 9319dc9273b36dc4f4bef1261b3c57690336a8cc)
Conflicts:
src/common/options/osd.yaml.in
- Octopus doesn't have osd.yaml.in; added the option in src/common/options.cc
Zac Dover [Wed, 18 May 2022 10:36:53 +0000 (20:36 +1000)]
doc/start: s/3/three/ in intro.rst
I'm changing "3" to "three" for two reasons:
1. It's correct.
2. This allows me to test backports into Octopus, Pacific, and Quincy.
I am particularly interested to see what happens when I attempt
the backport into Octopus, because backports into Octopus have
failed. This will provide me with another unit of data.
Cherry-pick notes:
- src/rgw/rgw_admin.cc: conflicts due to differences in op lists
- src/rgw/rgw_rados.h: conflicts due to changes to unrelated method signatures
- src/rgw/rgw_sal.cc: conflicts due to octopus missing a couple of methods from later releases
- src/rgw/rgw_sal.h: conflicts due to changes to unrelated method seignatures
- src/rgw/rgw_rados.cc: use of ldpp_dout vs. ldout
NitzanMordhai [Mon, 21 Mar 2022 11:34:34 +0000 (11:34 +0000)]
osd/PGLog.cc: Trim duplicates by number of entries
PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth.
Adding duplicate entries trimming to trim-pg-log opertion, we will use the exist
PGLog trim function to find out the set of entries\dup entries that we suppose to
trim. to use it we need to build the PGLog from disk.
Conflicts:
src/osd/PGLog.h -- octopus does not have the commit 877798028386fbd833e8955cb89ce3f1ee47fbeb
which cleans the `std` namespace depedency
in headers.
src/tools/ceph_objectstore_tool.cc -- octopus lacks the commit 7d73fa6a309dca4c5381596c5e92813e2e11ed3b
which puts the buffer's error hierarchy on
`system_error`.