git.apps.os.sepia.ceph.com Git

Merge pull request #46611 from rzarzynski/wip-55990-octopus

octopus: revert of #46253, add tools: ceph-objectstore-tool is able to trim solely pg log dups' entries

Reviewed-by: Neha Ojha <nojha@redhat.com>

tools: ceph-objectstore-tool is able to trim pg log dups' entries.

The main assumption is trimming just dups doesn't need any update
to the corresponding pg_info_t.

Testing:

1. cluster without the autoscaler
```
rzarz@ubulap:~/dev/ceph/build$ MON=1 MGR=1 OSD=3 MGR=1 MDS=0 ../src/vstart.sh -l -b -n -o "osd_pg_log_dups_tracked=3000000" -o "osd_pool_default_pg_autoscale_mode=off"
```

2. 8 PGs in the testing pool.
```
rzarz@ubulap:~/dev/ceph/build$ bin/ceph osd pool create test-pool 8 8
```

3. Provisioning dups with rados bench
```
bin/rados bench -p test-pool 300 write -b 4096  --no-cleanup
...
Total time run:         300.034
Total writes made:      103413
Write size:             4096
Object size:            4096
Bandwidth (MB/sec):     1.34637
Stddev Bandwidth:       0.589071
Max bandwidth (MB/sec): 2.4375
Min bandwidth (MB/sec): 0.902344
Average IOPS:           344
Stddev IOPS:            150.802
Max IOPS:               624
Min IOPS:               231
Average Latency(s):     0.0464151
Stddev Latency(s):      0.0183627
Max latency(s):         0.0928424
Min latency(s):         0.0131932
```

4. Killing osd.0
```
rzarz@ubulap:~/dev/ceph/build$ kill 2572129 # pid of osd.0
```

5. Listing PGs on osd.0 and calculating number of pg log's entries and
dups:

```
rzarz@ubulap:~/dev/ceph/build$ bin/ceph-objectstore-tool --data-path dev/osd0 --op list-pgs --pgid 2.c > osd0_pgs.txt
rzarz@ubulap:~/dev/ceph/build$ for pgid in `cat osd0_pgs.txt`; do echo $pgid; bin/ceph-objectstore-tool --data-path dev/osd0 --op log --pgid $pgid | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)'; done
2.7
10020
3100
2.6
10100
3000
2.3
10012
2800
2.1
10049
2900
2.2
10057
2700
2.0
10027
2900
2.5
10077
2700
2.4
10072
2900
1.0
97
0
```

6. Trimming dups
```
rzarz@ubulap:~/dev/ceph/build$ CEPH_ARGS="--osd_pg_log_dups_tracked 2500 --osd_pg_log_trim_max=100" bin/ceph-objectstore-tool --data-path dev/osd0 --op trim-pg-log-dups --pgid 2.7
max_dup_entries=2500 max_chunk_size=100
Removing keys dup_0000000020.00000000000000000001 - dup_0000000020.00000000000000000100
Removing keys dup_0000000020.00000000000000000101 - dup_0000000020.00000000000000000200
Removing keys dup_0000000020.00000000000000000201 - dup_0000000020.00000000000000000300
Removing keys dup_0000000020.00000000000000000301 - dup_0000000020.00000000000000000400
Removing keys dup_0000000020.00000000000000000401 - dup_0000000020.00000000000000000500
Removing keys dup_0000000020.00000000000000000501 - dup_0000000020.00000000000000000600
Finished trimming, now compacting...
Finished trimming pg log dups
```

7. Checking number of pg log's entries and dups
```
rzarz@ubulap:~/dev/ceph/build$ for pgid in `cat osd0_pgs.txt`; do echo $pgid; bin/ceph-objectstore-tool --data-path dev/osd0 --op log --pgid $pgid | jq '(.pg_log_t.log|length),(.pg_log_t.dups|length)'; done
2.7
10020
2500
2.6
10100
3000
2.3
10012
2800
2.1
10049
2900
2.2
10057
2700
2.0
10027
2900
2.5
10077
2700
2.4
10072
2900
1.0
97
0
```

Conflicts:
        src/tools/ceph_objectstore_tool.cc -- undetected conflict
        with d5445b8f113797718a0dbb05e884a6bffbfed76a. Fixed by
        adopting the patch no not require the `unique_ptr<T>::get()`.

Fixes: https://tracker.ceph.com/issues/53729
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
(cherry picked from commit a2190f901abf2fed20c65e59f53b38c10545cb5a)

Merge pull request #46952 from idryomov/wip-56387-octopus

octopus: rbd-fuse: librados will filter out -r option from command-line

Reviewed-by: Deepika Upadhyay <dupadhya@redhat.com>

Merge pull request #46912 from idryomov/wip-rbd-deep-copy-progress-octopus

octopus: librbd: update progress for non-existent objects on deep-copy

Reviewed-by: Mykola Golub <mgolub@suse.com>

rbd-fuse: librados will filter out -r option from command-line

The -r option will be filtered out by librados
when exec cmd "rbd-fuse /mountpoint -p pool_name -r rbd_name"
other rbds can be seen under the mount point

Fixes: https://tracker.ceph.com/issues/56387
Signed-off-by: wanwencong <wanwc@chinatelecom.cn>
(cherry picked from commit e99d64bc8a5c3bbb8a3632f211d4f56751cf499e)

Merge pull request #46609 from rzarzynski/wip-55982-octopus

octopus: osd: log the number of 'dups' entries in a PG Log

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #45159 from dvanders/wip-51323-octopus

octopus: qa: always format the pgid in hex

Reviewed-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>

Merge pull request #44770 from guits/wip-54010-octopus

octopus: ceph-volume: zap osds in rollback_osd()

Reviewed-by: Teoman Onay <tonay@redhat.com>

Merge pull request #43416 from trociny/wip-51908-octopus

octopus: crush: cancel upmaps with up set size != pool size

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #45154 from pponnuvel/wip-54386-octopus

octopus: osd/OSD: Log aggregated slow ops detail to cluster logs

Reviewed-by: Prashant D <pdhange@redhat.com>
Reviewed-by: Dan Hill <daniel.hill@canonical.com>

Merge pull request #45164 from dvanders/wip-52634-octopus

octopus: mds: ensure that we send the btime in cap messages

Reviewed-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>

Merge pull request #45157 from dvanders/wip-50914-octopus

octopus: mds: add heartbeat_reset() in start_files_to_reover()

Reviewed-by: Venky Shankar <vshankar@redhat.com>

librbd: make ImageCopyRequest::send_next_object_copy() return void

Make send_object_copies() consistent with handle_object_copy() wrt
calling send_next_object_copy().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 155fabd994f8a2052482d216df1ec0dfb40cb5b7)

librbd: update progress for non-existent objects on deep-copy

As a side effect of commit e5a21e904142 ("librbd: deep-copy image copy
state machine skips clean objects"), handle_object_copy() stopped being
called for non-existent objects.  This broke progress_object_no logic,
which expects to "see" all object numbers so that update_progress()
callback invocations can be ordered.  Currently update_progress() based
progress reporting gets stuck after encountering a hole in the image.

To fix, arrange for handle_object_copy() to be called for all object
numbers, even if ObjectCopyRequest isn't created.  Defer the extra call
to the image work queue to avoid locking issues.

Fixes: https://tracker.ceph.com/issues/56181
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 6813a7879146aec40f204a174b40a5a54e00b780)

Conflicts:
src/librbd/deep_copy/ImageCopyRequest.cc [ commit aabfb76e51bf
  ("librbd: swapped ThreadPool/ContextWQ for AsioEngine") not in
  octopus ]
src/test/librbd/deep_copy/test_mock_ImageCopyRequest.cc [
  commit 235b27a8f08a ("librbd/deep_copy: skip snap list if
  object is known to be clean") not in octopus ]

Merge pull request #46149 from ljflores/wip-teuthology-octopus-backport

octopus: qa/tasks: teuthology octopus backport

Merge pull request #46812 from idryomov/wip-rbd-mirror-remote-not-primary-octopus

octopus: rbd-mirror: generally skip replay/resync if remote image is not primary

Reviewed-by: Mykola Golub <mgolub@suse.com>
Reviewed-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>

Merge pull request #46777 from idryomov/wip-rbd-schedule-backports-octopus

octopus: mirror snapshot schedule and trash purge schedule fixes

Reviewed-by: Mykola Golub <mgolub@suse.com>
Reviewed-by: Deepika Upadhyay <dupadhya@redhat.com>

Merge pull request #46787 from adk3798/wip-56043-octopus

octopus: mgr/cephadm: try to get FQDN for active instance

Reviewed-by: Tatjana Dehler <tdehler@suse.com>

Merge pull request #46645 from mgfritch/backport-39536-octopus

octopus: mgr/cephadm: fix and improve osd draining

Reviewed-by: Adam King adking@redhat.com

Merge pull request #44541 from cfsnyder/wip-53493-octopus

octopus: mgr: limit changes to pg_num

Reviewed-by: Laura Flores <lflores@redhat.com>

Merge pull request #45040 from ifed01/wip-ifed-fix-54288-oct

octopus: rocksdb: do not use non-zero recycle_log_file_num setting

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #46687 from ifed01/wip-ifed-avl-update-cursor-oct

octopus: os/bluestore: Always update the cursor position in AVL near-fit search.

Reviewed-by: Mark Nelson <mnelson@redhat.com>

Merge pull request #46745 from mkogan1/rgw-backport-octopus-t54363

octopus: rgwlc: fix segfault resharding during lc

Reviewed-by: Casey Bodley <cbodley@redhat.com>
Reviewed-by: Matt Benjamin <mbenjamin@redhat.com>

Merge pull request #46701 from mkogan1/rgw-octopus-fix-fips-segf

octopus rgw: on FIPS enabled, fix segfault performing s3 multipart PUT

Reviewed-by: Casey Bodley <cbodley@redhat.com>

rbd-mirror: spell out "remote image is not primary" status correctly

There is a difference: non-primary means NON_PRIMARY promotion state,
while "not primary" can refer to any of NON_PRIMARY, ORPHAN or UNKNOWN
promotion states.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 2bd2050909f22980b113870f17a50e2efbe02ac7)

rbd-mirror: fix up "error preparing image for replay" messages

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 8ce97c8cb72a0049009caae69bc2e6c8b36ecbee)

rbd-mirror: fix up PrepareReplayDisconnected test case

It was botched in commit 2bca9ee96c65 ("rbd-mirror: consolidate
prepare local/remote image steps to bootstrap") and went unnoticed
because currently no special handling is needed for disconnected
clients -- is_disconnected() check happens to be the last step
and it doesn't generate an error.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit b6d6a2737ebb8c131d929bca6d39a0b15c67755e)

rbd-mirror: drop m_remote_promotion_state from PrepareReplayRequest

Now unused (and if it was used, the entire StateBuilder is passed in
anyway).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit adf7f1ee2ef70ca3e728f2c5619668dc38216edf)

rbd-mirror: generally skip replay/resync if remote image is not primary

Replay and resync should generally be skipped if the remote image is
not primary.

If this is not done for replay, snapshot-based mirroring can run into
a livelock if the primary image is demoted while a mirror snapshot is
being synced.  On the demote site, rbd-mirror would pick up the just
demoted image, grab the exclusive lock on it and idle waiting for a new
mirror snapshot to be created.  On the (still) non-primary site,
rbd-mirror would eventually finish syncing that mirror snapshot and
attempt to unlink from it on the demote site.  These attempts would
fail with EROFS due to exclusive lock being held in the "refuse proxied
maintenance operations" mode, blocking forward progress (syncing of the
demotion snapshot so that the non-primary image can be orderly promoted
to primary, etc).

If this is not done for resync, data loss can ensue as the just demoted
image would be immediately trashed, underneath the non-primary site that
is still syncing.

Currently this is done in PrepareReplayRequest only for journal-based
mirroring.  Note that it is conditional: if the local image is linked
to the remote image, proceeding is desirable.

Generalize this check, consolidate it with a related check in
PrepareRemoteImageRequest and move the result to BootstrapRequest to
cover both "local image does not exist" and "local image is unlinked"
cases for both modes.

Fixes: https://tracker.ceph.com/issues/54448
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 79d28e63cb47e9bacbb2fd8a5ddc3a092377731b)

rbd-mirror: strengthen is_local_primary() and is_linked()

Initialize local_promotion_state and remote_promotion_state to UNKNOWN
instead of counterintuitive PRIMARY and NON_PRIMARY -- half the time the
final values are flipped. Then is_local_primary() and is_linked() can
be strengthened as a non-existent image should stay in UNKNOWN.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit c60f1d5813c7fe248593731bbffb43d12cdd3b62)

mgr/cephadm: try to get FQDN for active instance

Fixes: https://tracker.ceph.com/issues/55674
Signed-off-by: Tatjana Dehler <tdehler@suse.com>
(cherry picked from commit d0385e030b391f588b4ec0dc707d5d46778a2aaa)

Conflicts:
src/pybind/mgr/cephadm/services/monitoring.py
src/pybind/mgr/cephadm/tests/test_services.py

mgr/rbd_support: always rescan image mirror snapshots on refresh

Establishing a watch on rbd_mirroring object and skipping rescanning
image mirror snapshots on periodic refresh unless rbd_mirroring object
gets notified in the interim is flawed.  rbd_mirroring object is
notified when mirroring is enabled or disabled on some image (including
when the image is removed), but it is not notified when images are
promoted or demoted.  However, load_pool_images() discards images that
are not primary at the time of the scan.  If the image is promoted
later, no snapshots are created even if the schedule is in place.  This
happens regardless of whether the schedule is added before or after the
promotion.

This effectively reverts commit 69259c8d3722 ("mgr/rbd_support: make
mirror_snapshot_schedule rescan only updated pools").  An alternative
fix could be to stop discarding non-primary images (i.e. drop

    if not info['primary']:
        continue

check added in commit d39eb283c5ce ("mgr/rbd_support: mirror snapshot
schedule should skip non-primary images")), but that would clutter the
queue and therefore "rbd mirror snapshot schedule status" output with
bogus entries.  Performing a rescan roughly every 60 seconds should be
manageable: currently it amounts to a single mirror_image_status_list
request, followed by mirror_image_get, get_snapcontext and snapshot_get
requests for each snapshot-based mirroring enabled image and concluded
by a single dir_list request.  Among these, per-image get_snapcontext
and snapshot_get requests are necessary for determining primaryness.

Fixes: https://tracker.ceph.com/issues/53914
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 7fb4fdbed0b908a2105ac44deca48f2170e46fe5)

Conflicts:
src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py [ commit
  e4a16e261370 ("mgr/rbd_support: add type annotation") not in
  octopus ]

mgr/rbd_support: avoid losing a schedule on load vs add race

If load_schedules() (i.e. periodic refresh) races with add_schedule()
invoked by the user for a fresh image, that image's schedule may get
lost until the next rebuild (not refresh!) of the queue:

1. periodic refresh invokes load_schedules()
2. load_schedules() creates a new Schedules instance and loads
   schedules from rbd_mirror_snapshot_schedule object
3. add_schedule() is invoked for a new image (an image that isn't
   present in self.images) by the user
4. before load_schedules() can grab self.lock, add_schedule() commits
   the new schedule to rbd_mirror_snapshot_schedule object and adds it
   to self.schedules
5. load_schedules() grabs self.lock and reassigns self.schedules with
   Schedules instance that is now stale
6. periodic refresh invokes load_pool_images() which discovers the new
   image; eventually it is added to self.images
7. periodic refresh invokes refresh_queue() which attempts to enqueue()
   the new image; this fails because a matching schedule isn't present

The next periodic refresh recovers the discarded schedule from
rbd_mirror_snapshot_schedule object but no attempt to enqueue() that
image is made since it is already "known" at that point.  Despite the
schedule being in place, no snapshots are created until the queue is
rebuilt from scratch or rbd_support module is reloaded.

To fix that, extend self.lock critical sections so that add_schedule()
and remove_schedule() can't get stepped on by load_schedules().

Fixes: https://tracker.ceph.com/issues/56090
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 95a0ec7b428c87294ca4a96ff6afcf613bc67144)

Conflicts:
src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py [ commit
  e4a16e261370 ("mgr/rbd_support: add type annotation") not in
  octopus ]
src/pybind/mgr/rbd_support/trash_purge_schedule.py [ ditto ]

mgr/rbd_support: refresh schedule queue immediately after delay elapses

The existing logic often leads to refresh_pools() and refresh_images()
being invoked after a 120 second delay instead of after an intended 60
second delay.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit ef3edd399adc99f3ff2acf580727d3dd5439d862)

Conflicts:
src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py [ commit
e4a16e261370 ("mgr/rbd_support: add type annotation") not in
octopus ]
src/pybind/mgr/rbd_support/trash_purge_schedule.py [ ditto ]

mgr/rbd_support: bail from refresh_pools() when there is no schedule

Make refresh_pools() behave the same as refresh_images().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 7d1e644b62909fb3844c0559f5ea94419eae6864)

Conflicts:
src/pybind/mgr/rbd_support/trash_purge_schedule.py [ commit
e4a16e261370 ("mgr/rbd_support: add type annotation") not in
octopus ]

mgr/rbd_support: add logs for when there is no schedule and for descheduling

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 568345b47503d8e69e6f7d074a0083fc5de44a2e)

Conflicts:
src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py [ commit
e4a16e261370 ("mgr/rbd_support: add type annotation") not in
octopus ]
src/pybind/mgr/rbd_support/trash_purge_schedule.py [ ditto ]

mgr/rbd_support: disambiguate mirror snapshot and trash purge schedule logs

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit bd4af8201cfbcce9ea473043aa9146e27f553b0e)

Conflicts:
src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py [ commit
e4a16e261370 ("mgr/rbd_support: add type annotation") not in
octopus ]
src/pybind/mgr/rbd_support/trash_purge_schedule.py [ ditto ]

octopus: rgwlc: fix segfault resharding during lc

Fixes: https://tracker.ceph.com/issues/54363
Signed-off-by: Mark Kogan <mkogan@redhat.com>
(cherry picked from commit 7d2e72a9d0451e36141282d6456a4c23d753b592)

Merge pull request #46592 from idryomov/wip-rbd-unlink-newest-snap-at-capacity-octopus

octopus: librbd: unlink newest mirror snapshot when at capacity, bump capacity

Reviewed-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
Reviewed-by: Mykola Golub <mgolub@suse.com>

Merge pull request #46589 from idryomov/wip-rbd-preserve-non-primary-snap-octopus

octopus: rbd-mirror: don't prune non-primary snapshot when restarting delta sync

Reviewed-by: Mykola Golub <mgolub@suse.com>
Reviewed-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>

octopus rgw: on FIPS enabled, fix segfault performing s3 multipart PUT

Fixes: https://tracker.ceph.com/issues/56068
Signed-off-by: Mark Kogan <mkogan@redhat.com>

os/bluestore: Always update the cursor position in AVL near-fit search.

Signed-off-by: Mark Nelson <mnelson@redhat.com>
(cherry picked from commit 3bed53debfa2f9ec9d31021ce7eaf8b78f78f9e0)

mgr/cephadm: fix up the strings reporting osd ids

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit a1ff3a9952778c1f20836b806de9fa5606432137)

mgr/cephadm: remove daemon before osd destroy/purge

Otherwise it doesn't work!

Drop the fullname property: it is always "osd.{self.osd_id}".

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit b5eab0ddfa0bb8ae7b1a6aec4ea2e4257a01a045)

mgr/cephadm: simplify OSD __str__ for drain

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit ca4050b057296d3c62deefca0ffcb4f640b30102)

mgr/cephadm: make drain adjust crush weight if not replacing

If we are replacing an OSD, we should mark it out and then back in
again when a new device shows up. However, if we are going to
destroy an OSD, we should just weight it to 0 in crush, so that data
doesn't move again once the OSD is purged.

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit 4fc1309f281356db0a074da22aa6f2daa034df8d)

mgr/cephadm: less log noise from osd drain code

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit e2f0e56ddf3197f220c5a43c79d6bc43e4b135ce)

mgr/cephadm: fix 'orch daemon add osd ...'

When adding an osd daemon explicitly, there is no created timestamp
for the spec, and we should never not apply it.

Fixes: b129c1312113f56a227caeb535f656f5a090a85f
Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit e8643275e5d92af9539e60a7a80ef13d0f27af64)

mgr/cephadm: 'drive group' -> 'service'

...and add 'osd.' prefix

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit 756bd773d4c6ff57e11c02502fb26f3500f928ad)

mgr/cephadm: only reapply osd spec if devices have changed

This avoids a lot of useless work when the devices have not changed.

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit b129c1312113f56a227caeb535f656f5a090a85f)

mgr/cephadm: use datetime_now() for last_facts_update

Be consistent!

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit 821f7e9d5b5a2d4725c5cea2987e601e07e83558)

mgr/cephadm: track last_device_change

Keep track of when the device inventory and/or state *changes*.

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit 2b7d8e16309b12273d2e65ce638c9588528ee1f0)

mgr/cephadm: track last_applied by host for osd specs

For each host, note when we last applied each osdspec. Log the start
time, not the end time.

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit 197a8ba22ff30ccf9498bbc14b7a3897e48e1220)

Revert "tools/ceph_objectstore_took: Add duplicate entry trimming"

This reverts commit ef04b8c06b2a0c5655233174f317994b1e70741c.

Although the chunking in off-line `dups` trimming (via COT) seems
fine, the `ceph-objectstore-tool` is a client of `trim()` of
`PGLog::IndexedLog` which means than a partial revert is not
possible without extensive changes.

The backport ticket is: https://tracker.ceph.com/issues/55990

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Revert "osd/PGLog.cc: Trim duplicates by number of entries"

This reverts commit 7cc1b29f2b7b7feee127f1dbbef947799e56f38b.
which is the in-OSD part of the fix for accumulation of `dup`
entries in a PG Log. Brainstorming it has brought questions
on the OSD's behaviour during an upgrade if there are tons of
dups in the log. What must be double-checked before bringing
it back is ensuring we chunk the deletions properly to not
impose OOMs / stalls in, to exemplify, RocksDB.

The backport ticket is: https://tracker.ceph.com/issues/55990

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

osd: log the number of 'dups' entries in a PG Log

We really want to have the ability to know how many
entries `PGLog::IndexedLog::dups` has inside.
The current ways are either invasive (stopping an OSD)
or indirect (examination of `dump_mempools`).

The code comes from Nitzan Mordechai (part of
ede37edd79a9d5560dfb417ec176327edfc0e4a3).

Fixes: https://tracker.ceph.com/issues/55982
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 8f1c8a7309976098644bb978d2c1095089522846)

librbd: unlink newest mirror snapshot when at capacity, bump capacity

CreatePrimaryRequest::unlink_peer() invoked via "rbd mirror image
snapshot" command or via rbd_support mgr module when creating a new
scheduled mirror snapshot at rbd_mirroring_max_mirroring_snapshots
capacity on the primary cluster can race with Replayer::unlink_peer()
invoked by rbd-mirror when finishing syncing an older snapshot on the
secondary cluster.  Consider the following:

   [ primary: primary-snap1, primary-snap2, primary-snap3
     secondary: non-primary-snap1 (complete), non-primary-snap2 (syncing) ]

0. rbd-mirror is syncing snap1..snap2 delta
1. rbd_support creates primary-snap4
2. due to rbd_mirroring_max_mirroring_snapshots == 3, rbd_support picks
   primary-snap3 for unlinking
3. rbd-mirror finishes syncing snap1..snap2 delta and marks
   non-primary-snap2 complete

   [ snap1 (the old base) is no longer needed on either cluster ]

4. rbd-mirror unlinks and removes primary-snap1
5. rbd-mirror removes non-primary-snap1
6. rbd-mirror picks snap2 as the new base
7. rbd-mirror creates non-primary-snap3 and starts syncing snap2..snap3
   delta

   [ primary: primary-snap2, primary-snap3, primary-snap4
     secondary: non-primary-snap2 (complete), non-primary-snap3 (syncing) ]

8. rbd_support unlinks and removes primary-snap3 which is in-use by
   rbd-mirror

If snap trimming on the primary cluster kicks in soon enough, the
secondary image becomes corrupted: rbd-mirror would eventually finish
"syncing" non-primary-snap3 and mark it complete in spite of bogus data
in the HEAD -- the primary cluster OSDs would start returning ENOENT
for snap trimmed objects.  Luckily, rbd-mirror's attempt to pick snap3
as the new base would wedge the replayer with "split-brain detected:
failed to find matching non-primary snapshot in remote image" error.

Before commit a888bff8d00e ("librbd/mirror: tweak which snapshot is
unlinked when at capacity") this could happen pretty much all the time
as it was the second oldest snapshot that was unlinked.  This commit
changed it to be the third oldest snapshot, turning this into a more
narrow but still very much possible to hit race.

Unfortunately this race condition appears to be inherent to the way
snapshot-based mirroring is currently implemented:

a. when mirror snapshots are created on the producer side of the
   snapshot queue, they are already linked
b. mirror snapshots can be concurrently unlinked/removed on both
   sides of the snapshot queue by non-cooperating clients (local
   rbd_mirror_image_create_snapshot() vs remote rbd-mirror)
c. with mirror peer links off the list due to (a), there is no
   existing way for rbd-mirror to persistently mark a snapshot as
   in-use

As a workaround, bump rbd_mirroring_max_mirroring_snapshots to 5 and
always unlink the newest snapshot (i.e. slot 4) instead of the third
oldest snapshot (i.e. slot 2).  Hopefully this gives enough leeway,
as rbd-mirror would need to sync two snapshots (i.e. transition from
syncing 0-1 to 1-2 and then to 2-3) before potentially colliding with
rbd_mirror_image_create_snapshot() on slot 4.

Fixes: https://tracker.ceph.com/issues/55803
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit ef83c0f347c38f3eee70d554d5433a1a17b209b6)

Conflicts:
src/common/options/rbd.yaml.in [ options are defined in
  src/common/options.cc in octopus ]

test/librbd: fix set_val() call in SuccessUnlink* test cases

rbd_mirroring_max_mirroring_snapshots isn't actually set to 3 there
due to the stray conf_ prefix. It didn't matter until now because the
default was also 3.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 94703c1036d1e666b4c950656be9a0f291021279)

rbd-mirror: don't prune non-primary snapshot when restarting delta sync

When restarting interrupted sync (signified by the "end" non-primary
snapshot with last_copied_object_number > 0), preserve the "start"
non-primary snapshot until the sync is completed, like it would have
been done had the sync not been interrupted. This ensures that the
same m_local_snap_id_start is passed to scan_remote_mirror_snapshots()
and ultimately ImageCopyRequest state machine on restart as on initial
start.

This ends up being yet another fixup for 281af0de86b1 ("rbd-mirror:
prune unnecessary non-primary mirror snapshots"), following earlier
7ba9214ea5b7 ("rbd-mirror: don't prune older mirror snapshots when
pruning incomplete snapshot") and ecd3778a6f9a ("rbd-mirror: ensure
that the last non-primary snapshot cannot be pruned").

Fixes: https://tracker.ceph.com/issues/55796
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 3ba82f2aa73871af9ef190c06e2c99eed6d21e7b)

cls/rbd: fix operator<< for MirrorSnapshotNamespace

Commit 50702eece0b1 ("cls/rbd: added clean_since_snap_id to
MirrorSnapshotNamespace") updated dump() but missed operator<<
overload.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 8ddce107d02bbf3021a53a2861024f66a3ec0918)

Merge pull request #46392 from ljflores/wip-55743-octopus

octopus: qa/suites/rados/thrash-erasure-code-big/thrashers: add `osd max backfills` setting to mapgap and pggrow

Merge pull request #46451 from zdover23/wip-doc-2022-05-31-backport-46430-octopus

octopus: doc/start: update "memory" in hardware-recs.rst

Reviewed-by: Mark Nelson <mnelson@redhat.com>

doc/start: update "memory" in hardware-recs.rst

This PR corrects some usage errors in the "Memory" section
of the hardware-recommendations.rst file. It also closes
some opened but never closed parentheses.

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(cherry picked from commit 429bbdea65188df6708832efee188e0a40e1cde2)

doc: (squash) removing confvals

Signed-off-by: Zac Dover <zac.dover@gmail.com>

Merge pull request #45891 from nkshirsagar/wip-55298-octopus

octopus: Catch exception if thrown by __generate_command_map()

Reviewed-by: Laura Flores <lflores@redhat.com>

Merge pull request #45726 from rhcs-dashboard/wip-54997-octopus

octopus: mgr/dashboard: Table columns hiding fix

Reviewed-by: Sarthak Gupta <sarthak.dev.0702@gmail.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: kalaspuffar <NOT@FOUND>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>

Merge pull request #46253 from rzarzynski/wip-45529-octopus

octopus: osd/PGLog.cc: Trim duplicates by number of entries

Reviewed-by: Nitzan Mordechai <nmordech@redhat.com>

Merge pull request #45161 from dvanders/wip-51933-octopus

octopus: mds: check rejoin_ack_gather before enter rejoin_gather_finish

Reviewed-by: Venky Shankar <vshankar@redhat.com>

Merge pull request #45158 from dvanders/wip-51202-octopus

octopus: mds: progress the recover queue immediately after the inode is enqueued

Reviewed-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>

Merge pull request #45055 from Vicente-Cheng/wip-54219-octopus

octopus: mds: fix seg fault in expire_recursive

Reviewed-by: Venky Shankar <vshankar@redhat.com>

Merge pull request #46489 from ceph/octopus-nobranch

octopus: qa: remove .teuthology_branch file

qa: remove .teuthology_branch file

This was originally added to help support the py2 -> py3 conversion.
That's long since complete so we should be able to just remove this file
now.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
(cherry picked from commit 81430de9b70be16a439bf2445f3345b83035a861)

Merge pull request #45015 from Vicente-Cheng/wip-54195-octopus

octopus: mds: mds_oft_prefetch_dirfrags default to false

Reviewed-by: Venky Shankar <vshankar@redhat.com>

Merge pull request #46216 from rzarzynski/wip-tests-bl-fix-rebuild-octopus

octopus: test/bufferlist: ensure rebuild_aligned_size_and_memory() always rebuilds.

Reviewed-by: Ilya Dryomov <idryomov@redhat.com>

Merge pull request #45633 from joscollin/wip-55054-octopus

octopus: qa: check mounts attribute in ctx

Reviewed-by: Rishabh Dave <ridave@redhat.com>
Reviewed-by: Kotresh HR <khiremat@redhat.com>
Reviewed-by: Neeraj Pratap Singh <neesingh@redhat.com>
Reviewed-by: Nikhilkumar Shelke <nshelke@redhat.com>

Merge pull request #45621 from s0nea/wip-55035-octopus

octopus: mgr/cephadm: try to get FQDN for configuration files

Reviewed-by: Adam King adking@redhat.com
Reviewed-by: Michael Fritch <mfritch@suse.com>
Reviewed-by: Patrick Seidensal <pnawracay@suse.com>

Merge pull request #45356 from mgfritch/backport-45347-octopus

octopus: cephadm: preserve `authorized_keys` file during upgrade

Reviewed-by: Adam King adking@redhat.com

Merge pull request #45162 from dvanders/wip-52443-octopus

octopus: client: do not dump mds twice in Inode::dump()

Reviewed-by: Venky Shankar <vshankar@redhat.com>

Merge pull request #45156 from dvanders/wip-50847-octopus

octopus: mds: fix possible mds_lock not locked assert

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Venky Shankar <vshankar@redhat.com>

Merge pull request #45155 from dvanders/wip-50631-octopus

octopus: mds: reset the return value for heap command

Reviewed-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>

qa/suites/rados/thrash-erasure-code-big/thrashers: add `osd max backfills` setting to mapgap and pggrow

All `rados/thrash-erasure-code-big` tests that die due to the “wait_for_recovery” timeout have one thing in common: They contain either `thrashers/pggrow` or `thrashers/mapgap`.

The difference between pggrow and mapgap vs. all other non-offending thrashers (default, careful, fastread, and morepggrow) is that they lack an override setting for `osd max backfills`. `osd max backfills` is the max number of backfill operations allowed to/from an OSD. The higher the number, the quicker the recovery. By default, this value is 1. On all of the non-offending thrashers (default, careful, fastread, and morepggrow), the default 1 value gets overridden in their .yaml files with a value > 1. This is not the case for pggrow and mapgap, however, as they lack an `osd max backfills` override setting.

The mclock op scheduler is known to override `osd max backfills` with a high value, but all of the thrash-erasure-code-big thrashers have their op queue set to “debug_random”, which chooses randomly between op queues (the debug_random op queue is set to override the default mclock_scheduler in qa/config/rados.yaml). So, coupled with the “debug_random” op queue, the low `osd max backfill` setting is causing some tests to time out in recovery.

WITHOUT `osd max backfills`, as they are now, “mapgap” and “pggrow” tests die due to timed-out recovery about 17/100 times, as seen here with a pggrow test: http://pulpito.front.sepia.ceph.com/lflores-2022-05-18_14:24:29-rados:thrash-erasure-code-big-master-distro-default-smithi/

WITH `osd max backfills` specified, as I have suggested in this PR, 99/100 tests passed, with one test failing for a different reason:
http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/

I also scheduled 145 tests WITH `osd max backfills` that are a mix of pggrow and mapgap thrashers. 144/145 tests passed, with one test failing for a different reason. http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/

Fixes: https://tracker.ceph.com/issues/51076
Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 40062676c2ceed49b9fa147127ffa83ba6118e2a)

Merge pull request #46308 from zdover23/wip-doc-tracker-55676-backport-octopus-3rd-attempt

octopus: doc/dev: update basic-workflow.rst

Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>

doc/dev: update basic-workflow.rst

This PR updates the basic-workflow.rst file
to serve the needs of people in 2022 who were not
present at jump street.

The text has been refined up to the section called
"Integration Tests" (non-inclusive).

I'm adding an extra underscore to attempt to suppress the
"Duplicate explicit target name errors" error message.

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(cherry picked from commit a227e4007a5ce66b63e42facf97f89655edf2169)

(squash) linking issue tracker correctly

This PR links issue tracker by label, and not
by file. This method was proposed by Kefu Chai
in 40f9e1cee054bb568dfa3267982467c99b4ce5c5
on 05 Sep 2020 and was never before incorporated
into Octopus. This was noticed by Neha Ojha in
May 2020.

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) more link fixes

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) see last commit

Signed-off-by: Zac Dover <zac.dover@gmail.com>
ibid

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) ibid

Working on fixing links.

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash)

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) more tedious testing

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) fixing make check link

This fixes the make-check link in
tests-integration-tests.rst

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) fix integration tests links

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) cleaning links

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) same as all the others

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) ibid

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) fix broken link (again)

This corrects the error "Mismatch: both
interpreted text role prefix and reference suffix", which
presented because I treated the link to an external URL as
though it were a :ref:-style link to a location inside the
Ceph documentation suite.

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) ibid - malformed external link

Signed-off-by: Zac Dover <zac.dover@gmail.com>

osd/OSD: Log aggregated slow ops detail to cluster logs

Slow requests can overwhelm a cluster log with every slow op in
detail and also fills up the monitor db. Instead, log slow ops
details in aggregated format.

Fixes: https://tracker.ceph.com/issues/52424
Signed-off-by: Prashant D <pdhange@redhat.com>
(cherry picked from commit 9319dc9273b36dc4f4bef1261b3c57690336a8cc)

Conflicts:
src/common/options/osd.yaml.in
- Octopus doesn't have osd.yaml.in; added the option in src/common/options.cc

Merge pull request #45423 from cfsnyder/wip-50802-octopus

octopus: radosgw-admin: skip GC init on read-only admin ops

Reviewed-by: Casey Bodley <cbodley@redhat.com>

Merge pull request #45456 from cfsnyder/wip-52991-octopus

octopus: rgw: have "bucket check --fix" fix pool ids correctly

Reviewed-by: Casey Bodley <cbodley@redhat.com>

Merge pull request #46328 from zdover23/wip-pr-46315-backport-to-octopus

octopus: doc/start: s/3/three/ in intro.rst

Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>

doc/start: s/3/three/ in intro.rst

I'm changing "3" to "three" for two reasons:

1. It's correct.
2. This allows me to test backports into Octopus, Pacific, and Quincy.
   I am particularly interested to see what happens when I attempt
   the backport into Octopus, because backports into Octopus have
   failed. This will provide me with another unit of data.

Signed-off-by: Zac Dover <zac.dover@gmail.com>
(cherry picked from commit 28efcec2d65e85ff2fa54e62b5b134e63ace853b)

Merge pull request #44668 from Vicente-Cheng/wip-53715-octopus

octopus: mds: skip directory size checks for reintegration

Reviewed-by: Ramana Raja <rraja@redhat.com>

Merge pull request #45560 from idryomov/wip-readv-writev-overflow-octopus

octopus: librbd: readv/writev fix iovecs length computation overflow

Reviewed-by: Christopher Hoffman <choffman@redhat.com>

Merge pull request #45554 from idryomov/wip-rbd-du-validate-octopus

octopus: test/librbd: add test to verify diff_iterate size

Reviewed-by: Deepika Upadhyay <dupadhya@redhat.com>

Merge pull request #45530 from idryomov/wip-fix-schedule-status-octopus

octopus: mgr/rbd_support: cast pool_id from int to str when collecting LevelSpec

Reviewed-by: Sunny Kumar <sunkumar@redhat.com>

rgw: during "bucket check --fix" set index entry pool id correctly

The current code sets the pool id of bucket index entries to the
bucket index pool rather than the data pool. This fixes that.

Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
(cherry picked from commit 4c2ac79621c7d110d9b3664ef5ce2027d817560c)

Conflicts:
src/rgw/rgw_rados.cc

Cherry-pick notes:
- Conflict due to braces around if condition after change that weren't on Octopus
- Octopus using dout vs. ldpp_dout

radosgw-admin: skip GC init on read-only admin ops

Fixes: https://tracker.ceph.com/issues/50520
Signed-off-by: Mark Kogan <mkogan@redhat.com>
(cherry picked from commit 9ac1991fc798af7e0ba0fac18209b71b5ae3f02b)

Conflicts:
src/rgw/rgw_admin.cc
src/rgw/rgw_rados.h
src/rgw/rgw_sal.cc
src/rgw/rgw_sal.h

Cherry-pick notes:
- src/rgw/rgw_admin.cc: conflicts due to differences in op lists
- src/rgw/rgw_rados.h: conflicts due to changes to unrelated method signatures
- src/rgw/rgw_sal.cc: conflicts due to octopus missing a couple of methods from later releases
- src/rgw/rgw_sal.h: conflicts due to changes to unrelated method seignatures
- src/rgw/rgw_rados.cc: use of ldpp_dout vs. ldout

mgr: limit changes to pg_num

We need to avoid making drastic changes to pg_num that outpace pgp_num or
else we will may hit the per-osd pg limits.

Fixes: https://tracker.ceph.com/issues/53442
Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit 3b2a11249aff6ee608efc95212d6723df180cd07)

Conflicts:
src/common/options/mgr.yaml.in

Cherry-pick notes:
- Options defined in common/options.cc in Octopus vs. common/options/mgr.yaml.in

Merge pull request #45465 from cfsnyder/wip-53470-octopus

octopus: common: avoid pthread_mutex_unlock twice

Reviewed-by: Laura Flores <lflores@redhat.com>

osd/PGLog.cc: Trim duplicates by number of entries
PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth.

Fixed: https://tracker.ceph.com/issues/53729
Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
(cherry picked from commit 0d253bcc09a5540fa6c724f6128fb7436ded5ec1)

tools/ceph_objectstore_took: Add duplicate entry trimming

Adding duplicate entries trimming to trim-pg-log opertion, we will use the exist
PGLog trim function to find out the set of entries\dup entries that we suppose to
trim. to use it we need to build the PGLog from disk.

Fixed: https://tracker.ceph.com/issues/53729
Signed-off-by: Nitzan Mordechai <nmordec@redhat.com>
(cherry picked from commit 9fb7ec61ba10896ce01d5970375f1ce1dd993874)

Conflicts:
src/osd/PGLog.h -- octopus does not have the commit
   877798028386fbd833e8955cb89ce3f1ee47fbeb
   which cleans the `std` namespace depedency
   in headers.
src/tools/ceph_objectstore_tool.cc -- octopus lacks the commit
   7d73fa6a309dca4c5381596c5e92813e2e11ed3b
   which puts the buffer's error hierarchy on
   `system_error`.

Merge pull request #45570 from mgfritch/backport-45420-octopus

octopus: cephadm: infer the default container image during pull

Reviewed-by: Adam King adking@redhat.com

Merge pull request #45437 from cfsnyder/wip-51782-octopus

octopus: qa/rgw: add failing tempest test to blocklist

Reviewed-by: Casey Bodley <cbodley@redhat.com>

Merge pull request #45320 from ljflores/wip-52077-octopus

octopus: test: fix wrong alarm (HitSetWrite)

Reviewed-by: Myoungwon Oh <omwmw@sk.com>