Xuehan Xu [Fri, 24 May 2024 09:30:41 +0000 (17:30 +0800)]
crimson/osd/osd_operations/client_request_common: `PeeringState::needs_recovery()`
may fail if the object is under backfill
Meanwhile, set the correct version for backfill:
From Classic:
```
if (is_degraded_or_backfilling_object(head)) {
if (can_backoff && g_conf()->osd_backoff_on_degraded) {
add_backoff(session, head, head);
maybe_kick_recovery(head);
}
```
some tests are currently failing when `lvm2` isn't installed:
```
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestList::test_empty_device_json_zero_exit_status - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestList::test_empty_device_zero_exit_status - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestFullReport::test_no_ceph_lvs - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestFullReport::test_ceph_data_lv_reported - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestFullReport::test_ceph_journal_lv_reported - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestFullReport::test_ceph_wal_lv_reported - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestFullReport::test_physical_2nd_device_gets_reported[journal] - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestFullReport::test_physical_2nd_device_gets_reported[db] - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestFullReport::test_physical_2nd_device_gets_reported[wal] - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestSingleReport::test_not_a_ceph_lv - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestSingleReport::test_report_a_ceph_lv - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestSingleReport::test_report_a_ceph_journal_device - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestSingleReport::test_report_by_osd_id_for_just_block_dev - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestSingleReport::test_report_by_osd_id_for_just_data_dev - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestSingleReport::test_report_by_osd_id_for_just_block_wal_and_db_dev - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestSingleReport::test_report_by_osd_id_for_data_and_journal_dev - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestSingleReport::test_report_by_nonexistent_osd_id - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_listing.py::TestSingleReport::test_report_a_ceph_lv_with_no_matching_devices - FileNotFoundError: [Errno 2] No such file or directory: 'pvs'
FAILED ceph_volume/tests/devices/lvm/test_migrate.py::TestNew::test_newdb_not_target_lvm - FileNotFoundError: [Errno 2] No such file or directory: 'lvs'
FAILED ceph_volume/tests/devices/lvm/test_zap.py::TestEnsureAssociatedLVs::test_nothing_is_found - FileNotFoundError: [Errno 2] No such file or directory: 'lvs'
FAILED ceph_volume/tests/devices/lvm/test_zap.py::TestEnsureAssociatedLVs::test_multiple_journals_are_found - FileNotFoundError: [Errno 2] No such file or directory: 'lvs'
FAILED ceph_volume/tests/devices/lvm/test_zap.py::TestEnsureAssociatedLVs::test_multiple_dbs_are_found - FileNotFoundError: [Errno 2] No such file or directory: 'lvs'
FAILED ceph_volume/tests/devices/lvm/test_zap.py::TestEnsureAssociatedLVs::test_multiple_wals_are_found - FileNotFoundError: [Errno 2] No such file or directory: 'lvs'
FAILED ceph_volume/tests/devices/lvm/test_zap.py::TestEnsureAssociatedLVs::test_multiple_backing_devs_are_found - FileNotFoundError: [Errno 2] No such file or directory: 'lvs'
FAILED ceph_volume/tests/objectstore/test_lvmbluestore.py::TestLvmBlueStore::test_activate_all_osd_is_active - FileNotFoundError: [Errno 2] No such file or directory: 'lvs'
```
Everything should be actually mocked. This commit addresses that.
Venky Shankar [Mon, 17 Jun 2024 09:27:52 +0000 (14:57 +0530)]
Merge PR #57991 into main
* refs/pull/57991/head:
qa: upgrade sub-suite upgraded_client from from n-1|n-2 releases
qa: upgrade sub-suite nofs from n-1 and n-2 releases
qa: use supported releases for featureful_client
Ronen Friedman [Tue, 4 Jun 2024 09:02:55 +0000 (04:02 -0500)]
osd/scrub: do not track reserving state at OSD level
As we no longer block the initiation of new scrub sessions for an OSD
for which any of its PGs is in the process of reserving scrub resources,
there is no need to track the reserving state at the OSD level.
Ronen Friedman [Tue, 4 Jun 2024 08:53:04 +0000 (03:53 -0500)]
osd/scrub: allow new scrubs while reserving
allow new scrub session to be initiated by an OSD even while a PG is
in the process of reserving scrub resources.
The existing restriction made sense when the replica reservation process
was expected to succeed or fail within a few milliseconds. It makes less
sense now that the reservation process is queue-based (Reserver based)
and can take unlimited time (hours, days, ...) to complete.
Zac Dover [Sat, 15 Jun 2024 11:55:18 +0000 (21:55 +1000)]
doc/rados: explain replaceable parts of command
Add an explanation that directs the reader to replace the "X" part of
the command "ceph tell mon.X mon_status" with the value specific to the
reader's Ceph cluster (which is (probably) not "X").
In the future, such replaceable strings in commands may be bounded by
angle brackets ("<" and ">").
This improvement to the documentation was suggested on the [ceph-users]
email list by Joel Davidow. This email, an absolute model of user
engagement with an upstream project, can be reviewed here:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/KF67F5TXFSSTPXV7EKL6JKLA5KZQDLDQ/
John Mulligan [Fri, 14 Jun 2024 14:07:07 +0000 (10:07 -0400)]
script/cpatch.py: add support for multiple valid python versions
Fix running cpatch.py with the latest centos9s based container images.
Future proof a little by adding multiple valid, existing, python version
numbers to probe.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Samuel Just [Tue, 28 May 2024 20:45:26 +0000 (20:45 +0000)]
crimson/.../snaptrim_event: SnapTrimObjSubEvent should enter WaitRepop
Otherwise, it parks on Process until the repop completes blocking any
other repops, including client IO. Since we don't actually care about
ordering, simply calling handle.complete() would also be viable, but
this is a valid usage of the stage and does provide information to an
operator.
SnapTrimEvent doesn't actually do or block on GetOBC or Process --
remove those stages entirely. Entering Process, in particular, causes
problems unless we immediately leave it as SnapTrimObjSubEvent needs to
enter and leave it to complete. Entering one of the stages removed in
a prior commit had a side effect of exiting Process -- without that
exit SnapTrimEvent and SnapTrimObjSubEvent mutually block preventing
snap trim or client io from making progress.
This leaves no actual pipeline stages on SnapTrimEvent, which makes
sense as only SnapTrimObjSubEvent actually does IO.
Samuel Just [Tue, 28 May 2024 16:48:42 +0000 (09:48 -0700)]
crimson/.../snaptrim_event: remove pipeline stages located on event
WaitSubop, WaitTrimTimer, and WaitRepop are pipeline stages local to
the operation. As such they don't actually provide any ordering
guarrantees as only one operation will ever enter them. Rather, the
intent is to hook into the event system to expose information to an
administrator.
This poses a problem for OrderedConcurrentPhase as it is currently
implemented. PipelineHandle::exit() is invoked prior to the op being
destructed. PipelineHandle::exit() does:
void exit() {
barrier.reset();
}
For OrderedConcurrentPhase, ~ExitBarrier() invokes ExitBarrier::exit():
The problem comes in not waiting for the phase->mutex.unlock() to occur.
For SnapTrimEvent, phase is actually in the operation itself. It's
possible for that continuation
to occur after the last finally() in ShardServices::start_operation
completes and releases the final reference to SnapTrimEvent. This is
harmless normally provided that the PG or connection outlives it,
but it's a problem for these stages.
For now, let's just remove these stages. We can reintroduce another
mechanism later to set these event flags without an actual pipeline
stage.
This is likely a bug even with pipelines not embedded in an operation,
but we can fix it later -- https://tracker.ceph.com/issues/64545.
Fixes: https://tracker.ceph.com/issues/63647 Signed-off-by: Samuel Just <sjust@redhat.com>
John Mulligan [Wed, 1 May 2024 14:57:02 +0000 (10:57 -0400)]
mgr/smb: share and cluster create commands only create resources
Prior to this change the create commands could be used counter to the
term 'create' as a create-or-update command. IMO this violates the
principle of least surprise so make them create-only.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
John Mulligan [Wed, 1 May 2024 14:55:27 +0000 (10:55 -0400)]
mgr/smb: add create_only arg for handler apply function
Add a create_only argument to the handler class apply function. This
flag is used to prevent modification of existing resources. This flag
will be use by 'cluster create' and 'share create' commands to make
them true to their names and not sneaky modify-or-create commands.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Afreen [Fri, 31 May 2024 07:54:27 +0000 (13:24 +0530)]
mgr/dashboard: Configure NVMe/TCP
Fixes https://tracker.ceph.com/issues/63686
- creation of Nvme-oF/TCP service
- deletion of Nvme-oF/TCP service
- edit/update Nvme-oF/TCP service
- added unit tests for Nvme-oF/TCP service
- changed Id -> Service Name
- added prefix of service type in service name (similar to <client.> in
fs access)
- service name and pool are required fields for nvmeof
- placement count now takes default value as mentioned in cephadm
- slight refactors
- prepopulate serviceId for each service type setServiceId()
- in case serviceId is same as servcie type then do not add create service name with<servicetype>.<setrviceid> format
Ilya Dryomov [Fri, 7 Jun 2024 10:12:29 +0000 (12:12 +0200)]
librbd: add rbd_snap_get_trash_namespace2() API to return full namespace
The existing rbd_snap_get_trash_namespace() API returns only the
original name of the deleted snapshot, omitting its namespace type.
While non-user snapshots have distinctive names, there is nothing
preventing the user from creating user snapshots with identical names
(i.e. starting with ".group" or ".mirror" prefix). After cloning from
non-user snapshots is allowed, it's possible for such user snapshots to
get mixed up with non-user snapshots in the trash, so let's provide
means for disambiguation.
Ilya Dryomov [Thu, 30 May 2024 14:54:53 +0000 (16:54 +0200)]
qa/workunits/rbd: fix bogus grep -v asserts in test_clone()
The intent of "rbd ls | grep -v clone" was probably to check that an
image with the name "clone" shows up in rbd2 pool and not in rbd pool.
However, it's very far from that -- "grep -v clone" would succeed
regardless because of an image with the name "test1" in rbd pool.
Ilya Dryomov [Fri, 24 May 2024 10:06:09 +0000 (12:06 +0200)]
librbd: add rbd_clone4() API to take parent snapshot by ID
Allow cloning from non-user snapshots -- namely snapshots in group
and mirror namespaces. The motivation is to provide a building block
for cloning new groups from group snapshots ("rbd group snap create").
Otherwise, group snapshots as they are today can be used only for
rolling back the group as a whole, which is very limiting.
While at it, there doesn't seem to be anything wrong with making it
possible to clone from mirror snapshots as well.
Snapshots in a trash namespace can't be cloned from since they are
considered to be deleted.
Cloning from non-user snapshots is limited to clone v2 just because
protecting/unprotecting is limited to snapshots in a user namespace.
This happens to simplify some invariants.
librbd: replace assert with error check in clone()
With an error check for p_snap_name, it doesn't make much sense to
crash if "either p_id or p_name" contract is violated. Replace the
assert with a similar error check.
Ivo Almeida [Wed, 15 May 2024 08:42:47 +0000 (09:42 +0100)]
mgr/dashboard: carbon initial setup
* replace header and side navigation by carbon components
* added carbon specifc style overrides
* added carbon icons
* created custom theme based on current color scheme
Fixes: https://tracker.ceph.com/issues/66217 Signed-off-by: Ivo Almeida <ialmeida@redhat.com>
Afreen Misbah [Wed, 12 Jun 2024 15:50:04 +0000 (21:20 +0530)]
mgr/dashboard: Fix login and notification e2e tests
Fixes https://tracker.ceph.com/issues/66453
- `#rbdMirroring` checkbox is not found due to which both of these tests are failing on most of the Prs
- this is due to the pool helper function which checks for an existing app passed in parameter
- if app is not found, then mirroring checkbox remains hidden
Mohit Agrawal [Wed, 12 Jun 2024 11:49:18 +0000 (17:19 +0530)]
unittest_osdmap aborted during OSDMapTest.BUG_42485
The testcase is aborted during the call of clean_upmap_tp
thread. The function(clean_pg_upmaps) spawns a number
of worker threads to process a PGMapper job. The worker
thread fetch a job from the queue and then process the
job and call process_finish the job. The process function
of PGMapper class destroying the object and as worker thread
call _process_finish function it crashes because job pointer
has become a dangling pointer.
Solution: To avoid a crash destroy the object in _process_finish
instead of doing in _process.