There is no need for CreateSnapshotRequests.__del__() that calls
CreateSnapshotRequests.wait_for_pending().
MirrorSnapshotScheduleHandler.shutdown() already calls
CreateSnapshotRequests.wait_for_pending().
Conflicts:
src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py
- Above conflict was due to commit e4a16e2
("mgr/rbd_support: add type annotation") not in pacific
Ramana Raja [Thu, 26 Oct 2023 17:18:52 +0000 (13:18 -0400)]
mgr/rbd_support: fix recursive locking on CreateSnapshotRequests lock
The MirrorSnapshotScheduleHandler's run thread issues asynchronous
create snapshot requests using a CreateSnapshotRequests instance. When
the thread invokes a CreateSnapshotRequests instance's get_ioctx(),
the instance's class variable lock is acquired. With the class
variable lock held, the garbage collection of a CreateSnapshotRequests
instance may race in the thread. The thread would then call
CreateSnapshotRequests __del__() that tries to acquire the class
variable lock that the thread already holds. Fix this
recursive deadlock by converting the CreateSnapshotRequests lock from
a class variable to an instance variable. There is no need to share
the lock across CreateSnapshotRequests instances.
Also convert MirrorSnapshotScheduleHandler, PerfHandler and
TrashPurgeScheduleHandler class variables to instance variables
that don't need to be shared across the instances.
Fixes: https://tracker.ceph.com/issues/62994 Signed-off-by: Ramana Raja <rraja@redhat.com> Co-Authored-By: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 4452bc22d1c6c8499cf55d6e39090adf7ae1dcbf)
Conflicts:
src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py
src/pybind/mgr/rbd_support/perf.py
src/pybind/mgr/rbd_support/trash_purge_schedule.py
- Above conflicts were due to commit e4a16e2
("mgr/rbd_support: add type annotation") not in pacific
Ramana Raja [Mon, 18 Sep 2023 02:52:56 +0000 (22:52 -0400)]
qa/suites/rbd: add test to check rbd_support module recovery
... on repeated blocklisting of its client.
There were issues with rbd_support module not being able to recover
from its RADOS client being repeatedly blocklisted. This occured for
example in clusters with OSDs slow to process RBD requests while the
module's mirror_snapshot_scheduler was taking mirror snapshots by
requesting exclusive locks on the RBD images and workloads were running
on the snapshotted images via kernel clients.
test/librbd/fsx: wait for resize to propagate in krbd_resize()
With this changes resize request will not be blocked until the resize is
completed. Because of this the fsx test fails as it assumes that the
request to resize immediately implies changes on the device size.
Hence we have to add a wait in resize handler of fsx for the device to
actually get resized.
Problem:
-------
Trying to disable any feature on an rbd image mapped with nbd leads to stuck
in rbd-nbd.
The rbd-nbd registers a watcher callback to detect image resize in
NBDWatchCtx::handle_notify(). The handle_notify calls image info method, which
calls refresh_if_required and it got stuck there.
It is getting stuck in ImageState::refresh_if_required() because
DisableFeaturesRequest issues update notifications while still holding onto
the exclusive lock with everything that has to do with it blocked.
Solution:
--------
Set only notify flag as part of NBDWatchCtx::handle_notify() and handle
the resize detection part as part of a different thread.
When the OSD preboots it sends a MMonGetPurgedSnaps message to
the monitor (`_get_purged_snaps`).
The monitor will reply with all the purged snapshots that their purged_epoch_ is in the
range of superblock.purged_snaps_last + 1 up to the last superblock.current_epoch + 1.
When the OSD will handle the reply from the mon (`handle_get_purged_snaps_reply`)
it will call `record_purged_snaps` to write those purged snapshots in the
OSD store as well (PSN_ keys).
Once purged_snaps_last is reset, in the following OSD reboot, the snapshots that were marked as
purged (purged_snaps_ keys) in the mon's store will be also marked,
correspondingly, in the OSD store.
That way `scrub_purged_snaps` will be able to re-trim the snapshots that weren't
marked as purged in the OSD side (for some reason)
Fixes: https://tracker.ceph.com/issues/62981 Signed-off-by: Matan Breizman <mbreizma@redhat.com>
(cherry picked from commit 120ed0f0e8f65c18bfcd1d649617770c2c5af663)
Manual conflict fixes: 'scrubdebug' command was removed since it's
not part of the original commit.
Commit dc69033763cc116c6ccdf1f97149a74248691042 moves cephfs-shell from
"<CEPH-REPO-ROOT>/src/tools/cephfs/" to
"<CEPH-REPO-ROOT>/src/tools/cephfs/shell" but cephfs-shell's location in
src/vstart.sh and qa/tasks/cephfs/test_cephfs_shell.py is left
un-updated. This produces a broken vstart_environment.sh and broken
export command in test_cephfs_shell.py.
Introduced-by: dc69033763cc116c6ccdf1f97149a74248691042 Fixes: https://tracker.ceph.com/issues/58795 Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 48ef0444774934dd6d0d3e026142d95e4098bebd)
Conflicts:
qa/tasks/cephfs/test_cephfs_shell.py
- Comment present at the top of file was different in Pacific
compared to main branch.
Adam King [Mon, 21 Aug 2023 17:48:56 +0000 (13:48 -0400)]
cephadm: make custom_configs work for tcmu-runner container
This is intended to be a temporary workaround to make
custom config files be able to be mounted into
the tcmu-runner container. The hope is to refactor
cephadm's iscsi handling for squid, but a patch
like this could be useful for iscsi in older
releases where currently custom config files
are unusable for the tcmu-runner container
What this patch actually does is have us write the
custom config files to a dir for the tcmu-runner
container so that the rest of the logic works without
change. I thought this would be easier to remove later
than a patch that integrates more with the container
mounts or general deployment
Ilya Dryomov [Thu, 12 Oct 2023 17:03:10 +0000 (19:03 +0200)]
pybind/rbd: don't produce info on errors in aio_mirror_image_get_info()
Check completion return value before attemting to decode c_info.
Otherwise we are guaranteed to access invalid memory in decode_cstr()
while trying to compute global_id string length when the client is
blocklisted for example.
PendingReleaseNotes: Note change to 'ceph config dump' pretty-print output.
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 401b30f19f51e86f1471447a6af788b94e283ff0)
Conflicts:
PendingReleaseNotes
- Remove unrelated release note related to Cephfs
- Move related release note under the new ">=16.2.15" section
mon/ConfigMonitor: Show localized name in "config dump --format json" output
The "ceph config dump" command without the json formatted output shows
the localized option names and their values. An example of a normalized
vs localized option is shown below:
Normalized: mgr/dashboard/ssl_server_port (maintaned within Option struct)
Localized: mgr/dashboard/x/ssl_server_port (maintained in mon store)
But the "ceph config dump --format json*" output showed the normalized
option names which was not consistent with the "config dump" output.
The output of the command along with variations for pretty printing must
show the same content.
This commit introduces a new member within the ConfigMap's MaskedOption
struct called "localized_name". This is initialized to the localized name
as part of ConfigMonitor::load_config() method.
The MaskedOption::dump() used for the json formatting is modified to
display the localized_name instead of the normalized name.
Adam King [Tue, 13 Jun 2023 23:54:30 +0000 (19:54 -0400)]
cephadm: run tcmu-runner through script to do restart on failure
Currently, cephadm runs tcmu-runner as a background
process inside the unit file deployed for iscsi
(rbd-target-api is the primary process). This means
if tcmu-runner crashes for whatever reason, systemd
will not attempt to restart it. This commits sets
up a script to serve as the container entrypoint
for the tcmu-runner container that will run
tcmu-runner and also restart it on failure
(unless there are too many failures in a short
period, at which point it gives up).
The hope is to eventually drop use of this script
for a better solution in squid onward, but this
should be helpful on older releases (quincy and
pacific at least) where we won't be able to
bring that better solution
Adam King [Fri, 2 Jun 2023 00:06:35 +0000 (20:06 -0400)]
cephadm: add tcmu-runner to logrotate config
This process could be used to set up the tcmu-runner
to log to a file much like other ceph daemons
- create /etc/tcmu directory
- create /etc/tcmu/tcmu.conf directory with default options
- change dir to /var/log
- change log level to 4
- add -v /etc/tcmu:/etc/tcmu to tcmu-runner container podman line in unit.run
In order to support this (mostly for debugging) we should
add tcmu-runner to the logrotate config
Ramana Raja [Mon, 2 Oct 2023 16:39:26 +0000 (12:39 -0400)]
librbd/ManagedLock: kickstart ExclusiveLock state machine
... that is stalled waiting for lock. Do this when trying to reacquire
lock in the ImageWatcher's rewatch mechanism. This would enable the
ExclusiveLock state machine to propagate the blocklist error to the
caller trying to perform an image operation requiring an exclusive
lock.
Previous attempt, e66db763, to fix the hang due to exclusive lock
acquisiton (stuck waiting for lock) racing with client blocklisting
did not always work. e66db763 kickstarted the ExclusiveLock state
machine when the ImageWatcher tried to schedule a exclusive lock
request and the blocklisting was detected. However, there is a short
window between a watch getting deregistered and client blocklisting
getting detected as part of rewatching. If hit when trying to schedule
a lock request, the ExclusiveLock state machine wasn't kickstarted,
blocklist error wasn't propagated, and the hang resurfaced.
A more robust approach is taken to resume the ExclusiveLock state
machine stuck waiting for lock during client blocklisting. Whenever
a client's ImageWatcher loses connection to the cluster, as it happens
during blocklising, the ImageWatcher initiates a mechanism to rewatch
the image and tries to reacquire the lock. Piggyback on this rewatch
mechanism that gets triggered during client blocklisting. And when
trying to reacquire the lock, kickstart the ExclusiveLock state
machine stalled waiting for lock (STATE_WAITING_FOR_LOCK).
rgw: fix radosgw-admin bucket check stat calculation bug
Fixes a regression with radosgw-admin bucket check stat
calculation and bucket reshard stat calculation when
there are objects that have transitioned from unversioned
to versioned. The bug was introduced in 152aadb71b61c53a4832a1c8cf82fce3d64b68d1.
rgw: radosgw-admin bucket check should only print index entries with --check-objects flag
Printing all index entries can be very time consuming for large
buckets and the inability to switch this behavior off makes it
cumbersome to use the command for fixing bucket stats. This was
also preventing the command from outputting recalculated bucket
stats when the --fix flag wasn't specified.
rgw: prevent another leftover bucket index olh entry scenario
If a call to bucket_index_link_olh or bucket_index_unlink_instance
fails, its associated pending xattr may have prevented the olh object
from being removed by another thread. We should do a best effort
cleanup attempt for this case by calling update_olh before returning
an error to the caller.
Nizamudeen A [Wed, 27 Sep 2023 11:27:32 +0000 (16:57 +0530)]
mgr/dashboard: allow tls 1.2 with a config option
Provide the option to allow tls1.2
`ceph dashboard set-enable-unsafe-tls-v1-2 True` followed with a mgr
restart will enable tls 1.2.
With tls1.2 enabled
```
╰─$ nmap -sV --script ssl-enum-ciphers -p 11000 127.0.0.1
Starting Nmap 7.93 ( https://nmap.org ) at 2023-09-27 16:56 IST
Nmap scan report for localhost (127.0.0.1)
Host is up (0.00018s latency).
PORT STATE SERVICE VERSION
11000/tcp open ssl/http CherryPy wsgiserver
|_http-server-header: Ceph-Dashboard
| ssl-enum-ciphers:
| TLSv1.2:
| ciphers:
| TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (ecdh_x25519) - A
| TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 (ecdh_x25519) - A
| TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (ecdh_x25519) - A
| TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 (ecdh_x25519) - A
| TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA (ecdh_x25519) - A
| TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA (ecdh_x25519) - A
| TLS_RSA_WITH_AES_256_GCM_SHA384 (rsa 2048) - A
| TLS_RSA_WITH_AES_256_CCM (rsa 2048) - A
| TLS_RSA_WITH_AES_128_GCM_SHA256 (rsa 2048) - A
| TLS_RSA_WITH_AES_128_CCM (rsa 2048) - A
| TLS_RSA_WITH_AES_256_CBC_SHA256 (rsa 2048) - A
| TLS_RSA_WITH_AES_128_CBC_SHA256 (rsa 2048) - A
| TLS_RSA_WITH_AES_256_CBC_SHA (rsa 2048) - A
| TLS_RSA_WITH_AES_128_CBC_SHA (rsa 2048) - A
| compressors:
| NULL
| cipher preference: server
| TLSv1.3:
| ciphers:
| TLS_AKE_WITH_AES_256_GCM_SHA384 (ecdh_x25519) - A
| TLS_AKE_WITH_CHACHA20_POLY1305_SHA256 (ecdh_x25519) - A
| TLS_AKE_WITH_AES_128_GCM_SHA256 (ecdh_x25519) - A
| TLS_AKE_WITH_AES_128_CCM_SHA256 (ecdh_x25519) - A
| cipher preference: server
|_ least strength: A
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 16.55 seconds
```
Without tls1.2 enabled (which defaults to tls 1.3)
```
╰─$ nmap -sV --script ssl-enum-ciphers -p 11000 127.0.0.1
Starting Nmap 7.93 ( https://nmap.org ) at 2023-09-27 16:54 IST
Nmap scan report for localhost (127.0.0.1)
Host is up (0.000075s latency).
PORT STATE SERVICE VERSION
11000/tcp open ssl/http CherryPy wsgiserver
| ssl-enum-ciphers:
| TLSv1.3:
| ciphers:
| TLS_AKE_WITH_AES_256_GCM_SHA384 (ecdh_x25519) - A
| TLS_AKE_WITH_CHACHA20_POLY1305_SHA256 (ecdh_x25519) - A
| TLS_AKE_WITH_AES_128_GCM_SHA256 (ecdh_x25519) - A
| TLS_AKE_WITH_AES_128_CCM_SHA256 (ecdh_x25519) - A
| cipher preference: server
|_ least strength: A
|_http-server-header: Ceph-Dashboard
```
Joshua Baergen [Wed, 17 May 2023 18:17:09 +0000 (12:17 -0600)]
rgw: Fix bucket validation against POST policies
It's possible that user could provide a form part as a part of a POST
object upload that uses 'bucket' as a key; in this case, it was
overriding what was being set in the validation env (which is the real
bucket being modified). The result of this is that a user could actually
upload to any bucket accessible by the specified access key by matching
the bucket in the POST policy in said POST form part.
Fix this simply by setting the bucket to the correct value after the
POST form parts are processed, ignoring the form part above if
specified.
cephfs-journal-tool: disambiguate usage of all keyword (in tool help).
The fs:all for rank option description was confusing. It seemd
like the fs was optional, but it is mandatory. This change modifies the
help message to reflect the correct way to use all in the --rank option.
Fixes: https://tracker.ceph.com/issues/61753 Signed-off-by: Manish M Yathnalli <myathnal@redhat.com>
(cherry picked from commit 52c033f85e274c86bd75f0eb902a32d86356094e)
This was buggy right from the start. Start maintaining per replayer
blocklisted or failed timestamp and use that to check if a replayer
restart is required.
Adam Kupczyk [Wed, 2 Feb 2022 19:28:14 +0000 (20:28 +0100)]
os/bluestore/bluefs: Make volume selector operations atomic
Make all RocksDBBlueFSVolumeSelector files/extents/size tracking atomic.
It used to be synchronized by BlueFS global lock.
Now, in Fine Grain Locking era, it is necessary to prevent corruption.
Fixes: https://tracker.ceph.com/issues/53906 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 372bda350966624d5081635e659f7c46947980c2)
Adam Kupczyk [Thu, 20 Jan 2022 12:44:35 +0000 (13:44 +0100)]
os/bluestore/bluefs: Code for volume selector check
Adds ability to verify that volume selector properly tracks disk usage.
Creates options:
- bluefs_check_volume_selector_on_umount
- bluefs_check_volume_selector_often
that can be used to validate that vselector does not diverge from
values it should have.
Matt Benjamin [Mon, 31 Oct 2022 16:40:50 +0000 (12:40 -0400)]
rgwlc: prevent lc for one bucket from exceeding time budget
Fixes: https://tracker.ceph.com/issues/57951 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
(cherry picked from commit 617ffccbca0169ac0f1cd713962d44e8cc74a8af)
Prashant D [Tue, 12 Sep 2023 18:58:23 +0000 (14:58 -0400)]
pacific: osd/scrub: Fix scrub starts messages spamming the cluster log
With the re-introduction of scrub starts message in commit-id e0c0b4f,
the cluster log is getting spammed by scrub *starts* for the same
PG. This is due to replicas rejecting the scrub reserve requests
resulting in scrub getting rescheduled for the same PG continuously.
Instead of logging the scrub *starts* message before scrub reservation
is done by all acting set OSDs, log *starts* message when active
scrubbing starts for the PG. The reservations period is expected
to take up to a few milliseconds and the scrubbing itself consumes the
most of the scrub period.
Fixes: https://tracker.ceph.com/issues/62669 Signed-off-by: Prashant D <pdhange@redhat.com>
Tobias Urdin [Mon, 7 Aug 2023 20:34:43 +0000 (20:34 +0000)]
rgw/auth: handle HTTP OPTIONS with v4 auth
This adds code to properly verify the signature
for HTTP OPTIONS calls that is preflight CORS
requests passing the expected method in the
access-control-request-method header.
Rishabh Dave [Mon, 11 Sep 2023 09:55:46 +0000 (15:25 +0530)]
doc/cephfs: write cephfs commands fully in docs
We write CephFS commands incompletely in docs. For example, "ceph tell
mds.a help" is simply written as "tell mds.a help". This might confuse
the reader and it won't harm to write the command in full.
Fixes: https://tracker.ceph.com/issues/62791 Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit e63b573d3edc272d83ee1b5eb3dace037f762d87)
* refs/pull/51045/head:
qa: Add test for per-module finisher thread
qa: allow check_counter to look at nested keys
qa: allow specifying min for check-counter
mgr: Add one finisher thread per module
Ramana Raja [Mon, 14 Aug 2023 15:27:13 +0000 (11:27 -0400)]
librbd/ImageWatcher: kick-start ExclusiveLock state machine
... that is stalled on waiting for lock and let it detect client
blocklisting. This would propagate the blocklist error to the caller
requesting an operation needing an exclusive lock.
When a caller requests a librbd operation that requires an exclusive
lock, the librbd client checks whether the exclusive lock is
held by another client. If the lock is held by another client, librbd
stalls its ExclusiveLock state machine and through its ImageWatcher
notifies the lock owner that it wants the exclusive lock. After
receiving the response from the lock owner, the ImageWatcher schedules
another lock request. Meanwhile if the client gets blocklisted, the
ImageWatcher fails to schedule another lock request and returns. The
ExclusiveLock state machine remains stalled and the blocklist error is
not propagated to the caller. Instead, when scheduling another lock
request, make the ImageWatcher call the ExclusiveLock state machine's
peer notification handler if the client is blocklisted. This allows
the Exclusive lock state machine to detect that the client has been
blocklisted in its send_acquire_lock() member function and propagate
the blocklist error to the caller.