Jason Dillaman [Wed, 5 Aug 2020 13:12:41 +0000 (09:12 -0400)]
librbd: migration abort should revert data back to the original image
If the migration destination image was modified and then the migration
was aborted, we need to copy the data back to the source image to avoid
losing data. For simplicity we will only revert the HEAD revision state
and will not attempt to copy new snapshots on the destination image
back to the source.
Fixes: https://tracker.ceph.com/issues/41394 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 5bd15da8be09a4e7644d411a0b0c132e5b795393)
We want to prevent the destination image from being used while an
abort is in-progress. Test that the image has no watchers prior to
permitting the abort, switch the migration state to ABORTING, and
treat the image as read-only if the migration state is ABORTING.
Jason Dillaman [Wed, 5 Feb 2020 20:27:39 +0000 (15:27 -0500)]
librbd: ensure deep-copy snapshot map includes all destination snap ids
When deep-copying from an arbitrary start snapshot id, the snap sequence
will be missing all older snapshots. Additionally, snapshot types that
are not deep-copied still need to be included in the destination snap
map.
Jason Dillaman [Wed, 5 Feb 2020 19:23:53 +0000 (14:23 -0500)]
librbd: deep-copy snapshots from a specified start/end position
Allow the snapshots to be arbitrarily copied from any source image
start/end snapshot ids. If the end snapshot is not a user-snapshot,
it will associate to the destination image HEAD revision.
Conflicts:
src/librbd/deep_copy/SnapshotCopyRequest.cc: different lock types
src/test/librbd/deep_copy/test_mock_SnapshotCopyRequest.cc: no mirror snapshot namespaces
Jason Dillaman [Wed, 5 Feb 2020 15:42:27 +0000 (10:42 -0500)]
librbd: deep-copy should accept a lower-bound for the destination snap_id
For snapshot-based mirroring, we will want to prevent the modification of
snapshots below the last sync snapshot and to prevent the copying of data
below that lower-bound as well. This commit just adds the new parameter and
future commits will update the snapshot and object copy behavior.
Ilya Dryomov [Sat, 29 Aug 2020 10:02:30 +0000 (12:02 +0200)]
msg/async/ProtocolV2: allow rxbuf/txbuf get bigger in testing
We have a kernel client test case that constructs huge auth tickets
to exercise the three related code paths in the kernel. One of the
tickets is bigger than 1000000 bytes, as required for triggering the
third code path.
We haven't bumped into this assert earlier because the kernel client
is still on msgr v1. However, "rbd map" and "rbd unmap" commands
started connecting to the cluster in commit 96f05a7956b3 ("rbd: delay
determination of default pool name") and that happens via msgr v2.
Satoru Takeuchi [Fri, 22 May 2020 01:45:32 +0000 (01:45 +0000)]
ceph-volume: show correct rejected reason in inventory if device type is not acceptable
If device type is not acceptable in `c-v inventory`, its rejected reason
becomes "Insufficient space (<5GB)" by mistake. It's because sys_api is
empty due to skipping devices that are neither `disk` nor `device`. We
should report the target device is not acceptable in this case.
Andrew Schoen [Fri, 4 Sep 2020 14:44:49 +0000 (09:44 -0500)]
ceph-volume: simple scan should ignore tmpfs
When simple scan is ran against a ceph-volume
OSD, util.encryption.legacy_encrypted returns
tmpfs. We want to avoid creating a Device
object with tmpfs and ignore the OSD as it's
not a ceph-disk created OSD.
mon/OSDMonitor: only take in osd into consideration when trimming osdmaps
we should not take down osd into consideration when trimming osdmap. in e62269c892, we decrease the upper bound of range of osdmaps to be trimmed
if the given osd is out. but we should have to decrease it only if the
osd in question is still *in*.
so, in this change, the min_lec is decreased only if the osd in question
is *in*.
Adam Kupczyk [Wed, 1 Jul 2020 21:09:17 +0000 (23:09 +0200)]
os/bluestore: Add documentation for large bluefs log recovery
Adds additional paragraph to ceph-bluestore-tool documentation,
describing how to use *special* options --bluefs_replay_recovery
and --bluefs_replay_recovery_disable_compact to recover large
bluefs log.
Fixes: https://tracker.ceph.com/issues/46714 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
Sébastien Han [Tue, 18 Aug 2020 13:41:31 +0000 (15:41 +0200)]
ceph-volume: retry when acquiring lock fails
When preaparing the osd device with --mkfs, the ceph-osd binary tries to
acquire an exclusive lock on the device (soon to become an OSD).
Unfortunately, when running in containers, we have seen cases where
there is a race between ceph-osd and systemd-udevd to acquire a lock on
the device. Sometimes systemd-udevd gets the lock and releases it soon
so that the ceph-osd gets sometimes the lock is still held and because
ceph-osd uses LOCK_NB the command fails.
This commit retries if the lock cannot be acquired, up to 5 times for 5
seconds, this should be more than enough to acquire the lock and
proceed with the OSD mkfs.
Unfortunately, this is so transient that we cannot lock earlier from c-v,
this won't do anything.
Patrick Donnelly [Thu, 27 Aug 2020 20:43:01 +0000 (13:43 -0700)]
Merge PR #36833 into nautilus
* refs/pull/36833/head:
nautilus: mgr/volumes: convert uid and gid to integer type
mgr/volumes: Address python breakage in python 2
mgr/volumes: Update doc/cephfs/fs-volumes.rst for nautilus
mgr/volumes: Prevent subvolume recreate if trash is not-empty
mgr/volumes: Disallow subvolume group level snapshots
mgr/volumes: Add test case to ensure subvolume is marked
mgr/volumes: handle idempotent subvolume marks
mgr/volumes: Tests amended and added to ensure subvolume trash functionality
mgr/volumes: Mark subvolume root with the vxattr ceph.dir.subvolume
mgr/volumes: Move incarnations for v2 subvolumes, to subvolume trash
mgr/volumes: maintain per subvolume trash directory
mgr/volumes: make subvolume_v2::_is_retained() object property
mgr/volumes: Use snapshot root directory attrs when creating clone root
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Venky Shankar [Fri, 21 Aug 2020 14:07:37 +0000 (10:07 -0400)]
mgr/volumes: maintain per subvolume trash directory
PR https://github.com/ceph/ceph/pull/36472 introduces changes
that disallow nested nested snapshots in a subtree (subvolume)
and renames across subvolumes. This effect asynchronous purge
in mgr/volumes as subvolume are moved to a trash directory for
asynchronous deletion by purge threads.
To workaround this, start maintaining a subvolume specific
trash directory. Use the trash directory as an index to the
subvolume specific trash directory entry.
This changes subvolume deletion logic which currently relies
on `--retain-snapshots` flag to decide if the subvolume user
directory should get purged or the subvolume base directory
itself. Deleting a subvolume moves the user facing directory
to its specific trash directory. Purge threads take care of
deleting user facing directories (in trash) and the subvolume
base directory if required (when certain conditions are met).
mgr/volumes: Use snapshot root directory attrs when creating clone root
If a subvolumes mode or uid/gid values are changed post a snapshot,
and a clone of a snapshot prior to the change is initiated, the clone
inherits the current source subvolumes attributes, rather than the
snapshots attributes.
Fixing this by using the snapshots subvolume root attributes to create
the clone subvolumes root.
Following attributes are picked from the source subvolume snapshot:
- uid, gid, mode, data pool, pool namespace, quota
Patrick Donnelly [Wed, 26 Aug 2020 23:35:08 +0000 (16:35 -0700)]
Merge PR #36804 into nautilus
* refs/pull/36804/head:
qa/workunits/fs: add test for subvolume
mds: don't move inode with nlink > 1 to global snaprealm if it's in subvolume
mds: disallow hardlink across subvolume
mds: disallow across subvolume rename
mds: disallow creating snapshot on descendent directory of subvolume
mds: add vxattr that marks/clears subvolume flag
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Tatjana Dehler [Tue, 28 Jul 2020 11:18:56 +0000 (13:18 +0200)]
mgr/dashboard: wait longer for health status to be cleared
Because of reasons the cluster needs more time to recover from
HEALTH_WARN while changes are made by `test_pool_update_metadata`.
Lets wait several times for the cluster status to be HEALTH_OK
again.
Fixes: https://tracker.ceph.com/issues/46573 Signed-off-by: Tatjana Dehler <tdehler@suse.com>
(cherry picked from commit 739b365a3f6be9557ccb784819d4ad9ad524880f)
Kefu Chai [Fri, 7 Aug 2020 16:26:21 +0000 (00:26 +0800)]
rgw: hold reloader using unique_ptr
instead of using optional<> for holding reloader, use unique_ptr<>.
as `RGWRealmReloader` is neither
trivially_copy_{assignable,constructible} nor
is_trivially_move_{assignable, constructible}, because of the `Cond`
member variable. but Clang++ and libc++ still tries to rely on a
delgating copy constructor for constructing the
optional<RGWRealmReloader> instance even the optional object is
created `in_place`.
in this change, to workaround this issue, reloader is instead
constructed using make_unique<>
Yaarit Hatuka [Thu, 20 Aug 2020 18:21:11 +0000 (18:21 +0000)]
mgr/devicehealth: fix daemon filtering before scraping device
Scraping health metrics of mon devices was introduced in Nautilus, then
disabled (only in Nautilus) since the 'tell' mechanism of mons was not
reliable.
This commit fixes a bug when filtering the daemons on the device to be
scrapped (and allows scraping osd devices solely).
When:
$ ceph device scrape-health-metrics seagate_123
Error EAGAIN: device seagate_123 not claimed by any active OSD daemons
But:
$ ceph device ls
DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY
seagate_123 hostname:sdc osd.1
test/librbd: allow parallel runs of run-rbd-unit-tests
Running all tests sequential makes it the longest test of
`make check`, with each partial test taking around 500 sec.
Running 6 tests thus takes almost an hour.
Cut this down if ctest runs tests in parallel
Default behaviour of src/test/run-rbd-unit-tests.sh is kept:
Without parameters the tests are run in sequence
To run unitttest_librbd with RBD_FEATURES, use `N` as parameter
Patrick Donnelly [Wed, 19 Aug 2020 20:49:00 +0000 (13:49 -0700)]
Merge PR #36448 into nautilus
* refs/pull/36448/head:
mgr: Add python-enum34 dependency to package for older distributions
mgr/volumes: Add documentation regarding --retain-snapshots option
mgr/volumes: Avoid trashing retained subvolume on create errors
mgr/volumes: Add subvolume v2 test cases
mgr/volumes: Derive v2 from v1 to leverage common methods
mgr/volumes: Introduce v2 subvolumes
mgr/volumes: Use operation type during subvolume open
Reviewed-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Jason Dillaman [Tue, 28 Jul 2020 01:14:18 +0000 (21:14 -0400)]
librbd: update hidden global config when setting pool config override
The new "dev"-level global config setting will be updated when any
pool-level config override is updated. librbd clients will detect
the new global-level config update and trigger a refresh. This avoids
the need for potentially tens of thousands of librbd clients
registering a watch on the pool metadata object or periodically polling
the pool metadata object for updates.
Fixes: https://tracker.ceph.com/issues/46694 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit f45df9fe786e8057c491c082e840483759d67e9e)
Conflicts:
src/common/options.cc
- "rbd_quiesce_notification_attempts", "rbd_default_snapshot_quiesce_mode", and
"rbd_plugins" options have not been backported to Octopus, yet
Jason Dillaman [Mon, 27 Jul 2020 19:31:09 +0000 (15:31 -0400)]
librbd: initial config watcher implementation
The config watcher will initially observe all "rbd_" configuration
updates received from the MON that have not been locally overridden
at the pool and/or image level.