Melissa Li [Thu, 26 May 2022 18:07:30 +0000 (14:07 -0400)]
mgr/dashboard: add rbd status endpoint
Show "No RBD pools available" error page when accessing block/rbd if there are no rbd pools.
Add a "button_name" and "button_route" property to `ModuleStatusGuardService` config to customize the button on the error page.
Modify `ModuleStatusGuardService` to execute API calls to `/ui-api/<uiApiPath>/status` which uses the `UIRouter`.
Fixes: https://tracker.ceph.com/issues/42109 Signed-off-by: Melissa Li <melissali@redhat.com>
(cherry picked from commit 6ac9b3cfe171a8902454ea907b3ba37d83eda3dc)
Nizamudeen A [Mon, 6 Jun 2022 05:51:29 +0000 (11:21 +0530)]
mgr/dashboard: configure rbd mirroring
One-click button in the case of an orch cluster for configuring the
rbd-mirroring when its not properly setup. This button will create an
rbd-mirror service and also an rbd labelled pool(replicated: size-3) (if they are not
existing)
Fixes: https://tracker.ceph.com/issues/55646 Signed-off-by: Nizamudeen A <nia@redhat.com>
Conflicts:
src/pybind/mgr/dashboard/frontend/src/app/core/error/error.component.html
src/pybind/mgr/dashboard/frontend/src/app/shared/services/module-status-guard.service.ts
src/pybind/mgr/dashboard/tests/test_rbd_mirroring.py
error component and module-status-guard.service.ts had one minor
conflict which was resolved by getting incoming changes (add missing
logic).
test_rbd_mirroring had an import conflict which was resolved by
accepting incoming changes which had the same imports with new ones.
Pere Diaz Bou [Mon, 30 May 2022 14:10:11 +0000 (16:10 +0200)]
mgr/dashboard: move replaying images to Syncing tab
Images with 'Replaying' state will be displayed in Syncing tab. syncTmpl
removed as it was unnecessary if sate is provided from the backend.
Replaying images in contrast of Syncing images don't have a progress
percentage, nevertheless, we have an approximation of how much time left
there is until the image is fully synced. Therefore, we can use seconds_until_synced to represent the progress.
Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
Pere Diaz Bou [Fri, 13 May 2022 15:15:33 +0000 (17:15 +0200)]
mgr/dashboard: snapshot mirroring from dashboard
Enable snapshot mirroring from the Pools -> Image
Also show the mirror-snapshot in the image where snapshot is enabled
When parsing images if an image has the snapshot mode enabled, it will
try to run commands that don't work with that mode. The solution was
not running those for now and appending the mode in the get call.
Fixes: https://tracker.ceph.com/issues/55648 Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com> Signed-off-by: Nizamudeen A <nia@redhat.com> Signed-off-by: Aashish Sharma <aasharma@redhat.com> Signed-off-by: Avan Thakkar <athakkar@redhat.com>
(cherry picked from commit 489a385a95d6ffa5dbd4c5f9c53c1f80ea179142)
Ilya Dryomov [Sun, 26 Jun 2022 11:05:09 +0000 (13:05 +0200)]
librbd: update progress for non-existent objects on deep-copy
As a side effect of commit e5a21e904142 ("librbd: deep-copy image copy
state machine skips clean objects"), handle_object_copy() stopped being
called for non-existent objects. This broke progress_object_no logic,
which expects to "see" all object numbers so that update_progress()
callback invocations can be ordered. Currently update_progress() based
progress reporting gets stuck after encountering a hole in the image.
To fix, arrange for handle_object_copy() to be called for all object
numbers, even if ObjectCopyRequest isn't created. Defer the extra call
to the image work queue to avoid locking issues.
Zac Dover [Wed, 29 Jun 2022 12:57:13 +0000 (22:57 +1000)]
doc/index.rst: add link to Dev Guide basic workfl.
This PR adds a link to the "Basic Workflow" section of the
Developer Guide on the landing page of docs.ceph.com.
This PR is meant to improve the documentation for developers
new to Ceph and to guide them to instructions that will allow
them to become full-fledged contributors to the Ceph project
as quickly as possible.
The "Basic Workflow" page of the Developer Guide contains
information that answers almost all of the questions that I had
about contributing to the Ceph project when I was new to it,
and I am finally acting on my long-held conviction that the
"Basic Workflow" page of the Developer Guide should have a more
prominent position in the documentation suite than it has had.
Laura Flores [Thu, 9 Jun 2022 18:55:48 +0000 (18:55 +0000)]
test/librados: modify LibRadosMiscConnectFailure.ConnectFailure to comply with new seconds unit
The unit type for `client_mount_timeout` was changed from "float" to "secs" in 983b10506dc8466a0e47ff0d320d480dd09999ec. To make this test comply with the new
seconds unit change, we need to change the value to an integer, as seconds
does not accept float values.
Fixes: https://tracker.ceph.com/issues/55971 Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit f357459e6b159229ad40491709f756b06a6e87f1) Signed-off-by: Neeraj Pratap Singh <neesingh@redhat.com>
client: Inode::hold_caps_until is time from monotonic clock now.
Inode::hold_caps_until storing time from ceph::coarse_mono_clock now.
This upstream code of this PR i.e. the parent PR contains the file
`src/common/options/mds-client.yaml.in` which intends to fix a part
of this PR whereas this file didn't exist in pacific branch.So, those
changes of `src/common/options/mds-client.yaml.in` are incorporated in
the below mentioned files in Conlicts. Fixes: https://tracker.ceph.com/issues/52982 Signed-off-by: Neeraj Pratap Singh <neesingh@redhat.com>
(cherry picked from commit 983b10506dc8466a0e47ff0d320d480dd09999ec)
Zac Dover [Tue, 21 Jun 2022 14:09:05 +0000 (00:09 +1000)]
doc/dev: add context note to dev guide config
This PR adds a note directing first-time cloners of
their Ceph git forks to make sure to cd into the ceph/
directory before trying to run the "git config" commands.
fix endianness issue with WriteLogCacheEntry encoding. abandon the
use of bits in the union. make '&' operation with the whole union
filed(flags) to get the bit information.
Ilya Dryomov [Sat, 18 Jun 2022 13:25:49 +0000 (15:25 +0200)]
rbd-mirror: spell out "remote image is not primary" status correctly
There is a difference: non-primary means NON_PRIMARY promotion state,
while "not primary" can refer to any of NON_PRIMARY, ORPHAN or UNKNOWN
promotion states.
Ilya Dryomov [Sat, 18 Jun 2022 11:00:34 +0000 (13:00 +0200)]
rbd-mirror: fix up PrepareReplayDisconnected test case
It was botched in commit 2bca9ee96c65 ("rbd-mirror: consolidate
prepare local/remote image steps to bootstrap") and went unnoticed
because currently no special handling is needed for disconnected
clients -- is_disconnected() check happens to be the last step
and it doesn't generate an error.
Ilya Dryomov [Mon, 20 Jun 2022 12:19:41 +0000 (14:19 +0200)]
rbd-mirror: generally skip replay/resync if remote image is not primary
Replay and resync should generally be skipped if the remote image is
not primary.
If this is not done for replay, snapshot-based mirroring can run into
a livelock if the primary image is demoted while a mirror snapshot is
being synced. On the demote site, rbd-mirror would pick up the just
demoted image, grab the exclusive lock on it and idle waiting for a new
mirror snapshot to be created. On the (still) non-primary site,
rbd-mirror would eventually finish syncing that mirror snapshot and
attempt to unlink from it on the demote site. These attempts would
fail with EROFS due to exclusive lock being held in the "refuse proxied
maintenance operations" mode, blocking forward progress (syncing of the
demotion snapshot so that the non-primary image can be orderly promoted
to primary, etc).
If this is not done for resync, data loss can ensue as the just demoted
image would be immediately trashed, underneath the non-primary site that
is still syncing.
Currently this is done in PrepareReplayRequest only for journal-based
mirroring. Note that it is conditional: if the local image is linked
to the remote image, proceeding is desirable.
Generalize this check, consolidate it with a related check in
PrepareRemoteImageRequest and move the result to BootstrapRequest to
cover both "local image does not exist" and "local image is unlinked"
cases for both modes.
Ilya Dryomov [Sat, 18 Jun 2022 10:35:51 +0000 (12:35 +0200)]
rbd-mirror: strengthen is_local_primary() and is_linked()
Initialize local_promotion_state and remote_promotion_state to UNKNOWN
instead of counterintuitive PRIMARY and NON_PRIMARY -- half the time the
final values are flipped. Then is_local_primary() and is_linked() can
be strengthened as a non-existent image should stay in UNKNOWN.
Kotresh HR [Fri, 18 Mar 2022 06:43:53 +0000 (12:13 +0530)]
mgr/volumes: Disable quota for mgr libcephfs connection
This is done to give 'mgr' libcephfs connection right to bypass
quota. The mgr/volumes plugin maintains configuration files
with in the directory where the user has enforced quota. So
when the quota is met, certain mgr/volumes apis don't work as
intended. e.g., When subvolumegroup quota is met, the group's
subvolume removal with '--retain-snapshots' fails.
This is done to give 'mgr' libcephfs connection right to bypass
quota. The mgr/volumes plugin maintains configuration files
with in the directory where the user has enforced quota. So
when the quota is met, certain mgr/volumes apis don't work as
intended. e.g., When subvolumegroup quota is met, the group's
subvolume removal with '--retain-snapshots' fails.
Conflicts:
src/pybind/mgr/volumes/fs/operations/group.py
- Updates in defination of create_groups
src/pybind/mgr/volumes/fs/volume.py
- Added set_group_attrs in import list and split long line
Xiubo Li [Wed, 8 Jun 2022 05:00:20 +0000 (13:00 +0800)]
qa: wait rank 0 to become up:active state before mounting fuse client
When setting the ec pool to the layout the filesystem may not be
ready, so when mounting a fuse client it will fail. To fix this we
need to wait at least the rank 0 to be in up:active state.
Xiubo Li [Fri, 27 May 2022 05:11:44 +0000 (13:11 +0800)]
client: choose auth MDS for getxattr with the Xs caps
If any 'x' caps is issued we can just choose the auth MDS instead
of the random replica MDSes. Because only when the Locker is in
LOCK_EXEC state will the loner client could get the 'x' caps. And
if we send the getattr requests to any replica MDS it must auth pin
and tries to rdlock from the auth MDS, and then the auth MDS need
to do the Locker state transition to LOCK_SYNC. And after that the
lock state will change back.
This cost much when doing the Locker state transition and usually
will need to revoke caps from clients.
And for the 'Xs' caps for getxattr we will also choose the auth MDS,
because the MDS side code is buggy due to setxattr won't notify the
increased xattr_version to replicated MDSes when the values changed
and the replica MDS will return the old xattr_version value. The
client will just drop the xattr values since it sees the xattr_version
doesn't change.
Xiubo Li [Wed, 16 Mar 2022 09:15:57 +0000 (17:15 +0800)]
mds, client: only send the metrices supported by MDSes
For the old ceph clusters the clients won't send any metrics to
them as default unless they have backported this commit, but there
has one option 'client_collect_and_send_global_metrics' still could
be used to enable it manually.
This will fix the crash bug when upgrading from old ceph clusters,
which will crash the MDSes once they receive unknown metrics.
mgr/cephadm: adding logic to close ports when removing a daemon Fixes: https://tracker.ceph.com/issues/52906 Signed-off-by: Redouane Kachach <rkachach@redhat.com>
(cherry picked from commit 4deb546ffd67ac8f05d2788150764a26b5671b87)
Redouane Kachach [Tue, 31 May 2022 10:11:03 +0000 (12:11 +0200)]
mgr/cephadm: check if a service exists before trying to restart it Fixes: https://tracker.ceph.com/issues/55800 Signed-off-by: Redouane Kachach <rkachach@redhat.com>
(cherry picked from commit 6b76753c3cabf9663fa1daa47c7bcb7df110a94c)
Redouane Kachach [Tue, 31 May 2022 10:59:26 +0000 (12:59 +0200)]
mgr/cephadm: capture exception when not able to list upgrade tags Fixes: https://tracker.ceph.com/issues/55801 Signed-off-by: Redouane Kachach <rkachach@redhat.com>
(cherry picked from commit 0e7a4366c0c1edd74d52acad5ed4dc3df0ef7679)
Ilya Dryomov [Sun, 19 Jun 2022 10:12:01 +0000 (12:12 +0200)]
mgr/rbd_support: always rescan image mirror snapshots on refresh
Establishing a watch on rbd_mirroring object and skipping rescanning
image mirror snapshots on periodic refresh unless rbd_mirroring object
gets notified in the interim is flawed. rbd_mirroring object is
notified when mirroring is enabled or disabled on some image (including
when the image is removed), but it is not notified when images are
promoted or demoted. However, load_pool_images() discards images that
are not primary at the time of the scan. If the image is promoted
later, no snapshots are created even if the schedule is in place. This
happens regardless of whether the schedule is added before or after the
promotion.
This effectively reverts commit 69259c8d3722 ("mgr/rbd_support: make
mirror_snapshot_schedule rescan only updated pools"). An alternative
fix could be to stop discarding non-primary images (i.e. drop
if not info['primary']:
continue
check added in commit d39eb283c5ce ("mgr/rbd_support: mirror snapshot
schedule should skip non-primary images")), but that would clutter the
queue and therefore "rbd mirror snapshot schedule status" output with
bogus entries. Performing a rescan roughly every 60 seconds should be
manageable: currently it amounts to a single mirror_image_status_list
request, followed by mirror_image_get, get_snapcontext and snapshot_get
requests for each snapshot-based mirroring enabled image and concluded
by a single dir_list request. Among these, per-image get_snapcontext
and snapshot_get requests are necessary for determining primaryness.
Ilya Dryomov [Fri, 17 Jun 2022 12:03:20 +0000 (14:03 +0200)]
mgr/rbd_support: avoid losing a schedule on load vs add race
If load_schedules() (i.e. periodic refresh) races with add_schedule()
invoked by the user for a fresh image, that image's schedule may get
lost until the next rebuild (not refresh!) of the queue:
1. periodic refresh invokes load_schedules()
2. load_schedules() creates a new Schedules instance and loads
schedules from rbd_mirror_snapshot_schedule object
3. add_schedule() is invoked for a new image (an image that isn't
present in self.images) by the user
4. before load_schedules() can grab self.lock, add_schedule() commits
the new schedule to rbd_mirror_snapshot_schedule object and adds it
to self.schedules
5. load_schedules() grabs self.lock and reassigns self.schedules with
Schedules instance that is now stale
6. periodic refresh invokes load_pool_images() which discovers the new
image; eventually it is added to self.images
7. periodic refresh invokes refresh_queue() which attempts to enqueue()
the new image; this fails because a matching schedule isn't present
The next periodic refresh recovers the discarded schedule from
rbd_mirror_snapshot_schedule object but no attempt to enqueue() that
image is made since it is already "known" at that point. Despite the
schedule being in place, no snapshots are created until the queue is
rebuilt from scratch or rbd_support module is reloaded.
To fix that, extend self.lock critical sections so that add_schedule()
and remove_schedule() can't get stepped on by load_schedules().
Ilya Dryomov [Fri, 17 Jun 2022 08:28:55 +0000 (10:28 +0200)]
mgr/rbd_support: refresh schedule queue immediately after delay elapses
The existing logic often leads to refresh_pools() and refresh_images()
being invoked after a 120 second delay instead of after an intended 60
second delay.