Patrick Donnelly [Tue, 25 Feb 2020 19:04:06 +0000 (11:04 -0800)]
Merge PR #32657 into master
* refs/pull/32657/head:
test: query using mds id, not rank
mgr: re-enable mds `scrub status` info in ceph status
mon: filter out ceph normal ceph entity types when dumping service metadata
mgr: filter out normal ceph services when processing service map
mgr: helper function to check if a service is a normal ceph service
Reviewed-by: Kefu Chai <kchai@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Sage Weil [Tue, 25 Feb 2020 13:05:31 +0000 (07:05 -0600)]
Merge PR #33495 into master
* refs/pull/33495/head:
mgr/cephadm: do not refresh device inventory on mgr restart
mgr/cephadm: make cache invalidate less racy
mgr/cephadm: fix last_device_update persistence
Reviewed-by: Joshua Schmid <jschmid@suse.de> Reviewed-by: Gabriel Brascher <gabriel@apache.org>
Volker Theile [Mon, 24 Feb 2020 12:03:13 +0000 (13:03 +0100)]
mgr/dashboard: Fix mypy issues and enable it by default
The decorator @no_type_check is used:
* To prevent a mypy error like 'error: INTERNAL ERROR -- Please try using mypy master on Github:'
* '#type: ignore' does not work, e.g. in broken lines
Yingxin Cheng [Tue, 25 Feb 2020 08:17:50 +0000 (16:17 +0800)]
crimson/osd: hide add_blocker() and clear_blocker() in Operation
Cleanup. Hide add_blocker() and clear_blocker() as private members
because no one is using them anymore. They are not exception-safe. Also
take the chance to reorder Operation class members.
Sage Weil [Tue, 25 Feb 2020 03:43:51 +0000 (21:43 -0600)]
Merge PR #33138 into master
* refs/pull/33138/head:
common/TextTable: only pad between columns
mgr/status: align with ceph table style
mgr/osd_perf_query: make table match ceph style
mgr: adjust tables to have 2 space column separation
common/TextTable: default to 2 spaces separating columns
Neha [Tue, 25 Feb 2020 03:01:41 +0000 (03:01 +0000)]
osd/PeeringState.h: ignore RemoteBackfillReserved in WaitLocalBackfillReserved
It is possible to dequeue an outstanding RemoteBackfillReserved, though we may have
already released reservations for that backfill target. Currently, if this happens
while we are in WaitLocalBackfillReserved, it can lead to a crash on the primary.
Prevent this by treating this condition as a no-op.
The longer term fix is to add a RELEASE_ACK mechanism, which prevents the primary
from scheduling a backfill retry until all the RELEASE_ACKs have been received.
Patrick Donnelly [Tue, 25 Feb 2020 02:18:19 +0000 (18:18 -0800)]
Merge PR #33413 into master
* refs/pull/33413/head:
test: verify purge queue w/ large number of subvolumes
test: pass timeout argument to mount::wait_for_dir_empty()
mgr/volumes: access volume in lockless mode when fetching async job
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Jason Dillaman [Thu, 20 Feb 2020 18:47:18 +0000 (13:47 -0500)]
rbd-mirror: implement basic status feedback for snapshot mirroring
The feedback includes the newest remote mirror snapshot timestamp,
the newest completely synced local mirror snapshot timestamp,
and optionally the in-progress sync snapshot timestamp and
percent complete.
Fixes: https://tracker.ceph.com/issues/44103 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Thu, 20 Feb 2020 17:57:26 +0000 (12:57 -0500)]
rbd-mirror: moved local to remote snapshot lookup to common function
This will be needed by the status formatter to lookup the remote
snapshot timestamp since the associated local snapshot timestamp
will be the time the snapshot was created on the local side.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Sage Weil [Mon, 24 Feb 2020 20:40:19 +0000 (14:40 -0600)]
Merge PR #33496 into master
* refs/pull/33496/head:
mgr/cephadm: combine get_daemons_by_daemon -> get_daemons_by_service
mgr/cephadm: remove apply_mon support
mgr/cephadm: use generics for add_mon
mgr/cephadm: use _apply_service for mgrs
mgr/cephadm: refactor most daemon add methods
mgr/cephadm: refactor _update_service and all apply methods
mgr/cephadm: fix get_unique_name when name in use
Sage Weil [Sun, 23 Feb 2020 19:17:34 +0000 (13:17 -0600)]
mgr/cephadm: use _apply_service for mgrs
Note that we are losing some of the special logic about removing standby
mgrs only. This should be added back *after* we fix up the scheduler
to be more intelligent about choosing hosts that already host daemons,
and make removal pick hosts that aren't selected (by label, or by
scheduler, etc.).
A few bugs to track this:
https://tracker.ceph.com/issues/44167
https://tracker.ceph.com/issues/44252 (prefer standby mgrs *and* mdss)
Sage Weil [Sun, 23 Feb 2020 14:32:05 +0000 (08:32 -0600)]
mgr/cephadm: do not refresh device inventory on mgr restart
The service inventory is more fluid and is faster to gather. We also
make a time-saving assumption that we don't need to persist our cache
updates when making changes because we know a mgr restart will refresh.
Device inventory changes are much less frequent and slower. Let's not
refresh them every restart.
Sage Weil [Fri, 21 Feb 2020 21:38:25 +0000 (15:38 -0600)]
mgr/cephadm: make cache invalidate less racy
Consider a cache invalidation that races with an actual update:
- serve() refresh starts
- refresh runs cephadm ls
- add_daemon creates a new daemon
- add_daemon returns and invalidates the list (set last_udpate=None)
- serve() stores its ls result in the cache
In such a case the add result will get lost.
Fix this by taking a conservative strategy:
- invalidate adds host to a refresh list
- serve() removes an item from the refresh list and then does the ls,
then stores the result.
Any racing update will invalidate *after* it does it's work, which means
we will always do a final ls afterwards.
Sage Weil [Sat, 22 Feb 2020 15:41:30 +0000 (09:41 -0600)]
mgr/cephadm: upgrade: handle stopped daemons
A stopped daemon should have the correct target_name, and we should ensure
that the host has an up-to-date image, so that when it does start it
comes up with the new image. If it has an old image name, we should
redeploy as per usual.
Venky Shankar [Wed, 19 Feb 2020 12:31:40 +0000 (07:31 -0500)]
mgr/volumes: access volume in lockless mode when fetching async job
Saw a deadlock when deleting lot of subvolumes -- purge threads were
stuck in accessing global lock for volume access. This can happen
when there is a concurrent remove (which renames and signals the
purge threads) and a purge thread is just about to scan the trash
directory for entries.
For the fix, purge threads fetches entries by accessing the volume
in lockless mode. This is safe from functionality point-of-view as
the rename and directory scan is correctly handled by the filesystem.
Worst case the purge thread would pick up the trash entry on next
scan, never leaving a stale trash entry.