Sage Weil [Wed, 24 Feb 2021 21:20:18 +0000 (16:20 -0500)]
mon/KVMonitor: fix 'osd new' cross-service commit
When we converted ConfigKeyService to KVMonitor, we didn't correctly
change this to propose_pending(), which mean that the kv change wasn't
captured in the paxos transaction.
Fixes: bb7ebc41532aeb23cff2241ab07b3f01c2f57ddd Fixes: https://tracker.ceph.com/issues/49460 Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit 66891b4845fbf119cacb2c77d39180e28c6626d5)
- document the guideline for locking when working with python GIL
- add primitives to extract the patterns for acquiring/releasing
GIL. so they can be reused.
Sage Weil [Tue, 16 Feb 2021 19:47:23 +0000 (14:47 -0500)]
mgr: use new kv subscription for mgr/, device/, config/
Include the config/ prefix (which we weren't loading before).
Before we are active, we collect these changes, and then pass them to
the ActivePyModules ctor. No change in functionality here, except that
when we make a change from a mgr module, we'll (1) put it in our local
cache store, (2) send the mon command to set it, and (3) get a notification
that updates the same value. Since this whole process is synchronous (see
ActivePyModules::set_store()), and the notification will generally arrive
*before* the command ack, there is no change in behavior.
If the mon cluster is not yet pacific, we still need to load
kv values the old way. If it's a mixed-mon cluster (and, e.g., our
current mon has the feature but not all of them do), we'll get this data
both ways, but no harm is done.
Sage Weil [Tue, 16 Feb 2021 19:43:19 +0000 (14:43 -0500)]
mon: allow subscription to kv/config-key data
Allow subscription to config-key/kv data. Initially we'll send a full
dump of the prefix. As changes occur, we'll send incremental diffs,
unless the subscriber is too far behind, in which case we'll send a full
dump again.
There is a new message, MKVData, to support this.
No compat issues since old clients won't subscribe to this stream unless
they know how to handle it.
Sage Weil [Mon, 15 Feb 2021 22:56:58 +0000 (17:56 -0500)]
mon: convert ConfigKeyService -> KVMonitor
Convert this into a normal PaxosService. This gives us (1) code uniformity
and also (2) a changelog in the form of the paxos commits for recently
changed keys that we can use to allow clients to subscribe to changes.
For upgrades this is pretty painless:
- the actual kv data is in the same place
- an old mon will still see updates made by a new monitor
- a new mon will still see updates made by an old monitor, but won't get
a paxosservice version bump.
Kotresh HR [Fri, 5 Feb 2021 18:05:22 +0000 (23:35 +0530)]
qa: Fix a few mgr/volume test cases
Recovering dirty auth metadata file might not retain the order,
fixed the comparison in 'test_recover_auth_metadata_during_authorize'
and 'test_recover_auth_metadata_during_deauthorize'.
Ilya Dryomov [Mon, 8 Feb 2021 16:01:47 +0000 (17:01 +0100)]
librbd: don't hold owner_lock for validate_image_removal()
handle_exclusive_lock() and handle_shut_down_exclusive_lock() call
validate_image_removal() without owner_lock held, so holding it in
shut_down_exclusive_lock() appears to be redundant.
Ilya Dryomov [Sun, 7 Feb 2021 14:09:24 +0000 (15:09 +0100)]
librbd: treat EROFS as expected in handle_acquire_lock()
If the peer refuses to release exclusive lock (e.g. in case automatic
exclusive lock transitions are disabled), EROFS is retured. Suppress
a rather confusing "Read-only file system" error message -- this case
is no different from EBUSY or EAGAIN.
Ilya Dryomov [Sun, 7 Feb 2021 12:46:15 +0000 (13:46 +0100)]
librbd: refuse to release exclusive lock when removing
Commit 25c2ffe145be ("librbd: acquire exclusive lock from peer when
removing") changed PreRemoveRequest to request exclusive lock from the
peer instead of giving up and proceeding without exclusive lock. This
caused one of the test cases that sometimes runs concurrent "rbd rm"
against the same image to fail intermittently, most often on assert
because exclusive lock is now automatically transitioned to another
"rbd rm" on its request.
The root cause is older and probably goes back to when synchronous
librbd::remove() which held owner_lock across all operations including
trim_image() was converted to a set of state machines. Since then, any
peer that requests exclusive lock (instead of trying once and backing
off) is able to mess with image removal.
Install StandardPolicy to disable automatic exclusive lock transitions
during image removal.
Jason Dillaman [Mon, 8 Feb 2021 16:53:28 +0000 (11:53 -0500)]
rbd-mirror: don't prune older mirror snapshots when pruning incomplete snapshot
Since we normally prune in order, we need to ensure that we don't prune older
snapshots when we need to delete an incomplete mirror snapshot since the
older snapshot might be the only remaining mirror snapshot.
Jason Dillaman [Fri, 29 Jan 2021 15:44:38 +0000 (10:44 -0500)]
librbd/deep_copy: skip snap list if object is known to be clean
If the fast-diff indicates that the destination object should exist
and that it hasn't changed, there shouldn't be a need to issue the
snap list operation. Instead, just update the destination object map
to indicate the existence of the object.
Jason Dillaman [Fri, 29 Jan 2021 02:42:09 +0000 (21:42 -0500)]
librbd/deep_copy: object-copy state machine must update object map
If there was no data to copy, the object-copy state machine was bypassing
the object-map update states and prematurely completing. Since the
object-map is default-initialized to all non-existent objects, this results
in incorrect state for OBJECT_EXISTS_CLEAN objects.
Jason Dillaman [Wed, 3 Feb 2021 18:21:34 +0000 (13:21 -0500)]
librbd/io: track object non-existence when computing snapshot deltas
Re-use the existing DNE state to track whether or not the object
already exists when computing snapshot deltas from an arbitrary
set of snapshots. Previously, the non-existence of the object was
only computed for snap id 0 for tracking whiteouts. In a future
commit, the deep-copy object-copy state machine will be able to
properly update the object-map state to indicate exists clean
vs non-existent state.
Jason Dillaman [Wed, 3 Feb 2021 15:13:28 +0000 (10:13 -0500)]
librbd/io: only track initial diff extents if no diffs exists
The purpose of the initial diff extents ({0, 0}) was to help track
whether or not objects exists for read-from-parent / whiteout
tracking. Once we have at least one set of diffs on the object, we
actually have enough information to know about the object state.
Jason Dillaman [Thu, 28 Jan 2021 23:30:16 +0000 (18:30 -0500)]
librbd/object_map: diff state machine should track object existence
The deep-copy snapshot-create state machine initializes the object-map
state to non-existent for all objects. There was an assumption that the
deep-copy object-copy state machine would always update the object map
but that was being skipped for clean objects as an optimization. This
change will support a future commit to run the object-copy state machine
for existing objects.
Nizamudeen A [Tue, 19 Jan 2021 12:35:43 +0000 (18:05 +0530)]
mgr/dashboard: Automatically refresh the crush map metadata table
If we make any change to the osd crush map like do an osd crush reweight from cli, for that change to be reflected on metadata table we need to reload the entire page. Instead this PR takes care of auto refreshing the tree view.
Fixes: https://tracker.ceph.com/issues/48922 Signed-off-by: Nizamudeen A <nia@redhat.com> Signed-off-by: Avan Thakkar <athakkar@redhat.com>
(cherry picked from commit bc8562ef2a17b78e80bd4e1272d3fd1a512249bb)
Sebastian Wagner [Fri, 29 Jan 2021 10:10:38 +0000 (11:10 +0100)]
mgr/cephadm: Add strings to assert statements
This helps with: https://tracker.ceph.com/issues/48981
Looks like there is an assert somewhere:
```
Error EINVAL: Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1269, in _handle_command
return self.handle_command(inbuf, cmd)
...snip...
File "/usr/share/ceph/mgr/orchestrator/module.py", line 550, in _list_services
raise_if_exception(completion)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 653, in raise_if_exception
raise e
AssertionError
```
Sebastian Wagner [Fri, 15 Jan 2021 12:13:35 +0000 (13:13 +0100)]
mgr/cephadm: try again calling ceph-volume without --filter-for-batch
Fixes: https://tracker.ceph.com/issues/48870
This deals with a cephadm upgrade issue:
1. user calls `ceph orch upgrade`
2. mgr/cephadm calls `ceph orch config set mgr.x container_image <new-container>`
3. standby mgr gets upgraded
4. mgr failover to new mgr
5. mgr/cephadm calls `_refresh_host_devices`
6. `_refresh_host_devices` calls` ceph orch config get osd container_image`.
But this returns the old image
7. `_refresh_host_devices` calls `ceph-volume ... --filter-for-batch`
with an image that doesn't support `filter-for-batch`
The idea is to simply retiry calling ceph-volume inventory without `--filter-for-batch`
(also removed `out` being used without being declared)
Sage Weil [Tue, 26 Jan 2021 22:10:13 +0000 (16:10 -0600)]
mgr/cephadm/upgrade: scale down MDS cluster(s) for major version upgrades
For octopus -> pacific, as with other recent releases, we need to scale
down the MDS cluster(s) to a single daemon before upgrading. (This is
because the MDS intra-cluster protocols aren't fully versioned.)
Sage Weil [Wed, 27 Jan 2021 14:54:00 +0000 (08:54 -0600)]
mgr/cephadm/upgrade: match against any repo_digest, not image_id
The image id can vary across hosts and (most notably) docker vs podman.
Instead, use the repo_digest as an image identifier.
Unfortunately, a single image may have multiple digests, even within the
same registry, so keep a list of the digests for the image we are
upgrading to, and ensure that each container has a digest that matches at
least one of them.
This allows upgrade to proceed in mixed docker+podman clusters. However,
it does not yet address a cluster with mixed CPU architectures, because
the container image will have different digest(s) for each architecture
build.
Paul Cuzner [Thu, 14 Jan 2021 22:08:48 +0000 (11:08 +1300)]
cephadm: install doc updated to include cluster-network parameter
Install guide updated to include a description of the --cluster-network
parameter. The text also links to the complete definition for cluster-network
on the rados/configuration/network-config-ref page.