Jason Dillaman [Tue, 10 Mar 2020 17:31:34 +0000 (13:31 -0400)]
librbd: race condition in image watcher notification callback
If a refresh is in-progress when a header update notification is
received, the notification was previously incorrectly dropped.
This prevented rbd-mirror's snapshot-based mirroring replayer from
detecting updates in some cases.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 9 Mar 2020 22:32:03 +0000 (18:32 -0400)]
librbd: re-use mirror promote state machine when disabling
The promote state machine will handle remove the non-primary
feature bit and will ensure an interrupted disable operation
doesn't leave things in an inconsistent state.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 9 Mar 2020 21:04:27 +0000 (17:04 -0400)]
librbd: enable/disable implicit non-primary feature bit
When promoted to primary, disable the non-primary feature bit and
when demoted (or created non-primary), enable the non-primary feature
bit. This will prevent all non rbd-mirror RBD clients from modifying
the RBD image.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 9 Mar 2020 20:49:07 +0000 (16:49 -0400)]
rbd-mirror: permit R/W operations against non-primary image
With the non-primary feature bit is enabled, mask-out the read-only
feature bit that will be set in the refresh image state machine if
the image has that feature bit set. This will ensure that only the
rbd-mirror daemon will be able to modify a non-primary image.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Tue, 3 Mar 2020 20:17:52 +0000 (15:17 -0500)]
librbd: track reason why ImageCtx is read-only
This will be utilized by the RefreshRequest state machine to flag the image
as read-only if the new RBD_FEATURE_NON_PRIMARY feature is enabled. Also
allow that flag to be masked out by rbd-mirror daemon to permit IO and
operations against a non-primary image.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Tue, 3 Mar 2020 20:01:35 +0000 (15:01 -0500)]
librbd: new RBD_FEATURE_NON_PRIMARY to prevent R/W IO
When a snapshot-based image is non-primary, we will need to use
this implicit feature to ensure that writes and maintenance
operations cannot be performed against the image.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Sage Weil [Tue, 10 Mar 2020 22:20:48 +0000 (17:20 -0500)]
Merge PR #33771 into octopus
* refs/pull/33771/head:
common/ceph_timer: Pass reference to waited time on stack
common/ceph_timer: Add test
common/ceph_timer: Use unique_function, allowing noncopyable events
common/ceph_timer: Couple cleanups
common/ceph_timer: Fix namespaces
common/ceph_timer: Add missing includes
common/ceph_timer.h: Don't indent contents of a namespace
Sage Weil [Tue, 10 Mar 2020 14:28:57 +0000 (09:28 -0500)]
cephadm: bootstrap: wait for mgr to restart after enabling a module
It was possible to enable a module (mon updates mgrmap) and then
do a mgr command and have that command reach the mgr before it got the
latest mgrmap and restarted.
Fixes: https://tracker.ceph.com/issues/44531 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Mon, 9 Mar 2020 01:38:59 +0000 (20:38 -0500)]
mgr/cephadm: fix upgrade order
Create two variables, CEPH_TYPES and CEPH_UPGRADE_ORDER. In reality they
are both the same, but this way the meaning is clear, and they lists
won't get out of sync (they should always have the same elements).
Sage Weil [Mon, 9 Mar 2020 17:26:06 +0000 (12:26 -0500)]
ceph.in: only shut down rados on clean exit
If we exit due to a timeout, then calling rados shutdown can lead to all
sorts of problems, because we may still have another thread that is
trying to call rados_connect and/or do some work, and rados_connect
and rados_shutdown don't (and can't!) really behave well when racing
against each other.
Note that shutdown here isn't that important--the process is about to
exit anyway. It's only useful to exercise the shutdown code path more
often.
Fixes: https://tracker.ceph.com/issues/44526 Signed-off-by: Sage Weil <sage@redhat.com>
Adam C. Emerson [Fri, 6 Mar 2020 03:14:47 +0000 (22:14 -0500)]
common/ceph_timer: Pass reference to waited time on stack
std::condition_variable::wait_until takes a const reference to a
time_point. It may access this reference after relinquishing the
mutex, creating a potential use-after-free error if the first event is
shut down.
So, just copy the time onto the stack, so we have a reference that
won't disappear.
https://tracker.ceph.com/issues/44373
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Sage Weil [Mon, 9 Mar 2020 13:28:57 +0000 (08:28 -0500)]
Merge PR #33793 into master
* refs/pull/33793/head:
qa/suites/rados/cephadm/upgrade: new start point
qa/tasks/cephadm: put bootstrap config etc directly in /etc/ceph
cephadm: shell: default to config and keyring in /etc/ceph, if present
Sage Weil [Mon, 9 Mar 2020 13:28:37 +0000 (08:28 -0500)]
Merge PR #33808 into master
* refs/pull/33808/head:
mgr/cephadm: apply: fill in default placement if none is provided
mgr/cephadm: make placement truly optional (default to count=1)
mgr/cephadm: allow count == 0
mgr/cephadm: remove magic labels
Currently, the removal report is updated after entering the main serve()
loop. This tiny window makes clients (like Dashboard) fail to poll the
result. Refresh the report data immediately after scheduling OSD for
removal.
The report structure is changed from Dict to Set because:
- The key is `OSDRemoval` instance, which make it hard to use.
- More consistent with orchestrator interfaces: most calls return a list.
Kefu Chai [Mon, 9 Mar 2020 03:48:07 +0000 (11:48 +0800)]
crimson/mgr: close() in background
as per Yingxin,
application code is not required to wait for the `close()` future, it
would be safe to ignore it, because:
- `close()` will shutdown its socket synchronously;
- `close()` will create an internal `ConnectionRef` when it's closing;
- `Messenger` will wait for all connections closed during `shutdown()`;
Kefu Chai [Sun, 8 Mar 2020 06:00:53 +0000 (14:00 +0800)]
qa/tasks/ceph_manager: use StringIO for capturing COT output
there are couple factors we should consider when choosing between
BytesIO and StringIO:
- if the producer is producing binary
- if we are expecting binary
- if the layers in between them are doing the decoding/encoding
automatically.
in our case, the producer is either the ChannelFile instances returned
by paramiko.SSHClient or subprocess.CompletedProcess insances returned
by subprocess.run(). the former are file-like objects opened in "r" mode,
but their contents are decoded with utf-8 when reading if
ChannelFile.FLAG_BINARY is not specified. that's why we always try to
add this flag in orchestra/run.py when collecting the stdout and stderr
from paramiko.SSHClient after executing a command.
back in python2, this works just fine. as we don't differentiate bytes
from str by then.
but in python3, we have to make a decision. in the case of
ceph-objectstore-tool (COT for short), it does not produce binary and
we don't check its output with binary, so, if neither Remote.run() nor
LocalRemote.run() decodes/encodes for us, it's fine.
so it boils down to `copy_to_log()`:
i think we we should respect the consumer's expectation, and only decode
the output if a StringIO is passed in as stdout or stderr.
as we always log the output with logging we could either set
`ChannelFile.FLAG_BINARY` depending on the type of `capture` or not.
if it's not set, paramiko will return str (bytes) on python2, and str on
python3. if it's not set paramiko will return str (bytes) on python2,
and bytes on python3.
if there is non-ASCII in the output, logging will bail fail with
`UnicodeDecodeError` exception. and paramiko throws the same exception
when trying to decode for us if `ChannelFile.FLAG_BINARY` is not
specified.
so to ensure that we always have logging messages no matter if the
producer follows the rule of "use StringIO if you only emit text" or
not, we have to use `ChannelFile.FLAG_BINARY`, and force paramiko
to send us the bytes. but we still have the luxury to use StringIO
and do the decode when the caller asks for str explicitly. that'd save
the pain of using `str.decode()` or `six.ensure_str()` everywhere
even if we can assure that the program does not write binary.
Chunsong Feng [Thu, 19 Dec 2019 09:32:09 +0000 (17:32 +0800)]
os/bluestore/spdk: Fix the overflow error of parsing spdk coremask
coremask supports up to 256 bits in DPDK19.05, but the use of stoll in
NVMEManager::try_get limits the maximum use to 64 bits. Parse coremask by
hex character from low to high.
Fixes: https://tracker.ceph.com/issues/43044 Signed-off-by: Hu Ye <yehu5@huawei.com> Signed-off-by: Chunsong Feng <fengchunsong@huawei.com> Signed-off-by: luo rixin <luorixin@huawei.com>
Sage Weil [Mon, 9 Mar 2020 00:57:06 +0000 (19:57 -0500)]
Merge PR #33804 into master
* refs/pull/33804/head:
cephadm: ls: warn if daemon type (version) is not supported
cephadm: report grafana version
cephadm: report prometheus, node-exporter, alertmanager versions
cephadm: use None (not '<no value>') for monitoring daemon version
Sage Weil [Sun, 8 Mar 2020 22:29:00 +0000 (17:29 -0500)]
Merge PR #33792 into master
* refs/pull/33792/head:
doc/cephadm: fix formatting for osd section
doc/cephadm: update 'adding mons' section to suggest/prefer 'apply'
doc/cephadm: fix formatting, typos
mgr/cephadm: implement apply_mon
mgr/cephadm: allow mon creation without explicit ip or addr
mgr/cephadm: allow _apply_service to delete mon daemon's data
mgr/cephadm: remove mon from monmap before removing mon
mgr/cephadm: do not remove mon if it breaks quorum
Sage Weil [Sun, 8 Mar 2020 21:49:38 +0000 (16:49 -0500)]
Merge PR #33802 into master
* refs/pull/33802/head:
mgr/cephadm: sanity check upgrade version
mgr/cephadm: only need to invalidate once here
mgr/cephadm: upgrade requires root mode for now
Sage Weil [Sun, 8 Mar 2020 17:00:45 +0000 (12:00 -0500)]
mgr/cephadm: remove magic labels
Remove the magic label behavior. It makes the code confusing, it
makes the overall behavior hard to explain, and it makes the PlacementSpec
meaning different than what Rook is doing.
Instead, if you want mons on hosts with label 'mon', then say 'label:mon'.
Sage Weil [Fri, 6 Mar 2020 21:26:20 +0000 (15:26 -0600)]
qa/tasks/cephadm: put bootstrap config etc directly in /etc/ceph
This puts the conf and keyring in /etc/ceph earlier rather than later,
making them useful for debugging a live system *during* bootstrap. It's
also less code.
Sage Weil [Sat, 7 Mar 2020 19:45:16 +0000 (13:45 -0600)]
Merge PR #33706 into master
* refs/pull/33706/head:
qa/suites/rados/cephadm/upgrade: adjust starting version
mgr/orch: from_strings -> from_string; do not accept a list
mgr/volumes: pass placement as string, not list
qa/tasks/mgr/test_orchestrator_cli: adjust placement args
qa/tasks/cephadm: pass apply placement as a single arg
mgr/orch: PlacementSpec: allow 'count:123'
mgr/orch: PlacementSpec: may pretty_str() match input
mgr/orch: take single placement argument
mgr/orch: PlacementSpec.from_strings: take a string *or* a list