Adam King [Wed, 15 Mar 2023 17:55:26 +0000 (13:55 -0400)]
cephadm: handle exceptions applying secondary services during bootstrap
Otherwise we risk hitting a mismatch between the cephadm binary version
and the container image version we're bootstrapping on, resulting in
bootstrap failing. Example in the tracker.
Filestore is no longer supported in cephadm and both the doc [1] and the
DriveGroupValidation [2] raise an exception if this method is used. This
patch removes the legacy code that is supposed to produce filestore
ceph-volume related commands.
Adam King [Fri, 3 Mar 2023 20:31:03 +0000 (15:31 -0500)]
mgr/prometheus: remove dependency on cephadm module
https://github.com/ceph/ceph/commit/f967ac061ebee362cdc82c458e955da75a9045e9
introduced an import of something in the cephadm module
in the prometheus module. This seems to break the prometheus
module in some non-cephadm setups. For example, the ceph-ansible
ci hit
failed: [mgr0 -> mon0] (item=prometheus) => changed=true
ansible_loop_var: item
cmd:
- ceph
- -n
- client.admin
- -k
- /etc/ceph/ceph.client.admin.keyring
- --cluster
- ceph
- mgr
- module
- enable
- prometheus
delta: '0:00:00.389965'
end: '2023-03-03 15:30:07.631308'
item: prometheus
rc: 2
start: '2023-03-03 15:30:07.241343'
stderr: 'Error ENOENT: module ''prometheus'' reports that it cannot run on the active manager daemon: No module named ''cephadm'' (pass --force to force enablement)'
stderr_lines: <omitted>
stdout: ''
stdout_lines: <omitted>
so we need to be a bit more careful with this import and
make sure the prometheus module works fine without cephadm
Adam King [Wed, 15 Feb 2023 22:07:09 +0000 (17:07 -0500)]
mgr/cephadm: be aware of host's shortname and FQDN
The idea is to gether the shortname and FQDN as part
of gather-facts, and then if we ever try to check if a certain
host is in our internal inventory by hostname, we can check
these other known names. This should avoid issues where
we think a hostname specified by FQDN is not in our
inventory because we know the host by the shortname
or vice versa.
John Mulligan [Mon, 27 Feb 2023 19:38:50 +0000 (14:38 -0500)]
cephadm: fix timeout argument to call function
The timeout argument to call function, for executing sub-processes, did
not function - this patch makes timeout work as (probably) intended.
Use the `process.communicate()` method rather than `tee` functions to
handle IO collection. Since no logging is done until after the exit code
is known the tee calls are not necessary. Add calls to kill the child
process when the time out occurs. This helps prevent event loop "leaks"
that generate python warnings.
John Mulligan [Thu, 23 Feb 2023 19:51:13 +0000 (14:51 -0500)]
cephadm/tests: add initial test coverage for call function
The call function provides the ability to run subprocesses, log output,
and provides an optional timeout parameter. This timeout parameter does
not appear to function correctly today, so we make use of
pytest.param/pytest.mark.xfail to mark these cases as already known to
fail.
John Mulligan [Wed, 22 Feb 2023 18:57:21 +0000 (13:57 -0500)]
cephadm: disable coverage for some compatibility blocks
This change disables reporting missing coverage for blocks that
contain copy and pasted code from other python versions and exist
to make those functions available to older python versions.
drive_group: fix limit filter in drive_selection.selector
When multiple osd service specs with 'limit' filter are applied,
the current logic makes the second service speec
try to pick devices that are already used by the first service spec.
doc/start: edit first 50 lines of documenting-ceph
Edit the first 150 lines of doc/start/documenting-ceph.rst. This is part
of an initiative to harvest the fruits of Cephalocon 2023, at which
documentation proved to be in demand to a surprising degree.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit dd37f94aa4f1de947b1eaf5d82cc529925f5823e)
Conrad Hoffmann [Wed, 22 Mar 2023 22:03:57 +0000 (23:03 +0100)]
doc: account for PG autoscaling being the default
The current documentation tries really hard to convince people to set
both `osd_pool_default_pg_num` and `osd_pool_default_pgp_num` in their
configs, but at least the latter has undesirable side effects on any
Ceph version that has PG autoscaling enabled by default (at least quincy
and beyond).
Assume a cluster with defaults of `64` for `pg_num` and `pgp_num`.
Starting `radosgw` will fail as it tries to create various pools without
providing values for `pg_num` or `pgp_num`. This triggers the following
in `OSDMonitor::prepare_new_pool()`:
- `pg_num` is set to `1`, because autoscaling is enabled
- `pgp_num` is set to `osd pool default pgp_num`, which we set to `64`
- This is an invalid setup, so the pool creation fails
Likewise, `ceph osd pool create mypool` (without providing values for
`pg_num` or `pgp_num`) does not work.
Following this rationale:
- Not providing a default value for `pgp_num` will always do the right
thing, unless you use advanced features, in which case you can be
expected to set both values on pool creation
- Setting `osd_pool_default_pgp_num` in your config breaks pool creation
for various cases
This commit:
- Removes `osd_pool_default_pgp_num` from all example configs
- Adds mentions of the autoscaling and how it interacts with the default
values in various places
For each file that was touched, the following maintenance was also
performed:
- Change interternal spaces to underscores for config values
- Remove mentions of filestore or any of its settings
- Fix minor inconsistencies, like indentation etc.
There is also a ticket which I think is very relevant and fixed by this,
though it only captures part of the broader issue addressed here:
qa/suites/rbd: install qemu-utils in addition to qemu-block-extra on Ubuntu
qemu-utils is usually pre-installed but, due to what appears to be
a Ubuntu packaging bug, it's not upgraded when qemu-block-extra is
installed:
The following NEW packages will be installed:
qemu-block-extra
The following packages will be upgraded:
qemu-system-common qemu-system-data qemu-system-gui qemu-system-x86
However, the version of the block driver must match exactly the version
of the qemu-img tool, so the above leads to:
$ qemu-img convert -f qcow2 -O raw /home/ubuntu/cephtest/qemu/base.client.0.0.qcow2 rbd:rbd/client.0.0
Failed to initialize module: /usr/lib/x86_64-linux-gnu/qemu/block-rbd.so
Note: only modules from the same build can be loaded.
qemu: module block-block-rbd not found, do you want to install qemu-block-extra package?
qemu-img: Unknown protocol 'rbd'
```
error: /var/cache/dnf/baseos-00fe51d07def85f0/packages/kernel-core-4.18.0-483.el8.x86_64.rpm: signature hdr data: BAD, no. of bytes(459772) out of range
```
ceph-volume tests are failing, OSDs never get up and running.
For some reason, updating the OS early in the testing workflow
addresses that issue in the CI.
Remove confusing parentheses and add a clearer (as compared to the
parentheses) hyphen (actually an em-dash, or at least it is intended
to be an em-dash) to doc/rados/operations/monitoring-osd-pg.rst
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 0c965c18d0e6ab1461b5fad42d481f25e4207940)
Ilya Dryomov [Tue, 28 Mar 2023 18:03:05 +0000 (20:03 +0200)]
librbd: avoid generating ESHUTDOWN in ManagedLock
EBLOCKLISTED has a very special meaning but happens to be an alias for
ESHUTDOWN. If the client gets blocklisted, we always want to propagate
EBLOCKLISTED error code since it's generated by the OSD.
For ManagedLock use case of indicating that an operation on the lock
raced with lock shut down, meaning that a higher level request can just
be restarted, ERESTART should do.
Ilya Dryomov [Tue, 28 Mar 2023 17:52:42 +0000 (19:52 +0200)]
librbd: fix recursive locking on owner_lock in ImageDispatch
needs_exclusive_lock() calls acquire_lock() with owner_lock held.
If lock acquisiton races with lock shut down, ManagedLock completes
ImageDispatch context directly and dispatch is retried immediately on
the same thread (due to DISPATCH_RESULT_RESTART). This results in
recursion into needs_exclusive_lock() and, barring locking issues, can
lead to unbounded stack growth if lock shut down takes its time.
During send_acquire_lock, there's a case where
there's no watcher handle present and lock request is delayed.
If the client is blocklisted, the delayed request will not
continue and the call that requested lock will never complete.
The lock process will now propagate -EBLOCKLIST, to callback
instead of indefinitely delaying.
Fixes: https://tracker.ceph.com/issues/59115 Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit 6a0aeadc31ab1942c42c6e466183148f1d3752be)
Ilya Dryomov [Thu, 30 Mar 2023 11:58:20 +0000 (13:58 +0200)]
librbd: clear Image::list_watchers() list before populating it
The "append to the passed list" behavior is confusing and not what the
corresponding C API (rbd_watchers_list) or other similar C++ APIs (e.g.
list_lockers) do.
Dongsheng Yang [Wed, 15 Mar 2023 06:54:39 +0000 (06:54 +0000)]
librbd: fix wrong attribute for rbd_quiesce_complete api
When we use rbd_quiesce_complete api, we got an error:
/usr/bin/ld: undefined reference to `rbd_quiesce_complete'
Then we found the problem is the symbol of rbd_quiesce_complete
in librbd.so is LOCAL. After some investigation, we found
the attribute of rbd_quiesce_complete api is CEPH_RADOS_API
rather than expected CEPH_RBD_API.
Fixes: https://tracker.ceph.com/issues/59208 Signed-off-by: Dongsheng Yang <dongsheng.yang.linux@gmail.com>
(cherry picked from commit 51a2b707a3074e000b310fc20901d5038b15ea0c)
Edit docs/rados/operations/health-checks.rst (2 of x). PR#50674, the PR
that immediately precedes this PR in the series of PRs that line-edit
health-checks.rst, wrongly identified this series as having five
sections. This has been rectified by using the "2 of x" formulation.
Cory Snyder [Mon, 27 Feb 2023 09:45:47 +0000 (04:45 -0500)]
ceph-volume: add test case to reproduce bug in get_physical_fast_allocs
Adds a test case to reproduce a bug with get_physical_fast_allocs for
clusters that have multiple fast device PVs in a single VG (deployed
prior to v15.2.8). Also fixes other test cases for this function
to more accurately represent reality.