Benoît Knecht [Thu, 28 Oct 2021 16:49:07 +0000 (18:49 +0200)]
mgr/ActivePyModules: Add metadata id in dump_server()
The `DaemonStateCollection` used to always contain the daemon name in its
`DaemonKey`, but since #40220 (or more specifically afc33758e076761b8d4ec004c8f9c49b80a48770), the RadosGW registers with its
instance ID instead (`rados.get_instance_id()`).
As a result, the `ceph_rgw_*` metrics returned by `ceph-mgr` through the
`prometheus` module have their `ceph_daemon` label include that ID instead of
the daemon name, e.g.
Rishabh Dave [Mon, 7 Feb 2022 18:44:42 +0000 (00:14 +0530)]
monitoring: mention PyYAML only once in requirements
Following error occurs while running "sudo install-deps.sh" -
ERROR: Double requirement given: PyYAML==6.0 (from -r requirements-lint.txt (line 5)) (already in pyyaml (from -r requirements-alerts.txt (line 1)), name='PyYAML')
PyYAML is mentioned twice as a requirement. It is mentioned once in both
the following files -
monitoring/ceph-mixin/requirements-lint.txt
monitoring/ceph-mixin/requirements-alerts.txt
Ilya Dryomov [Mon, 31 Jan 2022 13:08:26 +0000 (14:08 +0100)]
qa/suites/krbd: add legacy+rxbounce and crc+rxbounce coverage
For basic, rbd and rbd-nomount subsuites, replace legacy and crc
facets with "legacy or legacy+rxbounce" and "crc or crc+rxbounce"
facets (chosen at random).
For fsx, singleton and thrash subsuites, add legacy+rxbounce and
crc+rxbounce facets and drop prefer-crc facet. The expected behaviour
of the latter depends on cluster configuration and should be tested
separately.
Ilya Dryomov [Sat, 29 Jan 2022 14:01:27 +0000 (15:01 +0100)]
qa/suites/rbd: add cram-based mon command API test
With mon (rbd_support mgr module in this case) command definitions
generated automatically by @CLI{Read,Write}Command decorator, it's
very easy to accidentally break the external facing API.
Ilya Dryomov [Sat, 29 Jan 2022 14:01:27 +0000 (15:01 +0100)]
mgr/rbd_support: level_spec is optional for schedule list/status
Commit fea6fdff4c74 ("mgr/rbd_support: level_spec passed to some
commands is not optional") is wrong. While it is true that a valid
level_spec is needed to create a LevelSpec instance, an empty string
is very much a valid level spec -- it signifies "all levels".
This wasn't caught because within Ceph these commands are wrapped by
rbd CLI which injects an empty string in get_level_spec_args().
Ilya Dryomov [Fri, 28 Jan 2022 22:01:08 +0000 (23:01 +0100)]
mgr/rbd_support: "trash remove" takes image_id_spec, not image_spec
Because of @CLIWriteCommand, the parameter name has to adhere to
the mon command API. Commit dcb51b067a49 ("mgr/rbd_support: define
commands using CLICommand") accidentally changed image_id_spec to
image_spec, breaking external users such as go-ceph.
Casey Bodley [Wed, 2 Feb 2022 19:06:20 +0000 (14:06 -0500)]
qa/rgw: tests run against ceph-quincy branch
target the ceph-quincy branch of s3tests, ragweed, and java_s3tests.
this commit targets the quincy branch specifically, rather than merging
to master and backporting
Before the patch the test case was showing an unreliable behaviour
dependent on the underlying memory allocator. It was because
the bufferlist rebuild can be skipped, resulting in unchanged number
of buffers, if all of them begin at aligned addresses.
The commit fixes that by allocating a 4 KiB-aligned buffer and
offsetting it by a small constant (42) to ensure the memory added
to the bufferlist begins at non-4 KiB address.
test/bufferlist: assert the rebuild in rebuild_aligned_size_and_memory() actually happens.
For the investigation of failures like the following one:
```
[ RUN ] BufferList.rebuild_aligned_size_and_memory
../src/test/bufferlist.cc:1865: Failure
Expected equality of these values:
bl.get_num_buffers()
Which is: 2
1
[ FAILED ] BufferList.rebuild_aligned_size_and_memory (0 ms)
```
The test case assumes the rebuild before the failed clause **always**
happens while `bufferlist::rebuild_aligned_size_and_memory()` skips it
if buffers are already aligned.
Neha Ojha [Fri, 21 Jan 2022 23:31:01 +0000 (23:31 +0000)]
qa/suites/rados: reduce the number of cephadm tests
Currently, every rados run of ~400 jobs is running ~150 cephadm tests,
which is unnecessary and redundant. With this change, we will run some
basic cephadm tests within the rados suite. The following seems to be
a good start.
John Mulligan [Thu, 20 Jan 2022 19:48:28 +0000 (14:48 -0500)]
cephadm: validate that the constructed YumDnf baseurl is usable
If the inputs to the `cephadm add-repo` command would result in an
invalid URL for repo metadata fail the command early with a (somewhat)
helpful error.
Fixes: https://tracker.ceph.com/issues/46773 Signed-off-by: John Mulligan <jmulligan@redhat.com>
John Mulligan [Thu, 20 Jan 2022 19:46:03 +0000 (14:46 -0500)]
cephadm: add a validate function to Packager
The validate function is for testing the inputs to the Packager
subclasses independently of writing the configuration to disk.
It only raises an exception upon failed validation.
Use it for the existing YumDnf validation exceptions.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
Melissa Li [Tue, 11 Jan 2022 23:03:23 +0000 (18:03 -0500)]
mgr/cephadm/iscsi: use `mon_command` in `post_remove` instead of `check_mon_command`
Use `mon_command` instead of `check_mon_command` in `post_remove` to avoid errors such as if iscsi service is removed before the iscsi gateway list is updated, cluster will enter error state and iscsi removal gets stuck.
Fixes: https://tracker.ceph.com/issues/53706 Signed-off-by: Melissa Li <melissali@redhat.com>
John Mulligan [Tue, 18 Jan 2022 18:31:03 +0000 (13:31 -0500)]
mgr/cephadm: add a test for enabling cephfs mirroring module
Add a test that checks that when cephfs mirror service is enabled
the mirroring mgr module gets enabled.
Actually-written-by: Sebastian Wagner <sewagner@redhat.com> Signed-off-by: John Mulligan <jmulligan@redhat.com>
(cherry picked from commit bcb4fa70f9f739dce3e1c111db0a322804350f9d)
Adam King [Thu, 6 Jan 2022 22:01:34 +0000 (17:01 -0500)]
mgr/cephadm: still check agent deps if it is marked down
Right now if an agent is down, the way _check_agent works
if will return without ever going on to check the deps or
scheduled actions for that agent. This causes a few issues.
For one, if an agent is marked down and then a mgr failover
happens, even if reconfiguring the agent would put it in a working
state (e.g. changing the target ip if the active mgr has moved)
we never try it because _check_agent just returns as soon as it
sees the agent is down. Additionally, if someone purposely tried
to schedule a redeploy of a down agent for whatever reason, we
would never make good on this action.
This change allows us to still carry out the normal checks/
scheduled actions even on down agents
I isolated all the tests suites into there respective files
so that in future it is easier to add more tests to it.
I also given priority to the host actions.
Create OSD checks are now written in a way that OSDs
are created only on the intended hosts. This will make
the host draining process easier and less time consuming.
Also tried to address the flaky force maintenance checks.
Removed some duplicated codes
Service creation part improved to reduce the time taken
for its completion
Ilya Dryomov [Sun, 23 Jan 2022 15:32:57 +0000 (16:32 +0100)]
qa/run_xfstests_qemu.sh: disable 251, 260 and 288
All three are skipped with virtio disks:
251 [not run] FITRIM not supported on /dev/vdc
260 [not run] FITRIM not supported on /dev/vdc
288 [not run] FITRIM not supported on /dev/vdc
But 260 and 288 fail with ide disks, where discard defaults to on. The
ancient kernel in our ubuntu-12.04.qcow2 doesn't support virtio discard
anyway so let's just disable them for consistency.
Ilya Dryomov [Fri, 21 Jan 2022 12:41:46 +0000 (13:41 +0100)]
rbd-mirror: fix races in snapshot-based mirroring deletion propagation
When remote image is deleted, rbd-mirror can encounter three cases:
1) no remote image id
2) no remote mirror metadata
3) MIRROR_IMAGE_STATE_DISABLING in remote mirror metadata
Commit d4c66ac5c615 ("rbd-mirror: fix issue with snapshot-based
mirroring deletion propagation") fixed case 1. Cases 2 and 3 remained
broken because for both of them finalize_snapshot_state_builder() would
populate not only remote_mirror_peer_uuid but also remote_image_id,
thus disabling ENOLINK logic in handle_prepare_remote_image() and
handle_bootstrap(). Commit ff60aec2d9ef ("rbd-mirror: fix bootstrap
sequence while the image is removed") touched on case 3, but it made
a difference only for journal-based mirroring.
Stop calling finalize_snapshot_state_builder() on errors. Instead,
align with journal-based mirroring by filling remote_mirror_peer_uuid
together with remote_mirror_uuid.
Make it clear that the local image non-primariness is asserted
independent of the mode; avoid the default implementation being
overridden but still relied on by both modes.
Ilya Dryomov [Wed, 19 Jan 2022 11:54:23 +0000 (12:54 +0100)]
rbd: add missing switch arguments for recognition by get_command_spec()
Currently this
$ rbd --all children img
doesn't work, while this
$ rbd children --all img
or this
$ rbd children img --all
does. The issue is that -a/--all isn't on the list of known switch
arguments. The "rbd children" example may seem contrived but for more
complicated commands such as "rbd device map" mixing switches and
positional arguments occurs naturally:
ee8887f4c0ff4f91117f31b621b95c8d08019130 was intended for adding
mpath devices support in ceph-volume but it has missed the lvm batch scenario.
This also fixes the zapping of mpath devices prepared with `ceph-volume raw`
The recent changes from PR #43536 introduced a regeression preventing from
running ceph-volume in a containerized context on Ubuntu 18.04.
Given that the path for the binary `lvs` differs between CentOS 8 and Ubuntu 18.04.
(`/usr/sbin/lvs` and `/sbin/lvs` respictively). It means that ceph-volume running
in the container on CentOS 8 sees the `lvs` binary at `/usr/sbin/lvs` and try to
run it with `nsenter` on the host which is running Ubuntu 18.04.
Prashant D [Fri, 29 Oct 2021 13:09:24 +0000 (09:09 -0400)]
osd/OSD: Log aggregated slow ops detail to cluster logs
Slow requests can overwhelm a cluster log with every slow op in
detail and also fills up the monitor db. Instead, log slow ops
details in aggregated format.
Fixes: https://tracker.ceph.com/issues/52424 Signed-off-by: Prashant D <pdhange@redhat.com>
(cherry picked from commit 9319dc9273b36dc4f4bef1261b3c57690336a8cc)
Adam C. Emerson [Wed, 19 Jan 2022 21:49:05 +0000 (16:49 -0500)]
rgw: Report empty endpoints as error instead of crashing
Fixes: https://tracker.ceph.com/issues/53941 Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
(cherry picked from commit 3c4a64ca040d3a0e0ddf762c391575498dc2a77f) Fixes: https://tracker.ceph.com/issues/53973 Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Mykola Golub [Fri, 14 Jan 2022 18:21:29 +0000 (18:21 +0000)]
cls/journal: skip disconnected clients when finding min_commit_position
When a new journal client is registered, all already registered
clients are checked, and a client with min position is selected
as a position for the new client. Thus we may expect that
starting from the registered position all journal entries will be
available (not trimmed) for the new client.
But when looking for a min commit position, the client_register
function did not take into account that a registered client might
be in disconnected state, and in that case the journal entries
might be trimmed for this client.
Kamoltat [Wed, 12 Jan 2022 02:41:01 +0000 (02:41 +0000)]
pybind/mgr/progress: enforced try and except on accessing event dictionary
There is a certain race condition scenario where
an event gets deleted while the progress module
iterates through the ``events`` dictionary,
without a ``try and except``, this will cause
an unhandled exception error and will crash
the module.
This commit will enforce ``try and except``
on every part of the code where we are accessing
the ``events`` dictionary.