There was a difference between master and pacific.
The hwe kernel modification for Ubuntu 20.04 should be done
only for cephadm tests. Modifying `qa/distros/all/ubuntu_20.04.yaml` broke
many tests.
The recent changes from PR #43536 introduced a regeression preventing from
running ceph-volume in a containerized context on Ubuntu 18.04.
Given that the path for the binary `lvs` differs between CentOS 8 and Ubuntu 18.04.
(`/usr/sbin/lvs` and `/sbin/lvs` respictively). It means that ceph-volume running
in the container on CentOS 8 sees the `lvs` binary at `/usr/sbin/lvs` and try to
run it with `nsenter` on the host which is running Ubuntu 18.04.
I isolated all the tests suites into there respective files
so that in future it is easier to add more tests to it.
I also given priority to the host actions.
Create OSD checks are now written in a way that OSDs
are created only on the intended hosts. This will make
the host draining process easier and less time consuming.
Also tried to address the flaky force maintenance checks.
Removed some duplicated codes
Service creation part improved to reduce the time taken
for its completion
Sage Weil [Mon, 6 Dec 2021 15:19:16 +0000 (10:19 -0500)]
mgr/cephadm: allow activation of OSDs that have previously started
When this code was introduced way back in ea987a0e56db106f7c76d11f86b3e602257f365e,
for some reason I was focused only on freshly created OSDs. The
get_osd_uuid_map() helper is used by deploy_osd_daemons_for_existing_osds()
which is called not only by OSD creation but also by 'ceph cephadm
osd activate', which is meant to instantiate daemons for existing OSD
devices (e.g., devices that were reattached to a new server, or whose
/var/lib/ceph/$fsid/osd.$id directory was lost for some other reason.
However, if we ignore OSDs with up_from > 0, then we can't recreate a
daemon instance for such existing OSDs--arguably the most important ones,
since they may hold real data.
Paul Cuzner [Fri, 12 Nov 2021 03:16:59 +0000 (16:16 +1300)]
mgr/cephadm: Add snmp-gateway service support
Add a new snmp-gateway service to provide a bridge between
Prometheus and an SNMP management platform. The gateway
service uses https://github.com/maxwo/snmp_notifier to provide
an SNMP v2c and SNMP V3 support.
The SNMP V3 support mandates at least authentication, and also
offers authentication and privacy (encryption).
Fixes: https://tracker.ceph.com/issues/52920 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
(cherry picked from commit c2f5e105ca4870b2cb124db662537c20e6daadae)
Ilya Dryomov [Tue, 11 Jan 2022 12:13:01 +0000 (13:13 +0100)]
qa/run_xfstests_qemu.sh: harden against wget failures
If wget fails (e.g. due to a certificate issue), it still creates
an empty file. Then this file is marked executable, ./"${SCRIPT}"
immediately returns 0 and run_xfstests_qemu.sh exits successfully
without running a single xfstest.
This started on Sep 30, 2021 with the expiration of Let's Encrypt
root certificate -- all qemu jobs with "test: qa/run_xfstests_qemu.sh"
just booted the VM for a couple of seconds and reported success.
Ilya Dryomov [Fri, 7 Jan 2022 12:31:08 +0000 (13:31 +0100)]
test/librbd: make diff-iterate clone tests exercise fast-diff mode
The fast-diff feature wasn't propagated to the clone so these tests
were exercising the slow list_snaps path no matter what RBD_FEATURES
value was supplied to ceph_test_librbd.
Ilya Dryomov [Wed, 5 Jan 2022 19:24:40 +0000 (20:24 +0100)]
librbd: restore diff-iterate include_parent functionality in fast-diff mode
Commit 4429ed4f3f4c ("librbd: switch diff iterate API to use new snaps
list dispatch methods") removed the recursive execute() call. The new
list_snaps method does indeed handle parent diffs internally but it is
not used in fast-diff mode. Nothing changed there -- we still need to
load the parent object map, calculate parent object_diff_state, etc.
Ilya Dryomov [Tue, 4 Jan 2022 19:38:35 +0000 (20:38 +0100)]
librbd: diff-iterate reports incorrect offsets in fast-diff mode
If rbd_diff_iterate2() is called on an image offset that doesn't
correspond to an object boundary, the callback is invoked with an
incorrect image offset. For example, assuming a fully allocated
image, a diff request for 806354944~57344 results in offs=807403520,
len=57344, exists=true invocation, which is ahead by 1048576 bytes.
This occurs only in fast-diff mode, for a diff request on an image
with the fast-diff feature disabled or if whole_object parameter is
set to false the invocation is correct.
This bug goes back to the introduction of fast-diff mode in commit 6d5b969d4206 ("librbd: add diff_iterate2 to API").
Adam King [Fri, 19 Nov 2021 00:43:35 +0000 (19:43 -0500)]
mgr/orchestrator: add filtering and count option for orch host ls
Filter orch host ls output for only hosts whose name
contains a certain substring or who have a certain label
Add a count flag that causes the command to return the number
of hosts found (either overall or matching the substring and/or
label) instead of a list of all the matching hosts
Fixes: https://tracker.ceph.com/issues/47774 Fixes: https://tracker.ceph.com/issues/53452 Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit edd9bf38c3f07f5fdb6714e7f66515820c736d2e)
John Mulligan [Fri, 10 Dec 2021 13:16:19 +0000 (08:16 -0500)]
python-common: make count & count-per-host >= 1 checks consistent
The previous version of the validate function had a incorrect error
statement that suggested the count must be >1 when it should have
been >=1. This confusion was possibly due to using "n < 1" on
one line and "n <= 0" on another line. Since both values are supposed
to be integers this change corrects the error message and makes
the comparisons on the lines both use "n < 1" (since I find it easier
to see that the check "n < 1" is the inverse of the error text
asserting "n >= 1").
John Mulligan [Wed, 8 Dec 2021 20:37:11 +0000 (15:37 -0500)]
python-common: add unit test func for invalid yaml inputs
I didn't find a preexisting test function for this so I added a
new test that is fed yaml snippets and expected error messages.
This verifies some of the recently added validation for
count and cound_per_host under the placement spec.
Paul Cuzner [Wed, 3 Nov 2021 02:24:20 +0000 (15:24 +1300)]
mgr/prometheus: Update rule format and enhance SNMP support
Rules now adhere to the format defined by Prometheus.io.
This changes alert naming and each alert now includes a
a summary description to provide a quick one-liner.
In addition to reformatting some missing alerts for MDS and
cephadm have been added, and corresponding tests added.
The MIB has also been refactored, so it now passes standard
lint tests and a README included for devs to understand the
OID schema.
Fixes: https://tracker.ceph.com/issues/53111 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Paul Cuzner [Tue, 19 Oct 2021 00:07:02 +0000 (13:07 +1300)]
mgr/prometheus: add test cases and validation using tox
Focus all tests inside a tests directory, and use pytest/tox to
perform validation of the overall content. tox tests also use
promtool if available to provide rule checks and unittest runs.
In addition to these checks a validate_rules script provides the
format, and content checks against all rules - which is also
called via tox (but can be run independently too)
Paul Cuzner [Thu, 16 Sep 2021 23:24:29 +0000 (11:24 +1200)]
mgr/prometheus: track individual healthchecks as metrics
This patch creates a health history object maintained in
the modules kvstore. The history and current health
checks are used to create a metric per healthcheck whilst
also providing a history feature. Two new commands are added:
ceph healthcheck history ls
ceph healthcheck history clear
In addition to the new commands, the additional metrics
have been used to update the prometheus alerts
Fixes: https://tracker.ceph.com/issues/52638 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
(cherry picked from commit e0dfc02063ef40cf6a1dc6e3080d0a856ceff050)
Conflicts:
doc/mgr/prometheus.rst
- Adopting doc with master.
Nizamudeen A [Thu, 30 Dec 2021 08:28:58 +0000 (13:58 +0530)]
mgr/dashboard: stabilizing the cephadm dashboard e2e
Reordering the tests and adding some more tests to verify the cluster is
healthy before proceeding to do some complex tasks like maintenance and
drain host
Nizamudeen A [Mon, 20 Dec 2021 09:14:29 +0000 (14:44 +0530)]
mgr/dashboard: fix timeout error in dashboard cephadm e2e job
1. Fix the timeout error happening in the dashboard e2e job
2. Take care of the flaky force maintenance check
Most of the time our test is getting timed out while searching for an item
in the table. Its because `.clear().type()` is not clearing the content
in the search field sometimes and that creates a wrong data to be
entered into the search field and it starts searching based on this
wrong name. To avoid this I am explicitly clearing the search area
before typing.
I removed the `02-hosts-inventory.e2e` file because it is a duplicate
test of one of the test in the `01-hosts.e2e` file and fixed the error
from that file.
Also, in the inventory Identify test, we test for an element to be not
visible. According to the latest cypress docs, this should be not.exist
instead of not.visible since the cd-modal will not even be present in
the DOM