Nizamudeen A [Wed, 1 Jun 2022 07:40:14 +0000 (13:10 +0530)]
mgr/dashboard: fix drain e2e failure
Cypress sometimes fail to register the click and that causes the
deselect/select to not happen properly. Deselecting the row immediately
after performing the action makes it pass from cypress.
Nizamudeen A [Sat, 29 Jan 2022 16:55:28 +0000 (22:25 +0530)]
mgr/dashboard: BDD approach for the dashboard cephadm e2e
Files under the directory cypress/integration/common/* will contain
common specs which can be used on all the .feature files. We can change
the common directory to cypress/integration/* from the package.json, but
if we do that now then we'll need to take care of all the absolute
import in that path. So for now at least that's not a good choice.
The bug in the cypress-browserify-preprocessor which doesn't allow to
take our tsconfig.json file forced me to go with relative imports rather
than the absolute import. We'll need to wait for this to be fixed before
changing all our tests to BDD.
Zac Dover [Wed, 1 Jun 2022 14:44:02 +0000 (00:44 +1000)]
doc: (squash) adding pacific.rst to toctree
Signed-off-by: Zac Dover <zac.dover@gmail.com>
doc/start: update "memory" in hardware-recs.rst
This PR corrects some usage errors in the "Memory" section
of the hardware-recommendations.rst file. It also closes
some opened but never closed parentheses.
This adds the security/ directory from the main branch.
This is done so that all references in the pacific.rst
file find destinations. This means that Sphinx will re-
cognize the document as coherent and that Sphinx will
permit it to build.
Signed-off-by: Zac Dover <zac.dover@gmail.com>
doc: (squash) add security/index.rst to toctree
Signed-off-by: Zac Dover <zac.dover@gmail.com>
doc: remove :confvals: from bluestore-config-ref
Signed-off-by: Zac Dover <zac.dover@gmail.com>
doc: (squash) linking to s3-feature-table
Signed-off-by: Zac Dover <zac.dover@gmail.com>
doc: (squash) repair refs to cephfs-top
Signed-off-by: Zac Dover <zac.dover@gmail.com>
doc: (squash) fix link to snap-schedule
Signed-off-by: Zac Dover <zac.dover@gmail.com>
doc: (squash) fix link to ceph-dokan
Signed-off-by: Zac Dover <zac.dover@gmail.com>
doc: (squash) fixing active-releases link
Signed-off-by: Zac Dover <zac.dover@gmail.com>
doc: testing security/index toctree link
afd8be7eac5e996c3bd07656601a4534053e2516 broke it.
It has dropped`block_wal` and `block_db` from
`ceph_volume.devices.raw.activate.activate_bluestore` but
`activate.main.Activate.main` still passes those arguments when
calling `RAWActivate([]).activate()`
Note that we're making this change because (1) it allows db/wal and (2)
because there are no known users of 'raw activate'. The only known user
is via 'ceph-volume activate' and we've fixed that caller in this commit.
Zac Dover [Wed, 1 Jun 2022 14:54:40 +0000 (00:54 +1000)]
doc/rados: update bluestore-config-ref.rst
This PR updates bluestore-config-ref.rst so that
other PRs that refer to material in it can be
backported.
In order to ensure the coherence of this document,
all :confval: declarations have been removed. The
module that interprets those is called ceph_confval
and is available only in Quincy.
mgr/dashboard: introduce memory and cpu usage for daemons
Fixes: https://tracker.ceph.com/issues/55218 Signed-off-by: Avan Thakkar <athakkar@redhat.com> Co-authored-by: Aashish Sharma <aasharma@redhat.com>
Introducing 2 new columns in Cluster->Host->Daemons table for Memory and CPU usage.
Conflicts:
src/pybind/mgr/cephadm/module.py
- _process_ls_output() doesn't exist in pacific as agent isn't yet backported. So similar changes
needs to be done in serve.py instead.
In `master` the milestone step exits and causes remaining tasks not to be run. I previously tried with the `continue-on-error` flag, but it didn't work, so let's try putting that steps at the end.
Laura Flores [Mon, 16 May 2022 22:59:42 +0000 (17:59 -0500)]
qa/suites/rados/thrash-erasure-code-big/thrashers: add `osd max backfills` setting to mapgap and pggrow
All `rados/thrash-erasure-code-big` tests that die due to the “wait_for_recovery” timeout have one thing in common: They contain either `thrashers/pggrow` or `thrashers/mapgap`.
The difference between pggrow and mapgap vs. all other non-offending thrashers (default, careful, fastread, and morepggrow) is that they lack an override setting for `osd max backfills`. `osd max backfills` is the max number of backfill operations allowed to/from an OSD. The higher the number, the quicker the recovery. By default, this value is 1. On all of the non-offending thrashers (default, careful, fastread, and morepggrow), the default 1 value gets overridden in their .yaml files with a value > 1. This is not the case for pggrow and mapgap, however, as they lack an `osd max backfills` override setting.
The mclock op scheduler is known to override `osd max backfills` with a high value, but all of the thrash-erasure-code-big thrashers have their op queue set to “debug_random”, which chooses randomly between op queues (the debug_random op queue is set to override the default mclock_scheduler in qa/config/rados.yaml). So, coupled with the “debug_random” op queue, the low `osd max backfill` setting is causing some tests to time out in recovery.
WITHOUT `osd max backfills`, as they are now, “mapgap” and “pggrow” tests die due to timed-out recovery about 17/100 times, as seen here with a pggrow test: http://pulpito.front.sepia.ceph.com/lflores-2022-05-18_14:24:29-rados:thrash-erasure-code-big-master-distro-default-smithi/
WITH `osd max backfills` specified, as I have suggested in this PR, 99/100 tests passed, with one test failing for a different reason:
http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/
I also scheduled 145 tests WITH `osd max backfills` that are a mix of pggrow and mapgap thrashers. 144/145 tests passed, with one test failing for a different reason. http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/
Fixes: https://tracker.ceph.com/issues/51076 Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 40062676c2ceed49b9fa147127ffa83ba6118e2a)
Adam King [Fri, 1 Apr 2022 12:20:28 +0000 (08:20 -0400)]
mgr/cephadm: make UpgradeState from_json a bit safer
This way, for downgrades to whatever versions
this lands in onward, having added new parameters to
UpgradeState shouldn't break anything. Can't do much
about downgrades to older versions from this one
but this should help in the future.
Adam King [Mon, 28 Mar 2022 16:10:15 +0000 (12:10 -0400)]
mgr/cephadm: split _do_upgrade into sub functions
This function was around 500 lines and difficult to work
with. Splitting it into sub functions should hopefully make
it a bit easier to understand and make changes to.
Volker Theile [Tue, 10 May 2022 13:25:54 +0000 (15:25 +0200)]
cephadm: prometheus: The generatorURL in alerts is only using hostname
Prometheus is currently using only the hostname in the 'generatorURL' of an alert which causes issues when clicking on the URL in the Ceph Dashboard or somewhere else, because in most cases the hostname of the node that is running the Prometheus container is not resolvable.
To fix that the command line argument '--web.external-url' must be appended in the systemd unit file of the Prometheus container, e.g. '--web.external-url http://foo.bar:9095' whereas a FQDN hostname is used.
Volker Theile [Mon, 9 May 2022 13:31:15 +0000 (15:31 +0200)]
mgr/dashboard: Creating and editing Prometheus AlertManager silences is buggy
When creating a new monitoring silence the form is pre-filled with the wrong alert data. It is always used the alert data from the very first object in the list of the API response but not the specified alert identified by the 'fingerprint' property.
The same problem applies to editing silences. The selected silence is not edited, it's always the first one in the list returned API response but not that with the specified 'id' property.
The main problem of the origin implementation is that the Prometheus Alertmanager API endpoints /api/v1/[alerts/silences] do not support querying. To fix that, filtering is done in the frontend.
Zac Dover [Wed, 18 May 2022 10:36:53 +0000 (20:36 +1000)]
doc/start: s/3/three/ in intro.rst
I'm changing "3" to "three" for two reasons:
1. It's correct.
2. This allows me to test backports into Octopus, Pacific, and Quincy.
I am particularly interested to see what happens when I attempt
the backport into Octopus, because backports into Octopus have
failed. This will provide me with another unit of data.
Adam King [Thu, 18 Nov 2021 20:22:39 +0000 (15:22 -0500)]
mgr/cephadm: re-use old ip when re-adding hosts if necessary
When a host is re-added without an explicit ip we can default to the old
ip we had stored for the host rather than either keeping the loopback
address or throwing an exception. We only want to actually error when
the only options left are error or use a resolved loopback address
Redouane Kachach [Tue, 17 May 2022 15:26:39 +0000 (17:26 +0200)]
mgr/cephadm: stripping out / from the end of the url Fixes: https://tracker.ceph.com/issues/55638 Signed-off-by: Redouane Kachach <rkachach@redhat.com>
(cherry picked from commit 17032f6be22e9efc3e199d7e35091025bfaae965)
mgr/cephadm: do not add _admin label when no-minimize-config is provided Fixes: https://tracker.ceph.com/issues/52727 Signed-off-by: Redouane Kachach <rkachach@redhat.com>
(cherry picked from commit 01c8999d0354a71a7ef8526aab9b39e30d67c1bb)
Moritz Röhrich [Mon, 21 Mar 2022 16:32:25 +0000 (17:32 +0100)]
cephadm: avoid crashing on expected non-zero exit
- Avoid crashing when a call out to an external program expectedly does
not return exit status zero.
There are programs that communicate other information than error/no
error through exit status. E.g. `systemctl status` will return different
exit codes depending on the actual status of the units in question.
In cases where this is expected crashing with a RuntimeError exception
is inappropriate and should be avoided.
Fixes: https://tracker.ceph.com/issues/55117 Signed-off-by: Moritz Röhrich <moritz.rohrich@suse.com>
(cherry picked from commit a02be6f22fa18094cd8758700ab74581b6ce1701)
Cory Snyder [Tue, 17 May 2022 09:24:53 +0000 (05:24 -0400)]
mgr/ActivePyModules.cc: fix cases where GIL is held while attempting to lock mutex
The mgr process can deadlock if the GIL is held while attempting to lock a mutex.
Relevant regressions were introduced in commit a356bac. This fixes those regressions
and also cleans up some unnecessary yielding of the GIL.
ceph-volume/tests: reject loop devices in lvm.conf
The current task doesn't works (typo?).
Otherwise api/lvm.py can't work properly, functions such as
`get_single_lv()` and many other don't return the expected results.
Indeed, lvm is confused because of the nvme_loop setup.
This adds the support of complex OSD creation with command
`orch daemon add osd`.
Any argument supported by `DriveGroupSpec()` can be passed on the command line.
Cephadm shouldn't try to deploy a disk reported as unavailable by ceph-volume.
The idea here is to check the rejection reason so we can still use DB devices
in case of OSD replacement.
Sage Weil [Thu, 12 Aug 2021 15:12:59 +0000 (11:12 -0400)]
ceph-volume: activate: try simple mode too
This is of dubious value to cephadm since /etc/ceph/osd/* won't be
populated inside of a conatiner. However, it makes sense from a purely
ceph-volume perspective.
Sage Weil [Thu, 5 Aug 2021 16:02:22 +0000 (12:02 -0400)]
ceph-volume: lvm activate: infer bluestore or filestore
No need to require --filestore and/or --bluestore args since we can tell
from the LV tags which one it is.
We can't drop the arguments without breaking existing users, though, so
redefine them to mean *force* bluesetore or filestore activation (even
though this will error out if the tags don't match).
dparmar18 [Fri, 25 Mar 2022 08:18:54 +0000 (13:48 +0530)]
doc/cephfs/add-remove-mds: added cephadm note, refined "Adding an MDS"
Description: 1) Add a note about using cephadm for setting up the
cluster and mds(s), also mention the use of ceph
orchestrator if one needs to setup mds(s) manually.
2) Changed the term `data point` to `directory` in
point 1 under "Adding an MDS" section for better
clarity.
Cory Snyder [Tue, 17 May 2022 09:24:53 +0000 (05:24 -0400)]
mgr/ActivePyModules.cc: fix cases where GIL is held while attempting to lock mutex
The mgr process can deadlock if the GIL is held while attempting to lock a mutex.
Relevant regressions were introduced in commit a356bac. This fixes those regressions
and also cleans up some unnecessary yielding of the GIL.