Greg Farnum [Tue, 30 Nov 2021 18:29:46 +0000 (18:29 +0000)]
osd: Check range_blocklist in is_blocklisted(): we actually blocklist ranges
Carry a parallel map from cidr addresses to a new
range_bits class (stored entirely as ephemeral state) so that we
don't need to re-compute masks and bit mappings too often, and to
separate out the unpleasant ipv6 bit mapping logic. Then check
against those with range_bits::matches() the same way we check
for equality on specific-entity matches. Nice and simple loops!
Greg Farnum [Tue, 2 Nov 2021 00:38:50 +0000 (00:38 +0000)]
osdmap: convert get_blocklist() to provide the entity/IP and range blocklists
Providing a non-range-aware blocklist accessor would just be
asking for trouble, so don't.
The ugly part of this is how the Objecter is currently just
throwing the range blocklist on the end of its own list. The in-tree
callers are okay with this, and I'd like to look at removing the
blocklist events API from librados entirely -- it exposes "OSD-only"
state to clients and, as evidenced by this patch series, is not
particularly stable.
Greg Farnum [Wed, 8 Dec 2021 21:32:58 +0000 (21:32 +0000)]
mon: take blocklist ranges as a subcommand, not implicitly from address format
I discovered in testing with CephFS that this tends to interpret client IPs
(which don't have ports, but do have nonces) as invalid ranges. So give it
a separate input keyword that has to be applied first.
Greg Farnum [Mon, 25 Oct 2021 19:53:04 +0000 (19:53 +0000)]
msg: common: allow entity_addr_t to store a CIDR address range
This required very little change to the existing code. Use with care, because
existing code expects an IP address instead of a range, but it saves on
writing a new parser.
Greg Farnum [Tue, 2 Nov 2021 00:34:34 +0000 (00:34 +0000)]
mds: Server: Simplify apply_blocklist and usage of the OSDMap's blocklist
This previoulsly re-implemented a bunch of the OSDMap::is_blocklisted()
function, and wasn't actually any faster to run -- the list of new blocklists
may be smaller than the full set, but OSDMap::blocklist is an unordered_map
of constant lookup time so it shouldn't slow things down. More importantly,
this is much simpler, less likely to be buggy from duplicate code, and lets
the MDS off the hook for dealing with range blocklisting.
Greg Farnum [Mon, 1 Nov 2021 23:52:53 +0000 (23:52 +0000)]
client: Simplify blocklist tracking and interface
I'm not sure if the blocklist events tracking in Client.cc was ever
the simplest way to track that state, but it definitely isn't now. We
can just hand our addr_vec to the OSDMap and ask it -- it handles
version compatibility issues and, happily, means the Client doesn't
need to learn to deal with ranges directly.
In `master` the milestone step exits and causes remaining tasks not to be run. I previously tried with the `continue-on-error` flag, but it didn't work, so let's try putting that steps at the end.
Laura Flores [Mon, 16 May 2022 22:59:42 +0000 (17:59 -0500)]
qa/suites/rados/thrash-erasure-code-big/thrashers: add `osd max backfills` setting to mapgap and pggrow
All `rados/thrash-erasure-code-big` tests that die due to the “wait_for_recovery” timeout have one thing in common: They contain either `thrashers/pggrow` or `thrashers/mapgap`.
The difference between pggrow and mapgap vs. all other non-offending thrashers (default, careful, fastread, and morepggrow) is that they lack an override setting for `osd max backfills`. `osd max backfills` is the max number of backfill operations allowed to/from an OSD. The higher the number, the quicker the recovery. By default, this value is 1. On all of the non-offending thrashers (default, careful, fastread, and morepggrow), the default 1 value gets overridden in their .yaml files with a value > 1. This is not the case for pggrow and mapgap, however, as they lack an `osd max backfills` override setting.
The mclock op scheduler is known to override `osd max backfills` with a high value, but all of the thrash-erasure-code-big thrashers have their op queue set to “debug_random”, which chooses randomly between op queues (the debug_random op queue is set to override the default mclock_scheduler in qa/config/rados.yaml). So, coupled with the “debug_random” op queue, the low `osd max backfill` setting is causing some tests to time out in recovery.
WITHOUT `osd max backfills`, as they are now, “mapgap” and “pggrow” tests die due to timed-out recovery about 17/100 times, as seen here with a pggrow test: http://pulpito.front.sepia.ceph.com/lflores-2022-05-18_14:24:29-rados:thrash-erasure-code-big-master-distro-default-smithi/
WITH `osd max backfills` specified, as I have suggested in this PR, 99/100 tests passed, with one test failing for a different reason:
http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/
I also scheduled 145 tests WITH `osd max backfills` that are a mix of pggrow and mapgap thrashers. 144/145 tests passed, with one test failing for a different reason. http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/
Fixes: https://tracker.ceph.com/issues/51076 Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 40062676c2ceed49b9fa147127ffa83ba6118e2a)
Adam King [Fri, 1 Apr 2022 12:20:28 +0000 (08:20 -0400)]
mgr/cephadm: make UpgradeState from_json a bit safer
This way, for downgrades to whatever versions
this lands in onward, having added new parameters to
UpgradeState shouldn't break anything. Can't do much
about downgrades to older versions from this one
but this should help in the future.
Adam King [Mon, 28 Mar 2022 16:10:15 +0000 (12:10 -0400)]
mgr/cephadm: split _do_upgrade into sub functions
This function was around 500 lines and difficult to work
with. Splitting it into sub functions should hopefully make
it a bit easier to understand and make changes to.
Volker Theile [Tue, 10 May 2022 13:25:54 +0000 (15:25 +0200)]
cephadm: prometheus: The generatorURL in alerts is only using hostname
Prometheus is currently using only the hostname in the 'generatorURL' of an alert which causes issues when clicking on the URL in the Ceph Dashboard or somewhere else, because in most cases the hostname of the node that is running the Prometheus container is not resolvable.
To fix that the command line argument '--web.external-url' must be appended in the systemd unit file of the Prometheus container, e.g. '--web.external-url http://foo.bar:9095' whereas a FQDN hostname is used.
Volker Theile [Mon, 9 May 2022 13:31:15 +0000 (15:31 +0200)]
mgr/dashboard: Creating and editing Prometheus AlertManager silences is buggy
When creating a new monitoring silence the form is pre-filled with the wrong alert data. It is always used the alert data from the very first object in the list of the API response but not the specified alert identified by the 'fingerprint' property.
The same problem applies to editing silences. The selected silence is not edited, it's always the first one in the list returned API response but not that with the specified 'id' property.
The main problem of the origin implementation is that the Prometheus Alertmanager API endpoints /api/v1/[alerts/silences] do not support querying. To fix that, filtering is done in the frontend.
Zac Dover [Wed, 18 May 2022 10:36:53 +0000 (20:36 +1000)]
doc/start: s/3/three/ in intro.rst
I'm changing "3" to "three" for two reasons:
1. It's correct.
2. This allows me to test backports into Octopus, Pacific, and Quincy.
I am particularly interested to see what happens when I attempt
the backport into Octopus, because backports into Octopus have
failed. This will provide me with another unit of data.
Adam King [Thu, 18 Nov 2021 20:22:39 +0000 (15:22 -0500)]
mgr/cephadm: re-use old ip when re-adding hosts if necessary
When a host is re-added without an explicit ip we can default to the old
ip we had stored for the host rather than either keeping the loopback
address or throwing an exception. We only want to actually error when
the only options left are error or use a resolved loopback address
Redouane Kachach [Tue, 17 May 2022 15:26:39 +0000 (17:26 +0200)]
mgr/cephadm: stripping out / from the end of the url Fixes: https://tracker.ceph.com/issues/55638 Signed-off-by: Redouane Kachach <rkachach@redhat.com>
(cherry picked from commit 17032f6be22e9efc3e199d7e35091025bfaae965)
mgr/cephadm: do not add _admin label when no-minimize-config is provided Fixes: https://tracker.ceph.com/issues/52727 Signed-off-by: Redouane Kachach <rkachach@redhat.com>
(cherry picked from commit 01c8999d0354a71a7ef8526aab9b39e30d67c1bb)
Moritz Röhrich [Mon, 21 Mar 2022 16:32:25 +0000 (17:32 +0100)]
cephadm: avoid crashing on expected non-zero exit
- Avoid crashing when a call out to an external program expectedly does
not return exit status zero.
There are programs that communicate other information than error/no
error through exit status. E.g. `systemctl status` will return different
exit codes depending on the actual status of the units in question.
In cases where this is expected crashing with a RuntimeError exception
is inappropriate and should be avoided.
Fixes: https://tracker.ceph.com/issues/55117 Signed-off-by: Moritz Röhrich <moritz.rohrich@suse.com>
(cherry picked from commit a02be6f22fa18094cd8758700ab74581b6ce1701)
Cory Snyder [Tue, 17 May 2022 09:24:53 +0000 (05:24 -0400)]
mgr/ActivePyModules.cc: fix cases where GIL is held while attempting to lock mutex
The mgr process can deadlock if the GIL is held while attempting to lock a mutex.
Relevant regressions were introduced in commit a356bac. This fixes those regressions
and also cleans up some unnecessary yielding of the GIL.
ceph-volume/tests: reject loop devices in lvm.conf
The current task doesn't works (typo?).
Otherwise api/lvm.py can't work properly, functions such as
`get_single_lv()` and many other don't return the expected results.
Indeed, lvm is confused because of the nvme_loop setup.
This adds the support of complex OSD creation with command
`orch daemon add osd`.
Any argument supported by `DriveGroupSpec()` can be passed on the command line.
Cephadm shouldn't try to deploy a disk reported as unavailable by ceph-volume.
The idea here is to check the rejection reason so we can still use DB devices
in case of OSD replacement.
Sage Weil [Thu, 12 Aug 2021 15:12:59 +0000 (11:12 -0400)]
ceph-volume: activate: try simple mode too
This is of dubious value to cephadm since /etc/ceph/osd/* won't be
populated inside of a conatiner. However, it makes sense from a purely
ceph-volume perspective.
Sage Weil [Thu, 5 Aug 2021 16:02:22 +0000 (12:02 -0400)]
ceph-volume: lvm activate: infer bluestore or filestore
No need to require --filestore and/or --bluestore args since we can tell
from the LV tags which one it is.
We can't drop the arguments without breaking existing users, though, so
redefine them to mean *force* bluesetore or filestore activation (even
though this will error out if the tags don't match).
dparmar18 [Fri, 25 Mar 2022 08:18:54 +0000 (13:48 +0530)]
doc/cephfs/add-remove-mds: added cephadm note, refined "Adding an MDS"
Description: 1) Add a note about using cephadm for setting up the
cluster and mds(s), also mention the use of ceph
orchestrator if one needs to setup mds(s) manually.
2) Changed the term `data point` to `directory` in
point 1 under "Adding an MDS" section for better
clarity.
Cory Snyder [Tue, 17 May 2022 09:24:53 +0000 (05:24 -0400)]
mgr/ActivePyModules.cc: fix cases where GIL is held while attempting to lock mutex
The mgr process can deadlock if the GIL is held while attempting to lock a mutex.
Relevant regressions were introduced in commit a356bac. This fixes those regressions
and also cleans up some unnecessary yielding of the GIL.
Adam Kupczyk [Fri, 29 Apr 2022 21:32:43 +0000 (23:32 +0200)]
kv/RocksDBStore: Remove feature to make WholeSpaceIterator based on bounded iterator
Iterator-bounding feature is introduced to make RocksDB iterators limited, so they
would less likely traverse over tombstones.
This is used when listing keys in fixed range, for example OMAPS for specific object.
It is problematic when extending this logic to WholeSpaceIterator,
since prefix must be taken into account.
Fixes: https://tracker.ceph.com/issues/55444 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
Adds a precondition to RocksDBStore::get_cf_handle(string, IteratorBounds)
to avoid duplicating logic of the only caller (RocksDBStore::get_iterator).
Assertions will fail if preconditions are not met.
bluestore: add config option to allow rocksdb iterator bounds to be disabled
Add osd_rocksdb_iterator_bounds_enabled config option to allow rocksdb iterator bounds to be disabled.
Also includes minor refactoring to shorten code associated with IteratorBounds initialization in bluestore.
Signed-off-by: Cory Snyder <csnyder@iland.com>
(cherry picked from commit ca3ccd9)
Conflicts:
src/common/options/osd.yaml.in
Cherry-pick notes:
- Conflicts due to option definition in common/options.cc in Pacific vs. common/options/osd.yaml.in in later releases
bluestore: set upper and lower bounds on rocksdb omap iterators
Limits RocksDB omap Seek operations to the relevant key range of the object's omap.
This prevents RocksDB from unnecessarily iterating over delete range tombstones in
irrelevant omap CF shards. Avoids extreme performance degradation commonly caused
by tombstones generated from RGW bucket resharding cleanup. Also prefer CFIteratorImpl
over ShardMergeIteratorImpl when we can determine that all keys within specified
IteratorBounds must be in a single CF.