Ronen Friedman [Tue, 23 Aug 2022 05:12:18 +0000 (05:12 +0000)]
tests/osd: creating a Teuthology test re missing SnapMapper entries
The test (in the standalone/scrub suite) verifies that the scrubber
detects (and issues a cluster-log error) whenever a mapping entry
("SNA_") is missing in the SnapMapper DB.
Specifically, here the entry is corrupted - shortened as per
https://tracker.ceph.com/issues/56147.
Ronen Friedman [Mon, 1 Aug 2022 10:14:58 +0000 (10:14 +0000)]
osd/scrub: verify SnapMapper consistency
Whenever the scrubber access the SnapMapper for the snaps of a specific
clone, the mapper will now verify that the snaps have the required
mapping DB entries (the 'SNA_' keys).
Adam King [Mon, 22 Aug 2022 15:14:12 +0000 (11:14 -0400)]
mgr/cephadm: allow setting prometheus retention time
When we deploy Prometheus server, we don't provide any
ability to define the tsdb retention time - so it defaults to 15d.
This change adds a field that can be passed in a prometheus service
spec that will be passed as an arg to the --storage.tsdb.retention.time
parameter for the prometheus daemon.
Fixes: https://tracker.ceph.com/issues/54308 Signed-off-by: Adam King <adking@redhat.com>
David Galloway [Wed, 31 Aug 2022 18:21:16 +0000 (14:21 -0400)]
.github: Give folks 30 seconds to fill out the checklist
Otherwise GitHub sends an annoying e-mail right away when you file a PR that doesn't have the checklist filled out. It's easier IMO to create the PR, then check the boxes instead of putting Xes in brackets while filling out the PR comment.
Signed-off-by: David Galloway <dgallowa@redhat.com>
RGW - Zipper - Remove a number of casts from rgw_admin
There are still a ton of casts to RadosStore in rgw_admin. Remove the
easy ones. Many of the rest represent actual operations that are
specific to RadosStore, and need to be split out.
Signed-off-by: Daniel Gryniewicz <dang@redhat.com>
Redouane Kachach [Wed, 31 Aug 2022 11:49:37 +0000 (13:49 +0200)]
mgr/cephadm: Fix how we check if a host belongs to public network Fixes: https://tracker.ceph.com/issues/57060 Signed-off-by: Redouane Kachach <rkachach@redhat.com>
J. Eric Ivancich [Tue, 23 Aug 2022 20:44:24 +0000 (16:44 -0400)]
rgw: remove dout_subsys defs from header files
Each compilation unit should be able to define its own dout_subsys
without generating a redefinition warning. When dout_subsys is defined
in header files, it complicates this matter. This commit removes
definitions and header files and makes sure definitions are added to
.cc files as needed.
Additionally, at Adam Emerson's suggestion, use "static constexpr"
rather than "#define" to set "dout_subsys" in a few places as a
reminder to ultimately do it more broadly.
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
Ilya Dryomov [Tue, 30 Aug 2022 09:45:44 +0000 (11:45 +0200)]
rbd-mirror: skip setting error code on snapshot replayer shutdown
This is regarding failures in unregister_remote_update_watcher() and
unregister_local_update_watcher(). handle_replay_complete() can't be
called in these cases anymore as it would blindly attempt to unregister
watchers from scratch again. Dropping handle_replay_complete() calls
there means that these failures would only be logged and would not be
surfaced by snapshot replayer. But the only caller ignores them
anyway:
void ImageReplayer<I>::shut_down(int r) {
...
// close the replayer
if (m_replayer != nullptr) {
ctx = new LambdaContext([this, ctx](int r) {
m_replayer->destroy();
m_replayer = nullptr;
ctx->complete(0); <------
});
ctx = new LambdaContext([this, ctx](int r) {
m_replayer->shut_down(ctx);
});
}
Paul Cuzner [Mon, 29 Aug 2022 23:54:00 +0000 (11:54 +1200)]
cephadm: Fix disk size calculation
With native 4k sectors, the logical blocksize is set to
4096, which yields a disk size 8x the size of the actual
device. According to kernel source, device size only
uses 512 byte sectors, so the use of logical blocksize
is unnecessary.
Fixes: https://tracker.ceph.com/issues/57335 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Ilya Dryomov [Wed, 24 Aug 2022 10:56:31 +0000 (12:56 +0200)]
rbd-mirror: resume pending shutdown on error in snapshot replayer
If a shutdown is requested, e.g. by update_pool_replayers() because
remote RADOS instance got blocklisted, and Replayer::shut_down() pends
it on completion of current snapshot sync, it gets stuck if replayer
encounters an error in the interim. This is particularly likely in the
blocklist case: a higher layer may detect that client got blocklisted
and request a shutdown first, and then when replayer sees EBLOCKLISTED
in turn, it calls handle_replay_complete() -- which does not resume
a pending shutdown. Because update_pool_replayers() blocks on shutdown
with Mirror::m_lock held, eventually the entire daemon hangs in
perpetuity.
The addition of unselectable prompts to these three files
completes the work begun in PR#47810 (d8064b4), which sought
to bring dashboard.rst into line with the unselectable prompt
standard introduced by Kefu Chai in 2020.
Ronen Friedman [Thu, 18 Aug 2022 15:27:47 +0000 (18:27 +0300)]
common: improving fmtlib handling of ceph::utime_t
1. fixing the output to show local-time instead of UTC format, matching
operator<<() handling (and all the rest of our logs)
2. adding a 'short' mode (as {:s}) for when, e.g. in most scrub logs,
we only need 3 digits for the sub-second, and do not need the
trailing TZ designation.
so we can use the formatter defined for `LogEntry` in fmtlib v9.
in this new version of fmtlib, it is required to define a specialization
for the formatted type even when it comes to the types with an override of
operator<<(). since we already have an override for `LogEntry`, let's define
the specialization for `fmt::formatter<LogEntry>`.
this change should address the FTBFS when building with fmtlib v9.
Kefu Chai [Sat, 27 Aug 2022 02:27:01 +0000 (10:27 +0800)]
common/Journald: include msg/msg_fmt.h
so we can use the formatter defined for `entity_name_t`. in fmtlib v9,
it is required to define a specialization for the formatted type even
the type has an override of operator<<(). now that we already have a
formatter for `entity_name_t`, let's just use it.
this change should address the FTBFS when building with fmtlib v9.
Ilya Dryomov [Sat, 27 Aug 2022 09:09:00 +0000 (11:09 +0200)]
librbd: use actual monitor addresses when creating a peer bootstrap token
Relying on mon_host config option is fragile, as the user may confuse
v1 and v2 addresses, group them incorrectly, etc. Get mon_host value
only as a fallback.
Kefu Chai [Sat, 27 Aug 2022 15:46:00 +0000 (23:46 +0800)]
mon/MgrMonitor: do not propse again for "mgr fail"
in 23c3f76018b446fb77bbd71fdd33bddfbae9e06d, the change to fail the mgr
is proposed immediately. but `MgrMonitor::prepare_command()` method still
returns `true` in this case. its indirect caller of
`PaxosService::dispatch()` considers this as a sign that it needs to
propose the change with `propose_pending()`. but the pending change has
already been proposed by `MgrMonitor::prepare_command()`, and
`have_pending` is also cleared by this call. as we don't allow
consecutive paxos proposals, the second `propose_pending()` call is
delayed with a configured latency. but when the timer is fired, this
poseponed call would find itself trying to propose nothing. the change
to fail the mgr has been proposed. that's why we have
`ceph_assert(have_pending)` assertion failures.
in this change, the second proposal is not proposed anymore if the
proposal is proposed immediately. this should avoid the assertion
failure.
Lucian Petrut [Fri, 26 Aug 2022 12:54:10 +0000 (12:54 +0000)]
include: fix IS_ERR on Windows
The "long" type uses 32b on x64 Windows platforms, which means
it's not large enough to store a pointer. intptr_t or uintptr_t
should be used instead.
This change fixes include/err.h, using the right types. There was
a previous patch on this topic but unfortunately it didn't address
all the type casts.
This issue was brought up by the unittest_crush test, which recently
started to fail as the CrushWrapper methods use IS_ERR.