Sage Weil [Mon, 15 Mar 2021 18:49:24 +0000 (14:49 -0400)]
Merge PR #39979 into master
* refs/pull/39979/head:
python-common: fix PlacementSpec target size method
python-common: count-per-host must be combined with label or hosts or host_pattern
mgr/cephadm: handle bare 'count-per-host:NNN', fix comments
mgr/cephadm/schedule: remove Scheduler abstraction (for now at least)
mgr/cephadm/schedule: calculate additions/removals in place()
mgr/cephadm/schedule: allow colocation of certain daemon types
mgr/cephadm/schedule: shuffle candidates, not final placements
mgr/cephadm/schedule: pass per-type allow_colo to the scheduler
mgr/cephadm/services/cephadmservice: fix typo
mgr/cephadm/schedule: pass daemons, not get_daemons_func
mgr/cephadm: use local var
mgr/cephadm/schedule: move host filtering into get_candidates()
python-common/ceph/deployment/service_spec: disallow max-per-host + explicit placement
mgr/cephadm/schedule: respect count-per-host
mgr/cephadm: adjust deployment logic to allow multiple daemons per host
python-common: add count-per-host to PlacementSpec
mgr/cephadm: do not worry about even # of monitors
Reviewed-by: Juan Miguel Olmo <jolmomar@redhat.com> Reviewed-by: Sebastian Wagner <swagner@suse.com>
Sage Weil [Mon, 15 Mar 2021 16:55:36 +0000 (11:55 -0500)]
mgr/cephadm: tolerate failure to update daemon caps
If we're upgrading from 15.2.0, we may fail to update caps. Instead of
failing the upgrade hard, warn to the log and continue. This is less
than ideal, but the caps will get corrected the next time the daemon is
redeployed on the next upgrade, and most likely the previous caps will
continue to work (given they were presumably working before the upgrade).
Or Ozeri [Tue, 9 Mar 2021 20:14:49 +0000 (22:14 +0200)]
librbd: crypto format api semantics change
This commit alters the semantics of the encryption format api
to also load the encryption after format completes.
Additionally, several other small changes in librbd crypto are included,
in preparation of supporting clone formatting.
David Zafman [Sat, 13 Mar 2021 05:56:28 +0000 (05:56 +0000)]
test: osd-recovery-scrub.sh: Test fails if no scrubs happened for a recovering pg
Change TEST_recovery_scrub_2 to create more objects and use
osd_recovery_sleep to prevent recovery from finihing before
we start to scrub. Verify that at least 1 scrub was started
while the pg was reovering.
Fixes: https://tracker.ceph.com/issues/49779 Signed-off-by: David Zafman <dzafman@redhat.com>
Kefu Chai [Fri, 12 Mar 2021 06:51:30 +0000 (14:51 +0800)]
osd: do not handle pre-octopus messages
MOSDPGQuery and MOSDPGInfo messages are sent by
pre-octopus OSD, so in quincy and up clusters, we do not need
to handle them anymore, as we can only upgrade from octopus and
up to quincy.
we can drop MOSDPGNotify after Q + 2, though, after we stop sending
MOSDPGNotify in Q release.
Kefu Chai [Fri, 12 Mar 2021 07:09:12 +0000 (15:09 +0800)]
osd: send MOSDPGNotify2 instead of MOSDPGNotify
as we prefer sending MOSDPGNotify2 over MOSDPGNotify in PeeringState
in post octopus, to be more consistent and have one less thing to
worry, let's just use MOSDPGNotify2 in OSD.cc as well.
Kefu Chai [Fri, 12 Mar 2021 06:40:47 +0000 (14:40 +0800)]
osd: drop OSD::create_context()
OSD::create_context() was used for creating PeeringCtx from OSD's
require_osd_release. but since the check against require_osd_release
is not required anymore, let's drop this helper.
Kefu Chai [Fri, 12 Mar 2021 06:23:50 +0000 (14:23 +0800)]
osd/PeeringState: do not check for require_osd_release
before this change, we always check for require_osd_release when
creating MOSDPGNotify2 or MOSDPGNotify, if require_osd_release is
greater or equal to octopus, MOSDPGNotify2 is created.
since we are in a post-quincy era, and we only need to upgrade from
octopus and up to quincy, there is no need to be compatible with
osd whose version is lower than octopus.
in this change, the check in `BufferedRecoveryMessages::send_notify()`
is dropped.
The tests needs to scrub while recovery is in progress, so catching
recovery from the logs after the fact isn't the proper setup.
We can use osd_recovery_sleep config.
Sage Weil [Sat, 13 Mar 2021 16:34:43 +0000 (11:34 -0500)]
osd: propagate base pool application_metadata to tiers
If there is application metadata on the base pool, it should be mirrored
to any other tiers in the set. This aligns with the fact that the
'ceph osd pool application ...' commands refuse to operate on a non-base
pool.
This fixes problems with accessing tiers (e.g., cache tiers) when the
cephx cap is written in terms of application metadata.
Fixes: https://tracker.ceph.com/issues/49788 Signed-off-by: Sage Weil <sage@newdream.net>
Kefu Chai [Fri, 12 Mar 2021 11:39:28 +0000 (19:39 +0800)]
cmake: do not build lockdep for Release build
lockdep create large data structures on .bss and on heap for tracking
the locks and their dependencies. but we don't need to pay for this
if lockdep is not enabled.
lockdep helps us to track the lock dependencies related issue on Debug
build. and Release build, this feature hurts the performance and more
importantly, lockdeps is a feature only kicks in when using the
mutex_debug and friends. they are not used in Release build at all.
so, after this change, lockdep is not built in Release build. and
the static variables defined in lockdep.cc are not allocated anymore
in Release build.
Greg Farnum [Fri, 12 Mar 2021 22:41:03 +0000 (22:41 +0000)]
osd: PeeringState: implement an acting_set_writeable() function
Use it instead of direct checks against min_size and stretch_set_can_peer()
when deciding whether to go STATE_ACTIVE/STATE_PEERED or do updates
to things like last_epoch_started.
Greg Farnum [Fri, 12 Mar 2021 20:13:38 +0000 (20:13 +0000)]
osd: PeeringState: fix stretch peering so PGs can go peered but not active
I misunderstood and there was a pretty serious error here: to prevent
accidents, choose_acting() was preventing PGs from *finishing* peering
if they didn't satisfy the stretch cluster rules. What we actually want
to do is to finish peering, but not go active.
Happily, this is easy to fix -- we just add a call to stretch_set_can_peer()
alongside existing min_size checks when we choose whether to go PG_STATE_ACTIVE
or PG_STATE_PEERED!
Greg Farnum [Thu, 11 Mar 2021 22:19:10 +0000 (22:19 +0000)]
osd: PeeringState: don't add acting-set OSDs to candidate set in stretch mode
We were adding them once from the acting set, and then once from the all_infos
set, and that hit an assert later on. (I think it was otherwise harmless, but
I don't want to weaken the assert!)
There was a major error here! get_ancestor() was type-deduced to return
a bucket_candidates_t -- a *copy* of what was in the map, not the reference
to it we wanted to actually amend!
Fix this by returning a pointer instead. There's a way to coerce things
to return a reference instead but the syntax seems clumsier to me
and I'm not familiar with it anyway -- this works just fine.
Greg Farnum [Thu, 11 Mar 2021 07:40:52 +0000 (07:40 +0000)]
osd: PeeringState: respect stretch peering constraints for async recovery
Happily this is pretty simple: we just need to check that the resulting
wanted set can peer, which we have a function for. Run it before actually
swapping the want and candidate_want sets.
If we're not in stretch mode, this is a cheap function call that
will always return true, so it's equivalent to what we already have for them.
Greg Farnum [Thu, 11 Mar 2021 07:19:26 +0000 (07:19 +0000)]
osd: PeeringState: add a comment about using size as a proxy for activateable
When reviewing, I mistakenly thought we needed to skip a size check in
choose_acting() in case of mismatches between size and bucket counts, but that
is not accurate!
Sage Weil [Fri, 12 Mar 2021 20:21:49 +0000 (15:21 -0500)]
mgr: wait for ~3 beacons on startup if mons are pre-pacific
If we are going active and the mons are pre-pacific, they may have the
bug https://tracker.ceph.com/issues/49778 which prevents our modules
metadata (including options) from being updated (until the next beacon).
Wait a bit (6s by default, 3x the 2s mgr_tick_period) to let this
happen.
This allows us to upgrade from broken pre-pacific mons using cephadm,
which may (if orig cluster is <15.2.5) immediately do a cephadm
migration that relies on the mgr/cephadm/migration_current config
option being present in the mon's mgrmap.
Workaround for https://tracker.ceph.com/issues/49778
Sage Weil [Fri, 12 Mar 2021 20:00:49 +0000 (15:00 -0500)]
mon/MgrMonitor: populate available_modules from promote_standby()
This was done in the beacon path, where there is no active mgr and we
get a new entrant, but not for this case where an existing standby is
promoted to active.
This fixes a problem during upgrade where a new (standby) mgr's modules
have a new module option but it is not reflected immediately (not until
the next beacon).
Fixes: https://tracker.ceph.com/issues/49778 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Fri, 12 Mar 2021 16:15:35 +0000 (10:15 -0600)]
mgr/cephadm: fix get_keyring_with_caps
1- Pass caps to 'auth get-or-create'
2- Only try 'auth caps' if the get-or-create failed
Note that the 'auth caps' step can fail if upgrading from 15.2.0 since
'profile mgr' didn't include 'auth caps' until 15.2.1. We're not
addressing that for now...
Fixes: 7c0d532f3a4839f4199a13773fb5fa8b6fb3f183 Signed-off-by: Sage Weil <sage@newdream.net>
Kefu Chai [Fri, 12 Mar 2021 11:32:16 +0000 (19:32 +0800)]
cmake: do not build mutex_debug.cc if !WITH_CEPH_DEBUG_MUTEX
there is no need to build shared_mutex_debug.cc and
mutex_debug.cc, if they are not used at all. in Release build
we just use the mutex primitives offered by C++ standard library and
the POSIX API offered by libc.
Kefu Chai [Fri, 12 Mar 2021 04:02:22 +0000 (12:02 +0800)]
ceph.spec: build with system libpmem on fedora and el8
* build with WITH_SYSTEM_PMDK=ON on fedora, as f32 and f33 ship
libpmem1.8 and libpmem1.9 respectively. and we need libpmem v1.7
* build with WITH_SYSTEM_PMDK=ON on el8, as el8 and CentOS8 AppStream
ships libpmem v1.6,
quote from nvml.spec:
> By design, PMDK does not support any 32-bit architecture.
> Due to dependency on some inline assembly, PMDK can be compiled only
> on these architectures:
> - x86_64
> - ppc64le (experimental)
> - aarch64 (unmaintained, supporting hardware doesn't exist?)
so far, only x86_64 and ppc64le packages are built.
see also,
https://src.fedoraproject.org/rpms/nvml/blob/rawhide/f/nvml.spec
Kefu Chai [Fri, 12 Mar 2021 11:29:54 +0000 (19:29 +0800)]
cmake: make "WITH_CEPH_DEBUG_MUTEX" depend on CMAKE_BUILD_TYPE
this option is available only if CMAKE_BUILD_TYPE is Debug.
this change helps us to unify the checks for WITH_CEPH_DEBUG_MUTEX,
without this change, we always have to check both WITH_CEPH_DEBUG_MUTEX
*and* CMAKE_BUILD_TYPE.
after this change, we only respect WITH_CEPH_DEBUG_MUTEX.
Sebastian Wagner [Fri, 12 Mar 2021 11:04:54 +0000 (12:04 +0100)]
Merge pull request #39857 from adk3798/dup-labels
mgr/cephadm: remove duplicate labels when adding a host
Reviewed-by: Juan Miguel Olmo Martínez <jolmomar@redhat.com> Reviewed-by: Michael Fritch <mfritch@suse.com> Reviewed-by: Sebastian Wagner <sebastian.wagner@suse.com>
Sage Weil [Thu, 11 Mar 2021 21:56:52 +0000 (16:56 -0500)]
cephadm: use image id, not name, when inspecting for RepoDigests
The name is ambiguous, but the image_id is not! This fixes problems
during upgrade where upgrade thinks the container is upgraded (due to
an incorrect digest) when in fact it is not.
Fixes: 0826c45e0cb5d60fcf8cd71cd14edd34a6997cd4 Signed-off-by: Sage Weil <sage@newdream.net>
rgw::sal::RGWUser::list_buckets() uses the RGWBucketEnt constructor.
RGWBucketEnt::creation_time is initialized, but get_creation_time()
returns the uninitialized info.creation_time