From a5ebf417c6824803faad41d47d22a6f1c30ece39 Mon Sep 17 00:00:00 2001 From: Samuel Just Date: Tue, 25 Oct 2022 21:46:24 -0700 Subject: [PATCH] doc/dev/osd_internals: add past_intervals.rst Add explanation of past_interals. Signed-off-by: Samuel Just Signed-off-by: Matan Breizman (cherry picked from commit cd4c031e5e5f5b0318347a7957310cb7358380f6) --- doc/dev/osd_internals/past_intervals.rst | 92 ++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 doc/dev/osd_internals/past_intervals.rst diff --git a/doc/dev/osd_internals/past_intervals.rst b/doc/dev/osd_internals/past_intervals.rst new file mode 100644 index 0000000000000..4672485932693 --- /dev/null +++ b/doc/dev/osd_internals/past_intervals.rst @@ -0,0 +1,92 @@ +============= +PastIntervals +============= + +Purpose +------- + +There are two situations where we need to consider the set of all acting-set +OSDs for a PG back to some epoch ``e``: + + * During peering, we need to consider the acting set for every epoch back to + ``last_epoch_started``, the last epoch in which the PG completed peering and + became active. + (see :doc:`/dev/osd_internals/last_epoch_started` for a detailed explanation) + * During recovery, we need to consider the acting set for every epoch back to + ``last_epoch_clean``, the last epoch at which all of the OSDs in the acting + set were fully recovered, and the acting set was full. + +For either of these purposes, we could build such a set by iterating backwards +from the current OSDMap to the relevant epoch. Instead, we maintain a structure +PastIntervals for each PG. + +An ``interval`` is a contiguous sequence of OSDMap epochs where the PG mapping +didn't change. This includes changes to the acting set, the up set, the +primary, and several other parameters fully spelled out in +PastIntervals::check_new_interval. + +Maintenance and Trimming +------------------------ + +The PastIntervals structure stores a record for each ``interval`` back to +last_epoch_clean. On each new ``interval`` (See AdvMap reactions, +PeeringState::should_restart_peering, and PeeringState::start_peering_interval) +each OSD with the PG will add the new ``interval`` to its local PastIntervals. +Activation messages to OSDs which do not already have the PG contain the +sender's PastIntervals so that the recipient needn't rebuild it. (See +PeeringState::activate needs_past_intervals). + +PastIntervals are trimmed in two places. First, when the primary marks the +PG clean, it clears its past_intervals instance +(PeeringState::try_mark_clean()). The replicas will do the same thing when +they receive the info (See PeeringState::update_history). + +The second, more complex, case is in PeeringState::start_peering_interval. In +the event of a "map gap", we assume that the PG actually has gone clean, but we +haven't received a pg_info_t with the updated ``last_epoch_clean`` value yet. +To explain this behavior, we need to discuss OSDMap trimming. + +OSDMap Trimming +--------------- + +OSDMaps are created by the Monitor quorum and gossiped out to the OSDs. The +Monitor cluster also determines when OSDs (and the Monitors) are allowed to +trim old OSDMap epochs. For the reasons explained above in this document, the +primary constraint is that we must retain all OSDMaps back to some epoch such +that all PGs have been clean at that or a later epoch (min_last_epoch_clean). +(See OSDMonitor::get_trim_to). + +The Monitor quorum determines min_last_epoch_clean through MOSDBeacon messages +sent periodically by each OSDs. Each message contains a set of PGs for which +the OSD is primary at that moment as well as the min_last_epoch_clean across +that set. The Monitors track these values in OSDMonitor::last_epoch_clean. + +There is a subtlety in the min_last_epoch_clean value used by the OSD to +populate the MOSDBeacon. OSD::collect_pg_stats invokes PG::with_pg_stats to +obtain the lec value, which actually uses +pg_stat_t::get_effective_last_epoch_clean() rather than +info.history.last_epoch_clean. If the PG is currently clean, +pg_stat_t::get_effective_last_epoch_clean() is the current epoch rather than +last_epoch_clean -- this works because the PG is clean at that epoch and it +allows OSDMaps to be trimmed during periods where OSDMaps are being created +(due to snapshot activity, perhaps), but no PGs are undergoing ``interval`` +changes. + +Back to PastIntervals +--------------------- + +We can now understand our second trimming case above. If OSDMaps have been +trimmed up to epoch ``e``, we know that the PG must have been clean at some epoch +>= ``e`` (indeed, **all** PGs must have been), so we can drop our PastIntevals. + +This dependency also pops up in PeeringState::check_past_interval_bounds(). +PeeringState::get_required_past_interval_bounds takes as a parameter +oldest_epoch, which comes from OSDSuperblock::max_oldest_map. We use +max_oldest_map rather than a specific osd's oldest_map because we don't +necessarily trim all MOSDMap::oldest_map. In order to avoid doing too much +work at once we limit the amount of osdmaps trimmed using +``osd_target_transaction_size`` in OSD::trim_maps(). +For this reason, a specific OSD's oldest_map can lag OSDSuperblock::max_oldest_map +for a while. + +See https://tracker.ceph.com/issues/49689 for an example. -- 2.39.5