From: Patrick Donnelly Date: Fri, 4 Nov 2022 12:56:49 +0000 (-0400) Subject: doc: add MDS treatise on segments X-Git-Tag: v19.0.0~760^2 X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=bc75366b9c336b94022be6466954993f0e281fee;p=ceph-ci.git doc: add MDS treatise on segments Signed-off-by: Patrick Donnelly --- diff --git a/doc/cephfs/mds-config-ref.rst b/doc/cephfs/mds-config-ref.rst index 5b68053a05e..a0080ec2acb 100644 --- a/doc/cephfs/mds-config-ref.rst +++ b/doc/cephfs/mds-config-ref.rst @@ -17,8 +17,6 @@ .. confval:: mds_early_reply .. confval:: mds_default_dir_hash .. confval:: mds_log_skip_corrupt_events -.. confval:: mds_log_max_events -.. confval:: mds_log_max_segments .. confval:: mds_bal_sample_interval .. confval:: mds_bal_replicate_threshold .. confval:: mds_bal_unreplicate_threshold diff --git a/doc/cephfs/mds-journaling.rst b/doc/cephfs/mds-journaling.rst index 92b32bb456d..b6ccf27c8c0 100644 --- a/doc/cephfs/mds-journaling.rst +++ b/doc/cephfs/mds-journaling.rst @@ -88,3 +88,71 @@ Following are various event types that are journaled by the MDS. #. `EVENT_TABLESERVER`: Log transition states of MDSs view of server tables (snap/anchor). #. `EVENT_UPDATE`: Log file operations on an inode. + +#. `EVENT_SEGMENT`: Log a new journal segment boundary. + +#. `EVENT_LID`: Mark the beginning of a journal without a logical subtree map. + +Journal Segments +---------------- + +The MDS journal is composed of logical segments, called LogSegments in the +code. These segments are used to collect metadata updates by multiple events +into one logical unit for the purposes of trimming. Whenever the journal tries +to commit metadata operations (e.g. flush a file create out as an omap update +to a dirfrag object), it does so in a replayable batch of updates from the +LogSegment. The updates must be replayable in case the MDS fails during the +series of updates to various metadata objects. The reason the updates are +performed in batch is to group updates to the same metadata object (a dirfrag) +where multiple omap entries are probably updated in the same time period. + +Once a segment is trimmed, it is considered "expired". An expired segment is +eligible for deletion by the journaler as all of its updates are flushed to the +backing RADOS objects. This is done by updating the "expire position" of the +journaler to advance past the end of the expired segment. Some expired segments +may be kept in the journal to improve cache locality when the MDS restarts. + +For most of CephFS's history (up to 2023), the journal segments were delineated +by subtree maps, the ``ESubtreeMap`` event. The major reason for this is that +journal recovery must start with a copy of the subtree map before replaying any +other events. + +Now, log segments can be delineated by events which are a ``SegmentBoundary``. +These include, ``ESubtreeMap``, ``EResetJournal``, ``ESegment`` (2023), or +``ELid`` (2023). For ``ESegment``, this light-weight segment boundary allows +the MDS to journal the subtree map less frequently while also keeping the +journal segments small to keep trimming events short. In order to maintain the +constraint that the first event journal replay sees is the ``ESubtreeMap``, +those segments beginning with that event are considered "major segments" and a +new constraint was added to the deletion of expired segments: the first segment +of the journal must always be a major segment. + +The ``ELid`` event exists to mark the MDS journal as "new" where a logical +``LogSegment`` and log sequence number is required for other operations to +proceed, in particular the MDSTable operations. The MDS uses this event when +creating a rank or shutting it down. No subtree map is required when replaying +the rank from this initial state. + + +Configurations +-------------- + +The targetted size of a log segment in terms of number of events is controlled by: + +.. confval:: mds_log_events_per_segment + +The frequency of major segments (noted by the journaling of the latest ``ESubtreeMap``) is controlled by: + +.. confval:: mds_log_major_segment_event_ratio + +When ``mds_log_events_per_segment * mds_log_major_segment_event_ratio`` +non-``ESubtreeMap`` events are logged, the MDS will journal a new +``ESubtreeMap``. This is necessary to allow the journal to shrink in size +during the trimming of expired segments. + +The target maximum number of segments is controlled by: + +.. confval:: mds_log_max_segments + +The MDS will often sit a little above this number due to non-major segments +awaiting trimming up to the next major segment.