From b0d540e82f5383b5c101fb730da638d0f33d52f3 Mon Sep 17 00:00:00 2001 From: Venky Shankar Date: Mon, 28 Sep 2020 17:11:59 +0530 Subject: [PATCH] doc: cephfs mirroring design document Signed-off-by: Venky Shankar --- doc/dev/cephfs-mirroring.rst | 272 ++++++++++++++++++++++++++++++++++- 1 file changed, 265 insertions(+), 7 deletions(-) diff --git a/doc/dev/cephfs-mirroring.rst b/doc/dev/cephfs-mirroring.rst index 66681fd31007..d6e2acd7a8cf 100644 --- a/doc/dev/cephfs-mirroring.rst +++ b/doc/dev/cephfs-mirroring.rst @@ -3,7 +3,7 @@ CephFS Mirroring ================ .. note:: CephFS mirroring feature is currently under development. Interfaces detailed in - this document might change during development cycle. + this document might change during the development cycle. CephFS will support asynchronous replication of snapshots to a remote CephFS filesystem via `cephfs-mirror` tool. Snapshots are synchronized by mirroring snapshot data followed by @@ -13,11 +13,84 @@ the snapshot being synchronized. Key Idea -------- -For a given snapshot in a directory, `cephfs-mirror` daemon uses recursive stats to identify -changes in a directory tree. For a given directory, local rctime being larger than its -corresponding remote timestamp implies changes exist somewhere in the subtree. The subtree is -scanned further using the same rctime comparison principle to collect a set of changes for -propagation to remote filesystem. +For a given snapshot pair in a directory, `cephfs-mirror` daemon uses readdir diff to +identify changes in a directory tree. The diffs are applied to directory in the remote +filesystem thereby only synchronizing files that have changed between two snapshots. + +This feature is tracked here: https://tracker.ceph.com/issues/47034 + +.. note:: Synchronizing hardlinks is not supported -- hardlinked files get synchronized + as separate files. + +Mirroring Design +---------------- + +CephFS will support asynchronous replication of snapshots to a remote CephFS filesystem +via `cephfs-mirror` tool. For a given directory, snapshots are synchronized by transferring +snapshot data to the remote filesystem and creating a snapshot with the same name as the +snapshot being synchronized. + +Change Detection +---------------- + +The first snapshot to get synchronized involves bulk transfer of data to the remotee +filesystem. Subsequent snapshot synchronization transfers changed/modified files. + +Identifying changed files between two snapshots (one of which is already synchronized), +involves (efficiently) walking two (snapshot) trees. Since one of the snapshots is +already synchronized, scanning the two snapshots locally would suffice. For the most +part, identifying changed files between two snapshot does not involve querying the +remote filesystem. However, there are cases where it is essential to query the remote +filesystem: + +Failed Synchronization +---------------------- + +Failure to transfer snapshot data (which could be due to a variety of reasons) leaves +the remote filesystem with incomplete data. Although retrying synchronization most +likely overcomes transient issues, non-recoverable errors would need to be handled. +When a snapshot synchronization hits a non-recoverable error, the snapshot is noted +for failed synchronization and gets skipped. This is done either manually by the +operator or when the number of consecutive retries (post-failure) hits a configured +limit. + +Skipping a "failed" snapshot leaves incomplete data on the remote filesystem. When the +next snapshot is chosen for synchronization, local scanning of (two) snapshots is not +feasible, since certain files (which were part of the "failed" snapshot) may have been +transferred to the remote filesystem but do not exist in the "next" chosen snapshot. +In such a case, scanning the two snapshots "locally" for changes would ignore the +incomplete data on the remote filesystem left behind. In such a case identifying +changes would need to compare files between the snapshot and the remote filesystem. +Alternatively, the remote filesystem data could be purged followed by bulk transfer of +data from the next chosen snapshot. Subsequent snapshots can then rely on scanning local +snapshots for identifying changes to be transferred. + +Deleted Snapshots +----------------- + +Snapshots on the primary filesystem can be deleted even during synchronization. +This needs to be handled by the mirror daemon and poses the same issues as mention +above. Moreover, mirroring can be disabled in midst of snapshots being synchronized +and enabled after deleting snapshots (including the one under synchronization) and +poses the same set of issues. + +Snapshot Synchronization Order +------------------------------ + +Although the order in which snapshots get chosen for synchronization does not matter, +it may be beneficial to pick snapshots based on creation order (using snap-id). + +Snapshot Incarnation +-------------------- + +A snapshot may be deleted and recreated (with the same name) with different contents. +An "old" snapshot could have been synchronized (earlier) and the recreation of the +snapshot could have been done when mirroring was disabled. Using snapshot names to +infer the point-of-continuation would result in the "new" snapshot (incarnation) +never getting picked up for synchronization. + +Snapshots on the secondary filesystem stores the snap-id of the snapshot it was +synchronized from. This metadata is stored in `SnapInfo` structure. Interfaces ---------- @@ -27,7 +100,8 @@ mirroring. Manager interfaces are (mostly) wrappers around monitor commands for filesystem mirroring and is the recommended control interface. See `Internal Interfaces` (monitor commands) are detailed below. -.. note:: `Mirroring` module is work in progress and will be detailed when ready for use. +.. note:: `Mirroring` module is work in progress and the interface is detailed in the + `Mirroring Module and Interface` section. Internal Interfaces ------------------- @@ -66,6 +140,190 @@ to `FSMap` and gets notified on mirror status and/or peer updates. Peer changes handled by starting or stopping mirroring to when a new peer is added or an existing peer is removed. +Mirroring Module and Interface +------------------------------ + +Mirroring module provides interface for managing directory snapshot mirroring. The module +is implemented as a Ceph Manager plugin. Mirroring module does not manage spawning (and +terminating) the mirror daemons. Right now the preferred way would be to start/stop +mirror daemons via `systemctl(1)`. Going forward, deploying mirror daemons would be +managed by `cephadm` (Tracker: http://tracker.ceph.com/issues/47261). + +The manager module is responsible for assigning directories to mirror daemons for +synchronization. Multiple mirror daemons can be spawned to achieve concurrency in +directory snapshot synchronization. When mirror daemons are spawned (or terminated) +, the mirroring module discovers the modified set of mirror daemons and rebalances +the directory assignment amongst the new set thus providing high-availability. + +.. note:: Multiple mirror daemons is currently untested. Only a single mirror daemon + is recommended. + +Mirroring module is disabled by default. To enable mirroring use:: + + $ ceph mgr module enable mirroring + +Mirroring module provides a family of commands to control mirroring of directory +snapshots. To add or remove directories, mirroring needs to be enabled for a given +filesystem. To enable mirroring use:: + + $ ceph fs snapshot mirror enable + +.. note:: Mirroring module commands use `fs snapshot mirror` prefix as compared to + the monitor commands which `fs mirror` prefix. Make sure to use module + commands. + +To disable mirroring, use:: + + $ ceph fs snapshot mirror disable + +Once mirroring is enabled, add a peer to which directory snapshots are to be mirrored. +Peers follow `@` specification and get assigned a unique-id (UUID) +when added. See `Creating Users` section on how to create Ceph users for mirroring. + +To add a peer use:: + + $ ceph fs snapshot mirror peer_add [] + +`` is optional, and default to `` (on the remote cluster). + +.. note:: Only a single peer is supported right now. + +To remove a peer use:: + + $ ceph fs snapshot mirror peer_remove + +.. note:: See `Mirror Daemon Status` section on how to figure out Peer UUID. + +To configure a directory for mirroring, use:: + + $ ceph fs snapshot mirror add + +To stop a mirroring directory snapshots use:: + + $ ceph fs snapshot mirror remove + +Only absolute directory paths are allowed. Also, paths are normalized by the mirroring +module, therfore, `/a/b/../b` is equivalent to `/a/b`. Also, the directory should exist +before it can be configured for mirroring:: + + $ ceph fs snapshot mirror add cephfs /d0/d1/d2 + Error ENOENT: /d0/d1/d2 does not exist + $ mkdir -p /d0/d1/d2 + $ ceph fs snapshot mirror add cephfs /d0/d1/d2 + {} + $ ceph fs snapshot mirror add cephfs /d0/d1/../d1/d2 + Error EEXIST: directory /d0/d1/d2 is already tracked + +Once a directory is added for mirroring, its subdirectory or ancestor directories are +disallowed to be added for mirorring:: + + $ ceph fs snapshot mirror add cephfs /d0/d1 + Error EINVAL: /d0/d1 is a ancestor of tracked path /d0/d1/d2 + $ ceph fs snapshot mirror add cephfs /d0/d1/d2/d3 + Error EINVAL: /d0/d1/d2/d3 is a subtree of tracked path /d0/d1/d2 + +Commands to check directory mapping (to mirror daemons) and directory distribution are +detailed in `Mirror Daemon Status` section. + +Mirror Daemon Status +-------------------- + +Mirror daemons get asynchronously notified about changes in filesystem mirroring status +and/or peer updates. CephFS mirror daemons provide admin socket commands for querying +mirror status. To check available commands for mirror status use:: + + $ ceph --admin-daemon /path/to/mirror/daemon/admin/socket help + { + .... + .... + "fs mirror status cephfs@360": "get filesystem mirror status", + .... + .... + } + +Commands with `fs mirror status` prefix provide mirror status for mirror enabled +filesystem. Note that `cephfs@360` is of format `filesystem-name@filesystem-id`. +This format is required since mirror daemons get asynchronously notified regarding +filesystem mirror status (A filesystem can be deleted and recreated with the same +name). + +Right now, the command provides minimal information regarding mirror status:: + + $ ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@360 + { + "rados_inst": "192.168.0.5:0/1476644347", + "peers": { + "a2dc7784-e7a1-4723-b103-03ee8d8768f8": { + "remote": { + "client_name": "client.mirror_remote", + "cluster_name": "site-a", + "fs_name": "backup_fs" + } + } + }, + "snap_dirs": { + "dir_count": 1 + } + } + +`Peers` section in the command output above shows the peer information such as unique +peer-id (UUID) and specification. The peer-id is required to remove an existing peer +as mentioned in the `Mirror Module and Interface` section. + +When mirroring is disabled, the respective `fs mirror status` command for the filesystem +will not show up in command help. + +Mirroring module provides a couple of commands to display directory mapping and distribution +information. To check which mirror daemon a directory has been mapped to use:: + + $ ceph fs snapshot mirror dirmap cephfs /d0/d1/d2 + { + "instance_id": "404148", + "last_shuffled": 1601284516.10986, + "state": "mapped" + } + +.. note:: `instance_id` is the RAODS instance-id associated with a mirror daemon. This + will include the RADOS instance address to help map to a mirror daemon. + +Other information such as `state` and `last_shuffled` are interesting when running +multiple mirror daemons. + +When no mirror daemons are running the above command shows:: + + $ ceph fs snapshot mirror dirmap cephfs /d0/d1/d2 + { + "reason": "no mirror daemons running", + "state": "stalled" + } + +Signifying that no mirror daemons are running and mirroring is stalled. + +Creating Users +-------------- + +Start by creating a user (on the primary/local cluster) for the mirror daemon. This user +has restrictive capabilities on the MDS and the OSD:: + + $ ceph auth get-or-create client.mirror mon 'allow r' mds 'allow r' osd 'allow rw object_prefix cephfs_mirror' mgr 'allow r' + +Create a user for each filesystem peer (on the secondary/remote cluster). This user needs +to have full capabilities on the MDS (to take snapshots) and the OSDs:: + + $ ceph auth get-or-create client.mirror_remote mon 'allow r' mds 'allow rwps' osd 'allow rw' mgr 'allow r' + +This user should be used (as part of peer specification) when adding a peer. + +Starting Mirror Daemon +---------------------- + +Right now, mirror daemons need to be spawned manually. This will be replaced by providing +`systemctl(1)` command to start/stop mirror daemons. For now, use:: + + $ cephfs-mirror --id mirror --cluster site-a -f + +.. note:: User used here is `mirror` as created in the `Creating Users` section. + Feature Status -------------- -- 2.47.3