Important Data Structures
-------------------------
* SnapRealm: A `SnapRealm` is created whenever you create a snapshot at a new
- point in the hierarchy (or, when a snapshotted inode is moved outside of its
- parent snapshot). SnapRealms contain an `sr_t srnode`, links to `past_parents`
- and `past_children`, and all `inodes_with_caps` that are part of the snapshot.
- Clients also have a SnapRealm concept that maintains less data but is used to
- associate a `SnapContext` with each open file for writing.
+ point in the hierarchy (or, when a snapshotted inode is move outside of its
+ parent snapshot). SnapRealms contain an `sr_t srnode`, and `inodes_with_caps`
+ that are part of the snapshot. Clients also have a SnapRealm concept that
+ maintains less data but is used to associate a `SnapContext` with each open
+ file for writing.
* sr_t: An `sr_t` is the on-disk snapshot metadata. It is part of the containing
directory and contains sequence counters, timestamps, the list of associated
- snapshot IDs, and `past_parents`.
-* snaplink_t: `past_parents` et al are stored on-disk as a `snaplink_t`, holding
- the inode number and first `snapid` of the inode/snapshot referenced.
+ snapshot IDs, and `past_parent_snaps`.
+* SnapServer: SnapServer manages snapshot ID allocation, snapshot deletion and
+ tracks list of effective snapshots in the filesystem. A filesystem only has
+ one instance of snapserver.
+* SnapClient: SnapClient is used to communicate with snapserver, each MDS rank
+ has its own snapclient instance. SnapClient also caches effective snapshots
+ locally.
Creating a snapshot
-------------------
-Because CephFS snapshot currently is an experimental feature, we are supposed
-to enable it explicitly by the command below before testing.
+CephFS snapshot feature is enabled by default on new filesystem. To enable it
+on existing filesystems, use command below.
.. code::
- $ ceph fs set <fs_name> allow_new_snaps true --yes-i-really-mean-it
+ $ ceph fs set <fs_name> allow_new_snaps true
-To make a snapshot on directory "/1/2/3/foo", the client invokes "mkdir" on
-"/1/2/3/foo/.snap" directory. This is transmitted to the MDS Server as a
+To make a snapshot on directory "/1/2/3/", the client invokes "mkdir" on
+"/1/2/3/.snap" directory. This is transmitted to the MDS Server as a
CEPH_MDS_OP_MKSNAP-tagged `MClientRequest`, and initially handled in
Server::handle_client_mksnap(). It allocates a `snapid` from the `SnapServer`,
projects a new inode with the new SnapRealm, and commits it to the MDLog as
usual. When committed, it invokes
-`MDCache::do_realm_invalidate_and_update_notify()`, which triggers most of the
-real work of the snapshot.
+`MDCache::do_realm_invalidate_and_update_notify()`, which notifies all clients
+with caps on files under "/1/2/3/", about the new SnapRealm. When clients get
+the notifications, they update client-side SnapRealm hierarchy, link files
+under "/1/2/3/" to the new SnapRealm and generate a `SnapContext` for the
+new SnapRealm.
-If there were already snapshots above directory "foo" (rooted at "/1", say),
-the new SnapRealm adds its most immediate ancestor as a `past_parent` on
-creation. After committing to the MDLog, all clients with caps on files in
-"/1/2/3/foo/" are notified (MDCache::send_snaps()) of the new SnapRealm, and
-update the `SnapContext` they are using with that data. Note that this
-*is not* a synchronous part of the snapshot creation!
+Note that this *is not* a synchronous part of the snapshot creation!
Updating a snapshot
-------------------
-If you delete a snapshot, or move data out of the parent snapshot's hierarchy,
-a similar process is followed. Extra code paths check to see if we can break
-the `past_parent` links between SnapRealms, or eliminate them entirely.
+If you delete a snapshot, a similar process is followed. If you remove an inode
+out of its parent SnapRealm, the rename code creates a new SnapRealm for the
+renamed inode (if SnapRealm does not already exist), saves IDs of snapshots that
+are effective on the original parent SnapRealm into `past_parent_snaps` of the
+new SnapRealm, then follows a process similar to creating snapshot.
Generating a SnapContext
------------------------
A RADOS `SnapContext` consists of a snapshot sequence ID (`snapid`) and all
the snapshot IDs that an object is already part of. To generate that list, we
-generate a list of all `snapids` associated with the SnapRealm and all its
-`past_parents`.
+combine `snapids` associated with the SnapRealm and all vaild `snapids` in
+`past_parent_snaps`. Stale `snapids` are filtered out by SnapClient's cached
+effective snapshots.
Storing snapshot data
---------------------
Hard links
----------
-Hard links do not interact well with snapshots. A file is snapshotted when its
-primary link is part of a SnapRealm; other links *will not* preserve data.
-Generally the location where a file was first created will be its primary link,
-but if the original link has been deleted it is not easy (nor always
-determnistic) to find which link is now the primary.
+Inode with multiple hard links is moved to a dummy gloabl SnapRealm. The
+dummy SnapRealm covers all snapshots in the filesystem. The inode's data
+will be preserved for any new snapshot. These preserved data will cover
+snapshots on any linkage of the inode.
Multi-FS
---------