2. Rados operations for manipulating manifests.
3. How those operations interact with other features like snapshots.
+
+Status and Future Work
+======================
+
+At the moment, initial versions of a manifest data structure along
+with IO path support and rados control operations exist. This section
+is meant to outline next steps.
+
+Cleanups
+--------
+
+There are some rough edges we may want to improve:
+
+* object_manifest_t: appears to duplicate offset for chunks between the
+ map and the chunk_info_t with different widths (uint64_t in map vs
+ uint32_t in chunk_info_t).
+* SET_REDIRECT: Should perhaps create the object if it doesn't exist.
+* SET_CHUNK:
+
+ * Appears to trigger a new clone as user_modify gets set in
+ do_osd_ops. This probably isn't desirable, see Snapshots section
+ below for some options on how generally to mix these operations
+ with snapshots.
+ * Appears to assume that the corresponding section of the object
+ does not exist (sets FLAG_MISSING) but does not check whether the
+ corresponding extent exists already in the object.
+ * Appears to clear the manifest unconditionally if not chunked,
+ that's probably wrong. We should return an error if it's a
+ REDIRECT.
+
+* TIER_PROMOTE:
+
+ * Document control flow in this file. The CHUNKED case is
+ particularly unintuitive and could use some explanation.
+ * SET_REDIRECT clears the contents of the object. PROMOTE appears
+ to copy them back in, but does not unset the redirect or clear the
+ reference. Does this not violate the invariant? In particular, as
+ long as the redirect is set, it appears that all operations
+ will be proxied even after the promote defeating the purpose.
+ * For a chunked manifest, we appear to flush prior to promoting,
+ it that actually desirable?
+
+Cache/Tiering
+-------------
+
+There already exists a cache/tiering mechanism based on whiteouts.
+One goal here should ultimately be for this manifest machinery to
+provide a complete replacement.
+
+See cache-pool.rst
+
+The manifest machinery already shares some code paths with the
+existing cache/tiering code, mainly stat_flush.
+
+In no particular order, here's in incomplete list of things that need
+to be wired up to provide feature parity:
+
+* Online object access information: The osd already has pool configs
+ for maintaining bloom filters which provide estimates of access
+ recency for objects. We probably need to modify this to permit
+ hitset maintenance for a normal pool -- there are already
+ CEPH_OSD_OP_PG_HITSET* interfaces for querying them.
+* Tiering agent: The osd already has a background tiering agent which
+ would need to be modified to instead flush and evict using
+ manifests.
+
+Snapshots
+---------
+
+Fundamentally, I think we need to be able to manipulate the manifest
+status of clones because we want to be able to dynamically promote,
+flush (if the state was dirty when the clone was created), and evict
+clones.
+
+As such, I think we want to make a new op type which permits writes on
+clones.
+
+See snaps.rst for a rundown of the librados snapshot system and osd
+support details. I'd like to call out one particular data structure
+we may want to exploit.
+
+We maintain a clone_overlap mapping which gives between two adjacent
+clones byte ranges which are identical. It's used during recovery and
+cache/tiering promotion to ensure that the ObjectStore ends up with
+the right byte range sharing via clone. We probably want to ensure
+that adjacent clones agree on chunk mappings for shared regions.
+
Data Structures
===============
std::map<uint64_t, chunk_info_t> chunk_map;
}
-TODO: check the following
-
The type enum reflects three possible states an object can be in:
1. TYPE_NONE: normal rados object
rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
<target_object>
+ Returns ENOENT if the object does not exist (TODO: why?)
+ Returns EINVAL if the object already is a redirect.
+
+ Takes a reference to target as part of operation, can possibly leak a ref
+ if the acting set resets and the client dies between taking the ref and
+ recording the redirect.
+
+ Truncates object, clears omap, and clears xattrs as a side effect.
+
+ At the top of do_osd_ops, does not set user_modify.
+
+ TODO: No user_modify means does not this trigger a new clone?
+
* set-chunk
set the chunk-offset in a source_object to make a link between it and a
rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
<caspool> <target_object> <taget-offset>
+ Returns ENOENT if the object does not exist (TODO: why?)
+ Returns EINVAL if the object already is a redirect.
+ Returns EINVAL if on ill-formed parameter buffer.
+ Returns ENOTSUPP if existing mapped chunks overlap with new chunk mapping.
+
+ Takes references to targets as part of operation, can possibly leak refs
+ if the acting set resets and the client dies between taking the ref and
+ recording the redirect.
+
+ Truncates object, clears omap, and clears xattrs as a side effect.
+
+ TODO: Because do_osd_ops does not list SET_CHUNK in the non-user_modify list,xi
+ it would appear that make_writeable will trigger a new clone.
+
+ TODO: Because user_modify apparently gets set, it'll also trigger a watch.
+
+ TODO: Does not appear to clear corresponding portion of object. Should it?
+
+ TODO: SET_CHUNK appears to clear the manifest unconditionally if it's not chunked.
+ That seems wrong. ::
+
+ if (!oi.manifest.is_chunked()) {
+ oi.manifest.clear();
+ }
+
+
* tier-promote
promote the object (including chunks). ::
rados -p base_pool tier-promote <obj-name>
+ Returns ENOENT if the object does not exist
+ Returns EINVAL if the object already is a redirect.
+
+ For a chunked manifest, copies all chunks to head.
+
+ For a redirect manifest, copies data to head.
+
+ TODO: For a redirect manifest, I don't see where it unsets the redirect.
+ Is it supposed to?
+
+ Does not clear the manifest.
+
+ Note: For a chunked manifest, calls start_copy on itself and uses the
+ existing read proxy machinery to proxy the reads.
+
+ TODO: maybe_handle_manifest_detail appears to flush the data prior to
+ promoting it, seems unnecessary?
+
+ At the top of do_osd_ops, does not set user_modify.
+
+ TODO: No user_modify means does not this trigger a new clone unlike SET_CHUNK.
+
* unset-manifest
unset the manifest info in the object that has manifest. ::
rados -p base_pool unset-manifest <obj-name>
+ Clears manifest chunks or redirect. Lazily releases references, may
+ leak.
+
+ do_osd_ops seems not to include it in the user_modify=false whitelist,
+ and so will trigger a snapshot. Note, this will be true even for a
+ redirect though SET_REDIRECT does not flip user_modify.
+
* tier-flush
flush the object which has chunks to the chunk pool. ::
rados -p base_pool tier-flush <obj-name>
+ Included in the user_modify=false whitelist, does not trigger a clone.
+
Dedup tool
==========
ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
-
-=====================
-Status and Future Work
-======================
-
-At the moment, the above interfaces exist in rados, but have unclear
-interactions with snapshots.
-
-Snapshots
----------
-
-Here are some design questions we'll need to tackle:
-
-1. set-redirect
-
- * What happens if set on a clone?
- *