From: Samuel Just Date: Wed, 1 Apr 2020 02:13:25 +0000 (-0700) Subject: doc: more information on status and future work for manifest tiering X-Git-Tag: wip-pdonnell-testing-20200918.022351~1226^2~2 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=ac7af27ffc209c37cff02a0ede29601ba8eaff67;p=ceph-ci.git doc: more information on status and future work for manifest tiering Signed-off-by: Samuel Just Signed-off-by: Myoungwon Oh --- diff --git a/doc/dev/osd_internals/manifest.rst b/doc/dev/osd_internals/manifest.rst index bafc12a1802..feaa5fbeae7 100644 --- a/doc/dev/osd_internals/manifest.rst +++ b/doc/dev/osd_internals/manifest.rst @@ -23,6 +23,93 @@ This document exists to detail: 2. Rados operations for manipulating manifests. 3. How those operations interact with other features like snapshots. + +Status and Future Work +====================== + +At the moment, initial versions of a manifest data structure along +with IO path support and rados control operations exist. This section +is meant to outline next steps. + +Cleanups +-------- + +There are some rough edges we may want to improve: + +* object_manifest_t: appears to duplicate offset for chunks between the + map and the chunk_info_t with different widths (uint64_t in map vs + uint32_t in chunk_info_t). +* SET_REDIRECT: Should perhaps create the object if it doesn't exist. +* SET_CHUNK: + + * Appears to trigger a new clone as user_modify gets set in + do_osd_ops. This probably isn't desirable, see Snapshots section + below for some options on how generally to mix these operations + with snapshots. + * Appears to assume that the corresponding section of the object + does not exist (sets FLAG_MISSING) but does not check whether the + corresponding extent exists already in the object. + * Appears to clear the manifest unconditionally if not chunked, + that's probably wrong. We should return an error if it's a + REDIRECT. + +* TIER_PROMOTE: + + * Document control flow in this file. The CHUNKED case is + particularly unintuitive and could use some explanation. + * SET_REDIRECT clears the contents of the object. PROMOTE appears + to copy them back in, but does not unset the redirect or clear the + reference. Does this not violate the invariant? In particular, as + long as the redirect is set, it appears that all operations + will be proxied even after the promote defeating the purpose. + * For a chunked manifest, we appear to flush prior to promoting, + it that actually desirable? + +Cache/Tiering +------------- + +There already exists a cache/tiering mechanism based on whiteouts. +One goal here should ultimately be for this manifest machinery to +provide a complete replacement. + +See cache-pool.rst + +The manifest machinery already shares some code paths with the +existing cache/tiering code, mainly stat_flush. + +In no particular order, here's in incomplete list of things that need +to be wired up to provide feature parity: + +* Online object access information: The osd already has pool configs + for maintaining bloom filters which provide estimates of access + recency for objects. We probably need to modify this to permit + hitset maintenance for a normal pool -- there are already + CEPH_OSD_OP_PG_HITSET* interfaces for querying them. +* Tiering agent: The osd already has a background tiering agent which + would need to be modified to instead flush and evict using + manifests. + +Snapshots +--------- + +Fundamentally, I think we need to be able to manipulate the manifest +status of clones because we want to be able to dynamically promote, +flush (if the state was dirty when the clone was created), and evict +clones. + +As such, I think we want to make a new op type which permits writes on +clones. + +See snaps.rst for a rundown of the librados snapshot system and osd +support details. I'd like to call out one particular data structure +we may want to exploit. + +We maintain a clone_overlap mapping which gives between two adjacent +clones byte ranges which are identical. It's used during recovery and +cache/tiering promotion to ensure that the ObjectStore ends up with +the right byte range sharing via clone. We probably want to ensure +that adjacent clones agree on chunk mappings for shared regions. + Data Structures =============== @@ -42,8 +129,6 @@ object_info_t (see osd_types.h): std::map chunk_map; } -TODO: check the following - The type enum reflects three possible states an object can be in: 1. TYPE_NONE: normal rados object @@ -126,6 +211,19 @@ Operations: rados -p base_pool set-redirect --target-pool + Returns ENOENT if the object does not exist (TODO: why?) + Returns EINVAL if the object already is a redirect. + + Takes a reference to target as part of operation, can possibly leak a ref + if the acting set resets and the client dies between taking the ref and + recording the redirect. + + Truncates object, clears omap, and clears xattrs as a side effect. + + At the top of do_osd_ops, does not set user_modify. + + TODO: No user_modify means does not this trigger a new clone? + * set-chunk set the chunk-offset in a source_object to make a link between it and a @@ -137,6 +235,32 @@ Operations: rados -p base_pool set-chunk --target-pool + Returns ENOENT if the object does not exist (TODO: why?) + Returns EINVAL if the object already is a redirect. + Returns EINVAL if on ill-formed parameter buffer. + Returns ENOTSUPP if existing mapped chunks overlap with new chunk mapping. + + Takes references to targets as part of operation, can possibly leak refs + if the acting set resets and the client dies between taking the ref and + recording the redirect. + + Truncates object, clears omap, and clears xattrs as a side effect. + + TODO: Because do_osd_ops does not list SET_CHUNK in the non-user_modify list,xi + it would appear that make_writeable will trigger a new clone. + + TODO: Because user_modify apparently gets set, it'll also trigger a watch. + + TODO: Does not appear to clear corresponding portion of object. Should it? + + TODO: SET_CHUNK appears to clear the manifest unconditionally if it's not chunked. + That seems wrong. :: + + if (!oi.manifest.is_chunked()) { + oi.manifest.clear(); + } + + * tier-promote promote the object (including chunks). :: @@ -145,6 +269,28 @@ Operations: rados -p base_pool tier-promote + Returns ENOENT if the object does not exist + Returns EINVAL if the object already is a redirect. + + For a chunked manifest, copies all chunks to head. + + For a redirect manifest, copies data to head. + + TODO: For a redirect manifest, I don't see where it unsets the redirect. + Is it supposed to? + + Does not clear the manifest. + + Note: For a chunked manifest, calls start_copy on itself and uses the + existing read proxy machinery to proxy the reads. + + TODO: maybe_handle_manifest_detail appears to flush the data prior to + promoting it, seems unnecessary? + + At the top of do_osd_ops, does not set user_modify. + + TODO: No user_modify means does not this trigger a new clone unlike SET_CHUNK. + * unset-manifest unset the manifest info in the object that has manifest. :: @@ -153,6 +299,13 @@ Operations: rados -p base_pool unset-manifest + Clears manifest chunks or redirect. Lazily releases references, may + leak. + + do_osd_ops seems not to include it in the user_modify=false whitelist, + and so will trigger a snapshot. Note, this will be true even for a + redirect though SET_REDIRECT does not flip user_modify. + * tier-flush flush the object which has chunks to the chunk pool. :: @@ -161,6 +314,8 @@ Operations: rados -p base_pool tier-flush + Included in the user_modify=false whitelist, does not trigger a clone. + Dedup tool ========== @@ -244,20 +399,3 @@ and fixing the reference count (see ./refcount.rst). ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL - -===================== -Status and Future Work -====================== - -At the moment, the above interfaces exist in rados, but have unclear -interactions with snapshots. - -Snapshots ---------- - -Here are some design questions we'll need to tackle: - -1. set-redirect - - * What happens if set on a clone? - *