From: Connor Fawcett Date: Mon, 29 Sep 2025 01:21:16 +0000 (+0100) Subject: docs/dev: Add design document with information on proposed design for pool migration X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=3157a593287fca83052eb53dd8149943f9491936;p=ceph.git docs/dev: Add design document with information on proposed design for pool migration Signed-off-by: Connor Fawcett --- diff --git a/doc/dev/pool-migration-design.rst b/doc/dev/pool-migration-design.rst new file mode 100644 index 0000000000000..72d91e00c6611 --- /dev/null +++ b/doc/dev/pool-migration-design.rst @@ -0,0 +1,744 @@ +========================= + Design of Pool Migration +========================= + + +Background +========== + +The objective for pool migration is to be able to migrate all RADOS objects +from one pool to another within the same cluster non-disruptively. This +functionality is planned for the Umbrella release. + +Use cases for pool migration: + +* This provides the ability to change the erasure code profile (and in + particular the choice of K and M) non-disruptively. Implementing this as a + non-disruptive migration between pools is simpler and no less efficient than + trying to perform this type of transformation in place. +* Converting between replica and erasure coded pools. Changes being made to add + OMAP and class support to EC pools will remove the need to have a separate + replica pool for metadata when using RBD, CephFS or RGW, this should make + these migrations viable in conjunction with this work. +* The general use case of wanting to migrate data between two pools. + +By non-disruptive we mean that there will be no time where I/O or applications +need to switch from using the old pool to the new pool, not even a very short +outage at the start or end of the migration (as for example is required by RBD +live migration). The migration will however need to read and write every object +in the pool so there will be a performance impact during the migration - +similar to that when PGs are backfilling. The same techniques and controls that +are used when splitting/merging PGs and backfilling individual PGs will be used +by pool migration to manage this impact. + +For the first release we will require that the target pool is empty to begin +with (this means we don't need to worry about objects with the same name). +See the section on avoiding name collisions for more details. Supporting merging of +pools (either where constraints prevent object name collisions or where these +collisions are resolved automatically during the migration) is a possible +future enhancement, but it's not clear what use cases this solves. + +During a pool migration restrictions are placed on the source pool, it is not +permitted to modify the number of PGs (to cause splits or merges). See sections +on stopping changes to the number of PGs during migration and on the CLI and UI +for more details. Deletion of the source pool will not be permitted during a +migration. Other actions such as rebalancing are permitted but perhaps should +be discouraged as the data is being moved anyway. + +During a pool migration restrictions are placed on the target pool, it is not +permitted to migrate this pool to another (i.e. no daisy chained or cyclical +migrations). Splits, merges and rebalancing are permitted. Deletion of the +target pool will not be permitted during a migration. Once a pool has finished +migrating it is permited to start a new migration of the target pool of a +previous migration. + +For the first release there is no option to cancel, suspend or reverse a pool +migration once it has started. + +For the first release there is no plan to have the clients update their +references to the pool once the migration has completed, they will continue +to reference the old pool and objector (``librados``) will reroute the request +to the new pool. The ``OSDMap`` will retain stub information for the old pool +redirecting to the new pool. + +The feature requires changes to client code; all clients and daemons will +need to be upgraded before a pool migration is permitted. The two main clients, +objector (in ``librados``) and the kernel client will be updated. Updates to +the kernel client are likely to lag the Umbrella release. Where the clients are +integrated into other products (e.g. ``ODF``) these products will need to +incorporate the new clients before the feature can be used. + +For the first release there is no plan to support pool migration between Ceph +clusters. Theoretically this could be added later building upon the first +release code but would require substantial extra effort. It would require +clients to be able to update references to the cluster and pool once the +migration had completed and for clients to be able to redirect I/O to a +different cluster. There would be extra authentication challenges as all OSDs +and clients in the source cluster would need to be able to submit requests to +the target cluster. + + +Design +====== + + +Reuse of Existing Design +------------------------ + +Let's start by looking at existing code or features that we can copy / reuse +/ refactor / take inspiration from, we don't want to reinvent the wheel or +repeat past mistakes. + + +Backfill +~~~~~~~~ + +Backfill is a process run by a PG to recover objects on an OSD that has either +just been added to the PG (starts with no objects) or has been absent from the +PG for a while (has some objects that are up to date, some that are stale, is +probably missing new objects and may have objects that are no longer needed +because they were deleted while it was absent). + +Backfill takes a long time so I/O must be permitted to continue while the +backfill happens. It uses the fact that all objects have a hash and that it is +possible to list objects in hash order. This means that backfill can recover +objects in hash order and can simply keep a watermark hash value to track what +progress has been made. I/Os to objects with a hash below the watermark are to +an object that has been recovered and need to update all OSDs including the +backfilling OSD. I/Os to objects with a hash above the watermark can ignore the +backfilling OSDs as the backfill process will recover this object later. The +object(s) currently being recovered by the backfill process are locked to +prevent I/O for the short time it takes to backfill an object. + +Another property of backfill is that the process is idempotent, while there are +performance benefits to preserving the watermark there is no correctness issues +if the watermark is reset to the start and the backfill process starts again as +repeating the process will determine that objects have already been recovered. +This simplifies the design because the watermark doesn't have to be rigorously +replicated and checkpointed, although for backfill it is part of ``pg_info_t`` +so progress is checkpointed fairly frequently. + +Relevance to pool migration: + +* Pool migration can list objects in hash order and migrate them to the new pool. +* A watermark can be used to keep track of which objects have been migrated. + There is no need for the watermark to be persistent. +* Clients can cache a copy of the watermark to help direct I/Os to the correct + pool and PG. +* The client's cached copy can become stale, if I/Os are misdirected they will + be failed providing an up-to-date watermark so the I/O can be retried. +* Backfill recovers all parts of a RADOS object - attributes, data and + OMAP. Large objects are recovered in phases (something like 2MB at a time) + and utililize a temporary object which is renamed at the end of the recovery + for atomicity. If a peering cycle interrupts the process, then the + temporary object is discarded. If pool migration uses this technique it needs + to be aware that a peering cycle might disrupt the target pool but not the + source pool and therefore may need to restart the migration of the object if + the target discards the object. +* Backfill recovers a RADOS head object and its associated snapshots at the + same time and uses locking (e.g. + ``PrimaryLogPG::is_degraded_or_backfilling_object`` and + ``PrimaryLogPG::is_unreadable_object``) to ensure that none of these can be + accessed while they are being recovered because of dependencies between them. + Pool migration needs to migrate the head object and the snapshots at the same + time and needs to ensure we don't process I/O to the object halfway through + this process. +* Backfill is meant to preserve the space-efficiency of snapshots when + recovering them using the information in the snapset attribute to work out + which regions of the snapshots are clones - see ``calc_clone_subsets`` / + ``calc_head_subsets``. This hasn't been implemented for EC pools yet and + `tracker 72753 `_ shows it currently + isn't working for replica pools either. We will want to use this (and the way + a ``PushOp`` re-establishes cloned regions) for migration. + +Unlike backfill, for migration we want the clients to know the watermark so +they can route I/Os to the old/new pool. We don't care if clients have a +stale watermark - this will just cause a few I/Os to be incorrectly routed to +the old pool which can fail them back to the client and communicate a new +watermark so the I/O can be resubmitted to the new pool. + +We deliberately make updating the client's copy of the watermark lazy - there +could be hundreds or thousands of clients so updating them all the time would +be expensive. Putting the watermark into the ``OSDMap`` and issuing new epochs +to distribute it to all the clients would be even more expensive. In contrast +we are thinking about recording which PGs are migrating/have finished migrating +in the ``OSDMap`` - a rule of thumb would be to try and only update the +``OSDMap`` once a second during a migration. + +For migration to be able to support direct reads we do need all the OSDs in the +PG to know where the watermark is and for this to be updated as each object is +migrated. Migrating an object involves reading it from the source pool, writing +it to the target pool and then deleting it from the source pool. Other OSDs can +update migration progress as they process the delete request. There will be +some complexity regarding direct reads and migrating an object + its snapshots. +There is already some code that fails direct reads with ``EAGAIN`` (to redirect +these to the primary) when an object + its snapshots have not all been +recovered, we may need to use this when midway through migrating an object ++ snapshots and then have the primary stall the I/O until the object + +snapshots have all been migrated before failing the I/O again for redirection +to the new pool. + +The watermark doesn't necessarily need to be checkpointed to disk, it is cheap +to find the object with the lowest hash in a PG so we could do this to +recalculate the watermark whenever peering starts migration. + + +Scheduling Backfill / Recovery +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Deciding how to prioritize backfill/recovery and how fast to run this process +versus processing I/O from clients is a complex problem. Firstly, a decision +is made as to which PGs should be backfilling/recovering, and which should +wait. This involves messages between OSDs and considers whether I/O is blocked +and how much redundancy the PG has left (for example a replica-3 pool with 2 +failures is prioritized over a replica-3 pool with 1 failure). Secondly once a +PG has been selected to backfill/recover the schedule has to decide how +frequently to perform backfill/recovery versus process client I/O. This happens +within the primary OSD using weighted costs. + +Relevance to pool migration: + +* Pool migration is less critical than backfill or recovery. It needs to fit + into the same process to determine when a PG should start migrating. +* Once a PG is permitted to start migration the OSD scheduler needs to pace the + work. The overheads for migrating an object are like the overheads for + backfilling an object so hopefully we can just copy the backfill scheduling + for migration. + +The objective is to reuse as much of the scheduler (e.g. ``mclock``) as +possible, just teaching it that migration has a lower priority than backfill or +async recovery but higher priority than deep scrub. + +``Mclock`` works by assigning a weighting to each backfill / recovery op and +each client I/O request, it also benchmarks OSDs at startup to get some idea +what the maximum performance of the OSD is. This information is then used to +work out when to schedule background work. The same concepts should work for +migration requests. We will need to assign a weighting to migration work; +this should be similar/identical to the weighting for backfills. + +We will take a similar approach for supporting clusers running with +``WeightedPriorityQueue`` scheduling. + +The expectation is that there should be no need for new tuneable settings +for migration, the existing tuneable settings for backfill/recovery should be +sufficient, we don't want to further complicate this part of the UI. + + +Statistics +~~~~~~~~~~ + +I believe there are a few statistics collected about the performance of +backfill/recovery. We should supplement these with similar statistics +about the process of migrations. + +We need to consider OSD stats that are gathered by ``Prometheus`` and any +progress summary that is presented via ``HealthCheck`` and/or the UI. + + +CopyFrom +~~~~~~~~ + +``CopyFrom`` is a RADOS op that can copy the contents of an object into a new +object. It is sent to the OSD and PG that will store the new object. The OSD +is responsible for reading the source object which involves sending messages +to another OSD and PG and then writing the data it reads to the new object. If +the object being copied is large, then the copy operation is broken up into +multiple stages and this is made atomic by using a temporary object to store +the new data until the last data has been copied at which point the temporary +object can be renamed to become the new object. + +Relevance to pool migration: + +* Pool migration needs to copy objects from the old pool to a new pool - this + will involve one OSD and PG reading the object and another OSD and PG writing + the object. +* Pool migration will want to drive the copy operation from the source side, + so we probably need a ``CopyTo`` type operation. +* The way messages are sent between OSDs, the way a large object copy is staged + and the use of a temporary object name when staging are all concepts that can be + reused. + +Alternatively, pool migration might want to copy the recover object +implementation in ``ECBackend`` which is used to recover an object being +recovered or backfilled. This also stages the recovery of large objects +using a temporary object and uses ``PushOp`` messages to send data to the OSDs +being backfilled. It might be possible to use most of the recover object +process without changes, just changing the ``PushOp`` messages to be sent to a +different PG and sending the messages for all shards as the entire object is +being migrated. + +Lets consider the differences between the backend recovery op and CopyFrom: + +* ``CopyFrom`` is a process that runs in ``PrimaryLogPG`` above either the + replica or ``ECBackend`` that copies an object from a primary OSD for one PG + to the primary OSD for another PG. In the case of EC the primary OSD may need + to issue ``SubOp`` commands to other OSDs to read/write the data. +* ``run_recovery_op`` implemented by replica and EC pools runs on the primary + OSD and reads data (in the case of EC issuing ``SubOp`` commands to other + OSDs) but then issues ``PushOp`` commands to write the recovered data to the + destination OSDs. +* ``CopyFrom`` working at the ``PrimaryLogPG`` level ensures that the copied + object is included in the PG stats and gets its own PG log entry so the + update can be rolled forward/backwards and can be recovered by async + recovery. +* ``run_recovery_op`` is implemented at the ``PGBackend`` level and assumes the + PG already has stats and a PG log entry for the object, it is just + responsible for bringing other shards in the PG up to date. +* CopyFrom ends up issuing read and write ops to the PGBackend, it doesn't + provide techniques for copying a snapshot and preserving its + space-efficiency. +* ``run_recovery_op`` is meant to preserve space-efficiency of clones (not + implemented yet for EC pools and replica pools have bugs) – the ``PushOp`` + message includes a way of describing which parts of an object should be + clones. + +For pool migration we probably want a hybrid implementation. We can probably +re-use a lot of the ``run_recovery_op`` code to read the object that we want to +migrate, and ideally handle the space-efficiency in snaps. Instead of issuing +PushOps we probably want to issue a new ``COPY_PUT`` type op to the priamry PG +of the target pool, but passing the same kind of information as a PushOp so we +can keep track of what needs to be cloned. The target pool can then submit a +mixture of write and clone ops to the PGBackend layer to create the object as +well as updating the PG stats and creating a PG log entry. + + +Splitting PGs +~~~~~~~~~~~~~ + +Normally a pool has a number of PGs that is a power of 2. This is because we +want each PG to hold roughly the same number of objects, and we use the most +significant N bits of the object hash to select which PG to use. However, when +doubling the number of PGs that a pool has this causes approximately half the +objects in the pool to need to be moved to a new PG. We don't want all this +migration to happen at once; we want it to be paced over time to have less +impact. To deal with this the MGR controls the increase in the number of PGs, +it has a target for how many PGs the pool should have and slowly increases the +number of PGs waiting for PGs to finish recovery before doing further splits. + +When a pool has a non-power of 2 number of PGs this means that not all PGs are +the same size. For example, if there are 5 PGs then PGs 0 and 4 will be half +the size of PGs 1 to 3 because the choice between PG 0 and 4 is based on one +extra bit of the object hash. While this is not desirable as a long-term state +it is fine during the splitting process. + +Relevance to pool migration: + +* Pool migration needs to migrate all the objects in all the PGs in the old + pool to the new pool. Just like splitting we don't want to overwhelm the + system while performing the migration. +* Pool migration should therefore migrate one (or a small number) of PGs at + a time. +* A process needs to monitor the progress of migrations, notice when PGs finish + migrating and start the next PG. This could either be in the MON (in which case + it would need to be event driven with OSDs telling the MON when a PG has + finished migrating - somewhat similar to how PG merges work) or it could be + implemented in the MGR (in which case the MGR can poll the state of the PGs and + then tell the MON via a CLI command to start the next PG migration). + + +Direct I/O / Balanced Reads +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The EC direct I/O feature is making changes to the client to decide which OSD +to send client I/O requests to, it is building on top of the balanced reads +flag for replica pools which tells the client to distribute read I/Os evenly +across all the OSDs in a replica PG rather than sending them all to the +primary. + +Relevance to pool migration: + +* It's changing code in the client at a similar place to where we want the + client to implement pool migration deciding which pool (and hence PG and OSD) + to send I/O to. +* Direct I/O / balanced reads are permitted to be failed by the OSD that + receives the request with ``EAGAIN`` to deal with corner cases where the OSD + is unable to process the I/O. In this case the client retries the I/O but + sends it to the primary OSD. A similar retry mechanism is going to be required + when a client issue an I/O to the wrong pool because an object has been + recently migrated. When I/Os are retried, we need to worry about ordering as + this generates opportunities for I/Os to overtake or be reordered. See section + Read/Write ordering below. +* Direct I/O is adding extra information to the pg_pool_t structure that is + part of the ``OSDMap`` that gets sent to every Ceph daemon and client by the + monitor. This extra information is being used to determine that direct I/O is + supported and to help work out where to route the I/O request. Pool migration + will similarly need to add details to ``pg_pool_t`` structure so that clients + are aware that a migration is happening. + + +Read/Write Ordering +------------------- + +Ceph has some fairly strict read / write ordering rules. Once a write has +completed to the client any read must return the new data. Prior to the write +completing a read is expected to return all old data or all new data (a mixture +is not permitted). If writes A and B are issued concurrently one after another +to the same object then write A is expected to be applied before write B – +ordering of the writes is expected to be preserved through the client, +messenger and the OSD. If write A and read B are issued concurrently then there +is scope for read B to overtake write A. There is a flag ``RWORDERED`` that can +be set that prevents this overtaking from happening. + +There are no ordering guarantees when reads or writes are issued to different +objects - these objects are almost certainly stored on different OSDs and even +if they are on the same OSD will be processed by different threads with +different locks so can easily be reordered. + +There do not appear to be many uses of the ``RWORDERED`` flag, RBD and RGW do +not use the flag, CephFS uses the flag in MDS ``RecoveryQueue`` (calls +``filer.probe`` which is implemented in ``osdc/Filer.cc``) which I think is +only used in some recovery scenarios. + +These rules make it tricky to implement the watermark in the client and use +this to decide which pool to route I/O requests to without using something +equivalent to a new epoch to advance the watermark. The problem is that if the +watermark is advanced without quiescing I/O it is possible that this causes +requests to be reordered. + +For example: + +* Write A issued to old pool. +* Write B issued to old pool. +* Write A fails with updated watermark and is retried to new pool. +* Read B with ``RWORDERING`` issued to new pool. +* Write B fails and needs to be retried to new pool. + +In this example read B has overtaken write B. + +Perhaps more concerning is that the rules would also be broken if instead of +Read B we issued another write to B. + +The simplest way to prevent reordering violations is to not advance the +watermark while there are outstanding writes (or reads with ``RWORDERING`` flag +set) in flight. This isn't idea as it may result it quite a number of I/Os +being failed for retry before the watermark can be updated. + +A more sophisticated implementation stalls issuing new writes to objects +with a hash between the old and new watermark while there are other writes +in flight to objects with a hash between the old and new watermark. + + +Other Pool Migration Issues +--------------------------- + +Other topics that we need to think about for pool migration. + + +Avoiding Name Collisions +~~~~~~~~~~~~~~~~~~~~~~~~ + +For the first release we will require that the target pool is empty when the +migration starts (by having a UI interface that only starts a migration while +a new pool is being created). We can also protect against objects being written +to the target pool during the migration by adding client code to reject +attempts to initiate requests to the target pool (the client code itself is +still permitted to redirect requests from the source pool to the target pool). +Because we will require a minimum client version to use pool migration this +will ensure that all clients include this extra policing. OSDs cannot +themselves implement the policing so there is no protection against a rouge +client – we probably should have migration halt rather that crash if a name +collision is found. + +Post first release if there is a use case for merging pools then it is +theoretically possible to deal with name collisions by additionally using the +pool which the client is accessing the object from to uniquify the name. This +would require extra information in the request from the client to the OSDs. + + +Stopping Changes to the Number of PGs During Migration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +During a migration we don't really want to be changing the number of PGs in +the source pool. There are three reasons why: + +#. We don't really want to be moving objects around in the source pool when we + are about to migrate them - we are probably better off getting on with the + migration than trying to fix any imbalance in the source pool. +#. Splitting/merging PGs in the source pool makes it harder to schedule the + migration. Scheduling is done at two levels - we say how many source PGs are + migrating at a time and then control the rate of migration within a source PG. + If we split/merge the source pool this makes selecting which PGs to migrate + more difficult. +#. If we block splits and merges and migrate the PGs in reverse order (starting + with the highest numbered PG in the pool) then we can reduce the number of PGs + in the source pool as PGs finish migrating. This helps keeps the overall number + of PGs more manageable. + +In contrast we don't really care so much about the target pool - we can easily +cope with splits/merges while the migration is in progress. From a performance +perspective we do however want to avoid migrating objects to the target pool +and then having splits/merges occur that copy the objects a second time. That +means that normally we would want to set the number of target pool PGs to be +the same as the source pool at the start of the migrate. + +We might also want to default to disable the auto-scaler for the target pool +during the migration as we don't want it seeing a nearly empty target pool +with loads of PGs and thinking that it should reduce the number of PGs. + + +CLI and UI +~~~~~~~~~~ + +Pool migration will need a new CLI to start the migration, there will also need +to be a way of monitoring PGs that a migrating and the progress of the +migration. The CLI to start a migration will need to be implemented by the MON +(``OSDMonitor.cc`` already implements most of the pool CLI commands) because +the migration will need to update the ``pg_pool_t`` structures in the ``OSDMap`` +to record details of the migration. + +The new map will then be distributed to clients and OSDs so that they know that +the migration has started. PGs that have been scheduled to start migration will +need to determine at the end of the peering process that they don't need to +recovery or backfill and that they should attempt to schedule a migration (will +need new PG states ``MIGRATION_WAIT`` and ``MIGRATING``). + +We will need to work with the dashboard team to add support for pool migration +to the dashboard and to provide a REST API for starting a migration. + +We will want to block some CLIs while a pool migration is taking place: + +* We don't want to be able to split/merge PGs in the source pool while it is + being migrated (see above). +* We don't want the target pool to become the source of another migration + (no chaining migrations). + +Some of these CLIs are issued by MGR, in this case we probably will need to +change the MGR code to either cope with the failures and/or to detect that the +pool is migrating and avoid issuing the CLIs. We probably will need both as +although checking if the pool is migrating before issuing a CLI is probably +more efficient, it is exposed to a race hazard where the migrate may start +between the check and CLI being issued. + +We need to look at how the progress of things like backfill and recovery are +reported in the UI (possibly by ``HealthCheck``?) and think about how to report +the progress of a pool migration. We need to think what are the right units for +reporting progress (e.g. number of objects out of total objects, number of PGs +out of total PGs or just a percentage). + + +Backwards Compatibility / Software Upgrade +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Pool migration requires code changes in the Ceph daemons (MON, OSD and possibly +MGR) and to the Ceph clients that issue I/O. We can't allow a pool migration to +happen while any of these are running old code because the old code won't +understand that a pool migration is happening. Old clients won't have any way +of directing I/O to the correct pool, PG and OSD and having OSDs forward all +these requests to the correct OSD would be far too expensive. + +Ceph daemons and clients have a set of feature bits indicating what features +they support and there are mechanisms for setting a minimum set of feature +bits that are required by daemons and separately for clients. Once set this +prevents down-level daemons and clients connecting to the cluster. There are +also mechanisms to ensure that once a minimum level has been set that this +cannot be reversed. + +Pool migration will need to define a new feature bit and use the existing +mechanisms for setting minimum required levels for daemons and clients. The new +pool migration CLIs will need to fail an attempt to start a migration unless +the minimum levels have been set. + + +End of Migration +~~~~~~~~~~~~~~~~ + +When a migration completes, we will have moved all objects from pool A to pool +B, however clients (e.g. RBD, CephFs, RGW, ...) will still have pool A embedded +in their own data structures. We don't want to force all the clients to update +their data structures to point at the new pool, so instead we will retain stub +information about pool A saying that it has been migrated and that all I/O +should now be submitted to pool B. + +Retaining a stub ``pg_pool_t`` structure in the ``OSDMap`` is cheap - there +won't be thousands of pools and there isn't that much data stored for the pool. +We will want to ensure that the old pool has no PGs associated with it, we can +do this by reducing the number of PGs it has to 0 and letting the same code +that runs when PGs are merged clean up and delete the old PGs. + +We need to think about the consequences of this on the UI interface. While in +the code we start with pool A and create and migrate objects to pool B, from +the perspective of the UI we probably want to show this as a transformation of +pool A and hide the existence of pool B from the user. + +An alternative implementation would just show the pool redirection in the UI, +so users would see an RBD image used pool A but would then find that pool A has +been migrated to pool B. This alternative implementation might be better if we +plan to support merging of pools (migration to a non-empty target pool) in the +future. + + +Walkthrough of how Pool Migration Might Work +-------------------------------------------- + + +Initiating the Pool Migration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +#. User creates a new pool, perhaps they use a new flag ``--migratefrom`` to + say they want to start a pool migration. +#. Starting the migration as part of pool creation means we know the pool is + initially empty. +#. Unless the user specifies a number of PGs we can ensure that the newly + created pool has the same number of PGs as the source pool. There is no + requirement that the number of PGs is the same, it just avoids having to + perform a migration and then perform a second copy of data as the number of + PGs is adjusted to cater for the eventual number of objects in the pool. +#. The CLI command sets up the ``pg_pool_t`` structures in the ``OSDMap`` to + indicate that a pool migration is starting. We record that pool A is being + migrated to pool B, and record which PG(s) we are going to start migrating. + If we are going to migrate more than one PG at a time, we probably want to + specify a set of PGs (e.g. 0,1,2,3) that are being migrated. Any PG in the + set is migrating. Any PG not in the set that is higher than the lowest value + in the set is assumed to have completed migration, any PG not in the set + that is lower than the lowest value in the set is assumed to have not + started migration. +#. We migrate PGs in reverse order - so for example if a pool has PGs 0-15 + then we will start by migrating PG 15. +#. MON publishes new ``OSDMap`` as a new epoch. + + +Client +~~~~~~ + +#. Clients use the ``pg_pool_t`` structure in the ``OSDMap`` to work out a + migration is in progress. +#. From the range of PGs being migrated they can work out which PGs have been + migrated, which have not started migrating and which are in the process of + migrating. + + a. If an I/O is submitted to a PG that has been migrated the object hash and + new pool is used to determine which PG and OSD to route the I/O request + to. + b. If an I/O is submitted to a PG that has not started migration the object + hash and old pool is used to determine which PG and OSD to route the I/O + request to. + c. If an I/O is submitted to a PG that is marked as being migrated the client + checks if it has a cached watermark for this PG. If it does, then it uses + this to decide whether to route the request to the old or new pool. If it + has no cached watermark, it guesses and sends the I/O to the old pool. + +#. If an I/O is misrouted to the wrong pool the OSD will fail the request + providing an update to the watermark. The client needs to update its cached + copy of the watermark and resubmit the I/O. + + +OSD +~~~ + +#. OSDs use the ``pg_pool_t`` structure in the ``OSDMap`` to work out if a PG + needs migrating. +#. At the end of peering if the PG needs migrating and is not performing + backfill or recovery it sets the PG state to ``MIGRATION_WAIT`` and checks + with other OSDs whether they have the resources and free capacity to start + the migration. +#. If everything is good the PG state changes to ``MIGRATING``, sets the + watermark to 0 and the scheduler is instructed to start scheduling migration + work. +#. Migration starts by scanning the next range of objects to be migrated + creating a list of object OIDs. +#. Each object is then migrated, with the watermark being updated after the + object has been migrated. + + a. The primary reads the object and sends it to the primary of the target + PG which then writes the object. + b. If the object is large this is done in stages with the target using a + temporary object name which is renamed when the last data is written. + c. Once an object has been migrated it is deleted from the source pool. + +#. Client I/O checks the object hash of the client I/O with the watermark. If + the I/O is below the watermark it is failed for retry to the new pool, + providing the current watermark for the client to cache. +#. If a PG completes a migration, then it sends a message to the MON telling + it that the migration has completed. + + +MON +~~~ + +#. When MON gets a message from an OSD saying that a migration has completed + it updates the set in the ``pg_pool_t`` to record that the PG has finished + migration and that the next PG is starting migration. A new ``OSDMap`` is + published as a new epoch. +#. Because migrations are scheduled in reverse order and objects are deleted + as the migration happens, this means that as PG migrations complete that we + should have empty PGs that can be deleted by simply reducing the number of + PGs that the source pool has. PG migrations might not complete in the order + which they are started so we might have a few empty PGs hanging around that + cannot be deleted until another PG migration completes. +#. At the end of the migration there are no more PGs to start migrating, so + the set of migrating PGs diminishes. When the set becomes empty we should + have also reduced the number of PGs for the source pool to zero and at this + point the migration is complete. The MON can make final updates to the + ``pg_pool_t`` state to indicate the migration has finished. The + ``pg_pool_t`` structure needs to be kept so that clients know to direct all + I/O requests to this pool to the new pool instead. +#. Pools can be migrated more than once, this can result in multiple stub + ``pg_pool_t`` structures being kept. We do not want to have to recurse + through these stubs when I/Os are submitted, so at the end of a migration + the MON should attempt to reduce these redirects to a single level. + + +Testing and Test Tools +---------------------- + +The objectives of testing pool migration are: + +#. Validate that all the objects in the source pool are migrated to the target + pool and that their contents (data, attributes and OMAP) are retained. +#. Validate that during a migration object can be read (data, attributes, OMAP) + for objects that haven't yet been migrated, objects that have been migrated + and objects in the middle of being migrated. +#. Validate that during a migration objects can be updated (create, delete, + write, update attributes, update OMAPs) for objects that haven't yet been + migrated, objects that have been migrated and objects in the middle of being + migrated. +#. Validating pool migration under error scenarios, including resetting and + failing OSDs. +#. Validate that snapshots, clones are migrated and can be used during a pool + migration. +#. Validate the UI for pool migration, including restrictions placed on the UI + during the migration. +#. Validation of migrating multiple different pools in parallel. Validation of + migration a single pool multiple times in series. +#. Validation of pool migration with unreadable objects (excessive medium + errors plus possibly other failures that defeat the redundancy of the + replica/EC pool without taking it offline). +#. Validation of software upgrade / compatibility for both daemons (OSD, MON, + MGR) and clients. +#. Validation of performance impact during a migration. + +Pool migration makes changes to client code, so all modified clients will need +testing. + +Existing tools such as ``ceph_test_rados`` are good for creating and exercising +a set of objects and performing some consistency checking of objects. + +A simple script is probably better for creating a large number of objects and +then validating the contents of the objects. Writing a script is probably +better for being able to test attributes and OMAPs as well. If the script has +two phases (create objects and validate objects) then these phases can be run +at different times (before, during, after pool migration) to test different +aspects. The script could use a command line tool such as rados to create and +validate objects, using pseudo random numbers to generate data patterns, +attributes and OMAP data that could then be validated. The script would need to +run many rados commands in parallel to generate a decent I/O workload. There +may be scripts that already exist that can do this, it may be possible to adapt +ceph_test_rados to do this. + +Tools such as ``VDBench`` can test data integrity of block volumes, either +creating a data set and then validating it, or can continuously create and +update data keeping a journal so it can be validated at any point. However +block volume tools can only test object data, not attributes or OMAPs. + +A tool such as ``FIO`` is best suited for doing performance measurements. + +The I/O sequence tool ``ceph_test_rados_io_sequence`` is probably not useful +for testing pool migration - it specializes it testing a very small number of +objects and focuses on boundary conditions within an object (e.g. EC chunk +size, strip size) and data integrity. + +The objective should be to use teuthology to perform most of the testing for +pool migration (at a minimum 1 to 5 in the list above). It should be possible +to add pool migration as an option to existing tests in the RADOS suite, +extending the ``thrashOSD`` class to include the option of starting a +migration.