docs/dev: Add design document with information on proposed design for pool migration

author Connor Fawcett <connorfa@uk.ibm.com>

Mon, 29 Sep 2025 01:21:16 +0000 (02:21 +0100)

committer Connor Fawcett <connorfa@uk.ibm.com>

Mon, 6 Oct 2025 12:52:23 +0000 (13:52 +0100)
author Connor Fawcett <connorfa@uk.ibm.com>
Mon, 29 Sep 2025 01:21:16 +0000 (02:21 +0100)
committer Connor Fawcett <connorfa@uk.ibm.com>
Mon, 6 Oct 2025 12:52:23 +0000 (13:52 +0100)
diff --git a/doc/dev/pool-migration-design.rst b/doc/dev/pool-migration-design.rst

new file mode 100644 (file)

index 0000000..72d91e0
--- /dev/null
+++ b/doc/dev/pool-migration-design.rst
@@ -0,0 +1,744 @@
+=========================
+ Design of Pool Migration
+=========================
+
+
+Background
+==========
+
+The objective for pool migration is to be able to migrate all RADOS objects
+from one pool to another within the same cluster non-disruptively. This
+functionality is planned for the Umbrella release.
+
+Use cases for pool migration:
+
+* This provides the ability to change the erasure code profile (and in
+  particular the choice of K and M) non-disruptively. Implementing this as a
+  non-disruptive migration between pools is simpler and no less efficient than
+  trying to perform this type of transformation in place.
+* Converting between replica and erasure coded pools. Changes being made to add
+  OMAP and class support to EC pools will remove the need to have a separate
+  replica pool for metadata when using RBD, CephFS or RGW, this should make
+  these migrations viable in conjunction with this work.
+* The general use case of wanting to migrate data between two pools.
+
+By non-disruptive we mean that there will be no time where I/O or applications
+need to switch from using the old pool to the new pool, not even a very short
+outage at the start or end of the migration (as for example is required by RBD
+live migration). The migration will however need to read and write every object
+in the pool so there will be a performance impact during the migration -
+similar to that when PGs are backfilling. The same techniques and controls that
+are used when splitting/merging PGs and backfilling individual PGs will be used
+by pool migration to manage this impact.
+
+For the first release we will require that the target pool is empty to begin
+with (this means we don't need to worry about objects with the same name).
+See the section on avoiding name collisions for more details. Supporting merging of
+pools (either where constraints prevent object name collisions or where these
+collisions are resolved automatically during the migration) is a possible
+future enhancement, but it's not clear what use cases this solves.
+
+During a pool migration restrictions are placed on the source pool, it is not
+permitted to modify the number of PGs (to cause splits or merges). See sections
+on stopping changes to the number of PGs during migration and on the CLI and UI
+for more details. Deletion of the source pool will not be permitted during a
+migration. Other actions such as rebalancing are permitted but perhaps should
+be discouraged as the data is being moved anyway.
+
+During a pool migration restrictions are placed on the target pool, it is not
+permitted to migrate this pool to another (i.e. no daisy chained or cyclical
+migrations). Splits, merges and rebalancing are permitted. Deletion of the
+target pool will not be permitted during a migration. Once a pool has finished
+migrating it is permited to start a new migration of the target pool of a
+previous migration.
+
+For the first release there is no option to cancel, suspend or reverse a pool
+migration once it has started.
+
+For the first release there is no plan to have the clients update their
+references to the pool once the migration has completed, they will continue
+to reference the old pool and objector (``librados``) will reroute the request
+to the new pool. The ``OSDMap`` will retain stub information for the old pool
+redirecting to the new pool.
+
+The feature requires changes to client code; all clients and daemons will
+need to be upgraded before a pool migration is permitted. The two main clients,
+objector (in ``librados``) and the kernel client will be updated. Updates to
+the kernel client are likely to lag the Umbrella release. Where the clients are
+integrated into other products (e.g. ``ODF``) these products will need to
+incorporate the new clients before the feature can be used.
+
+For the first release there is no plan to support pool migration between Ceph
+clusters. Theoretically this could be added later building upon the first
+release code but would require substantial extra effort. It would require
+clients to be able to update references to the cluster and pool once the
+migration had completed and for clients to be able to redirect I/O to a
+different cluster. There would be extra authentication challenges as all OSDs
+and clients in the source cluster would need to be able to submit requests to
+the target cluster.
+
+
+Design
+======
+
+
+Reuse of Existing Design
+------------------------
+
+Let's start by looking at existing code or features that we can copy / reuse
+/ refactor / take inspiration from, we don't want to reinvent the wheel or
+repeat past mistakes.
+
+
+Backfill
+~~~~~~~~
+
+Backfill is a process run by a PG to recover objects on an OSD that has either
+just been added to the PG (starts with no objects) or has been absent from the
+PG for a while (has some objects that are up to date, some that are stale, is
+probably missing new objects and may have objects that are no longer needed
+because they were deleted while it was absent).
+
+Backfill takes a long time so I/O must be permitted to continue while the
+backfill happens. It uses the fact that all objects have a hash and that it is
+possible to list objects in hash order. This means that backfill can recover
+objects in hash order and can simply keep a watermark hash value to track what
+progress has been made. I/Os to objects with a hash below the watermark are to
+an object that has been recovered and need to update all OSDs including the
+backfilling OSD. I/Os to objects with a hash above the watermark can ignore the
+backfilling OSDs as the backfill process will recover this object later. The
+object(s) currently being recovered by the backfill process are locked to
+prevent I/O for the short time it takes to backfill an object.
+
+Another property of backfill is that the process is idempotent, while there are
+performance benefits to preserving the watermark there is no correctness issues
+if the watermark is reset to the start and the backfill process starts again as
+repeating the process will determine that objects have already been recovered.
+This simplifies the design because the watermark doesn't have to be rigorously
+replicated and checkpointed, although for backfill it is part of ``pg_info_t``
+so progress is checkpointed fairly frequently.
+
+Relevance to pool migration:
+
+* Pool migration can list objects in hash order and migrate them to the new pool.
+* A watermark can be used to keep track of which objects have been migrated.
+  There is no need for the watermark to be persistent.
+* Clients can cache a copy of the watermark to help direct I/Os to the correct
+  pool and PG.
+* The client's cached copy can become stale, if I/Os are misdirected they will
+  be failed providing an up-to-date watermark so the I/O can be retried.
+* Backfill recovers all parts of a RADOS object - attributes, data and
+  OMAP. Large objects are recovered in phases (something like 2MB at a time)
+  and utililize a temporary object which is renamed at the end of the recovery
+  for atomicity. If a peering cycle interrupts the process, then the
+  temporary object is discarded. If pool migration uses this technique it needs
+  to be aware that a peering cycle might disrupt the target pool but not the
+  source pool and therefore may need to restart the migration of the object if
+  the target discards the object.
+* Backfill recovers a RADOS head object and its associated snapshots at the
+  same time and uses locking (e.g.
+  ``PrimaryLogPG::is_degraded_or_backfilling_object`` and
+  ``PrimaryLogPG::is_unreadable_object``) to ensure that none of these can be
+  accessed while they are being recovered because of dependencies between them.
+  Pool migration needs to migrate the head object and the snapshots at the same
+  time and needs to ensure we don't process I/O to the object halfway through
+  this process.
+* Backfill is meant to preserve the space-efficiency of snapshots when
+  recovering them using the information in the snapset attribute to work out
+  which regions of the snapshots are clones - see ``calc_clone_subsets`` /
+  ``calc_head_subsets``. This hasn't been implemented for EC pools yet and
+  `tracker 72753 <https://tracker.ceph.com/issues/72753>`_ shows it currently
+  isn't working for replica pools either. We will want to use this (and the way
+  a ``PushOp`` re-establishes cloned regions) for migration.
+
+Unlike backfill, for migration we want the clients to know the watermark so
+they can route I/Os to the old/new pool. We don't care if clients have a
+stale watermark - this will just cause a few I/Os to be incorrectly routed to
+the old pool which can fail them back to the client and communicate a new
+watermark so the I/O can be resubmitted to the new pool.
+
+We deliberately make updating the client's copy of the watermark lazy - there
+could be hundreds or thousands of clients so updating them all the time would
+be expensive. Putting the watermark into the ``OSDMap`` and issuing new epochs
+to distribute it to all the clients would be even more expensive. In contrast
+we are thinking about recording which PGs are migrating/have finished migrating
+in the ``OSDMap`` - a rule of thumb would be to try and only update the
+``OSDMap`` once a second during a migration.
+
+For migration to be able to support direct reads we do need all the OSDs in the
+PG to know where the watermark is and for this to be updated as each object is
+migrated. Migrating an object involves reading it from the source pool, writing
+it to the target pool and then deleting it from the source pool. Other OSDs can
+update migration progress as they process the delete request. There will be
+some complexity regarding direct reads and migrating an object + its snapshots.
+There is already some code that fails direct reads with ``EAGAIN`` (to redirect
+these to the primary) when an object + its snapshots have not all been
+recovered, we may need to use this when midway through migrating an object
++ snapshots and then have the primary stall the I/O until the object +
+snapshots have all been migrated before failing the I/O again for redirection
+to the new pool.
+
+The watermark doesn't necessarily need to be checkpointed to disk, it is cheap
+to find the object with the lowest hash in a PG so we could do this to
+recalculate the watermark whenever peering starts migration.
+
+
+Scheduling Backfill / Recovery
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Deciding how to prioritize backfill/recovery and how fast to run this process
+versus processing I/O from clients is a complex problem. Firstly, a decision
+is made as to which PGs should be backfilling/recovering, and which should
+wait. This involves messages between OSDs and considers whether I/O is blocked
+and how much redundancy the PG has left (for example a replica-3 pool with 2
+failures is prioritized over a replica-3 pool with 1 failure). Secondly once a
+PG has been selected to backfill/recover the schedule has to decide how
+frequently to perform backfill/recovery versus process client I/O. This happens
+within the primary OSD using weighted costs.
+
+Relevance to pool migration:
+
+* Pool migration is less critical than backfill or recovery. It needs to fit
+  into the same process to determine when a PG should start migrating.
+* Once a PG is permitted to start migration the OSD scheduler needs to pace the
+  work. The overheads for migrating an object are like the overheads for
+  backfilling an object so hopefully we can just copy the backfill scheduling
+  for migration.
+
+The objective is to reuse as much of the scheduler (e.g. ``mclock``) as
+possible, just teaching it that migration has a lower priority than backfill or
+async recovery but higher priority than deep scrub.
+
+``Mclock`` works by assigning a weighting to each backfill / recovery op and
+each client I/O request, it also benchmarks OSDs at startup to get some idea
+what the maximum performance of the OSD is. This information is then used to
+work out when to schedule background work. The same concepts should work for
+migration requests. We will need to assign a weighting to migration work;
+this should be similar/identical to the weighting for backfills.
+
+We will take a similar approach for supporting clusers running with
+``WeightedPriorityQueue`` scheduling.
+
+The expectation is that there should be no need for new tuneable settings
+for migration, the existing tuneable settings for backfill/recovery should be
+sufficient, we don't want to further complicate this part of the UI.
+
+
+Statistics
+~~~~~~~~~~
+
+I believe there are a few statistics collected about the performance of
+backfill/recovery. We should supplement these with similar statistics
+about the process of migrations.
+
+We need to consider OSD stats that are gathered by ``Prometheus`` and any
+progress summary that is presented via ``HealthCheck`` and/or the UI.
+
+
+CopyFrom
+~~~~~~~~
+
+``CopyFrom`` is a RADOS op that can copy the contents of an object into a new
+object. It is sent to the OSD and PG that will store the new object. The OSD
+is responsible for reading the source object which involves sending messages
+to another OSD and PG and then writing the data it reads to the new object. If
+the object being copied is large, then the copy operation is broken up into
+multiple stages and this is made atomic by using a temporary object to store
+the new data until the last data has been copied at which point the temporary
+object can be renamed to become the new object.
+
+Relevance to pool migration:
+
+* Pool migration needs to copy objects from the old pool to a new pool - this
+  will involve one OSD and PG reading the object and another OSD and PG writing
+  the object.
+* Pool migration will want to drive the copy operation from the source side,
+  so we probably need a ``CopyTo`` type operation.
+* The way messages are sent between OSDs, the way a large object copy is staged
+  and the use of a temporary object name when staging are all concepts that can be
+  reused.
+
+Alternatively, pool migration might want to copy the recover object
+implementation in ``ECBackend`` which is used to recover an object being
+recovered or backfilled. This also stages the recovery of large objects
+using a temporary object and uses ``PushOp`` messages to send data to the OSDs
+being backfilled. It might be possible to use most of the recover object
+process without changes, just changing the ``PushOp`` messages to be sent to a
+different PG and sending the messages for all shards as the entire object is
+being migrated.
+
+Lets consider the differences between the backend recovery op and CopyFrom:
+
+* ``CopyFrom`` is a process that runs in ``PrimaryLogPG`` above either the
+  replica or ``ECBackend`` that copies an object from a primary OSD for one PG
+  to the primary OSD for another PG. In the case of EC the primary OSD may need
+  to issue ``SubOp`` commands to other OSDs to read/write the data.
+* ``run_recovery_op`` implemented by replica and EC pools runs on the primary
+  OSD and reads data (in the case of EC issuing ``SubOp`` commands to other
+  OSDs) but then issues ``PushOp`` commands to write the recovered data to the
+  destination OSDs.
+* ``CopyFrom`` working at the ``PrimaryLogPG`` level ensures that the copied
+  object is included in the PG stats and gets its own PG log entry so the
+  update can be rolled forward/backwards and can be recovered by async
+  recovery.
+* ``run_recovery_op`` is implemented at the ``PGBackend`` level and assumes the
+  PG already has stats and a PG log entry for the object, it is just
+  responsible for bringing other shards in the PG up to date.
+* CopyFrom ends up issuing read and write ops to the PGBackend, it doesn't
+  provide techniques for copying a snapshot and preserving its
+  space-efficiency.
+* ``run_recovery_op`` is meant to preserve space-efficiency of clones (not
+  implemented yet for EC pools and replica pools have bugs) – the ``PushOp``
+  message includes a way of describing which parts of an object should be
+  clones.
+
+For pool migration we probably want a hybrid implementation. We can probably
+re-use a lot of the ``run_recovery_op`` code to read the object that we want to
+migrate, and ideally handle the space-efficiency in snaps. Instead of issuing
+PushOps we probably want to issue a new ``COPY_PUT`` type op to the priamry PG
+of the target pool, but passing the same kind of information as a PushOp so we
+can keep track of what needs to be cloned. The target pool can then submit a
+mixture of write and clone ops to the PGBackend layer to create the object as
+well as updating the PG stats and creating a PG log entry.
+
+
+Splitting PGs
+~~~~~~~~~~~~~
+
+Normally a pool has a number of PGs that is a power of 2. This is because we
+want each PG to hold roughly the same number of objects, and we use the most
+significant N bits of the object hash to select which PG to use. However, when
+doubling the number of PGs that a pool has this causes approximately half the
+objects in the pool to need to be moved to a new PG. We don't want all this
+migration to happen at once; we want it to be paced over time to have less
+impact. To deal with this the MGR controls the increase in the number of PGs,
+it has a target for how many PGs the pool should have and slowly increases the
+number of PGs waiting for PGs to finish recovery before doing further splits.
+
+When a pool has a non-power of 2 number of PGs this means that not all PGs are
+the same size. For example, if there are 5 PGs then PGs 0 and 4 will be half
+the size of PGs 1 to 3 because the choice between PG 0 and 4 is based on one
+extra bit of the object hash. While this is not desirable as a long-term state
+it is fine during the splitting process.
+
+Relevance to pool migration:
+
+* Pool migration needs to migrate all the objects in all the PGs in the old
+  pool to the new pool. Just like splitting we don't want to overwhelm the
+  system while performing the migration.
+* Pool migration should therefore migrate one (or a small number) of PGs at
+  a time.
+* A process needs to monitor the progress of migrations, notice when PGs finish
+  migrating and start the next PG. This could either be in the MON (in which case
+  it would need to be event driven with OSDs telling the MON when a PG has
+  finished migrating - somewhat similar to how PG merges work) or it could be
+  implemented in the MGR (in which case the MGR can poll the state of the PGs and
+  then tell the MON via a CLI command to start the next PG migration).
+
+
+Direct I/O / Balanced Reads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The EC direct I/O feature is making changes to the client to decide which OSD
+to send client I/O requests to, it is building on top of the balanced reads
+flag for replica pools which tells the client to distribute read I/Os evenly
+across all the OSDs in a replica PG rather than sending them all to the
+primary.
+
+Relevance to pool migration:
+
+* It's changing code in the client at a similar place to where we want the
+  client to implement pool migration deciding which pool (and hence PG and OSD)
+  to send I/O to.
+* Direct I/O / balanced reads are permitted to be failed by the OSD that
+  receives the request with ``EAGAIN`` to deal with corner cases where the OSD
+  is unable to process the I/O. In this case the client retries the I/O but
+  sends it to the primary OSD. A similar retry mechanism is going to be required
+  when a client issue an I/O to the wrong pool because an object has been
+  recently migrated. When I/Os are retried, we need to worry about ordering as
+  this generates opportunities for I/Os to overtake or be reordered. See section
+  Read/Write ordering below.
+* Direct I/O is adding extra information to the pg_pool_t structure that is
+  part of the ``OSDMap`` that gets sent to every Ceph daemon and client by the
+  monitor. This extra information is being used to determine that direct I/O is
+  supported and to help work out where to route the I/O request. Pool migration
+  will similarly need to add details to ``pg_pool_t`` structure so that clients
+  are aware that a migration is happening.
+
+
+Read/Write Ordering
+-------------------
+
+Ceph has some fairly strict read / write ordering rules. Once a write has
+completed to the client any read must return the new data. Prior to the write
+completing a read is expected to return all old data or all new data (a mixture
+is not permitted). If writes A and B are issued concurrently one after another
+to the same object then write A is expected to be applied before write B –
+ordering of the writes is expected to be preserved through the client,
+messenger and the OSD. If write A and read B are issued concurrently then there
+is scope for read B to overtake write A. There is a flag ``RWORDERED`` that can
+be set that prevents this overtaking from happening.
+
+There are no ordering guarantees when reads or writes are issued to different
+objects - these objects are almost certainly stored on different OSDs and even
+if they are on the same OSD will be processed by different threads with
+different locks so can easily be reordered.
+
+There do not appear to be many uses of the ``RWORDERED`` flag, RBD and RGW do
+not use the flag, CephFS uses the flag in MDS ``RecoveryQueue`` (calls
+``filer.probe`` which is implemented in ``osdc/Filer.cc``) which I think is
+only used in some recovery scenarios.
+
+These rules make it tricky to implement the watermark in the client and use
+this to decide which pool to route I/O requests to without using something
+equivalent to a new epoch to advance the watermark. The problem is that if the
+watermark is advanced without quiescing I/O it is possible that this causes
+requests to be reordered.
+
+For example:
+
+* Write A issued to old pool.
+* Write B issued to old pool.
+* Write A fails with updated watermark and is retried to new pool.
+* Read B with ``RWORDERING`` issued to new pool.
+* Write B fails and needs to be retried to new pool.
+
+In this example read B has overtaken write B.
+
+Perhaps more concerning is that the rules would also be broken if instead of
+Read B we issued another write to B.
+
+The simplest way to prevent reordering violations is to not advance the
+watermark while there are outstanding writes (or reads with ``RWORDERING`` flag
+set) in flight. This isn't idea as it may result it quite a number of I/Os
+being failed for retry before the watermark can be updated.
+
+A more sophisticated implementation stalls issuing new writes to objects
+with a hash between the old and new watermark while there are other writes
+in flight to objects with a hash between the old and new watermark.
+
+
+Other Pool Migration Issues
+---------------------------
+
+Other topics that we need to think about for pool migration.
+
+
+Avoiding Name Collisions
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+For the first release we will require that the target pool is empty when the
+migration starts (by having a UI interface that only starts a migration while
+a new pool is being created). We can also protect against objects being written
+to the target pool during the migration by adding client code to reject
+attempts to initiate requests to the target pool (the client code itself is
+still permitted to redirect requests from the source pool to the target pool).
+Because we will require a minimum client version to use pool migration this
+will ensure that all clients include this extra policing. OSDs cannot
+themselves implement the policing so there is no protection against a rouge
+client – we probably should have migration halt rather that crash if a name
+collision is found.
+
+Post first release if there is a use case for merging pools then it is
+theoretically possible to deal with name collisions by additionally using the
+pool which the client is accessing the object from to uniquify the name. This
+would require extra information in the request from the client to the OSDs.
+
+
+Stopping Changes to the Number of PGs During Migration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+During a migration we don't really want to be changing the number of PGs in
+the source pool. There are three reasons why:
+
+#. We don't really want to be moving objects around in the source pool when we
+   are about to migrate them - we are probably better off getting on with the
+   migration than trying to fix any imbalance in the source pool.
+#. Splitting/merging PGs in the source pool makes it harder to schedule the
+   migration. Scheduling is done at two levels - we say how many source PGs are
+   migrating at a time and then control the rate of migration within a source PG.
+   If we split/merge the source pool this makes selecting which PGs to migrate
+   more difficult.
+#. If we block splits and merges and migrate the PGs in reverse order (starting
+   with the highest numbered PG in the pool) then we can reduce the number of PGs
+   in the source pool as PGs finish migrating. This helps keeps the overall number
+   of PGs more manageable.
+
+In contrast we don't really care so much about the target pool - we can easily
+cope with splits/merges while the migration is in progress. From a performance
+perspective we do however want to avoid migrating objects to the target pool
+and then having splits/merges occur that copy the objects a second time. That
+means that normally we would want to set the number of target pool PGs to be
+the same as the source pool at the start of the migrate.
+
+We might also want to default to disable the auto-scaler for the target pool
+during the migration as we don't want it seeing a nearly empty target pool
+with loads of PGs and thinking that it should reduce the number of PGs.
+
+
+CLI and UI
+~~~~~~~~~~
+
+Pool migration will need a new CLI to start the migration, there will also need
+to be a way of monitoring PGs that a migrating and the progress of the
+migration. The CLI to start a migration will need to be implemented by the MON
+(``OSDMonitor.cc`` already implements most of the pool CLI commands) because
+the migration will need to update the ``pg_pool_t`` structures in the ``OSDMap``
+to record details of the migration.
+
+The new map will then be distributed to clients and OSDs so that they know that
+the migration has started. PGs that have been scheduled to start migration will
+need to determine at the end of the peering process that they don't need to
+recovery or backfill and that they should attempt to schedule a migration (will
+need new PG states ``MIGRATION_WAIT`` and ``MIGRATING``).
+
+We will need to work with the dashboard team to add support for pool migration
+to the dashboard and to provide a REST API for starting a migration.
+
+We will want to block some CLIs while a pool migration is taking place:
+
+* We don't want to be able to split/merge PGs in the source pool while it is
+  being migrated (see above).
+* We don't want the target pool to become the source of another migration
+  (no chaining migrations).
+
+Some of these CLIs are issued by MGR, in this case we probably will need to
+change the MGR code to either cope with the failures and/or to detect that the
+pool is migrating and avoid issuing the CLIs. We probably will need both as
+although checking if the pool is migrating before issuing a CLI is probably
+more efficient, it is exposed to a race hazard where the migrate may start
+between the check and CLI being issued.
+
+We need to look at how the progress of things like backfill and recovery are
+reported in the UI (possibly by ``HealthCheck``?) and think about how to report
+the progress of a pool migration. We need to think what are the right units for
+reporting progress (e.g. number of objects out of total objects, number of PGs
+out of total PGs or just a percentage).
+
+
+Backwards Compatibility / Software Upgrade
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Pool migration requires code changes in the Ceph daemons (MON, OSD and possibly
+MGR) and to the Ceph clients that issue I/O. We can't allow a pool migration to
+happen while any of these are running old code because the old code won't
+understand that a pool migration is happening. Old clients won't have any way
+of directing I/O to the correct pool, PG and OSD and having OSDs forward all
+these requests to the correct OSD would be far too expensive.
+
+Ceph daemons and clients have a set of feature bits indicating what features
+they support and there are mechanisms for setting a minimum set of feature
+bits that are required by daemons and separately for clients. Once set this
+prevents down-level daemons and clients connecting to the cluster. There are
+also mechanisms to ensure that once a minimum level has been set that this
+cannot be reversed.
+
+Pool migration will need to define a new feature bit and use the existing
+mechanisms for setting minimum required levels for daemons and clients. The new
+pool migration CLIs will need to fail an attempt to start a migration unless
+the minimum levels have been set.
+
+
+End of Migration
+~~~~~~~~~~~~~~~~
+
+When a migration completes, we will have moved all objects from pool A to pool
+B, however clients (e.g. RBD, CephFs, RGW, ...) will still have pool A embedded
+in their own data structures. We don't want to force all the clients to update
+their data structures to point at the new pool, so instead we will retain stub
+information about pool A saying that it has been migrated and that all I/O
+should now be submitted to pool B.
+
+Retaining a stub ``pg_pool_t`` structure in the ``OSDMap`` is cheap - there
+won't be thousands of pools and there isn't that much data stored for the pool.
+We will want to ensure that the old pool has no PGs associated with it, we can
+do this by reducing the number of PGs it has to 0 and letting the same code
+that runs when PGs are merged clean up and delete the old PGs.
+
+We need to think about the consequences of this on the UI interface. While in
+the code we start with pool A and create and migrate objects to pool B, from
+the perspective of the UI we probably want to show this as a transformation of
+pool A and hide the existence of pool B from the user.
+
+An alternative implementation would just show the pool redirection in the UI,
+so users would see an RBD image used pool A but would then find that pool A has
+been migrated to pool B. This alternative implementation might be better if we
+plan to support merging of pools (migration to a non-empty target pool) in the
+future.
+
+
+Walkthrough of how Pool Migration Might Work
+--------------------------------------------
+
+
+Initiating the Pool Migration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+#. User creates a new pool, perhaps they use a new flag ``--migratefrom`` to
+   say they want to start a pool migration.
+#. Starting the migration as part of pool creation means we know the pool is
+   initially empty.
+#. Unless the user specifies a number of PGs we can ensure that the newly
+   created pool has the same number of PGs as the source pool. There is no
+   requirement that the number of PGs is the same, it just avoids having to
+   perform a migration and then perform a second copy of data as the number of
+   PGs is adjusted to cater for the eventual number of objects in the pool.
+#. The CLI command sets up the ``pg_pool_t`` structures in the ``OSDMap`` to
+   indicate that a pool migration is starting. We record that pool A is being
+   migrated to pool B, and record which PG(s) we are going to start migrating.
+   If we are going to migrate more than one PG at a time, we probably want to
+   specify a set of PGs (e.g. 0,1,2,3) that are being migrated. Any PG in the
+   set is migrating. Any PG not in the set that is higher than the lowest value
+   in the set is assumed to have completed migration, any PG not in the set
+   that is lower than the lowest value in the set is assumed to have not
+   started migration.
+#. We migrate PGs in reverse order - so for example if a pool has PGs 0-15
+   then we will start by migrating PG 15.
+#. MON publishes new ``OSDMap`` as a new epoch.
+
+
+Client
+~~~~~~
+
+#. Clients use the ``pg_pool_t`` structure in the ``OSDMap`` to work out a
+   migration is in progress.
+#. From the range of PGs being migrated they can work out which PGs have been
+   migrated, which have not started migrating and which are in the process of
+   migrating.
+
+   a. If an I/O is submitted to a PG that has been migrated the object hash and
+      new pool is used to determine which PG and OSD to route the I/O request
+      to.
+   b. If an I/O is submitted to a PG that has not started migration the object
+      hash and old pool is used to determine which PG and OSD to route the I/O
+      request to.
+   c. If an I/O is submitted to a PG that is marked as being migrated the client
+      checks if it has a cached watermark for this PG. If it does, then it uses
+      this to decide whether to route the request to the old or new pool. If it
+      has no cached watermark, it guesses and sends the I/O to the old pool.
+
+#. If an I/O is misrouted to the wrong pool the OSD will fail the request
+   providing an update to the watermark. The client needs to update its cached
+   copy of the watermark and resubmit the I/O.
+
+
+OSD
+~~~
+
+#. OSDs use the ``pg_pool_t`` structure in the ``OSDMap`` to work out if a PG
+   needs migrating.
+#. At the end of peering if the PG needs migrating and is not performing
+   backfill or recovery it sets the PG state to ``MIGRATION_WAIT`` and checks
+   with other OSDs whether they have the resources and free capacity to start
+   the migration.
+#. If everything is good the PG state changes to ``MIGRATING``, sets the
+   watermark to 0 and the scheduler is instructed to start scheduling migration
+   work.
+#. Migration starts by scanning the next range of objects to be migrated
+   creating a list of object OIDs.
+#. Each object is then migrated, with the watermark being updated after the
+   object has been migrated.
+
+   a. The primary reads the object and sends it to the primary of the target
+      PG which then writes the object.
+   b. If the object is large this is done in stages with the target using a
+      temporary object name which is renamed when the last data is written.
+   c. Once an object has been migrated it is deleted from the source pool.
+
+#. Client I/O checks the object hash of the client I/O with the watermark. If
+   the I/O is below the watermark it is failed for retry to the new pool,
+   providing the current watermark for the client to cache.
+#. If a PG completes a migration, then it sends a message to the MON telling
+   it that the migration has completed.
+
+
+MON
+~~~
+
+#. When MON gets a message from an OSD saying that a migration has completed
+   it updates the set in the ``pg_pool_t`` to record that the PG has finished
+   migration and that the next PG is starting migration. A new ``OSDMap`` is
+   published as a new epoch.
+#. Because migrations are scheduled in reverse order and objects are deleted
+   as the migration happens, this means that as PG migrations complete that we
+   should have empty PGs that can be deleted by simply reducing the number of
+   PGs that the source pool has. PG migrations might not complete in the order
+   which they are started so we might have a few empty PGs hanging around that
+   cannot be deleted until another PG migration completes.
+#. At the end of the migration there are no more PGs to start migrating, so
+   the set of migrating PGs diminishes. When the set becomes empty we should
+   have also reduced the number of PGs for the source pool to zero and at this
+   point the migration is complete. The MON can make final updates to the
+   ``pg_pool_t`` state to indicate the migration has finished. The
+   ``pg_pool_t`` structure needs to be kept so that clients know to direct all
+   I/O requests to this pool to the new pool instead.
+#. Pools can be migrated more than once, this can result in multiple stub
+   ``pg_pool_t`` structures being kept. We do not want to have to recurse
+   through these stubs when I/Os are submitted, so at the end of a migration
+   the MON should attempt to reduce these redirects to a single level.
+
+
+Testing and Test Tools
+----------------------
+
+The objectives of testing pool migration are:
+
+#. Validate that all the objects in the source pool are migrated to the target
+   pool and that their contents (data, attributes and OMAP) are retained.
+#. Validate that during a migration object can be read (data, attributes, OMAP)
+   for objects that haven't yet been migrated, objects that have been migrated
+   and objects in the middle of being migrated.
+#. Validate that during a migration objects can be updated (create, delete,
+   write, update attributes, update OMAPs) for objects that haven't yet been
+   migrated, objects that have been migrated and objects in the middle of being
+   migrated.
+#. Validating pool migration under error scenarios, including resetting and
+   failing OSDs.
+#. Validate that snapshots, clones are migrated and can be used during a pool
+   migration.
+#. Validate the UI for pool migration, including restrictions placed on the UI
+   during the migration.
+#. Validation of migrating multiple different pools in parallel. Validation of
+   migration a single pool multiple times in series.
+#. Validation of pool migration with unreadable objects (excessive medium
+   errors plus possibly other failures that defeat the redundancy of the
+   replica/EC pool without taking it offline).
+#. Validation of software upgrade / compatibility for both daemons (OSD, MON,
+   MGR) and clients.
+#. Validation of performance impact during a migration.
+
+Pool migration makes changes to client code, so all modified clients will need
+testing.
+
+Existing tools such as ``ceph_test_rados`` are good for creating and exercising
+a set of objects and performing some consistency checking of objects.
+
+A simple script is probably better for creating a large number of objects and
+then validating the contents of the objects. Writing a script is probably
+better for being able to test attributes and OMAPs as well. If the script has
+two phases (create objects and validate objects) then these phases can be run
+at different times (before, during, after pool migration) to test different
+aspects. The script could use a command line tool such as rados to create and
+validate objects, using pseudo random numbers to generate data patterns,
+attributes and OMAP data that could then be validated. The script would need to
+run many rados commands in parallel to generate a decent I/O workload. There
+may be scripts that already exist that can do this, it may be possible to adapt
+ceph_test_rados to do this.
+
+Tools such as ``VDBench`` can test data integrity of block volumes, either
+creating a data set and then validating it, or can continuously create and
+update data keeping a journal so it can be validated at any point. However
+block volume tools can only test object data, not attributes or OMAPs.
+
+A tool such as ``FIO`` is best suited for doing performance measurements.
+
+The I/O sequence tool ``ceph_test_rados_io_sequence`` is probably not useful
+for testing pool migration - it specializes it testing a very small number of
+objects and focuses on boundary conditions within an object (e.g. EC chunk
+size, strip size) and data integrity.
+
+The objective should be to use teuthology to perform most of the testing for
+pool migration (at a minimum 1 to 5 in the list above). It should be possible
+to add pool migration as an option to existing tests in the RADOS suite,
+extending the ``thrashOSD`` class to include the option of starting a
+migration.
author	Connor Fawcett <connorfa@uk.ibm.com>
	Mon, 29 Sep 2025 01:21:16 +0000 (02:21 +0100)
committer	Connor Fawcett <connorfa@uk.ibm.com>
	Mon, 6 Oct 2025 12:52:23 +0000 (13:52 +0100)