From 37d167af770213ca92c571e2e96d07434f21827c Mon Sep 17 00:00:00 2001 From: Zac Dover Date: Sun, 12 Mar 2023 11:17:03 +1000 Subject: [PATCH] doc/rados: edit operations/bs-migration (2 of x) Disambiguate and improve the English language in doc/rados/operations/bluestore-migration.rst up to but not including the section called "Whole Host Replacement". Co-authored-by: Cole Mitchell Signed-off-by: Zac Dover (cherry picked from commit ca803a24c64059023733e21d755edb9c6c973ecf) --- doc/rados/operations/bluestore-migration.rst | 252 ++++++++++--------- 1 file changed, 129 insertions(+), 123 deletions(-) diff --git a/doc/rados/operations/bluestore-migration.rst b/doc/rados/operations/bluestore-migration.rst index 0a6c2c585e0ec..3b67b2e9da702 100644 --- a/doc/rados/operations/bluestore-migration.rst +++ b/doc/rados/operations/bluestore-migration.rst @@ -2,68 +2,67 @@ BlueStore Migration ===================== -Each OSD can run either BlueStore or Filestore, and a single Ceph -cluster can contain a mix of both. Users who have previously deployed -Filestore OSDs should transition to BlueStore in order to -take advantage of the improved performance and robustness. Moreover, -Ceph releases beginning with Reef do not support Filestore. There are -several strategies for making such a transition. - -An individual OSD cannot be converted in place; -BlueStore and Filestore are simply too different for that to be -feasible. The conversion process uses either the cluster's normal -replication and healing support or tools and strategies that copy OSD -content from an old (Filestore) device to a new (BlueStore) one. - - -Deploy new OSDs with BlueStore -============================== - -New OSDs (e.g., when the cluster is expanded) should be deployed -using BlueStore. This is the default behavior so no specific change -is needed. - -Similarly, any OSDs that are reprovisioned after replacing a failed drive -should use BlueStore. - -Convert existing OSDs -===================== - -Mark out and replace --------------------- - -The simplest approach is to ensure that the cluster is healthy, -then mark ``out`` each device in turn, wait for -data to replicate across the cluster, reprovision the OSD, and mark -it back ``in`` again. Proceed to the next OSD when recovery is complete. -This is easy to automate but results in more data migration than -is strictly necessary, which in turn presents additional wear to SSDs and takes -longer to complete. +Each OSD must be formatted as either a Filestore OSD or a BlueStore OSD. +However, an individual Ceph cluster can operate with a mixture of both +Filestore OSDs and BlueStore OSDs. Because BlueStore is superior to Filestore +in performance and robustness, and because Filestore is not supported by Ceph +releases beginning with Reef, users deploying Filestore OSDs should transition +to BlueStore. There are several strategies for making the transition to +BlueStore. + +BlueStore is so different from Filestore that an individual OSD cannot +be converted in place. Instead, the conversion process must use either +(1) the cluster's normal replication and healing support, or (2) tools +and strategies that copy OSD content from an old (Filestore) device to +a new (BlueStore) one. + +Deploying new OSDs with BlueStore +================================= + +Use BlueStore when deploying new OSDs (for example, when the cluster is +expanded). Because this is the default behavior, no specific change is +needed. + +Similarly, use BlueStore for any OSDs that have been reprovisioned after +a failed drive was replaced. + +Converting existing OSDs +======================== + +Mark-``out`` replacement +------------------------ + +The simplest approach is to verify that the cluster is healthy and +then follow these steps for each Filestore OSD in succession: mark the OSD +``out``, wait for the data to replicate across the cluster, reprovision the OSD, +mark the OSD back ``in``, and wait for recovery to complete before proceeding +to the next OSD. This approach is easy to automate, but it entails unnecessary +data migration that carries costs in time and SSD wear. #. Identify a Filestore OSD to replace:: ID= DEVICE= - You can tell whether a given OSD is Filestore or BlueStore with: + #. Determine whether a given OSD is Filestore or BlueStore: - .. prompt:: bash $ + .. prompt:: bash $ - ceph osd metadata $ID | grep osd_objectstore + ceph osd metadata $ID | grep osd_objectstore - You can get a current count of Filestore and BlueStore OSDs with: + #. Get a current count of Filestore and BlueStore OSDs: - .. prompt:: bash $ + .. prompt:: bash $ - ceph osd count-metadata osd_objectstore + ceph osd count-metadata osd_objectstore -#. Mark the Filestore OSD ``out``: +#. Mark a Filestore OSD ``out``: .. prompt:: bash $ ceph osd out $ID -#. Wait for the data to migrate off the OSD in question: +#. Wait for the data to migrate off this OSD: .. prompt:: bash $ @@ -75,7 +74,7 @@ longer to complete. systemctl kill ceph-osd@$ID -#. Note which device this OSD is using: +#. Note which device the OSD is using: .. prompt:: bash $ @@ -87,25 +86,27 @@ longer to complete. umount /var/lib/ceph/osd/ceph-$ID -#. Destroy the OSD data. Be *EXTREMELY CAREFUL* as this will destroy - the contents of the device; be certain the data on the device is - not needed (i.e., that the cluster is healthy) before proceeding: +#. Destroy the OSD's data. Be *EXTREMELY CAREFUL*! These commands will destroy + the contents of the device; you must be certain that the data on the device is + not needed (in other words, that the cluster is healthy) before proceeding: .. prompt:: bash $ ceph-volume lvm zap $DEVICE -#. Tell the cluster the OSD has been destroyed (and a new OSD can be - reprovisioned with the same ID): +#. Tell the cluster that the OSD has been destroyed (and that a new OSD can be + reprovisioned with the same OSD ID): .. prompt:: bash $ ceph osd destroy $ID --yes-i-really-mean-it -#. Provision a BlueStore OSD in its place with the same OSD ID. - This requires you do identify which device to wipe based on what you saw - mounted above. BE CAREFUL! Also note that hybrid OSDs may require - adjustments to these commands: +#. Provision a BlueStore OSD in place by using the same OSD ID. This requires + you to identify which device to wipe, and to make certain that you target + the correct and intended device, using the information that was retrieved + when we directed you to "[N]ote which device the OSD is using". BE CAREFUL! + Note that you may need to modify these commands when dealing with hybrid + OSDs: .. prompt:: bash $ @@ -113,15 +114,15 @@ longer to complete. #. Repeat. -You can allow balancing of the replacement OSD to happen -concurrently with the draining of the next OSD, or follow the same -procedure for multiple OSDs in parallel, as long as you ensure the -cluster is fully clean (all data has all replicas) before destroying -any OSDs. If you reprovision multiple OSDs in parallel, be **very** careful to -only zap / destroy OSDs within a single CRUSH failure domain, e.g. ``host`` or -``rack``. Failure to do so will reduce the redundancy and availability of -your data and increase the risk of (or even cause) data loss. - +You may opt to (1) have the balancing of the replacement BlueStore OSD take +place concurrently with the draining of the next Filestore OSD, or instead +(2) follow the same procedure for multiple OSDs in parallel. In either case, +however, you must ensure that the cluster is fully clean (in other words, that +all data has all replicas) before destroying any OSDs. If you opt to reprovision +multiple OSDs in parallel, be **very** careful to destroy OSDs only within a +single CRUSH failure domain (for example, ``host`` or ``rack``). Failure to +satisfy this requirement will reduce the redundancy and availability of your +data and increase the risk of data loss (or even guarantee data loss). Advantages: @@ -131,29 +132,29 @@ Advantages: Disadvantages: -* Data is copied over the network twice: once to some other OSD in the - cluster (to maintain the desired number of replicas), and then again +* Data is copied over the network twice: once to another OSD in the + cluster (to maintain the specified number of replicas), and again back to the reprovisioned BlueStore OSD. - Whole host replacement ---------------------- -If you have a spare host in the cluster, or have sufficient free space -to evacuate an entire host in order to use it as a spare, then the -conversion can be done on a host-by-host basis with each stored copy of -the data migrating only once. +If you have a spare host in the cluster, or sufficient free space to evacuate +an entire host for use as a spare, then the conversion can be done on a +host-by-host basis so that each stored copy of the data is migrated only once. -First, you need an empty host that has no OSDs provisioned. There are two -ways to do this: either by starting with a new, empty host that isn't yet -part of the cluster, or by offloading data from an existing host in the cluster. +To use this approach, you need an empty host that has no OSDs provisioned. +There are two ways to do this: either by using a new, empty host that is not +yet part of the cluster, or by offloading data from an existing host that is +already part of the cluster. Use a new, empty host ^^^^^^^^^^^^^^^^^^^^^ -Ideally the host should have roughly the -same capacity as other hosts you will be converting. -Add the host to the CRUSH hierarchy, but do not attach it to the root: +Ideally the host will have roughly the same capacity as each of the other hosts +you will be converting. Add the host to the CRUSH hierarchy, but do not attach +it to the root: + .. prompt:: bash $ @@ -166,19 +167,18 @@ Use an existing host ^^^^^^^^^^^^^^^^^^^^ If you would like to use an existing host -that is already part of the cluster, and there is sufficient free +that is already part of the cluster, and if there is sufficient free space on that host so that all of its data can be migrated off to -other cluster hosts, you can instead do:: +other cluster hosts, you can do the following (instead of using a new, empty host): +.. prompt:: bash $ -.. prompt:: bash $ - OLDHOST= ceph osd crush unlink $OLDHOST default where "default" is the immediate ancestor in the CRUSH map. (For smaller clusters with unmodified configurations this will normally -be "default", but it might also be a rack name.) You should now +be "default", but it might instead be a rack name.) You should now see the host at the top of the OSD tree output with no parent: .. prompt:: bash $ @@ -199,15 +199,18 @@ see the host at the top of the OSD tree output with no parent: 2 ssd 1.00000 osd.2 up 1.00000 1.00000 ... -If everything looks good, jump directly to the "Wait for data -migration to complete" step below and proceed from there to clean up -the old OSDs. +If everything looks good, jump directly to the :ref:`Wait for data migration to +complete ` step below and proceed from there to +clean up the old OSDs. Migration process ^^^^^^^^^^^^^^^^^ -If you're using a new host, start at step #1. For an existing host, -jump to step #5 below. +If you're using a new host, start at :ref:`the first step +`. If you're using an existing host, +jump to :ref:`this step `. + +.. _bluestore_migration_process_first_step: #. Provision new BlueStore OSDs for all devices: @@ -215,14 +218,14 @@ jump to step #5 below. ceph-volume lvm create --bluestore --data /dev/$DEVICE -#. Verify OSDs join the cluster with: +#. Verify that the new OSDs have joined the cluster: .. prompt:: bash $ ceph osd tree You should see the new host ``$NEWHOST`` with all of the OSDs beneath - it, but the host should *not* be nested beneath any other node in + it, but the host should *not* be nested beneath any other node in the hierarchy (like ``root default``). For example, if ``newhost`` is the empty host, you might see something like:: @@ -251,13 +254,16 @@ jump to step #5 below. ceph osd crush swap-bucket $NEWHOST $OLDHOST - At this point all data on ``$OLDHOST`` will start migrating to OSDs - on ``$NEWHOST``. If there is a difference in the total capacity of - the old and new hosts you may also see some data migrate to or from - other nodes in the cluster, but as long as the hosts are similarly - sized this will be a relatively small amount of data. + At this point all data on ``$OLDHOST`` will begin migrating to the OSDs on + ``$NEWHOST``. If there is a difference between the total capacity of the + old hosts and the total capacity of the new hosts, you may also see some + data migrate to or from other nodes in the cluster. Provided that the hosts + are similarly sized, however, this will be a relatively small amount of + data. + +.. _bluestore_data_migration_step: -#. Wait for data migration to complete: +#. Wait for the data migration to complete: .. prompt:: bash $ @@ -295,53 +301,53 @@ jump to step #5 below. Advantages: * Data is copied over the network only once. -* Converts an entire host's OSDs at once. -* Can parallelize to converting multiple hosts at a time. -* No spare devices are required on each host. +* An entire host's OSDs are converted at once. +* Can be parallelized, to make possible the conversion of multiple hosts at the same time. +* No host involved in this process needs to have a spare device. Disadvantages: * A spare host is required. -* An entire host's worth of OSDs will be migrating data at a time. This +* An entire host's worth of OSDs will be migrating data at a time. This is likely to impact overall cluster performance. * All migrated data still makes one full hop over the network. - Per-OSD device copy ------------------- - A single logical OSD can be converted by using the ``copy`` function -of ``ceph-objectstore-tool``. This requires that the host have a free -device (or devices) to provision a new, empty BlueStore OSD. For -example, if each host in your cluster has twelve OSDs, then you'd need a -thirteenth unused device so that each OSD can be converted in turn before the -old device is reclaimed to convert the next OSD. +included in ``ceph-objectstore-tool``. This requires that the host have one or more free +devices to provision a new, empty BlueStore OSD. For +example, if each host in your cluster has twelve OSDs, then you need a +thirteenth unused OSD so that each OSD can be converted before the +previous OSD is reclaimed to convert the next OSD. Caveats: -* This strategy requires that an empty BlueStore OSD be prepared - without allocating a new OSD ID, something that the ``ceph-volume`` - tool doesn't support. More importantly, the setup of *dmcrypt* is - closely tied to the OSD identity, which means that this approach - does not work with encrypted OSDs. +* This approach requires that we prepare an empty BlueStore OSD but that we do not allocate + a new OSD ID to it. The ``ceph-volume`` tool does not support such an operation. **IMPORTANT:** + because the setup of *dmcrypt* is closely tied to the identity of the OSD, this approach does not + work with encrypted OSDs. * The device must be manually partitioned. -* Tooling not implemented! - -* Not documented! +* An unsupported user-contributed script that demonstrates this process may be found here: + https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash Advantages: -* Little or no data migrates over the network during the conversion. +* Provided that the 'noout' or the 'norecover'/'norebalance' flags are set on the OSD or the + cluster while the conversion process is underway, little or no data migrates over the + network during the conversion. Disadvantages: -* Tooling not fully implemented. -* Process not documented. -* Each host must have a spare or empty device. -* The OSD is offline during the conversion, which means new writes will - be written to only a subset of the OSDs. This increases the risk of data - loss due to a subsequent failure. (However, if there is a failure before - conversion is complete, the original FileStore OSD can be started to provide - access to its original data.) +* Tooling is not fully implemented, supported, or documented. + +* Each host must have an appropriate spare or empty device for staging. + +* The OSD is offline during the conversion, which means new writes to PGs + with the OSD in their acting set may not be ideally redundant until the + subject OSD comes up and recovers. This increases the risk of data + loss due to an overlapping failure. However, if another OSD fails before + conversion and startup have completed, the original Filestore OSD can be + started to provide access to its original data. -- 2.39.5