From 32375cb789d7e7098cdb87a84df3b442f2474c30 Mon Sep 17 00:00:00 2001 From: Anthony D'Atri Date: Wed, 7 Oct 2020 15:21:28 -0700 Subject: [PATCH] doc: misc clarity and capitalization Signed-off-by: Anthony D'Atri --- SubmittingPatches.rst | 9 ++- doc/radosgw/dynamicresharding.rst | 10 +-- doc/radosgw/layout.rst | 20 ++--- doc/rbd/rbd-live-migration.rst | 6 +- doc/rbd/rbd-mirroring.rst | 51 +++++++----- doc/rbd/rbd-openstack.rst | 45 +++++------ doc/rbd/rbd-persistent-cache.rst | 32 ++++---- doc/rbd/rbd-replay.rst | 2 +- doc/rbd/rbd-snapshot.rst | 56 ++++++------- doc/start/documenting-ceph.rst | 8 +- doc/start/hardware-recommendations.rst | 105 ++++++++++++++----------- doc/start/os-recommendations.rst | 19 ++--- 12 files changed, 202 insertions(+), 161 deletions(-) diff --git a/SubmittingPatches.rst b/SubmittingPatches.rst index d74529cd3ba07..7ac948e17faa8 100644 --- a/SubmittingPatches.rst +++ b/SubmittingPatches.rst @@ -66,7 +66,7 @@ then you just add a line saying :: Signed-off-by: Random J Developer -using your real name (sorry, no pseudonyms or anonymous contributions.) +using your real name (sorry, no pseudonyms or anonymous contributions). Git can sign off on your behalf ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -201,12 +201,17 @@ PR title If your PR has only one commit, the PR title can be the same as the commit title (and GitHub will suggest this). If the PR has multiple commits, do not accept -the title GitHub suggest. Either use the title of the most relevant commit, or +the title GitHub suggests. Either use the title of the most relevant commit, or write your own title. In the latter case, use the same "subsystem: short description" convention described in `Commit title`_ for the PR title, with the following difference: the PR title describes the entire set of changes, while the `Commit title`_ describes only the changes in a particular commit. +If GitHub suggests a PR title based on a very long commit message it will split +the result with an elipsis (...) and fold the remainder into the PR description. +In such a case, please edit the title to be more concise and the description to +remove the elipsis. + Keep in mind that the PR titles feed directly into the script that generates release notes and it is tedious to clean up non-conformant PR titles at release time. This document places no limit on the length of PR titles, but be aware diff --git a/doc/radosgw/dynamicresharding.rst b/doc/radosgw/dynamicresharding.rst index efced98a9066a..a7698e7bbc932 100644 --- a/doc/radosgw/dynamicresharding.rst +++ b/doc/radosgw/dynamicresharding.rst @@ -14,17 +14,17 @@ online bucket resharding. Each bucket index shard can handle its entries efficiently up until reaching a certain threshold number of entries. If this threshold is -exceeded the system can encounter performance issues. The dynamic +exceeded the system can suffer from performance issues. The dynamic resharding feature detects this situation and automatically increases -the number of shards used by the bucket index, resulting in the +the number of shards used by the bucket index, resulting in a reduction of the number of entries in each bucket index shard. This process is transparent to the user. By default dynamic bucket index resharding can only increase the -number of bucket index shards to 1999, although the upper-bound is a -configuration parameter (see Configuration below). Furthermore, when +number of bucket index shards to 1999, although this upper-bound is a +configuration parameter (see Configuration below). When possible, the process chooses a prime number of bucket index shards to -help spread the number of bucket index entries across the bucket index +spread the number of bucket index entries across the bucket index shards more evenly. The detection process runs in a background process that periodically diff --git a/doc/radosgw/layout.rst b/doc/radosgw/layout.rst index a48a995dde3a5..6ee0bc2a8ed6c 100644 --- a/doc/radosgw/layout.rst +++ b/doc/radosgw/layout.rst @@ -8,8 +8,8 @@ new developers to get up to speed with the implementation details. Introduction ------------ -Swift offers something called a container, that we use interchangeably with -the term bucket. One may say that RGW's buckets implement Swift containers. +Swift offers something called a *container*, which we use interchangeably with +the term *bucket*, so we say that RGW's buckets implement Swift containers. This document does not consider how RGW operates on these structures, e.g. the use of encode() and decode() methods for serialization and so on. @@ -42,18 +42,18 @@ Some variables have been used in above commands, they are: - bucket: Holds a mapping between bucket name and bucket instance id - bucket.instance: Holds bucket instance information[2] -Every metadata entry is kept on a single rados object. See below for implementation details. +Every metadata entry is kept on a single RADOS object. See below for implementation details. Note that the metadata is not indexed. When listing a metadata section we do a -rados pgls operation on the containing pool. +RADOS ``pgls`` operation on the containing pool. Bucket Index ^^^^^^^^^^^^ It's a different kind of metadata, and kept separately. The bucket index holds -a key-value map in rados objects. By default it is a single rados object per -bucket, but it is possible since Hammer to shard that map over multiple rados -objects. The map itself is kept in omap, associated with each rados object. +a key-value map in RADOS objects. By default it is a single RADOS object per +bucket, but it is possible since Hammer to shard that map over multiple RADOS +objects. The map itself is kept in omap, associated with each RADOS object. The key of each omap is the name of the objects, and the value holds some basic metadata of that object -- metadata that shows up when listing the bucket. Also, each omap holds a header, and we keep some bucket accounting metadata @@ -66,7 +66,7 @@ objects there is more information that we keep on other keys. Data ^^^^ -Objects data is kept in one or more rados objects for each rgw object. +Objects data is kept in one or more RADOS objects for each rgw object. Object Lookup Path ------------------ @@ -96,7 +96,7 @@ causes no ambiguity. For the same reason, slashes are permitted in object names (keys). It is also possible to create multiple data pools and make it so that -different users buckets will be created in different rados pools by default, +different users buckets will be created in different RADOS pools by default, thus providing the necessary scaling. The layout and naming of these pools is controlled by a 'policy' setting.[3] @@ -187,7 +187,7 @@ Known pools: namespace: users.keys 47UA98JSTJZ9YAN3OS3O - This allows radosgw to look up users by their access keys during authentication. + This allows ``radosgw`` to look up users by their access keys during authentication. namespace: users.swift test:tester diff --git a/doc/rbd/rbd-live-migration.rst b/doc/rbd/rbd-live-migration.rst index 813deaae3fb57..10a80280ff374 100644 --- a/doc/rbd/rbd-live-migration.rst +++ b/doc/rbd/rbd-live-migration.rst @@ -7,16 +7,16 @@ RBD images can be live-migrated between different pools within the same cluster or between different image formats and layouts. When started, the source image will be deep-copied to the destination image, pulling all snapshot history and -optionally keeping any link to the source image's parent to help preserve +optionally preserving any link to the source image's parent to preserve sparseness. This copy process can safely run in the background while the new target image is -in-use. There is currently a requirement to temporarily stop using the source +in use. There is currently a requirement to temporarily stop using the source image before preparing a migration. This helps to ensure that the client using the image is updated to point to the new target image. .. note:: - Image live-migration requires the Ceph Nautilus release or later. The krbd + Image live-migration requires the Ceph Nautilus release or later. The ``krbd`` kernel module does not support live-migration at this time. diff --git a/doc/rbd/rbd-mirroring.rst b/doc/rbd/rbd-mirroring.rst index 8315a89132f25..300cb98d4c3db 100644 --- a/doc/rbd/rbd-mirroring.rst +++ b/doc/rbd/rbd-mirroring.rst @@ -13,14 +13,14 @@ capability is available in two modes: actual image. The remote cluster will read from this associated journal and replay the updates to its local copy of the image. Since each write to the RBD image will result in two writes to the Ceph cluster, expect write - latencies to nearly double when using the RBD journaling image feature. + latencies to nearly double while using the RBD journaling image feature. * **Snapshot-based**: This mode uses periodically scheduled or manually created RBD image mirror-snapshots to replicate crash-consistent RBD images between clusters. The remote cluster will determine any data or metadata updates between two mirror-snapshots and copy the deltas to its local copy of - the image. With the help of the RBD fast-diff image feature, updated data - blocks can be quickly computed without the need to scan the full RBD image. + the image. With the help of the RBD ``fast-diff`` image feature, updated data + blocks can be quickly determined without the need to scan the full RBD image. Since this mode is not as fine-grained as journaling, the complete delta between two snapshots will need to be synced prior to use during a failover scenario. Any partially applied set of deltas will be rolled back at moment @@ -30,10 +30,10 @@ capability is available in two modes: snapshot-based mirroring requires the Ceph Octopus release or later. Mirroring is configured on a per-pool basis within peer clusters and can be -configured on a specific subset of images within the pool or configured to -automatically mirror all images within a pool when using journal-based -mirroring only. Mirroring is configured using the ``rbd`` command. The -``rbd-mirror`` daemon is responsible for pulling image updates from the remote, +configured on a specific subset of images within the pool. You can also mirror +all images within a given pool when using journal-based +mirroring. Mirroring is configured using the ``rbd`` command. The +``rbd-mirror`` daemon is responsible for pulling image updates from the remote peer cluster and applying them to the image within the local cluster. Depending on the desired needs for replication, RBD mirroring can be configured @@ -57,30 +57,35 @@ Pool Configuration The following procedures demonstrate how to perform the basic administrative tasks to configure mirroring using the ``rbd`` command. Mirroring is -configured on a per-pool basis within the Ceph clusters. +configured on a per-pool basis. -The pool configuration steps should be performed on both peer clusters. These -procedures assume two clusters, named "site-a" and "site-b", are accessible from -a single host for clarity. +These pool configuration steps should be performed on both peer clusters. These +procedures assume that both clusters, named "site-a" and "site-b", are accessible +from a single host for clarity. See the `rbd`_ manpage for additional details of how to connect to different Ceph clusters. .. note:: The cluster name in the following examples corresponds to a Ceph configuration file of the same name (e.g. /etc/ceph/site-b.conf). See the - `ceph-conf`_ documentation for how to configure multiple clusters. + `ceph-conf`_ documentation for how to configure multiple clusters. Note + that ``rbd-mirror`` does **not** require the source and destination clusters + to have unique internal names; both can and should call themselves ``ceph``. + The config `files` that ``rbd-mirror`` needs for local and remote clusters + can be named arbitrarily, and containerizing the daemon is one strategy + for maintaining them outside of ``/etc/ceph`` to avoid confusion. Enable Mirroring ---------------- -To enable mirroring on a pool with ``rbd``, specify the ``mirror pool enable`` -command, the pool name, and the mirroring mode:: +To enable mirroring on a pool with ``rbd``, issue the ``mirror pool enable`` +subcommand with the pool name, and the mirroring mode:: rbd mirror pool enable {pool-name} {mode} The mirroring mode can either be ``image`` or ``pool``: -* **image**: When configured in ``image`` mode, mirroring needs to be +* **image**: When configured in ``image`` mode, mirroring must `explicitly enabled`_ on each image. * **pool** (default): When configured in ``pool`` mode, all images in the pool with the journaling feature enabled are mirrored. @@ -111,13 +116,13 @@ Bootstrap Peers --------------- In order for the ``rbd-mirror`` daemon to discover its peer cluster, the peer -needs to be registered to the pool and a user account needs to be created. +must be registered and a user account must be created. This process can be automated with ``rbd`` and the ``mirror pool peer bootstrap create`` and ``mirror pool peer bootstrap import`` commands. -To manually create a new bootstrap token with ``rbd``, specify the -``mirror pool peer bootstrap create`` command, a pool name, along with an +To manually create a new bootstrap token with ``rbd``, issue the +``mirror pool peer bootstrap create`` subcommand, a pool name, and an optional friendly site name to describe the local cluster:: rbd mirror pool peer bootstrap create [--site-name {local-site-name}] {pool-name} @@ -289,6 +294,16 @@ For example:: .. tip:: You can enable journaling on all new images by default by adding ``rbd default features = 125`` to your Ceph configuration file. +.. tip:: ``rbd-mirror`` tunables are set by default to values suitable for + mirroring an entire pool. When using ``rbd-mirror`` to migrate single + volumes been clusters you may achieve substantial performance gains + by setting ``rbd_mirror_journal_max_fetch_bytes=33554432`` and + ``rbd_journal_max_payload_bytes=8388608`` within the ``[client]`` config + section of the local or centralized configuration. Note that these + settings may allow ``rbd-mirror`` to present a substantial write workload + to the destination cluster: monitor cluster performance closely during + migrations and test carefuly before running multiple migrations in parallel. + Create Image Mirror-Snapshots ----------------------------- diff --git a/doc/rbd/rbd-openstack.rst b/doc/rbd/rbd-openstack.rst index 9d02b51460089..3f1b85f30f6d5 100644 --- a/doc/rbd/rbd-openstack.rst +++ b/doc/rbd/rbd-openstack.rst @@ -4,10 +4,10 @@ .. index:: Ceph Block Device; OpenStack -You may use Ceph Block Device images with OpenStack through ``libvirt``, which -configures the QEMU interface to ``librbd``. Ceph stripes block device images as -objects across the cluster, which means that large Ceph Block Device images have -better performance than a standalone server! +You can attach Ceph Block Device images to OpenStack instances through ``libvirt``, +which configures the QEMU interface to ``librbd``. Ceph stripes block volumes +across multiple OSDs within the cluster, which means that large volumes can +realize better performance than local drives on a standalone server! To use Ceph Block Devices with OpenStack, you must install QEMU, ``libvirt``, and OpenStack first. We recommend using a separate physical node for your @@ -56,13 +56,13 @@ Three parts of OpenStack integrate with Ceph's block devices: every virtual machine inside Ceph directly without using Cinder, which is advantageous because it allows you to perform maintenance operations easily with the live-migration process. Additionally, if your hypervisor dies it is - also convenient to trigger ``nova evacuate`` and run the virtual machine + also convenient to trigger ``nova evacuate`` and reinstate the virtual machine elsewhere almost seamlessly. In doing so, :ref:`exclusive locks ` prevent multiple compute nodes from concurrently accessing the guest disk. -You can use OpenStack Glance to store images in a Ceph Block Device, and you +You can use OpenStack Glance to store images as Ceph Block Devices, and you can use Cinder to boot a VM using a copy-on-write clone of an image. The instructions below detail the setup for Glance, Cinder and Nova, although @@ -78,9 +78,9 @@ while running VMs using a local disk, or vice versa. Create a Pool ============= -By default, Ceph block devices use the ``rbd`` pool. You may use any available -pool. We recommend creating a pool for Cinder and a pool for Glance. Ensure -your Ceph cluster is running, then create the pools. :: +By default, Ceph block devices live within the ``rbd`` pool. You may use any +suitable pool by specifying it explicitly. We recommend creating a pool for +Cinder and a pool for Glance. Ensure your Ceph cluster is running, then create the pools. :: ceph osd pool create volumes ceph osd pool create images @@ -309,25 +309,26 @@ authenticating with the Ceph cluster. :: rbd_user = cinder rbd_secret_uuid = 457eb676-33da-42ec-9a8c-9293d545c337 -These two flags are also used by the Nova ephemeral backend. +These two flags are also used by the Nova ephemeral back end. Configuring Nova ---------------- -In order to boot all the virtual machines directly into Ceph, you must +In order to boot virtual machines directly from Ceph volumes, you must configure the ephemeral backend for Nova. -It is recommended to enable the RBD cache in your Ceph configuration file -(enabled by default since Giant). Moreover, enabling the admin socket -brings a lot of benefits while troubleshooting. Having one socket -per virtual machine using a Ceph block device will help investigating performance and/or wrong behaviors. +It is recommended to enable the RBD cache in your Ceph configuration file; this +has been enabled by default since the Giant release. Moreover, enabling the +client admin socket allows the collection of metrics and can be invaluable +for troubleshooting. -This socket can be accessed like this:: +This socket can be accessed on the hypvervisor (Nova compute) node:: ceph daemon /var/run/ceph/ceph-client.cinder.19195.32310016.asok help -Now on every compute nodes edit your Ceph configuration file:: +To enable RBD cache and admin sockets, ensure that on each hypervisor's +``ceph.conf`` contains:: [client] rbd cache = true @@ -336,7 +337,7 @@ Now on every compute nodes edit your Ceph configuration file:: log file = /var/log/qemu/qemu-guest-$pid.log rbd concurrent management ops = 20 -Configure the permissions of these paths:: +Configure permissions for these directories:: mkdir -p /var/run/ceph/guests/ /var/log/qemu/ chown qemu:libvirtd /var/run/ceph/guests /var/log/qemu/ @@ -344,15 +345,15 @@ Configure the permissions of these paths:: Note that user ``qemu`` and group ``libvirtd`` can vary depending on your system. The provided example works for RedHat based systems. -.. tip:: If your virtual machine is already running you can simply restart it to get the socket +.. tip:: If your virtual machine is already running you can simply restart it to enable the admin socket Restart OpenStack ================= To activate the Ceph block device driver and load the block device pool name -into the configuration, you must restart OpenStack. Thus, for Debian based -systems execute these commands on the appropriate nodes:: +into the configuration, you must restart the related OpenStack services. +For Debian based systems execute these commands on the appropriate nodes:: sudo glance-control api restart sudo service nova-compute restart @@ -383,7 +384,7 @@ You can use `qemu-img`_ to convert from one format to another. For example:: qemu-img convert -f qcow2 -O raw precise-cloudimg.img precise-cloudimg.raw When Glance and Cinder are both using Ceph block devices, the image is a -copy-on-write clone, so it can create a new volume quickly. In the OpenStack +copy-on-write clone, so new volumes are created quickly. In the OpenStack dashboard, you can boot from that volume by performing the following steps: #. Launch a new instance. diff --git a/doc/rbd/rbd-persistent-cache.rst b/doc/rbd/rbd-persistent-cache.rst index 6c5bb92f60195..7ce637c57b0c4 100644 --- a/doc/rbd/rbd-persistent-cache.rst +++ b/doc/rbd/rbd-persistent-cache.rst @@ -7,14 +7,14 @@ Shared, Read-only Parent Image Cache ==================================== -`Cloned RBD images`_ from a parent usually only modify a small portion of -the image. For example, in a VDI workload, the VMs are cloned from the same -base image and initially only differ by hostname and IP address. During the -booting stage, all of these VMs would re-read portions of duplicate parent -image data from the RADOS cluster. If we have a local cache of the parent -image, this will help to speed up the read process on one host, as well as -to save the client to cluster network traffic. -RBD shared read-only parent image cache requires explicitly enabling in +`Cloned RBD images`_ usually modify only a small fraction of the parent +image. For example, in a VDI use-case, VMs are cloned from the same +base image and initially differ only by hostname and IP address. During +booting, all of these VMs read portions of the same parent +image data. If we have a local cache of the parent +image, this speeds up reads on the caching host. We also achieve +reduction of client-to-cluster network traffic. +RBD cache must be explicitly enabled in ``ceph.conf``. The ``ceph-immutable-object-cache`` daemon is responsible for caching the parent content on the local disk, and future reads on that data will be serviced from the local cache. @@ -64,14 +64,14 @@ The key components of the daemon are: RADOS cluster and stored in the local caching directory. On opening each cloned rbd image, ``librbd`` will try to connect to the -cache daemon over its domain socket. If it's successfully connected, -``librbd`` will automatically check with the daemon on the subsequent reads. +cache daemon through its Unix domain socket. Once successfully connected, +``librbd`` will coordinate with the daemon on the subsequent reads. If there's a read that's not cached, the daemon will promote the RADOS object to local caching directory, so the next read on that object will be serviced -from local file. The daemon also maintains simple LRU statistics so if there's -not enough capacity it will delete some cold cache files. +from cache. The daemon also maintains simple LRU statistics so that under +capacity pressure it will evict cold cache files as needed. -Here are some important cache options correspond to the following settings: +Here are some important cache configuration settings: - ``immutable_object_cache_sock`` The path to the domain socket used for communication between librbd clients and the ceph-immutable-object-cache @@ -81,9 +81,9 @@ Here are some important cache options correspond to the following settings: - ``immutable_object_cache_max_size`` The max size for immutable cache. -- ``immutable_object_cache_watermark`` The watermark for the cache. If the - capacity reaches to this watermark, the daemon will delete cold cache based - on the LRU statistics. +- ``immutable_object_cache_watermark`` The high-water mark for the cache. If the + capacity reaches this threshold the daemon will delete cold cache based + on LRU statistics. The ``ceph-immutable-object-cache`` daemon is available within the optional ``ceph-immutable-object-cache`` distribution package. diff --git a/doc/rbd/rbd-replay.rst b/doc/rbd/rbd-replay.rst index e1c96b21720c1..b1fc4973fb920 100644 --- a/doc/rbd/rbd-replay.rst +++ b/doc/rbd/rbd-replay.rst @@ -4,7 +4,7 @@ .. index:: Ceph Block Device; RBD Replay -RBD Replay is a set of tools for capturing and replaying Rados Block Device +RBD Replay is a set of tools for capturing and replaying RADOS Block Device (RBD) workloads. To capture an RBD workload, ``lttng-tools`` must be installed on the client, and ``librbd`` on the client must be the v0.87 (Giant) release or later. To replay an RBD workload, ``librbd`` on the client must be the Giant diff --git a/doc/rbd/rbd-snapshot.rst b/doc/rbd/rbd-snapshot.rst index c38090253b46a..701acec2f3ad1 100644 --- a/doc/rbd/rbd-snapshot.rst +++ b/doc/rbd/rbd-snapshot.rst @@ -4,22 +4,24 @@ .. index:: Ceph Block Device; snapshots -A snapshot is a read-only copy of the state of an image at a particular point in -time. One of the advanced features of Ceph block devices is that you can create -snapshots of the images to retain a history of an image's state. Ceph also -supports snapshot layering, which allows you to clone images (e.g., a VM image) -quickly and easily. Ceph supports block device snapshots using the ``rbd`` -command and many higher level interfaces, including `QEMU`_, `libvirt`_, -`OpenStack`_ and `CloudStack`_. +A snapshot is a read-only logical copy of an image at a particular point in +time: a checkpoint. One of the advanced features of Ceph block devices is +that you can create snapshots of images to retain point-in-time state history. +Ceph also supports snapshot layering, which allows you to clone images (e.g., a +VM image) quickly and easily. Ceph block device snapshots are managed using the +``rbd`` command and multiple higher level interfaces, including `QEMU`_, +`libvirt`_, `OpenStack`_ and `CloudStack`_. .. important:: To use RBD snapshots, you must have a running Ceph cluster. -.. note:: Because RBD does not know about the file system, snapshots are - `crash-consistent` if they are not coordinated with the mounting - computer. So, we recommend you stop `I/O` before taking a snapshot of - an image. If the image contains a file system, the file system must be - in a consistent state before taking a snapshot or you may have to run - `fsck`. To stop `I/O` you can use `fsfreeze` command. See +.. note:: Because RBD does not know about any filesystem within an image + (volume), snapshots are not `crash-consistent` unless they are + coordinated within the mounting (attaching) operating system. + We therefore recommend that you pause or stop I/O before taking a snapshot. + If the volume contains a filesystem, it must be in an internally + consistent state before taking a snapshot. Snapshots taken at + inconsistent points may need a `fsck` pass before subsequent + mounting. To stop `I/O` you can use `fsfreeze` command. See `fsfreeze(8)` man page for more details. For virtual machines, `qemu-guest-agent` can be used to automatically freeze file systems when creating a snapshot. @@ -37,10 +39,10 @@ command and many higher level interfaces, including `QEMU`_, `libvirt`_, Cephx Notes =========== -When `cephx`_ is enabled (it is by default), you must specify a user name or ID -and a path to the keyring containing the corresponding key for the user. See -:ref:`User Management ` for details. You may also add the ``CEPH_ARGS`` environment -variable to avoid re-entry of the following parameters. :: +When `cephx`_ authentication is enabled (it is by default), you must specify a +user name or ID and a path to the keyring containing the corresponding key. See +:ref:`User Management ` for details. You may also set the +``CEPH_ARGS`` environment variable to avoid re-entry of these parameters. :: rbd --id {user-ID} --keyring=/path/to/secret [commands] rbd --name {username} --keyring=/path/to/secret [commands] @@ -58,12 +60,12 @@ Snapshot Basics =============== The following procedures demonstrate how to create, list, and remove -snapshots using the ``rbd`` command on the command line. +snapshots using the ``rbd`` command. Create Snapshot --------------- -To create a snapshot with ``rbd``, specify the ``snap create`` option, the pool +To create a snapshot with ``rbd``, specify the ``snap create`` option, the pool name and the image name. :: rbd snap create {pool-name}/{image-name}@{snap-name} @@ -102,14 +104,14 @@ For example:: the current version of the image with data from a snapshot. The time it takes to execute a rollback increases with the size of the image. It is **faster to clone** from a snapshot **than to rollback** - an image to a snapshot, and it is the preferred method of returning + an image to a snapshot, and is the preferred method of returning to a pre-existing state. Delete a Snapshot ----------------- -To delete a snapshot with ``rbd``, specify the ``snap rm`` option, the pool +To delete a snapshot with ``rbd``, specify the ``snap rm`` subcommand, the pool name, the image name and the snap name. :: rbd snap rm {pool-name}/{image-name}@{snap-name} @@ -120,13 +122,13 @@ For example:: .. note:: Ceph OSDs delete data asynchronously, so deleting a snapshot - doesn't free up the disk space immediately. + doesn't immediately free up the underlying OSDs' capacity. Purge Snapshots --------------- To delete all snapshots for an image with ``rbd``, specify the ``snap purge`` -option and the image name. :: +subcommand and the image name. :: rbd snap purge {pool-name}/{image-name} @@ -161,7 +163,7 @@ clones rapidly. Parent Child -.. note:: The terms "parent" and "child" mean a Ceph block device snapshot (parent), +.. note:: The terms "parent" and "child" refer to a Ceph block device snapshot (parent), and the corresponding image cloned from the snapshot (child). These terms are important for the command line usage below. @@ -171,12 +173,12 @@ the cloned image to open the parent snapshot and read it. A COW clone of a snapshot behaves exactly like any other Ceph block device image. You can read to, write from, clone, and resize cloned images. There are no special restrictions with cloned images. However, the copy-on-write clone of -a snapshot refers to the snapshot, so you **MUST** protect the snapshot before +a snapshot depends on the snapshot, so you **MUST** protect the snapshot before you clone it. The following diagram depicts the process. -.. note:: Ceph only supports cloning for format 2 images (i.e., created with +.. note:: Ceph only supports cloning of RBD format 2 images (i.e., created with ``rbd create --image-format 2``). The kernel client supports cloned images - since kernel 3.10. + beginning with the 3.10 release. Getting Started with Layering ----------------------------- diff --git a/doc/start/documenting-ceph.rst b/doc/start/documenting-ceph.rst index 27db8453c28b1..f345302f8af48 100644 --- a/doc/start/documenting-ceph.rst +++ b/doc/start/documenting-ceph.rst @@ -151,13 +151,13 @@ If it doesn't exist, create your branch:: Make a Change ------------- -Modifying a document involves opening a restructuredText file, changing +Modifying a document involves opening a reStructuredText file, changing its contents, and saving the changes. See `Documentation Style Guide`_ for details on syntax requirements. -Adding a document involves creating a new restructuredText file under the -``doc`` directory or its subdirectories and saving the file with a ``*.rst`` -file extension. You must also include a reference to the document: a hyperlink +Adding a document involves creating a new reStructuredText file within the +``doc`` directory tree with a ``*.rst`` +extension. You must also include a reference to the document: a hyperlink or a table of contents entry. The ``index.rst`` file of a top-level directory usually contains a TOC, where you can add the new file name. All documents must have a title. See `Headings`_ for details. diff --git a/doc/start/hardware-recommendations.rst b/doc/start/hardware-recommendations.rst index aa6c2b71f6944..d11447f04cae5 100644 --- a/doc/start/hardware-recommendations.rst +++ b/doc/start/hardware-recommendations.rst @@ -21,33 +21,44 @@ data cluster (e.g., OpenStack, CloudStack, etc). CPU === -Ceph metadata servers dynamically redistribute their load, which is CPU -intensive. So your metadata servers should have significant processing power -(e.g., quad core or better CPUs). Ceph OSDs run the :term:`RADOS` service, calculate +CephFS metadata servers are CPU intensive, so they should have significant +processing power (e.g., quad core or better CPUs) and benefit from higher clock +rate (frequency in GHz). Ceph OSDs run the :term:`RADOS` service, calculate data placement with :term:`CRUSH`, replicate data, and maintain their own copy of the -cluster map. Therefore, OSDs should have a reasonable amount of processing power -(e.g., dual core processors). Monitors simply maintain a master copy of the -cluster map, so they are not CPU intensive. You must also consider whether the +cluster map. Therefore, OSD nodes should have a reasonable amount of processing +power. Requirements vary by use-case; a starting point might be one core per +OSD for light / archival usage, and two cores per OSD for heavy workloads such +as RBD volumes attached to VMs. Monitor / manager nodes do not have heavy CPU +demands so a modest processor can be chosen for them. Also consider whether the host machine will run CPU-intensive processes in addition to Ceph daemons. For example, if your hosts will run computing VMs (e.g., OpenStack Nova), you will need to ensure that these other processes leave sufficient processing power for Ceph daemons. We recommend running additional CPU-intensive processes on -separate hosts. +separate hosts to avoid resource contention. RAM === -Generally, more RAM is better. +Generally, more RAM is better. Monitor / manager nodes for a modest cluster +might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB +is a reasonable target. There is a memory target for BlueStore OSDs that +defaults to 4GB. Factor in a prudent margin for the operating system and +administrative tasks (like monitoring and metrics) as well as increased +consumption during recovery: provisioning ~8GB per BlueStore OSD +is advised. Monitors and managers (ceph-mon and ceph-mgr) --------------------------------------------- Monitor and manager daemon memory usage generally scales with the size of the -cluster. For small clusters, 1-2 GB is generally sufficient. For -large clusters, you should provide more (5-10 GB). You may also want -to consider tuning settings like ``mon_osd_cache_size`` or -``rocksdb_cache_size``. +cluster. Note that at boot-time and during topology changes and recovery these +daemons will need more RAM than they do during steady-state operation, so plan +for peak usage. For very small clusters, 32 GB suffices. For +clusters of up to, say, 300 OSDs go with 64GB. For clusters built with (or +which will grow to) even more OSDS you should provision +129GB. You may also want to consider tuning settings like ``mon_osd_cache_size`` +or ``rocksdb_cache_size`` after careful research. Metadata servers (ceph-mds) --------------------------- @@ -108,8 +119,8 @@ performance tradeoffs to consider when planning for data storage. Simultaneous OS operations, and simultaneous request for read and write operations from multiple daemons against a single drive can slow performance considerably. -.. important:: Since Ceph has to write all data to the journal before it can - send an ACK (for XFS at least), having the journal and OSD +.. important:: Since Ceph has to write all data to the journal (or WAL+DB) + before it can ACK writes, having this metadata and OSD performance in balance is really important! @@ -127,23 +138,25 @@ at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the 1 terabyte disks would generally increase the cost per gigabyte by 40%--rendering your cluster substantially less cost efficient. -.. tip:: Running multiple OSDs on a single disk--irrespective of partitions--is - **NOT** a good idea. +.. tip:: Running multiple OSDs on a single SAS / SATA drive + is **NOT** a good idea. NVMe drives, however, can achieve + improved performance by being split into two more more OSDs. .. tip:: Running an OSD and a monitor or a metadata server on a single - disk--irrespective of partitions--is **NOT** a good idea either. + drive is also **NOT** a good idea. Storage drives are subject to limitations on seek time, access time, read and write times, as well as total throughput. These physical limitations affect overall system performance--especially during recovery. We recommend using a -dedicated drive for the operating system and software, and one drive for each -Ceph OSD Daemon you run on the host. Most "slow OSD" issues arise due to running +dedicated (ideally mirrored) drive for the operating system and software, and +one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above). +Many "slow OSD" issues not attributable to hardware failure arise from running an operating system, multiple OSDs, and/or multiple journals on the same drive. Since the cost of troubleshooting performance issues on a small cluster likely exceeds the cost of the extra disk drives, you can optimize your cluster design planning by avoiding the temptation to overtax the OSD storage drives. -You may run multiple Ceph OSD Daemons per hard disk drive, but this will likely +You may run multiple Ceph OSD Daemons per SAS / SATA drive, but this will likely lead to resource contention and diminish the overall throughput. You may store a journal and object data on the same drive, but this may increase the time it takes to journal a write and ACK to the client. Ceph must write to the journal @@ -196,12 +209,9 @@ are a few important performance considerations for journals and SSDs: proper partition alignment with SSDs, which can cause SSDs to transfer data much more slowly. Ensure that SSD partitions are properly aligned. -While SSDs are cost prohibitive for object storage, OSDs may see a significant -performance improvement by storing an OSD's journal on an SSD and the OSD's -object data on a separate hard disk drive. The ``osd journal`` configuration -setting defaults to ``/var/lib/ceph/osd/$cluster-$id/journal``. You can mount -this path to an SSD or to an SSD partition so that it is not merely a file on -the same disk as the object data. +SSDs have historically been cost prohibitive for object storage, though +emerging QLC drives are closing the gap. HDD OSDs may see a significant +performance improvement by offloading WAL+DB onto an SSD. One way Ceph accelerates CephFS file system performance is to segregate the storage of CephFS metadata from the storage of the CephFS file contents. Ceph @@ -214,9 +224,12 @@ your CephFS metadata pool that points only to a host's SSD storage media. See Controllers ----------- -Disk controllers also have a significant impact on write throughput. Carefully, -consider your selection of disk controllers to ensure that they do not create -a performance bottleneck. +Disk controllers (HBAs) can have a significant impact on write throughput. +Carefully consider your selection to ensure that they do not create +a performance bottleneck. Notably RAID-mode (IR) HBAs may exhibit higher +latency than simpler "JBOD" (IT) mode HBAs, and the RAID SoC, write cache, +and battery backup can substantially increase hardware and maintenance +costs. Some RAID HBAs can be configured with an IT-mode "personality". .. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write @@ -226,8 +239,8 @@ a performance bottleneck. Additional Considerations ------------------------- -You may run multiple OSDs per host, but you should ensure that the sum of the -total throughput of your OSD hard disks doesn't exceed the network bandwidth +You typically will run multiple OSDs per host, but you should ensure that the +aggregate throughput of your OSD drives doesn't exceed the network bandwidth required to service a client's need to read or write data. You should also consider what percentage of the overall data the cluster stores on each host. If the percentage on a particular host is large and the host fails, it can lead to @@ -243,10 +256,10 @@ multiple OSDs per host. Networks ======== -Consider starting with a 10Gbps+ network in your racks. Replicating 1TB of data +Provision at least 10Gbps+ networking in your racks. Replicating 1TB of data across a 1Gbps network takes 3 hours, and 10TBs takes 30 hours! By contrast, -with a 10Gbps network, the replication times would be 20 minutes and 1 hour -respectively. In a petabyte-scale cluster, failure of an OSD disk should be an +with a 10Gbps network, the replication times would be 20 minutes and 1 hour +respectively. In a petabyte-scale cluster, failure of an OSD drive is an expectation, not an exception. System administrators will appreciate PGs recovering from a ``degraded`` state to an ``active + clean`` state as rapidly as possible, with price / performance tradeoffs taken into consideration. @@ -255,12 +268,16 @@ cabling more manageable. VLANs using 802.1q protocol require VLAN-capable NICs and Switches. The added hardware expense may be offset by the operational cost savings for network setup and maintenance. When using VLANs to handle VM traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack, -etc.), it is also worth considering using 10G Ethernet. Top-of-rack routers for -each network also need to be able to communicate with spine routers that have -even faster throughput--e.g., 40Gbps to 100Gbps. +etc.), there is additional value in using 10G Ethernet or better; 40Gb or +25/50/100 Gb networking as of 2020 is common for production clusters. + +Top-of-rack routers for each network also need to be able to communicate with +spine routers that have even faster throughput, often 40Gbp/s or more. + Your server hardware should have a Baseboard Management Controller (BMC). -Administration and deployment tools may also use BMCs extensively, so consider +Administration and deployment tools may also use BMCs extensively, especially +via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band network for administration. Hypervisor SSH access, VM image uploads, OS image installs, management sockets, etc. can impose significant loads on a network. Running three networks may seem @@ -273,7 +290,7 @@ Failure Domains =============== A failure domain is any failure that prevents access to one or more OSDs. That -could be a stopped daemon on a host; a hard disk failure, an OS crash, a +could be a stopped daemon on a host; a hard disk failure, an OS crash, a malfunctioning NIC, a failed power supply, a network outage, a power outage, and so forth. When planning out your hardware needs, you must balance the temptation to reduce costs by placing too many responsibilities into too few @@ -301,7 +318,7 @@ and development clusters can run successfully with modest hardware. | | | * ARM processors specifically may | | | | require additional cores. | | | | * Actual performance depends on many | -| | | factors including disk, network, and | +| | | factors including drives, net, and | | | | client throughput and latency. | | | | Benchmarking is highly recommended. | | +----------------+-----------------------------------------+ @@ -315,15 +332,15 @@ and development clusters can run successfully with modest hardware. | +----------------+-----------------------------------------+ | | Network | 1x 1GbE+ NICs (10GbE+ recommended) | +--------------+----------------+-----------------------------------------+ -| ``ceph-mon`` | Processor | - 1 core minimum | +| ``ceph-mon`` | Processor | - 2 cores minimum | | +----------------+-----------------------------------------+ -| | RAM | 2GB+ per daemon | +| | RAM | 24GB+ per daemon | | +----------------+-----------------------------------------+ -| | Disk Space | 10 GB per daemon | +| | Disk Space | 60 GB per daemon | | +----------------+-----------------------------------------+ | | Network | 1x 1GbE+ NICs | +--------------+----------------+-----------------------------------------+ -| ``ceph-mds`` | Processor | - 1 core minimum | +| ``ceph-mds`` | Processor | - 2 cores minimum | | +----------------+-----------------------------------------+ | | RAM | 2GB+ per daemon | | +----------------+-----------------------------------------+ diff --git a/doc/start/os-recommendations.rst b/doc/start/os-recommendations.rst index 275cd425844b6..cef6b366fd2bd 100644 --- a/doc/start/os-recommendations.rst +++ b/doc/start/os-recommendations.rst @@ -19,10 +19,11 @@ Linux Kernel your Linux distribution on any client hosts. For RBD, if you choose to *track* long-term kernels, we currently recommend - 4.x-based "longterm maintenance" kernel series: + 4.x-based "longterm maintenance" kernel series or later: - 4.19.z - 4.14.z + - 5.x For CephFS, see the section about `Mounting CephFS using Kernel Driver`_ for kernel version guidance. @@ -111,30 +112,30 @@ Luminous (12.2.z) Notes ----- -- **1**: The default kernel has an older version of ``btrfs`` that we do not - recommend for ``ceph-osd`` storage nodes. We recommend using ``bluestore`` - starting from Mimic, and ``XFS`` for previous releases with ``filestore``. +- **1**: The default kernel has an older version of ``Btrfs`` that we do not + recommend for ``ceph-osd`` storage nodes. We recommend using ``BlueStore`` + starting with Luminous, and ``XFS`` for previous releases with ``Filestore``. - **2**: The default kernel has an old Ceph client that we do not recommend for kernel client (kernel RBD or the Ceph file system). Upgrade to a recommended kernel. -- **3**: The default kernel regularly fails in QA when the ``btrfs`` - file system is used. We recommend using ``bluestore`` starting from - Mimic, and ``XFS`` for previous releases with ``filestore``. +- **3**: The default kernel regularly fails in QA when the ``Btrfs`` + file system is used. We recommend using ``BlueStore`` starting from + Luminous, and ``XFS`` for previous releases with ``Filestore``. - **4**: ``btrfs`` is no longer tested on this release. We recommend using ``bluestore``. - **5**: Some additional features related to dashboard are not available. -- **6**: Building packages are built regularly, but not distributed by Ceph. +- **6**: Packages are built regularly, but not distributed by upstream Ceph. Testing ------- - **B**: We build release packages for this platform. For some of these - platforms, we may also continuously build all ceph branches and exercise + platforms, we may also continuously build all Ceph branches and perform basic unit tests. - **I**: We do basic installation and functionality tests of releases on this -- 2.39.5