replacing the release name in the url (for example, ``reef`` in
`https://docs.ceph.com/en/reef/ <https://docs.ceph.com/en/reef>`_) with the
branch name you prefer (for example, ``quincy``, to create a URL that reads
-`https://docs.ceph.com/en/pacific/ <https://docs.ceph.com/en/quincy/>`_).
+`https://docs.ceph.com/en/quincy/ <https://docs.ceph.com/en/quincy/>`_).
.. _making_contributions:
.. prompt:: bash $
- sudo apt-get install git
+ sudo apt-get install git
In Fedora, run the following command:
.. prompt:: bash $
- sudo yum install git
+ sudo yum install git
In CentOS/RHEL, run the following command:
.. prompt:: bash $
- sudo yum install git
+ sudo yum install git
#. Make sure that your ``.gitconfig`` file has been configured to include your
name and email address:
.. code-block:: ini
- [user]
- email = {your-email-address}
- name = {your-name}
+ [user]
+ email = {your-email-address}
+ name = {your-name}
For example:
.. prompt:: bash $
- git config --global user.name "John Doe"
- git config --global user.email johndoe@example.com
+ git config --global user.name "John Doe"
+ git config --global user.email johndoe@example.com
-#. Create a `github`_ account (if you don't have one).
+#. Create a `GitHub`_ account (if you don't have one).
#. Fork the Ceph project. See https://github.com/ceph/ceph.
- **Installation (Quick):** Quick start documentation is in the
``doc/start`` directory.
-- **Installation (Manual):** Documentaton concerning the manual installation of
+- **Installation (Manual):** Documentation concerning the manual installation of
Ceph is in the ``doc/install`` directory.
- **Manpage:** Manpage source is in the ``doc/man`` directory.
.. prompt:: bash $
- git checkout main
+ git checkout main
When you make changes to documentation that affect an upcoming release, use
the ``next`` branch. ``next`` is the second most commonly used branch.
.. prompt:: bash $
- git checkout next
+ git checkout next
When you are making substantial contributions such as new features that are not
yet in the current release; if your contribution is related to an issue with a
.. prompt:: bash $
- git branch -a | grep wip-doc-{your-branch-name}
+ git branch -a | grep wip-doc-{your-branch-name}
If it doesn't exist, create your branch:
.. prompt:: bash $
- git checkout -b wip-doc-{your-branch-name}
+ git checkout -b wip-doc-{your-branch-name}
Make a Change
.. prompt:: bash $
- git add doc/rados/example.rst
+ git add doc/rados/example.rst
Deleting a document involves removing it from the repository with ``git rm
{path-to-filename}``. For example:
.. prompt:: bash $
- git rm doc/rados/example.rst
+ git rm doc/rados/example.rst
You must also remove any reference to a deleted document from other documents.
.. prompt:: bash $
- cd ceph
+ cd ceph
.. note::
The directory that contains ``build-doc`` and ``serve-doc`` must be included
.. prompt:: bash $
- admin/build-doc
+ admin/build-doc
To scan for the reachability of external links, execute:
.. prompt:: bash $
- admin/build-doc linkcheck
+ admin/build-doc linkcheck
Running ``admin/build-doc`` creates a ``build-doc`` directory under
``ceph``. You may need to create a directory under ``ceph/build-doc`` for
.. prompt:: bash $
- mkdir -p output/html/api/libcephfs-java/javadoc
+ mkdir -p output/html/api/libcephfs-java/javadoc
The build script ``build-doc`` produces output partially consisting of errors
and warnings. You MUST fix errors in documents you modified before committing
.. prompt:: bash $
- admin/serve-doc
+ admin/serve-doc
You can also navigate to ``build-doc/output`` to inspect the built documents.
Within ``build-doc/output`` is an ``html`` directory and a ``man`` directory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ceph uses Python Sphinx, which is distribution agnostic. The first time you
-build the Ceph documentation, a doxygen XML tree is generated. This can be time
--consuming.
+build the Ceph documentation, a doxygen XML tree is generated. This can
+be time-consuming.
Some of Python Sphinx's dependencies vary across distributions. The first time
you build the documentation, the build script notifies you of uninstalled
-dependencies.
+dependencies.
To run Sphinx and build documentation successfully, the following packages are
required:
.. raw:: html
- <style type="text/css">div.body h3{margin:5px 0px 0px 0px;}</style>
- <table cellpadding="10"><colgroup><col width="30%"><col width="30%"><col width="30%"></colgroup><tbody valign="top"><tr><td><h3>Debian/Ubuntu</h3>
+ <style type="text/css">div.body h3{margin:5px 0px 0px 0px;}</style>
+ <table cellpadding="10"><colgroup><col width="30%"><col width="30%"><col width="30%"></colgroup><tbody valign="top"><tr><td><h3>Debian/Ubuntu</h3>
- gcc
- python3-dev
.. raw:: html
- </td><td><h3>Fedora</h3>
+ </td><td><h3>Fedora</h3>
- gcc
- python-devel
.. raw:: html
- </td><td><h3>CentOS/RHEL</h3>
+ </td><td><h3>CentOS/RHEL</h3>
- gcc
- python-devel
.. raw:: html
- </td></tr></tbody></table>
+ </td></tr></tbody></table>
Install each dependency that is not installed on your host. For Debian/Ubuntu
.. prompt:: bash $
- sudo apt-get install gcc python-dev python3-pip libxml2-dev libxslt-dev doxygen graphviz ant ditaa
- sudo apt-get install python3-sphinx python3-venv cython3
+ sudo apt-get install gcc python-dev python3-pip libxml2-dev libxslt-dev doxygen graphviz ant ditaa
+ sudo apt-get install python3-sphinx python3-venv cython3
For Fedora distributions, run the following commands:
.. prompt:: bash $
- wget http://rpmfind.net/linux/centos/7/os/x86_64/Packages/python-jinja2-2.7.2-2.el7.noarch.rpm
- sudo yum install python-jinja2-2.7.2-2.el7.noarch.rpm
- wget http://rpmfind.net/linux/centos/7/os/x86_64/Packages/python-pygments-1.4-9.el7.noarch.rpm
- sudo yum install python-pygments-1.4-9.el7.noarch.rpm
- wget http://rpmfind.net/linux/centos/7/os/x86_64/Packages/python-docutils-0.11-0.2.20130715svn7687.el7.noarch.rpm
- sudo yum install python-docutils-0.11-0.2.20130715svn7687.el7.noarch.rpm
- wget http://rpmfind.net/linux/centos/7/os/x86_64/Packages/python-sphinx-1.1.3-11.el7.noarch.rpm
- sudo yum install python-sphinx-1.1.3-11.el7.noarch.rpm
+ wget http://rpmfind.net/linux/centos/7/os/x86_64/Packages/python-jinja2-2.7.2-2.el7.noarch.rpm
+ sudo yum install python-jinja2-2.7.2-2.el7.noarch.rpm
+ wget http://rpmfind.net/linux/centos/7/os/x86_64/Packages/python-pygments-1.4-9.el7.noarch.rpm
+ sudo yum install python-pygments-1.4-9.el7.noarch.rpm
+ wget http://rpmfind.net/linux/centos/7/os/x86_64/Packages/python-docutils-0.11-0.2.20130715svn7687.el7.noarch.rpm
+ sudo yum install python-docutils-0.11-0.2.20130715svn7687.el7.noarch.rpm
+ wget http://rpmfind.net/linux/centos/7/os/x86_64/Packages/python-sphinx-1.1.3-11.el7.noarch.rpm
+ sudo yum install python-sphinx-1.1.3-11.el7.noarch.rpm
Ceph documentation makes extensive use of `ditaa`_, which is not built for
CentOS/RHEL. If you make changes to ``ditaa`` diagrams, you must install
.. prompt:: bash $
- wget http://rpmfind.net/linux/fedora/linux/releases/22/Everything/x86_64/os/Packages/j/jericho-html-3.3-4.fc22.noarch.rpm
- sudo yum install jericho-html-3.3-4.fc22.noarch.rpm
- wget http://rpmfind.net/linux/centos/7/os/x86_64/Packages/jai-imageio-core-1.2-0.14.20100217cvs.el7.noarch.rpm
- sudo yum install jai-imageio-core-1.2-0.14.20100217cvs.el7.noarch.rpm
- wget http://rpmfind.net/linux/centos/7/os/x86_64/Packages/batik-1.8-0.12.svn1230816.el7.noarch.rpm
- sudo yum install batik-1.8-0.12.svn1230816.el7.noarch.rpm
- wget http://rpmfind.net/linux/fedora/linux/releases/22/Everything/x86_64/os/Packages/d/ditaa-0.9-13.r74.fc21.noarch.rpm
- sudo yum install ditaa-0.9-13.r74.fc21.noarch.rpm
+ wget http://rpmfind.net/linux/fedora/linux/releases/22/Everything/x86_64/os/Packages/j/jericho-html-3.3-4.fc22.noarch.rpm
+ sudo yum install jericho-html-3.3-4.fc22.noarch.rpm
+ wget http://rpmfind.net/linux/centos/7/os/x86_64/Packages/jai-imageio-core-1.2-0.14.20100217cvs.el7.noarch.rpm
+ sudo yum install jai-imageio-core-1.2-0.14.20100217cvs.el7.noarch.rpm
+ wget http://rpmfind.net/linux/centos/7/os/x86_64/Packages/batik-1.8-0.12.svn1230816.el7.noarch.rpm
+ sudo yum install batik-1.8-0.12.svn1230816.el7.noarch.rpm
+ wget http://rpmfind.net/linux/fedora/linux/releases/22/Everything/x86_64/os/Packages/d/ditaa-0.9-13.r74.fc21.noarch.rpm
+ sudo yum install ditaa-0.9-13.r74.fc21.noarch.rpm
After you have installed these packages, build the documentation by following
the steps in `Build the Source`_.
The following is a common commit comment (preferred)::
- doc: Fixes a spelling error and a broken hyperlink.
+ doc: Fixes a spelling error and a broken hyperlink.
- Signed-off-by: John Doe <john.doe@gmail.com>
+ Signed-off-by: John Doe <john.doe@gmail.com>
The following comment includes a reference to a bug. ::
- doc: Fixes a spelling error and a broken hyperlink.
+ doc: Fixes a spelling error and a broken hyperlink.
- Fixes: https://tracker.ceph.com/issues/1234
+ Fixes: https://tracker.ceph.com/issues/1234
- Signed-off-by: John Doe <john.doe@gmail.com>
+ Signed-off-by: John Doe <john.doe@gmail.com>
The following comment includes a terse sentence following the comment summary.
There is a carriage return between the summary line and the description::
- doc: Added mon setting to monitor config reference
+ doc: Added mon setting to monitor config reference
- Describes 'mon setting', which is a new setting added
- to config_opts.h.
+ Describes 'mon setting', which is a new setting added
+ to config_opts.h.
- Signed-off-by: John Doe <john.doe@gmail.com>
+ Signed-off-by: John Doe <john.doe@gmail.com>
To commit changes, execute the following:
.. prompt:: bash $
- git commit -a
+ git commit -a
An easy way to manage your documentation commits is to use visual tools for
.. prompt:: bash $
- sudo apt-get install gitk git-gui
+ sudo apt-get install gitk git-gui
For Fedora/CentOS/RHEL, execute:
.. prompt:: bash $
- sudo yum install gitk git-gui
+ sudo yum install gitk git-gui
Then, execute:
.. prompt:: bash $
- cd {git-ceph-repo-path}
- gitk
+ cd {git-ceph-repo-path}
+ gitk
Finally, select **File->Start git gui** to activate the graphical user interface.
---------------
Once you have one or more commits, you must push them from the local copy of the
-repository to ``github``. A graphical tool like ``git-gui`` provides a user
+repository to ``GitHub``. A graphical tool like ``git-gui`` provides a user
interface for pushing to the repository. If you created a branch previously:
.. prompt:: bash $
- git push origin wip-doc-{your-branch-name}
+ git push origin wip-doc-{your-branch-name}
Otherwise:
.. prompt:: bash $
- git push
+ git push
Make a Pull Request
.. prompt:: bash $
- less doc/architecture.rst
+ less doc/architecture.rst
Review the following style guides to maintain this consistency.
underline with a leading and trailing space on the title text line.
See `Document Title`_ for details.
-#. **Section Titles:** Section tiles use the ``=`` character underline with no
+#. **Section Titles:** Section titles use the ``=`` character underline with no
leading or trailing spaces for text. Two carriage returns should precede a
section title (unless an inline reference precedes it). See `Sections`_ for
details.
you'll spend all day wondering what went wrong without realizing that you
omitted that underscore. Also, pay special attention to the space between the
substitution text (in this case, "here") and the less-than bracket that sets
-the explicit link apart from the substition text. The link will not render
+the explicit link apart from the substitution text. The link will not render
properly without this space.
Linking Customs
.. _Python Sphinx: https://www.sphinx-doc.org
.. _restructuredText: http://docutils.sourceforge.net/rst.html
.. _Fork and Pull: https://help.github.com/articles/using-pull-requests
-.. _github: http://github.com
+.. _GitHub: http://github.com
.. _ditaa: http://ditaa.sourceforge.net/
.. _Document Title: http://docutils.sourceforge.net/docs/user/rst/quickstart.html#document-title-subtitle
.. _Sections: http://docutils.sourceforge.net/docs/user/rst/quickstart.html#sections
another, but below are some general guidelines.
.. important:: Note that as of December 2025, ARM architecture containers
- provide a limited set of daemons. SMB service, for example, is not
- yet supported.
+ provide a limited set of daemons. SMB service, for example, is
+ not yet supported.
-.. tip:: check out the `ceph blog`_ too.
+.. tip:: Check out the `Ceph blog`_ too.
CPU
===
CRUSH, to replicate data, and to maintain their own copies of the cluster map.
With earlier releases of Ceph, we would make hardware recommendations based on
-the number of cores per OSD, but this cores-per-osd metric is no longer as
+the number of cores per OSD, but this cores-per-OSD metric is no longer as
useful a metric as the number of cycles per IOP and the number of IOPS per OSD.
For example, with NVMe OSD drives, Ceph can easily utilize five or six cores on real
clusters and up to about fourteen cores on single OSDs in isolation. So cores
hardware, select for IOPS per core.
.. tip:: When we speak of CPU *cores*, we mean *threads* when hyperthreading
- is enabled. Hyperthreading is usually beneficial for Ceph servers.
+ is enabled. Hyperthreading is usually beneficial for Ceph servers.
Monitor nodes and Manager nodes do not have heavy CPU demands and require only
-modest processors. if your hosts will run CPU-intensive processes in
+modest processors. If your hosts will run CPU-intensive processes in
addition to Ceph daemons, make sure that you have enough processing power to
run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is
one example of a CPU-intensive process.) We recommend that you run
non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are
not your Monitor and Manager nodes) in order to avoid resource contention.
-If your cluster deployes the Ceph Object Gateway, RGW daemons may co-reside
-with your Mon and Manager services if the nodes have sufficient resources.
+If your cluster deploys the Ceph Object Gateway, RGW daemons may co-reside
+with your Monitor and Manager services if the nodes have sufficient resources.
RAM
===
Generally, more RAM is better. Monitor / Manager nodes for a modest cluster
-might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB
+might do fine with 64 GB; for a larger cluster with hundreds of OSDs 128 GB
is advised.
-.. tip:: when we speak of RAM and storage requirements, we often describe
- the needs of a single daemon of a given type. A given server as
- a whole will thus need at least the sum of the needs of the
- daemons that it hosts as well as resources for logs and other operating
- system components. Keep in mind that a server's need for RAM
- and storage will be greater at startup and when components
- fail or are added and the cluster rebalances. In other words,
- allow headroom past what you might see used during a calm period
- on a small initial cluster footprint.
+.. tip:: When we speak of RAM and storage requirements, we often describe
+ the needs of a single daemon of a given type. A given server as
+ a whole will thus need at least the sum of the needs of the
+ daemons that it hosts as well as resources for logs and other operating
+ system components. Keep in mind that a server's need for RAM
+ and storage will be greater at startup and when components
+ fail or are added and the cluster rebalances. In other words,
+ allow headroom past what you might see used during a calm period
+ on a small initial cluster footprint.
There is an :confval:`osd_memory_target` setting for BlueStore OSDs that
defaults to 4 GiB. Factor in a prudent margin for the operating system and
cluster. Note that at boot-time and during topology changes and recovery these
daemons will need more RAM than they do during steady-state operation, so plan
for peak usage. For very small clusters, 32 GB suffices. For clusters of up to,
-say, 300 OSDs go with 64GB. For clusters built with (or which will grow to)
-even more OSDs you should provision 128GB. You may also want to consider
+say, 300 OSDs go with 64 GB. For clusters built with (or which will grow to)
+even more OSDs you should provision 128 GB. You may also want to consider
tuning the following settings:
* :confval:`mon_osd_cache_size`
======
BlueStore uses its own memory to cache data rather than relying on the
-operating system's page cache. When using the BlueStore OSD back end you can adjust the amount of memory
-that the OSD attempts to consume by changing the :confval:`osd_memory_target`
-configuration option.
+operating system's page cache. When using the BlueStore OSD back end you can
+adjust the amount of memory that the OSD attempts to consume by changing
+the :confval:`osd_memory_target` configuration option.
-- Setting the :confval:`osd_memory_target` below 2GB is not
- recommended. Ceph may fail to keep the memory consumption under 2GB and
+- Setting the :confval:`osd_memory_target` below 2 GB is not
+ recommended. Ceph may fail to keep the memory consumption under 2 GB and
extremely slow performance is likely.
-- Setting the memory target between 2GB and 4GB typically works but may result
- in degraded performance: metadata may need to be read from disk during IO
- unless the active data set is relatively small.
+- Setting the memory target between 2 GB and 4 GB typically works but may
+ result in degraded performance: metadata may need to be read from disk
+ during IO unless the active data set is relatively small.
-- 4GB is the current default value for :confval:`osd_memory_target` This default
- was chosen for typical use cases, and is intended to balance RAM cost and
- OSD performance.
+- 4 GB is the current default value for :confval:`osd_memory_target`. This
+ default was chosen for typical use cases, and is intended to balance RAM
+ cost and OSD performance.
-- Setting the :confval:`osd_memory_target` higher than 4GB can improve
- performance when there many (small) objects or when large (256GB/OSD
+- Setting the :confval:`osd_memory_target` higher than 4 GB can improve
+ performance when there are many (small) objects or when large (256 GB/OSD
or more) data sets are processed. This is especially true with fast
NVMe OSDs.
needed, depending on the exact configuration of the system.
.. tip:: Configuring the operating system with swap to provide additional
- virtual memory for daemons is not advised for modern systems. Doing
- so may result in lower performance, and your Ceph cluster may well be
- happier with a daemon that crashes vs one that slows to a crawl.
+ virtual memory for daemons is not advised for modern systems. Doing
+ so may result in lower performance, and your Ceph cluster may well be
+ happier with a daemon that crashes vs one that slows to a crawl.
When using the legacy Filestore back end, the OS page cache was used for caching
data, so tuning was not normally needed. OSD memory consumption is related
use a significant fraction of their capacity for metadata, and drives smaller
than 100 GiB will not be effective at all.
-It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at a
-minimum, hosts that run or may run Ceph Monitor and Ceph Manager daemons.
-CephFS Metadata Server metadata pools and Ceph Object Gateway (RGW) index and log pools
-also require SSDs to be effective at enterprise scale, even if HDDs are to
-be provisioned for bulk OSD data. RGW deployments notably, if using HDDs for
-bulk object bucket data, should provision all other pools on SSDs.
+It is *strongly* suggested that (enterprise-class) SSDs are provisioned for, at
+a minimum, hosts that run or may run Ceph Monitor and Ceph Manager daemons.
+CephFS Metadata Server metadata pools and Ceph Object Gateway (RGW) index and
+log pools also require SSDs to be effective at enterprise scale, even if HDDs
+are to be provisioned for bulk OSD data. RGW deployments notably, if using HDDs
+for bulk object bucket data, should provision all other pools on SSDs.
To get the best performance out of Ceph, provision the following on separate
drives:
with SAS / SATA ports connect multiple drives via a backplane, which
itself can be a bottleneck. This is especially true with dense chassis,
where 24, 36, or even 100 drives may contend for resources. Chassis that
- can house more than 8 SAS / SATA drives typically do so by means of _expanders_.
- In the past these were conventional AIC cards with a bunch of cables; today
- expanders are embedded into the drive backplanes and are less visible. Notably
- these expanders can be performance bottlenecks.
+ can house more than 8 SAS / SATA drives typically do so by means
+ of _expanders_. In the past these were conventional AIC cards with a bunch
+ of cables; today expanders are embedded into the drive backplanes and are
+ less visible. Notably these expanders can be performance bottlenecks.
.. tip:: Another factor when considering HDDs for your cluster is to plan ahead.
SAS and SATA SSDs are disappearing from manufacturer's product roadmaps, and
- adding SSDs to today's SAS/SATA chassis will become increasingly difficult in
- the years to come. One can purchase "universal" chassis that will accept all
- three, but these are more expensive and often require an expensive and fussy
- tri-mode HBA. Moreover, a chassis built for LFF (3.5") drives is rather
- space-inefficient when SFF (2.5") drives are emplaced via adapters.
+ adding SSDs to today's SAS/SATA chassis will become increasingly difficult
+ in the years to come. One can purchase "universal" chassis that will accept
+ all three, but these are more expensive and often require an expensive and
+ fussy tri-mode HBA. Moreover, a chassis built for LFF (3.5") drives is
+ rather space-inefficient when SFF (2.5") drives are emplaced via adapters.
See also the `Storage Networking
Industry Association's Total Cost of Ownership calculator`_.
Many "slow OSD" issues (when they are not attributable to hardware failure)
arise from running an operating system and multiple OSDs on the same drive.
Also be aware that today's 32 TB HDD uses the same SATA interface that was
-already a bottleneck for a 3 TB HDD from 2014: more than ten times the data to squeeze
-through the same interface. An analogy is to consider a three story building
-with one elevator, then a thirty-two story building with the same single
-elevator.
+already a bottleneck for a 3 TB HDD from 2014: more than ten times the data to
+squeeze through the same interface. An analogy is to consider a three story
+building with one elevator, then a thirty-two story building with the same
+single elevator.
For this reason, when using HDDs for
OSDs, drives larger than 8 TB may be best suited for storage of large
to improved client performance, they substantially improve the speed and
client impact of cluster changes including rebalancing when OSDs or Monitors
are added, removed, or fail. More subtly, the very slow recovery of an
-HDD cluster can result in a lengthy period of enhanced risk when a component fails.
+HDD cluster can result in a lengthy period of enhanced risk when a
+component fails.
SSDs do not have moving mechanical parts, so they are not subject
to many of the limitations of HDDs. SSDs do have significant
Relatively inexpensive SSDs may appeal to your sense of economy. Use caution.
Acceptable IOPS is not the only factor to consider when selecting SSDs for
-use with Ceph. Bargain client-class or off-brand SSDs are a false economy: they may experience
-"cliffing", which means that after an initial burst, sustained performance
-once a limited cache is filled declines considerably. Consider also durability:
-a drive rated for 0.3 Drive Writes Per Day (DWPD or equivalent) may be fine for
-OSDs dedicated to certain types of sequentially-written read-mostly data, but
-are not a good choice for an RBD pool serving hundreds of VMs. Enterprise-class SSDs are best
-for Ceph: they feature power loss protection (PLP) and do
-not suffer the dramatic cliffing that client (desktop) models may experience.
-
-When provisioning a single (or mirrored pair) SSD for both operating system boot
-and Ceph Monitor / Manager purposes, a minimum capacity of 256 GB is advised
-and at least 960 GB is recommended. A drive model rated at 1+ DWPD or the
-equivalent in TBW (TeraBytes Written) is suggested. However, for a given write
-workload, a larger SSD than technically required will provide more endurance
-because it effectively has greater overprovisioning. We stress that
+use with Ceph. Bargain client-class or off-brand SSDs are a false economy: they
+may experience "cliffing", which means that after an initial burst, sustained
+performance once a limited cache is filled declines considerably. Consider
+also durability: a drive rated for 0.3 Drive Writes Per Day (DWPD or
+equivalent) may be fine for OSDs dedicated to certain types of
+sequentially-written read-mostly data, but are not a good choice for an RBD
+pool serving hundreds of VMs. Enterprise-class SSDs are best for Ceph:
+they feature power loss protection (PLP) and do not suffer the dramatic
+cliffing that client (desktop) models may experience.
+
+When provisioning a single (or mirrored pair) SSD for both operating system
+boot and Ceph Monitor / Manager purposes, a minimum capacity of 256 GB is
+advised and at least 960 GB is recommended. A drive model rated at 1+ DWPD or
+the equivalent in TBW (TeraBytes Written) is suggested. However, for a given
+write workload, a larger SSD than technically required will provide more
+endurance because it effectively has greater overprovisioning. We stress that
enterprise-class drives are best for production use, as they feature power
loss protection and increased durability compared to client (desktop) SKUs
that are intended for much lighter and intermittent duty cycles. And we cannot
-stress enough that Monitor databases, CephFS metadata pools, and RGW log/index pools
-all but require SSDs for acceptable performance and stability.
+stress enough that Monitor databases, CephFS metadata pools, and RGW log/index
+pools all but require SSDs for acceptable performance and stability.
-SSDs have historically been considered cost prohibitive for object storage, but
-QLC SSDs are closing the gap, offering greater density with lower power
+SSDs have historically been considered cost prohibitive for object storage,
+but QLC SSDs are closing the gap, offering greater density with lower power
consumption and less power spent on cooling. Moreover, HDD OSDs may see a
significant write latency improvement by offloading WAL+DB onto an SSD.
Most Ceph OSD deployments do not require an SSD with greater endurance than
1 DWPD (aka "read-optimized"). "Mixed-use" SSDs in the 3 DWPD class are
-often overkill for this purpose and cost signficantly more.
+often overkill for this purpose and cost significantly more.
To get a better sense of the factors that determine the total cost of storage,
you might use the `Storage Networking Industry Association's Total Cost of
Partition Alignment
~~~~~~~~~~~~~~~~~~~
-When using SSDs with Ceph, make sure that your partitions (if any) are properly aligned.
-Improperly aligned partitions can result in reduced performance and endurance.
-For more information about proper partition
-alignment and example commands that show how to align partitions properly, see
-`Werner Fischer's blog post on partition alignment`_.
+When using SSDs with Ceph, make sure that your partitions (if any) are properly
+aligned. Improperly aligned partitions can result in reduced performance and
+endurance. For more information about proper partition alignment and example
+commands that show how to align partitions properly,
+see `Werner Fischer's blog post on partition alignment`_.
CephFS Metadata Segregation
~~~~~~~~~~~~~~~~~~~~~~~~~~~
One way that Ceph accelerates CephFS file system performance is by separating
the storage of CephFS metadata from the storage of the CephFS file contents.
Ceph provides a default ``metadata`` pool for CephFS metadata. You will never
-have to manually create a pool for CephFS metadata, but you should create a CRUSH map
-hierarchy for your CephFS metadata pool that includes only SSD storage media.
+have to manually create a pool for CephFS metadata, but you should create a
+CRUSH map hierarchy for your CephFS metadata pool that includes only SSD
+storage media.
See :ref:`CRUSH Device Class<crush-map-device-class>` for details.
additionally reduces the HDD vs SSD cost gap when the system as a whole is
considered. The initial cost of a fancy RAID HBA plus onboard cache plus
battery backup (BBU or supercapacitor) can easily exceed more than 1000 US
-dollars even after discounts, a sum that goes a long way toward SSD cost parity.
-An HBA-free system may also cost hundreds of US dollars less every year if one
-purchases an annual maintenance contract or extended warranty.
+dollars even after discounts, a sum that goes a long way toward SSD cost
+parity. An HBA-free system may also cost hundreds of US dollars less every year
+if one purchases an annual maintenance contract or extended warranty.
.. tip:: The `Ceph blog`_ is often an excellent source of information on Ceph
performance issues. See `Ceph Write Throughput 1`_ and `Ceph Write
BlueStore opens storage devices with ``O_DIRECT`` and issues ``fsync()``
frequently to ensure that data is safely persisted to media. You can evaluate a
-drive's low-level write performance using ``fio``. For example, 4 KiB random write
-performance is measured as follows:
+drive's low-level write performance using ``fio``. For example, 4 KiB random
+write performance is measured as follows:
.. code-block:: console
(volatile) cache. When the volatile cache is enabled, Linux uses a device in
"write back" mode, and when disabled, it uses "write through".
-The default configuration for HDDs (usually: caching is enabled) may not be optimal, and
-OSD performance may be dramatically increased in terms of increased IOPS and
-decreased commit latency by disabling this write cache.
+The default configuration for HDDs (usually: caching is enabled) may not be
+optimal, and OSD performance may be dramatically increased in terms of
+increased IOPS and decreased commit latency by disabling this write cache.
Users are therefore encouraged to benchmark their devices with ``fio`` as
described earlier and persist the optimal cache configuration for their
/dev/sda:
write-caching = 0 (off)
-.. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to "write
- through":
+.. tip:: This udev rule (tested on CentOS 8) will set all SATA/SAS device
+ cache_types to "write through":
.. code-block:: console
# cat /etc/udev/rules.d/99-ceph-write-through.rules
ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"
-.. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to "write
- through":
+.. tip:: This udev rule (tested on CentOS 7) will set all SATA/SAS device
+ cache_types to "write through":
.. code-block:: console
less likelihood of overwhelming network interfaces.
Consider each host's percentage of the cluster's overall
capacity. If the percentage supplied by a particular host is large and the host
-fails, the cluster often experiences problems such as recovery causing OSDs to exceed the
-``full ratio``, which in turn causes Ceph to halt operations to prevent data
-loss.
+fails, the cluster often experiences problems such as recovery causing OSDs to
+exceed the ``full ratio``, which in turn causes Ceph to halt operations to
+prevent data loss.
When you run multiple OSDs per host, you also need to ensure that the kernel
-is up to date. See `OS Recommendations`_ for notes on ``glibc`` and
+is up to date. See :ref:`os-recommendations` for notes on ``glibc`` and
``syncfs(2)`` to ensure that your hardware performs as expected when running
multiple OSDs per host.
-----
It takes three hours to replicate 1 TiB of data across a 1 Gb/s network and it
-takes thirty hours to replicate 10 TiB across a 1 Gb/s network. But it takes only
-twenty minutes to replicate 1 TiB across a 10 Gb/s network, and only
+takes thirty hours to replicate 10 TiB across a 1 Gb/s network. But it takes
+only twenty minutes to replicate 1 TiB across a 10 Gb/s network, and only
three hours to replicate 10 TiB across a 10 Gb/s network.
Note that a 40 Gb/s network link is effectively four 10 Gb/s channels in
-parallel, and that a 100Gb/s network link is effectively four 25 Gb/s channels
+parallel, and that a 100 Gb/s network link is effectively four 25 Gb/s channels
in parallel. Thus, and perhaps somewhat counterintuitively, an individual
packet on a 25 Gb/s network has slightly lower latency compared to a 40 Gb/s
network.
switches. The added expense of this hardware may be offset by the operational
cost savings on network setup and maintenance. When using VLANs to handle VM
traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack,
-etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or
-increasingly 25/50/100 Gb/s networking as of 2022 is common for production clusters.
+etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s
+or increasingly 25/50/100 Gb/s networking as of 2022 is common for
+production clusters.
Top-of-rack (TOR) switches also need fast and redundant uplinks to
core / spine network switches or routers, often at least 40 Gb/s.
Well-known examples are iDRAC (Dell), CIMC (Cisco UCS), and iLO (HPE).
Administration and deployment tools may also use BMCs extensively, especially
via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band
-network for security and administration. Hypervisor SSH access, VM image uploads,
-OS image installs, management sockets, etc. can impose significant loads on a network.
-Running multiple networks may seem like overkill, but each traffic path represents
-a potential capacity, throughput and/or performance bottleneck that you should
-carefully consider before deploying a large scale data cluster.
+network for security and administration. Hypervisor SSH access, VM image
+uploads, OS image installs, management sockets, etc. can impose significant
+loads on a network. Running multiple networks may seem like overkill, but each
+traffic path represents a potential capacity, throughput and/or performance
+bottleneck that you should carefully consider before deploying a large scale
+data cluster.
-Additionally, BMCs as of 2025 rarely offer network connections faster than 1 Gb/s,
-so dedicated and inexpensive 1 Gb/s switches for BMC administrative traffic
-may reduce costs by wasting fewer expensive ports on faster host switches.
+Additionally, BMCs as of 2025 rarely offer network connections faster than
+1 Gb/s, so dedicated and inexpensive 1 Gb/s switches for BMC administrative
+traffic may reduce costs by wasting fewer expensive ports on faster
+host switches.
Failure Domains
===============
-A failure domain can be thought of as any component loss that prevents access to
-one or more OSDs or other Ceph daemons. These could be a stopped daemon on a host;
-a storage drive failure, an OS crash, a malfunctioning NIC, a failed power supply,
-a network outage, a power outage, and so forth. When planning your hardware
-deployment, you must balance the risk of reducing costs by placing too many
-responsibilities into too few failure domains against the added costs of
-isolating every potential failure domain.
+A failure domain can be thought of as any component loss that prevents access
+to one or more OSDs or other Ceph daemons. These could be a stopped daemon on a
+host; a storage drive failure, an OS crash, a malfunctioning NIC, a failed
+power supply, a network outage, a power outage, and so forth. When planning
+your hardware deployment, you must balance the risk of reducing costs by
+placing too many responsibilities into too few failure domains against the
+added costs of isolating every potential failure domain.
Minimum Hardware Recommendations
There are many factors that influence resource choices. The
minimum resources that suffice for one purpose will not necessarily suffice for
-another. A sandbox cluster with one OSD built on a laptop with VirtualBox or on
-a trio of Raspberry PIs will get by with fewer resources than a production
-deployment with a thousand OSDs serving five thousand of RBD clients. The
+another. A sandbox cluster with one OSD built on a laptop with VirtualBox or
+on a trio of Raspberry Pis will get by with fewer resources than a production
+deployment with a thousand OSDs serving five thousand RBD clients. The
classic Fisher Price PXL 2000 captures video, as does an IMAX or RED camera.
One would not expect the former to do the job of the latter. We especially
cannot stress enough the criticality of using enterprise-quality storage
Additional insights into resource planning for production clusters are
found above and elsewhere within this documentation.
-+--------------+----------------+-----------------------------------------+
-| Process | Criteria | Bare Minimum and Recommended |
-+==============+================+=========================================+
-| ``ceph-osd`` | Processor | - 1 min, 3 recommended threads per HDD |
-| | | OSD. 4, 6 respectively for NVMe SSD |
-| | | OSDs. |
-| | | |
-| | | * Results are before replication. |
-| | | * Results may vary across CPU and drive |
-| | | models and Ceph configuration: |
-| | | (erasure coding, compression, etc) |
-| | | * ARM processors specifically may |
-| | | require more cores for performance. |
-| | | * SSD OSDs, especially NVMe, will |
-| | | benefit from additional cores per OSD.|
-| | | * Actual performance depends on many |
-| | | factors including drives, net, and |
-| | | client throughput and latency. |
-| | | Benchmarking is highly recommended. |
-| +----------------+-----------------------------------------+
-| | RAM | - 4GB+ per daemon (more is better) |
-| | | - 2-4GB may function but will be slow |
-| | | - Less than 2GB is not recommended |
-| +----------------+-----------------------------------------+
-| | Storage Drives | 1x storage drive per OSD in most cases. |
-| | | PCIe Gen 4+ SSDs larger than 30 TB may |
-| | | benefit from being split into two or |
-| | | more OSDs. |
-| +----------------+-----------------------------------------+
-| | DB/WAL offload | 1x SSD partition per HDD OSD |
-| | (optional) | 4-5x HDD OSDs per DB/WAL SATA SSD |
-| | | <= 15 HDD OSDs per DB/WAL NVMe SSD |
-| +----------------+-----------------------------------------+
-| | Network | 1x 1Gb/s (bonded 25+ Gb/s recommended) |
-+--------------+----------------+-----------------------------------------+
-| ``ceph-mon`` | Processor | - 2 cores minimum |
-| +----------------+-----------------------------------------+
-| | RAM | 5GB+ per daemon (large / production |
-| | | clusters need more) |
-| +----------------+-----------------------------------------+
-| | Storage | 100 GB per daemon, SSD strongly urged |
-| +----------------+-----------------------------------------+
-| | Network | 1x 1Gb/s (10+ Gb/s recommended) |
-+--------------+----------------+-----------------------------------------+
-| ``ceph-mds`` | Processor | - 2 cores minimum, higher freq is |
-| | | better than more cores |
-| +----------------+-----------------------------------------+
-| | RAM | 8+ GiB per daemon |
-| +----------------+-----------------------------------------+
-| | Network | 1x 1Gb/s (10+ Gb/s recommended) |
-+--------------+----------------+-----------------------------------------+
++--------------+----------------+------------------------------------------+
+| Process | Criteria | Bare Minimum and Recommended |
++==============+================+==========================================+
+| ``ceph-osd`` | Processor | - 1 min, 3 recommended threads per HDD |
+| | | OSD. 4, 6 respectively for NVMe SSD |
+| | | OSDs. |
+| | | |
+| | | * Results are before replication. |
+| | | * Results may vary across CPU and drive |
+| | | models and Ceph configuration: |
+| | | (erasure coding, compression, etc) |
+| | | * ARM processors specifically may |
+| | | require more cores for performance. |
+| | | * SSD OSDs, especially NVMe, will |
+| | | benefit from additional cores per OSD. |
+| | | * Actual performance depends on many |
+| | | factors including drives, net, and |
+| | | client throughput and latency. |
+| | | Benchmarking is highly recommended. |
+| +----------------+------------------------------------------+
+| | RAM | - >= 4 GB per daemon (more is better) |
+| | | - 2-4 GB may function but will be slow |
+| | | - Less than 2 GB is not recommended |
+| +----------------+------------------------------------------+
+| | Storage Drives | 1x storage drive per OSD in most cases. |
+| | | PCIe Gen 4+ SSDs larger than 30 TB may |
+| | | benefit from being split into two or |
+| | | more OSDs. |
+| +----------------+------------------------------------------+
+| | DB/WAL offload | 1x SSD partition per HDD OSD |
+| | (optional) | 4-5x HDD OSDs per DB/WAL SATA SSD |
+| | | <= 15 HDD OSDs per DB/WAL NVMe SSD |
+| +----------------+------------------------------------------+
+| | Network | 1x 1 Gb/s (bonded 25+ Gb/s recommended) |
++--------------+----------------+------------------------------------------+
+| ``ceph-mon`` | Processor | - 2 cores minimum |
+| +----------------+------------------------------------------+
+| | RAM | >= 5 GB per daemon (large / production |
+| | | clusters need more) |
+| +----------------+------------------------------------------+
+| | Storage | 100 GB per daemon, SSD strongly urged |
+| +----------------+------------------------------------------+
+| | Network | 1x 1 Gb/s (10+ Gb/s recommended) |
++--------------+----------------+------------------------------------------+
+| ``ceph-mds`` | Processor | - 2 cores minimum, higher freq is |
+| | | better than more cores |
+| +----------------+------------------------------------------+
+| | RAM | >= 8 GiB per daemon |
+| +----------------+------------------------------------------+
+| | Network | 1x 1 Gb/s (10+ Gb/s recommended) |
++--------------+----------------+------------------------------------------+
.. tip:: When running an OSD node with a single storage drive, create a
partition for your OSD that is separate from the partition
.. _Ceph blog: https://ceph.io/en/news/blog/
.. _Ceph Write Throughput 1: https://ceph.io/en/news/blog/2013/ceph-performance-part-1-disk-controller-write-throughput/
.. _Ceph Write Throughput 2: https://ceph.io/en/news/blog/2013/ceph-performance-part-2-write-throughput-without-ssd-journals/
-.. _Mapping Pools to Different Types of OSDs: ../../rados/operations/crush-map#placing-different-pools-on-different-osds
-.. _OS Recommendations: ../os-recommendations
.. _Storage Networking Industry Association's Total Cost of Ownership calculator: https://www.snia.org/forums/cmsi/programs/TCOcalc
.. _Werner Fischer's blog post on partition alignment: https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation