^^^^^^^^^
SQLite allows configuring the page size prior to creating a new database. It is
-advisable to increase this config to 65536 (64K) when using RADOS backed
+advisable to increase this config to 65536 (64K) when using RADOS-backed
databases to reduce the number of OSD reads/writes and thereby improve
throughput and latency.
Export or Extract Database out of RADOS
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The database is striped on RADOS and can be extracted using the RADOS cli toolset.
+The database is striped on RADOS and can be extracted using the RADOS CLI toolset.
.. code:: sh
Getting librados for PHP
--------------------------
+------------------------
To install the ``librados`` extension for PHP, you need to execute the following procedure:
------------
Java requires you to specify the user ID (``admin``) or user name
-(``client.admin``), and uses the ``ceph`` cluster name by default . The Java
+(``client.admin``), and uses the ``ceph`` cluster name by default. The Java
binding converts C++-based errors into exceptions.
.. code-block:: java
Compile the source; then, run it. If you have copied the JAR to
-``/usr/share/java`` and sym linked from your ``ext`` directory, you won't need
+``/usr/share/java`` and symlinked from your ``ext`` directory, you won't need
to specify the classpath. For example:
.. prompt:: bash $
===============
You can create your own Ceph client using Python. The following tutorial will
-show you how to import the Ceph Python module, connect to a Ceph cluster, and
+show you how to import the Ceph Python module, connect to a Ceph cluster, and
perform object operations as a ``client.admin`` user.
.. note:: To use the Ceph Python bindings, you must have access to a
Reading from and writing to the Ceph Storage Cluster requires an input/output
context (ioctx). You can create an ioctx with the ``open_ioctx()`` or
``open_ioctx2()`` method of the ``Rados`` class. The ``ioctx_name`` parameter
-is the name of the pool and ``pool_id`` is the ID of the pool you wish to use.
+is the name of the pool and ``pool_id`` is the ID of the pool you wish to use.
.. code-block:: python
:linenos:
ioctx.remove_object("hw")
-Writing and Reading XATTRS
+Writing and Reading XATTRs
--------------------------
Once you create an object, you can write extended attributes (XATTRs) to
When CephX is enabled, Ceph will look for the keyring in the default search
path: this path includes ``/etc/ceph/$cluster.$name.keyring``. It is possible
-to override this search-path location by adding a ``keyring`` option in the
+to override this search path location by adding a ``keyring`` option in the
``[global]`` section of your :ref:`Ceph configuration <configuring-ceph>`
file, but this is not recommended.
.. note:: The monitor keyring (that is, ``mon.``) contains a key but no
capabilities, and this keyring is not part of the cluster ``auth`` database.
-The daemon's data-directory locations default to directories of the form::
+The daemon's data directory locations default to directories of the form::
/var/lib/ceph/$type/$cluster-$id
control. You can enable or disable signatures for service messages between
clients and Ceph, and for messages between Ceph daemons.
-Note that even when signatures are enabled data is not encrypted in flight.
+Note that even when signatures are enabled, data is not encrypted in flight.
``cephx_require_signatures``
ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
-.. note:: The option ``--data`` can take as its argument any of the the
+.. note:: The option ``--data`` can take as its argument any of the
following devices: logical volumes specified using *vg/lv* notation,
existing logical volumes, and GPT partitions.
ceph-volume lvm create --bluestore --data /dev/sda
If the devices to be used for a BlueStore OSD are pre-created logical volumes,
-then the :ref:`ceph-volume-lvm` call for an logical volume named
+then the :ref:`ceph-volume-lvm` call for a logical volume named
``ceph-vg/block-lv`` is as follows:
.. prompt:: bash $
recommend placing ``block.db`` on the faster device while ``block`` (that is,
the data) is stored on the slower device (that is, the rotational drive).
-You must create these volume groups and these logical volumes manually. as The
+You must create these volume groups and these logical volumes manually. The
``ceph-volume`` tool is currently unable to do so [create them?] automatically.
The following procedure illustrates the manual creation of volume groups and
certain conditions are met: TCMalloc must be configured as the memory allocator
and the ``bluestore_cache_autotune`` configuration option must be enabled (note
that it is currently enabled by default). When automatic cache sizing is in
-effect, BlueStore attempts to keep OSD heap-memory usage under a certain target
+effect, BlueStore attempts to keep OSD heap memory usage under a certain target
size (as determined by ``osd_memory_target``). This approach makes use of a
best-effort algorithm and caches do not shrink smaller than the size defined by
the value of ``osd_memory_cache_min``. Cache ratios are selected in accordance
For more information about the *compressible* and *incompressible* I/O hints,
see :c:func:`rados_set_alloc_hint`.
-Note that data in Bluestore will be compressed only if the data chunk will be
+Note that data in BlueStore will be compressed only if the data chunk will be
sufficiently reduced in size (as determined by the ``bluestore compression
required ratio`` setting). No matter which compression modes have been used, if
the data chunk is too big, then it will be discarded and the original
:confval:`bluestore_min_alloc_size_ssd`, depending on the OSD's ``rotational``
attribute. Thus if an OSD is created on an HDD, BlueStore is initialized with
the current value of :confval:`bluestore_min_alloc_size_hdd`; but with SSD OSDs
-(including NVMe devices), Bluestore is initialized with the current value of
+(including NVMe devices), BlueStore is initialized with the current value of
:confval:`bluestore_min_alloc_size_ssd`.
In Mimic and earlier releases, the default values were 64KB for rotational
media (HDD) and 16KB for non-rotational media (SSD). The Octopus release
-changed the the default value for non-rotational media (SSD) to 4KB, and the
+changed the default value for non-rotational media (SSD) to 4KB, and the
Pacific release changed the default value for rotational media (HDD) to 4KB.
These changes were driven by space amplification that was experienced by Ceph
-RADOS GateWay (RGW) deployments that hosted large numbers of small files
+RADOS Gateway (RGW) deployments that hosted large numbers of small files
(S3/Swift objects).
For example, when an RGW client stores a 1 KB S3 object, that object is written
- :term:`Ceph OSD Daemon` (``ceph-osd``)
A Ceph Storage Cluster that deploys the :term:`Ceph File System` also runs
-at least one :term:`Ceph Metadata Server` (``ceph-mds``). A Cluster that
+at least one :term:`Ceph Metadata Server` (``ceph-mds``). A cluster that
deploys :term:`Ceph Object Storage` runs Ceph RADOS Gateway daemons
(``radosgw``).
Each Ceph daemon and client pulls configuration option values from one or more
of the sources listed below. Option values found via sources later in the list
-will override any found in sources ealier in the list. In other words,
+will override any found in sources earlier in the list. In other words,
the last value wins.
- The compiled-in default value
Bootstrap options enable each Ceph daemon
to contact the Monitors, to authenticate, and to retrieve central
-configuration values. For this reason, these options are ususally stored locally
+configuration values. For this reason, these options are usually stored locally
on each node in a local configuration file. These options
include the following:
Here ``$cluster`` is the cluster's name (default: ``ceph``).
The Ceph configuration file uses an ``ini`` style syntax. One may add comment
-text after a pound sign (#) or a semi-colon semicolon (;). For example:
+text after a pound sign (#) or a semicolon (;). For example:
.. code-block:: ini
# <--A number (#) sign number sign (#) precedes a comment.
; A comment may be anything.
- # Comments always follow a semi-colon semicolon (;) or a pound sign (#) on each line.
+ # Comments always follow a semicolon (;) or a pound sign (#) on each line.
# The end of the line terminates a comment.
# We recommend that you provide comments in your configuration file(s).
A 64-bit signed integer. Some SI suffixes are supported, including ``K``, ``M``,
``G``, ``T``, ``P``, and ``E``. These represent, respectively, 10\ :sup:`3`, 10\ :sup:`6`,
- 10\ :sup:`9`, etc.). ``B`` (bytes)is the only supported unit string. Thus ``1K``, ``1M``,
+ 10\ :sup:`9`, etc.). ``B`` (bytes) is the only supported unit string. Thus ``1K``, ``1M``,
``128B`` and ``-1`` are all valid option values. When a negative value is
assigned to an option that defines a threshold or limit, this often indicates that the value is
"unlimited" -- that is, no threshold or limit will be enforced. Options that
source precedence.
In addition, options may have a *mask* associated with them to further restrict to
-which daemons or clients the option's value applies.. Masks take two forms:
+which daemons or clients the option's value applies. Masks take two forms:
#. ``type:location`` where ``type`` is a CRUSH bucket type, for example ``rack`` or
``host``, and ``location`` is a value for that property. For example,
``host:foo`` would limit the option only to daemons or clients
running on a host named ``foo``. Recent Ceph releases provide functionality
that obviates most situations that formerly required host-specific configuration
- values. Examples include OSD device classses, the ``osd_memory_target`` autotuner,
+ values. Examples include OSD device classes, the ``osd_memory_target`` autotuner,
and options with values that are specific to certain media. Examples
of the latter include ``osd_recovery_sleep_ssd`` and ``osd_recovery_max_active_hdd``.
You can show all settings for a daemon that is currently running by connecting
to the admin socket on the host where it runs. For example, to dump all
current settings for ``osd.1701``, run the following command on the host
-where ``osd.1701`` runs. The host whre a daemon runs can be determined with
+where ``osd.1701`` runs. The host where a daemon runs can be determined with
the ``ceph osd find`` command or ``ceph orch ps`` commands.
.. prompt:: bash #
The `Hardware Recommendations`_ section provides some hardware guidelines for
configuring a Ceph Storage Cluster. It is possible for a single :term:`Ceph
Node` to run multiple daemons. For example, a single node with multiple drives
-ususally runs one ``ceph-osd`` for each drive. Ideally, each node will be
+usually runs one ``ceph-osd`` for each drive. Ideally, each node will be
assigned to a particular type of process. For example, some nodes might run
``ceph-osd`` daemons, other nodes might run ``ceph-mds`` daemons, and still
other nodes might run ``ceph-mon`` daemons.
.. confval:: tmp_dir
The ``$TMPDIR`` environment variable is used to initialize the config, if
-present, but may be overriden on the command-line. A default may also
+present, but may be overridden on the command line. A default may also
be set for the cluster using the usual ``ceph config`` API.
The template for the temporary files created by daemons is controlled
Extended Attributes
===================
-Extended Attributes (XATTRs) are important for Filestore OSDs. However, Certain
+Extended Attributes (XATTRs) are important for Filestore OSDs. However, certain
disadvantages can occur when the underlying file system is used for the storage
of XATTRs: some file systems have limits on the number of bytes that can be
stored in XATTRs, and your file system might in some cases therefore run slower
``filestore_queue_max_ops``
-:Description: Defines the maximum number of in-progress operations that Filestore accepts before it blocks the queueing of any new operations.
+:Description: Defines the maximum number of in-progress operations that Filestore accepts before it blocks the queuing of any new operations.
:Type: Integer
:Required: No. Minimal impact on performance.
:Default: ``50``
Filestore is preferred for new deployments.
- **Speed:** The journal enables the Ceph OSD Daemon to commit small writes
- quickly. Ceph writes small, random i/o to the journal sequentially, which
+ quickly. Ceph writes small, random I/O to the journal sequentially, which
tends to speed up bursty workloads by allowing the backing file system more
time to coalesce writes. The Ceph OSD Daemon's journal, however, can lead
to spiky performance with short spurts of high-speed writes followed by
The *balanced* profile is the default mClock profile. This profile allocates
equal reservation/priority to client operations and background recovery
operations. Background best-effort ops are given lower reservation and therefore
-take a longer time to complete when are are competing operations. This profile
+take a longer time to complete when there are competing operations. This profile
helps meet the normal/steady-state requirements of the cluster. This is the
case when external client performance requirement is not critical and there are
other background operations that still need attention within the OSD.
This profile optimizes client performance over background activities by
allocating more reservation and limit to client operations as compared to
background operations in the OSD. This profile, for example, may be enabled
-to provide the needed performance for I/O intensive applications for a
+to provide the needed performance for I/O-intensive applications for a
sustained period of time at the cost of slower recoveries. The table shows
the resource control parameters set by the profile:
^^^^^^^^^^^^^^^^^
This profile optimizes background recovery performance as compared to external
clients and other background operations within the OSD. This profile, for
-example, may be enabled by an administrator temporarily to speed-up background
+example, may be enabled by an administrator temporarily to speed up background
recoveries during non-peak hours. The table shows the resource control
parameters set by the profile:
mClock Config Options
---------------------
.. important:: These defaults cannot be changed using any of the config
- subsytem commands like *config set* or via the *config daemon* or *config
+ subsystem commands like *config set* or via the *config daemon* or *config
tell* interfaces. Although the above command(s) report success, the mclock
QoS parameters are reverted to their respective built-in profile defaults.
override the ``osd_mclock_max_capacity_iops_[hdd, ssd]`` option as described
in the `Set or Override Max IOPS Capacity of an OSD`_ section.
-This step is highly recommended until an alternate mechansim is worked upon.
+This step is highly recommended until an alternate mechanism is worked upon.
Steps to Manually Benchmark an OSD (Optional)
=============================================
Looking up Monitors through DNS
===============================
-Since Ceph version 11.0.0 (Kraken), RADOS has supported looking up monitors
+Since Ceph version 11.0.0 (Kraken), RADOS has supported looking up Monitors
through DNS.
-The addition of the ability to look up monitors through DNS means that daemons
+The addition of the ability to look up Monitors through DNS means that daemons
and clients do not require a *mon host* configuration directive in their
``ceph.conf`` configuration file.
With a DNS update, clients and daemons can be made aware of changes
-in the monitor topology. To be more precise and technical, clients look up the
-monitors by using ``DNS SRV TCP`` records.
+in the Monitor topology. To be more precise and technical, clients look up the
+Monitors by using ``DNS SRV TCP`` records.
By default, clients and daemons look for the TCP service called *ceph-mon*,
which is configured by the *mon_dns_srv_name* configuration directive.
Example
-------
-When the DNS search domain is set to *example.com* a DNS zone file might contain the following elements.
+When the DNS search domain is set to *example.com*, a DNS zone file might contain the following elements.
First, create records for the Monitors, either IPv4 (A) or IPv6 (AAAA).
Now all Monitors are running on port *6789*, with priorities 10, 10, 20 and weights 20, 30, 50 respectively.
-Monitor clients choose monitor by referencing the SRV records. If a cluster has multiple Monitor SRV records
+Monitor clients choose Monitor by referencing the SRV records. If a cluster has multiple Monitor SRV records
with the same priority value, clients and daemons will load balance the connections to Monitors in proportion
to the values of the SRV weight fields.
-For the above example, this will result in approximate 40% of the clients and daemons connecting to mon1,
+For the above example, this will result in approximately 40% of the clients and daemons connecting to mon1,
60% of them connecting to mon2. However, if neither of them is reachable, then mon3 will be reconsidered as a fallback.
See also :ref:`msgr2`.
Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons at random
intervals less than every 6 seconds. If a neighboring Ceph OSD Daemon doesn't
-show a heartbeat within a 20 second grace period, the Ceph OSD Daemon may
+show a heartbeat within a 20-second grace period, the Ceph OSD Daemon may
consider the neighboring Ceph OSD Daemon ``down`` and report it back to a Ceph
Monitor, which will update the Ceph Cluster Map. You may change this grace
period by adding an ``osd heartbeat grace`` setting under the ``[mon]``
By default, two Ceph OSD Daemons from different hosts must report to the Ceph
Monitors that another Ceph OSD Daemon is ``down`` before the Ceph Monitors
-acknowledge that the reported Ceph OSD Daemon is ``down``. But there is chance
+acknowledge that the reported Ceph OSD Daemon is ``down``. But there is a chance
that all the OSDs reporting the failure are hosted in a rack with a bad switch
which has trouble connecting to another OSD. To avoid this sort of false alarm,
we consider the peers reporting a failure a proxy for a potential "subcluster"
all cases, but will sometimes help us localize the grace correction to a subset
of the system that is unhappy. ``mon osd reporter subtree level`` is used to
group the peers into the "subcluster" by their common ancestor type in CRUSH
-map. By default, only two reports from different subtree are required to report
+map. By default, only two reports from different subtrees are required to report
another Ceph OSD Daemon ``down``. You can change the number of reporters from
unique subtrees and the common ancestor type required to report a Ceph OSD
Daemon ``down`` to a Ceph Monitor by adding an ``mon osd min down reporters``
-and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of
+and ``mon osd reporter subtree level`` settings under the ``[mon]`` section of
your Ceph configuration file, or by setting the value at runtime.
OSDs Report Their Status
========================
-If an Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will
-consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout``
+If a Ceph OSD Daemon doesn't report to a Ceph Monitor, the Ceph Monitor will
+consider the Ceph OSD Daemon ``down`` after the ``mon osd report timeout``
elapses. A Ceph OSD Daemon sends a report to a Ceph Monitor when a reportable
event such as a failure, a change in placement group stats, a change in
``up_thru`` or when it boots within 5 seconds. You can change the Ceph OSD
What is it
----------
-The messenger v2 protocol, or msgr2, is the second major revision on
+The messenger v2 protocol, or msgr2, is the second major revision of
Ceph's on-wire protocol. It brings with it several key features:
* A *secure* mode that encrypts all data passing over the network
port speaking the legacy v1 protocol. Any address that was
previously shown with any prefix is now shown as a ``v1:`` address.
* **TYPE_ANY** ``any:1.2.3.4:578/89012`` identifies a client that can
- speak either version of the protocol. Prior to nautilus, clients would appear as
+ speak either version of the protocol. Prior to Nautilus, clients would appear as
``1.2.3.4:0/123456``, where the port of 0 indicates they are clients
and do not accept incoming connections. Starting with Nautilus,
these clients are now internally represented by a **TYPE_ANY**
The bracketed list or vector of addresses means that the same daemon can be
reached on multiple ports (and protocols). Any client or other daemon
connecting to that daemon will use the v2 protocol (listed first) if
-possible; otherwise it will back to the legacy v1 protocol. Legacy
+possible; otherwise it will fall back to the legacy v1 protocol. Legacy
clients will only see the v1 addresses and will continue to connect as
they did before, with the v1 protocol.
Prior to Nautilus, a CLI user or daemon will normally discover the
monitors via the ``mon_host`` option in ``/etc/ceph/ceph.conf``. The
syntax for this option has expanded starting with Nautilus to allow
-support the new bracketed list format. For example, an old line
+support for the new bracketed list format. For example, an old line
like::
mon_host = 10.0.0.1:6789,10.0.0.2:6789,10.0.0.3:6789
=================================
Careful network infrastructure and configuration is critical for building a
-resilient and high performance :term:`Ceph Storage Cluster`. The Ceph Storage
-Cluster does not perform request routing or dispatching on behalf of
+resilient and high-performance :term:`Ceph Storage Cluster`. The Ceph Storage
+Cluster does not perform request routing or dispatching on behalf of
the :term:`Ceph Client`. Instead, Ceph clients make requests directly to Ceph
OSD Daemons. Ceph OSDs perform data replication on behalf of Ceph clients,
which imposes additional load on Ceph networks.
We recommend that for resilience and capacity network interfaces are bonded
and connect to redundant switches. Bonding should be active/active,
-or implement a layer 3 multipath strategy with FRR or similar technlogy. When
+or implement a layer 3 multipath strategy with FRR or similar technology. When
using LACP bonding it is important to consult your organization's network team
-to determine the proper transmit hash policy Usually this is 2+3 or 3+4. The
+to determine the proper transmit hash policy. Usually this is 2+3 or 3+4. The
wrong choice can result in imbalanced network link utilization with a fraction
of the available throughput. Network observability tools including ``bmon``
and ``iftop`` and ``netstat`` are invaluable when ensuring that bond member
sudo iptables -L
Some Linux distributions include rules that reject all inbound requests
-except SSH from all network interfaces. For example::
+except SSH from all network interfaces. For example::
REJECT all -- anywhere anywhere reject-with icmp-host-prohibited
You will need to delete these rules on both your public and cluster networks
initially, and replace them with appropriate rules when you are ready to
-harden the ports on your Ceph Nodes.
+harden the ports on your Ceph nodes.
.. note:: Docker and Podman containers may experience disruption when rules
are adjusted or reloaded. You may find it best to update rules on
network. When you add the rule using the example below, make sure you
replace ``{iface}`` with the public network interface (e.g., ``eth0``,
``eth1``, etc.), ``{ip-address}`` with the IP address of the public
-network and ``{netmask}`` with the netmask for the public network. :
+network and ``{netmask}`` with the netmask for the public network.
.. prompt:: bash $
Public Network
--------------
-The public network configuration allows you specifically define IP addresses
+The public network configuration allows you to specifically define IP addresses
and subnets for the public network. You may specifically assign static IP
addresses or override ``public_network`` settings using the ``public_addr``
setting for a specific daemon.
The cluster network configuration allows you to declare a cluster network, and
specifically define IP addresses and subnets for the cluster network. You may
-specifically assign static IP addresses or override ``cluster_network``
+specifically assign static IP addresses or override ``cluster_network``
settings using the ``cluster_addr`` setting for specific OSD daemons.
-.. confval:: cluster_network_interface
+.. confval:: cluster_network_interface
.. confval:: cluster_network
.. confval:: cluster_addr
In a configuration file, you may specify settings for all Ceph OSD Daemons in
the cluster by adding configuration settings to the ``[osd]`` section of your
configuration file. To add settings directly to a specific Ceph OSD Daemon
-(e.g., ``host``), enter it in an OSD-specific section of your configuration
+(e.g., ``host``), enter it in an OSD-specific section of your configuration
file. For example:
.. code-block:: ini
================
This section applies only to the older Filestore OSD back end. Since Luminous
-BlueStore has been default and preferred.
+BlueStore has been the default and preferred.
By default, Ceph expects that you will provision a Ceph OSD Daemon's journal at
the following path, which is usually a symlink to a device or partition::
Core Concepts
`````````````
-Ceph's QoS support is implemented using a queueing scheduler
+Ceph's QoS support is implemented using a queuing scheduler
based on `the dmClock algorithm`_. This algorithm allocates the I/O
resources of the Ceph cluster in proportion to weights, and enforces
the constraints of minimum reservation and maximum limitation, so that
the services can compete for the resources fairly. Currently the
*mclock_scheduler* operation queue divides Ceph services involving I/O
-resources into following buckets:
+resources into the following buckets:
- client op: the iops issued by client
- osd subop: the iops issued by primary OSD
- pg recovery: the recovery related requests
- pg scrub: the scrub related requests
-And the resources are partitioned using following three sets of tags. In other
+And the resources are partitioned using the following three sets of tags. In other
words, the share of each type of service is controlled by three tags:
#. reservation: the minimum IOPS allocated for the service.
In Ceph, operations are graded with "cost". And the resources allocated
for serving various services are consumed by these "costs". So, for
-example, the more reservation a services has, the more resource it is
+example, the more reservation a service has, the more resource it is
guaranteed to possess, as long as it requires. Assuming there are 2
services: recovery and client ops:
Various organizations and individuals are currently experimenting with
mClock as it exists in this code base along with their modifications
-to the code base. We hope you'll share you're experiences with your
+to the code base. We hope you'll share your experiences with your
mClock and dmClock experiments on the ``ceph-devel`` mailing list.
.. confval:: osd_async_recovery_min_cost
intensive.
To maintain operational performance, Ceph performs recovery with limitations on
-the number recovery requests, threads and object chunk sizes which allows Ceph
-perform well in a degraded state.
+the number of recovery requests, threads and object chunk sizes which allows Ceph
+to perform well in a degraded state.
.. note:: Some of these settings are automatically reset if the `mClock`_
scheduler is active, see `mClock backfill`_.
osd_pool_default_min_size = 2 # Accept an I/O operation to a PG that has two copies of an object.
# Note: by default, PG autoscaling is enabled and this value is used only
- # in specific circumstances. It is however still recommend to set it.
+ # in specific circumstances. It is however still recommended to set it.
# Ensure you have a realistic number of placement groups. We recommend
# approximately 100 per OSD. E.g., total number of OSDs multiplied by 100
# divided by the number of replicas (i.e., 'osd_pool_default_size'). So for
Storage Clusters consist of several types of daemons:
1. a :term:`Ceph OSD Daemon` (OSD) stores data as objects on a storage node
- 2. a :term:`Ceph Monitor` (MON) maintains a master copy of the cluster map.
- 3. a :term:`Ceph Manager` manager daemon
+ 2. a :term:`Ceph Monitor` (MON) maintains a master copy of the cluster map
+ 3. a :term:`Ceph Manager` daemon
A Ceph Storage Cluster might contain thousands of storage nodes. A
minimal system has at least one Ceph Monitor and two Ceph OSD
<h3>APIs</h3>
Most Ceph deployments use `Ceph Block Devices`_, `Ceph Object Storage`_ and/or the
- `Ceph File System`_. You may also develop applications that talk directly to
+ `Ceph File System`_. You may also develop applications that talk directly to
the Ceph Storage Cluster.
.. toctree::
and it moves that bucket underneath any other buckets that you have
specified. **Important:** If you specify only the root bucket, the command
will attach the OSD directly to the root, but CRUSH rules expect OSDs to be
- inside of hosts. If the OSDs are not inside hosts, the OSDS will likely not
+ inside of hosts. If the OSDs are not inside hosts, the OSDs will likely not
receive any data.
.. prompt:: bash $
ceph config set mgr target_max_misplaced_ratio .03 # 3%
-A larger value may increase the speed of cluster balancing / convergence
+A larger value may increase the speed of cluster balancing/convergence
at the potential cost of greater impact on client operations.
There is a separate setting ``upmap_max_deviation`` for how uniform the
below the cluster's average, it will be considered sufficiently balanced.
-This value of PG replicas / shards (as distinct from logical PGs) is reported
+This value of PG replicas/shards (as distinct from logical PGs) is reported
by the ``ceph osd df`` command under the ``PGS`` column and the variance
above or below the average under the ``VAR`` column. It may seem desirable
to specify a perfect or nearly perfect distribution by setting a very low
low value for this setting may result in the balancer shuffling data
forever as it endeavors to meet an impossible expectation.
-That said, clusters with multiple CRUSH device classes and / or OSDs that
+That said, clusters with multiple CRUSH device classes and/or OSDs that
differ in capacity will benefit from a smaller value. In this situation
run a command of the following form:
.. prompt:: bash $
- ceph config set mgr mgr/balancer/upmap_max_deviation 1
+ ceph config set mgr mgr/balancer/upmap_max_deviation 1
This value is reasonable and safe for most clusters. Note that this is
an absolute integer number of PGs, not a percentage.
placement calculation. These ``pg-upmap-primary`` entries provide fine-grained
control over primary PG mappings. This mode optimizes the placement of individual
primary PGs in order to achieve balanced reads, or primary PGs, in a cluster.
- In ``read`` mode, upmap behavior is not excercised, so this mode is best for
- uses cases in which only read balancing is desired.
+ In ``read`` mode, upmap behavior is not exercised, so this mode is best for
+ use cases in which only read balancing is desired.
To use ``pg-upmap-primary``, all clients must be Reef or newer. For more
details about client compatibility, see :ref:`read_balancer`.
tiers. The upstream Ceph community also recommends migrating from legacy
deployments.
-A cache tier provides Ceph Clients with better I/O performance for a subset of
+A cache tier provides Ceph clients with better I/O performance for a subset of
the data stored in a backing storage tier. Cache tiering involves creating a
pool of relatively fast/expensive storage devices (e.g., solid state drives)
configured to act as a cache tier, and a backing pool of either erasure-coded
agent will begin flushing or evicting when either threshold is triggered.
.. note:: All client requests will be blocked only when ``target_max_bytes`` or
- ``target_max_objects`` reached
+ ``target_max_objects`` is reached.
Relative Sizing
~~~~~~~~~~~~~~~
ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0}
For example, setting the value to ``0.6`` will begin aggressively flush dirty
-objects when they reach 60% of the cache pool's capacity. obviously, we'd
+objects when they reach 60% of the cache pool's capacity. Obviously, we'd
better set the value between dirty_ratio and full_ratio:
.. prompt:: bash $
.. prompt:: bash $
- ceph osd pool {cache-tier} cache_min_evict_age {#seconds}
+ ceph osd pool set {cache-tier} cache_min_evict_age {#seconds}
For example, to evict objects after 30 minutes, execute the following:
.. prompt:: bash $
- $ ceph mon set election_strategy {classic|disallow|connectivity}
+ ceph mon set election_strategy {classic|disallow|connectivity}
Choosing a mode
===============
ceph pg dump [--format {format}]
-Here the valid formats are ``plain`` (default), ``json`` ``json-pretty``,
+Here the valid formats are ``plain`` (default), ``json``, ``json-pretty``,
``xml``, and ``xml-pretty``. When implementing monitoring tools and other
tools, it is best to use the ``json`` format. JSON parsing is more
deterministic than the ``plain`` format (which is more human readable), and the
.. note:: Any assigned override reweight value will conflict with the balancer.
This means that when the balancer is in use, all override reweight values
- must be be reset to ``1.0000`` in order to avoid unbalanced usage and
+ must be reset to ``1.0000`` in order to avoid unbalanced usage and
full OSDs. Most clusters with no clients older than the Luminous release
should use the pg-upmap balancer instead of legacy reweighting.
*class* of device. By default, Ceph OSDs automatically classify themselves as
either ``hdd`` or ``ssd`` in accordance with the underlying type of device
being used. These device classes can be customized. One might set the ``device
-class`` of OSDs to ``nvme`` to distinguish the from SATA SSDs, or one might set
+class`` of OSDs to ``nvme`` to distinguish them from SATA SSDs, or one might set
them to something arbitrary like ``ssd-testing`` or ``ssd-ethel`` so that rules
and pools may be flexibly constrained to use (or avoid using) specific subsets
of OSDs based on specific requirements.
steps when an out OSD is encountered and rely on CHOOSELEAF steps to
permit moving OSDs to new hosts. However, CHOOSELEAF rules don't
support more than a single OSD per failure domain. MSR rules, new in
-squid, support multiple OSDs per failure domain by retrying all prior
+Squid, support multiple OSDs per failure domain by retrying all prior
steps when an out OSD is encountered. Using MSR rules requires that
-OSDs and clients be required to support the CRUSH_MSR feature bit
-(squid or newer).
+OSDs and clients are required to support the CRUSH_MSR feature bit
+(Squid or newer).
Deleting rules
* For large clusters, a small percentage of PGs might map to fewer than the
desired number of OSDs. This is known to happen when there are multiple
- hierarchy layers in use (for example,, ``row``, ``rack``, ``host``,
+ hierarchy layers in use (for example, ``row``, ``rack``, ``host``,
``osd``).
* When one or more OSDs are marked ``out``, data tends to be redistributed
Ceph stores, replicates, and rebalances data objects across a RADOS cluster
dynamically. Because different users store objects in different pools for
different purposes on many OSDs, Ceph operations require a certain amount of
-data- placement planning. The main data-placement planning concepts in Ceph
+data-placement planning. The main data-placement planning concepts in Ceph
include:
- **Pools:** Ceph stores data within pools, which are logical groups used for
ceph device ls-by-daemon <daemon>
ceph device ls-by-host <host>
-To see information about the location of an specific device and about how the
+To see information about the location of a specific device and about how the
device is being consumed, run a command of the following form:
.. prompt:: bash $
By default, the clay code plugin picks *d=k+m-1* as it provides the greatest savings in terms
of network bandwidth and disk IO. In the case of the *clay* plugin configured with
-*k=8*, *m=4* and *d=11* when a single OSD fails, d=11 osds are contacted and
+*k=8*, *m=4* and *d=11* when a single OSD fails, d=11 OSDs are contacted and
250MiB is downloaded from each of them, resulting in a total download of 11 X 250MiB = 2.75GiB
amount of information. More general parameters are provided below. The benefits are substantial
when the repair is carried out for a rack that stores information on the order of
+-----------------+----------------------------------+----------------------------------+
-where ``S`` is the amount of data stored of single OSD being recovered.
+where ``S`` is the amount of data stored on a single OSD being recovered.
``technique={reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion}``
-:Description: The more flexible technique is *reed_sol_van* : it is
+:Description: The more flexible technique is *reed_sol_van*\: it is
enough to set *k* and *m*. The *cauchy_good* technique
- can be faster but you need to chose the *packetsize*
+ can be faster but you need to choose the *packetsize*
carefully. All of *reed_sol_r6_op*, *liberation*,
*blaum_roth*, *liber8tion* are *RAID6* equivalents in
the sense that they can only be configured with *m=2*.
Although it is probably not an interesting use case when all hosts are
connected to the same switch, reduced bandwidth usage can actually be
-observed.:
+observed:
.. prompt:: bash $
---------------------------------------
In Firefly the bandwidth reduction will only be observed if the primary
-OSD is in the same rack as the lost chunk.:
+OSD is in the same rack as the lost chunk:
.. prompt:: bash $
It is strictly equivalent to using a *K=2* *M=1* erasure code profile. The *DD*
implies *K=2*, the *c* implies *M=1* and the *isa* plugin is used
-by default.:
+by default:
.. prompt:: bash $
level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
is actually an erasure code profile to be used for this level. The
example below specifies the Jerasure backend with the cauchy technique to
-be used in the lrcpool.:
+be used in the lrcpool:
.. prompt:: bash $
The *shec* plugin encapsulates the `multiple SHEC
<http://tracker.ceph.com/projects/ceph/wiki/Shingled_Erasure_Code_(SHEC)>`_
-library. It allows ceph to recover data more efficiently than Reed Solomon codes.
+library. It allows Ceph to recover data more efficiently than Reed-Solomon codes.
Create an SHEC profile
======================
Space Efficiency
----------------
-Space efficiency is a ratio of data chunks to all ones in a object and
+Space efficiency is a ratio of data chunks to all ones in an object and
represented as k/(k+m).
In order to improve space efficiency, you should increase k or decrease m:
.. note:: CephFS and RGW deployments with a significant proportion
of very small user files/objects may wish to plan carefully as
erasure-coded data pools can result in considerable additional space
- ampliificaton. Both CephFS and RGW support multiple data pools
+ amplification. Both CephFS and RGW support multiple data pools
with different media, performance, and data protection strategies,
which can enable efficient and effective deployments. An RGW
deployment might for example provision a modest complement of
Monitor database is too large (see ``MON_DISK_BIG`` below). Another common
scenario is that Ceph logging subsystem levels have been raised for
troubleshooting purposes without subsequent return to default levels. Ongoing
-verbose logging can easily fill up the files system containing ``/var/log``. If
+verbose logging can easily fill up the file system containing ``/var/log``. If
you trim logs that are currently open, remember to restart or instruct your
syslog or other daemon to re-open the log file. Another common dynamic is
that users or processes have written a large amount of data to ``/tmp``
Low-level cluster operations consist of starting, stopping, and restarting a
particular daemon within a cluster; changing the settings of a particular
-daemon or subsystem; and, adding a daemon to the cluster or removing a daemon
+daemon or subsystem; and, adding a daemon to the cluster or removing a daemon
from the cluster. The most common use cases for low-level operations include
growing or shrinking the Ceph cluster and replacing legacy or failed hardware
with new hardware.
reachable (``down``).
If an OSD is ``up``, it may be either ``in`` service (clients can read and
-write data) or it is ``out`` of service. If the OSD was ``in`` but then due to a failure or a manual action was set to the ``out`` state, Ceph will migrate placement groups to the other OSDs to maintin the configured redundancy.
+write data) or it is ``out`` of service. If the OSD was ``in`` but then due to a failure or a manual action was set to the ``out`` state, Ceph will migrate placement groups to the other OSDs to maintain the configured redundancy.
If an OSD is ``out`` of service, CRUSH will not assign placement groups to it.
If an OSD is ``down``, it will also be ``out``.
health statuses yet, because the PGs are in the process of being created and
the OSDs are in the process of peering.
#. You have just added or removed an OSD.
-#. You have just have modified your cluster map.
+#. You have just modified your cluster map.
Checking to see if OSDs are ``up`` and running is an important aspect of monitoring them:
whenever the cluster is up and running, every OSD that is ``in`` the cluster should also
network switch, a NIC failure, or a layer 1 failure.
By default, a heartbeat time that exceeds 1 second (1000 milliseconds) raises a
-health check (a ``HEALTH_WARN``. For example:
+health check (a ``HEALTH_WARN``). For example:
::
When checking a cluster's status (e.g., running ``ceph -w`` or ``ceph -s``),
Ceph will report on the status of the placement groups. A placement group has
one or more states. The optimum state for placement groups in the placement group
-map is ``active + clean``.
+map is ``active+clean``.
*creating*
Ceph is still creating the placement group.
Ceph has not replicated some objects in the placement group the correct number of times yet.
*inconsistent*
- Ceph detects inconsistencies in the one or more replicas of an object in the placement group
+ Ceph detects inconsistencies in one or more replicas of an object in the placement group
(e.g. objects are the wrong size, objects are missing from one replica *after* recovery finished, etc.).
*peering*
- The placement group is undergoing the peering process
+ The placement group is undergoing the peering process.
*repair*
Ceph is checking the placement group and repairing any inconsistencies it finds (if possible).
Ceph is migrating/synchronizing objects and their replicas.
*forced_recovery*
- High recovery priority of that PG is enforced by user.
+ High recovery priority of the PG is enforced by the user.
*recovery_wait*
- The placement group is waiting in line to start recover.
+ The placement group is waiting in line to start recovery.
*recovery_toofull*
A recovery operation is waiting because the destination OSD is over its
recent operations. Backfill is a special case of recovery.
*forced_backfill*
- High backfill priority of that PG is enforced by user.
+ High backfill priority of the PG is enforced by user.
*backfill_wait*
The placement group is waiting in line to start backfill.
Factors Relevant To Specifying pg_num
=====================================
-Performance and and even data distribution across
+Performance and even data distribution across
OSDs weigh in favor of a higher number of PGs. Conserving CPU resources and
minimizing memory usage weigh in favor of a lower number of PGs.
The latter was more of a concern before Filestore OSDs were deprecated, so
.. describe:: [expected-num-objects]
- The expected number of RADOS objects for this pool. By setting this value and
+ The expected number of RADOS objects for this pool. By setting this value,
you arrange for PG splitting to occur at the time of pool creation and
- avoid the latency impact that accompanies runtime folder splitting.
+ avoid the latency impact that accompanies runtime PG splitting.
:Type: Integer
:Required: No.
.. describe:: allow_ec_optimizations
- :Description: Enables performance and capacity optimizations for an erasure-coded pool. These optimizations were designed for CephFS and RBD workloads; RGW workloads with signficant numbers of small objects or with small random access reads of objects will also benefit. RGW workloads with large sequential read and writes will see little benefit. For more details, see :ref:`rados_ops_erasure_coding_optimizations`:
+ :Description: Enables performance and capacity optimizations for an erasure-coded pool. These optimizations were designed for CephFS and RBD workloads; RGW workloads with significant numbers of small objects or with small random access reads of objects will also benefit. RGW workloads with large sequential read and writes will see little benefit. For more details, see :ref:`rados_ops_erasure_coding_optimizations`.
:Type: Boolean
.. versionadded:: 20.2.0
ceph osd pool stretch set {pool-name} {peering_crush_bucket_count} {peering_crush_bucket_target} {peering_crush_bucket_barrier} {crush_rule} {size} {min_size} [--yes-i-really-mean-it]
-Here are the break downs of the arguments:
+Here are the breakdowns of the arguments:
.. describe:: {pool-name}
.. describe:: {peering_crush_bucket_count}
- This value is used along with ``peering_crush_bucket_barrier`` to determined whether the set of
+ This value is used along with ``peering_crush_bucket_barrier`` to determine whether the set of
OSDs in the chosen acting set can peer with each other, based on the number of distinct
buckets there are in the acting set.
ceph osd pool stretch show {pool-name}
-Here are the break downs of the argument:
+Here are the breakdowns of the argument:
.. describe:: {pool-name}
ceph balancer on
ceph balancer mode <read|upmap-read>
-Both ``read`` and ``upmap-read`` mode make use of ``pg-upmap-primary``. In order
+Both the ``read`` and ``upmap-read`` modes make use of ``pg-upmap-primary``. In order
to use ``pg-upmap-primary``, the cluster cannot have any pre-Reef clients.
If you want to use a different balancer or if you want to make your
If you are working in a vstart cluster, you may pass the ``--vstart`` parameter
as shown above so the CLI commands are formatted with the `./bin/` prefix.
- Note that any time the number of pgs changes (for instance, if the pg autoscaler [:ref:`pg-autoscaler`]
+ Note that any time the number of pgs changes (for instance, if the PG autoscaler [:ref:`pg-autoscaler`]
kicks in), you should consider rechecking the scores and rerunning the balancer if needed.
To see some details about what the tool is doing, you can pass
.. prompt:: bash $
- osdmaptool om --upmap out.txt [--upmap-pool <pool>] \
- [--upmap-max <max-optimizations>] \
- [--upmap-deviation <max-deviation>] \
+ osdmaptool om --upmap out.txt [--upmap-pool <pool>] \
+ [--upmap-max <max-optimizations>] \
+ [--upmap-deviation <max-deviation>] \
[--upmap-active]
It is highly recommended that optimization be done for each pool
:ref:`rados_pools`. Ceph users must have access to a given pool in order to read and
write data, and Ceph users must have execute permissions in order to use Ceph's
administrative commands. The following concepts will help you understand
-Ceph['s] user management.
+Ceph's user management.
.. _rados-ops-user:
Cluster, its pools, and the data within those pools.
Ceph has the concept of a ``type`` of user. For purposes of user management,
-the type will always be ``client``. Ceph identifies users in a "period-
-delimited form" that consists of the user type and the user ID: for example,
+the type will always be ``client``. Ceph identifies users in a "period-delimited
+form" that consists of the user type and the user ID: for example,
``TYPE.ID``, ``client.admin``, or ``client.user1``. The reason for user typing
is that the Cephx protocol is used not only by clients but also non-clients,
such as Ceph Monitors, OSDs, and Metadata Servers. Distinguishing the user type
``class-read``
-:Descriptions: Gives the user the capability to call class read methods.
- Subset of ``x``.
+:Description: Gives the user the capability to call class read methods.
+ Subset of ``x``.
``class-write``
Initializing oprofile
=====================
-``oprofile`` must be initalized the first time it is used. Locate the
+``oprofile`` must be initialized the first time it is used. Locate the
``vmlinux`` image that corresponds to the kernel you are running:
.. prompt:: bash $
.. prompt:: bash
- ceph tell {daemon-type}{daemon-id} heap release
+ ceph tell {daemon-type}.{daemon-id} heap release
For example:
:ref:`failures-osd-peering`).
- Stuck ``unclean`` placement groups usually indicate that something is
preventing recovery from completing, possibly unfound objects (see
- :ref:`failures-osd-unfound`);
+ :ref:`failures-osd-unfound`).
.. _failures-osd-peering: