Each CephFS file system requires at least one MDS. The cluster operator will
generally use their automated deployment tool to launch required MDS servers as
-needed. Rook and ansible (via the ceph-ansible playbooks) are recommended
+needed. Rook and Ansible (via the ceph-ansible playbooks) are recommended
tools for doing this. For clarity, we also show the systemd commands here which
may be run by the deployment technology if executed on bare-metal.
miscellaneous upkeep threads working in tandem.
Even so, it is recommended that an MDS server be well provisioned with an
-advanced CPU with sufficient cores. Development is on-going to make better use
+advanced CPU with sufficient cores. Development is ongoing to make better use
of available CPU cores in the MDS; it is expected in future versions of Ceph
that the MDS server will improve performance by taking advantage of more cores.
$ sudo systemctl stop ceph-mds@${id}
- The MDS will automatically notify the Ceph monitors that it is going down.
- This enables the monitors to perform instantaneous failover to an available
+ The MDS will automatically notify the Ceph Monitors that it is going down.
+ This enables the Monitors to perform instantaneous failover to an available
standby, if one exists. It is unnecessary to use administrative commands to
effect this failover, e.g. through the use of ``ceph mds fail mds.${id}``.
file system and do not affect other file systems. Confirmation flag is only
needed for changing ``max_mds`` when cluster is unhealthy.
-.. note:: It is mandatory to pass confirmation flag (--yes--i-really-mean-it)
+.. note:: It is mandatory to pass confirmation flag (--yes-i-really-mean-it)
for modifying FS setting variable ``max_mds`` when cluster is unhealthy.
It has been added a precaution to tell users that modifying ``max_mds``
during troubleshooting or recovery might not help. Instead, it might
Rename a Ceph file system. This also changes the application tags on the data
pools and metadata pool of the file system to the new file system name.
The CephX IDs authorized to the old file system name need to be reauthorized
-to the new name. Any on-going operations of the clients using these IDs may be
+to the new name. Any ongoing operations of the clients using these IDs may be
disrupted. Mirroring is expected to be disabled on the file system.
::
fs swap <fs1-name> <fs1_id> <fs2-name> <fs2_id> [--swap-fscids=yes|no] [--yes-i-really-mean-it]
-Swaps names of two Ceph file sytems and updates the application tags on all
+Swaps names of two Ceph file systems and updates the application tags on all
pools of both FSs accordingly. Certain tools that track FSCIDs of the file
systems, besides the FS names, might get confused due to this operation. For
this reason, mandatory option ``--swap-fscids`` has been provided that must be
Before the swap, mirroring should be disabled on both the CephFSs
(because the cephfs-mirror daemon uses the fscid internally and changing it
-while the daemon is running could result in undefined behaviour), both the
+while the daemon is running could result in undefined behavior), both the
CephFSs should be offline and the file system flag ``refuse_client_sessions``
must be set for both the CephFS.
After the swap, CephX credentials may need to be reauthorized if the existing
mounts should "follow" the old file system to its new name. Generally, for
-disaster recovery, its desirable for the existing mounts to continue using
+disaster recovery, it is desirable for the existing mounts to continue using
the same file system name. Any active file system mounts for either CephFSs
must remount. Existing unflushed operations will be lost. When it is judged
that one of the swapped file systems is ready for clients, run::
This command creates a file system with a specific **fscid** (file system cluster ID).
You may want to do this when an application expects the file system's ID to be
-stable after it has been recovered, e.g., after monitor databases are lost and
+stable after it has been recovered, e.g., after Monitor databases are lost and
rebuilt. Consequently, file system IDs don't always keep increasing with newer
file systems.
===================
-Libcephfs (JavaDoc)
+Libcephfs (Javadoc)
===================
.. warning::
are encouraged to contribute!
..
- The admin/build-docs script runs Ant to build the JavaDoc files, and
+ The admin/build-docs script runs Ant to build the Javadoc files, and
copies them to api/libcephfs-java/javadoc/.
-View the auto-generated `JavaDoc pages for the CephFS Java bindings <javadoc/>`_.
+View the auto-generated `Javadoc pages for the CephFS Java bindings <javadoc/>`_.
.. highlight:: python
-The `cephfs` python module provides access to CephFS service.
+The `cephfs` Python module provides access to the CephFS service.
API calls
=========
benefit from knowing about.
The following sections describe some areas where distributed file systems
-may have noticeably different performance behaviours compared with
+may have noticeably different performance behaviors compared with
local file systems.
less efficient than splitting your files into more modest-sized directories.
Even standard userspace tools can become quite slow when operating on very
-large directories. For example, the default behaviour of ``ls``
+large directories. For example, the default behavior of ``ls``
is to give an alphabetically ordered result, but ``readdir`` system
calls do not give an ordered result (this is true in general, not just
with CephFS). So when you ``ls`` on a million file directory, it is
items become unpinned and eligible to be dropped. The MDS can only drop cache
state when no clients refer to the metadata to be dropped. Also described below
is how to configure the MDS recall settings for your workload's needs. This is
-necessary if the internal throttles on the MDS recall can not keep up with the
+necessary if the internal throttles on the MDS recall cannot keep up with the
client workload.
.. _cephfs_cache_configuration_mds_cache_memory_limit:
.. confval:: mds_recall_global_max_decay_threshold
-its decay rate is the same as ``mds_recall_max_decay_rate``. Any recalled
+Its decay rate is the same as ``mds_recall_max_decay_rate``. Any recalled
capability for any session also increments this counter.
If clients are slow to release state, the warning "failing to respond to cache
--------------------------------------------------------------------
Every second (or every interval set by the ``mds_cache_trim_interval``
-configuration paramater), the MDS runs the "cache trim" procedure. One of the
+configuration parameter), the MDS runs the "cache trim" procedure. One of the
steps of this procedure is "recall client state". During this step, the MDS
checks every client (session) to determine whether it needs to recall caps.
If any of the following are true, then the MDS needs to recall caps:
When a client wants to operate on an inode, it will query the MDS in various
ways, which will then grant the client a set of **capabilities**. This
grants the client permissions to operate on the inode in various ways. One
-of the major differences from other network file systems (e.g NFS or SMB) is
+of the major differences from other network file systems (e.g. NFS or SMB) is
that the capabilities granted are quite granular, and it's possible that
multiple clients can hold different capabilities on the same inodes.
state.
In each state the MDS Locker will always try to issue all the capabilities to the
-clients allowed, even some capabilities are not needed or wanted by the clients,
+clients allowed, even if some capabilities are not needed or wanted by the clients,
as pre-issuing capabilities could reduce latency in some cases.
If there is only one client, usually it will be the loner client for all the inodes.
each inode depending on the capabilities the clients (needed | wanted), but usually
it will fail. The loner client will always get all the capabilities.
-The filelock will control files' partial metadatas' and the file contents' access
-permissions. The metadatas include **mtime**, **atime**, **size**, etc.
+The filelock will control files' partial metadata and the file contents' access
+permissions. The metadata includes **mtime**, **atime**, **size**, etc.
* **Fs**: Once a client has it, all other clients are denied **Fw**.
MDSes. The **Fcb** capabilities won't be granted to all the clients and the
clients will do sync read/write.
-* **Fw**: If there is no loner client and once a client have this capability, the
+* **Fw**: If there is no loner client and once a client has this capability, the
**Fsxcb** capabilities won't be granted to other clients.
If multiple clients read from and write to the same file, then the lock state
forcing clients to drop dirty buffers, for example on a simple file size extension
or truncating use case.
-* **Fl**: This capability means the clients could perform lazy io. LazyIO relaxes
+* **Fl**: This capability means the clients could perform lazy IO. LazyIO relaxes
POSIX semantics. Buffered reads/writes are allowed even when a file is opened by
multiple applications on multiple clients. Applications are responsible for managing
cache coherency themselves.
ceph-dokan --name client.foo -l x
Unmounting file systems
------------------=-----
+-----------------------
The mount can be removed by either issuing ctrl-c or using the unmap command,
like so::
* ``--type <type string>`` only include events of this type
* ``--frag <ino>[.frag id]`` only include events referring to this directory fragment
* ``--dname <string>`` only include events referring to this named dentry within a directory
- fragment (may only be used in conjunction with ``--frag``
+ fragment (may only be used in conjunction with ``--frag``)
* ``--client <int>`` only include events from this client session ID
Filters may be combined on an AND basis (i.e. only the intersection of events from each filter).
* ``binary``: write each event as a binary file, within a folder whose name is controlled by ``--path``
* ``json``: write all events to a single file specified by ``--path``, as a JSON serialized list of objects.
-* ``summary``: write a human readable summary of the events read to standard out
-* ``list``: write a human readable terse listing of the type of each event, and
+* ``summary``: write a human-readable summary of the events read to stdout
+* ``list``: write a human-readable terse listing of the type of each event, and
which file paths the event affects.
Requirements
------------
-The primary (local) and secondary (remote) Ceph clusters version should be Pacific or later.
+The primary (local) and secondary (remote) Ceph cluster versions should be Pacific or later.
.. _cephfs_mirroring_creating_users:
To avoid having to maintain the remote cluster configuration file and remote
ceph user keyring in the primary cluster, users can bootstrap a peer (which
-stores the relevant remote cluster details in the monitor config store on the
+stores the relevant remote cluster details in the Monitor config store on the
primary cluster). See the :ref:`Bootstrap
Peers<cephfs_mirroring_bootstrap_peers>` section.
-The ``peer_add`` command supports passing the remote cluster monitor address
+The ``peer_add`` command supports passing the remote cluster Monitor address
and the user key. However, bootstrapping a peer is the recommended way to add a
peer.
---------------
Adding a peer via the ``peer_add`` subcommand requires the peer cluster configuration and
-user keyring to be available in the primary cluster (manager host and hosts running the
+user keyring to be available in the primary cluster (Manager host and hosts running the
mirror daemon). This can be avoided by bootstrapping and importing a peer token. Peer
bootstrap involves creating a bootstrap token on the peer cluster via::
An entry per mirror daemon instance is displayed along with information such as configured
peers and basic stats. The peer information includes the remote file system name (``fs_name``),
-cluster's monitor addresses (``mon_host``) and cluster FSID (``fsid``). For more detailed
+cluster's Monitor addresses (``mon_host``) and cluster FSID (``fsid``). For more detailed
stats, use the admin socket interface as detailed below.
CephFS mirror daemons provide admin socket commands for querying mirror status. To check
When the snapshot or the directory is removed from the remote filesystem, the mirror daemon will
clear the failed state upon successful synchronization of the pending snapshots, if any.
-.. note:: Setting snap-schedule on the remote flle system for directories that are being mirrored will
+.. note:: Setting snap-schedule on the remote file system for directories that are being mirrored will
cause the mirror daemon to report errors like ``invalid metadata``.
.. note:: Treat the remote filesystem as read-only. Nothing is inherently enforced by CephFS.
CephFS Top Utility
==================
-CephFS provides `top(1)` like utility to display various Ceph Filesystem metrics
-in realtime. `cephfs-top` is a curses based python script which makes use of `stats`
+CephFS provides `top(1)`\-like utility to display various Ceph Filesystem metrics
+in real time. `cephfs-top` is a curses-based Python script which makes use of `stats`
plugin in Ceph Manager to fetch (and display) metrics.
Manager Plugin
==============
Ceph Filesystem clients periodically forward various metrics to Ceph Metadata Servers (MDS)
-which in turn get forwarded to Ceph Manager by MDS rank zero. Each active MDS forward its
+which in turn get forwarded to Ceph Manager by MDS rank zero. Each active MDS forwards its
respective set of metrics to MDS rank zero. Metrics are aggregated and forwarded to Ceph
Manager.
============
`cephfs-top` utility relies on `stats` plugin to fetch performance metrics and display in
-`top(1)` like format. `cephfs-top` is available as part of `cephfs-top` package.
+`top(1)`\-like format. `cephfs-top` is available as part of `cephfs-top` package.
By default, `cephfs-top` uses `client.fstop` user to connect to a Ceph cluster::
.. image:: cephfs-top.png
-.. note:: Minimum compatible python version for cephfs-top is 3.6.0. cephfs-top is supported on distros RHEL 8, Ubuntu 18.04, CentOS 8 and above.
+.. note:: Minimum compatible Python version for cephfs-top is 3.6.0. cephfs-top is supported on distributions RHEL 8, Ubuntu 18.04, CentOS 8 and above.
the configuration so long as the preconditions apply: it is empty
and not part of an existing snapshot.
-.. warning:: The charmap is not applied to snapshot names. Snapshots names are always case-sensitive and not normalized.
+.. warning:: The charmap is not applied to snapshot names. Snapshot names are always case-sensitive and not normalized.
Normalization
-------------
# file: foo/
ceph.dir.normalization="nfd"
-To remove normlization on a directory, you must remove the ``ceph.dir.charmap``
+To remove normalization on a directory, you must remove the ``ceph.dir.charmap``
configuration.
.. note:: The MDS maintains an ``alternate_name`` metadata (also used for
The ``ceph.dir.casesensitive`` attribute accepts a boolean value. By default,
names are case-sensitive (as normal in a POSIX file system). Setting this value
-to false will make the named entries in the directory (and its descendent
+to false will make the named entries in the directory (and its descendant
directories) case-insensitive.
Case folding requires that names are also normalized. By default, after setting
Permissions
-----------
-As with other CephFS virtual extended atributes, a client may only set the
+As with other CephFS virtual extended attributes, a client may only set the
``charmap`` configuration on a directory with the **p** MDS auth cap. Viewing
the configuration does not require this cap.
caps: [osd] allow rw tag cephfs data=cephfs_a
To completely restrict the client to the ``bar`` directory, omit the
-root directory :
+root directory:
.. prompt:: bash #
caps: [mon] allow r network 10.0.0.0/8
caps: [osd] allow rw tag cephfs data=cephfs_a network 10.0.0.0/8
-The optional ``{network/prefix}`` is a standard network-name-and-prefix length
-in CIDR notation (for example, ``10.3.0.0/16``). If ``{network/prefix}}`` is
+The optional ``{network/prefix}`` is a standard network-number-and-prefix-length
+in CIDR notation (for example, ``10.3.0.0/16``). If ``{network/prefix}`` is
present, the use of this capability is restricted to clients connecting from
this network.
Updating Client Configuration
-----------------------------
-Certain client configurations can be applied at runtime. To check if a configuration option can be applied (taken into affect by a client) at runtime, use the `config help` command::
+Certain client configurations can be applied at runtime. To check if a configuration option can be applied (taken into effect by a client) at runtime, use the `config help` command::
ceph config help debug_client
debug_client - Debug level for client
If the ``--fscid`` option is provided then this creates a file system with a
specific fscid. This can be used when an application expects the file system's ID
-to be stable after it has been recovered, e.g., after monitor databases are
+to be stable after it has been recovered, e.g., after Monitor databases are
lost and rebuilt. Consequently, file system IDs don't always keep increasing
with newer file systems.
.. note:: The Ceph file system must be offline before metadata repair tools can
be used on it. The tools will complain if they are invoked when the file
system is online. If any of the recovery steps do not complete successfully,
- DO NOT proceeed to run any more recovery steps. If any recovery step fails,
- seek help from experts via mailing lists and IRC channels and Slack
- channels.
+ DO NOT proceed to run any more recovery steps. If any recovery step fails,
+ seek help from experts via mailing lists or IRC/Slack channels.
Journal export
--------------
journal data has been extracted by other means such as ``recover_dentries``.
Resetting the journal is likely to leave orphaned objects in the data pool
and could result in the re-allocation of already-written inodes resulting in
- faulty behaviour of the file system (bugs, etc..).
+ faulty behavior of the file system (bugs, etc..).
MDS table wipes
---------------
ceph -s
The data scan tools (``scan_extents``, ``scan_inodes``, and ``scan_links``)
-will automatically report their progress to the Ceph manager if the ``cli_api``
+will automatically report their progress to the Ceph Manager if the ``cli_api``
module is enabled. Progress updates include:
- Number of objects processed and total objects
to function. If progress updates are not appearing in ``ceph -s``, verify
that:
- - The ``cli_api`` manager module is enabled
+ - The ``cli_api`` Manager module is enabled
- The ``ceph`` command is available in your PATH
- Your ``CEPH_CONF`` environment variable (if set) points to a valid
configuration file
Progress updates will be automatically disabled if the system cannot
-communicate with the Ceph manager or if the required module is not available.
+communicate with the Ceph Manager or if the required module is not available.
Console output will continue to show local progress information even if
-manager updates are disabled.
+Manager updates are disabled.
If the root inode or MDS directory (``~mdsdir``) is missing or corrupt, run the following command:
#. Create a recovery file system. This recovery file system will be used to
recover the data in the damaged pool. First, the filesystem will have a data
- pool deployed for it. Then you will attacha new metadata pool to the new
+ pool deployed for it. Then you will attach a new metadata pool to the new
data pool. Then you will set the new metadata pool to be backed by the old
data pool.
The ``--recover`` flag prevents any MDS daemon from joining the new file
system.
-#. Create the intial metadata for the file system:
+#. Create the initial metadata for the file system:
.. prompt:: bash #
This results in lack of vertical scaling and wastage of non-busy resources/MDSs.
This led to the adoption of a more dynamic way of handling
-metadata: Dynamic Subtree Partitioning, where load intensive portions
-of the directory hierarchy from busy MDSs are migrated to non busy MDSs.
+metadata: Dynamic Subtree Partitioning, where load-intensive portions
+of the directory hierarchy from busy MDSs are migrated to non-busy MDSs.
This strategy ensures that activity hotspots are relieved as they
appear and so leads to vertical scaling of the metadata workload in
---------------------------------------
Once the exporter verifies that the subtree is permissible to be exported
-(Non degraded cluster, non-frozen subtree root), the subtree root
+(non-degraded cluster, non-frozen subtree root), the subtree root
directory is temporarily auth pinned, the subtree freeze is initiated,
and the exporter is committed to the subtree migration, barring an
intervening failure of the importer or itself.
Some of these features are closer to being done than others, though. We
describe each of them with an approximation of how risky they are and briefly
describe what is required to enable them. Note that doing so will
-*irrevocably* flag maps in the monitor as having once enabled this flag to
+*irrevocably* flag maps in the Monitor as having once enabled this flag to
improve debugging and support processes.
Inline data
feature enables small files (generally <2KB) to be stored in the inode
and served out of the MDS. This may improve small-file performance but increases
load on the MDS. It is not sufficiently tested to support at this time, although
-failures within it are unlikely to make non-inlined data inaccessible
+failures within it are unlikely to make non-inlined data inaccessible.
Inline data has always been off by default and requires setting
the ``inline_data`` flag.
.. tip::
- Your linux distribution may not ship with commands for manipulating xattrs by default,
+ Your Linux distribution may not ship with commands for manipulating xattrs by default,
the required package is usually called ``attr``.
Layout fields
# file: dir
ceph.dir.layout="stripe_unit=4194304 stripe_count=2 object_size=4194304 pool=cephfs_data"
-Getting the layout in json format. If there's no specific layout set for the
+Getting the layout in JSON format. If there's no specific layout set for the
particular inode, the system traverses the directory path backwards and finds
-the closest ancestor directory with a layout and returns it in json format.
-A file layout also can be retrieved in json format using ``ceph.file.layout.json`` vxattr.
+the closest ancestor directory with a layout and returns it in JSON format.
+A file layout also can be retrieved in JSON format using ``ceph.file.layout.json`` vxattr.
-A virtual field named ``inheritance`` is added to the json output to show the status of layout.
+A virtual field named ``inheritance`` is added to the JSON output to show the status of layout.
The ``inheritance`` field can have the following values:
``@default`` implies the system default layout
$ setfattr -n ceph.file.layout.stripe_count -v 4 file1
setfattr: file1: Directory not empty
-File and Directory layouts can also be set using the json format.
+File and Directory layouts can also be set using the JSON format.
The ``inheritance`` field is ignored when setting the layout.
-Also, if both, ``pool_name`` and ``pool_id`` fields are specified, then the
+Also, if both ``pool_name`` and ``pool_id`` fields are specified, then the
``pool_name`` is given preference for better disambiguation.
.. code-block:: bash
ceph.file.layout="stripe_unit=4194304 stripe_count=4 object_size=4194304 pool=cephfs_data"
-Files created as descendents of the directory also inherit the layout, if the intermediate
+Files created as descendants of the directory also inherit the layout if the intermediate
directories do not have layouts set:
.. code-block:: bash
* ``used``: The amount of storage consumed in bytes
* ``name``: Name of the pool
-* ``mon_addrs``: List of Ceph monitor addresses
+* ``mon_addrs``: List of Ceph Monitor addresses
* ``used_size``: Current used size of the CephFS volume in bytes
* ``pending_subvolume_deletions``: Number of subvolumes pending deletion
is created with octal file mode ``755``, UID ``0``, GID ``0`` and the data pool
layout of its parent directory.
-You can also specify an Unicode normalization form using the ``--normalization``
+You can also specify a Unicode normalization form using the ``--normalization``
option. This will be used to internally mangle file names so that Unicode
characters that can be represented by different Unicode code point sequences
are all mapped to the same representation, which means that they will all
To learn more about Unicode normalization forms see https://unicode.org/reports/tr15
-It's also possible to configure a subvolume group for case insensitive access
+It's also possible to configure a subvolume group for case-insensitive access
when the ``--casesensitive=0`` option is used. When this option is added, file
names that only differ in the case of its characters will be mapped to the same
file. The case of the file name used when the file was created is preserved.
* ``uid``: UID of the subvolume group path
* ``gid``: GID of the subvolume group path
* ``mode``: mode of the subvolume group path
-* ``mon_addrs``: list of monitor addresses
+* ``mon_addrs``: list of Monitor addresses
* ``bytes_pcent``: quota used in percentage if quota is set, else displays "undefined"
* ``bytes_quota``: quota size in bytes if quota is set, else displays "infinite"
* ``bytes_used``: current used size of the subvolume group in bytes
be aware that user permissions and ACLs associated with the previous scope might still apply. Ensure that
any necessary permissions are updated as needed to maintain proper access control.
-When creating a subvolume you can also specify an Unicode normalization form by
+When creating a subvolume you can also specify a Unicode normalization form by
using the ``--normalization`` option. This will be used to internally mangle
file names so that Unicode characters that can be represented by different
Unicode code point sequences are all mapped to the representation, which means
To learn more about Unicode normalization forms see https://unicode.org/reports/tr15
-It's also possible to configure a subvolume for case insensitive access when
+It's also possible to configure a subvolume for case-insensitive access when
the ``--casesensitive=0`` option is used. When this option is added, file
names that only differ in the case of its characters will be mapped to the same
file. The case of the file name used when the file was created is preserved.
* ``uid``: UID of the subvolume path
* ``gid``: GID of the subvolume path
* ``mode``: mode of the subvolume path
-* ``mon_addrs``: list of monitor addresses
+* ``mon_addrs``: list of Monitor addresses
* ``bytes_pcent``: quota used in percentage if quota is set; else displays
``undefined``
* ``bytes_quota``: quota size in bytes if quota is set; else displays
Listing the Snapshots of a Subvolume
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Use a command of the following from to list the snapshots of a subvolume:
+Use a command of the following form to list the snapshots of a subvolume:
.. prompt:: bash #
Listing Custom Metadata That Has Been Set on a Snapshot
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Use a command of the following from to list custom metadata (key-value pairs)
+Use a command of the following form to list custom metadata (key-value pairs)
set on the snapshot:
.. prompt:: bash #
A progress report is also printed in the output when clone is ``in-progress``.
Here the progress is reported only for the specific clone. For collective
progress made by all ongoing clones, a progress bar is printed at the bottom
-in ouput of ``ceph status`` command::
+in output of ``ceph status`` command::
progress:
3 ongoing clones - average progress is 47.569% (10s)
}
}
-.. note:: Delete the canceled cloned by supplying the ``--force`` option to the
+.. note:: Delete the canceled clone by supplying the ``--force`` option to the
``fs subvolume rm`` command.
Configurables
ceph config set client client_respect_subvolume_snapshot_visibility <true|false>
-.. note:: The MGR daemon operates as a privileged CephFS client and therefore
+.. note:: The Manager daemon operates as a privileged CephFS client and therefore
bypasses snapshot visibility restrictions. This behavior is required
to ensure the reliable execution of operations such as snap-schedule
and snapshot cloning. As a result, modifying the
``client_respect_subvolume_snapshot_visibility`` configuration option
- has no effect on the CephFS instance running within the MGR daemon.
+ has no effect on the CephFS instance running within the Manager daemon.
How to Disable Snapshot Visibility on a Subvolume?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Normalization and Case Sensitivity
----------------------------------
-The subvolumegroup and subvolume interefaces have a porcelain layer API to
+The subvolumegroup and subvolume interfaces have a porcelain layer API to
manipulate the ``ceph.dir.charmap`` configurations (see also :ref:`charmap`).
.. prompt:: bash #
- ceph fs subvolume charmap set <vol_name> <subvol> <--group_name=name> <setting> <value>
+ ceph fs subvolume charmap set <vol_name> <subvol> --group_name=<name> <setting> <value>
For example:
.. prompt:: bash #
- ceph fs subvolume charmap get <vol_name> <subvol> <--group_name=name> <setting>
+ ceph fs subvolume charmap get <vol_name> <subvol> --group_name=<name> <setting>
For example:
.. prompt:: bash #
- ceph fs subvolume charmap get <vol_name> <subvol> <--group_name=name>
+ ceph fs subvolume charmap get <vol_name> <subvol> --group_name=<name>
For example:
.. prompt:: bash #
- ceph fs subvolumegroup charmap rm <vol_name> <group_name
+ ceph fs subvolumegroup charmap rm <vol_name> <group_name>
Or for a subvolume:
.. prompt:: bash #
- ceph fs subvolume charmap rm <vol_name> <subvol> <--group_name=name>
+ ceph fs subvolume charmap rm <vol_name> <subvol> --group_name=<name>
For example:
* **cancel all** active sets in case an immediate resume of IO is required.
The operations listed above are non-blocking: they attempt the intended modification
-and return with an up to date version of the target set, whether the operation was successful or not.
+and return with an up-to-date version of the target set, whether the operation was successful or not.
The set may change states as a result of the modification, and the version that's returned in the response
-is guaranteed to be in a state consistent with this and potentialy other successful operations from
+is guaranteed to be in a state consistent with this and potentially other successful operations from
the same control loop batch.
Some set states are `awaitable`. We will discuss those below, but for now it's important to mention that
There are two types of await: `quiesce await` and `release await`. The former is the default,
and the latter can only be achieved with ``--release`` present in the argument list.
-To avoid confision, it is not permitted to issue a `quiesce await` when the set is not `QUIESCING`.
+To avoid confusion, it is not permitted to issue a `quiesce await` when the set is not `QUIESCING`.
Trying to ``--release`` a set that is not `QUIESCED` is an ``EPERM`` error as well, regardless
of whether await is requested alongside. However, it's not an error to `release await`
an already released set, or to `quiesce await` a `QUIESCED` one - those are successful no-ops.
}
Error EPERM:
-Although ``--cancel`` will succeed syncrhonously for a set in an active state, awaiting a canceled
+Although ``--cancel`` will succeed synchronously for a set in an active state, awaiting a canceled
set is not permitted, hence this call will result in an ``EPERM``. This is deliberately different from
returning a ``EINVAL`` error, denoting an error on the user's side, to simplify the system's behavior
when ``--await`` is requested. As a result, it's also a simpler model for the user to work with.
------------------------
By default the volumes plugin is enabled and set to ``always on``. However, in
certain cases it might be appropriate to disable it. For example, when a CephFS
-is in a degraded state, the volumes plugin commands may accumulate in MGR
+is in a degraded state, the volumes plugin commands may accumulate in the Manager
instead of getting served. Which eventually causes policy throttles to kick in
-and the MGR becomes unresponsive.
+and the Manager becomes unresponsive.
In this event, the volumes plugin can be disabled even though it is an
-``always on`` module in MGR. To do so, run ``ceph mgr module disable volumes
+``always on`` module in the Manager. To do so, run ``ceph mgr module disable volumes
--yes-i-really-mean-it``. Do note that this command will disable operations
and remove commands of the volumes plugin since it will disable all CephFS
services on the Ceph cluster accessed through this plugin.
Finally, Client 3, is using an older version of CephFS client and does not have
fscrypt feature present. In this mode, users have the same view as before, but
are able to do some data operations to encrypted files. This mode is not
-recommend and not supported.
+recommended and not supported.
.. figure:: cephfs_fscrypt_multiclient.svg
Hammer and later
----------------
-Since the hammer release, a full file system will lead to ENOSPC
+Since the Hammer release, a full file system will lead to ENOSPC
results from:
* Data writes on the client
subsequent access to the data objects by the MDS or other clients. For
more on the epoch barrier mechanism, see :ref:`background_blocklisting_and_osd_epoch_barrier`.
-Legacy (pre-hammer) behavior
+Legacy (pre-Hammer) behavior
----------------------------
-In versions of Ceph earlier than hammer, the MDS would ignore
+In versions of Ceph earlier than Hammer, the MDS would ignore
the full status of the RADOS cluster, and any data writes from
clients would stall until the cluster ceased to be full.
-There are two dangerous conditions to watch for with this behaviour:
+There are two dangerous conditions to watch for with this behavior:
* If a client had pending writes to a file, then it was not possible
for the client to release the file to the MDS for deletion: this could
Cluster health checks
=====================
-The Ceph monitor daemons will generate health messages in response
+The Ceph Monitor daemons will generate health messages in response
to certain states of the file system map structure (and the enclosed MDS maps).
Message: mds rank(s) *ranks* have failed
Message: mds *names* are laggy
Description: The named MDS daemons have failed to send beacon messages
-to the monitor for at least ``mds_beacon_grace`` (default 15s), while
+to the Monitor for at least ``mds_beacon_grace`` (default 15s), while
they are supposed to send beacon messages every ``mds_beacon_interval``
-(default 4s). The daemons may have crashed. The Ceph monitor will
+(default 4s). The daemons may have crashed. The Ceph Monitor will
automatically replace laggy daemons with standbys if any are available.
Message: insufficient standby daemons available
Description: One or more file systems are configured to have a certain number
of standby daemons available (including daemons in standby-replay) but the
cluster does not have enough standby daemons. The standby daemons not in replay
-count towards any file system (i.e. they may overlap). This warning can
+count towards any file system (i.e. they may overlap). This warning can be
configured by setting ``ceph fs set <fs> standby_count_wanted <count>``. Use
zero for ``count`` to disable.
MDS daemons can identify a variety of unwanted conditions, and
indicate these to the operator in the output of ``ceph status``.
-These conditions have human readable messages, and additionally
+These conditions have human-readable messages, and additionally
a unique code starting with ``MDS_``.
.. highlight:: console
``MDS_ESTIMATED_REPLAY_TIME``
-----------------------------
Message
- "HEALTH_WARN Replay: x% complete. Estimated time remaining *x* seconds
+ HEALTH_WARN Replay: x% complete. Estimated time remaining *x* seconds
Description
When an MDS journal replay takes more than 30 seconds, this message indicates the estimated time to completion.
Troubleshooting <troubleshooting>
Disaster recovery <disaster-recovery>
cephfs-journal-tool <cephfs-journal-tool>
- Recovering file system after monitor store loss <recover-fs-after-mon-store-loss>
+ Recovering file system after Monitor store loss <recover-fs-after-mon-store-loss>
.. raw:: html
Supported Features of the Kernel Driver
========================================
-The kernel driver is developed separately from the core ceph code, and as
+The kernel driver is developed separately from the core Ceph code, and as
such it sometimes differs from the FUSE driver in feature implementation.
The following details the implementation status of various CephFS features
in the kernel driver.
Quotas
------
-Quota was first introduced by the hammer release. Quota disk format got renewed
+Quota was first introduced by the Hammer release. Quota disk format got renewed
by the Mimic release. Linux kernel clients >= 4.17 can support the new format
-quota. At present, no Linux kernel client support the old format quota.
+quota. At present, no Linux kernel clients support the old format quota.
See `Quotas`_ for more information.
file is opened by multiple applications on multiple clients. Applications are
responsible for managing cache coherency themselves.
-Libcephfs supports LazyIO since nautilus release.
+Libcephfs supports LazyIO since Nautilus release.
Enable LazyIO
=============
-LazyIO can be enabled by following ways.
+LazyIO can be enabled in the following ways.
- ``client_force_lazyio`` option enables LAZY_IO globally for libcephfs and
ceph-fuse mount.
int fda = ceph_open(ca, "shared_file.txt", O_CREAT|O_RDWR, 0644);
/* Enable LazyIO for fda */
- ceph_lazyio(ca, fda, 1));
+ ceph_lazyio(ca, fda, 1);
for(i = 0; i < num_iters; i++) {
char out_buf[] = "fooooooooo";
Since the cache is distributed, the MDS must take great care to ensure
that no client holds capabilities that may conflict with other clients'
-capabilities, or operations that it does itself. This allows cephfs
+capabilities, or operations that it does itself. This allows CephFS
clients to rely on much greater cache coherence than a filesystem like
NFS, where the client may cache data and metadata beyond the point where
it has changed on the server.
#. Consistency: On an MDS failover, the journal events can be replayed to reach a
consistent file system state. Also, metadata operations that require multiple
updates to the backing store need to be journaled for crash consistency (along
- with other consistency mechanisms such as locking, etc..).
+ with other consistency mechanisms such as locking, etc.).
#. Performance: Journal updates are (mostly) sequential, hence updates to journals
are fast. Furthermore, updates can be batched into single write, thereby saving
Apart from journaling file system metadata updates, CephFS journals various other events
such as client session info and directory import/export state to name a few. These events
-are used by the metadata sever to reestablish correct state as required, e.g., Ceph MDS
+are used by the metadata server to reestablish correct state as required, e.g., Ceph MDS
tries to reconnect clients on restart when journal events get replayed and a specific
event type in the journal specifies that a client entity type has a session with the MDS
before it was restarted.
Configurations
--------------
-The targetted size of a log segment in terms of number of events is controlled by:
+The targeted size of a log segment in terms of number of events is controlled by:
.. confval:: mds_log_events_per_segment
up:standby
The MDS is available to takeover for a failed rank (see also :ref:`mds-standby`).
-The monitor will automatically assign an MDS in this state to a failed rank
+The Monitor will automatically assign an MDS in this state to a failed rank
once available.
up:boot
-This state is broadcast to the Ceph monitors during startup. This state is
+This state is broadcast to the Ceph Monitors during startup. This state is
never visible as the Monitor immediately assign the MDS to an available rank or
commands the MDS to operate as a standby. The state is documented here for
completeness.
up:stopping
-When a rank is stopped, the monitors command an active MDS to enter the
+When a rank is stopped, the Monitors command an active MDS to enter the
``up:stopping`` state. In this state, the MDS accepts no new client
connections, migrates all subtrees to other ranks in the file system, flush its
metadata journal, and, if the last rank (0), evict all clients and shutdown
- Total number of Inodes
* - total_read_ops
- Gauge
- - Total number of read operations generated by all process
+ - Total number of read operations generated by all processes
* - total_read_size
- Gauge
- - Number of bytes read in input/output operations generated by all process
+ - Number of bytes read in input/output operations generated by all processes
* - total_write_ops
- Gauge
- - Total number of write operations generated by all process
+ - Total number of write operations generated by all processes
* - total_write_size
- Gauge
- Number of bytes written in input/output operations generated by all processes
The FUSE client is the most accessible and the easiest to upgrade to the
version of Ceph used by the storage cluster, while the kernel client will
-always gives better performance.
+always give better performance.
When encountering bugs or performance issues, it is often instructive to
try using the other client, in order to find out whether the bug was
client-specific or not (and then to let the developers know).
-General Pre-requisite for Mounting CephFS
+General Prerequisites for Mounting CephFS
-----------------------------------------
Before mounting CephFS, ensure that the client host (where CephFS has to be
mounted and used) has a copy of the Ceph configuration file (i.e.
``ceph.conf``) and a keyring of the CephX user that has permission to access
the MDS. Both of these files must already be present on the host where the
-Ceph MON resides.
+Ceph Monitor resides.
#. Generate a minimal conf file for the client host and place it at a
standard location::
ssh {user}@{mon-host} "sudo ceph fs authorize cephfs client.foo / rw" | sudo tee /etc/ceph/ceph.client.foo.keyring
- In above command, replace ``cephfs`` with the name of your CephFS, ``foo``
+ In the above command, replace ``cephfs`` with the name of your CephFS, ``foo``
by the name you want for your CephX user and ``/`` by the path within your
CephFS for which you want to allow access to the client host and ``rw``
stands for both read and write permissions. Alternatively, you may copy the
- Ceph keyring from the MON host to client host at ``/etc/ceph`` but creating
+ Ceph keyring from the Monitor host to client host at ``/etc/ceph`` but creating
a keyring specific to the client host is better. While creating a CephX
- keyring/client, using same client name across multiple machines is perfectly
+ keyring/client, using the same client name across multiple machines is perfectly
fine.
.. note:: If you get 2 prompts for password while running above any of 2
ceph-fuse --id foo -k /path/to/keyring /mnt/mycephfs
-You may pass a Monitor's address and port on the commandline, although this is not mandatory::
+You may pass a Monitor's address and port on the command line, although this is not mandatory::
ceph-fuse --id foo -m 192.168.0.1:6789 /mnt/mycephfs
``ceph-fuse@.service`` and ``ceph-fuse.target`` systemd units are available.
As usual, these unit files declare the default dependencies and recommended
execution context for ``ceph-fuse``. After making the fstab entry shown above,
-run following commands::
+run the following commands::
systemctl start ceph-fuse@/mnt/mycephfs.service
systemctl enable ceph-fuse.target
Is mount helper present?
------------------------
The ``mount.ceph`` helper is installed by Ceph packages. The helper passes the
-monitor address(es) and CephX user keyrings, saving the Ceph admin the effort
+Monitor address(es) and CephX user keyrings, saving the Ceph admin the effort
of passing these details explicitly while mounting CephFS. If the helper is not
present on the client machine, CephFS can still be mounted using the kernel
driver but only by passing these details explicitly to the ``mount`` command.
to be years behind the latest upstream Linux kernel where Ceph development
takes place (including bug fixes).
-As a rough guide, as of Ceph 10.x (Jewel), you should be using a least a 4.x
+As a rough guide, as of Ceph 10.x (Jewel), you should be using at least a 4.x
kernel. If you absolutely have to use an older kernel, you should use the
-fuse client instead of the kernel client.
+FUSE client instead of the kernel client.
This advice does not apply if you are using a Linux distribution that includes
CephFS support. In that case, the distributor is responsible for backporting
#. ``name`` is the username of the CephX user we are using to mount CephFS.
#. ``fsid`` is the FSID of the Ceph cluster, which can be found using the
``ceph fsid`` command. ``fs_name`` is the file system to mount. The kernel
- driver requires a ceph Monitor's address and the secret key of the CephX
+ driver requires a Ceph Monitor's address and the secret key of the CephX
user. For example:
.. prompt:: bash #
mount -t ceph cephuser@b3acfc0d-575f-41d3-9c91-0e7ed3dbb3fa.cephfs=/ /mnt/mycephfs -o mon_addr=192.168.0.1:6789,secret=AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==
-When using the mount helper, monitor hosts and FSID are optional. The
+When using the mount helper, Monitor hosts and FSID are optional. The
``mount.ceph`` helper discovers these details by finding and reading the ceph
conf file. For example:
mount -t ceph cephuser@.cephfs=/ -o secret=AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==
-.. note:: Note that the dot (``.`` in the string ``cephuser@.cephfs``) must be
+.. note:: Note that the dot (``.`` in the string ``cephuser@.cephfs``) must be
a part of the device string.
A weakness of this method is that it will leave the secret key in your shell's
Ensure that the permissions on the secret key file are appropriate (preferably,
``600``).
-Multiple monitor hosts can be passed by separating addresses with a ``/``:
+Multiple Monitor hosts can be passed by separating addresses with a ``/``:
.. prompt:: bash #
---------------
The ``fs authorize`` command allows configuring the client's access to a
-particular file system. See also in :ref:`fs-authorize-multifs`. The client will
+particular file system. See also :ref:`fs-authorize-multifs`. The client will
only have visibility of authorized file systems and the Monitors/MDS will
reject access to clients without authorization.
bug.
If an MDS daemon crashes or is killed while in the ``up:stopping`` state, a
-standby will take over and the cluster monitors will against try to stop
+standby will take over and the cluster Monitors will again try to stop
the daemon.
When a daemon finishes stopping, it will respawn itself and go back to being a
Setting subtree partitioning policies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-It is also possible to setup **automatic** static partitioning of subtrees via
+It is also possible to set up **automatic** static partitioning of subtrees via
a set of **policies**. In CephFS, this automatic static partitioning is
referred to as **ephemeral pinning**. Any directory (inode) which is
ephemerally pinned will be automatically assigned to a particular rank
The CephFS file system provides the ``bal_rank_mask`` option to enable the
balancer to dynamically rebalance subtrees within particular active MDS ranks.
This allows administrators to employ both the dynamic subtree partitioning and
-static pining schemes in different active MDS ranks so that metadata loads are
+static pinning schemes in different active MDS ranks so that metadata loads are
optimized based on user demand. For instance, in realistic cloud storage
environments, where a lot of subvolumes are allotted to multiple computing
nodes (e.g., VMs and containers), some subvolumes that require high performance
server`_. This document provides information on configuring NFS-Ganesha
clusters manually. The simplest and preferred way of managing NFS-Ganesha
clusters and CephFS exports is using ``ceph nfs ...`` commands. See
-:doc:`/mgr/nfs` for more details. As the deployment is done using cephadm or
-rook.
+:ref:`mgr-nfs` for more details. As the deployment is done using cephadm or
+Rook.
Requirements
============
.. note::
It is recommended to use 3.5 or later stable version of NFS-Ganesha
- packages with pacific (16.2.x) or later stable version of Ceph packages.
+ packages with Pacific (16.2.x) or later stable version of Ceph packages.
Configuring NFS-Ganesha to export CephFS
========================================
``ceph.conf`` for libcephfs clients includes a ``[client]`` section with
``mon_host`` option set to let the clients connect to the Ceph cluster's
-monitors, usually generated via ``ceph config generate-minimal-conf``.
+Monitors, usually generated via ``ceph config generate-minimal-conf``.
For example::
[client]
- CephFS does not currently maintain the ``atime`` field. Most applications
do not care, though this impacts some backup and data tiering
applications that can move unused data to a secondary storage system.
- You may be able to workaround this for some use cases, as CephFS does
+ You may be able to work around this for some use cases, as CephFS does
support setting ``atime`` via the ``setattr`` operation.
Perspective
system implementations do not strictly adhere to the spec, including
local Linux file systems like ext4 and XFS. For example, for
performance reasons, the atomicity requirements for reads are relaxed:
-processing reading from a file that is also being written may see torn
+processes reading from a file that is also being written may see torn
results.
Similarly, NFS has extremely weak consistency semantics when multiple
for managing and executing the parallel deletion of files.
There is a purge queue for every MDS rank. Purge queues consist of purge items
which contain nominal information from the inodes such as size and the layout
-(i.e. all other un-needed metadata information is discarded making it
+(i.e. all other unneeded metadata information is discarded making it
independent of all metadata structures).
Deletion process
purge queue can process then the data pool usage might increase
substantially over time. In extreme scenarios, the purge queue
backlog can become so huge that it can slacken the capacity reclaim
- and the linux ``du`` command for CephFS might report inconsistent
- data compared to the CephFS Data pool.
+ and the Linux ``du`` command for CephFS might report inconsistent
+ data compared to the CephFS data pool.
There are a few tunable configs that MDS uses internally to throttle purge
queue processing:
reached. They will inevitably be allowed to write some amount of
data over the configured limit. How far over the quota they are
able to go depends primarily on the amount of time, not the amount
- of data. Generally speaking writers will be stopped within 10s of
+ of data. Generally speaking writers will be stopped within tens of
seconds of crossing the configured limit.
#. *Quotas are implemented in the kernel client 4.17 and higher.*
(e.g., ``/home/user``) based on the MDS capability, and a quota is
configured on an ancestor directory they do not have access to
(e.g., ``/home``), the client will not enforce it. When using
- path-based access restrictions be sure to configure the quota on
- the directory the client is restricted too (e.g., ``/home/user``)
+ path-based access restrictions, be sure to configure the quota on
+ the directory the client is restricted to (e.g., ``/home/user``)
or something nested beneath it.
In case of a kernel client, it needs to have access to the parent
(e.g., ``/home/volumes/group``), the kclient needs to have access
to the parent (e.g., ``/home/volumes``).
- An example command to create such an user is as below::
+ An example command to create such a user::
$ ceph auth get-or-create client.guest mds 'allow r path=/home/volumes, allow rw path=/home/volumes/group' mgr 'allow rw' osd 'allow rw tag cephfs metadata=*' mon 'allow r'
Recovering the file system after catastrophic Monitor store loss
================================================================
-During rare occasions, all the monitor stores of a cluster may get corrupted
+During rare occasions, all the Monitor stores of a cluster may get corrupted
or lost. To recover the cluster in such a scenario, you need to rebuild the
-monitor stores using the OSDs (see :ref:`mon-store-recovery-using-osds`),
-and get back the pools intact (active+clean state). However, the rebuilt monitor
+Monitor stores using the OSDs (see :ref:`mon-store-recovery-using-osds`),
+and get back the pools intact (``active+clean`` state). However, the rebuilt Monitor
stores don't restore the file system maps ("FSMap"). Additional steps are required
to bring back the file system. The steps to recover a multiple active MDS file
system or multiple file systems are yet to be identified. Currently, only the steps
ceph tell mds.<fsname>:0 scrub start <path> [scrubopts] [tag]
-where ``scrubopts`` is a comma delimited list of ``recursive``, ``force``, or
+where ``scrubopts`` is a comma-delimited list of ``recursive``, ``force``, or
``repair`` and ``tag`` is an optional custom string tag (the default is a generated
UUID). An example command is::
}
``status`` shows the number of inodes that are scheduled to be scrubbed at any point in time.
-Hence, it can change on subsequent ``scrub status`` invocations. Also, a high level summary of
+Hence, it can change on subsequent ``scrub status`` invocations. Also, a high-level summary of
scrub operation (which includes the operation state and paths on which scrub is triggered)
gets displayed in ``ceph status``::
When a timestamp is passed (the `start` argument in the `add`, `remove`,
`activate` and `deactivate` subcommands) the ISO format `%Y-%m-%dT%H:%M:%S` will
-always be accepted. When either python3.7 or newer is used or
+always be accepted. When either Python 3.7 or newer is used or
https://github.com/movermeyer/backports.datetime_fromisoformat is installed, any
-valid ISO timestamp that is parsed by python's `datetime.fromisoformat` is valid.
+valid ISO timestamp that is parsed by Python's `datetime.fromisoformat` is valid.
When no subcommand is supplied a synopsis is printed::
--------------------------
The module offers two subcommands to inspect existing schedules: `list` and
-`status`. Bother offer plain and json output via the optional `format` argument.
+`status`. Both offer plain and JSON output via the optional `format` argument.
The default is plain.
The `list` sub-command will list all schedules on a path in a short single line
format. It offers a `recursive` argument to list all schedules in the specified
point path prefix. Paths to snap-schedule should start at the appropriate
CephFS file system root and not at the host file system root.
e.g. if the Ceph File System is mounted at ``/mnt`` and the path under which
- snapshots need to be taken is ``/mnt/some/path`` then the acutal path required
+ snapshots need to be taken is ``/mnt/some/path`` then the actual path required
by snap-schedule is only ``/some/path``.
.. note:: It should be noted that the "created" field in the snap-schedule status
Limitations
-----------
-Snapshots are scheduled using python Timers. Under normal circumstances
+Snapshots are scheduled using Python Timers. Under normal circumstances
specifying 1h as the schedule will result in snapshots 1 hour apart fairly
-precisely. If the mgr daemon is under heavy load however, the Timer threads
+precisely. If the Manager daemon is under heavy load however, the Timer threads
might not get scheduled right away, resulting in a slightly delayed snapshot. If
-this happens, the next snapshot will be schedule as if the previous one was not
+this happens, the next snapshot will be scheduled as if the previous one was not
delayed, i.e. one or more delayed snapshots will not cause drift in the overall
schedule.
-----------
A Ceph cluster may have zero or more CephFS *file systems*. Each CephFS has
-a human readable name (set at creation time with ``fs new``) and an integer
+a human-readable name (set at creation time with ``fs new``) and an integer
ID. The ID is called the file system cluster ID, or *FSCID*.
Each CephFS file system has a number of *ranks*, numbered beginning with zero.
metadata shard. Management of ranks is described in :doc:`/cephfs/multimds` .
Each CephFS ``ceph-mds`` daemon starts without a rank. It may be assigned one
-by the cluster's monitors. A daemon may only hold one rank at a time, and only
+by the cluster's Monitors. A daemon may only hold one rank at a time, and only
give up a rank when the ``ceph-mds`` process stops.
If a rank is not associated with any daemon, that rank is considered ``failed``.
Managing failover
-----------------
-If an MDS daemon stops communicating with the cluster's monitors, the monitors
+If an MDS daemon stops communicating with the cluster's Monitors, the Monitors
will wait ``mds_beacon_grace`` seconds (default 15) before marking the daemon as
-*laggy*. If a standby MDS is available, the monitor will immediately replace the
+*laggy*. If a standby MDS is available, the Monitor will immediately replace the
laggy daemon.
Each file system may specify a minimum number of standby daemons in order to be
considered healthy. This number includes daemons in the ``standby-replay`` state
-waiting for a ``rank`` to fail. (Note, the monitors will not assign a
+waiting for a ``rank`` to fail. (Note, the Monitors will not assign a
``standby-replay`` daemon to take over a failure for another ``rank`` or a
failure in a different CephFS file system). The pool of standby daemons not in
``replay`` counts towards any file system count. Each file system may set the
ceph fs set <fs name> allow_standby_replay <bool>
-Once set, the monitors will assign available standby daemons to follow the
+Once set, the Monitors will assign available standby daemons to follow the
active MDSs in that file system.
Once an MDS has entered the ``standby-replay`` state, it will only be used as a
CephFS provides a configuration option for MDS called ``mds_join_fs`` which
enforces this affinity.
-When failing over MDS daemons, a cluster's monitors will prefer standby daemons with
+When failing over MDS daemons, a cluster's Monitors will prefer standby daemons with
``mds_join_fs`` equal to the file system ``name`` with the failed ``rank``. If no
standby exists with ``mds_join_fs`` equal to the file system ``name``, it will
choose an unqualified standby (no setting for ``mds_join_fs``) for the replacement.
Note, configuring MDS file system affinity does not change the behavior that
``standby-replay`` daemons are always selected before other standbys.
-Even further, the monitors will regularly examine the CephFS file systems even when
+Even further, the Monitors will regularly examine the CephFS file systems even when
stable to check if a standby with stronger affinity is available to replace an
MDS with lower affinity. This process is also done for ``standby-replay`` daemons:
if a regular standby has stronger affinity than the ``standby-replay`` MDS, it will
ceph config set mds mds_heartbeat_grace 3600
- .. note:: This causes the MDS to continue to send beacons to the monitors
+ .. note:: This causes the MDS to continue to send beacons to the Monitors
even when its internal "heartbeat" mechanism has not been reset (it has
not beaten) in one hour. In the past, this was achieved with the
- ``mds_beacon_grace`` monitor setting.
+ ``mds_beacon_grace`` Monitor setting.
* **Disable open-file-table prefetch.** Under normal circumstances, the MDS
prefetches directory contents during recovery as a way of heating up its
ceph daemon -d mds.<name> dump_ops_in_flight --debug-client=20 --debug-ms=1
-If you suspect a potential monitor issue, enable monitor debugging as well
+If you suspect a potential Monitor issue, enable Monitor debugging as well
(``--debug-monc=20``) by running a command of the following form:
.. prompt:: bash #
-------------
Unfortunately, the kernel client does not provide an admin socket. However,
-the the kernel on the client has `debugfs
+if the kernel on the client has `debugfs
<https://docs.kernel.org/filesystems/debugfs.html>`_ enabled, interfaces
similar to the admin socket are available.
* ``mdsc``: Dumps current requests to the MDS
* ``mdsmap``: Dumps the current MDSMap epoch and MDSes
* ``mds_sessions``: Dumps the current sessions to MDSes
-* ``monc``: Dumps the current maps from the monitor, and any "subscriptions" held
-* ``monmap``: Dumps the current monitor map epoch and monitors
+* ``monc``: Dumps the current maps from the Monitor, and any "subscriptions" held
+* ``monmap``: Dumps the current monitor map epoch and Monitors
* ``osdc``: Dumps the current ops in-flight to OSDs (ie, file data IO)
* ``osdmap``: Dumps the current OSDMap epoch, pools, and OSDs
Extraordinary events include the following:
* Client Eviction
-* Missed Beacon ACK from the monitors
+* Missed Beacon ACK from the Monitors
* Missed Internal Heartbeats
In-memory log dump is disabled by default. This prevents production
.. note:: When higher log levels are set (``log_level`` greater than or equal
to ``10``) there is no reason to dump the in-memory logs. A lower gather
- level (``gather_level`` less than ``10``) is insufficient to gather in-
- memory logs. This means that a log level of greater than or equal to ``10``
+ level (``gather_level`` less than ``10``) is insufficient to gather
+ in-memory logs. This means that a log level of greater than or equal to ``10``
or a gather level of less than ``10`` in ``debug_mds`` prevents enabling
in-memory-log dumping. In such cases, if there is a failure, you must reset
the value of ``mds_extraordinary_events_dump_interval`` to ``0`` before
Disabling the Volumes Plugin
============================
In certain scenarios, the Volumes plugin may need to be disabled to prevent
-compromise for rest of the Ceph cluster. For details see:
+compromise for the rest of the Ceph cluster. For details see:
:ref:`disabling-volumes-plugin`
Reporting Issues
information as possible. Especially important information:
* Ceph versions installed on client and server
-* Whether you are using the kernel or fuse client
+* Whether you are using the kernel or FUSE client
* If you are using the kernel client, what kernel version?
* How many clients are in play, doing what kind of workload?
* If a system is 'stuck', is that affecting all clients or just one?
This only needs to be run once, and it is not necessary to
stop any other services while it runs. The command may take some
-time to execute, as it iterates overall objects in your metadata
+time to execute, as it iterates over all objects in your metadata
pool. It is safe to continue using your file system as normal while
it executes. If the command aborts for any reason, it is safe
to simply run it again.