<script src="{{ pathto('_static/js/html5shiv.min.js', 1) }}"></script>
<![endif]-->
{%- if not embedded %}
- {# XXX Sphinx 1.8.0 made this an external js-file, quick fix until we refactor the template to inherert more blocks directly from sphinx #}
+ {# XXX Sphinx 1.8.0 made this an external js-file, quick fix until we refactor the template to inherit more blocks directly from sphinx #}
{% if sphinx_version >= "1.8.0" %}
<script type="text/javascript" id="documentation_options" data-url_root="{{ url_root }}" src="{{ pathto('_static/documentation_options.js', 1) }}"></script>
{%- for scriptfile in script_files %}
rather refer to them as *Primary*, *Secondary*, and so forth. By convention,
the *Primary* is the first OSD in the *Acting Set*, and is responsible for
coordinating the peering process for each placement group where it acts as
-the *Primary*, and is the **ONLY** OSD that that will accept client-initiated
+the *Primary*, and is the **ONLY** OSD that will accept client-initiated
writes to objects for a given placement group where it acts as the *Primary*.
When a series of OSDs are responsible for a placement group, that series of
Implicit sizing
---------------
-Scenarios in which either devices are under-comitted or not all data devices are
+Scenarios in which either devices are under-committed or not all data devices are
currently ready for use (due to a broken disk for example), one can still rely
on `ceph-volume` automatic sizing.
Users can provide hints to `ceph-volume` as to how many data devices should have
Each host within the cluster is expected to operate within the same Linux
Security Module (LSM) state. For example, if the majority of the hosts are
running with SELINUX in enforcing mode, any host not running in this mode is
-flagged as an anomaly and a healtcheck (WARNING) state raised.
+flagged as an anomaly and a healthcheck (WARNING) state raised.
CEPHADM_CHECK_SUBSCRIPTION
~~~~~~~~~~~~~~~~~~~~~~~~~~
ceph orch apply alertmanager
#. Deploy Prometheus. A single Prometheus instance is sufficient, but
- for high availablility (HA) you might want to deploy two:
+ for high availability (HA) you might want to deploy two:
.. prompt:: bash #
Sizes don't have to be specified exclusively in Gigabytes(G).
-Other units of size are supported: Megabyte(M), Gigabyte(G) and Terrabyte(T).
+Other units of size are supported: Megabyte(M), Gigabyte(G) and Terabyte(T).
Appending the (B) for byte is also supported: ``MB``, ``GB``, ``TB``.
this with a timeout: if the asynchronous read does not come back within half
the balancing tick interval the operation is cancelled and a Connection Timeout
error is returned. By default, the balancing tick interval is 10 seconds, so
-Mantle will use a 5 second second timeout. This design allows Mantle to
+Mantle will use a 5 second timeout. This design allows Mantle to
immediately return an error if anything RADOS-related goes wrong.
We use this implementation because we do not want to do a blocking OSD read
lttng enable-event --userspace osd:*
lttng start
-Perform some ceph operatin::
+Perform some Ceph operation::
rados bench -p ec 5 write
Auditing
Auditing takes the results from both *authentication and authorization* and
- records them into an audit log. The audit log records records all actions
+ records them into an audit log. The audit log records all actions
taking by/during the authentication and authorization for later review by
the administrators. While authentication and authorization are preventive
systems (in which unauthorized access is prevented), auditing is a reactive
Given that the *keytab client file* is/should already be copied and available at the
- Kerberos client (Ceph cluster node), we should be able to athenticate using it before
- going forward: ::
+ Kerberos client (Ceph cluster node), we should be able to authenticate using it before
+ continuing: ::
# kdestroy -A && kinit -k -t /etc/gss_client_mon1.ktab -f 'ceph/ceph-mon1@MYDOMAIN.COM' && klist -f
Ticket cache: KEYRING:persistent:0:0
6. Name Resolution
- As mentioned earlier, Kerberos *relies heavly on name resolution*. Most of
+ As mentioned earlier, Kerberos *relies heavily on name resolution*. Most of
the Kerberos issues are usually related to name resolution, since Kerberos
is *very picky* on both *systems names* and *host lookups*.
an almost "real" environment.
- Safe and isolated. Does not depend of the things you have installed in
your machine. And the vms are isolated from your environment.
- - Easy to work "dev" environment. For "not compilated" software pieces,
+ - Easy to work "dev" environment. For "not compiled" software pieces,
for example any mgr module. It is an environment that allow you to test your
changes interactively.
but we suggest to use the container image approach.
So things to do:
- - 1. Review `requeriments <https://kcli.readthedocs.io/en/latest/#libvirt-hypervisor-requisites>`_
+ - 1. Review `requirements <https://kcli.readthedocs.io/en/latest/#libvirt-hypervisor-requisites>`_
and install/configure whatever is needed to meet them.
- 2. get the kcli image and create one alias for executing the kcli command
::
create loopback devices capable of holding osds.
.. note:: Each osd will require 5GiB of space.
-After bootstraping the cluster you can go inside the seed box in which you'll be
-able to run cehpadm commands::
+After bootstrapping the cluster you can go inside the seed box in which you'll be
+able to run cephadm commands::
box -v cluster sh
[root@8d52a7860245] cephadm --help
By using the --check option first, the Admin can choose whether to proceed. This
workflow is obviously optional for the CLI user, but could be integrated into the
-UI workflow to help less experienced Administators manage the cluster.
+UI workflow to help less experienced administrators manage the cluster.
By adopting this two-phase approach, a UI based workflow would look something
like this.
The message C sends to A in phase I is build in ``CephxClientHandler::build_request()`` (in
``auth/cephx/CephxClientHandler.cc``). This routine is used for more than one purpose.
In this case, we first call ``validate_tickets()`` (from routine
-``CephXTicektManager::validate_tickets()`` which lives in ``auth/cephx/CephxProtocol.h``).
+``CephXTicketManager::validate_tickets()`` which lives in ``auth/cephx/CephxProtocol.h``).
This code runs through the list of possible tickets to determine what we need, setting values
in the ``need`` flag as necessary. Then we call ``ticket.get_handler()``. This routine
(in ``CephxProtocol.h``) finds a ticket of the specified type (a ticket to perform
To ensure that prebuilt packages are available by the jenkins agents, we need to
upload them to either ``apt-mirror.front.sepia.ceph.com`` or `chacra`_. To upload
-packages to the former would require the help our our lab administrator, so if we
+packages to the former would require the help of our lab administrator, so if we
want to maintain the package repositories on regular basis, a better choice would be
to manage them using `chacractl`_. `chacra`_ represents packages repositories using
a resource hierarchy, like::
ref
a unique id of a given version of a set packages. This id is used to reference
the set packages under the ``<project>/<branch>``. It is a good practice to
- version the packaging recipes, like the ``debian`` directory for building deb
- packages and the ``spec`` for building rpm packages, and use the sha1 of the
- packaging receipe for the ``ref``. But you could also use a random string for
+ version the packaging recipes, like the ``debian`` directory for building DEB
+ packages and the ``spec`` for building RPM packages, and use the SHA1 of the
+ packaging recipe for the ``ref``. But you could also use a random string for
``ref``, like the tag name of the built source tree.
distro
std::cout << "oops, the optimistic path generates a new error!";
return crimson::ct_error::input_output_error::make();
},
- // we have a special handler to delegate the handling up. For conveience,
+ // we have a special handler to delegate the handling up. For convenience,
// the same behaviour is available as single argument-taking variant of
// `safe_then()`.
ertr::pass_further{});
.. describe:: waiting_for_healthy
If an OSD daemon is able to connected to its heartbeat peers, and its own
- internal hearbeat does not fail, it is considered healthy. Otherwise, it
+ internal heartbeat does not fail, it is considered healthy. Otherwise, it
puts itself in the state of `waiting_for_healthy`, and check its own
reachability and internal heartbeat periodically.
* **PoseidonStore uses hybrid update strategies for different data size, similar to BlueStore.**
As we discussed, both in-place and out-of-place update strategies have their pros and cons.
- Since CPU is only bottlenecked under small I/O workloads, we chose update-in-place for small I/Os to mininize CPU consumption
+ Since CPU is only bottlenecked under small I/O workloads, we chose update-in-place for small I/Os to minimize CPU consumption
while choosing update-out-of-place for large I/O to avoid double write. Double write for small data may be better than host-GC overhead
in terms of CPU consumption in the long run. Although it leaves GC entirely up to SSDs,
#. Crash occurs right after writing Data blocks
- Data partition --> | Data blocks |
- - We don't need to care this case. Data is not alloacted yet in reality. The blocks will be reused.
+ - We don't need to care this case. Data is not allocated yet. The blocks will be reused.
#. Crash occurs right after WAL
- Data partition --> | Data blocks |
Usage Patterns
==============
-The different Ceph interface layers present potentially different oportunities
-and costs for deduplication and tiering in general.
+Each Ceph interface layer presents unique opportunities and costs for
+deduplication and tiering in general.
RadosGW
-------
+++ /dev/null
-=================================
- Deploying a development cluster
-=================================
-
-In order to develop on ceph, a Ceph utility,
-*vstart.sh*, allows you to deploy fake local cluster for development purpose.
-
-Usage
-=====
-
-It allows to deploy a fake local cluster on your machine for development purpose. It starts rgw, mon, osd and/or mds, or all of them if not specified.
-
-To start your development cluster, type the following::
-
- vstart.sh [OPTIONS]...
-
-In order to stop the cluster, you can type::
-
- ./stop.sh
-
-Options
-=======
-
-.. option:: -b, --bluestore
-
- Use bluestore as the objectstore backend for osds.
-
-.. option:: --cache <pool>
-
- Set a cache-tier for the specified pool.
-
-.. option:: -d, --debug
-
- Launch in debug mode.
-
-.. option:: -e
-
- Create an erasure pool.
-
-.. option:: -f, --filestore
-
- Use filestore as the osd objectstore backend.
-
-.. option:: --hitset <pool> <hit_set_type>
-
- Enable hitset tracking.
-
-.. option:: -i ip_address
-
- Bind to the specified *ip_address* instead of guessing and resolve from hostname.
-
-.. option:: -k
-
- Keep old configuration files instead of overwriting these.
-
-.. option:: -K, --kstore
-
- Use kstore as the osd objectstore backend.
-
-.. option:: -l, --localhost
-
- Use localhost instead of hostname.
-
-.. option:: -m ip[:port]
-
- Specifies monitor *ip* address and *port*.
-
-.. option:: --memstore
-
- Use memstore as the objectstore backend for osds
-
-.. option:: --multimds <count>
-
- Allow multimds with maximum active count.
-
-.. option:: -n, --new
-
- Create a new cluster.
-
-.. option:: -N, --not-new
-
- Reuse existing cluster config (default).
-
-.. option:: --nodaemon
-
- Use ceph-run as wrapper for mon/osd/mds.
-
-.. option:: --nolockdep
-
- Disable lockdep
-
-.. option:: -o <config>
-
- Add *config* to all sections in the ceph configuration.
-
-.. option:: --rgw_port <port>
-
- Specify ceph rgw http listen port.
-
-.. option:: --rgw_frontend <frontend>
-
- Specify the rgw frontend configuration (default is civetweb).
-
-.. option:: --rgw_compression <compression_type>
-
- Specify the rgw compression plugin (default is disabled).
-
-.. option:: --smallmds
-
- Configure mds with small limit cache size.
-
-.. option:: --short
-
- Short object names only; necessary for ext4 dev
-
-.. option:: --valgrind[_{osd,mds,mon}] 'valgrind_toolname [args...]'
-
- Launch the osd/mds/mon/all the ceph binaries using valgrind with the specified tool and arguments.
-
-.. option:: --without-dashboard
-
- Do not run using mgr dashboard.
-
-.. option:: -x
-
- Enable cephx (on by default).
-
-.. option:: -X
-
- Disable cephx.
-
-
-Environment variables
-=====================
-
-{OSD,MDS,MON,RGW}
-
-These environment variables will contains the number of instances of the desired ceph process you want to start.
-
-Example: ::
-
- OSD=3 MON=3 RGW=1 vstart.sh
-
-
-============================================================
- Deploying multiple development clusters on the same machine
-============================================================
-
-In order to bring up multiple ceph clusters on the same machine, *mstart.sh* a
-small wrapper around the above *vstart* can help.
-
-Usage
-=====
-
-To start multiple clusters, you would run mstart for each cluster you would want
-to deploy, and it will start monitors, rgws for each cluster on different ports
-allowing you to run multiple mons, rgws etc. on the same cluster. Invoke it in
-the following way::
-
- mstart.sh <cluster-name> <vstart options>
-
-For eg::
-
- ./mstart.sh cluster1 -n
-
-
-For stopping the cluster, you do::
-
- ./mstop.sh <cluster-name>
--- /dev/null
+=================================
+ Deploying a development cluster
+=================================
+
+In order to develop on ceph, a Ceph utility,
+*vstart.sh*, allows you to deploy fake local cluster for development purpose.
+
+Usage
+=====
+
+It allows to deploy a fake local cluster on your machine for development purpose. It starts rgw, mon, osd and/or mds, or all of them if not specified.
+
+To start your development cluster, type the following::
+
+ vstart.sh [OPTIONS]...
+
+In order to stop the cluster, you can type::
+
+ ./stop.sh
+
+Options
+=======
+
+.. option:: -b, --bluestore
+
+ Use bluestore as the objectstore backend for osds.
+
+.. option:: --cache <pool>
+
+ Set a cache-tier for the specified pool.
+
+.. option:: -d, --debug
+
+ Launch in debug mode.
+
+.. option:: -e
+
+ Create an erasure pool.
+
+.. option:: -f, --filestore
+
+ Use filestore as the osd objectstore backend.
+
+.. option:: --hitset <pool> <hit_set_type>
+
+ Enable hitset tracking.
+
+.. option:: -i ip_address
+
+ Bind to the specified *ip_address* instead of guessing and resolve from hostname.
+
+.. option:: -k
+
+ Keep old configuration files instead of overwriting these.
+
+.. option:: -K, --kstore
+
+ Use kstore as the osd objectstore backend.
+
+.. option:: -l, --localhost
+
+ Use localhost instead of hostname.
+
+.. option:: -m ip[:port]
+
+ Specifies monitor *ip* address and *port*.
+
+.. option:: --memstore
+
+ Use memstore as the objectstore backend for osds
+
+.. option:: --multimds <count>
+
+ Allow multimds with maximum active count.
+
+.. option:: -n, --new
+
+ Create a new cluster.
+
+.. option:: -N, --not-new
+
+ Reuse existing cluster config (default).
+
+.. option:: --nodaemon
+
+ Use ceph-run as wrapper for mon/osd/mds.
+
+.. option:: --nolockdep
+
+ Disable lockdep
+
+.. option:: -o <config>
+
+ Add *config* to all sections in the ceph configuration.
+
+.. option:: --rgw_port <port>
+
+ Specify ceph rgw http listen port.
+
+.. option:: --rgw_frontend <frontend>
+
+ Specify the rgw frontend configuration (default is civetweb).
+
+.. option:: --rgw_compression <compression_type>
+
+ Specify the rgw compression plugin (default is disabled).
+
+.. option:: --smallmds
+
+ Configure mds with small limit cache size.
+
+.. option:: --short
+
+ Short object names only; necessary for ext4 dev
+
+.. option:: --valgrind[_{osd,mds,mon}] 'valgrind_toolname [args...]'
+
+ Launch the osd/mds/mon/all the ceph binaries using valgrind with the specified tool and arguments.
+
+.. option:: --without-dashboard
+
+ Do not run using mgr dashboard.
+
+.. option:: -x
+
+ Enable cephx (on by default).
+
+.. option:: -X
+
+ Disable cephx.
+
+
+Environment variables
+=====================
+
+{OSD,MDS,MON,RGW}
+
+These environment variables will contains the number of instances of the desired ceph process you want to start.
+
+Example: ::
+
+ OSD=3 MON=3 RGW=1 vstart.sh
+
+
+============================================================
+ Deploying multiple development clusters on the same machine
+============================================================
+
+In order to bring up multiple ceph clusters on the same machine, *mstart.sh* a
+small wrapper around the above *vstart* can help.
+
+Usage
+=====
+
+To start multiple clusters, you would run mstart for each cluster you would want
+to deploy, and it will start monitors, rgws for each cluster on different ports
+allowing you to run multiple mons, rgws etc. on the same cluster. Invoke it in
+the following way::
+
+ mstart.sh <cluster-name> <vstart options>
+
+For eg::
+
+ ./mstart.sh cluster1 -n
+
+
+For stopping the cluster, you do::
+
+ ./mstop.sh <cluster-name>
have to create a Redmine tracker issue: the case of minor documentation changes.
Simple documentation cleanup does not require a corresponding tracker issue.
-Major documenatation changes do require a tracker issue. Major documentation
+Major documentation changes do require a tracker issue. Major documentation
changes include adding new documentation chapters or files, and making
substantial changes to the structure or content of the documentation.
The second command (git checkout -b fix_1) creates a "bugfix branch" called
"fix_1" in your local working copy of the repository. The changes that you make
-in order to fix the bug will be commited to this branch.
+in order to fix the bug will be committed to this branch.
The third command (git push -u origin fix_1) pushes the bugfix branch from
your local working repository to your fork of the upstream repository.
Using a browser extension to auto-fill the merge message
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-If you use a browser for merging Github PRs, the easiest way to fill in
-the merge message is with the `"Ceph Github Helper Extension"
+If you use a browser for merging GitHub PRs, the easiest way to fill in
+the merge message is with the `"Ceph GitHub Helper Extension"
<https://github.com/tspmelo/ceph-github-helper>`_ (available for `Chrome
<https://chrome.google.com/webstore/detail/ceph-github-helper/ikpfebikkeabmdnccbimlomheocpgkmn>`_
and `Firefox <https://addons.mozilla.org/en-US/firefox/addon/ceph-github-helper/>`_).
-After enabling this extension, if you go to a Github PR page, a vertical helper
+After enabling this extension, if you go to a GitHub PR page, a vertical helper
will be displayed at the top-right corner. If you click on the user silhouette button
the merge message input will be automatically populated.
.. note::
- Every ``vstart`` environment needs Ceph `to be compiled`_ from its Github
+ Every ``vstart`` environment needs Ceph `to be compiled`_ from its GitHub
repository, though Docker environments simplify that step by providing a
shell script that contains those instructions.
Additional information for developers can also be found in the `Developer
Guide`_.
-.. _Deploying a development cluster: https://docs.ceph.com/docs/master/dev/dev_cluster_deployement/
+.. _Deploying a development cluster: https://docs.ceph.com/docs/master/dev/dev_cluster_deployment/
.. _Developer Guide: https://docs.ceph.com/docs/master/dev/quick_guide/
Host-based vs Docker-based Development Environments
`ceph-dev`_ is an exception to this rule as one of the options it provides
is `build-free`_. This is accomplished through a Ceph installation using
- RPM system packages. You will still be able to work with a local Github
+ RPM system packages. You will still be able to work with a local GitHub
repository like you are used to.
local dashboardSchema(title, uid, time_from, refresh, schemaVersion, tags,timezone, timepicker)
-To add a graph panel we can spcify the graph schema in a local function such as -
+To add a graph panel we can specify the graph schema in a local function such as -
::
Ceph API and therefore:
#. The versioned OpenAPI specification should be updated explicitly: ``tox -e openapi-fix``.
-#. The team @ceph/api will be requested for reviews (this is automated via Github CODEOWNERS), in order to asses the impact of changes.
+#. The team @ceph/api will be requested for reviews (this is automated via GitHub CODEOWNERS), in order to asses the impact of changes.
Additionally, Sphinx documentation can be generated from the OpenAPI
specification with ``tox -e openapi-doc``.
.. note::
It is recommended to compile without any optimizations (``-O0`` gcc flag)
- in order to avoid elimintaion of intermediate values.
+ in order to avoid elimination of intermediate values.
Stopping for breakpoints while debugging may cause timeouts, so the following
configuration options are suggested::
--interactive drops a Python shell when a test fails
--log-ps-output logs ps output; might be useful while debugging
--teardown tears Ceph cluster down after test(s) has finished
- runnng
+ running
--kclient use the kernel cephfs client instead of FUSE
--brxnet=<net/mask> specify a new net/mask for the mount clients' network
namespace container (Default: 192.168.0.0/16)
Viewing Test Results
--------------------
-When a teuthology run has been completed successfully, use `pulpito`_ dasboard
+When a teuthology run has been completed successfully, use `pulpito`_ dashboard
to view the results::
http://pulpito.front.sepia.ceph.com/<job-name>/<job-id>/
documentation and better understanding of integration tests.
Tests can be documented by embedding ``meta:`` annotations in the yaml files
-used to define the tests. The results can be seen in the `teuthology-desribe
+used to define the tests. The results can be seen in the `teuthology-describe
usecases`_
Since this is a new feature, many yaml files have yet to be annotated.
.. _Sepia Lab: https://wiki.sepia.ceph.com/doku.php
.. _teuthology repository: https://github.com/ceph/teuthology
.. _teuthology framework: https://github.com/ceph/teuthology
-.. _teuthology-desribe usecases: https://gist.github.com/jdurgin/09711d5923b583f60afc
+.. _teuthology-describe usecases: https://gist.github.com/jdurgin/09711d5923b583f60afc
.. _ceph-deploy man page: ../../../../man/8/ceph-deploy
#. To ensure that the build process has been initiated, confirm that the branch
name has appeared in the list of "Latest Builds Available" at `Shaman`_.
- Soon after you start the build process, the testing infrastructrure adds
+ Soon after you start the build process, the testing infrastructure adds
other, similarly-named builds to the list of "Latest Builds Available".
The names of these new builds will contain the names of various Linux
distributions of Linux and will be used to test your build against those
.. _teuthology_testing_qa_changes:
-Testing QA changes (without re-building binaires)
+Testing QA changes (without re-building binaries)
*************************************************
If you are making changes only in the ``qa/`` directory, you do not have to
``$yourname`` is replaced with your name. Identifying your branch with your
name makes your branch easily findable on Shaman and Pulpito.
-If you are using one of the stable branches (for example, nautilis, mimic,
-etc.), include the name of that stable branch in your ceph-ci branch name.
+If you are using one of the stable branches (`quincy`, `pacific`, etc.), include
+the name of that stable branch in your ceph-ci branch name.
For example, the ``feature-x`` PR branch should be named
``wip-feature-x-nautilus``. *This is not just a convention. This ensures that your branch is built in the correct environment.*
*
* Detailed description when necessary
*
- * preconditons, postconditions, warnings, bugs or other notes
+ * preconditions, postconditions, warnings, bugs or other notes
*
* parameter reference
* return value (if non-void)
--------
You can use Inkscape to generate scalable vector graphics.
-https://inkscape.org/en/ for restructedText documents.
+https://inkscape.org/en/ for restructuredText documents.
If you generate diagrams with Inkscape, you should
commit both the Scalable Vector Graphics (SVG) file and export a
The ``DECODE_START`` macro takes an argument specifying the most recent
message version that the code can handle. This is compared with the
compat_version encoded in the message, and if the message is too new then
-an exception will be thrown. Because changes to compat_verison are rare,
+an exception will be thrown. Because changes to compat_version are rare,
this isn't usually something to worry about when adding fields.
In practice, changes to encoding usually involve simply adding the desired fields
But it's noteworthy that ``MgrStatMonitor`` does *not* prepare the reports by itself,
it just stores whatever the health reports received from mgr!
-ceph-mgr -- A Delegate Aggegator
---------------------------------
+ceph-mgr -- A Delegate Aggregator
+---------------------------------
In Ceph, mgr is created to share the burden of monitor, which is used to establish
the consensus of information which is critical to keep the cluster function.
Apparently, osdmap, mdsmap and monmap fall into this category. But what about the
aggregated statistics of the cluster? They are crucial for the administrator to
understand the status of the cluster, but they might not be that important to keep
-the cluster running. To address this scability issue, we offloaded the work of
+the cluster running. To address this scalability issue, we offloaded the work of
collecting and aggregating the metrics to mgr.
Now, mgr is responsible for receiving and processing the ``MPGStats`` messages from
The state transition entries are of type `sm_state_t` from `src/mds/locks.h` source. TODO: Describe these in detail.
-We reach a point where the MDS fills in `LockOpVec` and invokes `Locker::acquire_locks()`, which according to the lock type and the mode (`rdlock`, etc..) tries to acquire that particular lock. Starting state for the lock is `LOCK_SYNC` (this may not always be the case, but consider this for simplicity). To acquire `xlock` for `iauth`, the MDS refers to the state transition table. If the current state allows the lock to be acquired, the MDS grabs the lock (which is just incrementing a counter). The current state (`LOCK_SYNC`) does not allow `xlock` to be acquired (column `x` in `LOCK_SYNC` state), thereby requiring a lock state switch. At this point, the MDS switches to an intermediate state `LOCK_SYNC_LOCK` - signifying transitioning from `LOCK_SYNC` to `LOCK_LOCK` state. The intermediate state has a couple of purposes - a. The intermediate state defines what caps are allowed to be held by cilents thereby revoking caps that are not allowed be held in this state, and b. preventing new locks to be acquired. At this point the MDS sends cap revoke messages to clients::
+We reach a point where the MDS fills in `LockOpVec` and invokes
+`Locker::acquire_locks()`, which according to the lock type and the mode
+(`rdlock`, etc..) tries to acquire that particular lock. Starting state for
+the lock is `LOCK_SYNC` (this may not always be the case, but consider this
+for simplicity). To acquire `xlock` for `iauth`, the MDS refers to the state
+transition table. If the current state allows the lock to be acquired, the MDS
+grabs the lock (which is just incrementing a counter). The current state
+(`LOCK_SYNC`) does not allow `xlock` to be acquired (column `x` in `LOCK_SYNC`
+state), thereby requiring a lock state switch. At this point, the MDS switches
+to an intermediate state `LOCK_SYNC_LOCK` - signifying transitioning from
+`LOCK_SYNC` to `LOCK_LOCK` state. The intermediate state has a couple of
+purposes - a. The intermediate state defines what caps are allowed to be held
+by clients thereby revoking caps that are not allowed be held in this state,
+and b. preventing new locks to be acquired. At this point the MDS sends cap
+revoke messages to clients::
2021-11-22T07:18:20.040-0500 7fa66a3bd700 7 mds.0.locker: issue_caps allowed=pLsXsFscrl, xlocker allowed=pLsXsFscrl on [inode 0x10000000003 [2,head] /testfile auth v142 ap=1 DIRTYPARENT s=0 n(v0 rc2021-11-22T06:21:45.015746-0500 1=1+0) (iauth sync->lock) (iversion lock) caps={94134=pAsLsXsFscr/-@1,94138=pLsXsFscr/-@1} | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 authpin=1 0x5633ffdac000]
2021-11-22T07:18:20.040-0500 7fa66a3bd700 20 mds.0.locker: client.94134 pending pAsLsXsFscr allowed pLsXsFscrl wanted -
2021-11-22T07:18:20.040-0500 7fa66a3bd700 7 mds.0.locker: sending MClientCaps to client.94134 seq 2 new pending pLsXsFscr was pAsLsXsFscr
-As seen above, `client.94134` has `As` caps, which is getting revoked by the MDS. After the caps have been revoked, the MDS can continue to transition to further states: `LOCK_SYNC_LOCK` to `LOCK_LOCK`. Since the goal is to acquire `xlock`, the state transition conitnues (as per the lock transition state machine)::
+As seen above, `client.94134` has `As` caps, which are getting revoked by the
+MDS. After the caps have been revoked, the MDS can continue to transition to
+further states: `LOCK_SYNC_LOCK` to `LOCK_LOCK`. Since the goal is to acquire
+`xlock`, the state transition continues (as per the lock transition state
+machine)::
LOCK_LOCK -> LOCK_LOCK_XLOCK
LOCK_LOCK_XLOCK -> LOCK_PREXLOCK
bool is_compress
uint32_t method
- - the server determines whether compression is possible according to its' configuration.
- - if it is possible, it will pick its' most prioritizied compression method that is also supprorted by the client.
+ - the server determines whether compression is possible according to the configuration.
+ - if it is possible, it will pick the most prioritized compression method that is also supported by the client.
- if none exists, it will determine that session between the peers will be handled without compression.
.. ditaa::
- the target addr is who the client is trying to connect *to*, so
that the server side can close the connection if the client is
talking to the wrong daemon.
- - type.gid (entity_name_t) is set here, by combinging the type shared in the hello
+ - type.gid (entity_name_t) is set here, by combining the type shared in the hello
frame with the gid here. this means we don't need it
in the header of every message. it also means that we can't send
messages "from" other entity_name_t's. the current
shouldn't break any existing functionality. implementation will
likely want to mask this against what the authenticated credential
allows.
- - cookie is the client coookie used to identify a session, and can be used
+ - cookie is the client cookie used to identify a session, and can be used
to reconnect to an existing session.
- we've dropped the 'protocol_version' field from msgr1
u32le front_crc; // Checksums of the various sections.
u32le middle_crc; //
u32le data_crc; //
- u64le sig; // Crypographic signature.
+ u64le sig; // Cryptographic signature.
u8 flags;
}
--------------------------------------------------
Some writes will be small enough to not require updating all of the
-shards holding data blocks. For write amplification minization
+shards holding data blocks. For write amplification minimization
reasons, it would be best to avoid writing to those shards at all,
and delay even sending the log entries until the next write which
actually hits that shard.
Thus, we introduce a simple convention: consecutive clones which
share a reference at the same offset share the same refcount. This
means that a write that invokes ``make_writeable`` may decrease refcounts,
-but not increase them. This has some conquences for removing clones.
+but not increase them. This has some consequences for removing clones.
Consider the following sequence ::
write foo [0, 1024)
while any has exceeded the hard limit.
Tighter soft limits will cause writeback to happen more quickly,
-but may cause the OSD to miss oportunities for write coalescing.
+but may cause the OSD to miss opportunities for write coalescing.
Tighter hard limits may cause a reduction in latency variance by
reducing time spent flushing the journal, but may reduce writeback
parallelism.
Setting these properly should help to smooth out op latencies by
mostly avoiding the hard limit.
-See FileStore::throttle_ops and FileSTore::thottle_bytes.
+See FileStore::throttle_ops and FileSTore::throttle_bytes.
journal usage throttle
----------------------
inc / 6 / - / 10 /
* encode front hb addr
* osdmap::incremental ext version bumped to 10
- * osdmap's ext versiont bumped to 10
+ * osdmap's ext version bumped to 10
* because we're adding osd_addrs->hb_front_addr to map
// below we have the change to ENCODE_START() for osdmap and others
Introduction
============
-Dedupliation, as described in ../deduplication.rst, needs a way to
+Deduplication, as described in ../deduplication.rst, needs a way to
maintain a target pool of deduplicated chunks with atomic ref
refcounting. To that end, there exists an osd object class
refcount responsible for using the object class machinery to
}
},
-This represents the 2d histogram, consisting of 9 history entrires and 32 value groups per each history entry.
+This represents the 2D histogram, consisting of 9 history entries and 32 value groups per each history entry.
"Ranges" element denote value bounds for each of value groups. "Buckets" denote amount of value groups ("buckets"),
-"Min" is a minimum accepted valaue, "quant_size" is quantization unit and "scale_type" is either "log2" (logarhitmic
+"Min" is a minimum accepted value, "quant_size" is quantization unit and "scale_type" is either "log2" (logarithmic
scale) or "linear" (linear scale).
You can use histogram_dump.py tool (see src/tools/histogram_dump.py) for quick visualisation of existing histogram
data.
--------------------------------
Ceph contains a script called ``vstart.sh`` (see also
-:doc:`/dev/dev_cluster_deployement`) which allows developers to quickly test
+:doc:`/dev/dev_cluster_deployment`) which allows developers to quickly test
their code using a simple deployment on your development system. Once the build
finishes successfully, start the ceph deployment using the following command:
This project aims to enable Ceph to work on zoned storage drives and at the same
time explore research problems related to adopting this new interface. The
-first target is to enable non-ovewrite workloads (e.g. RGW) on host-managed SMR
+first target is to enable non-overwrite workloads (e.g. RGW) on host-managed SMR
(HM-SMR) drives and explore cleaning (garbage collection) policies. HM-SMR
drives are high capacity hard drives with the ZBC/ZAC interface. The longer
term goal is to support ZNS SSDs, as they become available, as well as overwrite
* `grnet <https://grnet.gr/>`_
* `Monash University <http://www.monash.edu/>`_
* `NRF SARAO <http://www.ska.ac.za/about/sarao/>`_
-* `Science & Technology Facilities Councel (STFC) <https://stfc.ukri.org/>`_
+* `Science & Technology Facilities Council (STFC) <https://stfc.ukri.org/>`_
* `University of Michigan <http://www.osris.org/>`_
* `SWITCH <https://switch.ch/>`_
* ceph-ansible is widely deployed.
* ceph-ansible is not integrated with the new orchestrator APIs,
- introduced in Nautlius and Octopus, which means that newer
+ introduced in Nautilus and Octopus, which means that newer
management features and dashboard integration are not available.
=========
You can easily mirror Ceph yourself using a Bash script and rsync. An easy-to-use
-script can be found at `Github`_.
+script can be found at `GitHub`_.
When mirroring Ceph, please keep the following guidelines in mind:
become a official mirror.
To make sure all mirrors meet the same standards some requirements have been
-set for all mirrors. These can be found on `Github`_.
+set for all mirrors. These can be found on `GitHub`_.
If you want to apply for an official mirror, please contact the ceph-users mailinglist.
-.. _Github: https://github.com/ceph/ceph/tree/master/mirroring
+.. _GitHub: https://github.com/ceph/ceph/tree/master/mirroring
In order to mount Ceph filesystems, ``ceph-dokan`` requires Dokany to be
installed. You may fetch the installer as well as the source code from the
-Dokany Github repository: https://github.com/dokan-dev/dokany/releases
+Dokany GitHub repository: https://github.com/dokan-dev/dokany/releases
The minimum supported Dokany version is 1.3.1. At the time of the writing,
Dokany 2.0 is in Beta stage and is unsupported.
handle efficiently. Knowing that the lines are sorted allows this to
be done efficiently with minimal memory overhead.
-The sorting of each file needs to be done lexcially. Most POSIX
+The sorting of each file needs to be done lexically. Most POSIX
systems use the *LANG* environment variable to determine the `sort`
tool's sorting order. To sort lexically we would need something such
as:
* set-inc-osdmap
* mark-complete
* reset-last-complete
-* apply-layour-settings
+* apply-layout-settings
* update-mon-db
* dump-export
* trim-pg-log
Availability
============
-**ceph-objectstore-tool** is part of Ceph, a massively scalable, open-source, distributed storage system. **ceph-objectstore-tool** is provided by the package `ceph-osd`. Refer to the Ceph documentation at htpp://ceph.com/docs for more information.
+**ceph-objectstore-tool** is part of Ceph, a massively scalable, open-source, distributed storage system. **ceph-objectstore-tool** is provided by the package `ceph-osd`. Refer to the Ceph documentation at http://ceph.com/docs for more information.
ceph config rm <who> <option>
Subcommand ``log`` to show recent history of config changes. If `count` option
-is omitted it defeaults to 10.
+is omitted it defaults to 10.
Usage::
Arguments:
* [--fsid FSID] cluster FSID
-* [--config-json CONFIG_JSON] JSON file with config and (client.bootrap-osd) key
+* [--config-json CONFIG_JSON] JSON file with config and (client.bootstrap-osd) key
* [--config CONFIG, -c CONFIG] ceph conf file
* [--keyring KEYRING, -k KEYRING] ceph.keyring to pass through to the container
* the weight of this OSD.
#. Looking at the number of placement groups held by 3 OSDs. We have
- * avarge, stddev, stddev/average, expected stddev, expected stddev / average
+ * average, stddev, stddev/average, expected stddev, expected stddev / average
* min and max
#. The number of placement groups mapping to n OSDs. In this case, all 8 placement
groups are mapping to 3 different OSDs.
It is recommended to enable the cache with a 10 seconds TTL when there are 500+
osds or 10k+ pgs as internal structures might increase in size, and cause latency
issues when requesting large structures. As an example, an OSDMap with 1000 osds
-has a aproximate size of 4MiB. With heavy load, on a 3000 osd cluster there has
+has a approximate size of 4MiB. With heavy load, on a 3000 osd cluster there has
been a 1.5x improvement enabling the cache.
Furthermore, you can run ``ceph daemon mgr.${MGRNAME} perf dump`` to retrieve perf
If you are using a self-signed certificate for Grafana,
disable certificate verification in the dashboard to avoid refused connections,
which can be a result of certificates signed by an unknown CA or that do not
-matchn the host name::
+match the host name::
$ ceph dashboard set-grafana-api-ssl-verify False
found by searching for keywords, such as *500 Internal Server Error*,
followed by ``traceback``. The end of a traceback contains more details about
what exact error occurred.
-#. Check your web browser's Javascript Console for any errors.
+#. Check your web browser's JavaScript Console for any errors.
Ceph Dashboard Logs
*diskprediction_local* requires at least six datasets of device health metrics to
-make prediction of the devices' life expentancy. And these health metrics are
+make prediction of the devices' life expectancy. And these health metrics are
collected only if health monitoring is :ref:`enabled <enabling-monitoring>`.
Run the following command to retrieve the life expectancy of given device.
RGW Module
============
-The rgw module helps with bootstraping and configuring RGW realm
+The rgw module helps with bootstrapping and configuring RGW realm
and the different related entities.
Enabling
underlying drive space is allocated. This means that roughly
(4KB - 1KB) == 3KB is allocated but never used, which corresponds to 300%
overhead or 25% efficiency. Similarly, a 5KB user object will be stored
-as one 4KB and one 1KB RADOS object, again stranding 4KB of device capcity,
+as one 4KB and one 1KB RADOS object, again stranding 4KB of device capacity,
though in this case the overhead is a much smaller percentage. Think of this
in terms of the remainder from a modulus operation. The overhead *percentage*
thus decreases rapidly as user object size increases.
When an RGW bucket pool contains many relatively large user objects, the effect
of this phenomenon is often negligible, but should be considered for deployments
-that expect a signficiant fraction of relatively small objects.
+that expect a significant fraction of relatively small objects.
The 4KB default value aligns well with conventional HDD and SSD devices. Some
new coarse-IU (Indirection Unit) QLC SSDs however perform and wear best
.. confval:: mon_osd_min_down_reporters
.. confval:: mon_osd_reporter_subtree_level
-.. index:: OSD hearbeat
+.. index:: OSD heartbeat
OSD Settings
------------
The disallow Mode
=================
-This mode lets you mark monitors as disallowd, in which case they will
+This mode lets you mark monitors as disallowed, in which case they will
participate in the quorum and serve clients, but cannot be elected leader. You
may wish to use this if you have some monitors which are known to be far away
from clients.
when some OSDs are being evacuated or slowly brought into service.
Deployments utilizing Nautilus (or later revisions of Luminous and Mimic)
-that have no pre-Luminous cients may instead wish to instead enable the
+that have no pre-Luminous clients may instead wish to instead enable the
`balancer`` module for ``ceph-mgr``.
Add/remove an IP address or CIDR range to/from the blocklist.
layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
$ ceph osd pool create lrcpool erasure LRCprofile
-You could also use a different erasure code profile for for each
+You could also use a different erasure code profile for each
layer.::
$ ceph osd erasure-code-profile set LRCprofile \
ceph health detail
-Clients global_id reclaim rehavior can also seen in the
+Clients' global_id reclaim behavior can also seen in the
``global_id_status`` field in the dump of clients connected to an
individual monitor (``reclaim_insecure`` means the client is
unpatched and is contributing to this health alert)::
Access may be restricted to specific pools as defined by their application
metadata. The ``*`` wildcard may be used for the ``key`` argument, the
-``value`` argument, or both. ``all`` is a synony for ``*``.
+``value`` argument, or both. ``all`` is a synonym for ``*``.
Namespace
---------
- Each RGW instance has its own private and ephemeral ``RGW`` Lua table that is lost when the daemon restarts. Note that ``background`` context scripts will run on every instance.
- The maximum number of entries in the table is 100,000. Each entry has a string key a value with a combined length of no more than 1KB.
A Lua script will abort with an error if the number of entries or entry size exceeds these limits.
-- The ``RGW`` Lua table uses string indeces and can store values of type: string, integer, double and boolean
+- The ``RGW`` Lua table uses string indices and can store values of type: string, integer, double and boolean
Increment/Decrement Functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
resharding
~~~~~~~~~~
-Allows buckets to be resharded in a multisite configuration without interrupting the replication of their objects. When ``rgw_dynamic_resharding`` is enabled, it runs on each zone independently, and zones may choose different shard counts for the same bucket. When buckets are resharded manunally with ``radosgw-admin bucket reshard``, only that zone's bucket is modified. A zone feature should only be marked as supported after all of its radosgws and osds have upgraded.
+Allows buckets to be resharded in a multisite configuration without interrupting the replication of their objects. When ``rgw_dynamic_resharding`` is enabled, it runs on each zone independently, and zones may choose different shard counts for the same bucket. When buckets are resharded manually with ``radosgw-admin bucket reshard``, only that zone's bucket is modified. A zone feature should only be marked as supported after all of its radosgws and osds have upgraded.
Commands
In this mode, the operation is acked only after the notification is sent to the topic's configured endpoint, which means that the
round trip time of the notification is added to the latency of the operation itself.
-.. note:: The original triggering operation will still be considered as successful even if the notification fail with an error, cannot be deliverd or times out
+.. note:: The original triggering operation will still be considered as successful even if the notification fail with an error, cannot be delivered or times out
Notifications may also be sent asynchronously. They will be committed into persistent storage and then asynchronously sent to the topic's configured endpoint.
In this case, the only latency added to the original operation is of committing the notification to persistent storage.
-.. note:: If the notification fail with an error, cannot be deliverd or times out, it will be retried until successfully acked
+.. note:: If the notification fail with an error, cannot be delivered or times out, it will be retried until successfully acked
.. tip:: To minimize the added latency in case of asynchronous notifications, it is recommended to place the "log" pool on fast media
# radosgw-admin subscription rm --subscription={topic-name} [--tenant={tenant}]
-To fetch all of the events stored in a subcription, use:
+To fetch all of the events stored in a subscription, use:
::
.. note::
The section name of the QAT configuration files must be ``CEPH`` since
- the section name is set as "CEPH" in Ceph cropto source code.
+ the section name is set as "CEPH" in Ceph crypto source code.
Then, edit the Ceph configuration file to make use of QAT based crypto plugin::
The feature introduces 2 new APIs: Auth and Cache.
- NOTE: The `D3N RGW Data Cache`_ is an alternative data caching mechanism implemented natively in the Rados Gatewey.
+ NOTE: The `D3N RGW Data Cache`_ is an alternative data caching mechanism implemented natively in the Rados Gateway.
New APIs
-------------------------
.. note::
- The ``s3:ObjectSynced:Create`` event is sent when an object successfully syncs to a zone. It must be explicitely set for each zone.
+ The ``s3:ObjectSynced:Create`` event is sent when an object successfully syncs to a zone. It must be explicitly set for each zone.
Topic Configuration
-------------------
| Upon an error being detected, RGW returns 400-Bad-Request and a specific error message sends back to the client.
| Currently, there are 2 main types of error.
|
- | **Syntax error**: the s3selecet parser rejects user requests that are not aligned with parser syntax definitions, as
+ | **Syntax error**: the s3select parser rejects user requests that are not aligned with parser syntax definitions, as
| described in this documentation.
| Upon Syntax Error, the engine creates an error message that points to the location of the error.
| RGW sends back the error message in a specific error response.
~~~~
| NULL is a legit value in ceph-s3select systems similar to other DB systems, i.e. systems needs to handle the case where a value is NULL.
| The definition of NULL in our context, is missing/unknown, in that sense **NULL can not produce a value on ANY arithmetic operations** ( a + NULL will produce NULL value).
-| The Same is with arithmetic comaprision, **any comparison to NULL is NULL**, i.e. unknown.
+| The Same is with arithmetic comparison, **any comparison to NULL is NULL**, i.e. unknown.
| Below is a truth table contains the NULL use-case.
+---------------------------------+-----------------------------+
#. On the “Connect To Target” window, select the “Enable multi-path” option, and
click the “Advanced” button.
-#. Under the "Connet using" section, select a “Target portal IP” . Select the
+#. Under the "Connect using" section, select a “Target portal IP” . Select the
“Enable CHAP login on” and enter the "Name" and "Target secret" values from the
Ceph iSCSI Ansible client credentials section, and click OK.
cluster_name = ceph
# Place a copy of the ceph cluster's admin keyring in the gateway's /etc/ceph
- # drectory and reference the filename here
+ # directory and reference the filename here
gateway_keyring = ceph.client.admin.keyring
.. _Python Sphinx: http://sphinx-doc.org
-.. _resturcturedText: http://docutils.sourceforge.net/rst.html
+.. _restructuredText: http://docutils.sourceforge.net/rst.html
.. _Fork and Pull: https://help.github.com/articles/using-pull-requests
.. _github: http://github.com
.. _ditaa: http://ditaa.sourceforge.net/
The larger the Ceph cluster, the more common OSD failures will be.
The faster that a placement group (PG) can recover from a ``degraded`` state to
an ``active + clean`` state, the better. Notably, fast recovery minimizes
-the liklihood of multiple, overlapping failures that can cause data to become
+the likelihood of multiple, overlapping failures that can cause data to become
temporarily unavailable or even lost. Of course, when provisioning your
network, you will have to balance price against performance.
expiration of the replicas is required to allow previously replicated
objects from eventually being trimmed from the cache as well.
-Each metdata object has a authority bit that indicates whether it is
+Each metadata object has a authority bit that indicates whether it is
authoritative or a replica.
The exporter walks the subtree hierarchy and packages up an MExport
message containing all metadata and important state (\eg, information
-about metadata replicas). At the same time, the expoter's metadata
+about metadata replicas). At the same time, the exporter's metadata
objects are flagged as non-authoritative. The MExport message sends
the actual subtree metadata to the importer. Upon receipt, the
importer inserts the data into its cache, marks all objects as
mds_kill_export_at:
1: After moving to STATE_EXPORTING
2: After sending MExportDirDiscover
-3: After receiving MExportDirDiscoverAck and auth_unpin'ing.
+3: After receiving MExportDirDiscoverAck and auth_unpinning.
4: After sending MExportDirPrep
5: After receiving MExportDirPrepAck
6: After sending out MExportDirNotify to all replicas
## Bucket Index Log Resharding
-The bucket replication logs for multisite are stored in the same bucket index shards as the keys that they modify. However, we can't reshard these log entries like we do with with normal keys, because other zones need to track their position in the logs. If we shuffle the log entries around between shards, other zones no longer have a way to associate their old shard marker positions with the new shards, and their only recourse would be to restart a full sync. So when resharding buckets, we need to preserve the old bucket index logs so that other zones can finish processing their log entries, while any new events are recorded in the new bucket index logs.
+The bucket replication logs for multisite are stored in the same bucket index
+shards as the keys that they modify. However, we can't reshard these log
+entries like we do with normal keys, because other zones need to track their
+position in the logs. If we shuffle the log entries around between shards,
+other zones no longer have a way to associate their old shard marker positions
+with the new shards, and their only recourse would be to restart a full sync.
+So when resharding buckets, we need to preserve the old bucket index logs so
+that other zones can finish processing their log entries, while any new events
+are recorded in the new bucket index logs.
An additional goal is to move replication logs out of omap (so out of the bucket index) into separate rados objects. To enable this, the bucket instance metadata should be able to describe a bucket whose *index layout* is different from its *log layout*. For existing buckets, the two layouts would be identical and share the bucket index objects. Alternate log layouts are otherwise out of scope for this design.