Preventing Stale Reads
======================
-We write synchronously to all replicas before sending an ack to the
-client, which ensures that we do not introduce potential inconsistency
-in the write path. However, we only read from one replica, and the
-client will use whatever OSDMap is has to identify which OSD to read
-from. In most cases, this is fine: either the client map is correct,
+We write synchronously to all replicas before sending an ACK to the
+client, which limits the potential for inconsistency
+in the write path. However, by default we serve reads from just
+one replica (the lead/primary OSD for each PG), and the
+client will use whatever OSDMap is has to select the OSD from which to read.
+In most cases, this is fine: either the client map is correct,
or the OSD that we think is the primary for the object knows that it
is not the primary anymore, and can feed the client an updated map
-indicating a newer primary.
+that indicates a newer primary.
They key is to ensure that this is *always* true. In particular, we
need to ensure that an OSD that is fenced off from its peers and has
============================
Placement groups (PGs) are an internal implementation detail of how
-Ceph distributes data. You can allow the cluster to either make
-recommendations or automatically tune PGs based on how the cluster is
-used by enabling *pg-autoscaling*.
+Ceph distributes data. You may enable *pg-autoscaling* to allow the cluster to
+make recommendations or automatically adjust the numbers of PGs (``pgp_num``)
+for each pool based on expected cluster and pool utilization.
-Each pool in the system has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``.
+Each pool has a ``pg_autoscale_mode`` property that can be set to ``off``, ``on``, or ``warn``.
-* ``off``: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate PG number for each pool. Please refer to :ref:`choosing-number-of-placement-groups` for more information.
+* ``off``: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate ``pgp_num`` for each pool. Please refer to :ref:`choosing-number-of-placement-groups` for more information.
* ``on``: Enable automated adjustments of the PG count for the given pool.
* ``warn``: Raise health alerts when the PG count should be adjusted
-To set the autoscaling mode for existing pools,::
+To set the autoscaling mode for an existing pool::
ceph osd pool set <pool-name> pg_autoscale_mode <mode>
-For example to enable autoscaling on pool ``foo``,::
+For example to enable autoscaling on pool ``foo``::
ceph osd pool set foo pg_autoscale_mode on
You can also configure the default ``pg_autoscale_mode`` that is
-applied to any pools that are created in the future with::
+set on any pools that are subsequently created::
ceph config set global osd_pool_default_pg_autoscale_mode <mode>
**EFFECTIVE RATIO** is the target ratio after adjusting in two ways:
-1. subtracting any capacity expected to be used by pools with target size set
-2. normalizing the target ratios among pools with target ratio set so
+1. Subtracting any capacity expected to be used by pools with target size set
+2. Normalizing the target ratios among pools with target ratio set so
they collectively target the rest of the space. For example, 4
pools with target_ratio 1.0 would have an effective ratio of 0.25.
Automated scaling
-----------------
-Allowing the cluster to automatically scale PGs based on usage is the
+Allowing the cluster to automatically scale ``pgp_num`` based on usage is the
simplest approach. Ceph will look at the total available storage and
target number of PGs for the whole system, look at how much data is
-stored in each pool, and try to apportion the PGs accordingly. The
+stored in each pool, and try to apportion PGs accordingly. The
system is relatively conservative with its approach, only making
changes to a pool when the current number of PGs (``pg_num``) is more
-than 3 times off from what it thinks it should be.
+than a factor of 3 off from what it thinks it should be.
The target number of PGs per OSD is based on the
``mon_target_pg_per_osd`` configurable (default: 100), which can be
terms of the absolute size of the pool (i.e., bytes), or as a weight
relative to other pools with a ``target_size_ratio`` set.
-For example,::
+For example::
ceph osd pool set mypool target_size_bytes 100T
will tell the system that `mypool` is expected to consume 100 TiB of
-space. Alternatively,::
+space. Alternatively::
ceph osd pool set mypool target_size_ratio 1.0
Alternatively, ``pg_num`` can be explicitly provided. However,
whether you specify a ``pg_num`` value or not does not affect whether
the value is automatically tuned by the cluster after the fact. To
-enable or disable auto-tuning,::
+enable or disable auto-tuning::
ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)