Clusters running recent Ceph releases support using the tunable values
in the CRUSH maps. However, older clients and daemons will not correctly interact
with clusters using the "tuned" CRUSH maps. To detect this situation,
-there is now a feature bit ``CRUSH_TUNABLES`` (value 0x40000) to
+there are now features bits ``CRUSH_TUNABLES`` (value 0x40000) and ``CRUSH_TUNABLES2`` to
reflect support for tunables.
If the OSDMap currently used by the ``ceph-mon`` or ``ceph-osd``
-daemon has non-legacy values, it will require the ``CRUSH_TUNABLES``
-feature bit from clients and daemons who connect to it. This means
+daemon has non-legacy values, it will require the ``CRUSH_TUNABLES`` or ``CRUSH_TUNABLES2``
+feature bits from clients and daemons who connect to it. This means
that old clients will not be able to connect.
At some future point in time, newly created clusters will have
* When some OSDs are marked out, the data tends to get redistributed
to nearby OSDs instead of across the entire hierarchy.
-Which client versions support tunables
---------------------------------------
+CRUSH_TUNABLES
+--------------
+
+ * ``choose_local_tries``: Number of local retries. Legacy value is
+ 2, optimal value is 0.
+
+ * ``choose_local_fallback_tries``: Legacy value is 5, optimal value
+ is 0.
+
+ * ``choose_total_tries``: Total number of attempts to choose an item.
+ Legacy value was 19, subsequent testing indicates that a value of
+ 50 is more appropriate for typical clusters. For extremely large
+ clusters, a larger value might be necessary.
+
+CRUSH_TUNABLES2
+---------------
+
+ * ``chooseleaf_descend_once``: Whether a recursive chooseleaf attempt
+ will retry, or only try once and allow the original placement to
+ retry. Legacy default is 0, optimal value is 1.
+
+
+Which client versions support CRUSH_TUNABLES
+--------------------------------------------
* argonaut series, v0.48.1 or later
* v0.49 or later
* Linux kernel version v3.5 or later (for the file system and RBD kernel clients)
+Which client versions support CRUSH_TUNABLES2
+---------------------------------------------
+
+ * v0.55 or later, including bobtail series (v0.56.x)
+ * Linux kernel version v3.9 or later (for the file system and RBD kernel clients)
+
A few important points
----------------------
storage nodes. If the Ceph cluster is already storing a lot of
data, be prepared for some fraction of the data to move.
* The ``ceph-osd`` and ``ceph-mon`` daemons will start requiring the
- ``CRUSH_TUNABLES`` feature of new connections as soon as they get
+ feature bits of new connections as soon as they get
the updated map. However, already-connected clients are
effectively grandfathered in, and will misbehave if they do not
support the new feature.
changed back to the defult values, ``ceph-osd`` daemons will not be
required to support the feature. However, the OSD peering process
requires examining and understanding old maps. Therefore, you
- should not run old (pre-v0.48) versions of the ``ceph-osd`` daemon
+ should not run old versions of the ``ceph-osd`` daemon
if the cluster has previosly used non-legacy CRUSH values, even if
the latest version of the map has been switched back to using the
legacy defaults.
Tuning CRUSH
------------
+The simplest way to adjust the crush tunables is by changing to a known
+profile. Those are:
+
+ * ``legacy``: the legacy behavior from argonaut and earlier.
+ * ``argonaut``: the legacy values supported by the original argonaut release
+ * ``bobtail``: the values supported by the bobtail release
+ * ``optimal``: the current best values
+ * ``default``: the current default values for a new cluster
+
+Currently, ``legacy``, ``default``, and ``argonaut`` are the same, and
+``bobtail`` and ``optimal`` include ``CRUSH_TUNABLES`` and ``CRUSH_TUNABLES2``.
+
+You can select a profile on a running cluster with the command::
+
+ ceph osd crush tunables {PROFILE}
+
+Note that this may result in some data movement.
+
+
+Tuning CRUSH, the hard way
+--------------------------
+
If you can ensure that all clients are running recent code, you can
adjust the tunables by extracting the CRUSH map, modifying the values,
and reinjecting it into the cluster.
}
// tunables
+ void set_tunables_legacy() {
+ crush->choose_local_tries = 2;
+ crush->choose_local_fallback_tries = 5;
+ crush->choose_total_tries = 19;
+ crush->chooseleaf_descend_once = 0;
+ }
+ void set_tunables_optimal() {
+ crush->choose_local_tries = 0;
+ crush->choose_local_fallback_tries = 0;
+ crush->choose_total_tries = 50;
+ crush->chooseleaf_descend_once = 1;
+ }
+ void set_tunables_argonaut() {
+ set_tunables_legacy();
+ }
+ void set_tunables_bobtail() {
+ set_tunables_optimal();
+ }
+ void set_tunables_default() {
+ set_tunables_legacy();
+ }
+
int get_choose_local_tries() const {
return crush->choose_local_tries;
}
}
} while (false);
}
+ else if (m->cmd.size() == 4 && m->cmd[1] == "crush" && m->cmd[2] == "tunables") {
+ bufferlist bl;
+ if (pending_inc.crush.length())
+ bl = pending_inc.crush;
+ else
+ osdmap.crush->encode(bl);
+
+ CrushWrapper newcrush;
+ bufferlist::iterator p = bl.begin();
+ newcrush.decode(p);
+
+ err = 0;
+ if (m->cmd[3] == "legacy" || m->cmd[3] == "argonaut") {
+ newcrush.set_tunables_legacy();
+ } else if (m->cmd[3] == "bobtail") {
+ newcrush.set_tunables_bobtail();
+ } else if (m->cmd[3] == "optimal") {
+ newcrush.set_tunables_optimal();
+ } else if (m->cmd[3] == "default") {
+ newcrush.set_tunables_default();
+ } else {
+ err = -EINVAL;
+ ss << "unknown tunables profile '" << m->cmd[3] << "'; allowed values are argonaut, bobtail, optimal, or default";
+ }
+ if (err == 0) {
+ pending_inc.crush.clear();
+ newcrush.encode(pending_inc.crush);
+ ss << "adjusted tunables profile to " << m->cmd[3];
+ getline(ss, rs);
+ paxos->wait_for_commit(new Monitor::C_Command(mon, m, 0, rs, paxos->get_version()));
+ return true;
+ }
+ }
else if (m->cmd[1] == "setmaxosd" && m->cmd.size() > 2) {
int newmax = parse_pos_long(m->cmd[2].c_str(), &ss);
if (newmax < 0) {