From: Conrad Hoffmann Date: Wed, 22 Mar 2023 22:03:57 +0000 (+0100) Subject: doc: account for PG autoscaling being the default X-Git-Tag: v18.1.0~169^2 X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=27769809976c81f8ae8247ebe8301a3273721166;p=ceph-ci.git doc: account for PG autoscaling being the default The current documentation tries really hard to convince people to set both `osd_pool_default_pg_num` and `osd_pool_default_pgp_num` in their configs, but at least the latter has undesirable side effects on any Ceph version that has PG autoscaling enabled by default (at least quincy and beyond). Assume a cluster with defaults of `64` for `pg_num` and `pgp_num`. Starting `radosgw` will fail as it tries to create various pools without providing values for `pg_num` or `pgp_num`. This triggers the following in `OSDMonitor::prepare_new_pool()`: - `pg_num` is set to `1`, because autoscaling is enabled - `pgp_num` is set to `osd pool default pgp_num`, which we set to `64` - This is an invalid setup, so the pool creation fails Likewise, `ceph osd pool create mypool` (without providing values for `pg_num` or `pgp_num`) does not work. Following this rationale: - Not providing a default value for `pgp_num` will always do the right thing, unless you use advanced features, in which case you can be expected to set both values on pool creation - Setting `osd_pool_default_pgp_num` in your config breaks pool creation for various cases This commit: - Removes `osd_pool_default_pgp_num` from all example configs - Adds mentions of the autoscaling and how it interacts with the default values in various places For each file that was touched, the following maintenance was also performed: - Change interternal spaces to underscores for config values - Remove mentions of filestore or any of its settings - Fix minor inconsistencies, like indentation etc. There is also a ticket which I think is very relevant and fixed by this, though it only captures part of the broader issue addressed here: Fixes: https://tracker.ceph.com/issues/47176 Signed-off-by: Conrad Hoffmann (cherry picked from commit 402d2eacbc67f7a6d47d8f90d9ed757fc20931a6) --- diff --git a/doc/install/manual-deployment.rst b/doc/install/manual-deployment.rst index 95232fce2aa..6716ecb5beb 100644 --- a/doc/install/manual-deployment.rst +++ b/doc/install/manual-deployment.rst @@ -132,24 +132,24 @@ The procedure is as follows: #. Add the initial monitor(s) to your Ceph configuration file. :: - mon initial members = {hostname}[,{hostname}] + mon_initial_members = {hostname}[,{hostname}] For example:: - mon initial members = mon-node1 + mon_initial_members = mon-node1 #. Add the IP address(es) of the initial monitor(s) to your Ceph configuration file and save the file. :: - mon host = {ip-address}[,{ip-address}] + mon_host = {ip-address}[,{ip-address}] For example:: - mon host = 192.168.0.1 + mon_host = 192.168.0.1 **Note:** You may use IPv6 addresses instead of IPv4 addresses, but - you must set ``ms bind ipv6`` to ``true``. See `Network Configuration + you must set ``ms_bind_ipv6`` to ``true``. See `Network Configuration Reference`_ for details about network configuration. #. Create a keyring for your cluster and generate a monitor secret key. :: @@ -210,37 +210,33 @@ The procedure is as follows: [global] fsid = {cluster-id} - mon initial members = {hostname}[, {hostname}] - mon host = {ip-address}[, {ip-address}] - public network = {network}[, {network}] - cluster network = {network}[, {network}] - auth cluster required = cephx - auth service required = cephx - auth client required = cephx - osd journal size = {n} - osd pool default size = {n} # Write an object n times. - osd pool default min size = {n} # Allow writing n copies in a degraded state. - osd pool default pg num = {n} - osd pool default pgp num = {n} - osd crush chooseleaf type = {n} + mon_initial_members = {hostname}[, {hostname}] + mon_host = {ip-address}[, {ip-address}] + public_network = {network}[, {network}] + cluster_network = {network}[, {network}] + auth_cluster required = cephx + auth_service required = cephx + auth_client required = cephx + osd_pool_default_size = {n} # Write an object n times. + osd_pool_default_min_size = {n} # Allow writing n copies in a degraded state. + osd_pool_default_pg_num = {n} + osd_crush_chooseleaf_type = {n} In the foregoing example, the ``[global]`` section of the configuration might look like this:: [global] fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993 - mon initial members = mon-node1 - mon host = 192.168.0.1 - public network = 192.168.0.0/24 - auth cluster required = cephx - auth service required = cephx - auth client required = cephx - osd journal size = 1024 - osd pool default size = 3 - osd pool default min size = 2 - osd pool default pg num = 333 - osd pool default pgp num = 333 - osd crush chooseleaf type = 1 + mon_initial_members = mon-node1 + mon_host = 192.168.0.1 + public_network = 192.168.0.0/24 + auth_cluster_required = cephx + auth_service_required = cephx + auth_client_required = cephx + osd_pool_default_size = 3 + osd_pool_default_min_size = 2 + osd_pool_default_pg_num = 333 + osd_crush_chooseleaf_type = 1 #. Start the monitor(s). @@ -295,7 +291,7 @@ Adding OSDs Once you have your initial monitor(s) running, you should add OSDs. Your cluster cannot reach an ``active + clean`` state until you have enough OSDs to handle the -number of copies of an object (e.g., ``osd pool default size = 2`` requires at +number of copies of an object (e.g., ``osd_pool_default_size = 2`` requires at least two OSDs). After bootstrapping your monitor, your cluster has a default CRUSH map; however, the CRUSH map doesn't have any Ceph OSD Daemons mapped to a Ceph Node. @@ -311,8 +307,6 @@ CRUSH map under the host for you. Execute ``ceph-volume -h`` for CLI details. The ``ceph-volume`` utility automates the steps of the `Long Form`_ below. To create the first two OSDs with the short form procedure, execute the following for each OSD: -bluestore -^^^^^^^^^ #. Create the OSD. :: copy /var/lib/ceph/bootstrap-osd/ceph.keyring from monitor node (mon-node1) to /var/lib/ceph/bootstrap-osd/ceph.keyring on osd node (osd-node1) @@ -353,45 +347,6 @@ activate): sudo ceph-volume lvm activate 0 a7f64266-0894-4f1e-a635-d0aeaca0e993 -filestore -^^^^^^^^^ -#. Create the OSD. :: - - ssh {osd node} - sudo ceph-volume lvm create --filestore --data {data-path} --journal {journal-path} - - For example:: - - ssh osd-node1 - sudo ceph-volume lvm create --filestore --data /dev/hdd1 --journal /dev/hdd2 - -Alternatively, the creation process can be split in two phases (prepare, and -activate): - -#. Prepare the OSD. :: - - ssh {node-name} - sudo ceph-volume lvm prepare --filestore --data {data-path} --journal {journal-path} - - For example:: - - ssh osd-node1 - sudo ceph-volume lvm prepare --filestore --data /dev/hdd1 --journal /dev/hdd2 - - Once prepared, the ``ID`` and ``FSID`` of the prepared OSD are required for - activation. These can be obtained by listing OSDs in the current server:: - - sudo ceph-volume lvm list - -#. Activate the OSD:: - - sudo ceph-volume lvm activate --filestore {ID} {FSID} - - For example:: - - sudo ceph-volume lvm activate --filestore 0 a7f64266-0894-4f1e-a635-d0aeaca0e993 - - Long Form --------- diff --git a/doc/rados/configuration/demo-ceph.conf b/doc/rados/configuration/demo-ceph.conf index 2537dc45c0c..8ba285a42a5 100644 --- a/doc/rados/configuration/demo-ceph.conf +++ b/doc/rados/configuration/demo-ceph.conf @@ -15,13 +15,13 @@ auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx -#Choose reasonable numbers for journals, number of replicas -#and placement groups. +#Choose reasonable number of replicas and placement groups. osd_journal_size = {n} osd_pool_default_size = {n} # Write an object n times. -osd_pool_default_min_size = {n} # Allow writing n copy in a degraded state. +osd_pool_default_min_size = {n} # Allow writing n copies in a degraded state. +osd_pool_default_pg_autoscale_mode = {mode} # on, off, or warn +# Only used if autoscaling is off or warn: osd_pool_default_pg_num = {n} -osd_pool_default_pgp_num = {n} #Choose a reasonable crush leaf type. #0 for a 1-node cluster. diff --git a/doc/rados/configuration/pool-pg.conf b/doc/rados/configuration/pool-pg.conf index 28252e860aa..6765d37dfad 100644 --- a/doc/rados/configuration/pool-pg.conf +++ b/doc/rados/configuration/pool-pg.conf @@ -1,21 +1,21 @@ [global] - # By default, Ceph makes 3 replicas of RADOS objects. If you want to maintain four - # copies of an object the default value--a primary copy and three replica - # copies--reset the default values as shown in 'osd_pool_default_size'. - # If you want to allow Ceph to accept an I/O operation to a degraded PG, - # set 'osd_pool_default_min_size' to a number less than the - # 'osd_pool_default_size' value. + # By default, Ceph makes three replicas of RADOS objects. If you want + # to maintain four copies of an object the default value--a primary + # copy and three replica copies--reset the default values as shown in + # 'osd_pool_default_size'. If you want to allow Ceph to accept an I/O + # operation to a degraded PG, set 'osd_pool_default_min_size' to a + # number less than the 'osd_pool_default_size' value. - osd_pool_default_size = 3 # Write an object 3 times. + osd_pool_default_size = 3 # Write an object three times. osd_pool_default_min_size = 2 # Accept an I/O operation to a PG that has two copies of an object. + # Note: by default, PG autoscaling is enabled and this value is used only + # in specific circumstances. It is however still recommend to set it. # Ensure you have a realistic number of placement groups. We recommend # approximately 100 per OSD. E.g., total number of OSDs multiplied by 100 - # divided by the number of replicas (i.e., osd pool default size). So for - # 10 OSDs and osd pool default size = 4, we'd recommend approximately + # divided by the number of replicas (i.e., 'osd_pool_default_size'). So for + # 10 OSDs and 'osd_pool_default_size' = 4, we'd recommend approximately # (100 * 10) / 4 = 250. - # always use the nearest power of 2 - + # Always use the nearest power of two. osd_pool_default_pg_num = 256 - osd_pool_default_pgp_num = 256 diff --git a/doc/rados/operations/pools.rst b/doc/rados/operations/pools.rst index 12f1c1b1bb1..b4ab3b83ea8 100644 --- a/doc/rados/operations/pools.rst +++ b/doc/rados/operations/pools.rst @@ -55,10 +55,14 @@ To list your cluster's pools, execute: Create a Pool ============= -Before creating pools, refer to the `Pool, PG and CRUSH Config Reference`_. -Ideally, you should override the default value for the number of placement -groups in your Ceph configuration file, as the default is NOT ideal. -For details on placement group numbers refer to `setting the number of placement groups`_ +If you are not using the PG autoscaler you may wish to explicitly set a value +for :confval:osd_pool_default_pg_num, as the default is small and not ideal for +many production-scale deployments. Refer to the `Pool, PG and CRUSH Config +Reference`_. Be careful, though, to not set a very high value as auto-deployed +pools, notably certain RGW pools, will not hold much data and thus should not +have a gratuitous number of PGs. When the PG autoscaler is not actively +managing placement group numbers, best practice is to explicitly provide pg_num +and pgp_num when creating each pool. .. note:: Starting with Luminous, all pools need to be associated to the application using the pool. See `Associate Pool to Application`_ below for @@ -66,10 +70,11 @@ For details on placement group numbers refer to `setting the number of placement For example: -.. prompt:: bash $ +.. code-block:: ini + [global] + osd_pool_default_pg_autoscale_mode = off osd_pool_default_pg_num = 128 - osd_pool_default_pgp_num = 128 To create a pool, execute: @@ -91,13 +96,16 @@ Where: .. describe:: {pg-num} - The total number of placement groups for the pool. See :ref:`placement groups` - for details on calculating a suitable number. The - default value ``8`` is NOT suitable for most systems. + The total number of placement groups for the pool. See :ref:`placement + groups` for details on calculating a suitable number. The default value of + :confval:`osd_pool_default_pg_num` is likely too small for production pools + used for bulk data, including RBD and RGW data and bucket pools + respectively. :Type: Integer - :Required: Yes. - :Default: 8 + :Required: No. Set to ``1`` if autoscaling is enabled, otherwise picks up Ceph + configuration value :confval:`osd_pool_default_pg_num` + :Default: Value of :confval:`osd_pool_default_pg_num` .. describe:: {pgp-num} @@ -106,8 +114,9 @@ Where: for placement group splitting scenarios. :Type: Integer - :Required: Yes. Picks up default or Ceph configuration value if not specified. - :Default: 8 + :Required: No. Picks up Ceph configuration value :confval:`osd_pool_default_pgp_num` + if not specified. If that is not set, defaults to value of ``pg-num``. + :Default: Value of ``pg-num`` .. describe:: {replicated|erasure} diff --git a/src/common/options/global.yaml.in b/src/common/options/global.yaml.in index fe08916d495..bdb415c5de5 100644 --- a/src/common/options/global.yaml.in +++ b/src/common/options/global.yaml.in @@ -2509,14 +2509,17 @@ options: type: uint level: advanced desc: number of PGs for placement purposes (0 to match pg_num) - fmt_desc: The default number of placement groups for placement for a pool. + fmt_desc: | + The default number of placement groups for placement for a pool. The default value is the same as ``pgp_num`` with ``mkpool``. - PG and PGP should be equal (for now). + PG and PGP should be equal (for now). Note: should not be set unless + autoscaling is disabled. default: 0 services: - mon see_also: - osd_pool_default_pg_num + - osd_pool_default_pg_autoscale_mode flags: - runtime - name: osd_pool_default_type