]> git-server-git.apps.pok.os.sepia.ceph.com Git - ceph-ansible.git/commit
restart_osd_daemon.sh.j2 - consider active+clean+* pgs as OK
authorMatthew Vernon <mv3@sanger.ac.uk>
Wed, 19 Sep 2018 12:26:26 +0000 (13:26 +0100)
committermergify[bot] <mergify[bot]@users.noreply.github.com>
Mon, 24 Sep 2018 12:49:21 +0000 (12:49 +0000)
commit2585d0c2ad0b07ebbc15fc7ea1c8410cc561ce53
tree44b7c7d74bdfe80f84fa81d1effd952bdd37a27f
parent142eccc6fd3391670c22841e2c3c4751a2254dfe
restart_osd_daemon.sh.j2 - consider active+clean+* pgs as OK

After restarting each OSD, restart_osd_daemon.sh checks that the
cluster is in a good state before moving on to the next one. One of
the checks it does is that the number of pgs in the state
"active+clean" is equal to the total number of pgs in the cluster.

On large clusters (e.g. we have 173,696 pgs), it is likely that at
least one pg will be scrubbing and/or deep-scrubbing at any one
time. These pgs are in state "active+clean+scrubbing" or
"active+clean+scrubbing+deep", so the script was erroneously not
including them in the "good" count. Similar concerns apply to
"active+clean+snaptrim" and "active+clean+snaptrim_wait".

Fix this by considering as good any pg whose state contains
active+clean. Do this as an integer comparison to num_pgs in pgmap.

(could this be backported to at least stable-3.0 please?)

Closes: #2008
Signed-off-by: Matthew Vernon <mv3@sanger.ac.uk>
(cherry picked from commit 04f4991648568e079f19f8e531a11a5fddd45c87)
roles/ceph-defaults/templates/restart_osd_daemon.sh.j2