From b371771383446e8ff61004e3b556f47154fcf8b2 Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Mon, 24 Sep 2018 11:12:00 -0500 Subject: [PATCH] doc/rados/troubleshooting-mon: update mondb recovery script - some cleanup (e.g., use $ms throughput) - behave if the local host is in the $hosts list (use $ms.remote) - be clear about updating all mons - mon.0 -> mon.foo Signed-off-by: Sage Weil --- .../troubleshooting/troubleshooting-mon.rst | 32 +++++++++++-------- 1 file changed, 19 insertions(+), 13 deletions(-) diff --git a/doc/rados/troubleshooting/troubleshooting-mon.rst b/doc/rados/troubleshooting/troubleshooting-mon.rst index dd4c8fdeaa4..58d5bc1470e 100644 --- a/doc/rados/troubleshooting/troubleshooting-mon.rst +++ b/doc/rados/troubleshooting/troubleshooting-mon.rst @@ -397,7 +397,7 @@ might be found in the monitor log:: or:: - Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.0/store.db/1234567.ldb + Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb Recovery using healthy monitor(s) --------------------------------- @@ -410,45 +410,51 @@ Recovery using OSDs ------------------- But what if all monitors fail at the same time? Since users are encouraged to -deploy at least three monitors in a Ceph cluster, the chance of simultaneous +deploy at least three (and preferably five) monitors in a Ceph cluster, the chance of simultaneous failure is rare. But unplanned power-downs in a data center with improperly configured disk/fs settings could fail the underlying filesystem, and hence kill all the monitors. In this case, we can recover the monitor store with the information stored in OSDs.:: - ms=/tmp/mon-store + ms=/root/mon-store mkdir $ms + # collect the cluster map from OSDs for host in $hosts; do - rsync -avz $ms user@host:$ms + rsync -avz $ms/. user@host:$ms.remote rm -rf $ms ssh user@host <