From: Joao Eduardo Luis Date: Mon, 13 Jul 2015 11:35:13 +0000 (+0100) Subject: tools: ceph_monstore_tool: describe behavior of rewrite command X-Git-Tag: v9.1.0~527^2~1 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=a881f9385feb0f5a61fa22357984d6f291c08177;p=ceph.git tools: ceph_monstore_tool: describe behavior of rewrite command Signed-off-by: Joao Eduardo Luis --- diff --git a/src/tools/ceph_monstore_tool.cc b/src/tools/ceph_monstore_tool.cc index 969340dd97b3..eec3924ccc83 100644 --- a/src/tools/ceph_monstore_tool.cc +++ b/src/tools/ceph_monstore_tool.cc @@ -337,6 +337,34 @@ int rewrite_transaction(MonitorDBStore& store, int version, // add a new osdmap epoch to store, so monitors will update their current osdmap // in addition to the ones stored in epochs. + // + // This is needed due to the way the monitor updates from paxos and the + // facilities we are leveraging to push this update to the rest of the + // quorum. + // + // In a nutshell, we are generating a good version of the osdmap, with a + // proper crush, and building a transaction that will replace the bad + // osdmaps with good osdmaps. But this transaction needs to be applied on + // all nodes, so that the monitors will have good osdmaps to share with + // clients. We thus leverage Paxos, specifically the recovery mechanism, by + // creating a pending value that will be committed once the monitors form an + // initial quorum after being brought back to life. + // + // However, the way the monitor works has the paxos services, including the + // OSDMonitor, updating their state from disk *prior* to the recovery phase + // begins (so they have an up to date state in memory). This means the + // OSDMonitor will see the old, broken map, before the new paxos version is + // applied to disk, and the old version is cached. Even though we have the + // good map now, and we share the good map with clients, we will still be + // working on the old broken map. Instead of mucking around the monitor to + // make this work, we instead opt for adding the same osdmap but with a + // newer version, so that the OSDMonitor picks up on it when it updates from + // paxos after the proposal has been committed. This is not elegant, but + // avoids further unpleasantness that would arise from kludging around the + // current behavior. Also, has the added benefit of making sure the clients + // get an updated version of the map (because last_committed+1 > + // last_committed) :) + // cout << "adding a new epoch #" << last_committed+1 << std::endl; r = update_osdmap(store, last_committed++, true, crush, t); if (r)