From c8f021cb85b4375faf91af1d580dc8e2e8669395 Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Mon, 22 Feb 2021 18:48:14 -0500 Subject: [PATCH] mon/OSDMonitor: do not mark newly created OSDs OUT Currently, new OSDs are marked OUT. This behavior appears to date back all the way to the 'osd new' command in c9e6cac1cfae73cc4bce4a995ed2bba5dc85ae4b, and for 'osd create' from 118f081b3c64bc36a8b9825e3373587a334c672b. The first commit has no real explanation, but it presumably inherited it from the second. That second commit, though, says if we are creating an osd which has the same id as a previously removed 'in' osd, we should not mark this newly created osd as 'in' This isn't actually a good idea, however. If we are creating (or reusing) a new OSD id, the OSD that starts up will have no data. So no matter what there will be a data migration from the before state to the final state. If we mark the osd OUT when the osd id is allocated but before the OSD starts up, we'll create a middle state where PGs are mapped to the id (by virtue of the CRUSH weight) and then remapped away (due to out), creating a middle state where a bunch of PGs will repeer and maybe data will move. Instead, we have two cases: 1) If we are reusing a DESTROYED osd id, we should leave the in/out state the way it was. This way we still go straight from the before state to the after state (the osd will mark itself in when it starts up). 2) If we are allocating a new id in do_osd_create(), we want the OSD to be IN, so there is no middle state. Unfortunately, we have to work around apply_incremental() being obnoxious here: it's sloppy implementation will implicitly set EXISTS by virtue of new_osd_weight (the mark IN part) before applying the osd_state XOR, so be careful! (This behavior is mirrored by the Linux kernel implementation too, thankfully.) Signed-off-by: Sage Weil --- src/mon/OSDMonitor.cc | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc index 3d2528ec0022e..b429376876a3e 100644 --- a/src/mon/OSDMonitor.cc +++ b/src/mon/OSDMonitor.cc @@ -9045,7 +9045,6 @@ void OSDMonitor::do_osd_create( if (existing_id >= 0) { ceph_assert(existing_id < osdmap.get_max_osd()); ceph_assert(allocated_id < 0); - pending_inc.new_weight[existing_id] = CEPH_OSD_OUT; *new_id = existing_id; } else if (allocated_id >= 0) { ceph_assert(existing_id < 0); @@ -9094,7 +9093,10 @@ out: pending_inc.new_max_osd = *new_id + 1; } - pending_inc.new_state[*new_id] |= CEPH_OSD_EXISTS | CEPH_OSD_NEW; + pending_inc.new_weight[*new_id] = CEPH_OSD_IN; + // do not set EXISTS; OSDMap::set_weight, called by apply_incremental, will + // set it for us. (ugh.) + pending_inc.new_state[*new_id] |= CEPH_OSD_NEW; if (!uuid.is_zero()) pending_inc.new_uuid[*new_id] = uuid; } @@ -9425,7 +9427,6 @@ int OSDMonitor::prepare_command_osd_new( if (is_recreate_destroyed) { ceph_assert(id >= 0); ceph_assert(osdmap.is_destroyed(id)); - pending_inc.new_weight[id] = CEPH_OSD_OUT; pending_inc.new_state[id] |= CEPH_OSD_DESTROYED; if ((osdmap.get_state(id) & CEPH_OSD_NEW) == 0) { pending_inc.new_state[id] |= CEPH_OSD_NEW; -- 2.39.5