From: Adam King <adking@redhat.com>
Date: Mon, 26 Sep 2022 18:02:19 +0000 (-0400)
Subject: mgr/cephadm: fix handling of mgr upgrades with 3 or more mgrs
X-Git-Tag: v18.1.0~681^2
X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=fa0ab94b40a12ffcef2e889eecbd2757f24d0811;p=ceph.git

mgr/cephadm: fix handling of mgr upgrades with 3 or more mgrs

Fixes: https://tracker.ceph.com/issues/57675

When daemons are upgraded by cephadm, there are two criteria taken into
account for a daemon to be considered totally upgraded. The first is the
container image the daemon actually has currently. The second is the container
image of the mgr that deployed the daemon. I'll refer to these as a daemon
having the "correct version" and "correct deployed by". For reference,
the correct deployed by needs to be tracked as cephadm may change
something about the unit files it generates between versions and not
making sure daemons are deployed by the current version of cephadm
risks some obscure bugs.

The function _detect_need_upgrade takes a list of daemons and returns
two new lists. The first is all daemons from the input list that
are on the wrong version. The second are all daemons that are on the
right version but deployed by the wrong version. Additionally it returns
a bool to say whether the current active mgr must be upgraded (i.e. it
would belong in either of the two returned lists). Prior to this change,
how it would work is the second list (list of daemons that are on the right
version but have the wrong deployed by version) would simply be added to
the first list if the active mgr does not need to be upgraded. The idea
is that if you are upgrading from X image to Y image, we can only
really "fix" the deployed by version of the daemon if the active mgr
is on the Y version as it will be the one deploying the daemon. So if
the active mgr is not upgraded we can just ignore the daemons that just
have the wrong deployed by version in hte current iteration. All of this is
really only important when the mgr daemons are being upgraded. After all the
mgrs are upgraded any future upgrades of daemons will be done by a mgr on
the new version so deployed by version will always get completed
along with the version of the daemon itself. This system also works fine
for the typical 2 mgr setup.

Imagine mgr A and B on version X deployed by version X being upgraded to
version Y with A as active. First A deploys B with version Y. Now B
has version Y and deployed by version X. A then fails over to B as it
sees it needs to be upgraded. B then upgrades A so A now has version Y
and deployed by version Y. B then fails over to A as it sees it needs
to be upgraded as its deployed by version is still X. Finally, A
redeploys B and both mgrs are fully upgraded and everything is fine.

However, things can get trickier with 3 or more mgrs due to the
fact that cephadm does not control which other mgr takes over after
a failover. Imagine a similar scenario but now you have mgr
A, B, and C. First A will upgrade B and C to Y so they now
are both on version Y with deployed by version X. It then fails
over since it needs to be upgraded and let's say B takes over as
active. B then upgrade A so it now has version Y and deployed by
version Y. However, it will not redeploy C even though it should
as, given it sees that it needs to be upgraded due to its deployed by
version being wrong, it doesn't touch any daemon that just needs its
deployed by version fixed. It then fails over and lets say C takes
over. Since it still has the wrong deployed by version and therefore
thinks that it needs to be upgraded,  it won't touch B since that
only needs its deployed by version fixed. It sees that it needs
to be upgraded however so it fails over. Lets say B takes over again.
You can see how we can end up in a loop here where B and C say they
need to be upgraded but never upgrade each other. It seems from what
I've seen that which mgr is picked after a failover isn't totally
random so this type of scenario can actually happen and it can get
stuck here until the user takes some action. The change here is
to, instead of not touching daemons that needs their deployed by version
fixed if the active mgr needs upgrade, only don't touch that list
if the active mgr is on the wrong version. So in our example scenario
B would still have upgraded C the first time around as it would
see it is on the correct version Y and can therefore fix the deployed
by version for C. This is what the check always should have been
but since most of the testing is with 2 mgr daemons and even with
more its by chance you end up in the loop this issue wasn't seen.

Will add that it is also possible to end up in this loop with
only 2 mgr daemons if some amount of manual upgrading of the mgr
daemons is done.

Signed-off-by: Adam King <adking@redhat.com>
---

diff --git a/src/pybind/mgr/cephadm/upgrade.py b/src/pybind/mgr/cephadm/upgrade.py
index b7ad4a8b66eb3..d40be61a3da57 100644
--- a/src/pybind/mgr/cephadm/upgrade.py
+++ b/src/pybind/mgr/cephadm/upgrade.py
@@ -1156,7 +1156,7 @@ class CephadmUpgrade:
                     # no point in trying to redeploy with new version if active mgr is not on the new version
                     need_upgrade_deployer = []
 
-            if not need_upgrade_self:
+            if any(d in target_digests for d in self.mgr.get_active_mgr_digests()):
                 # only after the mgr itself is upgraded can we expect daemons to have
                 # deployed_by == target_digests
                 need_upgrade += need_upgrade_deployer