From: Adam King Date: Mon, 26 Sep 2022 18:02:19 +0000 (-0400) Subject: mgr/cephadm: fix handling of mgr upgrades with 3 or more mgrs X-Git-Tag: v18.1.0~681^2 X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=fa0ab94b40a12ffcef2e889eecbd2757f24d0811;p=ceph.git mgr/cephadm: fix handling of mgr upgrades with 3 or more mgrs Fixes: https://tracker.ceph.com/issues/57675 When daemons are upgraded by cephadm, there are two criteria taken into account for a daemon to be considered totally upgraded. The first is the container image the daemon actually has currently. The second is the container image of the mgr that deployed the daemon. I'll refer to these as a daemon having the "correct version" and "correct deployed by". For reference, the correct deployed by needs to be tracked as cephadm may change something about the unit files it generates between versions and not making sure daemons are deployed by the current version of cephadm risks some obscure bugs. The function _detect_need_upgrade takes a list of daemons and returns two new lists. The first is all daemons from the input list that are on the wrong version. The second are all daemons that are on the right version but deployed by the wrong version. Additionally it returns a bool to say whether the current active mgr must be upgraded (i.e. it would belong in either of the two returned lists). Prior to this change, how it would work is the second list (list of daemons that are on the right version but have the wrong deployed by version) would simply be added to the first list if the active mgr does not need to be upgraded. The idea is that if you are upgrading from X image to Y image, we can only really "fix" the deployed by version of the daemon if the active mgr is on the Y version as it will be the one deploying the daemon. So if the active mgr is not upgraded we can just ignore the daemons that just have the wrong deployed by version in hte current iteration. All of this is really only important when the mgr daemons are being upgraded. After all the mgrs are upgraded any future upgrades of daemons will be done by a mgr on the new version so deployed by version will always get completed along with the version of the daemon itself. This system also works fine for the typical 2 mgr setup. Imagine mgr A and B on version X deployed by version X being upgraded to version Y with A as active. First A deploys B with version Y. Now B has version Y and deployed by version X. A then fails over to B as it sees it needs to be upgraded. B then upgrades A so A now has version Y and deployed by version Y. B then fails over to A as it sees it needs to be upgraded as its deployed by version is still X. Finally, A redeploys B and both mgrs are fully upgraded and everything is fine. However, things can get trickier with 3 or more mgrs due to the fact that cephadm does not control which other mgr takes over after a failover. Imagine a similar scenario but now you have mgr A, B, and C. First A will upgrade B and C to Y so they now are both on version Y with deployed by version X. It then fails over since it needs to be upgraded and let's say B takes over as active. B then upgrade A so it now has version Y and deployed by version Y. However, it will not redeploy C even though it should as, given it sees that it needs to be upgraded due to its deployed by version being wrong, it doesn't touch any daemon that just needs its deployed by version fixed. It then fails over and lets say C takes over. Since it still has the wrong deployed by version and therefore thinks that it needs to be upgraded, it won't touch B since that only needs its deployed by version fixed. It sees that it needs to be upgraded however so it fails over. Lets say B takes over again. You can see how we can end up in a loop here where B and C say they need to be upgraded but never upgrade each other. It seems from what I've seen that which mgr is picked after a failover isn't totally random so this type of scenario can actually happen and it can get stuck here until the user takes some action. The change here is to, instead of not touching daemons that needs their deployed by version fixed if the active mgr needs upgrade, only don't touch that list if the active mgr is on the wrong version. So in our example scenario B would still have upgraded C the first time around as it would see it is on the correct version Y and can therefore fix the deployed by version for C. This is what the check always should have been but since most of the testing is with 2 mgr daemons and even with more its by chance you end up in the loop this issue wasn't seen. Will add that it is also possible to end up in this loop with only 2 mgr daemons if some amount of manual upgrading of the mgr daemons is done. Signed-off-by: Adam King --- diff --git a/src/pybind/mgr/cephadm/upgrade.py b/src/pybind/mgr/cephadm/upgrade.py index b7ad4a8b66eb3..d40be61a3da57 100644 --- a/src/pybind/mgr/cephadm/upgrade.py +++ b/src/pybind/mgr/cephadm/upgrade.py @@ -1156,7 +1156,7 @@ class CephadmUpgrade: # no point in trying to redeploy with new version if active mgr is not on the new version need_upgrade_deployer = [] - if not need_upgrade_self: + if any(d in target_digests for d in self.mgr.get_active_mgr_digests()): # only after the mgr itself is upgraded can we expect daemons to have # deployed_by == target_digests need_upgrade += need_upgrade_deployer