Adam King [Mon, 26 Sep 2022 18:02:19 +0000 (14:02 -0400)]
mgr/cephadm: fix handling of mgr upgrades with 3 or more mgrs
Fixes: https://tracker.ceph.com/issues/57675
When daemons are upgraded by cephadm, there are two criteria taken into
account for a daemon to be considered totally upgraded. The first is the
container image the daemon actually has currently. The second is the container
image of the mgr that deployed the daemon. I'll refer to these as a daemon
having the "correct version" and "correct deployed by". For reference,
the correct deployed by needs to be tracked as cephadm may change
something about the unit files it generates between versions and not
making sure daemons are deployed by the current version of cephadm
risks some obscure bugs.
The function _detect_need_upgrade takes a list of daemons and returns
two new lists. The first is all daemons from the input list that
are on the wrong version. The second are all daemons that are on the
right version but deployed by the wrong version. Additionally it returns
a bool to say whether the current active mgr must be upgraded (i.e. it
would belong in either of the two returned lists). Prior to this change,
how it would work is the second list (list of daemons that are on the right
version but have the wrong deployed by version) would simply be added to
the first list if the active mgr does not need to be upgraded. The idea
is that if you are upgrading from X image to Y image, we can only
really "fix" the deployed by version of the daemon if the active mgr
is on the Y version as it will be the one deploying the daemon. So if
the active mgr is not upgraded we can just ignore the daemons that just
have the wrong deployed by version in hte current iteration. All of this is
really only important when the mgr daemons are being upgraded. After all the
mgrs are upgraded any future upgrades of daemons will be done by a mgr on
the new version so deployed by version will always get completed
along with the version of the daemon itself. This system also works fine
for the typical 2 mgr setup.
Imagine mgr A and B on version X deployed by version X being upgraded to
version Y with A as active. First A deploys B with version Y. Now B
has version Y and deployed by version X. A then fails over to B as it
sees it needs to be upgraded. B then upgrades A so A now has version Y
and deployed by version Y. B then fails over to A as it sees it needs
to be upgraded as its deployed by version is still X. Finally, A
redeploys B and both mgrs are fully upgraded and everything is fine.
However, things can get trickier with 3 or more mgrs due to the
fact that cephadm does not control which other mgr takes over after
a failover. Imagine a similar scenario but now you have mgr
A, B, and C. First A will upgrade B and C to Y so they now
are both on version Y with deployed by version X. It then fails
over since it needs to be upgraded and let's say B takes over as
active. B then upgrade A so it now has version Y and deployed by
version Y. However, it will not redeploy C even though it should
as, given it sees that it needs to be upgraded due to its deployed by
version being wrong, it doesn't touch any daemon that just needs its
deployed by version fixed. It then fails over and lets say C takes
over. Since it still has the wrong deployed by version and therefore
thinks that it needs to be upgraded, it won't touch B since that
only needs its deployed by version fixed. It sees that it needs
to be upgraded however so it fails over. Lets say B takes over again.
You can see how we can end up in a loop here where B and C say they
need to be upgraded but never upgrade each other. It seems from what
I've seen that which mgr is picked after a failover isn't totally
random so this type of scenario can actually happen and it can get
stuck here until the user takes some action. The change here is
to, instead of not touching daemons that needs their deployed by version
fixed if the active mgr needs upgrade, only don't touch that list
if the active mgr is on the wrong version. So in our example scenario
B would still have upgraded C the first time around as it would
see it is on the correct version Y and can therefore fix the deployed
by version for C. This is what the check always should have been
but since most of the testing is with 2 mgr daemons and even with
more its by chance you end up in the loop this issue wasn't seen.
Will add that it is also possible to end up in this loop with
only 2 mgr daemons if some amount of manual upgrading of the mgr
daemons is done.
Ilya Dryomov [Thu, 19 Jan 2023 12:21:40 +0000 (13:21 +0100)]
doc/rbd/rbd-exclusive-locks: warn about automatic lock transitions
A lot of people aren't aware of automatic lock transitions and
wrongfully assume that exclusive lock means that the image remains
locked for as long as the client is running. Redo the explanation
and add a warning.
Avan Thakkar [Mon, 16 Jan 2023 12:41:06 +0000 (18:11 +0530)]
mgr/prometheus: export zero valued pg state metrics
Fixes: https://tracker.ceph.com/issues/58471 Signed-off-by: Avan Thakkar <athakkar@redhat.com>
As per the Prometheus documentation, omitting zero metrics is not a best practice. The metric value for all PG_STATES should be initialized to zero.
Zac Dover [Thu, 19 Jan 2023 01:50:17 +0000 (11:50 +1000)]
doc/install: link to "cephadm installing ceph"
Link to "Installing Ceph" in the cephadm documentation instead of (as
was the case before this commit) to the cephadm overview page. Anyone
who clicks on the "cephadm" link in the context of the
doc/install/index.rst page is more likely to expect installation
instructions than to expect an explanation of what cephadm is.
Zac Dover [Wed, 11 Jan 2023 15:12:24 +0000 (01:12 +1000)]
doc/cephadm: s/osd/OSD/ where appropriate
Capitalize the initialization "OSD" where it occurs in natural language
in cephadm/host-management.rst. This PR answers a request made by
Anthony D'Atri and seconded by Cole Mitchell in https://github.com/ceph/ceph/pull/49699#discussion_r1066171002.
Zac Dover [Tue, 10 Jan 2023 15:55:55 +0000 (01:55 +1000)]
doc/css: add "span" padding to custom.css
Add "scroll-top-bar: 2em;" for the "span" html element in custom.css so
that the top bar doesn't get in the way of headings bounded by the "span
element".
Zac Dover [Mon, 9 Jan 2023 18:09:20 +0000 (04:09 +1000)]
doc/rados: link to cephadm replacing osd section
Direct readers to the "Replacing an OSD" section in the cephadm
documentation, for cases in which the instructions in "Replacing an OSD"
in the RADOS documentation don't work.
Zac Dover [Sun, 8 Jan 2023 08:04:43 +0000 (18:04 +1000)]
doc/glossary: Clean up "Ceph Object Storage"
Remove redundant material under the "Ceph Object Storage" headword and
add a "See 'Ceph Object Store'" link. A future PR will provide a couple
of sentences that explain how object storage is what's really supporting
both CephFS and RBD.
Zac Dover [Fri, 6 Jan 2023 16:24:39 +0000 (02:24 +1000)]
doc/css: Add scroll-margin-top to h2 html element
Add "scroll-margin-top: 4em;" to the h2 html element's definition in
custom.css. This moves the text under all h2 html elements out of the
way of the sticky-header-style top bar, which previously obscured the
text.
Zac Dover [Fri, 6 Jan 2023 12:51:47 +0000 (22:51 +1000)]
doc/man: define --num-rep, --min-rep and --max-rep
Explain the "--num-rep", "--min-rep", and "--max-rep" options, which are
required when running "crushtool" commands with the "--show-mappings"
flag. Originally reported by Brad Fitzpatrick.
Zac Dover [Thu, 5 Jan 2023 12:25:43 +0000 (22:25 +1000)]
doc/css: add scroll-margin-top to dt elements
add "scroll-margin-top: em3;" to custom.css so that the header bar
doesn't obscure the text of headwords in glossary.rst. Note that this
applies only to elements in the documentation that are rendered into
HTML with the dt (which stands for "description term" or "description
list") tag. Other modifications will be necessary in order to ensure
that the anchor points of non-dt elements are not obscured by the header
bar.
Zac Dover [Sun, 1 Jan 2023 12:06:54 +0000 (22:06 +1000)]
doc/start: add link-related metadocumentation
Add two kinds of link-related metadocumentation (documentation about how
to write documentation) to the "Documenting Ceph" section of the "Intro
to Ceph" document: 1. metadocumentation about external links, and 2.
metadocumentation about internal links.
Zac Dover [Sat, 31 Dec 2022 04:22:26 +0000 (14:22 +1000)]
doc/glossary: capitalize "DAS" correctly
Correctly capitalize "Direct-Attached Storage" in the glossary. (And
test the "Quincy" branch, which seems lately not to have picked up any
docs backports.)
Zac Dover [Fri, 30 Dec 2022 01:32:31 +0000 (11:32 +1000)]
doc/glossary: collate "releases" entries
Collect the "Releases"-related entries together under the "Releases"
headword, in order to give readers a sense at a glance of how the
different kinds of releases relate to one another.
Xiubo Li [Wed, 23 Nov 2022 05:24:38 +0000 (13:24 +0800)]
qa: switch to https protocol for repos' server
Since the git:// is not reachable any more and have switch to
https://.
The git archive does not support the https protocol, so we couldn't
user the git archive to retrieve the tar ball any more, will split
this into 3 steps:
1, clone the whole ceph repo
2, checkout the commit/tag/branch
3, then change directory to qa/workunits/.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
(cherry picked from commit 89177d65988c56324916de8394089b6e4b38aab7)
Conflicts:
- qa/workunits/fs/snaps/snaptest-git-ceph.sh: minor conflicts
- qa/machine_types/schedule_subset.sh: no need to fix this
- qa/tasks/cephfs/xfstests_dev.py: minor confilicts
Kamoltat [Wed, 14 Dec 2022 19:54:00 +0000 (19:54 +0000)]
mon/Monitor.cc: notify_new_monmap() skips removal of non-exist rank
Problem:
In RHCS the user can choose to manually remove a monitor rank
before shutting the monitor down. Causing inconsistency in monmap.
for example we remove mon.a from the monmap, there is a short period
where mon.a is still operational and will try to remove itself from
monmap but we will run into an assertion in
ConnectionTracker::notify_ranks_removed().
Solution:
In Monitor::notify_new_monmap() we prevent the func
from going into removing our own rank, or
ranks that doesn't exists in monmap.
FYI: this is an RHCS problem only, in ODF,
we never remove a monitor from monmap
before shutting it down.
--mon-initial-members does nothing but causes monmap
to populate ``removed_ranks`` because the way we start
monitors in standalone tests uses ``run_mon $dir $id ..``
on each mon. Regardless of --mon-initial-members=a,b,c, if
we set --mon-host=$MONA,$MONB,$MONC (which we do every single tests),
everytime we run a monitor (e.g.,run mon.b) it will pre-build
our monmap with
Now, with --mon-initial-members=a,b,c we are letting
monmap know that we should have initial members name:
a,b,c, which we only have `b` as a match. So what
``MonMap::set_initial_members`` do is that it will
remove noname-a and noname-c which will
populate `removed_ranks`.
Solution:
remove all instances of --mon-initial-members
in the standalone test as it has no impact on
the nature of the tests themselves.
When upgrading the monitors (include booting up),
we check if `peer_tracker` is dirty or not. If
so, we clear it. Added some functions in `Elector` and
`ConnectionTracker` class to
check for clean `peer_tracker`.
Moreover, there could be some cases where due
to startup weirdness or abnormal circumstances,
we might get a report from our own rank. Therefore,
it doesn't hurt to add a sanity check in
`ConnectionTracker::report_live_connection` and
`ConnectionTracker::report_dead_connection`.
In `notify_clear_peer_state()` we another
mechanism in reseting our `peer_tracker.rank`
to match our own monitor.rank.
This is added so there is a way for us
to recover from a scenrio where `peer_tracker.rank`
is messed up from adjusting the ranks or removing
ranks.
`notifiy_clear_peer_state()` can be triggered
by using the command:
`ceph connection scores reset`
Also in `clear_peer_reports`, besides
reassigning my_reports to an empty object,
we also have to make `my_reports` = `rank`
from `peer_tracker`, such that we don't get
-1 as a rank in my_reports.
Kamoltat [Wed, 2 Nov 2022 01:59:52 +0000 (01:59 +0000)]
mon: change how we handle removed_ranks
when a new monitor joins, there is a chance that
it will recive a monmap that recently removed
a monitor and ``removed_rank`` will have some
content in it. A new monitor that joins
should never remove rank in peer_tracker but
rather call ``notify_clear_peer_state()``
to reset the `peer_report`.
In the case when it is a monitor that
has joined quorum before and is only 1
epoch behind the newest monmap provided
by the probe_replied monitor. We can
actually remove and adjust ranks in `peer_report`
since we are sure that if there is any content in
removed_ranks, then it has to be because in the
next epoch we are removing a rank, since every
update of an epoch we always clear the removed_ranks.
There is no point in keeping the content
of ``removed_ranks`` after monmap gets updated
to the epoch.
Therefore, clear ``removed_ranks`` every update.
When there is discontinuity between
monmaps for more 1 epoch or the new monitor never joined quorum before,
we always reset `peer_tracker`.
Moreover, beneficial for monitor log to also log
which rank has been removed at the current time
of the monmap. So add removed_ranks to `print_summary`
and `dump` in MonMap.cc.
In `ConnectionTracker::receive_peer_report`
we loop through ranks which is bad when
there is `notify_rank_removed` before this and
the ranks are not adjusted yet. When we rely
on the rank in certain scenarios, we end up
with extra peer_report copy which we don't
want.
SOLUTION:
In `ConnectionTracker::receive_peer_report`
instead of passing `report.rank` in the function
`ConnectionTracker::reports`, we pass `i.first`
instead so that trim old ranks properly.
We also added a assert in notify_rank_removed(),
comparing expected rank provided by the monmap
against the rank that we adjust ourself to as
a sanity check.
We edited test/mon/test_election.cc
to reflect the changes made in notify_rank_removed().