Ilya Dryomov [Fri, 1 Dec 2023 17:54:19 +0000 (18:54 +0100)]
test/librbd: drop TestLibRBD.SnapDiff
This was added to integration test [1], separate from the fix which
went in only with unit test adjustments. It's duplicated by several
cases in DiffIterateTest.DiffIterateDeterministic now. Specifically,
the issue could be reproduced by any of:
(3) snap2 -> HEAD
(4) snap3 -> HEAD
(7) snap2 -> snap3
scribble()-based DiffIterate tests are too weak: at least two
regressions that should been caught by DiffIterate.DiffIterate or
DiffIterate.DiffIterateStress were missed [1][2]. Aside from the
randomness which can be both a good and a bad thing, asserts there
ensure only that the returned diff covers all changes that were made.
If the returned diff is too excessive or otherwise bogus, this isn't
detected [3].
Add a deterministic test to systematically cover the most common cases
that don't involve discards. A similar test for discards will be added
with the fix for [4].
Comment out debug log in vector_iterate_cb() like it's done in
iterate_cb().
Ilya Dryomov [Mon, 27 Nov 2023 10:59:26 +0000 (11:59 +0100)]
librbd: fix read_whole_object handling in ObjectListSnapsRequest
Originally, in commit 2be4840afd4f ("librados/snap_set_diff: don't
assert on empty snapset"), exists was set to true. This didn't make
ObjectListSnapsRequest, causing the following deep-copy tests to fail
when run against calc_snap_set_diff() rigged to return "whole object"
as described in [1]:
This is a regression introduced in commit cc87a8bd697e ("librbd:
deep-copy object utilizes image-extent IO methods") by way of commit 11923e234efc ("librbd: generic object list snapshot request").
Ilya Dryomov [Mon, 27 Nov 2023 09:11:52 +0000 (10:11 +0100)]
librbd: fix LIST_SNAPS_FLAG_WHOLE_OBJECT behavior
Bundling read_whole_object and LIST_SNAPS_FLAG_WHOLE_OBJECT cases
together is wrong:
- In read_whole_object case, calc_snap_set_diff() sets just
read_whole_object. Everything else is zeroed out and may require
resetting to fit with the rest of ObjectListSnapsRequest logic.
- In LIST_SNAPS_FLAG_WHOLE_OBJECT case, only the diff should be
expanded. Everything else is set by calc_snap_set_diff() and should
be used as is. This goes for end_size in particular -- if it's reset
to object size, bogus zero extents may be returned as the object
would appear to have grown.
This is a regression introduced in commit 4429ed4f3f4c ("librbd: switch
diff iterate API to use new snaps list dispatch methods") by way of
commit 66dd53d9c4d9 ("librbd: optionally return full object extent for
any snapshot deltas").
Ilya Dryomov [Sun, 19 Nov 2023 21:44:28 +0000 (22:44 +0100)]
test/librbd: make ListSnapsWholeObject actually test stuff
Despite being added in commit 66dd53d9c4d9 ("librbd: optionally return
full object extent for any snapshot deltas") ostensibly to test the new
LIST_SNAPS_FLAG_WHOLE_OBJECT code, it surely doesn't do that because
the flag isn't even passed to MockObjectListSnapsRequest::create().
I can only guess, but it looks like snap ID 3 was intended to be
a starting point. Otherwise, with 0 and CEPH_NOSNAP passed as snap
IDs, the overlap that is set up for the clone wouldn't affect the
computation in any way.
Use snap ID 3 as a starting point and run both with and without
LIST_SNAPS_FLAG_WHOLE_OBJECT on the same snapset to pinpoint the
difference.
Ilya Dryomov [Sat, 11 Nov 2023 13:15:49 +0000 (14:15 +0100)]
librados/snap_set_diff: set end_size only if end object exists
Since commit 73f50a13109f ("rbd-mirror: use generalized deep copy for
image sync"), the only user of calc_snap_set_diff() immediately unsets
end_size otherwise.
calc_snap_set_diff() semantics are clearer if end_size is set together
with end_exists and clone_end_snap_id.
Zac Dover [Sat, 9 Dec 2023 03:46:00 +0000 (04:46 +0100)]
doc/radosgw: format POST statements
Format the POST methods so that they appear in the rendered text as
examples of POST API calls and not as plain old unformatted text, which
is how they looked before this commit. The content of these API calls
remains to be tested and confirmed to work, but this is a first step.
Zac Dover [Sat, 2 Dec 2023 05:32:26 +0000 (06:32 +0100)]
doc/radosgw: add gateway starting command
Add a command that properly starts (or restarts) the RADOS gateway after
RGW settings have been changed. This commit has been added in response
to an issue reported anonymously on
https://pad.ceph.com/p/Report_Documentation_Bugs.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit ec7c515490c2ade44d886e423a6601c7ef0cf5e8)
Zac Dover [Tue, 5 Dec 2023 19:46:26 +0000 (20:46 +0100)]
doc/radosgw: update link in rgw-cache.rst
Update link in doc/radosgw/rgw-cache.rst. The link updated here is a
link to all the Nginx configuration files. The old link was broken. This
update comes to us from an anonymous report on
https://pad.ceph.com/p/Report_Documentation_Bugs.
Zac Dover [Sun, 3 Dec 2023 12:17:46 +0000 (13:17 +0100)]
doc/rados: repair stretch-mode.rst
Remove a section of doc/rados/operations/stretch-mode.rst that I wrongly
re-included after its removal. The request for this (re)-removal is
here: https://github.com/ceph/ceph/pull/54689#discussion_r1413007655.
Zac Dover [Sat, 2 Dec 2023 05:38:28 +0000 (06:38 +0100)]
doc/radosgw: fix formatting
Repair the formatting of a string that had a string inside backticks
that itself was inside double asterisks. The presence of the asterisks
around the entire string caused the backticks to appear in the rendered
documentation.
fs suite relies on these debugfs entries to gather mount information
(client-id, addr/inst) which are required by some tests. In fs suite,
the disto kernel gets overridden by the testing kernel and therefore
even if Ubuntu 20.04 is chosen as the distro, the testing kernel is
installed. However, with smoke suite, the distro kernel is used and
the missing patches causes certain essential information gathering to
fail early on (client-id, etc..) causing the test to not even start
execution. PR #54515 fixes a bug in the client-id fetching path but
isn't complete due to the missing patches - details here:
https://tracker.ceph.com/issues/63488#note-8
But its essential to have the smoke tests running since those tests
have lately uncovered bugs in the MDS (w/ distro kernels). In order
to benefit from those tests, this change ignores failures when
gathering mount information (which aren't used by the fs relevant
smoke tests). The test (in fs suite) that rely on this piece of
information would fail when run with 20.04 distro kernel (but the
fs suite overrides it with the testing kernel).
Venky Shankar [Mon, 27 Nov 2023 05:12:02 +0000 (10:42 +0530)]
qa: add centos_latest (9.stream) and ubuntu_20.04 yamls to supported-all-distro
A bug in Ceph MDS (MDS crash!) is seen with distos using a not-so-recent kernel
(5.4ish). This crash was first seen in quincy smoke run and the problematic
backport change was reverted. The smoke suite chooses a random distro for each
job, so to hit this bug, the appropriate distro needs to be (randomly) get chosen.
This change point the smoke suite to run against all supported distros.
This effects suites that point to supported-all-distro (powercycle) since it
bloats up the number of jobs. E.g., currently, without --subset, powercycle:osd
INFO:teuthology.suite.run:0/336 jobs were filtered out.
vs
(with this change)
Unable to schedule 560 jobs, too many jobs, when maximum 500 jobs allowed.
For smoke suite
INFO:teuthology.suite.run:Scheduled 24 jobs in total.
vs
(with this change)
INFO:teuthology.suite.run:Scheduled 120 jobs in total.
Eventually, with PR #46882, then testing kernel will no longer override the
distro kernel in fs suite, so we should get good coverage then.
Zac Dover [Tue, 28 Nov 2023 05:08:48 +0000 (06:08 +0100)]
doc/rados: improve "Ceph Subsystems"
Improve the English in the subsection "Ceph Subsystems" in the section
"Subsystem, Log and Debug Settings" [sic] in
doc/rados/troubleshooting/log-and-debug.rst.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 7bca5f57cc2c11bdd76dd0edb43c716a1d5ad355)
Lucian Petrut [Wed, 15 Mar 2023 09:04:40 +0000 (09:04 +0000)]
test/libcephfs: skip flaky timestamp assertion on Windows
There's a new libcephfs test that creates a snapshot and
compares ctime/mtime. The issue is that one of the assertion
fails on Windows, potentially due to reduced timestamp
precision.
Zac Dover [Fri, 17 Nov 2023 09:24:14 +0000 (19:24 +1000)]
doc/start: explain "OSD"
Explain the initialism "OSD" and link to its definition in the glossary.
This PR is raised in response to an anonymous documentation bug that reads
"Paragraph 2 uses the acronym OSD without any explanation.
This makes it very difficult to understand this part of
the documentation as there is no indication of what this
acronym is until much further into the documentation. Replace
first occurence of OSD with Object Storage Daemon (OSD) or
link it to the glossary."
-- https://pad.ceph.com/p/Report_Documentation_Bugs
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit a78fe85470c2471574aceb723cd304498cde1afb)
Aashish Sharma [Wed, 4 Oct 2023 06:54:13 +0000 (12:24 +0530)]
mgr/dashboard: Consider null values as zero in grafana panels
After upgrading from RHCS4 to RHCS5..some of the grafana charts broke.
This is because in RHCS5 we do not generate the metrics if its value is
zero as a result the null value from that metric breaks the grafana
charts or graphs. This PR is to fix the above mentioned issue.
'ceph-volume raw list' is broken for a specific use case (rook).
rook copies devices from /dev/ to /mnt for specific/internal needs.
when ceph-volume raw list is passed a device from /mnt then
ceph-volume ignores it and return an empty dict.
That prevent rook from creating OSDs properly.
Zac Dover [Tue, 14 Nov 2023 13:40:42 +0000 (23:40 +1000)]
doc/glossary: add "Quorum" to glossary
Add the term "Quorum" to the glossary and link to the part of
architecture.rst concerning Monitors. The sticky header at the top of
the docs.ceph.com website gets in the way of the location linked to in
this commit, but fatigue and disgust prevent me from spending time today
trial-and-erroring my way through the hostile and ill-documented
wilderness of scroll-margin so that the link goes where it should.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit c2f6a770bf0e12296c334d99ac86ff4732ec29b7)
Zac Dover [Mon, 13 Nov 2023 10:57:07 +0000 (20:57 +1000)]
doc/rados: format "initial troubleshooting"
Format the steps in the "Initial Troubleshooting" section of
doc/rados/troubleshooting/troubleshooting-mon.rst. A near-future PR (not
this one) will add context to this section and explain that the steps
described here are the first steps that you should undertake when you
determine that you have an unresponsive or down Monitor. This PR is
merely for formatting.
shimin [Sun, 8 Oct 2023 11:15:09 +0000 (19:15 +0800)]
mon: fix mds metadata lost in one case.
In most cases, peon's pending_metadata is inconsistent with mon's db.
When a peon turns into leader, and at the same time a active mds stops,
the new leader may flush wrong mds metadata into db. So we meed to
update mds metadata from db at every fsmap change.
This phenomenon can be reproduce like this:
A Cluster with 3 mon and 3 mds (one active, other two standby), 6 osd.
step 1. stop two standby mds;
step 2. restart all mon; (make pending_medata consistent with db)
step 3. start other two mds
step 4. stop leader mon
step 5. run "ceph mds metadata" command to check mds metadata
step 6. stop active mds
step 7. run "ceph mds metadata" command to check mds metadata again
Zac Dover [Sun, 12 Nov 2023 10:21:41 +0000 (20:21 +1000)]
doc/config: edit "ceph-conf.rst"
Edit the first section of doc/rados/configuration/ceph-conf.rst.
Initially I just wanted to change "series" to "set", but once I got my
hands dirty I ended up simplifying some sentences.
Zac Dover [Sun, 12 Nov 2023 10:52:09 +0000 (20:52 +1000)]
doc/rados: parallelize t-mon headings
Give parallel structure to the questions in the Q&A section of the "The
Cluster Has Quorum But At Least One Monitor Is Down" subsection of the
"Most Common Monitor Issues" section of
doc/rados/troubleshooting/troubleshooting-mon.rst.