Zac Dover [Mon, 30 May 2022 13:32:06 +0000 (23:32 +1000)]
doc/start: update "memory" in hardware-recs.rst
This PR corrects some usage errors in the "Memory" section
of the hardware-recommendations.rst file. It also closes
some opened but never closed parentheses.
Laura Flores [Mon, 16 May 2022 22:59:42 +0000 (17:59 -0500)]
qa/suites/rados/thrash-erasure-code-big/thrashers: add `osd max backfills` setting to mapgap and pggrow
All `rados/thrash-erasure-code-big` tests that die due to the “wait_for_recovery” timeout have one thing in common: They contain either `thrashers/pggrow` or `thrashers/mapgap`.
The difference between pggrow and mapgap vs. all other non-offending thrashers (default, careful, fastread, and morepggrow) is that they lack an override setting for `osd max backfills`. `osd max backfills` is the max number of backfill operations allowed to/from an OSD. The higher the number, the quicker the recovery. By default, this value is 1. On all of the non-offending thrashers (default, careful, fastread, and morepggrow), the default 1 value gets overridden in their .yaml files with a value > 1. This is not the case for pggrow and mapgap, however, as they lack an `osd max backfills` override setting.
The mclock op scheduler is known to override `osd max backfills` with a high value, but all of the thrash-erasure-code-big thrashers have their op queue set to “debug_random”, which chooses randomly between op queues (the debug_random op queue is set to override the default mclock_scheduler in qa/config/rados.yaml). So, coupled with the “debug_random” op queue, the low `osd max backfill` setting is causing some tests to time out in recovery.
WITHOUT `osd max backfills`, as they are now, “mapgap” and “pggrow” tests die due to timed-out recovery about 17/100 times, as seen here with a pggrow test: http://pulpito.front.sepia.ceph.com/lflores-2022-05-18_14:24:29-rados:thrash-erasure-code-big-master-distro-default-smithi/
WITH `osd max backfills` specified, as I have suggested in this PR, 99/100 tests passed, with one test failing for a different reason:
http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_22:40:27-rados:thrash-erasure-code-big-master-distro-default-smithi/
I also scheduled 145 tests WITH `osd max backfills` that are a mix of pggrow and mapgap thrashers. 144/145 tests passed, with one test failing for a different reason. http://pulpito.front.sepia.ceph.com/lflores-2022-05-17_15:27:54-rados:thrash-erasure-code-big-master-distro-default-smithi/
Fixes: https://tracker.ceph.com/issues/51076 Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 40062676c2ceed49b9fa147127ffa83ba6118e2a)
This PR links issue tracker by label, and not
by file. This method was proposed by Kefu Chai
in 40f9e1cee054bb568dfa3267982467c99b4ce5c5
on 05 Sep 2020 and was never before incorporated
into Octopus. This was noticed by Neha Ojha in
May 2020.
Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) more link fixes
Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) see last commit
Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) fix broken link (again)
This corrects the error "Mismatch: both
interpreted text role prefix and reference suffix", which
presented because I treated the link to an external URL as
though it were a :ref:-style link to a location inside the
Ceph documentation suite.
Signed-off-by: Zac Dover <zac.dover@gmail.com>
(squash) ibid - malformed external link
Zac Dover [Wed, 18 May 2022 10:36:53 +0000 (20:36 +1000)]
doc/start: s/3/three/ in intro.rst
I'm changing "3" to "three" for two reasons:
1. It's correct.
2. This allows me to test backports into Octopus, Pacific, and Quincy.
I am particularly interested to see what happens when I attempt
the backport into Octopus, because backports into Octopus have
failed. This will provide me with another unit of data.
Cherry-pick notes:
- src/rgw/rgw_admin.cc: conflicts due to differences in op lists
- src/rgw/rgw_rados.h: conflicts due to changes to unrelated method signatures
- src/rgw/rgw_sal.cc: conflicts due to octopus missing a couple of methods from later releases
- src/rgw/rgw_sal.h: conflicts due to changes to unrelated method seignatures
- src/rgw/rgw_rados.cc: use of ldpp_dout vs. ldout
NitzanMordhai [Mon, 21 Mar 2022 11:34:34 +0000 (11:34 +0000)]
osd/PGLog.cc: Trim duplicates by number of entries
PGLog needs to trim duplicates by the number of entries rather than the versions. That way, we prevent unbounded duplicate growth.
Adding duplicate entries trimming to trim-pg-log opertion, we will use the exist
PGLog trim function to find out the set of entries\dup entries that we suppose to
trim. to use it we need to build the PGLog from disk.
Conflicts:
src/osd/PGLog.h -- octopus does not have the commit 877798028386fbd833e8955cb89ce3f1ee47fbeb
which cleans the `std` namespace depedency
in headers.
src/tools/ceph_objectstore_tool.cc -- octopus lacks the commit 7d73fa6a309dca4c5381596c5e92813e2e11ed3b
which puts the buffer's error hierarchy on
`system_error`.
Kefu Chai [Mon, 21 Dec 2020 17:07:37 +0000 (01:07 +0800)]
include/denc: use pair<const K,V> in range-based for loop
map<K,V>::value_type is pair<const K, V>, so if we use range-based for
loop when iterating through a map, we should use pair<const K,V> instead
of pair<K,V>, the latter also compiles, but it might create a temporary
object of pair<K,V> from pair<const K,V>. GCC-11 complains at seeing
this:
../src/include/denc.h:1002:21: warning: loop variable ‘e’ of type ‘const T&’ {aka ‘const std::pair<OSDPerfMetricQuery, OSDPerfMetricReport>&’} binds to a tem\
porary constructed from type ‘const std::pair<const OSDPerfMetricQuery, OSDPerfMetricReport>’ [-Wrange-loop-constru
ct]
1002 | for (const T& e : s) {
| ^
this change
* use the value_type of container in `maplike_details<Container>`,
so we can avoid the overhead of creating temporay objects when
encoding a map
* define denc_traits for std::pair<const A, B> as well, so the elements
of a map can be encoded using denc facility
Kefu Chai [Fri, 8 Jan 2021 05:42:18 +0000 (13:42 +0800)]
test/test_rbd_replay: move operator<<(..rbd_loc& name) to rbd_replay
so gtest can print out rbd_loc when printing out diagnostic information
when test fails. after moving operator<<(ostream&, const rbd_loc&) to
the `rbd_replay` namespace, ADL is able to find it. for more details on
the lookup rules, see https://en.cppreference.com/w/cpp/language/adl
Kefu Chai [Thu, 7 Jan 2021 03:56:25 +0000 (11:56 +0800)]
common/ceph_time: add operator<< for signedspan
* templatize operator<<(ostream&, duration<>), so it works for more
duration<> classes with minimal efforts -- we just need to explicitly
instantiate these template operators
* explicitly instantiate operator<< for timespan, signedspan, seconds
and milliseconds. they are most likely to be used in Ceph. we can add
more of them when necessary.
Kefu Chai [Thu, 7 Jan 2021 07:17:45 +0000 (15:17 +0800)]
common/ceph_time: move operator<<(ostream&, timespan&) into std namespace
otherwise compiler is not able to find it as the "timespan" here is
actually a class defined in std namespace, even it has an alias defined
in ceph namespace like:
test/rbd_mirror: grab timer lock before calling add_event_after()
add_event_after() expects an externally provided mutex to be held
for the call. This was missed in commit 8965a0f2a6f7 ("rbd-mirror:
synchronize with in-flight stop in ImageReplayer::stop()").
Ilya Dryomov [Sun, 20 Feb 2022 16:33:08 +0000 (17:33 +0100)]
rbd-mirror: synchronize with in-flight stop in ImageReplayer::stop()
Complete on_finish right away only if the replayer is stopped (meaning
that it is legible to be restarted immediately, possibly from on_finish
itself). This is the behaviour pretty much anyone would assume and
also what ImageReplayer::restart() relies on.
Ilya Dryomov [Sun, 20 Feb 2022 12:11:02 +0000 (13:11 +0100)]
rbd-mirror: manual stop should take precedence over regular stop
Somewhat similar to commit 0a3794e56256 ("rbd-mirror: make stop
properly cancel restart"), make it so that a) if a manual stop is
joined to regular stop, the stop becomes manual and b) if a regular
stop is joined to a manual stop, the stop stays manual.
Ilya Dryomov [Sat, 19 Feb 2022 15:43:04 +0000 (16:43 +0100)]
rbd-mirror: straighten ImageReplayer::stop() a bit
- don't default on_finish parameter
- m_restart_requested is set in ImageReplayer::restart() which is the
only restart=true call site, so setting m_restart_requested here is
redundant
- is_stopped_() can't be true in is_running_() branch
- on_finish->complete(0) in the end is unreachable
Kefu Chai [Wed, 6 Jan 2021 08:18:17 +0000 (16:18 +0800)]
googletest submodule: pick up change to silence error=maybe-uninitialized warning
to include the fix of https://github.com/google/googletest/pull/3024
otherwise GCC-11 fails to compile the tests with following warning:
In file included from ../src/googletest/googletest/src/gtest-all.cc:42:
../src/googletest/googletest/src/gtest-death-test.cc: In function ‘bool testing::internal::StackGrowsDown()’:
../src/googletest/googletest/src/gtest-death-test.cc:1301:24: error: ‘dummy’ may be used uninitialized [-Werror=maybe-uninitialized]
1301 | StackLowerThanAddress(&dummy, &result);
| ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~
../src/googletest/googletest/src/gtest-death-test.cc:1290:13: note: by argument 1 of type ‘const void*’ to ‘void testing::internal::StackLowerThanAddress(const void*, bool*)’ declared here
1290 | static void StackLowerThanAddress(const void* ptr, bool* result) {
| ^~~~~~~~~~~~~~~~~~~~~
../src/googletest/googletest/src/gtest-death-test.cc:1299:7: note: ‘dummy’ declared here
1299 | int dummy;
| ^~~~~
cc1plus: all warnings being treated as errors
Before the patch the test case was showing an unreliable behaviour
dependent on the underlying memory allocator. It was because
the bufferlist rebuild can be skipped, resulting in unchanged number
of buffers, if all of them begin at aligned addresses.
The commit fixes that by allocating a 4 KiB-aligned buffer and
offsetting it by a small constant (42) to ensure the memory added
to the bufferlist begins at non-4 KiB address.
test/bufferlist: assert the rebuild in rebuild_aligned_size_and_memory() actually happens.
For the investigation of failures like the following one:
```
[ RUN ] BufferList.rebuild_aligned_size_and_memory
../src/test/bufferlist.cc:1865: Failure
Expected equality of these values:
bl.get_num_buffers()
Which is: 2
1
[ FAILED ] BufferList.rebuild_aligned_size_and_memory (0 ms)
```
The test case assumes the rebuild before the failed clause **always**
happens while `bufferlist::rebuild_aligned_size_and_memory()` skips it
if buffers are already aligned.
Or Friedmann [Tue, 19 Apr 2022 12:00:28 +0000 (12:00 +0000)]
rgw: RGWCoroutine::set_sleeping() checks for null stack
users of the RGWOmapAppend coroutine don't manage the lifetime of its
underlying coroutine stack, so end up making calls on RGWOmapAppend
after its stack goes away. this null check is a band-aid, and there are
still several other calls in RGWCoroutine that don't check for null
stack
Fixes: https://tracker.ceph.com/issues/49302 Signed-off-by: Or Friedmann <ofriedma@redhat.com> Signed-off-by: Casey Bodley <cbodley@redhat.com>
(cherry picked from commit 3f0f831d66c7d43c9872f5de2aceb68aef4004d8)
Kefu Chai [Mon, 22 Jun 2020 01:34:53 +0000 (09:34 +0800)]
doc/conf.py: s/add_javascript/add_js_file/
to address following warning:
jenkins-build/build/workspace/ceph-pr-docs/doc/conf.py:102: RemovedInSphinx40Warning: The app.add_javascript() is deprecated. Please use app.add_js_file() instead.
Kefu Chai [Sun, 6 Mar 2022 06:27:50 +0000 (14:27 +0800)]
doc/conf.py: silence warnings from breathe
breathe calls doxygen for extracting/generating docs from code.
while doxygen complains at seeing undocumented fields/func. these
warnings could fail the sphinx-build command, if it takes warnings
as errors.
Kefu Chai [Sun, 6 Mar 2022 06:23:42 +0000 (14:23 +0800)]
mgr/cephadm: add empty line after param list in docstring
this helps to silence the warning from sphinx, like
src/pybind/mgr/orchestrator/_interface.py:docstring of orchestrator._interface.Orchestrator.remove_osds:9: WARNING: Field list ends without a blank line; unexpected unindent.