Rishabh Dave [Mon, 25 Mar 2024 12:05:38 +0000 (17:35 +0530)]
qa/cephfs: add tests failing MDS and FS when MDS is unhealthy
Add tests to verify that the confirmation flag is mandatory for running
commands "ceph mds fail" and "ceph fs fail" when MDS has one of the two
health warnings: MDS_CACHE_OVERSIZE or MDS_TRIM.
Also, add MDS_CACHE_OVERSIZE and MDS_TRIM to ignorelist for
test_admin.py so that QA jobs knows this an expected failure.
Rishabh Dave [Mon, 25 Mar 2024 12:01:01 +0000 (17:31 +0530)]
qa/cephfs: pass confirmation flag to fs fail in tear down code
Since "ceph fs fail" command now requires the confirmation flag when
Ceph cluster has either health warning MDS_TRIM or MDS_CACHE_OVERSIZE,
update tear down in QA code. During the teardown, the CephFS should be
failed, regardless of whether or not Ceph cluster has health warnings,
since it is teardown.
Rishabh Dave [Wed, 13 Mar 2024 09:31:02 +0000 (15:01 +0530)]
cephfs,mon: require confirmation to fail unhealthy FS
Confirmation flag must be passed when running the command "ceph fs fail"
when the MDS for this FS has either of the two health warnings: MDS_TRIM
or MDS_CACHE_OVERSIZED. Else, the command will fail and print an
appropriate error message.
Restarting an MDS with these health warnings is not recommened since it
will have a slow recovery during restart which will create new problems.
Fixes: https://tracker.ceph.com/issues/61866 Signed-off-by: Rishabh Dave <ridave@redhat.com>
Since the command "ceph mds fail" now may require confirmation flag
("--yes-i-really-mean-it"), update this method to allow/disallow adding
this flag to the command arguments.
Rishabh Dave [Fri, 19 Apr 2024 11:28:30 +0000 (16:58 +0530)]
doc/cephfs: mention need of confirmation for "ceph mds fail"
Update docs since command "ceph mds fail" will now fail if MDS has either
health warning MDS_TRIM or MDS_CACHE_OVERSIZED and if confirmation flag
is not passed.
Rishabh Dave [Fri, 8 Mar 2024 15:39:18 +0000 (21:09 +0530)]
cephfs,mon: require confirmation to fail unhealthy MDS
When running the command "ceph mds fail" for an MDS that is unhealthy
due to, MDS_CACHE_OVERSIZED or MDS_TRIM, user must pass confirmation
flag. Else, the command will fail and print an appropriate error
message.
Restarting an MDS with such health warnings is not recommended since it
will have a slow reocvery during restart which will create new problems.
Fixes: https://tracker.ceph.com/issues/61866 Signed-off-by: Rishabh Dave <ridave@redhat.com>
Rishabh Dave [Thu, 18 Apr 2024 08:59:15 +0000 (14:29 +0530)]
qa/vstart_runner: increase timeout for vstart.sh command
Since the timeout bug was fixed (https://tracker.ceph.com/issues/65533)
"Ceph API tests" sometimes fails because vstart.sh command had to be
aborted due to timeout.
Currently, "timeout" is set to 300 seconds which sometimes is not enough
for vstart.sh to run successfully for "Ceph API tests" CI job. 180
seconds usually suffices for vstart.sh to run successfully when used for
CephFS.
Increase value of "timeout" to avoid such failures on "Ceph API tests" CI.
Fixes: https://tracker.ceph.com/issues/65565 Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit f779b428689ea245aa0c978732c468860520c609)
luo rixin [Tue, 16 Apr 2024 07:18:06 +0000 (15:18 +0800)]
install-deps: save and restore user's XDG_CACHE_HOME
Since ccache 4.0, ccache use $XDG_CACHE_HOME/ccache to keep compile cache
if XDG_CACHE_HOME is set. In this case $XDG_CACHE_HOME is overwrite,
ccache will use $XDG_CACHE_HOME/ccache(ccache will create the dir if not exsit) to
store compile cache, but $XDG_CACHE_HOME will be removed next round running,
leading to ccache contests are always removed. So save and restore user's XDG_CACHE_HOME.
Fixes: https://tracker.ceph.com/issues/65175 Signed-off-by: luo rixin <luorixin@huawei.com>
Patrick Donnelly [Wed, 17 Apr 2024 20:01:39 +0000 (16:01 -0400)]
Merge PR #56755 into main
* refs/pull/56755/head:
mds/quiesce: don't take mirrored cap-related locks on the replica
mds/quiesce: xlock the file to let clients keep their buffered writes
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Leonid Usov [Wed, 17 Apr 2024 11:49:34 +0000 (14:49 +0300)]
mds/quiesce: agent: avoid a race condition with rapid db updates
When new roots begin processing but don't yet make it into the
currently tracked set, there is a window for the next update
with the same roots to treat them as new.
We fix it by simplifying the agent model, getting rid of
the intermediate `working` set. Since we never remove or add
items into the current roots collection, it's safe to update the
current set directly from the pending set.
The race was due to the fact that `db_update()` relied on the `current`
to deduce new roots into `pending`, while the same new root
could have already been seen and posted into the `working` set.
This would lead to submitting the same new root twice.
Without the `working` set such race isn't possible.
Fixes: https://tracker.ceph.com/issues/65545 Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Rishabh Dave [Wed, 3 Apr 2024 05:41:29 +0000 (11:11 +0530)]
qa/vstart_runner: don't let command run after timeout
Parameter "timeout" is accepted by LocalRemote.run() but the method
doesn't do anything about it besides accepting it. Thus, this parameter
has no effect.
In LocalRemote.run(), pass this parameter to LocalRemoteProcess.wait()
and from this method pass it to subprocess.Popen.communicate(). Thus,
command will be terminated by subprocess module at seconds specified by
"timeout" parameter. IOW, "timeout" parameter will have an effect.
Fixes: https://tracker.ceph.com/issues/65533 Signed-off-by: Rishabh Dave <ridave@redhat.com>
As the thrash tests were introduced, some options were disabled
until the tests are stabilized.
Re-enable chance_down option (default is 0.4) to detect bugs on restart.
Since it will probably take few iterations before thrash and recovery tests ('default.yaml')
will pass successfully, add anoter 'simple.yaml' which should remain stable.
crimson/common/operation: fix and move exit() after entering the next phase
If exit/unlock the barrier before entering the next phase, it is
possible for the next request to exit the barrier at the same time, and
enters the next phase first, causing reorder issues.
Leonid Usov [Sat, 13 Apr 2024 12:37:03 +0000 (15:37 +0300)]
mds/quiesce: don't take mirrored cap-related locks on the replica
For every mirrored lock, the auth will message the replica to ensure
the replicated lock state. When we take x/rdlock on the auth, it will
ensure the LOCK_LOCK state on the replica, which has the file caps we
want for quiesce: CACHE and BUFFER.
It should be sufficient to only hold the quiesce local lock
on the replica side.
Leonid Usov [Mon, 8 Apr 2024 11:35:02 +0000 (14:35 +0300)]
mds/quiesce: xlock the file to let clients keep their buffered writes
With the quiesce protocol taking a `rdlock` on the file,
it also revokes the `Fb` capability, which the clients can't release
until they are done flushing, and that may take up arbitrarily long,
evidently, more than 10 minutes.
We went for the rdlock to avoid affecting readonly clients,
but given the evidence above we should not optimize for those.
Ideally, we’d like to have a QUIESCE file lock mode where both rd
and buffer are allowed, but as of now it seems like our best
available option is to `xlock` the file which will let the writing
clients keep their buffers for the duration of the quiesce.
We can only afford this change for a `splitauth` config,
i.e. where we drop the lock immediately after all `Fw`s are revoked
Instruct readers to use "mkdir /mnt/cephfs1" to create a mountpoint
before using "ceph-fuse" to mount a filesystem, if "/mnt/cephfs1"
doesn't already exist. cf.
https://github.com/ceph/ceph/pull/56831#discussion_r1561102227
The function is typically invoked on client errors like NoSuchBucket. Logging these errors with level 1 may initially suggest a significant issue, when in fact it's just a client error. Consider raising the logging level to 20 for better clarity.
- Moves "features" section in rbd image create form to "Advanced" section.
- makes rbd configuration section to be expanded by default rather than
being collapsed as it has only single section. This will improve user experience as it will not
require two clicks.
- updates e2e test
Rongqi Sun [Fri, 12 Apr 2024 06:51:34 +0000 (06:51 +0000)]
bluestore/bluestore_types: check 'it' valid before using
When sanitizer is enabled, unittest_bluestore_types fails as following
```
[ RUN ] sb_info_space_efficient_map_t.basic
=================================================================
==143714==ERROR: AddressSanitizer: heap-buffer-overflow on address 0xffff99f8b7f4 at pc 0xaaaab50bde18 bp 0xffffebefcdb0 sp 0xffffebefcda8
READ of size 8 at 0xffff99f8b7f4 thread T0
#0 0xaaaab50bde14 in sb_info_t::get_sbid() const /root/ceph/src/os/bluestore/bluestore_types.h:1337:30
#1 0xaaaab50a5908 in sb_info_space_efficient_map_t::find(unsigned long) /root/ceph/src/os/bluestore/bluestore_types.h:1385:10
#2 0xaaaab50bd638 in sb_info_space_efficient_map_t::_add(long) /root/ceph/src/os/bluestore/bluestore_types.h:1424:15
#3 0xaaaab50a52bc in sb_info_space_efficient_map_t::add_maybe_stray(unsigned long) /root/ceph/src/os/bluestore/bluestore_types.h:1358:12
#4 0xaaaab4fec03c in sb_info_space_efficient_map_t_basic_Test::TestBody() /root/ceph/src/test/objectstore/test_bluestore_types.cc:113:11
#5 0xaaaab51e9a40 in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/ceph/src/googletest/googletest/src/gtest.cc:2605:10
#6 0xaaaab5197040 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/ceph/src/googletest/googletest/src/gtest.cc:2641:14
#7 0xaaaab51488a4 in testing::Test::Run() /root/ceph/src/googletest/googletest/src/gtest.cc:2680:5
#8 0xaaaab514a7e8 in testing::TestInfo::Run() /root/ceph/src/googletest/googletest/src/gtest.cc:2858:11
#9 0xaaaab514bde8 in testing::TestSuite::Run() /root/ceph/src/googletest/googletest/src/gtest.cc:3012:28
#10 0xaaaab5167bac in testing::internal::UnitTestImpl::RunAllTests() /root/ceph/src/googletest/googletest/src/gtest.cc:5723:44
#11 0xaaaab51f3940 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/ceph/src/googletest/googletest/src/gtest.cc:2605:10
#12 0xaaaab519e5d8 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/ceph/src/googletest/googletest/src/gtest.cc:2641:14
#13 0xaaaab5167024 in testing::UnitTest::Run() /root/ceph/src/googletest/googletest/src/gtest.cc:5306:10
#14 0xaaaab50b4d6c in RUN_ALL_TESTS() /root/ceph/src/googletest/googletest/include/gtest/gtest.h:2486:46
#15 0xaaaab50a1080 in main /root/ceph/src/test/objectstore/test_bluestore_types.cc:2847:10
#16 0xffff9d6c73f8 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#17 0xffff9d6c74c8 in __libc_start_main csu/../csu/libc-start.c:392:3
#18 0xaaaab4f3812c in _start (/root/ceph/build/bin/unittest_bluestore_types+0xe4812c) (BuildId: cb75399658026f83a4e89012de8fb02f08f6d239)
0xffff99f8b7f4 is located 0 bytes to the right of 20-byte region [0xffff99f8b7e0,0xffff99f8b7f4)
allocated by thread T0 here:
#0 0xaaaab4fe636c in operator new[](unsigned long) (/root/ceph/build/bin/unittest_bluestore_types+0xef636c) (BuildId: cb75399658026f83a4e89012de8fb02f08f6d239)
#1 0xaaaab50c0d2c in mempool::pool_allocator<(mempool::pool_index_t)11, sb_info_t>::allocate(unsigned long, void*) /root/ceph/src/include/mempool.h:375:33
#2 0xaaaab50c0c0c in std::allocator_traits<mempool::pool_allocator<(mempool::pool_index_t)11, sb_info_t> >::allocate(mempool::pool_allocator<(mempool::pool_index_t)11, sb_info_t>&, unsigned long) /usr/bin/../lib/gcc/aarch64-linux-gnu/11/../../../../include/c++/11/bits/alloc_traits.h:318:20
#3 0xaaaab50c044c in std::_Vector_base<sb_info_t, mempool::pool_allocator<(mempool::pool_index_t)11, sb_info_t> >::_M_allocate(unsigned long) /usr/bin/../lib/gcc/aarch64-linux-gnu/11/../../../../include/c++/11/bits/stl_vector.h:346:20
#4 0xaaaab50bf954 in void std::vector<sb_info_t, mempool::pool_allocator<(mempool::pool_index_t)11, sb_info_t> >::_M_realloc_insert<long&>(__gnu_cxx::__normal_iterator<sb_info_t*, std::vector<sb_info_t, mempool::pool_allocator<(mempool::pool_index_t)11, sb_info_t> > >, long&) /usr/bin/../lib/gcc/aarch64-linux-gnu/11/../../../../include/c++/11/bits/vector.tcc:440:33
#5 0xaaaab50be0d8 in sb_info_t& std::vector<sb_info_t, mempool::pool_allocator<(mempool::pool_index_t)11, sb_info_t> >::emplace_back<long&>(long&) /usr/bin/../lib/gcc/aarch64-linux-gnu/11/../../../../include/c++/11/bits/vector.tcc:121:4
#6 0xaaaab50bd760 in sb_info_space_efficient_map_t::_add(long) /root/ceph/src/os/bluestore/bluestore_types.h:1429:24
#7 0xaaaab50a5e78 in sb_info_space_efficient_map_t::add_or_adopt(unsigned long) /root/ceph/src/os/bluestore/bluestore_types.h:1361:15
#8 0xaaaab4feb07c in sb_info_space_efficient_map_t_basic_Test::TestBody() /root/ceph/src/test/objectstore/test_bluestore_types.cc:103:11
#9 0xaaaab51e9a40 in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/ceph/src/googletest/googletest/src/gtest.cc:2605:10
#10 0xaaaab5197040 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /root/ceph/src/googletest/googletest/src/gtest.cc:2641:14
#11 0xaaaab51488a4 in testing::Test::Run() /root/ceph/src/googletest/googletest/src/gtest.cc:2680:5
#12 0xaaaab514a7e8 in testing::TestInfo::Run() /root/ceph/src/googletest/googletest/src/gtest.cc:2858:11
#13 0xaaaab514bde8 in testing::TestSuite::Run() /root/ceph/src/googletest/googletest/src/gtest.cc:3012:28
#14 0xaaaab5167bac in testing::internal::UnitTestImpl::RunAllTests() /root/ceph/src/googletest/googletest/src/gtest.cc:5723:44
#15 0xaaaab51f3940 in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/ceph/src/googletest/googletest/src/gtest.cc:2605:10
#16 0xaaaab519e5d8 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) /root/ceph/src/googletest/googletest/src/gtest.cc:2641:14
#17 0xaaaab5167024 in testing::UnitTest::Run() /root/ceph/src/googletest/googletest/src/gtest.cc:5306:10
#18 0xaaaab50b4d6c in RUN_ALL_TESTS() /root/ceph/src/googletest/googletest/include/gtest/gtest.h:2486:46
#19 0xaaaab50a1080 in main /root/ceph/src/test/objectstore/test_bluestore_types.cc:2847:10
#20 0xffff9d6c73f8 in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#21 0xffff9d6c74c8 in __libc_start_main csu/../csu/libc-start.c:392:3
#22 0xaaaab4f3812c in _start (/root/ceph/build/bin/unittest_bluestore_types+0xe4812c) (BuildId: cb75399658026f83a4e89012de8fb02f08f6d239)
SUMMARY: AddressSanitizer: heap-buffer-overflow /root/ceph/src/os/bluestore/bluestore_types.h:1337:30 in sb_info_t::get_sbid() const
Shadow bytes around the buggy address:
0x200ff33f16a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x200ff33f16b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x200ff33f16c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x200ff33f16d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x200ff33f16e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x200ff33f16f0: fa fa fa fa fa fa fa fa fa fa fa fa 00 00[04]fa
0x200ff33f1700: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x200ff33f1710: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x200ff33f1720: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x200ff33f1730: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x200ff33f1740: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==143714==ABORTING
```
'it' might be invalid, so before using 'it', need to figure validity out
Patrick Donnelly [Thu, 11 Apr 2024 23:00:53 +0000 (19:00 -0400)]
Merge PR #56839 into main
* refs/pull/56839/head:
script/ptl-tool: push branch to shaman for qa runs
script/ptl-tool: add release name to branch with switch
script/ptl-tool: alphabetize arguments
script/ptl-tool: avoid repo specific remotes entirely
script/ptl-tool: improve help text for envvars
Adds move constructor and move assignment operator to the librbd::Image.
Also marks copy ctor/assign op as deleted, and makes them public for better compiler diagnostics.
Patrick Donnelly [Thu, 11 Apr 2024 17:10:59 +0000 (13:10 -0400)]
Merge PR #56835 into main
* refs/pull/56835/head:
script/ptl-tool: create qa trackers for test branches
script/ptl-tool: add switch for debugging
script/ptl-tool: add --stop-at-built flag