orchestrator and cephadm relied on CLICommandMeta to bypass the global
behavior of CLICommand. That is no longer a problem, so replace
CLICommandMeta with OrchestratorCLICommandBase to preserve the magic
error wrapping.
Jon Bailey [Mon, 22 Sep 2025 11:59:47 +0000 (12:59 +0100)]
Fix for "data digests are inconsistent"
It was possible to see "data digests are inconsistent" being output to the logs at incorrect times due to multiple bugs. This code reorganises some of the deep scrubbing code and fixes the issues. The root cause of the issue that is being fixed here is:
* We were comparing crc buffers beyond the end of the crcs
* There was a double call to logical_to_ondisk_size when creating the crcs for zero buffers, causing them to be mis-sized
* The code was not padding smaller shards as its a requirement for shards to be the same sized when used for parity comparison.
All the above are fixed in this commit
Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
Bill Scales [Sun, 17 Aug 2025 15:42:11 +0000 (16:42 +0100)]
qa: test_pool_min_size should kill osds first then mark them down
The objective of test_pool_min_size is to inject up to M failures
in a K+M pool to prove that it has enough redundancy to stay
active.
It was selecting OSDs and then killing and marking them out
one at a time. Testing with wide erasure codes (high values of
K and M - for example 8+4) found that this test sometimes
failed with a PG become dead. Debugging showed that what
was happening is that after one OSD had been killed and
marked out this allowed rebalancing and async recovery to
start which further reduced the redundancy of the PG,
when the remaining error injects happened the PG
correctly became dead.
In practice OSDs are not normally killed and marked out
one after another in quick succession. The more common
scenario is that one or more OSDs fail at about the
same time (lets say over a couple of minutes) and then
after mon_osd_down_out_interval (10 mins) the mon
will mark them out. Killing the OSDs first and then
marking them out prevents additional async recovery
from starting.
If OSDs do fail over a long period of time such that
the mon marks each OSD out then hopefully there is
enough time for async recovery to run between the
failures.
This commit changes the error inject to kill all the
selected OSDs first and then to mark them out.
Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
Laura Flores [Mon, 24 Nov 2025 17:31:05 +0000 (11:31 -0600)]
qa/suites/upgrade: add "OBJECT_UNFOUND" to ignorelists
The thrashing in the upgrade tests has been configured to be very aggressive;
the tests are permitted to stop up to 4 of the 8 OSDs, so it is expected that
it is causing these kinds of health warnings to be generated.
This commit also cleans up some expected filesystem and pg peering warnings
in the upgrade tests.
Fixes: https://tracker.ceph.com/issues/72424 Signed-off-by: Laura Flores <lflores@ibm.com>
Kefu Chai [Fri, 9 Jan 2026 23:53:29 +0000 (07:53 +0800)]
rgw: fix memory leak in RGWHTTPManager thread cleanup
Fix memory leak detected by AddressSanitizer in unittest_http_manager.
The test was failing with ASan enabled due to rgw_http_req_data objects
not being properly cleaned up when the HTTP manager thread exits.
ASan reported the following leaks:
Direct leak of 17152 byte(s) in 32 object(s) allocated from:
#0 operator new(unsigned long)
#1 RGWHTTPManager::add_request(RGWHTTPClient*)
/ceph/src/rgw/rgw_http_client.cc:946:33
#2 HTTPManager_SignalThread_Test::TestBody()
/ceph/src/test/rgw/test_http_manager.cc:132:10
Indirect leak of 768 byte(s) in 32 object(s) allocated from:
#0 operator new(unsigned long)
#1 rgw_http_req_data::rgw_http_req_data()
/ceph/src/rgw/rgw_http_client.cc:52:22
#2 RGWHTTPManager::add_request(RGWHTTPClient*)
/ceph/src/rgw/rgw_http_client.cc:946:37
SUMMARY: AddressSanitizer: 17920 byte(s) leaked in 64 allocation(s).
Root cause: The rgw_http_req_data class uses reference counting
(inherits from RefCountedObject). When a request is unregistered,
unregister_request() calls get() to increment the refcount, expecting
a corresponding put() to be called later.
In manage_pending_requests(), unregistered requests are properly
handled with both _unlink_request() and put(). However, in the thread
cleanup code (reqs_thread_entry exit path), only _unlink_request() was
called without the matching put(), causing a reference count leak.
The fix adds the missing put() call in the thread cleanup code to match
the reference counting pattern used in manage_pending_requests().
Test results:
- Before: 17,920 bytes leaked in 64 allocations
- After: 0 leaks, unittest_http_manager passes with ASan
ceph-volume: update argparse help output assertions for python compatibility
This commit updates test assertions to be more flexible about
argparse help output format. It replaces checks for 'optional
arguments' and 'positional arguments' with checks for 'positional'
and help flags ('-h' or '--help'), which works across different
python versions where argparse output format may differ.
ceph-volume: avoid Device() instantiation in lvm OSD filtering
Replace Device() instantiation with direct LVM API calls to reduce
subprocess overhead.
Use sysfs check first, then only query LVM for actual LVs.
caches LVM mappers list to avoid repeated sysfs reads.
Ville Ojamo [Tue, 3 Feb 2026 06:28:12 +0000 (13:28 +0700)]
doc: unpin pip in admin/doc-read-the-docs.txt
7dd00ca introduced a proper fix for pip 25.3/PEP517 compatibility by
adding pyproject.toml files and the workaround in a65c46c is no longer
necessary. RTD builds with pip 25.3 and later work with the proper fix.
Remove the pinned pip in admin/doc-read-the-docs.txt and let RTD use the
default PIP version.
Signed-off-by: Ville Ojamo <14869000+bluikko@users.noreply.github.com>
Shraddha Agrawal [Thu, 29 Jan 2026 04:28:00 +0000 (09:58 +0530)]
ceph-volume: support crimson osd binary
Prior to this commit, ceph-volume was using hardcoded OSD binary
to issue commands (eg - to perform mkfs, etc). This commit enables
ceph-volume to start supporting crimson OSDs.
A new argument, --osd-type is introduced with the default value
classic. When this parameter is set to 'crimson', ceph-osd-crimson
binary will be used to execute OSD commands.
This commit enables us to deploy both classic and crimson
type OSDs using cephadm. To enable the same, a new feature,
osd_type is added to DriverGroupSpec. The default value for
the same is classic, but can also be set to crimson.
When this value is read by cephadm, the entrypoint is
changed from /usr/bin/ceph-osd to /usr/bin/ceph-osd-crimson.
- updates tearsheet component css to match with carbon component
- adds laoding state to submit button
- adds support for step validation when angualr component are use for steps rather than plain html templates
- adds step one of nvmeof
Ilya Dryomov [Fri, 30 Jan 2026 15:32:35 +0000 (16:32 +0100)]
qa/tasks/rbd_mirror_thrash: don't use random.randrange() on floats
This stopped working in Python 3.12:
Changed in version 3.12: Automatic conversion of non-integer types
is no longer supported. Calls such as randrange(10.0) and
randrange(Fraction(10, 1)) now raise a TypeError.
Ilya Dryomov [Tue, 11 Nov 2025 15:33:16 +0000 (16:33 +0100)]
qa/tasks/qemu: install genisoimage package
genisoimage is expected to be included in our base images but currently
isn't on Rocky 10. Since it's quite a niche thing, let's install the
package explicitly.
Ilya Dryomov [Thu, 29 Jan 2026 20:41:03 +0000 (21:41 +0100)]
qa/workunits/rbd: reduce randomized sleeps in live import tests
These tests were tuned for slower hardware than what we have now.
Currently "rbd migration execute" always finishes (successfully) before
the NBD server is killed.
Ilya Dryomov [Tue, 11 Nov 2025 20:39:58 +0000 (21:39 +0100)]
qa/valgrind.supp: make gcm_cipher_internal suppression more resilient
gcm_cipher_internal() and ossl_gcm_stream_final() make it to the stack
trace only on CentOS Stream 9. On Ubuntu 22.04 and Rocky 10, it looks
as follows:
Thread 4 msgr-worker-1:
Conditional jump or move depends on uninitialised value(s)
at 0x70A36D4: ??? (in /usr/lib64/libcrypto.so.3.2.2)
by 0x70A39A1: ??? (in /usr/lib64/libcrypto.so.3.2.2)
by 0x6F8A09C: EVP_DecryptFinal_ex (in /usr/lib64/libcrypto.so.3.2.2)
by 0xB498C1F: ceph::crypto::onwire::AES128GCM_OnWireRxHandler::authenticated_decrypt_update_final(ceph::buffer::v15_2_0::list&) (crypto_onwire.cc:271)
by 0xB4992D7: ceph::msgr::v2::FrameAssembler::disassemble_preamble(ceph::buffer::v15_2_0::list&) (frames_v2.cc:281)
by 0xB482D98: ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int) (ProtocolV2.cc:1149)
by 0xB475318: ProtocolV2::run_continuation(Ct<ProtocolV2>&) (ProtocolV2.cc:54)
by 0xB457012: AsyncConnection::process() (AsyncConnection.cc:495)
by 0xB49E61A: EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) (Event.cc:492)
by 0xB49EA9D: UnknownInlinedFun (Stack.cc:50)
by 0xB49EA9D: UnknownInlinedFun (invoke.h:61)
by 0xB49EA9D: UnknownInlinedFun (invoke.h:111)
by 0xB49EA9D: std::_Function_handler<void (), NetworkStack::add_thread(Worker*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (std_function.h:290)
by 0xBB11063: ??? (in /usr/lib64/libstdc++.so.6.0.33)
by 0x4F17119: start_thread (in /usr/lib64/libc.so.6)
The proposal to amend the existing suppression so that it's tied to the
specific callsite rather than libcrypto internals [1] received a thumbs
up from Radoslaw.
Roland Sommer [Fri, 30 Jan 2026 07:54:49 +0000 (08:54 +0100)]
debian: package mgr/smb in ceph-mgr-modules-core
The `BaseController` auto-imports the packaged `mgr/dashboard/controllers/smb.py`
file, which in turn wants to import `smb.enums` etc. which is part of the `smb`
package which is missing from `debian/ceph-mgr-modules-core.install`, thus
missing in the package. The missing module causes an exception
`ModuleNotFoundError: No module named 'smb'` on mgr instances when running a
ceph tentacle cluster installed from debian packages.
See: https://tracker.ceph.com/issues/74268 Signed-off-by: Roland Sommer <rol@ndsommer.de>
Afreen Misbah [Wed, 28 Jan 2026 09:59:08 +0000 (15:29 +0530)]
mgr/dashboard: fetch all namespaces in a gateway group
- adds a new API /api/gateway_group/{group}/namespace
- updates tests
- needed for UI flows and in general to fetch all namespaces, could not change existing API due to the maintenence of backward compatibility
- in a followup PR will add server side pagination
Ville Ojamo [Fri, 30 Jan 2026 04:47:40 +0000 (11:47 +0700)]
doc/dev: add sequence diagrams back to health-reports.rst
The sequence diagrams were removed in ce96ddd because they were causing
issues. Add them back as SVG images. Include as comments the source code
used to generate the diagrams.
Signed-off-by: Ville Ojamo <14869000+bluikko@users.noreply.github.com>
John Mulligan [Thu, 29 Jan 2026 23:28:44 +0000 (18:28 -0500)]
Merge pull request #65632 from phlogistonjohn/jjm-smb-hosts-allow
smb: support shares equivalent for hosts allow
Reviewed-by: Anthony D Atri <anthony.datri@gmail.com> Reviewed-by: Anoop C S <anoopcs@cryptolab.net> Reviewed-by: Shwetha Acharya <sacharya@redhat.com> Reviewed-by: Avan Thakkar <athakkar@redhat.com> Reviewed-by: Adam King <adking@redhat.com>
John Mulligan [Fri, 9 Jan 2026 16:25:43 +0000 (11:25 -0500)]
qa/workunits/smb: make the runner script easier to use manually
When testing the tests it can help speed things up to avoid
recreating the virtualenv, allow an env var SMB_REUSE_VENV=<path>
to supply a specific virtual env dir to (re)use.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
John Mulligan [Thu, 8 Jan 2026 18:42:14 +0000 (13:42 -0500)]
qa/suites/orch/cephadm: enable hosts_access tests
Enable the hosts_access tests when running deploy_smb_mgr_basic.yaml,
deploy_smb_mgr_domain.yaml, deploy_smb_mgr_res_basic.yaml, or
deploy_smb_mgr_res_dom.yaml.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
John Mulligan [Thu, 8 Jan 2026 18:45:43 +0000 (13:45 -0500)]
qa/workunits/smb: add tests for hosts_access field
The recently added hosts_access field allows a share to be configured
to allow or deny hosts by IP or network. The new module reconfigures
a share to attempt a small set of access scenarios with the hosts_access
field.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
John Mulligan [Wed, 19 Nov 2025 22:26:27 +0000 (17:26 -0500)]
qa/workunits/smb: add utility module for cephadm shell commands
Add a helper module that makes it a bit cleaner and easier to
find and interact with the cluster's 'admin node' the node where
we can run `cephadm shell` and commands within that shell.
This will allow us to make modifications to smb resources via
the ceph command and JSON in order to test various features.
Signed-off-by: John Mulligan <jmulligan@redhat.com>
John Mulligan [Fri, 9 Jan 2026 14:32:56 +0000 (09:32 -0500)]
qa/workunits/smb: make the smb_cfg fixture module scoped
This means the file will only be read when pytest changes modules.
This also allows this fixture to be used with other fixtures at the
module or scope "higher" than the function scope.
John Mulligan [Fri, 9 Jan 2026 16:12:46 +0000 (11:12 -0500)]
qa/tasks: add client node info to smb workunit config dump
When generating the big ball of config JSON that helps define
parameters for the smb tests in the workunit add client "node"
info as well.
Add a function to avoid repeating the logic of getting node
info from the teuthology remote object.
Signed-off-by: John Mulligan <jmulligan@redhat.com>