Kefu Chai [Sat, 18 Oct 2025 14:23:56 +0000 (22:23 +0800)]
debian/rules: enable WITH_CRIMSON when pkg.ceph.crimson profile is set
Since commit 9b1d524839 ("debian: mark "crimson" specific deps with
"pkg.ceph.crimson""), crimson-specific build dependencies have been
gated by the Build-Profiles: <pkg.ceph.crimson> tag. However,
debian/rules was never updated to pass -DWITH_CRIMSON=ON when this
build profile is active.
This causes builds with the crimson profile enabled to fail during
dh_install, as the crimson-osd binary is never built but the install
file tries to package it:
Failed to copy 'usr/bin/crimson-osd': No such file or directory
dh_install: error: debian/ceph-crimson-osd.install returned exit code 127
Fix this by checking for pkg.ceph.crimson in DEB_BUILD_PROFILES and
enabling the CMake option accordingly, following the same pattern used
for pkg.ceph.arrow.
Kefu Chai [Fri, 17 Oct 2025 14:09:26 +0000 (22:09 +0800)]
qa: install ceph-osd-classic and ceph-osd-crimson
- qa/packages/packages.yaml: add ceph-osd and ceph-osd-classic to
packages/packages.yaml, so that the "install" task can install
ceph-osd-classic by default, this preserves the existing behavior.
- qa/suites/crimson-rados: install ceph-osd-crimson instead of
ceph-osd-classic. adding them to exclude_packages and extra_packages
to task.install allows us to customize the packages to be installed
when performing the "install"
task.
- qa/suites/crimson-rados-experimental: likewise.
debian,ceph.spec: split ceph-osd into shared base and implementation packages
Previously, ceph-osd packaging had two mutually exclusive flavors that
could only be built one at a time: one with classic OSD and another
with crimson OSD. Both provided /usr/bin/ceph-osd, making them
impossible to coexist and confusing from a user perspective.
This commit restructures the packaging to enable both implementations
to coexist on the same system:
- ceph-osd: Contains shared components (systemd units, sysctl configs,
common executables like ceph-erasure-code-tool) and depends on exactly
one OSD implementation
- ceph-osd-classic: Contains the classic OSD implementation binary and
classic-specific tools
- ceph-osd-crimson: Contains the crimson OSD implementation binary and
crimson-specific tools
The two implementation packages install different sets of file, so they
don't conflict with each other anymore, and both depend on ceph-osd for
shared resources.
Changes:
Debian packaging:
- Revert e5f00d2f
- Add ceph-osd-crimson package
- Add Recommends: ceph-osd-classic to prefer classic on upgrades
- Add Replaces/Breaks for smooth upgrades from old monolithic package
- Create separate .install files for crimson and classic osd packages
Enforce exact version matching using ${binary:Version}
RPM packaging:
- Use rich dependencies for OR requirement (classic or crimson)
- Add Recommends: ceph-osd-classic for upgrade preference
Upgrade behavior:
Users upgrading from older versions will automatically get
ceph-osd-classic due to the Recommends directive, maintaining
backward compatibility. Users can explicitly choose crimson by
installing ceph-osd-crimson, which will coexist with classic.
Switching between implementations is supported via standard package
operations, with the alternatives system ensuring /usr/bin/ceph-osd
always points to the active implementation.
Nizamudeen A [Thu, 6 Nov 2025 04:53:47 +0000 (10:23 +0530)]
mgr/dashboard: start node virtual-env after starting ceph cluster
in frontend e2e.sh file, we don't need to start the node venv early on
before the ceph cluster is started. we only need it for the `npm` or
`npx` commands. Starting node virtual env and then starting ceph will
cause the ceph cluster to assume the node-env python as the python
environment which breaks the cryptotools call.
So moving the node-env venv start after the ceph is created
Fixes: https://tracker.ceph.com/issues/73804 Signed-off-by: Nizamudeen A <nia@redhat.com>
Samuel Just [Tue, 11 Nov 2025 02:52:22 +0000 (02:52 +0000)]
qa/suites: remove centos restriction from valgrind yaml
http://tracker.ceph.com/issues/20360 and
http://tracker.ceph.com/issues/18126 were quite some time
ago. It's causing trouble now because it only overrides the
os_type bit leaving the os_version alone causing teuthology
to look for centos 10 (centos + rocky 10).
Kefu Chai [Mon, 10 Nov 2025 15:01:32 +0000 (23:01 +0800)]
mgr/dashboard: fix Physical Disks identify test race condition
Fix a regression in the Physical Disks identify device e2e test that
causes intermittent timeouts when attempting to click the "Identify"
button.
Problem:
The test was timing out after 120 seconds while attempting to click
the "Identify" button, which remained in a disabled state. This
manifested as a race condition where the test would try to click the
button before it became enabled.
Root Cause (Regression Analysis):
This regression was introduced in commit 94418d90d2b ("mgr/dashboard:
fix UI modal issues", Sept 9, 2024) which aimed to fix the Physical
Disks Identify modal not opening (tracker.ceph.com/issues/67547).
While that commit successfully:
- Migrated from cd-modal to cds-modal (Carbon Design System)
- Changed button selector to use data-testid="primary-action"
- Added the e2e test to prevent future regressions
It inadvertently introduced a timing issue by not adding proper wait
logic for the button to become enabled. The commit also modified the
table-actions component to conditionally render the primary action
button based on tableActions.length > 0, which can cause the button
to be disabled while table actions are still loading.
Solution:
Add .should('not.be.disabled') before .click() to ensure Cypress waits
for the button to become enabled before attempting to interact with it.
This follows the established pattern used elsewhere in the codebase
(see page-helper.po.ts:319).
Impact:
- Fixes Jenkins build failures in ceph-dashboard-cephadm-e2e job
- Observed in build #18956 as "Regression - Failing for 1 build"
- Jenkins metrics show MTTF of ~2 hours, indicating this race
condition occurs frequently enough to cause CI instability
Ilya Dryomov [Mon, 10 Nov 2025 19:43:59 +0000 (20:43 +0100)]
qa/suites/rbd/valgrind: don't hardcode os_type in memcheck.yaml
The entire subsuite is pinned by centos_latest.yaml symlink, so the
stanza in memcheck.yaml is redundant. Removing it allows to experiment
with other distros just through varying the symlink target.
common: ModeCollector: locating the value of the mode
The ModeCollector class is used to collect values
of some type 'key', each associated with some object
identified by an 'ID'. The collector reports the 'mode'
value - the value associated with the largest number
of distinct IDs.
The results structure returned by the collector specifies
one of three possible mode_status_t values:
- no_mode_value - No clear victory for any value
- mode_value - we have a winner, but it has less than half of the
samples
- authorative_value - more than half of the samples are of the same
value
Yuval Lifshitz [Mon, 3 Nov 2025 11:20:07 +0000 (11:20 +0000)]
rgw/logging: deleteting the object holding the temp object name on cleanup
* in case of prefix per source this would prevent leaking this object
* in case of share prefix, it would prevent data loss when other source
buckets will try to commit an already comitted temporary object
* when updatign the "last committed" attribute, the object must exist.
this is so that commit without rollover (in case of cleanup) won't
recreate the deleted object
* some refactoring of try-catch code to have less nesting
galsalomon66 [Mon, 27 Oct 2025 17:25:58 +0000 (17:25 +0000)]
initializing of enable_progress length_before_processing length_post_processing on construction.
these variable are getting initialized on s3select/CSV flow, no valgrind local run had discovered any issue related to these variables.
valgrind reports produced by teuthology points on run_s3select_on_csv to contain UninitCondition warning. sometimes.
Matan Breizman [Wed, 22 Oct 2025 13:48:08 +0000 (13:48 +0000)]
qa/suites/crimson-rados-experimental: Test recovery thrash test
https://tracker.ceph.com/issues/67446 is merged, We should be able
to start testing Seastore similar to `crimson-rados/thrash` suite which
uses ceph_test_rados and rados bench.
crimson-rados-experimental is a copy of crimson-rados thrash with only
objectstore changes.
Once the experimental suite is ready, we could add seastore to
crimson-rados/thrash and remove crimson-rados/thrash_seastore_* variants.
Matan Breizman [Thu, 16 Oct 2025 08:50:30 +0000 (08:50 +0000)]
qa/tasks/admin_socket: fix ceph-ci/main usage
When scheduling jobs with --sha1 instead of -c. The ceph-ci
branch used is 'main'. However, ceph-ci doesn't actually have
a main branch - Instead use ceph.git main branch.
```
Command failed on smithi116 with status 8: "wget -q -O
/home/ubuntu/cephtest/admin_socket_client.0/objecter_requests --
'http://git.ceph.com/?p=ceph-ci.git;a=blob_plain;f=src/test/admin_socket/objecter_requests;hb=main' && chmod u=rx -- /home/ubuntu/cephtest/admin_socket_client.0/objecter_requests"
```
qa/suites/crimson-rados: restructure objectstore dir
This is a preperation step for addind new backend options
testing (e.g caching type)
* move crimson's objectstore yamls from qa/config to
qa/objectstore/crimson
* Use the entire qa/objectstore/crimson where possible
instead of symlinking each backend definition
* Introduce qa/clusters/crimson
4 deployment clusters (1/2/3/4 nodes) options same as classic.
* Symlink all cluster dirs to the common dir above
For now keep using only 1/2, we could add 3/4 later on.
* Move to "crimson cpu num" instead of specifying
"crimson cpu set" set.
- We expect users to mostly use this option for deploying
clusters, so use this as testing default.
* remove "crimson bluestore cpu set" which is responsible for
cpu pinning exclusiveness in seastar/alien cores.
* ignore "for optimal performance" cluster warning now that we
no longer pin cpus for testing.
Expose rbd_default_clone_format option which has a fairly comprehensive
description (much more verbose than most other options, anyway). This
should help with understanding the difference between clone v1 and v2.
Ville Ojamo [Fri, 17 Oct 2025 06:42:57 +0000 (13:42 +0700)]
src/common: Fix formatting in options/rgw.yaml.in
Fix single backticks to double backticks to properly end the inline
preformatted formatting. Fixes the formatting overflowing until the next
occurrence of double backticks seen in rendered docs, URL:
https://docs.ceph.com/en/latest/radosgw/config-ref/#confval-rgw_scheduler_type
Add full stops that seemed to be missing in desc attribute.
Use singular word "value" in desc attribute when there's only one
possible other value.
Remove unnecessary "the".
Signed-off-by: Ville Ojamo <14869000+bluikko@users.noreply.github.com>
Laura Flores [Tue, 4 Nov 2025 21:10:42 +0000 (15:10 -0600)]
qa/tasks: make the cephadm and vstart_runner tasks aware of watchdog
The WatchedProcesses class was added in https://github.com/ceph/ceph/pull/64889/commits
to help the DaemonWatchdog montior processes. In https://github.com/ceph/ceph/pull/64889/commits/7ee026be4e7ef07502507dfb7975c74bc8c85fc5,
an attribute 'watched_processes' was added to the cluster context to track
a list of processes. This was added to the ceph task (in ceph.py), but for
tests that use the cephadm task instead (cephadm.py), we need to add it there too.
This applies to `thrash-old-clients` and `upgrade` tests in particular.
To be on the safe side, we should also initialize 'watched_processes' for vstart_runner
in case someone opts into the watchdog there in the future.
This commit also unifies the quotation marks for the 'watched_processes' attribute in the
ceph task with the other attributes. No major logic is changed here- it is only for convention.
Fixes: https://tracker.ceph.com/issues/73682 Signed-off-by: Laura Flores <lflores@ibm.com>
test/librbd: Remove crimson skip from TestDeepCopy
The TestDeepCopy.Stress and TestDeepCopy.Stress_SmallerDstObjSize tests
were previously skipped for the crimson store. This commit removes the
SKIP_IF_CRIMSON() calls, indicating that the tests should now pass with
the crimson osd.
chungfengz [Thu, 6 Nov 2025 09:46:51 +0000 (09:46 +0000)]
bluestore/BlueFS: fix bytes_written_slow counter with aio_write
The bytes_written_slow performance counter was incorrectly reporting
0 when using async I/O.
When aio_write() is called with a bufferlist, it uses claim_append()
to transfer ownership of the buffer to the aio structure, leaving the
source bufferlist empty. Using t.length() after aio_write() returns 0
instead of the actual bytes written.
Fix by using the pre-calculated x_len value which contains the actual
write size and is not affected by the buffer ownership transfer.
The with_obc() function acquires a lock before invoking the
lambda it wraps. Earlier the lambda itself called send_to_osd()
which returns a future to with_obc. If a future is not resolved
immediately and a response could arrive and trigger
handle_pull_response() which attempts to acquire an exclusive lock.
Because a future is not returned yet to with_obc() so the original
lock is still holding by with_obc and handle_pull_response() throw
an assertion failure due to that osd is crashed.
Solution: Move send_to_osd() call outside with_obc lambda so that
the lock is released before handle_pull_response() is triggered.
test/client: When testing large io, consider fscrypt
When testing large io sizes and clamping that io, consider
fscrypt max io size. This max io size should be a multiple
of 4K (fscrypt block size), but not to exceed INT_MAX.
Signed-off-by: Christopher Hoffman <choffman@redhat.com>
client: Use nearest fscrypt block when clamping max io size
A max io size can currently be up to INT_MAX. If it is greater,
then clamp the size to INT_MAX. This conflicts with fscrypt io
operations. An fscrypt, op needs to read a whole fscrypt block.
The size of fscrypt block size is 4K, INT_MAX % 4K is not equal
to 0. Therefore, get the nearest multiple of 4K to INT_MAX that
does not go over. In the fscrypt case, this value will be used
for clamping max io size.
Fixes: https://tracker.ceph.com/issues/73346 Signed-off-by: Christopher Hoffman <choffman@redhat.com>
client: Do not expose ceph_fscrypt_key_identifier in api
The libcephfs API call add_fscrypt_key exposes an internal fscrypt
data structure. This is because a hash keyid (of the master key) is used
for calls such as remove_fscrypt_key. Instead of using this structure,
use a char array to obtain keyid.
Fixes: https://tracker.ceph.com/issues/63293 Signed-off-by: Christopher Hoffman <choffman@redhat.com>
Fix warnings/errors in ceph API tests that are present in various files
that were introduced by fscrypt feature
src/client/FSCrypt.cc:90:6: error: variable 'olen' set but not used [-Werror,-Wunused-but-set-variable]
90 | int olen = 0;
| ^
src/client/FSCrypt.cc:91:6: error: variable 'line' set but not used [-Werror,-Wunused-but-set-variable]
91 | int line = 0;
| ^
src/client/FSCrypt.cc:945:2: error: is this the way to do it? [-Werror,-W#warnings]
945 | #warning is this the way to do it?
src/client/Client.cc:11850:2: error: read holes [-Werror,-W#warnings]
11850 | #warning read holes
| ^
src/client/Client.cc:11855:2: error: implement file read here [-Werror,-W#warnings]
11855 | #warning implement file read here
| ^
src/client/Inode.cc:847:2: error: need to make sure that we do not skip entire subtree somehow [-Werror,-W#warnings]
847 | #warning need to make sure that we do not skip entire subtree somehow
| ^
Signed-off-by: Christopher Hoffman <choffman@redhat.com>
Fix warnings/errors in ceph API tests that are present in FSCrypt.cc
src/client/FSCrypt.cc:90:6: error: variable 'olen' set but not used [-Werror,-Wunused-but-set-variable]
90 | int olen = 0;
| ^
src/client/FSCrypt.cc:91:6: error: variable 'line' set but not used [-Werror,-Wunused-but-set-variable]
91 | int line = 0;
| ^
src/client/FSCrypt.cc:945:2: error: is this the way to do it? [-Werror,-W#warnings]
945 | #warning is this the way to do it?
Signed-off-by: Christopher Hoffman <choffman@redhat.com>