Kefu Chai [Mon, 10 Nov 2025 15:01:32 +0000 (23:01 +0800)]
mgr/dashboard: fix Physical Disks identify test race condition
Fix a regression in the Physical Disks identify device e2e test that
causes intermittent timeouts when attempting to click the "Identify"
button.
Problem:
The test was timing out after 120 seconds while attempting to click
the "Identify" button, which remained in a disabled state. This
manifested as a race condition where the test would try to click the
button before it became enabled.
Root Cause (Regression Analysis):
This regression was introduced in commit 94418d90d2b ("mgr/dashboard:
fix UI modal issues", Sept 9, 2024) which aimed to fix the Physical
Disks Identify modal not opening (tracker.ceph.com/issues/67547).
While that commit successfully:
- Migrated from cd-modal to cds-modal (Carbon Design System)
- Changed button selector to use data-testid="primary-action"
- Added the e2e test to prevent future regressions
It inadvertently introduced a timing issue by not adding proper wait
logic for the button to become enabled. The commit also modified the
table-actions component to conditionally render the primary action
button based on tableActions.length > 0, which can cause the button
to be disabled while table actions are still loading.
Solution:
Add .should('not.be.disabled') before .click() to ensure Cypress waits
for the button to become enabled before attempting to interact with it.
This follows the established pattern used elsewhere in the codebase
(see page-helper.po.ts:319).
Impact:
- Fixes Jenkins build failures in ceph-dashboard-cephadm-e2e job
- Observed in build #18956 as "Regression - Failing for 1 build"
- Jenkins metrics show MTTF of ~2 hours, indicating this race
condition occurs frequently enough to cause CI instability
Ilya Dryomov [Mon, 10 Nov 2025 19:43:59 +0000 (20:43 +0100)]
qa/suites/rbd/valgrind: don't hardcode os_type in memcheck.yaml
The entire subsuite is pinned by centos_latest.yaml symlink, so the
stanza in memcheck.yaml is redundant. Removing it allows to experiment
with other distros just through varying the symlink target.
common: ModeCollector: locating the value of the mode
The ModeCollector class is used to collect values
of some type 'key', each associated with some object
identified by an 'ID'. The collector reports the 'mode'
value - the value associated with the largest number
of distinct IDs.
The results structure returned by the collector specifies
one of three possible mode_status_t values:
- no_mode_value - No clear victory for any value
- mode_value - we have a winner, but it has less than half of the
samples
- authorative_value - more than half of the samples are of the same
value
Yuval Lifshitz [Mon, 3 Nov 2025 11:20:07 +0000 (11:20 +0000)]
rgw/logging: deleteting the object holding the temp object name on cleanup
* in case of prefix per source this would prevent leaking this object
* in case of share prefix, it would prevent data loss when other source
buckets will try to commit an already comitted temporary object
* when updatign the "last committed" attribute, the object must exist.
this is so that commit without rollover (in case of cleanup) won't
recreate the deleted object
* some refactoring of try-catch code to have less nesting
galsalomon66 [Mon, 27 Oct 2025 17:25:58 +0000 (17:25 +0000)]
initializing of enable_progress length_before_processing length_post_processing on construction.
these variable are getting initialized on s3select/CSV flow, no valgrind local run had discovered any issue related to these variables.
valgrind reports produced by teuthology points on run_s3select_on_csv to contain UninitCondition warning. sometimes.
Matan Breizman [Wed, 22 Oct 2025 13:48:08 +0000 (13:48 +0000)]
qa/suites/crimson-rados-experimental: Test recovery thrash test
https://tracker.ceph.com/issues/67446 is merged, We should be able
to start testing Seastore similar to `crimson-rados/thrash` suite which
uses ceph_test_rados and rados bench.
crimson-rados-experimental is a copy of crimson-rados thrash with only
objectstore changes.
Once the experimental suite is ready, we could add seastore to
crimson-rados/thrash and remove crimson-rados/thrash_seastore_* variants.
Matan Breizman [Thu, 16 Oct 2025 08:50:30 +0000 (08:50 +0000)]
qa/tasks/admin_socket: fix ceph-ci/main usage
When scheduling jobs with --sha1 instead of -c. The ceph-ci
branch used is 'main'. However, ceph-ci doesn't actually have
a main branch - Instead use ceph.git main branch.
```
Command failed on smithi116 with status 8: "wget -q -O
/home/ubuntu/cephtest/admin_socket_client.0/objecter_requests --
'http://git.ceph.com/?p=ceph-ci.git;a=blob_plain;f=src/test/admin_socket/objecter_requests;hb=main' && chmod u=rx -- /home/ubuntu/cephtest/admin_socket_client.0/objecter_requests"
```
qa/suites/crimson-rados: restructure objectstore dir
This is a preperation step for addind new backend options
testing (e.g caching type)
* move crimson's objectstore yamls from qa/config to
qa/objectstore/crimson
* Use the entire qa/objectstore/crimson where possible
instead of symlinking each backend definition
* Introduce qa/clusters/crimson
4 deployment clusters (1/2/3/4 nodes) options same as classic.
* Symlink all cluster dirs to the common dir above
For now keep using only 1/2, we could add 3/4 later on.
* Move to "crimson cpu num" instead of specifying
"crimson cpu set" set.
- We expect users to mostly use this option for deploying
clusters, so use this as testing default.
* remove "crimson bluestore cpu set" which is responsible for
cpu pinning exclusiveness in seastar/alien cores.
* ignore "for optimal performance" cluster warning now that we
no longer pin cpus for testing.
Expose rbd_default_clone_format option which has a fairly comprehensive
description (much more verbose than most other options, anyway). This
should help with understanding the difference between clone v1 and v2.
Ville Ojamo [Fri, 17 Oct 2025 06:42:57 +0000 (13:42 +0700)]
src/common: Fix formatting in options/rgw.yaml.in
Fix single backticks to double backticks to properly end the inline
preformatted formatting. Fixes the formatting overflowing until the next
occurrence of double backticks seen in rendered docs, URL:
https://docs.ceph.com/en/latest/radosgw/config-ref/#confval-rgw_scheduler_type
Add full stops that seemed to be missing in desc attribute.
Use singular word "value" in desc attribute when there's only one
possible other value.
Remove unnecessary "the".
Signed-off-by: Ville Ojamo <14869000+bluikko@users.noreply.github.com>
Laura Flores [Tue, 4 Nov 2025 21:10:42 +0000 (15:10 -0600)]
qa/tasks: make the cephadm and vstart_runner tasks aware of watchdog
The WatchedProcesses class was added in https://github.com/ceph/ceph/pull/64889/commits
to help the DaemonWatchdog montior processes. In https://github.com/ceph/ceph/pull/64889/commits/7ee026be4e7ef07502507dfb7975c74bc8c85fc5,
an attribute 'watched_processes' was added to the cluster context to track
a list of processes. This was added to the ceph task (in ceph.py), but for
tests that use the cephadm task instead (cephadm.py), we need to add it there too.
This applies to `thrash-old-clients` and `upgrade` tests in particular.
To be on the safe side, we should also initialize 'watched_processes' for vstart_runner
in case someone opts into the watchdog there in the future.
This commit also unifies the quotation marks for the 'watched_processes' attribute in the
ceph task with the other attributes. No major logic is changed here- it is only for convention.
Fixes: https://tracker.ceph.com/issues/73682 Signed-off-by: Laura Flores <lflores@ibm.com>
test/librbd: Remove crimson skip from TestDeepCopy
The TestDeepCopy.Stress and TestDeepCopy.Stress_SmallerDstObjSize tests
were previously skipped for the crimson store. This commit removes the
SKIP_IF_CRIMSON() calls, indicating that the tests should now pass with
the crimson osd.
chungfengz [Thu, 6 Nov 2025 09:46:51 +0000 (09:46 +0000)]
bluestore/BlueFS: fix bytes_written_slow counter with aio_write
The bytes_written_slow performance counter was incorrectly reporting
0 when using async I/O.
When aio_write() is called with a bufferlist, it uses claim_append()
to transfer ownership of the buffer to the aio structure, leaving the
source bufferlist empty. Using t.length() after aio_write() returns 0
instead of the actual bytes written.
Fix by using the pre-calculated x_len value which contains the actual
write size and is not affected by the buffer ownership transfer.
The with_obc() function acquires a lock before invoking the
lambda it wraps. Earlier the lambda itself called send_to_osd()
which returns a future to with_obc. If a future is not resolved
immediately and a response could arrive and trigger
handle_pull_response() which attempts to acquire an exclusive lock.
Because a future is not returned yet to with_obc() so the original
lock is still holding by with_obc and handle_pull_response() throw
an assertion failure due to that osd is crashed.
Solution: Move send_to_osd() call outside with_obc lambda so that
the lock is released before handle_pull_response() is triggered.
test/client: When testing large io, consider fscrypt
When testing large io sizes and clamping that io, consider
fscrypt max io size. This max io size should be a multiple
of 4K (fscrypt block size), but not to exceed INT_MAX.
Signed-off-by: Christopher Hoffman <choffman@redhat.com>
client: Use nearest fscrypt block when clamping max io size
A max io size can currently be up to INT_MAX. If it is greater,
then clamp the size to INT_MAX. This conflicts with fscrypt io
operations. An fscrypt, op needs to read a whole fscrypt block.
The size of fscrypt block size is 4K, INT_MAX % 4K is not equal
to 0. Therefore, get the nearest multiple of 4K to INT_MAX that
does not go over. In the fscrypt case, this value will be used
for clamping max io size.
Fixes: https://tracker.ceph.com/issues/73346 Signed-off-by: Christopher Hoffman <choffman@redhat.com>
client: Do not expose ceph_fscrypt_key_identifier in api
The libcephfs API call add_fscrypt_key exposes an internal fscrypt
data structure. This is because a hash keyid (of the master key) is used
for calls such as remove_fscrypt_key. Instead of using this structure,
use a char array to obtain keyid.
Fixes: https://tracker.ceph.com/issues/63293 Signed-off-by: Christopher Hoffman <choffman@redhat.com>
Fix warnings/errors in ceph API tests that are present in various files
that were introduced by fscrypt feature
src/client/FSCrypt.cc:90:6: error: variable 'olen' set but not used [-Werror,-Wunused-but-set-variable]
90 | int olen = 0;
| ^
src/client/FSCrypt.cc:91:6: error: variable 'line' set but not used [-Werror,-Wunused-but-set-variable]
91 | int line = 0;
| ^
src/client/FSCrypt.cc:945:2: error: is this the way to do it? [-Werror,-W#warnings]
945 | #warning is this the way to do it?
src/client/Client.cc:11850:2: error: read holes [-Werror,-W#warnings]
11850 | #warning read holes
| ^
src/client/Client.cc:11855:2: error: implement file read here [-Werror,-W#warnings]
11855 | #warning implement file read here
| ^
src/client/Inode.cc:847:2: error: need to make sure that we do not skip entire subtree somehow [-Werror,-W#warnings]
847 | #warning need to make sure that we do not skip entire subtree somehow
| ^
Signed-off-by: Christopher Hoffman <choffman@redhat.com>
Fix warnings/errors in ceph API tests that are present in FSCrypt.cc
src/client/FSCrypt.cc:90:6: error: variable 'olen' set but not used [-Werror,-Wunused-but-set-variable]
90 | int olen = 0;
| ^
src/client/FSCrypt.cc:91:6: error: variable 'line' set but not used [-Werror,-Wunused-but-set-variable]
91 | int line = 0;
| ^
src/client/FSCrypt.cc:945:2: error: is this the way to do it? [-Werror,-W#warnings]
945 | #warning is this the way to do it?
Signed-off-by: Christopher Hoffman <choffman@redhat.com>
Add fscrypt dummy encryption to client. This will allow
for mounting a cephfs volume without providing any fscrypt
information. This will allow for more straightforward setup
for development and test suites.
Signed-off-by: Christopher Hoffman <choffman@redhat.com>
Marcus Watts [Sat, 28 Jun 2025 00:56:05 +0000 (20:56 -0400)]
libcephfs: ll_set_fscrypt_policy_v2 - use in->dirstat
Better check for empty direcotry.
It turns out in->dirstat contains a count of files and subdirectories
from a directory, so all we have to do is make sure that's valid.
Rishabh Dave [Wed, 16 Jul 2025 16:04:18 +0000 (21:34 +0530)]
client: in fcopyfile(), update len to read only leftover fragment
fcopyfile() reads 1 MiB of data every time but when a fragment smaller
than 1 MiB is left, it still reads 1 MiB of data, causing to never meet
the condition of "off == size". This leads to an infinity loop which
continues to write until CephFS becomes full.
Resolves: rhbz#2379716 Fixes: https://tracker.ceph.com/issues/72238 Signed-off-by: Rishabh Dave <ridave@redhat.com>