Shraddha Agrawal [Thu, 22 May 2025 10:26:41 +0000 (15:56 +0530)]
mon/MgrStatMonitor: ignore duration for which feature is off
When the availability tracking feature is disabled, we should not
be updating the score. We should start recalculating the score
when the user enables the features again. Essentially, for the
purpose of calculating the score, we need to ignore the duration
for which the feature was turned off.
The score is calculated from the uptime and downtime durations
recorded in `pool_availability` object. These durations are updated
in `calc_pool_availability` by adding the diff between last_uptime/
last_downtime and now.
To discard the duration for which the feature was turned off, we
need to offset the uptime/downtime by this duration. A simple way
to do this is to update the last_uptime and last_downtime to the
timestamp when the feature is toggled on again. To implement the
same, we record the time at which the feature is toggled from off
to on. When `calc_pool_availability` is invoked, if a reset is
required, it resets last_uptime and last_downtime before proceeding
with availability calculations.
We only care about the state when the feature is toggled from off to
on. All other toggle states for the config option will not have any
effect on the score.
Shraddha Agrawal [Thu, 22 May 2025 09:16:50 +0000 (14:46 +0530)]
MgrStatMonitor: add config observer
This commit adds a config observer to MgrStatMonitor so we
can track when a user enables/disables enable_availability_tracking
config option. The time difference between disabling and then
enabling the config option will be used to offset the uptime
and/or downtime from the availability score feature.
Shraddha Agrawal [Thu, 22 May 2025 08:20:57 +0000 (13:50 +0530)]
mon/MgrStatMonitor.cc: do not update score when disabled
This commit adds changes to ensure the availability score
tracking is not updated when the feature is disabled. We
will preserve the score calculated before the feature is
turned off and start updating it again when the feature
is enabled.
src/common/options: add config option for availability score
This commit modifies src/common/options/mon.yaml.in to add a
new config option to enable/disable tracking availability
score. This config option can be modified dynamically at
runtime as well.
To enable tracking availability score, we can run the
following command:
ceph config set mon enable_availability_tracking true
By default, tracking availability score is enabled.
To disable tracking availability score:
ceph config set mon enable_availablity_tracking false
When the feature is turned off, invoking the
`availability-status` command will display an error, prompting
the user to turn on the feature using the config option.
Zac Dover [Wed, 11 Jun 2025 12:44:32 +0000 (22:44 +1000)]
doc/rados/ops: edit cache-tiering.rst
Add material to doc/rados/operations/cache-tiering.rst, as suggested by
Anthony D'Atri in
https://github.com/ceph/ceph/pull/63745#discussion_r2127887785.
Zac Dover [Wed, 11 Jun 2025 12:39:50 +0000 (22:39 +1000)]
doc/radosgw: edit cloud-transition.rst
Add a link to the "Versioned Objects" section from a place in the docs
where that section is referred to. This change was requested by Anthony
D'Atri in
https://github.com/ceph/ceph/pull/63447#discussion_r2104492552.
Jaya Prakash [Wed, 5 Mar 2025 21:56:37 +0000 (21:56 +0000)]
os/bluestore: Implemented create-bdev-label
Introduces a helper function create_bdev_label() and a new command create-bdev-label
to write essential OSD metadata (e.g., fsid, whoami) directly into the device label
at offset 0, for use on devices where support_bdev_label == false.
Zac Dover [Tue, 10 Jun 2025 10:38:54 +0000 (20:38 +1000)]
doc/rbd: add mirroring troubleshooting info
Add a note to doc/rbd/rbd-mirroring.rst that directs the reader to set
both "site-a" and "site-b" to have the same pool names in the event that
rbd throws the error message "failed to import peer bootstrap token".
This information was reported to the Ceph upstream by Petr Tlapa in June
of 2025, and credit for its development goes to Petr.
Zac Dover [Tue, 10 Jun 2025 10:58:22 +0000 (20:58 +1000)]
doc/rados: enhance "pools.rst"
Add a link to the instructions for modifying a user's caps for a given
pool. Add this link where it makes sense to add it. Add this link where
the reader would naturally want to have the link.
Kefu Chai [Tue, 10 Jun 2025 09:59:28 +0000 (17:59 +0800)]
test/erasure-code: fix memory leak in ErasureCodePlugin.parity_delta_write
Fix 4KB memory leak in ErasureCodePlugin_parity_delta_write_Test caused by
unmanaged raw buffer allocation. The test was allocating a 4096-byte raw
buffer to replace shard 4 for delta encoding validation, but the buffer::ptr
constructed from the raw pointer did not manage the buffer's lifecycle.
Detected by AddressSanitizer:
```
Direct leak of 4096 byte(s) in 1 object(s) allocated from:
#0 0x7fb73a720e15 in malloc /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:67
#1 0x5562f4062ccc in ErasureCodePlugin_parity_delta_write_Test::TestBody() /home/kefu/dev/ceph/src/test/erasure-code/TestErasureCodePluginJerasure.cc:122
#2 0x5562f41081a1 in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/kefu/dev/ceph/src/googletest/googletest/src/gtest.cc:2653
#3 0x5562f40f3004 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /home/kefu/dev/ceph/src/googletest/googletest/src/gtest.cc:2689
#4 0x5562f409cbba in testing::Test::Run() /home/kefu/dev/ceph/src/googletest/googletest/src/gtest.cc:2728```
```
In this change, we replace raw pointer allocation with
create_bufferptr() to ensure proper memory management by buffer::ptr.
Zac Dover [Tue, 10 Jun 2025 02:54:18 +0000 (12:54 +1000)]
doc/mgr: edit telemetry.rst (lines 300-400)
Edit doc/mgr/telemetry.rst (lines 300-400).
Follow up on the suggestions made by Anthony D'Atri in
https://github.com/ceph/ceph/pull/63741 (except for the one about
including Lovecraftian lore in the dummy user data in this file).
Kefu Chai [Mon, 9 Jun 2025 10:26:21 +0000 (18:26 +0800)]
rgw/rgw_lua_utils: fix memory leak in luaL_error() formatting
Previously, error messages passed to luaL_error() were formatted using
std::string concatenation. Since luaL_error() never returns (it throws
a Lua exception via longjmp), the allocated std::string memory was
leaked, as detected by AddressSanitizer:
This change replaces std::string formatting with stack-allocated buffer
and std::to_chars() to eliminate the memory leak.
Note: We cannot format int64_t directly through luaL_error() because
lua_pushfstring() does not support long long or int64_t format specifiers,
even in Lua 5.4 (see https://www.lua.org/manual/5.4/manual.html#lua_pushfstring).
Since libstdc++ uses int64_t for std::chrono::milliseconds::rep, we use
std::to_chars() for safe, efficient conversion without heap allocation.
The maximum runtime limit was a configuration introduced by 3e3cb156.
Naveen Naidu [Mon, 9 Jun 2025 08:02:44 +0000 (13:32 +0530)]
.github/workflows/diff-ceph-config.yml: use --ref-commit-sha and --cmp-commit-sha
update the config_diff.py to use `--ref-commit-sha` and
`--cmp-commit-sha` to repliace the three-dot diff [1] that Github uses
for showing it's diff. This way we only output the configuration changes
that have been made in the PR.
Naveen Naidu [Sun, 8 Jun 2025 13:55:24 +0000 (19:25 +0530)]
src/script/config_diff.py: add support for `ref-commit-sha` and `cmp-commit-sha` arguments
Introduced `ref-commit-sha` and `cmp-commit-sha` arguments to the
`diff-branch-remote-repo` mode, enabling comparison of remote
branches against specific commits.
This enhancement is crucial for comparing configuration changes
between a pull request (PR) and the Ceph upstream main branch. It
allows for precise comparison by focusing on files changed in the
PR, rather than simply comparing the PR's head with its latest
commit.
The approach mirrors GitHub's three-dot diff [1], where the PR is
compared against the common ancestor of the Ceph upstream repository
, i.e., the point where the PR was forked.
Naveen Naidu [Mon, 9 Jun 2025 07:36:00 +0000 (13:06 +0530)]
.github/workflows/config-diff-post-comment.js: improve handling of GH comment
1. When no configuration changes are detected, delete the outdated
configuration diff Github comment. This ensures that the PR does not
have any misleading information about configuration changes.
2. Configuration changes might change with every push event, update the
old configuration diff comment with the new configDiff that was
calculated in the present run.
Naveen Naidu [Sun, 8 Jun 2025 06:37:11 +0000 (12:07 +0530)]
src/scripts/config-diff.py: simplify sparse_branch_checkout_* functions and add files names to POSIX diff
Refactored `sparse_branch_checkout_skip_clone` and
`sparse_branch_checkout_remote_repo_skip_clone` to accept and use
branch/tag names directly instead of constructing `ref_sha` strings
throughout the code.
Also include filenames from where the configuration values are coming
from in the POSIX diff. This helps identify the config options faster in
case of descrepancies.
Kefu Chai [Wed, 4 Jun 2025 03:05:38 +0000 (11:05 +0800)]
cmake: enable out-of-source build of breakpad
Previously, Breakpad was built in its source tree instead of the
user-specified build directory, inconsistent with other external
projects and potentially causing source tree pollution.
Include path fix:
- Add ${INSTALL_DIR}/include/breakpad to include directories to fix
FTBFS on Jammy builders
Build system improvements:
- Replace dedicated LSS submodule symlink target with PATCH_COMMAND to
simplify the build process
- Use user-specified make command instead of hardcoded "make"
- Skip building unused process library and tools
- Link against breakpad with PRIVATE visibility unless required
Compiler flag cleanups:
- Remove -Wno-array-bounds from CFLAGS (Breakpad uses C++/CXXFLAGS)
- Remove compile-time flags incorrectly placed in LDFLAGS
- Remove '-fPIC' from CFLAGS, as it is already included by breakpad
when building on linux hosts.
- Replace the individual -Wno-* flags with -Wno-error to cancel
-Werror option specified by breakpad. This is more future-proof.
CMake target modernization:
- Rename libbreakpad_client to Breakpad::client following modern conventions
- Add Breakpad::breakpad header-only target to minimize dependencies
- Install library to enable proper include path prefixes
(breakpad/client/... vs client/...)
Header dependency optimization:
- Remove Breakpad includes from popular headers, use forward declarations
- Include Breakpad headers before internal headers for better readability
Ronen Friedman [Wed, 4 Jun 2025 17:44:16 +0000 (12:44 -0500)]
osd/scrub: make m_session_started_at at Session state ctor
ScrubMachine::get_time_scrubbing() must access the Session object
to compute the scrub duration. But the State data is not externally
accessible before its ctor has completed.
As we always happen to try to access that data inside the ctor,
this always results in a warning log message.
Here we move m_session_started_at into the outer state, simplifying
the logic required to access it.
Zac Dover [Wed, 4 Jun 2025 23:39:33 +0000 (09:39 +1000)]
doc/glossary: s/OMAP/omap/
Change "OMAP" to "omap" to match the capitalization established by
Eleanor Cawthon in her 2012 omap paper, here:
https://ceph.io/assets/pdfs/CawthonKeyValueStore.pdf.
Samuel Just [Wed, 4 Jun 2025 20:55:21 +0000 (20:55 +0000)]
.gitmodules: remove shallow=true config from nvmeof/gateway
https://github.com/ceph/ceph/pull/61264 reintroduced
https://tracker.ceph.com/issues/67640 fixed by 383091e89.
Setting shallow=true for the nvmeof/gateway submodule
is problematic because the ceph.git submodule sha1
is only very rarely the head sha1 of the default
branch.
Fixes: https://tracker.ceph.com/issues/71568 Signed-off-by: Samuel Just <sjust@redhat.com>
mgr/dashboard: fix KeyError exception in HardwareService.get_summary()
Typical error:
```
[dashboard ERROR exception] Internal Server Error
Traceback (most recent call last):
File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 48, in dashboard_exception_handler
return handler(*args, **kwargs)
File "/lib/python3.9/site-packages/cherrypy/_cpdispatch.py", line 54, in __call__
return self.callable(*self.args, **self.kwargs)
File "/usr/share/ceph/mgr/dashboard/controllers/_base_controller.py", line 263, in inner
ret = func(*args, **kwargs)
File "/usr/share/ceph/mgr/dashboard/controllers/_rest_controller.py", line 193, in wrapper
return func(*vpath, **params)
File "/usr/share/ceph/mgr/dashboard/controllers/hardware.py", line 21, in summary
return HardwareService.get_summary(categories, hostname)
File "/usr/share/ceph/mgr/dashboard/services/hardware.py", line 33, in get_summary
'ok': sum(item['status']['health'] == 'OK' for items in data.values()
File "/usr/share/ceph/mgr/dashboard/services/hardware.py", line 33, in <genexpr>
'ok': sum(item['status']['health'] == 'OK' for items in data.values()
KeyError: 'status'
```
The recent change from commit `fbcdf571ca1` introduced this regression.
* refs/pull/62865/head:
test/libcephfs: copy DT_NEEDED entries from input libraries
test/fs: only add libcephfs as library dependency
test/client: do not depend on libcephfs
Anoop C S [Wed, 4 Jun 2025 08:02:01 +0000 (13:32 +0530)]
libcephfs: Bump API major version
We recently had ABI changes[1] with respect to APIs from chown() family
which calls for a change in major version. Native users of the library
may not have to change their code but expected sizes differ when data
type for parameters are changed. However go-ceph, Go bindings for ceph,
couldn't build[2] unless the ABI change is made visible to the consumers
of the API. Following the Semantic Versioning guidelines[3] we reset
minor and patch (extra) versions to 0.