Kefu Chai [Sat, 13 Jun 2026 01:50:09 +0000 (09:50 +0800)]
python-common/cryptotools: stop using the removed X509Req API
pyOpenSSL deprecated OpenSSL.crypto.X509Req in 24.2.0 (2024-07-20) and
removed it in 26.3.0 (2026-06-12). as we don't pin pyopenssl, CI picked
up the new release, and create_self_signed_cert() started failing with:
AttributeError: module 'OpenSSL.crypto' has no attribute 'X509Req'
this took down run-tox-mgr, run-tox-mgr-dashboard-py3 and the mypy check.
we only used X509Req to build a subject name and then copied it into the
X509 cert. so drop it, and set the subject on the cert directly. the
resulting cert stays the same: subject from dname, issuer set to the same
subject, self-signed.
Ashwin M. Joshi [Tue, 10 Feb 2026 06:29:49 +0000 (11:59 +0530)]
mgr/cephadm: Control cephadm.log messages based on a new mgr logging level flag
Introduces a new 'cephadm_binary_logging_level' config option to control
the verbosity of cephadm logging to persistent destinations (cephadm.log, syslog).
- Adds --logging-level CLI flag (info, debug, error, warning)
- Adds mgr/cephadm/cephadm_binary_logging_level config option
- Applies logging level to file and syslog handlers
- Console handlers maintain their defaults for terminal UX
Fixes: https://tracker.ceph.com/issues/74872 Signed-off-by: Ashwin M. Joshi <ashjosh1@in.ibm.com>
doc/rados/configuration: Remove wpq recommendation warning for EC clusters
Remove the warning that recommends using wpq scheduler as a fallback for EC
clusters. This issue is addressed by considering EC recovery reads as
background, assigning an accurate cost for those reads and tuning the QoS
parameters associated with best-effort class of operations.
mclock_common: adjust mClock profile parameters to prevent backfill starvation
Adjust the 'background_best_effort' queue parameters across the
three standard mClock profiles (high_client_ops, balanced, and
high_recovery_ops) to ensure best effort ops are not starved.
Previously, the 'background_best_effort' queue carried a default allocation
of 0% (MIN) reservation and a weight of 1 under these profiles. When
concurrent client traffic is dense, the zero-reservation for example completely
starves backfill sub-ops (MSG_OSD_EC_READ) on pools with
'allow_ec_optimizations' set to false. This starvation forces the Primary OSD
to hold internal BlueStore transactions and PG object locks for extended
windows, causing severe client median (50th) latency inflation.
To prevent background starvation and resolve the effects of the primary lock
retention, the profile configurations are tuned as follows:
The following profile changes forces low-cost sub-ops to clear out of peer
queues rapidly to drop primary locks, which helps improve the client
completion latency and tail latency (95th, 99th and 99.5th) percentile.
1. high_client_ops profile:
- Grant 'background_best_effort' a safe 5% minimum reservation.
- Scale the queue weight to 4.
2. balanced profile:
- Grant 'background_best_effort' a 5% minimum reservation.
- Set the queue weight to 2.
3. high_recovery_ops profile:
- Grant 'background_best_effort' a 5% minimum reservation.
- Set the queue weight to 2.
4. Modify the mClock config reference documentation to reflect the tuning
changes to the best-effort QoS parameters across the profiles.
Note on Proportional Scaling Compatibility:
Configuring these changes shifts total reservations to 105% (e.g., 50%
client + 50% recovery + 5% best-effort under the Balanced profile). Under
heavy concurrent saturation, mClock's internal controls resolves this
gracefully via proportional down-scaling, preserving the underlying
device bandwidth limits for different classes of clients. For example instead
of the client being allocated 50% bandwidth, a slightly lower reservation is
allocated while shifting the remaining bandwidth to the best-effort queue.
This minor scaling shift is virtually unnoticeable to the client application,
but it prevents the internal queue deadlocks.
mclock_common, mClockScheduler: Add perf counters for scheduler ops
Add perf counters to show the status pertaining to the number of ops,
dynamic queue lengths, queue latency and bytes read for the following
ops handled in the high queues and in the scheduler queues:
- peering
- client
- ec reads/writes
- ec recovery reads
Additional counters can be added in the future based on the requirement.
src/messages, osd: Calculate and set cost for subOpReads for mClock scheduler
Previously, sub-op reads returned a hardcoded cost of 0, bypassing
mClock's background bandwidth and tag calculation mechanisms. This
allowed backfill operations to proceed un-metered, occasionally causing
backend resource contention and driving up client tail latencies.
Cost is calculated based on whether the complete chunk/shard or a subchunk
needs to be read. The possible cases are:
1. Read the complete chunk aligned length:
- Cost is set to the length of the chunk aligned extent size.
2. Fragmented reads:
- Consider the subchunk length and count to calculate the cost.
- compute_cost evaluates the exact layout of fragmented shard bytes on
disk by summing up the active subchunk allocations exactly once
(`fragmented_shard_bytes += k.second * subchunk_size`).
- Linear Extent Scaling: Scale the baseline footprint cleanly by
multiplying it against the true count of read extents (`tl.size()`),
achieving a highly efficient O(N) time complexity.
This linear cost model is compatible with pools running with
'allow_ec_optimizations' set to true. Under the FastEC optimized
pipeline, most operations are unified and bypass fragment slicing,
meaning requests will primarily match the Case 1 chunk-aligned path.
In Case 2 where applicable, the O(N) loop ensures that cost will
scale proportionally according to the layout.
It is important to note that the amount of data to read was set to an upper
bound defined by osd_recovery_max_chunk (8 MiB) and was rounded up to the
stripe width. The reason for setting a higher than actual upper bound is that
there may be cases where the object doesn't have the xattrs yet to determine
its size. Therefore, the amount to read was ultimatly set to ~(8 MiB / k)
where k is the number of data shards. This can cause mClock to prolong
the recovery times as items stay longer in the queue. To address this, the
amount to read is set to the remaining length of the object to recover
if the object size is known. Otherwise, the amount to read is set to the
recovery chunk size as before. Therefore, in some cases, only the first
recovery read could be costly if the object context is not known.
The MOSDECSubOpRead class introduces the following:
- cost member. This necessitates an increment to the HEAD_VERSION and
appropriate handling within the encode and decode methods.
- compute_cost() that is called when creating the message by
ECCommonL::ReadPipeline::do_read_op(). This calls into ECSubRead::cost()
that performs the actual calculations to set the cost based on the cases
mentioned above.
- The same sequence applies to the EC optimized path in
ECCommon::ReadPipeline::do_read_op().
osd/scheduler: Classify EC subOp reads according to op priority for mClock
The change brings MSG_OSD_EC_READ into the fold of mClock scheduler. This
improves the scheduling of client and other classes of operation as they
are no longer unnecessarily preempted by the 'immediate' queue.
EC SubOps are now handled as follows:
- EC SubOp reads generated during recovery will either go into the
'background_recovery' or 'background_best_effort' class based on
the recovery priority set for the op. EC SubOp reads generated due
to client will continue to be classified as 'immediate'.
- EC SubOp writes generated as a result of client operations will
continue to be classified as 'immediate'.
- EC SubOp replies are considered high priority and therefore
continue to be classed as 'immediate'.
Aashish Sharma [Thu, 11 Jun 2026 09:19:04 +0000 (14:49 +0530)]
mgr/dashboard: fix zone creation in rgw service creation form
The zone creation request from the rgw service creation form was missing
the tier_type, sync_from and sync_from_all properties as a result the
zone creation was failing. This PR tends to fix this issue.
doc/mgr/ceph_secrets: add documentation for the ceph_secrets module
Document CLI commands (set/get/get-value/ls/rm), the Python API via
CephSecretsClient, secret URI embedding and resolution, and epoch-based
change detection.
Redouane Kachach [Thu, 11 Jun 2026 08:55:43 +0000 (10:55 +0200)]
mgr/ceph_secrets: add unit tests for all modules
Add pytest coverage for the full stack: secret types and URI/path
parsing, storage backend contract, Mon KV store (CRUD, epoch,
serialization, corruption handling), SecretMgr (scan/resolve),
module RPC surface and CLI handlers, and the CephSecretsClient
wrapper. Gate test imports on the UNITTEST env var following the
SMB module pattern.
Redouane Kachach [Mon, 26 Jan 2026 13:51:44 +0000 (14:51 +0100)]
mgr: add CephSecretsClient wrapper for ceph_secrets RPC
Add a thin typed client around mgr.remote() for consuming the
ceph_secrets module. Exposes get/set/rm, epoch and version queries,
batch version fetch, scan and resolve helpers. Lives alongside
ceph_secrets_types.py so any mgr module can import it without
depending on the ceph_secrets package directly.
Redouane Kachach [Mon, 26 Jan 2026 14:14:35 +0000 (15:14 +0100)]
mgr/ceph_secrets: add 'ceph secret' CLI commands and input parsing
This commit has the following changes:
1) Add the ceph_secrets mgr module entrypoint and wire it to
SecretMgr. Implement the core RPC surface consumed by other mgr
modules (secret_ls/get/set/rm, secret_get_value, secret_get_version)
and keep the implementation focused on the internal API.
2) Add user-facing CLI commands (ceph secret ls/get/set/rm) using
parse_secret_path. Secret data is accepted via -i (inbuf) only for
script-friendly usage. Add secret get-value for plain-string output
without a JSON envelope. Ensure consistent JSON output and error
mapping to EINVAL/ENOENT, while preserving safe non-reveal defaults
unless explicitly requested.
3) Add the scanning and resolution helpers (scan_refs,
scan_unresolved_refs, resolve_object) through the ceph_secrets module
RPC API. This lets consumers reliably detect secret:/... references and
resolve them inside nested objects without duplicating logic. The
behavior is delegated to SecretMgr to keep parsing/resolution
consistent across the stack.
Redouane Kachach [Mon, 26 Jan 2026 13:46:39 +0000 (14:46 +0100)]
mgr/ceph_secrets: add SecretMgr for secrets handling and resolution
Introduce SecretMgr to encapsulate higher-level behavior on top of
SecretStoreMon: listing helpers, scan_refs, scan_unresolved_refs, and
resolve_object (walk nested dict/list structures). This keeps
parsing/substitution logic out of the mgr module entrypoint and makes
consumer behavior consistent. The module can now resolve secret://…
references deterministically and provide structured scan output.
Redouane Kachach [Mon, 23 Feb 2026 09:44:22 +0000 (10:44 +0100)]
mgr/ceph_secrets: add SecretStoreMon mon-kv store implementation
Introduce SecretRecord (data, metadata, versioning, timestamps) and
the canonical KV prefix (secret_store/v1/…). Add JSON serialization
helpers (to_json) including the ability to omit secret data unless
explicitly requested. This commit defines the “what we store and how
it looks” without wiring any mgr interactions yet.
Add SecretStoreMon implementing the backend using mgr’s KV
store (get_store, set_store, prefix listing). Implement set/get/rm
semantics, version increments, and list-by-prefix queries for
namespace/scope/target. This isolates persistence logic from CLI/RPC
concerns and provides deterministic record behavior for later layers.
Redouane Kachach [Mon, 26 Jan 2026 13:40:43 +0000 (14:40 +0100)]
mgr/ceph_secrets: add storage backend protocol for mgr KV secrets
Define a minimal backend protocol for secret persistence
operations (get/set/rm/list), keeping the module implementation
decoupled from the backing store details. For now we will start with
monstore-db as secure KV store but the idea is to extend this to other
backends such as Vault.
Redouane Kachach [Mon, 23 Feb 2026 09:39:49 +0000 (10:39 +0100)]
mgr/ceph_secrets: add secret reference types and parsing helpers
Introduce the shared types and parsing logic used across the secrets
module: secret scopes, secret references, and the exception hierarchy.
Includes validation for all supported addressing forms and clear
error messages on malformed input.
Kefu Chai [Thu, 11 Jun 2026 03:32:24 +0000 (11:32 +0800)]
ceph.spec.in: require update-alternatives for the osd scriptlets
ceph-osd-crimson and ceph-osd-classic call update-alternatives in their
%posttrans and %preun scriptlets but don't depend on it. declare it as a
scriptlet dependency so the binary is there when they run.
Kefu Chai [Thu, 11 Jun 2026 03:29:05 +0000 (11:29 +0800)]
ceph.spec.in: use %{_sbindir} instead of ${_sbindir} in osd %preun
a37b5b5bde8c added %preun scriptlets that use ${_sbindir}, which is
shell syntax rather than an rpm macro, so it expands to empty at run
time and the scriptlet runs "/update-alternatives", failing on
uninstall/upgrade with:
/var/tmp/rpm-tmp.K1fvm3: line 2: /update-alternatives: No such file or directory
error: %preun(ceph-osd-crimson-2:20.3.0-5054.g33c1d671.el9.x86_64) scriptlet failed, exit status 127
Error in PREUN scriptlet in rpm package ceph-osd-crimson.
use %{_sbindir}, like the %posttrans --install lines already do, so it
expands to /usr/sbin/update-alternatives at build time.
crimson/osd: reject Seastore PG merges across shards
Seastore cannot merge collections between reactor shards currently.
On cross-shard detection, tell the monitor the source PG is not ready
(via MOSDPGReadyToMerge{ ready=false }) so the unsafe pg_num decrement
is never proposed, then send MOSDPGStopMerge to clamp pg_num_target and
permanently disable further shrink for the pool.
crimson/osd: implement PG merge detection and orchestration in PGAdvanceMap
Integrate PG merge handling into the map advancement pipeline.
When pg_num shrinks between epochs, check_for_merges() returns a
merge_result_t describing whether this PG is a merge source, target, or
not involved. start() stops advancing through later epochs once a merge
is detected, then either finish_merge_advance() or the normal activate
path runs so complete_rctx() always happens in one place.
- check_for_merges(): detect pg_num shrink and dispatch merge_pg().
- merge_pg(): merge-only work — Seastore eligibility, source handoff
setup, target rendezvous collection and PG::merge_from().
- finish_merge_advance(): commit rctx and complete the role-specific
steps (source: complete_rctx, stop, register_merge_source; target:
handle_advance_map, handle_activate_map, complete_rctx).
Add PG::merge_from to execute the merge of source PGs into a target PG.
This function builds a transaction to remove source-specifc metadata
objects and merge source collections into the target collection.
crimson/osd: per-PG rendezvous for cross-shard merge source handoff
Add infrastructure so source PGs can be extracted from their birth
shard, moved to the target shard, and collected by the target PG before
merge proceeds.
Cross-shard safety: PGs are tied to their birth_shard for destruction.
register_merge_source() uses extract_pg() to detach the source,
seastar::foreign_ptr to hop cores, and
crimson::local_shared_foreign_ptr on the target so release routes
destruction back to the birth shard.
Synchronization: replace the per-shard ShardServices merge_info_t
registry (shared_promise waiters, ready_pgs staging, and cleanup hooks)
with merge state on the target PG itself. Source-side
register_merge_source() delivers PGs via PG::add_merge_source(); the
target waits in PG::collect_merge_sources(n) on a per-PG semaphore.
Duplicate source registrations are ignored. PG::stop() breaks the
semaphore so shutdown does not hang.
ShardServices::register_merge_source() and extract_pg() live in
shard_services; rendezvous types and methods live on PG.
This function ensures that when a PG is being removed or
merged and it calls stop() - it will clear primary state, and
notify the Monitor to clear any pending merge flags.
It will also call client_request_orderer.clear_and_cancel() ensuring
all remaining client requests are properly completed. This is needed
for merging in particular since on_change() is never called for the merge epoch
(handle_advance_map is skipped after merge detection), so
clear_and_cancel() is never invoked on the source PG's orderer.
crimson/osd/shard_services: inherit from peering_sharded_service
Update ShardServices to inherit from seastar::peering_sharded_service.
This allows the service to access its own sharded container directly
via container() rather than manually storing a reference to it.
crimson/osd: Add functions to notify mon when PGs are ready to merge
When a PG is in the pending merge state it is >= pg_num_pending and <
pg_num. When this happens, IO is paused and once the PG peers we notify
the mon that we are idle and safe to merge.
Use Gated for merge notify callbacks.
Patrick Donnelly [Wed, 10 Jun 2026 18:30:59 +0000 (14:30 -0400)]
Merge PR #66726 into main
* refs/pull/66726/head:
doc: Update documentation to reflect new functionality
test: Add integration tests for EC Omap operations and recovery
osd: Hook up omap operations in EC pools
osd: Allow for recovery of OMAP header and entries in EC pools
doc: Write design document to explain the reasoning behind implementing this feature
osd: Introduce functions required for EC OMAP support
osd: Add ECOmapJournal class and relocate OmapUpdateType enum class
Reviewed-by: Bill Scales <bill_scales@uk.ibm.com> Reviewed-by: Alex Ainscow <aainscow@uk.ibm.com> Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
ceph-volume: fix raw activate when device path is stale
This changes unlink_bs_symlinks to use os.path.lexists instead
of os.path.exists. It can happen that devices get renumbered,
in that case, the OSD symlink still exists but its target device
is gone which means os.path.exists returns False, so the symlink
is never cleaned up and ceph-volume activate can fail later.
Ronen Friedman [Wed, 3 Jun 2026 05:40:25 +0000 (05:40 +0000)]
crimson/osd: move get_max_object_size() to store level
is_offset_and_length_valid() called get_sharded_store() locally to
obtain the store-specific max_object_size. On alien cores (where
smp::count > store_shard_nums), the local store is inactive and the
call hits assert(shard_store.get_status() == true).
As the max object size is a store-specific property and not a
store-shard one, there is no reason to acquire the
store shard to obtain it. Instead -
a get_max_object_size() method is added to the Store interface.
Sun Yuechi [Wed, 10 Jun 2026 00:13:53 +0000 (08:13 +0800)]
cmake: disable Catch2 tests when Catch2 is unavailable
debhelper on noble passes -DFETCHCONTENT_FULLY_DISCONNECTED=ON, so CPM
cannot fetch Catch2 and silently skips it, leaving no Catch2 targets
behind and breaking the generate step. Fall back to WITH_CATCH2=OFF
with a warning instead.
dheart [Tue, 9 Jun 2026 13:27:14 +0000 (21:27 +0800)]
os/bluestore: prevent reallocation and corruption when shared_blob key is missing/undecodable
When the shared_blob key is missing or fails to decode,
it is necessary to scan the blob's pextents directly as the sole authoritative source
to verify allocated blocks and prevent double-allocation.
Emmanuel Ameh [Tue, 9 Jun 2026 12:40:03 +0000 (13:40 +0100)]
doc/man: Remove stale EOL release names from deprecation notices
ceph.rst: "osd create" deprecation notice cited "the Luminous release"
(2017, EOL 2020). Update to a plain deprecation statement directing
users to the replacement command (osd new).
rbd.rst: cephx_require_signatures option deprecation cited "the Bobtail
release" (2013, EOL 2015) as context for why the option is deprecated.
Remove the EOL release name; retain the deprecation warning. Fix the
companion nocephx_require_signatures notice for consistency ("in a
future release" instead of "in the future").