]> git-server-git.apps.pok.os.sepia.ceph.com Git - ceph.git/log
ceph.git
2 weeks agomr/dashboard: remove rgw_servers filter from radosgw-sync-overview grafana dashboard 68604/head
Aashish Sharma [Thu, 23 Apr 2026 16:17:41 +0000 (21:47 +0530)]
mr/dashboard: remove rgw_servers filter from radosgw-sync-overview grafana dashboard

Fixes: https://tracker.ceph.com/issues/76239
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
(cherry picked from commit b0601df2a55e3fc56370687b27693779ad356be5)

2 weeks agoMerge PR #68316 into tentacle
Patrick Donnelly [Thu, 23 Apr 2026 20:33:33 +0000 (16:33 -0400)]
Merge PR #68316 into tentacle

* refs/pull/68316/head:
qa: allow multiple mgr sessions during eviction test

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
2 weeks agoMerge PR #68315 into tentacle
Patrick Donnelly [Thu, 23 Apr 2026 20:33:04 +0000 (16:33 -0400)]
Merge PR #68315 into tentacle

* refs/pull/68315/head:
qa: resolve py3.12 regression for random.sample

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
2 weeks agoMerge PR #68314 into tentacle
Patrick Donnelly [Thu, 23 Apr 2026 20:32:44 +0000 (16:32 -0400)]
Merge PR #68314 into tentacle

* refs/pull/68314/head:
qa/cephfs: do not validate error string in "fs authorize" tests
mon/AuthMonitor: add osd w cap for superuser client

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
2 weeks agoMerge PR #68313 into tentacle
Patrick Donnelly [Thu, 23 Apr 2026 20:31:57 +0000 (16:31 -0400)]
Merge PR #68313 into tentacle

* refs/pull/68313/head:
mds: use SimpleLock::WAIT_ALL for wait mask

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
2 weeks agoMerge PR #67808 into tentacle
Patrick Donnelly [Thu, 23 Apr 2026 20:30:16 +0000 (16:30 -0400)]
Merge PR #67808 into tentacle

* refs/pull/67808/head:
mds: scrub pins more inodes than the mds_cache_memory_limit

Reviewed-by: Kotresh Hiremath Ravishankar <khiremat@redhat.com>
Reviewed-by: Igor Fedotov <ifedotov@suse.com>
2 weeks agoMerge PR #68169 into tentacle
Patrick Donnelly [Thu, 23 Apr 2026 20:29:37 +0000 (16:29 -0400)]
Merge PR #68169 into tentacle

* refs/pull/68169/head:
qa/workunits: Add updated kernel archive URL

Reviewed-by: Kotresh Hiremath Ravishankar <khiremat@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
2 weeks agoMerge PR #68439 into tentacle
Patrick Donnelly [Thu, 23 Apr 2026 20:29:12 +0000 (16:29 -0400)]
Merge PR #68439 into tentacle

* refs/pull/68439/head:
mds: add ref counting to LogSegment

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
2 weeks agoMerge PR #67341 into tentacle
Patrick Donnelly [Thu, 23 Apr 2026 19:38:57 +0000 (15:38 -0400)]
Merge PR #67341 into tentacle

* refs/pull/67341/head:
ceph-volume: skip redundant NVMe mkfs discards

Reviewed-by: Teoman Onay <tonay@ibm.com>
2 weeks agoceph-volume: skip redundant NVMe mkfs discards 67341/head
Ujjawal Anand [Fri, 6 Feb 2026 11:26:58 +0000 (16:56 +0530)]
ceph-volume: skip redundant NVMe mkfs discards

- Avoid redundant discard during mkfs when discard is disabled
- Reduces mkfs time on large NVMe devices by skipping long running discard operations

Fixes: https://tracker.ceph.com/issues/74908
Signed-off-by: Ujjawal Anand <ujjawal.anand@ibm.com>
(cherry picked from commit daebcfb8944789a47da20ffec3e03a5b2737f711)

2 weeks agoMerge PR #68379 into tentacle
Patrick Donnelly [Wed, 22 Apr 2026 17:29:45 +0000 (13:29 -0400)]
Merge PR #68379 into tentacle

* refs/pull/68379/head:
cephadm: wait for latest osd map after ceph-volume before OSD deploy

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
2 weeks agoMerge PR #68286 into tentacle
Patrick Donnelly [Wed, 22 Apr 2026 17:28:04 +0000 (13:28 -0400)]
Merge PR #68286 into tentacle

* refs/pull/68286/head:
ceph-volume: skip mkfs discard for LVM NVMe OSDs

Reviewed-by: Teoman Onay <tonay@ibm.com>
2 weeks agoMerge PR #68121 into tentacle
Patrick Donnelly [Wed, 22 Apr 2026 17:27:25 +0000 (13:27 -0400)]
Merge PR #68121 into tentacle

* refs/pull/68121/head:
orch/cephadm: fix osd.default creation

Reviewed-by: Teoman Onay <tonay@ibm.com>
2 weeks agoMerge PR #68108 into tentacle
Patrick Donnelly [Wed, 22 Apr 2026 17:24:53 +0000 (13:24 -0400)]
Merge PR #68108 into tentacle

* refs/pull/68108/head:
ceph-volume: skip virtual cdrom devices in inventory

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
2 weeks agoMerge PR #67989 into tentacle
Patrick Donnelly [Wed, 22 Apr 2026 17:24:27 +0000 (13:24 -0400)]
Merge PR #67989 into tentacle

* refs/pull/67989/head:
ceph-volume: include LVM mapper devices in get_devices()

Reviewed-by: Adam King <adking@redhat.com>
2 weeks agoMerge PR #67916 into tentacle
Patrick Donnelly [Wed, 22 Apr 2026 17:23:51 +0000 (13:23 -0400)]
Merge PR #67916 into tentacle

* refs/pull/67916/head:
src/ceph-volume: fast device unavailable as error

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
2 weeks agoMerge PR #67418 into tentacle
Patrick Donnelly [Wed, 22 Apr 2026 17:22:51 +0000 (13:22 -0400)]
Merge PR #67418 into tentacle

* refs/pull/67418/head:
mgr/cephadm: validate hostname in NodeProxyCache
node-proxy: improve HTTP error logging in client
node-proxy: get serial number instead of SKU
node-proxy: allow multiple sources per component
node-proxy: re-auth and retry once on 401
node-proxy: fix flake8 E721 in _dict_diff
node-proxy: make the update loop interval configurable
mgr/node-proxy: fix "ceph orch hardware status --category criticals"
node-proxy: normalize storage data per member
node-proxy: encapsulate send logic in dedicated method
node-proxy: log actual data delta in reporter
node-proxy: add periodic heartbeats in main and reporter loops
node-proxy: adjust log levels
node-proxy: add unit tests
node-proxy: add tox config for mypy, flake8, isort, black
node-proxy: black and isort formatting pass
node-proxy: fix mypy errors
node-proxy: handle nested Redfish paths for components
node-proxy: split out config, bootstrap and redfish logic
node-proxy: refactor config loading
node-proxy: add 'vendor based' redfish system selection
node-proxy: introduce component spec registry and overrides for updates
mgr/cephadm: safe status/health access in node-proxy agent and inventory
node-proxy: narrow build_data exception handling and re-raise
node-proxy: refactor Endpoint/EndpointMgr and fix chassis paths
node-proxy: use safe field access in storage update
node-proxy: reduce log verbosity for missing optional fields

Reviewed-by: Teoman Onay <tonay@ibm.com>
2 weeks agoMerge PR #67343 into tentacle
Patrick Donnelly [Wed, 22 Apr 2026 17:21:13 +0000 (13:21 -0400)]
Merge PR #67343 into tentacle

* refs/pull/67343/head:
ceph-volume: fix test_reject_readonly_device unit test
ceph-volume: single lvs call to speed up exclude_lvm_osd_devices
ceph-volume: avoid Device() instantiation in lvm OSD filtering
ceph-volume: avoid RuntimeError on ceph-volume raw list with non-existent loop devices

Reviewed-by: Teoman Onay <tonay@ibm.com>
2 weeks agoMerge PR #68344 into tentacle
Patrick Donnelly [Wed, 22 Apr 2026 17:04:32 +0000 (13:04 -0400)]
Merge PR #68344 into tentacle

* refs/pull/68344/head:
test/rgw: remove depracated boto dependency

Reviewed-by: Casey Bodley <cbodley@redhat.com>
2 weeks agoMerge PR #67850 into tentacle
Patrick Donnelly [Wed, 22 Apr 2026 14:38:38 +0000 (10:38 -0400)]
Merge PR #67850 into tentacle

* refs/pull/67850/head:
mgr, qa: clarify module checks in DaemonServer
mgr, qa: add `pending_modules` to asock command
mgr, common, qa, doc: issue health error after max expiration is exceeded
mgr: ensure that all modules have started before advertising active mgr

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
2 weeks agomgr, qa: clarify module checks in DaemonServer 67850/head
Laura Flores [Fri, 12 Sep 2025 20:14:30 +0000 (20:14 +0000)]
mgr, qa: clarify module checks in DaemonServer

The current check groups modules not being
enabled with failing to initialize. In this commit,
we reorder the checks:

1: Screen for a module being enabled. If it's not,
   issue an EOPNOTSUPP with instructions on how
   to enable it.

2. Screen for if a module is active. If a module
   is enabled, then the cluster expects it to
   be active to support commands. If the module
   took too long to initialize though, we will
   catch this and issue an ETIMEDOUT error with
   a link for troubleshooting.

Now, these two separate issues are not grouped
together, and they are checked in the right order.

Fixes: https://tracker.ceph.com/issues/71631
Signed-off-by: Laura Flores <lflores@ibm.com>
(cherry picked from commit fdc072f15da7ec4c918a1ebff439f6ce4922f33f)

2 weeks agomgr, qa: add `pending_modules` to asock command
Laura Flores [Thu, 11 Sep 2025 22:13:51 +0000 (22:13 +0000)]
mgr, qa: add `pending_modules` to asock command

Now, the command `ceph tell mgr mgr_status` will show a
"pending_modules" field. This is another way for Ceph operators
to check which modules haven't been initalized yet (in addition
to the health error).

This command was also added to testing scenarios in the workunit.

Fixes: https://tracker.ceph.com/issues/71631
Signed-off-by: Laura Flores <lflores@ibm.com>
(cherry picked from commit 68221661b00f8a6bef0fbd7b5401aa49eb5118d0)

2 weeks agomgr, common, qa, doc: issue health error after max expiration is exceeded
Laura Flores [Tue, 29 Jul 2025 22:46:46 +0000 (22:46 +0000)]
mgr, common, qa, doc: issue health error after max expiration is exceeded

----------------- Enhancement to the Original Fix -----------------

During a mgr failover, the active mgr is marked available if:
  1. The mon has chosen a standby to be active
  2. The chosen active mgr has all of its modules initialized

Now that we've improved the criteria for sending the "active" beacon
by enforcing it to retry initializing mgr modules, we need to account
for extreme cases in which the modules are stuck loading for a very long
time, or even indefinitely. In these extreme cases where the modules might
never initialize, we don't want to delay sending the "active" beacon for
too long. This can result in blocking other important mgr functionality,
such as reporting PG availability in the health status. We want
to avoid sending warnings about PGs being unknown in the health status when
that's not ultimately the problem.

To account for an exeptionally long module loading time, I added a new
configurable `mgr_module_load_expiration`. If we exceed this maximum amount
of time (in ms) allotted for the active mgr to load the mgr modules before declaring
availability, the standby will then proceed to mark itself "available" and
send the "active" beacon to the mon and unblock other critical mgr functionality.

If this happens, a health error will be issued at this time, indicating
which mgr modules got stuck initializing (See src/mgr/PyModuleRegistry.cc). The
idea is to unblock the rest of the mgr's critical functionality while making it
clear to Ceph operators that some modules are unusable.

--------------------- Integration Testing --------------------

The workunit was rewritten so it tests for these scenarios:

1. Normal module loading behavior (no health error should be issued)
2. Acceptable delay in module loading behavior (no health error should be
   issued)
3. Unacceptable delay in module loading behavior (a health error should be
   issued)

--------------------- Documentation --------------------

This documentation explains the "Module failed to initialize"
cluster error.

Users are advised to try failing over
the mgr to reboot the module initialization process,
then if the error persists, file a bug report. I decided
to write it this way instead of providing more complex
debugging tips such as advising to disable some mgr modules
since every case will be different depending on which modules
failed to initialize.

In the bug report, developers can ask for the health detail
output to narrow down which module is causing a bottleneck,
and then ask the user to try disabling certain modules until
the mgr is able to fully initialize.

Fixes: https://tracker.ceph.com/issues/71631
Signed-off-by: Laura Flores <lflores@ibm.com>
(cherry picked from commit bf25a08cc58c8e872806e92dd36d9ff91e5523d9)

2 weeks agomgr: ensure that all modules have started before advertising active mgr
Laura Flores [Fri, 25 Apr 2025 22:11:19 +0000 (22:11 +0000)]
mgr: ensure that all modules have started before advertising active mgr

----------------- Explanation of Problem ----------------

When the mgr is restarted or failed over via `ceph mgr fail` or during an
upgrade, mgr modules sometimes take longer to start up (this includes
loading their class, commands, and module options, and being removed
from the `pending_modules` map structure). This startup delay can happen
due to a cluster's specific hardware or if a code bottleneck is triggered in
a module’s `serve()` function (each mgr module has a `serve()` function that
performs initialization tasks right when the module is loaded).

When this startup delay occurs, any mgr module command issued against the
cluster around the same time fails with error saying that the command is not
supported:
```
$ ceph mgr fail; ceph fs volume ls
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'volumes' is not enabled/loaded (required by command 'fs volume ls'): use `ceph mgr module enable volumes` to enable it
```

We should try to lighten any bottlenecks in the mgr module `serve()`
functions wherever possible, but the root cause of this failure is that the
mgr sends a beacon to the mon too early, indicating that it is active before
the module loading has completed. Specifically, some of the mgr modules
have loaded their class but have not yet been deleted from the `pending_modules`
structure, indicating that they have not finished starting up.

--------------------- Explanation of Fix  --------------------

This commit improves the criteria for sending the “active” beacon to the mon so
the mgr does not signal that it’s active too early. We do this through the following additions:

1. A new context `ActivePyModules::recheck_modules_start` that will be set if not all modules
   have finished startup.

2. A new function `ActivePyModules::check_all_modules_started()` that checks if modules are
   still pending startup; if all have started up (`pending_modules` is empty), then we send
   the beacon right away. But if some are still pending, we pass the beacon task on to the new
   recheck context `ActivePyModules::recheck_modules_start` so we know to send the beacon later.

3. Logic in ActivePyModules::start_one() that only gets triggered if the modules did not all finish
   startup the first time we checked. We know this is the case if the new recheck context
   `recheck_modules_start` was set from `nullptr`. The beacon is only sent once `pending_modules` is
   confirmed to be empty, which means that all the modules have started up and are ready to support commands.

4. Adjustment of when the booleans `initializing` and `initialized` are set. These booleans come into play in
   MgrStandby::send_beacon() when we check that the active mgr has been initialized (thus, it is available).
   We only send the beacon when this boolean is set. Currently, we set these booleans at the end of Mgr::init(),
   which means that it gets set early before `pending_modules` is clear. With this adjustment, the bools are set
   only after we check that all modules have started up. The send_beacon code is triggered on mgr failover AND on
   every Mgr::tick(), which occurs by default every two seconds. If we don’t adjust when these bools are set, we
   only fix the mgr failover part, but the mgr still sends the beacon too early via Mgr::tick(). Below is the relevant
   code from MgrStandby::send_beacon(), which is triggered in Mgr::background_init() AND in Mgr::tick():
```
  // Whether I think I am available (request MgrMonitor to set me
  // as available in the map)
  bool available = active_mgr != nullptr && active_mgr->is_initialized();

  auto addrs = available ? active_mgr->get_server_addrs() : entity_addrvec_t();
  dout(10) << "sending beacon as gid " << monc.get_global_id() << dendl;

```

--------------------- Reproducing the Bug ----------------------

At face value, this issue is indeterministically reproducible since it
can depend on environmental factors or specific cluster workloads.
However, I was able to deterministically reproduce it by injecting a
bottleneck into the balancer module:
```
diff --git a/src/pybind/mgr/balancer/module.py b/src/pybind/mgr/balancer/module.py
index d12d69f..91c83fa8023 100644
--- a/src/pybind/mgr/balancer/module.py
+++ b/src/pybind/mgr/balancer/module.py
@@ -772,10 +772,10 @@ class Module(MgrModule):
                     self.update_pg_upmap_activity(plan)  # update pg activity in `balancer status detail`
                 self.optimizing = False
+                # causing a bottleneck
+                for i in range(0, 1000):
+                    for j in range (0, 1000):
+                        x = i + j
+                        self.log.debug("hitting the bottleneck in the balancer module")
             self.log.debug('Sleeping for %d', sleep_interval)
             self.event.wait(sleep_interval)
             self.event.clear()
```

Then, the error reproduces every time by running:
```
$ ./bin/ceph mgr fail; ./bin/ceph telemetry show
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'telemetry' is not enabled/loaded (required by command 'telemetry show'): use `ceph mgr module enable telemetry` to enable it
```

With this fix, the active mgr is marked as "initialized" only after all
the modules have started up, and this error goes away. The command may
take a bit longer to execute depending on the extent of the delay.

---------------------- Integration Testing ---------------------

This commit adds a dev-only config that can inject a longer
loading time into the mgr module loading sequence so we can
simulate this scenario in a test.

The config is 0 ms by default since we do not add any delay
outside of testing scenarios. The config can be adjusted
with the following command:
  `ceph config set mgr mgr_module_load_delay <ms>`

A second dev-only config also allows you to specify which
module you want to be delayed in loading time. You may change
this with the following command:
  `ceph config set mgr mgr_module_load_delay_name <module name>`

The workunit added here tests a simulated slow loading module
scenario to ensure that this case is properly handled.

--------------------- Documentation --------------------

The new documentation describes the three existing mgr states so Ceph
operators can better interpret their Ceph status output.

Fixes: https://tracker.ceph.com/issues/71631
Signed-off-by: Laura Flores <lflores@ibm.com>
(cherry picked from commit cbd1726fde08692848b5c0e42c06e4f089ebcd5c)

3 weeks agoMerge PR #67511 into tentacle
Patrick Donnelly [Fri, 17 Apr 2026 21:47:14 +0000 (17:47 -0400)]
Merge PR #67511 into tentacle

* refs/pull/67511/head:
doc/_ext: fix ceph_commands.py for new decorator-based command system
pybind/mgr: add per-module CLICommand instances to remaining modules
pybind/mgr/dashboard: create DBCLICommand, use throughout
pybind/mgr/tests/test_object_format: update DecoDemo to use fresh CLICommand
pybind/mgr/smb: adapt SMBCommand to use CLICommandBase
pybind/orchestrator,cephadm: replace CLICommandMeta
pybind/mgr: mechanically fix simple users to not import CLI*Command
pybind/mgr/mgr_module: support per-module CLICommand instances and globals
pybind/.../dashboard: misc automatic linter fixes

Reviewed-by: John Mulligan <jmulligan@redhat.com>
3 weeks agodoc/_ext: fix ceph_commands.py for new decorator-based command system 67511/head
Kefu Chai [Sun, 8 Feb 2026 12:34:15 +0000 (20:34 +0800)]
doc/_ext: fix ceph_commands.py for new decorator-based command system

After commit 4aa9e246f, mgr modules migrated from using a class-level
COMMANDS list to decorator-based command registration using per-module
CLICommand instances (e.g., @BalancerCLICommand.Read('balancer status')).

This broke the ceph_commands.py Sphinx extension which was hardcoded to
expect m.COMMANDS to be a list, causing documentation builds to fail.

But not all modules are using this per-module CLICommand. Some modules are
fully migrated (balancer, hello, etc.) and use decorators, while others
are partially migrated (volumes, progress, stats, influx, k8sevents,
osd_perf_query, osd_support) - they have CLICommand defined but still
use the old COMMANDS list.

This fix updates _collect_module_commands() to handle three scenarios:

1. Fully migrated modules: Check CLICommand.dump_cmd_list() and use it
   if it returns commands
2. Partially migrated modules: Fall back to the old COMMANDS list if
   dump_cmd_list() returns empty
3. Legacy modules: Use COMMANDS list if CLICommand doesn't exist

This ensures the Sphinx extension works with modules in any migration
state, maintaining backwards compatibility while supporting the new
decorator pattern.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
(cherry picked from commit 77efb41aec4a3ccece0bbca94e7c5b3fea154298)

3 weeks agomds: add ref counting to LogSegment 68439/head
Milind Changire [Tue, 10 Jun 2025 14:40:30 +0000 (20:10 +0530)]
mds: add ref counting to LogSegment

Fixes: https://tracker.ceph.com/issues/70723
Signed-off-by: Milind Changire <mchangir@redhat.com>
(cherry picked from commit fe3b4159b82f65ef3ba7df1b123f126d50954a78)

3 weeks agopybind/mgr: add per-module CLICommand instances to remaining modules
Samuel Just [Wed, 26 Nov 2025 22:51:54 +0000 (22:51 +0000)]
pybind/mgr: add per-module CLICommand instances to remaining modules

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit 4aa9e246f05663ec334f67d8e7f1ce817c1cbf2d)

Conflicts:
  - src/pybind/mgr/prometheus/module.py
  CLIReadCommand/CLIWriteCommand removed from imports, not yet in tentacle - kept tentacle version
  - src/pybind/mgr/orchestrator/module.py
      _cert_store_key_ls, _cert_store_cert_ls - signature changed in main, kept tentacle version
  _cert_store_entity_ls - function renamed in main, kept tentacle name
  _cert_store_entity_ls - function name changed in main but not in tentacle
  - src/pybind/mgr/smb/cli.py
  typing imports differ between main and tentacle - manually reconciled
  error_wrapper not backported to tentacle - removed from cherry-pick
  - src/pybind/mgr/mirroring/module.py
  import differs between main and tentacle - manually kept tentacle imports
  - src/pybind/mgr/pg_autoscaler/module.py
      get_scaling_threshold not backported to tentacle - removed that change

3 weeks agopybind/mgr/dashboard: create DBCLICommand, use throughout
Samuel Just [Mon, 24 Nov 2025 17:37:41 +0000 (09:37 -0800)]
pybind/mgr/dashboard: create DBCLICommand, use throughout

Also moves Command from mgr_module to DBCommand in dashboard/cli.py
since dashboard is the only user and this way it can directly use
DBCLICommand.

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit 9fe5a643f433127e0e59b6ba04685aa60de5588b)

3 weeks agopybind/mgr/tests/test_object_format: update DecoDemo to use fresh CLICommand
Samuel Just [Mon, 24 Nov 2025 17:36:50 +0000 (09:36 -0800)]
pybind/mgr/tests/test_object_format: update DecoDemo to use fresh CLICommand

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit d8acddcc0b6af05bed4348a36a1752b679cdd7a0)

3 weeks agopybind/mgr/smb: adapt SMBCommand to use CLICommandBase
Samuel Just [Mon, 24 Nov 2025 17:36:15 +0000 (09:36 -0800)]
pybind/mgr/smb: adapt SMBCommand to use CLICommandBase

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit a58d20cca388d2339c9b999f6279a1439d31ccbe)

3 weeks agopybind/orchestrator,cephadm: replace CLICommandMeta
Samuel Just [Mon, 24 Nov 2025 17:31:47 +0000 (09:31 -0800)]
pybind/orchestrator,cephadm: replace CLICommandMeta

orchestrator and cephadm relied on CLICommandMeta to bypass the global
behavior of CLICommand.  That is no longer a problem, so replace
CLICommandMeta with OrchestratorCLICommandBase to preserve the magic
error wrapping.

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit 776abe4f87cdad368cccc4c3994ec8e4b5cdaf13)

3 weeks agopybind/mgr: mechanically fix simple users to not import CLI*Command
Samuel Just [Mon, 24 Nov 2025 17:29:55 +0000 (09:29 -0800)]
pybind/mgr: mechanically fix simple users to not import CLI*Command

The next commit will introduce module specific CLICommand classes.

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit 9099c682c2596612df7ab698e5ac3cfa578eb6d3)

3 weeks agopybind/mgr/mgr_module: support per-module CLICommand instances and globals
Samuel Just [Mon, 24 Nov 2025 17:27:24 +0000 (09:27 -0800)]
pybind/mgr/mgr_module: support per-module CLICommand instances and globals

Otherwise, the class members on MgrModule and CLICommand are global to all
modules in the same interpreter.

Following commits will introduce a per-module CLICommand types for each
module.

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit 2d79ae64795fcd348b2f2c54d58105c941446b50)

3 weeks agopybind/.../dashboard: misc automatic linter fixes
Samuel Just [Tue, 2 Dec 2025 00:57:22 +0000 (16:57 -0800)]
pybind/.../dashboard: misc automatic linter fixes

Signed-off-by: Samuel Just <sjust@redhat.com>
(cherry picked from commit ff8639818447b8079ceacf3b5d8cbe6ce8cee007)

3 weeks agoMerge PR #65861 into tentacle
Patrick Donnelly [Thu, 16 Apr 2026 16:21:05 +0000 (12:21 -0400)]
Merge PR #65861 into tentacle

* refs/pull/65861/head:
mgr/smb: fix error handling for fundamental resource parsing

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: John Mulligan <jmulligan@redhat.com>
3 weeks agocephadm: wait for latest osd map after ceph-volume before OSD deploy 68379/head
Guillaume Abrioux [Fri, 10 Apr 2026 14:37:58 +0000 (16:37 +0200)]
cephadm: wait for latest osd map after ceph-volume before OSD deploy

after ceph-volume creates an OSD, the cached osd map of the mgr can
lag behind the monitors, then get_osd_uuid_map() misses the new osd
id and deploy_osd_daemons_for_existing_osds() skips deploying the
cephadm daemon, which reports a misleading "Created no osd(s)" while
the osd exists.

This behavior is often seen with raw devices. (lvm list returns quicker).

This also fixes get_osd_uuid_map(only_up=True) as the previous branch
never populated the map when 'only_up' was true.
Now we only include osds with 'up==1' so a new OSD created (but still down)
is not treated as already present.

Fixes: https://tracker.ceph.com/issues/75965
Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
(cherry picked from commit 56123af6477a93c62df74e23b6cf3b4fdf6b19e9)

3 weeks agoMerge PR #68057 into tentacle
Patrick Donnelly [Tue, 14 Apr 2026 15:56:52 +0000 (11:56 -0400)]
Merge PR #68057 into tentacle

* refs/pull/68057/head:
qa/rgw/upgrade: symlinks are explicit about distro versions

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
3 weeks agoMerge PR #68142 into tentacle
Patrick Donnelly [Tue, 14 Apr 2026 15:55:31 +0000 (11:55 -0400)]
Merge PR #68142 into tentacle

* refs/pull/68142/head:
test/rgw/notification: do not use netstat in the code

Reviewed-by: Casey Bodley <cbodley@redhat.com>
3 weeks agoMerge PR #68254 into tentacle
Patrick Donnelly [Tue, 14 Apr 2026 15:53:19 +0000 (11:53 -0400)]
Merge PR #68254 into tentacle

* refs/pull/68254/head:
This change introduces the shared memory communication (SMC-D) for the cluster network.

Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
3 weeks agoMerge PR #67757 into tentacle
Patrick Donnelly [Tue, 14 Apr 2026 15:52:23 +0000 (11:52 -0400)]
Merge PR #67757 into tentacle

* refs/pull/67757/head:
qa/tasks/barbican: add kek for simple_crypto_plugin
qa/suites/rgw: use 'member' instead of 'Member' for roles in barbican
qa/suites/rgw: bump keystone to stable/2025.2

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Reviewed-by: Casey Bodley <cbodley@redhat.com>
3 weeks agoMerge PR #67468 into tentacle
Patrick Donnelly [Tue, 14 Apr 2026 15:51:13 +0000 (11:51 -0400)]
Merge PR #67468 into tentacle

* refs/pull/67468/head:
test/rgw/lua: ignore hours for zero mtime

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Reviewed-by: Yuval Lifshitz <ylifshit@redhat.com>
3 weeks agoMerge PR #67061 into tentacle
Patrick Donnelly [Tue, 14 Apr 2026 15:50:51 +0000 (11:50 -0400)]
Merge PR #67061 into tentacle

* refs/pull/67061/head:
qa/tasks/dnsmasq: preserve nameserver for future use
qa/suites/rgw/website: enable centos_latest

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
3 weeks agoMerge PR #68350 into tentacle
Patrick Donnelly [Tue, 14 Apr 2026 15:37:54 +0000 (11:37 -0400)]
Merge PR #68350 into tentacle

* refs/pull/68350/head:
mgr/DaemonServer: Limit search for OSDs to upgrade within the crush bucket.

Reviewed-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
3 weeks agoMerge PR #67888 into tentacle
Patrick Donnelly [Tue, 14 Apr 2026 15:34:05 +0000 (11:34 -0400)]
Merge PR #67888 into tentacle

* refs/pull/67888/head:
os/bluestore: track compression_*blob_size* parameters for online update.

Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
3 weeks agoMerge PR #68276 into tentacle
Patrick Donnelly [Mon, 13 Apr 2026 17:27:41 +0000 (13:27 -0400)]
Merge PR #68276 into tentacle

* refs/pull/68276/head:
debian: remove stale distutils override from py3dist-overrides

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Reviewed-by: Casey Bodley <cbodley@redhat.com>
3 weeks agoMerge PR #68118 into tentacle
Patrick Donnelly [Mon, 13 Apr 2026 17:23:48 +0000 (13:23 -0400)]
Merge PR #68118 into tentacle

* refs/pull/68118/head:
qa/tasks/backfill_toofull.py: Fix assert failures with & without compression

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Reviewed-by: Bill Scales <bill_scales@uk.ibm.com>
3 weeks agoos/bluestore: track compression_*blob_size* parameters for online update. 67888/head
Igor Fedotov [Thu, 19 Feb 2026 17:39:56 +0000 (20:39 +0300)]
os/bluestore: track compression_*blob_size* parameters for online update.

Fixes: https://tracker.ceph.com/issues/75032
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit b14012f3f608a279d8b650793a2f2d48d9c0e40a)

3 weeks agoThis change introduces the shared memory communication (SMC-D) for the cluster network. 68254/head
Aliaksei Makarau [Tue, 31 Mar 2026 06:40:04 +0000 (08:40 +0200)]
This change introduces the shared memory communication (SMC-D) for the cluster network.
SMC-D is faster than ethernet in IBM Z LPARs and/or VMs (zVM or KVM).

Fixes: https://tracker.ceph.com/issues/66702
Signed-off-by: Aliaksei Makarau <aliaksei.makarau@ibm.com>
(cherry picked from commit e65af75a67b445bf7014842e9d9b3cfbae1e464b)

3 weeks agomgr/DaemonServer: Limit search for OSDs to upgrade within the crush bucket. 68350/head
Sridhar Seshasayee [Wed, 25 Mar 2026 08:49:03 +0000 (14:19 +0530)]
mgr/DaemonServer: Limit search for OSDs to upgrade within the crush bucket.

The behavior of the 'ok-to-upgrade' command is now more deterministic with
respect to the parameters passed.

To achieve the above, the commit implements the following changes:

1. The 'ok-to-upgrade' command is modified to operate strictly on the OSDs
   within the CRUSH bucket and, if possible, meet the '--max' criteria when
   specified. When --max <num> is provided, the command returns up to <num>
   OSD IDs from the specified CRUSH bucket that can be safely stopped for
   simultaneous upgrade. This is useful when only a subset of OSDs within
   the bucket needs to be upgraded for performance or other reasons.

2. Modifies the standalone tests to reflect the above change.

3. Modifies the relevant documentation to reflect the change in behavior.

Fixes: https://tracker.ceph.com/issues/75681
Signed-off-by: Sridhar Seshasayee <sridhar.seshasayee@ibm.com>
(cherry picked from commit f18093fc09bfedbb02cbe7967fc85b2dea9ff71f)

3 weeks agotest/rgw: remove depracated boto dependency 68344/head
Yuval Lifshitz [Wed, 18 Feb 2026 09:34:41 +0000 (09:34 +0000)]
test/rgw: remove depracated boto dependency

Fixes: https://tracker.ceph.com/issues/75154
Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
Co-authored-by: Bob-Shell <bob-shell@ai-assistant>
(cherry picked from commit 8e9c77160d5cae2f6bd6d97a2ecdabb9ae9530bd)

4 weeks agoMerge PR #67982 into tentacle
Patrick Donnelly [Fri, 10 Apr 2026 23:58:33 +0000 (19:58 -0400)]
Merge PR #67982 into tentacle

* refs/pull/67982/head:
rgw/tests: add os-specific java 1.7 install commands to keycloak task

Reviewed-by: Casey Bodley <cbodley@redhat.com>
4 weeks agorgw/tests: add os-specific java 1.7 install commands to keycloak task 67982/head
J. Eric Ivancich [Fri, 27 Feb 2026 20:56:29 +0000 (15:56 -0500)]
rgw/tests: add os-specific java 1.7 install commands to keycloak task

Add commands to keycloak task specific for rocky, rhel, centos, and
ubuntu. Also, clean-up installed package(s) after test is run.

This is necessary as rocky can't install the same package(s) that the
other os types currently can.

Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
(cherry picked from commit ee710390d277784ddac3d70c9e11e427f46f363d)

4 weeks agoMerge PR #67942 into tentacle
Patrick Donnelly [Fri, 10 Apr 2026 19:12:17 +0000 (15:12 -0400)]
Merge PR #67942 into tentacle

* refs/pull/67942/head:
nvmeofgw: fix issue of delete all gws from the pool/group

Reviewed-by: Aviv Caro <Aviv.Caro@ibm.com>
Reviewed-by: Alexander Indenbaum <aindenba@redhat.com>
4 weeks agoMerge PR #67790 into tentacle
Patrick Donnelly [Fri, 10 Apr 2026 19:08:22 +0000 (15:08 -0400)]
Merge PR #67790 into tentacle

* refs/pull/67790/head:
mon [stretch-mode]: Allow a max bucket weight diff threshold

Reviewed-by: Shraddha Agrawal <shraddhaag@ibm.com>
4 weeks agoqa: allow multiple mgr sessions during eviction test 68316/head
Patrick Donnelly [Thu, 26 Feb 2026 19:49:56 +0000 (14:49 -0500)]
qa: allow multiple mgr sessions during eviction test

Fixes: https://tracker.ceph.com/issues/70580
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 8db0c8a4cc4bd394151533f254c7e7cfa9dbc1ca)

4 weeks agoqa: resolve py3.12 regression for random.sample 68315/head
Patrick Donnelly [Thu, 26 Feb 2026 20:03:56 +0000 (15:03 -0500)]
qa: resolve py3.12 regression for random.sample

Fixes: https://tracker.ceph.com/issues/75089
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit a6b13f6ab0fd53fa63b86075c3951534d1860910)

4 weeks agoqa/cephfs: do not validate error string in "fs authorize" tests 68314/head
Venky Shankar [Mon, 16 Mar 2026 04:48:22 +0000 (10:18 +0530)]
qa/cephfs: do not validate error string in "fs authorize" tests

Error string validation is prone to failures when error string
changes. errno (retval) validation suffices for tests.

Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit d499c7f30f6d381bd3251108ff8dda9fd3c80f0f)

4 weeks agomon/AuthMonitor: add osd w cap for superuser client
Patrick Donnelly [Wed, 18 Feb 2026 20:27:30 +0000 (15:27 -0500)]
mon/AuthMonitor: add osd w cap for superuser client

Right now only a client with "rw" permissions on an MDS gets "rw" on an
OSD.

[@vshankar: fixed malformed OSD cap when authorizing multiple paths]

Reported-by: John Mulligan <jmulligan@redhat.com>
Fixes: https://tracker.ceph.com/issues/75013
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit 6e6e570f80480b9a310d10a96ad1e5682785c775)

4 weeks agomds: use SimpleLock::WAIT_ALL for wait mask 68313/head
Patrick Donnelly [Tue, 24 Feb 2026 19:30:24 +0000 (14:30 -0500)]
mds: use SimpleLock::WAIT_ALL for wait mask

The Locker uses has_any_waiter for a particular lock to evaluate whether
to nudge the log. For the squid, tentacle, and main branches, this
larger bit mask (all 64 bits) will cause this to wrongly return true for
other locks which have waiters. The side-effect of waking requests
spuriously is undesirable but should not affect performance
significantly.

For reef and older releases, using std::numeric_limits<uint64_t>::max()
in has_any_waiter() causes a bitwise overflow that sets the wait-queue
search bound impossibly high, resulting in the method always incorrectly
returning false. This results in nudge_log never nudging the log!

Fixes: db5c9dc2e6cc95a8d112c2131e4cac5340ca9dd0
Fixes: https://tracker.ceph.com/issues/75141
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 0be3a7896be1d5e9d6b04a6b943c2b5535b8523f)

4 weeks agoMerge PR #68226 into tentacle
Patrick Donnelly [Thu, 9 Apr 2026 16:56:25 +0000 (12:56 -0400)]
Merge PR #68226 into tentacle

* refs/pull/68226/head:
rgw: enhanced java s3-tests change setting of JAVA_HOME
rgw: java s3-tests change setting of JAVA_HOME

Reviewed-by: Casey Bodley <cbodley@redhat.com>
4 weeks agoMerge PR #67533 into tentacle
Patrick Donnelly [Thu, 9 Apr 2026 16:53:15 +0000 (12:53 -0400)]
Merge PR #67533 into tentacle

* refs/pull/67533/head:
qa/cephadm: ensure host has been fully saved before considering bootstrap complete
mgr/cephadm: add __getstate__ so OSD class can be pickled

Reviewed-by: Adam King <adking@redhat.com>
Reviewed-by: Guillaume Abrioux <gabrioux@redhat.com>
4 weeks agoceph-volume: skip mkfs discard for LVM NVMe OSDs 68286/head
Ujjawal Anand [Mon, 6 Apr 2026 06:26:20 +0000 (11:56 +0530)]
ceph-volume: skip mkfs discard for LVM NVMe OSDs

Run NVMe preformat in the LVM bluestore prepare path and
set skip_mkfs_discard so ceph-osd --mkfs gets --bdev-enable-discard false,
matching the raw OSD behavior.

Fixes:https://tracker.ceph.com/issues/75873

Signed-off-by: Ujjawal Anand <ujjawal.anand@ibm.com>
(cherry picked from commit bb6d4d7457d3daff87811a00ec2a61ab31e19d7d)

4 weeks agodebian: remove stale distutils override from py3dist-overrides 68276/head
Kefu Chai [Wed, 8 Apr 2026 07:29:09 +0000 (15:29 +0800)]
debian: remove stale distutils override from py3dist-overrides

distutils was deprecated in Python 3.10 (PEP 632) and removed in
Python 3.12. The `python3-distutils` package no longer exists in
Debian Trixie (Python 3.13) or Ubuntu 24.04+ (Python 3.12).

The only runtime reference was in `debian/ceph-mgr.requires`, already
cleaned up by 3fb3f892aa3. This override is now dead code, hence no
installed file declares a runtime dependency on `distutils`, so
`dh_python3` never resolves it. Removing it prevents a latent
uninstallable-dependency bug if `distutils` were accidentally
reintroduced in a `.requires` file.

Fixes: https://tracker.ceph.com/issues/75901
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Signed-off-by: Max R. Carrara <m.carrara@proxmox.com>
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
(cherry picked from commit d1d07a0542228b7c40238a9a78d138ad07130240)

4 weeks agoMerge PR #67907 into tentacle
Patrick Donnelly [Tue, 7 Apr 2026 17:38:24 +0000 (13:38 -0400)]
Merge PR #67907 into tentacle

* refs/pull/67907/head:
doc/start: Add ARM support note to hardware-recommendations.rst
doc: Improve start/hardware-recommendations.rst
doc: Update the old ceph.com/community/ links to ceph.io/en/news/blog/
doc/start: Improve hardware-recommendations.rst
doc/start: fix wording in swap tip
doc: Use ref instead of full URLs for intra-docs links
doc: Use existing labels and ref for hyperlinks in architecture.rst

Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Anthony D Atri <anthony.datri@gmail.com>
4 weeks agoMerge PR #66948 into tentacle
Patrick Donnelly [Tue, 7 Apr 2026 12:41:48 +0000 (08:41 -0400)]
Merge PR #66948 into tentacle

* refs/pull/66948/head:
mgr/DaemonServer: Re-order OSDs in crush bucket to maximize OSDs for upgrade
mgr/DaemonServer: Implement ok-to-upgrade command
mgr/DaemonServer: Modify offline_pg_report to handle set or vector types

Reviewed-by: Nitzan Mordechai <nmordech@redhat.com>
Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
4 weeks agorgw: enhanced java s3-tests change setting of JAVA_HOME 68226/head
J. Eric Ivancich [Tue, 7 Apr 2026 00:53:34 +0000 (20:53 -0400)]
rgw: enhanced java s3-tests change setting of JAVA_HOME

Under Centos 9 the Java 8 version is recognized by the substring
"java-1.8" rather than "java-8". So the grep has been modified to
accept either.

Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
4 weeks agoMerge PR #67993 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 19:55:54 +0000 (15:55 -0400)]
Merge PR #67993 into tentacle

* refs/pull/67993/head:
test/rgw/kafka: fix kafka relase to more recent one

Reviewed-by: J. Eric Ivancich <ivancich@redhat.com>
4 weeks agoMerge PR #67797 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 19:40:39 +0000 (15:40 -0400)]
Merge PR #67797 into tentacle

* refs/pull/67797/head:
qa/workunits/rbd: fix unbound variable in status()
qa/workunits/rbd: short-circuit status() if "ceph -s" fails
qa: rbd_mirror_fsx_compare.sh doesn't error out as expected

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
4 weeks agoMerge PR #67795 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 19:40:07 +0000 (15:40 -0400)]
Merge PR #67795 into tentacle

* refs/pull/67795/head:
qa/tasks: make rbd_mirror_thrash inherit from ThrasherGreenlet

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
4 weeks agoMerge PR #67705 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 19:39:48 +0000 (15:39 -0400)]
Merge PR #67705 into tentacle

* refs/pull/67705/head:
librbd/cache/pwl: WriteLogOperationSet::cell can be garbage

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
4 weeks agoMerge PR #66837 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 18:25:25 +0000 (14:25 -0400)]
Merge PR #66837 into tentacle

* refs/pull/66837/head:
os/bluestore: rename row names in RocksDBBlueFSVolumeSelector.
test/bluestore: add volume selector tests
os/bluestore:fix bluestore_volume_selection_reserved_factor usage

Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
4 weeks agoMerge PR #67354 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 18:24:11 +0000 (14:24 -0400)]
Merge PR #67354 into tentacle

* refs/pull/67354/head:
debian: remove invoke-rc.d calls from postrm scripts

Reviewed-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Reviewed-by: Casey Bodley <cbodley@redhat.com>
4 weeks agoMerge PR #67407 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 18:22:32 +0000 (14:22 -0400)]
Merge PR #67407 into tentacle

* refs/pull/67407/head:
osd: add pg-upmap-primary to clean_pg_upmaps

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
4 weeks agoMerge PR #66482 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 18:21:40 +0000 (14:21 -0400)]
Merge PR #66482 into tentacle

* refs/pull/66482/head:
mgr/prometheus/test_module: Adding unit-test for new classes
mgr/prometheus: metrics header for standby module
mgr/prometheus: Use RLock to fix deadlock in HealthHistory
mgr/TTLCache: fix PyObject* lifetime management and cleanup logic
mgr/prometheus: prune stale health checks, compress output

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
4 weeks agoMerge PR #65913 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 18:17:08 +0000 (14:17 -0400)]
Merge PR #65913 into tentacle

* refs/pull/65913/head:
client: signal waitfor_commit waiters for write delegation enabled inode
test/libcephfs: add test for fsync on a write delegated inode
client: adjust `Fb` cap ref count check during synchronous fsync()

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
Reviewed-by: Venky Shankar <vshankar@redhat.com>
4 weeks agoMerge PR #65957 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 18:16:33 +0000 (14:16 -0400)]
Merge PR #65957 into tentacle

* refs/pull/65957/head:
client: crash caused by invalid iterator in _readdir_cache_cb

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
4 weeks agoMerge PR #66469 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 18:16:13 +0000 (14:16 -0400)]
Merge PR #66469 into tentacle

* refs/pull/66469/head:
mds: MDCache: check validity of mdr requests before dispatching
mds: MDCache request cleanup handles potential null mdr

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
4 weeks agoMerge PR #67617 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 18:14:09 +0000 (14:14 -0400)]
Merge PR #67617 into tentacle

* refs/pull/67617/head:
qa: fix TypeError in delay

Reviewed-by: Venky Shankar <vshankar@redhat.com>
4 weeks agoMerge PR #67455 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 18:10:14 +0000 (14:10 -0400)]
Merge PR #67455 into tentacle

* refs/pull/67455/head:
qa: krbd_rxbounce.sh: do more reads to generate more errors

Reviewed-by: Ramana Raja <rraja@redhat.com>
4 weeks agoMerge PR #67581 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 18:09:55 +0000 (14:09 -0400)]
Merge PR #67581 into tentacle

* refs/pull/67581/head:
librbd: don't complete ImageUpdateWatchers::shut_down() prematurely

Reviewed-by: Ramana Raja <rraja@redhat.com>
4 weeks agoMerge PR #67583 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 18:09:34 +0000 (14:09 -0400)]
Merge PR #67583 into tentacle

* refs/pull/67583/head:
librbd/mirror: detect trashed snapshots in UnlinkPeerRequest

Reviewed-by: Ramana Raja <rraja@redhat.com>
4 weeks agoMerge PR #67031 into tentacle
Patrick Donnelly [Mon, 6 Apr 2026 18:03:06 +0000 (14:03 -0400)]
Merge PR #67031 into tentacle

* refs/pull/67031/head:
doc/ceph.rst: scrub-related 'tell pgid' commands
osd/scrub: operator abort: (not) handling in-the-mail scrubs
osd/scrub: added the scrub-abort command
osd/scrub: support an operator-abort command
osd/scrub: removing the unused PgScrubber::m_scrub_reg_stamp

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
4 weeks agorgw: java s3-tests change setting of JAVA_HOME
J. Eric Ivancich [Wed, 1 Apr 2026 16:29:01 +0000 (12:29 -0400)]
rgw: java s3-tests change setting of JAVA_HOME

Previously s3tests_java.py set JAVA_HOME using the `alternatives`
command. That had issues in that `alternatives` is not present on all
Ubuntu systems, and some installations of Java don't update
alternatives. So instead we look for a "java-8" jvm in /usr/lib/jvm/
and set JAVA_HOME to the first one we find.

Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
(cherry picked from commit b8e2796270f4558b406411682a9b916109d0c530)

5 weeks agoMerge PR #68185 into tentacle
Patrick Donnelly [Fri, 3 Apr 2026 16:55:31 +0000 (12:55 -0400)]
Merge PR #68185 into tentacle

* refs/pull/68185/head:
20.2.1

Reviewed-by: Patrick Donnelly <pdonnell@ibm.com>
5 weeks ago20.2.1 tentacle-release 68185/head v20.2.1
Ceph Release Team [Thu, 2 Apr 2026 14:15:15 +0000 (14:15 +0000)]
20.2.1

Signed-off-by: Ceph Release Team <ceph-maintainers@ceph.io>
5 weeks agoqa/workunits: Add updated kernel archive URL 68169/head
Brad Hubbard [Thu, 26 Mar 2026 02:29:54 +0000 (12:29 +1000)]
qa/workunits: Add updated kernel archive URL

Fixes: https://tracker.ceph.com/issues/75391
Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
(cherry picked from commit 14fc52e0974e623ac39d323e18de8c04bff4ed97)
Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
5 weeks agotest/rgw/notification: do not use netstat in the code 68142/head
Yuval Lifshitz [Fri, 20 Feb 2026 15:41:14 +0000 (15:41 +0000)]
test/rgw/notification: do not use netstat in the code

* net-tools are deprecated in fedora and ubuntu
* using netstat -p (used to verify that the http server is listening on
  a port) requires root privilages, which may fail in some tests environments

Fixes: https://tracker.ceph.com/issues/75820
Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
(cherry picked from commit 5e204e17684ec6d2ab5b44e114be6cc4dfcf10b9)

5 weeks agoorch/cephadm: fix osd.default creation 68121/head
Guillaume Abrioux [Mon, 23 Mar 2026 12:55:10 +0000 (13:55 +0100)]
orch/cephadm: fix osd.default creation

The 'ceph orch daemon add osd' path mishandles osd.default which can
can save an incomplete osd.default spec (wrong store key, incomplete
spec, and apply ordering) and can skip work or reject valid LV paths.

This commit validates before saving, persists the full device
selections, matches the spec store by service name, and forces OSD
creation so each command is actually executed.

Fixes: https://tracker.ceph.com/issues/75670
Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
(cherry picked from commit 1f9230654562b56bbfb0d0491286d2748df69949)

5 weeks agoqa/tasks/backfill_toofull.py: Fix assert failures with & without compression 68118/head
Sridhar Seshasayee [Mon, 9 Mar 2026 09:31:54 +0000 (15:01 +0530)]
qa/tasks/backfill_toofull.py: Fix assert failures with & without compression

The following issues with the test are addressed:

1. The test was encountering assertion failure (assert backfillfull < 0.9) with
   compression enabled. This was because the condition was not factoring in the
   compression ratio. Without it the backfillfull ratio can easily exceed 1. By
   factoring in the compression ratio, the backfillfull ratio will be in the
   range (0 - n), where n can vary depending on the type of compression used.

2. The main contributing factor for (1) above is the amount of data written to
   the pool. The writes were time-bound earlier leading to excess data and
   eventually the assertion failure. By limiting the data written to the OSDs
   to 50% of the OSD capacity in the first phase and only 20% in the re-write
   phase, the outcome of the test is more deterministic regardless of
   compression being enabled or not.

3. A potential false cluster error is avoided by swapping the setting of
   the nearfull-ratio and backfill-ratio after the re-write phase.

4. Fix a couple of typos - s/tartget/target.

Fixes: https://tracker.ceph.com/issues/71005
Signed-off-by: Sridhar Seshasayee <sridhar.seshasayee@ibm.com>
(cherry picked from commit 91de6a0b7b8b8c2531446555c25bf53e23635982)

5 weeks agoceph-volume: skip virtual cdrom devices in inventory 68108/head
Ujjawal Anand [Tue, 3 Mar 2026 05:38:29 +0000 (11:08 +0530)]
ceph-volume: skip virtual cdrom devices in inventory
-Some hosts expose IPMI/BMC virtual media as /dev/sr0. These
devices are reported as SCSI type 5 (CD/DVD) via sysfs and
appear in inventory/orchestrator output.

-These are not real disks and cannot be used as OSD targets.

-Filter out devices with SCSI type 5 to avoid listing
virtual cdrom devices as valid OSD candidates.

Fixes:https://tracker.ceph.com/issues/75281

Signed-off-by: Ujjawal Anand <ujjawal.anand@ibm.com>
(cherry picked from commit 3706288bc050bb20d64fc9ac6a643aaf84ca041d)

6 weeks agoqa/rgw/upgrade: symlinks are explicit about distro versions 68057/head
Casey Bodley [Wed, 25 Mar 2026 16:38:59 +0000 (12:38 -0400)]
qa/rgw/upgrade: symlinks are explicit about distro versions

avoid relying on "ubuntu_latest" and "rpm_latest" symlinks, which change
over time on main. be explicit about the distro versions supported by
the initial release

Signed-off-by: Casey Bodley <cbodley@redhat.com>
(cherry picked from commit 73b1d1e708725a34aa4bcdfaa7fff396a643d7fc)

Conflicts: removed tentacle, added reef
qa/suites/rgw/upgrade/1-install/reef/distro$/ubuntu_latest.yaml
qa/suites/rgw/upgrade/1-install/squid/distro$/centos_9.stream.yaml
qa/suites/rgw/upgrade/1-install/squid/distro$/ubuntu_22.04.yaml
qa/suites/rgw/upgrade/1-install/squid/distro$/ubuntu_latest.yaml

6 weeks agotest/rgw/kafka: fix kafka relase to more recent one 67993/head
Yuval Lifshitz [Wed, 4 Mar 2026 14:53:13 +0000 (14:53 +0000)]
test/rgw/kafka: fix kafka relase to more recent one

Fixes: https://tracker.ceph.com/issues/75323
Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
(cherry picked from commit dc412a7e519d037acbcac8a92c7ecf2dbde9875a)

6 weeks agoceph-volume: include LVM mapper devices in get_devices() 67989/head
Guillaume Abrioux [Thu, 5 Feb 2026 15:34:26 +0000 (15:34 +0000)]
ceph-volume: include LVM mapper devices in get_devices()

`ceph-volume inventory --list-all` doesn't include LVs devices:

```
NAME          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
vda           253:0    0   61G  0 disk
└─vda1        253:1    0   61G  0 part /etc/ceph/ceph.keyring
                                       /etc/ceph/ceph.conf
                                       /run/podman-init
                                       /rootfs/var/lib/containers/storage/overlay
                                       /rootfs
vdb           253:16   0  200G  0 disk
vdc           253:32   0  200G  0 disk
vdd           253:48   0  200G  0 disk
vde           253:64   0  200G  0 disk
vdf           253:80   0  200G  0 disk
vdg           253:96   0  200G  0 disk
vdh           253:112  0  200G  0 disk
vdi           253:128  0  200G  0 disk
vdj           253:144  0  200G  0 disk
vdk           253:160  0  200G  0 disk
└─vg_test-lv1 252:0    0  200G  0 lvm

Device Path               Size         Device nodes    rotates available
Model name
/dev/vdb                  200.00 GB    vdb             True    True
/dev/vdc                  200.00 GB    vdc             True    True
/dev/vdd                  200.00 GB    vdd             True    True
/dev/vde                  200.00 GB    vde             True    True
/dev/vdf                  200.00 GB    vdf             True    True
/dev/vdg                  200.00 GB    vdg             True    True
/dev/vdh                  200.00 GB    vdh             True    True
/dev/vdi                  200.00 GB    vdi             True    True
/dev/vdj                  200.00 GB    vdj             True    True
/dev/vda                  61.00 GB     vda             True    False
/dev/vda1                 61.00 GB     vda             False   False
/dev/vdk                  200.00 GB    vdk             True    False
```

This commit removes the UdevData(diskname).is_lvm check so LVs devices
are no longer skipped when listing devices.

```

Device Path               Size         Device nodes    rotates available
Model name
/dev/vdb                  200.00 GB    vdb             True    True
/dev/vdc                  200.00 GB    vdc             True    True
/dev/vdd                  200.00 GB    vdd             True    True
/dev/vde                  200.00 GB    vde             True    True
/dev/vdf                  200.00 GB    vdf             True    True
/dev/vdg                  200.00 GB    vdg             True    True
/dev/vdh                  200.00 GB    vdh             True    True
/dev/vdi                  200.00 GB    vdi             True    True
/dev/vdj                  200.00 GB    vdj             True    True
/dev/vda                  61.00 GB     vda             True    False
/dev/vda1                 61.00 GB     vda             False   False
/dev/vdk                  200.00 GB    vdk             True    False
/dev/vg_test/lv1          200.00 GB    vdk             True    False
```

Fixes: https://tracker.ceph.com/issues/74775
Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com>
(cherry picked from commit c06bee965f14607c3a792ecd670e7c336ddca217)

6 weeks agonvmeofgw: fix issue of delete all gws from the pool/group 67942/head
Leonid Chernin [Tue, 26 Aug 2025 12:34:32 +0000 (15:34 +0300)]
nvmeofgw: fix issue of delete all gws from the pool/group
          when gws not removed from the map

Signed-off-by: Leonid Chernin <leonidc@il.ibm.com>
(cherry picked from commit 29174099ac46d19f6dd5dd9375a2a8c606dccd17)

7 weeks agosrc/ceph-volume: fast device unavailable as error 67916/head
Timothy Q Nguyen [Wed, 11 Mar 2026 18:45:38 +0000 (11:45 -0700)]
src/ceph-volume: fast device unavailable as error

Normally when fast devices are passed to batch command but
no fast allocations could be found the batch command will
do nothing and return an empty plan. This leads to issues
however because the return essentially makes this issue silent
which makes it hard to debug in certain scenarios. I propose
to change this to raise error, and have made changes in osd.py
to better log the errors and process the exceptions. This
shouldn't affect processes that much and the change in
osd.py ensures the raised errors will not interrupt the return
output. I've also changed the unit tests to account for
change.

Signed-off-by: Timothy Q Nguyen <timqn22@gmail.com>
(cherry picked from commit 262175b107a86a0a330629645b4bc7a00a4fe047)

7 weeks agodoc/start: Add ARM support note to hardware-recommendations.rst 67907/head
Anthony D'Atri [Mon, 8 Dec 2025 19:48:58 +0000 (14:48 -0500)]
doc/start: Add ARM support note to hardware-recommendations.rst

Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
(cherry picked from commit 33ecb7912f15495668a99cc64d3aa86fe93d20df)

7 weeks agodoc: Improve start/hardware-recommendations.rst
Anthony D'Atri [Mon, 17 Nov 2025 17:57:29 +0000 (12:57 -0500)]
doc: Improve start/hardware-recommendations.rst

Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
(cherry picked from commit b4fa87d24fc363ecbc2dafbe5feaf15273e18128)

7 weeks agodoc: Update the old ceph.com/community/ links to ceph.io/en/news/blog/
mrVectorz [Mon, 3 Nov 2025 04:24:59 +0000 (23:24 -0500)]
doc: Update the old ceph.com/community/ links to ceph.io/en/news/blog/

Signed-off-by: Marc Methot <mb.methot@gmail.com>
(cherry picked from commit c32027ba9192b10a10acbbe7683933290dc964b5)

7 weeks agodoc/start: Improve hardware-recommendations.rst
Anthony D'Atri [Thu, 23 Oct 2025 19:29:19 +0000 (15:29 -0400)]
doc/start: Improve hardware-recommendations.rst

Signed-off-by: Anthony D'Atri <anthonyeleven@users.noreply.github.com>
(cherry picked from commit d24c3ac173c09018cd45d8932cde264d36cee257)