]> git-server-git.apps.pok.os.sepia.ceph.com Git - ceph.git/log
ceph.git
9 days agomon/AuthMonitor: add osd w cap for superuser client
Patrick Donnelly [Wed, 18 Feb 2026 20:27:30 +0000 (15:27 -0500)]
mon/AuthMonitor: add osd w cap for superuser client

Right now only a client with "rw" permissions on an MDS gets "rw" on an
OSD.

[@vshankar: fixed malformed OSD cap when authorizing multiple paths]

Reported-by: John Mulligan <jmulligan@redhat.com>
Fixes: https://tracker.ceph.com/issues/75013
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Signed-off-by: Venky Shankar <vshankar@redhat.com>
11 days agoMerge pull request #67782 from rkachach/fix_issue_75492
Redouane Kachach [Sat, 14 Mar 2026 09:46:02 +0000 (10:46 +0100)]
Merge pull request #67782 from rkachach/fix_issue_75492

mgr/nvmeof: Adding missing CLICommand field to nvmeof mgr module

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: John Mulligan <jmulligan@redhat.com>
Reviewed-by: Adam King <adking@redhat.com>
11 days agoqa: ignore NVMEOF_GATEWAY_DOWN in nvmeof_scalability.yaml 67804/head
Vallari Agrawal [Fri, 13 Mar 2026 08:47:46 +0000 (14:17 +0530)]
qa: ignore NVMEOF_GATEWAY_DOWN in nvmeof_scalability.yaml

Sometimes during scale-up/scale-down, a gateway goes in
UNAVAILABLE state (which triggers NVMEOF_GATEWAY_DOWN warning)
for a couple of seconds and self-recovers.
In this, none of the scale test asserts fail.

So NVMEOF_GATEWAY_DOWN can be ignorelist, because scale test asserts
on expected gw count and checks if all expected gws are AVAILABLE
between each iteration of scale-up/scale-down.

Fixes: https://tracker.ceph.com/issues/75179
Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
11 days agoqa/tasks/nvmeof.py: retry do_check if gw in CREATED
Vallari Agrawal [Fri, 13 Mar 2026 08:32:06 +0000 (14:02 +0530)]
qa/tasks/nvmeof.py: retry do_check if gw in CREATED

In do_check(), ensure all the namespaces+listeners are
added in gateway (i.e. gateway not in CREATED state)
after gateway is restarted. This is to prevent going into
next iteration of tharshing while gateways are still being
updated.

Fixes: https://tracker.ceph.com/issues/75382
Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
12 days agorgw/test/multisite: revise test_period_update_commit zone selection for clarity
Oguzhan Ozmen [Fri, 13 Mar 2026 22:35:19 +0000 (22:35 +0000)]
rgw/test/multisite: revise test_period_update_commit zone selection for clarity

Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
12 days agorgw/test/multisite: revise test_period_update_commit testcase client wkld settings
Oguzhan Ozmen [Fri, 13 Mar 2026 22:33:45 +0000 (22:33 +0000)]
rgw/test/multisite: revise test_period_update_commit testcase client wkld settings

- set wkld concurrency level to default urllib pool size

    Set wkld_concurrency to 10 which is the default urllib pool size
    to avoid the event:

    WARNING:urllib3.connectionpool:Connection pool is full,
    discarding connection: ... Connection pool size: 10

- make the client wkld less aggresive

Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
12 days agorgw/test/multisite: get_oldest_incremental_change_not_applied_epoch - handle sync...
Oguzhan Ozmen [Fri, 13 Mar 2026 22:30:50 +0000 (22:30 +0000)]
rgw/test/multisite: get_oldest_incremental_change_not_applied_epoch - handle sync-status failure gracefully

Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
12 days agorgw/test/multisite: run sync status on the intended zone
Oguzhan Ozmen [Fri, 13 Mar 2026 22:30:11 +0000 (22:30 +0000)]
rgw/test/multisite: run sync status on the intended zone

Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
12 days agorgw/test/multisite: test_period_update_commit use a custom retry setting
Oguzhan Ozmen [Fri, 13 Mar 2026 22:36:22 +0000 (22:36 +0000)]
rgw/test/multisite: test_period_update_commit use a custom retry setting

Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
12 days agorgw/test/multisite: use config's retry settings
Oguzhan Ozmen [Fri, 13 Mar 2026 22:31:40 +0000 (22:31 +0000)]
rgw/test/multisite: use config's retry settings

Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
12 days agorgw/test/multisite: add a mechanism to use custom config temporarily
Oguzhan Ozmen [Fri, 13 Mar 2026 22:29:05 +0000 (22:29 +0000)]
rgw/test/multisite: add a mechanism to use custom config temporarily

Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
12 days agoMerge pull request #66580 from kamoltat/wip-ksirivad-fix-72994
Kamoltat (Junior) Sirivadhna [Fri, 13 Mar 2026 21:34:34 +0000 (17:34 -0400)]
Merge pull request #66580 from kamoltat/wip-ksirivad-fix-72994

mon [stretch-mode]: Allow a max bucket weight diff threshold
Reviewed-by: Ronen Friedman <rfriedma@ibm.com>
12 days agoMerge PR #67780 into main
Patrick Donnelly [Fri, 13 Mar 2026 19:54:51 +0000 (01:24 +0530)]
Merge PR #67780 into main

* refs/pull/67780/head:
Revert "Merge PR #67630 into main"

Reviewed-by: Shraddha Agrawal <shraddhaag@ibm.com>
12 days agomgr/nvmeof: Adding missing CLICommand file to nvmeof mgr module 67782/head
Redouane Kachach [Fri, 13 Mar 2026 15:40:01 +0000 (16:40 +0100)]
mgr/nvmeof: Adding missing CLICommand file to nvmeof mgr module

Fixes: https://tracker.ceph.com/issues/75492
Signed-off-by: Redouane Kachach <rkachach@ibm.com>
12 days agoscript/build-with-container: add CONFIGURE_ARGS env var to configure step 67783/head
John Mulligan [Fri, 13 Mar 2026 17:42:09 +0000 (13:42 -0400)]
script/build-with-container: add CONFIGURE_ARGS env var to configure step

Add a new optional CONFIGURE_ARGS environment variable to the configure
step so that there's a mechanism to pass custom cmake options that
aren't handled elsewhere in the run-make.sh script.

Because configure is a rather fundamental build step it's probably
preferable to set this via an env file so that it persists across
rebuilds. Using an environment var here also avoids both needing to
change run-make.sh or add another CLI option to BWC which already has
too many.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
12 days agoRevert "Merge PR #67630 into main" 67780/head
Patrick Donnelly [Fri, 13 Mar 2026 14:18:06 +0000 (19:48 +0530)]
Revert "Merge PR #67630 into main"

This reverts commit 3a5e4524aa56de4c26400ccf994baa6ba8e16d9e, reversing
changes made to d334ff531c563bb7d0e37777f606322ec91b7453.

To everyone's surprise, skipping a workflow does not make it less
required. Well done Github!

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
12 days agoMerge pull request #67275 from ifed01/wip-ifed-fix-bluefs-expand-test
Igor Fedotov [Fri, 13 Mar 2026 14:05:44 +0000 (17:05 +0300)]
Merge pull request #67275 from ifed01/wip-ifed-fix-bluefs-expand-test

qa/standalone: fix/improve bluefs tests

Reviewed-by: Adam Kupczyk <akupczyk@ibm.com>
12 days agoMerge pull request #67609 from ifed01/wip-ifed-bluefs-stats-reset
Igor Fedotov [Fri, 13 Mar 2026 14:01:39 +0000 (17:01 +0300)]
Merge pull request #67609 from ifed01/wip-ifed-bluefs-stats-reset

os/bluestore: add 'bluefs stats reset' admin socket command.

Reviewed-by: Adam Kupczyk <akupczyk@ibm.com>
12 days agoMerge pull request #67770 from bluikko/wip-doc-cephadm-spelling
bluikko [Fri, 13 Mar 2026 12:40:57 +0000 (19:40 +0700)]
Merge pull request #67770 from bluikko/wip-doc-cephadm-spelling

doc/cephadm: Fix spelling errors

12 days agoMerge pull request #67718 from rhcs-dashboard/fix-subsystem-create-layout-issue
Afreen Misbah [Fri, 13 Mar 2026 09:52:18 +0000 (15:22 +0530)]
Merge pull request #67718 from rhcs-dashboard/fix-subsystem-create-layout-issue

mgr/dashboard: Footer actions shift upward instead of staying pinned at modal bottom in NVMe/TCP subsystem create wizard

Reviewed-by: Afreen Misbah <afreen@ibm.com>
Reviewed-by: Devika Babrekar <devika.babrekar@ibm.com>
12 days agoqa/tasks/nvmeof.py: Fix tharsher daemon_rm revival
Vallari Agrawal [Fri, 13 Mar 2026 08:24:31 +0000 (13:54 +0530)]
qa/tasks/nvmeof.py: Fix tharsher daemon_rm revival

Instead of "ceph orch daemon restart",
wait for daemon to come backup on it's own
during revival.
Also improve do_check retry logic.
And some logging improvements in nvmeof.thrasher task.

Fixes: https://tracker.ceph.com/issues/75383
Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
12 days agodoc/cephadm: Fix spelling errors 67770/head
Ville Ojamo [Fri, 13 Mar 2026 06:51:38 +0000 (13:51 +0700)]
doc/cephadm: Fix spelling errors

Signed-off-by: Ville Ojamo <git2233+ceph@ojamo.eu>
12 days agoMerge pull request #65405 from rhcs-dashboard/notification-store-events
Afreen Misbah [Fri, 13 Mar 2026 07:31:27 +0000 (13:01 +0530)]
Merge pull request #65405 from rhcs-dashboard/notification-store-events

mgr/dashboard: Add restore events in notification screen

Reviewed-by: Afreen Misbah <afreen@ibm.com>
Reviewed-by: Dnyaneshwari Talwekar <dtalweka@redhat.com>
12 days agomgr/dashboard: Breadcrumb should allow going back to subsystem tab 67643/head
pujaoshahu [Wed, 4 Mar 2026 08:32:54 +0000 (14:02 +0530)]
mgr/dashboard: Breadcrumb should allow going back to subsystem tab

Fixes: https://tracker.ceph.com/issues/75288
Signed-off-by: pujaoshahu <pshahu@redhat.com>
12 days agoMerge pull request #67760 from gbregman/main
Gil Bregman [Fri, 13 Mar 2026 07:15:08 +0000 (09:15 +0200)]
Merge pull request #67760 from gbregman/main

mgr/dashboard: Add secure and verify-host-name to "listener add" on NVMeoF CLI

12 days agoMerge pull request #67647 from rhcs-dashboard/fix-75317-main
Aashish Sharma [Fri, 13 Mar 2026 06:59:45 +0000 (12:29 +0530)]
Merge pull request #67647 from rhcs-dashboard/fix-75317-main

mgr/dashboard: update onboarding screen as per design

Reviewed-by: Afreen Misbah <afreen@ibm.com>
12 days agoMerge pull request #67713 from rhcs-dashboard/fix-nvmeof-initiator-add-visibility
Afreen Misbah [Fri, 13 Mar 2026 06:30:03 +0000 (12:00 +0530)]
Merge pull request #67713 from rhcs-dashboard/fix-nvmeof-initiator-add-visibility

mgr/dashboard: Initiator add shows success but host is not added/displayed in Subsystem Initiators table

Reviewed-by: Afreen Misbah <afreen@ibm.com>
Reviewed-by: pujaoshahu <pshahu@redhat.com>
12 days agomgr/dashboard: fix-nvmeof-subsystem-create-firefox-next 67769/head
Sagar Gopale [Fri, 13 Mar 2026 05:57:39 +0000 (11:27 +0530)]
mgr/dashboard: fix-nvmeof-subsystem-create-firefox-next

Fixes: https://tracker.ceph.com/issues/75434
Signed-off-by: Sagar Gopale <sagar.gopale@ibm.com>
12 days agomgr/dashboard: rename expand-cluster to add-storage 67647/head
Aashish Sharma [Thu, 5 Mar 2026 06:33:00 +0000 (12:03 +0530)]
mgr/dashboard: rename expand-cluster to add-storage

Signed-off-by: Aashish Sharma <aasharma@redhat.com>
12 days agomgr/dashboard: update onboarding screen as per design
Aashish Sharma [Wed, 4 Mar 2026 09:58:17 +0000 (15:28 +0530)]
mgr/dashboard: update onboarding screen as per design

Fixes: https://tracker.ceph.com/issues/75317
Signed-off-by: Aashish Sharma <aasharma@redhat.com>
13 days agoqa/suites/crimson-rados: add fio test case for osd shard number changes upon restart... 64975/head
Chunmei Liu [Thu, 27 Nov 2025 07:47:37 +0000 (07:47 +0000)]
qa/suites/crimson-rados: add fio test case for osd shard number changes upon restart for 3 osd

Signed-off-by: Chunmei Liu <chunmei.liu@ibm.com>
13 days agodoc/dev/seastore.rst: add design implementation for osd shards change
chunmei liu [Tue, 3 Feb 2026 23:04:40 +0000 (15:04 -0800)]
doc/dev/seastore.rst: add design implementation for osd shards change

Signed-off-by: chunmei liu <chunmei.liu@ibm.com>
13 days agocrimson/common/options: add seastore_require_partition_count_match_reactor_count...
chunmei liu [Thu, 19 Feb 2026 23:19:32 +0000 (15:19 -0800)]
crimson/common/options: add seastore_require_partition_count_match_reactor_count in crimson.yaml.in

Signed-off-by: chunmei liu <chunmei.liu@ibm.com>
13 days agocrimson/osd/osd_admin: add osd command to dump store shards info
chunmei liu [Mon, 9 Mar 2026 22:51:46 +0000 (15:51 -0700)]
crimson/osd/osd_admin: add osd command to dump store shards info

Signed-off-by: chunmei liu <chunmei.liu@ibm.com>
13 days agocrimson/os/seastore: support other devices
Chunmei Liu [Sat, 18 Oct 2025 00:17:44 +0000 (00:17 +0000)]
crimson/os/seastore: support other devices

Signed-off-by: Chunmei Liu <chunmei.liu@ibm.com>
13 days agotest/crimson/seastore: using store_index = 0 for the tests
Chunmei Liu [Thu, 21 Aug 2025 01:10:52 +0000 (01:10 +0000)]
test/crimson/seastore: using store_index = 0 for the tests

Signed-off-by: Chunmei Liu <chunmei.liu@ibm.com>
13 days agocrimson/tools: fixing tools according to osd shards number change modification
Chunmei Liu [Wed, 1 Oct 2025 22:58:23 +0000 (22:58 +0000)]
crimson/tools: fixing tools according to osd shards number change modification

Signed-off-by: Chunmei Liu <chunmei.liu@ibm.com>
13 days agocrimson/os/seastore: make register_metrics works for
chunmei liu [Thu, 12 Mar 2026 19:09:07 +0000 (12:09 -0700)]
crimson/os/seastore: make register_metrics works for
 multiple store shards on one reactor

Signed-off-by: chunmei liu <chunmei.liu@ibm.com>
13 days agocrimson/osd: replace store call by with_store call in case need remote store calling.
chunmei liu [Tue, 3 Feb 2026 22:40:56 +0000 (14:40 -0800)]
crimson/osd: replace store call by with_store call in case need remote store calling.

Signed-off-by: chunmei liu <chunmei.liu@ibm.com>
13 days agocrimson/osd/shard_services: get multiple store shards for per local state, and use...
chunmei liu [Tue, 3 Feb 2026 22:29:26 +0000 (14:29 -0800)]
crimson/osd/shard_services: get multiple store shards for per local state, and use store index to create pg mapping

Signed-off-by: chunmei liu <chunmei.liu@ibm.com>
13 days agocrimson/osd/pg_map: add pg mapping policy for osd shards number is different with...
chunmei liu [Wed, 16 Jul 2025 03:34:08 +0000 (20:34 -0700)]
crimson/osd/pg_map: add pg mapping policy for osd shards number is different with store shards number

Signed-off-by: chunmei liu <chunmei.liu@ibm.com>
13 days agocrimson/os/futurized_store: support cross core store calling
chunmei liu [Wed, 16 Jul 2025 03:32:21 +0000 (20:32 -0700)]
crimson/os/futurized_store: support cross core store calling

Signed-off-by: chunmei liu <chunmei.liu@ibm.com>
13 days agocrimson/os/alienstore: support multiple store shards on each reactor
Chunmei Liu [Wed, 1 Oct 2025 22:33:12 +0000 (22:33 +0000)]
crimson/os/alienstore: support multiple store shards on each reactor

Signed-off-by: Chunmei Liu <chunmei.liu@ibm.com>
13 days agocrimson/os/cyanstore: create multiple store shards on each reactor
chunmei liu [Tue, 15 Jul 2025 10:27:16 +0000 (03:27 -0700)]
crimson/os/cyanstore: create multiple store shards on each reactor

note: src/stop.sh should wait enought time before kill the crimson-osd
in case cyanstore can't write meta data to disk.

Signed-off-by: chunmei liu <chunmei.liu@ibm.com>
13 days agocrimson/os/seastore: create multiple device shards and store shards on each reactor.
Chunmei Liu [Fri, 17 Oct 2025 23:15:40 +0000 (23:15 +0000)]
crimson/os/seastore: create multiple device shards and store shards on each reactor.

Signed-off-by: Chunmei Liu <chunmei.liu@ibm.com>
13 days agoMerge pull request #67396 from Rotemrs/lua-background-vm-fix
Yuval Lifshitz [Thu, 12 Mar 2026 17:52:13 +0000 (19:52 +0200)]
Merge pull request #67396 from Rotemrs/lua-background-vm-fix

rgw/lua: create fresh VM for each background script execution

13 days agomgr/dashboard: Add secure and verify-host-name to "listener add" on NVMeoF CLI. 67760/head
Gil Bregman [Thu, 12 Mar 2026 14:23:49 +0000 (16:23 +0200)]
mgr/dashboard: Add secure and verify-host-name to "listener add" on NVMeoF CLI.
Also add missing "manual" field in "listener list".

Fixes: https://tracker.ceph.com/issues/75447
Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
13 days agoMerge pull request #67638 from nbalacha/wip-nbalacha-75306
Yuval Lifshitz [Thu, 12 Mar 2026 15:46:17 +0000 (17:46 +0200)]
Merge pull request #67638 from nbalacha/wip-nbalacha-75306

rgw/lua: fix a crash when D4N is enabled

13 days agoMerge pull request #67660 from kshtsk/wip-keystone-2025.2
kyr [Thu, 12 Mar 2026 11:05:28 +0000 (12:05 +0100)]
Merge pull request #67660 from kshtsk/wip-keystone-2025.2

qa/tasks/keystone: upgrade keystone to 2025.2

13 days agoinclude/ceph_features: note more kernel versions 67754/head
Ilya Dryomov [Thu, 12 Mar 2026 10:30:24 +0000 (11:30 +0100)]
include/ceph_features: note more kernel versions

Despite both MONNAMES and MONENC being pre-argonaut feature bits and
the kernel client implicitly assuming argonaut since 5.0, its monmap
decoding routine didn't handle MONNAMES and MONENC until 5.11 (when it
became necessary as part of msgr2 support).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
13 days agoMerge pull request #67712 from afreen23/landing-page-fixes
Afreen Misbah [Thu, 12 Mar 2026 10:10:22 +0000 (15:40 +0530)]
Merge pull request #67712 from afreen23/landing-page-fixes

mgr/dashboard: Fix scrubbing state

Reviewed-by: Devika Babrekar <devika.babrekar@ibm.com>
13 days agoMerge pull request #67714 from afreen23/overview-breaking
Afreen Misbah [Thu, 12 Mar 2026 10:10:03 +0000 (15:40 +0530)]
Merge pull request #67714 from afreen23/overview-breaking

mgr/dashboard: Fix breaking overview page

Reviewed-by: Devika Babrekar <devika.babrekar@ibm.com>
13 days agoMerge pull request #66245 from athanatos/wip-sjust-seastore-conflict
Matan Breizman [Thu, 12 Mar 2026 08:11:26 +0000 (10:11 +0200)]
Merge pull request #66245 from athanatos/wip-sjust-seastore-conflict

crimson/seatore: rework lba_manager to use LBACursor rather than LBAMapping

Reviewed-by: Matan Breizman <mbreizma@redhat.com>
Reviewed-by: Xuehan Xu <xuxuehan@qianxin.com>
13 days agoMerge pull request #67739 from dang/wip-dang-posix-readme
Daniel Gryniewicz [Thu, 12 Mar 2026 04:52:48 +0000 (10:22 +0530)]
Merge pull request #67739 from dang/wip-dang-posix-readme

Update the POSIXDriver readme to current state

13 days agodoc: Update the POSIXDriver readme to current state 67739/head
Daniel Gryniewicz [Wed, 11 Mar 2026 04:47:06 +0000 (10:17 +0530)]
doc: Update the POSIXDriver readme to current state

Signed-off-by: Daniel Gryniewicz <dang@redhat.com>
2 weeks agoMerge pull request #67451 from Ericmzhang/wip-mon-colocate
SrinivasaBharathKanta [Thu, 12 Mar 2026 01:04:58 +0000 (06:34 +0530)]
Merge pull request #67451 from Ericmzhang/wip-mon-colocate

mon: Health warning for colocated monitors

2 weeks agorgw: set 2_min minumum on rgw_mp_lock_max_time
Casey Bodley [Wed, 11 Mar 2026 22:27:54 +0000 (18:27 -0400)]
rgw: set 2_min minumum on rgw_mp_lock_max_time

because the lock renewal request is sent every half interval, don't
allow the lock duration to get small enough that rados request latency
becomes significant

Signed-off-by: Casey Bodley <cbodley@redhat.com>
2 weeks agoMerge pull request #67641 from Hezko/revive-nvme-module
Hezko [Wed, 11 Mar 2026 22:03:55 +0000 (00:03 +0200)]
Merge pull request #67641 from Hezko/revive-nvme-module

introduce nvme module again

2 weeks agoMerge pull request #65626 from samarahu/wip-d4n-remove-bucket
Samarah Uriarte [Wed, 11 Mar 2026 21:27:03 +0000 (16:27 -0500)]
Merge pull request #65626 from samarahu/wip-d4n-remove-bucket

rgw/d4n: Implement bucket check_empty and remove methods

Reviewed-by: Pritha Srivastava <prsrivas@redhat.com>
2 weeks agoqa: Add "auto_pool_create" to nvmeof_initiator 67641/head
Vallari Agrawal [Wed, 4 Mar 2026 06:21:00 +0000 (11:51 +0530)]
qa: Add "auto_pool_create" to nvmeof_initiator

While deploying gateways with "ceph orch apply nvmeof",
--pool can be optional now. If not passed, a pool with
name ".nvmeof" would automatically be created.

In nvmeof task, "auto_pool_create: True" would skip --pool
in "ceph orch apply nvmeof".

Signed-off-by: Vallari Agrawal <vallari.agrawal@ibm.com>
2 weeks agomgr/nvmeof: add missing CLICommand to the module
Avan Thakkar [Mon, 2 Mar 2026 13:00:48 +0000 (18:30 +0530)]
mgr/nvmeof: add missing CLICommand to the module

Fixed AttributeError: type object 'NVMeoF' has no attribute 'CLICommand'

Signed-off-by: Avan Thakkar <athakkar@redhat.com>
2 weeks agoMerge pull request #67659 from kamoltat/wip-ksirivad-fix-70320
Kamoltat (Junior) Sirivadhna [Wed, 11 Mar 2026 18:48:21 +0000 (14:48 -0400)]
Merge pull request #67659 from kamoltat/wip-ksirivad-fix-70320

qa: make test_progress atomically capture OSD marked in/out events
Reviewed-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
2 weeks agomgr, qa: clarify module checks in DaemonServer
Laura Flores [Fri, 12 Sep 2025 20:14:30 +0000 (20:14 +0000)]
mgr, qa: clarify module checks in DaemonServer

The current check groups modules not being
enabled with failing to initialize. In this commit,
we reorder the checks:

1: Screen for a module being enabled. If it's not,
   issue an EOPNOTSUPP with instructions on how
   to enable it.

2. Screen for if a module is active. If a module
   is enabled, then the cluster expects it to
   be active to support commands. If the module
   took too long to initialize though, we will
   catch this and issue an ETIMEDOUT error with
   a link for troubleshooting.

Now, these two separate issues are not grouped
together, and they are checked in the right order.

Fixes: https://tracker.ceph.com/issues/71631
Signed-off-by: Laura Flores <lflores@ibm.com>
2 weeks agomgr, qa: add `pending_modules` to asock command
Laura Flores [Thu, 11 Sep 2025 22:13:51 +0000 (22:13 +0000)]
mgr, qa: add `pending_modules` to asock command

Now, the command `ceph tell mgr mgr_status` will show a
"pending_modules" field. This is another way for Ceph operators
to check which modules haven't been initalized yet (in addition
to the health error).

This command was also added to testing scenarios in the workunit.

Fixes: https://tracker.ceph.com/issues/71631
Signed-off-by: Laura Flores <lflores@ibm.com>
2 weeks agomgr, common, qa, doc: issue health error after max expiration is exceeded
Laura Flores [Tue, 29 Jul 2025 22:46:46 +0000 (22:46 +0000)]
mgr, common, qa, doc: issue health error after max expiration is exceeded

----------------- Enhancement to the Original Fix -----------------

During a mgr failover, the active mgr is marked available if:
  1. The mon has chosen a standby to be active
  2. The chosen active mgr has all of its modules initialized

Now that we've improved the criteria for sending the "active" beacon
by enforcing it to retry initializing mgr modules, we need to account
for extreme cases in which the modules are stuck loading for a very long
time, or even indefinitely. In these extreme cases where the modules might
never initialize, we don't want to delay sending the "active" beacon for
too long. This can result in blocking other important mgr functionality,
such as reporting PG availability in the health status. We want
to avoid sending warnings about PGs being unknown in the health status when
that's not ultimately the problem.

To account for an exeptionally long module loading time, I added a new
configurable `mgr_module_load_expiration`. If we exceed this maximum amount
of time (in ms) allotted for the active mgr to load the mgr modules before declaring
availability, the standby will then proceed to mark itself "available" and
send the "active" beacon to the mon and unblock other critical mgr functionality.

If this happens, a health error will be issued at this time, indicating
which mgr modules got stuck initializing (See src/mgr/PyModuleRegistry.cc). The
idea is to unblock the rest of the mgr's critical functionality while making it
clear to Ceph operators that some modules are unusable.

--------------------- Integration Testing --------------------

The workunit was rewritten so it tests for these scenarios:

1. Normal module loading behavior (no health error should be issued)
2. Acceptable delay in module loading behavior (no health error should be
   issued)
3. Unacceptable delay in module loading behavior (a health error should be
   issued)

--------------------- Documentation --------------------

This documentation explains the "Module failed to initialize"
cluster error.

Users are advised to try failing over
the mgr to reboot the module initialization process,
then if the error persists, file a bug report. I decided
to write it this way instead of providing more complex
debugging tips such as advising to disable some mgr modules
since every case will be different depending on which modules
failed to initialize.

In the bug report, developers can ask for the health detail
output to narrow down which module is causing a bottleneck,
and then ask the user to try disabling certain modules until
the mgr is able to fully initialize.

Fixes: https://tracker.ceph.com/issues/71631
Signed-off-by: Laura Flores <lflores@ibm.com>
2 weeks agomgr: ensure that all modules have started before advertising active mgr
Laura Flores [Fri, 25 Apr 2025 22:11:19 +0000 (22:11 +0000)]
mgr: ensure that all modules have started before advertising active mgr

----------------- Explanation of Problem ----------------

When the mgr is restarted or failed over via `ceph mgr fail` or during an
upgrade, mgr modules sometimes take longer to start up (this includes
loading their class, commands, and module options, and being removed
from the `pending_modules` map structure). This startup delay can happen
due to a cluster's specific hardware or if a code bottleneck is triggered in
a module’s `serve()` function (each mgr module has a `serve()` function that
performs initialization tasks right when the module is loaded).

When this startup delay occurs, any mgr module command issued against the
cluster around the same time fails with error saying that the command is not
supported:
```
$ ceph mgr fail; ceph fs volume ls
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'volumes' is not enabled/loaded (required by command 'fs volume ls'): use `ceph mgr module enable volumes` to enable it
```

We should try to lighten any bottlenecks in the mgr module `serve()`
functions wherever possible, but the root cause of this failure is that the
mgr sends a beacon to the mon too early, indicating that it is active before
the module loading has completed. Specifically, some of the mgr modules
have loaded their class but have not yet been deleted from the `pending_modules`
structure, indicating that they have not finished starting up.

--------------------- Explanation of Fix  --------------------

This commit improves the criteria for sending the “active” beacon to the mon so
the mgr does not signal that it’s active too early. We do this through the following additions:

1. A new context `ActivePyModules::recheck_modules_start` that will be set if not all modules
   have finished startup.

2. A new function `ActivePyModules::check_all_modules_started()` that checks if modules are
   still pending startup; if all have started up (`pending_modules` is empty), then we send
   the beacon right away. But if some are still pending, we pass the beacon task on to the new
   recheck context `ActivePyModules::recheck_modules_start` so we know to send the beacon later.

3. Logic in ActivePyModules::start_one() that only gets triggered if the modules did not all finish
   startup the first time we checked. We know this is the case if the new recheck context
   `recheck_modules_start` was set from `nullptr`. The beacon is only sent once `pending_modules` is
   confirmed to be empty, which means that all the modules have started up and are ready to support commands.

4. Adjustment of when the booleans `initializing` and `initialized` are set. These booleans come into play in
   MgrStandby::send_beacon() when we check that the active mgr has been initialized (thus, it is available).
   We only send the beacon when this boolean is set. Currently, we set these booleans at the end of Mgr::init(),
   which means that it gets set early before `pending_modules` is clear. With this adjustment, the bools are set
   only after we check that all modules have started up. The send_beacon code is triggered on mgr failover AND on
   every Mgr::tick(), which occurs by default every two seconds. If we don’t adjust when these bools are set, we
   only fix the mgr failover part, but the mgr still sends the beacon too early via Mgr::tick(). Below is the relevant
   code from MgrStandby::send_beacon(), which is triggered in Mgr::background_init() AND in Mgr::tick():
```
  // Whether I think I am available (request MgrMonitor to set me
  // as available in the map)
  bool available = active_mgr != nullptr && active_mgr->is_initialized();

  auto addrs = available ? active_mgr->get_server_addrs() : entity_addrvec_t();
  dout(10) << "sending beacon as gid " << monc.get_global_id() << dendl;

```

--------------------- Reproducing the Bug ----------------------

At face value, this issue is indeterministically reproducible since it
can depend on environmental factors or specific cluster workloads.
However, I was able to deterministically reproduce it by injecting a
bottleneck into the balancer module:
```
diff --git a/src/pybind/mgr/balancer/module.py b/src/pybind/mgr/balancer/module.py
index d12d69f..91c83fa8023 100644
--- a/src/pybind/mgr/balancer/module.py
+++ b/src/pybind/mgr/balancer/module.py
@@ -772,10 +772,10 @@ class Module(MgrModule):
                     self.update_pg_upmap_activity(plan)  # update pg activity in `balancer status detail`
                 self.optimizing = False
+                # causing a bottleneck
+                for i in range(0, 1000):
+                    for j in range (0, 1000):
+                        x = i + j
+                        self.log.debug("hitting the bottleneck in the balancer module")
             self.log.debug('Sleeping for %d', sleep_interval)
             self.event.wait(sleep_interval)
             self.event.clear()
```

Then, the error reproduces every time by running:
```
$ ./bin/ceph mgr fail; ./bin/ceph telemetry show
Error ENOTSUP: Warning: due to ceph-mgr restart, some PG states may not be up to date
Module 'telemetry' is not enabled/loaded (required by command 'telemetry show'): use `ceph mgr module enable telemetry` to enable it
```

With this fix, the active mgr is marked as "initialized" only after all
the modules have started up, and this error goes away. The command may
take a bit longer to execute depending on the extent of the delay.

---------------------- Integration Testing ---------------------

This commit adds a dev-only config that can inject a longer
loading time into the mgr module loading sequence so we can
simulate this scenario in a test.

The config is 0 ms by default since we do not add any delay
outside of testing scenarios. The config can be adjusted
with the following command:
  `ceph config set mgr mgr_module_load_delay <ms>`

A second dev-only config also allows you to specify which
module you want to be delayed in loading time. You may change
this with the following command:
  `ceph config set mgr mgr_module_load_delay_name <module name>`

The workunit added here tests a simulated slow loading module
scenario to ensure that this case is properly handled.

--------------------- Documentation --------------------

The new documentation describes the three existing mgr states so Ceph
operators can better interpret their Ceph status output.

Fixes: https://tracker.ceph.com/issues/71631
Signed-off-by: Laura Flores <lflores@ibm.com>
2 weeks agoMerge pull request #67736 from gbregman/main
Gil Bregman [Wed, 11 Mar 2026 17:46:14 +0000 (19:46 +0200)]
Merge pull request #67736 from gbregman/main

mgr/dashboard: Remove the clear-alerts parameter from NVMeoF CLI

2 weeks agoMerge pull request #67431 from adk3798/cephadm-test-iscsi-ignorelist-pg-degraded
Redouane Kachach [Wed, 11 Mar 2026 15:58:50 +0000 (16:58 +0100)]
Merge pull request #67431 from adk3798/cephadm-test-iscsi-ignorelist-pg-degraded

qa/rbd/iscsi/cluster: ignore PG_DEGRADED warning

Reviewed-by: Redouane Kachach <rkachach@redhat.com>
2 weeks agoMerge pull request #67428 from adk3798/test-cephadm-timeout-ignore-timeout
Redouane Kachach [Wed, 11 Mar 2026 15:57:55 +0000 (16:57 +0100)]
Merge pull request #67428 from adk3798/test-cephadm-timeout-ignore-timeout

qa/cephadm: ignore CEPHADM_HOST_TIMEOUT_ERROR in timeout test

Reviewed-by: Redouane Kachach <rkachach@redhat.com>
2 weeks agoMerge pull request #67393 from adk3798/cephadm-grafana-sample-fixup
Redouane Kachach [Wed, 11 Mar 2026 15:56:23 +0000 (16:56 +0100)]
Merge pull request #67393 from adk3798/cephadm-grafana-sample-fixup

cephadm/samples: don't specify localhost as grafana addr

Reviewed-by: John Mulligan <jmulligan@redhat.com>
2 weeks agomgr/dashboard: Remove the clear-alerts parameter from NVMeoF CLI 67736/head
Gil Bregman [Tue, 10 Mar 2026 16:37:12 +0000 (18:37 +0200)]
mgr/dashboard: Remove the clear-alerts parameter from NVMeoF CLI

Fixes: https://tracker.ceph.com/issues/74969
Signed-off-by: Gil Bregman <gbregman@il.ibm.com>
2 weeks agomgr/dashboard: Initiator add shows success but host is not added/displayed in Subsyst... 67713/head
Sagar Gopale [Mon, 9 Mar 2026 10:42:50 +0000 (16:12 +0530)]
mgr/dashboard: Initiator add shows success but host is not added/displayed in Subsystem Initiators table

Fixes: https://tracker.ceph.com/issues/75402
Signed-off-by: Sagar Gopale <sagar.gopale@ibm.com>
2 weeks agomgr/dashboard: Footer actions shift upward instead of staying pinned at modal bottom... 67718/head
Sagar Gopale [Mon, 9 Mar 2026 13:41:46 +0000 (19:11 +0530)]
mgr/dashboard: Footer actions shift upward instead of staying pinned at modal bottom in NVMe/TCP subsystem create wizard

Fixes: https://tracker.ceph.com/issues/75409
Signed-off-by: Sagar Gopale <sagar.gopale@ibm.com>
2 weeks agomgr/dashboard: Add restore events in notification screen 65405/head
pujashahu [Fri, 5 Sep 2025 08:01:23 +0000 (13:31 +0530)]
mgr/dashboard: Add restore events in notification screen

Fixes: https://tracker.ceph.com/issues/72887
Signed-off-by: pujashahu <pshahu@redhat.com>
Signed-off-by: pujaoshahu <pshahu@redhat.com>
2 weeks agomgr/nvmeof: add nvmeof module introduction to pending release notes
Tomer Haskalovitch [Wed, 25 Feb 2026 18:48:32 +0000 (20:48 +0200)]
mgr/nvmeof: add nvmeof module introduction to pending release notes

Fixes: https://tracker.ceph.com/issues/74702
Signed-off-by: Tomer Haskalovitch <tomer.haska@ibm.com>
(cherry picked from commit 166fb04c1251bc2df6aa68cbd4e303005f8f08e7)

2 weeks agomgr/nvmeof: add unittests
Tomer Haskalovitch [Tue, 24 Feb 2026 11:38:36 +0000 (13:38 +0200)]
mgr/nvmeof: add unittests

Fixes: https://tracker.ceph.com/issues/74702
Signed-off-by: Tomer Haskalovitch <tomer.haska@ibm.com>
(cherry picked from commit eecbff76fa6401edaf2abbee9d86e08162f752eb)

2 weeks agomgr/nvmeof: use nvmeof module during orch nvmeof apply
Tomer Haskalovitch [Tue, 24 Feb 2026 11:38:00 +0000 (13:38 +0200)]
mgr/nvmeof: use nvmeof module during orch nvmeof apply

Added a call to create_pool_if_not_exists during the execution of ceph orch apply nvmeof command.

Fixes: https://tracker.ceph.com/issues/74702
Signed-off-by: Tomer Haskalovitch <tomer.haska@ibm.com>
(cherry picked from commit f5734cf41b18add5e54efa13c4519359705dae57)

2 weeks agomgr/nvmeof: set nvmeof module to be enabled by default
Tomer Haskalovitch [Tue, 24 Feb 2026 11:36:17 +0000 (13:36 +0200)]
mgr/nvmeof: set nvmeof module to be enabled by default

Fixes: https://tracker.ceph.com/issues/74702
Signed-off-by: Tomer Haskalovitch <tomer.haska@ibm.com>
(cherry picked from commit eccffe57c5a0cf8a762351fe26e6f631108fb849)

2 weeks agomgr/nvmeof: intergrate module into build and debian pkg
Tomer Haskalovitch [Tue, 24 Feb 2026 11:35:39 +0000 (13:35 +0200)]
mgr/nvmeof: intergrate module into build and debian pkg

Fixes: https://tracker.ceph.com/issues/74702
Signed-off-by: Tomer Haskalovitch <tomer.haska@ibm.com>
(cherry picked from commit 901ec98b4146b9e2f2d2b4ab257a2d1a5b903d9f)

2 weeks agomgr/nvmeof: introduce the new nvmeof module
Tomer Haskalovitch [Tue, 24 Feb 2026 11:22:11 +0000 (13:22 +0200)]
mgr/nvmeof: introduce the new nvmeof module

Introduce a new NVMe-oF mgr module and which create the pool
used for storing NVMe-related metadata ceph orch nvmeof apply command.
This removes the need for users to manually create and configure the
metadata pool before using the NVMe-oF functionality, simplifying
setup and reducing the chance of misconfiguration.

Fixes: https://tracker.ceph.com/issues/74702
Signed-off-by: Tomer Haskalovitch <tomer.haska@ibm.com>
(cherry picked from commit 15fcbb5e3eac2153c51d16b96e32d86038eb0569)

2 weeks agorgw/lua: create fresh VM for each background script execution 67396/head
Rotem Shapira [Wed, 18 Feb 2026 13:51:45 +0000 (13:51 +0000)]
rgw/lua: create fresh VM for each background script execution

Previously, the background thread reused the same Lua VM across
iterations, causing stale state to persist. This made operations
like 'pairs(RGW)' fail to iterate properly.

Now we create a fresh VM on each iteration, which:
- Fixes the iteration bug
- Simplifies the code (no need to update limits on existing VM)
- Ensures clean state for each script execution

Verified with unit tests:
- TableIterateBackground
- TableIterateBackgroundBreak
- TableIterateStepByStep

Fixes: https://tracker.ceph.com/issues/74839
Signed-off-by: Rotem Shapira <rotem.rs@gmail.com>
2 weeks agotest/crimson/seastore/test_seastore: add clone removal test 66245/head
Samuel Just [Fri, 13 Feb 2026 23:50:03 +0000 (15:50 -0800)]
test/crimson/seastore/test_seastore: add clone removal test

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agotest/crimson/.../test_object_data_handler: add multiple clone/overwrite test case
Samuel Just [Mon, 8 Dec 2025 19:22:48 +0000 (11:22 -0800)]
test/crimson/.../test_object_data_handler: add multiple clone/overwrite test case

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agotest/crimson/.../test_object_data_handler.cc: add support for clones
Samuel Just [Fri, 5 Dec 2025 00:23:48 +0000 (16:23 -0800)]
test/crimson/.../test_object_data_handler.cc: add support for clones

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../transaction_manager: add logging to remap_mappings
Samuel Just [Mon, 8 Dec 2025 18:10:51 +0000 (10:10 -0800)]
crimson/.../transaction_manager: add logging to remap_mappings

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../lba_manager: add formatter for remap_entry_t
Samuel Just [Mon, 8 Dec 2025 18:10:28 +0000 (10:10 -0800)]
crimson/.../lba_manager: add formatter for remap_entry_t

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../object_data_handler: fix LOG_PREFIX for do_clone
Samuel Just [Mon, 8 Dec 2025 17:21:59 +0000 (09:21 -0800)]
crimson/.../object_data_handler: fix LOG_PREFIX for do_clone

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../lba_manager: convert remap_mappings to use cursors
Samuel Just [Tue, 21 Oct 2025 21:59:58 +0000 (21:59 +0000)]
crimson/.../lba_manager: convert remap_mappings to use cursors

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../lba_manager: convert reserve_region to use cursor
Samuel Just [Mon, 20 Oct 2025 23:55:00 +0000 (23:55 +0000)]
crimson/.../lba_manager: convert reserve_region to use cursor

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../btree_lba_manager: simplify _update_mapping_ret, remove update_mapping_re...
Samuel Just [Sat, 18 Oct 2025 00:54:08 +0000 (17:54 -0700)]
crimson/.../btree_lba_manager: simplify _update_mapping_ret, remove update_mapping_ret_bare_t

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../btree_lba_manager: convert _update_mapping to coroutine
Samuel Just [Sat, 18 Oct 2025 00:34:57 +0000 (17:34 -0700)]
crimson/.../btree_lba_manager: convert _update_mapping to coroutine

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../btree_lba_manager: remove update_refcount, simplify _update_mapping retur...
Samuel Just [Fri, 17 Oct 2025 23:06:34 +0000 (23:06 +0000)]
crimson/.../btree_lba_manager: remove update_refcount, simplify _update_mapping return value

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../transaction_manager: convert remap_mappings to coroutine
Samuel Just [Fri, 17 Oct 2025 22:07:57 +0000 (22:07 +0000)]
crimson/.../transaction_manager: convert remap_mappings to coroutine

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../btree_lba_manager: convert remap_mappings to coroutine
Samuel Just [Thu, 16 Oct 2025 01:45:33 +0000 (18:45 -0700)]
crimson/.../btree_lba_manager: convert remap_mappings to coroutine

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../lba_manager: convert clone_mapping to use cursors
Samuel Just [Thu, 16 Oct 2025 01:26:52 +0000 (01:26 +0000)]
crimson/.../lba_manager: convert clone_mapping to use cursors

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../transaction_manager: convert clone_pin to coroutine
Samuel Just [Thu, 16 Oct 2025 00:54:26 +0000 (00:54 +0000)]
crimson/.../transaction_manager: convert clone_pin to coroutine

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../btree_lba_manager: convert clone_mapping to coroutine
Samuel Just [Wed, 15 Oct 2025 22:47:30 +0000 (22:47 +0000)]
crimson/.../btree_lba_manager: convert clone_mapping to coroutine

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../btree_lba_manager: convert get_end_mapping to return LBACursorRef
Samuel Just [Wed, 15 Oct 2025 21:58:34 +0000 (21:58 +0000)]
crimson/.../btree_lba_manager: convert get_end_mapping to return LBACursorRef

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../btree_lba_manager: convert get_end_mapping to coroutine
Samuel Just [Wed, 15 Oct 2025 21:53:20 +0000 (21:53 +0000)]
crimson/.../btree_lba_manager: convert get_end_mapping to coroutine

Signed-off-by: Samuel Just <sjust@redhat.com>
2 weeks agocrimson/.../transaction_manager: remove LBAMapping update_mapping variant
Samuel Just [Tue, 14 Oct 2025 00:03:46 +0000 (00:03 +0000)]
crimson/.../transaction_manager: remove LBAMapping update_mapping variant

Signed-off-by: Samuel Just <sjust@redhat.com>