git.apps.os.sepia.ceph.com Git

mgr/dashboard: add multiple ceph users deletion

Fixes: https://tracker.ceph.com/issues/72752
Signed-off-by: Pedro Gonzalez Gomez <pegonzal@ibm.com>
(cherry picked from commit 14ca16576d16de49c07725fb4b0feb112c8a1a43)

mgr/dashboard: fix SMB custom DNS button and linked_to_cluster col

- The button 'add custom DNS' in smb cluster form should only appear for active directory where is relevant.

- The linked_to_cluster column data is missing from smb standalone

- Some refactoring to remove magic strings and use FormControl for publicAddrs field

Fixes: https://tracker.ceph.com/issues/73096
Signed-off-by: Pedro Gonzalez Gomez <pegonzal@ibm.com>
(cherry picked from commit 9ce943e21558d17b3a214840b39bb57eab0cbd85)

cephadm: add ubuntu 24.04 container build test for completeness

Signed-off-by: John Mulligan <jmulligan@redhat.com>
(cherry picked from commit e915b3963720a8424e7718fac09bf6954c9e8400)

cephadm: enable test case for centos10 cephadm rpm build

Now that the build script is updated we can enable the test for
centos 10 based rpm sourced cephadm builds.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
(cherry picked from commit 16530fde5e4a84c5690da760f4be5dd91131a69b)

cephadm: support cephadm rpm based builds without top_level.txt

Signed-off-by: John Mulligan <jmulligan@redhat.com>
(cherry picked from commit 26a499a8da339d870af193ea964368afbc84c694)

cephadm: add centos 10 container images for cephadm build tests

Signed-off-by: John Mulligan <jmulligan@redhat.com>
(cherry picked from commit 32e98c484ac1ee518d8a479f17ccf4c5b7a7264b)

cephadm: remove centos 8 from the cephadm build suite containers

Signed-off-by: John Mulligan <jmulligan@redhat.com>
(cherry picked from commit 4d3e7b6bcb85f703823b6c414f414a9ff4f379aa)

cephadm: fix some issues running existing cephadm build tests

As time has marched on and people changed things our tests no longer
match the expected inputs.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
(cherry picked from commit 31c8010faa417ca53614bd30379a9b9c0c9199de)

mgr/dashboard: Form retains old data when switching from edit to create mode

Fixes: https://tracker.ceph.com/issues/72989
Signed-off-by: pujashahu <pshahu@redhat.com>
(cherry picked from commit 918dff407d912b3a5ac068e0050467396668163c)

client: check if inode ref is dir before proceeding with lookup

Lookup on non-dir inode is incorrect. Client::_lookup did check for it
but the order needed to be correct i.e. check for non-dir inode before
anything.

Fixes: https://tracker.ceph.com/issues/70553
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
(cherry picked from commit aa6ffa5159f6c9e14000a7501a394b1185b63078)

client: do not open dir for a non-dir inode

adds an extra ref to the root which leads to ref leaks:
2025-03-26T15:37:55.162+0530 7fadb8d41640 1 client.4293 dump_inode: DISCONNECTED inode 0x1 #0x1 ref 2 0x1.head(faked_ino=0 nref=2 ll_ref=0 cap_refs={1024=0} open={} mode=41777 size=0/0 nlink=1 btime=2025-03-26T15:37:39.967146+0530 mtime=2025-03-26T15:37:45.113229+0530 ctime=2025-03-26T15:37:45.136266+0530 change_attr=3 caps=pAsLsXs(0=pAsLsXs) COMPLETE has_dir_layout 0x7fad78006970)

which leads to:
2025-03-26T15:37:55.162+0530 7fadb8d41640 2 client.4293 cache still has 1+1 items, waiting (for caps to release?)

Fixes: https://tracker.ceph.com/issues/70553
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
(cherry picked from commit 23b6b435835a095ceb10de92c3153ec1351d259c)

qa: test unmount hang using high/low level APIs

Fixes: https://tracker.ceph.com/issues/70553
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
(cherry picked from commit cbf1c5bb39b146f2457fe9c30c91122bcc7d442f)

test/libcephfs: use more entries to reproduce snapdiff fragmentation
issue.

Snapdiff listing fragments have different boundaries in Reef and Squid+
releases hence original reproducer (made for Reef) doesn't work properly
in S+ releases. This patch fixes that at cost of longer execution.
This might be redundant/senseless when backporting to Reef.

Related-to: https://tracker.ceph.com/issues/72518
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 23397d32607fc307359d63cd651df3c83ada3a7f)

mds: rollback the snapdiff fragment entries with the same name if needed.

This is required when more entries with the same name don't fit into the
fragment. With the existing means for fragment offset specification such a splitting to be
prohibited.

Fixes: https://tracker.ceph.com/issues/72518
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 24955e66f4826f8623d2bec1dbfc580f0e4c39ae)

test/libcephfs: Polisihing SnapdiffDeletionRecreation case

Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit daf3350621cfafa383cd9deea81b60b775a53093)

Test failure: LibCephFS.SnapdiffDeletionRecreation
Reproduces: https://tracker.ceph.com/issues/72518
Signed-off-by: Md Mahamudur Rahaman Sajib <mahamudur.sajib@croit.io>
(cherry picked from commit 4ff71386ac1529dc1f7c2640511f509bd6842862)
(cherry picked from commit 48f5a5d04fb2cef52c5e4a3daf452ccf988666d2)

mon/FSCommands: avoid unreachable code triggering compiler warning

    In file included from /home/pdonnell/ceph/src/mds/FSMap.h:31,
                     from /home/pdonnell/ceph/src/mon/PaxosFSMap.h:20,
                     from /home/pdonnell/ceph/src/mon/MDSMonitor.h:26,
                     from /home/pdonnell/ceph/src/mon/FSCommands.cc:17:
    /home/pdonnell/ceph/src/mds/MDSMap.h: In member function ‘int FileSystemCommandHandler::set_val(Monitor*, FSMap&, MonOpRequestRef, const cmdmap_t&, std::ostream&, FileSystemCommandHandler::fs_or_fscid, std::string, std::string)’:
    /home/pdonnell/ceph/src/mds/MDSMap.h:223:40: warning: ‘fsp’ may be used uninitialized in this function [-Wmaybe-uninitialized]
      223 |   bool test_flag(int f) const { return flags & f; }
          |                                        ^~~~~
    /home/pdonnell/ceph/src/mon/FSCommands.cc:417:21: note: ‘fsp’ was declared here
      417 |   const Filesystem* fsp;
          |                     ^~~

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 24acaaf766336466021caadba1facd2901775435)

pybind/mgr: pin cheroot version in requirements-required.txt

With python 3.10 (didn't seem to happen with python 3.12) the
pybind/mgr/cephadm/tests/test_node_proxy.py test times out.
This appears to be related to a new release of the cheroot
package and a github issues describing the same problem
we're seeing has been opened by another user
https://github.com/cherrypy/cheroot/issues/769

It is worth noting that the workaround described in that
issue does also work for us. If you add

```
import cheroot
cheroot.server.HTTPServer._serve_unservicable = lambda: None
```

after the existing imports in test_node_proxy.py the
test hanging issue also disappears. Also worth noting the
particular pin of

cheroot~=10.0

was chosen as it matches the existing pin being used
in pybind/mgr/dashboard/constraints.txt

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit 6231955b5d00ae6b3630ee94e85b2449092ef0fe)

build-with-container: add argument groups to organize options

Use the argparse add_argument_group feature to organize the mass of
arguments into more sensible categories. Hopefully, someone reading
over the `--help` output can now more easily see options that
are useful rather than being overwhelmed by a wall of text.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
(cherry picked from commit 71a1be4dd0aea004da56c2f518ee70a281a3f7d3)

radosgw-admin: Pass max_entries for bucket list

The changes in [1] did not take into account that
radosgw-admin code calls `RGWBucketAdminOp::info`
directly and passes a `RGWBucketAdminOpState`
struct where max_entries is not initialized so
we should not assume that it's zero.

This in turn broke the `bucket list --uid foo` and
`bucket stats --uid foo` commands as the output was
changed and thus not keeping backward compatibility.

This change makes sure that we populate max_entries
in `RGWBucketAdminOpState` if `--max-entries` argument
was specified otherwise we set it to zero to keep the
backward compatibility in the output format.

[1] https://github.com/ceph/ceph/pull/62777

Fixes: https://tracker.ceph.com/issues/72049
Signed-off-by: Tobias Urdin <tobias.urdin@binero.com>
(cherry picked from commit 3909c6554cdfcf1b05b5e32297b2e65e9c67af2b)

rgw/qa: Move admin pagination tests

Move the tests into qa directory and add it to the
rgw/verify suite so that we can run it in teuthology.

Signed-off-by: Tobias Urdin <tobias.urdin@binero.com>
(cherry picked from commit 57cbc9b6599be1e84c5bc209936080c4a04bb891)

rgw/doc: Add doc for admin bucket list pagination

This adds the documentation for the admin bucket list
pagination change.

Signed-off-by: Tobias Urdin <tobias.urdin@binero.com>
(cherry picked from commit 7fa025e08a21b03ce91556ffb936b9f26ffdc00f)

rgw/admin: Add max-entries and marker to bucket list

This adds pagination to the /admin/bucket endpoint for the
Admin API.

If a user has a lot of buckets the /admin/bucket endpoint
that is listing buckets can be so long that the HTTP request
gets a timeout.

This adds the ``max-entries`` and ``marker`` query parameters
to the API to support pagination. If ``max-entries`` is given
we introduce a new format for the HTTP response body the same
way that metadata API does, if it's not given we return the
response with the same body as before and thus retaining the
backward compatibility of the API.

This adds a Python3 based test suite that tests all of this
functionality to verify the behaviour and the HTTP response
body itself.

This fixes the pagination mentioned in tracker [1] and thus
fixes (or atleast partially fixes) that.

[1] https://tracker.ceph.com/issues/22168

Fixes: https://tracker.ceph.com/issues/22168
Signed-off-by: Tobias Urdin <tobias.urdin@binero.com>
(cherry picked from commit 1d5523ec0bec916e0a87fdcb8d27b67753e477b6)

RGW: multi object delete op; skip olh update for all deletes but the last one

Fixes: https://tracker.ceph.com/issues/72375
Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
(cherry picked from commit 9bb170104446bfea0ad87b34244f3a3d47962fcc)

Conflicts:
    src/rgw/rgw_op.cc
    src/rgw/rgw_op.h
- RGWDeleteMultiObj kept the vector of objects to be deleted as "rgw_obj_key"
  rather than "RGWMultiDelObject".
- RGWDeleteMultiObj::execute didn't factor out the object deletions into
  "handle_objects" helper method.
- There was no check whether RGWDeleteMultiObj::execute is already running in
  a coroutine or not before handling objects.

mgr/cephadm: don't mark nvmeof daemons without pool and group in name as stray

Cephadm's naming of these daemons always includes the pool and
group name associated with the nvmeof service. Nvmeof recently
has started to register with the cluster using names that
don't include that, resulting in warnings likes

```
[WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
    stray daemon nvmeof.vm-01.hwwhfc on host vm-01 not managed by cephadm
```

where cephadm knew that nvmeof daemon as

```
[ceph: root@vm-00 /]# ceph orch ps --daemon-type nvmeof
NAME                            HOST   PORTS                   STATUS   REFRESHED  AGE  MEM USE  MEM LIM  VERSION    IMAGE ID
nvmeof.foo.group1.vm-01.hwwhfc  vm-01  *:5500,4420,8009,10008  stopped     5m ago  25m        -        -  <unknown>  <unknown>
```

Signed-off-by: Adam King <adking@redhat.com>
(cherry picked from commit 695680876eb8af0891e3776888b6361dc8728c86)

doc/mgr/smb: document the 'provider' option for smb share

Signed-off-by: Sachin Prabhu <sp@spui.uk>
(cherry picked from commit 742659b18a21cd8ccc36a0f0a53bea265a13a541)
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>

mds: fix test that directory has no snaps

To look if the directory's first is beyond the last snap. This matches the behavior of lssnaps.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Fixes: https://tracker.ceph.com/issues/71462
(cherry picked from commit c22db4e683cf2e6b0decc937e9ab92ba15d46487)

qa: test for child dir with first beyond parent snaps

If the parent directory has snapshots but the child was created after, then we
should be able to modify its charmap.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Fixes: https://tracker.ceph.com/issues/71462
(cherry picked from commit 659e4262d042dc50a381846c25640c76a06bdec2)

qa: remove extraneous directory from test

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Fixes: https://tracker.ceph.com/issues/71462
(cherry picked from commit 7678dbfd8830141ece420fde66bbb1687c616206)

qa: correct test description

This test is checking for failure conditions.

Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Fixes: https://tracker.ceph.com/issues/71462
(cherry picked from commit c428149b9cb12c9e9b90d305131b669211a56b4b)

mds/MDSDaemon: unlock `mds_lock` while shutting down Beacon and others

This fixes a deadlock bug during MDS shutdown:

- the "signal_handler" thread receives the shutdown signal and invokes
  MDSDaemon::suicide() while holding `mds_lock`

- MDSDaemon::suicide() invokes Beacon::send_and_wait() while still
  holding `mds_lock`

- meanwhile, all "ms_dispatch" threads get stuck waiting for
  `mds_lock`, for example in MDCache::upkeep_main() or
  MDSDaemon::ms_dispatch2()

- Beacon::send_and_wait() waits for a `MSG_MDS_BEACON` packet to be
  dispatched (via `cvar` with a timeout)

At this point, even if a `MSG_MDS_BEACON` packet is received by one of
the worker threads, they will put it in the `DispatchQueue`, but no
dispatcher thread will be able to handle it because they are all
stuck.  The cvar.wait_for() call in Beacon::send_and_wait() will
therefore time out and the `MSG_MDS_BEACON` will never be processed.

The proper solution is to unlock `mds_lock` to avoid the dispatchers
from getting stuck.  And in general, we should be holding a lock
strictly only when it is needed and never do blocking calls while
holding a lock.

Fixes: https://tracker.ceph.com/issues/68760
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
(cherry picked from commit c0fedb52fa61cd4a847f5be7eb66629f9b7b6ab7)

qa/suites/upgrade: update ignorelist with cephfs specific warnings (under stress-split)

The warnings are expected as the MDSs are upgraded and restarted.

Fixes: http://tracker.ceph.com/issues/71615
Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit aeefebd81b98332494d8a8104c43e2831bf6a869)

qa/suites/upgrade: add "Replacing daemon mds" to ignorelist

Since this warning is expected when monitors replace an active
MDS daemon due to an higher affinity standby.

Fixes: http://tracker.ceph.com/issues/50279
Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit 6dffa503a2b4d45bac2d303fda193e3b19084a8f)

mds: skip charmap handler check for MDS requests

The MDS uses a rename request to move the primary link to a remote link. For
these requests, there will be no session.

Fixes: https://tracker.ceph.com/issues/72349
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit ca35c0de42861352bee4cfaeda6e83a7aa0bd094)

qa: test for charmap handling on reintegration

The MDS uses a rename request to move the primary link to a remote link. For
these requests, there will be no session.

Fixes: https://tracker.ceph.com/issues/72349
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
(cherry picked from commit 5d983a46b010144b71f8c5d092d30a4942d8e6e7)

tasks/cephfs: Use different errmsg for invalid dir

During test_df_for_invalid_directory, path_walk is now called.
Use a more general error message as more errnos can be returned
and this will be a better catch all.

Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit 45776aa32e27f14fb249cf3b52fa441b3a22a290)

client: bring client_lock out of statfs helper method

Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit dd7bdd32381ec81104f8399904c770f1ba0485b7)

client: move mref_reader check in statfs out of helper

Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit 9ab1209a68d450eca9bf915a605200ab92e53926)

test: Add test for libcephfs statfs

Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit 61c13a9d097353a924436f9e6c2b95984d12485d)

client: get quota root based off of provided inode in statfs

In statfs, get quota_root for inode provided. Check if a quota
is directly applied to inode. If not, reverse tree walk up and
maybe find a quota set higher up the tree.

Fixes: https://tracker.ceph.com/issues/72355
Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit 85f9fc29d202e1b050e50ad8bae13d7751ef28db)

client: use path supplied in statfs

If path provided, use in statfs. Replumb internal statfs
for internal only to allow for use in ll_statfs and statfs

Fixes: https://tracker.ceph.com/issues/72355
Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit 0c6a8add81c61960b733ce3ec38f7bbb3b5e93e9)

mdstypes: Dump export_ephemeral_random_pin as double

Fixes: https://tracker.ceph.com/issues/71944
Signed-off-by: Enrico Bocchi <enrico.bocchi@cern.ch>
(cherry picked from commit 12b9d735006b5ee8b47e49d9aa6b693f7ce7b952)

release note: add a note that "subvolume info" cmd output can also...

contain "source field" in it.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 37244a7972b9842e3cb8f8ddbd4a1d9ca875bbe3)

doc/cephfs: update docs since "subvolume info" cmd output can also...

contain source field in it.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 0a029149df824ef0f026af912925d809b31862ef)

qa/cephfs: add test to check clone source info's present in...

.meta file of a cloned subvolume after cloning is finished and in the
output of "ceph fs subvolume info" command.

Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 44cacaf1e46b46f96ac3641cad02f27c4c753532)

mgr/vol: show clone source info in "subvolume info" cmd output

Include clone source information in output of "ceph fs subvolume info"
command so that users can access this information conveniently.

Fixes: https://tracker.ceph.com/issues/71266
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 0ef6da69d993ca58270010e0b458bad0dff29034)

mgr/vol: keep clone source info even after cloning is finished

Instead of removing the information regarding source of a cloned
subvolume from the .meta file after the cloning has finished, keep it as
it is as the user may find it useful.

Fixes: https://tracker.ceph.com/issues/71266
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit bbacfdae1a2e7eb60f91b75852bcd1096b6e3c84)

qa: Add test for subvolume_ls on osd full

Fixes: https://tracker.ceph.com/issues/72260
Signed-off-by: Kotresh HR <khiremat@redhat.com>
(cherry picked from commit 8547e57ebc4022ca6750149f49b68599a8af712e)

mds: Fix readdir when osd is full.

Problem:
The readdir wouldn't list all the entries in the directory
when the osd is full with rstats enabled.

Cause:
The issue happens only in multi-mds cephfs cluster. If rstats
is enabled, the readdir would request 'Fa' cap on every dentry,
basically to fetch the size of the directories. Note that 'Fa' is
CEPH_CAP_GWREXTEND which maps to CEPH_CAP_FILE_WREXTEND and is
used by CEPH_STAT_RSTAT.

The request for the cap is a getattr call and it need not go to
the auth mds. If rstats is enabled, the getattr would go with
the mask CEPH_STAT_RSTAT which mandates the requirement for
auth-mds in 'handle_client_getattr', so that the request gets
forwarded to auth mds if it's not the auth. But if the osd is full,
the indode is fetched in the 'dispatch_client_request' before
calling the handler function of respective op, to check the
FULL cap access for certain metadata write operations. If the inode
doesn't exist, ESTALE is returned. This is wrong for the operations
like getattr, where the inode might not be in memory on the non-auth
mds and returning ESTALE is confusing and client wouldn't retry. This
is introduced by the commit 6db81d8479b539d which fixes subvolume
deletion when osd is full.

Fix:
Fetch the inode required for the FULL cap access check for the
relevant operations in osd full scenario. This makes sense because
all the operations would mostly be preceded with lookup and load
the inode in memory or they would handle ESTALE gracefully.

Fixes: https://tracker.ceph.com/issues/72260
Introduced-by: 6db81d8479b539d3ca6b98dc244c525e71a36437
Signed-off-by: Kotresh HR <khiremat@redhat.com>
(cherry picked from commit 1ca8f334f944ff78ba12894f385ffb8c1932901c)

qa/cephfs: fix test_subvolume_group_charmap_inheritance test

Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit fe3d6417bfaef571a9bb4093b5a8dfdb3cc3e59d)

doc: add name mangling documentation for subvolume group creation

Signed-off-by: Xavi Hernandez <xhernandez@gmail.com>
(cherry picked from commit b47bbf8afdfcb81ee8aed7ef9c27b45dd8d5a589)

qa: add tests for name mangling in subvolume group creation

Signed-off-by: Xavi Hernandez <xhernandez@gmail.com>
(cherry picked from commit ea2d8e9fc04f249d576e4799c3bdc44302cf1226)

pybind/mgr: add name mangling options to subvolume group creation

Signed-off-by: Xavi Hernandez <xhernandez@gmail.com>
(cherry picked from commit f98990ac1bbdf4ca0f05ea0336289cb32001159f)

mgr/volumes: Fix json.loads for test on mon caps

Signed-off-by: Enrico Bocchi <enrico.bocchi@cern.ch>
(cherry picked from commit b008ef9eb690618608f902c67f8df1fb8a587e33)

mgr/volumes: Add test for mon caps if auth key has remaining mds/osd caps

Signed-off-by: Enrico Bocchi <enrico.bocchi@cern.ch>
(cherry picked from commit 403d5411364e2fddd70d98a6f120b26e416c1d99)

mgr/volumes: Keep mon caps if auth key has remaining mds/osd caps

Signed-off-by: Enrico Bocchi <enrico.bocchi@cern.ch>
(cherry picked from commit 0882bbe8a4470f82993d87b7c02b19aa7fe7fbcc)

workunits/rados: remove cache tier test

Fixes: https://tracker.ceph.com/issues/71930
Signed-off-by: Nitzan Mordechai <nmordec@ibm.com>
(cherry picked from commit 2b57f435a1de1a99dd7bcb47478938965587713b)

qa/suites/rados/thrash-old-clients: Add OSD warnings to ignore list

Add these to ignorelist:
- OSD_HOST_DOWN
- OSD_ROOT_DOWN

These warnings cause Teuthology tests to fail during OSD setup. They
are unrelated to RADOS testing and occur due to the nature of
thrashing tests, where OSDs temporarily mark themselves as down.

If an OSD is the only one in the cluster at the time, the cluster
may incorrectly detect the host as down, even though other OSDs
are still starting up.

Fixes: https://tracker.ceph.com/issues/70972
Signed-off-by: Naveen Naidu <naveen.naidu@ibm.com>
(cherry picked from commit 36b96ad56f0d364930b8841ebe36335756e489a9)

blk/kernel: improve DiscardThread life cycle.

This will eliminate a potential race between thread startup and its
removal.

Relates-to: https://tracker.ceph.com/issues/71800
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
(cherry picked from commit 69369b151b96ca74bffb9d72f4c249f48fde2845)

mgr/dashboard: fix missing schedule interval in rbd API

Fetching the rbd image schedule interval through the rbd_support module
schedule list command

GET /api/rbd will have the following field per image
```
"schedule_info": {
                    "image": "rbd/rbd_1",
                    "schedule_time": "2025-09-11 03:00:00",
                    "schedule_interval": [
                        {
                            "interval": "5d",
                            "start_time": null
                        },
                        {
                            "interval": "3h",
                            "start_time": null
                        }
                    ]
                },
```

Also fixes the UI where schedule interval was missing in the form and
also disable editing the schedule_interval.

Extended the same thing to the `GET /api/pool` endpoint.

Fixes: https://tracker.ceph.com/issues/72977
Signed-off-by: Nizamudeen A <nia@redhat.com>
(cherry picked from commit 72cebf0126bd07f7d42b0ae7b68646c527044942)

options/mon: disable availability tracking by default

Signed-off-by: Shraddha Agrawal <shraddhaag@ibm.com>
(cherry picked from commit ef7effaa33bd6b936d7433e668d36f80ed7bee65)

osd: Optimized EC incorrectly rolled backwards write

A bug in choose_acting in this scenario:

* Current primary shard has been absent so has missed the latest few writes
* All the recent writes are partial writes that have not updated shard X
* All the recent writes have completed

The authorative shard is chosen from the set of primary-capable shards
that have the highest last epoch started, these have all got log entries
for the recent writes.

The get log shard is chosen from the set of shards that have the highest
last epoch started, this chooses shard X because its furthest behind

The primary shard last update is not less than get log shard last
update so this if statement decides that it has a good enough log:

if ((repeat_getlog != nullptr) &&
    get_log_shard != all_info.end() &&
    (info.last_update < get_log_shard->second.last_update) &&
    pool.info.is_nonprimary_shard(get_log_shard->first.shard)) {

We then proceed through peering using the primary log and the
log from shard X. Neither have details about the recent writes
which are then incorrectly rolled back.

The if statement should be looking at last_update for the
authorative shard rather than the get_log_shard, the code
would then realize that it needs to get the log from the
authorative shard first and then have a second pass
where it gets the log from the get log shard.

Peering would then have information about the partial writes
(obtained from the authorative shards log) and could correctly
roll these writes forward by deducing that the get_log_shard
didn't have these log entries because they were partial writes.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit ac4e0926bbac4ee4d8e33110b8a434495d730770)

osd: Clear zero_for_decode for shards where read failed on recovery

Not clearing this can lead to a failed decode, which panics, rather than
a recovery or IO failure.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 6365803275b1b6a142200cc2db9735d48c86ae03)

osd: Reduce buffer-printing debug strings to debug level 30

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
# Conflicts:
# src/osd/ECBackend.cc
(cherry picked from commit b4ab3b1dcef59a19c67bb3b9e3f90dfa09c4f30b)

osd: Fix segfault in EC debug string

The old debug_string implementation was potentially reading up to 3
bytes off the end of an array. It was also doing lots of unnecessary
bufferlist reconstructs. This refactor of this function fixes both
issues.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit da3ccdf4d03e40b747f8876449199102e53e00ce)

osd: Optimized EC backfill interval has wrong versions

Bug in the optimized EC code creating the backfill
interval on the primary. It is creating a map with
the object version for each backfilling shard. When
there are multiple backfill targets the code was
overwriting oi.version with the version
for a shard that has had partial writes which
can result in the object not being backfilled.

Can manifest as a data integirty issue, scrub
error or snapshot corruption.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit acca514f9a3d0995b7329f4577f6881ba093a429)

osd: Optimized EC choose_acting needs to use best primary shard

There have been a couple of corner case bugs with choose_acting
with optimized EC pools in the scenario where a new primary
with no existing log is choosen and find_best_info selects
a non-primary shard as the authorative shard.

Non-primary shards don't have a full log so in this scenario
we need to get the log from a shard that does have a complete
log first (so our log is ahead or eqivalent to authorative shard)
and then repeat the get log for the authorative shard.

Problems arise if we make different decisions about the acting
set and backfill/recovery based on these two different shards.
In one bug we osicillated between two different primaries
because one primary used one shard to making peering decisions
and the other primary used the other shard, resulting in
looping flip/flop changes to the acting_set.

In another bug we used one shard to decide that we could do
async recovery but then tried to get the log from another
shard and asserted because we didn't have enough history in
the log to do recovery and should have choosen to do a backfill.

This change makes optimized EC pools always choose the
best !non_primary shard when making decisions about peering
(irrespective of whether the primary has a full log or not).
The best overall shard is now only used for get log when
deciding how far to rollback the log.

It also sets repeat_getlog to false if peering fails because
the PG is incomplete to avoid looping forever trying to get
the log.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit f3f45c2ef3e3dd7c7f556b286be21bd5a7620ef7)

osd: Do not sent PDWs if read count > k

The main point of PDW (as currently implemented) is to reduce the amount
of reading performed by the primary when preparing for a read-modify-write (RMW).

It was making the assumption that if any recovery was required by a
conventional RMW, then a PDW is always better. This was an incorrect assumption
as a conventional RMW performs at most K reads for any plugin which
supports PDW. As such, we tweak this logic to perform a conventional RMW
if the PDW is going to read k or more shards.

This should improve performance in some minor areas.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit cffd10f3cc82e0aef29209e6e823b92bdb0291ce)

osd: Fix decode for some extent cache reads.

The extent cache in EC can cause the backend to perform some surprising reads. Some
of the patterns were discovered in test that caused the decode to attempt to
decode more data than was anticipated during the read planning, leading to an
assert. This simple fix reduces the scope of the decode to the minimum.

Signed-off-by: Alex Ainscow <aainscow@uk.ibm.com>
(cherry picked from commit 2ab45a22397112916bbcdb82adb85f99599e03c0)

osd: Optimized EC calculate_maxles_and_minlua needs to use ...
exclude_nonprimary_shards

When an optimized EC pool is searching for the best shard that
isn't a non-primary shard then the calculation for maxles and
minlua needs to exclude nonprimary-shards

This bug was seen in a test run where activating a PG was
interrupted by a new epoch and only a couple of non-primary
shards became active and updated les. In the next epoch
a new primary (without log) failed to find a shard that
wasn't non-primary with the latest les. The les of
non-primary shards should be ignored when looking for
an appropriate shard to get the full log from.

This is safe because an epoch cannot start I/O without
at least K shards that have updated les, and there
are always K-1 non-primary shards. If I/O has started
then we will find the latest les even if we skip
non-primary shards. If I/O has not started then the
latest les ignoring non-primary shards is the
last epoch in which I/O was started and has a good
enough log+missing list.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 72d55eec85afa4c00fac8dc18a1fb49751e61985)

osd: Optimized EC choose_async_recovery_ec must use auth_shard

Optimized EC pools modify how GetLog and choose_acting work,
if the auth_shard is a non-primary shard and the (new) primary
is behind the auth_shard then we cannot just get the log from
the non-primary shard because it will be missing entries for
partial writes. Instead we need to get the log from a shard
that has the full log first and then repeat GetLog to get
the log from the auth_shard.

choose_acting was modifying auth_shard in the case where
we need to get the log from another shard first. This is
wrong - the remainder of the logic in choose_acting and
in particular choose_async_recovery_ec needs to use the
auth_shard to calculate what the acting set will be.
Using a different shard occasional can cause a
different acting set to be selected (because of
thresholds about the number of log entries behind
a shard needs to be to perform async recovery) and
this can lead to two shards flip/flopping with
different opinions about what the acting set should be.

Fix is to separate out which shard will be returned
to GetLog from the auth_shard which will be used
for acting set calculations.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 3c2161ee7350a05e0d81a23ce24cd0712dfef5fb)

osd: Optimized EC don't try to trim past crt

If there is an exceptionally long sequence of partial writes
that did not update a shard that is followed by a full write
then it is possible that the log trim point is ahead of the
previous write to the shard (and hence crt). We cannot trim
beyond crt. In this scenario its fine to limit the trim to crt
because the shard doesn't have any of the log entries for the
partial writes so there is nothing more to trim.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 645cdf9f61e79764eca019f58a4d9c6b51768c81)

osd: Optimized EC missing call to apply_pwlc after updating pwlc

update_peer_info was updating pwlc with a newer version received
from another shard, but failed to update the peer_info's to
reflect the new pwlc by calling apply_pwlc.

Scenario was primary receiving an update from shard X which had
newer information about shard Y. The code was calling apply_pwlc
for shard X but not for shard Y.

The fix simplifies the logic in update_peer_info - if we are
the primary update all peer_info's that have pwlc. If we
are a non-primary and there is pwlc then update info.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit d19f3a3bcbb848e530e4d31cbfe195973fa9a144)

osd: Optimized EC don't apply pwlc for divergent writes

Split pwlc epoch into a separate variable so that we
can use epoch and version number when comparing if
last_update is within a pwlc range. This ensures that
pwlc is not applied to a shard that has a divergent
write, but still tracks the most recent update of pwlc.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit d634f824f229677aa6df7dded57352f7a59f3597)

osd: Optimized EC present_shards no longer needed

present_shards is no longer needed in the PG log entry, this has been
replaced with code in proc_master_log that calculates which shards were
in the last epoch started and are still present.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 880a17e39626d99a0b6cc8259523daa83c72802c)

osd: Optimized EC proc_master_log fix roll-forward logic when shard is absent

Fix bug in optimized EC code where proc_master_log incorrectly did not
roll forward a write if one of the written shards is missing in the current
epoch and there is a stray version of that shard that did not receive the
write.

As long as the currently present shards that participated in les and were
updated by a write have the update then the write should be rolled-forward.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit e0e8117769a8b30b2856f940ab9fc00ad1e04f63)

osd: Refactor find_best_info and choose_acting

Refactor find_best_info to have separate function to calculate
maxles and minlua. The refactor makes history_les_bound
optional, tidy up the choose_acting interface removing this
where it is not used.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit f1826fdbf136dc7c96756f0fb8a047c9d9dda82a)

osd: EC Optimizations proc_master_log boundary case bug fixes

Fix a couple of bugs in proc_master_log for optimized EC
pools dealing with boundary conditions such as an empty
log and merging two logs that diverge from the very first
entry.

Refactor the code to handle the boundary conditions and
neaten up the code.

Predicate the code block with if (pool.info.allows_ecoptimizations())
to make it clear this code path is only for optimized EC pools.

Signed-off-by: Bill Scales <bill_scales@uk.ibm.com>
(cherry picked from commit 1b44fd9991f5f46b969911440363563ddfad94ad)

osd: Invalidate stats during peering if we are rolling a shard forwards.

This change will mean we always recalculate stats upon rolling stats forwards. This prevent the situation where we end up with incorrect statistics due to where we always take the stats of the oldest shard during peering; causing outdated pg stats being applied for cases where the oldest shards are shards that don't see partial writes where num_bytes has changed on other places after that point on that shard.

Signed-off-by: Jon Bailey <jonathan.bailey1@ibm.com>
(cherry picked from commit b178ce476f4a5b2bb0743e36d78f3a6e23ad5506)

osd: ECTransaction.h includes OSDMap.h

Needed for crimson.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 6dd393e37f6afb9063c4bed3e573557bd0efb6bd)

osd: bypass messenger for local EC reads

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit b07d1f67625c8b621b2ebf5a7f744c588cae99d3)

osd: fix buildability after get_write_plan() shuffling

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 7f4cb19251345849736e83bd0c7cc15ccdcdf48b)

osd: just shuffle get_write_plan() from ECBackend to ECCommon

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 9d5bf623537b8ee29e000504d752ace1c05964d7)

osd: prepare get_write_plan() for moving from ECBackend to ECCommon

For the sake of sharing with crimson.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit dc5b0910a500363b62cfda8be44b4bed634f9cd6)

osd: separate producing EC's WritePlan out into a dedicated method

For the sake of sharing with crimson in next commits.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit e06c0c6dd08fd6d2418a189532171553d63a9deb)

osd: fix unused variable warning in ClientReadCompleter

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit eb3a3bb3a70e6674f6e23a88dd1b2b86551efda2)

osd: shuffle ECCommon::RecoveryBackend from ECBackend.cc to ECCommon.cc

It's just code movement; there is no changes apart that.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit ef644c9d29b8adaef228a20fc96830724d1fc3f5)

osd: drop junky `#if 1` in recovery backend

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit d43bded4a02532c4612d53fc4418db8e4e829c3f)

osd: move ECCommon::RecoveryBackend from ECBackend.cc to ECCommon.cc

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit debd035a650768ead64e0707028bb862f4767bef)

osd: replace get_obc() with maybe_load_obc() in EC recovery

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 266773625f997ff6a1fda82b201e023948a5c081)

osd: abstract sending MOSDPGPush during EC recovery

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 1d54eaff41ec8d880bcf9149e4c71114e0ffdc09)

osd: prepare ECCommon::RecoveryBackend for shuffling to ECCommon.cc

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit e3ade5167d3671524eb372a028157f2a46e7a219)

osd: squeeze RecoveryHandle out of ECCommon::RecoveryBackend

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 1e0feb73a4b91bd8b7b3ecc164d28fe005b97ed1)

osd: just shuffle RecoveryMessages to ECCommon.h

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit bc28c16a9a83b0f12d3d6463eaeacbab40b0890b)

osd: prepare RecoveryMessages for shuffling to ECCommon.h

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 0581926035113b1a9cb38f76233242d6b32a7dc6)

osd: ECCommon::RecoveryBackend doesn't depend on ECBackend anymore

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 6ead960b23a95211847250d90e3d2945c6254345)

osd: fix buildability after the RecoveryBackend shuffling

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit c9d18cf3024e5ba681bed5dc315f70527e99b3f1)

osd: just shuffle RecoveryBackend from ECBackend.h to ECCommon.h

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit 98480d2f75a7b99aa72562a6a6daa5f39db3d425)

doc: add config option and usage docs

This commit adds docs for the new config option introduced as
well as updates the feature docs on how to use this config
option.

Fixes: https://tracker.ceph.com/issues/72619
Signed-off-by: Shraddha Agrawal <shraddhaag@ibm.com>
(cherry picked from commit 1cbe41bde12eb1d0437b746164edb689393cc5ad)

mon/MgrStatMonitor: update availability score after configured interval

This commit ensures that data availability status is only updated
after the configured interval has elapsed. By default this interval
is setup to be 1s.

Fixes: https://tracker.ceph.com/issues/72619
Signed-off-by: Shraddha Agrawal <shraddhaag@ibm.com>
(cherry picked from commit e9e3d90210922f336950d02bedca2f09d4463dfe)