Alex Ainscow [Mon, 14 Jul 2025 15:55:40 +0000 (16:55 +0100)]
osd: Optimised EC avoids ever reading more than K shards (if plugin supports it).
Plugins which support partial reads, should never need more than k shards
to read the data, even if some shards have failed. However, rebalancing commonly
requests k + m shards, as very frequently all shards are moved. If this occurs
and all k + m shards are online, the read will be achieved by reading ALL shards
rather than just reading k shards. This commit fixes that issue.
The problem is that we don't want to change the API to the old EC, so we cannot
update the plugin behaviour here. Instead, the EC code itself will reduce
the number of shards it tells minimum_to_decode about.
In a comment we note that bitset_set performance could be improved using _pdep_u64.
This would require fiddly platform-specific code and would likely not show
any performance improvements for most applications. The majority of the calls to
this function will be with a bitset that has <=n set bits and will never enter this
if statement. When there are >n bits set we are going to save one or more read I/Os,
the cost of the for loop is insignificant vs this saving. I have left the comment
in as a hint to future users of this function.
Further notes were made in a review comment that are worth recording:
- If performance is limited by the drives, then less read I/Os is a clear advantage.
- If performance is limited by the network then less remote read I/Os is a clear advantage.
- If performance is limited by the CPU then the CPU cost of M unnecessary remote
read I/Os (messenger+bluestore) is almost certainly more than the cost of doing an
extra encode operation to calculate the coding parities.
- If performance is limited by system memory bandwidth the encode+crc generation
has less overhead than the read+bluestore crc check+messenger overheads.
Longer term this logic should probably be pushed into the plugins, in particular
to give LRC the opportunity to optimize for locality of the shards. Reason for
not doing this now is that it would be messy because the legacy EC code cannot
support this optimization and LRC isn't yet optimizing for locality
Alex Ainscow [Wed, 11 Jun 2025 15:24:12 +0000 (16:24 +0100)]
osd: Remove all references to hinfo from optimized EC
Legacy EC used hinfo to store two things:
1. Shard size
2. CRCs of the shards
However:
* Optimized EC stores different object sizes on each shard
* Optimized EC scrub calculates the correct sizes of shards and checks them, so shard size checks are not needed in hinfo.
* Bluestore checks the CRC.
* Seastore checks the CRC.
As such, the hinfo object is redundant. As such we remove it in
optimized EC:
1. Remove all references/upgrades to hinfo.
2. Delete hinfo attribute if found on recovery/backfill.
3. Redirect all scrub references for hinfo to legacy EC.
Kefu Chai [Fri, 27 Jun 2025 13:56:23 +0000 (21:56 +0800)]
erasure-code/jerasure: fix memory leak in Galois field operations
Fix a memory leak where Galois field acceleration functions created by
ErasureCodeJerasure::prepare() were never freed. ASan detected this as
a one-time leak when the plugin was unloaded.
Add jerasure_finish() destructor function in jerasure_init.cc to free
the allocated Galois field operations. Since jerasure_init.cc and
galois.c are built into the same object library, jerasure_finish() can
access and clean up the global static acceleration functions defined
in galois.c.
The destructor function is automatically called when the shared library
(plugin) is unloaded, ensuring proper cleanup without requiring explicit
calls from client code.
qa/standalone/scrub: fix "scrubbed in 0ms" in osd-scrub-test.sh
The specific test looks for a 'last scrub duration' higher than
0 as a sign that the scrub actually ran. Previous code fixes
guaranteed that even a scrub duration as low as 1ms would be
reported as "1" (1s). However, none of the 15 objects created
in this test were designated for the tested PG, which remained
empty. As a result, the scrub duration was reported as "0".
The fix is to create a large enough number of objects so that
at least one of them is mapped to the tested PG.
Warning: Encountered an error when evaluating display name ${{ format('python3 ceph/src/script/redmine-upkeep.py --github-action --git-dir=./ceph/ {0} {1}
', github.event_name == 'workflow_dispatch' && (inputs.debug && '--debug' || '') || '', github.event_name == 'workflow_dispatch' && format('--limit={}', inputs.limit) || '') }}. The template is not valid. .github/workflows/redmine-upkeep.yml (Line: 54, Col: 14): The following format string is invalid: --limit={}
Error: The template is not valid. .github/workflows/redmine-upkeep.yml (Line: 54, Col: 14): The following format string is invalid: --limit={}
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
This script's functionality has been incoporated into the redmine-upkeep.py script:
* A backport which is merged is moved to the Resolved state by
_transform_backport_resolved
* A non-backport issue with all backports resolved/rejected will transition
from 'Pending Backport' to 'Resolved'.
The only thing not done is changing the "Target Version" which is not
particularly useful anymore as hotfixes frequently invalidate it. Rely on
"Released In" instead.
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Patrick Donnelly [Thu, 26 Jun 2025 02:03:13 +0000 (22:03 -0400)]
script: add redmine-upkeep script
This script intends to do all Redmine upkeep related tasks.
It's organized so that a series of transformations are run on any issue under
examination. These include things like updating the "Merge Commit" and "Fixed
In" fields when the associated PR is merged. The "Released In" field is also
set when the associated merge commit is part of some non-RC release.
Transformations can be run in one of three ways:
* An explicit --issue switch specifies a comma separated list of issues to
operate on.
* An explicit --revision-range switch specifies a Git revision range to lookup
first-parent merge commits. For each of those merge commits, try to find
the associated issue with "Merge Commit" equal to that merge commit. If found,
operate on that issue. This is intended to allow updating the "Released In"
field whenever a tag is pushd to ceph.git.
Future work:
* Incorporate backport-create-issue into this script.
* Also update "Target version" if straightforward. This is protected by redmine
whenever a version is locked/closed.
* Anything else we desire to automate!
Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>
Merge pull request #62713 from soumyakoduri/wip-skoduri-restore-glacier
rgw/cloud-restore [PART2] : Add Restore support from Glacier/Tape cloud endpoints
Reviewed-by: Adam Emerson <aemerson@redhat.com> Reviewed-by: Jiffin Tony Thottan <thottanjiffin@gmail.com> Reviewed-by: Matt Benjamin <mbenjamin@redhat.com> Reviewed-by: Daniel Gryniewicz <dang@redhat.com>
This should fix the chmod 777 /var/log/ceph failures.
We were missing the install task which resulted in no /var/log/ceph:
```
2025-07-07T08:55:44.586 INFO:teuthology.run_tasks:Running task ceph...
2025-07-07T08:55:44.679 INFO:tasks.ceph:Making ceph log dir writeable by
non-root...
2025-07-07T08:55:44.679 DEBUG:teuthology.orchestra.run.smithi144:> sudo
chmod 777 /var/log/ceph
2025-07-07T08:55:44.711
INFO:teuthology.orchestra.run.smithi144.stderr:chmod: cannot access
'/var/log/ceph': No such file or directory
```
Shachar Sharon [Thu, 26 Jun 2025 06:43:01 +0000 (09:43 +0300)]
mgr/smb: Enable per-share profile counters
Samba's commit 9f8d272 ("vfs_ceph_new: use per-share profile macros")
enables per-share profile counters for VFS ceph bridge. Enable those by
default for each smb-ceph share.
Matan Breizman [Sun, 8 Jun 2025 10:20:25 +0000 (10:20 +0000)]
crimson/CMakeLists: simplify crimson-common deps
instead of appending conditional dependencies to crimson-common with
crimson_common_deps and crimson_common_public_deps, use
target_link_libraries directly.
Connor Fawcett [Tue, 24 Jun 2025 11:45:06 +0000 (12:45 +0100)]
Adds a new command-line utility which can check the consistency of objects within an erasure coded pool.
A new test-only inject tells the EC backend to return both data and parity shards to the client so that they can
be checked for consistency by the new tool.
Soumya Koduri [Fri, 23 May 2025 20:25:30 +0000 (01:55 +0530)]
rgw/cloud-restore: Handle failure with adding restore entry
In case adding restore entry to FIFO fails, reset the `restore_status`
of that object as "RestoreFailed" so that restore process can be
retried from the end S3 user.
Reviewed-by: Adam Emerson <aemerson@redhat.com> Reviewed-by: Jiffin Tony Thottan <thottanjiffin@gmail.com> Signed-off-by: Soumya Koduri <skoduri@redhat.com>
rgw/cloud-restore: Support restoration of objects transitioned to Glacier/Tape endpoint
Restoration of objects from certain cloud services (like Glacier/Tape) could
take significant amount of time (even days). Hence store the state of such restore requests
and periodically process them.
Brief summary of changes
* Refactored existing restore code to consolidate and move all restore processing into rgw_restore* file/class
* RGWRestore class is defined to manage the restoration of objects.
* Lastly, for SAL_RADOS, FIFO is used to store and read restore entries.
Currently, this PR handles storing state of restore requests sent to cloud-glacier tier-type which need async processing.
The changes are tested with AWS Glacier Flexible Retrieval with tier_type Expedited and Standard.
Reviewed-by: Matt Benjamin <mbenjamin@redhat.com> Reviewed-by: Adam Emerson <aemerson@redhat.com> Reviewed-by: Jiffin Tony Thottan <thottanjiffin@gmail.com> Reviewed-by: Daniel Gryniewicz <dang@redhat.com> Signed-off-by: Soumya Koduri <skoduri@redhat.com>
Dnyaneshwari [Thu, 22 May 2025 07:08:25 +0000 (12:38 +0530)]
mgr/dashboard: Local Storage Class - create and list Fixes: https://tracker.ceph.com/issues/71460 Signed-off-by: Dnyaneshwari Talwekar <dtalwekar@redhat.com>
Shraddha Agrawal [Thu, 26 Jun 2025 12:27:45 +0000 (17:57 +0530)]
mon/MgrStatMonitor.cc: cleanup handle_conf_change
Prior to this change, we were using a flag value,
`reset_availability_last_uptime_downtime_val` to record the
timestamp to which the last_uptime and last_downtime should be
updated to. This was originally done so to avoid the values
being overwritten by a paxos update.
Now, instead of using an intermediate value, we are immediately
clearing the last_uptime and last_downtime values in
pending_pool_availability object. Since we are updating the values
in the pending object, we will not lost this information due to
an incoming paxos update.