Adam King [Thu, 25 Sep 2025 20:13:18 +0000 (16:13 -0400)]
mgr/cephadm: split host cache entries if they exceed max mon store entry size
If the json blob we attempt to store for a host entry
exceeds the max mon store entry size, we become unable
to continue to store that hosts information in the
config key store. This means we only ever have the
information from the last time the json blob was
under the size limit each time the mgr fails over,
resulting in a number of stray host/daemon warnings
being generated and very outdated information being
reported by `ceph orch ps` and `ceph orch ls` around
the time of the failover
Igor Fedotov [Thu, 21 Aug 2025 10:42:54 +0000 (13:42 +0300)]
test/libcephfs: use more entries to reproduce snapdiff fragmentation
issue.
Snapdiff listing fragments have different boundaries in Reef and Squid+
releases hence original reproducer (made for Reef) doesn't work properly
in S+ releases. This patch fixes that at cost of longer execution.
This might be redundant/senseless when backporting to Reef.
Igor Fedotov [Tue, 12 Aug 2025 13:17:49 +0000 (16:17 +0300)]
mds: rollback the snapdiff fragment entries with the same name if needed.
This is required when more entries with the same name don't fit into the
fragment. With the existing means for fragment offset specification such a splitting to be
prohibited.
Signed-off-by: Dnyaneshwari <dtalweka@redhat.com>
mgr/dashboard: handle creation of new pool
Commit includes:
1) Provide link to create a new pool
2) Refactored validation on ACL mapping, removed required validator as default
3) fixed runtime error on console due to ACL length due to which the details section was not opening
4) Used rxjs operators to make API calls and making form ready once all data is available, fixing the form patch issues
5) Refactored some part of code to improve the performance
6) Added zone and pool information in details section for local storage class
Fixes: https://tracker.ceph.com/issues/72569 Signed-off-by: Naman Munet <naman.munet@ibm.com>
(cherry picked from commit 2d0e71c845643a26d4425ddac8ee0ff30153eff2)
Problem:
The readdir wouldn't list all the entries in the directory
when the osd is full with rstats enabled.
Cause:
The issue happens only in multi-mds cephfs cluster. If rstats
is enabled, the readdir would request 'Fa' cap on every dentry,
basically to fetch the size of the directories. Note that 'Fa' is
CEPH_CAP_GWREXTEND which maps to CEPH_CAP_FILE_WREXTEND and is
used by CEPH_STAT_RSTAT.
The request for the cap is a getattr call and it need not go to
the auth mds. If rstats is enabled, the getattr would go with
the mask CEPH_STAT_RSTAT which mandates the requirement for
auth-mds in 'handle_client_getattr', so that the request gets
forwarded to auth mds if it's not the auth. But if the osd is full,
the indode is fetched in the 'dispatch_client_request' before
calling the handler function of respective op, to check the
FULL cap access for certain metadata write operations. If the inode
doesn't exist, ESTALE is returned. This is wrong for the operations
like getattr, where the inode might not be in memory on the non-auth
mds and returning ESTALE is confusing and client wouldn't retry. This
is introduced by the commit 6db81d8479b539d which fixes subvolume
deletion when osd is full.
Fix:
Fetch the inode required for the FULL cap access check for the
relevant operations in osd full scenario. This makes sense because
all the operations would mostly be preceded with lookup and load
the inode in memory or they would handle ESTALE gracefully.
Venky Shankar [Fri, 29 Aug 2025 07:15:09 +0000 (07:15 +0000)]
qa/cephfs: use fuse mount for volumes/subvolume tests
Using the kernel client is a) not really required existing
volume/subvolume test and b) per-subvolume metrics is only
supported by the user-space client library.
Igor Golikov [Thu, 10 Jul 2025 10:18:57 +0000 (10:18 +0000)]
mds: aggregate and expose subvolume metrics
rank0 periodically receives subvolume metrics from other MDS instances
and aggregate subvolume metrics using sliding window.
The MetricsAggregator exposes PerfCounters and PerfQueries for these
metrics.
Igor Golikov [Thu, 10 Jul 2025 10:17:36 +0000 (10:17 +0000)]
client,mds: add support for subvolume level metrics
Add support for client side metrics collection using SimpleIOMetric
struct and aggregation using AggregatedIOMetrics struct,
Client holds SimpleIOMetrics vector per each subvolume it recognized
(via caps/metadata messages), aggregates them into the
AggregatedIOMetric struct, and sends periodically to the MDS, along
with regulat client metrics.
MDS holds map of subvolume_path -> vector<AggregatedIOMetrics> and sends
it periodically to rank0, for further aggregation and exposure.
Yaarit Hatuka [Mon, 21 Oct 2024 20:35:31 +0000 (16:35 -0400)]
mgr/callhome: persist operations between mgr restarts
Currently the operations dictionary is only kept in memory. It is lost
when the mgr restarts, and this can cause the module to handle upload
requests which were already processed and registered in the operations
dictionary. To prevent that, we write the operations to the db, and load
them when the module starts.
mgr/callhome: management of diagnostic upload requests (#78)
Call Home stores diagnostic upload requests for 10 days
Call Home does not process operations sent repeated by IBM Call Home mesh
Call Home able repeat level 1 operations after 5 minutes
Call Home able to repeat level2 (and upper) operations after 1 hour
This is a combination of 18 commits to ease maintenance.
Signed-off-by: Yaarit Hatuka <yhatuka@ibm.com> Signed-off-by: Juan Miguel Olmo MartÃnez <jolmomar@ibm.com>
(cherry picked from commit c9deac16f75e174e66ecde453cc8e71c936b3981)
A new line was missing between the block of "%files node-proxy"
and that of "%files mgr-callhome".
Please note that the changes in de6cbfbde53c64877941751d2ef5f8198ae5dccc
to src/cephadm/cephadm.py were reset in this commit, since they were
extracted and cherry-picked to a separate call-home-cephadm branch.
pybind/mgr: add call_home_agent to CMakeLists.txt and tox.ini
This commit can be safely squashed along with: 209527a8e087c916fadd0e395e3619a89cf1c3a6
mgr/callhome: Add hardware status to inventory reports
in future releases
Signed-off-by: Yaarit Hatuka <yhatuka@ibm.com>
mgr/ccha: Remove jti error message when no credentials (#61)
Avoid the annoying error message if not credentials present
Fix error if registry credentials are set using ceph cephadm registry-reg_credentials
Changed default regex for registry urls
ECuRep requires Transfer ID credentials (user ID and password). In this fix we
are adding the option to load them from the encrypted keys file instead of
asking the user to populate them. The keys from the files are the default. As a
workaround, we are leaving the option to manually populate the module options,
in case we ever need it.
BF-2271537: mgr/callhome: pick up SI event ID (#65)
Storage Insights event ID was not picked up correctly which prevented Ceph from
listening to SI triggered requests, and thus not fulfilling them and updating
on their status.
Status for operations updated to match SI expectations
complete_multipart_upload: the spec requires that the client
provide the same values for sse-c as were used to initiate
the upload. Verify the required paraemeters exist and match.
rgw/multisite: reset RGW_ATTR_OBJ_REPLICATION_TRACE during object attr changes.
otherwise, if a zone receives request for any s3 object api requests like PutObjectAcl, PutObjectTagging etc. and this zone
was originally the source zone for the object put request, then such subsequent sync ops will fail. this is because the
zone id was added to the replication trace to ensure that we don't sync the object back to it.
for example in a put/delete race during full sync(https://tracker.ceph.com/issues/58911)
so, if the same zone ever becomes the destination for subsequent sync requests on the same object, we compare this zone as
the destination zone against the zone entries in replication trace and because it's entry is already present in the trace,
the sync operation returns -ERR_NOT_MODIFIED.
rgw/logging: add error message when log_record fails
when log_record fails in journal mode due to issues in the target
bucket, the result code that the client get will be confusing, since
there is no indication that the issue is wit hte target bucket and not
the source bucket on which the client was operating.
the HTTP error message will be used to convey this information.
rgw/restore: Mark the restore entry status as `None` first time
While adding the restore entry to the FIFO, mark its status as `None`
so that restore thread knows that the entry is being processed for
the first time. Incase the restore is still in progress and the entry
needs to be re-added to the queue, its status then will be marked
`InProgress`.
Soumya Koduri [Sun, 10 Aug 2025 12:13:11 +0000 (17:43 +0530)]
rgw/restore: Persistently store the restore state for cloud-s3 tier
In order to resume IN_PROGRESS restore operations post RGW service
restarts, store the entries of the objects being restored from `cloud-s3`
tier persistently. This is already being done for `cloud-s3-glacier`
tier and now the same will be applied to `cloud-s3` tier too.
With this change, when `restore-object` is performed on any object,
it will be marked RESTORE_ALREADY_IN_PROGRESS and added to a restore FIFO queue.
This queue is later processed by Restore worker thread which will try to
fetch the objects from Cloud or Glacier/Tape S3 services. Hence all the
restore operations are now handled asynchronously (for both `cloud-s3`,
`cloud-s3-glacier` tiers).
Matt Benjamin [Thu, 11 Sep 2025 20:42:03 +0000 (16:42 -0400)]
rgw_cksum: return ChecksumAlgorithm and ChecksumType in ListParts
An uncompleted multipart upload's checksum algorithm and type can
be deduced from the upload object. Also the ChecksumType element
was being omitted in the completed case.
rgw/restore: Update expiry-date of restored copies
As per AWS spec (https://docs.aws.amazon.com/AmazonS3/latest/API/API_RestoreObject.html),
if a `restore-object` request is re-issued on already restored copy, server needs to
update restoration period relative to the current time. These changes handles the same.
Note: this applies to only temporary restored copies
cloud restore : add None type for cloud-s3-glacier
AWS supports various glacier conf options such as Standard, Expetided
to restore object with in a time period. Theses options may not be supported in
other S3 servers. So introducing option NoTier, so other vendors can be supported.
Signed-off-by: Harsimran Singh <hsthukral51@gmail.com>
(cherry picked from commit b588fd05c7d82b52fc8fa3742976a9a45c3755b4) Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Signed-off-by: Ali Masarwa <ali.saed.masarwa@gmail.com> Signed-off-by: Ali Masarwa <amasarwa@redhat.com>
(cherry picked from commit 47166556c5bbcf1f26621bf24cf04221b65af366)
Signed-off-by: Oguzhan Ozmen <oozmen@bloomberg.net>
(cherry picked from commit 9bb170104446bfea0ad87b34244f3a3d47962fcc) Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Co-authored-by: Yuval Lifshitz <yuvalif@yahoo.com> Signed-off-by: Mark Kogan <31659604+mkogan1@users.noreply.github.com>
(cherry picked from commit 965eda7a45b12c9ccd78f230076002043f7df65c) Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Marcus Watts [Wed, 28 Aug 2024 21:21:13 +0000 (17:21 -0400)]
rgw/storage class. Don't inherit storage class for copy object.
When an object is copied, it should only be depending on data
in the request to determine the storage class, and if it is
not specified, it should default to 'STANDARD'. In radosgw,
this means that this is another attribute (similar to encryption)
that should not be merged from the source object.
Fixes: https://tracker.ceph.com/issues/67787 Signed-off-by: Marcus Watts <mwatts@redhat.com>
(cherry picked from commit a0e60bda70d4af93aa545a3fdea46eb9e68088c4)
Marcus Watts [Wed, 28 Aug 2024 15:42:05 +0000 (11:42 -0400)]
rgw/storage class: don't store/report STANDARD storage class.
While 'STANDARD' is a valid storage class, it is not supposed
to ever be returned when fetching an object. This change suppresses
storing 'STANDARD' as the attribute value, so that objects
explicitly created with 'STANDARD' will in fact be indistinguishable
from those where it was implicitly set.
Fixes: https://tracker.ceph.com/issues/67786 Signed-off-by: Marcus Watts <mwatts@redhat.com>
(cherry picked from commit b95e743ab9374cd3463a29c5f719ffce1c9fb28a)
Marcus Watts [Sat, 25 May 2024 03:45:14 +0000 (23:45 -0400)]
Fix lifecycle transition of encrypted multipart objects.
Lifecycle transtion can copy objects to a different storage tier.
When this happens, since the object is repacked, the original
manifest is invalidated. It is necessary to store a special
"parts_len" attribute to fix this. There was code in PutObj
to handle this, but that was only used for multisite replication,
it is not used by the lifecycle transisiton code. This fix
adds similar logic to the lifecycle transition code path to make the
same thing happen.
Fixes: https://tracker.ceph.com/issues/23264 Signed-off-by: Marcus Watts <mwatts@redhat.com>
(cherry picked from commit 60ddd17d2753b769ba2f5ebde60eb7753649d73f)
Marcus Watts [Fri, 14 Apr 2023 09:19:59 +0000 (05:19 -0400)]
copy object encryption fixes
This contains code to allow copyobject to copy encrypted objects.
It includes additional data paths to communicate data from the
rest layer down to the sal layer to handle decrypting
objects. The data paths include logic to use filter chains
from get and put that process encryption and compression.
There are several hacks to deal with quirks of the filter chains.
The "get" path has to propgate flushes around the chain,
because a flush isn't guaranteed to propagate through it.
Also the "get" and "put" chains have conflicting uses of the
buffer list logic, so the buffer list has to be copied so that
they don't step on each other's toes.
Fixes: https://tracker.ceph.com/issues/23264 Signed-off-by: Marcus Watts <mwatts@redhat.com>
(cherry picked from commit bcaaf55f4182da0a980c87c1dbd7e1d3c868626c)
Marcus Watts [Tue, 16 Jul 2024 21:16:10 +0000 (17:16 -0400)]
rgw/compression antibug check
If another bug tells the compression filter to decompress more
data than is actually present, the resulting "end_of_buffer"
error was thrown. The thrown exception unwinds the stack,
including a completion that is pending. The resulting core dump
indicates a failure with this completion rather than the end of buffer
exception, which is misleading and not useful.
With this change, radosgw does not abort, and instead logs
a somewhat useful message before returning an "unknown" error
to the client.
Fixes: https://tracker.ceph.com/issues/23264 Signed-off-by: Marcus Watts <mwatts@redhat.com>
(cherry picked from commit 8c7b0fac53107c5fdfcd1b9d5c5d6933b7ace39f)
rgw_beast_enable_async=0 can be used to run process_request() without a
coroutine context, which can make stack traces easier to view and debug
however, the frontend's reads/writes through ClientIO were still using
the yield_context to suspend/resume. so after ClientIO, the stack traces
came from the coroutine resume instead of process_request()
the beast frontend's ClientIO now issues synchronous reads/writes when
rgw_beast_enable_async is disabled
matt benjamin [Fri, 16 May 2025 16:02:20 +0000 (12:02 -0400)]
rgw: defensive fix for crash attemping part-copy of '%' versioned obj
The proximate cause of the issue actually appears to be in recognizing
the key.name of the object, only failing in rgw_rados due to an assert
on key.name being non-empty.
Signed-off-by: matt benjamin <mbenjamin@redhat.com>
(cherry picked from commit 5111b625a174aa2eaeb4be943dec9fe4b9d948af) Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
Matt Benjamin [Mon, 30 Jun 2025 14:26:25 +0000 (10:26 -0400)]
rgwlc: fix removal of delete markers (SAL)
S3 delete markers do not have head objects, and SAL's Object::load_obj_state()
returns -ENOENT in this case. Handle this case in LC's remove_expired_obj().
Matt Benjamin [Sun, 10 Aug 2025 18:05:43 +0000 (14:05 -0400)]
rgw:chksum: pull up aws-sdk-java-v2 and fix S3Builder invocation
This commit pulls up aws-sdk-java-v2 to 2.32.2, which has trailing header
formatting previously seen with golang v2 sdk--for which the upstream
*Reef* logic is not present (see prior commit by Yixin Jin).
And it fixes the construction of S3Client to accept endpoint self-signed
certificates--logic which is present in the main function example code
in jcksum.java, but somehow not in putobjects.java (anymore?).
Yixin Jin [Sun, 10 Aug 2025 15:59:18 +0000 (11:59 -0400)]
rgw:cksum: fix two checksum-trailer related signing issues
1. return error code on signature mismatch (should be 400,
XAmzContentSHA256Mismatch
2. reorder final chunk extraction and signing to better address
what we were handling as a special case of a few trailing bytes--
this is arising because the implementer was working against Reef,
which I guess doesn't have the extra extraction logic (c.f.,
ceph/main and its upstream backport)
(A change to catch rgw::io::Exception at rgw_process_authenticated
has been removed, as it is already handled in the only applicable
path.)
Matt Benjamin [Sun, 18 May 2025 01:02:34 +0000 (21:02 -0400)]
rgw: aws-chunked need not supply any content-length
The updated logic for aws chunked handling (2024) appears sufficient
to handle the cases produced by aws-sdk-go-v2.
Note that https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html
states that "For all requests, you must include the
x-amz-decoded-content-length header, specifying the size of the object in
bytes." (accessed 5/17/2025) (but now we do not enforce it).