In rgw_sync_pipe_params, the mode can be either system or user.
When in system mode, no user is involved, but the current
implementation holds an empty rgw_user, which can cause confusion
in pipe_rules::find_basic_info_without_tags().
With this change, rgw_user is now optional, ensuring that when no
user is involved, it is explicitly nullopt rather than an empty object.
Seena Fallah [Fri, 28 Mar 2025 20:55:20 +0000 (21:55 +0100)]
rgw: remote copy obj pass rgwx-perm-check-uid for perm evaluation
When copying object from remote source (bucket from another zonegroup)
the perms of the source is not evaluated resulting in reading from
unauthorized buckets.
passing `rgwx-perm-check-uid` will let the source zone evaluates the
perm and close this bug.
Seena Fallah [Fri, 28 Mar 2025 20:52:47 +0000 (21:52 +0100)]
rgw: RGWRadosPutObj evals source bucket perm for backward compatibility
As of a3f40b4 we no longer evaluate perms locally for source bucket,
this could cause broken permission evaluation dusring upgrade as one
zone is not respecting the perm evaluation based on the `rgwx-perm-check-uid`
arg.
Seena Fallah [Fri, 28 Mar 2025 20:48:34 +0000 (21:48 +0100)]
rgw: give hint via header for perm evaluation in GetObj
Return `Rgwx-Perm-Checked` header as a hint for the destination zone
to know whether the perms where considered or not.
This is just a backward compatibility for upgrade and can be dropped
in T+2 release.
Seena Fallah [Thu, 27 Feb 2025 10:53:44 +0000 (11:53 +0100)]
rgw: take account GetObject(Version)Tagging when replicating
In case the uid has no permission to read tagging, the tags should
not be replicated.
Ref. https://docs.aws.amazon.com/AmazonS3/latest/userguide/setting-repl-config-perm-overview.html
Seena Fallah [Mon, 24 Feb 2025 22:41:13 +0000 (23:41 +0100)]
rgw: check source object replication by replication actions
Check for permissions of `s3:GetObjectVersionForReplication` in
addition to `s3:GetObject` and `s3:GetObjectVersion` when fetching
the object for multisite.
Seena Fallah [Mon, 24 Feb 2025 22:33:45 +0000 (23:33 +0100)]
rgw: only allow system override if identity is not impersonating
Since multisite now delegates permission checks for source objects
to the source zone (a3f40b4), we need to avoid allowing system-level
overrides when the request is impersonating another identity.
SysReqApplier should only grant override permission if the request
is truly system-authenticated and not acting on behalf of another
user or role (i.e., no rgwx-perm-check-uid or rgwx-perm-check-role
in the request).
rgw: SysReqApplier overrides is_admin_of based on impersonation
SysReqApplier now returns true for is_admin_of() when the requester
was a system user and was not impersonating any user/role using
rgwx-perm-check-uid or rgwx-perm-check-role.
rgw/qa: added test case to assume a role after role creation
syncs, and then creating a bucket on both primary and secondary.
The test name is test_assume_role_after_sync.
rgw/sts: by-passing authentication using temp creds
in case the request is forwarded from secondary in
a multi-site setup. authenticating with the system
user creds of which are used to sign the request.
Permissions are still derived from the role.
Patrick Donnelly [Fri, 25 Apr 2025 19:00:39 +0000 (15:00 -0400)]
Merge PR #62833 into main
* refs/pull/62833/head:
qa: test charmap changes with dir and snaps
mds: check for snapshots on parent snaprealms
mds: use strict_strtobool for parsing bools
common: take string_view for strict_tobool
Ville Ojamo [Fri, 25 Apr 2025 07:16:52 +0000 (14:16 +0700)]
doc/radosgw: Fix section header levels in multisite-sync-policy.rst
The section header levels are reversed so the hierarchy in the TOC is
incorrect. Switch around the section header levels to make the TOC
hierarchy correct, for example individual examples are children of the
"Examples" section.
Signed-off-by: Ville Ojamo <14869000+bluikko@users.noreply.github.com>
src/mon/PGMap.cc: check unfound obejcts in `get_unavailable_pg_in_pool_map`
If a pool has any PG with unfound objects, we should consider
it unavailable for the availability score. If a PG has unfound
objects, it will be recorded in PGMap.
In `get_unavailable_pg_in_map`, if a PG has unfound obejcts,
we add it to `pool_pg_unavailable_map`.
When we update the `pg_stat` we don't
check whether the pg state is in `stale`.
Therefore, the attribute `last_unstale`
will always get updated even if the pg
state actually contains `stale`.
Solution:
Place a condition to only update
the attribute `last_unstale` when
we the pg truly doesn't have `stale`
in its state.
Kamoltat [Thu, 26 Oct 2023 19:08:37 +0000 (19:08 +0000)]
src/mon/PGMap.cc: init pool_availability
Added PoolAvailability Struct
Modified PGMap.cc to include a k,v map:
`pool_availability`.
The key being the `poolid` and value
is `PoolAvailability`
Init the function:
`PGMap::get_unavailable_pg_in_pool_map()`
to identify and aggregate all the PGs we
mark as `unavailable` as well as the pool
that associates with the unavailable PG.
Also, included `pool_availability`
to `PGMapDigest::dump()`.
Adam Kupczyk [Mon, 10 Jun 2024 16:03:24 +0000 (16:03 +0000)]
os/bluestore: Add admin socket commands to inspect onode metadata
Add admin socket commands:
1) bluestore collections
Lists collections.
2) bluestore list <coll> [start object] [max count]
Lists collection coll starting from object (optional). Default 100 entries. 0 = unlimited.
3) bluestore onode metadata <object>
Prints onode metadata as seen by BlueStore.
It might happen (usually in tests) that 2 BlueStore instances are created at the same time.
Since admin commands are unique, it fails to register.
Use first register to detect whether we can register at all.
Adam Kupczyk [Tue, 8 Apr 2025 08:36:21 +0000 (08:36 +0000)]
os/bluestore: Add do_write_v2_compressed()
Modify do_write_v2() to branch into do_write_v2_compressed().
Segmented and regular cases are recognized and handled properly.
New do_write_v2_compressed() oversees compression / recompression.
Make one Estimator per Collection.
It makes possible for estimator to learn in collection specific compressibility.
In write_v2_compressed use compressor already selected in choose_write_options.
Make Collection create Estimator on first use.
Adam Kupczyk [Tue, 8 Apr 2025 11:03:22 +0000 (11:03 +0000)]
os/bluestore/compression: Main part of recompression feature
Add feature of recompression scanner that looks around write region to see how much
would be gained, if we read some more around and wrote more.
Added Compression.h / Compression.cc.
Added debug_bluestore_compression dout.
Created Scanner class.
Provides write_lookaround() for scanning loaded extents.
Adam Kupczyk [Wed, 9 Apr 2025 16:03:52 +0000 (16:03 +0000)]
os/bluestore/compression: Estimator class
Add CMake rules to compile.
Add bluestore_compression dout subsys.
Created Estimator class.
It is used by Scanner to decide if specific extent is to be recompressed.
Prepare for future machine learning / adaptive algorithm for estimation.
So far logic of Estimator is relatively simple.
It learns expected recompression values and uses them in next iterations to predict.
Max Kellermann [Thu, 24 Apr 2025 05:17:48 +0000 (07:17 +0200)]
mds/Locker: use ceph_abort_msg() instead of ceph_assert()
This ceph_assert() always fails, but depending on the configuration
value `ceph_assert_supresssions`, execution may continue, but the
`dir` variable is left uninitialized. This leads to a compiler
warning:
/home/jenkins-build/build/workspace/ceph-api/src/mds/Locker.cc:451:22: error: variable 'dir' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
clang then suggests to nullptr-initialize the variable:
/home/jenkins-build/build/workspace/ceph-api/src/mds/Locker.cc:447:11: note: initialize the variable 'dir' to silence this warning
447 | CDir *dir;
| ^
| = nullptr
This, however, is a very bad idea because all this does is suppress
the warning; it still crashes the process.
Since there's no recovery from this problem, let's switch to
ceph_abort_msg() which is [[noreturn]] and the compiler can deduce
that `dir` is always initialized when it's used.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Merge pull request #62693 from ronen-fr/wip-rf-iocnt
osd/scrub: performance counters for I/O performed by the scrubber
Reviewed-by: Alex Ainscow <aainscow@uk.ibm.com> Reviewed-by: Bill Scales <bill_scales@uk.ibm.com> Reviewed-by: Samuel Just <sjust@redhat.com> Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
N Balachandran [Mon, 21 Apr 2025 11:34:08 +0000 (17:04 +0530)]
rbd: display correct mirror state when creating
The mirror image state is set to MIRROR_IMAGE_STATE_CREATING
when the image is first created on the secondary, but was displayed
as "unknown" by the rbd info command. This has been fixed.
Fixes: https://tracker.ceph.com/issues/70963 Signed-off-by: N Balachandran <nithya.balachandran@ibm.com>
HealthMonitor: Add topology-aware netsplit detection and warning
Problem:
Currently, Ceph cannot detect and report network partitions (netsplits)
between monitors in different topology locations in a consolidated way.
While stretch mode can handle partitions through monitor elections,
users lack visibility into the topology-level view of network
disconnections, making troubleshooting difficult.
Solution:
This implementation adds a hierarchical netsplit detection mechanism that:
- Uses DirectedGraph structure for netsplit detection
- Maps monitor disconnections to relevant CRUSH topology levels
- Aggregates individual disconnections into location-level reports when appropriate
- Detects complete location-level netsplits when ALL monitors between locations
cannot communicate
- Reports specific topology locations experiencing complete communication failures
- Falls back to individual monitor-level reporting for partial disconnections
- Handles monitors with missing location data gracefully
- Leverages HealthMonitor::check_for_mon_down to receive a set of down monitors,
efficiently avoiding false netsplit reports for monitors already known to be down
- Implements smart filtering that correctly excludes down monitors from location-based
analysis, ensuring accurate netsplit reporting at both individual and topology levels
The implementation produces user-friendly health warnings:
1. For complete location netsplits: "Netsplit detected between dc1 and dc2"
2. For individual monitor disconnections: "Netsplit detected between mon.a and mon.d"
Performance considerations:
- Time complexity: O(m²) where m is the number of monitors
- Space complexity: O(m²) for connection tracking
- Practical impact is minimal as monitor count is typically small (3-7)