Casey Bodley [Wed, 14 Feb 2024 14:43:14 +0000 (09:43 -0500)]
rgw/putobj: RadosWriter uses part head object for multipart parts
the cleanup logic in the RadosWrite destructor was using the wrong
`head_obj` to avoid races between cleanup and part re-uploads. it
pointed at the final location of the multipart upload, rather than the
head object of the current part
Tobias Urdin [Tue, 6 Feb 2024 07:50:55 +0000 (07:50 +0000)]
rgw/auth: ignoring signatures for HTTP OPTIONS calls
Before [1] we always sent all HTTP OPTIONS requests to
the S3AnonymousEngine and ignored any provided AWSv4
credentials sent in the request.
That PR changed so that if we got credentials in the
request we instead sent it through the authentication
code in order to solve HTTP OPTIONS requests on tenanted
users to start working (because we need to resolve the
tenant, also called bucket tenant in the code, and we can't
only rely on the bucket name since it will not be found).
We solved this by modifying the canonical HTTP method used
when calculating the AWSv4 signature by instead using the
access-control-request-method header which worked good.
This change did not take into account that when you generated
a presigned URL for a put_object request you can also pass in
extra parameters like a canned ACL [2] to the Params variable
in for example boto3's generated_presigned_url().
Doing that will cause the client to add the x-amz-acl header
to x-amz-signedheaders and also use that in their signature
calculation.
When doing a HTTP OPTIONS calls for CORS on that presigned URL
the browser will never send a x-amz-acl header with the correct
data since that is something that the actual PUT request should
include later, so that HTTP OPTIONS call should pass even though
the signature can never be calculated correctly server-side like
verified against AWS S3 in tracker [3].
This patch as a result skips the signature calculation when doing
EC2 auth using the LocalEngine but we still need to pass the request
there in order to lookup the user to support buckets in a tenant.
For the Keystone EC2 auth we're pretty out of luck in the sense that
Keystone's API itself requires us to send the AWSv4 signature in the
request with the access_key in order to obtain a token, and we cannot
leave the signature out, we also cannot spoof the signature from
rgw -> keystone since we don't have access to the secret_key if it's
not in our cache.
For that approach we simply pass on to get_access_token() that if it's
an HTTP OPTIONS and we find the access_key in the cache we pull that
and ignore verifying signature and pass it on for validation. This means
that the cache must be warm if using Keystone auth and adding extra
params to a presigned URL.
This partly makes some of the commits in [1] redundant for EC2
LocalEngine auth but we still need it for tenanted bucket support.
Yuri Weinstein [Thu, 1 Feb 2024 23:34:35 +0000 (15:34 -0800)]
qa/suites/upgrade/pacific-p2p: run librbd python API tests from pacific tip
This job installs librbd from v16.2.7, upgrades librbd to latest
pacific and then runs librbd python API tests from v16.2.7 against the
upgraded librbd. This isn't expected to work in the general case
and is currently failing in TestImage.test_diff_iterate test because
it was adjusted to do the right thing in commit f8ced6d1fe66
("test/pybind/rbd: don't ignore from_snapshot in check_diff()").
CMake allows us to customize `CMAKE_CXX_FLAGS` by setting CXXFLAGS
environmental variable. and Debian's debhelper also sets CXXFLAGS
when it builds cmake projects for customizing the building flags.
but we fail to populate this setting down when building external
projects. this is important when it comes to the projects which
is critical to the performance. RocksDB is one of them.
in this change, we pass the `CMAKE_CXX_FLAGS` down in
`BuildRocksDB.cmake` so that its `CMAKE_CXX_FLAGS` contains
the same set of `CMAKE_CXX_FLAGS` used by its parent project.
this should help with the performance in the bluestore, where
RocksDB is used.
ceph-volume: fix partitions support in disk.get_devices()
The following:
```
is_part = get_file_contents(os.path.join(_sys_dev_block_path, item, 'partition')) == "1"
```
assumes any `/sys/dev/block/x:y/partition` contains '1' which is wrong.
This file actually contains the corresponding partition number.
Wei Wang [Mon, 29 Jan 2024 08:26:24 +0000 (08:26 +0000)]
mon: fix health store size growing infinitely
The `check_mutes` wrongly marks `changed` to true, trigger `propose_pending` and block following `maybe_trim` logic (`have_pending` will be always be false); as a result, the health store will never be trimmed.
Kefu Chai [Wed, 24 Feb 2021 04:34:26 +0000 (12:34 +0800)]
async/dpdk: do not use worker id when creating worker
so we can drop the `i` parameter in a later change.
also restructure DPDKStack::spawn_worker() to capture variables by value
instead of by reference, we cannot assume that the variables allocated on
stack are still available when the function is scheduled on another
stack and core.
Kefu Chai [Wed, 24 Feb 2021 04:06:45 +0000 (12:06 +0800)]
async/rdma: initialize worker in RDMAStack::create_worker()
in ff65c800b3e1f3f7e3989223b9bde4cbbaf5c076, we moved create_worker()
call out of the constructor to avoid calling virtual functions in
constructor. but this created a regression where RDMAStack's constructor
tries to reference its workers not yet created.
in this change, the workers are initialized right after they are
created.
Zac Dover [Fri, 2 Feb 2024 01:53:45 +0000 (11:53 +1000)]
doc/rados: update config for autoscaler
Update doc/rados/configuration/pool-pg-config-ref.rst to account for the
behavior of autoscaler.
Previously, this file was last meaningfully altered in 2013, prior to
the invention of autoscaler. A recent confusion was brought to my
attention on the Ceph Slack whereby a user attempted to alter the
default values of a Quincy cluster, as suggested in this documentation.
That alteration caused Ceph to throw the error "Error ERANGE: 'pgp_num'
must be greater than 0 and lower or equal than 'pg_num', which in this
case is one" and a related "rgw_init_ioctx ERROR" reading in part
"Numerical result out of range". The user removed the
"osd_pool_default_pgp_num" configuration line from ceph.conf and the
cluster worked as expected. I presume that this is because the removal
of this configuration line allowed autoscaler to work as intended.
Fixes: https://tracker.ceph.com/issues/64259 Co-authored-by: David Orman <ormandj@corenode.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 4dc12092be584da44baca14e31ca33231164235f)
Without cluster_log_to_file we have nothing to grep for errors:
2023-10-27T16:06:59.111 DEBUG:teuthology.orchestra.run.smithi150:> sudo egrep '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/38cc7fce-74d9-11ee-8db9-212e2dc638e7/ceph.log | egrep -v '\(MDS_ALL_DOWN\)' | egrep -v '\(MDS_UP_LESS_THAN_MAX\)' | head -n 1
2023-10-27T16:06:59.141 INFO:teuthology.orchestra.run.smithi150.stderr:grep: /var/log/ceph/38cc7fce-74d9-11ee-8db9-212e2dc638e7/ceph.log: No such file or directory
Set mon_cluster_log_to_file = true.
Fixes: https://tracker.ceph.com/issues/63425 Signed-off-by: Dan van der Ster <dan.vanderster@clyso.com>
(cherry picked from commit 822e6b01f909dff79b336bb2fc029de4663b428a)
NitzanMordhai [Thu, 23 Nov 2023 12:01:03 +0000 (12:01 +0000)]
mgr/BaseMgrModule: Optimize CPython Call in Finish Function
Remove CPython overhead packing tuple during the 'finish' function to
improve memory consumption when we deal with long-string outputs.
When modules like Restful return large amounts of output the use
of PyObject_CallFunction without createing PyObject will reduce the
time the memory held by the mgr.
This makes ceph-volume report partitions in inventory.
A partition is a valid device for `ceph-volume lvm prepare`
so we should report them in inventory (when using `--list-all`
parameter).
Nizamudeen A [Tue, 16 Jan 2024 05:21:56 +0000 (10:51 +0530)]
admin/doc-requirements: bump Sphinx to 5.0.2
```
Running Sphinx v4.5.0
Sphinx version error:
The sphinxcontrib.applehelp extension used by this project needs at least Sphinx v5.0; it therefore cannot be built with this version.
```
Ilya Dryomov [Sat, 6 Jan 2024 16:08:04 +0000 (17:08 +0100)]
librbd: try to preserve object map for diff-iterate in fast-diff mode
As an optimization, try to ensure that the object map for the end
version is preloaded through the acquisition of exclusive lock and
as a consequence remains around until exclusive lock is released.
If it's not around, DiffRequest would (re)load it on each call.
Ilya Dryomov [Sat, 6 Jan 2024 16:05:39 +0000 (17:05 +0100)]
librbd/object_map: potentially use in-memory object map in DiffRequest
If the object map for the end version is around (already loaded in
memory, either due to the end version being a snapshot or due to
exclusive lock being held), use it to run diff-iterate against the
beginning of time. Since it's the only object map needed in that
case, such calls would be satisfied locally.
Conflicts:
src/test/librbd/mock/MockObjectMap.h [ commit 87459c23aa05
("librbd,rbd_mirror: do not include RWLock.h unless it is
used") not in pacific ]
Ilya Dryomov [Fri, 5 Jan 2024 12:15:54 +0000 (13:15 +0100)]
librbd/object_map: decouple object map processing in DiffRequest
In preparation for potentially using in-memory object map, decouple
object map processing from loading object maps and place the logic in
prepare_for_object_map() and process_object_map().
Ilya Dryomov [Fri, 5 Jan 2024 11:23:24 +0000 (12:23 +0100)]
common/bit_vector: fix iterator vs reference constness confusion
T (ConstIterator or Iterator) is confused with const T here:
IteratorImpl dereference operator is wrongly overloaded on const
and returns Reference instead of ConstReference for ConstIterator.
This then fails inside bufferlist bowels because Reference is
incompatible with bufferlist::const_iterator.
Ilya Dryomov [Thu, 4 Jan 2024 10:39:20 +0000 (11:39 +0100)]
librbd/object_map: don't resize object map in handle_load_object_map()
Currently it's done in two cases:
- if the loaded object map is larger than expected based on byte size,
it's truncated to expected number of objects
- in case of deep-copy, if the loaded object map is smaller than diff
state, it's expanded to get "track the largest of all versions in the
set" semantics
Both of these cases can be easily dealt with without modifying the
object map. Being able to process a const object map is needed for
working on in-memory object map which is external to DiffRequest.
It's totally broken: instead of returning the current position and
moving to the next position, it returns the next position and doesn't
move anywhere. Luckily it hasn't been used until now.
Ilya Dryomov [Thu, 28 Dec 2023 09:14:18 +0000 (10:14 +0100)]
librbd: propagate diff-iterate range to parent in fast-diff mode
When getting parent diff, pass the overlap-reduced image extent instead
of the entire 0..overlap range to avoid a similar quadratic slowdown on
cloned images.
Ilya Dryomov [Wed, 27 Dec 2023 17:07:05 +0000 (18:07 +0100)]
librbd/object_map: add support for ranged diff-iterate
Currently diff-iterate in fast-diff mode is performed on the entire
image no matter what image extent is passed to the API. Then, unused
diff just gets discarded as DiffIterate ends up querying only objects
that the passed image extent maps to. This hasn't been an issue for
internal consumers ("rbd du", "rbd diff", etc) because they work on the
entire image, but turns out to lead to quadratic slowdown in some QEMU
use cases.
0..UINT64_MAX range is carved out for deep-copy which is unranged by
definition. To get effectively unranged diff-iterate, 0..UINT64_MAX-1
range can be used.
Ilya Dryomov [Sat, 23 Dec 2023 14:19:09 +0000 (15:19 +0100)]
test/librbd: expand TestMockObjectMapDiffRequest edge case coverage
For each covered edge case or error, run through the following
scenarios:
- where the edge case concerns snap_id_start
- where the edge case concerns snap_id_end
- where the edge case concerns intermediate snapshot and
snap_id_start == 0 (diff against the beginning of time)
- where the edge case concerns intermediate snapshot and
snap_id_start != 0 (diff from snapshot)
Ilya Dryomov [Sat, 23 Dec 2023 13:47:54 +0000 (14:47 +0100)]
librbd/object_map: allow intermediate snaps to be skipped on diff-iterate
In case of diff-iterate against the beginning of time, the result
depends only on the end version. Loading and processing object maps
or intermediate snapshots is redundant and can be skipped.
This optimization is made possible by commit be507aaed15f ("librbd:
diff-iterate shouldn't ever report "new hole" against a hole") and, to
a lesser extent, the previous commit.
Getting FastDiffInvalid, LoadObjectMapError and ObjectMapTooSmall to
pass required tweaking not just expectations, but also start/end snap
ids and thus also the meaning of these tests. This is addressed in the
next commit.
Ilya Dryomov [Fri, 22 Dec 2023 17:50:20 +0000 (18:50 +0100)]
librbd/object_map: resurrect diff-iterate behavior when image is shrunk
The new "track the largest of all versions in the set, diff state is
only ever grown" semantics introduced in commit 330f2a7bb94f ("librbd:
helper state machine for computing diffs between object-maps") don't
make sense for diff-iterate. It's a waste because DiffIterate won't
query beyond the end version size -- this is baked into the API.
Limit this behavior to deep-copy and resurrect the original behavior
from 2015 for diff-iterate.
Ilya Dryomov [Fri, 22 Dec 2023 15:10:12 +0000 (16:10 +0100)]
librbd/object_map: fix diff from snapshot when image is grown
Commit 399a45e11332 ("librbd/object_map: rbd diff between two
snapshots lists entire image content") fixed most of the damage caused
by commit b81cd2460de7 ("librbd/object_map: diff state machine should
track object existence"), but the case of a "resize diff" when diffing
from snapshot was missed. An area that was freshly allocated in image
resize is the same in principle as a freshly created image and objects
marked OBJECT_EXISTS_CLEAN are no exception. Diff for such objects in
such an area should be set to DIFF_STATE_DATA_UPDATED, however
currently when diffing from snapshot, it's set to DIFF_STATE_DATA.
Ilya Dryomov [Wed, 20 Dec 2023 11:22:17 +0000 (12:22 +0100)]
librbd/object_map: drop bogus if in handle_load_object_map()
It became redundant with commit b81cd2460de7 ("librbd/object_map: diff
state machine should track object existence") -- it != end_it condition
in the loop is sufficient.
In preparation for multiple similarly configured MockTestImageCtx
objects being used in a single test, centralize their creation and add
a couple of helpers for setting expectations from a callback.
rgw/sts: code to check IAM policy and return an
appropriate error incase Resource specified in the
IAM policy is incorrect and is discarded. The IAM
policy can be a resource policy or an identity policy.
This is for policies that have already been set.
rgw/sts: code for returning an error when an IAM policy
resource belongs to someone else's tenant.
While parsing the policy it discards the resource element,
but then when an operation is evaluated, since the resource element
is empty, it doesnt evaluate the resource at all and the policy
ends up erroneously allowing actions on resources in other tenants.
Eliminate m_max_recent and set the capacity of the m_recent ring buffer
when log_max_recent changes. In order to call set_capacity(),
ConcreteEntry needed its move constructor set noexcept.
I haven't followed the boost code all the way down but I suspect that
setting the ring buffer capacity to anything less than 1 entry will
probably cause problems, so restrict log_max_recent to >=1.
Also fix a wrong variable used for printing the max new entries during
"log dump".
Initialize standalone test for stretched clusters,
testing uneven weight warnings and != 2 buckets
warnings.
Added `wait_for_health_gone()` function in ceph-helpers.sh
this function allows us to wait for health condition to
disappear when doing standalone tests.
mon/Monitor: during shutdown don't accept new authentication and create new sessions
During shutdown, the monitor is designed not to accept new authentication requests
or create new sessions. However, a problem arises when the monitor marks its status
as "shutdown" but still accepts new authentication requests and creates new sessions.
This issue causes the monitor to fail when checking the session list.
To fix this problem, an update is implemented. With this fix,
a monitor in the "shutdown" state will correctly reject new authentication requests
and prevent the creation of new sessions.
This ensures that the monitor operates as intended during the shutdown process.
NitzanMordhai [Thu, 16 Nov 2023 07:09:29 +0000 (07:09 +0000)]
Tools/rados: Improve Error Messaging for Object Name Resolution
The current implementation of 'rados clearomap' exhibits a behavior where
an error message is generated without the associated object name or,
in the case of a non-existent object name, may result in a segmentation fault.
The proposed fix addresses this issue by enhancing the error message.
After applying the fix, error messages will consistently display the correct
object name, providing users with more accurate and actionable information.