osd: Retain overridden mClock recovery settings across osd restarts
Fix an issue where an overridden mClock recovery setting (set prior to
an osd restart) could be lost after an osd restart.
For e.g., consider that prior to an osd restart, the option
'osd_max_backfill' was successfully set to a value different from the
mClock default. If the osd was restarted for some reason, the
boot-up sequence was incorrectly resetting the backfill value to the
mclock default within the async local/remote reservers. This fix
ensures that no change is made if the current overriden value is
different from the mClock default.
Modify an existing standalone test to verify that the local and remote
async reservers are updated to the desired number of backfills under
normal conditions and also across osd restarts.
osd: Set default max active recovery and backfill limits for mClock
Client ops are sensitive to the recovery load and must be carefully
set for osds whose underlying device is HDD. Tests revealed that
recoveries with osd_max_backfills = 10 and osd_recovery_max_active_hdd = 5
were still aggressive and overwhelmed client ops. The built-in defaults
for mClock are now set to:
Previously, setting default configs from the configured profile was
split across:
- enable_mclock_profile_settings
- set_mclock_profile - sets mclock_profile class member
- set_*_allocations - updates client_allocs class member
- set_profile_config - sets profile based on client_allocs class member
This made tracing the effect of changing the profile pretty challenging
due passing state through class member variables.
Instead, define a simple profile_t with three constexpr values
corresponding to the three profiles and handle it all in a single
set_config_defaults_from_profile() method.
osd: Modify mClock scheduler's cost model to represent cost in bytes
The mClock scheduler's cost model for HDDs/SSDs is modified and now
represents the cost of an IO in terms of bytes.
The cost parameters, namely, osd_mclock_cost_per_io_usec_[hdd|ssd]
and osd_mclock_cost_per_byte_usec_[hdd|ssd] which represent the cost
of an IO in secs are inaccurate and therefore removed.
The new model considers the following aspects of an osd to calculate
the cost of an IO:
- osd_mclock_max_capacity_iops_[hdd|ssd] (existing option)
The measured random write IOPS at 4 KiB block size. This is
measured during OSD boot-up using OSD bench tool.
- osd_mclock_max_sequential_bandwidth_[hdd|ssd] (new config option)
The maximum sequential bandwidth of of the underlying device.
For HDDs, 150 MiB/s is considered, and for SSDs 750 MiB/s is
considered in the cost calculation.
The following important changes are made to arrive at the overall
cost of an IO,
1. Represent QoS reservation and limit config parameter as proportion:
The reservation and limit parameters are now set in terms of a
proportion of the OSD's max IOPS capacity. The earlier representation
was in terms of IOPS per OSD shard which required the user to perform
calculations before setting the parameter. Representing the
reservation and limit in terms of proportions is much more intuitive
and simpler for a user.
2. Cost per IO Calculation:
Using the above config options, osd_bandwidth_cost_per_io for the osd is
calculated and set. It is the ratio of the max sequential bandwidth and
the max random write iops of the osd. It is a constant and represents the
base cost of an IO in terms of bytes. This is added to the actual size of
the IO(in bytes) to represent the overall cost of the IO operation.See
mClockScheduler::calc_scaled_cost().
3. Cost calculation in Bytes:
The settings for reservation and limit in terms a fraction of the OSD's
maximum IOPS capacity is converted to Bytes/sec before updating the
mClock server's ClientInfo structure. This is done for each OSD op shard
using osd_bandwidth_capacity_per_shard shown below:
The above result is updated within the mClock server's ClientInfo
structure for different op_scheduler_class operations. See
mClockScheduler::ClientRegistry::update_from_config().
The overall cost of an IO operation (in secs) is finally determined
during the tag calculations performed in the mClock server. See
crimson::dmclock::RequestTag::tag_calc() for more details.
4. Profile Allocations:
Optimize mClock profile allocations due to the change in the cost model
and lower recovery cost.
5. Modify standalone tests to reflect the change in the QoS config
parameter representation of reservation and limit options.
Fixes: https://tracker.ceph.com/issues/58529 Fixes: https://tracker.ceph.com/issues/59080 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
osd: update PGRecovery queue item cost to reflect object size
Previously, we used a static value of osd_recovery_cost (20M
by default) for PGRecovery. For pools with relatively small
objects, this causes mclock to backfill very very slowly as
20M massively overestimates the amount of IO each recovery
queue operation requires. Instead, add a cost_per_object
parameter to OSDService::awaiting_throttle and set it to the
average object size in the PG being queued.
Fixes: https://tracker.ceph.com/issues/58606 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
osd: update OSDService::queue_recovery_context to specify cost
Previously, we always queued this with cost osd_recovery_cost which
defaults to 20M. With mclock, this caused these items to be delayed
heavily. Instead, base the cost on the operation queued.
Fixes: https://tracker.ceph.com/issues/58606 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
osd/osd_types: use appropriate cost value for PullOp
See included comments -- previous values did not account for object
size. This causes problems for mclock which is much more strict
in how it interprets costs.
Fixes: https://tracker.ceph.com/issues/58607 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
osd/osd_types: use appropriate cost value for PushReplyOp
See included comments -- previous values did not account for object
size. This causes problems for mclock which is much more strict
in how it interprets costs.
Fixes: https://tracker.ceph.com/issues/58529 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
doc/start: edit first 50 lines of documenting-ceph
Edit the first 150 lines of doc/start/documenting-ceph.rst. This is part
of an initiative to harvest the fruits of Cephalocon 2023, at which
documentation proved to be in demand to a surprising degree.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit dd37f94aa4f1de947b1eaf5d82cc529925f5823e)
Line-edit doc/rados/user-management.rst (2 of x). Some internal
references had to be removed, but these will be repaired when the next
part of this file is updated in a future PR.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit e3575bb72f307a27d49fedf3692ca661e3d613a5)
Conrad Hoffmann [Wed, 22 Mar 2023 22:03:57 +0000 (23:03 +0100)]
doc: account for PG autoscaling being the default
The current documentation tries really hard to convince people to set
both `osd_pool_default_pg_num` and `osd_pool_default_pgp_num` in their
configs, but at least the latter has undesirable side effects on any
Ceph version that has PG autoscaling enabled by default (at least quincy
and beyond).
Assume a cluster with defaults of `64` for `pg_num` and `pgp_num`.
Starting `radosgw` will fail as it tries to create various pools without
providing values for `pg_num` or `pgp_num`. This triggers the following
in `OSDMonitor::prepare_new_pool()`:
- `pg_num` is set to `1`, because autoscaling is enabled
- `pgp_num` is set to `osd pool default pgp_num`, which we set to `64`
- This is an invalid setup, so the pool creation fails
Likewise, `ceph osd pool create mypool` (without providing values for
`pg_num` or `pgp_num`) does not work.
Following this rationale:
- Not providing a default value for `pgp_num` will always do the right
thing, unless you use advanced features, in which case you can be
expected to set both values on pool creation
- Setting `osd_pool_default_pgp_num` in your config breaks pool creation
for various cases
This commit:
- Removes `osd_pool_default_pgp_num` from all example configs
- Adds mentions of the autoscaling and how it interacts with the default
values in various places
For each file that was touched, the following maintenance was also
performed:
- Change interternal spaces to underscores for config values
- Remove mentions of filestore or any of its settings
- Fix minor inconsistencies, like indentation etc.
There is also a ticket which I think is very relevant and fixed by this,
though it only captures part of the broader issue addressed here:
qa/suites/rbd: install qemu-utils in addition to qemu-block-extra on Ubuntu
qemu-utils is usually pre-installed but, due to what appears to be
a Ubuntu packaging bug, it's not upgraded when qemu-block-extra is
installed:
The following NEW packages will be installed:
qemu-block-extra
The following packages will be upgraded:
qemu-system-common qemu-system-data qemu-system-gui qemu-system-x86
However, the version of the block driver must match exactly the version
of the qemu-img tool, so the above leads to:
$ qemu-img convert -f qcow2 -O raw /home/ubuntu/cephtest/qemu/base.client.0.0.qcow2 rbd:rbd/client.0.0
Failed to initialize module: /usr/lib/x86_64-linux-gnu/qemu/block-rbd.so
Note: only modules from the same build can be loaded.
qemu: module block-block-rbd not found, do you want to install qemu-block-extra package?
qemu-img: Unknown protocol 'rbd'
Matt Benjamin [Thu, 15 Dec 2022 19:55:16 +0000 (14:55 -0500)]
rgw/notifications: fetch object state to get size, in rgw_lc.cc
Failure to call get_obj_state() leaves object size and other members
uninitialized, and appears to result in in lc delete notifications
with 0 for object size.
Fixes: https://tracker.ceph.com/issues/58287 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
(cherry picked from commit b20a66767f782c06258fb0a5551ee45d6dccb91c)
Vedansh Bhartia [Thu, 2 Mar 2023 13:04:53 +0000 (18:34 +0530)]
rgw: use unique_ptr for flat_map emplace in BucketTrimWatcher
When emplacing objects into the trim notify handler of
BucketTrimWatcher, use a unique_ptr for the handler so that it is
destroyed if the emplace fails.
Though the destructor is already called, this behaviour cannot be relied
upon. std::map does not exhibit the same behaviour, and would have
leaked memory had it been used instead.
Matt Benjamin [Sat, 11 Mar 2023 19:58:54 +0000 (14:58 -0500)]
Do not duplicate query-string in ops-log
Fixes: https://tracker.ceph.com/issues/59059 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
(cherry picked from commit 3f2313f0e67c444407139c80dff596c5d5b5903e)
Yuval Lifshitz [Sun, 26 Mar 2023 10:02:17 +0000 (10:02 +0000)]
rgw/notifications: support bucket notification with bucket policy
following policy should be used to allow any user to get, put and delete
bucket notification on a bucket called "my-bucket":
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Statement",
"Effect": "Allow",
"Principal": "*",
"Action": ["s3:GetBucketNotification", "s3:PutBucketNotification"],
"Resource": "arn:aws:s3:::my-bucket"
}
]
}
note that notification deletion uses the "PUT" permission.
J. Eric Ivancich [Sat, 18 Mar 2023 18:35:39 +0000 (14:35 -0400)]
rgw: add unordered listing to reindex to force stats update
By including an unordered listing in the script, we will complete
placing objects in the bucket index and allow stats to be updated
rather than waiting for this to happen organically at a user's
request. Unordered is preferred as it can run more efficiently.
J. Eric Ivancich [Wed, 15 Mar 2023 13:26:07 +0000 (09:26 -0400)]
rgw: install rgw scripts with common files rather than radosgw files
Update ceph.spec.in and debian install files so
rgw-restore-bucket-index, rgw-orphan-list, rgw-gap-list,
rgw-gap-list-comparator are installed with common files.
Tongliang Deng [Tue, 13 Dec 2022 06:42:34 +0000 (06:42 +0000)]
rgw/sse-s3: fix bucket encryption of multipart upload
Multipart upload missing encryption when we have bucket encryption
policy. Fix it by fetching bucket encryption policy and resolving
defaults at multipart init op.
Fixes: https://tracker.ceph.com/issues/59218 Signed-off-by: Tongliang Deng <dengtongliang@gmail.com>
(cherry picked from commit 6d9e4f7924c6149d23919ef82bc09406e1290164)
Invalid URL concatenation prevents some OpenIDConnect providers from working
with RGW and the AssumeRoleWithWebIdentity API. Invalid URLs contain a double
slash `//`. This fix ensures that an ISS is properly joined to the .well-known
path.
Marcus Watts [Fri, 8 May 2020 05:41:35 +0000 (01:41 -0400)]
rgw/civetweb: handle old clients with transfer-encoding: chunked.
s3 clients *should* provide an x-amz-decoded-content-length field
when they use transport-encoding: chunked. Some clients do not.
With swift we already allow chunked uploads that do not specify the
content length in advance. This commit adds similar support
for s3. Known client affected by this: boto2.
lichaochao [Tue, 28 Mar 2023 03:17:26 +0000 (05:17 +0200)]
rgw: fix rgw cache invalidation after unregister_watch() error
When a metadata osd fails, an unregister_watch() error may occur,
resulting in an rgw cache invalidation.
By adding an unregister_done flag and when a register_watch() error ,
performing a reinit() operation again,
After the first reinit() failure, the register_watch() will be performed again
Fixes: https://tracker.ceph.com/issues/59217 Signed-off-by: lichaochao <lichaochao2_yewu@cmss.chinamobile.com>
(cherry picked from commit f9aae71af3ad8eee5996c31544d98041968dbbec)
Casey Bodley [Thu, 23 Mar 2023 19:02:51 +0000 (15:02 -0400)]
rgw: RGWCopyObj loads src_bucket in init_processing()
if `RGWCopyObj::verify_permissions()` returns an error, it may leave
some zipper objects uninitialized. when the user has admin or system
privileges, we'll ignore that error and call `execute()` anyway. this
moves the initialization into `RGWCopyObj::init_processing()` instead