Samuel Just [Thu, 6 Apr 2023 05:57:42 +0000 (22:57 -0700)]
osd/: differentiate scheduler class for undersized/degraded vs data movement
Recovery operations on pgs/objects that have fewer than the configured
number of copies should be treated more urgently than operations on
pgs/objects that simply need to be moved to a new location.
Samuel Just [Tue, 4 Apr 2023 23:34:17 +0000 (23:34 +0000)]
osd/scheduler: simplify qos specific params in OpSchedulerItem
is_qos_item() was only used in operator<< for OpSchedulerItem. However,
it's actually useful to see priority for mclock items since it affects
whether it goes into the immediate queues and, for some types, the
class. Unconditionally display both class_id and priority.
osd: Retain overridden mClock recovery settings across osd restarts
Fix an issue where an overridden mClock recovery setting (set prior to
an osd restart) could be lost after an osd restart.
For e.g., consider that prior to an osd restart, the option
'osd_max_backfill' was successfully set to a value different from the
mClock default. If the osd was restarted for some reason, the
boot-up sequence was incorrectly resetting the backfill value to the
mclock default within the async local/remote reservers. This fix
ensures that no change is made if the current overriden value is
different from the mClock default.
Modify an existing standalone test to verify that the local and remote
async reservers are updated to the desired number of backfills under
normal conditions and also across osd restarts.
osd: Set default max active recovery and backfill limits for mClock
Client ops are sensitive to the recovery load and must be carefully
set for osds whose underlying device is HDD. Tests revealed that
recoveries with osd_max_backfills = 10 and osd_recovery_max_active_hdd = 5
were still aggressive and overwhelmed client ops. The built-in defaults
for mClock are now set to:
Previously, setting default configs from the configured profile was
split across:
- enable_mclock_profile_settings
- set_mclock_profile - sets mclock_profile class member
- set_*_allocations - updates client_allocs class member
- set_profile_config - sets profile based on client_allocs class member
This made tracing the effect of changing the profile pretty challenging
due passing state through class member variables.
Instead, define a simple profile_t with three constexpr values
corresponding to the three profiles and handle it all in a single
set_config_defaults_from_profile() method.
osd: Modify mClock scheduler's cost model to represent cost in bytes
The mClock scheduler's cost model for HDDs/SSDs is modified and now
represents the cost of an IO in terms of bytes.
The cost parameters, namely, osd_mclock_cost_per_io_usec_[hdd|ssd]
and osd_mclock_cost_per_byte_usec_[hdd|ssd] which represent the cost
of an IO in secs are inaccurate and therefore removed.
The new model considers the following aspects of an osd to calculate
the cost of an IO:
- osd_mclock_max_capacity_iops_[hdd|ssd] (existing option)
The measured random write IOPS at 4 KiB block size. This is
measured during OSD boot-up using OSD bench tool.
- osd_mclock_max_sequential_bandwidth_[hdd|ssd] (new config option)
The maximum sequential bandwidth of of the underlying device.
For HDDs, 150 MiB/s is considered, and for SSDs 750 MiB/s is
considered in the cost calculation.
The following important changes are made to arrive at the overall
cost of an IO,
1. Represent QoS reservation and limit config parameter as proportion:
The reservation and limit parameters are now set in terms of a
proportion of the OSD's max IOPS capacity. The earlier representation
was in terms of IOPS per OSD shard which required the user to perform
calculations before setting the parameter. Representing the
reservation and limit in terms of proportions is much more intuitive
and simpler for a user.
2. Cost per IO Calculation:
Using the above config options, osd_bandwidth_cost_per_io for the osd is
calculated and set. It is the ratio of the max sequential bandwidth and
the max random write iops of the osd. It is a constant and represents the
base cost of an IO in terms of bytes. This is added to the actual size of
the IO(in bytes) to represent the overall cost of the IO operation.See
mClockScheduler::calc_scaled_cost().
3. Cost calculation in Bytes:
The settings for reservation and limit in terms a fraction of the OSD's
maximum IOPS capacity is converted to Bytes/sec before updating the
mClock server's ClientInfo structure. This is done for each OSD op shard
using osd_bandwidth_capacity_per_shard shown below:
The above result is updated within the mClock server's ClientInfo
structure for different op_scheduler_class operations. See
mClockScheduler::ClientRegistry::update_from_config().
The overall cost of an IO operation (in secs) is finally determined
during the tag calculations performed in the mClock server. See
crimson::dmclock::RequestTag::tag_calc() for more details.
4. Profile Allocations:
Optimize mClock profile allocations due to the change in the cost model
and lower recovery cost.
5. Modify standalone tests to reflect the change in the QoS config
parameter representation of reservation and limit options.
Fixes: https://tracker.ceph.com/issues/58529 Fixes: https://tracker.ceph.com/issues/59080 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
osd: update PGRecovery queue item cost to reflect object size
Previously, we used a static value of osd_recovery_cost (20M
by default) for PGRecovery. For pools with relatively small
objects, this causes mclock to backfill very very slowly as
20M massively overestimates the amount of IO each recovery
queue operation requires. Instead, add a cost_per_object
parameter to OSDService::awaiting_throttle and set it to the
average object size in the PG being queued.
Fixes: https://tracker.ceph.com/issues/58606 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
osd: update OSDService::queue_recovery_context to specify cost
Previously, we always queued this with cost osd_recovery_cost which
defaults to 20M. With mclock, this caused these items to be delayed
heavily. Instead, base the cost on the operation queued.
Fixes: https://tracker.ceph.com/issues/58606 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
osd/osd_types: use appropriate cost value for PullOp
See included comments -- previous values did not account for object
size. This causes problems for mclock which is much more strict
in how it interprets costs.
Fixes: https://tracker.ceph.com/issues/58607 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
osd/osd_types: use appropriate cost value for PushReplyOp
See included comments -- previous values did not account for object
size. This causes problems for mclock which is much more strict
in how it interprets costs.
Fixes: https://tracker.ceph.com/issues/58529 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
Zac Dover [Fri, 5 May 2023 06:35:28 +0000 (16:35 +1000)]
doc/cephfs: repairing inaccessible FSes
Add a procedure to doc/cephfs/troubleshooting.rst that explains how to
restore access to FileSystems that became inaccessible after
post-Nautilus upgrades. The procedure included here was written by Harry
G Coin, and merely lightly edited by me. I include him here as a
"co-author", but it should be noted that he did the heavy lifting on
this.
See the email thread here for more context:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/HS5FD3QFR77NAKJ43M2T5ZC25UYXFLNW/
Co-authored-by: Harry G Coin <hgcoin@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
Venky Shankar [Thu, 4 May 2023 12:53:20 +0000 (18:23 +0530)]
Merge PR #51224 into main
* refs/pull/51224/head:
doc: add a note for minimum compatible python version and supported distros
tools/cephfs/top/CMakeList.txt: check the minimum compatible python version for cephfs-top
Matan Breizman [Wed, 3 May 2023 09:39:14 +0000 (09:39 +0000)]
crimson/osd/objclass: Compilation warning
```
In copy constructor ‘ceph::buffer::v15_2_0::list::list(const ceph::buffer::v15_2_0::list&)’,
inlined from ‘OSDOp::OSDOp(const OSDOp&)’ at ../src/osd/osd_types.h:4081:8,
inlined from ‘int cls_cxx_snap_revert(cls_method_context_t, snapid_t)’ at ../src/crimson/osd/objclass.cc:279:37:
../src/include/buffer.h:945:20: warning: ‘op.OSDOp::indata.ceph::buffer::v15_2_0::list::_len’ is used uninitialized [-Wuninitialized]
945 | _len(other._len),
| ~~~~~~^~~~
../src/crimson/osd/objclass.cc: In function ‘int cls_cxx_snap_revert(cls_method_context_t, snapid_t)’:
../src/crimson/osd/objclass.cc:279:9: note: ‘op’ declared here
279 | OSDOp op{op = CEPH_OSD_OP_ROLLBACK};
|
```
Matan Breizman [Tue, 2 May 2023 09:07:00 +0000 (09:07 +0000)]
crimson/osd/ops_executer: Fix usage of Message's connection
See #50835.
In crimson, conn is independently maintained outside Message.
Therefore, when trying to use the message's connection for
`get_orig_source_inst()` we won't be able to get the peer address.
Venky Shankar [Tue, 2 May 2023 05:58:41 +0000 (11:28 +0530)]
Merge PR #51005 into main
* refs/pull/51005/head:
qa: fix test_nfs_export_creation_at_symlink
qa: update test cases to check for ENOTDIR instead of EINVAL
qa: fix test_nfs_export_with_invalid_path
mgr/nfs: handle exceptions for cephfs_path_is_dir()
mgr/nfs/utils: changes to helper func to check cephfs path
Adam King [Wed, 5 Apr 2023 00:45:23 +0000 (20:45 -0400)]
mgr/cephadm: prefer same hosts as related service daemons when picking arbitrary hosts
For now, just for linking ingress services and
their backend services. The idea is if one, or both,
of the ingress service and backend service is using a
count, to try and get them to deploy their daemons
on the same host(s). If the placements have explicit
placements (not using count) we still stick to
those placements regardless.
This should enable something like specifying a host
for the backend service and leaving the ingress
placement as just "count: 1" and having the ingress
service get on the same host as the backend service
daemon. This is particularly useful for the keepalive-only
(VIP but no haproxy) over NFS setup where the keepalive
must share a host with the NFS to function, but will
also be useful for other VIP only setups we may do
in the future.
Laura Flores [Mon, 1 May 2023 16:28:54 +0000 (16:28 +0000)]
mgr: add urllib3==1.26.15 to mgr/requirements.txt
We do not depend on any particular version of
urllib3, but as a workaround to the incompatibility
of urllib3 constraints between kubernetes and
requests, we need to pin it temporarily to
the version both are happy with.
Fixes: https://tracker.ceph.com/issues/59591 Signed-off-by: Laura Flores <lflores@redhat.com>
rgw/cloud-transition: New attrs to detect cloudtiered objects
Add new attrs "RGW_ATTR_CLOUD_TIER_TYPE" and "RGW_ATTR_CLOUD_TIER_CONFIG"
to store details about cloud-tiered objects so that they get synced accordingly
in a multisite environment.
rgw/cloudtransition: Allow multisite zones to sync cloudtiered objects
In a multisite configuration, zones should be able to fetch & sync
cloud-transitioned objects as well. To allow this, a new header
'x-rgwx-sync-cloudtiered' is added to be used by sync client to GET
such objects.