Zac Dover [Mon, 22 May 2023 21:41:09 +0000 (07:41 +1000)]
doc/glossary: update bluestore entry
Update the BlueStore entry in the glossary, explaining that as of Reef
BlueStore and only BlueStore (and not FileStore) is the storage backend
for Ceph.
This topic has been discussed many times; recently at the Dev
Summit of Cephalocon 2023.
This commit is the minial version of the work, contained entirely
within the `doc`. However, likely it will be expanded as there
were ideas like e.g. adding cache tiering back experimental feature
list (Sam) to warn users when deploying a new cluster.
doc: Add missing `ceph` command in documentation section `REPLACING AN OSD`
Signed-off-by: Alexander Proschek <alexander.proschek@protonmail.com> Signed-off-by: Alexander Proschek <alexander.proschek@protonmail.com>
(cherry picked from commit 0557d5e465556adba6d25db62a40ba55a5dd2400)
Zac Dover [Thu, 18 May 2023 21:07:02 +0000 (07:07 +1000)]
doc/radosgw: explain multisite dynamic sharding
Add a note to doc/radosgw/dynamicresharding.rst and a note to
doc/radosgw/multisite.rst that explains that dynamic resharding is not
supported in releases prior to Reef.
This commit is made in response to a request from Mathias Chapelain.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit d4ed4223d914328361528990f89f1ee4acd30e79)
Zac Dover [Wed, 17 May 2023 12:25:38 +0000 (22:25 +1000)]
doc/cephfs: line-edit "Mirroring Module"
Line-edit the "Mirroring Module" section of
doc/cephfs/cephfs-mirroring.rst. Add prompts and formatting where such
things contribute to the realization of adequate sentences.
This commit is a follow-up to https://github.com/ceph/ceph/pull/51505.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit dd8855d9a934bcdd6a026f1308ba7410b1e143e3)
Aashish Sharma [Mon, 8 May 2023 07:19:13 +0000 (12:49 +0530)]
mgr/dashboard: fix regression caused by cephPgImabalance alert
because of an earlier fix delivered, there is a regression caused by it
due to which alerts are not getting displayed in the active alerts tab.
This PR intends to fix this issue.
Venky Shankar [Tue, 16 May 2023 05:25:34 +0000 (10:55 +0530)]
doc: explain cephfs mirroring `peer_add` step in detail
@zdover23 reached out regarding missing explanation for `peer_add`
step in cephfs mirroring documentation. Add some explanation and
and example to make the step clear.
Matan Breizman [Wed, 1 Feb 2023 08:49:19 +0000 (08:49 +0000)]
messages/MOSDMap: Rename oldest_map to cluster_osdmap_trim_lower_bound
Previously, MOSDMap messages sent to other OSDs were populated with the
superblocks's oldest_map. We should, instead, use the superblock's
cluster_osdmap_trim_lower_bound because oldest map is merely a marker
for each osd's trimming progress.
As specified in the docs:
***
We use cluster_osdmap_trim_lower_bound rather than a specific osd's oldest_map
because we don't necessarily trim all MOSDMap::oldest_map. In order to avoid
doing too much work at once we limit the amount of osdmaps trimmed using
``osd_target_transaction_size`` in OSD::trim_maps().
For this reason, a specific OSD's oldest_map can lag behind
OSDSuperblock::cluster_osdmap_trim_lower_bound
for a while.
***
Matan Breizman [Thu, 3 Nov 2022 08:59:11 +0000 (08:59 +0000)]
osd: Remove oldest_stored_osdmap()
The only usage was for identyfing map gaps on new intervals.
We should use max_oldest_stored_osdmap() instead, since a specific
osd's oldest_map may lag behind.
Matan Breizman [Wed, 2 Nov 2022 10:40:03 +0000 (10:40 +0000)]
osd: Fix check_past_interval_bounds()
When getting the required past interval bounds we use
oldest_map or current pg info (lec/ec).
Before this change we set oldest_map epoch using the
osd's superblock.oldest_map.
The fix will use the max_oldest_map received with other peers
instead since a specific osd's oldest_map can lag for a while
in order to avoid large workloads.
Samuel Just [Wed, 26 Oct 2022 04:46:24 +0000 (21:46 -0700)]
doc/dev/osd_internals: add past_intervals.rst
Add explanation of past_interals.
Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Matan Breizman <mbreizma@redhat.com>
(cherry picked from commit cd4c031e5e5f5b0318347a7957310cb7358380f6)
rgw/multisite: set truncated flag to true after fetching remote mdlogs.
This is in addition to 2ed1c3e. Ensures that the we continue syncing logs
that have been fetched already and don't end up cloning more logs prematurely.
rgw: read incremental metalog from master cluster based on truncate variable
when the log entry in the meta.log object of the secondary cluster is empty,
the value of max_marker is also empty,which can't meet the requirement that
mdlog_marker <= max_marker,resulting in that the secondary cluster can't fetch
new log entry from the master cluster and infinite loop,finally, the secondary
cluster's metadata can't catch up the master cluster. when the truncate is false,
it means that the secondary cluster's meta.log is empyt,we can read more from
master cluster.
Zac Dover [Fri, 12 May 2023 10:35:25 +0000 (20:35 +1000)]
doc/cephfs: rectify prompts in fs-volumes.rst
Make sure all prompts are unselectable. This PR is meant to be
backported to Reef, Quincy, and Pacific, to get all of the prompts into
a fit state so that a line-edit can be performed on the Englsh language
in this file.
qa/: Override mClock profile to 'high_recovery_ops' for qa tests
The qa tests are not client I/O centric and mostly focus on triggering
recovery/backfills and monitor them for completion within a finite amount
of time. The same holds true for scrub operations.
Therefore, an mClock profile that optimizes background operations is a
better fit for qa related tests. The osd_mclock_profile is therefore
globally overriden to 'high_recovery_ops' profile for the Rados suite as
it fits the requirement.
Also, many standalone tests expect recovery and scrub operations to
complete within a finite time. To ensure this, the osd_mclock_profile
options is set to 'high_recovery_ops' as part of the run_osd() function
in ceph-helpers.sh.
A subset of standalone tests explicitly used 'high_recovery_ops' profile.
Since the profile is now set as part of run_osd(), the earlier overrides
are redundant and therefore removed from the tests.
doc/: Modify mClock configuration documentation to reflect profile changes
Modify the relevant documentation to reflect:
- change in the default mClock profile to 'balanced'
- new allocations for ops across mClock profiles
- change in the osd_max_backfills limit
- miscellaneous changes related to warnings.
common/options/osd.yaml.in: Change mclock max sequential bandwidth for SSDs
The osd_mclock_max_sequential_bandwidth_ssd is changed to 1200 MiB/s as
a reasonable middle ground considering the broad range of SSD capabilities.
This allows the mClock's cost model to extract the SSDs capability
depending on the cost of the IO being performed.
osd/: Retain the default osd_max_backfills limit to 1 for mClock
The earlier limit of 3 was still aggressive enough to have an impact on
the client and other competing operations. Retain the current default
for mClock. This can be modified if necessary after setting the
osd_mclock_override_recovery_settings option.
Samuel Just [Tue, 11 Apr 2023 15:10:04 +0000 (08:10 -0700)]
osd/scheduler/mClockScheduler: avoid limits for recovery
Now that recovery operations are split between background_recovery and
background_best_effort, rebalance qos params to avoid penalizing
background_recovery while idle.
Samuel Just [Thu, 6 Apr 2023 05:57:42 +0000 (22:57 -0700)]
osd/: differentiate scheduler class for undersized/degraded vs data movement
Recovery operations on pgs/objects that have fewer than the configured
number of copies should be treated more urgently than operations on
pgs/objects that simply need to be moved to a new location.
Samuel Just [Tue, 4 Apr 2023 23:34:17 +0000 (23:34 +0000)]
osd/scheduler: simplify qos specific params in OpSchedulerItem
is_qos_item() was only used in operator<< for OpSchedulerItem. However,
it's actually useful to see priority for mclock items since it affects
whether it goes into the immediate queues and, for some types, the
class. Unconditionally display both class_id and priority.
osd: Retain overridden mClock recovery settings across osd restarts
Fix an issue where an overridden mClock recovery setting (set prior to
an osd restart) could be lost after an osd restart.
For e.g., consider that prior to an osd restart, the option
'osd_max_backfill' was successfully set to a value different from the
mClock default. If the osd was restarted for some reason, the
boot-up sequence was incorrectly resetting the backfill value to the
mclock default within the async local/remote reservers. This fix
ensures that no change is made if the current overriden value is
different from the mClock default.
Modify an existing standalone test to verify that the local and remote
async reservers are updated to the desired number of backfills under
normal conditions and also across osd restarts.
osd: Set default max active recovery and backfill limits for mClock
Client ops are sensitive to the recovery load and must be carefully
set for osds whose underlying device is HDD. Tests revealed that
recoveries with osd_max_backfills = 10 and osd_recovery_max_active_hdd = 5
were still aggressive and overwhelmed client ops. The built-in defaults
for mClock are now set to:
Previously, setting default configs from the configured profile was
split across:
- enable_mclock_profile_settings
- set_mclock_profile - sets mclock_profile class member
- set_*_allocations - updates client_allocs class member
- set_profile_config - sets profile based on client_allocs class member
This made tracing the effect of changing the profile pretty challenging
due passing state through class member variables.
Instead, define a simple profile_t with three constexpr values
corresponding to the three profiles and handle it all in a single
set_config_defaults_from_profile() method.
osd: Modify mClock scheduler's cost model to represent cost in bytes
The mClock scheduler's cost model for HDDs/SSDs is modified and now
represents the cost of an IO in terms of bytes.
The cost parameters, namely, osd_mclock_cost_per_io_usec_[hdd|ssd]
and osd_mclock_cost_per_byte_usec_[hdd|ssd] which represent the cost
of an IO in secs are inaccurate and therefore removed.
The new model considers the following aspects of an osd to calculate
the cost of an IO:
- osd_mclock_max_capacity_iops_[hdd|ssd] (existing option)
The measured random write IOPS at 4 KiB block size. This is
measured during OSD boot-up using OSD bench tool.
- osd_mclock_max_sequential_bandwidth_[hdd|ssd] (new config option)
The maximum sequential bandwidth of of the underlying device.
For HDDs, 150 MiB/s is considered, and for SSDs 750 MiB/s is
considered in the cost calculation.
The following important changes are made to arrive at the overall
cost of an IO,
1. Represent QoS reservation and limit config parameter as proportion:
The reservation and limit parameters are now set in terms of a
proportion of the OSD's max IOPS capacity. The earlier representation
was in terms of IOPS per OSD shard which required the user to perform
calculations before setting the parameter. Representing the
reservation and limit in terms of proportions is much more intuitive
and simpler for a user.
2. Cost per IO Calculation:
Using the above config options, osd_bandwidth_cost_per_io for the osd is
calculated and set. It is the ratio of the max sequential bandwidth and
the max random write iops of the osd. It is a constant and represents the
base cost of an IO in terms of bytes. This is added to the actual size of
the IO(in bytes) to represent the overall cost of the IO operation.See
mClockScheduler::calc_scaled_cost().
3. Cost calculation in Bytes:
The settings for reservation and limit in terms a fraction of the OSD's
maximum IOPS capacity is converted to Bytes/sec before updating the
mClock server's ClientInfo structure. This is done for each OSD op shard
using osd_bandwidth_capacity_per_shard shown below:
The above result is updated within the mClock server's ClientInfo
structure for different op_scheduler_class operations. See
mClockScheduler::ClientRegistry::update_from_config().
The overall cost of an IO operation (in secs) is finally determined
during the tag calculations performed in the mClock server. See
crimson::dmclock::RequestTag::tag_calc() for more details.
4. Profile Allocations:
Optimize mClock profile allocations due to the change in the cost model
and lower recovery cost.
5. Modify standalone tests to reflect the change in the QoS config
parameter representation of reservation and limit options.
Fixes: https://tracker.ceph.com/issues/58529 Fixes: https://tracker.ceph.com/issues/59080 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
osd: update PGRecovery queue item cost to reflect object size
Previously, we used a static value of osd_recovery_cost (20M
by default) for PGRecovery. For pools with relatively small
objects, this causes mclock to backfill very very slowly as
20M massively overestimates the amount of IO each recovery
queue operation requires. Instead, add a cost_per_object
parameter to OSDService::awaiting_throttle and set it to the
average object size in the PG being queued.
Fixes: https://tracker.ceph.com/issues/58606 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
osd: update OSDService::queue_recovery_context to specify cost
Previously, we always queued this with cost osd_recovery_cost which
defaults to 20M. With mclock, this caused these items to be delayed
heavily. Instead, base the cost on the operation queued.
Fixes: https://tracker.ceph.com/issues/58606 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
osd/osd_types: use appropriate cost value for PullOp
See included comments -- previous values did not account for object
size. This causes problems for mclock which is much more strict
in how it interprets costs.
Fixes: https://tracker.ceph.com/issues/58607 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
osd/osd_types: use appropriate cost value for PushReplyOp
See included comments -- previous values did not account for object
size. This causes problems for mclock which is much more strict
in how it interprets costs.
Fixes: https://tracker.ceph.com/issues/58529 Signed-off-by: Samuel Just <sjust@redhat.com> Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>