mon/MonCap: Update osd profile to allow cmd to set iops capacity on mon db
The default mon caps for osds is set to "allow profile osd", which allows
only "rw" capability. Osds with mclock scheduler enabled store their max
iops capacity on the mon config store. This can be achieved by executing
the "config set" command. However, since the osd(s) by default do not have
the execute permission, the command fails with "Permission denied" error.
Therefore, modify the default osd profile to allow running the "config set"
command with restriction to only set keys with name matching either (regex)
"osd_mclock_max_capacity_iops_hdd" or "osd_mclock_max_capacity_iops_ssd"
so that the osd has the permission to update the mon config store with the
desired information.
doc: Update mclock-config-ref doc steps to override osd max iops capacity.
Update the steps in the mclock config reference document to manually
override an OSDs max IOPS capacity. Provide information on the alternative
ways to override the osd_mclock_max_capacity_iops_[hdd,ssd] options for
an OSD.
osd: Add config option to skip running the OSD benchmark on start-up.
Introduce a new dev config option "osd_mclock_skip_benchmark" that
when set skips running the OSD benchmark on start-up. By default
this option is disabled. This is useful in the following scenarios:
- Dev/CI testing,
- Configurations that don't need QoS.
If the option is enabled, the default OSD iops capacity is read from
osd_mclock_max_capacity_iops_[hdd,ssd].
Conflicts:
src/common/options/osd.yaml.in
- Removed non-existent file: src/common/options/osd.yaml.in since the
switch to yaml for config options is not available in pacific yet.
osd: Add a new config option to forcibly run OSD benchmark on init
The new config option "osd_mclock_force_run_benchmark_on_init" is
introduced to allow a user to force run the OSD benchmark test on every
OSD boot-up even if the historical data about the OSD's iops capacity is
available on the MON config store. The 'force_run_benchmark' flag is set
to the value indicated by the new config option.
By default this new config option is set to false.
The utility of this option is to help refresh the OSD iops capacity
when the underlying device's performance characteristics have changed
significantly. In such cases, the OSD can be restarted with this option
enabled temporarily. Once the new iops capacity is updated to the MON
store, this option can be removed from the OSD's start-up config.
Conflicts:
src/common/options/osd.yaml.in
- Removed non-existent file: src/common/options/osd.yaml.in since the
switch to yaml for config options is not available in pacific yet.
- Added new config option "osd_mclock_osd_mclock_force_run_benchmark_on_init"
to options.cc.
osd: Add mechanism to avoid running OSD benchmark on every OSD boot-up
Use "mon_cmd_set_config()" to store the OSD's max iops capacity to
the MON store during the first bring-up. Don't run the OSD benchmark
test on subsequent boot-ups if a previously persisted iops capacity is
available on the MON store and is different from the default iops
capacity.
Add the 'force_run_benchmark' flag to force a run of the benchmark
in case the default iops capacity cannot be determined.
common/config: Add methods to return the default value of a config option
Add wrapper method "get_val_default()" to the ConfigProxy class that takes
the config option key to search. This method in-turn calls another method
with the same name added to md_config_t class that does the actual work of
searching for the config option. If the option is valid, _get_val_default()
is used to get the default value. Otherwise, the wrapper method returns
std::nullopt.
osd: Add method to store config option key/value on the MON store
Add method mon_cmd_set_config() to save config option key and
value to the MON store. The ConfigMonitor command, 'config set' is
used to achieve this.
A corresponding get method is unnecessary since any config option
found on the MON store is loaded during OSD boot-up and set using
the md_config_t::set_mon_vals() method. Therefore, the existing
versions of ConfigProxy::get_val() method are sufficient to get
the latest value for the config option.
qa/tasks: Enhance wait_until_true() to check & retry recovery progress
With mclock scheduler enabled, the recovery throughput is throttled based
on factors like the type of mclock profile enabled, the OSD capacity among
others. Due to this the recovery times may vary and therefore the existing
timeout of 120 secs may not be sufficient.
To address the above, a new method called _is_inprogress_or_complete() is
introduced in the TestProgress Class that checks if the event with the
specified 'id' is in progress by checking the 'progress' key of the
progress command response. This method also handles the corner case where
the event completes just before it's called.
The existing wait_until_true() method in the CephTestCase Class is
modified to accept another function argument called "check_fn". This is
set to the _is_inprogress_or_complete() function described earlier in the
"test_turn_off_module" test that has been observed to fail due to the
reasons already described above. A retry mechanism of a maximum of 5
attempts is introduced after the first timeout is hit. This means that
the wait can extend up to a maximum of 600 secs (120 secs * 5) as long as
there is recovery progress reported by the 'ceph progress' command result.
osd: Disable heartbeat timeout until a non-future workitem can be processed
There could be rare instances when employing the mclock scheduler where a
worker thread for a shard may not get an immediate work item to process.
Such items are designated as future work items. In such cases, the
_process() loop waits until the time indicated by the scheduler to attempt
a dequeue from the scheduler queue again. It may so happen that if there
are multiple threads per shard, a thread may not get an immediate item for
a long time. This time could exceed the heartbeat timeout for the thread
and result in hearbeat timeouts reported for the osd in question. To
prevent this, the heartbeat timeouts for the thread is disabled before
waiting for an item and enabled once the wait period is over.
osd: Run osd bench test to override default max osd capacity for mclock
If mclock scheduler is enabled, run the osd bench test as part of osd
initialization sequence in order to determine the max osd capacity. The
iops determined as part of the test is used to override the default
osd_mclock_max_capacity_iops_[hdd,ssd] option depending on the
underlying device type.
The test performs random writes of 100 objects of 4MiB size using
4KiB blocksize. The existing test which was a part of asok_command() is
factored out into a separate method called run_osd_bench_test() so that it
can be used for both purposes. If the test fails, the default values
for the above mentioned options are used.
A new method called update_configuration() in introduced in OpScheduler
base class to facilitate propagation of changes to a config option
that is not user initiated. This method helps in applying changes and
update any internal variable associated with a config option as
long as it is tracked. In this case, the change to the max osd capacity
is propagated to each op shard using the mentioned method. In the
future this method can be useful to propagate changes to advanced
config option(s) that the user is not expected to modify.
osd: Remove the generic "osd_mclock_max_capacity_iops" option.
Remove the generic "osd_mclock_max_capacity_iops" option and use the
"osd_mclock_max_capacity_iops_[hdd,ssd]" options. It is better to have a
clear indication about the type of underlying device. This helps in
avoiding confusion when trying to read or override the options.
Conflicts:
src/common/options/osd.yaml.in
- Removed non-existent file: src/common/options/osd.yaml.in since the
switch to yaml for config options is not available in pacific yet.
- Removed config option "osd_mclock_max_capacity_iops" from options.cc.
Since inject_facts_as_vars is set to false in the ansible.cfg file then we
have to update the references to use ansible_facts[<thing>] instead of
ansible_<thing>.
We already install the dependency from ceph-ansible requirements.txt and to
avoid false positive (like after rebooting a node) we can retry failing test.
Without loading the ansible.cfg file from ceph-ansible project, we don't
have the pipelining enabled which can result in significant performance
improvement.
This removes the ANSIBLE_ACTION_PLUGINS, ANSIBLE_RETRY_FILES_ENABLED and
ANSIBLE_SSH_RETRIES environment variables as it is already included in the
ansible.cfg file.
ceph-volume/tests: update ansible ssh_args env var
The ansible ssh_args parameter is usually defined in the ansible.cfg file.
Currently this variable is overrided in tox to manage the vagrant ssh file
but we lost all default values.
rpm: drop use of $FIRST_ARG in ceph-immutable-object-cache
The use of $FIRST_ARG was probably required because the SUSE-specific
%service_* rpm macros were playing tricks on the shell positional parameters.
This is bad practice and error-prone, so let's assume that no macros should do
that anymore and hence it's safe to assume that positional parameters remain
unchanged after any rpm macro call.
Kefu Chai [Sat, 1 May 2021 15:30:18 +0000 (23:30 +0800)]
common/pick_address: define in_addr_t if it is not defined
neither mingw not not have in_addr_t defined, see
https://docs.microsoft.com/en-us/windows/win32/api/winsock2/ns-winsock2-in_addr,
so define it if it is not defined.
Conflicts:
src/common/options/global.yaml.in: global.yaml.in was introduced
in master only, so, in this change src/common/options.cc is updated
instead.
because fmt is packaged in EPEL, while librados is packaged
in RHEL, so we cannot have fmt as a runtime dependency of librados.
to address this issue, we should compile librados either with static library
or with header-only library of fmt. but because the fedora packaging
guideline does no encourage us to package static libraries, and it would
be complicated to package both static and dynamic library for fmt.
the simpler solution would be to compile Ceph with the header-only
version of fmt.
in this change, we compile ceph with the header-only version of fmt
on RHEL to address the runtime dependency issue.
* an interface library named "fmt-header-only" is introduced. it brings
the support to the header only fmt library.
* fmt::fmt is renamed to fmt
* an option named "WITH_FMT_HEADER_ONLY" is introduced
* fmt::fmt is an alias of "fmt-header-only" if "WITH_FMT_HEADER_ONLY"
is "ON", and an alias of "fmt" otherwise.
because fmt is packaged in EPEL, while librados is packaged
in RHEL, so we cannot have fmt as a runtime dependency of librados.
to address this issue an option "WITH_FMT_HEADER_ONLY" is introduced, so
that we can enable it when building Ceph with the header version of fmt.
and the built packages won't have runtime dependency of fmt.
cephfs-mirror: record directory path cancel in DirRegistry
When removing a directory path from mirroring, cephfs-mirror records
this state in a thread-local storage. The replayer thread backs-off
in midst of mirroring the directory snapshots for this directory path.
However, the state (canceled state) is never cleared causing the thread
to incorrectly assume that other directory paths (which are picked up
by this thread) need backing-off, hence, marking these directory paths
as failed (to synchronize snapshots).
Fix is to store this state in the directory specific store which is
allocated when a thread picks up a directory path for synchronization.
cephfs-mirror: complete context when a mirror instance is not failed or blocklisted
Without this, the updater thread can start processing othere queued
contexts when a mirror instance is failed or blocklisted resulting
in unexpected behavior.
qa/standalone: fixing the timings when waiting for deep-scrub to start
initiate_and_fetch_state() initiates a scrub, then polls the published
PG state looking for 'scrubbing'. Calling flush_pg_stats() as part of
the polling process might cause the scrub and the following recovery to
be missed altogether.
Note: this polling mechanism is definitely not robust. Will be
redesigned in the future.
Ronen Friedman [Sat, 15 May 2021 19:14:38 +0000 (22:14 +0300)]
test: recovery_scrub: do not display 'repair' status on auto-repair deep-scrub
A new test: auto_repair_bluestore_tag.
Based on auto_repair_bluestore_basic. Sets auto-repair, starts a periodic
deep-scrub, then verifies that the PG state, while scrubbing, is 'scrubbing+deep'
and not 'scrubbing+deep+repair'.
Ronen Friedman [Mon, 10 May 2021 13:15:16 +0000 (16:15 +0300)]
osd/scrub: separate between PG state flags and internal scrubber operation
Modify the scrubber to rely on internal flags for 'should we repair' and
'is this a deep scrub', instead of using PG_STATE_REPAIR & PG_STATE_DEEP_SCRUB.
This enables us to implement the 'fix-as-you-go deepscrub' functionality
of 'osd_scrub_auto_repair', without displaying REPAIR status to the user.
Adam Kupczyk [Mon, 24 May 2021 12:27:05 +0000 (14:27 +0200)]
os/bluestore/bluefs: Add test that detects bluefs inconsistency
Add test that detects possible scenario that will cause BlueFS to have file
that contains data that has never been written. This is done by tricking
replay log to already accept file metadata (size, allocations), but actual data
stored in these allocations is not yet synced to disk.
Scenario:
1) write to file h1 on SLOW device
2) flush h1 (and trigger h1 mark to be added to bluefs replay log)
3) write to file h2
4) fsync h2 (forces replay log to be written)
The result is:
- bluefs log now has stable state of h1
- SLOW device is not yet flushed (no fdatasync())
Adam Kupczyk [Mon, 24 May 2021 12:49:51 +0000 (14:49 +0200)]
os/bluestore/bluefs: Remove possibility of bluefs replay log containing files without data
It had been possible to have a bluefs replay log to serialize file metadata (size, allocations),
but actual data stored in these allocations is not yet synced to disk.
This could happen if _flush_range(h1) allocated space for file h1 on device (like SLOW) that will not
be used when flushing future replay log. Such thing can happen when we have h2 that wrote to WAL and
out replay log is on DB. After fsync(h2) we write to replay log, wait for fdatasync on WAL and DB.
There is no waiting on SLOW, but h1 was dirty and has been serialized to replay log.
Solution is to delay notifying replay log that it has to include h1 after finishing fdatasync.
Fixes: https://tracker.ceph.com/issues/50965 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 03ac53f7d4c83e56f664ad371ffe3bc2d40e1837)
Kefu Chai [Thu, 10 Jun 2021 12:19:09 +0000 (20:19 +0800)]
tasks/ceph_manager: ignore EACCES when waiting for quorum
mon_tick_interval is 5 seconds by default. monitors update their
rotating keys every mon_tick_interval. before monitors forms a
quorum, the auth requests from clients are put into the wait list.
these requests are re-enqueued once the monitors form a quorum. but
there is a small window of mon_tick_interval, before they are able
to serve the auth requests even after their claim to be able to
server requests. if these re-enqueued requests happen to be served
in this window, and if authx is enabled, they will be greeted with
errors like
handle_auth_bad_method server allowed_methods [2] but i only support [2]
in the case of ceph cli, the error would look like:
[errno 13] RADOS permission denied (error connecting to the cluster)
so, to address this issue, the EACCES error is ignored when waiting
for a quorum.
ceph-monstore-tool: use a large enough paxos/{first,last}_committed
so the rebuild paxos transaction won't be overwritten by the ones
created before recovery completes.
when the quorum is recovering, the leader will collect the paxos
transactions from peons. if the quorum accept the proposal for setting
the fingerprint, the peon will update the monitor with the paxos
transaction with a newer "last_committed" than the one created using
update_paxos() in ceph_monstore_tool.cc. the latter "last_committed" is
always 0.
so, to avoid this extra paxos proposal obsoleting the "rebuilding" paxos
transaction, we use a large enough number for {first,last}_committed.
Adam C. Emerson [Wed, 14 Jul 2021 15:02:21 +0000 (11:02 -0400)]
rgw: Robust notify invalidates on cache timeout
This avoids a potential race condition in which updates are delayed.
Fixes: https://tracker.ceph.com/issues/51674 Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
(cherry picked from commit 76247990ff38049ee32dd47d31482b9648353673)
Conflicts:
src/rgw/services/svc_notify.cc
- Skip the renaming, since this is a backport and that's mostly a
matter of futureproofing.
Backport: https://tracker.ceph.com/issues/51679 Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Adam C. Emerson [Wed, 7 Jul 2021 22:47:00 +0000 (18:47 -0400)]
rgw: distribute() takes RGWCacheNotifyInfo
So we don't have to parse the bufferlist back out to find what object
to throw out of the cache.
Fixes: https://tracker.ceph.com/issues/51674 Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
(cherry picked from commit 7f952ad80114096322f202ba58279aaa4a002313)
Backport: https://tracker.ceph.com/issues/51679 Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
Adam C. Emerson [Tue, 13 Jul 2021 20:05:47 +0000 (16:05 -0400)]
rgw: Don't segfault on datalog trim
Synchronous (or yielded, basically other-than AioCompletion trim)
would try to dereference the past-the-end iterator if we were trimming
to a point in the most recent generation.
https://tracker.ceph.com/issues/51661 Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
(cherry picked from commit 97305f03c16db1cfaceef04a74ee510bc1fc1e80)
https://tracker.ceph.com/issues/51675 Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
pacific: qa: FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/debug/ceph/3fab6bea-f243-47a4-a956-8c03a62b61b5.client4721/mds_sessions'
Cherry-pick notes:
- handle differences due to renaming of rgw::sal::RGWObject to rgw::sal::Object
- handle differences due to move of test_ps_s3_metadata_on_master test from tests_ps.py to test_bn.py
Moving the attrs into s->bucket_attrs before setting them results in
setting empty attrs into the bucket. This means that reading them back
later gets empty attrs, which can cause a segfault.
mgr/dashboard: remove usage of 'rgw_frontend_ssl_key'
Fixes: https://tracker.ceph.com/issues/51643 Signed-off-by: Avan Thakkar <athakkar@redhat.com>
Removing the usage of rgw_frontend_ssl_key from the rgw service form.
monitoring: remove instance label from ceph-cluster.json completely
The `instance` label is only useful if
- the exporter returns only data about its node or instance
- the exporter provides an instance label and then may return data about
other nodes
In this case, it's about the Prometheus mgr module, which is a single
exporter providing data about a whole cluster, so not only data related
to the node (or instance) the mgr module is running on. It is
completely irrelevant on which node the exporter runs on, the data
provided doesn't change. The exporter also doesn't provide `instance`
labels (which Prometheus wouldn't change due to our configuration, see
"honor_labels" setting).
(Actually there's one exception where `instance` labels are provided by
the Ceph mgr module, but that doesn't affect the Ceph Cluster
dashboard.)
Note that keeping that instance label on this particular dashboard would
enable the user to switch between a previously failed mgr instance and
the data collected from there and the currently running mgr instance (on
which the Prometheus mgr module runs on). That'd split the data, which
I don't think is a useful feature, but rather looks broken.
Fixes: https://tracker.ceph.com/issues/51212 Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
(cherry picked from commit 037410713f032c0a2a25243e411ae67dffcc1d1a)