git.apps.os.sepia.ceph.com Git

mon/MonCap: Update osd profile to allow cmd to set iops capacity on mon db

The default mon caps for osds is set to "allow profile osd", which allows
only "rw" capability. Osds with mclock scheduler enabled store their max
iops capacity on the mon config store. This can be achieved by executing
the "config set" command. However, since the osd(s) by default do not have
the execute permission, the command fails with "Permission denied" error.

Therefore, modify the default osd profile to allow running the "config set"
command with restriction to only set keys with name matching either (regex)
"osd_mclock_max_capacity_iops_hdd" or "osd_mclock_max_capacity_iops_ssd"
so that the osd has the permission to update the mon config store with the
desired information.

Fixes: https://tracker.ceph.com/issues/52329
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 2cdbe81d7dd34b65e5c3c51005df5214a1e3a577)

doc: Update mclock-config-ref doc steps to override osd max iops capacity.

Update the steps in the mclock config reference document to manually
override an OSDs max IOPS capacity. Provide information on the alternative
ways to override the osd_mclock_max_capacity_iops_[hdd,ssd] options for
an OSD.

Fixes: https://tracker.ceph.com/issues/52025
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 7c25511a41e452bcc5806f192e90aac6cd113f9e)

osd: Add config option to skip running the OSD benchmark on start-up.

Introduce a new dev config option "osd_mclock_skip_benchmark" that
when set skips running the OSD benchmark on start-up. By default
this option is disabled. This is useful in the following scenarios:

- Dev/CI testing,
- Configurations that don't need QoS.

If the option is enabled, the default OSD iops capacity is read from
osd_mclock_max_capacity_iops_[hdd,ssd].

Fixes: https://tracker.ceph.com/issues/52025
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 6ca32bde2e1d0dd58df168126582a570ac09aad6)

Conflicts:
src/common/options/osd.yaml.in
- Removed non-existent file: src/common/options/osd.yaml.in since the
switch to yaml for config options is not available in pacific yet.

osd: Add a new config option to forcibly run OSD benchmark on init

The new config option "osd_mclock_force_run_benchmark_on_init" is
introduced to allow a user to force run the OSD benchmark test on every
OSD boot-up even if the historical data about the OSD's iops capacity is
available on the MON config store. The 'force_run_benchmark' flag is set
to the value indicated by the new config option.

By default this new config option is set to false.

The utility of this option is to help refresh the OSD iops capacity
when the underlying device's performance characteristics have changed
significantly. In such cases, the OSD can be restarted with this option
enabled temporarily. Once the new iops capacity is updated to the MON
store, this option can be removed from the OSD's start-up config.

Fixes: https://tracker.ceph.com/issues/51464
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 8725a1088098aa6f389b87f8db0989693d869c1b)

Conflicts:
    src/common/options/osd.yaml.in
- Removed non-existent file: src/common/options/osd.yaml.in since the
  switch to yaml for config options is not available in pacific yet.
- Added new config option "osd_mclock_osd_mclock_force_run_benchmark_on_init"
  to options.cc.

osd: Add mechanism to avoid running OSD benchmark on every OSD boot-up

Use "mon_cmd_set_config()" to store the OSD's max iops capacity to
the MON store during the first bring-up. Don't run the OSD benchmark
test on subsequent boot-ups if a previously persisted iops capacity is
available on the MON store and is different from the default iops
capacity.

Add the 'force_run_benchmark' flag to force a run of the benchmark
in case the default iops capacity cannot be determined.

Fixes: https://tracker.ceph.com/issues/51464
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 10f8b79ca3c04ae7612b297465582951a340aa6a)

common/config: Add methods to return the default value of a config option

Add wrapper method "get_val_default()" to the ConfigProxy class that takes
the config option key to search. This method in-turn calls another method
with the same name added to md_config_t class that does the actual work of
searching for the config option. If the option is valid, _get_val_default()
is used to get the default value. Otherwise, the wrapper method returns
std::nullopt.

Fixes: https://tracker.ceph.com/issues/51464
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 9438e5a4b6b16e7ca3276c94c05ec73541690c10)

osd: Add method to store config option key/value on the MON store

Add method mon_cmd_set_config() to save config option key and
value to the MON store. The ConfigMonitor command, 'config set' is
used to achieve this.

A corresponding get method is unnecessary since any config option
found on the MON store is loaded during OSD boot-up and set using
the md_config_t::set_mon_vals() method. Therefore, the existing
versions of ConfigProxy::get_val() method are sufficient to get
the latest value for the config option.

Fixes: https://tracker.ceph.com/issues/51464
Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 1fca4bdfd4c919907247f8f31a3f9e6ca7a11653)

doc: Update mclock-config-ref to reflect automated OSD benchmarking

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 76420f9d5911ca3dc4129e6dfe7c44dc4734b700)

Conflicts:
doc/rados/configuration/mclock-config-ref.rst
- Removed "confval" directives as it is not yet available in pacific.

qa/tasks: Enhance wait_until_true() to check & retry recovery progress

With mclock scheduler enabled, the recovery throughput is throttled based
on factors like the type of mclock profile enabled, the OSD capacity among
others. Due to this the recovery times may vary and therefore the existing
timeout of 120 secs may not be sufficient.

To address the above, a new method called _is_inprogress_or_complete() is
introduced in the TestProgress Class that checks if the event with the
specified 'id' is in progress by checking the 'progress' key of the
progress command response. This method also handles the corner case where
the event completes just before it's called.

The existing wait_until_true() method in the CephTestCase Class is
modified to accept another function argument called "check_fn". This is
set to the _is_inprogress_or_complete() function described earlier in the
"test_turn_off_module" test that has been observed to fail due to the
reasons already described above. A retry mechanism of a maximum of 5
attempts is introduced after the first timeout is hit. This means that
the wait can extend up to a maximum of 600 secs (120 secs * 5) as long as
there is recovery progress reported by the 'ceph progress' command result.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 328271d587d099e78dcd020c17e7465043c1bb6b)

osd: Disable heartbeat timeout until a non-future workitem can be processed

There could be rare instances when employing the mclock scheduler where a
worker thread for a shard may not get an immediate work item to process.
Such items are designated as future work items. In such cases, the
_process() loop waits until the time indicated by the scheduler to attempt
a dequeue from the scheduler queue again. It may so happen that if there
are multiple threads per shard, a thread may not get an immediate item for
a long time. This time could exceed the heartbeat timeout for the thread
and result in hearbeat timeouts reported for the osd in question. To
prevent this, the heartbeat timeouts for the thread is disabled before
waiting for an item and enabled once the wait period is over.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 9a95492b66341f7351e80f0386b4439f713debc6)

osd: Run osd bench test to override default max osd capacity for mclock

If mclock scheduler is enabled, run the osd bench test as part of osd
initialization sequence in order to determine the max osd capacity. The
iops determined as part of the test is used to override the default
osd_mclock_max_capacity_iops_[hdd,ssd] option depending on the
underlying device type.

The test performs random writes of 100 objects of 4MiB size using
4KiB blocksize. The existing test which was a part of asok_command() is
factored out into a separate method called run_osd_bench_test() so that it
can be used for both purposes. If the test fails, the default values
for the above mentioned options are used.

A new method called update_configuration() in introduced in OpScheduler
base class to facilitate propagation of changes to a config option
that is not user initiated. This method helps in applying changes and
update any internal variable associated with a config option as
long as it is tracked. In this case, the change to the max osd capacity
is propagated to each op shard using the mentioned method. In the
future this method can be useful to propagate changes to advanced
config option(s) that the user is not expected to modify.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit db6c995ba6ea7d19642955acf8d117d3267e9632)

Conflicts:
src/osd/OSD.cc
- Retained use of cmd_getval() since cmd_getval_or() is not available
for parsing the 'osd bench' command options.

osd: Remove the generic "osd_mclock_max_capacity_iops" option.

Remove the generic "osd_mclock_max_capacity_iops" option and use the
"osd_mclock_max_capacity_iops_[hdd,ssd]" options. It is better to have a
clear indication about the type of underlying device. This helps in
avoiding confusion when trying to read or override the options.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
(cherry picked from commit 6ad38a291dd1fffb5a8b9ad786dd7ca22f67f411)

Conflicts:
src/common/options/osd.yaml.in
- Removed non-existent file: src/common/options/osd.yaml.in since the
switch to yaml for config options is not available in pacific yet.
- Removed config option "osd_mclock_max_capacity_iops" from options.cc.

Merge pull request #42480 from smithfarm/wip-51836-pacific

pacific: rpm: drop use of $FIRST_ARG in ceph-immutable-object-cache

Reviewed-by: Kefu Chai <kchai@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>

Merge pull request #42482 from MrFreezeex/wip-51840-pacific

pacific: osd: log snaptrim message to dout

Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge pull request #42426 from ifed01/wip-ifed-compact-after-upgrade-pac

pacific: os/bluestore: compact db after bulk omap naming upgrade.

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #42424 from ifed01/wip-ifed-bluefs-safer-flush-pac

pacific: os/bluestore: Remove possibility of replay log and file inconsistency

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #42477 from tchaikov/pacific-pr-40961

pacific: bind on loopback address if no other addresses are available

Reviewed-by: Josh Durgin <jdurgin@redhat.com>

Merge pull request #42472 from tchaikov/pacific-pr-42464

pacific: cmake, ceph.spec.in: build with header only fmt on RHEL

Reviewed-by: Josh Durgin <jdurgin@redhat.com>

Merge pull request #42458 from vshankar/wip-51819

pacific: cephfs-mirror: record directory path cancel in DirRegistry

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>

Merge pull request #42299 from callithea/wip-51374-pacific

pacific: monitoring: Clean up Grafana dashboards

Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Laura Paduano <lpaduano@suse.com>
Reviewed-by: p-se <NOT@FOUND>

Merge pull request #42213 from rhcs-dashboard/wip-51542-pacific

pacific: radosgw: include realm_{id,name} in service map

Reviewed-by: Waad Alkhoury <walkhour@redhat.com>
Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>

Merge pull request #42363 from k0ste/wip-51699-pacific

pacific: rgw: allow to set ssl options and ciphers for beast frontend

Reviewed-by: Casey Bodley <cbodley@redhat.com>

Merge pull request #42490 from dsavineau/wip-51851-pacific

pacific: ceph-volume/tests: update ansible environment variables in tox

Merge pull request #42398 from satoru-takeuchi/wip-50900-pacific

pacific: osd/scrub: separate between PG state flags and internal scrubber operation

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #41783 from pponnuvel/wip-51140-pacific

pacific: rgw: Improve error message on email id reuse

Reviewed-by: Daniel Gryniewicz <dang@redhat.com>

ceph-volume/tests: use ansible_facts

Since inject_facts_as_vars is set to false in the ansible.cfg file then we
have to update the references to use ansible_facts[<thing>] instead of
ansible_<thing>.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 842fc2b605a2321a031a240c4aa4348c1be24e14)

ceph-volume/tests: use pytest rerunfailures

We already install the dependency from ceph-ansible requirements.txt and to
avoid false positive (like after rebooting a node) we can retry failing test.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 95056a24e4fbc19307f5b32724bfdb459a42f7ab)

ceph-volume/tests: set ANSIBLE_CONFIG env var

Without loading the ansible.cfg file from ceph-ansible project, we don't
have the pipelining enabled which can result in significant performance
improvement.
This removes the ANSIBLE_ACTION_PLUGINS, ANSIBLE_RETRY_FILES_ENABLED and
ANSIBLE_SSH_RETRIES environment variables as it is already included in the
ansible.cfg file.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit b758fdd829e3b4d7b790e6d35a02c97f3962d13e)

ceph-volume/tests: update ansible ssh_args env var

The ansible ssh_args parameter is usually defined in the ansible.cfg file.
Currently this variable is overrided in tox to manage the vagrant ssh file
but we lost all default values.

Signed-off-by: Dimitri Savineau <dsavinea@redhat.com>
(cherry picked from commit 0ad615bab555d9965aec36d025acb28708b07cf0)

osd: log snaptrim message to dout

This log message is not an error and is done on every tick of the
snaptrim process. Replace the derr logging to dout(10) to not log it
by default.

Fixes: https://tracker.ceph.com/issues/51799
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit e2b2faef11c18df6c2f1f083d32d279e51b63e82)

rpm: drop use of $FIRST_ARG in ceph-immutable-object-cache

The use of $FIRST_ARG was probably required because the SUSE-specific
%service_* rpm macros were playing tricks on the shell positional parameters.
This is bad practice and error-prone, so let's assume that no macros should do
that anymore and hence it's safe to assume that positional parameters remain
unchanged after any rpm macro call.

Thanks to Franck Bui for providing the original patch
926433f5d45e557c42f050b43798ba29dc495e02 that this patch is modeled after.

NOTE: the use of FIRST_ARG had already been eliminated by
926433f5d45e557c42f050b43798ba29dc495e02 but was re-introduced later by
9466d7098573dafcfede5e9c852374fbbd99f9e7

Fixes: 9466d7098573dafcfede5e9c852374fbbd99f9e7
Fixes: https://tracker.ceph.com/issues/51797
Signed-off-by: Nathan Cutler <ncutler@suse.com>
(cherry picked from commit 1cb84a1160ed4108cae30100682b1e3ee7c7721d)

common/pick_address: define in_addr_t if it is not defined

neither mingw not not have in_addr_t defined, see
https://docs.microsoft.com/en-us/windows/win32/api/winsock2/ns-winsock2-in_addr,
so define it if it is not defined.

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit a06f8edeca0a1d08ff181d18f913bcfe7570bdcd)

Conflicts:
cmake/modules/CephChecks.cmake: trivial resolution

common/pick_addr: use grading machinery to refactor pick_address()

as picking iface on the same NUMA node is not a hard requirement, the
grading machinery is a nice fit for this purpose.

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 329d51c68ec6bf1864aa9430a62d65a93362a1b9)

common/pick_address: prefer non-loopback addresses

instead of filtering out loopback ifaces, check for loopback addresses,
and prefer non-loopback addresses over loopback addresses.

before this change, iface named "lo" is filtered out by default,
and "lo" is allowed if `ms_bind_exclude_lo_iface` is false.

after this change, iface with address out of 127/8 is prefered.
the iface marked down is not considered.

the option of "ms_bind_exclude_lo_iface" is removed. the tests are
updated accordingly.

Fixes: https://tracker.ceph.com/issues/50456
Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit a9b9bcd53215a07608a28ac2c8e4a8c8b8e80e66)

Conflicts:
src/common/options/global.yaml.in: global.yaml.in was introduced
in master only, so, in this change src/common/options.cc is updated
instead.

common/pick_address: Allow binding on loopback iface

in 6147c0917157efd2d35610e759685656a4989abb, "lo" is also skipped when
daemon is trying to find an address to bind. but that change reverts the
fix of 201b59204374ebdab91bb554b986577a97b19c36, to address the problem.

an option named "ms_bind_exclude_lo_iface" is added, it defaults to
"true". but it can be changed to false to allow daemon to bind on "lo".

Fixes: https://tracker.ceph.com/issues/50012
Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 7f01d36a2ca0576f1ff103ae3fa7c3662e93b722)

common/pick_address: document find_ip_in_subnet_list()

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit b106ec0bbf7fa726062989114f461f2d0a1f93a9)

common/pick_address: pass string by reference

to silence warnings from clang-tidy.

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 6d0ed81f796209f27b96811f9140b7fff16a7940)

common/pick_addr: refactor pick_address.cc and ipaddr.cc

* do not replicate the same logic in IPv4 and IPv6 paths
* use helpers returning bool for filtering the candidate addresses
for better readability

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 52785d5a3607b2f2ee6d41069d18a154b3eb5d45)

common/pick_address: use scope_guard for freeifaddrs()

for better readability

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit c3c110b5763ac420c4b88f8a545c1c87a71ce59a)

ceph.spec.in: build with header only fmt on RHEL

because fmt is packaged in EPEL, while librados is packaged
in RHEL, so we cannot have fmt as a runtime dependency of librados.
to address this issue, we should compile librados either with static library
or with header-only library of fmt. but because the fedora packaging
guideline does no encourage us to package static libraries, and it would
be complicated to package both static and dynamic library for fmt.

the simpler solution would be to compile Ceph with the header-only
version of fmt.

in this change, we compile ceph with the header-only version of fmt
on RHEL to address the runtime dependency issue.

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 5419131499f4c4b5f6ba520342c39f9766e0cf4d)

Conflicts:
ceph.spec.in: trivial resolution

cmake: add an option "WITH_FMT_HEADER_ONLY"

in this change:

* an interface library named "fmt-header-only" is introduced. it brings
the support to the header only fmt library.
* fmt::fmt is renamed to fmt
* an option named "WITH_FMT_HEADER_ONLY" is introduced
* fmt::fmt is an alias of "fmt-header-only" if "WITH_FMT_HEADER_ONLY"
is "ON", and an alias of "fmt" otherwise.

because fmt is packaged in EPEL, while librados is packaged
in RHEL, so we cannot have fmt as a runtime dependency of librados.
to address this issue an option "WITH_FMT_HEADER_ONLY" is introduced, so
that we can enable it when building Ceph with the header version of fmt.
and the built packages won't have runtime dependency of fmt.

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit d81478b5690f9e3d36a7738223c737fc96a34d28)

Conflicts:
src/CMakeLists.txt
src/common/CMakeLists.txt: trivial resolution

osdc/Objecter: move LingerOp's ctor to .cc

so the linkage of fmt::fmt does not spill out to other compilation
units.

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 0e0d320525437ce6868ac6dd532ccd4c7e4a6fff)

Merge pull request #42423 from ifed01/wip-ifed-fix-missing-shared-blob-pac

pacific: os/bluestore: fix erroneous SharedBlob record removal during repair.

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #42344 from neha-ojha/wip-51663-pacific

pacific: qa/*/test_envlibrados_for_rocksdb.sh: install libarchive-3.3.3

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #42238 from trociny/wip-51584-pacific

pacific: osd: move down peers out from peer_purged

Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>
Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #39902 from singuliere/wip-49377-pacific

pacific: cmake: build static libs if they are internal ones

Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge pull request #42414 from votdev/wip-51731-pacific

pacific: mgr/dashboard: Add configurable MOTD or wall notification

Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Laura Paduano <lpaduano@suse.com>

Merge pull request #42411 from tchaikov/pacific-pr-27465

pacific: ceph-monstore-tool: use a large enough paxos/{first,last}_committed

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #42219 from ifed01/wip-ifed-migrate-pac

pacific: ceph-volume: implement bluefs volume migration

Guillaume Abrioux <gabrioux@redhat.com>
Reviewed-by: Dimitri Savineau <dsavinea@redhat.com>

test: add test for checking readd after remove for a directory path

Fixes: http://tracker.ceph.com/issues/51666
Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit 19a45c8d54bd67727a4017720d22af4fa5d34c3f)

cephfs-mirror: record directory path cancel in DirRegistry

When removing a directory path from mirroring, cephfs-mirror records
this state in a thread-local storage. The replayer thread backs-off
in midst of mirroring the directory snapshots for this directory path.
However, the state (canceled state) is never cleared causing the thread
to incorrectly assume that other directory paths (which are picked up
by this thread) need backing-off, hence, marking these directory paths
as failed (to synchronize snapshots).

Fix is to store this state in the directory specific store which is
allocated when a thread picks up a directory path for synchronization.

Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit 1a956be9baf0f21e64d81a684cd8f90cb6481f6a)

cephfs-mirror: complete context when a mirror instance is not failed or blocklisted

Without this, the updater thread can start processing othere queued
contexts when a mirror instance is failed or blocklisted resulting
in unexpected behavior.

Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit c49024c90a1da7905c491897ea26e8bbc8942186)

rgw: set default ssl options for beast frontend

to 'no_sslv2:no_sslv3:no_tlsv1:no_tlsv1_1'

Signed-off-by: Mykola Golub <mgolub@suse.com>
(cherry picked from commit fb31c87c2d6c02563d2d2a1e63d5b62bea2c6f91)

qa/standalone: fixing the timings when waiting for deep-scrub to start

initiate_and_fetch_state() initiates a scrub, then polls the published
PG state looking for 'scrubbing'. Calling flush_pg_stats() as part of
the polling process might cause the scrub and the following recovery to
be missed altogether.

Note: this polling mechanism is definitely not robust. Will be
redesigned in the future.

Fixes: https://tracker.ceph.com/issues/51581
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit ed45acee34435611f8dea7f77fde54a6586cf6d9)

test: recovery_scrub: do not display 'repair' status on auto-repair deep-scrub

A new test: auto_repair_bluestore_tag.

Based on auto_repair_bluestore_basic. Sets auto-repair, starts a periodic
deep-scrub, then verifies that the PG state, while scrubbing, is 'scrubbing+deep'
and not 'scrubbing+deep+repair'.

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit d6eb3e3a3c29a02d6c7c088ef7c8c668a872d16e)

osd/scrub: separate between PG state flags and internal scrubber operation

Modify the scrubber to rely on internal flags for 'should we repair' and
'is this a deep scrub', instead of using PG_STATE_REPAIR & PG_STATE_DEEP_SCRUB.

This enables us to implement the 'fix-as-you-go deepscrub' functionality
of 'osd_scrub_auto_repair', without displaying REPAIR status to the user.

Fixes: https://tracker.ceph.com/issues/50446
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
Co-Authored by: Neha Ojha <nojha@redhat.com>
(cherry picked from commit 80c756b00927a306249de2d5887c4f4f64e4c679)

os/bluestore: compact db after bulk omap naming upgrade.

Omap naming scheme upgrade introduced recently might perform bulk data
removal and hence leave DB in a "degraded" state. Let's compact it.

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit 0e5c140b79a0b1809a8044696dce1eb7a36b1d41)

os/bluestore/bluefs: Add test that detects bluefs inconsistency

Add test that detects possible scenario that will cause BlueFS to have file
that contains data that has never been written. This is done by tricking
replay log to already accept file metadata (size, allocations), but actual data
stored in these allocations is not yet synced to disk.

Scenario:
1) write to file h1 on SLOW device
2) flush h1 (and trigger h1 mark to be added to bluefs replay log)
3) write to file h2
4) fsync h2 (forces replay log to be written)

The result is:
- bluefs log now has stable state of h1
- SLOW device is not yet flushed (no fdatasync())

Test detects this condition and fails.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit c591a6e14e2c956d268adcaa9aa3e9c8a1fdea2a)

os/bluestore/bluefs: Remove possibility of bluefs replay log containing files without data

It had been possible to have a bluefs replay log to serialize file metadata (size, allocations),
but actual data stored in these allocations is not yet synced to disk.

This could happen if _flush_range(h1) allocated space for file h1 on device (like SLOW) that will not
be used when flushing future replay log. Such thing can happen when we have h2 that wrote to WAL and
out replay log is on DB. After fsync(h2) we write to replay log, wait for fdatasync on WAL and DB.
There is no waiting on SLOW, but h1 was dirty and has been serialized to replay log.

Solution is to delay notifying replay log that it has to include h1 after finishing fdatasync.

Fixes: https://tracker.ceph.com/issues/50965
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 03ac53f7d4c83e56f664ad371ffe3bc2d40e1837)

os/bluestore: fix erroneous SharedBlob record removal during repair.

Fixes: https://tracker.ceph.com/issues/51619
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit 7090930d4a2e6f2efdecaff23f9a2f795e7819fb)

mgr/dashboard: Add configurable MOTD or wall notification

Fixes: https://tracker.ceph.com/issues/51408
Signed-off-by: Volker Theile <vtheile@suse.com>
(cherry picked from commit f7f163e75cf5fb6cd022a8d13c28f5b923e01aed)

tasks/ceph_manager: ignore EACCES when waiting for quorum

mon_tick_interval is 5 seconds by default. monitors update their
rotating keys every mon_tick_interval. before monitors forms a
quorum, the auth requests from clients are put into the wait list.
these requests are re-enqueued once the monitors form a quorum. but
there is a small window of mon_tick_interval, before they are able
to serve the auth requests even after their claim to be able to
server requests. if these re-enqueued requests happen to be served
in this window, and if authx is enabled, they will be greeted with
errors like

handle_auth_bad_method server allowed_methods [2] but i only support [2]

in the case of ceph cli, the error would look like:

[errno 13] RADOS permission denied (error connecting to the cluster)

so, to address this issue, the EACCES error is ignored when waiting
for a quorum.

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 7afd38f846894f11a61f697a2522cd0c30a35dc7)

tasks/ceph_manager: use safe_while() to refactor the wait for quorum

for better readability

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 3908c1f4cd0ebbfdcaae2d9e6de5c1609523cc55)

ceph-monstore-tool: use a large enough paxos/{first,last}_committed

so the rebuild paxos transaction won't be overwritten by the ones
created before recovery completes.

when the quorum is recovering, the leader will collect the paxos
transactions from peons. if the quorum accept the proposal for setting
the fingerprint, the peon will update the monitor with the paxos
transaction with a newer "last_committed" than the one created using
update_paxos() in ceph_monstore_tool.cc. the latter "last_committed" is
always 0.

so, to avoid this extra paxos proposal obsoleting the "rebuilding" paxos
transaction, we use a large enough number for {first,last}_committed.

Fixes: http://tracker.ceph.com/issues/38219
Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit 5475ef7843ab4021eddee60c2789b81d616383e9)

Merge pull request #42297 from callithea/wip-51558-pacific

pacific: mgr/dashboard: fix Accept-Language header parsing

Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Laura Paduano <lpaduano@suse.com>
Reviewed-by: huww98 <NOT@FOUND>
Reviewed-by: Nizamudeen A <nia@redhat.com>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>

Merge pull request #42346 from adamemerson/wip-51674-pacific

rgw: Backport of 51674 to Pacific

Reviewed-by: Casey Bodley <cbodley@redhat.com>

Merge pull request #42336 from adamemerson/wip-51661-pacific

pacific: rgw: Don't segfault on datalog trim

Reviewed-by: Casey Bodley <cbodley@redhat.com>

Merge pull request #42166 from batrick/i51499

pacific: qa: test_ls_H_prints_human_readable_file_size failure

Reviewed-by: Jos Collin <jcollin@redhat.com>

Merge pull request #42354 from rhcs-dashboard/wip-51685-pacific

pacific: mgr/dashboard: Fix test_error force maintenance dashboard check

Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
Reviewed-by: Aaryan Porwal <aaryanporwal2233@gmail.com>

pacific: qa: avoid sudo on tmpfile

This code does not exist in master. This is a continuation of the fix
for the backport. Resolves failures like [1].

[1] /ceph/teuthology-archive/yuriw-2021-07-08_23:33:26-fs-wip-yuri2-testing-2021-07-08-1142-pacific-distro-basic-smithi/6260232/teuthology.log

Fixes: https://tracker.ceph.com/issues/51704
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

rgw: allow to set ssl options and ciphers for beast frontend

Two new conf keys are added for "beast" framework:

- ssl_options: a colon separated list of ssl context options,
  documented in boost's ssl::context_base;

- ssl_ciphers: a colon separated list of ciphers, documented
  in openssl's ciphers(1) manual.

Example:

  rgw frontends = beast ...  ssl_options=default_workarounds:no_tlsv1:no_tlsv1_1 ssl_ciphers=HIGH:!aNULL:!MD5

Fixes: https://tracker.ceph.com/issues/50932
Signed-off-by: Mykola Golub <mgolub@suse.com>
(cherry picked from commit 91abede6357d167063c63eade45421d2f17bb0e7)

Merge pull request #42228 from cfsnyder/wip-51188-pacific

pacific: mgr/telemetry: pass leaderboard flag even w/o ident

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #42229 from cfsnyder/wip-51556-pacific

pacific: mon: return -EINVAL when handling unknown option in 'ceph osd pool get'

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #42323 from cfsnyder/wip-51080-pacific

pacific: rgw: require bucket name in bucket chown

Reviewed-by: Daniel Gryniewicz <dang@redhat.com>

Merge pull request #42321 from cfsnyder/wip-51514-pacific

pacific: rgw/notifications: support metadata filter in CompleteMultipartUpload and Copy events

Reviewed-by: Casey Bodley <cbodley@redhat.com>

Merge pull request #42320 from cfsnyder/wip-51221-pacific

pacific: RGW - Don't move attrs before setting them

Reviewed-by: Daniel Gryniewicz <dang@redhat.com>

Merge pull request #42226 from cfsnyder/wip-51547-pacific

pacific: src/pybind/mgr/mirroring/fs/snapshot_mirror.py: do not assume a cephf…

Reviewed-by: Venky Shankar <vshankar@redhat.com>
Reviewed-by: Sébastien Han <seb@redhat.com>
Reviewed-by: Jos Collin <jcollin@redhat.com>

Merge pull request #42225 from cfsnyder/wip-51424-pacific

pacific: mgr: set debug_mgr=2/5 (so INFO goes to mgr log by default)

Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge pull request #42223 from cfsnyder/wip-51498-pacific

pacific: mgr/DaemonServer: skip redundant update of pgp_num_actual

Reviewed-by: Neha Ojha <nojha@redhat.com>

mgr/dashboard: Fix test_error force maintenance dashboard check

Fixes: https://tracker.ceph.com/issues/50771
Signed-off-by: Nizamudeen A <nia@redhat.com>
(cherry picked from commit 871e5995cbbafaa980f6df383e375359b4364428)

rgw: Robust notify invalidates on cache timeout

This avoids a potential race condition in which updates are delayed.

Fixes: https://tracker.ceph.com/issues/51674
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
(cherry picked from commit 76247990ff38049ee32dd47d31482b9648353673)

Conflicts:
         src/rgw/services/svc_notify.cc
     - Skip the renaming, since this is a backport and that's mostly a
       matter of futureproofing.

Backport: https://tracker.ceph.com/issues/51679
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>

rgw: distribute() takes RGWCacheNotifyInfo

So we don't have to parse the bufferlist back out to find what object
to throw out of the cache.

Fixes: https://tracker.ceph.com/issues/51674
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
(cherry picked from commit 7f952ad80114096322f202ba58279aaa4a002313)
Backport: https://tracker.ceph.com/issues/51679
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>

qa/*/test_envlibrados_for_rocksdb.sh: install libarchive-3.3.3

To workaround the libarchive dependency issue seen with centos 8, which
has been causing consistent failures like

```
2021-06-04T04:52:51.147 INFO:tasks.workunit.client.0.smithi071.stdout:Installed:
2021-06-04T04:52:51.148 INFO:tasks.workunit.client.0.smithi071.stdout: cmake-3.18.2-9.el8.x86_64 cmake-data-3.18.2-9.el8.noarch
...
2021-06-04T04:52:57.554 INFO:tasks.workunit.client.0.smithi071.stderr:+ cmake -DCMAKE_BUILD_TYPE=Debug -DWITH_TESTS=ON -DWITH_LIBRADOS=ON -DWITH_SNAPPY=ON -DWITH_GFLAGS=OFF -DFAIL_ON_WARNINGS=OFF ..
2021-06-04T04:52:57.579 DEBUG:teuthology.orchestra.run:got remote process result: 127
2021-06-04T04:52:57.580 INFO:tasks.workunit.client.0.smithi071.stderr:cmake: symbol lookup error: cmake: undefined symbol: archive_write_add_filter_zstd
```
More details in https://tracker.ceph.com/issues/51101#note-5

Fixes: https://tracker.ceph.com/issues/51101
Signed-off-by: Neha Ojha <nojha@redhat.com>
(cherry picked from commit 345cb641ed63e5b88f799a0b50bddb4028ed2589)

Merge pull request #42258 from sseshasa/wip-51603-pacific

pacific: qa/standalone: Add missing cleanups after completion of a subset of osd and scrub tests.

Reviewed-by: Ronen Friedman <rfriedma@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #42231 from cfsnyder/wip-51081-pacific

pacific: rgw: parse tenant name out of rgwx-bucket-instance

Reviewed-by: Casey Bodley <cbodley@redhat.com>

Merge pull request #42230 from cfsnyder/wip-51329-pacific

pacific: rgw: avoid infinite loop when deleting a bucket

Reviewed-by: Casey Bodley <cbodley@redhat.com>

rgw: Don't segfault on datalog trim

Synchronous (or yielded, basically other-than AioCompletion trim)
would try to dereference the past-the-end iterator if we were trimming
to a point in the most recent generation.

https://tracker.ceph.com/issues/51661
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
(cherry picked from commit 97305f03c16db1cfaceef04a74ee510bc1fc1e80)
https://tracker.ceph.com/issues/51675
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>

Merge pull request #42233 from smithfarm/wip-51579-pacific

pacific: common/Formatter: include used header

Reviewed-by: Kefu Chai <kchai@redhat.com>
Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #42221 from neha-ojha/wip-50393-pacific

pacific: qa/workunits/mon/test_mon_config_key: use subprocess.run() instead of proc.communicate()

Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge pull request #42165 from batrick/i51500

pacific: qa: FileNotFoundError: [Errno 2] No such file or directory: '/sys/kernel/debug/ceph/3fab6bea-f243-47a4-a956-8c03a62b61b5.client4721/mds_sessions'

Reviewed-by: Jos Collin <jcollin@redhat.com>

Merge pull request #42088 from lxbsz/wip-51285

pacific: mds: to print the unknow type value

Reviewed-by: Jos Collin <jcollin@redhat.com>
Reviewed-by: Yuri Weinstein <yweinste@redhat.com>

Merge pull request #42086 from kotreshhr/wip-51200-pacific

pacific: mgr/volumes: Add config to insert delay at the beginning of the clone

Reviewed-by: Jos Collin <jcollin@redhat.com>

Merge pull request #39834 from sebastian-philipp/pacific-backport-39131-39373

pacific: mgr/rook: Add timezone info

Reviewed-by: Juan Miguel Olmo Martínez <jolmomar@redhat.com>
Reviewed-by: Varsha Rao <varao@redhat.com>

Merge pull request #42316 from rhcs-dashboard/wip-51657-pacific

pacific: mgr/dashboard: remove usage of 'rgw_frontend_ssl_key'

Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>

rgw: require bucket name in bucket chown

Checking and reporting missing the mandatory parameter avoid clueless error
message for bucket chown.

Signed-off-by: Zulai Wang <zl31wang@gmail.com>
(cherry picked from commit 158a1f4313c0fa206031ede6f48a26c0c7467d57)

rgw/notifications: support metadata filter in COPY events

Fixes: https://tracker.ceph.com/issues/51261
Signed-off-by: Yuval Lifshitz <ylifshit@redhat.com> (cherry picked from commit
e7f30a1b278455567f1f1069f41c9a1c3ef335c2)

Conflicts:
        src/rgw/rgw_notify.cc
        src/rgw/rgw_notify.h
        src/test/rgw/bucket_notification/test_bn.py

Cherry-pick notes:
- handle differences due to renaming of rgw::sal::RGWObject to rgw::sal::Object
- handle differences due to move of test_ps_s3_metadata_on_master test from tests_ps.py to test_bn.py

RGW - Don't move attrs before setting them

Moving the attrs into s->bucket_attrs before setting them results in
setting empty attrs into the bucket. This means that reading them back
later gets empty attrs, which can cause a segfault.

Signed-off-by: Daniel Gryniewicz <dang@redhat.com>
(cherry picked from commit 4d38154f5779b01ed93430ef7405417820214796)

Conflicts:
src/rgw/rgw_op.cc

mgr/dashboard: remove usage of 'rgw_frontend_ssl_key'

Fixes: https://tracker.ceph.com/issues/51643
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
Removing the usage of rgw_frontend_ssl_key from the rgw service form.

(cherry picked from commit f6afed5aa51d80257b4883c976bbae7c3b5cc524)

Merge pull request #42264 from yuriw/wip-yuriw-p2p-pacific

qa/tests: advanced pacific version to reflect the latest 16.2.5 point

monitoring: remove instance label from ceph-cluster.json completely

The `instance` label is only useful if

- the exporter returns only data about its node or instance
- the exporter provides an instance label and then may return data about
  other nodes

In this case, it's about the Prometheus mgr module, which is a single
exporter providing data about a whole cluster, so not only data related
to the node (or instance) the mgr module is running on.  It is
completely irrelevant on which node the exporter runs on, the data
provided doesn't change.  The exporter also doesn't provide `instance`
labels (which Prometheus wouldn't change due to our configuration, see
"honor_labels" setting).

(Actually there's one exception where `instance` labels are provided by
the Ceph mgr module, but that doesn't affect the Ceph Cluster
dashboard.)

Note that keeping that instance label on this particular dashboard would
enable the user to switch between a previously failed mgr instance and
the data collected from there and the currently running mgr instance (on
which the Prometheus mgr module runs on).  That'd split the data, which
I don't think is a useful feature, but rather looks broken.

Fixes: https://tracker.ceph.com/issues/51212
Signed-off-by: Patrick Seidensal <pseidensal@suse.com>
(cherry picked from commit 037410713f032c0a2a25243e411ae67dffcc1d1a)