git.apps.os.sepia.ceph.com Git

qa/suites/krbd: rename rxbounce subsuite

A new job that doesn't want ms_mode to be set underneath it is about to
be added. Rename rxbounce to ms_modeless to make this purpose obvious.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 7f391c5688105e55f7799a9d45721ec49531747d)

rbd: support pool and image level overrides for rbd_default_map_options

Fixes: https://tracker.ceph.com/issues/52850
Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit 9afc9712824a92fd6bdb2574c5880ab835236ed1)

Merge pull request #45079 from aclamk/wip-54318-quincy

quincy: os/bluestore/bluefs: Fix improper vselector tracking in _flush_special()

Reviewed-by: Igor Fedotov <ifedotov@suse.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #45108 from rhcs-dashboard/wip-dashboard-quincy-backports3

quincy: mgr/dashboard: 3rd (and hopefully last) backport batch

Reviewed-by: Nizamudeen A <nia@redhat.com>

Merge pull request #45092 from ljflores/wip-54326-quincy

quincy: mgr/telemetry: handle empty device report when "send" is triggered

Merge pull request #45074 from cbodley/wip-54162

quincy: rgw: fix segfault in OpsLogRados::log when realm is reloaded

Reviewed-by: Cory Snyder <csnyder@iland.com>

Merge pull request #45061 from soumyakoduri/quincy

quincy: rgw/dbstore: Add dbstore-tests to `make check`

Reviewed-by: Casey Bodley <cbodley@redhat.com>

mgr/dashboard: add validation for snmp v3 engine id

Fixes: https://tracker.ceph.com/issues/54270
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
(cherry picked from commit 2866db1eac7d726201f5bb34abdb32981c783f0e)

mgr/dashboard: change privacy protocol field from required to optional

Fixes: https://tracker.ceph.com/issues/54270
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
Privacy protocol field shouldn't be a required field.

(cherry picked from commit 2d8f2b8195a0f0c7a21d4ec5061b1b51a3aade2c)

mgr/dashboard: Contact Info should be visible only when Ident channel is checked

Fixes:https://tracker.ceph.com/issues/54133
Signed-off-by: Sarthak0702 <sarthak.0702@gmail.com>
(cherry picked from commit 15211a6378a6fee9316f79ba0b27821891527c38)

mgr/dashboard: dashboard turns telemetry off when configuring report

Signed-off-by: Sarthak0702 <sarthak.0702@gmail.com>
(cherry picked from commit 97c57adf8565756dbf24f3c46ed3916303903fb7)

mgr/dashboard: "Please expand your cluster first" shouldn't be shown if cluster is already meaningfully running

This PR will assume that a cluster is already up and fully running. If this should not be the expected behaviour, deployment tools have to set 'INSTALLED' explicitly. Without this assumption it might happen that upgraded and fully running clusters, e.g. Octopus -> Pacific, will show the 'Expand Cluster' on first log in.

cephadm will take care that the bootstrap phase will write the necessary key to show the 'Expand cluster' page.

Fixes: https://tracker.ceph.com/issues/54215
Signed-off-by: Volker Theile <vtheile@suse.com>
(cherry picked from commit 48fff60b63785ec07f71d3e59394b0c08357247c)

mgr/telemetry: handle empty device report when "send" is triggered

On certain environments, such as the "ceph-dev-docker" environment
(https://github.com/ricardoasmarques/ceph-dev-docker), the mgr
module is unable to fetch device metrics. As a result, the device
report generated by "gather_device_report()" returns an empty dict.
This causes an AssertionError when the "send" function is triggered
(i.e. by running `ceph telemetry status` or `ceph telemetry send`),
and the module crashes.

The fix in this commit checks that the generated device report
contains metrics before trying to send it. If the device report
does not contain metrics (it returns an empty dict), the module
will log an appropriate message in the mgr log and not send the
device report.

If this scenario happens when running the `ceph telemetry send` command,
the user will additionally see this message:
```
Ceph report sent to https://telemetry.ceph.com/report
Unable to send device report: channel is on, but generated report was empty.
```

I also added a few more debug messages in gather_device_report() to make
future debugging easier.

Fixes: https://tracker.ceph.com/issues/54250
Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 54e0e58f1b3f431281df0e2dd2b258f85cbade19)

Merge pull request #45058 from idryomov/wip-rbd-quincy-batch-3

quincy: rbd backports (batch 3)

Reviewed-by: Deepika Upadhyay <dupadhya@redhat.com>

os/bluestore/bluefs: Fix vselector

Fix bluefs volume selector in device_migrate_to_existing.
Fix bluefs volume selector in _rewrite_log_and_layout_sync_LNF_LD.

Fixes: https://tracker.ceph.com/issues/54248
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 3813416e6a8d296312271598e823f876a09b2504)

os/bluestore/bluefs: Fix improper vselector tracking in _flush_special()

Moves vselector size tracking outside _flush_special().
Function _compact_log_async...() updated sizes twice.
Problem could not be solved by making second modification of size just update,
as it will possibly disrupt vselector consistency check (_vselector_check()).
Feature to track vselector consistency relies on the fact that either log.lock or nodes.lock
are taken when the check is performed. Which is not true for _compact_log_async...().

Now _flush_special does not update vselector sizes by itself but leaves the update to
the caller.

Fixes: https://tracker.ceph.com/issues/54248
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 4bc0f61d23299724fad2d8e6f2858734f1db6e5a)

rgw: fix segfault in OpsLogRados::log when realm is reloaded

We weren't previously handling the deallocation of the store when
a realm was reloaded. Now passing a const reference to the pointer.

Fixes: https://tracker.ceph.com/issues/54130
Signed-off-by: Cory Snyder <csnyder@iland.com>
(cherry picked from commit 0713f65355586b2f6ceeb6bbce8763158847e5ed)

Merge pull request #45038 from guits/bkp-quincy-cephadm-ingress-fix

quincy: cephadm/ingress: make frontend stat bind on localhost

Reviewed-by: Adam King adking@redhat.com

Merge pull request #45043 from kamoltat/wip-ksirivad-backport-quincy-44588

quincy: pybind/mgr/progress: disable pg recovery event by default

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #45030 from ljflores/wip-quincy-basic-channel-additions

quincy: mgr/telemetry: add basic_pool_usage and basic_usage_by_class collections to the telemetry module

Reviewed-by: Yaarit Hatuka <yaarit@redhat.com>

Merge pull request #45029 from ljflores/wip-54274-quincy

quincy: mgr/telemetry: collect what we can from histograms, mempools, and heap stats

Reviewed-by: Yaarit Hatuka <yaarit@redhat.com>

Merge pull request #44982 from batrick/i54234

quincy: qa: use cephadm to provision cephfs for fs:workloads

Reviewed-by: Adam King adking@redhat.com

Merge pull request #44952 from aclamk/wip-54209-quincy

quincy: [BlueStore] Fix problem with volume selector

Reviewed-by: Igor Fedotov <ifedotov@suse.com>

rgw/dbstore: Add dbstore-tests to `make check`

Include and run dbstore-tests as part of `make check` target

Signed-off-by: Soumya Koduri <skoduri@redhat.com>

qa/suites/rbd: make sure block-rbd.so is installed

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 8f0fd0af3da8581c47dc916303615264714a0489)

qa/tasks/qemu: make sure block-rbd.so is installed

Fixes: https://tracker.ceph.com/issues/54286
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 525ff61cfc8516b4d7bed6f819b00a0b6cb7be0a)

cls/rbd: GroupSnapshotNamespace comparator violates ordering rules

For

  GroupSnapshotNamespace a(1, "group-1", "snap-2");
  GroupSnapshotNamespace b(1, "group-2", "snap-1");

both a < b and b < a evaluate to true.  This violates STL strict weak
ordering requirements which is a problem because GroupSnapshotNamespace
is used as a key in std::map (ictx->snap_ids at least), etc.

Fixes: https://tracker.ceph.com/issues/49792
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 830e72ab9d66c8f5703ea27da5249b02dd16ccd0)

rbd: mark optional positional arguments as such in help output

Currently at least five commands have optional positional arguments.

Overloading po::value<std::string>()->default_value("") for this
is a bit sneaky but nothing better fits into the existing Shell.cc
framework.

Note that strictly speaking "[<interval>] [<start-time>]" should be
"[<interval> [<start-time>]]" but we aren't doing that here because
"ceph" command doesn't do it either.

Fixes: https://tracker.ceph.com/issues/54191
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit cb0df397aae552adc80713ca0d59ed1ebfd3b1be)

Merge pull request #44909 from batrick/i54160

quincy: mon/MDSMonitor: sanity assert when inline data turned on in MDSMap from v16.2.4 -> v16.2.[567]

Reviewed-by: Venky Shankar vshankar@redhat.com

Merge pull request #44875 from kotreshhr/wip-54123-quincy

quincy: mgr/volumes: Fix subvoume snapshot clone failure

Reviewed-by: Venky Shankar vshankar@redhat.com

pybind/mgr/progress: disable pg recovery event by default

The progress module disabled the pg recovery event by default
since the event is expensive and has interrupted other serviceis
when there is OSDs being marked in/out from the the cluster.

To turn the event on manually:

ceph config set mgr mgr/progress/allow_pg_recovery_event true

Updated qa/tasks/mgr/test_progress.py to enable
the pg recovery event when testing the progress module.

Signed-off-by: Kamoltat <ksirivad@redhat.com>
(cherry picked from commit f06da20dff141dc239900f944001d55fb8296014)

Merge pull request #44899 from rhcs-dashboard/wip-dashboard-quincy-backports2

quincy: dashboard: 2nd backport batch

Reviewed-by: Nizamudeen A <nia@redhat.com>

cephadm/ingress: make frontend stat bind on localhost

The current configuration of keepalived makes it do
a curl on localhost:9999 in order to check the endpoint is alive.
Given the endpoint only binds on the vip addr, that doesn't work.

Fixes: https://tracker.ceph.com/issues/53807
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit ff482da6cb3a62b14f3a06e2d558876eabebfe65)

mgr/telemetry: separate device class usage statistics into their own collection

The new collection is called `basic_usage_by_class`. This info should be separate
from `basic_pool_usage` since it doesn't involve pool statistics.

Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit f69cec5b708ce71083d16d9976cf7e6b20f090d2)

mgr/telemetry: update `basic_pool_usage` collection desc

- Added the word "default" since we are only collecting
default pool applications

- Removed the word "data" since we are actually collecting
usage *statistics*

Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit c71a54ec1ab804de8408bdf39fe8727192d23492)

doc/mgr: update telemetry doc to reflect `basic_pool_usage` collection

Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 4a2b54c1f2f2b58784d118011c3bd407281123ff)

mgr/telemetry: fix perf channel to screen out non-default pool applications

Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 7467ed59aceb696e6682081dd03bbe8e9cccf789)

mgr/telemetry: add `stats_by_class` to the `basic_pool_usage` collection

Any device classes that are not default ('hdd', 'ssd', 'nvme') are screened out.

Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 285d14457c157a3e4dfd12363e0ba02b8add57fa)

mgr/telemetry: add df stats to the `basic_pool_usage` collection

The `df` stats under `pools` indicate data usage for each pool.
The `kb_bytes` field is screened out since it is redundant.

Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit ee63d624ba395dacd9f9c0ff59a989589448eab8)

mgr/telemetry: create `basic_pool_usage` collection

Here, I define the `basic_pool_usage` collection and add
pool application under the basic channel. I screen out
any applications that are not default.

Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 1f571cd4251422f03f40401c3f9163d716d2b6e4)

mgr/telemetry: compare len(values) to len(categories)

This format will allow us to safely add or remove
categories as needed in the future.

Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 5ac1dd6866287d9e5bc895a4028836d09c836069)

mgr/telemetry: collect what we can from heap stats, mempools, and osd histograms

If we run into a problem collecting heap stats, mempools,
or osd histograms from a particular osd (i.e. the osd is down),
we should continue to collect what we can from other osds rather
than exiting and returning an empty JSON object.

Some log messages are also refined.

Fixes: https://tracker.ceph.com/issues/53985
Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit c617b78f7bb589314b3c377496a9bb3914cbb2ba)

Merge pull request #44972 from guits/wip-54243-quincy

quincy: ceph-volume: honour osd_dmcrypt_key_size option

ceph-volume/activate: load the config from lv tag

When `ceph-volume lvm trigger` is called with an OSD where the tag
`ceph.cluster_name` is not 'ceph', it fails.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 5ac1ec65cb2a582b2ae550202cc9911f993943f2)

ceph-volume/tests: use centos/stream8 images

Since recent move from CentOS 8 to CentOS Stream 8, let's do the same here.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 2b793952bbac7973b97d245c282165daadeabb51)

ceph-volume/tests: add tests in util/encryption.py

this adds some unit tests in order to cover `luks_format()` and `luks_open()`
in `util/encryption.py`.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit db48850745f218e08cf53ae2d8edf3428f2b4010)

ceph-volume: honour osd_dmcrypt_key_size option

ceph-volume doesn't honour osd_dmcrypt_key_size.
It means the default size is always applied.

It also changes the default value in `get_key_size_from_conf()`

From cryptsetup manpage:

> For XTS mode you can optionally set a key size of 512 bits with the -s option.

Using more than 512bits will end up with the following error message:

```
Key size in XTS mode must be 256 or 512 bits.
```

Fixes: https://tracker.ceph.com/issues/54006
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 47c33179f9a15ae95cc1579a421be89378602656)

Merge pull request #44918 from benhanokh/wip-54175-quincy

quincy: BlueStore:NCB:Bug-Fix for recovery code with shared blobs

Reviewed-by: Neha Ojha <nojha@redhat.com>

mgr/dashboard:Directories Menu Can't Use on Ceph File System Dashboard

Added exception handling to opendir() in cephfs.py for directories with no execute permission.

Fixes: https://tracker.ceph.com/issues/51611
Signed-off-by: Sarthak0702 <sarthak.0702@gmail.com>
(cherry picked from commit ea1af5438d380eb2160de635ffc7b08a69baf04c)

Merge pull request #44906 from trociny/wip-54143-quincy

quincy: rgw: check bucket shard init status in RGWRadosBILogTrimCR

Reviewed-by: Casey Bodley <cbodley@redhat.com>

Merge pull request #44852 from k0ste/wip-54073-quincy

quincy: rgw: fix bucket index list minor calculation bug

Reviewed-by: Casey Bodley <cbodley@redhat.com>

qa: update rhel kclient to setup container tools

To fix [1,2].

[1] https://github.com/ceph/ceph/pull/42000#issuecomment-905628920
[2] https://github.com/ceph/ceph/pull/42000#issuecomment-906276775

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 0fcf8922dcedba0a9f36a59a44005389b5130702)

qa: stop overriding distro for k-testing

This is a continuation of previous commit

qa: only use RHEL for workload testing

We don't want to test fs:workload with centos/ubuntu to avoid packaging
issues and to reduce the matrix of distros we're running workloads on.
Also, the testing kernel should install fine on the distros we test with
"supported" random distros.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit fb75ed6d391960f0826ac810b942afd2f0a662ea)

qa: only use RHEL for workload testing

It's not useful testing workloads with different distributions; it just
adds to the maintenance burden of this qa suite as distro upgrades often
break compilation of various tests.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 017ccd21e6ce58bd772ebcdcbac0e6ac1412f409)

qa: convert fs:workload to use cephadm

Note: it's important to keep the install task which supplies packages
needed for some workloads.

Fixes: https://tracker.ceph.com/issues/51333
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 405bb2e48e5914e4b849bca5cb32660f13c4a00a)

qa: split fs begin task

To allow switching to cephadm task.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 50c39dc007615902e1fb040c03965c0bb3edc142)

Conflicts:
qa/cephfs/begin/0-install.yaml

qa/tasks/cephadm: setup CephManager when OSDs are provisioned

The Filesystem object may use this when configuring EC data pools at
file system creation (via a FuseMount).

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 27c1110129bd5c8eb7b58e7051e9c1ac2446328c)

qa/tasks/cephadm: setup file system if MDS are provisioned

This is the same behavior/code as what the ceph task does.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 2436405c5d49cb3fae852763366c0541db175cb4)

mgr/dashboard: add snmp-gateway service e2e tests

Fixes: https://tracker.ceph.com/issues/54034
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
(cherry picked from commit 76dcf6a881f343bb3d93259701e57ccb572f94e8)

mgr/dashboard: add snmp destination validation

Signed-off-by: Avan Thakkar <athakkar@redhat.com>
(cherry picked from commit 81c93a21ff64e6aeeeba2c88db4f61b13352f565)

mgr/dashboard: support snmp-gateway service creation from UI

Fixes: https://tracker.ceph.com/issues/54034
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
(cherry picked from commit ad6fcfc05625b3fd8a088b8a2b5c3d5fbbf2c53a)

mgr/dashboard: change the readFile to readFileSync

Apparently the readFile i added in #44934 is async and that's not what
we want. so changing it to the synchronous call that is readFileSync

Fixes: https://tracker.ceph.com/issues/54190
Signed-off-by: Nizamudeen A <nia@redhat.com>
(cherry picked from commit cbfdd551d9c1e67c2757056ac1119c058f4aa704)

mgr/dashboard: set appropriate baseline branch for applitools

All the dashboard PRs are checked against a baseline branch called
'default' in the visual regresstion testing. This will cause issues when
testing PRs in different branches. For eg: currently our master and
pacific has to save two different screenshots since the two of them
differ slightly.

Disabling the applitools logs as well because its too 'noisy'

Fixes: https://tracker.ceph.com/issues/54190
Signed-off-by: Nizamudeen A <nia@redhat.com>
(cherry picked from commit 40c902ac59b758a314f6a123d71cb59342523dac)

mgr/dashboard: fix for cephadm e2e failing because of rgw commands getting stuck

Delaying the rgw service creation in the tests until the cluster is
healthy

also changing the node_ip_offset to 110 because in the jenkins I saw

Fixes: https://tracker.ceph.com/issues/54030
Signed-off-by: Nizamudeen A <nia@redhat.com>
(cherry picked from commit 347fb2e8fe26020a4693d3bbd94ca007c7e3535a)

cephadm: change shared_folder directory for prometheus and grafana

After https://github.com/ceph/ceph/pull/44059 the monitoring/prometheus
and monitoring/grafana/dashboards directories are changed to
monitoring/ceph-mixins. That broke the shared_folders in the cephadm
bootstrap script.

Changed all the instances of monitoring/prometheus and
monitoring/grafana/dashboards to monitoring/ceph-mixins

Also, renaming all the instances of prometheus_alerts.yaml to
prometheus_alerts.yml.

Fixes: https://tracker.ceph.com/issues/54176
Signed-off-by: Nizamudeen A <nia@redhat.com>
(cherry picked from commit 27592b75618706194e668c40056d9bfc58c5a3c6)

mgr/dashboard: cephadm e2e job: display info on error & other improvements

- Fix: ensure that on_error trap is called (display more info on error).
- Set static IPs to VMs.
- Remove domain in cluster definition to avoid side effects of potential dns misconfiguration.
- Minor improvements.

Fixes: https://tracker.ceph.com/issues/53991
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
(cherry picked from commit 39af61efb24dac6f41ba0752944882d35ad287db)

doc: update dashboard kcli test env documentation

Fixes: https://tracker.ceph.com/issues/54105
Signed-off-by: Nizamudeen A <nia@redhat.com>
(cherry picked from commit 8feb2b8fe03f0c06a0ab09328ca8df0dfe8c0de9)

cephadm/box: fix remove image tar error and cleanups

Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
(cherry picked from commit 71c493528eb17f4280b50df67bfd437e054cb6aa)

mgr/dashboard: perform daemon actions on cluster->services

Signed-off-by: Pere Diaz Bou <pdiazbou@redhat.com>
Fixes: https://tracker.ceph.com/issues/50322
(cherry picked from commit 239f884c31976f3e716d6e33224a0efb6220288e)

monitoring: build jsonnet/jb only for testing

Build jsonnet and jb in the testso that we can build ceph without
internet access and still be able to run the test needed for monitoring
using jsonnet tools.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit 8ff1e6b39976ea5e857b6575934d1a42302c6a0f)

spec: debian: monitoring: build jsonnet from source to use 0.18.0

As this new version is recently released it's still not in every distro
we use. We now build jsonnet from source so that we can use this new
version of jsonnet. This commit could be reverted later on when the new
version would be available everywhere.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit ecaf9070aed955c5a7ec7818cd9e2c45ddacc545)

mgr/dashboard: monitoring: refactor into ceph-mixin

Mixin is a way to bundle dashboards, prometheus rules and alerts into
jsonnet package. Shifting to mixin will allow easier integration with
monitoring automation that some users may use.

This commit moves `/monitoring/grafana/dashboards` and
`/monitoring/prometheus` to `/monitoring/ceph-mixin`. Prometheus alerts
was also converted to Jsonnet using an automated way (from yaml to json
to jsonnet). This commit minimises any change made to the generated files
and should not change neithers the dashboards nor the Prometheus alerts.

In the future some configuration will also be added to jsonnet to add
more functionalities to the dashboards or alerts (i.e.: multi cluster).

Fixes: https://tracker.ceph.com/issues/53374
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit 98236e3a1d2855c95d86640645c2984efa83791f)

spec: debian: add golang as build dependency

Add golang as a build dependency to build golang project in the test
for monitoring/ceph-mixin.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit e102620394a5d889e42616278de73dfb3b01f625)

monitoring/grafana: Add tests for radosgw panels

Some of the expressions modified in c40290390d7 were not covered by any tests,
especially those in the `radosgw-detail.json` dashboard.

This commit fills in those gaps.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 2daaa052ea82ff806a529402e802adbbbe9b4554)

monitoring/grafana: Update radosgw dashboards

With the `ceph_daemon` label now replaced by `instance_id` on all `ceph_rgw_*`
metrics, we need to update Grafana dashboards get that label back from
`ceph_rgw_metadata` using this type of construct:

```
ceph_rgw_req * on (instance_id) group_left(ceph_daemon) ceph_rgw_metadata
```

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit adc36dea7fc586c4d882462fbd3ab52006402b8a)

pybind/mgr/prometheus: Add instance_id metadata for rgw

In order to get the `ceph_daemon` label for `rgw` metrics corresponding to the
value before #40220, we need to add the `instance_id` label to the
`ceph_rgw_metadata` metric.

This way, the old `ceph_daemon` label can be added to any `ceph_rgw_*` metric
using the following PromQL query, for instance:

```
ceph_rgw_req * on (instance_id) group_left(ceph_daemon) ceph_rgw_metadata
```

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 01b42c1c51a1b4142adcde0c2c673b60e61e4697)

pybind/mgr/mgr_module.py: Set instance_id label for rgw

Now that the RadosGW returns its instance ID instead of its daemon name,
replace the `ceph_daemon` label with an `instance_id` label on the `rgw`
metrics.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 9f6573bbb1eeed8fab117149a117fdffd56bdf64)

mgr/ActivePyModules: Add metadata id in dump_server()

The `DaemonStateCollection` used to always contain the daemon name in its
`DaemonKey`, but since #40220 (or more specifically
afc33758e076761b8d4ec004c8f9c49b80a48770), the RadosGW registers with its
instance ID instead (`rados.get_instance_id()`).

As a result, the `ceph_rgw_*` metrics returned by `ceph-mgr` through the
`prometheus` module have their `ceph_daemon` label include that ID instead of
the daemon name, e.g.

```
ceph_rgw_req{ceph_daemon="rgw.127202"}
```

instead of

```
ceph_rgw_req{ceph_daemon="rgw.my-hostname.rgw0"}
```

This commit adds the daemon name from `state->metadata["id"]` if available, as
`service.name` in the JSON document returned by `dump_server()`.

Signed-off-by: Benoît Knecht <bknecht@protonmail.ch>
(cherry picked from commit 2db1aaabe5f4627bb7b177ab3441593f08aa7cbe)

Merge pull request #44943 from ljflores/wip-54203-quincy

quincy: monitoring: mention PyYAML only once in requirements

os/bluestore/bluefs: Make volume selector operations atomic

Make all RocksDBBlueFSVolumeSelector files/extents/size tracking atomic.
It used to be synchronized by BlueFS global lock.
Now, in Fine Grain Locking era, it is necessary to prevent corruption.

Fixes: https://tracker.ceph.com/issues/53906
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 372bda350966624d5081635e659f7c46947980c2)

os/bluestore/bluefs: Code for volume selector check

Adds ability to verify that volume selector properly tracks disk usage.
Creates options:
- bluefs_check_volume_selector_on_umount
- bluefs_check_volume_selector_often
that can be used to validate that vselector does not diverge from
values it should have.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit d233e3b1d23c135f0ec8d808c0961ddce8526bc8)

monitoring: mention PyYAML only once in requirements

Following error occurs while running "sudo install-deps.sh" -
ERROR: Double requirement given: PyYAML==6.0 (from -r requirements-lint.txt (line 5)) (already in pyyaml (from -r requirements-alerts.txt (line 1)), name='PyYAML')

PyYAML is mentioned twice as a requirement. It is mentioned once in both
the following files -
monitoring/ceph-mixin/requirements-lint.txt
monitoring/ceph-mixin/requirements-alerts.txt

These requirements were added in commits
44d3e4c264506154373ffaeb13d6c924c580e6b5 and
4750ac0d7766a8a089adf073415af0ac0d3f81d9.

Fixes: https://tracker.ceph.com/issues/54185
Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit a6f5efb620c429f81ea13992c2f77b4ca55458bc)

Merge pull request #44914 from idryomov/wip-rbd-quincy-batch-2

quincy: rbd backports (batch 2)

Reviewed-by: Deepika Upadhyay <dupadhya@redhat.com>

krbd: return error when no initial monitor address found

Since we filter monitor addresses based on ms_mode, check that at
least one address was found.

Otherwise, we mismatch arguments when calling sysfs/add_single_major
which emits a misleading error message to dmesg:

libceph: resolve 'name=user1' (ret=-3): failed
libceph: parse_ips bad ip 'name=user1,key=client.user1'

Fixes: https://tracker.ceph.com/issues/54128
Signed-off-by: Burt Holzman <burt@fnal.gov>
(cherry picked from commit 0076ffc86e043af7aedc127df8661eaf87fc1c58)

Merge pull request #44902 from neha-ojha/wip-44868-quincy

quincy: qa/distros: remove centos8

Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Adam King adking@redhat.com
Reviewed-by: Yuri Weinstein <yweinste@redhat.com>

Merge pull request #44848 from cbodley/wip-54088

quincy: qa: remove centos8 from supported distros

Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>

qa: Add tests for snapshot clone failure with quota

Fixes: https://tracker.ceph.com/issues/53848
Signed-off-by: Kotresh HR <khiremat@redhat.com>
(cherry picked from commit 7c0d31e52cea90e65152996024cabfa8a8fd299f)

os/BlueStore: NCB fixes recovery code with shared blobs

Replaces the BitmapAllocator used by NCB Recovery code with a dedicated SimpleBitmap.
The SimpleBitmap allows for bits to be set multiple times without any adverse effect.
This is needed beacuse shared-blobs will report the same allocation multiple times.

Fixes: https://tracker.ceph.com/issues/53678
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
(cherry picked from commit 8868894491c5e4df6d77fb78ed22702a493fe4f8)

qa/workunits/rbd: improve schedule add/remove cli test

This patch adds few tests to cover schedule add/remove with invalid
inputs.

Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
(cherry picked from commit a9312d4777a82d8f2d8766a011f10952f84d3f27)

mgr/rbd_support: fix schedule remove

Issue:

If we provide a random string in the schedule remove
command the entire schedule at specified level gets
removed.

Fixes: https://tracker.ceph.com/issues/53250
Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
(cherry picked from commit 1b62447071a900b9fa7d856617cb7db9e030f91e)

qa/suites/krbd: add legacy+rxbounce and crc+rxbounce coverage

For basic, rbd and rbd-nomount subsuites, replace legacy and crc
facets with "legacy or legacy+rxbounce" and "crc or crc+rxbounce"
facets (chosen at random).

For fsx, singleton and thrash subsuites, add legacy+rxbounce and
crc+rxbounce facets and drop prefer-crc facet. The expected behaviour
of the latter depends on cluster configuration and should be tested
separately.

The total number of jobs remains the same.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit fbf8c1d68be60ab294719113edbd7f459a755c15)

qa: krbd rxbounce test

Lives in its own directory since ms_mode doesn't need to be permuted
here.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 95d30b534ef65207168397dd25ca7213c8290568)

rbd: recognize rxbounce map option

Fixes: https://tracker.ceph.com/issues/54063
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 8d2a456d7055cfb64e6bb9927187e2240b8c4d2a)

qa/suites/rbd: add cram-based mon command API test

With mon (rbd_support mgr module in this case) command definitions
generated automatically by @CLI{Read,Write}Command decorator, it's
very easy to accidentally break the external facing API.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 4ed1e74d83e8bf99b77d794d2d3bd0b22fe0997a)

mgr/rbd_support: level_spec is optional for schedule list/status

Commit fea6fdff4c74 ("mgr/rbd_support: level_spec passed to some
commands is not optional") is wrong. While it is true that a valid
level_spec is needed to create a LevelSpec instance, an empty string
is very much a valid level spec -- it signifies "all levels".

This wasn't caught because within Ceph these commands are wrapped by
rbd CLI which injects an empty string in get_level_spec_args().

Fixes: https://tracker.ceph.com/issues/54058
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit a5eef01e9248b09c187fcb8c6d122fd08dc54c88)

mgr/rbd_support: "trash remove" takes image_id_spec, not image_spec

Because of @CLIWriteCommand, the parameter name has to adhere to
the mon command API. Commit dcb51b067a49 ("mgr/rbd_support: define
commands using CLICommand") accidentally changed image_id_spec to
image_spec, breaking external users such as go-ceph.

Fixes: https://tracker.ceph.com/issues/54057
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 2f5faabf4258ec37984f871f46fee73e630c8a33)

rgw: fix bucket index list minor calculation bug

When "bucket index list" traverses the different regions in the bucket
index assembling the output, it miscalculates how many entries to ask
for at one point. This fixes that.

This fixes previous "rgw: bucket index list can produce I/O errors".

Credit for finding this bug goes to Soumya Koduri <skoduri@redhat.com>.

Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
(cherry picked from commit aa7605151f0a5f467d50f13f27c7aef42a40cc39)

mds: add inline feature to MDS bootstrap incompat

File systems that had inline data enabled at some point would have this
bit in the CompatSet "incompat" set. This would conflict during upgrade
with the default v16.2.4 CompatSet assigned to existing (16.2.4-) MDS.
Subsequently, this would cause an assertion in FSMap::sanity during
pending map creation.

This bit will get added anyway during the upgrade process so might as
well add it to the MDS CompatSet during bootstrap.

Fixes: https://tracker.ceph.com/issues/54081
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit d341c5b7734b45568cd986f05d41e91e4cf4a4f7)

mds: throw some feature definitions in static memory

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit e2c461e02a8cc03a99cf6eb7bfb85fa3483efcea)

qa: test inline compat set on older MDSMap

Reproduced here:

/ceph/teuthology-archive/pdonnell-2022-01-31_19:13:02-fs:upgrade-master-distro-default-smithi/6651572/teuthology.log

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 445cdd4120eea26432f8692ebf3db8a0a9f0f9cf)