Xiubo Li [Wed, 26 Jul 2023 06:34:01 +0000 (14:34 +0800)]
mds: defer trim() until after the last cache_rejoin ack being received
Just before the last cache_rejoin ack being received the entire
subtree, together with the inode subtree root belongs to, were
trimmed the isolated_inodes list couldn't be correctly erased. We
should defer calling the trim() until the last cache_rejoin ack
being received.
Zac Dover [Sat, 15 Jun 2024 11:55:18 +0000 (21:55 +1000)]
doc/rados: explain replaceable parts of command
Add an explanation that directs the reader to replace the "X" part of
the command "ceph tell mon.X mon_status" with the value specific to the
reader's Ceph cluster (which is (probably) not "X").
In the future, such replaceable strings in commands may be bounded by
angle brackets ("<" and ">").
This improvement to the documentation was suggested on the [ceph-users]
email list by Joel Davidow. This email, an absolute model of user
engagement with an upstream project, can be reviewed here:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/KF67F5TXFSSTPXV7EKL6JKLA5KZQDLDQ/
Nizamudeen A [Fri, 3 May 2024 08:56:19 +0000 (14:26 +0530)]
mgr/k8sevents: update V1Events to CoreV1Events
centos9 only provides kubernetes 26.1.0 as base dep and hence the
k8sevents code needs to be updated accordingly. the api changes happened
in kuberenetes while 19.0.0 was released
Fixes: https://tracker.ceph.com/issues/65627 Fixes: https://tracker.ceph.com/issues/64981 Signed-off-by: Nizamudeen A <nia@redhat.com>
(cherry picked from commit 6af964719217d720e6c2fd1ba2a607f6255d2604)
Rishabh Dave [Tue, 7 May 2024 14:50:55 +0000 (20:20 +0530)]
qa/cephfs: set joinable on FS before exiting tests in TestFSFail
After running TestFSFail, CephFSTestCase.tearDown() fails attempting
to unmount CephFS. Set joinable on FS and wait for the MDS to be up
before exiting the test. This will ensure that unmounting is
successful in teardown.
Fixes: https://tracker.ceph.com/issues/65841 Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit faa30e03f31551a71ebb8330dbbe7005d9ddd559)
Rishabh Dave [Wed, 8 May 2024 13:59:11 +0000 (19:29 +0530)]
qa/cephfs: pass MDS name, not FS name, to "ceph mds fail" cmd
This issue was not caught in original QA run because "ceph mds fail"
returns 0 even though MDS name received by it in argument is
non-existent. This is done for the sake of idempotency, however it
caused this bug to go uncaught.
Fixea: https://tracker.ceph.com/issues/65864 Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit ab643f7a501797634a366fd29bf4acef6a8f0cf2)
Rishabh Dave [Mon, 25 Mar 2024 12:05:38 +0000 (17:35 +0530)]
qa/cephfs: add tests failing MDS and FS when MDS is unhealthy
Add tests to verify that the confirmation flag is mandatory for running
commands "ceph mds fail" and "ceph fs fail" when MDS has one of the two
health warnings: MDS_CACHE_OVERSIZE or MDS_TRIM.
Also, add MDS_CACHE_OVERSIZE and MDS_TRIM to ignorelist for
test_admin.py so that QA jobs knows this an expected failure.
Rishabh Dave [Mon, 25 Mar 2024 12:01:01 +0000 (17:31 +0530)]
qa/cephfs: pass confirmation flag to fs fail in tear down code
Since "ceph fs fail" command now requires the confirmation flag when
Ceph cluster has either health warning MDS_TRIM or MDS_CACHE_OVERSIZE,
update tear down in QA code. During the teardown, the CephFS should be
failed, regardless of whether or not Ceph cluster has health warnings,
since it is teardown.
Rishabh Dave [Wed, 13 Mar 2024 09:31:02 +0000 (15:01 +0530)]
cephfs,mon: require confirmation to fail unhealthy FS
Confirmation flag must be passed when running the command "ceph fs fail"
when the MDS for this FS has either of the two health warnings: MDS_TRIM
or MDS_CACHE_OVERSIZED. Else, the command will fail and print an
appropriate error message.
Restarting an MDS with these health warnings is not recommened since it
will have a slow recovery during restart which will create new problems.
Fixes: https://tracker.ceph.com/issues/61866 Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit b901616494a8359e59f7ec2cd661077c4aced01c)
Conflicts:
- src/mon/FSCommands.cc
- lines surrounding the patch are different in reef compared to main.
the reef code was still accessing "mds_map" directly instead of
accessing it using "get_mds_map()".
- return value of get_filesystem() is different in main.
Add an explanation of leader-peon conditions that obtain when the
cluster is in the "HEALTH_OK" state. Previously, the text discussed
these two monitor states only in the context of a health detail entry.
This improvement to the documentation was suggested on the [ceph-users]
email list by Joel Davidow. This email, an absolute model of user
engagement with an upstream project, can be reviewed here: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/KF67F5TXFSSTPXV7EKL6JKLA5KZQDLDQ/
I will list Joel Davidow here as the co-author for the sake of more
expediently getting this change into the documentation, but though he is
listed as the co-author, he is the true author.
Co-authored-by: Joel Davidow <jdavidow@nso.edu> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 6fb9a5ef817eda5184d51ebcb425a6091ca82299)
Zac Dover [Wed, 5 Jun 2024 16:43:15 +0000 (02:43 +1000)]
doc/start: s/intro.rst/index.rst/
Change the filename "doc/start/intro.rst" to "doc/start/index.rst" so
that Sphinx finds the root filename for the "/start" directory in the
default location.
Zac Dover [Tue, 4 Jun 2024 13:37:27 +0000 (23:37 +1000)]
doc/start: s/http/https/ in links
Replace "http" with "https" in doc/start/get-involved.rst.
This commit is, in a way, a repeat of
https://github.com/ceph/ceph/pull/57213/
(1c5383b91bd7dbfa9670c6485fcc5ff28b79f40d), which targeted the Reef
branch instead of the main branch. When this commit has been merged and
backported, I will close https://github.com/ceph/ceph/pull/57213/.
I am listing Casey Cain here as the co-author, but he is in fact the
true author of this change.
Since the command "ceph mds fail" now may require confirmation flag
("--yes-i-really-mean-it"), update this method to allow/disallow adding
this flag to the command arguments.
Rishabh Dave [Fri, 19 Apr 2024 11:28:30 +0000 (16:58 +0530)]
doc/cephfs: mention need of confirmation for "ceph mds fail"
Update docs since command "ceph mds fail" will now fail if MDS has either
health warning MDS_TRIM or MDS_CACHE_OVERSIZED and if confirmation flag
is not passed.
Rishabh Dave [Fri, 8 Mar 2024 15:39:18 +0000 (21:09 +0530)]
cephfs,mon: require confirmation to fail unhealthy MDS
When running the command "ceph mds fail" for an MDS that is unhealthy
due to, MDS_CACHE_OVERSIZED or MDS_TRIM, user must pass confirmation
flag. Else, the command will fail and print an appropriate error
message.
Restarting an MDS with such health warnings is not recommended since it
will have a slow reocvery during restart which will create new problems.
Fixes: https://tracker.ceph.com/issues/61866 Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit eeda00eea5043d3ba806695a207b732cb53b35c4)
Rishabh Dave [Fri, 8 Mar 2024 15:31:51 +0000 (21:01 +0530)]
mds: add no counters in warning for standby-replay MDS
Don't include inode and stray counters in the health warnings printed
for standby-replay MDSs. Since these counters are present in the health
warnings only due to replay, it can confuse users, and therefore, do not
include them.
Fixes: https://tracker.ceph.com/issues/63514 Signed-off-by: Rishabh Dave <ridave@redhat.com>
(cherry picked from commit 03dcdc1329e471aa4aa403519ea5131db2f99b23)
Ilya Dryomov [Mon, 6 May 2024 06:16:01 +0000 (08:16 +0200)]
qa/workunits/rbd: wait for replaying status in bootstrap tests
wait_for_replay_complete() doesn't wait for image status to get
updated. This didn't matter previously because these tests are run on
two different pools and nothing else was following.
mon: validate SERVER_REEF on set-require-min-compat-client
Unit testing
-------------
```
[rzarzynski@o06 build]$ bin/unittest_features
...
[ RUN ] features.release_features
1 argonaut features 0x40000 looks like argonaut
2 bobtail features 0x40000 looks like argonaut
3 cuttlefish features 0x40000 looks like argonaut
4 dumpling features 0x42040000 looks like dumpling
5 emperor features 0x42040000 looks like dumpling
6 firefly features 0x20842040000 looks like firefly
7 giant features 0x20842040000 looks like firefly
8 hammer features 0x1020842040000 looks like hammer
9 infernalis features 0x1020842040000 looks like hammer
10 jewel features 0x401020842040000 looks like jewel
11 kraken features 0xc01020842040000 looks like kraken
12 luminous features 0xe01020842240000 looks like luminous
13 mimic features 0xe01020842240000 looks like luminous
14 nautilus features 0xe01020842240000 looks like luminous
15 octopus features 0xe01020842240000 looks like luminous
16 pacific features 0xe01020842240000 looks like luminous
17 quincy features 0xe01020842240000 looks like luminous
18 reef features 0xe010208d2240000 looks like reef
19 squid features 0xe010208d2240000 looks like reef
[ OK ] features.release_features (0 ms)
```
Manual testing
--------------
\### 'quincy` client connected to `main` cluster
There was `ceph -w` from `quincy` running in the background.
```
[rzarzynski@o06 build]$ bin/ceph osd set-require-min-compat-client reef
Error EPERM: cannot set require_min_compat_client to reef: 1 connected client(s) look like luminous (missing 0x80000000); add --yes-i-really-mean-it to do it anyway
```
mon, osd, *: expose upmap-primary in OSDMap::get_features()
This is a minimal fix to ensure only peers understanding
`pg-upmap-primary` are able to connect, and thus to exclude
the possibility of running into the `pg_upmap_primaries.empty()`
assertion in encoders.
Fixes for other problems will follow up.
The intention is to ship this patch in the very next minor
release of reef.
Manual testing
--------------
\### start using upmap-primar is presence of `quincy` client
NOTE: incompatible clients aren't disconnected but this is
known and expected as we lack the machinery.
\### `main` client is still able to connect
```
[rzarzynski@o06 build]$ bin/ceph -w
cluster:
id: d570a7cd-84ca-4fd0-aafb-80138762c6af
health: HEALTH_WARN
11 mgr modules have failed dependencies
1 pool(s) do not have an application enabled
services:
mon: 1 daemons, quorum a (age 64m)
mgr: x(active, since 64m)
osd: 3 osds: 3 up (since 64m), 3 in (since 64m)
\### `quincy` client may connect again
```
[rzarzynski@o06 build-quincy]$ bin/ceph -s -c /home/rzarzynski/ceph2/build/ceph.conf
cluster:
id: d570a7cd-84ca-4fd0-aafb-80138762c6af
health: HEALTH_WARN
11 mgr modules have failed dependencies
1 pool(s) do not have an application enabled
services:
mon: 1 daemons, quorum a (age 77m)
mgr: x(active, since 77m)
osd: 3 osds: 3 up (since 76m), 3 in (since 76m)
Jos Collin [Mon, 6 May 2024 12:47:29 +0000 (18:17 +0530)]
pybind/mgr/mirroring: Fix KeyError: 'directory_count' in daemon status
The directory_count key is missing in self.mgr.get_daemon_status() output json,
intermittently when there is a delay caused by m_listener.handle_mirroring_enabled() to update the
directory_count, which results in ServiceDaemon::update_status() creates a json with out 'directory_count' key/value.
But the mgr/mirroring -> daemon_status() always expects the 'directory_count' key to be present in the json returned by
self.mgr.get_daemon_status().
This issue occurs intermittently when we enable/disable mirroring and check the 'daemon status' in between.
This patch fixes this issue by setting a default value 0 for 'directory_count' in doemon_status().
Fixes: https://tracker.ceph.com/issues/65795 Signed-off-by: Jos Collin <jcollin@redhat.com>
(cherry picked from commit b78baa23e562742b8bdc5a75f82e3b6fbf55a8a5)