Ilya Dryomov [Tue, 30 Aug 2022 09:45:44 +0000 (11:45 +0200)]
rbd-mirror: skip setting error code on snapshot replayer shutdown
This is regarding failures in unregister_remote_update_watcher() and
unregister_local_update_watcher(). handle_replay_complete() can't be
called in these cases anymore as it would blindly attempt to unregister
watchers from scratch again. Dropping handle_replay_complete() calls
there means that these failures would only be logged and would not be
surfaced by snapshot replayer. But the only caller ignores them
anyway:
void ImageReplayer<I>::shut_down(int r) {
...
// close the replayer
if (m_replayer != nullptr) {
ctx = new LambdaContext([this, ctx](int r) {
m_replayer->destroy();
m_replayer = nullptr;
ctx->complete(0); <------
});
ctx = new LambdaContext([this, ctx](int r) {
m_replayer->shut_down(ctx);
});
}
Ilya Dryomov [Wed, 24 Aug 2022 10:56:31 +0000 (12:56 +0200)]
rbd-mirror: resume pending shutdown on error in snapshot replayer
If a shutdown is requested, e.g. by update_pool_replayers() because
remote RADOS instance got blocklisted, and Replayer::shut_down() pends
it on completion of current snapshot sync, it gets stuck if replayer
encounters an error in the interim. This is particularly likely in the
blocklist case: a higher layer may detect that client got blocklisted
and request a shutdown first, and then when replayer sees EBLOCKLISTED
in turn, it calls handle_replay_complete() -- which does not resume
a pending shutdown. Because update_pool_replayers() blocks on shutdown
with Mirror::m_lock held, eventually the entire daemon hangs in
perpetuity.
Adam King [Thu, 25 Aug 2022 16:09:49 +0000 (12:09 -0400)]
mgr/orchestrator/tests: don't match exact whitespace in table output
It seems that the exact spacing may differ a bit between
python versions. Currently seeing py3 (which cooresponds to py 3.6
on my system) passing these tests and py37 (which is python 3.7
obviously) failing. I think verifying against the exact whitespace
is unnecessary anyhow. As long as it isn't egregious, we don't
really need to worry about exactly what the spacing is.
Adam King [Wed, 24 Aug 2022 14:36:53 +0000 (10:36 -0400)]
doc/cephadm: fix example for specifying networks for rgw
count_per_host must be used with underscores rather
than dashes to work, you need to pass service_id not
service_name and the option for the port is called
rgw_frontend_port not just "port"
Zac Dover [Tue, 23 Aug 2022 06:59:04 +0000 (16:59 +1000)]
doc/mgr: edit orchestrator.rst
This PR improves the English language in the "Orchestrator CLI"
section of the MGR documentation. It adds a couple of section
headers in order to signpost the information in the document
a bit more than had already been done, but it makes no major
structural changes to the presentation of the information here.
This PR was motivated by feedback from the 2022 Ceph User Survey
in which one of the respondents wrote "better ceph orch documen-
tation".
The final section on this page, "Current Implementation Status",
must be verified by someone who is familiar with the current state
of "ceph orch" and a date stamp should be applied to the top of
the section so that the word "current" has a meaningful referent.
Xiubo Li [Wed, 6 Apr 2022 00:12:26 +0000 (08:12 +0800)]
ceph-fuse: add dedicated snap stag map for each directory
This will fix the fino colliding bug, which is caused when the
snapid is later than 0xffff.
From mds 'mds_max_snaps_per_dir' option, we can see that the max
snapshots for each directory is 4_K, and in ceph-fuse we have
around 64_K, which is from 0xffff - 2, stags could be used to make
the fake fuse inode numbers for each directory.
Xiubo Li [Thu, 24 Mar 2022 02:01:57 +0000 (10:01 +0800)]
ceph-fuse: return EINVAL if get invalid fino instead of assert
All the snap ids of the finos returned to libfuse from libcephfs
will be recorded in the map of 'stag_snap_map', and will never be
erased before unmounting. So if libfuse passes invalid fino the
ceph-fuse should return EINVAL errno instead of crash itself.
Xiubo Li [Wed, 23 Mar 2022 02:05:32 +0000 (10:05 +0800)]
mds-client: make the fake inos option unchangeable in runtime
If the flags is empty then in option.h in can_update_at_runtime()
it will return true. That means this opetion could be changed in
runtime, which is buggy. Because if this is false, ceph-fuse will
use its own fake inos instead of libcephfs'. If this is changed
during runtime, we will hit inos dosn't exist assert bugs.
John Mulligan [Tue, 2 Aug 2022 13:45:59 +0000 (09:45 -0400)]
mgr/volumes: drop pre-python 3.2 version checks
Based on other conversations we believe that there is no need to support
python versions lower than Python 3.6 for pacific and later. This means
it is safe to drop the remaining version checks for python
3.2.
John Mulligan [Mon, 11 Jul 2022 20:44:00 +0000 (16:44 -0400)]
mgr/volumes: a lock to guard against races reading/writing config
Fixes: https://tracker.ceph.com/issues/55583
Use a python threading lock to avoid race conditions where the
config file is being both read and written to at the same time.
Before this change, the content of the config file being parsed could be
'corrupted' by the MetadataManager racing with itself. Along with the
previous two patches, additional logging was added to the mgr code to
produce the simplified version of the mgr log below:
```
[volumes INFO volumes.fs.operations.versions.metadata_manager] READ: b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/c9f773af-5221-49c6-846c-d65c0920ae3f\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
[volumes INFO volumes.fs.operations.versions.metadata_manager] READ: b''
[volumes INFO volumes.fs.operations.versions.metadata_manager] READ: b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/c9f773af-5221-49c6-846c-d65c0920ae3f\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
[volumes INFO volumes.fs.operations.versions.metadata_manager] wrote 203 bytes to config b'/volumes/Park/babydino2/.meta'
[volumes INFO volumes.fs.operations.versions.metadata_manager] READ: b'a0\n\n'
[volumes INFO volumes.fs.operations.versions.metadata_manager] READ: b''
[volumes ERROR volumes.module] Failed _cmd_fs_clone_cancel(clone_name:babydino2, format:json, group_name:Park, prefix:fs clone cancel, vol_name:cephfs) < "":
Traceback (most recent call last):
...
File "/usr/lib64/python3.6/configparser.py", line 1111, in _read
raise e
configparser.ParsingError: Source contains parsing errors: b'/volumes/Park/babydino2/.meta'
[line 13]: 'a0\n'
```
Looking at the above you can see that the log indicates a write to the
config file (of 203 bytes). This happens before the file has finished
reading and thus instead of getting an empty string indicating EOF, it
gets that last four bytes of the new content of the file. The lock
prevents the MetadataManager from both reading and writing the config
file at the same time.
John Mulligan [Tue, 12 Jul 2022 22:33:07 +0000 (18:33 -0400)]
mgr/volumes: write volume metadata with shim class
Add a class that works a bit like a python file object so that we
can simplify the flush function. Providing a file-like object to
the ConfigParser's write function avoids unnecessary copies to
a StringIO object and makes the code easier to read.
With no more uses of StringIO, the StringIO imports are removed.
John Mulligan [Tue, 12 Jul 2022 22:32:54 +0000 (18:32 -0400)]
mgr/volumes: read volume metadata file using read_string
The read_string method, available in Python 3.2 (we assume Python 3.6 as
our current minimum python versino), supports parsing a provided string
for ini-style configuration parameters. Refactoring the reading of the
config file from cephfs into a simple iterator function and then
providing it to the ConfigParser as a single string, allows us to avoid
using StringIO and simplifies the refresh function.
Nizamudeen A [Tue, 26 Apr 2022 10:19:09 +0000 (15:49 +0530)]
mgr/dashboard: prometheus rules internal server error
After we increase/decrease the count of the node-exporter, we get a 500
- Internal server error from api/prometheus/rules endpoint. On further
debugging its caused by the jsonDecodder, because I guess the expected
input for the json.loads() is not a json formatted input. So to fix
that issue I can either do an error handling on the json.loads() or I
can move the json.loads() on the already existing try block. I went for
the second approach here.
qa: filter internal directories in 'subvolumegroup ls' command
Internal directories: '_nogroup', '_index', '_legacy', '_deleting'
1. Internal directories should be filtered in 'subvolmegroup ls' command.
2. Internal directories should not be accepted as a group name.
mgr/volumes: filter internal directories in 'subvolumegroup ls' command
Internal directories: '_nogroup', '_index', '_legacy', '_deleting'
1. Internal directories should be filtered in 'subvolmegroup ls' command.
2. Internal directories should not be accepted as a group name.
Used the https://www.npmjs.com/package/@grafana/e2e npm packages and
followed
https://github.com/grafana/grafana/blob/main/contribute/style-guides/e2e.md
to understand the style of the grafana e2e testing.
In this PR I introduces the tests for the Hosts Overall
Performance and also RGW per Daemon and Overall Performance
To be able to recreate and test pg log duplicate entries, a new option
added to the COT: --op pg-log-inject-dups we will also need to provide
--file json_arry of dups, it can get as many dups that need to be inject
the json for dups is in the following format:
{"reqid": "client.n.n:n", "version": "n'n", "user_version": n, "return_code": n}
Xiubo Li [Mon, 7 Mar 2022 07:42:42 +0000 (15:42 +0800)]
mds: flush mdlog if locked and still has wanted caps not satisfied
In _do_cap_update() if one client is releasing the Fw caps the
relevant client range will be erased, and then new_max will be 0.
It will skip flushing the mdlog after it submitting a journal log,
which will keep holding the wrlock for the filelock.
So when a new client is trying to open the file for reading, since
the wrlock is locked for the filelock the file_eval() is possibly
couldn't changing the lock state and at the same time if the
filelock is in stable state, such as in EXECL, MIX. The mds may
skip flushing the mdlog in the open related code too.
We need to flush the mdlog if there still has any wanted caps
couldn't be satisfied and has any lock for the filelock after the
file_eval().
Kefu Chai [Mon, 8 Aug 2022 14:41:17 +0000 (22:41 +0800)]
pybind/mgr/dashboard: do not use distutils.version.StrictVersion
replace `distutils.version.StrictVersion` with
`pkg_resources.parse_version()`
as the former is deprecated, see https://peps.python.org/pep-0632/.
let's use `pkg_resources` instead. this change also addresses
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1010894.
we have this issue when testing with an ubuntu jammy test node.
see https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1967139
Kefu Chai [Mon, 8 Aug 2022 14:41:17 +0000 (22:41 +0800)]
pybind/mgr/dashboard: do not use distutils.version.StrictVersion
replace `distutils.version.StrictVersion` with
`pkg_resources.parse_version()`
as the former is deprecated, see https://peps.python.org/pep-0632/.
let's use `pkg_resources` instead. this change also addresses
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1010894.
we have this issue when testing with an ubuntu jammy test node.
see https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1967139
Kefu Chai [Fri, 12 Aug 2022 05:06:25 +0000 (13:06 +0800)]
mgr/dashboard: bump up more-itertools
before this change, more-itertools tries to import Sequence from
collections, this leads us to failures like:
```
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
__import__(pkg_name)
File
"/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/__init__.py",
line 9, in <module>
import cherrypy
File
"/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/.tox/openapi-check/lib/python3.10/site-packages/cherrypy/__init__.py",
line 76, in <module>
from . import _cprequest, _cpserver, _cptree, _cplogging, _cpconfig
File
"/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/.tox/openapi-check/lib/python3.10/site-packages/cherrypy/_cprequest.py",
line 11, in <module>
from cherrypy import _cpreqbody
File
"/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/.tox/openapi-check/lib/python3.10/site-packages/cherrypy/_cpreqbody.py",
line 135, in <module>
import cheroot.server
File
"/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/.tox/openapi-check/lib/python3.10/site-packages/cheroot/server.py",
line 96, in <module>
from .workers import threadpool
File
"/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/.tox/openapi-check/lib/python3.10/site-packages/cheroot/workers/threadpool.py",
line 20, in <module>
from jaraco.functools import pass_none
File
"/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/.tox/openapi-check/lib/python3.10/site-packages/jaraco/functools.py",
line 8, in <module>
import more_itertools
File
"/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/.tox/openapi-check/lib/python3.10/site-packages/more_itertools/__init__.py",
line 1, in <module>
from more_itertools.more import * # noqa
File
"/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/.tox/openapi-check/lib/python3.10/site-packages/more_itertools/more.py",
line 3, in <module>
from collections import Counter, defaultdict, deque, Sequence
ImportError: cannot import name 'Sequence' from 'collections'
(/usr/lib/python3.10/collections/__init__.py)
ERROR: InvocationError for command
/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/.tox/openapi-check/bin/python3
-m dashboard.controllers.docs
/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/dashboard/.tox/openapi-check/tmp/openapi.yaml
(exited with code 1)
```
after this change, more-itertools is pin'ed at the latest stable
at the time of writing, which includes the fixes including
https://github.com/more-itertools/more-itertools/commit/30a861bc5a4f53a9ba73923c9048a3632a0f9d18
.
please note, more-itertools dropped python3.3 support. but neither
do us support this python version, so we should be safe.