cephadm: check if file exists when passing `--apply_spec`
cephadm deploys the cluster, fails and does a rollback.
If the passed file doesn't exist we can make the CLI fail early instead.
```
... omitted output ...
Applying ../host-spec.yaml to cluster
FileNotFoundError: [Errno 2] No such file or directory: '../host-spec.yaml'
***************
Cephadm hit an issue during cluster installation. Current cluster files will be deleted automatically.
To disable this behaviour you can pass the --no-cleanup-on-failure flag. In case of any previous
broken installation, users must use the following command to completely delete the broken cluster:
for more information please refer to https://docs.ceph.com/en/latest/cephadm/operations/#purging-a-cluster
***************
Deleting cluster with fsid: 6e6a2dbe-f73a-11ee-8262-98be948800fd
Traceback (most recent call last):
File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/tmp/tmpive4g9gs.cephadm.build/app/__main__.py", line 5615, in <module>
File "/tmp/tmpive4g9gs.cephadm.build/app/__main__.py", line 5603, in main
File "/tmp/tmpive4g9gs.cephadm.build/app/__main__.py", line 2693, in _rollback
File "/tmp/tmpive4g9gs.cephadm.build/app/__main__.py", line 445, in _default_image
File "/tmp/tmpive4g9gs.cephadm.build/app/__main__.py", line 2958, in command_bootstrap
FileNotFoundError: [Errno 2] No such file or directory: '../host-spec.yaml'
```
Remove information about the installation of the Zabbix module and link
to a discussion of the reasoning behind Ceph's refusal to support
Zabbix.
John Jasen developed a procedure explaining how to install "Zabbix 2".
This commit removes outdated procedures and explains why those
procedures were removed. Immediately following this explanation, the
text includes an explanation of how to install "Zabbix 2".
this may happen when a test fails, and does not cleanup topics
it created. other tests that verify the number of topics may fail
because of that.
all tests that verify number of topics, should delete all topics at the
start of the test.
doc: reorder "releases" entries for reef to fix the diagram
The Gantt diagram currently shows "reef (latest 18.2.0)" instead of
18.2.2. This is because ReleasesGantt expects releases array to be
sorted in reverse order:
for code_name, info in releases.items():
last_release = info['releases'][0]
first_release = info['releases'][-1]
Nizamudeen A [Wed, 27 Mar 2024 16:29:55 +0000 (21:59 +0530)]
install-deps: enable copr ceph/grpc
In dashboard, to generate nvmeof apis in el8 this is needed so that it
can download the python3-grpcio packages.
https://copr.fedorainfracloud.org/coprs/ceph/grpc/
Fixes: https://tracker.ceph.com/issues/65184 Signed-off-by: Nizamudeen A <nia@redhat.com>
Adam King [Wed, 3 Apr 2024 17:11:08 +0000 (13:11 -0400)]
mgr/cephadm: pass daemon's current image when reconfiguring
Important to note here is that a reconfig will rewrite
the config files of the daemon and restart it, but not
rewrite the unit files. This lead to a bug where the
grafana image we used between the quincy and squid release
used different UIDs internally, which caused us to rewrite
the config files as owned by a UID that worked with the
new image but did not work with the image still specified
in the unit.run file. This meant the grafana daemon was down
from a bit after the mgr upgrade until the end
of the upgrade when we redeploy all the monitoring images.
Fixes: https://tracker.ceph.com/issues/65234 Signed-off-by: Adam King <adking@redhat.com>
rgw/notification: Load bucket attrs before calling publish_reserve.
As part of PR# 55657, publish_reserve would reload bucket to ensure bucket_attrs are loaded. However for lc events, where the bucket attrs were already loaded, the reloading was causing crash but there was no obvious root cause, so to avoid the crashes, remove reloading of bucket in publish_reserve and put the onus on callers to load the bucket before calling publish_reserve.
Adam King [Thu, 4 Apr 2024 18:11:11 +0000 (14:11 -0400)]
mgr/cephadm: make client-keyring deploying ceph.conf optional
There are cases where users would like to manage their own
ceph.conf but still have cephadm deploy the client keyrings,
so this is being added to facilitate that.
Fixes: https://tracker.ceph.com/issues/65335 Signed-off-by: Adam King <adking@redhat.com>
Adam King [Wed, 3 Apr 2024 18:34:08 +0000 (14:34 -0400)]
mgr/cephadm: handle setting required osd release with no OSDs during upgrade
A change to the `ceph osd require-osd-release` command made it so
it fails if no OSDs are up unless --yes-i-really-mean-it is passed.
For real clusters this is likely not an issue, but it can be an
annoyance for trying upgrades on test clusters that may not have
OSDs deployed. This patch is to try and just pass the flag in cases
where we have no OSDs rather than failing the upgrade
rgw: don't map to EIO in rgw_http_error_to_errno()
the http client uses EIO to detect connection errors specifically. if we
map normal http errors to EIO, we incorrectly mark their endpoint as
failed and route requests to other endpoints (if any exist)
default to ERR_INTERNAL_ERROR (500 InternalError) instead
Rishabh Dave [Thu, 22 Feb 2024 12:31:59 +0000 (18:01 +0530)]
cephfs-shell: don't use pkg_resources since its deprecated
Currently, cephfs-shell prints warnings, hangs and aborts when launched.
This occurs because Python module "pkg_resources" has been deprecated.
We use that module only checking version of other Python modules used in
cephfs-shell. Use "Version" from "packaging.version" instead.
Fixes: https://tracker.ceph.com/issues/64538 Signed-off-by: Rishabh Dave <ridave@redhat.com>
rgw: udpate options yaml file so LDAP uri isn't an invalid example
LDAP tries to bind the URI configuration option when RGW starts. The
default value is an example used to show the form of the URI and is
not itself valid. The default value is used, unless overrideen, and
can cause delays in start-up in some situations. The example is now
provided in the description and the default is the empty string.
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
Log at lower level at log level 5 while the
mds is stopping to track down issues like mds
becoming laggy while stopping. It was being
logged at level 7 before.
With the recent code added to handle connection errors
(commit#e200499bb3c5703862b92a4d7fb534d98601f1bf), RGWRESTStreamS3PutObj
initialization could fail at times if there were any failed requests to the
cloud endpoint within CONN_STATUS_EXPIRE_SECS period.
This fix is to handle such errors and abort the transition/sync
requests which can be retried later by LC/Sync worker threads.
osd/scrub: uniform handling of reservation requests
we now allow - on the replica side - reservation requests
regardless of the ReplicaActive sub-state. I.e. - we will
honor such requests even when handling a chunk request
(in ReplicaActiveOp).
Note that the current Primary code would never send such
a request. But a future primary code might.
Leonid Usov [Thu, 28 Mar 2024 05:32:26 +0000 (01:32 -0400)]
mds/quiesce: prevent an overflow of the wait duration
QuiesceTimeInterval::max() may overflow inside of a call to
std::condition_variable::wait_for and result in a busy-loop,
making the call to timeout immediately
The solution is to cap the wait duration to a value which can
certainly fit in whichever clock std library is using internally.
Fixes: https://tracker.ceph.com/issues/65276 Signed-off-by: Leonid Usov <leonid.usov@ibm.com>