Zac Dover [Wed, 7 Feb 2024 13:18:35 +0000 (23:18 +1000)]
doc/radosgw: add confval directives
Add confval directives to the documentation of "quota cache" options.
This addresses a request made by Antony D'Atri in https://github.com/ceph/ceph/pull/55075/files#r1444006246.
Zac Dover [Sun, 4 Feb 2024 15:36:10 +0000 (01:36 +1000)]
doc/rados: update PG guidance
Update the "Creating a Pool" section of doc/rados/operations/pools.rst
so that the documentation no longer insists that the user change the
values of "osd_pool_default_pg_num" and "osd_pool_default_pgp_num".
See also: https://github.com/ceph/ceph/pull/55419
Tracker: https://tracker.ceph.com/issues/64259
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 5ad241442d2c141ba508faba61f39d70f3f09679)
This commit introduces a major refactor of the main
entrypoint.
- subclass threading.Thread:
- Introduce a new class `BaseThread()` that is a
`threading.Thread()` abstraction class in order
to monitor the different threads.
- `BaseSystem()` inherits from `BaseThread()`.
- Handle `SIGTERM` signal in order to gracefully shutdown
node-proxy (make threads exit gracefully, log out from RedFish API, etc.)
Additionally, this:
- drops the class `Logger()` from util.py which
was not adding value. It is now replaced with a simple `get_logger()`
function.
- changes the node-proxy API port from 8080 to 9456
(8080 being widely used for frontend apps...)
- changes the container entrypoint in order to use the
`ceph-node-proxy` binary from the packaging
Zac Dover [Fri, 2 Feb 2024 01:53:45 +0000 (11:53 +1000)]
doc/rados: update config for autoscaler
Update doc/rados/configuration/pool-pg-config-ref.rst to account for the
behavior of autoscaler.
Previously, this file was last meaningfully altered in 2013, prior to
the invention of autoscaler. A recent confusion was brought to my
attention on the Ceph Slack whereby a user attempted to alter the
default values of a Quincy cluster, as suggested in this documentation.
That alteration caused Ceph to throw the error "Error ERANGE: 'pgp_num'
must be greater than 0 and lower or equal than 'pg_num', which in this
case is one" and a related "rgw_init_ioctx ERROR" reading in part
"Numerical result out of range". The user removed the
"osd_pool_default_pgp_num" configuration line from ceph.conf and the
cluster worked as expected. I presume that this is because the removal
of this configuration line allowed autoscaler to work as intended.
Fixes: https://tracker.ceph.com/issues/64259 Co-authored-by: David Orman <ormandj@corenode.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit 4dc12092be584da44baca14e31ca33231164235f)
bacport mgr/rook: always recreate kvm default network + fix groups refresh Fixes: https://tracker.ceph.com/issues/64079
This change also includes:
- adding ~/.local/bin to path so behave binary can be found
- adding requirements.txt file for testing dependencies
- increasing timeout used to wait for tools deployment to 90s
- increasing timeout used to wait for kvm network to 20s
mgr/cephadm: add a new config option 'oob_default_addr'
So there's a default value (169.254.1.1) which is the default
address for the 'OS to iDrac pass-through' interface.
Given that node-proxy will reach the RedFish API through this interface,
we can make users avoid to pass that addr when providing the host spec
at bootstrap time.
Afreen [Mon, 29 Jan 2024 10:12:10 +0000 (15:42 +0530)]
mgr/dashboard: Create subvol of same name in different group
Fixes https://tracker.ceph.com/issues/64112
Issue:
Currently, we are unable to create subvolume of same name in different
subvolume group
Fix:
We are validating only the filesystem name of subvolume
which is stopping the creation a subvolume of same name.
Added more granularity , by adding the subvolumegroup name.
Laura Flores [Mon, 29 Jan 2024 00:58:25 +0000 (00:58 +0000)]
mgr: pin pytest to version 7.4.4
On 2024-01-27, pytest updated to 8.0.0,
which broke run-tox-mgr.
https://docs.pytest.org/en/stable/changelog.html
==================================== ERRORS ====================================
_____________________ ERROR collecting alerts/__init__.py ______________________
alerts/__init__.py:2: in <module>
from .module import Alerts
alerts/module.py:6: in <module>
from mgr_module import CLIReadCommand, HandleCommandResult, MgrModule, Option
mgr_module.py:1: in <module>
import ceph_module # noqa
E ModuleNotFoundError: No module named 'ceph_module'
______________________ ERROR collecting alerts/module.py _______________________
alerts/module.py:6: in <module>
from mgr_module import CLIReadCommand, HandleCommandResult, MgrModule, Option
mgr_module.py:1: in <module>
import ceph_module # noqa
E ModuleNotFoundError: No module named 'ceph_module'
____________________ ERROR collecting balancer/__init__.py _____________________
balancer/__init__.py:2: in <module>
from .module import Module
balancer/module.py:12: in <module>
from mgr_module import CLIReadCommand, CLICommand, CommandResult, MgrModule, Option, OSDMap, CephReleases
mgr_module.py:1: in <module>
import ceph_module # noqa
E ModuleNotFoundError: No module named 'ceph_module'
_____________________ ERROR collecting balancer/module.py ______________________
balancer/module.py:12: in <module>
from mgr_module import CLIReadCommand, CLICommand, CommandResult, MgrModule, Option, OSDMap, CephReleases
mgr_module.py:1: in <module>
import ceph_module # noqa
E ModuleNotFoundError: No module named 'ceph_module'
ceph-volume: fix partitions support in disk.get_devices()
The following:
```
is_part = get_file_contents(os.path.join(_sys_dev_block_path, item, 'partition')) == "1"
```
assumes any `/sys/dev/block/x:y/partition` contains '1' which is wrong.
This file actually contains the corresponding partition number.
ceph-volume: use 'no workqueue' options with dmcrypt
CloudFlare engineers made some testing and realized that using
workqueues with encryption on flash devices has a bad effect.
See [1] for details.
With this patch it will make ceph-volume call crypsetup with
`--perf-no_read_workqueue` and `--perf-no_write_workqueue` options
when the device is not a rotational.
the following messages get logged quite a lot while
this is not a very useful information in a normal situation:
```
2024-01-12 09:09:40,604 - reporter - INFO - data ready to be sent to the mgr.
2024-01-12 09:09:40,604 - reporter - INFO - no diff, not sending data to the mgr.
2024-01-12 09:10:15,022 - reporter - INFO - data ready to be sent to the mgr.
2024-01-12 09:10:15,022 - reporter - INFO - no diff, not sending data to the mgr.
...
```
This `sleep(5)` should be initiated *after* the lock is released.
Otherwise, it can cause troubles with the reporter loop which can
never acquire the lock.
The current implementation requires the inclusion of all the recent
modifications in the cephadm binary, which won't be backported.
Since we need the node-proxy code backported to reef, let's move the
code make it a separate daemon.
Signed-off-by: Guillaume Abrioux <gabrioux@ibm.com> Co-authored-by: Adam King <adking@redhat.com>
(cherry picked from commit 7e6bc179ae7e0d633bd63086775002182c861d3f)
This renames the mgr's NodeProxyCache attribute from
`self.node_proxy` to `self.node_proxy_cache` and the
class `NodeProxy` in agent.py from `NodeProxy` to
`NodeProxyEndpoint` to make it clearer and avoid confusion.
node-proxy: enhance debug log messages for locking operations
This commit updates the debug log messages in the BaseRedfishSystem
and Reporter classes. The adjustments made enhance the clarity and
precision of the messages by specifically identifying acquired
and released locks, detailing their context, thereby improving the
understanding of the control flow during locking operations
in these components.
node-proxy requires this dependency so it needs to be added as
dependency for tox testing.
Typical failure:
```
ImportError while importing test module '/root/ceph/src/cephadm/tests/test_agent.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib64/python3.9/importlib/__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests/test_agent.py:10: in <module>
_cephadm = import_cephadm()
tests/fixtures.py:14: in import_cephadm
import cephadm as _cephadm
cephadm.py:32: in <module>
from cephadmlib.node_proxy.main import NodeProxy
cephadmlib/node_proxy/main.py:2: in <module>
from .redfishdellsystem import RedfishDellSystem
cephadmlib/node_proxy/redfishdellsystem.py:2: in <module>
from .baseredfishsystem import BaseRedfishSystem
cephadmlib/node_proxy/baseredfishsystem.py:2: in <module>
from .basesystem import BaseSystem
cephadmlib/node_proxy/basesystem.py:2: in <module>
from .util import Config
cephadmlib/node_proxy/util.py:2: in <module>
import yaml
E ModuleNotFoundError: No module named 'yaml'
```
node-proxy: send oob management requests to the MgrListener()
Note that this won't be a true out of band management.
In the case where the host hangs, this won't work. The oob
management should be reached directly but most of the time
the oob network is isolated. The idea is to send queries to the
the tcp server exposed by the cephadm agent (MgrListener) so it
can send itself queries to the redfish API using the IP address
exposed on the OS.
node-proxy: move the output formatting logic to orchestrator
Implementing this in the cephadm module doesn't follow the general idea
of the orchestrator interface. This is where the output formatting should
be done so let's move the logic to the orchestrator module.
node-proxy: code change for hdd blinkenlight pre-requisites
This is mainly for anticipating the case where hdd blinkenlight via RedFish
works (testing has to be done). This introduces the required changes so the
endpoint `/led` can support blinkenlight for both chassis and disks.
Ramana Raja [Tue, 23 Jan 2024 21:07:04 +0000 (16:07 -0500)]
rbd-nbd: log errors during netlink_resize() using derr
When using rbd CLI to map the images to NBD devices via netlink,
any errors that arose during image resizing in netlink_resize()
were not logged. Switching the error logging from using cerr to
derr helps log the errors from netlink_resize().
Ramana Raja [Mon, 22 Jan 2024 22:06:58 +0000 (17:06 -0500)]
rbd_nbd: fix resize of images mapped using netlink
Include device identifier or cookie in the message sent to the kernel
to resize images mapped to NBD devices using netlink. Otherwise,
netlink_resize() fails and the size of the device isn't updated.
cephadm: gracefully shutdown the agent prior to removing
When the agent is removed, the daemon is abruptly stopped.
Since the node-proxy logic runs from within the cephadm agent,
it leaves an active RedFish session. The idea is to gracefully
shutdown the agent so node-proxy can catch that event and make sure
it closes the current active RedFish session prior to shutting down.
node-proxy: update the data structure for summary report
This extends the current data structure for the 'summary' report.
It adds `sn` (serial number information) and the `firmwares` dict
to the current data structure.
This was intented to address the case where the Ceph
manager can't talk directly to the oob management tool because
of network restrictions (subnets not inter-connecter, etc.).
If for any reason the host is stuck or unreachable, that local API won't
be helpful anyway, as a result any actions the Ceph mgr would be asked
to perform on the node would fail.