mgr/cephadm: add a new config option 'oob_default_addr'
So there's a default value (169.254.1.1) which is the default
address for the 'OS to iDrac pass-through' interface.
Given that node-proxy will reach the RedFish API through this interface,
we can make users avoid to pass that addr when providing the host spec
at bootstrap time.
the following messages get logged quite a lot while
this is not a very useful information in a normal situation:
```
2024-01-12 09:09:40,604 - reporter - INFO - data ready to be sent to the mgr.
2024-01-12 09:09:40,604 - reporter - INFO - no diff, not sending data to the mgr.
2024-01-12 09:10:15,022 - reporter - INFO - data ready to be sent to the mgr.
2024-01-12 09:10:15,022 - reporter - INFO - no diff, not sending data to the mgr.
...
```
This `sleep(5)` should be initiated *after* the lock is released.
Otherwise, it can cause troubles with the reporter loop which can
never acquire the lock.
This renames the mgr's NodeProxyCache attribute from
`self.node_proxy` to `self.node_proxy_cache` and the
class `NodeProxy` in agent.py from `NodeProxy` to
`NodeProxyEndpoint` to make it clearer and avoid confusion.
node-proxy: enhance debug log messages for locking operations
This commit updates the debug log messages in the BaseRedfishSystem
and Reporter classes. The adjustments made enhance the clarity and
precision of the messages by specifically identifying acquired
and released locks, detailing their context, thereby improving the
understanding of the control flow during locking operations
in these components.
node-proxy requires this dependency so it needs to be added as
dependency for tox testing.
Typical failure:
```
ImportError while importing test module '/root/ceph/src/cephadm/tests/test_agent.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib64/python3.9/importlib/__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests/test_agent.py:10: in <module>
_cephadm = import_cephadm()
tests/fixtures.py:14: in import_cephadm
import cephadm as _cephadm
cephadm.py:32: in <module>
from cephadmlib.node_proxy.main import NodeProxy
cephadmlib/node_proxy/main.py:2: in <module>
from .redfishdellsystem import RedfishDellSystem
cephadmlib/node_proxy/redfishdellsystem.py:2: in <module>
from .baseredfishsystem import BaseRedfishSystem
cephadmlib/node_proxy/baseredfishsystem.py:2: in <module>
from .basesystem import BaseSystem
cephadmlib/node_proxy/basesystem.py:2: in <module>
from .util import Config
cephadmlib/node_proxy/util.py:2: in <module>
import yaml
E ModuleNotFoundError: No module named 'yaml'
```
node-proxy: send oob management requests to the MgrListener()
Note that this won't be a true out of band management.
In the case where the host hangs, this won't work. The oob
management should be reached directly but most of the time
the oob network is isolated. The idea is to send queries to the
the tcp server exposed by the cephadm agent (MgrListener) so it
can send itself queries to the redfish API using the IP address
exposed on the OS.
node-proxy: move the output formatting logic to orchestrator
Implementing this in the cephadm module doesn't follow the general idea
of the orchestrator interface. This is where the output formatting should
be done so let's move the logic to the orchestrator module.
node-proxy: code change for hdd blinkenlight pre-requisites
This is mainly for anticipating the case where hdd blinkenlight via RedFish
works (testing has to be done). This introduces the required changes so the
endpoint `/led` can support blinkenlight for both chassis and disks.
cephadm: gracefully shutdown the agent prior to removing
When the agent is removed, the daemon is abruptly stopped.
Since the node-proxy logic runs from within the cephadm agent,
it leaves an active RedFish session. The idea is to gracefully
shutdown the agent so node-proxy can catch that event and make sure
it closes the current active RedFish session prior to shutting down.
node-proxy: update the data structure for summary report
This extends the current data structure for the 'summary' report.
It adds `sn` (serial number information) and the `firmwares` dict
to the current data structure.
This was intented to address the case where the Ceph
manager can't talk directly to the oob management tool because
of network restrictions (subnets not inter-connecter, etc.).
If for any reason the host is stuck or unreachable, that local API won't
be helpful anyway, as a result any actions the Ceph mgr would be asked
to perform on the node would fail.
node-proxy: fetch idrac details from NodeProxyCache()
The class ` NodeProxyCache()` is intended for that, it already
has this information so there's no need to make a call to `get_store()`
each time we want to access idrac details.
This is the first 'act on node' feature implementation.
This adds the endpoint /led
a GET request to this endpoint returns the current status
of the enclosure LED.
a PATCH request to this endpoint allows to set the
enclosure LED status.
- subclass cherrypy._cpserver.Server,
- drop cherrypy.quickstart() call,
- drop nested classes approach,
- make it run over https
- print tracebacks when an exception is raised
This makes all the required changes in order to support
collecting, pushing and exposing data regarding firmwares
status and versions for all the underlying hardware.
This also refactors the redfish dell corresponding logic:
Having so many nested/inheritance classes seems unnecessary.
- fix multiple logging issue because of new handler
added each time `Logger` is called
- do not propagate to parent (root) logger: as it makes it log the messages too
- implement a new method `is_logged()` in `RedFishClient`
- refactor the logic regarding caught errors in `RedFishClient`
Given that the 'node-proxy' terminology is internal, let's change
the few node-proxy related alert names to something
more user friendly as they are intended to be seen by the user
These changes are required in order to be able to re-use
the existing agent endpoint. The current code doesn't ease/allow
adding a new application. The idea here is to add a new class for
handling the '/node-proxy' endpoint.
Given that this library isn't packaged for both
upstream and downstream and we can achieve what it was used for
directly with a lib such `urllib` (basically just auth), let's
drop this dependency.
node-proxy entrypoint (`server.main()`) now takes two parameters
(addr / port) in order to make the reporter agent know how to reach
the http agent endpoint hosted in the mgr daemon.
This adds 2 endpoints to the existing http agent endpoint:
- '/node_proxy/idrac': support POST requests only although this endpoint
is intended for fetching the idrac credentials of a given node. As we pass
sensitive details (ceph secret) I didn't want to pass it as a query parameter
in the url. Passing it in a HTTP header is perhaps a better approach but we already
do similar thing for endpoint '/data' (agent) so for consistency reason I stick to
that.
- '/node_proxy/data': support GET and POST requests. A GET will return the
aggregated data for all nodes within the cluster. node-proxy will use a POST
request to that endpoint to push its collected data.
This encapsulates the existing code in a new method
`query_endpoint()`.
The idea is to avoid duplicating code if we need to make multiple
calls to the agent endpoint from the `run()` method.
This adds new parameters to the current spec 'HostSpec'.
The idea is to make it possible to pass idrac credentials so
it will be possible for the node-proxy agent to consume them in order
to communicate with the redfish API.
This moves the existing files to the new directory 'cephadmlib' so
we can make the existing code for node-proxy run within the cephadm
agent. Indeed, we can leverage the existing code for the cephadm agent
given that both daemons would achieve the same thing.
node-proxy: try to acquire lock early in reporter's loop
The lock should be acquired early in this loop.
If the lock gets acquired by another call after we enter that condition *and*
before Reporter.loop() actually acquires it, it can lead to issue if during
this short amount of time the value of `data_ready` gets modified
This implements BaseClient class and make RedfishClient inherit from it.
Same logic as BaseSystem / RedfishSystem given that any other backend could
need to implement a new client for collecting the data.
if this call is stuck for any reason, the report will block
the whole daemon given that at this point it has acquired a lock.
We need to make sure this call won't block the daemon for a long time,
let's add a timeout.
This catches the requests.exceptions.RequestException
exception in the reporter agent so we can better handle the
case where it can't reach the endpoint when trying to send the
collected data.
Before this change, if for some reason the refreshed data couldn't be
sent to the endpoint, it wouldn't have retried because
`self.system.previous_data` was overwritten anyway.