From a0c69e70b164ffb3b3b433d8d2fad692e0fbc1a6 Mon Sep 17 00:00:00 2001 From: Paul Cuzner Date: Fri, 9 Oct 2020 13:20:28 +1300 Subject: [PATCH] doc/dev/cephadm: Doc defining the design for host maintenance Initial PR to define/agree the scope and goals of providing a host maintenance feature. Signed-off-by: Paul Cuzner (cherry picked from commit 8f3ed063a7c7a58d430e0c9a2ac43a7214fe86e4) --- doc/dev/cephadm/host-maintenance.rst | 100 +++++++++++++++++++++++++++ doc/dev/cephadm/index.rst | 11 +++ 2 files changed, 111 insertions(+) create mode 100644 doc/dev/cephadm/host-maintenance.rst create mode 100644 doc/dev/cephadm/index.rst diff --git a/doc/dev/cephadm/host-maintenance.rst b/doc/dev/cephadm/host-maintenance.rst new file mode 100644 index 0000000000000..464c1421c9c9a --- /dev/null +++ b/doc/dev/cephadm/host-maintenance.rst @@ -0,0 +1,100 @@ +================ +Host Maintenance +================ + +All hosts that support Ceph daemons need to support maintenance activity, whether the host +is physical or virtual(vm or cloud). This means that management workflows should provide +a simple and consistent way to support this operational requirement. This document defines +the maintenance strategy that could be implemented in cephadm and mgr/cephadm. + + +High Level Design +================= +Placing a host into maintenance, adopts the following workflow; + +#. confirm that the removal of the host does not impact data availability (the following + steps will assume it is safe to proceed) + + * orch host ok-to-stop would be used here + +#. if the host has osd daemons, apply noout to the host subtree to prevent data migration + from triggering during the planned maintenance slot. +#. Stop the ceph target (all daemons stop) +#. Disable the ceph target on that host, to prevent a reboot from automatically starting + ceph services again) + + +Exiting Maintenance, is basically the reverse of the above sequence + +Admin Interaction +================= +The ceph orch command will be extended to support maintenance. + +.. code-block:: + + ceph orch host enter-maintenance [ --check ] + ceph orch host exit-maintenance + +.. note:: In addition, the host's status should be updated to reflect whether it + is in maintenance or not. + +The 'check' Option +__________________ +The orch host ok-to-stop command focuses on ceph daemons (mon, osd, mds), which +provides the first check. However, a ceph cluster also uses other types of daemons +for monitoring, management and non-native protocol support which means the +logic will need to consider service impact too. The 'check' option provides +this additional layer to alert the user of service impact to *secondary* +daemons. + +The list below shows some of these additional daemons. + +* mgr (not included in ok-to-stop checks) +* prometheus, grafana, alertmanager +* rgw +* haproxy +* iscsi gateways +* ganesha gateways + +By using the --check option first, the Admin can choose whether to proceed. This +workflow is obviously optional for the CLI user, but could be integrated into the +UI workflow to help less experienced Administators manage the cluster. + +By adopting this two-phase approach, a UI based workflow would look something +like this. + +#. User selects a host to place into maintenance + + * orchestrator checks for data **and** service impact +#. If potential impact is shown, the next steps depend on the impact type + + * **data availability** : maintenance is denied, informing the user of the issue + * **service availability** : user is provided a list of affected services and + asked to confirm + + +Components Impacted +=================== +Implementing this capability will require changes to the following; + +* cephadm + + * Add maintenance subcommand with the following 'verbs'; enter, exit, check + +* mgr/cephadm + + * add methods to CephadmOrchestrator for enter/exit and check + * data gathering would be skipped for hosts in a maintenance state + +* mgr/orchestrator + + * add CLI commands to OrchestratorCli which expose the enter/exit and check interaction + + +Ideas for Future Work +===================== +#. When a host is placed into maintenance, the time of the event could be persisted. This + would allow the orchestrator layer to establish a maintenance window for the task and + alert if the maintenance window has been exceeded. + + diff --git a/doc/dev/cephadm/index.rst b/doc/dev/cephadm/index.rst new file mode 100644 index 0000000000000..731d5a31c0675 --- /dev/null +++ b/doc/dev/cephadm/index.rst @@ -0,0 +1,11 @@ +=================================== +cephadm developer documentation +=================================== + +.. rubric:: Contents + +.. toctree:: + :maxdepth: 1 + + + host-maintenance -- 2.39.5