From: Paul Cuzner Date: Sun, 11 Oct 2020 22:59:32 +0000 (+1300) Subject: doc: moved design doc under doc/dev X-Git-Tag: v16.1.0~334^2~2 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=7168cbf0521aa29701f0133b4e38b858888e46b4;p=ceph.git doc: moved design doc under doc/dev Signed-off-by: Paul Cuzner --- diff --git a/doc/dev/cephadm/compliance-check.rst b/doc/dev/cephadm/compliance-check.rst new file mode 100644 index 000000000000..8fadd2511610 --- /dev/null +++ b/doc/dev/cephadm/compliance-check.rst @@ -0,0 +1,119 @@ +================ +Compliance Check +================ + +The stability and reliability of a Ceph cluster is not just dependent upon the Ceph daemons, but +also the OS and hardware that Ceph is installed on. This document is intended to promote a design +discussion for providing a "compliance" feature within mgr/cephadm, which would be responsible for +identifying common platform related issues that could impact Ceph stability and operation. + +The ultimate goal of the checks is to identify issues early and raise a healthcheck WARN +event, to alert the Administrator to the issue. + +Prerequisites +============= +In order to effectively analyse the hosts that Ceph is deployed to, this feature requires a cache +of host related metadata. The metadata is already available from cephadm's HostFacts class and the +gather-facts cephadm command. For the purposes of this document, we will assume that this +data is available within the mgr/cephadm "cache" structure. + +Some checks will require that the host status is also populated e.g. ONLINE, OFFLINE, MAINTENANCE + +Administrator Interaction +========================= +Not all users will require this feature, and must be able to 'opt out'. For this reason, +mgr/cephadm must provide controls, such as the following; + +.. code-block:: + + ceph cephadm compliance enable | disable | status [--format json] + ceph cephadm compliance ls [--format json] + ceph cephadm compliance enable-check + ceph cephadm compliance disable-check + ceph cephadm compliance set-check-interval + ceph cephadm compliance get-check-interval + +The status option would show the enabled/disabled state of the feature, along with the +check-interval. + +The ls subcommand would show all checks in the following format; + +check-name status description + +Proposed Integration +==================== +The compliance checks are not required to run all the time, but instead should run at discrete +intervals. The interval would be configurable under via the :code:`set-check-interval` +subcommand (default would be every 12 hours) + + +mgr/cephadm current executes an event driven (time based) serve loop to act on deploy/remove and +reconcile activity. In order to execute the compliance checks, the compliance check code would be +called from this main serve loop - when the :code:`set-check-interval` is met. + + +Proposed Checks +=============== +All checks would push any errors to a list, so multiple issues can be escalated to the Admin at +the same time. The list below provides a description of each check, with the text following the +name indicating a shortname version *(the shortname is the reference for command Interaction +when enabling or disabling a check)* + + +OS Consistency (OS) +___________________ +* all hosts must use same vendor +* all hosts must be on the same major release + +*src: gather-facts output* + +Linux Kernel Security Mode (LSM) +________________________________ +* All hosts should have a consistent SELINUX/AppArmor configuration + +*src: gather-facts output* + +Services Check (SERVICES) +_________________________ +Hosts that are in an ONLINE state should adhere to the following; + +* all daemons (systemd units) should be enabled +* all daemons should be running (not dead) + +*src: list_daemons output* + +Support Status (SUPPORT) +________________________ +If support status has been detected, it should be consistent across all hosts. At this point +support status us only available for Red Hat machines. + +*src: gather-facts output* + +Network : MTU (MTU) +________________________________ +All network interfaces on the same Ceph network (public/cluster) should have the same MTU + +*src: gather-facts output* + +Network : LinkSpeed (LINKSPEED) +____________________________________________ +All network interfaces on the same Ceph network (public/cluster) should have the same Linkspeed + +*src: gather-facts output* + +Network : Consistency (INTERFACE) +______________________________________________ +All hosts with OSDs should have consistent network configuration - eg. if some hosts do +not separate cluster/public traffic but others do, that is an anomaly that would generate a +compliance check warning. + +*src: gather-facts output* + +Notification Strategy +===================== +If any of the checks fail mgr/cephadm would raise a WARN level alert + +Futures +======= +The checks highlighted here serve as a start point only, and we should expect to expand +on the checks over time. diff --git a/src/pybind/mgr/cephadm/designs/compliance-check.rst b/src/pybind/mgr/cephadm/designs/compliance-check.rst deleted file mode 100644 index 8fadd2511610..000000000000 --- a/src/pybind/mgr/cephadm/designs/compliance-check.rst +++ /dev/null @@ -1,119 +0,0 @@ -================ -Compliance Check -================ - -The stability and reliability of a Ceph cluster is not just dependent upon the Ceph daemons, but -also the OS and hardware that Ceph is installed on. This document is intended to promote a design -discussion for providing a "compliance" feature within mgr/cephadm, which would be responsible for -identifying common platform related issues that could impact Ceph stability and operation. - -The ultimate goal of the checks is to identify issues early and raise a healthcheck WARN -event, to alert the Administrator to the issue. - -Prerequisites -============= -In order to effectively analyse the hosts that Ceph is deployed to, this feature requires a cache -of host related metadata. The metadata is already available from cephadm's HostFacts class and the -gather-facts cephadm command. For the purposes of this document, we will assume that this -data is available within the mgr/cephadm "cache" structure. - -Some checks will require that the host status is also populated e.g. ONLINE, OFFLINE, MAINTENANCE - -Administrator Interaction -========================= -Not all users will require this feature, and must be able to 'opt out'. For this reason, -mgr/cephadm must provide controls, such as the following; - -.. code-block:: - - ceph cephadm compliance enable | disable | status [--format json] - ceph cephadm compliance ls [--format json] - ceph cephadm compliance enable-check - ceph cephadm compliance disable-check - ceph cephadm compliance set-check-interval - ceph cephadm compliance get-check-interval - -The status option would show the enabled/disabled state of the feature, along with the -check-interval. - -The ls subcommand would show all checks in the following format; - -check-name status description - -Proposed Integration -==================== -The compliance checks are not required to run all the time, but instead should run at discrete -intervals. The interval would be configurable under via the :code:`set-check-interval` -subcommand (default would be every 12 hours) - - -mgr/cephadm current executes an event driven (time based) serve loop to act on deploy/remove and -reconcile activity. In order to execute the compliance checks, the compliance check code would be -called from this main serve loop - when the :code:`set-check-interval` is met. - - -Proposed Checks -=============== -All checks would push any errors to a list, so multiple issues can be escalated to the Admin at -the same time. The list below provides a description of each check, with the text following the -name indicating a shortname version *(the shortname is the reference for command Interaction -when enabling or disabling a check)* - - -OS Consistency (OS) -___________________ -* all hosts must use same vendor -* all hosts must be on the same major release - -*src: gather-facts output* - -Linux Kernel Security Mode (LSM) -________________________________ -* All hosts should have a consistent SELINUX/AppArmor configuration - -*src: gather-facts output* - -Services Check (SERVICES) -_________________________ -Hosts that are in an ONLINE state should adhere to the following; - -* all daemons (systemd units) should be enabled -* all daemons should be running (not dead) - -*src: list_daemons output* - -Support Status (SUPPORT) -________________________ -If support status has been detected, it should be consistent across all hosts. At this point -support status us only available for Red Hat machines. - -*src: gather-facts output* - -Network : MTU (MTU) -________________________________ -All network interfaces on the same Ceph network (public/cluster) should have the same MTU - -*src: gather-facts output* - -Network : LinkSpeed (LINKSPEED) -____________________________________________ -All network interfaces on the same Ceph network (public/cluster) should have the same Linkspeed - -*src: gather-facts output* - -Network : Consistency (INTERFACE) -______________________________________________ -All hosts with OSDs should have consistent network configuration - eg. if some hosts do -not separate cluster/public traffic but others do, that is an anomaly that would generate a -compliance check warning. - -*src: gather-facts output* - -Notification Strategy -===================== -If any of the checks fail mgr/cephadm would raise a WARN level alert - -Futures -======= -The checks highlighted here serve as a start point only, and we should expect to expand -on the checks over time.