mon: add monitor RocksDB backup and restore

author Matthew N. Heler <matthew.heler@hotmail.com>

Mon, 18 May 2026 01:57:01 +0000 (20:57 -0500)

committer Matthew N. Heler <matthew.heler@hotmail.com>

Thu, 18 Jun 2026 00:24:52 +0000 (19:24 -0500)
author Matthew N. Heler <matthew.heler@hotmail.com>
Mon, 18 May 2026 01:57:01 +0000 (20:57 -0500)
committer Matthew N. Heler <matthew.heler@hotmail.com>
Thu, 18 Jun 2026 00:24:52 +0000 (19:24 -0500)
diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst

index 18006d0611b89b22ecbb99a798748858c78305a9..85c003f16b5d7fb3debbd84d3763f87bf5adbbe2 100644 (file)
--- a/doc/rados/configuration/mon-config-ref.rst
+++ b/doc/rados/configuration/mon-config-ref.rst
@@ -600,6 +600,361 @@ is far outweighed by the number of accidental pool (and thus data) deletions it
  
  For more information about the pool flags see :ref:`Pool values <setpoolvalues>`.
  
+Monitor backup
+==============
+
+In normal operation, Monitor backups are not required: surviving members
+of the Monitor quorum re-sync new or replaced Monitors automatically,
+and the Monitor store can be largely rebuilt from the OSDs after total
+quorum loss (see :ref:`mon-store-recovery-using-osds`).
+
+However, ``mon-store-recovery-using-osds`` only recovers state that the
+OSDs can report: osdmap history, auth keys associated with running
+OSDs, and similar. Some Monitor state has no copy outside the Monitor
+store and is unrecoverable if all Monitors are lost:
+
+* **Encryption keys for dm-crypt OSDs** are stored only in the
+  Monitor's ``config-key`` store under
+  ``dm-crypt/osd/<osd-uuid>/luks``. Without these keys the underlying
+  block devices cannot be unlocked, even when the OSD daemons and data
+  are physically intact.
+* **Cephadm orchestrator state** under ``mgr/cephadm/*`` in the
+  ``config-key`` store, including host inventory, daemon placement
+  specs, and service definitions.
+* **Config-key entries** populated by users or third-party tooling via
+  ``ceph config-key set``.
+* **Dashboard and manager module state** persisted to the
+  ``config-key`` store.
+
+A Monitor backup lets an operator restore this state if all running
+Monitors are lost. It is most valuable for clusters that use
+dm-crypt-encrypted OSDs or that depend heavily on cephadm-managed
+deployment state, where loss of the Monitor store would be a
+protracted data-availability incident rather than a recoverable inconvenience.
+
+Monitor backups complement, but do not replace, the existing Monitor
+recovery procedures. They are not a means of "undoing" cluster-level
+operations such as pool deletion or CRUSH changes: once OSDs have
+observed and acted on a newer osdmap, restoring an older Monitor store
+does not roll back the OSD-side effects.
+
+The Ceph Monitor uses the native RocksDB ``BackupEngine`` to create
+consistent snapshots of its store, which can be copied elsewhere
+without downtime.
+
+When :confval:`mon_backup_interval` is set, a backup is triggered every
+N seconds. Pair it with :confval:`mon_backup_cleanup_interval`; if only
+the backup interval is set, backups accumulate indefinitely because
+retention is only applied during cleanup.
+
+Backups share table files (``.sst``) within the backup directory: each
+new backup only copies SSTables that the running database has produced
+since the previous backup. Restoring any individual backup version is
+independent of the others, but the on-disk files for every version
+live in a single shared tree under ``mon_backup_path``.
+
+Layout of the backup directory
+------------------------------
+
+A backup path managed by the RocksDB ``BackupEngine`` contains three
+top-level directories plus a per-version copy of the Monitor keyring::
+
+   /path/to/backups/
+   ├── meta/
+   │   ├── 1
+   │   ├── 2
+   │   └── 3
+   ├── private/
+   │   ├── 1/
+   │   ├── 2/
+   │   └── 3/
+   ├── shared_checksum/
+   │   ├── 000007.sst
+   │   ├── 000010.sst
+   │   └── ...
+   ├── keyring.1
+   ├── keyring.2
+   └── keyring.3
+
+* ``meta/<N>`` is a metadata file describing logical backup version
+  ``N``.
+* ``private/<N>/`` contains files unique to backup ``N`` (RocksDB
+  descriptors and other per-version state).
+* ``shared_checksum/`` contains SSTables shared across backup
+  versions. A single SSTable in this directory may belong to several
+  versions; the BackupEngine deletes a file from here only when no
+  remaining backup references it.
+* ``keyring.<N>`` is a copy of ``$mon_data/keyring`` taken at the time
+  of backup ``N``. The Monitor needs this key to authenticate at
+  startup, and it is not stored inside the RocksDB database; restoring
+  version ``N`` copies the matching ``keyring.<N>`` back into
+  ``$mon_data`` so an older snapshot is paired with the keyring of its
+  vintage. Cleanup removes ``keyring.<N>`` files whose backup version
+  has been pruned.
+
+  Stashing the keyring is **best effort**: if the copy fails (for
+  example, permission denied on the backup directory or out of space
+  after the RocksDB snapshot completed), the backup is still recorded
+  as successful and the RocksDB data remains usable. On restore, a
+  missing ``keyring.<N>`` is silently skipped, and the operator must
+  supply the Monitor keyring out-of-band before starting the daemon.
+
+.. warning::
+
+   The backup directory contains the ``[mon.]`` private key. Treat
+   it with the same access controls as ``$mon_data`` itself; a
+   complete backup is sufficient material to impersonate a Monitor
+   in the cluster.
+
+Do not delete a ``private/<N>/`` directory or files under
+``shared_checksum/`` by hand: removing a referenced shared file
+corrupts every backup that points to it. Use ``backup_cleanup`` or
+the configured retention parameters to remove old versions so the
+BackupEngine can release shared files safely.
+
+If the Monitor cluster fails and you need to copy a backup elsewhere,
+copy the entire ``/path/to/backups/`` directory. Copying only
+``private/<N>/`` is not sufficient; the version's SSTables live in
+``shared_checksum/``.
+
+
+The :confval:`mon_backup_cleanup_interval` specifies the interval for
+backup cleanup. The cleanup algorithm keeps the last
+``mon_backup_keep_last`` backups. It then collects hourly
+``mon_backup_keep_hourly`` and daily ``mon_backup_keep_daily``
+versions, retaining the newest backup in each time window.
+
+You can trigger ``backup`` and ``backup_cleanup`` through any running
+Monitor's admin socket.
+
+.. prompt:: bash #
+
+   ceph --admin-daemon .../mon.asok backup
+
+.. prompt:: bash #
+
+   ceph --admin-daemon .../mon.asok backup_cleanup
+
+The following metrics related to the monitor backup process are
+tracked by ``ceph-mon``.
+
+.. list-table:: Ceph Monitor Backup Metrics
+   :widths: 30 12 58
+   :header-rows: 1
+
+   * - Name
+     - Type
+     - Description
+   * - ``backup_running``
+     - Gauge
+     - ``1`` while a backup is in progress, ``0`` otherwise
+   * - ``backup_started``
+     - Counter
+     - Backup attempts (includes attempts rejected at the :confval:`mon_backup_min_avail` pre-flight check)
+   * - ``backup_success``
+     - Counter
+     - Backups completed by the ``BackupEngine``
+   * - ``backup_failed``
+     - Counter
+     - Failed backup attempts (pre-flight or ``BackupEngine``)
+   * - ``backup_duration``
+     - Average
+     - Backup wall-clock time
+   * - ``backup_last_success``
+     - Gauge
+     - UTC timestamp of the most recent successful backup, or ``0`` if none
+   * - ``backup_last_success_id``
+     - Gauge
+     - ``BackupEngine`` version ID of the most recent successful backup (the value passed to ``--restore-backup --backup-version``)
+   * - ``backup_last_failed``
+     - Gauge
+     - UTC timestamp of the most recent failed attempt, or ``0`` if none
+   * - ``backup_last_size``
+     - Gauge
+     - Payload size in bytes of the most recent attempt (may be partial on failure)
+   * - ``backup_last_files``
+     - Gauge
+     - File count of the most recent attempt (may be partial on failure)
+   * - ``backup_cleanup_started``
+     - Counter
+     - Cleanup invocations
+   * - ``backup_cleanup_running``
+     - Gauge
+     - ``1`` while a cleanup is in progress, ``0`` otherwise
+   * - ``backup_cleanup_success``
+     - Counter
+     - Cleanup passes completed without error
+   * - ``backup_cleanup_failed``
+     - Counter
+     - Failed cleanup passes
+   * - ``backup_cleanup_duration``
+     - Average
+     - Cleanup wall-clock time
+   * - ``backup_cleanup_kept``
+     - Gauge
+     - Backups retained by the most recent cleanup pass
+   * - ``backup_cleanup_deleted``
+     - Gauge
+     - Backups removed by the most recent cleanup pass
+   * - ``backup_cleanup_freed``
+     - Gauge
+     - Bytes released by the most recent cleanup pass (overstated when shared backups are in use, because the ``BackupEngine`` payload sum ignores file sharing)
+   * - ``backup_cleanup_size``
+     - Gauge
+     - Bytes retained by the most recent cleanup pass (same sharing caveat as ``backup_cleanup_freed``)
+
+The ``backup_started``/``backup_success``/``backup_failed`` and
+``backup_cleanup_started``/``backup_cleanup_success``/``backup_cleanup_failed``
+counters are monotonic and accumulate for the lifetime of the
+``ceph-mon`` process. The ``backup_last_*`` fields and the four
+cleanup result gauges (``backup_cleanup_kept``, ``backup_cleanup_deleted``,
+``backup_cleanup_freed``, ``backup_cleanup_size``) describe only the
+most recent invocation and are overwritten on every pass.
+
+To retrieve backup metrics from a running monitor's admin socket:
+
+.. prompt:: bash #
+
+   ceph --admin-daemon .../mon.asok perf dump | jq '.["mon"] | with_entries(select(.key | startswith("backup_")))'
+
+.. code-block:: json
+
+   {
+     "backup_running": 0,
+     "backup_started": 2,
+     "backup_success": 2,
+     "backup_failed": 0,
+     "backup_duration": {
+       "avgcount": 2,
+       "sum": 0.149076498,
+       "avgtime": 0.074538249
+     },
+     "backup_last_success": 1722001989.849262,
+     "backup_last_success_id": 3,
+     "backup_last_failed": 0,
+     "backup_last_size": 3924677,
+     "backup_last_files": 6,
+     "backup_cleanup_started": 1,
+     "backup_cleanup_running": 0,
+     "backup_cleanup_success": 1,
+     "backup_cleanup_failed": 0,
+     "backup_cleanup_size": 86144,
+     "backup_cleanup_kept": 1,
+     "backup_cleanup_duration": {
+       "avgcount": 1,
+       "sum": 0.002031246,
+       "avgtime": 0.002031246
+     },
+     "backup_cleanup_freed": 0,
+     "backup_cleanup_deleted": 0
+   }
+
+Monitor Backup Metric Usage Examples
+------------------------------------
+
+The following examples show how to use the monitor backup performance
+counters. PromQL examples assume that the ``ceph-exporter`` is being
+scraped by Prometheus. Admin-socket examples use ``ceph daemon``
+directly against a specific ``ceph-mon`` daemon.
+
+``backup_last_success`` (gauge, Unix epoch seconds)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The timestamp when the most recent successful backup completed. ``0``
+means no successful backup has occurred since the mon started.
+
+* Age of the most recent successful backup, per mon:
+
+  .. code-block:: promql
+
+      time() - mon_backup_last_success
+
+* Detect a stalled backup schedule (no success in over 2 hours; adjust
+  the threshold to roughly 2× your :confval:`mon_backup_interval`):
+
+  .. code-block:: promql
+
+      time() - mon_backup_last_success > 7200 and mon_backup_last_success > 0
+
+``backup_failed`` (counter)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Cumulative count of failed backup attempts (pre-flight or
+``BackupEngine``) since the mon started.
+
+* Backup failures per mon in the last hour:
+
+  .. code-block:: promql
+
+      increase(mon_backup_failed[1h])
+
+* Alert when any mon has logged a backup failure recently:
+
+  .. code-block:: promql
+
+      increase(mon_backup_failed[15m]) > 0
+
+``backup_last_size`` (gauge, bytes)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Payload size in bytes of the most recent backup attempt.
+
+* Backup size trend across all mons:
+
+  .. code-block:: promql
+
+      mon_backup_last_size
+
+* Live size check for a single mon via admin socket:
+
+  .. code-block:: bash
+
+      ceph daemon mon.<id> perf dump | jq '.mon.backup_last_size'
+
+``backup_cleanup_kept`` (gauge)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Number of backups retained by the most recent cleanup pass.
+
+* Mons whose retention has grown beyond an expected ceiling:
+
+  .. code-block:: promql
+
+      mon_backup_cleanup_kept > 100
+
+``backup_cleanup_freed`` (gauge, bytes)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Bytes released by the most recent cleanup pass. Overstated when shared
+backups are in use.
+
+* Bytes reclaimed by the latest cleanup, per mon:
+
+  .. code-block:: promql
+
+      mon_backup_cleanup_freed
+
+* Total bytes released by the most recent cleanup across all mons:
+
+  .. code-block:: promql
+
+      sum(mon_backup_cleanup_freed)
+
+Monitor Backup Configuration Options
+------------------------------------
+
+The following options control monitor backup behavior. All are
+runtime-tunable.
+
+.. confval:: mon_backup_path
+.. confval:: mon_backup_min_avail
+.. confval:: mon_backup_keep_last
+.. confval:: mon_backup_keep_hourly
+.. confval:: mon_backup_keep_daily
+.. confval:: mon_backup_interval
+.. confval:: mon_backup_cleanup_interval
+
+
  Miscellaneous
  =============
  
diff --git a/doc/rados/troubleshooting/troubleshooting-mon.rst b/doc/rados/troubleshooting/troubleshooting-mon.rst

index 10e49dc8b2b1cbfa7021886e7e176195ec2a163f..791b8a7b53a50826b721ce345681f041ebace3f4 100644 (file)
--- a/doc/rados/troubleshooting/troubleshooting-mon.rst
+++ b/doc/rados/troubleshooting/troubleshooting-mon.rst
@@ -518,6 +518,97 @@ or::
  
    Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb
  
+Recovery Using Mon Backup
+-------------------------
+
+If Monitor backups are enabled, backups can be found in the configured
+``mon_backup_path``. To list the available backup versions, run:
+
+.. code-block:: bash
+
+   ceph-mon -i [num] --list-backups /path/to/backups
+
+In containerized deployments, run this from inside the Monitor
+container (``cephadm shell --name mon.<id>``), or install
+``ceph-common`` on the host.
+
+This invokes the RocksDB ``BackupEngine`` to enumerate the logical
+backup versions at the path. Output looks like::
+
+   ID:     Time:                           Size:
+   1       Sun May 18 03:00:01 2026         4 MiB
+   2       Sun May 18 04:00:02 2026         12 KiB
+   3       Sun May 18 05:00:01 2026         16 KiB
+
+The ``ID`` column is the value to pass as ``--backup-version`` when
+restoring. A plain ``ls`` of the backup path shows the BackupEngine's
+internal ``meta/``, ``private/``, and ``shared_checksum/`` directories,
+which are not directly usable for restore; always use ``--list-backups``
+to obtain the IDs.
+
+To restore a backup, stop the monitor and run:
+
+.. code-block:: bash
+
+   ceph-mon -i [num] --restore-backup /path/to/backups --backup-version <version> --yes-i-really-mean-it
+
+The ``--yes-i-really-mean-it`` flag is required because restore overwrites the existing monitor store.
+If the ``--backup-version`` argument is omitted, the latest version will be restored.
+The restored store contains everything that was in the mon at the time the backup was taken,
+including auth records; any changes (auth, pools, CRUSH, etc.) made after that point are lost.
+OSDs will reconcile their state with the restored osdmap as the cluster comes back up.
+
+If ``ceph-mon --restore-backup`` is invoked as ``root`` (typical when running
+from a service shell), the restored ``kv_backend`` file and the rehydrated
+``keyring`` will be owned by ``root``. Before starting the monitor daemon,
+``chown -R ceph:ceph <mon_data>`` so the unprivileged ``ceph`` user can read
+them.
+
+Restoring a multi-monitor cluster
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Each monitor has its own Paxos state (rank, accepted proposal numbers,
+``last_committed``), so backups are per-monitor: a backup taken from
+``mon.a`` should be restored into ``mon.a``'s ``mon_data``, not copied
+onto ``mon.b`` or ``mon.c``.
+
+To recover a cluster from monitor backups:
+
+#. Stop all monitors.
+#. Restore each monitor you have a backup for from its own backup path
+   using the ``--restore-backup`` command above.
+#. Start the restored monitors. Quorum forms once a majority of the
+   monmap members are running. Restoring and starting a majority is the
+   simplest path: the recovered monmap is the original one, and the
+   monitors elect among themselves as normal.
+
+If you cannot bring back a majority of monitors (because backups are
+missing or storage is lost on the other hosts), the restored seed will
+not form quorum on its own. The monmap recovered from the backup still
+lists every monitor in the original cluster, and Paxos requires a
+majority to elect. The seed will sit in ``probing`` / ``electing``
+indefinitely.
+
+To reduce the monmap on the offline seed so it can elect itself:
+
+#. Stop all monitors.
+#. Restore the seed monitor from its backup as above.
+#. Edit the monmap directly in the offline mon store::
+
+     # extract from the offline mon store
+     ceph-mon -i <seed-id> --extract-monmap /tmp/monmap
+     # drop monitors that will not come back
+     monmaptool /tmp/monmap --rm <other-id-1> --rm <other-id-2>
+     # write it back into the store
+     ceph-mon -i <seed-id> --inject-monmap /tmp/monmap
+
+#. Start the seed monitor. With the reduced monmap it can elect itself
+   and form a single-member quorum.
+#. Re-add any remaining monitors as new sync members per
+   :ref:`adding-and-removing-monitors`; they will synchronize from the
+   recovered seed rather than reusing their old backups.
+
+
  Recovery Using Healthy Monitor(s)
  ---------------------------------
  
diff --git a/src/ceph_mon.cc b/src/ceph_mon.cc

index e1475368d012a02dda04715441ceba92860818bb..de928f0e4cdaeb35cb8f11e68ae28371fc9253e5 100644 (file)
--- a/src/ceph_mon.cc
+++ b/src/ceph_mon.cc
@@ -19,6 +19,7 @@
  #include <fcntl.h>
  
  #include <iostream>
+#include <optional>
  #include <sstream>
  #include <string>
  
@@ -42,6 +43,7 @@
  #include "common/Throttle.h"
  #include "common/Timer.h"
  #include "common/errno.h"
+#include "common/strtol.h"
  #include "common/Preforker.h"
  
  #include "global/global_init.h"
@@ -213,7 +215,7 @@ static void usage()
         << "  --force-sync\n"
         << "        force a sync from another mon by wiping local data (BE CAREFUL)\n"
         << "  --yes-i-really-mean-it\n"
-       << "        mandatory safeguard for --force-sync\n"
+       << "        mandatory safeguard for --force-sync and --restore-backup\n"
         << "  --compact\n"
         << "        compact the monitor store\n"
         << "  --osdmap <filename>\n"
@@ -224,8 +226,14 @@ static void usage()
         << "        extract the monmap from the local monitor store and exit\n"
         << "  --mon-data <directory>\n"
         << "        where the mon store and keyring are located\n"
-       << "  --set-crush-location <bucket>=<foo>"
-       << "        sets monitor's crush bucket location (only for stretch mode)"
+       << "  --set-crush-location <bucket>=<foo>\n"
+       << "        sets monitor's crush bucket location (only for stretch mode)\n"
+       << "  --restore-backup <directory>\n"
+       << "        restore the backup from location and exit (requires --yes-i-really-mean-it)\n"
+       << "  --backup-version <version>\n"
+       << "        BackupEngine version ID (uint32); defaults to the latest backup when omitted\n"
+       << "  --list-backups <directory>\n"
+       << "        list available backups\n"
         << std::endl;
    generic_server_usage();
  }
@@ -265,6 +273,8 @@ int main(int argc, const char **argv)
    bool force_sync = false;
    bool yes_really = false;
    std::string osdmapfn, inject_monmap, extract_monmap, crush_loc;
+  std::string restore_backup_location, list_backup_location;
+  std::optional<uint32_t> restore_backup_version;
  
    auto args = argv_to_vec(argc, argv);
    if (args.empty()) {
@@ -337,6 +347,18 @@ int main(int argc, const char **argv)
        extract_monmap = val;
      } else if (ceph_argparse_witharg(args, i, &val, "--set-crush-location", (char*)NULL)) {
        crush_loc = val;
+    } else if (ceph_argparse_witharg(args, i, &val, "--list-backups", (char*)NULL)) {
+      list_backup_location = val;
+    } else if (ceph_argparse_witharg(args, i, &val, "--restore-backup", (char*)NULL)) {
+      restore_backup_location = val;
+    } else if (ceph_argparse_witharg(args, i, &val, "--backup-version", (char*)NULL)) {
+      std::string parse_err;
+      long long v = strict_strtoll(val.c_str(), 10, &parse_err);
+      if (!parse_err.empty() || v < 0 || v > UINT32_MAX) {
+        cerr << "invalid --backup-version '" << val << "'" << std::endl;
+        exit(1);
+      }
+      restore_backup_version = static_cast<uint32_t>(v);
      } else {
        ++i;
      }
@@ -362,6 +384,51 @@ int main(int argc, const char **argv)
      exit(1);
    }
  
+  // -- list backups --
+  if (!list_backup_location.empty()) {
+    cout << "list backup from location '" << list_backup_location << "'" << std::endl << std::endl;
+    auto backup_infos = MonitorDBStore::list_backups(
+      cct.get(), g_conf()->mon_data, list_backup_location);
+    if (!backup_infos) {
+      cerr << "failed to enumerate backups at '" << list_backup_location
+           << "' (see log for details)" << std::endl;
+      exit(1);
+    }
+    if (backup_infos->empty()) {
+      cout << "no backups found at '" << list_backup_location << "'" << std::endl;
+      exit(0);
+    }
+    cout << "ID:\tTime:\t\t\t\tSize:" << std::endl;
+    for (const KeyValueDB::BackupStats& bi : *backup_infos) {
+      cout << bi.id << "\t";
+      bi.timestamp.asctime(cout);
+      cout << "\t" << byte_u_t(bi.size) << std::endl;
+    }
+    exit(0);
+  }
+
+  // -- restore backup --
+  if (!restore_backup_location.empty()) {
+    if (!yes_really) {
+      cerr << "restoring will overwrite the monitor store at '" << g_conf()->mon_data
+           << "'. Pass --yes-i-really-mean-it to proceed." << std::endl;
+      exit(1);
+    }
+    cerr << "restoring backup from location '" << restore_backup_location << "' to '"
+         << g_conf()->mon_data << "'" << std::endl;
+    if (MonitorDBStore::restore_backup(cct.get(), g_conf()->mon_data, restore_backup_location, restore_backup_version)) {
+        cout << "successfully restored backup. Start ceph-mon normally" << std::endl;
+        exit(0);
+    }
+    cerr << "restore failed. Check the backup path and version (use --list-backups to enumerate)." << std::endl;
+    exit(1);
+  }
+
+  if (restore_backup_version.has_value()) {
+    cerr << "--backup-version requires --restore-backup" << std::endl;
+    exit(1);
+  }
+
    MonitorDBStore store(g_conf()->mon_data);
  
    // -- mkfs --
@@ -537,7 +604,7 @@ int main(int argc, const char **argv)
        exit(1);
      }
      store.close();
-    dout(0) << argv[0] << ": created monfs at " << g_conf()->mon_data 
+    dout(0) << argv[0] << ": created monfs at " << g_conf()->mon_data
             << " for " << g_conf()->name << dendl;
      return 0;
    }
diff --git a/src/common/options/mon.yaml.in b/src/common/options/mon.yaml.in

index 9c81dbac7345914b3a70f76cbdf5a175ef9171b7..ef875ef80584b173ccd49be928c6f4b15caaaf74 100644 (file)
--- a/src/common/options/mon.yaml.in
+++ b/src/common/options/mon.yaml.in
@@ -1176,6 +1176,74 @@ options:
    flags:
    - no_mon_update
    with_legacy: true
+- name: mon_backup_path
+  type: str
+  level: advanced
+  desc: Path to Monitor database backups
+  fmt_desc: The Monitor's backup location.
+  default: /var/backups/ceph/mon/$cluster-$id
+  services:
+  - mon
+  flags:
+  - runtime
+- name: mon_backup_min_avail
+  type: int
+  level: advanced
+  desc: Only capture backups if at least this percentage of the target filesystem is free
+  default: 10
+  min: 0
+  max: 100
+  services:
+  - mon
+  flags:
+  - runtime
+- name: mon_backup_keep_last
+  type: uint
+  level: advanced
+  desc: Keep the last N backups
+  fmt_desc: Keep the last N backups of the Monitor database.
+  default: 6
+  services:
+  - mon
+  flags:
+  - runtime
+- name: mon_backup_keep_hourly
+  type: uint
+  level: advanced
+  desc: Number of hourly backups
+  default: 5
+  services:
+  - mon
+  flags:
+  - runtime
+- name: mon_backup_keep_daily
+  type: uint
+  level: advanced
+  desc: Number of daily backups
+  fmt_desc: Keep one backup per day, for the specified number of days.
+  default: 7
+  services:
+  - mon
+  flags:
+  - runtime
+- name: mon_backup_interval
+  type: secs
+  level: advanced
+  desc: Automatic backups every N seconds (0 disables)
+  default: 0
+  services:
+  - mon
+  flags:
+  - runtime
+- name: mon_backup_cleanup_interval
+  type: secs
+  level: advanced
+  desc: Trigger backup cleanup every N seconds (0 disables)
+  default: 0
+  services:
+  - mon
+  flags:
+  - runtime
  - name: mon_rocksdb_options
    type: str
    level: advanced
diff --git a/src/kv/KeyValueDB.cc b/src/kv/KeyValueDB.cc

index 41eb30873b89c165886a7a9a57851ce3634abe60..c7700fd6f3f2a3dbde6bcc9ad8b93edf8ea7a219 100644 (file)
--- a/src/kv/KeyValueDB.cc
+++ b/src/kv/KeyValueDB.cc
@@ -1,6 +1,10 @@
  // -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:nil -*-
  // vim: ts=8 sw=2 sts=2 expandtab
  
+#include <filesystem>
+#include <memory>
+#include <sstream>
+
  #include "KeyValueDB.h"
  #include "RocksDBStore.h"
  
@@ -18,6 +22,52 @@ KeyValueDB *KeyValueDB::create(CephContext *cct, const string& type,
    return NULL;
  }
  
+bool KeyValueDB::restore_backup(CephContext *cct,
+                                const std::string &type,
+                                const std::string &path,
+                                const std::string &backup_location,
+                                const std::optional<uint32_t> &version)
+{
+  if (std::filesystem::exists(path) &&
+      !std::filesystem::is_empty(path)) {
+    std::unique_ptr<KeyValueDB> probe(KeyValueDB::create(cct, type, path));
+    if (!probe) {
+      lderr(cct) << __func__ << " unsupported kv backend: " << type << dendl;
+      return false;
+    }
+    std::ostringstream err;
+    if (probe->open(err) < 0) {
+      // Heuristic: rocksdb's PosixEnv lock path surfaces "lock" in the
+      // error text. Any other open failure (corruption, I/O) is precisely
+      // why a restore is being run -- warn and proceed.
+      const std::string msg = err.str();
+      if (msg.find("lock") != std::string::npos) {
+        lderr(cct) << __func__ << " another monitor is using this data dir: "
+                   << path << ": " << msg << dendl;
+        return false;
+      }
+      lderr(cct) << __func__ << " existing store at " << path
+                 << " is unreadable, proceeding with restore: " << msg << dendl;
+    } else {
+      probe->close();
+    }
+  }
+  if (type == "rocksdb") {
+    return RocksDBStore::restore_backup(cct, path, backup_location, version);
+  }
+  return false;
+}
+
+std::optional<std::vector<KeyValueDB::BackupStats>> KeyValueDB::list_backups(
+  CephContext *cct, const std::string &type, const std::string &backup_location)
+{
+  if (type == "rocksdb") {
+    return RocksDBStore::list_backups(cct, backup_location);
+  }
+  lderr(cct) << __func__ << " unsupported kv backend: " << type << dendl;
+  return std::nullopt;
+}
+
  int KeyValueDB::test_init(const string& type, const string& dir)
  {
    if (type == "rocksdb") {
diff --git a/src/kv/KeyValueDB.h b/src/kv/KeyValueDB.h

index 2d861eda82d1d0b60f50c90a4be650b7bdb3ae7b..381492fa1d7ba7d90807bc98b362ccd0e1b17ac8 100644 (file)
--- a/src/kv/KeyValueDB.h
+++ b/src/kv/KeyValueDB.h
@@ -13,6 +13,7 @@
  #include <string_view>
  #include <boost/scoped_ptr.hpp>
  #include "include/encoding.h"
+#include "include/utime.h"
  #include "common/Formatter.h"
  #include "common/perf_counters.h"
  #include "common/PriorityCache.h"
@@ -24,6 +25,25 @@
   */
  class KeyValueDB {
  public:
+  struct BackupCleanupStats {
+    bool error{false};
+    utime_t timestamp;
+    uint32_t corrupted{0};
+    uint32_t deleted{0};
+    uint32_t kept{0};
+    uint64_t size{0};
+    uint64_t freed{0};
+  };
+
+  struct BackupStats {
+    bool error{false};
+    uint64_t id{0};
+    utime_t timestamp;
+    std::string msg;
+    uint64_t size{0};
+    uint64_t number_files{0};
+  };
+
    class TransactionImpl {
    public:
      // amount of ops included
@@ -112,8 +132,8 @@ public:
      }
  
      /// Remove Single Key which exists and was not overwritten.
-    /// This API is only related to performance optimization, and should only be 
-    /// re-implemented by log-insert-merge tree based keyvalue stores(such as RocksDB). 
+    /// This API is only related to performance optimization, and should only be
+    /// re-implemented by log-insert-merge tree based keyvalue stores(such as RocksDB).
      /// If a key is overwritten (by calling set multiple times), then the result
      /// of calling rm_single_key on this key is undefined.
      virtual void rm_single_key(
@@ -418,6 +438,29 @@ public:
        return 0;
    };
  
+  /// Create a kv database backup in directory path.
+  virtual BackupStats backup(const std::string &path) {
+    return {.error = true, .msg = "backup not supported by this backend"};
+  }
+
+  /// Remove old backups in directory path according to retention.
+  virtual BackupCleanupStats backup_cleanup(const std::string &path,
+                                            uint64_t keep_last,
+                                            uint64_t keep_hourly,
+                                            uint64_t keep_daily) {
+    return {.error = true};
+  }
+
+  /// restore from backup the specified backup version
+  static bool restore_backup(CephContext *cct, const std::string &type,
+                             const std::string &path,
+                             const std::string &backup_location,
+                             const std::optional<uint32_t> &version);
+
+  static std::optional<std::vector<BackupStats>> list_backups(
+    CephContext *cct, const std::string &type,
+    const std::string &backup_location);
+
    /// compact the underlying store
    virtual void compact() {}
  
diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc

index ae308fafdf6979be1d3a0605378b48ed0628cfbc..6d9470306a147bfb68f80303e06abcebd7e6f6fa 100644 (file)
--- a/src/kv/RocksDBStore.cc
+++ b/src/kv/RocksDBStore.cc
@@ -7,6 +7,7 @@
  #include <set>
  #include <string>
  #include <string_view>
+#include <tuple>
  #include <errno.h>
  #include <unistd.h>
  #include <sys/types.h>
@@ -18,10 +19,14 @@
  #include "rocksdb/slice.h"
  #include "rocksdb/cache.h"
  #include "rocksdb/filter_policy.h"
+#include "rocksdb/utilities/backup_engine.h"
  #include "rocksdb/utilities/convenience.h"
  #include "rocksdb/utilities/table_properties_collectors.h"
  #include "rocksdb/merge_operator.h"
  
+#include "common/version.h"
+#include "rocksdb/util/stderr_logger.h"
+
  #include "common/Clock.h" // for ceph_clock_now()
  #include "common/perf_counters.h"
  #include "common/PriorityCache.h"
@@ -70,6 +75,7 @@ static const char* sharding_def_file = "sharding/def";
  static const char* sharding_recreate = "sharding/recreate_columns";
  static const char* resharding_column_lock = "reshardingXcommencingXlocked";
  
+
  static bufferlist to_bufferlist(rocksdb::Slice in) {
    bufferlist bl;
    bl.append(bufferptr(in.data(), in.size()));
@@ -1143,6 +1149,7 @@ int RocksDBStore::do_open(ostream &out,
    if (create_if_missing) {
      status = rocksdb::DB::Open(opt, path, &db);
      if (!status.ok()) {
+      out << status.ToString();
        derr << status.ToString() << dendl;
        return -EINVAL;
      }
@@ -1191,6 +1198,7 @@ int RocksDBStore::do_open(ostream &out,
          status = rocksdb::DB::Open(opt, path, &db);
        }
        if (!status.ok()) {
+       out << status.ToString();
         derr << status.ToString() << dendl;
         return -EINVAL;
        }
@@ -1206,6 +1214,7 @@ int RocksDBStore::do_open(ostream &out,
                                    path, existing_cfs, &handles, &db);
        }
        if (!status.ok()) {
+       out << status.ToString();
         derr << status.ToString() << dendl;
         return -EINVAL;
        }
@@ -2076,6 +2085,278 @@ int RocksDBStore::split_key(rocksdb::Slice in, string_view *prefix, string_view
    return 0;
  }
  
+KeyValueDB::BackupStats RocksDBStore::backup(const std::string &path)
+{
+  ldout(cct, 20) << __func__ << " start backup action" << dendl;
+  std::lock_guard backup_locker{backup_lock};
+  // stamp timestamp up front so every return path (including early Open
+  // failures) carries a real time the scheduler can gate retries on.
+  KeyValueDB::BackupStats rv;
+  rv.timestamp = ceph_clock_now();
+
+  rocksdb::BackupEngine* engine_ptr = nullptr;
+  rocksdb::BackupEngineOptions engine_options = rocksdb::BackupEngineOptions(path);
+  // BackupEngineOptions must be stable across opens to the same directory,
+  // and share_files_with_checksum=false is deprecated by rocksdb.
+  engine_options.share_table_files = true;
+  engine_options.share_files_with_checksum = true;
+  engine_options.sync = true;
+
+  rocksdb::Status s = rocksdb::BackupEngine::Open(
+    engine_options,
+    rocksdb::Env::Default(),
+    &engine_ptr);
+  std::unique_ptr<rocksdb::BackupEngine> backup_engine{engine_ptr};
+
+  if (!backup_engine || !s.ok()) {
+    ldout(cct, 0) << __func__ << " can't create backup_engine: " << s.ToString() << dendl;
+    rv.msg = s.ToString();
+    rv.error = true;
+    return rv;
+  }
+
+  // we remove corrupted backups first to not link to broken ones
+  remove_corrupted_backups(backup_engine.get(), nullptr);
+
+  rocksdb::BackupID new_backup;
+  rocksdb::BackupInfo new_backup_info;
+  rocksdb::CreateBackupOptions new_backup_options = rocksdb::CreateBackupOptions();
+  new_backup_options.flush_before_backup = true;
+
+  std::string app_metadata = std::string("ceph_version=") + ceph_version_to_str();
+  s = backup_engine->CreateNewBackupWithMetadata(new_backup_options, db, app_metadata, &new_backup);
+
+  rv.timestamp = ceph_clock_now();
+  rv.msg = s.ToString();
+
+  if (!s.ok()) {
+    ldout(cct, 0) << __func__ << " can't create backup: " << s.ToString() << dendl;
+    rv.error = true;
+    remove_corrupted_backups(backup_engine.get(), nullptr);
+    return rv;
+  } else {
+    ldout(cct, 10) << __func__ << " created backup successfully: " << s.ToString() << dendl;
+    rv.msg = s.ToString();
+  }
+  s = backup_engine->GetBackupInfo(new_backup, &new_backup_info);
+  if (!s.ok()) {
+    ldout(cct, 0) << __func__ << " can't get backup info: " << s.ToString() << dendl;
+    rv.error = true;
+    rv.msg = s.ToString();
+    return rv;
+  }
+  rv.id = new_backup_info.backup_id;
+  rv.size = new_backup_info.size;
+  rv.number_files = new_backup_info.number_files;
+
+  return rv;
+}
+
+bool RocksDBStore::restore_backup(CephContext *cct, const std::string &path,
+                                  const std::string &backup_location,
+                                  const std::optional<uint32_t> &version)
+{
+  rocksdb::BackupEngineReadOnly* engine_ptr = nullptr;
+  rocksdb::StderrLogger logger = rocksdb::StderrLogger();
+  rocksdb::BackupEngineOptions engine_options = rocksdb::BackupEngineOptions(backup_location);
+  engine_options.info_log = &logger;
+
+  rocksdb::Status s = rocksdb::BackupEngineReadOnly::Open(
+    rocksdb::Env::Default(),
+    engine_options,
+    &engine_ptr);
+  std::unique_ptr<rocksdb::BackupEngineReadOnly> backup_engine{engine_ptr};
+  const rocksdb::RestoreOptions options = rocksdb::RestoreOptions();
+  if (!s.ok()) {
+    derr << __func__ << " can't open backup folder: " << s.ToString() << dendl;
+    return false;
+  }
+  if (!version) {
+    derr << __func__ << " restore last valid backup" << dendl;
+    s = backup_engine->RestoreDBFromLatestBackup(options, path, path);
+  } else {
+    s = backup_engine->RestoreDBFromBackup(
+        options, static_cast<rocksdb::BackupID>(*version), path, path);
+  }
+  if (!s.ok()) {
+    derr << "Error when restoring backup: " << s.ToString() << dendl;
+  }
+  return s.ok();
+}
+
+namespace {
+
+bool compare_backupinfo_by_timestamp(const rocksdb::BackupInfo& a, const rocksdb::BackupInfo& b)
+{
+  // newest first; tie-break on backup_id so cleanup never keeps an older entry
+  // in preference to a newer one with the same second-resolution timestamp.
+  return std::tie(a.timestamp, a.backup_id) > std::tie(b.timestamp, b.backup_id);
+}
+
+struct TimeBucket {
+  utime_t start;
+  utime_t end;
+  rocksdb::BackupID backup_id;
+
+  TimeBucket(utime_t start, utime_t end) :
+              start(start), end(end), backup_id(0) {}
+};
+
+} // namespace
+
+void RocksDBStore::remove_corrupted_backups(rocksdb::BackupEngine *backup_engine, KeyValueDB::BackupCleanupStats *rv) {
+  std::vector<rocksdb::BackupID> corrupt_backup_ids;
+  backup_engine->GetCorruptedBackups(&corrupt_backup_ids);
+  for (rocksdb::BackupID backup_id : corrupt_backup_ids) {
+    ldout(cct, 1) << __func__ << " delete corrupted backup: " << backup_id << dendl;
+    rocksdb::Status s = backup_engine->DeleteBackup(backup_id);
+    if (!s.ok()) {
+      lderr(cct) << __func__ << " failed to delete corrupted backup "
+                 << backup_id << ": " << s.ToString() << dendl;
+      if (rv) {
+        rv->error = true;
+      }
+      continue;
+    }
+    if (rv) {
+      rv->corrupted++;
+    }
+  }
+}
+
+KeyValueDB::BackupCleanupStats RocksDBStore::backup_cleanup(const std::string &path,
+                                                            uint64_t keep_last,
+                                                            uint64_t keep_hourly,
+                                                            uint64_t keep_daily)
+{
+  ldout(cct, 20) << __func__ << " start backup cleanup" << dendl;
+  std::lock_guard backup_locker{backup_lock};
+  // stamp timestamp up front so every return path (including early Open
+  // failures and empty result) carries a real time for the retry gate.
+  BackupCleanupStats rv;
+  rv.timestamp = ceph_clock_now();
+
+  rocksdb::BackupEngine* engine_ptr = nullptr;
+  rocksdb::Status s = rocksdb::BackupEngine::Open(
+    rocksdb::BackupEngineOptions(path),
+    rocksdb::Env::Default(),
+    &engine_ptr);
+  std::unique_ptr<rocksdb::BackupEngine> backup_engine{engine_ptr};
+  if (!backup_engine || !s.ok()) {
+    // cleaning backups when folder is not available is minor problem
+    ldout(cct, 10) << __func__ << " can't clean backups: " << s.ToString() << dendl;
+    rv.error = true;
+    return rv;
+  }
+  // remove corrupted backups first
+  std::set<rocksdb::BackupID> keep_backups;
+
+  remove_corrupted_backups(backup_engine.get(), &rv);
+  ldout(cct, 20) << __func__ << " collect garbage" << dendl;
+
+  std::vector<rocksdb::BackupInfo> backup_infos;
+  backup_engine->GetBackupInfo(&backup_infos);
+
+  if (backup_infos.empty()) {
+    ldout(cct, 15) << __func__ << " no backup infos" << dendl;
+    return rv;
+  }
+  // sort all backups with newest backup first
+  std::stable_sort(backup_infos.begin(), backup_infos.end(), compare_backupinfo_by_timestamp);
+  // always retain the newest backup, regardless of retention settings, so
+  // cleanup can never leave zero backups for a subsequent failed backup.
+  keep_backups.insert(backup_infos.front().backup_id);
+  utime_t now = ceph_clock_now();
+
+  std::vector<TimeBucket> buckets;
+  // half-open intervals [start, end) so adjacent buckets meet without gaps
+  utime_t start = now.round_to_hour();
+  for (uint64_t i = 0; i < keep_hourly; i++) {
+    buckets.push_back(TimeBucket(start, start + utime_t(3600, 0)));
+    start -= 3600.0;
+  }
+  start = now.round_to_day();
+  for (uint64_t i = 0; i < keep_daily; i++) {
+    buckets.push_back(TimeBucket(start, start + utime_t(86400, 0)));
+    start -= 86400.0;
+  }
+
+  size_t i = 0;
+  for (const rocksdb::BackupInfo& bi : backup_infos) {
+    if (i++ < keep_last) {
+      keep_backups.insert(bi.backup_id);
+    }
+    utime_t ts = utime_t(bi.timestamp, 0);
+    for (TimeBucket& bucket : buckets) {
+      if (ts >= bucket.start && ts < bucket.end) {
+        if (bucket.backup_id == 0) {
+          bucket.backup_id = bi.backup_id;
+        }
+      }
+    }
+  }
+  // push the winners into the list
+  for (const TimeBucket& bucket : buckets) {
+    if (bucket.backup_id) {
+      keep_backups.insert(bucket.backup_id);
+    }
+  }
+
+  rv.kept = keep_backups.size();
+
+  for (const rocksdb::BackupInfo& bi : backup_infos) {
+    if (keep_backups.count(bi.backup_id)) {
+      // payload-sum across kept backups (does not account for file sharing,
+      // so it overstates the on-disk footprint).
+      rv.size += bi.size;
+      continue;
+    }
+    ldout(cct, 10) << __func__ << " delete old backup: " << bi.backup_id << dendl;
+    rocksdb::Status s = backup_engine->DeleteBackup(bi.backup_id);
+    if (!s.ok()) {
+      lderr(cct) << __func__ << " failed to delete backup " << bi.backup_id
+                 << ": " << s.ToString() << dendl;
+      rv.error = true;
+      continue;
+    }
+    rv.freed += bi.size;
+    rv.deleted++;
+  }
+  rv.timestamp = ceph_clock_now();
+  return rv;
+}
+
+std::optional<std::vector<KeyValueDB::BackupStats>>
+RocksDBStore::list_backups(CephContext *cct, const std::string &backup_location) {
+  rocksdb::BackupEngineReadOnly* engine_ptr = nullptr;
+  rocksdb::Status s = rocksdb::BackupEngineReadOnly::Open(
+    rocksdb::BackupEngineOptions(backup_location),
+    rocksdb::Env::Default(),
+    &engine_ptr);
+  std::unique_ptr<rocksdb::BackupEngineReadOnly> backup_engine{engine_ptr};
+
+  if (!backup_engine || !s.ok()) {
+    lderr(cct) << __func__ << " can't open backup location " << backup_location
+               << ": " << s.ToString() << dendl;
+    return std::nullopt;
+  }
+
+  std::vector<rocksdb::BackupInfo> backup_infos;
+  backup_engine->GetBackupInfo(&backup_infos);
+  std::stable_sort(backup_infos.begin(), backup_infos.end(), compare_backupinfo_by_timestamp);
+  std::vector<KeyValueDB::BackupStats> rv;
+  for (const rocksdb::BackupInfo& bi : backup_infos) {
+    KeyValueDB::BackupStats br;
+    br.id = bi.backup_id;
+    br.timestamp = utime_t(bi.timestamp, 0);
+    br.size = bi.size;
+    br.number_files = bi.number_files;
+    rv.push_back(br);
+  }
+  return rv;
+}
+
+
  void RocksDBStore::compact()
  {
    dout(2) << __func__ << " starting" << dendl;
@@ -3386,7 +3667,7 @@ int RocksDBStore::prepare_for_reshard(const std::string& new_sharding,
            << full_name << dendl;
        return -EINVAL;
      }
-    dout(10) << "created column " << full_name << " handle = " << (void*)cf << dendl; 
+    dout(10) << "created column " << full_name << " handle = " << (void*)cf << dendl;
      existing_columns.push_back(full_name);
      handles.push_back(cf);
    }
diff --git a/src/kv/RocksDBStore.h b/src/kv/RocksDBStore.h

index e4dd5ca54c556ddd9a71d1f3240043e778d7aca9..2bd310aa44fcce614ab354b4f29bdc6d64df2e59 100644 (file)
--- a/src/kv/RocksDBStore.h
+++ b/src/kv/RocksDBStore.h
@@ -21,6 +21,7 @@
  #include "rocksdb/statistics.h"
  #include "rocksdb/table.h"
  #include "rocksdb/db.h"
+#include "rocksdb/utilities/backup_engine.h"
  #include "kv/rocksdb_cache/BinnedLRUCache.h"
  #include <errno.h>
  #include "common/errno.h"
@@ -115,11 +116,13 @@ public:
                  uint32_t hash_l, uint32_t hash_h)
        : name(name), shard_cnt(shard_cnt), options(options), hash_l(hash_l), hash_h(hash_h) {}
    };
+
  private:
    friend std::ostream& operator<<(std::ostream& out, const ColumnFamily& cf);
  
    bool must_close_default_cf = false;
    rocksdb::ColumnFamilyHandle *default_cf = nullptr;
+  ceph::mutex backup_lock = ceph::make_mutex("RocksDBStore::Backup");
  
    /// column families in use, name->handles
    struct prefix_shards {
@@ -147,6 +150,7 @@ private:
    int do_open(std::ostream &out, bool create_if_missing, bool open_readonly,
               const std::string& cfs="");
    int load_rocksdb_options(bool create_if_missing, rocksdb::Options& opt);
+  void remove_corrupted_backups(rocksdb::BackupEngine *engine, KeyValueDB::BackupCleanupStats *result);
  public:
    static bool parse_sharding_def(const std::string_view text_def,
                                 std::vector<ColumnFamily>& sharding_def,
@@ -212,6 +216,23 @@ public:
      return cct->_conf.get_val<uint64_t>("rocksdb_delete_range_threshold");
    }
  
+  KeyValueDB::BackupStats backup(const std::string &path) override;
+  KeyValueDB::BackupCleanupStats backup_cleanup(const std::string &path,
+                                                uint64_t keep_last,
+                                                uint64_t keep_hourly,
+                                                uint64_t keep_daily) override;
+
+  /// Restore a backup into @p path. @p version is the rocksdb backup id, or
+  /// nullopt for the most recent. Must be called on a closed store.
+  static bool restore_backup(CephContext *cct, const std::string &path,
+                             const std::string &backup_location,
+                             const std::optional<uint32_t> &version);
+
+  /// List backups at @p backup_location, newest first.
+  /// Returns nullopt if the BackupEngine could not be opened.
+  static std::optional<std::vector<BackupStats>> list_backups(
+    CephContext *cct, const std::string &backup_location);
+
    void compact() override;
  
    void compact_async() override {
diff --git a/src/mon/CMakeLists.txt b/src/mon/CMakeLists.txt

index 34f48e55424fd2e9b488f5f98ab01a48e1e506cf..94944d792c171d9abbec154619ef63171445e00b 100644 (file)
--- a/src/mon/CMakeLists.txt
+++ b/src/mon/CMakeLists.txt
@@ -10,6 +10,7 @@ set(lib_mon_srcs
    MgrMonitor.cc
    MgrStatMonitor.cc
    Monitor.cc
+  MonitorBackup.cc
    MonmapMonitor.cc
    LogMonitor.cc
    AuthMonitor.cc
diff --git a/src/mon/Monitor.cc b/src/mon/Monitor.cc

index 205fb23c7ceaeb8739f63131aa369e736ab00e77..8ef4bdbbd3d82fccb93146e0e83451fbccdbb49a 100644 (file)
--- a/src/mon/Monitor.cc
+++ b/src/mon/Monitor.cc
@@ -41,6 +41,7 @@
  #include "MonitorDBStore.h"
  #include "MonMap.h"
  #include "Paxos.h"
+#include "MonitorBackup.h"
  
  #include "messages/PaxosServiceMessage.h"
  #include "messages/MMonCommand.h"
@@ -549,7 +550,11 @@ will start to track new ops received afterwards.";
             << duration << " seconds" << dendl;
      out << "compacted " << g_conf().get_val<std::string>("mon_keyvaluedb")
         << " in " << duration << " seconds";
- } else {
+  } else if (command == "backup") {
+    r = perform_backup();
+  } else if (command == "backup_cleanup") {
+    r = cleanup_backup();
+  } else {
      ceph_abort_msg("bad AdminSocket command binding");
    }
    (read_only ? audit_clog->debug() : audit_clog->info())
@@ -568,6 +573,40 @@ abort:
    return r;
  }
  
+int Monitor::perform_backup()
+{
+  std::string backup_path = g_conf().get_val<string>("mon_backup_path");
+  dout(1) << "triggering backup" << dendl;
+  if (backup_path.empty()) {
+    derr << "backup failed: mon_backup_path is empty" << dendl;
+    return -ENOTDIR;
+  }
+  if (!backup_manager) {
+    derr << "backup failed: monitor still initializing" << dendl;
+    return -EAGAIN;
+  }
+  uint64_t jobid = backup_manager->backup();
+  dout(1) << "queued backup job id " << jobid << dendl;
+  return 0;
+}
+
+int Monitor::cleanup_backup()
+{
+  std::string backup_path = g_conf().get_val<string>("mon_backup_path");
+  dout(1) << "triggering backup_cleanup" << dendl;
+  if (backup_path.empty()) {
+    derr << "backup_cleanup failed: mon_backup_path is empty" << dendl;
+    return -ENOTDIR;
+  }
+  if (!backup_manager) {
+    derr << "backup_cleanup failed: monitor still initializing" << dendl;
+    return -EAGAIN;
+  }
+  uint64_t jobid = backup_manager->cleanup();
+  dout(1) << "queued backup cleanup job id " << jobid << dendl;
+  return 0;
+}
+
  void Monitor::handle_signal(int signum)
  {
    derr << "*** Got Signal " << sig_str(signum) << " ***" << dendl;
@@ -825,6 +864,44 @@ int Monitor::preinit()
          "ewon", PerfCountersBuilder::PRIO_INTERESTING);
      pcb.add_u64_counter(l_mon_election_lose, "election_lose", "Elections lost",
          "elst", PerfCountersBuilder::PRIO_INTERESTING);
+    pcb.add_u64(l_mon_backup_running, "backup_running", "Mon backup process is running",
+        nullptr, PerfCountersBuilder::PRIO_USEFUL);
+    pcb.add_u64_counter(l_mon_backup_started, "backup_started", "Mon backups started",
+        nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+    pcb.add_u64_counter(l_mon_backup_success, "backup_success", "Mon backups finished successfully",
+        nullptr, PerfCountersBuilder::PRIO_USEFUL);
+    pcb.add_u64_counter(l_mon_backup_failed, "backup_failed", "Mon backups failed",
+        nullptr, PerfCountersBuilder::PRIO_USEFUL);
+    pcb.add_time_avg(l_mon_backup_duration, "backup_duration", "Mon backup duration",
+        nullptr, PerfCountersBuilder::PRIO_USEFUL);
+    pcb.add_time(l_mon_backup_last_success, "backup_last_success", "Last successful mon backup",
+        nullptr, PerfCountersBuilder::PRIO_USEFUL);
+    pcb.add_u64(l_mon_backup_last_success_id, "backup_last_success_id", "Last successful mon backup id",
+        nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+    pcb.add_time(l_mon_backup_last_failed, "backup_last_failed", "Last failed mon backup",
+        nullptr, PerfCountersBuilder::PRIO_USEFUL);
+    pcb.add_u64(l_mon_backup_last_size, "backup_last_size", "Last backup size",
+        nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+    pcb.add_u64(l_mon_backup_last_files, "backup_last_files", "Last backup file numbers",
+        nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+    pcb.add_u64_counter(l_mon_backup_cleanup_started, "backup_cleanup_started", "Mon backup cleanup started",
+        nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+    pcb.add_u64(l_mon_backup_cleanup_running, "backup_cleanup_running", "Mon backup cleanup is running",
+        nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+    pcb.add_u64_counter(l_mon_backup_cleanup_success, "backup_cleanup_success", "Mon backup cleanup finished successfully",
+        nullptr, PerfCountersBuilder::PRIO_USEFUL);
+    pcb.add_u64_counter(l_mon_backup_cleanup_failed, "backup_cleanup_failed", "Mon backup cleanup failed",
+        nullptr, PerfCountersBuilder::PRIO_USEFUL);
+    pcb.add_u64(l_mon_backup_cleanup_size, "backup_cleanup_size", "Size of backups removed",
+        nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+    pcb.add_u64(l_mon_backup_cleanup_kept, "backup_cleanup_kept", "Number of backups kept after cleanup",
+        nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+    pcb.add_time_avg(l_mon_backup_cleanup_duration, "backup_cleanup_duration", "Mon backup cleanup duration",
+        nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+    pcb.add_u64(l_mon_backup_cleanup_freed, "backup_cleanup_freed", "Mon backup cleanup freed size in bytes",
+        nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+    pcb.add_u64(l_mon_backup_cleanup_deleted, "backup_cleanup_deleted", "Mon backup cleanup deleted backups",
+        nullptr, PerfCountersBuilder::PRIO_INTERESTING);
      logger = pcb.create_perf_counters();
      cct->get_perfcounters_collection()->add(logger);
    }
@@ -983,6 +1060,13 @@ int Monitor::preinit()
                                        command.helpstring);
      ceph_assert(r == 0);
    }
+  r = admin_socket->register_command("backup", admin_hook,
+                                    "create a backup of the mon database");
+  ceph_assert(r == 0);
+  r = admin_socket->register_command(
+    "backup_cleanup", admin_hook,
+    "delete old mon database backups according to retention config");
+  ceph_assert(r == 0);
    l.lock();
  
    // add ourselves as a conf observer
@@ -1031,6 +1115,9 @@ int Monitor::init()
  
    // add features of myself into feature_map
    session_map.feature_map.add_mon(con_self->get_features());
+
+  backup_manager = std::make_unique<MonitorBackupManager>(cct, this);
+
    return 0;
  }
  
@@ -1126,6 +1213,9 @@ void Monitor::shutdown()
      delete admin_hook;
      admin_hook = NULL;
    }
+  if (backup_manager) {
+    backup_manager->stop();
+  }
  
    elector.shutdown();
  
@@ -6119,6 +6209,7 @@ void Monitor::tick()
      prepare_new_fingerprint(t);
      paxos->trigger_propose();
    }
+  backup_manager->tick();
  
    mgr_client.update_daemon_health(get_health_metrics());
    new_tick();
diff --git a/src/mon/Monitor.h b/src/mon/Monitor.h

index cc7d7b12e0278dd6ee27da1064bd123b0fb0dbe5..a914969ed35374eb34459ab504f3f36cadf48588 100644 (file)
--- a/src/mon/Monitor.h
+++ b/src/mon/Monitor.h
@@ -50,6 +50,7 @@
  #include "include/CompatSet.h"
  #include "mon/MonitorDBStore.h"
  #include "mon/mon_types.h" // for Metadata, PAXOS_*, ScrubResult
+#include "mon/MonitorBackup.h"
  #include "mgr/MgrClient.h"
  #include <boost/smart_ptr/atomic_shared_ptr.hpp>
  #include <boost/smart_ptr/shared_ptr.hpp>
@@ -100,6 +101,25 @@ enum {
    l_mon_election_call,
    l_mon_election_win,
    l_mon_election_lose,
+  l_mon_backup_running,
+  l_mon_backup_started,
+  l_mon_backup_success,
+  l_mon_backup_failed,
+  l_mon_backup_duration,
+  l_mon_backup_last_success,
+  l_mon_backup_last_success_id,
+  l_mon_backup_last_failed,
+  l_mon_backup_last_size,
+  l_mon_backup_last_files,
+  l_mon_backup_cleanup_started,
+  l_mon_backup_cleanup_running,
+  l_mon_backup_cleanup_success,
+  l_mon_backup_cleanup_failed,
+  l_mon_backup_cleanup_size,
+  l_mon_backup_cleanup_kept,
+  l_mon_backup_cleanup_duration,
+  l_mon_backup_cleanup_freed,
+  l_mon_backup_cleanup_deleted,
    l_mon_last,
  };
  
@@ -1001,6 +1021,8 @@ private:
  
    OpTracker op_tracker;
  
+  std::unique_ptr<MonitorBackupManager> backup_manager;
+
   public:
    Monitor(CephContext *cct_, std::string nm, MonitorDBStore *s,
           Messenger *m, Messenger *mgr_m, MonMap *map);
@@ -1046,6 +1068,10 @@ private:
                        std::ostream& err,
                        std::ostream& out);
  
+  // Execute mon database backup
+  int perform_backup();
+  int cleanup_backup();
+
  private:
    // don't allow copying
    Monitor(const Monitor& rhs);
diff --git a/src/mon/MonitorBackup.cc b/src/mon/MonitorBackup.cc

new file mode 100644 (file)

index 0000000..aff1beb
--- /dev/null
+++ b/src/mon/MonitorBackup.cc
@@ -0,0 +1,214 @@
+// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
+// vim: ts=8 sw=2 smarttab
+/*
+* Ceph - scalable distributed file system
+*
+* Copyright (C) 2021 B1-Systems GmbH
+*
+* This is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License version 2.1, as published by the Free Software
+* Foundation. See file COPYING.
+*/
+
+#include <chrono>
+#include <filesystem>
+
+#include "include/util.h"
+#include "mon/MonitorBackup.h"
+#include "mon/Monitor.h"
+
+#define dout_subsys ceph_subsys_mon
+#undef dout_context
+#define dout_context cct
+
+namespace fs = std::filesystem;
+
+/***
+ * Thread which runs monitor backup operations
+ */
+void *MonitorBackupManager::entry() {
+  std::unique_lock lock{mutex};
+  auto wakeup_predicate = [this] {
+    return should_stop || should_backup || should_cleanup || wakeup_pending;
+  };
+  while (true) {
+    // Wait for any signal (tick, request, or stop) before doing
+    // scheduled work. Predicate-based so notifications delivered
+    // before the worker entered wait() are not lost, and so we don't
+    // fire scheduled work while init() is still wiring the mon up.
+    work_cond.wait(lock, wakeup_predicate);
+    if (should_stop) {
+      return nullptr;
+    }
+    wakeup_pending = false;
+
+    auto now = ceph_clock_now();
+    std::string backup_path = cct->_conf.get_val<std::string>("mon_backup_path");
+    auto interval = cct->_conf.get_val<std::chrono::seconds>("mon_backup_interval");
+    auto cleanup_interval = cct->_conf.get_val<std::chrono::seconds>("mon_backup_cleanup_interval");
+    bool path_ok = !backup_path.empty();
+
+    bool timer_backup = false;
+    bool timer_cleanup = false;
+
+    if (path_ok && interval.count() > 0 &&
+        (mon->is_leader() || mon->is_peon())) {
+      if (!last_backup) {
+        dout(10) << "trigger first timed backup" << dendl;
+        timer_backup = true;
+      } else if ((now - last_backup->timestamp) >= utime_t(interval.count(), 0)) {
+        dout(10) << "trigger timed backup" << dendl;
+        timer_backup = true;
+      }
+    }
+    if (path_ok && cleanup_interval.count() > 0) {
+      if (!last_cleanup) {
+        dout(10) << "trigger first timed backup cleanup" << dendl;
+        timer_cleanup = true;
+      } else if ((now - last_cleanup->timestamp) >= utime_t(cleanup_interval.count(), 0)) {
+        dout(10) << "trigger timed backup cleanup" << dendl;
+        timer_cleanup = true;
+      }
+    }
+
+    bool run_cleanup = should_cleanup || timer_cleanup;
+    bool run_backup = should_backup || timer_backup;
+
+    should_backup = false;
+    should_cleanup = false;
+
+    if (!run_backup && !run_cleanup) {
+      continue;
+    }
+
+    lock.unlock();
+    if (run_cleanup) {
+      do_cleanup();
+    }
+    if (run_backup) {
+      do_backup();
+    }
+    lock.lock();
+  }
+}
+
+void MonitorBackupManager::stop() {
+  {
+    std::lock_guard guard{mutex};
+    if (should_stop) {
+      return;
+    }
+    should_stop = true;
+    work_cond.notify_one();
+  }
+  join();
+}
+
+void MonitorBackupManager::do_cleanup() {
+  dout(5) << "start backup cleanup" << dendl;
+  mon->logger->inc(l_mon_backup_cleanup_started);
+  mon->logger->set(l_mon_backup_cleanup_running, 1);
+  auto start = ceph_clock_now();
+  KeyValueDB::BackupCleanupStats stats = mon->store->backup_cleanup();
+  mon->logger->set(l_mon_backup_cleanup_size, stats.size);
+  mon->logger->set(l_mon_backup_cleanup_kept, stats.kept);
+  mon->logger->set(l_mon_backup_cleanup_freed, stats.freed);
+  mon->logger->set(l_mon_backup_cleanup_deleted, stats.deleted);
+  if (stats.error) {
+    mon->logger->inc(l_mon_backup_cleanup_failed);
+  } else {
+    mon->logger->inc(l_mon_backup_cleanup_success);
+  }
+  auto ptr = std::make_shared<KeyValueDB::BackupCleanupStats>(stats);
+  last_cleanup.swap(ptr);
+  auto end = ceph_clock_now();
+  utime_t duration = end - start;
+  mon->logger->tinc(l_mon_backup_cleanup_duration, duration);
+  mon->logger->set(l_mon_backup_cleanup_running, 0);
+}
+
+void MonitorBackupManager::record_last_backup(std::shared_ptr<KeyValueDB::BackupStats> stats) {
+  if (stats->error && last_backup) {
+    stats->id = last_backup->id;
+  }
+  last_backup.swap(stats);
+}
+
+// Returns true if there is enough free space on the backup volume.
+bool MonitorBackupManager::check_free_space() {
+  auto backup_path = cct->_conf.get_val<std::string>("mon_backup_path");
+
+  std::error_code ec;
+  if (!fs::exists(backup_path, ec)) {
+    if (!fs::create_directories(backup_path, ec)) {
+      dout(1) << "failed to create monitor backup directory '"
+              << backup_path << "': " << ec.message() << dendl;
+      return false;
+    }
+    fs::permissions(backup_path, fs::perms::owner_all,
+                    fs::perm_options::replace, ec);
+    if (ec) {
+      dout(1) << "failed to set permissions on monitor backup directory '"
+              << backup_path << "': " << ec.message() << dendl;
+      return false;
+    }
+    dout(5) << "created monitor backup directory '" << backup_path
+            << "'" << dendl;
+  }
+
+  ceph_data_stats_t stats;
+  int err = get_fs_stats(stats, backup_path.c_str());
+  if (err < 0) {
+    dout(1) << "error checking monitor backup directory: " << cpp_strerror(err)
+            << dendl;
+    return false;
+  }
+
+  if (stats.avail_percent <= cct->_conf.get_val<int64_t>("mon_backup_min_avail")) {
+    dout(1) << "ERROR: not enough disk space to start backup: " << "(available: "
+            << stats.avail_percent << "% " << byte_u_t(stats.byte_avail) << ")\n"
+            << "run backup_cleanup regularly or decrease mon_backup_min_avail" << dendl;
+    return false;
+  }
+  return true;
+}
+
+void MonitorBackupManager::do_backup() {
+  dout(1) << "start backup" << dendl;
+  mon->logger->inc(l_mon_backup_started);
+  mon->logger->set(l_mon_backup_running, 1);
+  auto start = ceph_clock_now();
+
+  std::shared_ptr<KeyValueDB::BackupStats> result;
+
+  if (!check_free_space()) {
+    mon->logger->inc(l_mon_backup_failed);
+    mon->logger->tset(l_mon_backup_last_failed, start);
+    result = std::make_shared<KeyValueDB::BackupStats>();
+    result->error = true;
+    result->timestamp = start;
+    result->msg = "insufficient free space";
+  } else {
+    KeyValueDB::BackupStats stats = mon->store->backup();
+    utime_t duration = ceph_clock_now() - start;
+    mon->logger->tinc(l_mon_backup_duration, duration);
+    mon->logger->set(l_mon_backup_last_size, stats.size);
+    mon->logger->set(l_mon_backup_last_files, stats.number_files);
+    if (stats.error) {
+      mon->logger->inc(l_mon_backup_failed);
+      mon->logger->tset(l_mon_backup_last_failed, stats.timestamp);
+      dout(1) << "failed backup in " << utimespan_str(duration) << dendl;
+    } else {
+      mon->logger->inc(l_mon_backup_success);
+      mon->logger->tset(l_mon_backup_last_success, stats.timestamp);
+      mon->logger->set(l_mon_backup_last_success_id, stats.id);
+      dout(1) << "finished backup in " << utimespan_str(duration) << dendl;
+    }
+    result = std::make_shared<KeyValueDB::BackupStats>(stats);
+  }
+
+  record_last_backup(result);
+  mon->logger->set(l_mon_backup_running, 0);
+}
+
diff --git a/src/mon/MonitorBackup.h b/src/mon/MonitorBackup.h

new file mode 100644 (file)

index 0000000..74d5fc8
--- /dev/null
+++ b/src/mon/MonitorBackup.h
@@ -0,0 +1,101 @@
+// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
+// vim: ts=8 sw=2 smarttab
+/*
+* Ceph - scalable distributed file system
+*
+* Copyright (C) 2021 B1-Systems GmbH
+*
+* This is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License version 2.1, as published by the Free Software
+* Foundation. See file COPYING.
+*/
+
+
+#ifndef CEPH_MONITOR_BACKUP_H
+#define CEPH_MONITOR_BACKUP_H
+
+#include <cstdint>
+#include <memory>
+#include <mutex>
+#include <string>
+
+
+#include "common/Thread.h"
+#include "common/ceph_context.h"
+#include "common/ceph_mutex.h"
+#include "kv/KeyValueDB.h"
+#include "mon/MonitorDBStore.h"
+
+class Monitor;
+
+class MonitorBackupManager : public Thread {
+  CephContext *cct;
+  Monitor *mon;
+  ceph::mutex mutex;
+  ceph::condition_variable work_cond;
+  bool should_stop{false};
+  // set by tick(); a sticky flag so a notification delivered before the
+  // worker enters wait() is not lost. cleared each time the worker
+  // re-evaluates timer triggers.
+  bool wakeup_pending{false};
+
+  bool should_backup{false};
+  bool should_cleanup{false};
+  uint64_t last_job_id{0};
+  std::shared_ptr<KeyValueDB::BackupCleanupStats> last_cleanup;
+  std::shared_ptr<KeyValueDB::BackupStats> last_backup;
+
+  void do_backup();
+  void do_cleanup();
+  bool check_free_space();
+  void record_last_backup(std::shared_ptr<KeyValueDB::BackupStats> stats);
+protected:
+  void *entry() override;
+
+public:
+  explicit MonitorBackupManager(CephContext *cct, Monitor *monitor) :
+    cct(cct),
+    mon(monitor),
+    mutex(ceph::make_mutex("mon::BackupManager::mutex")) {
+      create("mon::backups");
+  }
+
+  void tick() {
+    std::lock_guard guard{mutex};
+    if (should_stop) {
+      return;
+    }
+    wakeup_pending = true;
+    work_cond.notify_one();
+  }
+
+  /**
+   * Stop the backup manager thread. Safe to call more than once.
+   **/
+  void stop();
+  /**
+   * Start a new backup.
+   * @returns {uint64_t} new job id
+   **/
+  uint64_t backup() {
+    std::lock_guard guard{mutex};
+    should_backup = true;
+    uint64_t rv = ++last_job_id;
+    work_cond.notify_one();
+    return rv;
+  }
+
+  /// Queue a cleanup pass.
+  uint64_t cleanup() {
+    std::lock_guard guard{mutex};
+    should_cleanup = true;
+    uint64_t rv = ++last_job_id;
+    work_cond.notify_one();
+    return rv;
+  }
+
+};
+
+#endif
+
diff --git a/src/mon/MonitorDBStore.h b/src/mon/MonitorDBStore.h

index a645bb7e4ae7b1a43be0e86542419dcfc1f92069..c316d23768b69a1f4cac560d6462ccbdb4f8108a 100644 (file)
--- a/src/mon/MonitorDBStore.h
+++ b/src/mon/MonitorDBStore.h
@@ -14,6 +14,8 @@
  #ifndef CEPH_MONITOR_DB_STORE_H
  #define CEPH_MONITOR_DB_STORE_H
  
+#include <algorithm>
+#include <filesystem>
  #include <set>
  #include <map>
  #include <string>
@@ -32,6 +34,7 @@
  #include "common/Clock.h"
  #include "common/debug.h"
  #include "common/safe_io.h"
+#include "common/strtol.h"
  #include "common/blkdev.h"
  #include "common/PriorityCache.h"
  #include "common/version.h"
@@ -65,6 +68,11 @@ class MonitorDBStore
      return path;
    }
  
+  // returns the database store path
+  static std::string get_store_path(const std::string& path) {
+    return (std::filesystem::path(path) / "store.db").string();
+  }
+
    std::shared_ptr<PriorityCache::PriCache> get_priority_cache() const {
      return db->get_priority_cache();
    }
@@ -631,14 +639,7 @@ class MonitorDBStore
    }
  
    void _open(const std::string& kv_type) {
-    int pos = 0;
-    for (auto rit = path.rbegin(); rit != path.rend(); ++rit, ++pos) {
-      if (*rit != '/')
-       break;
-    }
-    std::ostringstream os;
-    os << path.substr(0, path.size() - pos) << "/store.db";
-    std::string full_path = os.str();
+    std::string full_path = get_store_path(path);
  
      KeyValueDB *db_ptr = KeyValueDB::create(g_ceph_context,
                                             kv_type,
@@ -691,7 +692,7 @@ class MonitorDBStore
      if (r < 0)
        return r;
  
-    // Monitors are few in number, so the resource cost of exposing 
+    // Monitors are few in number, so the resource cost of exposing
      // very detailed stats is low: ramp up the priority of all the
      // KV store's perf counters.  Do this after open, because backend may
      // not have constructed PerfCounters earlier.
@@ -743,6 +744,200 @@ class MonitorDBStore
      db.reset(NULL);
    }
  
+  /// @brief Creates a backup of the database under mon_backup_path.
+  /// @return stats describing the created backup
+  KeyValueDB::BackupStats backup() {
+    auto backup_path = g_conf().get_val<std::string>("mon_backup_path");
+    auto stats = db->backup(backup_path);
+    if (!stats.error) {
+      // Stash the mon keyring alongside the rocksdb backup, keyed by
+      // backup id, so a restore of an older version is paired with the
+      // keyring of that vintage. The [mon.] secret can rotate; using a
+      // single fixed filename would break authentication after restore.
+      std::error_code ec;
+      auto dest = backup_path + "/keyring." + std::to_string(stats.id);
+      std::filesystem::copy_file(
+        path + "/keyring", dest,
+        std::filesystem::copy_options::overwrite_existing, ec);
+      if (!ec) {
+        std::filesystem::permissions(dest,
+          std::filesystem::perms::owner_read | std::filesystem::perms::owner_write,
+          std::filesystem::perm_options::replace, ec);
+      }
+      if (ec) {
+        // Best-effort: a rocksdb backup without a stashed keyring is still
+        // valid; the operator can supply a keyring out-of-band on restore.
+        derr << __func__ << " failed to stash keyring at "
+             << dest << ": " << ec.message() << dendl;
+      }
+    }
+    return stats;
+  }
+
+  /// @brief Remove old backups in mon_backup_path according to the retention config.
+  /// @return stats describing what was kept, deleted, and freed
+  KeyValueDB::BackupCleanupStats backup_cleanup() {
+    auto backup_path = g_conf().get_val<std::string>("mon_backup_path");
+    auto stats = db->backup_cleanup(
+      backup_path,
+      g_conf().get_val<uint64_t>("mon_backup_keep_last"),
+      g_conf().get_val<uint64_t>("mon_backup_keep_hourly"),
+      g_conf().get_val<uint64_t>("mon_backup_keep_daily"));
+    if (stats.error) {
+      return stats;
+    }
+    // Remove keyring.<id> files for backup ids the kv layer just dropped.
+    std::string kv_type;
+    if (read_meta("kv_backend", &kv_type) < 0 || kv_type.empty()) {
+      kv_type = "rocksdb";
+    }
+    std::set<uint64_t> surviving;
+    auto remaining = KeyValueDB::list_backups(g_ceph_context, kv_type, backup_path);
+    if (!remaining) {
+      return stats;
+    }
+    for (const auto& b : *remaining) {
+      surviving.insert(b.id);
+    }
+    std::error_code ec;
+    for (auto it = std::filesystem::directory_iterator(backup_path, ec);
+         it != std::filesystem::directory_iterator();
+         it.increment(ec)) {
+      auto name = it->path().filename().string();
+      if (name.compare(0, 8, "keyring.") != 0) {
+        continue;
+      }
+      std::string idstr = name.substr(8);
+      std::string parse_err;
+      long long id = strict_strtoll(idstr.c_str(), 10, &parse_err);
+      if (!parse_err.empty() || id < 0) {
+        continue;
+      }
+      if (surviving.count(static_cast<uint64_t>(id))) {
+        continue;
+      }
+      std::error_code rm_ec;
+      std::filesystem::remove(it->path(), rm_ec);
+    }
+    return stats;
+  }
+
+  /// @brief List all backup versions at backup_path.
+  /// @param cct ceph context
+  /// @param path path to the local mon data dir (used to discover the kv backend)
+  /// @param backup_path path to the backup location
+  /// @return list of BackupStats, one per backup
+  static std::optional<std::vector<KeyValueDB::BackupStats>> list_backups(
+    CephContext *cct, const std::string &path, const std::string &backup_path) {
+    std::string kv_type;
+    int r = read_meta_path("kv_backend", &kv_type, path);
+    if (r < 0 || kv_type.empty()) {
+      // Disaster recovery: mon_data may be empty or absent. We only ship
+      // a rocksdb kv backend today, so default to it for enumeration.
+      kv_type = "rocksdb";
+    }
+    return KeyValueDB::list_backups(cct, kv_type, backup_path);
+  }
+
+
+  /// @brief Restore the backup with the given version from backup_path into path.
+  /// @param cct ceph context
+  /// @param path path to the local mon data dir to restore into
+  /// @param backup_path path to the backup location
+  /// @param version version of the backup to restore (nullopt for latest)
+  /// @return true on success
+  static bool restore_backup(CephContext *cct, const std::string &path,
+                             const std::string &backup_path,
+                             const std::optional<uint32_t> &version) {
+    std::string kv_type;
+    int r = read_meta_path("kv_backend", &kv_type, path);
+    if (r < 0 || kv_type.empty()) {
+      // Disaster recovery: mon_data is empty or freshly initialized, so
+      // there is no kv_backend marker. Default to rocksdb and stamp the
+      // file back so the subsequent open() finds it.
+      kv_type = "rocksdb";
+      std::error_code ec;
+      std::filesystem::create_directories(path, ec);
+      if (ec) {
+        lderr(cct) << __func__ << " failed to create " << path
+                   << ": " << ec.message() << dendl;
+        return false;
+      }
+      std::filesystem::permissions(path,
+        std::filesystem::perms::owner_all,
+        std::filesystem::perm_options::replace, ec);
+      const std::string v = kv_type + "\n";
+      if (safe_write_file(path.c_str(), "kv_backend",
+                          v.c_str(), v.length(), 0600) < 0) {
+        lderr(cct) << __func__ << " failed to write kv_backend in "
+                   << path << dendl;
+        return false;
+      }
+    }
+    std::string store_path = get_store_path(path);
+
+    // Resolve "latest" up front so we know which versioned keyring to
+    // rehydrate alongside the rocksdb restore. Pick by BackupEngine id
+    // (monotonic per rocksdb) rather than timestamp, so a clock skew
+    // between backups cannot make the default restore pick a stale one.
+    uint32_t resolved_version;
+    if (version) {
+      resolved_version = *version;
+    } else {
+      auto backups = KeyValueDB::list_backups(cct, kv_type, backup_path);
+      if (!backups || backups->empty()) {
+        lderr(cct) << __func__ << " no backups found at " << backup_path << dendl;
+        return false;
+      }
+      resolved_version = std::max_element(
+        backups->begin(), backups->end(),
+        [](const auto& a, const auto& b) { return a.id < b.id; })->id;
+    }
+
+    if (!KeyValueDB::restore_backup(cct, kv_type, store_path, backup_path,
+                                    resolved_version)) {
+      return false;
+    }
+
+    // Rehydrate the matching keyring (skipped silently if the operator
+    // keeps the keyring out-of-band).
+    std::error_code ec;
+    auto keyring_src = backup_path + "/keyring." + std::to_string(resolved_version);
+    if (std::filesystem::exists(keyring_src, ec)) {
+      std::filesystem::copy_file(
+        keyring_src,
+        path + "/keyring",
+        std::filesystem::copy_options::overwrite_existing,
+        ec);
+      if (ec) {
+        lderr(cct) << __func__ << " failed to restore keyring from "
+                   << keyring_src << ": " << ec.message() << dendl;
+        return false;
+      }
+    }
+
+    // The mon store holds auth, config-key and dm-crypt secrets;
+    // tighten everything we just restored to owner-only.
+    std::filesystem::permissions(path,
+      std::filesystem::perms::owner_all,
+      std::filesystem::perm_options::replace, ec);
+    for (auto it = std::filesystem::recursive_directory_iterator(path, ec);
+         it != std::filesystem::recursive_directory_iterator();
+         it.increment(ec)) {
+      std::error_code ec_chmod;
+      auto perms = it->is_directory(ec_chmod)
+        ? std::filesystem::perms::owner_all
+        : (std::filesystem::perms::owner_read | std::filesystem::perms::owner_write);
+      std::filesystem::permissions(it->path(), perms,
+        std::filesystem::perm_options::replace, ec_chmod);
+      if (ec_chmod) {
+        lderr(cct) << __func__ << " failed to chmod " << it->path()
+                   << ": " << ec_chmod.message() << dendl;
+      }
+    }
+    return true;
+  }
+
    void compact() {
      db->compact();
    }
@@ -788,7 +983,7 @@ class MonitorDBStore
    /**
     * read_meta - read a simple configuration key out-of-band
     *
-   * Read a simple key value to an unopened/mounted store.
+   * Read a simple key value from an unopened/unmounted store.
     *
     * Trailing whitespace is stripped off.
     *
@@ -798,6 +993,24 @@ class MonitorDBStore
     */
    int read_meta(const std::string& key,
                 std::string *value) const {
+    return read_meta_path(key, value, path);
+  }
+
+  /**
+   * read_meta_path - read a simple configuration key out-of-band
+   *
+   * Read a simple key value from a specified path store.
+   *
+   * Trailing whitespace is stripped off.
+   *
+   * @param key key name
+   * @param value pointer to value string
+   * @param path path to directory
+   * @returns 0 for success, or an error code
+   */
+  static int read_meta_path(const std::string& key,
+                            std::string *value,
+                            const std::string& path) {
      char buf[4096];
      int r = safe_read_file(path.c_str(), key.c_str(),
                            buf, sizeof(buf));
diff --git a/src/vstart.sh b/src/vstart.sh

index b478c00b090513268fd90e31a0af872b01aeada8..7f5421bffced235c7afc69348c6c3aabde45c806 100755 (executable)
--- a/src/vstart.sh
+++ b/src/vstart.sh
@@ -1157,6 +1157,7 @@ start_mon() {
  [mon.$f]
          host = $HOSTNAME
          mon data = $CEPH_DEV_DIR/mon.$f
+        mon backup path = $CEPH_DEV_DIR/mon.$f-backup
  EOF
              count=$(($count + 2))
          done
author	Matthew N. Heler <matthew.heler@hotmail.com>
	Mon, 18 May 2026 01:57:01 +0000 (20:57 -0500)
committer	Matthew N. Heler <matthew.heler@hotmail.com>
	Thu, 18 Jun 2026 00:24:52 +0000 (19:24 -0500)
doc/rados/configuration/mon-config-ref.rst		patch \| blob \| history
doc/rados/troubleshooting/troubleshooting-mon.rst		patch \| blob \| history
src/ceph_mon.cc		patch \| blob \| history
src/common/options/mon.yaml.in		patch \| blob \| history
src/kv/KeyValueDB.cc		patch \| blob \| history
src/kv/KeyValueDB.h		patch \| blob \| history
src/kv/RocksDBStore.cc		patch \| blob \| history
src/kv/RocksDBStore.h		patch \| blob \| history
src/mon/CMakeLists.txt		patch \| blob \| history
src/mon/Monitor.cc		patch \| blob \| history
src/mon/Monitor.h		patch \| blob \| history
src/mon/MonitorBackup.cc	[new file with mode: 0644]	patch \| blob
src/mon/MonitorBackup.h	[new file with mode: 0644]	patch \| blob
src/mon/MonitorDBStore.h		patch \| blob \| history
src/vstart.sh		patch \| blob \| history