From: Matthew N. Heler Date: Mon, 18 May 2026 01:57:01 +0000 (-0500) Subject: mon: add monitor RocksDB backup and restore X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=3a9ae41e2a8fd614d67e3dac39d28ddf5dd6ca4a;p=ceph.git mon: add monitor RocksDB backup and restore Implements an opt-in backup mechanism for the monitor using rocksdb::BackupEngine. Backups run on a schedule when mon_backup_interval is set, or are triggered manually via `ceph tell mon.* backup`. Cleanup keeps the last N, hourly, and daily snapshots, with a free-space guard. Off by default. Restore is offline: stop the mon and run ceph-mon --restore-backup --yes-i-really-mean-it optionally with --backup-version (BackupEngine logical version, as shown by --list-backups). The mon keyring is stashed alongside the RocksDB backup so a wiped mon_data is recovered end-to-end, and kv_backend is stamped back when missing. Co-authored-by: Daniel Poelzleithner Signed-off-by: Matthew N. Heler --- diff --git a/doc/rados/configuration/mon-config-ref.rst b/doc/rados/configuration/mon-config-ref.rst index 18006d0611b..85c003f16b5 100644 --- a/doc/rados/configuration/mon-config-ref.rst +++ b/doc/rados/configuration/mon-config-ref.rst @@ -600,6 +600,361 @@ is far outweighed by the number of accidental pool (and thus data) deletions it For more information about the pool flags see :ref:`Pool values `. +Monitor backup +============== + +In normal operation, Monitor backups are not required: surviving members +of the Monitor quorum re-sync new or replaced Monitors automatically, +and the Monitor store can be largely rebuilt from the OSDs after total +quorum loss (see :ref:`mon-store-recovery-using-osds`). + +However, ``mon-store-recovery-using-osds`` only recovers state that the +OSDs can report: osdmap history, auth keys associated with running +OSDs, and similar. Some Monitor state has no copy outside the Monitor +store and is unrecoverable if all Monitors are lost: + +* **Encryption keys for dm-crypt OSDs** are stored only in the + Monitor's ``config-key`` store under + ``dm-crypt/osd//luks``. Without these keys the underlying + block devices cannot be unlocked, even when the OSD daemons and data + are physically intact. +* **Cephadm orchestrator state** under ``mgr/cephadm/*`` in the + ``config-key`` store, including host inventory, daemon placement + specs, and service definitions. +* **Config-key entries** populated by users or third-party tooling via + ``ceph config-key set``. +* **Dashboard and manager module state** persisted to the + ``config-key`` store. + +A Monitor backup lets an operator restore this state if all running +Monitors are lost. It is most valuable for clusters that use +dm-crypt-encrypted OSDs or that depend heavily on cephadm-managed +deployment state, where loss of the Monitor store would be a +protracted data-availability incident rather than a recoverable inconvenience. + +Monitor backups complement, but do not replace, the existing Monitor +recovery procedures. They are not a means of "undoing" cluster-level +operations such as pool deletion or CRUSH changes: once OSDs have +observed and acted on a newer osdmap, restoring an older Monitor store +does not roll back the OSD-side effects. + +The Ceph Monitor uses the native RocksDB ``BackupEngine`` to create +consistent snapshots of its store, which can be copied elsewhere +without downtime. + +When :confval:`mon_backup_interval` is set, a backup is triggered every +N seconds. Pair it with :confval:`mon_backup_cleanup_interval`; if only +the backup interval is set, backups accumulate indefinitely because +retention is only applied during cleanup. + +Backups share table files (``.sst``) within the backup directory: each +new backup only copies SSTables that the running database has produced +since the previous backup. Restoring any individual backup version is +independent of the others, but the on-disk files for every version +live in a single shared tree under ``mon_backup_path``. + +Layout of the backup directory +------------------------------ + +A backup path managed by the RocksDB ``BackupEngine`` contains three +top-level directories plus a per-version copy of the Monitor keyring:: + + /path/to/backups/ + ├── meta/ + │ ├── 1 + │ ├── 2 + │ └── 3 + ├── private/ + │ ├── 1/ + │ ├── 2/ + │ └── 3/ + ├── shared_checksum/ + │ ├── 000007.sst + │ ├── 000010.sst + │ └── ... + ├── keyring.1 + ├── keyring.2 + └── keyring.3 + +* ``meta/`` is a metadata file describing logical backup version + ``N``. +* ``private//`` contains files unique to backup ``N`` (RocksDB + descriptors and other per-version state). +* ``shared_checksum/`` contains SSTables shared across backup + versions. A single SSTable in this directory may belong to several + versions; the BackupEngine deletes a file from here only when no + remaining backup references it. +* ``keyring.`` is a copy of ``$mon_data/keyring`` taken at the time + of backup ``N``. The Monitor needs this key to authenticate at + startup, and it is not stored inside the RocksDB database; restoring + version ``N`` copies the matching ``keyring.`` back into + ``$mon_data`` so an older snapshot is paired with the keyring of its + vintage. Cleanup removes ``keyring.`` files whose backup version + has been pruned. + + Stashing the keyring is **best effort**: if the copy fails (for + example, permission denied on the backup directory or out of space + after the RocksDB snapshot completed), the backup is still recorded + as successful and the RocksDB data remains usable. On restore, a + missing ``keyring.`` is silently skipped, and the operator must + supply the Monitor keyring out-of-band before starting the daemon. + +.. warning:: + + The backup directory contains the ``[mon.]`` private key. Treat + it with the same access controls as ``$mon_data`` itself; a + complete backup is sufficient material to impersonate a Monitor + in the cluster. + +Do not delete a ``private//`` directory or files under +``shared_checksum/`` by hand: removing a referenced shared file +corrupts every backup that points to it. Use ``backup_cleanup`` or +the configured retention parameters to remove old versions so the +BackupEngine can release shared files safely. + +If the Monitor cluster fails and you need to copy a backup elsewhere, +copy the entire ``/path/to/backups/`` directory. Copying only +``private//`` is not sufficient; the version's SSTables live in +``shared_checksum/``. + + +The :confval:`mon_backup_cleanup_interval` specifies the interval for +backup cleanup. The cleanup algorithm keeps the last +``mon_backup_keep_last`` backups. It then collects hourly +``mon_backup_keep_hourly`` and daily ``mon_backup_keep_daily`` +versions, retaining the newest backup in each time window. + +You can trigger ``backup`` and ``backup_cleanup`` through any running +Monitor's admin socket. + +.. prompt:: bash # + + ceph --admin-daemon .../mon.asok backup + +.. prompt:: bash # + + ceph --admin-daemon .../mon.asok backup_cleanup + +The following metrics related to the monitor backup process are +tracked by ``ceph-mon``. + +.. list-table:: Ceph Monitor Backup Metrics + :widths: 30 12 58 + :header-rows: 1 + + * - Name + - Type + - Description + * - ``backup_running`` + - Gauge + - ``1`` while a backup is in progress, ``0`` otherwise + * - ``backup_started`` + - Counter + - Backup attempts (includes attempts rejected at the :confval:`mon_backup_min_avail` pre-flight check) + * - ``backup_success`` + - Counter + - Backups completed by the ``BackupEngine`` + * - ``backup_failed`` + - Counter + - Failed backup attempts (pre-flight or ``BackupEngine``) + * - ``backup_duration`` + - Average + - Backup wall-clock time + * - ``backup_last_success`` + - Gauge + - UTC timestamp of the most recent successful backup, or ``0`` if none + * - ``backup_last_success_id`` + - Gauge + - ``BackupEngine`` version ID of the most recent successful backup (the value passed to ``--restore-backup --backup-version``) + * - ``backup_last_failed`` + - Gauge + - UTC timestamp of the most recent failed attempt, or ``0`` if none + * - ``backup_last_size`` + - Gauge + - Payload size in bytes of the most recent attempt (may be partial on failure) + * - ``backup_last_files`` + - Gauge + - File count of the most recent attempt (may be partial on failure) + * - ``backup_cleanup_started`` + - Counter + - Cleanup invocations + * - ``backup_cleanup_running`` + - Gauge + - ``1`` while a cleanup is in progress, ``0`` otherwise + * - ``backup_cleanup_success`` + - Counter + - Cleanup passes completed without error + * - ``backup_cleanup_failed`` + - Counter + - Failed cleanup passes + * - ``backup_cleanup_duration`` + - Average + - Cleanup wall-clock time + * - ``backup_cleanup_kept`` + - Gauge + - Backups retained by the most recent cleanup pass + * - ``backup_cleanup_deleted`` + - Gauge + - Backups removed by the most recent cleanup pass + * - ``backup_cleanup_freed`` + - Gauge + - Bytes released by the most recent cleanup pass (overstated when shared backups are in use, because the ``BackupEngine`` payload sum ignores file sharing) + * - ``backup_cleanup_size`` + - Gauge + - Bytes retained by the most recent cleanup pass (same sharing caveat as ``backup_cleanup_freed``) + +The ``backup_started``/``backup_success``/``backup_failed`` and +``backup_cleanup_started``/``backup_cleanup_success``/``backup_cleanup_failed`` +counters are monotonic and accumulate for the lifetime of the +``ceph-mon`` process. The ``backup_last_*`` fields and the four +cleanup result gauges (``backup_cleanup_kept``, ``backup_cleanup_deleted``, +``backup_cleanup_freed``, ``backup_cleanup_size``) describe only the +most recent invocation and are overwritten on every pass. + +To retrieve backup metrics from a running monitor's admin socket: + +.. prompt:: bash # + + ceph --admin-daemon .../mon.asok perf dump | jq '.["mon"] | with_entries(select(.key | startswith("backup_")))' + +.. code-block:: json + + { + "backup_running": 0, + "backup_started": 2, + "backup_success": 2, + "backup_failed": 0, + "backup_duration": { + "avgcount": 2, + "sum": 0.149076498, + "avgtime": 0.074538249 + }, + "backup_last_success": 1722001989.849262, + "backup_last_success_id": 3, + "backup_last_failed": 0, + "backup_last_size": 3924677, + "backup_last_files": 6, + "backup_cleanup_started": 1, + "backup_cleanup_running": 0, + "backup_cleanup_success": 1, + "backup_cleanup_failed": 0, + "backup_cleanup_size": 86144, + "backup_cleanup_kept": 1, + "backup_cleanup_duration": { + "avgcount": 1, + "sum": 0.002031246, + "avgtime": 0.002031246 + }, + "backup_cleanup_freed": 0, + "backup_cleanup_deleted": 0 + } + +Monitor Backup Metric Usage Examples +------------------------------------ + +The following examples show how to use the monitor backup performance +counters. PromQL examples assume that the ``ceph-exporter`` is being +scraped by Prometheus. Admin-socket examples use ``ceph daemon`` +directly against a specific ``ceph-mon`` daemon. + +``backup_last_success`` (gauge, Unix epoch seconds) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The timestamp when the most recent successful backup completed. ``0`` +means no successful backup has occurred since the mon started. + +* Age of the most recent successful backup, per mon: + + .. code-block:: promql + + time() - mon_backup_last_success + +* Detect a stalled backup schedule (no success in over 2 hours; adjust + the threshold to roughly 2× your :confval:`mon_backup_interval`): + + .. code-block:: promql + + time() - mon_backup_last_success > 7200 and mon_backup_last_success > 0 + +``backup_failed`` (counter) +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Cumulative count of failed backup attempts (pre-flight or +``BackupEngine``) since the mon started. + +* Backup failures per mon in the last hour: + + .. code-block:: promql + + increase(mon_backup_failed[1h]) + +* Alert when any mon has logged a backup failure recently: + + .. code-block:: promql + + increase(mon_backup_failed[15m]) > 0 + +``backup_last_size`` (gauge, bytes) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Payload size in bytes of the most recent backup attempt. + +* Backup size trend across all mons: + + .. code-block:: promql + + mon_backup_last_size + +* Live size check for a single mon via admin socket: + + .. code-block:: bash + + ceph daemon mon. perf dump | jq '.mon.backup_last_size' + +``backup_cleanup_kept`` (gauge) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Number of backups retained by the most recent cleanup pass. + +* Mons whose retention has grown beyond an expected ceiling: + + .. code-block:: promql + + mon_backup_cleanup_kept > 100 + +``backup_cleanup_freed`` (gauge, bytes) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Bytes released by the most recent cleanup pass. Overstated when shared +backups are in use. + +* Bytes reclaimed by the latest cleanup, per mon: + + .. code-block:: promql + + mon_backup_cleanup_freed + +* Total bytes released by the most recent cleanup across all mons: + + .. code-block:: promql + + sum(mon_backup_cleanup_freed) + +Monitor Backup Configuration Options +------------------------------------ + +The following options control monitor backup behavior. All are +runtime-tunable. + +.. confval:: mon_backup_path +.. confval:: mon_backup_min_avail +.. confval:: mon_backup_keep_last +.. confval:: mon_backup_keep_hourly +.. confval:: mon_backup_keep_daily +.. confval:: mon_backup_interval +.. confval:: mon_backup_cleanup_interval + + Miscellaneous ============= diff --git a/doc/rados/troubleshooting/troubleshooting-mon.rst b/doc/rados/troubleshooting/troubleshooting-mon.rst index 10e49dc8b2b..791b8a7b53a 100644 --- a/doc/rados/troubleshooting/troubleshooting-mon.rst +++ b/doc/rados/troubleshooting/troubleshooting-mon.rst @@ -518,6 +518,97 @@ or:: Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb +Recovery Using Mon Backup +------------------------- + +If Monitor backups are enabled, backups can be found in the configured +``mon_backup_path``. To list the available backup versions, run: + +.. code-block:: bash + + ceph-mon -i [num] --list-backups /path/to/backups + +In containerized deployments, run this from inside the Monitor +container (``cephadm shell --name mon.``), or install +``ceph-common`` on the host. + +This invokes the RocksDB ``BackupEngine`` to enumerate the logical +backup versions at the path. Output looks like:: + + ID: Time: Size: + 1 Sun May 18 03:00:01 2026 4 MiB + 2 Sun May 18 04:00:02 2026 12 KiB + 3 Sun May 18 05:00:01 2026 16 KiB + +The ``ID`` column is the value to pass as ``--backup-version`` when +restoring. A plain ``ls`` of the backup path shows the BackupEngine's +internal ``meta/``, ``private/``, and ``shared_checksum/`` directories, +which are not directly usable for restore; always use ``--list-backups`` +to obtain the IDs. + +To restore a backup, stop the monitor and run: + +.. code-block:: bash + + ceph-mon -i [num] --restore-backup /path/to/backups --backup-version --yes-i-really-mean-it + +The ``--yes-i-really-mean-it`` flag is required because restore overwrites the existing monitor store. +If the ``--backup-version`` argument is omitted, the latest version will be restored. +The restored store contains everything that was in the mon at the time the backup was taken, +including auth records; any changes (auth, pools, CRUSH, etc.) made after that point are lost. +OSDs will reconcile their state with the restored osdmap as the cluster comes back up. + +If ``ceph-mon --restore-backup`` is invoked as ``root`` (typical when running +from a service shell), the restored ``kv_backend`` file and the rehydrated +``keyring`` will be owned by ``root``. Before starting the monitor daemon, +``chown -R ceph:ceph `` so the unprivileged ``ceph`` user can read +them. + +Restoring a multi-monitor cluster +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Each monitor has its own Paxos state (rank, accepted proposal numbers, +``last_committed``), so backups are per-monitor: a backup taken from +``mon.a`` should be restored into ``mon.a``'s ``mon_data``, not copied +onto ``mon.b`` or ``mon.c``. + +To recover a cluster from monitor backups: + +#. Stop all monitors. +#. Restore each monitor you have a backup for from its own backup path + using the ``--restore-backup`` command above. +#. Start the restored monitors. Quorum forms once a majority of the + monmap members are running. Restoring and starting a majority is the + simplest path: the recovered monmap is the original one, and the + monitors elect among themselves as normal. + +If you cannot bring back a majority of monitors (because backups are +missing or storage is lost on the other hosts), the restored seed will +not form quorum on its own. The monmap recovered from the backup still +lists every monitor in the original cluster, and Paxos requires a +majority to elect. The seed will sit in ``probing`` / ``electing`` +indefinitely. + +To reduce the monmap on the offline seed so it can elect itself: + +#. Stop all monitors. +#. Restore the seed monitor from its backup as above. +#. Edit the monmap directly in the offline mon store:: + + # extract from the offline mon store + ceph-mon -i --extract-monmap /tmp/monmap + # drop monitors that will not come back + monmaptool /tmp/monmap --rm --rm + # write it back into the store + ceph-mon -i --inject-monmap /tmp/monmap + +#. Start the seed monitor. With the reduced monmap it can elect itself + and form a single-member quorum. +#. Re-add any remaining monitors as new sync members per + :ref:`adding-and-removing-monitors`; they will synchronize from the + recovered seed rather than reusing their old backups. + + Recovery Using Healthy Monitor(s) --------------------------------- diff --git a/src/ceph_mon.cc b/src/ceph_mon.cc index e1475368d01..de928f0e4cd 100644 --- a/src/ceph_mon.cc +++ b/src/ceph_mon.cc @@ -19,6 +19,7 @@ #include #include +#include #include #include @@ -42,6 +43,7 @@ #include "common/Throttle.h" #include "common/Timer.h" #include "common/errno.h" +#include "common/strtol.h" #include "common/Preforker.h" #include "global/global_init.h" @@ -213,7 +215,7 @@ static void usage() << " --force-sync\n" << " force a sync from another mon by wiping local data (BE CAREFUL)\n" << " --yes-i-really-mean-it\n" - << " mandatory safeguard for --force-sync\n" + << " mandatory safeguard for --force-sync and --restore-backup\n" << " --compact\n" << " compact the monitor store\n" << " --osdmap \n" @@ -224,8 +226,14 @@ static void usage() << " extract the monmap from the local monitor store and exit\n" << " --mon-data \n" << " where the mon store and keyring are located\n" - << " --set-crush-location =" - << " sets monitor's crush bucket location (only for stretch mode)" + << " --set-crush-location =\n" + << " sets monitor's crush bucket location (only for stretch mode)\n" + << " --restore-backup \n" + << " restore the backup from location and exit (requires --yes-i-really-mean-it)\n" + << " --backup-version \n" + << " BackupEngine version ID (uint32); defaults to the latest backup when omitted\n" + << " --list-backups \n" + << " list available backups\n" << std::endl; generic_server_usage(); } @@ -265,6 +273,8 @@ int main(int argc, const char **argv) bool force_sync = false; bool yes_really = false; std::string osdmapfn, inject_monmap, extract_monmap, crush_loc; + std::string restore_backup_location, list_backup_location; + std::optional restore_backup_version; auto args = argv_to_vec(argc, argv); if (args.empty()) { @@ -337,6 +347,18 @@ int main(int argc, const char **argv) extract_monmap = val; } else if (ceph_argparse_witharg(args, i, &val, "--set-crush-location", (char*)NULL)) { crush_loc = val; + } else if (ceph_argparse_witharg(args, i, &val, "--list-backups", (char*)NULL)) { + list_backup_location = val; + } else if (ceph_argparse_witharg(args, i, &val, "--restore-backup", (char*)NULL)) { + restore_backup_location = val; + } else if (ceph_argparse_witharg(args, i, &val, "--backup-version", (char*)NULL)) { + std::string parse_err; + long long v = strict_strtoll(val.c_str(), 10, &parse_err); + if (!parse_err.empty() || v < 0 || v > UINT32_MAX) { + cerr << "invalid --backup-version '" << val << "'" << std::endl; + exit(1); + } + restore_backup_version = static_cast(v); } else { ++i; } @@ -362,6 +384,51 @@ int main(int argc, const char **argv) exit(1); } + // -- list backups -- + if (!list_backup_location.empty()) { + cout << "list backup from location '" << list_backup_location << "'" << std::endl << std::endl; + auto backup_infos = MonitorDBStore::list_backups( + cct.get(), g_conf()->mon_data, list_backup_location); + if (!backup_infos) { + cerr << "failed to enumerate backups at '" << list_backup_location + << "' (see log for details)" << std::endl; + exit(1); + } + if (backup_infos->empty()) { + cout << "no backups found at '" << list_backup_location << "'" << std::endl; + exit(0); + } + cout << "ID:\tTime:\t\t\t\tSize:" << std::endl; + for (const KeyValueDB::BackupStats& bi : *backup_infos) { + cout << bi.id << "\t"; + bi.timestamp.asctime(cout); + cout << "\t" << byte_u_t(bi.size) << std::endl; + } + exit(0); + } + + // -- restore backup -- + if (!restore_backup_location.empty()) { + if (!yes_really) { + cerr << "restoring will overwrite the monitor store at '" << g_conf()->mon_data + << "'. Pass --yes-i-really-mean-it to proceed." << std::endl; + exit(1); + } + cerr << "restoring backup from location '" << restore_backup_location << "' to '" + << g_conf()->mon_data << "'" << std::endl; + if (MonitorDBStore::restore_backup(cct.get(), g_conf()->mon_data, restore_backup_location, restore_backup_version)) { + cout << "successfully restored backup. Start ceph-mon normally" << std::endl; + exit(0); + } + cerr << "restore failed. Check the backup path and version (use --list-backups to enumerate)." << std::endl; + exit(1); + } + + if (restore_backup_version.has_value()) { + cerr << "--backup-version requires --restore-backup" << std::endl; + exit(1); + } + MonitorDBStore store(g_conf()->mon_data); // -- mkfs -- @@ -537,7 +604,7 @@ int main(int argc, const char **argv) exit(1); } store.close(); - dout(0) << argv[0] << ": created monfs at " << g_conf()->mon_data + dout(0) << argv[0] << ": created monfs at " << g_conf()->mon_data << " for " << g_conf()->name << dendl; return 0; } diff --git a/src/common/options/mon.yaml.in b/src/common/options/mon.yaml.in index 9c81dbac734..ef875ef8058 100644 --- a/src/common/options/mon.yaml.in +++ b/src/common/options/mon.yaml.in @@ -1176,6 +1176,74 @@ options: flags: - no_mon_update with_legacy: true +- name: mon_backup_path + type: str + level: advanced + desc: Path to Monitor database backups + fmt_desc: The Monitor's backup location. + default: /var/backups/ceph/mon/$cluster-$id + services: + - mon + flags: + - runtime +- name: mon_backup_min_avail + type: int + level: advanced + desc: Only capture backups if at least this percentage of the target filesystem is free + default: 10 + min: 0 + max: 100 + services: + - mon + flags: + - runtime +- name: mon_backup_keep_last + type: uint + level: advanced + desc: Keep the last N backups + fmt_desc: Keep the last N backups of the Monitor database. + default: 6 + services: + - mon + flags: + - runtime +- name: mon_backup_keep_hourly + type: uint + level: advanced + desc: Number of hourly backups + default: 5 + services: + - mon + flags: + - runtime +- name: mon_backup_keep_daily + type: uint + level: advanced + desc: Number of daily backups + fmt_desc: Keep one backup per day, for the specified number of days. + default: 7 + services: + - mon + flags: + - runtime +- name: mon_backup_interval + type: secs + level: advanced + desc: Automatic backups every N seconds (0 disables) + default: 0 + services: + - mon + flags: + - runtime +- name: mon_backup_cleanup_interval + type: secs + level: advanced + desc: Trigger backup cleanup every N seconds (0 disables) + default: 0 + services: + - mon + flags: + - runtime - name: mon_rocksdb_options type: str level: advanced diff --git a/src/kv/KeyValueDB.cc b/src/kv/KeyValueDB.cc index 41eb30873b8..c7700fd6f3f 100644 --- a/src/kv/KeyValueDB.cc +++ b/src/kv/KeyValueDB.cc @@ -1,6 +1,10 @@ // -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:nil -*- // vim: ts=8 sw=2 sts=2 expandtab +#include +#include +#include + #include "KeyValueDB.h" #include "RocksDBStore.h" @@ -18,6 +22,52 @@ KeyValueDB *KeyValueDB::create(CephContext *cct, const string& type, return NULL; } +bool KeyValueDB::restore_backup(CephContext *cct, + const std::string &type, + const std::string &path, + const std::string &backup_location, + const std::optional &version) +{ + if (std::filesystem::exists(path) && + !std::filesystem::is_empty(path)) { + std::unique_ptr probe(KeyValueDB::create(cct, type, path)); + if (!probe) { + lderr(cct) << __func__ << " unsupported kv backend: " << type << dendl; + return false; + } + std::ostringstream err; + if (probe->open(err) < 0) { + // Heuristic: rocksdb's PosixEnv lock path surfaces "lock" in the + // error text. Any other open failure (corruption, I/O) is precisely + // why a restore is being run -- warn and proceed. + const std::string msg = err.str(); + if (msg.find("lock") != std::string::npos) { + lderr(cct) << __func__ << " another monitor is using this data dir: " + << path << ": " << msg << dendl; + return false; + } + lderr(cct) << __func__ << " existing store at " << path + << " is unreadable, proceeding with restore: " << msg << dendl; + } else { + probe->close(); + } + } + if (type == "rocksdb") { + return RocksDBStore::restore_backup(cct, path, backup_location, version); + } + return false; +} + +std::optional> KeyValueDB::list_backups( + CephContext *cct, const std::string &type, const std::string &backup_location) +{ + if (type == "rocksdb") { + return RocksDBStore::list_backups(cct, backup_location); + } + lderr(cct) << __func__ << " unsupported kv backend: " << type << dendl; + return std::nullopt; +} + int KeyValueDB::test_init(const string& type, const string& dir) { if (type == "rocksdb") { diff --git a/src/kv/KeyValueDB.h b/src/kv/KeyValueDB.h index 2d861eda82d..381492fa1d7 100644 --- a/src/kv/KeyValueDB.h +++ b/src/kv/KeyValueDB.h @@ -13,6 +13,7 @@ #include #include #include "include/encoding.h" +#include "include/utime.h" #include "common/Formatter.h" #include "common/perf_counters.h" #include "common/PriorityCache.h" @@ -24,6 +25,25 @@ */ class KeyValueDB { public: + struct BackupCleanupStats { + bool error{false}; + utime_t timestamp; + uint32_t corrupted{0}; + uint32_t deleted{0}; + uint32_t kept{0}; + uint64_t size{0}; + uint64_t freed{0}; + }; + + struct BackupStats { + bool error{false}; + uint64_t id{0}; + utime_t timestamp; + std::string msg; + uint64_t size{0}; + uint64_t number_files{0}; + }; + class TransactionImpl { public: // amount of ops included @@ -112,8 +132,8 @@ public: } /// Remove Single Key which exists and was not overwritten. - /// This API is only related to performance optimization, and should only be - /// re-implemented by log-insert-merge tree based keyvalue stores(such as RocksDB). + /// This API is only related to performance optimization, and should only be + /// re-implemented by log-insert-merge tree based keyvalue stores(such as RocksDB). /// If a key is overwritten (by calling set multiple times), then the result /// of calling rm_single_key on this key is undefined. virtual void rm_single_key( @@ -418,6 +438,29 @@ public: return 0; }; + /// Create a kv database backup in directory path. + virtual BackupStats backup(const std::string &path) { + return {.error = true, .msg = "backup not supported by this backend"}; + } + + /// Remove old backups in directory path according to retention. + virtual BackupCleanupStats backup_cleanup(const std::string &path, + uint64_t keep_last, + uint64_t keep_hourly, + uint64_t keep_daily) { + return {.error = true}; + } + + /// restore from backup the specified backup version + static bool restore_backup(CephContext *cct, const std::string &type, + const std::string &path, + const std::string &backup_location, + const std::optional &version); + + static std::optional> list_backups( + CephContext *cct, const std::string &type, + const std::string &backup_location); + /// compact the underlying store virtual void compact() {} diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index ae308fafdf6..6d9470306a1 100644 --- a/src/kv/RocksDBStore.cc +++ b/src/kv/RocksDBStore.cc @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -18,10 +19,14 @@ #include "rocksdb/slice.h" #include "rocksdb/cache.h" #include "rocksdb/filter_policy.h" +#include "rocksdb/utilities/backup_engine.h" #include "rocksdb/utilities/convenience.h" #include "rocksdb/utilities/table_properties_collectors.h" #include "rocksdb/merge_operator.h" +#include "common/version.h" +#include "rocksdb/util/stderr_logger.h" + #include "common/Clock.h" // for ceph_clock_now() #include "common/perf_counters.h" #include "common/PriorityCache.h" @@ -70,6 +75,7 @@ static const char* sharding_def_file = "sharding/def"; static const char* sharding_recreate = "sharding/recreate_columns"; static const char* resharding_column_lock = "reshardingXcommencingXlocked"; + static bufferlist to_bufferlist(rocksdb::Slice in) { bufferlist bl; bl.append(bufferptr(in.data(), in.size())); @@ -1143,6 +1149,7 @@ int RocksDBStore::do_open(ostream &out, if (create_if_missing) { status = rocksdb::DB::Open(opt, path, &db); if (!status.ok()) { + out << status.ToString(); derr << status.ToString() << dendl; return -EINVAL; } @@ -1191,6 +1198,7 @@ int RocksDBStore::do_open(ostream &out, status = rocksdb::DB::Open(opt, path, &db); } if (!status.ok()) { + out << status.ToString(); derr << status.ToString() << dendl; return -EINVAL; } @@ -1206,6 +1214,7 @@ int RocksDBStore::do_open(ostream &out, path, existing_cfs, &handles, &db); } if (!status.ok()) { + out << status.ToString(); derr << status.ToString() << dendl; return -EINVAL; } @@ -2076,6 +2085,278 @@ int RocksDBStore::split_key(rocksdb::Slice in, string_view *prefix, string_view return 0; } +KeyValueDB::BackupStats RocksDBStore::backup(const std::string &path) +{ + ldout(cct, 20) << __func__ << " start backup action" << dendl; + std::lock_guard backup_locker{backup_lock}; + // stamp timestamp up front so every return path (including early Open + // failures) carries a real time the scheduler can gate retries on. + KeyValueDB::BackupStats rv; + rv.timestamp = ceph_clock_now(); + + rocksdb::BackupEngine* engine_ptr = nullptr; + rocksdb::BackupEngineOptions engine_options = rocksdb::BackupEngineOptions(path); + // BackupEngineOptions must be stable across opens to the same directory, + // and share_files_with_checksum=false is deprecated by rocksdb. + engine_options.share_table_files = true; + engine_options.share_files_with_checksum = true; + engine_options.sync = true; + + rocksdb::Status s = rocksdb::BackupEngine::Open( + engine_options, + rocksdb::Env::Default(), + &engine_ptr); + std::unique_ptr backup_engine{engine_ptr}; + + if (!backup_engine || !s.ok()) { + ldout(cct, 0) << __func__ << " can't create backup_engine: " << s.ToString() << dendl; + rv.msg = s.ToString(); + rv.error = true; + return rv; + } + + // we remove corrupted backups first to not link to broken ones + remove_corrupted_backups(backup_engine.get(), nullptr); + + rocksdb::BackupID new_backup; + rocksdb::BackupInfo new_backup_info; + rocksdb::CreateBackupOptions new_backup_options = rocksdb::CreateBackupOptions(); + new_backup_options.flush_before_backup = true; + + std::string app_metadata = std::string("ceph_version=") + ceph_version_to_str(); + s = backup_engine->CreateNewBackupWithMetadata(new_backup_options, db, app_metadata, &new_backup); + + rv.timestamp = ceph_clock_now(); + rv.msg = s.ToString(); + + if (!s.ok()) { + ldout(cct, 0) << __func__ << " can't create backup: " << s.ToString() << dendl; + rv.error = true; + remove_corrupted_backups(backup_engine.get(), nullptr); + return rv; + } else { + ldout(cct, 10) << __func__ << " created backup successfully: " << s.ToString() << dendl; + rv.msg = s.ToString(); + } + s = backup_engine->GetBackupInfo(new_backup, &new_backup_info); + if (!s.ok()) { + ldout(cct, 0) << __func__ << " can't get backup info: " << s.ToString() << dendl; + rv.error = true; + rv.msg = s.ToString(); + return rv; + } + rv.id = new_backup_info.backup_id; + rv.size = new_backup_info.size; + rv.number_files = new_backup_info.number_files; + + return rv; +} + +bool RocksDBStore::restore_backup(CephContext *cct, const std::string &path, + const std::string &backup_location, + const std::optional &version) +{ + rocksdb::BackupEngineReadOnly* engine_ptr = nullptr; + rocksdb::StderrLogger logger = rocksdb::StderrLogger(); + rocksdb::BackupEngineOptions engine_options = rocksdb::BackupEngineOptions(backup_location); + engine_options.info_log = &logger; + + rocksdb::Status s = rocksdb::BackupEngineReadOnly::Open( + rocksdb::Env::Default(), + engine_options, + &engine_ptr); + std::unique_ptr backup_engine{engine_ptr}; + const rocksdb::RestoreOptions options = rocksdb::RestoreOptions(); + if (!s.ok()) { + derr << __func__ << " can't open backup folder: " << s.ToString() << dendl; + return false; + } + if (!version) { + derr << __func__ << " restore last valid backup" << dendl; + s = backup_engine->RestoreDBFromLatestBackup(options, path, path); + } else { + s = backup_engine->RestoreDBFromBackup( + options, static_cast(*version), path, path); + } + if (!s.ok()) { + derr << "Error when restoring backup: " << s.ToString() << dendl; + } + return s.ok(); +} + +namespace { + +bool compare_backupinfo_by_timestamp(const rocksdb::BackupInfo& a, const rocksdb::BackupInfo& b) +{ + // newest first; tie-break on backup_id so cleanup never keeps an older entry + // in preference to a newer one with the same second-resolution timestamp. + return std::tie(a.timestamp, a.backup_id) > std::tie(b.timestamp, b.backup_id); +} + +struct TimeBucket { + utime_t start; + utime_t end; + rocksdb::BackupID backup_id; + + TimeBucket(utime_t start, utime_t end) : + start(start), end(end), backup_id(0) {} +}; + +} // namespace + +void RocksDBStore::remove_corrupted_backups(rocksdb::BackupEngine *backup_engine, KeyValueDB::BackupCleanupStats *rv) { + std::vector corrupt_backup_ids; + backup_engine->GetCorruptedBackups(&corrupt_backup_ids); + for (rocksdb::BackupID backup_id : corrupt_backup_ids) { + ldout(cct, 1) << __func__ << " delete corrupted backup: " << backup_id << dendl; + rocksdb::Status s = backup_engine->DeleteBackup(backup_id); + if (!s.ok()) { + lderr(cct) << __func__ << " failed to delete corrupted backup " + << backup_id << ": " << s.ToString() << dendl; + if (rv) { + rv->error = true; + } + continue; + } + if (rv) { + rv->corrupted++; + } + } +} + +KeyValueDB::BackupCleanupStats RocksDBStore::backup_cleanup(const std::string &path, + uint64_t keep_last, + uint64_t keep_hourly, + uint64_t keep_daily) +{ + ldout(cct, 20) << __func__ << " start backup cleanup" << dendl; + std::lock_guard backup_locker{backup_lock}; + // stamp timestamp up front so every return path (including early Open + // failures and empty result) carries a real time for the retry gate. + BackupCleanupStats rv; + rv.timestamp = ceph_clock_now(); + + rocksdb::BackupEngine* engine_ptr = nullptr; + rocksdb::Status s = rocksdb::BackupEngine::Open( + rocksdb::BackupEngineOptions(path), + rocksdb::Env::Default(), + &engine_ptr); + std::unique_ptr backup_engine{engine_ptr}; + if (!backup_engine || !s.ok()) { + // cleaning backups when folder is not available is minor problem + ldout(cct, 10) << __func__ << " can't clean backups: " << s.ToString() << dendl; + rv.error = true; + return rv; + } + // remove corrupted backups first + std::set keep_backups; + + remove_corrupted_backups(backup_engine.get(), &rv); + ldout(cct, 20) << __func__ << " collect garbage" << dendl; + + std::vector backup_infos; + backup_engine->GetBackupInfo(&backup_infos); + + if (backup_infos.empty()) { + ldout(cct, 15) << __func__ << " no backup infos" << dendl; + return rv; + } + // sort all backups with newest backup first + std::stable_sort(backup_infos.begin(), backup_infos.end(), compare_backupinfo_by_timestamp); + // always retain the newest backup, regardless of retention settings, so + // cleanup can never leave zero backups for a subsequent failed backup. + keep_backups.insert(backup_infos.front().backup_id); + utime_t now = ceph_clock_now(); + + std::vector buckets; + // half-open intervals [start, end) so adjacent buckets meet without gaps + utime_t start = now.round_to_hour(); + for (uint64_t i = 0; i < keep_hourly; i++) { + buckets.push_back(TimeBucket(start, start + utime_t(3600, 0))); + start -= 3600.0; + } + start = now.round_to_day(); + for (uint64_t i = 0; i < keep_daily; i++) { + buckets.push_back(TimeBucket(start, start + utime_t(86400, 0))); + start -= 86400.0; + } + + size_t i = 0; + for (const rocksdb::BackupInfo& bi : backup_infos) { + if (i++ < keep_last) { + keep_backups.insert(bi.backup_id); + } + utime_t ts = utime_t(bi.timestamp, 0); + for (TimeBucket& bucket : buckets) { + if (ts >= bucket.start && ts < bucket.end) { + if (bucket.backup_id == 0) { + bucket.backup_id = bi.backup_id; + } + } + } + } + // push the winners into the list + for (const TimeBucket& bucket : buckets) { + if (bucket.backup_id) { + keep_backups.insert(bucket.backup_id); + } + } + + rv.kept = keep_backups.size(); + + for (const rocksdb::BackupInfo& bi : backup_infos) { + if (keep_backups.count(bi.backup_id)) { + // payload-sum across kept backups (does not account for file sharing, + // so it overstates the on-disk footprint). + rv.size += bi.size; + continue; + } + ldout(cct, 10) << __func__ << " delete old backup: " << bi.backup_id << dendl; + rocksdb::Status s = backup_engine->DeleteBackup(bi.backup_id); + if (!s.ok()) { + lderr(cct) << __func__ << " failed to delete backup " << bi.backup_id + << ": " << s.ToString() << dendl; + rv.error = true; + continue; + } + rv.freed += bi.size; + rv.deleted++; + } + rv.timestamp = ceph_clock_now(); + return rv; +} + +std::optional> +RocksDBStore::list_backups(CephContext *cct, const std::string &backup_location) { + rocksdb::BackupEngineReadOnly* engine_ptr = nullptr; + rocksdb::Status s = rocksdb::BackupEngineReadOnly::Open( + rocksdb::BackupEngineOptions(backup_location), + rocksdb::Env::Default(), + &engine_ptr); + std::unique_ptr backup_engine{engine_ptr}; + + if (!backup_engine || !s.ok()) { + lderr(cct) << __func__ << " can't open backup location " << backup_location + << ": " << s.ToString() << dendl; + return std::nullopt; + } + + std::vector backup_infos; + backup_engine->GetBackupInfo(&backup_infos); + std::stable_sort(backup_infos.begin(), backup_infos.end(), compare_backupinfo_by_timestamp); + std::vector rv; + for (const rocksdb::BackupInfo& bi : backup_infos) { + KeyValueDB::BackupStats br; + br.id = bi.backup_id; + br.timestamp = utime_t(bi.timestamp, 0); + br.size = bi.size; + br.number_files = bi.number_files; + rv.push_back(br); + } + return rv; +} + + void RocksDBStore::compact() { dout(2) << __func__ << " starting" << dendl; @@ -3386,7 +3667,7 @@ int RocksDBStore::prepare_for_reshard(const std::string& new_sharding, << full_name << dendl; return -EINVAL; } - dout(10) << "created column " << full_name << " handle = " << (void*)cf << dendl; + dout(10) << "created column " << full_name << " handle = " << (void*)cf << dendl; existing_columns.push_back(full_name); handles.push_back(cf); } diff --git a/src/kv/RocksDBStore.h b/src/kv/RocksDBStore.h index e4dd5ca54c5..2bd310aa44f 100644 --- a/src/kv/RocksDBStore.h +++ b/src/kv/RocksDBStore.h @@ -21,6 +21,7 @@ #include "rocksdb/statistics.h" #include "rocksdb/table.h" #include "rocksdb/db.h" +#include "rocksdb/utilities/backup_engine.h" #include "kv/rocksdb_cache/BinnedLRUCache.h" #include #include "common/errno.h" @@ -115,11 +116,13 @@ public: uint32_t hash_l, uint32_t hash_h) : name(name), shard_cnt(shard_cnt), options(options), hash_l(hash_l), hash_h(hash_h) {} }; + private: friend std::ostream& operator<<(std::ostream& out, const ColumnFamily& cf); bool must_close_default_cf = false; rocksdb::ColumnFamilyHandle *default_cf = nullptr; + ceph::mutex backup_lock = ceph::make_mutex("RocksDBStore::Backup"); /// column families in use, name->handles struct prefix_shards { @@ -147,6 +150,7 @@ private: int do_open(std::ostream &out, bool create_if_missing, bool open_readonly, const std::string& cfs=""); int load_rocksdb_options(bool create_if_missing, rocksdb::Options& opt); + void remove_corrupted_backups(rocksdb::BackupEngine *engine, KeyValueDB::BackupCleanupStats *result); public: static bool parse_sharding_def(const std::string_view text_def, std::vector& sharding_def, @@ -212,6 +216,23 @@ public: return cct->_conf.get_val("rocksdb_delete_range_threshold"); } + KeyValueDB::BackupStats backup(const std::string &path) override; + KeyValueDB::BackupCleanupStats backup_cleanup(const std::string &path, + uint64_t keep_last, + uint64_t keep_hourly, + uint64_t keep_daily) override; + + /// Restore a backup into @p path. @p version is the rocksdb backup id, or + /// nullopt for the most recent. Must be called on a closed store. + static bool restore_backup(CephContext *cct, const std::string &path, + const std::string &backup_location, + const std::optional &version); + + /// List backups at @p backup_location, newest first. + /// Returns nullopt if the BackupEngine could not be opened. + static std::optional> list_backups( + CephContext *cct, const std::string &backup_location); + void compact() override; void compact_async() override { diff --git a/src/mon/CMakeLists.txt b/src/mon/CMakeLists.txt index 34f48e55424..94944d792c1 100644 --- a/src/mon/CMakeLists.txt +++ b/src/mon/CMakeLists.txt @@ -10,6 +10,7 @@ set(lib_mon_srcs MgrMonitor.cc MgrStatMonitor.cc Monitor.cc + MonitorBackup.cc MonmapMonitor.cc LogMonitor.cc AuthMonitor.cc diff --git a/src/mon/Monitor.cc b/src/mon/Monitor.cc index 205fb23c7ce..8ef4bdbbd3d 100644 --- a/src/mon/Monitor.cc +++ b/src/mon/Monitor.cc @@ -41,6 +41,7 @@ #include "MonitorDBStore.h" #include "MonMap.h" #include "Paxos.h" +#include "MonitorBackup.h" #include "messages/PaxosServiceMessage.h" #include "messages/MMonCommand.h" @@ -549,7 +550,11 @@ will start to track new ops received afterwards."; << duration << " seconds" << dendl; out << "compacted " << g_conf().get_val("mon_keyvaluedb") << " in " << duration << " seconds"; - } else { + } else if (command == "backup") { + r = perform_backup(); + } else if (command == "backup_cleanup") { + r = cleanup_backup(); + } else { ceph_abort_msg("bad AdminSocket command binding"); } (read_only ? audit_clog->debug() : audit_clog->info()) @@ -568,6 +573,40 @@ abort: return r; } +int Monitor::perform_backup() +{ + std::string backup_path = g_conf().get_val("mon_backup_path"); + dout(1) << "triggering backup" << dendl; + if (backup_path.empty()) { + derr << "backup failed: mon_backup_path is empty" << dendl; + return -ENOTDIR; + } + if (!backup_manager) { + derr << "backup failed: monitor still initializing" << dendl; + return -EAGAIN; + } + uint64_t jobid = backup_manager->backup(); + dout(1) << "queued backup job id " << jobid << dendl; + return 0; +} + +int Monitor::cleanup_backup() +{ + std::string backup_path = g_conf().get_val("mon_backup_path"); + dout(1) << "triggering backup_cleanup" << dendl; + if (backup_path.empty()) { + derr << "backup_cleanup failed: mon_backup_path is empty" << dendl; + return -ENOTDIR; + } + if (!backup_manager) { + derr << "backup_cleanup failed: monitor still initializing" << dendl; + return -EAGAIN; + } + uint64_t jobid = backup_manager->cleanup(); + dout(1) << "queued backup cleanup job id " << jobid << dendl; + return 0; +} + void Monitor::handle_signal(int signum) { derr << "*** Got Signal " << sig_str(signum) << " ***" << dendl; @@ -825,6 +864,44 @@ int Monitor::preinit() "ewon", PerfCountersBuilder::PRIO_INTERESTING); pcb.add_u64_counter(l_mon_election_lose, "election_lose", "Elections lost", "elst", PerfCountersBuilder::PRIO_INTERESTING); + pcb.add_u64(l_mon_backup_running, "backup_running", "Mon backup process is running", + nullptr, PerfCountersBuilder::PRIO_USEFUL); + pcb.add_u64_counter(l_mon_backup_started, "backup_started", "Mon backups started", + nullptr, PerfCountersBuilder::PRIO_INTERESTING); + pcb.add_u64_counter(l_mon_backup_success, "backup_success", "Mon backups finished successfully", + nullptr, PerfCountersBuilder::PRIO_USEFUL); + pcb.add_u64_counter(l_mon_backup_failed, "backup_failed", "Mon backups failed", + nullptr, PerfCountersBuilder::PRIO_USEFUL); + pcb.add_time_avg(l_mon_backup_duration, "backup_duration", "Mon backup duration", + nullptr, PerfCountersBuilder::PRIO_USEFUL); + pcb.add_time(l_mon_backup_last_success, "backup_last_success", "Last successful mon backup", + nullptr, PerfCountersBuilder::PRIO_USEFUL); + pcb.add_u64(l_mon_backup_last_success_id, "backup_last_success_id", "Last successful mon backup id", + nullptr, PerfCountersBuilder::PRIO_INTERESTING); + pcb.add_time(l_mon_backup_last_failed, "backup_last_failed", "Last failed mon backup", + nullptr, PerfCountersBuilder::PRIO_USEFUL); + pcb.add_u64(l_mon_backup_last_size, "backup_last_size", "Last backup size", + nullptr, PerfCountersBuilder::PRIO_INTERESTING); + pcb.add_u64(l_mon_backup_last_files, "backup_last_files", "Last backup file numbers", + nullptr, PerfCountersBuilder::PRIO_INTERESTING); + pcb.add_u64_counter(l_mon_backup_cleanup_started, "backup_cleanup_started", "Mon backup cleanup started", + nullptr, PerfCountersBuilder::PRIO_INTERESTING); + pcb.add_u64(l_mon_backup_cleanup_running, "backup_cleanup_running", "Mon backup cleanup is running", + nullptr, PerfCountersBuilder::PRIO_INTERESTING); + pcb.add_u64_counter(l_mon_backup_cleanup_success, "backup_cleanup_success", "Mon backup cleanup finished successfully", + nullptr, PerfCountersBuilder::PRIO_USEFUL); + pcb.add_u64_counter(l_mon_backup_cleanup_failed, "backup_cleanup_failed", "Mon backup cleanup failed", + nullptr, PerfCountersBuilder::PRIO_USEFUL); + pcb.add_u64(l_mon_backup_cleanup_size, "backup_cleanup_size", "Size of backups removed", + nullptr, PerfCountersBuilder::PRIO_INTERESTING); + pcb.add_u64(l_mon_backup_cleanup_kept, "backup_cleanup_kept", "Number of backups kept after cleanup", + nullptr, PerfCountersBuilder::PRIO_INTERESTING); + pcb.add_time_avg(l_mon_backup_cleanup_duration, "backup_cleanup_duration", "Mon backup cleanup duration", + nullptr, PerfCountersBuilder::PRIO_INTERESTING); + pcb.add_u64(l_mon_backup_cleanup_freed, "backup_cleanup_freed", "Mon backup cleanup freed size in bytes", + nullptr, PerfCountersBuilder::PRIO_INTERESTING); + pcb.add_u64(l_mon_backup_cleanup_deleted, "backup_cleanup_deleted", "Mon backup cleanup deleted backups", + nullptr, PerfCountersBuilder::PRIO_INTERESTING); logger = pcb.create_perf_counters(); cct->get_perfcounters_collection()->add(logger); } @@ -983,6 +1060,13 @@ int Monitor::preinit() command.helpstring); ceph_assert(r == 0); } + r = admin_socket->register_command("backup", admin_hook, + "create a backup of the mon database"); + ceph_assert(r == 0); + r = admin_socket->register_command( + "backup_cleanup", admin_hook, + "delete old mon database backups according to retention config"); + ceph_assert(r == 0); l.lock(); // add ourselves as a conf observer @@ -1031,6 +1115,9 @@ int Monitor::init() // add features of myself into feature_map session_map.feature_map.add_mon(con_self->get_features()); + + backup_manager = std::make_unique(cct, this); + return 0; } @@ -1126,6 +1213,9 @@ void Monitor::shutdown() delete admin_hook; admin_hook = NULL; } + if (backup_manager) { + backup_manager->stop(); + } elector.shutdown(); @@ -6119,6 +6209,7 @@ void Monitor::tick() prepare_new_fingerprint(t); paxos->trigger_propose(); } + backup_manager->tick(); mgr_client.update_daemon_health(get_health_metrics()); new_tick(); diff --git a/src/mon/Monitor.h b/src/mon/Monitor.h index cc7d7b12e02..a914969ed35 100644 --- a/src/mon/Monitor.h +++ b/src/mon/Monitor.h @@ -50,6 +50,7 @@ #include "include/CompatSet.h" #include "mon/MonitorDBStore.h" #include "mon/mon_types.h" // for Metadata, PAXOS_*, ScrubResult +#include "mon/MonitorBackup.h" #include "mgr/MgrClient.h" #include #include @@ -100,6 +101,25 @@ enum { l_mon_election_call, l_mon_election_win, l_mon_election_lose, + l_mon_backup_running, + l_mon_backup_started, + l_mon_backup_success, + l_mon_backup_failed, + l_mon_backup_duration, + l_mon_backup_last_success, + l_mon_backup_last_success_id, + l_mon_backup_last_failed, + l_mon_backup_last_size, + l_mon_backup_last_files, + l_mon_backup_cleanup_started, + l_mon_backup_cleanup_running, + l_mon_backup_cleanup_success, + l_mon_backup_cleanup_failed, + l_mon_backup_cleanup_size, + l_mon_backup_cleanup_kept, + l_mon_backup_cleanup_duration, + l_mon_backup_cleanup_freed, + l_mon_backup_cleanup_deleted, l_mon_last, }; @@ -1001,6 +1021,8 @@ private: OpTracker op_tracker; + std::unique_ptr backup_manager; + public: Monitor(CephContext *cct_, std::string nm, MonitorDBStore *s, Messenger *m, Messenger *mgr_m, MonMap *map); @@ -1046,6 +1068,10 @@ private: std::ostream& err, std::ostream& out); + // Execute mon database backup + int perform_backup(); + int cleanup_backup(); + private: // don't allow copying Monitor(const Monitor& rhs); diff --git a/src/mon/MonitorBackup.cc b/src/mon/MonitorBackup.cc new file mode 100644 index 00000000000..aff1bebdaf8 --- /dev/null +++ b/src/mon/MonitorBackup.cc @@ -0,0 +1,214 @@ +// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*- +// vim: ts=8 sw=2 smarttab +/* +* Ceph - scalable distributed file system +* +* Copyright (C) 2021 B1-Systems GmbH +* +* This is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License version 2.1, as published by the Free Software +* Foundation. See file COPYING. +*/ + +#include +#include + +#include "include/util.h" +#include "mon/MonitorBackup.h" +#include "mon/Monitor.h" + +#define dout_subsys ceph_subsys_mon +#undef dout_context +#define dout_context cct + +namespace fs = std::filesystem; + +/*** + * Thread which runs monitor backup operations + */ +void *MonitorBackupManager::entry() { + std::unique_lock lock{mutex}; + auto wakeup_predicate = [this] { + return should_stop || should_backup || should_cleanup || wakeup_pending; + }; + while (true) { + // Wait for any signal (tick, request, or stop) before doing + // scheduled work. Predicate-based so notifications delivered + // before the worker entered wait() are not lost, and so we don't + // fire scheduled work while init() is still wiring the mon up. + work_cond.wait(lock, wakeup_predicate); + if (should_stop) { + return nullptr; + } + wakeup_pending = false; + + auto now = ceph_clock_now(); + std::string backup_path = cct->_conf.get_val("mon_backup_path"); + auto interval = cct->_conf.get_val("mon_backup_interval"); + auto cleanup_interval = cct->_conf.get_val("mon_backup_cleanup_interval"); + bool path_ok = !backup_path.empty(); + + bool timer_backup = false; + bool timer_cleanup = false; + + if (path_ok && interval.count() > 0 && + (mon->is_leader() || mon->is_peon())) { + if (!last_backup) { + dout(10) << "trigger first timed backup" << dendl; + timer_backup = true; + } else if ((now - last_backup->timestamp) >= utime_t(interval.count(), 0)) { + dout(10) << "trigger timed backup" << dendl; + timer_backup = true; + } + } + if (path_ok && cleanup_interval.count() > 0) { + if (!last_cleanup) { + dout(10) << "trigger first timed backup cleanup" << dendl; + timer_cleanup = true; + } else if ((now - last_cleanup->timestamp) >= utime_t(cleanup_interval.count(), 0)) { + dout(10) << "trigger timed backup cleanup" << dendl; + timer_cleanup = true; + } + } + + bool run_cleanup = should_cleanup || timer_cleanup; + bool run_backup = should_backup || timer_backup; + + should_backup = false; + should_cleanup = false; + + if (!run_backup && !run_cleanup) { + continue; + } + + lock.unlock(); + if (run_cleanup) { + do_cleanup(); + } + if (run_backup) { + do_backup(); + } + lock.lock(); + } +} + +void MonitorBackupManager::stop() { + { + std::lock_guard guard{mutex}; + if (should_stop) { + return; + } + should_stop = true; + work_cond.notify_one(); + } + join(); +} + +void MonitorBackupManager::do_cleanup() { + dout(5) << "start backup cleanup" << dendl; + mon->logger->inc(l_mon_backup_cleanup_started); + mon->logger->set(l_mon_backup_cleanup_running, 1); + auto start = ceph_clock_now(); + KeyValueDB::BackupCleanupStats stats = mon->store->backup_cleanup(); + mon->logger->set(l_mon_backup_cleanup_size, stats.size); + mon->logger->set(l_mon_backup_cleanup_kept, stats.kept); + mon->logger->set(l_mon_backup_cleanup_freed, stats.freed); + mon->logger->set(l_mon_backup_cleanup_deleted, stats.deleted); + if (stats.error) { + mon->logger->inc(l_mon_backup_cleanup_failed); + } else { + mon->logger->inc(l_mon_backup_cleanup_success); + } + auto ptr = std::make_shared(stats); + last_cleanup.swap(ptr); + auto end = ceph_clock_now(); + utime_t duration = end - start; + mon->logger->tinc(l_mon_backup_cleanup_duration, duration); + mon->logger->set(l_mon_backup_cleanup_running, 0); +} + +void MonitorBackupManager::record_last_backup(std::shared_ptr stats) { + if (stats->error && last_backup) { + stats->id = last_backup->id; + } + last_backup.swap(stats); +} + +// Returns true if there is enough free space on the backup volume. +bool MonitorBackupManager::check_free_space() { + auto backup_path = cct->_conf.get_val("mon_backup_path"); + + std::error_code ec; + if (!fs::exists(backup_path, ec)) { + if (!fs::create_directories(backup_path, ec)) { + dout(1) << "failed to create monitor backup directory '" + << backup_path << "': " << ec.message() << dendl; + return false; + } + fs::permissions(backup_path, fs::perms::owner_all, + fs::perm_options::replace, ec); + if (ec) { + dout(1) << "failed to set permissions on monitor backup directory '" + << backup_path << "': " << ec.message() << dendl; + return false; + } + dout(5) << "created monitor backup directory '" << backup_path + << "'" << dendl; + } + + ceph_data_stats_t stats; + int err = get_fs_stats(stats, backup_path.c_str()); + if (err < 0) { + dout(1) << "error checking monitor backup directory: " << cpp_strerror(err) + << dendl; + return false; + } + + if (stats.avail_percent <= cct->_conf.get_val("mon_backup_min_avail")) { + dout(1) << "ERROR: not enough disk space to start backup: " << "(available: " + << stats.avail_percent << "% " << byte_u_t(stats.byte_avail) << ")\n" + << "run backup_cleanup regularly or decrease mon_backup_min_avail" << dendl; + return false; + } + return true; +} + +void MonitorBackupManager::do_backup() { + dout(1) << "start backup" << dendl; + mon->logger->inc(l_mon_backup_started); + mon->logger->set(l_mon_backup_running, 1); + auto start = ceph_clock_now(); + + std::shared_ptr result; + + if (!check_free_space()) { + mon->logger->inc(l_mon_backup_failed); + mon->logger->tset(l_mon_backup_last_failed, start); + result = std::make_shared(); + result->error = true; + result->timestamp = start; + result->msg = "insufficient free space"; + } else { + KeyValueDB::BackupStats stats = mon->store->backup(); + utime_t duration = ceph_clock_now() - start; + mon->logger->tinc(l_mon_backup_duration, duration); + mon->logger->set(l_mon_backup_last_size, stats.size); + mon->logger->set(l_mon_backup_last_files, stats.number_files); + if (stats.error) { + mon->logger->inc(l_mon_backup_failed); + mon->logger->tset(l_mon_backup_last_failed, stats.timestamp); + dout(1) << "failed backup in " << utimespan_str(duration) << dendl; + } else { + mon->logger->inc(l_mon_backup_success); + mon->logger->tset(l_mon_backup_last_success, stats.timestamp); + mon->logger->set(l_mon_backup_last_success_id, stats.id); + dout(1) << "finished backup in " << utimespan_str(duration) << dendl; + } + result = std::make_shared(stats); + } + + record_last_backup(result); + mon->logger->set(l_mon_backup_running, 0); +} + diff --git a/src/mon/MonitorBackup.h b/src/mon/MonitorBackup.h new file mode 100644 index 00000000000..74d5fc879b5 --- /dev/null +++ b/src/mon/MonitorBackup.h @@ -0,0 +1,101 @@ +// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*- +// vim: ts=8 sw=2 smarttab +/* +* Ceph - scalable distributed file system +* +* Copyright (C) 2021 B1-Systems GmbH +* +* This is free software; you can redistribute it and/or +* modify it under the terms of the GNU Lesser General Public +* License version 2.1, as published by the Free Software +* Foundation. See file COPYING. +*/ + + +#ifndef CEPH_MONITOR_BACKUP_H +#define CEPH_MONITOR_BACKUP_H + +#include +#include +#include +#include + + +#include "common/Thread.h" +#include "common/ceph_context.h" +#include "common/ceph_mutex.h" +#include "kv/KeyValueDB.h" +#include "mon/MonitorDBStore.h" + +class Monitor; + +class MonitorBackupManager : public Thread { + CephContext *cct; + Monitor *mon; + ceph::mutex mutex; + ceph::condition_variable work_cond; + bool should_stop{false}; + // set by tick(); a sticky flag so a notification delivered before the + // worker enters wait() is not lost. cleared each time the worker + // re-evaluates timer triggers. + bool wakeup_pending{false}; + + bool should_backup{false}; + bool should_cleanup{false}; + uint64_t last_job_id{0}; + std::shared_ptr last_cleanup; + std::shared_ptr last_backup; + + void do_backup(); + void do_cleanup(); + bool check_free_space(); + void record_last_backup(std::shared_ptr stats); +protected: + void *entry() override; + +public: + explicit MonitorBackupManager(CephContext *cct, Monitor *monitor) : + cct(cct), + mon(monitor), + mutex(ceph::make_mutex("mon::BackupManager::mutex")) { + create("mon::backups"); + } + + void tick() { + std::lock_guard guard{mutex}; + if (should_stop) { + return; + } + wakeup_pending = true; + work_cond.notify_one(); + } + + /** + * Stop the backup manager thread. Safe to call more than once. + **/ + void stop(); + /** + * Start a new backup. + * @returns {uint64_t} new job id + **/ + uint64_t backup() { + std::lock_guard guard{mutex}; + should_backup = true; + uint64_t rv = ++last_job_id; + work_cond.notify_one(); + return rv; + } + + /// Queue a cleanup pass. + uint64_t cleanup() { + std::lock_guard guard{mutex}; + should_cleanup = true; + uint64_t rv = ++last_job_id; + work_cond.notify_one(); + return rv; + } + +}; + +#endif + diff --git a/src/mon/MonitorDBStore.h b/src/mon/MonitorDBStore.h index a645bb7e4ae..c316d23768b 100644 --- a/src/mon/MonitorDBStore.h +++ b/src/mon/MonitorDBStore.h @@ -14,6 +14,8 @@ #ifndef CEPH_MONITOR_DB_STORE_H #define CEPH_MONITOR_DB_STORE_H +#include +#include #include #include #include @@ -32,6 +34,7 @@ #include "common/Clock.h" #include "common/debug.h" #include "common/safe_io.h" +#include "common/strtol.h" #include "common/blkdev.h" #include "common/PriorityCache.h" #include "common/version.h" @@ -65,6 +68,11 @@ class MonitorDBStore return path; } + // returns the database store path + static std::string get_store_path(const std::string& path) { + return (std::filesystem::path(path) / "store.db").string(); + } + std::shared_ptr get_priority_cache() const { return db->get_priority_cache(); } @@ -631,14 +639,7 @@ class MonitorDBStore } void _open(const std::string& kv_type) { - int pos = 0; - for (auto rit = path.rbegin(); rit != path.rend(); ++rit, ++pos) { - if (*rit != '/') - break; - } - std::ostringstream os; - os << path.substr(0, path.size() - pos) << "/store.db"; - std::string full_path = os.str(); + std::string full_path = get_store_path(path); KeyValueDB *db_ptr = KeyValueDB::create(g_ceph_context, kv_type, @@ -691,7 +692,7 @@ class MonitorDBStore if (r < 0) return r; - // Monitors are few in number, so the resource cost of exposing + // Monitors are few in number, so the resource cost of exposing // very detailed stats is low: ramp up the priority of all the // KV store's perf counters. Do this after open, because backend may // not have constructed PerfCounters earlier. @@ -743,6 +744,200 @@ class MonitorDBStore db.reset(NULL); } + /// @brief Creates a backup of the database under mon_backup_path. + /// @return stats describing the created backup + KeyValueDB::BackupStats backup() { + auto backup_path = g_conf().get_val("mon_backup_path"); + auto stats = db->backup(backup_path); + if (!stats.error) { + // Stash the mon keyring alongside the rocksdb backup, keyed by + // backup id, so a restore of an older version is paired with the + // keyring of that vintage. The [mon.] secret can rotate; using a + // single fixed filename would break authentication after restore. + std::error_code ec; + auto dest = backup_path + "/keyring." + std::to_string(stats.id); + std::filesystem::copy_file( + path + "/keyring", dest, + std::filesystem::copy_options::overwrite_existing, ec); + if (!ec) { + std::filesystem::permissions(dest, + std::filesystem::perms::owner_read | std::filesystem::perms::owner_write, + std::filesystem::perm_options::replace, ec); + } + if (ec) { + // Best-effort: a rocksdb backup without a stashed keyring is still + // valid; the operator can supply a keyring out-of-band on restore. + derr << __func__ << " failed to stash keyring at " + << dest << ": " << ec.message() << dendl; + } + } + return stats; + } + + /// @brief Remove old backups in mon_backup_path according to the retention config. + /// @return stats describing what was kept, deleted, and freed + KeyValueDB::BackupCleanupStats backup_cleanup() { + auto backup_path = g_conf().get_val("mon_backup_path"); + auto stats = db->backup_cleanup( + backup_path, + g_conf().get_val("mon_backup_keep_last"), + g_conf().get_val("mon_backup_keep_hourly"), + g_conf().get_val("mon_backup_keep_daily")); + if (stats.error) { + return stats; + } + // Remove keyring. files for backup ids the kv layer just dropped. + std::string kv_type; + if (read_meta("kv_backend", &kv_type) < 0 || kv_type.empty()) { + kv_type = "rocksdb"; + } + std::set surviving; + auto remaining = KeyValueDB::list_backups(g_ceph_context, kv_type, backup_path); + if (!remaining) { + return stats; + } + for (const auto& b : *remaining) { + surviving.insert(b.id); + } + std::error_code ec; + for (auto it = std::filesystem::directory_iterator(backup_path, ec); + it != std::filesystem::directory_iterator(); + it.increment(ec)) { + auto name = it->path().filename().string(); + if (name.compare(0, 8, "keyring.") != 0) { + continue; + } + std::string idstr = name.substr(8); + std::string parse_err; + long long id = strict_strtoll(idstr.c_str(), 10, &parse_err); + if (!parse_err.empty() || id < 0) { + continue; + } + if (surviving.count(static_cast(id))) { + continue; + } + std::error_code rm_ec; + std::filesystem::remove(it->path(), rm_ec); + } + return stats; + } + + /// @brief List all backup versions at backup_path. + /// @param cct ceph context + /// @param path path to the local mon data dir (used to discover the kv backend) + /// @param backup_path path to the backup location + /// @return list of BackupStats, one per backup + static std::optional> list_backups( + CephContext *cct, const std::string &path, const std::string &backup_path) { + std::string kv_type; + int r = read_meta_path("kv_backend", &kv_type, path); + if (r < 0 || kv_type.empty()) { + // Disaster recovery: mon_data may be empty or absent. We only ship + // a rocksdb kv backend today, so default to it for enumeration. + kv_type = "rocksdb"; + } + return KeyValueDB::list_backups(cct, kv_type, backup_path); + } + + + /// @brief Restore the backup with the given version from backup_path into path. + /// @param cct ceph context + /// @param path path to the local mon data dir to restore into + /// @param backup_path path to the backup location + /// @param version version of the backup to restore (nullopt for latest) + /// @return true on success + static bool restore_backup(CephContext *cct, const std::string &path, + const std::string &backup_path, + const std::optional &version) { + std::string kv_type; + int r = read_meta_path("kv_backend", &kv_type, path); + if (r < 0 || kv_type.empty()) { + // Disaster recovery: mon_data is empty or freshly initialized, so + // there is no kv_backend marker. Default to rocksdb and stamp the + // file back so the subsequent open() finds it. + kv_type = "rocksdb"; + std::error_code ec; + std::filesystem::create_directories(path, ec); + if (ec) { + lderr(cct) << __func__ << " failed to create " << path + << ": " << ec.message() << dendl; + return false; + } + std::filesystem::permissions(path, + std::filesystem::perms::owner_all, + std::filesystem::perm_options::replace, ec); + const std::string v = kv_type + "\n"; + if (safe_write_file(path.c_str(), "kv_backend", + v.c_str(), v.length(), 0600) < 0) { + lderr(cct) << __func__ << " failed to write kv_backend in " + << path << dendl; + return false; + } + } + std::string store_path = get_store_path(path); + + // Resolve "latest" up front so we know which versioned keyring to + // rehydrate alongside the rocksdb restore. Pick by BackupEngine id + // (monotonic per rocksdb) rather than timestamp, so a clock skew + // between backups cannot make the default restore pick a stale one. + uint32_t resolved_version; + if (version) { + resolved_version = *version; + } else { + auto backups = KeyValueDB::list_backups(cct, kv_type, backup_path); + if (!backups || backups->empty()) { + lderr(cct) << __func__ << " no backups found at " << backup_path << dendl; + return false; + } + resolved_version = std::max_element( + backups->begin(), backups->end(), + [](const auto& a, const auto& b) { return a.id < b.id; })->id; + } + + if (!KeyValueDB::restore_backup(cct, kv_type, store_path, backup_path, + resolved_version)) { + return false; + } + + // Rehydrate the matching keyring (skipped silently if the operator + // keeps the keyring out-of-band). + std::error_code ec; + auto keyring_src = backup_path + "/keyring." + std::to_string(resolved_version); + if (std::filesystem::exists(keyring_src, ec)) { + std::filesystem::copy_file( + keyring_src, + path + "/keyring", + std::filesystem::copy_options::overwrite_existing, + ec); + if (ec) { + lderr(cct) << __func__ << " failed to restore keyring from " + << keyring_src << ": " << ec.message() << dendl; + return false; + } + } + + // The mon store holds auth, config-key and dm-crypt secrets; + // tighten everything we just restored to owner-only. + std::filesystem::permissions(path, + std::filesystem::perms::owner_all, + std::filesystem::perm_options::replace, ec); + for (auto it = std::filesystem::recursive_directory_iterator(path, ec); + it != std::filesystem::recursive_directory_iterator(); + it.increment(ec)) { + std::error_code ec_chmod; + auto perms = it->is_directory(ec_chmod) + ? std::filesystem::perms::owner_all + : (std::filesystem::perms::owner_read | std::filesystem::perms::owner_write); + std::filesystem::permissions(it->path(), perms, + std::filesystem::perm_options::replace, ec_chmod); + if (ec_chmod) { + lderr(cct) << __func__ << " failed to chmod " << it->path() + << ": " << ec_chmod.message() << dendl; + } + } + return true; + } + void compact() { db->compact(); } @@ -788,7 +983,7 @@ class MonitorDBStore /** * read_meta - read a simple configuration key out-of-band * - * Read a simple key value to an unopened/mounted store. + * Read a simple key value from an unopened/unmounted store. * * Trailing whitespace is stripped off. * @@ -798,6 +993,24 @@ class MonitorDBStore */ int read_meta(const std::string& key, std::string *value) const { + return read_meta_path(key, value, path); + } + + /** + * read_meta_path - read a simple configuration key out-of-band + * + * Read a simple key value from a specified path store. + * + * Trailing whitespace is stripped off. + * + * @param key key name + * @param value pointer to value string + * @param path path to directory + * @returns 0 for success, or an error code + */ + static int read_meta_path(const std::string& key, + std::string *value, + const std::string& path) { char buf[4096]; int r = safe_read_file(path.c_str(), key.c_str(), buf, sizeof(buf)); diff --git a/src/vstart.sh b/src/vstart.sh index b478c00b090..7f5421bffce 100755 --- a/src/vstart.sh +++ b/src/vstart.sh @@ -1157,6 +1157,7 @@ start_mon() { [mon.$f] host = $HOSTNAME mon data = $CEPH_DEV_DIR/mon.$f + mon backup path = $CEPH_DEV_DIR/mon.$f-backup EOF count=$(($count + 2)) done