For more information about the pool flags see :ref:`Pool values <setpoolvalues>`.
+Monitor backup
+==============
+
+In normal operation, Monitor backups are not required: surviving members
+of the Monitor quorum re-sync new or replaced Monitors automatically,
+and the Monitor store can be largely rebuilt from the OSDs after total
+quorum loss (see :ref:`mon-store-recovery-using-osds`).
+
+However, ``mon-store-recovery-using-osds`` only recovers state that the
+OSDs can report: osdmap history, auth keys associated with running
+OSDs, and similar. Some Monitor state has no copy outside the Monitor
+store and is unrecoverable if all Monitors are lost:
+
+* **Encryption keys for dm-crypt OSDs** are stored only in the
+ Monitor's ``config-key`` store under
+ ``dm-crypt/osd/<osd-uuid>/luks``. Without these keys the underlying
+ block devices cannot be unlocked, even when the OSD daemons and data
+ are physically intact.
+* **Cephadm orchestrator state** under ``mgr/cephadm/*`` in the
+ ``config-key`` store, including host inventory, daemon placement
+ specs, and service definitions.
+* **Config-key entries** populated by users or third-party tooling via
+ ``ceph config-key set``.
+* **Dashboard and manager module state** persisted to the
+ ``config-key`` store.
+
+A Monitor backup lets an operator restore this state if all running
+Monitors are lost. It is most valuable for clusters that use
+dm-crypt-encrypted OSDs or that depend heavily on cephadm-managed
+deployment state, where loss of the Monitor store would be a
+protracted data-availability incident rather than a recoverable inconvenience.
+
+Monitor backups complement, but do not replace, the existing Monitor
+recovery procedures. They are not a means of "undoing" cluster-level
+operations such as pool deletion or CRUSH changes: once OSDs have
+observed and acted on a newer osdmap, restoring an older Monitor store
+does not roll back the OSD-side effects.
+
+The Ceph Monitor uses the native RocksDB ``BackupEngine`` to create
+consistent snapshots of its store, which can be copied elsewhere
+without downtime.
+
+When :confval:`mon_backup_interval` is set, a backup is triggered every
+N seconds. Pair it with :confval:`mon_backup_cleanup_interval`; if only
+the backup interval is set, backups accumulate indefinitely because
+retention is only applied during cleanup.
+
+Backups share table files (``.sst``) within the backup directory: each
+new backup only copies SSTables that the running database has produced
+since the previous backup. Restoring any individual backup version is
+independent of the others, but the on-disk files for every version
+live in a single shared tree under ``mon_backup_path``.
+
+Layout of the backup directory
+------------------------------
+
+A backup path managed by the RocksDB ``BackupEngine`` contains three
+top-level directories plus a per-version copy of the Monitor keyring::
+
+ /path/to/backups/
+ ├── meta/
+ │ ├── 1
+ │ ├── 2
+ │ └── 3
+ ├── private/
+ │ ├── 1/
+ │ ├── 2/
+ │ └── 3/
+ ├── shared_checksum/
+ │ ├── 000007.sst
+ │ ├── 000010.sst
+ │ └── ...
+ ├── keyring.1
+ ├── keyring.2
+ └── keyring.3
+
+* ``meta/<N>`` is a metadata file describing logical backup version
+ ``N``.
+* ``private/<N>/`` contains files unique to backup ``N`` (RocksDB
+ descriptors and other per-version state).
+* ``shared_checksum/`` contains SSTables shared across backup
+ versions. A single SSTable in this directory may belong to several
+ versions; the BackupEngine deletes a file from here only when no
+ remaining backup references it.
+* ``keyring.<N>`` is a copy of ``$mon_data/keyring`` taken at the time
+ of backup ``N``. The Monitor needs this key to authenticate at
+ startup, and it is not stored inside the RocksDB database; restoring
+ version ``N`` copies the matching ``keyring.<N>`` back into
+ ``$mon_data`` so an older snapshot is paired with the keyring of its
+ vintage. Cleanup removes ``keyring.<N>`` files whose backup version
+ has been pruned.
+
+ Stashing the keyring is **best effort**: if the copy fails (for
+ example, permission denied on the backup directory or out of space
+ after the RocksDB snapshot completed), the backup is still recorded
+ as successful and the RocksDB data remains usable. On restore, a
+ missing ``keyring.<N>`` is silently skipped, and the operator must
+ supply the Monitor keyring out-of-band before starting the daemon.
+
+.. warning::
+
+ The backup directory contains the ``[mon.]`` private key. Treat
+ it with the same access controls as ``$mon_data`` itself; a
+ complete backup is sufficient material to impersonate a Monitor
+ in the cluster.
+
+Do not delete a ``private/<N>/`` directory or files under
+``shared_checksum/`` by hand: removing a referenced shared file
+corrupts every backup that points to it. Use ``backup_cleanup`` or
+the configured retention parameters to remove old versions so the
+BackupEngine can release shared files safely.
+
+If the Monitor cluster fails and you need to copy a backup elsewhere,
+copy the entire ``/path/to/backups/`` directory. Copying only
+``private/<N>/`` is not sufficient; the version's SSTables live in
+``shared_checksum/``.
+
+
+The :confval:`mon_backup_cleanup_interval` specifies the interval for
+backup cleanup. The cleanup algorithm keeps the last
+``mon_backup_keep_last`` backups. It then collects hourly
+``mon_backup_keep_hourly`` and daily ``mon_backup_keep_daily``
+versions, retaining the newest backup in each time window.
+
+You can trigger ``backup`` and ``backup_cleanup`` through any running
+Monitor's admin socket.
+
+.. prompt:: bash #
+
+ ceph --admin-daemon .../mon.asok backup
+
+.. prompt:: bash #
+
+ ceph --admin-daemon .../mon.asok backup_cleanup
+
+The following metrics related to the monitor backup process are
+tracked by ``ceph-mon``.
+
+.. list-table:: Ceph Monitor Backup Metrics
+ :widths: 30 12 58
+ :header-rows: 1
+
+ * - Name
+ - Type
+ - Description
+ * - ``backup_running``
+ - Gauge
+ - ``1`` while a backup is in progress, ``0`` otherwise
+ * - ``backup_started``
+ - Counter
+ - Backup attempts (includes attempts rejected at the :confval:`mon_backup_min_avail` pre-flight check)
+ * - ``backup_success``
+ - Counter
+ - Backups completed by the ``BackupEngine``
+ * - ``backup_failed``
+ - Counter
+ - Failed backup attempts (pre-flight or ``BackupEngine``)
+ * - ``backup_duration``
+ - Average
+ - Backup wall-clock time
+ * - ``backup_last_success``
+ - Gauge
+ - UTC timestamp of the most recent successful backup, or ``0`` if none
+ * - ``backup_last_success_id``
+ - Gauge
+ - ``BackupEngine`` version ID of the most recent successful backup (the value passed to ``--restore-backup --backup-version``)
+ * - ``backup_last_failed``
+ - Gauge
+ - UTC timestamp of the most recent failed attempt, or ``0`` if none
+ * - ``backup_last_size``
+ - Gauge
+ - Payload size in bytes of the most recent attempt (may be partial on failure)
+ * - ``backup_last_files``
+ - Gauge
+ - File count of the most recent attempt (may be partial on failure)
+ * - ``backup_cleanup_started``
+ - Counter
+ - Cleanup invocations
+ * - ``backup_cleanup_running``
+ - Gauge
+ - ``1`` while a cleanup is in progress, ``0`` otherwise
+ * - ``backup_cleanup_success``
+ - Counter
+ - Cleanup passes completed without error
+ * - ``backup_cleanup_failed``
+ - Counter
+ - Failed cleanup passes
+ * - ``backup_cleanup_duration``
+ - Average
+ - Cleanup wall-clock time
+ * - ``backup_cleanup_kept``
+ - Gauge
+ - Backups retained by the most recent cleanup pass
+ * - ``backup_cleanup_deleted``
+ - Gauge
+ - Backups removed by the most recent cleanup pass
+ * - ``backup_cleanup_freed``
+ - Gauge
+ - Bytes released by the most recent cleanup pass (overstated when shared backups are in use, because the ``BackupEngine`` payload sum ignores file sharing)
+ * - ``backup_cleanup_size``
+ - Gauge
+ - Bytes retained by the most recent cleanup pass (same sharing caveat as ``backup_cleanup_freed``)
+
+The ``backup_started``/``backup_success``/``backup_failed`` and
+``backup_cleanup_started``/``backup_cleanup_success``/``backup_cleanup_failed``
+counters are monotonic and accumulate for the lifetime of the
+``ceph-mon`` process. The ``backup_last_*`` fields and the four
+cleanup result gauges (``backup_cleanup_kept``, ``backup_cleanup_deleted``,
+``backup_cleanup_freed``, ``backup_cleanup_size``) describe only the
+most recent invocation and are overwritten on every pass.
+
+To retrieve backup metrics from a running monitor's admin socket:
+
+.. prompt:: bash #
+
+ ceph --admin-daemon .../mon.asok perf dump | jq '.["mon"] | with_entries(select(.key | startswith("backup_")))'
+
+.. code-block:: json
+
+ {
+ "backup_running": 0,
+ "backup_started": 2,
+ "backup_success": 2,
+ "backup_failed": 0,
+ "backup_duration": {
+ "avgcount": 2,
+ "sum": 0.149076498,
+ "avgtime": 0.074538249
+ },
+ "backup_last_success": 1722001989.849262,
+ "backup_last_success_id": 3,
+ "backup_last_failed": 0,
+ "backup_last_size": 3924677,
+ "backup_last_files": 6,
+ "backup_cleanup_started": 1,
+ "backup_cleanup_running": 0,
+ "backup_cleanup_success": 1,
+ "backup_cleanup_failed": 0,
+ "backup_cleanup_size": 86144,
+ "backup_cleanup_kept": 1,
+ "backup_cleanup_duration": {
+ "avgcount": 1,
+ "sum": 0.002031246,
+ "avgtime": 0.002031246
+ },
+ "backup_cleanup_freed": 0,
+ "backup_cleanup_deleted": 0
+ }
+
+Monitor Backup Metric Usage Examples
+------------------------------------
+
+The following examples show how to use the monitor backup performance
+counters. PromQL examples assume that the ``ceph-exporter`` is being
+scraped by Prometheus. Admin-socket examples use ``ceph daemon``
+directly against a specific ``ceph-mon`` daemon.
+
+``backup_last_success`` (gauge, Unix epoch seconds)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The timestamp when the most recent successful backup completed. ``0``
+means no successful backup has occurred since the mon started.
+
+* Age of the most recent successful backup, per mon:
+
+ .. code-block:: promql
+
+ time() - mon_backup_last_success
+
+* Detect a stalled backup schedule (no success in over 2 hours; adjust
+ the threshold to roughly 2× your :confval:`mon_backup_interval`):
+
+ .. code-block:: promql
+
+ time() - mon_backup_last_success > 7200 and mon_backup_last_success > 0
+
+``backup_failed`` (counter)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Cumulative count of failed backup attempts (pre-flight or
+``BackupEngine``) since the mon started.
+
+* Backup failures per mon in the last hour:
+
+ .. code-block:: promql
+
+ increase(mon_backup_failed[1h])
+
+* Alert when any mon has logged a backup failure recently:
+
+ .. code-block:: promql
+
+ increase(mon_backup_failed[15m]) > 0
+
+``backup_last_size`` (gauge, bytes)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Payload size in bytes of the most recent backup attempt.
+
+* Backup size trend across all mons:
+
+ .. code-block:: promql
+
+ mon_backup_last_size
+
+* Live size check for a single mon via admin socket:
+
+ .. code-block:: bash
+
+ ceph daemon mon.<id> perf dump | jq '.mon.backup_last_size'
+
+``backup_cleanup_kept`` (gauge)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Number of backups retained by the most recent cleanup pass.
+
+* Mons whose retention has grown beyond an expected ceiling:
+
+ .. code-block:: promql
+
+ mon_backup_cleanup_kept > 100
+
+``backup_cleanup_freed`` (gauge, bytes)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Bytes released by the most recent cleanup pass. Overstated when shared
+backups are in use.
+
+* Bytes reclaimed by the latest cleanup, per mon:
+
+ .. code-block:: promql
+
+ mon_backup_cleanup_freed
+
+* Total bytes released by the most recent cleanup across all mons:
+
+ .. code-block:: promql
+
+ sum(mon_backup_cleanup_freed)
+
+Monitor Backup Configuration Options
+------------------------------------
+
+The following options control monitor backup behavior. All are
+runtime-tunable.
+
+.. confval:: mon_backup_path
+.. confval:: mon_backup_min_avail
+.. confval:: mon_backup_keep_last
+.. confval:: mon_backup_keep_hourly
+.. confval:: mon_backup_keep_daily
+.. confval:: mon_backup_interval
+.. confval:: mon_backup_cleanup_interval
+
+
Miscellaneous
=============
Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb
+Recovery Using Mon Backup
+-------------------------
+
+If Monitor backups are enabled, backups can be found in the configured
+``mon_backup_path``. To list the available backup versions, run:
+
+.. code-block:: bash
+
+ ceph-mon -i [num] --list-backups /path/to/backups
+
+In containerized deployments, run this from inside the Monitor
+container (``cephadm shell --name mon.<id>``), or install
+``ceph-common`` on the host.
+
+This invokes the RocksDB ``BackupEngine`` to enumerate the logical
+backup versions at the path. Output looks like::
+
+ ID: Time: Size:
+ 1 Sun May 18 03:00:01 2026 4 MiB
+ 2 Sun May 18 04:00:02 2026 12 KiB
+ 3 Sun May 18 05:00:01 2026 16 KiB
+
+The ``ID`` column is the value to pass as ``--backup-version`` when
+restoring. A plain ``ls`` of the backup path shows the BackupEngine's
+internal ``meta/``, ``private/``, and ``shared_checksum/`` directories,
+which are not directly usable for restore; always use ``--list-backups``
+to obtain the IDs.
+
+To restore a backup, stop the monitor and run:
+
+.. code-block:: bash
+
+ ceph-mon -i [num] --restore-backup /path/to/backups --backup-version <version> --yes-i-really-mean-it
+
+The ``--yes-i-really-mean-it`` flag is required because restore overwrites the existing monitor store.
+If the ``--backup-version`` argument is omitted, the latest version will be restored.
+The restored store contains everything that was in the mon at the time the backup was taken,
+including auth records; any changes (auth, pools, CRUSH, etc.) made after that point are lost.
+OSDs will reconcile their state with the restored osdmap as the cluster comes back up.
+
+If ``ceph-mon --restore-backup`` is invoked as ``root`` (typical when running
+from a service shell), the restored ``kv_backend`` file and the rehydrated
+``keyring`` will be owned by ``root``. Before starting the monitor daemon,
+``chown -R ceph:ceph <mon_data>`` so the unprivileged ``ceph`` user can read
+them.
+
+Restoring a multi-monitor cluster
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Each monitor has its own Paxos state (rank, accepted proposal numbers,
+``last_committed``), so backups are per-monitor: a backup taken from
+``mon.a`` should be restored into ``mon.a``'s ``mon_data``, not copied
+onto ``mon.b`` or ``mon.c``.
+
+To recover a cluster from monitor backups:
+
+#. Stop all monitors.
+#. Restore each monitor you have a backup for from its own backup path
+ using the ``--restore-backup`` command above.
+#. Start the restored monitors. Quorum forms once a majority of the
+ monmap members are running. Restoring and starting a majority is the
+ simplest path: the recovered monmap is the original one, and the
+ monitors elect among themselves as normal.
+
+If you cannot bring back a majority of monitors (because backups are
+missing or storage is lost on the other hosts), the restored seed will
+not form quorum on its own. The monmap recovered from the backup still
+lists every monitor in the original cluster, and Paxos requires a
+majority to elect. The seed will sit in ``probing`` / ``electing``
+indefinitely.
+
+To reduce the monmap on the offline seed so it can elect itself:
+
+#. Stop all monitors.
+#. Restore the seed monitor from its backup as above.
+#. Edit the monmap directly in the offline mon store::
+
+ # extract from the offline mon store
+ ceph-mon -i <seed-id> --extract-monmap /tmp/monmap
+ # drop monitors that will not come back
+ monmaptool /tmp/monmap --rm <other-id-1> --rm <other-id-2>
+ # write it back into the store
+ ceph-mon -i <seed-id> --inject-monmap /tmp/monmap
+
+#. Start the seed monitor. With the reduced monmap it can elect itself
+ and form a single-member quorum.
+#. Re-add any remaining monitors as new sync members per
+ :ref:`adding-and-removing-monitors`; they will synchronize from the
+ recovered seed rather than reusing their old backups.
+
+
Recovery Using Healthy Monitor(s)
---------------------------------
#include <fcntl.h>
#include <iostream>
+#include <optional>
#include <sstream>
#include <string>
#include "common/Throttle.h"
#include "common/Timer.h"
#include "common/errno.h"
+#include "common/strtol.h"
#include "common/Preforker.h"
#include "global/global_init.h"
<< " --force-sync\n"
<< " force a sync from another mon by wiping local data (BE CAREFUL)\n"
<< " --yes-i-really-mean-it\n"
- << " mandatory safeguard for --force-sync\n"
+ << " mandatory safeguard for --force-sync and --restore-backup\n"
<< " --compact\n"
<< " compact the monitor store\n"
<< " --osdmap <filename>\n"
<< " extract the monmap from the local monitor store and exit\n"
<< " --mon-data <directory>\n"
<< " where the mon store and keyring are located\n"
- << " --set-crush-location <bucket>=<foo>"
- << " sets monitor's crush bucket location (only for stretch mode)"
+ << " --set-crush-location <bucket>=<foo>\n"
+ << " sets monitor's crush bucket location (only for stretch mode)\n"
+ << " --restore-backup <directory>\n"
+ << " restore the backup from location and exit (requires --yes-i-really-mean-it)\n"
+ << " --backup-version <version>\n"
+ << " BackupEngine version ID (uint32); defaults to the latest backup when omitted\n"
+ << " --list-backups <directory>\n"
+ << " list available backups\n"
<< std::endl;
generic_server_usage();
}
bool force_sync = false;
bool yes_really = false;
std::string osdmapfn, inject_monmap, extract_monmap, crush_loc;
+ std::string restore_backup_location, list_backup_location;
+ std::optional<uint32_t> restore_backup_version;
auto args = argv_to_vec(argc, argv);
if (args.empty()) {
extract_monmap = val;
} else if (ceph_argparse_witharg(args, i, &val, "--set-crush-location", (char*)NULL)) {
crush_loc = val;
+ } else if (ceph_argparse_witharg(args, i, &val, "--list-backups", (char*)NULL)) {
+ list_backup_location = val;
+ } else if (ceph_argparse_witharg(args, i, &val, "--restore-backup", (char*)NULL)) {
+ restore_backup_location = val;
+ } else if (ceph_argparse_witharg(args, i, &val, "--backup-version", (char*)NULL)) {
+ std::string parse_err;
+ long long v = strict_strtoll(val.c_str(), 10, &parse_err);
+ if (!parse_err.empty() || v < 0 || v > UINT32_MAX) {
+ cerr << "invalid --backup-version '" << val << "'" << std::endl;
+ exit(1);
+ }
+ restore_backup_version = static_cast<uint32_t>(v);
} else {
++i;
}
exit(1);
}
+ // -- list backups --
+ if (!list_backup_location.empty()) {
+ cout << "list backup from location '" << list_backup_location << "'" << std::endl << std::endl;
+ auto backup_infos = MonitorDBStore::list_backups(
+ cct.get(), g_conf()->mon_data, list_backup_location);
+ if (!backup_infos) {
+ cerr << "failed to enumerate backups at '" << list_backup_location
+ << "' (see log for details)" << std::endl;
+ exit(1);
+ }
+ if (backup_infos->empty()) {
+ cout << "no backups found at '" << list_backup_location << "'" << std::endl;
+ exit(0);
+ }
+ cout << "ID:\tTime:\t\t\t\tSize:" << std::endl;
+ for (const KeyValueDB::BackupStats& bi : *backup_infos) {
+ cout << bi.id << "\t";
+ bi.timestamp.asctime(cout);
+ cout << "\t" << byte_u_t(bi.size) << std::endl;
+ }
+ exit(0);
+ }
+
+ // -- restore backup --
+ if (!restore_backup_location.empty()) {
+ if (!yes_really) {
+ cerr << "restoring will overwrite the monitor store at '" << g_conf()->mon_data
+ << "'. Pass --yes-i-really-mean-it to proceed." << std::endl;
+ exit(1);
+ }
+ cerr << "restoring backup from location '" << restore_backup_location << "' to '"
+ << g_conf()->mon_data << "'" << std::endl;
+ if (MonitorDBStore::restore_backup(cct.get(), g_conf()->mon_data, restore_backup_location, restore_backup_version)) {
+ cout << "successfully restored backup. Start ceph-mon normally" << std::endl;
+ exit(0);
+ }
+ cerr << "restore failed. Check the backup path and version (use --list-backups to enumerate)." << std::endl;
+ exit(1);
+ }
+
+ if (restore_backup_version.has_value()) {
+ cerr << "--backup-version requires --restore-backup" << std::endl;
+ exit(1);
+ }
+
MonitorDBStore store(g_conf()->mon_data);
// -- mkfs --
exit(1);
}
store.close();
- dout(0) << argv[0] << ": created monfs at " << g_conf()->mon_data
+ dout(0) << argv[0] << ": created monfs at " << g_conf()->mon_data
<< " for " << g_conf()->name << dendl;
return 0;
}
flags:
- no_mon_update
with_legacy: true
+- name: mon_backup_path
+ type: str
+ level: advanced
+ desc: Path to Monitor database backups
+ fmt_desc: The Monitor's backup location.
+ default: /var/backups/ceph/mon/$cluster-$id
+ services:
+ - mon
+ flags:
+ - runtime
+- name: mon_backup_min_avail
+ type: int
+ level: advanced
+ desc: Only capture backups if at least this percentage of the target filesystem is free
+ default: 10
+ min: 0
+ max: 100
+ services:
+ - mon
+ flags:
+ - runtime
+- name: mon_backup_keep_last
+ type: uint
+ level: advanced
+ desc: Keep the last N backups
+ fmt_desc: Keep the last N backups of the Monitor database.
+ default: 6
+ services:
+ - mon
+ flags:
+ - runtime
+- name: mon_backup_keep_hourly
+ type: uint
+ level: advanced
+ desc: Number of hourly backups
+ default: 5
+ services:
+ - mon
+ flags:
+ - runtime
+- name: mon_backup_keep_daily
+ type: uint
+ level: advanced
+ desc: Number of daily backups
+ fmt_desc: Keep one backup per day, for the specified number of days.
+ default: 7
+ services:
+ - mon
+ flags:
+ - runtime
+- name: mon_backup_interval
+ type: secs
+ level: advanced
+ desc: Automatic backups every N seconds (0 disables)
+ default: 0
+ services:
+ - mon
+ flags:
+ - runtime
+- name: mon_backup_cleanup_interval
+ type: secs
+ level: advanced
+ desc: Trigger backup cleanup every N seconds (0 disables)
+ default: 0
+ services:
+ - mon
+ flags:
+ - runtime
- name: mon_rocksdb_options
type: str
level: advanced
// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:nil -*-
// vim: ts=8 sw=2 sts=2 expandtab
+#include <filesystem>
+#include <memory>
+#include <sstream>
+
#include "KeyValueDB.h"
#include "RocksDBStore.h"
return NULL;
}
+bool KeyValueDB::restore_backup(CephContext *cct,
+ const std::string &type,
+ const std::string &path,
+ const std::string &backup_location,
+ const std::optional<uint32_t> &version)
+{
+ if (std::filesystem::exists(path) &&
+ !std::filesystem::is_empty(path)) {
+ std::unique_ptr<KeyValueDB> probe(KeyValueDB::create(cct, type, path));
+ if (!probe) {
+ lderr(cct) << __func__ << " unsupported kv backend: " << type << dendl;
+ return false;
+ }
+ std::ostringstream err;
+ if (probe->open(err) < 0) {
+ // Heuristic: rocksdb's PosixEnv lock path surfaces "lock" in the
+ // error text. Any other open failure (corruption, I/O) is precisely
+ // why a restore is being run -- warn and proceed.
+ const std::string msg = err.str();
+ if (msg.find("lock") != std::string::npos) {
+ lderr(cct) << __func__ << " another monitor is using this data dir: "
+ << path << ": " << msg << dendl;
+ return false;
+ }
+ lderr(cct) << __func__ << " existing store at " << path
+ << " is unreadable, proceeding with restore: " << msg << dendl;
+ } else {
+ probe->close();
+ }
+ }
+ if (type == "rocksdb") {
+ return RocksDBStore::restore_backup(cct, path, backup_location, version);
+ }
+ return false;
+}
+
+std::optional<std::vector<KeyValueDB::BackupStats>> KeyValueDB::list_backups(
+ CephContext *cct, const std::string &type, const std::string &backup_location)
+{
+ if (type == "rocksdb") {
+ return RocksDBStore::list_backups(cct, backup_location);
+ }
+ lderr(cct) << __func__ << " unsupported kv backend: " << type << dendl;
+ return std::nullopt;
+}
+
int KeyValueDB::test_init(const string& type, const string& dir)
{
if (type == "rocksdb") {
#include <string_view>
#include <boost/scoped_ptr.hpp>
#include "include/encoding.h"
+#include "include/utime.h"
#include "common/Formatter.h"
#include "common/perf_counters.h"
#include "common/PriorityCache.h"
*/
class KeyValueDB {
public:
+ struct BackupCleanupStats {
+ bool error{false};
+ utime_t timestamp;
+ uint32_t corrupted{0};
+ uint32_t deleted{0};
+ uint32_t kept{0};
+ uint64_t size{0};
+ uint64_t freed{0};
+ };
+
+ struct BackupStats {
+ bool error{false};
+ uint64_t id{0};
+ utime_t timestamp;
+ std::string msg;
+ uint64_t size{0};
+ uint64_t number_files{0};
+ };
+
class TransactionImpl {
public:
// amount of ops included
}
/// Remove Single Key which exists and was not overwritten.
- /// This API is only related to performance optimization, and should only be
- /// re-implemented by log-insert-merge tree based keyvalue stores(such as RocksDB).
+ /// This API is only related to performance optimization, and should only be
+ /// re-implemented by log-insert-merge tree based keyvalue stores(such as RocksDB).
/// If a key is overwritten (by calling set multiple times), then the result
/// of calling rm_single_key on this key is undefined.
virtual void rm_single_key(
return 0;
};
+ /// Create a kv database backup in directory path.
+ virtual BackupStats backup(const std::string &path) {
+ return {.error = true, .msg = "backup not supported by this backend"};
+ }
+
+ /// Remove old backups in directory path according to retention.
+ virtual BackupCleanupStats backup_cleanup(const std::string &path,
+ uint64_t keep_last,
+ uint64_t keep_hourly,
+ uint64_t keep_daily) {
+ return {.error = true};
+ }
+
+ /// restore from backup the specified backup version
+ static bool restore_backup(CephContext *cct, const std::string &type,
+ const std::string &path,
+ const std::string &backup_location,
+ const std::optional<uint32_t> &version);
+
+ static std::optional<std::vector<BackupStats>> list_backups(
+ CephContext *cct, const std::string &type,
+ const std::string &backup_location);
+
/// compact the underlying store
virtual void compact() {}
#include <set>
#include <string>
#include <string_view>
+#include <tuple>
#include <errno.h>
#include <unistd.h>
#include <sys/types.h>
#include "rocksdb/slice.h"
#include "rocksdb/cache.h"
#include "rocksdb/filter_policy.h"
+#include "rocksdb/utilities/backup_engine.h"
#include "rocksdb/utilities/convenience.h"
#include "rocksdb/utilities/table_properties_collectors.h"
#include "rocksdb/merge_operator.h"
+#include "common/version.h"
+#include "rocksdb/util/stderr_logger.h"
+
#include "common/Clock.h" // for ceph_clock_now()
#include "common/perf_counters.h"
#include "common/PriorityCache.h"
static const char* sharding_recreate = "sharding/recreate_columns";
static const char* resharding_column_lock = "reshardingXcommencingXlocked";
+
static bufferlist to_bufferlist(rocksdb::Slice in) {
bufferlist bl;
bl.append(bufferptr(in.data(), in.size()));
if (create_if_missing) {
status = rocksdb::DB::Open(opt, path, &db);
if (!status.ok()) {
+ out << status.ToString();
derr << status.ToString() << dendl;
return -EINVAL;
}
status = rocksdb::DB::Open(opt, path, &db);
}
if (!status.ok()) {
+ out << status.ToString();
derr << status.ToString() << dendl;
return -EINVAL;
}
path, existing_cfs, &handles, &db);
}
if (!status.ok()) {
+ out << status.ToString();
derr << status.ToString() << dendl;
return -EINVAL;
}
return 0;
}
+KeyValueDB::BackupStats RocksDBStore::backup(const std::string &path)
+{
+ ldout(cct, 20) << __func__ << " start backup action" << dendl;
+ std::lock_guard backup_locker{backup_lock};
+ // stamp timestamp up front so every return path (including early Open
+ // failures) carries a real time the scheduler can gate retries on.
+ KeyValueDB::BackupStats rv;
+ rv.timestamp = ceph_clock_now();
+
+ rocksdb::BackupEngine* engine_ptr = nullptr;
+ rocksdb::BackupEngineOptions engine_options = rocksdb::BackupEngineOptions(path);
+ // BackupEngineOptions must be stable across opens to the same directory,
+ // and share_files_with_checksum=false is deprecated by rocksdb.
+ engine_options.share_table_files = true;
+ engine_options.share_files_with_checksum = true;
+ engine_options.sync = true;
+
+ rocksdb::Status s = rocksdb::BackupEngine::Open(
+ engine_options,
+ rocksdb::Env::Default(),
+ &engine_ptr);
+ std::unique_ptr<rocksdb::BackupEngine> backup_engine{engine_ptr};
+
+ if (!backup_engine || !s.ok()) {
+ ldout(cct, 0) << __func__ << " can't create backup_engine: " << s.ToString() << dendl;
+ rv.msg = s.ToString();
+ rv.error = true;
+ return rv;
+ }
+
+ // we remove corrupted backups first to not link to broken ones
+ remove_corrupted_backups(backup_engine.get(), nullptr);
+
+ rocksdb::BackupID new_backup;
+ rocksdb::BackupInfo new_backup_info;
+ rocksdb::CreateBackupOptions new_backup_options = rocksdb::CreateBackupOptions();
+ new_backup_options.flush_before_backup = true;
+
+ std::string app_metadata = std::string("ceph_version=") + ceph_version_to_str();
+ s = backup_engine->CreateNewBackupWithMetadata(new_backup_options, db, app_metadata, &new_backup);
+
+ rv.timestamp = ceph_clock_now();
+ rv.msg = s.ToString();
+
+ if (!s.ok()) {
+ ldout(cct, 0) << __func__ << " can't create backup: " << s.ToString() << dendl;
+ rv.error = true;
+ remove_corrupted_backups(backup_engine.get(), nullptr);
+ return rv;
+ } else {
+ ldout(cct, 10) << __func__ << " created backup successfully: " << s.ToString() << dendl;
+ rv.msg = s.ToString();
+ }
+ s = backup_engine->GetBackupInfo(new_backup, &new_backup_info);
+ if (!s.ok()) {
+ ldout(cct, 0) << __func__ << " can't get backup info: " << s.ToString() << dendl;
+ rv.error = true;
+ rv.msg = s.ToString();
+ return rv;
+ }
+ rv.id = new_backup_info.backup_id;
+ rv.size = new_backup_info.size;
+ rv.number_files = new_backup_info.number_files;
+
+ return rv;
+}
+
+bool RocksDBStore::restore_backup(CephContext *cct, const std::string &path,
+ const std::string &backup_location,
+ const std::optional<uint32_t> &version)
+{
+ rocksdb::BackupEngineReadOnly* engine_ptr = nullptr;
+ rocksdb::StderrLogger logger = rocksdb::StderrLogger();
+ rocksdb::BackupEngineOptions engine_options = rocksdb::BackupEngineOptions(backup_location);
+ engine_options.info_log = &logger;
+
+ rocksdb::Status s = rocksdb::BackupEngineReadOnly::Open(
+ rocksdb::Env::Default(),
+ engine_options,
+ &engine_ptr);
+ std::unique_ptr<rocksdb::BackupEngineReadOnly> backup_engine{engine_ptr};
+ const rocksdb::RestoreOptions options = rocksdb::RestoreOptions();
+ if (!s.ok()) {
+ derr << __func__ << " can't open backup folder: " << s.ToString() << dendl;
+ return false;
+ }
+ if (!version) {
+ derr << __func__ << " restore last valid backup" << dendl;
+ s = backup_engine->RestoreDBFromLatestBackup(options, path, path);
+ } else {
+ s = backup_engine->RestoreDBFromBackup(
+ options, static_cast<rocksdb::BackupID>(*version), path, path);
+ }
+ if (!s.ok()) {
+ derr << "Error when restoring backup: " << s.ToString() << dendl;
+ }
+ return s.ok();
+}
+
+namespace {
+
+bool compare_backupinfo_by_timestamp(const rocksdb::BackupInfo& a, const rocksdb::BackupInfo& b)
+{
+ // newest first; tie-break on backup_id so cleanup never keeps an older entry
+ // in preference to a newer one with the same second-resolution timestamp.
+ return std::tie(a.timestamp, a.backup_id) > std::tie(b.timestamp, b.backup_id);
+}
+
+struct TimeBucket {
+ utime_t start;
+ utime_t end;
+ rocksdb::BackupID backup_id;
+
+ TimeBucket(utime_t start, utime_t end) :
+ start(start), end(end), backup_id(0) {}
+};
+
+} // namespace
+
+void RocksDBStore::remove_corrupted_backups(rocksdb::BackupEngine *backup_engine, KeyValueDB::BackupCleanupStats *rv) {
+ std::vector<rocksdb::BackupID> corrupt_backup_ids;
+ backup_engine->GetCorruptedBackups(&corrupt_backup_ids);
+ for (rocksdb::BackupID backup_id : corrupt_backup_ids) {
+ ldout(cct, 1) << __func__ << " delete corrupted backup: " << backup_id << dendl;
+ rocksdb::Status s = backup_engine->DeleteBackup(backup_id);
+ if (!s.ok()) {
+ lderr(cct) << __func__ << " failed to delete corrupted backup "
+ << backup_id << ": " << s.ToString() << dendl;
+ if (rv) {
+ rv->error = true;
+ }
+ continue;
+ }
+ if (rv) {
+ rv->corrupted++;
+ }
+ }
+}
+
+KeyValueDB::BackupCleanupStats RocksDBStore::backup_cleanup(const std::string &path,
+ uint64_t keep_last,
+ uint64_t keep_hourly,
+ uint64_t keep_daily)
+{
+ ldout(cct, 20) << __func__ << " start backup cleanup" << dendl;
+ std::lock_guard backup_locker{backup_lock};
+ // stamp timestamp up front so every return path (including early Open
+ // failures and empty result) carries a real time for the retry gate.
+ BackupCleanupStats rv;
+ rv.timestamp = ceph_clock_now();
+
+ rocksdb::BackupEngine* engine_ptr = nullptr;
+ rocksdb::Status s = rocksdb::BackupEngine::Open(
+ rocksdb::BackupEngineOptions(path),
+ rocksdb::Env::Default(),
+ &engine_ptr);
+ std::unique_ptr<rocksdb::BackupEngine> backup_engine{engine_ptr};
+ if (!backup_engine || !s.ok()) {
+ // cleaning backups when folder is not available is minor problem
+ ldout(cct, 10) << __func__ << " can't clean backups: " << s.ToString() << dendl;
+ rv.error = true;
+ return rv;
+ }
+ // remove corrupted backups first
+ std::set<rocksdb::BackupID> keep_backups;
+
+ remove_corrupted_backups(backup_engine.get(), &rv);
+ ldout(cct, 20) << __func__ << " collect garbage" << dendl;
+
+ std::vector<rocksdb::BackupInfo> backup_infos;
+ backup_engine->GetBackupInfo(&backup_infos);
+
+ if (backup_infos.empty()) {
+ ldout(cct, 15) << __func__ << " no backup infos" << dendl;
+ return rv;
+ }
+ // sort all backups with newest backup first
+ std::stable_sort(backup_infos.begin(), backup_infos.end(), compare_backupinfo_by_timestamp);
+ // always retain the newest backup, regardless of retention settings, so
+ // cleanup can never leave zero backups for a subsequent failed backup.
+ keep_backups.insert(backup_infos.front().backup_id);
+ utime_t now = ceph_clock_now();
+
+ std::vector<TimeBucket> buckets;
+ // half-open intervals [start, end) so adjacent buckets meet without gaps
+ utime_t start = now.round_to_hour();
+ for (uint64_t i = 0; i < keep_hourly; i++) {
+ buckets.push_back(TimeBucket(start, start + utime_t(3600, 0)));
+ start -= 3600.0;
+ }
+ start = now.round_to_day();
+ for (uint64_t i = 0; i < keep_daily; i++) {
+ buckets.push_back(TimeBucket(start, start + utime_t(86400, 0)));
+ start -= 86400.0;
+ }
+
+ size_t i = 0;
+ for (const rocksdb::BackupInfo& bi : backup_infos) {
+ if (i++ < keep_last) {
+ keep_backups.insert(bi.backup_id);
+ }
+ utime_t ts = utime_t(bi.timestamp, 0);
+ for (TimeBucket& bucket : buckets) {
+ if (ts >= bucket.start && ts < bucket.end) {
+ if (bucket.backup_id == 0) {
+ bucket.backup_id = bi.backup_id;
+ }
+ }
+ }
+ }
+ // push the winners into the list
+ for (const TimeBucket& bucket : buckets) {
+ if (bucket.backup_id) {
+ keep_backups.insert(bucket.backup_id);
+ }
+ }
+
+ rv.kept = keep_backups.size();
+
+ for (const rocksdb::BackupInfo& bi : backup_infos) {
+ if (keep_backups.count(bi.backup_id)) {
+ // payload-sum across kept backups (does not account for file sharing,
+ // so it overstates the on-disk footprint).
+ rv.size += bi.size;
+ continue;
+ }
+ ldout(cct, 10) << __func__ << " delete old backup: " << bi.backup_id << dendl;
+ rocksdb::Status s = backup_engine->DeleteBackup(bi.backup_id);
+ if (!s.ok()) {
+ lderr(cct) << __func__ << " failed to delete backup " << bi.backup_id
+ << ": " << s.ToString() << dendl;
+ rv.error = true;
+ continue;
+ }
+ rv.freed += bi.size;
+ rv.deleted++;
+ }
+ rv.timestamp = ceph_clock_now();
+ return rv;
+}
+
+std::optional<std::vector<KeyValueDB::BackupStats>>
+RocksDBStore::list_backups(CephContext *cct, const std::string &backup_location) {
+ rocksdb::BackupEngineReadOnly* engine_ptr = nullptr;
+ rocksdb::Status s = rocksdb::BackupEngineReadOnly::Open(
+ rocksdb::BackupEngineOptions(backup_location),
+ rocksdb::Env::Default(),
+ &engine_ptr);
+ std::unique_ptr<rocksdb::BackupEngineReadOnly> backup_engine{engine_ptr};
+
+ if (!backup_engine || !s.ok()) {
+ lderr(cct) << __func__ << " can't open backup location " << backup_location
+ << ": " << s.ToString() << dendl;
+ return std::nullopt;
+ }
+
+ std::vector<rocksdb::BackupInfo> backup_infos;
+ backup_engine->GetBackupInfo(&backup_infos);
+ std::stable_sort(backup_infos.begin(), backup_infos.end(), compare_backupinfo_by_timestamp);
+ std::vector<KeyValueDB::BackupStats> rv;
+ for (const rocksdb::BackupInfo& bi : backup_infos) {
+ KeyValueDB::BackupStats br;
+ br.id = bi.backup_id;
+ br.timestamp = utime_t(bi.timestamp, 0);
+ br.size = bi.size;
+ br.number_files = bi.number_files;
+ rv.push_back(br);
+ }
+ return rv;
+}
+
+
void RocksDBStore::compact()
{
dout(2) << __func__ << " starting" << dendl;
<< full_name << dendl;
return -EINVAL;
}
- dout(10) << "created column " << full_name << " handle = " << (void*)cf << dendl;
+ dout(10) << "created column " << full_name << " handle = " << (void*)cf << dendl;
existing_columns.push_back(full_name);
handles.push_back(cf);
}
#include "rocksdb/statistics.h"
#include "rocksdb/table.h"
#include "rocksdb/db.h"
+#include "rocksdb/utilities/backup_engine.h"
#include "kv/rocksdb_cache/BinnedLRUCache.h"
#include <errno.h>
#include "common/errno.h"
uint32_t hash_l, uint32_t hash_h)
: name(name), shard_cnt(shard_cnt), options(options), hash_l(hash_l), hash_h(hash_h) {}
};
+
private:
friend std::ostream& operator<<(std::ostream& out, const ColumnFamily& cf);
bool must_close_default_cf = false;
rocksdb::ColumnFamilyHandle *default_cf = nullptr;
+ ceph::mutex backup_lock = ceph::make_mutex("RocksDBStore::Backup");
/// column families in use, name->handles
struct prefix_shards {
int do_open(std::ostream &out, bool create_if_missing, bool open_readonly,
const std::string& cfs="");
int load_rocksdb_options(bool create_if_missing, rocksdb::Options& opt);
+ void remove_corrupted_backups(rocksdb::BackupEngine *engine, KeyValueDB::BackupCleanupStats *result);
public:
static bool parse_sharding_def(const std::string_view text_def,
std::vector<ColumnFamily>& sharding_def,
return cct->_conf.get_val<uint64_t>("rocksdb_delete_range_threshold");
}
+ KeyValueDB::BackupStats backup(const std::string &path) override;
+ KeyValueDB::BackupCleanupStats backup_cleanup(const std::string &path,
+ uint64_t keep_last,
+ uint64_t keep_hourly,
+ uint64_t keep_daily) override;
+
+ /// Restore a backup into @p path. @p version is the rocksdb backup id, or
+ /// nullopt for the most recent. Must be called on a closed store.
+ static bool restore_backup(CephContext *cct, const std::string &path,
+ const std::string &backup_location,
+ const std::optional<uint32_t> &version);
+
+ /// List backups at @p backup_location, newest first.
+ /// Returns nullopt if the BackupEngine could not be opened.
+ static std::optional<std::vector<BackupStats>> list_backups(
+ CephContext *cct, const std::string &backup_location);
+
void compact() override;
void compact_async() override {
MgrMonitor.cc
MgrStatMonitor.cc
Monitor.cc
+ MonitorBackup.cc
MonmapMonitor.cc
LogMonitor.cc
AuthMonitor.cc
#include "MonitorDBStore.h"
#include "MonMap.h"
#include "Paxos.h"
+#include "MonitorBackup.h"
#include "messages/PaxosServiceMessage.h"
#include "messages/MMonCommand.h"
<< duration << " seconds" << dendl;
out << "compacted " << g_conf().get_val<std::string>("mon_keyvaluedb")
<< " in " << duration << " seconds";
- } else {
+ } else if (command == "backup") {
+ r = perform_backup();
+ } else if (command == "backup_cleanup") {
+ r = cleanup_backup();
+ } else {
ceph_abort_msg("bad AdminSocket command binding");
}
(read_only ? audit_clog->debug() : audit_clog->info())
return r;
}
+int Monitor::perform_backup()
+{
+ std::string backup_path = g_conf().get_val<string>("mon_backup_path");
+ dout(1) << "triggering backup" << dendl;
+ if (backup_path.empty()) {
+ derr << "backup failed: mon_backup_path is empty" << dendl;
+ return -ENOTDIR;
+ }
+ if (!backup_manager) {
+ derr << "backup failed: monitor still initializing" << dendl;
+ return -EAGAIN;
+ }
+ uint64_t jobid = backup_manager->backup();
+ dout(1) << "queued backup job id " << jobid << dendl;
+ return 0;
+}
+
+int Monitor::cleanup_backup()
+{
+ std::string backup_path = g_conf().get_val<string>("mon_backup_path");
+ dout(1) << "triggering backup_cleanup" << dendl;
+ if (backup_path.empty()) {
+ derr << "backup_cleanup failed: mon_backup_path is empty" << dendl;
+ return -ENOTDIR;
+ }
+ if (!backup_manager) {
+ derr << "backup_cleanup failed: monitor still initializing" << dendl;
+ return -EAGAIN;
+ }
+ uint64_t jobid = backup_manager->cleanup();
+ dout(1) << "queued backup cleanup job id " << jobid << dendl;
+ return 0;
+}
+
void Monitor::handle_signal(int signum)
{
derr << "*** Got Signal " << sig_str(signum) << " ***" << dendl;
"ewon", PerfCountersBuilder::PRIO_INTERESTING);
pcb.add_u64_counter(l_mon_election_lose, "election_lose", "Elections lost",
"elst", PerfCountersBuilder::PRIO_INTERESTING);
+ pcb.add_u64(l_mon_backup_running, "backup_running", "Mon backup process is running",
+ nullptr, PerfCountersBuilder::PRIO_USEFUL);
+ pcb.add_u64_counter(l_mon_backup_started, "backup_started", "Mon backups started",
+ nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+ pcb.add_u64_counter(l_mon_backup_success, "backup_success", "Mon backups finished successfully",
+ nullptr, PerfCountersBuilder::PRIO_USEFUL);
+ pcb.add_u64_counter(l_mon_backup_failed, "backup_failed", "Mon backups failed",
+ nullptr, PerfCountersBuilder::PRIO_USEFUL);
+ pcb.add_time_avg(l_mon_backup_duration, "backup_duration", "Mon backup duration",
+ nullptr, PerfCountersBuilder::PRIO_USEFUL);
+ pcb.add_time(l_mon_backup_last_success, "backup_last_success", "Last successful mon backup",
+ nullptr, PerfCountersBuilder::PRIO_USEFUL);
+ pcb.add_u64(l_mon_backup_last_success_id, "backup_last_success_id", "Last successful mon backup id",
+ nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+ pcb.add_time(l_mon_backup_last_failed, "backup_last_failed", "Last failed mon backup",
+ nullptr, PerfCountersBuilder::PRIO_USEFUL);
+ pcb.add_u64(l_mon_backup_last_size, "backup_last_size", "Last backup size",
+ nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+ pcb.add_u64(l_mon_backup_last_files, "backup_last_files", "Last backup file numbers",
+ nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+ pcb.add_u64_counter(l_mon_backup_cleanup_started, "backup_cleanup_started", "Mon backup cleanup started",
+ nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+ pcb.add_u64(l_mon_backup_cleanup_running, "backup_cleanup_running", "Mon backup cleanup is running",
+ nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+ pcb.add_u64_counter(l_mon_backup_cleanup_success, "backup_cleanup_success", "Mon backup cleanup finished successfully",
+ nullptr, PerfCountersBuilder::PRIO_USEFUL);
+ pcb.add_u64_counter(l_mon_backup_cleanup_failed, "backup_cleanup_failed", "Mon backup cleanup failed",
+ nullptr, PerfCountersBuilder::PRIO_USEFUL);
+ pcb.add_u64(l_mon_backup_cleanup_size, "backup_cleanup_size", "Size of backups removed",
+ nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+ pcb.add_u64(l_mon_backup_cleanup_kept, "backup_cleanup_kept", "Number of backups kept after cleanup",
+ nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+ pcb.add_time_avg(l_mon_backup_cleanup_duration, "backup_cleanup_duration", "Mon backup cleanup duration",
+ nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+ pcb.add_u64(l_mon_backup_cleanup_freed, "backup_cleanup_freed", "Mon backup cleanup freed size in bytes",
+ nullptr, PerfCountersBuilder::PRIO_INTERESTING);
+ pcb.add_u64(l_mon_backup_cleanup_deleted, "backup_cleanup_deleted", "Mon backup cleanup deleted backups",
+ nullptr, PerfCountersBuilder::PRIO_INTERESTING);
logger = pcb.create_perf_counters();
cct->get_perfcounters_collection()->add(logger);
}
command.helpstring);
ceph_assert(r == 0);
}
+ r = admin_socket->register_command("backup", admin_hook,
+ "create a backup of the mon database");
+ ceph_assert(r == 0);
+ r = admin_socket->register_command(
+ "backup_cleanup", admin_hook,
+ "delete old mon database backups according to retention config");
+ ceph_assert(r == 0);
l.lock();
// add ourselves as a conf observer
// add features of myself into feature_map
session_map.feature_map.add_mon(con_self->get_features());
+
+ backup_manager = std::make_unique<MonitorBackupManager>(cct, this);
+
return 0;
}
delete admin_hook;
admin_hook = NULL;
}
+ if (backup_manager) {
+ backup_manager->stop();
+ }
elector.shutdown();
prepare_new_fingerprint(t);
paxos->trigger_propose();
}
+ backup_manager->tick();
mgr_client.update_daemon_health(get_health_metrics());
new_tick();
#include "include/CompatSet.h"
#include "mon/MonitorDBStore.h"
#include "mon/mon_types.h" // for Metadata, PAXOS_*, ScrubResult
+#include "mon/MonitorBackup.h"
#include "mgr/MgrClient.h"
#include <boost/smart_ptr/atomic_shared_ptr.hpp>
#include <boost/smart_ptr/shared_ptr.hpp>
l_mon_election_call,
l_mon_election_win,
l_mon_election_lose,
+ l_mon_backup_running,
+ l_mon_backup_started,
+ l_mon_backup_success,
+ l_mon_backup_failed,
+ l_mon_backup_duration,
+ l_mon_backup_last_success,
+ l_mon_backup_last_success_id,
+ l_mon_backup_last_failed,
+ l_mon_backup_last_size,
+ l_mon_backup_last_files,
+ l_mon_backup_cleanup_started,
+ l_mon_backup_cleanup_running,
+ l_mon_backup_cleanup_success,
+ l_mon_backup_cleanup_failed,
+ l_mon_backup_cleanup_size,
+ l_mon_backup_cleanup_kept,
+ l_mon_backup_cleanup_duration,
+ l_mon_backup_cleanup_freed,
+ l_mon_backup_cleanup_deleted,
l_mon_last,
};
OpTracker op_tracker;
+ std::unique_ptr<MonitorBackupManager> backup_manager;
+
public:
Monitor(CephContext *cct_, std::string nm, MonitorDBStore *s,
Messenger *m, Messenger *mgr_m, MonMap *map);
std::ostream& err,
std::ostream& out);
+ // Execute mon database backup
+ int perform_backup();
+ int cleanup_backup();
+
private:
// don't allow copying
Monitor(const Monitor& rhs);
--- /dev/null
+// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
+// vim: ts=8 sw=2 smarttab
+/*
+* Ceph - scalable distributed file system
+*
+* Copyright (C) 2021 B1-Systems GmbH
+*
+* This is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License version 2.1, as published by the Free Software
+* Foundation. See file COPYING.
+*/
+
+#include <chrono>
+#include <filesystem>
+
+#include "include/util.h"
+#include "mon/MonitorBackup.h"
+#include "mon/Monitor.h"
+
+#define dout_subsys ceph_subsys_mon
+#undef dout_context
+#define dout_context cct
+
+namespace fs = std::filesystem;
+
+/***
+ * Thread which runs monitor backup operations
+ */
+void *MonitorBackupManager::entry() {
+ std::unique_lock lock{mutex};
+ auto wakeup_predicate = [this] {
+ return should_stop || should_backup || should_cleanup || wakeup_pending;
+ };
+ while (true) {
+ // Wait for any signal (tick, request, or stop) before doing
+ // scheduled work. Predicate-based so notifications delivered
+ // before the worker entered wait() are not lost, and so we don't
+ // fire scheduled work while init() is still wiring the mon up.
+ work_cond.wait(lock, wakeup_predicate);
+ if (should_stop) {
+ return nullptr;
+ }
+ wakeup_pending = false;
+
+ auto now = ceph_clock_now();
+ std::string backup_path = cct->_conf.get_val<std::string>("mon_backup_path");
+ auto interval = cct->_conf.get_val<std::chrono::seconds>("mon_backup_interval");
+ auto cleanup_interval = cct->_conf.get_val<std::chrono::seconds>("mon_backup_cleanup_interval");
+ bool path_ok = !backup_path.empty();
+
+ bool timer_backup = false;
+ bool timer_cleanup = false;
+
+ if (path_ok && interval.count() > 0 &&
+ (mon->is_leader() || mon->is_peon())) {
+ if (!last_backup) {
+ dout(10) << "trigger first timed backup" << dendl;
+ timer_backup = true;
+ } else if ((now - last_backup->timestamp) >= utime_t(interval.count(), 0)) {
+ dout(10) << "trigger timed backup" << dendl;
+ timer_backup = true;
+ }
+ }
+ if (path_ok && cleanup_interval.count() > 0) {
+ if (!last_cleanup) {
+ dout(10) << "trigger first timed backup cleanup" << dendl;
+ timer_cleanup = true;
+ } else if ((now - last_cleanup->timestamp) >= utime_t(cleanup_interval.count(), 0)) {
+ dout(10) << "trigger timed backup cleanup" << dendl;
+ timer_cleanup = true;
+ }
+ }
+
+ bool run_cleanup = should_cleanup || timer_cleanup;
+ bool run_backup = should_backup || timer_backup;
+
+ should_backup = false;
+ should_cleanup = false;
+
+ if (!run_backup && !run_cleanup) {
+ continue;
+ }
+
+ lock.unlock();
+ if (run_cleanup) {
+ do_cleanup();
+ }
+ if (run_backup) {
+ do_backup();
+ }
+ lock.lock();
+ }
+}
+
+void MonitorBackupManager::stop() {
+ {
+ std::lock_guard guard{mutex};
+ if (should_stop) {
+ return;
+ }
+ should_stop = true;
+ work_cond.notify_one();
+ }
+ join();
+}
+
+void MonitorBackupManager::do_cleanup() {
+ dout(5) << "start backup cleanup" << dendl;
+ mon->logger->inc(l_mon_backup_cleanup_started);
+ mon->logger->set(l_mon_backup_cleanup_running, 1);
+ auto start = ceph_clock_now();
+ KeyValueDB::BackupCleanupStats stats = mon->store->backup_cleanup();
+ mon->logger->set(l_mon_backup_cleanup_size, stats.size);
+ mon->logger->set(l_mon_backup_cleanup_kept, stats.kept);
+ mon->logger->set(l_mon_backup_cleanup_freed, stats.freed);
+ mon->logger->set(l_mon_backup_cleanup_deleted, stats.deleted);
+ if (stats.error) {
+ mon->logger->inc(l_mon_backup_cleanup_failed);
+ } else {
+ mon->logger->inc(l_mon_backup_cleanup_success);
+ }
+ auto ptr = std::make_shared<KeyValueDB::BackupCleanupStats>(stats);
+ last_cleanup.swap(ptr);
+ auto end = ceph_clock_now();
+ utime_t duration = end - start;
+ mon->logger->tinc(l_mon_backup_cleanup_duration, duration);
+ mon->logger->set(l_mon_backup_cleanup_running, 0);
+}
+
+void MonitorBackupManager::record_last_backup(std::shared_ptr<KeyValueDB::BackupStats> stats) {
+ if (stats->error && last_backup) {
+ stats->id = last_backup->id;
+ }
+ last_backup.swap(stats);
+}
+
+// Returns true if there is enough free space on the backup volume.
+bool MonitorBackupManager::check_free_space() {
+ auto backup_path = cct->_conf.get_val<std::string>("mon_backup_path");
+
+ std::error_code ec;
+ if (!fs::exists(backup_path, ec)) {
+ if (!fs::create_directories(backup_path, ec)) {
+ dout(1) << "failed to create monitor backup directory '"
+ << backup_path << "': " << ec.message() << dendl;
+ return false;
+ }
+ fs::permissions(backup_path, fs::perms::owner_all,
+ fs::perm_options::replace, ec);
+ if (ec) {
+ dout(1) << "failed to set permissions on monitor backup directory '"
+ << backup_path << "': " << ec.message() << dendl;
+ return false;
+ }
+ dout(5) << "created monitor backup directory '" << backup_path
+ << "'" << dendl;
+ }
+
+ ceph_data_stats_t stats;
+ int err = get_fs_stats(stats, backup_path.c_str());
+ if (err < 0) {
+ dout(1) << "error checking monitor backup directory: " << cpp_strerror(err)
+ << dendl;
+ return false;
+ }
+
+ if (stats.avail_percent <= cct->_conf.get_val<int64_t>("mon_backup_min_avail")) {
+ dout(1) << "ERROR: not enough disk space to start backup: " << "(available: "
+ << stats.avail_percent << "% " << byte_u_t(stats.byte_avail) << ")\n"
+ << "run backup_cleanup regularly or decrease mon_backup_min_avail" << dendl;
+ return false;
+ }
+ return true;
+}
+
+void MonitorBackupManager::do_backup() {
+ dout(1) << "start backup" << dendl;
+ mon->logger->inc(l_mon_backup_started);
+ mon->logger->set(l_mon_backup_running, 1);
+ auto start = ceph_clock_now();
+
+ std::shared_ptr<KeyValueDB::BackupStats> result;
+
+ if (!check_free_space()) {
+ mon->logger->inc(l_mon_backup_failed);
+ mon->logger->tset(l_mon_backup_last_failed, start);
+ result = std::make_shared<KeyValueDB::BackupStats>();
+ result->error = true;
+ result->timestamp = start;
+ result->msg = "insufficient free space";
+ } else {
+ KeyValueDB::BackupStats stats = mon->store->backup();
+ utime_t duration = ceph_clock_now() - start;
+ mon->logger->tinc(l_mon_backup_duration, duration);
+ mon->logger->set(l_mon_backup_last_size, stats.size);
+ mon->logger->set(l_mon_backup_last_files, stats.number_files);
+ if (stats.error) {
+ mon->logger->inc(l_mon_backup_failed);
+ mon->logger->tset(l_mon_backup_last_failed, stats.timestamp);
+ dout(1) << "failed backup in " << utimespan_str(duration) << dendl;
+ } else {
+ mon->logger->inc(l_mon_backup_success);
+ mon->logger->tset(l_mon_backup_last_success, stats.timestamp);
+ mon->logger->set(l_mon_backup_last_success_id, stats.id);
+ dout(1) << "finished backup in " << utimespan_str(duration) << dendl;
+ }
+ result = std::make_shared<KeyValueDB::BackupStats>(stats);
+ }
+
+ record_last_backup(result);
+ mon->logger->set(l_mon_backup_running, 0);
+}
+
--- /dev/null
+// -*- mode:C++; tab-width:8; c-basic-offset:2; indent-tabs-mode:t -*-
+// vim: ts=8 sw=2 smarttab
+/*
+* Ceph - scalable distributed file system
+*
+* Copyright (C) 2021 B1-Systems GmbH
+*
+* This is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License version 2.1, as published by the Free Software
+* Foundation. See file COPYING.
+*/
+
+
+#ifndef CEPH_MONITOR_BACKUP_H
+#define CEPH_MONITOR_BACKUP_H
+
+#include <cstdint>
+#include <memory>
+#include <mutex>
+#include <string>
+
+
+#include "common/Thread.h"
+#include "common/ceph_context.h"
+#include "common/ceph_mutex.h"
+#include "kv/KeyValueDB.h"
+#include "mon/MonitorDBStore.h"
+
+class Monitor;
+
+class MonitorBackupManager : public Thread {
+ CephContext *cct;
+ Monitor *mon;
+ ceph::mutex mutex;
+ ceph::condition_variable work_cond;
+ bool should_stop{false};
+ // set by tick(); a sticky flag so a notification delivered before the
+ // worker enters wait() is not lost. cleared each time the worker
+ // re-evaluates timer triggers.
+ bool wakeup_pending{false};
+
+ bool should_backup{false};
+ bool should_cleanup{false};
+ uint64_t last_job_id{0};
+ std::shared_ptr<KeyValueDB::BackupCleanupStats> last_cleanup;
+ std::shared_ptr<KeyValueDB::BackupStats> last_backup;
+
+ void do_backup();
+ void do_cleanup();
+ bool check_free_space();
+ void record_last_backup(std::shared_ptr<KeyValueDB::BackupStats> stats);
+protected:
+ void *entry() override;
+
+public:
+ explicit MonitorBackupManager(CephContext *cct, Monitor *monitor) :
+ cct(cct),
+ mon(monitor),
+ mutex(ceph::make_mutex("mon::BackupManager::mutex")) {
+ create("mon::backups");
+ }
+
+ void tick() {
+ std::lock_guard guard{mutex};
+ if (should_stop) {
+ return;
+ }
+ wakeup_pending = true;
+ work_cond.notify_one();
+ }
+
+ /**
+ * Stop the backup manager thread. Safe to call more than once.
+ **/
+ void stop();
+ /**
+ * Start a new backup.
+ * @returns {uint64_t} new job id
+ **/
+ uint64_t backup() {
+ std::lock_guard guard{mutex};
+ should_backup = true;
+ uint64_t rv = ++last_job_id;
+ work_cond.notify_one();
+ return rv;
+ }
+
+ /// Queue a cleanup pass.
+ uint64_t cleanup() {
+ std::lock_guard guard{mutex};
+ should_cleanup = true;
+ uint64_t rv = ++last_job_id;
+ work_cond.notify_one();
+ return rv;
+ }
+
+};
+
+#endif
+
#ifndef CEPH_MONITOR_DB_STORE_H
#define CEPH_MONITOR_DB_STORE_H
+#include <algorithm>
+#include <filesystem>
#include <set>
#include <map>
#include <string>
#include "common/Clock.h"
#include "common/debug.h"
#include "common/safe_io.h"
+#include "common/strtol.h"
#include "common/blkdev.h"
#include "common/PriorityCache.h"
#include "common/version.h"
return path;
}
+ // returns the database store path
+ static std::string get_store_path(const std::string& path) {
+ return (std::filesystem::path(path) / "store.db").string();
+ }
+
std::shared_ptr<PriorityCache::PriCache> get_priority_cache() const {
return db->get_priority_cache();
}
}
void _open(const std::string& kv_type) {
- int pos = 0;
- for (auto rit = path.rbegin(); rit != path.rend(); ++rit, ++pos) {
- if (*rit != '/')
- break;
- }
- std::ostringstream os;
- os << path.substr(0, path.size() - pos) << "/store.db";
- std::string full_path = os.str();
+ std::string full_path = get_store_path(path);
KeyValueDB *db_ptr = KeyValueDB::create(g_ceph_context,
kv_type,
if (r < 0)
return r;
- // Monitors are few in number, so the resource cost of exposing
+ // Monitors are few in number, so the resource cost of exposing
// very detailed stats is low: ramp up the priority of all the
// KV store's perf counters. Do this after open, because backend may
// not have constructed PerfCounters earlier.
db.reset(NULL);
}
+ /// @brief Creates a backup of the database under mon_backup_path.
+ /// @return stats describing the created backup
+ KeyValueDB::BackupStats backup() {
+ auto backup_path = g_conf().get_val<std::string>("mon_backup_path");
+ auto stats = db->backup(backup_path);
+ if (!stats.error) {
+ // Stash the mon keyring alongside the rocksdb backup, keyed by
+ // backup id, so a restore of an older version is paired with the
+ // keyring of that vintage. The [mon.] secret can rotate; using a
+ // single fixed filename would break authentication after restore.
+ std::error_code ec;
+ auto dest = backup_path + "/keyring." + std::to_string(stats.id);
+ std::filesystem::copy_file(
+ path + "/keyring", dest,
+ std::filesystem::copy_options::overwrite_existing, ec);
+ if (!ec) {
+ std::filesystem::permissions(dest,
+ std::filesystem::perms::owner_read | std::filesystem::perms::owner_write,
+ std::filesystem::perm_options::replace, ec);
+ }
+ if (ec) {
+ // Best-effort: a rocksdb backup without a stashed keyring is still
+ // valid; the operator can supply a keyring out-of-band on restore.
+ derr << __func__ << " failed to stash keyring at "
+ << dest << ": " << ec.message() << dendl;
+ }
+ }
+ return stats;
+ }
+
+ /// @brief Remove old backups in mon_backup_path according to the retention config.
+ /// @return stats describing what was kept, deleted, and freed
+ KeyValueDB::BackupCleanupStats backup_cleanup() {
+ auto backup_path = g_conf().get_val<std::string>("mon_backup_path");
+ auto stats = db->backup_cleanup(
+ backup_path,
+ g_conf().get_val<uint64_t>("mon_backup_keep_last"),
+ g_conf().get_val<uint64_t>("mon_backup_keep_hourly"),
+ g_conf().get_val<uint64_t>("mon_backup_keep_daily"));
+ if (stats.error) {
+ return stats;
+ }
+ // Remove keyring.<id> files for backup ids the kv layer just dropped.
+ std::string kv_type;
+ if (read_meta("kv_backend", &kv_type) < 0 || kv_type.empty()) {
+ kv_type = "rocksdb";
+ }
+ std::set<uint64_t> surviving;
+ auto remaining = KeyValueDB::list_backups(g_ceph_context, kv_type, backup_path);
+ if (!remaining) {
+ return stats;
+ }
+ for (const auto& b : *remaining) {
+ surviving.insert(b.id);
+ }
+ std::error_code ec;
+ for (auto it = std::filesystem::directory_iterator(backup_path, ec);
+ it != std::filesystem::directory_iterator();
+ it.increment(ec)) {
+ auto name = it->path().filename().string();
+ if (name.compare(0, 8, "keyring.") != 0) {
+ continue;
+ }
+ std::string idstr = name.substr(8);
+ std::string parse_err;
+ long long id = strict_strtoll(idstr.c_str(), 10, &parse_err);
+ if (!parse_err.empty() || id < 0) {
+ continue;
+ }
+ if (surviving.count(static_cast<uint64_t>(id))) {
+ continue;
+ }
+ std::error_code rm_ec;
+ std::filesystem::remove(it->path(), rm_ec);
+ }
+ return stats;
+ }
+
+ /// @brief List all backup versions at backup_path.
+ /// @param cct ceph context
+ /// @param path path to the local mon data dir (used to discover the kv backend)
+ /// @param backup_path path to the backup location
+ /// @return list of BackupStats, one per backup
+ static std::optional<std::vector<KeyValueDB::BackupStats>> list_backups(
+ CephContext *cct, const std::string &path, const std::string &backup_path) {
+ std::string kv_type;
+ int r = read_meta_path("kv_backend", &kv_type, path);
+ if (r < 0 || kv_type.empty()) {
+ // Disaster recovery: mon_data may be empty or absent. We only ship
+ // a rocksdb kv backend today, so default to it for enumeration.
+ kv_type = "rocksdb";
+ }
+ return KeyValueDB::list_backups(cct, kv_type, backup_path);
+ }
+
+
+ /// @brief Restore the backup with the given version from backup_path into path.
+ /// @param cct ceph context
+ /// @param path path to the local mon data dir to restore into
+ /// @param backup_path path to the backup location
+ /// @param version version of the backup to restore (nullopt for latest)
+ /// @return true on success
+ static bool restore_backup(CephContext *cct, const std::string &path,
+ const std::string &backup_path,
+ const std::optional<uint32_t> &version) {
+ std::string kv_type;
+ int r = read_meta_path("kv_backend", &kv_type, path);
+ if (r < 0 || kv_type.empty()) {
+ // Disaster recovery: mon_data is empty or freshly initialized, so
+ // there is no kv_backend marker. Default to rocksdb and stamp the
+ // file back so the subsequent open() finds it.
+ kv_type = "rocksdb";
+ std::error_code ec;
+ std::filesystem::create_directories(path, ec);
+ if (ec) {
+ lderr(cct) << __func__ << " failed to create " << path
+ << ": " << ec.message() << dendl;
+ return false;
+ }
+ std::filesystem::permissions(path,
+ std::filesystem::perms::owner_all,
+ std::filesystem::perm_options::replace, ec);
+ const std::string v = kv_type + "\n";
+ if (safe_write_file(path.c_str(), "kv_backend",
+ v.c_str(), v.length(), 0600) < 0) {
+ lderr(cct) << __func__ << " failed to write kv_backend in "
+ << path << dendl;
+ return false;
+ }
+ }
+ std::string store_path = get_store_path(path);
+
+ // Resolve "latest" up front so we know which versioned keyring to
+ // rehydrate alongside the rocksdb restore. Pick by BackupEngine id
+ // (monotonic per rocksdb) rather than timestamp, so a clock skew
+ // between backups cannot make the default restore pick a stale one.
+ uint32_t resolved_version;
+ if (version) {
+ resolved_version = *version;
+ } else {
+ auto backups = KeyValueDB::list_backups(cct, kv_type, backup_path);
+ if (!backups || backups->empty()) {
+ lderr(cct) << __func__ << " no backups found at " << backup_path << dendl;
+ return false;
+ }
+ resolved_version = std::max_element(
+ backups->begin(), backups->end(),
+ [](const auto& a, const auto& b) { return a.id < b.id; })->id;
+ }
+
+ if (!KeyValueDB::restore_backup(cct, kv_type, store_path, backup_path,
+ resolved_version)) {
+ return false;
+ }
+
+ // Rehydrate the matching keyring (skipped silently if the operator
+ // keeps the keyring out-of-band).
+ std::error_code ec;
+ auto keyring_src = backup_path + "/keyring." + std::to_string(resolved_version);
+ if (std::filesystem::exists(keyring_src, ec)) {
+ std::filesystem::copy_file(
+ keyring_src,
+ path + "/keyring",
+ std::filesystem::copy_options::overwrite_existing,
+ ec);
+ if (ec) {
+ lderr(cct) << __func__ << " failed to restore keyring from "
+ << keyring_src << ": " << ec.message() << dendl;
+ return false;
+ }
+ }
+
+ // The mon store holds auth, config-key and dm-crypt secrets;
+ // tighten everything we just restored to owner-only.
+ std::filesystem::permissions(path,
+ std::filesystem::perms::owner_all,
+ std::filesystem::perm_options::replace, ec);
+ for (auto it = std::filesystem::recursive_directory_iterator(path, ec);
+ it != std::filesystem::recursive_directory_iterator();
+ it.increment(ec)) {
+ std::error_code ec_chmod;
+ auto perms = it->is_directory(ec_chmod)
+ ? std::filesystem::perms::owner_all
+ : (std::filesystem::perms::owner_read | std::filesystem::perms::owner_write);
+ std::filesystem::permissions(it->path(), perms,
+ std::filesystem::perm_options::replace, ec_chmod);
+ if (ec_chmod) {
+ lderr(cct) << __func__ << " failed to chmod " << it->path()
+ << ": " << ec_chmod.message() << dendl;
+ }
+ }
+ return true;
+ }
+
void compact() {
db->compact();
}
/**
* read_meta - read a simple configuration key out-of-band
*
- * Read a simple key value to an unopened/mounted store.
+ * Read a simple key value from an unopened/unmounted store.
*
* Trailing whitespace is stripped off.
*
*/
int read_meta(const std::string& key,
std::string *value) const {
+ return read_meta_path(key, value, path);
+ }
+
+ /**
+ * read_meta_path - read a simple configuration key out-of-band
+ *
+ * Read a simple key value from a specified path store.
+ *
+ * Trailing whitespace is stripped off.
+ *
+ * @param key key name
+ * @param value pointer to value string
+ * @param path path to directory
+ * @returns 0 for success, or an error code
+ */
+ static int read_meta_path(const std::string& key,
+ std::string *value,
+ const std::string& path) {
char buf[4096];
int r = safe_read_file(path.c_str(), key.c_str(),
buf, sizeof(buf));
[mon.$f]
host = $HOSTNAME
mon data = $CEPH_DEV_DIR/mon.$f
+ mon backup path = $CEPH_DEV_DIR/mon.$f-backup
EOF
count=$(($count + 2))
done