From 5cc9252784731a138c7a5b1ec51b63547232bdd1 Mon Sep 17 00:00:00 2001 From: Neha Ojha Date: Mon, 26 Mar 2018 15:08:12 -0700 Subject: [PATCH] doc: dev description of async recovery Signed-off-by: Josh Durgin Signed-off-by: Neha Ojha --- doc/dev/osd_internals/async_recovery.rst | 48 ++++++++++++++++++++++++ 1 file changed, 48 insertions(+) create mode 100644 doc/dev/osd_internals/async_recovery.rst diff --git a/doc/dev/osd_internals/async_recovery.rst b/doc/dev/osd_internals/async_recovery.rst new file mode 100644 index 000000000000..e31f8dc70637 --- /dev/null +++ b/doc/dev/osd_internals/async_recovery.rst @@ -0,0 +1,48 @@ +===================== +Asynchronous Recovery +===================== + +PGs in Ceph maintain a log of writes to allow speedy recovery of data. +Instead of scanning all of the objects to see what is missing on each +osd, we can examine the pg log to see which objects we need to +recover. See `Log Based PG`_ for more detail on this process. + +.. _`Log Based PG`: + log_based_pg.rst + +Until now, this recovery process was synchronous - it blocked writes +to an object until it was recovered. In contrast, backfill could allow +writes to proceed (assuming enough up-to-date copies of the data were +available) by temporarily assigning a different acting set, and +backfilling an OSD outside of the acting set. In some circumstances, +this ends up being significantly better for availability, e.g. if the +pg log contains 3000 writes to different objects. Recovering several +megabytes of an object (or even worse, several megabytes of omap keys, +like rgw bucket indexes) can drastically increase latency for a small +update, and combined with requests spread across many degraded objects +it is a recipe for slow requests. + +To avoid this, we can perform recovery in the background on an OSD out +of the acting set, similar to backfill, but still using the PG log to +determine what needs recovery. This is known as asynchronous recovery. + +Exactly when we perform asynchronous recovery instead of synchronous +recovery is not a clear-cut threshold. There are a few criteria which +need to be met for asynchronous recovery: + +* try to keep min_size replicas available +* use the approximate magnitude of the difference in length of + logs as the cost of recovery +* use the parameter osd_async_recovery_min_pg_log_entries to determine + when asynchronous recovery is appropriate + +With the existing peering process, when we choose the acting set we +have not fetched the pg log from each peer, we have only the bounds of +it and other metadata from their pg_info_t. It would be more expensive +to fetch and examine every log at this point, so we only consider an +approximate check for log length for now. + +While async recovery is occurring, writes on members of the acting set +may proceed, but we need to send their log entries to the async +recovery targets (just like we do for backfill osds) so that they +can completely catch up. -- 2.47.3