From: Sage Weil <sage@redhat.com>
Date: Fri, 23 Feb 2018 19:18:53 +0000 (-0600)
Subject: osd: improve documentation for event queue ordering and requeueing rules
X-Git-Tag: v13.1.0~390^2~39
X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=b4d96be92dc09fc65ea09059c294110cdccf7837;p=ceph.git

osd: improve documentation for event queue ordering and requeueing rules

Signed-off-by: Sage Weil <sage@redhat.com>
---

diff --git a/src/osd/OSD.h b/src/osd/OSD.h
index 658e8a317390..dcb5c87780b2 100644
--- a/src/osd/OSD.h
+++ b/src/osd/OSD.h
@@ -1029,6 +1029,73 @@ enum class io_queue {
   mclock_client,
 };
 
+
+/*
+
+  Each PG slot includes queues for events that are processing and/or waiting
+  for a PG to be materialized in the slot.
+
+  These are the constraints:
+
+  - client ops must remained ordered by client, regardless of map epoch
+  - peering messages/events from peers must remain ordered by peer
+  - peering messages and client ops need not be ordered relative to each other
+
+  - some peering events can create a pg (e.g., notify)
+  - the query peering event can proceed when a PG doesn't exist
+
+  Implementation notes:
+
+  - everybody waits for split.  If the OSD has the parent PG it will instantiate
+    the PGSlot early and mark it waiting_for_split.  Everything will wait until
+    the parent is able to commit the split operation and the child PG's are
+    materialized in the child slots.
+
+  - every event has an epoch property and will wait for the OSDShard to catch
+    up to that epoch.  For example, if we get a peering event from a future
+    epoch, the event will wait in the slot until the local OSD has caught up.
+    (We should be judicious in specifying the required epoch [by, e.g., setting
+    it to the same_interval_since epoch] so that we don't wait for epochs that
+    don't affect the given PG.)
+
+  - we maintain two separate wait lists, *waiting* and *waiting_peering*. The
+    OpQueueItem has an is_peering() bool to determine which we use.  Waiting
+    peering events are queued up by epoch required.
+
+  - when we wake a PG slot (e.g., we finished split, or got a newer osdmap, or
+    materialized the PG), we wake *all* waiting items.  (This could be optimized,
+    probably, but we don't bother.)  We always requeue peering items ahead of
+    client ops.
+
+  - some peering events are marked !peering_requires_pg (PGQuery).  if we do
+    not have a PG these are processed immediately (under the shard lock).
+
+  - we do not have a PG present, we check if the slot maps to the current host.
+    if so, we either queue the item and wait for the PG to materialize, or
+    (if the event is a pg creating event like PGNotify), we materialize the PG.
+
+  - when we advance the osdmap on the OSDShard, we scan pg slots and
+    discard any slots with no pg (and not waiting_for_split) that no
+    longer map to the current host.
+
+  Some notes:
+
+  - There is theoretical race between query (which can proceed if the pg doesn't
+    exist) and split (which may materialize a PG in a different shard):
+      - osd has epoch E
+      - shard 0 processes notify on P from epoch E-1
+      - shard 0 identifies P splits to P+C in epoch E
+      - shard 1 receives query for P (epoch E), returns DNE
+      - shard 1 installs C in shard 0 with waiting_for_split
+
+    This can't really be fixed without ordering queries over all shards.  In
+    practice, it is very unlikely to occur, since only the primary sends a
+    notify (or other creating event) and only the primary who sends a query.
+    Even if it does happen, the instantiated P is empty, so reporting DNE vs
+    empty C is minimal harm.
+
+  */
+
 struct OSDShardPGSlot {
   PGRef pg;                      ///< pg reference
   deque<OpQueueItem> to_process; ///< order items for this slot
diff --git a/src/osd/OpQueueItem.h b/src/osd/OpQueueItem.h
index faebf4a634a8..a678560f1417 100644
--- a/src/osd/OpQueueItem.h
+++ b/src/osd/OpQueueItem.h
@@ -12,38 +12,6 @@
  *
  */
 
-
-/*
-
-  Ordering notes:
-
-  - everybody waits for split.
-
-  - client ops must remained ordered by client, regardless of map epoch
-  - client ops wait for pg to exist (or are discarded if we confirm the pg
-  no longer should).
-  - client ops must wait for the min epoch.
-    -> this happens under the PG itself, not as part of the queue.
-       currently in PrimaryLogPG::do_request()
-    -> the pg waiting queue is ordered by client, so other clients do not have to wait
-
-  - peering messages must wait for the required_map
-    - currently in do_peering_event(), PG::peering_waiters
-  - peering messages must remain ordered (globally or by peer?)
-  - some peering messages create the pg
-  - query does not need a pg.
-    - q: do any peering messages need to wait for the pg to exist?
-        pretty sure no!
-
-    ---
-
-    bool waiting_for_split -- everyone waits.
-
-    waiting -- client/mon ops
-    waiting_peering -- peering ops
-
-  */
-
 #pragma once
 
 #include <ostream>