osd: improve documentation for event queue ordering and requeueing rules

author Sage Weil <sage@redhat.com>

Fri, 23 Feb 2018 19:18:53 +0000 (13:18 -0600)

committer Sage Weil <sage@redhat.com>

Wed, 4 Apr 2018 13:26:57 +0000 (08:26 -0500)
author Sage Weil <sage@redhat.com>
Fri, 23 Feb 2018 19:18:53 +0000 (13:18 -0600)
committer Sage Weil <sage@redhat.com>
Wed, 4 Apr 2018 13:26:57 +0000 (08:26 -0500)
diff --git a/src/osd/OSD.h b/src/osd/OSD.h

index 658e8a317390bd890d4ceaa65a6a7714e3bd12e8..dcb5c87780b2d9a2af967c35b8634453c5d38d5d 100644 (file)
--- a/src/osd/OSD.h
+++ b/src/osd/OSD.h
@@ -1029,6 +1029,73 @@ enum class io_queue {
    mclock_client,
  };
  
+
+/*
+
+  Each PG slot includes queues for events that are processing and/or waiting
+  for a PG to be materialized in the slot.
+
+  These are the constraints:
+
+  - client ops must remained ordered by client, regardless of map epoch
+  - peering messages/events from peers must remain ordered by peer
+  - peering messages and client ops need not be ordered relative to each other
+
+  - some peering events can create a pg (e.g., notify)
+  - the query peering event can proceed when a PG doesn't exist
+
+  Implementation notes:
+
+  - everybody waits for split.  If the OSD has the parent PG it will instantiate
+    the PGSlot early and mark it waiting_for_split.  Everything will wait until
+    the parent is able to commit the split operation and the child PG's are
+    materialized in the child slots.
+
+  - every event has an epoch property and will wait for the OSDShard to catch
+    up to that epoch.  For example, if we get a peering event from a future
+    epoch, the event will wait in the slot until the local OSD has caught up.
+    (We should be judicious in specifying the required epoch [by, e.g., setting
+    it to the same_interval_since epoch] so that we don't wait for epochs that
+    don't affect the given PG.)
+
+  - we maintain two separate wait lists, *waiting* and *waiting_peering*. The
+    OpQueueItem has an is_peering() bool to determine which we use.  Waiting
+    peering events are queued up by epoch required.
+
+  - when we wake a PG slot (e.g., we finished split, or got a newer osdmap, or
+    materialized the PG), we wake *all* waiting items.  (This could be optimized,
+    probably, but we don't bother.)  We always requeue peering items ahead of
+    client ops.
+
+  - some peering events are marked !peering_requires_pg (PGQuery).  if we do
+    not have a PG these are processed immediately (under the shard lock).
+
+  - we do not have a PG present, we check if the slot maps to the current host.
+    if so, we either queue the item and wait for the PG to materialize, or
+    (if the event is a pg creating event like PGNotify), we materialize the PG.
+
+  - when we advance the osdmap on the OSDShard, we scan pg slots and
+    discard any slots with no pg (and not waiting_for_split) that no
+    longer map to the current host.
+
+  Some notes:
+
+  - There is theoretical race between query (which can proceed if the pg doesn't
+    exist) and split (which may materialize a PG in a different shard):
+      - osd has epoch E
+      - shard 0 processes notify on P from epoch E-1
+      - shard 0 identifies P splits to P+C in epoch E
+      - shard 1 receives query for P (epoch E), returns DNE
+      - shard 1 installs C in shard 0 with waiting_for_split
+
+    This can't really be fixed without ordering queries over all shards.  In
+    practice, it is very unlikely to occur, since only the primary sends a
+    notify (or other creating event) and only the primary who sends a query.
+    Even if it does happen, the instantiated P is empty, so reporting DNE vs
+    empty C is minimal harm.
+
+  */
+
  struct OSDShardPGSlot {
    PGRef pg;                      ///< pg reference
    deque<OpQueueItem> to_process; ///< order items for this slot
diff --git a/src/osd/OpQueueItem.h b/src/osd/OpQueueItem.h

index faebf4a634a8ad479c0a34140e5ee4aa4e100bdf..a678560f141744fbeb6a0b59fe7dc944b57bfaf1 100644 (file)
--- a/src/osd/OpQueueItem.h
+++ b/src/osd/OpQueueItem.h
@@ -12,38 +12,6 @@
   *
   */
  
-
-/*
-
-  Ordering notes:
-
-  - everybody waits for split.
-
-  - client ops must remained ordered by client, regardless of map epoch
-  - client ops wait for pg to exist (or are discarded if we confirm the pg
-  no longer should).
-  - client ops must wait for the min epoch.
-    -> this happens under the PG itself, not as part of the queue.
-       currently in PrimaryLogPG::do_request()
-    -> the pg waiting queue is ordered by client, so other clients do not have to wait
-
-  - peering messages must wait for the required_map
-    - currently in do_peering_event(), PG::peering_waiters
-  - peering messages must remain ordered (globally or by peer?)
-  - some peering messages create the pg
-  - query does not need a pg.
-    - q: do any peering messages need to wait for the pg to exist?
-        pretty sure no!
-
-    ---
-
-    bool waiting_for_split -- everyone waits.
-
-    waiting -- client/mon ops
-    waiting_peering -- peering ops
-
-  */
-
  #pragma once
  
  #include <ostream>
author	Sage Weil <sage@redhat.com>
	Fri, 23 Feb 2018 19:18:53 +0000 (13:18 -0600)
committer	Sage Weil <sage@redhat.com>
	Wed, 4 Apr 2018 13:26:57 +0000 (08:26 -0500)
src/osd/OSD.h		patch \| blob \| history
src/osd/OpQueueItem.h		patch \| blob \| history