From 828da5b672e5941c8b4f30e85e29709bc6eac7b3 Mon Sep 17 00:00:00 2001
From: Sage Weil <sage@redhat.com>
Date: Wed, 15 Jan 2020 17:09:24 -0600
Subject: [PATCH] os/bluestore: avoid race between split_cache and get/put
 pin/unpin

The Onode get() and put() methods may call pin() and unpin() when we
transition between 1 and >1 ref.  To do this they make use of the
OSDShard *s pointer without taking any additional locks.  This runs afoul
of Collection::split_cache(), which moves an onode between shards.

It would be very complicated to address this race head-on: we ultimately
need to be under the protction of the OSDShard lock to do the pin/unpin,
but if OSDShard *s is changing, we don't know which lock to take.  And if
it is null, what do we do?  It might be null when we test but then get
set by split_cache.  And what if there is a put() followed by a get(),
and they managed to acquire the appropriate lock(s), but the get() thread
gets it first?  And so on.

We can avoid this whole mess by preventing a put() or get() from making
this transition (and looking at OSDShard *s) at all.

The only reason nref was *ever* < 2 is because the sequence was
 - remove from old collection onode_map
 - move onode to new shard
 - add to new collection onode_map

The fix is to simply
 - remove from old colleciton onode_map
 - add to new collection onode_map
 - adjust onode shard

That ensures that the onode's nref is >= 2 at all times.

At the same time, improve this code so that we don't _rm and _add when
the src and dest shard are the same.

Fixes: https://tracker.ceph.com/issues/43147
Fixes: https://tracker.ceph.com/issues/43131
Signed-off-by: Sage Weil <sage@redhat.com>
---
 src/os/bluestore/BlueStore.cc | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc
index fecba3ad59860..6dfe103e89ba9 100644
--- a/src/os/bluestore/BlueStore.cc
+++ b/src/os/bluestore/BlueStore.cc
@@ -3731,13 +3731,24 @@ void BlueStore::Collection::split_cache(
       ldout(store->cct, 20) << __func__ << " moving " << o << " " << o->oid
 			    << dendl;
 
-      onode_map.cache->_rm(p->second);
+      // move the onode to the new map before futzing with the cache
+      // shard, ensuring that nref is always >= 2, and no racing
+      // thread can trigger a pin or unpin (which does *not* behave
+      // well when we are clearing and resetting the 's' shard
+      // pointer!).
       p = onode_map.onode_map.erase(p);
-
-      o->c = dest;
-      dest->onode_map.cache->_add(o, 1);
       dest->onode_map.onode_map[o->oid] = o;
 
+      if (onode_map.cache != dest->onode_map.cache) {
+	// move onode to a different cache shard
+	onode_map.cache->_rm(o);
+	o->c = dest;
+	dest->onode_map.cache->_add(o, 1);
+      } else {
+	// the onode is in the same cache shard, making our move simpler.
+	o->c = dest;
+      }
+
       // move over shared blobs and buffers.  cover shared blobs from
       // both extent map and spanning blob map (the full extent map
       // may not be faulted in)
-- 
2.39.5