From 828da5b672e5941c8b4f30e85e29709bc6eac7b3 Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Wed, 15 Jan 2020 17:09:24 -0600 Subject: [PATCH] os/bluestore: avoid race between split_cache and get/put pin/unpin The Onode get() and put() methods may call pin() and unpin() when we transition between 1 and >1 ref. To do this they make use of the OSDShard *s pointer without taking any additional locks. This runs afoul of Collection::split_cache(), which moves an onode between shards. It would be very complicated to address this race head-on: we ultimately need to be under the protction of the OSDShard lock to do the pin/unpin, but if OSDShard *s is changing, we don't know which lock to take. And if it is null, what do we do? It might be null when we test but then get set by split_cache. And what if there is a put() followed by a get(), and they managed to acquire the appropriate lock(s), but the get() thread gets it first? And so on. We can avoid this whole mess by preventing a put() or get() from making this transition (and looking at OSDShard *s) at all. The only reason nref was *ever* < 2 is because the sequence was - remove from old collection onode_map - move onode to new shard - add to new collection onode_map The fix is to simply - remove from old colleciton onode_map - add to new collection onode_map - adjust onode shard That ensures that the onode's nref is >= 2 at all times. At the same time, improve this code so that we don't _rm and _add when the src and dest shard are the same. Fixes: https://tracker.ceph.com/issues/43147 Fixes: https://tracker.ceph.com/issues/43131 Signed-off-by: Sage Weil --- src/os/bluestore/BlueStore.cc | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc index fecba3ad59860..6dfe103e89ba9 100644 --- a/src/os/bluestore/BlueStore.cc +++ b/src/os/bluestore/BlueStore.cc @@ -3731,13 +3731,24 @@ void BlueStore::Collection::split_cache( ldout(store->cct, 20) << __func__ << " moving " << o << " " << o->oid << dendl; - onode_map.cache->_rm(p->second); + // move the onode to the new map before futzing with the cache + // shard, ensuring that nref is always >= 2, and no racing + // thread can trigger a pin or unpin (which does *not* behave + // well when we are clearing and resetting the 's' shard + // pointer!). p = onode_map.onode_map.erase(p); - - o->c = dest; - dest->onode_map.cache->_add(o, 1); dest->onode_map.onode_map[o->oid] = o; + if (onode_map.cache != dest->onode_map.cache) { + // move onode to a different cache shard + onode_map.cache->_rm(o); + o->c = dest; + dest->onode_map.cache->_add(o, 1); + } else { + // the onode is in the same cache shard, making our move simpler. + o->c = dest; + } + // move over shared blobs and buffers. cover shared blobs from // both extent map and spanning blob map (the full extent map // may not be faulted in) -- 2.39.5