From: Sage Weil Date: Wed, 25 Apr 2018 20:32:38 +0000 (-0500) Subject: osd/PrimaryLogPG: avoid infinite loop when flush collides with write lock X-Git-Tag: v13.1.0~78^2 X-Git-Url: http://git.apps.os.sepia.ceph.com/?a=commitdiff_plain;h=refs%2Fpull%2F21653%2Fhead;p=ceph.git osd/PrimaryLogPG: avoid infinite loop when flush collides with write lock We try to take a write lock with fop->op. If we fail, fop->op is put on the lock's waiting list. Requeuing it again will simply kick off processing of another instance of the same op, which will again fail to take the lock, leading to an infinite loop that can't terminate because requeue_op is doing a push_front and preventing other PG messages that might release the lock. Do the same write lock attempt on any dup_ops so that they too will end up on the wait list. It looks like this broke waaay back in commit d700d99f76e0a29bfb419bc85d19ef1950b62a9a, a 2014 refactor of the OpContext behavior. Fixes: https://tracker.ceph.com/issues/23664 Signed-off-by: Sage Weil --- diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc index a5c260740f8e6..3b18934b4eb1a 100644 --- a/src/osd/PrimaryLogPG.cc +++ b/src/osd/PrimaryLogPG.cc @@ -10009,10 +10009,18 @@ int PrimaryLogPG::try_flush_mark_clean(FlushOpRef fop) fop->op)) { dout(20) << __func__ << " took write lock" << dendl; } else if (fop->op) { - dout(10) << __func__ << " waiting on write lock" << dendl; + dout(10) << __func__ << " waiting on write lock " << fop->op << " " + << fop->dup_ops << dendl; close_op_ctx(ctx.release()); - requeue_op(fop->op); - requeue_ops(fop->dup_ops); + // fop->op is now waiting on the lock; get fop->dup_ops to wait too. + for (auto op : fop->dup_ops) { + bool locked = ctx->lock_manager.get_lock_type( + ObjectContext::RWState::RWWRITE, + oid, + obc, + op); + assert(!locked); + } return -EAGAIN; // will retry } else { dout(10) << __func__ << " failed write lock, no op; failing" << dendl;