From: Josh Durgin <jdurgin@redhat.com>
Date: Fri, 6 Apr 2018 06:43:13 +0000 (-0400)
Subject: osd/ECBackend: only check required shards when finishing recovery reads
X-Git-Tag: v12.2.6~140^2~3
X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=e9e4c42bf098ab9ae2b5d60e7d3a8f688d70e3db;p=ceph.git

osd/ECBackend: only check required shards when finishing recovery reads

1235810c2ad08ccb7ef5946686eb2b85798f5bca allowed recovery to use
multiple passes of reads to handle EIO, but the end condition for
checking whether we finished reading requires the full data to be
decodable (this is what get_want_to_read_shards returns).

This is just a loss of efficiency normally, since when there is only
one object the subsequent read works, and grabs all the data
necessary. The crash comes from having multiple objects in the same
ReadOp - in this case the sequence of events is:

- start recovery of two objects (osd_recovery_max_single_start > 1)
- read object a shard 3
- read object b shard 3
- fail minimum_to_decode because shard 3 can't reconstruct all of object a
- re-read all of object a, marking more reads in progress
- fail minimum_to_decode because shard 3 can't reconstruct all of object b
- skip re-reading object because there are now reads in progress
- finish reading k shards of object a
- still fail minimum_to_decode for object b, so no extra data was read
- send_all_remaining_reads tries to lookup object b in ReadOp object
- crash dereferencing to_read[object b], since this was cleared after handling the original object b read reply

This patch fixes the immediate inefficiency and crash by only checking
for the missing shards that were requested, rather than the entire
object, for recovery reads.

Fixes: http://tracker.ceph.com/issues/23195 (first crash)
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
(cherry picked from commit 468ad4b41010488c8d48ef65ccbebfdb4270690f)

Conflicts:
	src/osd/ECBackend.cc (trivial)
---

diff --git a/qa/standalone/erasure-code/test-erasure-eio.sh b/qa/standalone/erasure-code/test-erasure-eio.sh
index 4bfc898c8e4c..62ac49dbba46 100755
--- a/qa/standalone/erasure-code/test-erasure-eio.sh
+++ b/qa/standalone/erasure-code/test-erasure-eio.sh
@@ -379,6 +379,65 @@ function TEST_ec_recovery_errors() {
     delete_erasure_coded_pool $poolname
 }
 
+# Test recovery when there's only one shard to recover, but multiple
+# objects recovering in one RecoveryOp
+function TEST_ec_recovery_multiple_objects() {
+    local dir=$1
+    local objname=myobject
+
+    export CEPH_ARGS
+    CEPH_ARGS+=' --osd-recovery-max-single-start 3 --osd-recovery-max-active 3 '
+    setup_osds 7 || return 1
+
+    local poolname=pool-jerasure
+    create_erasure_coded_pool $poolname 3 2 || return 1
+
+    rados_put $dir $poolname test1
+    rados_put $dir $poolname test2
+    rados_put $dir $poolname test3
+
+    ceph osd out 0 || return 1
+
+    # Cluster should recover these objects all at once
+    wait_for_clean || return 1
+
+    rados_get $dir $poolname test1
+    rados_get $dir $poolname test2
+    rados_get $dir $poolname test3
+
+    delete_erasure_coded_pool $poolname
+}
+
+# test multi-object recovery when the one missing shard gets EIO
+function TEST_ec_recovery_multiple_objects_eio() {
+    local dir=$1
+    local objname=myobject
+
+    export CEPH_ARGS
+    CEPH_ARGS+=' --osd-recovery-max-single-start 3 --osd-recovery-max-active 3 '
+    setup_osds 7 || return 1
+
+    local poolname=pool-jerasure
+    create_erasure_coded_pool $poolname 3 2 || return 1
+
+    rados_put $dir $poolname test1
+    rados_put $dir $poolname test2
+    rados_put $dir $poolname test3
+
+    # can't read from this shard anymore
+    inject_eio ec data $poolname $objname $dir 0 || return 1
+    ceph osd out 0 || return 1
+
+    # Cluster should recover these objects all at once
+    wait_for_clean || return 1
+
+    rados_get $dir $poolname test1
+    rados_get $dir $poolname test2
+    rados_get $dir $poolname test3
+
+    delete_erasure_coded_pool $poolname
+}
+
 # Test backfill with unfound object
 function TEST_ec_backfill_unfound() {
     local dir=$1
diff --git a/src/osd/ECBackend.cc b/src/osd/ECBackend.cc
index a0fecedda32a..6bdc4e351e4a 100644
--- a/src/osd/ECBackend.cc
+++ b/src/osd/ECBackend.cc
@@ -1208,10 +1208,9 @@ void ECBackend::handle_sub_read_reply(
         have.insert(j->first.shard);
         dout(20) << __func__ << " have shard=" << j->first.shard << dendl;
       }
-      set<int> want_to_read, dummy_minimum;
-      get_want_to_read_shards(&want_to_read);
+      set<int> dummy_minimum;
       int err;
-      if ((err = ec_impl->minimum_to_decode(want_to_read, have, &dummy_minimum)) < 0) {
+      if ((err = ec_impl->minimum_to_decode(rop.want_to_read[iter->first], have, &dummy_minimum)) < 0) {
 	dout(20) << __func__ << " minimum_to_decode failed" << dendl;
         if (rop.in_progress.empty()) {
 	  // If we don't have enough copies and we haven't sent reads for all shards