]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
10 years agoMerge pull request #3169 from ceph/wip-8797-firefly
Loic Dachary [Fri, 9 Jan 2015 00:30:59 +0000 (01:30 +0100)]
Merge pull request #3169 from ceph/wip-8797-firefly

Wip 8797 firefly

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Loic Dachary <ldachary@redhat.com>
10 years agoMerge pull request #3179 from dachary/wip-9998-crush-underfloat-firefly
Loic Dachary [Fri, 9 Jan 2015 00:28:49 +0000 (01:28 +0100)]
Merge pull request #3179 from dachary/wip-9998-crush-underfloat-firefly

crush: fix weight underfloat issue (firefly)

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Loic Dachary <ldachary@redhat.com>
10 years agoMerge pull request #3220 from ceph/wip-mon-backports.firefly
Loic Dachary [Fri, 9 Jan 2015 00:28:11 +0000 (01:28 +0100)]
Merge pull request #3220 from ceph/wip-mon-backports.firefly

mon: backports for #9987 against firefly

Reviewed-by: Joao Eduardo Luis <joao@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Loic Dachary <ldachary@redhat.com>
10 years agoMerge pull request #3258 from ceph/wip-10372-firefly
Loic Dachary [Fri, 9 Jan 2015 00:26:30 +0000 (01:26 +0100)]
Merge pull request #3258 from ceph/wip-10372-firefly

osd: fix librados pool deletion race on firefly

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Samuel Just <sjust@redhat.com>
Reviewed-by: Loic Dachary <ldachary@redhat.com>
10 years agoIf trusty, use older version of qemu
Warren Usui [Fri, 19 Dec 2014 04:00:28 +0000 (20:00 -0800)]
If trusty, use older version of qemu

Fixes #10319
Signed-off-by: Warren Usui <warren.usui@inktank.com>
(cherry-picked from 46a1a4cb670d30397979cd89808a2e420cef2c11)

10 years agoMerge pull request #3264 from dachary/wip-jerasure-firefly
Sage Weil [Mon, 29 Dec 2014 22:31:23 +0000 (14:31 -0800)]
Merge pull request #3264 from dachary/wip-jerasure-firefly

erasure-code: update links to jerasure upstream

10 years agoMerge pull request #3268 from ceph/firefly-10415
Sage Weil [Mon, 29 Dec 2014 18:55:39 +0000 (10:55 -0800)]
Merge pull request #3268 from ceph/firefly-10415

libcephfs/test.cc: close fd before umount

10 years agoerasure-code: update links to jerasure upstream 3264/head
Loic Dachary [Sun, 28 Dec 2014 09:29:54 +0000 (10:29 +0100)]
erasure-code: update links to jerasure upstream

It moved from bitbucket to jerasure.org

Signed-off-by: Loic Dachary <ldachary@redhat.com>
(cherry picked from commit 8e86f901939f16cc9c8ad7a4108ac4bcf3916d2c)

10 years agolibcephfs/test.cc: close fd before umount 3268/head
Yan, Zheng [Tue, 23 Dec 2014 02:22:00 +0000 (10:22 +0800)]
libcephfs/test.cc: close fd before umount

Fixes: #10415
Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit d3fb563cee4c4cf08ff4ee01782e52a100462429)

10 years agoosdc/Objecter: handle reply race with pool deletion 3258/head
Sage Weil [Tue, 23 Dec 2014 23:49:26 +0000 (15:49 -0800)]
osdc/Objecter: handle reply race with pool deletion

We need to handle this scenario:

 - send request in epoch X
 - osd replies
 - pool is deleted in epoch X+1
 - client gets map X+1, sends a map check
 - client handles reply
   -> asserts that no map checks are in flight

This isn't the best solution.  We could infer that a map check isn't needed
since the pool existed earlier and doesn't now.  But this is firefly and
the fix is no more expensive than the old assert.

Fixes: #10372
Signed-off-by: Sage Weil <sage@redhat.com>
10 years agomon/PGMap and PGMonitor: update last_epoch_clean cache from new osd keys 3220/head
Sage Weil [Sun, 2 Nov 2014 16:50:59 +0000 (08:50 -0800)]
mon/PGMap and PGMonitor: update last_epoch_clean cache from new osd keys

We were only invalidating the cached value from apply_incremental, which
is no longer called on modern clusters.

Fix this by storing the update epoch in the key as well (it is not part
of osd_stat_t).

Backport: giant, firefly, dumpling(?)
Fixes: #9987
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 093c5f0cabeb552b90d944da2c50de48fcf6f564)

10 years agomon/PGMap: invalidate cached min_last_epoch_clean from new-style pg keys
Sage Weil [Sun, 2 Nov 2014 16:49:48 +0000 (08:49 -0800)]
mon/PGMap: invalidate cached min_last_epoch_clean from new-style pg keys

We were only invalidating the cache from the legacy apply_incremental(),
which is no longer called on modern clusters.

Fixes: #9987
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 3fb731b722c50672a5a9de0c86a621f5f50f2d06)

10 years agocrush/CrushWrapper: fix create_or_move_item when name exists but item does not 3179/head
Sage Weil [Mon, 24 Nov 2014 02:50:51 +0000 (18:50 -0800)]
crush/CrushWrapper: fix create_or_move_item when name exists but item does not

We were using item_exists(), which simply checks if we have a name defined
for the item.  Instead, use _search_item_exists(), which looks for an
instance of the item somewhere in the hierarchy.  This matches what
get_item_weightf() is doing, which ensures we get a non-negative weight
that converts properly to floating point.

Backport: giant, firefly
Fixes: #9998
Reported-by: Pawel Sadowski <ceph@sadziu.pl>
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 9902383c690dca9ed5ba667800413daa8332157e)

10 years agocrush/builder: prevent bucket weight underflow on item removal
Sage Weil [Sat, 22 Nov 2014 01:47:56 +0000 (17:47 -0800)]
crush/builder: prevent bucket weight underflow on item removal

It is possible to set a bucket weight that is not the sum of the item
weights if you manually modify/build the CRUSH map.  Protect against any
underflow on the bucket weight when removing items.

Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 8c87e9502142d5b4a282b94f929ae776a49be1dc)

10 years agocrush/CrushWrapper: fix _search_item_exists
Sage Weil [Sat, 22 Nov 2014 01:37:03 +0000 (17:37 -0800)]
crush/CrushWrapper: fix _search_item_exists

Reported-by: Pawel Sadowski <ceph@sadziu.pl>
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit eeadd60714d908a3a033aeb7fd542c511e63122b)

10 years agoMerge pull request #3124 from ceph/wip-10194-firefly
Sage Weil [Fri, 12 Dec 2014 14:19:50 +0000 (06:19 -0800)]
Merge pull request #3124 from ceph/wip-10194-firefly

rgw: optionally call FCGX_Free() on the fcgi connection

Reviewed-by: Sage Weil <sage@redhat.com>
10 years agoCall Rados.shutdown() explicitly before exit 3169/head
Dan Mick [Wed, 10 Dec 2014 21:19:53 +0000 (13:19 -0800)]
Call Rados.shutdown() explicitly before exit

This is mostly a demonstration of good behavior, as the resources will
be reclaimed on exit anyway.

Signed-off-by: Dan Mick <dan.mick@redhat.com>
(cherry picked from commit b038e8fbf9103cc42a4cde734b3ee601af6019ea)

10 years agorados.py: remove Rados.__del__(); it just causes problems
Dan Mick [Wed, 10 Dec 2014 21:19:16 +0000 (13:19 -0800)]
rados.py: remove Rados.__del__(); it just causes problems

Recent versions of Python contain a change to thread shutdown that
causes ceph to hang on exit; see http://bugs.python.org/issue21963.
As it turns out, this is relatively easy to avoid by not spawning
threads on exit, as Rados.__del__() will certainly do by calling
shutdown(); I suspect, but haven't proven, that the problem is
that shutdown() tries to start() a threading.Thread() that never
makes it all the way back to signal start().

Also add a PendingReleaseNote and extra doc comments to clarify.

Fixes: #8797
Signed-off-by: Dan Mick <dan.mick@redhat.com>
(cherry picked from commit 5ba9b8f21f8010c59dd84a0ef2acfec99e4b048f)

Conflicts:
PendingReleaseNotes

10 years agorgw: optionally call FCGX_Free() on the fcgi connection 3124/head
Yehuda Sadeh [Wed, 26 Nov 2014 23:18:07 +0000 (15:18 -0800)]
rgw: optionally call FCGX_Free() on the fcgi connection

Fixes: #10194
A new configurable controls this behavior. This forces disconnection of
the fcgi connection when done with the request.

Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
10 years agoMerge pull request #3109 from ceph/firefly-10263
Gregory Farnum [Mon, 8 Dec 2014 23:02:52 +0000 (15:02 -0800)]
Merge pull request #3109 from ceph/firefly-10263

mds: store backtrace for straydir

Reviewed-by: Greg Farnum <gfarnum@redhat.com>
10 years agoMerge pull request #3009 from dachary/wip-10018-primary-erasure-code-hinfo-firefly
Samuel Just [Mon, 8 Dec 2014 21:19:44 +0000 (13:19 -0800)]
Merge pull request #3009 from dachary/wip-10018-primary-erasure-code-hinfo-firefly

osd: deep scrub must not abort if hinfo is missing (firefly)

Reviewed-by: Samuel Just <sjust@redhat.com>
10 years agomds: store backtrace for straydir 3109/head
Yan, Zheng [Fri, 7 Nov 2014 03:38:37 +0000 (11:38 +0800)]
mds: store backtrace for straydir

Backport: giant, firefly, emperor, dumpling
Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit 0d89db5d3e5ae5d552d4058a88a4e186748ab1d2)

10 years agoMerge pull request #3089 from dachary/wip-10063-hobject-shard-firefly
Sage Weil [Sat, 6 Dec 2014 19:06:02 +0000 (11:06 -0800)]
Merge pull request #3089 from dachary/wip-10063-hobject-shard-firefly

common: do not omit shard when ghobject NO_GEN is set (firefly)

10 years agoMerge pull request #2480 from dachary/wip-9420-erasure-code-non-regression-firefly
Sage Weil [Sat, 6 Dec 2014 18:59:44 +0000 (10:59 -0800)]
Merge pull request #2480 from dachary/wip-9420-erasure-code-non-regression-firefly

erasure-code: store and compare encoded contents (firefly)

10 years agoMerge pull request #3096 from dachary/wip-9785-dmcrypt-keys-permissions-firefly
Sage Weil [Sat, 6 Dec 2014 01:33:05 +0000 (17:33 -0800)]
Merge pull request #3096 from dachary/wip-9785-dmcrypt-keys-permissions-firefly

ceph-disk: dmcrypt file permissions (firefly)

10 years agoceph-disk: dmcrypt file permissions 3096/head
Loic Dachary [Thu, 4 Dec 2014 21:21:32 +0000 (22:21 +0100)]
ceph-disk: dmcrypt file permissions

The directory in which key files are stored for dmcrypt must be 700 and
the file 600.

http://tracker.ceph.com/issues/9785 Fixes: #9785

Signed-off-by: Loic Dachary <ldachary@redhat.com>
(cherry picked from commit 58682d1776ab1fd4daddd887d921ca9cc312bf50)

10 years agoMerge pull request #3086 from dachary/wip-10125-radosgw-init-firefly
Sage Weil [Fri, 5 Dec 2014 17:04:00 +0000 (09:04 -0800)]
Merge pull request #3086 from dachary/wip-10125-radosgw-init-firefly

rgw: run radosgw as apache with systemd (firefly)

10 years agocommon: do not omit shard when ghobject NO_GEN is set 3089/head
Loic Dachary [Fri, 14 Nov 2014 00:16:10 +0000 (01:16 +0100)]
common: do not omit shard when ghobject NO_GEN is set

Do not silence the display of shard_id when generation is NO_GEN.
Erasure coded objects JSON representation used by ceph_objectstore_tool
need the shard_id to find the file containing the chunk.

Minimal testing is added to ceph_objectstore_tool.py

http://tracker.ceph.com/issues/10063 Fixes: #10063

Signed-off-by: Loic Dachary <ldachary@redhat.com>
(cherry picked from commit dcf09aed121f566221f539106d10283a09f15cf5)

Conflicts:
src/test/ceph_objectstore_tool.py

10 years agorgw: run radosgw as apache with systemd 3086/head
Loic Dachary [Tue, 2 Dec 2014 17:10:48 +0000 (18:10 +0100)]
rgw: run radosgw as apache with systemd

Same as sysv.

http://tracker.ceph.com/issues/10125 Fixes: #10125

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 7b621f4abf63456272dec3449aa108c89504a7a5)

Conflicts:
src/init-radosgw.sysv

10 years agoMerge pull request #3078 from ceph/wip-10030-firefly
Josh Durgin [Thu, 4 Dec 2014 19:32:18 +0000 (11:32 -0800)]
Merge pull request #3078 from ceph/wip-10030-firefly

librbd: don't close an already closed parent image upon failure

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
10 years agoMerge pull request #3063 from ceph/wip-10123-firefly
Sage Weil [Thu, 4 Dec 2014 07:01:44 +0000 (23:01 -0800)]
Merge pull request #3063 from ceph/wip-10123-firefly

librbd: protect list_children from invalid child pool IoCtxs

Reviewed-by: Sage Weil <sage@redhat.com>
10 years agoReplicatedPG: don't move on to the next snap immediately
Samuel Just [Tue, 23 Sep 2014 22:52:08 +0000 (15:52 -0700)]
ReplicatedPG: don't move on to the next snap immediately

If we have a bunch of trimmed snaps for which we have no
objects, we'll spin for a long time.  Instead, requeue.

Fixes: #9487
Backport: dumpling, firefly, giant
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit c17ac03a50da523f250eb6394c89cc7e93cb4659)

10 years agoosd: initialize purged_snap on backfill start; restart backfill if change
Sage Weil [Tue, 23 Sep 2014 23:21:33 +0000 (16:21 -0700)]
osd: initialize purged_snap on backfill start; restart backfill if change

If we backfill a PG to a new OSD, we currently neglect to initialize
purged_snaps.  As a result, the first time the snaptrimmer runs it has to
churn through every deleted snap for all time, and to make matters worse
does so in one go with the PG lock held.  This leads to badness on any
cluster with a significant number of removed snaps that experiences
backfill.

Resolve this by initializing purged_snaps when we finish backfill.  The
backfill itself will clear out any stray snaps and ensure the object set
is in sync with purged_snaps.  Note that purged_snaps on the primary
that is driving backfill will not change during this period as the
snaptrimmer is not scheduled unless the PG is clean (which it won't be
during backfill).

If we by chance to interrupt backfill, go clean with other OSDs,
purge snaps, and then let this OSD rejoin, we will either restart
backfill (non-contiguous log) or the log will include the result of
the snap trim (the events that remove the trimmed snap).

Fixes: #9487
Backfill: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 255b430a87201c7d0cf8f10a3c1e62cbe8dd2d93)

10 years agolibrbd: don't close an already closed parent image upon failure 3078/head
Jason Dillaman [Thu, 6 Nov 2014 10:01:38 +0000 (05:01 -0500)]
librbd: don't close an already closed parent image upon failure

If librbd is not able to open a child's parent image, it will
incorrectly close the parent image twice, resulting in a crash.

Fixes: #10030
Backport: firefly, giant
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 61ebfebd59b61ffdc203dfeca01ee1a02315133e)

10 years agolibrbd: protect list_children from invalid child pool IoCtxs 3063/head
Jason Dillaman [Tue, 18 Nov 2014 02:49:26 +0000 (21:49 -0500)]
librbd: protect list_children from invalid child pool IoCtxs

While listing child images, don't ignore error codes returned
from librados when creating an IoCtx. This will prevent seg
faults from occurring when an invalid IoCtx is used.

Fixes: #10123
Backport: giant, firefly, dumpling
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 0d350b6817d7905908a4e432cd359ca1d36bab50)

10 years agoMerge pull request #3014 from dachary/wip-9665-ceph-disk-partprobe-firefly
Sage Weil [Sun, 30 Nov 2014 18:12:04 +0000 (10:12 -0800)]
Merge pull request #3014 from dachary/wip-9665-ceph-disk-partprobe-firefly

ceph disk zap must call partprobe

10 years agoceph-disk: use update_partition in prepare_dev and main_prepare 3014/head
Loic Dachary [Fri, 10 Oct 2014 08:26:31 +0000 (10:26 +0200)]
ceph-disk: use update_partition in prepare_dev and main_prepare

In the case of prepare_dev the partx alternative was missing and is not
added because update_partition does it.

http://tracker.ceph.com/issues/9721 Fixes: #9721

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
(cherry picked from commit 23e71b1ee816c0ec8bd65891998657c46e364fbe)

Conflicts:
src/ceph-disk

10 years agoceph-disk: run partprobe after zap
Loic Dachary [Thu, 9 Oct 2014 16:52:17 +0000 (18:52 +0200)]
ceph-disk: run partprobe after zap

Not running partprobe after zapping a device can lead to the following:

* ceph-disk prepare /dev/loop2
* links are created in /dev/disk/by-partuuid
* ceph-disk zap /dev/loop2
* links are not removed from /dev/disk/by-partuuid
* ceph-disk prepare /dev/loop2
* some links are not created in /dev/disk/by-partuuid

This is assuming there is a bug in the way udev events are handled by
the operating system.

http://tracker.ceph.com/issues/9665 Fixes: #9665

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
(cherry picked from commit fed3b06c47a5ef22cb3514c7647544120086d1e7)

10 years agoceph-disk: encapsulate partprobe / partx calls
Loic Dachary [Fri, 10 Oct 2014 08:23:34 +0000 (10:23 +0200)]
ceph-disk: encapsulate partprobe / partx calls

Add the update_partition function to reduce code duplication.
The action is made an argument although it always is -a because it will
be -d when deleting a partition.

Use the update_partition function in prepare_journal_dev

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
(cherry picked from commit 922a15ea6865ef915bbdec2597433da6792c1cb2)

Conflicts:
src/ceph-disk

10 years agoosd: deep scrub must not abort if hinfo is missing 3009/head
Loic Dachary [Thu, 6 Nov 2014 16:11:20 +0000 (17:11 +0100)]
osd: deep scrub must not abort if hinfo is missing

Instead it should set read_error.

http://tracker.ceph.com/issues/10018 Fixes: #10018

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
(cherry picked from commit 9d84d2e8309d26e39ca849a75166d2d7f2dec9ea)

10 years agoerasure-code: add corpus verification to make check 2480/head
Loic Dachary [Tue, 23 Sep 2014 09:38:09 +0000 (11:38 +0200)]
erasure-code: add corpus verification to make check

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
10 years agoerasure-code: workunit to check for encoding regression
Loic Dachary [Sat, 13 Sep 2014 11:36:09 +0000 (13:36 +0200)]
erasure-code: workunit to check for encoding regression

Clone the archive of encoded objects and decode all archived objects, up
to and including the current ceph version.

http://tracker.ceph.com/issues/9420 Refs: #9420

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
10 years agoerasure-code: store and compare encoded contents
Loic Dachary [Sat, 13 Sep 2014 08:16:31 +0000 (10:16 +0200)]
erasure-code: store and compare encoded contents

Introduce ceph_erasure_code_non_regression to check and compare how an
erasure code plugin encodes and decodes content with a given set of
parameters. For instance:

./ceph_erasure_code_non_regression \
      --plugin jerasure \
      --parameter technique=reed_sol_van \
      --parameter k=2 \
      --parameter m=2 \
      --stripe-width 3181 \
      --create \
      --check

Will create an encoded object (--create) and store it into a directory
along with the chunks, one chunk per file. The directory name is derived
from the parameters. The content of the object is a random pattern of 31
bytes repeated to fill the object size specified with --stripe-width.

The check function (--check) reads the object back from the file,
encodes it and compares the result with the content of the chunks read
from the files. It also attempts recover from one or two erasures.

Chunks encoded by a given version of Ceph are expected to be encoded
exactly in the same way by all Ceph versions going forward.

http://tracker.ceph.com/issues/9420 Refs: #9420

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
10 years agoMerge pull request #2961 from ceph/wip-10114-firefly
Loic Dachary [Wed, 19 Nov 2014 01:45:26 +0000 (02:45 +0100)]
Merge pull request #2961 from ceph/wip-10114-firefly

Add annotation to all assembly files to turn off stack-execute bit

Reviewed-by: Loic Dachary <ldachary@redhat.com>
10 years agoAdd annotation to all assembly files to turn off stack-execute bit 2961/head
Dan Mick [Sat, 15 Nov 2014 01:59:57 +0000 (17:59 -0800)]
Add annotation to all assembly files to turn off stack-execute bit

See discussion in http://tracker.ceph.com/issues/10114

Building with these changes allows output from readelf like this:

 $ readelf -lW src/.libs/librados.so.2 | grep GNU_STACK
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000
0x000000 RW  0x8

(note the absence of 'X' in 'RW')

Fixes: #10114
Signed-off-by: Dan Mick <dan.mick@redhat.com>
(cherry picked from commit 06a245a9845c0c126fb3106b41b2fd2bc4bc4df3)
(not-yet-present-in-firefly files in isa-l manually removed)

10 years agoMerge pull request #2760 from ceph/wip-9835-firefly
Samuel Just [Thu, 13 Nov 2014 18:36:12 +0000 (10:36 -0800)]
Merge pull request #2760 from ceph/wip-9835-firefly

osd: fix erasure hung op bug (9835)

Reviewed-by: Samuel Just <sjust@redhat.com>
10 years agoosd: use OSDMap helper to tell if ops are misdirected 2760/head
Samuel Just [Wed, 5 Nov 2014 20:12:14 +0000 (12:12 -0800)]
osd: use OSDMap helper to tell if ops are misdirected

calc_pg_role doesn't actually take into account primary affinity.

Fixes: #9835
Signed-off-by: Samuel Just <sam.just@inktank.com>
10 years agoosd: discard rank > 0 ops on erasure pools
Sage Weil [Mon, 20 Oct 2014 20:55:33 +0000 (13:55 -0700)]
osd: discard rank > 0 ops on erasure pools

Erasure pools do not support read from replica, so we should drop
any rank > 0 requests.

This fixes a bug where an erasure pool maps to [1,2,3], temporarily maps
to [-1,2,3], sends a request to osd.2, and then remaps back to [1,2,3].
Because the 0 shard never appears on osd.2, the request sits in the
waiting_for_pg map indefinitely and cases slow request warnings.
This problem does not come up on replicated pools because all instances of
the PG are created equal.

Fix by only considering role == 0 for erasure pools as a correct mapping.

Fixes: #9835
Signed-off-by: Sage Weil <sage@redhat.com>
10 years agoosd/OSDMap: add osd_is_valid_op_target()
Sage Weil [Thu, 13 Nov 2014 01:04:35 +0000 (17:04 -0800)]
osd/OSDMap: add osd_is_valid_op_target()

Helper to check whether an osd is a given op target for a pg.  This
assumes that for EC we always send ops to the primary, while for
replicated we may target any replica.

Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 89c02637914ac7332e9dbdbfefc2049b2b6c127d)

10 years agoqa: allow small allocation diffs for exported rbds
Josh Durgin [Wed, 12 Nov 2014 02:16:02 +0000 (18:16 -0800)]
qa: allow small allocation diffs for exported rbds

The local filesytem may behave slightly differently. This isn't
foolproof, but seems to be reliable enough on rhel7 rootfs, where
exact comparison was failing.

Fixes: #10002
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
(cherry picked from commit e94d3c11edb9c9cbcf108463fdff8404df79be33)

10 years agoosd: fix map advance limit to handle map gaps
Sage Weil [Sun, 25 May 2014 15:38:38 +0000 (08:38 -0700)]
osd: fix map advance limit to handle map gaps

The recent change in cf25bdf6b0090379903981fe8cee5ea75efd7ba0 would stop
advancing after some number of epochs, but did not take into consideration
the possibilty that there are missing maps.  In that case, it is impossible
to advance past the gap.

Fix this by increasing the max epoch as we go so that we can always get
beyond the gap.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1e0a82fd55dede473c0af32924f4bcb5bb697a2b)

10 years agoMerge pull request #2880 from ceph/wip-10025-firefly
Gregory Farnum [Fri, 7 Nov 2014 22:10:18 +0000 (14:10 -0800)]
Merge pull request #2880 from ceph/wip-10025-firefly

#10025/firefly -- tools: fix MDS journal import

Reviewed-by: Greg Farnum <greg@inktank.com>
10 years agotools: fix MDS journal import 2880/head
John Spray [Fri, 7 Nov 2014 11:34:43 +0000 (11:34 +0000)]
tools: fix MDS journal import

Previously it only worked on fresh filesystems which
hadn't been trimmed yet, and resulted in an invalid
trimmed_pos when expire_pos wasn't on an object
boundary.

Fixes: #10025
Signed-off-by: John Spray <john.spray@redhat.com>
(cherry picked from commit fb29e71f9a97c12354045ad2e128156e503be696)

10 years agoMerge remote-tracking branch 'origin/wip-sam-firefly-backports' into firefly
Samuel Just [Thu, 6 Nov 2014 18:37:42 +0000 (10:37 -0800)]
Merge remote-tracking branch 'origin/wip-sam-firefly-backports' into firefly

10 years agoMerge pull request #2737 from ceph/wip-9629-firefly
Samuel Just [Thu, 6 Nov 2014 18:30:20 +0000 (10:30 -0800)]
Merge pull request #2737 from ceph/wip-9629-firefly

osd: do not clone/preserve snapdir on cache-evict (firefly backport)

Reviewed-by: Samuel Just <sjust@redhat.com>
10 years agoMerge pull request #2657 from ceph/wip-9053-9301-firefly
Samuel Just [Thu, 6 Nov 2014 18:26:21 +0000 (10:26 -0800)]
Merge pull request #2657 from ceph/wip-9053-9301-firefly

mon: backport two paxos fixes to firefly

Reviewed-by: Joao Luis <joao@redhat.com>
10 years agoMerge pull request #2656 from ceph/wip-9502-firefly
Samuel Just [Thu, 6 Nov 2014 18:21:12 +0000 (10:21 -0800)]
Merge pull request #2656 from ceph/wip-9502-firefly

mon: backport mon disk full check to firefly

Reviewed-by: Samuel Just <sjust@redhat.com>
10 years agoMerge pull request #2764 from ceph/wip-9851
Samuel Just [Thu, 6 Nov 2014 18:18:27 +0000 (10:18 -0800)]
Merge pull request #2764 from ceph/wip-9851

osd: bring FileJournal in sync with giant

Reviewed-by: Samuel Just <sjust@redhat.com>
10 years agoMerge pull request #2776 from ceph/wip-9675.firefly
Samuel Just [Thu, 6 Nov 2014 18:12:17 +0000 (10:12 -0800)]
Merge pull request #2776 from ceph/wip-9675.firefly

CrushWrapper: pick a ruleset same as rule_id

Reviewed-by: Samuel Just <sam.just@inktank.com>
10 years agoceph-disk: mount xfs with inode64 by default
Sage Weil [Mon, 15 Sep 2014 22:29:08 +0000 (15:29 -0700)]
ceph-disk: mount xfs with inode64 by default

We did this forever ago with mkcephfs, but ceph-disk didn't.  Note that for
modern XFS this option is obsolete, but for older kernels it was not the
default.

Backport: firefly
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 11496399ef318498c11e551f139d96db52d3309c)

10 years agorgw: set length for keystone token validation request
Yehuda Sadeh [Thu, 9 Oct 2014 17:20:27 +0000 (10:20 -0700)]
rgw: set length for keystone token validation request

Fixes: #7796
Backport: giany, firefly
Need to set content length to this request, as the server might not
handle a chunked request (even though we don't send anything).

Tested-by: Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
(cherry picked from commit 3dd4ccad7fe97fc16a3ee4130549b48600bc485c)

10 years agorgw: subuser creation fixes
Yehuda Sadeh [Tue, 19 Aug 2014 20:15:46 +0000 (13:15 -0700)]
rgw: subuser creation fixes

Fixes: #8587
There were a couple of issues, one when trying to identify whether swift
user exists, we weren't using the correct swift id. The second problem
is that we relied on the gen_access flag in the swift case, where it
doesn't really need to apply.

Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
(cherry picked from commit 1441ffe8103f03c6b2f625f37adbb2e1cfec66bb)

10 years agoMerge pull request #2847 from dachary/wip-9752-past-intervals-firefly
Sage Weil [Fri, 31 Oct 2014 15:35:50 +0000 (08:35 -0700)]
Merge pull request #2847 from dachary/wip-9752-past-intervals-firefly

osd: past_interval display bug on acting

10 years agoosd: past_interval display bug on acting 2847/head
Loic Dachary [Thu, 30 Oct 2014 23:49:21 +0000 (00:49 +0100)]
osd: past_interval display bug on acting

The acting array was incorrectly including the primary and up_primary.

http://tracker.ceph.com/issues/9752 Fixes: #9752

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
(cherry picked from commit c5f8d6eded52da451fdd1d807bd4700221e4c41c)

10 years agoMerge pull request #2840 from ceph/firefly-9869
Yan, Zheng [Fri, 31 Oct 2014 00:01:24 +0000 (17:01 -0700)]
Merge pull request #2840 from ceph/firefly-9869

Backport "client: cast m->get_client_tid() to compare to 16-bit Inode::flushing_cap_tid"

10 years agoclient: cast m->get_client_tid() to compare to 16-bit Inode::flushing_cap_tid 2840/head
Greg Farnum [Thu, 23 Oct 2014 00:16:31 +0000 (17:16 -0700)]
client: cast m->get_client_tid() to compare to 16-bit Inode::flushing_cap_tid

m->get_client_tid() is 64 bits (as it should be), but Inode::flushing_cap_tid
is only 16 bits. 16 bits should be plenty to let the cap flush updates
pipeline appropriately, but we need to cast in the proper direction when
comparing these differently-sized versions. So downcast the 64-bit one
to 16 bits.

Fixes: #9869
Backport: giant, firefly, dumpling

Signed-off-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit a5184cf46a6e867287e24aeb731634828467cd98)

10 years agoReplicatedPG: cancel cb on blacklisted watcher
Samuel Just [Thu, 11 Sep 2014 20:46:51 +0000 (13:46 -0700)]
ReplicatedPG: cancel cb on blacklisted watcher

Fixes: #8315
Backport: firefly
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 16bd45777166c29c433af3b59254a7169e512d98)

10 years agoReplicatedPG::on_removal: clear rollback info
Samuel Just [Sun, 21 Sep 2014 17:19:43 +0000 (10:19 -0700)]
ReplicatedPG::on_removal: clear rollback info

Fixes: #9293
Backport: firefly
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 544b8c7ffb4af01765b87239f2d7ab88479ee779)

10 years agoMerge remote-tracking branch 'origin/wip-9574' into wip-sam-firefly-backports
Samuel Just [Thu, 30 Oct 2014 20:48:18 +0000 (13:48 -0700)]
Merge remote-tracking branch 'origin/wip-9574' into wip-sam-firefly-backports

10 years agoPG: release backfill reservations if a backfill peer rejects
Samuel Just [Mon, 29 Sep 2014 22:01:25 +0000 (15:01 -0700)]
PG: release backfill reservations if a backfill peer rejects

Also, the full peer will wait until the rejection from the primary
to do a state transition.

Fixes: #9626
Backport: giant, firefly, dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 624aaf2a4ea9950153a89ff921e2adce683a6f51)

10 years agoMerge remote-tracking branch 'origin/wip-9113' into wip-sam-firefly-backports
Samuel Just [Thu, 30 Oct 2014 20:47:22 +0000 (13:47 -0700)]
Merge remote-tracking branch 'origin/wip-9113' into wip-sam-firefly-backports

10 years agoosd/osd_types: consider CRUSH_ITEM_NONE in check_new_interval() min_size check
Sage Weil [Sun, 12 Oct 2014 17:05:51 +0000 (10:05 -0700)]
osd/osd_types: consider CRUSH_ITEM_NONE in check_new_interval() min_size check

Fixes: #9718
Backport: firefly
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit d947050c82a511f91c98e1c76e48ffa9e187eee7)

Conflicts:
src/osd/osd_types.cc

10 years agoPG:: reset_interval_flush and in set_last_peering_reset
Samuel Just [Mon, 20 Oct 2014 21:10:58 +0000 (14:10 -0700)]
PG:: reset_interval_flush and in set_last_peering_reset

If we have a change in the prior set, but not in the up/acting set, we go back
through Reset in order to reset peering state.  Previously, we would reset
last_peering_reset in the Reset constructor.  This did not, however, reset the
flush_interval, which caused the eventual flush event to be ignored and the
peering messages to not be sent.

Instead, we will always reset_interval_flush if we are actually changing the
last_peering_reset value.

Fixes: #9821
Backport: firefly
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit d9ff3a6b789c5b9c77aefa3751bd808f5d7b8ca7)

10 years agoReplicatedPG: writeout hit_set object with correct prior_version
Samuel Just [Thu, 23 Oct 2014 16:11:28 +0000 (09:11 -0700)]
ReplicatedPG: writeout hit_set object with correct prior_version

Fixes: #9875
Backport: giant, firefly
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 1a3ad307f1a4c0a956d6fd31d13f01ffe411a09d)

10 years agoMerge pull request #2717 from dachary/wip-9747-ceph-spec-firefly
Sage Weil [Mon, 27 Oct 2014 03:37:52 +0000 (20:37 -0700)]
Merge pull request #2717 from dachary/wip-9747-ceph-spec-firefly

rpm: 95-ceph-osd-alt.rules is not needed for centos7 / rhel7 (firefly)

10 years agoCrushWrapper: pick a ruleset same as rule_id 2776/head
Xiaoxi Chen [Wed, 20 Aug 2014 07:35:44 +0000 (15:35 +0800)]
CrushWrapper: pick a ruleset same as rule_id

Originally in the add_simple_ruleset funtion, the ruleset_id
is not reused but rule_id is reused. So after some add/remove
against rules, the newly created rule likely to have
ruleset!=rule_id.

We dont want this happen because we are trying to hold the constraint
that ruleset == rule_id.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
(cherry picked from commit 78e84f34da83abf5a62ae97bb84ab70774b164a6)

Conflicts:
src/test/erasure-code/TestErasureCodeIsa.cc

Fixes: #9675
10 years agoos/FileJournal: When dump journal, using correctly seq avoid misjudging joural corrupt. 2764/head
Ma Jianpeng [Mon, 21 Jul 2014 07:08:55 +0000 (15:08 +0800)]
os/FileJournal: When dump journal, using correctly seq avoid misjudging joural corrupt.

In func FileJournal::dump, it always using seq=0 as last-seq and it can
misjudge the journal corrupt.

Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
(cherry picked from commit 5f65b4db6d1dad7c2c5a09eab42af63a82ea9e9b)

10 years agoos: io_event.res is the size written
Loic Dachary [Thu, 25 Sep 2014 23:15:53 +0000 (01:15 +0200)]
os: io_event.res is the size written

And not an error code to be converted with cpp_strerror()

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
(cherry picked from commit 7827e0035e3350ad2d9230f27a1629545f53af5c)

10 years agoos/FileJournal: For journal-aio-mode, don't use aio when closing journal.
Ma Jianpeng [Thu, 21 Aug 2014 07:10:46 +0000 (15:10 +0800)]
os/FileJournal: For journal-aio-mode, don't use aio when closing journal.

For jouranl-aio-mode when closing journal, the write_finish_thread_entry may exit before
write_thread_entry. This cause no one wait last aios to complete.
On some platform, after that the journal-header on journal corrupted.
To avoid this, when closing jouranl we don't use aio.

Fixes: 9073
Reported-by: Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
Tested-by: Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
(cherry picked from commit e870fd09ce846e5642db268c33bbe8e2e17ffef2)

10 years agoos/FileJournal: Only using aio then alloc the related resources.
Ma Jianpeng [Thu, 21 Aug 2014 13:07:51 +0000 (21:07 +0800)]
os/FileJournal: Only using aio then alloc the related resources.

If define HAVE_LIBAIO, it alloc related resouces. But itt don't check whether
using aio mode. Only using aio it alloc the related resources.

Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
(cherry picked from commit a66a4931d5be9ee26c0983b3154fdbe37261a51c)

10 years agoos/FileJournal: Tune the judge logic for read_header.
Ma Jianpeng [Thu, 21 Aug 2014 07:49:44 +0000 (15:49 +0800)]
os/FileJournal: Tune the judge logic for read_header.

When reading journal-header, it should firstly check the result of
pread and then do decoce operation.

Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
(cherry picked from commit c8e2b89cf6bc36a0ff29887b9e76cbbeceef9f8f)

10 years agoos/FileJournal: signal aio_cond even if seq is 0
Sage Weil [Wed, 20 Aug 2014 03:50:13 +0000 (20:50 -0700)]
os/FileJournal: signal aio_cond even if seq is 0

This can happen if we write a journal but no events.

Reported-by: Somnath Roy <somnath.roy@sandisk.com>
Reported-by: Ma, Jianpeng <jianpeng.ma@intel.com>
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 57778e2c577c1e1bbf9525232720a2994fa36abc)

10 years agoos/FileJournal: Update the journal header when closing journal
Ma Jianpeng [Wed, 23 Jul 2014 17:10:38 +0000 (10:10 -0700)]
os/FileJournal: Update the journal header when closing journal

When closing journal, it should check must_write_header and update
journal header if must_write_header alreay set.
It can reduce the nosense journal-replay after restarting osd.

Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com>
Reviewed-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 5bf472aefb7360a1fe17601b42e551df120badfb)

10 years agoRevert "os/FileJournal: stop aio completion thread *after* writer thread"
Sage Weil [Tue, 21 Oct 2014 13:53:36 +0000 (06:53 -0700)]
Revert "os/FileJournal: stop aio completion thread *after* writer thread"

This reverts commit 334631ae4641824b3df49245f36a8fd4b143bf3f.

10 years agoMerge pull request #2716 from ceph/wip-firefly-9419
Samuel Just [Fri, 17 Oct 2014 17:47:22 +0000 (10:47 -0700)]
Merge pull request #2716 from ceph/wip-firefly-9419

Backport fix for bug #9419

10 years agoMerge pull request #2724 from dachary/wip-9073-journal-aio-mode-firefly
Samuel Just [Fri, 17 Oct 2014 17:44:30 +0000 (10:44 -0700)]
Merge pull request #2724 from dachary/wip-9073-journal-aio-mode-firefly

os/FileJournal: stop aio completion thread *after* writer thread

10 years agoMerge pull request #2742 from ceph/firefly-unknown-locktype
Sage Weil [Fri, 17 Oct 2014 15:20:53 +0000 (08:20 -0700)]
Merge pull request #2742 from ceph/firefly-unknown-locktype

mds: reply -EOPNOTSUPP for unknown lock type

10 years agomds: reply -EOPNOTSUPP for unknown lock type 2742/head
Yan, Zheng [Tue, 14 Oct 2014 14:02:41 +0000 (22:02 +0800)]
mds: reply -EOPNOTSUPP for unknown lock type

Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit 675392335c53ff7879031fb9184e4f35bcc90fe2)

10 years agoosd/ReplicatedPG: do not clone or preserve snapdir on cache_evict 2737/head
Sage Weil [Sun, 21 Sep 2014 22:56:18 +0000 (15:56 -0700)]
osd/ReplicatedPG: do not clone or preserve snapdir on cache_evict

If we cache_evict a head in a cache pool, we need to prevent
make_writeable() from cloning the head and finish_ctx() from
preserving the snapdir object.

Fixes: #8629
Backport: firefly
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit ce8eefca13008a9cce3aedd67b11537145e1fd77)

10 years agoceph_test_rados_api_tier: add EvictSnap2 test case
Sage Weil [Sun, 21 Sep 2014 22:54:15 +0000 (15:54 -0700)]
ceph_test_rados_api_tier: add EvictSnap2 test case

Verify an evict doesn't create a snapdir object.  Reproduces #8629

Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 398c74eacb1ce4e573aef0d24718a5925d90272b)

10 years agoMerge pull request #2734 from ceph/wip-firefly-undump
Sage Weil [Thu, 16 Oct 2014 13:09:51 +0000 (06:09 -0700)]
Merge pull request #2734 from ceph/wip-firefly-undump

mds: fix --undump-journal

Reviewed-by: Sage Weil <sage@redhat.com>
10 years agomds: fix --undump-journal 2734/head
John Spray [Thu, 16 Oct 2014 10:17:40 +0000 (11:17 +0100)]
mds: fix --undump-journal

This hadn't worked for a long time.  This is a fix
for firefly only, as this code was refactored in giant.

Signed-off-by: John Spray <john.spray@redhat.com>
10 years agoceph-mon: check fs stats just before preforking 2656/head
Joao Eduardo Luis [Tue, 23 Sep 2014 13:02:55 +0000 (14:02 +0100)]
ceph-mon: check fs stats just before preforking

Otherwise statfs may fail if mkfs hasn't been run yet or if the monitor
data directory does not exist.  There are checks to account for the mon
data dir not existing and we should wait for them to clear before we go
ahead and check the fs stats.

Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
(cherry picked from commit 7f71c11666b25e91dd612c58b4eda9ac0d4752f8)

Conflicts:
src/ceph_mon.cc

10 years agoceph_mon: check available storage space for mon data dir on start
Joao Eduardo Luis [Thu, 18 Sep 2014 15:53:43 +0000 (16:53 +0100)]
ceph_mon: check available storage space for mon data dir on start

error out if available storage space is below 'mon data avail crit'

Fixes: #9502
Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
(cherry picked from commit 2da1a2914ac7df18ce842b0aac728fffb5bed2b6)

Conflicts:
src/ceph_mon.cc

10 years agomon: DataHealthService: use get_fs_stats() instead
Joao Eduardo Luis [Thu, 18 Sep 2014 15:52:34 +0000 (16:52 +0100)]
mon: DataHealthService: use get_fs_stats() instead

and relieve the DataStats struct from clutter by using
ceph_data_stats_t instead of multiple fields.

Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
(cherry picked from commit 9996d446988768658db751a7843b13cf3d194213)

Conflicts:
src/mon/DataHealthService.cc

10 years agocommon: util: add get_fs_stats() function
Joao Eduardo Luis [Thu, 18 Sep 2014 15:32:20 +0000 (16:32 +0100)]
common: util: add get_fs_stats() function

simplifies the task of obtaining available/used disk space, as well as
used available percentage.

Signed-off-by: Joao Eduardo Luis <joao@redhat.com>
(cherry picked from commit 3d74230d1c0fbfa15487e2a90ac60b883476e840)

10 years agoinclude/util.h: prevent multiple inclusion of header
Joao Eduardo Luis [Thu, 18 Sep 2014 15:25:44 +0000 (16:25 +0100)]
include/util.h: prevent multiple inclusion of header

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 76eff9503493312cb97e4a2f9236f4dbcbf931df)

10 years agomon: re-bootstrap if we get probed by a mon that is way ahead 2657/head
Sage Weil [Thu, 18 Sep 2014 21:23:36 +0000 (14:23 -0700)]
mon: re-bootstrap if we get probed by a mon that is way ahead

During bootstrap we verify that our paxos commits overlap with the other
mons we will form a quorum with.  If they do not, we do a sync.

However, it is possible we pass those checks, then fail to join a quorum
before the quorum moves ahead in time such that we no longer overlap.
Currently nothing kicks up back into a probing state to discover we need
to sync... we will just keep trying to call or join an election instead.

Fix this by jumping back to bootstrap if we get a probe that is ahead of
us.  Only do this from non probe or sync states as these will be common;
it is only the active and electing states that matter (and probably just
electing!).

Fixes: #9301
Backport: giant, firefly
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit c421b55e8e15ef04ca8aeb47f7d090375eaa8573)

10 years agomon/Paxos: fix off-by-one in last_ vs first_committed check
Sage Weil [Thu, 18 Sep 2014 21:11:24 +0000 (14:11 -0700)]
mon/Paxos: fix off-by-one in last_ vs first_committed check

peon last_committed + 1 == leader first_committed is okay.  Note that the
other check (where I clean up whitespace) gets this correct.

Fixes: #9301 (partly)
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit d81cd7f86695185dce31df76c33c9a02123f0e4a)

10 years agomon/Paxos: share state and verify contiguity early in collect phase
Sage Weil [Wed, 13 Aug 2014 23:17:02 +0000 (16:17 -0700)]
mon/Paxos: share state and verify contiguity early in collect phase

We verify peons are contiguous and share new paxos states to catch peons
up at the end of the round.  Do this each time we (potentially) get new
states via a collect message.  This will allow peons to be pulled forward
and remain contiguous when they otherwise would not have been able to.
For example, if

  mon.0 (leader)  20..30
  mon.1 (peon)    15..25
  mon.2 (peon)    28..40

If we got mon.1 first and then mon.2 second, we would store the new txns
and then boot mon.1 out at the end because 15..25 is not contiguous with
28..40.  However, with this change, we share 26..30 to mon.1 when we get
the collect, and then 31..40 when we get mon.2's collect, pulling them
both into the final quorum.

It also breaks the 'catch-up' work into smaller pieces, which ought to
smooth out latency a bit.

Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit c54f1e4d66b22bad715ac17e9baa72ab93e48c46)