Danny Al-Gaaf [Sun, 28 Jul 2013 21:25:58 +0000 (23:25 +0200)]
ceph_authtool.cc: update help/usage text
Added implemented but not listed commands to the help/usage text:
* -g shortcut for --gen-key
* -a shortcut for --add-key
* -u/--set-uid to set auid
* --gen-print-key
* --import-keyring
When posting an object it is possible to provide a key
name that refers to the original filename, however we
need to verify that in the end we don't end up with an
empty object name.
Gary Lowell [Thu, 22 Aug 2013 18:07:16 +0000 (11:07 -0700)]
ceph.spec.in: Don't invoke debug_package macro on centos.
If the redhat-rpm-config package is installed, the debuginfo rpms will
be built by default. The build will fail when the package installed
and the specfile also invokes the macro.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Sage Weil [Sun, 28 Jul 2013 15:59:21 +0000 (08:59 -0700)]
osd: get initial full map after a map gap
If there is a gap in our map history, get the full range of maps that
the mon has. Make sure the first one is a full map.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit a6cd9fea50a4bd7048a222617a2bfe0680f7a969)
Sage Weil [Sun, 28 Jul 2013 15:55:38 +0000 (08:55 -0700)]
osd: fix off-by-one in map gap logic
If we have map 250, and monitor's first is 251, but sends 260, we can
request the intervening range.
Fixes: #5784 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit e24b50225c841a650d9303041bbe811e04bdd668)
Samuel Just [Mon, 22 Jul 2013 23:00:07 +0000 (16:00 -0700)]
OSD: tolerate holes in stored maps
We may have holes in stored maps during init_splits_between
and advance_pg. In either case, we should simply skip the
missing maps.
Fixes: #5677 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6951d2345a5d837c3b14103bd4d8f5ee4407c937)
Sage Weil [Wed, 21 Aug 2013 05:39:09 +0000 (22:39 -0700)]
ceph-disk: partprobe after creating journal partition
At least one user reports that a partprobe is needed after creating the
journal partition. It is not clear why sgdisk is not doing it, but this
fixes ceph-disk for them, and should be harmless for other users.
Sage Weil [Fri, 16 Aug 2013 04:48:06 +0000 (21:48 -0700)]
osdc/ObjectCacher: do not merge rx buffers
We do not try to merge rx buffers currently. Make that explicit and
documented in the code that it is not supported. (Otherwise the
last_read_tid values will get lost and read results won't get applied
to the cache properly.)
Sage Weil [Fri, 16 Aug 2013 04:47:18 +0000 (21:47 -0700)]
osdc/ObjectCacher: match reads with their original rx buffers
Consider a sequence like:
1- start read on 100~200
100~200 state rx
2- truncate to 200
100~100 state rx
3- start read on 200~200
100~100 state rx
200~200 state rx
4- get 100~200 read result
when processing the second 200~200 bufferhead (it is too big). The
larger issue, though, is that we should not be looking at this data at
all; it has been truncated away.
Fix this by marking each rx buffer with the read request that is sent to
fill it, and only fill it from that read request. Then the first reply
will fill the first 100~100 extend but not touch the other extent; the
second read will do that.
Sage Weil [Thu, 22 Aug 2013 22:54:48 +0000 (15:54 -0700)]
mon/Paxos: fix another uncommitted value corner case
It is possible that we begin the paxos recovery with an uncommitted
value for, say, commit 100. During last/collect we discover 100 has been
committed already. But also, another node provides an uncommitted value
for 101 with the same pn. Currently, we refuse to learn it, because the
pn is not strictly > than our current uncommitted pn... even though it is
the next last_committed+1 value that we need.
There are two possible fixes here:
- make this a >= as we can accept newer values from the same pn.
- discard our uncommitted value metadata when we commit the value.
Sandon Van Ness [Fri, 23 Aug 2013 02:44:40 +0000 (19:44 -0700)]
QA: Compile fsstress if missing on machine.
Some distro's have a lack of ltp-kernel packages and all we need is
fstress. This just modified the shell script to download/compile
fstress from source and copy it to the right location if it doesn't
currently exist where it is expected. It is a very small/quick
compile and currently only SLES and debian do not have it already.
Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Sandon Van Ness <sandon@inktank.com>
Sage Weil [Fri, 9 Aug 2013 19:40:34 +0000 (12:40 -0700)]
json_spirit: remove unused typedef
In file included from json_spirit/json_spirit_writer.cpp:7:0:
json_spirit/json_spirit_writer_template.h: In function 'String_type json_spirit::non_printable_to_string(unsigned int)':
json_spirit/json_spirit_writer_template.h:37:50: warning: typedef 'Char_type' locally defined but not used [-Wunused-local-typedefs]
typedef typename String_type::value_type Char_type;
Josh Durgin [Wed, 21 Aug 2013 21:28:49 +0000 (14:28 -0700)]
objecter: resend unfinished lingers when osdmap is no longer paused
Plain Ops that haven't finished yet need to be resent if the osdmap
transitions from full or paused to unpaused. If these Ops are
triggered by LingerOps, they will be cancelled instead (since
should_resend = false), but the LingerOps that triggered them will not
be resent.
Fix this by checking the registered flag for all linger ops, and
resending any of them that aren't paused anymore.
Fixes: #6070 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage.weil@inktank.com>
(cherry picked from commit 38a0ca66a79af4b541e6322467ae3a8a4483cc72)
Samuel Just [Thu, 8 Aug 2013 22:12:46 +0000 (15:12 -0700)]
RadosClient: shutdown monclient after dropping lock
Otherwise, the monclient shutdown may deadlock waiting
on a context trying to take the RadosClient lock.
Fixes: #5897 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0aacd10e2557c55021b5be72ddf39b9cea916be4)
Sage Weil [Fri, 16 Aug 2013 17:52:02 +0000 (10:52 -0700)]
mon/OSDMonitor: make 'osd pool rmsnap ...' not racy/crashy
NOTE: This is a manual backport of d90683fdeda15b726dcf0a7cab7006c31e99f14.
Due to all kinds of collateral changes in the mon the original patch
doesn't cleanly apply.
Ensure that the snap does in fact exist before we try to remove it. This
avoids a crash where a we get two dup rmsnap requests (due to thrashing, or
a reconnect, or something), the committed (p) value does have the snap, but
the uncommitted (pp) does not. This fails the old test such that we try
to remove it from pp again, and assert.
Restructure the flow so that it is easier to distinguish the committed
short return from the uncommitted return (which must still wait for the
commit).
0> 2013-07-16 14:21:27.189060 7fdf301e9700 -1 osd/osd_types.cc: In function 'void pg_pool_t::remove_snap(snapid_t)' thread 7fdf301e9700 time 2013-07-16 14:21:27.187095
osd/osd_types.cc: 662: FAILED assert(snaps.count(s))
ceph version 0.66-602-gcd39d8a (cd39d8a6727d81b889869e98f5869e4227b50720)
1: (pg_pool_t::remove_snap(snapid_t)+0x6d) [0x7ad6dd]
2: (OSDMonitor::prepare_command(MMonCommand*)+0x6407) [0x5c1517]
3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1fb) [0x5c41ab]
4: (PaxosService::dispatch(PaxosServiceMessage*)+0x937) [0x598c87]
5: (Monitor::handle_command(MMonCommand*)+0xe56) [0x56ec36]
6: (Monitor::_ms_dispatch(Message*)+0xd1d) [0x5719ad]
7: (Monitor::handle_forward(MForward*)+0x821) [0x572831]
8: (Monitor::_ms_dispatch(Message*)+0xe44) [0x571ad4]
9: (Monitor::ms_dispatch(Message*)+0x32) [0x588c52]
10: (DispatchQueue::entry()+0x549) [0x7cf1d9]
11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7060fd]
12: (()+0x7e9a) [0x7fdf35165e9a]
13: (clone()+0x6d) [0x7fdf334fcccd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Tue, 13 Aug 2013 19:52:41 +0000 (12:52 -0700)]
librados: fix async aio completion wakeup
For aio flush, we register a wait on the most recent write. The write
completion code, however, was *only* waking the waiter if they were waiting
on that write, without regard to previous writes (completed or not).
For example, we might have 6 and 7 outstanding and wait on 7. If they
finish in order all is well, but if 7 finishes first we do the flush
completion early. Similarly, if we
Josh Durgin [Tue, 13 Aug 2013 02:17:09 +0000 (19:17 -0700)]
librados: fix locking for AioCompletionImpl refcounting
Add an already-locked helper so that C_Aio{Safe,Complete} can
increment the reference count when their caller holds the
lock. C_AioCompleteAndSafe's caller is not holding the lock, so call
regular get() to ensure no racing updates can occur.
This eliminates all direct manipulations of AioCompletionImpl->ref,
and makes the necessary locking clear.
The only place C_AioCompleteAndSafe is used is in handling
aio_flush_async(). This could cause a missing completion.
Refs: #5919 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Tested-by: Oliver Francke <Oliver.Francke@filoo.de>
(cherry picked from commit 7a52e2ff5025754f3040eff3fc52d4893cafc389)
Sage Weil [Tue, 25 Jun 2013 20:16:45 +0000 (13:16 -0700)]
osd: fix race when queuing recovery ops
Previously we would sample how many ops to start under the lock, drop it,
and start that many. This is racy because multiple threads can jump in
and we start too many ops. Instead, claim as many slots as we can and
release them back later if we do not end up using them.
Take care to re-wake the work-queue since we are releasing more resources
for wq use.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 01d3e094823d716be0b39e15323c2506c6f0cc3b)
Sage Weil [Mon, 24 Jun 2013 23:37:29 +0000 (16:37 -0700)]
osd: tolerate racing threads starting recovery ops
We sample the (max - active) recovery ops to know how many to start, but
do not hold the lock over the full duration, such that it is possible to
start too many ops. This isn't problematic except that our condition
checks for being == max but not beyond it, and we will continue to start
recovery ops when we shouldn't. Fix this by adjusting the conditional
to be <=.
Reported-by: Stefan Priebe <s.priebe@profihost.ag> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 3791a1e55828ba541f9d3e8e3df0da8e79c375f9)
Sage Weil [Sat, 10 Aug 2013 01:02:32 +0000 (18:02 -0700)]
ceph-disk: fix mount options passed to move_mount
Commit 6cbe0f021f62b3ebd5f68fcc01a12fde6f08cff5 added a mount_options but
in certain cases it may be blank. Fill in with the defaults, just as we
do in mount().
Backport: cuttlefish Reviewed-by: Dan Mick <dan.mick@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit cb50b5a7f1ab2d4e7fdad623a0e7769000755a70)
Yehuda Sadeh [Mon, 12 Aug 2013 17:05:44 +0000 (10:05 -0700)]
rgw: fix multi delete
Fixes: #5931
Backport: bobtail, cuttlefish
Fix a bad check, where we compare the wrong field. Instead of
comparing the ret code to 0, we compare the string value to 0
which generates implicit casting, hence the crash.
Danny Al-Gaaf [Tue, 23 Jul 2013 19:56:09 +0000 (21:56 +0200)]
ceph.spec.in: obsolete ceph-libs only on the affected distro
The ceph-libs package existed only on Redhat based distro,
there was e.g. never such a package on SUSE. Therefore: make
sure the 'Obsoletes' is only set on these affected distros.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
common: pick_addresses: fix bug with observer class that triggered #5205
The Observer class we defined to observe conf changes and thus avoid
triggering #5205 (as fixed by eb86eebe1ba42f04b46f7c3e3419b83eb6fe7f9a),
was returning always the same const static array, which would lead us to
always populate the observer's list with an observer for 'public_addr'.
This would of course become a problem when trying to obtain the observer
for 'cluster_add' during md_config_t::set_val() -- thus triggering the
same assert as initially reported on #5205.
Backport: cuttlefish Fixes: #5205 Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 7ed6de9dd7aed59f3c5dd93e012cf080bcc36d8a)
Samuel Just [Fri, 2 Aug 2013 18:58:52 +0000 (11:58 -0700)]
PG: set !flushed in Reset()
Otherwise, we might serve a pull before we start_flush in the
ReplicaActive constructor.
Fixes: #5799 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9e7d6d547e0e8a6db6ba611882afa9bf74ea0195)
Sage Weil [Fri, 26 Jul 2013 20:58:46 +0000 (13:58 -0700)]
osd: load all classes on startup
This avoid creating a wide window between when ceph-osd is started and
when a request arrives needing a class and it is loaded. In particular,
upgrading the packages in that window may cause linkage errors (if the
class API has changed, for example).
Fixes: #5752 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit c24e652d8c5e693498814ebe38c6adbec079ea36)
Fixes: #5766
Backport: cuttlefish Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c562b72e703f671127d0ea2173f6a6907c825cd1)
Sage Weil [Thu, 25 Jul 2013 18:10:53 +0000 (11:10 -0700)]
mon/Paxos: share uncommitted value when leader is/was behind
If the leader has and older lc than we do, and we are sharing states to
bring them up to date, we still want to also share our uncommitted value.
This particular case was broken by b26b7f6e, which was only contemplating
the case where the leader was ahead of us or at the same point as us, but
not the case where the leader was behind. Note that the call to
share_state() a few lines up will bring them fully up to date, so
after they receive and store_state() for this message they will be at the
same lc as we are.
Fixes: #5750
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 05b6c7e8645081f405c616735238ae89602d3cc6)
Samuel Just [Wed, 24 Jul 2013 01:04:40 +0000 (18:04 -0700)]
HashIndex: reset attr upon split or merge completion
A replay of an in progress merge or split might make
our counts unreliable.
Fixes: #5723 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0dc3efdd885377a07987d868af5bb7a38245c90b)
Samuel Just [Wed, 24 Jul 2013 00:34:25 +0000 (17:34 -0700)]
test/filestore/store_test: add test for 5723
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 37a4c4af54879512429bb114285bcb4c7c3488d5)
Samuel Just [Tue, 23 Jul 2013 20:51:26 +0000 (13:51 -0700)]
FileStore::_collection_rename: fix global replay guard
If the replay is being replayed, we might have already
performed the rename, skip it. Also, we must set the
collection replay guard only after we have done the
rename.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 870c474c5348831fcb13797d164f49682918fb30)
Samuel Just [Mon, 22 Jul 2013 20:46:10 +0000 (13:46 -0700)]
PGLog::rewind_divergent_log: unindex only works from tail, index() instead
Fixes: #5714 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6957dbc75cc2577652b542aa3eae69f03060cb63)
The original patch covered the same code in PGLog.cc.
Sage Weil [Tue, 16 Jul 2013 20:13:46 +0000 (13:13 -0700)]
msg/Pipe: hold pipe_lock during important parts of accept()
Previously we did not bother with locking for accept() because we were
not visible to any other threads. However, we need to close accepting
Pipes from mark_down_all(), which means we need to handle interference.
Fix up the locking so that we hold pipe_lock when looking at Pipe state
and verify that we are still in the ACCEPTING state any time we retake
the lock.
Sage Weil [Tue, 16 Jul 2013 23:25:28 +0000 (16:25 -0700)]
msgr: adjust nonce on rebind()
We can have a situation where:
- we have a pipe to a peer
- pipe goes to standby (on peer)
- we rebind to a new port
- ....
- we rebind again to the same old port
- we connect to peer
and get reattached to the ancient pipe from two instances back. Avoid that
by picking a new nonce each time we rebind.
Add 1,000,000 each time so that the port is still legible in the printed
output.
Sage Weil [Tue, 16 Jul 2013 17:09:02 +0000 (10:09 -0700)]
msg/Pipe: avoid creating empty out_q entry
We need to maintain the invariant that all sub queues in out_q are never
empty. Fix discard_requeued_up_to() to avoid creating an entry unless we
know it is already present.
This bug leads to an incorrect reconnect attempt when
- we accept a pipe (lossless peer)
- they send some stuff, maybe
- fault
- we initiate reconnect, even tho we have nothing queued
In particular, we shouldn't reconnect because we aren't checking for
resets, and the fact that our out_seq is 0 while the peer's might be
something else entirely will trigger asserts later.
This fixes at least one source of #5626, and possibly #5517.
Sage Weil [Fri, 12 Jul 2013 23:21:24 +0000 (16:21 -0700)]
msg/Pipe: fix RECONNECT_SEQ behavior
Calling handle_ack() here has no effect because we have already
spliced sent messages back into our out queue. Instead, pull them out
of there and discard. Add a few assertions along the way.
Sage Weil [Mon, 17 Jun 2013 21:14:02 +0000 (14:14 -0700)]
msg/Pipe: goto fail_unlocked on early failures in accept()
Instead of duplicating an incomplete cleanup sequence (that does not
clear_pipe()), goto fail_unlocked and do the cleanup in a generic way.
s/rc/r/ while we are here.
Sage Weil [Mon, 17 Jun 2013 19:47:11 +0000 (12:47 -0700)]
msgr: clear_pipe inside pipe_lock on mark_down_all
Observed a segfault in rebind -> mark_down_all -> clear_pipe -> put that
may have been due to a racing thread clearing the connection_state pointer.
Do the clear_pipe() call under the protection of pipe_lock, as we do in
all other contexts.
Samuel Just [Fri, 19 Jul 2013 02:26:02 +0000 (19:26 -0700)]
ReplicatedPG: track temp collection contents, clear during on_change
We also assert in on_flushed() that the temp collection is actually
empty.
Fixes: #5670 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 47516d9c4b7f023f3a16e166749fa7b1c7b3b24c)
Samuel Just [Fri, 19 Jul 2013 02:25:14 +0000 (19:25 -0700)]
PG, ReplicatedPG: pass a transaction down to ReplicatedPG::on_change
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9f56a7b8bfcb63cb4fbbc0c9b8ff01de9e518c57)
Samuel Just [Wed, 17 Jul 2013 22:04:10 +0000 (15:04 -0700)]
PG: start flush on primary only after we process the master log
Once we start serving reads, stray objects must have already
been removed. Therefore, we have to flush all operations
up to the transaction writing out the authoritative log.
On replicas, we flush in Stray() if we will not eventually
be activated and in ReplicaActive if we are in the acting
set. This way a replica won't serve a replica read until
the store is consistent.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b41f1ba48563d1d3fd17c2f62d10103b5d63f305)
Samuel Just [Wed, 17 Jul 2013 19:51:19 +0000 (12:51 -0700)]
ReplicatedPG: replace clean_up_local with a debug check
Stray objects should have been cleaned up in the merge_log
transactions. Only on the primary have those operations
necessarily been flushed at activate().
Fixes: 5084 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 278c7b59228f614addf830cb0afff4988c9bc8cb)
Samuel Just [Thu, 18 Jul 2013 17:12:17 +0000 (10:12 -0700)]
FileStore: add global replay guard for split, collection_rename
In the event of a split or collection rename, we need to ensure that
we don't replay any operations on objects within those collections
prior to that point. Thus, we mark a global replay guard on the
collection after doing a syncfs and make sure to check that in
_check_replay_guard() for all object operations.
Fixes: #5154 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f3f92fe21061e21c8b259df5ef283a61782a44db)
Samuel Just [Mon, 15 Jul 2013 20:44:20 +0000 (13:44 -0700)]
OSD: add config option for peering_wq batch size
Large peering_wq batch sizes may excessively delay
peering messages resulting in unreasonably long
peering. This may speed up peering.
Backport: cuttlefish
Related: #5084 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 39e5a2a406b77fa82e9a78c267b679d49927e3c3)
Sage Weil [Tue, 18 Jun 2013 03:54:15 +0000 (20:54 -0700)]
ceph-disk: make is_held() smarter about full disks
Handle the case where the device is a full disk. Make the partition
check a bit more robust (don't make assumptions about naming aside from
the device being a prefix of the partition).
Sage Weil [Wed, 24 Jul 2013 18:55:42 +0000 (11:55 -0700)]
mon/OSDMonitor: search for latest full osdmap if record version is missing
In 97462a3213e5e15812c79afc0f54d697b6c498b1 we tried to search for a
recent full osdmap but were looking at the wrong key. If full_0 was
present we could record that the latest full map was last_committed even
though it wasn't present. This is fixed in 76cd7ac1c, but we need to
compensate for when get_version_latest_full() gives us a back version
number by repeating the search.
Fixes: #5737 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit c2131d4047156aa2964581c9dbd93846382a07e7)
test: test_store_tool: global init before using LevelDBStore
Fixes a segfault
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a7a7d3fc8a2ba4a30ef136a32f2903d157b3e19a)
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 76cd7ac1c2094b34ad36bea89b2246fa90eb2f6d)
Sage Weil [Sun, 21 Jul 2013 15:48:18 +0000 (08:48 -0700)]
mon/Paxos: fix pn for uncommitted value during collect/last phase
During the collect/last exchange, peers share any uncommitted values
with the leader. They are supposed to also share the pn under which
that value was accepted, but were instead using the just-accepted pn
value. This effectively meant that we *always* took the uncommitted
value; if there were multiples, which one we accepted depended on what
order the LAST messages arrived, not which pn the values were generated
under.
The specific failure sequence I observed:
- collect
- learned uncommitted value for 262 from myself
- send collect with pn 901
- got last with pn 901 (incorrect) for 200 (old) from peer
- discard our own value, remember the other
- finish collect phase
- ignore old uncommitted value
Fix this by storing a pending_v and pending_pn value whenever we accept
a value. Use this to send an appropriate pn value in the LAST reply
so that the leader can make it's decision about which uncommitted value
to accept based on accurate information. Also use it when we learn
the uncommitted value from ourselves.
We could probably be more clever about storing less information here,
for example by omitting pending_v and clearing pending_pn at the
appropriate point, but that would be more fragile. Similarly, we could
store a pn for *every* commit if we wanted to lay some groundwork for
having multiple uncommitted proposals in flight, but I don't want to
speculate about what is necessary or sufficient for a correct solution
there.
Sage Weil [Mon, 22 Jul 2013 21:13:23 +0000 (14:13 -0700)]
mon/Paxos: only share uncommitted value if it is next
We may have an uncommitted value from our perspective (it is our lc + 1)
when the collector has a much larger lc (because we have been out for
the last few rounds). Only share an uncommitted value if it is in fact
the next value.
Sage Weil [Tue, 23 Jul 2013 20:32:12 +0000 (13:32 -0700)]
mon/OSDMonitor: fix base case for 7fb3804fb workaround
After cluster creation, we have no full map stored and first_committed ==
1. In that case, there is no need for a full map, since we can get there
from OSDMap() and the incrementals.
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao@inktank.com>
(cherry picked from commit e807770784175b05130bba938113fdbf874f152e)
mon: OSDMonitor: work around a full version bug introduced in 7fb3804fb
In 7fb3804fb860dcd0340dd3f7c39eec4315f8e4b6 we moved the full version
stashing logic to the encode_trim_extra() function. However, we forgot
to update the osdmap's 'latest_full' key that should always point to
the latest osdmap full version. This eventually degenerated in a missing
full version after a trim. This patch works around this bug by looking
for the latest available full osdmap version in the store and updating
'latest_full' to its proper value.
Related-to: #5704
Backport: cuttlefish
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 97462a3213e5e15812c79afc0f54d697b6c498b1)
mon: OSDMonitor: update the osdmap's latest_full with the new full version
We used to do this on encode_full(), but since [1] we no longer rely on
PaxosService to manage the full maps for us. And we forgot to write down
the latest_full version to the store, leaving it in a truly outdated state.
Sage Weil [Thu, 18 Jul 2013 21:35:19 +0000 (14:35 -0700)]
mon: decline to scrub when paxos is not active
In f1ce8d7c955a2443111bf7d9e16b4c563d445712 we close a race between scrub
and paxos commit completion on the leader. The fix is nontrivial to
backport and probably not worthwhile; just avoid scrubbing at that time
for now.
Sage Weil [Tue, 16 Jul 2013 20:14:50 +0000 (13:14 -0700)]
ceph-disk: rely on /dev/disk/by-partuuid instead of special-casing journal symlinks
This was necessary when ceph-disk-udev didn't create the by-partuuid (and
other) symlinks for us, but now it is fragile and error-prone. (It also
appears to be broken on a certain customer RHEL VM.) See d7f7d613512fe39ec883e11d201793c75ee05db1.
Instead, just use the by-partuuid symlinks that we spent all that ugly
effort generating.
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 64379e701b3ed862c05f156539506d3382f77aa8)
mon: Monitor: StoreConverter: clearer debug message on 'needs_conversion()'
The previous debug message outputted the function's name, as often our
functions do. This was however a source of bewilderment, as users would
see those in logs and think their stores would need conversion. Changing
this message is trivial enough and it will make ceph users happier log
readers.
Backport: cuttlefish Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ad1392f68170b391d11df0ce5523c2d1fb57f60e)
mon: Monitor: do not reopen MonitorDBStore during conversion
We already open the store on ceph_mon.cc, before we start the conversion.
Given we are unable to reproduce this every time a conversion is triggered,
we are led to believe that this causes a race in leveldb that will lead
to 'store.db/LOCK' being locked upon the open this patch removes.
Regardless, reopening the db here is pointless as we already did it when
we reach Monitor::StoreConverter::convert().
Fixes: #5640
Backport: cuttlefish
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 036e6739a4e873863bae3d7d00f310c015dfcdb3)