git.apps.os.sepia.ceph.com Git

mds: fix locking, use-after-free/race in handle_accept

We need to hold mds_lock here.

Normally the con also holds a reference, but an ill-timed connection reset
could drop it.

Fixes: #5883
Backport: dumpling, cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a0929955cb84fb8cfdeb551d6863e4955b8e2a71)

.gitignore: ignore test-driver

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit edf2c3449ec96d91d3d7ad01c50f7a79b7b2f7cc)

Conflicts:

.gitignore

fuse: fix warning when compiled against old fuse versions

client/fuse_ll.cc: In function 'void invalidate_cb(void*, vinodeno_t, int64_t, int64_t)':
warning: client/fuse_ll.cc:540: unused variable 'fino'

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9833e9dabe010e538cb98c51d79b6df58ce28f9e)

json_spirit: remove unused typedef

In file included from json_spirit/json_spirit_writer.cpp:7:0:
json_spirit/json_spirit_writer_template.h: In function 'String_type json_spirit::non_printable_to_string(unsigned int)':
json_spirit/json_spirit_writer_template.h:37:50: warning: typedef 'Char_type' locally defined but not used [-Wunused-local-typedefs]
typedef typename String_type::value_type Char_type;

(Also, ha ha, this file uses \r\n.)

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6abae35a3952e5b513895267711fea63ff3bad09)

gtest: add build-aux/test-driver to .gitignore

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c9cdd19d1cd88b84e8a867f5ab85cb51fdc6f8e4)

objecter: resend unfinished lingers when osdmap is no longer paused

Plain Ops that haven't finished yet need to be resent if the osdmap
transitions from full or paused to unpaused. If these Ops are
triggered by LingerOps, they will be cancelled instead (since
should_resend = false), but the LingerOps that triggered them will not
be resent.

Fix this by checking the registered flag for all linger ops, and
resending any of them that aren't paused anymore.

Fixes: #6070
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage.weil@inktank.com>
(cherry picked from commit 38a0ca66a79af4b541e6322467ae3a8a4483cc72)

v0.61.8

RadosClient: shutdown monclient after dropping lock

Otherwise, the monclient shutdown may deadlock waiting
on a context trying to take the RadosClient lock.

Fixes: #5897
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0aacd10e2557c55021b5be72ddf39b9cea916be4)

mon/OSDMonitor: make 'osd pool mksnap ...' not expose uncommitted state

[This is a backport of d1501938f5d07c067d908501fc5cfe3c857d7281]

We were returning success without waiting if the pending pool state had
the snap.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>

mon/OSDMonitor: make 'osd pool rmsnap ...' not racy/crashy

NOTE: This is a manual backport of d90683fdeda15b726dcf0a7cab7006c31e99f14.
Due to all kinds of collateral changes in the mon the original patch
doesn't cleanly apply.

Ensure that the snap does in fact exist before we try to remove it.  This
avoids a crash where a we get two dup rmsnap requests (due to thrashing, or
a reconnect, or something), the committed (p) value does have the snap, but
the uncommitted (pp) does not.  This fails the old test such that we try
to remove it from pp again, and assert.

Restructure the flow so that it is easier to distinguish the committed
short return from the uncommitted return (which must still wait for the
commit).

         0> 2013-07-16 14:21:27.189060 7fdf301e9700 -1 osd/osd_types.cc: In function 'void pg_pool_t::remove_snap(snapid_t)' thread 7fdf301e9700 time 2013-07-16 14:21:27.187095
    osd/osd_types.cc: 662: FAILED assert(snaps.count(s))

     ceph version 0.66-602-gcd39d8a (cd39d8a6727d81b889869e98f5869e4227b50720)
     1: (pg_pool_t::remove_snap(snapid_t)+0x6d) [0x7ad6dd]
     2: (OSDMonitor::prepare_command(MMonCommand*)+0x6407) [0x5c1517]
     3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1fb) [0x5c41ab]
     4: (PaxosService::dispatch(PaxosServiceMessage*)+0x937) [0x598c87]
     5: (Monitor::handle_command(MMonCommand*)+0xe56) [0x56ec36]
     6: (Monitor::_ms_dispatch(Message*)+0xd1d) [0x5719ad]
     7: (Monitor::handle_forward(MForward*)+0x821) [0x572831]
     8: (Monitor::_ms_dispatch(Message*)+0xe44) [0x571ad4]
     9: (Monitor::ms_dispatch(Message*)+0x32) [0x588c52]
     10: (DispatchQueue::entry()+0x549) [0x7cf1d9]
     11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7060fd]
     12: (()+0x7e9a) [0x7fdf35165e9a]
     13: (clone()+0x6d) [0x7fdf334fcccd]
     NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>

librados: fix async aio completion wakeup

For aio flush, we register a wait on the most recent write.  The write
completion code, however, was *only* waking the waiter if they were waiting
on that write, without regard to previous writes (completed or not).
For example, we might have 6 and 7 outstanding and wait on 7.  If they
finish in order all is well, but if 7 finishes first we do the flush
completion early.  Similarly, if we

- start 6
- start 7
- finish 7
- flush; wait on 7
- finish 6

we can hang forever.

Fix by doing any completions that are prior to the oldest pending write in
the aio write completion handler.

Refs: #5919

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Tested-by: Oliver Francke <Oliver.Francke@filoo.de>
(cherry picked from commit 16ed0b9af8bc08c7dabead1c1a7c1a22b1fb02fb)

librados: fix locking for AioCompletionImpl refcounting

Add an already-locked helper so that C_Aio{Safe,Complete} can
increment the reference count when their caller holds the
lock. C_AioCompleteAndSafe's caller is not holding the lock, so call
regular get() to ensure no racing updates can occur.

This eliminates all direct manipulations of AioCompletionImpl->ref,
and makes the necessary locking clear.

The only place C_AioCompleteAndSafe is used is in handling
aio_flush_async(). This could cause a missing completion.

Refs: #5919
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Tested-by: Oliver Francke <Oliver.Francke@filoo.de>
(cherry picked from commit 7a52e2ff5025754f3040eff3fc52d4893cafc389)

mon/Paxos: bootstrap peon too if monmap updates

If we get a monmap update, the leader bootstraps. Peons should do the
same.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit efe5b67bb700ef6218d9579abf43cc9ecf25ef52)

osd: fix race when queuing recovery ops

Previously we would sample how many ops to start under the lock, drop it,
and start that many. This is racy because multiple threads can jump in
and we start too many ops. Instead, claim as many slots as we can and
release them back later if we do not end up using them.

Take care to re-wake the work-queue since we are releasing more resources
for wq use.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 01d3e094823d716be0b39e15323c2506c6f0cc3b)

osd: tolerate racing threads starting recovery ops

We sample the (max - active) recovery ops to know how many to start, but
do not hold the lock over the full duration, such that it is possible to
start too many ops. This isn't problematic except that our condition
checks for being == max but not beyond it, and we will continue to start
recovery ops when we shouldn't. Fix this by adjusting the conditional
to be <=.

Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 3791a1e55828ba541f9d3e8e3df0da8e79c375f9)

ceph-disk: fix mount options passed to move_mount

Commit 6cbe0f021f62b3ebd5f68fcc01a12fde6f08cff5 added a mount_options but
in certain cases it may be blank. Fill in with the defaults, just as we
do in mount().

Backport: cuttlefish
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit cb50b5a7f1ab2d4e7fdad623a0e7769000755a70)

rgw: fix multi delete

Fixes: #5931
Backport: bobtail, cuttlefish

Fix a bad check, where we compare the wrong field. Instead of
comparing the ret code to 0, we compare the string value to 0
which generates implicit casting, hence the crash.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit f9f1c48ad799da2b4be0077bf9d61ae116da33d7)

Conflicts:
src/rgw/rgw_rest_s3.cc

ceph.spec.in: obsolete ceph-libs only on the affected distro

The ceph-libs package existed only on Redhat based distro,
there was e.g. never such a package on SUSE. Therefore: make
sure the 'Obsoletes' is only set on these affected distros.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>

ceph.spec.in: Obsolete ceph-libs

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

common: pick_addresses: fix bug with observer class that triggered #5205

The Observer class we defined to observe conf changes and thus avoid
triggering #5205 (as fixed by eb86eebe1ba42f04b46f7c3e3419b83eb6fe7f9a),
was returning always the same const static array, which would lead us to
always populate the observer's list with an observer for 'public_addr'.

This would of course become a problem when trying to obtain the observer
for 'cluster_add' during md_config_t::set_val() -- thus triggering the
same assert as initially reported on #5205.

Backport: cuttlefish
Fixes: #5205
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 7ed6de9dd7aed59f3c5dd93e012cf080bcc36d8a)

make sure we are using the mount options

Signed-off-by: Alfredo Deza <alfredo@deza.pe>
(cherry picked from commit 34831d0989d4bcec4920068b6ee09ab6b3234c91)

PG: set !flushed in Reset()

Otherwise, we might serve a pull before we start_flush in the
ReplicaActive constructor.

Fixes: #5799
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9e7d6d547e0e8a6db6ba611882afa9bf74ea0195)

osd: make open classes on start optional

This is cuttlefish; default to the old behavior!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6f996223fb34650771772b88355046746f238cf2)

osd: load all classes on startup

This avoid creating a wide window between when ceph-osd is started and
when a request arrives needing a class and it is loaded. In particular,
upgrading the packages in that window may cause linkage errors (if the
class API has changed, for example).

Fixes: #5752
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit c24e652d8c5e693498814ebe38c6adbec079ea36)

Conflicts:
src/osd/ClassHandler.cc

ceph_test_rados: print version banner on startup

It is helpful when looking at qa run logs to see what version of the
tester is running.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 12c1f1157c7b9513a3d9f716a8ec62fce00d28f5)

ceph_test_rados: add --pool <name> arg

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bcfbd0a3ffae6947464d930f636c8b35d1331e9d)

upstart: stop ceph-create-keys when the monitor stops

This avoids lingering ceph-create-keys tasks.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a90a2b42db8de134b8ea5d81cab7825fb9ec50b4)

FileStore: fix fd leak in _check_global_replay_guard

Bug introduced in f3f92fe21061e21c8b259df5ef283a61782a44db.

Fixes: #5766
Backport: cuttlefish
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c562b72e703f671127d0ea2173f6a6907c825cd1)

mon/Paxos: share uncommitted value when leader is/was behind

If the leader has and older lc than we do, and we are sharing states to
bring them up to date, we still want to also share our uncommitted value.
This particular case was broken by b26b7f6e, which was only contemplating
the case where the leader was ahead of us or at the same point as us, but
not the case where the leader was behind. Note that the call to
share_state() a few lines up will bring them fully up to date, so
after they receive and store_state() for this message they will be at the
same lc as we are.

Fixes: #5750
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 05b6c7e8645081f405c616735238ae89602d3cc6)

Merge remote-tracking branch 'gh/cuttlefish-next' into cuttlefish

HashIndex: reset attr upon split or merge completion

A replay of an in progress merge or split might make
our counts unreliable.

Fixes: #5723
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0dc3efdd885377a07987d868af5bb7a38245c90b)

test/filestore/store_test: add test for 5723

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 37a4c4af54879512429bb114285bcb4c7c3488d5)

Conflicts:
src/os/LFNIndex.cc
src/test/filestore/store_test.cc

FileStore::_collection_rename: fix global replay guard

If the replay is being replayed, we might have already
performed the rename, skip it. Also, we must set the
collection replay guard only after we have done the
rename.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 870c474c5348831fcb13797d164f49682918fb30)

v0.61.7

PGLog::rewind_divergent_log: unindex only works from tail, index() instead

Fixes: #5714
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6957dbc75cc2577652b542aa3eae69f03060cb63)

The original patch covered the same code in PGLog.cc.

Conflicts:

src/osd/PGLog.cc
src/osd/PG.cc

msg/Pipe: do not hold pipe_lock for verify_authorizer()

We shouldn't hold the pipe_lock while doing the ms_verify_authorizer
upcalls.

Fix by unlocking a bit earlier, and verifying our state is still correct
in the failure path.

This regression was introduced by ecab4bb9513385bd765cca23e4e2fadb7ac4bac2.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 723d691f7a1f53888618dfc311868d1988f61f56)

Conflicts:

src/msg/Pipe.cc

msg/Pipe: a bit of additional debug output

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 16568d9e1fb8ac0c06ebaa1e1dc1d6a432a5e4d4)

msg/Pipe: hold pipe_lock during important parts of accept()

Previously we did not bother with locking for accept() because we were
not visible to any other threads. However, we need to close accepting
Pipes from mark_down_all(), which means we need to handle interference.

Fix up the locking so that we hold pipe_lock when looking at Pipe state
and verify that we are still in the ACCEPTING state any time we retake
the lock.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ecab4bb9513385bd765cca23e4e2fadb7ac4bac2)

msgr: fix a typo/goto-cross from dd4addef2d

We didn't build or review carefully enough!

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1a84411209b13084b3edb87897d5d678937e3299)

msgr: close accepting_pipes from mark_down_all()

We need to catch these pipes too, particularly when doing a rebind(),
to avoid them leaking through.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 687fe888b32ac9d41595348dfc82111c8dbf2fcb)

msgr: maintain list of accepting pipes

New pipes exist in a sort of limbo before we know who the peer is and
add them to rank_pipe. Keep a list of them in accepting_pipes for that
period.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit dd4addef2d5b457cc9a58782fe42af6b13c68b81)

msgr: adjust nonce on rebind()

We can have a situation where:

- we have a pipe to a peer
- pipe goes to standby (on peer)
- we rebind to a new port
- ....
- we rebind again to the same old port
- we connect to peer

and get reattached to the ancient pipe from two instances back. Avoid that
by picking a new nonce each time we rebind.

Add 1,000,000 each time so that the port is still legible in the printed
output.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 994e2bf224ab7b7d5b832485ee14de05354d2ddf)

Conflicts:

src/msg/Accepter.cc

msgr: mark_down_all() after, not before, rebind

If we are shutting down all old connections and binding to new ports,
we want to avoid a sequence like:

- close all prevoius connections
- new connection comes in on old port
- rebind to new ports
-> connection from old port leaks through

As a first step, close all connections after we shut down the old
accepter and before we start the new one.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 07a0860a1899c7353bb506e33de72fdd22b857dd)

Conflicts:

src/msg/SimpleMessenger.cc

msg/Pipe: unlock msgr->lock earlier in accept()

Small cleanup. Nothing needs msgr->lock for the previously larger
window.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ad548e72fd94b4a16717abd3b3f1d1be4a3476cf)

msg/Pipe: avoid creating empty out_q entry

We need to maintain the invariant that all sub queues in out_q are never
empty. Fix discard_requeued_up_to() to avoid creating an entry unless we
know it is already present.

This bug leads to an incorrect reconnect attempt when

- we accept a pipe (lossless peer)
- they send some stuff, maybe
- fault
- we initiate reconnect, even tho we have nothing queued

In particular, we shouldn't reconnect because we aren't checking for
resets, and the fact that our out_seq is 0 while the peer's might be
something else entirely will trigger asserts later.

This fixes at least one source of #5626, and possibly #5517.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9f1c27261811733f40acf759a72958c3689c8516)

msg/Pipe: assert lock is held in various helpers

These all require that we hold pipe_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 579d858aabbe5df88543d096ef4dbddcfc023cca)

msg/Pipe: be a bit more explicit about encoding outgoing messages

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4282971d47b90484e681ff1a71ae29569dbd1d32)

msg/Pipe: fix RECONNECT_SEQ behavior

Calling handle_ack() here has no effect because we have already
spliced sent messages back into our out queue. Instead, pull them out
of there and discard. Add a few assertions along the way.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 495ee108dbb39d63e44cd3d4938a6ec7d11b12e3)

msgr: reaper: make sure pipe has been cleared (under pipe_lock)

All paths to pipe shutdown should have cleared the con->pipe reference
already. Assert as much.

Also, do it under pipe_lock!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9586305a2317c7d6bbf31c9cf5b67dc93ccab50d)

msg/Pipe: goto fail_unlocked on early failures in accept()

Instead of duplicating an incomplete cleanup sequence (that does not
clear_pipe()), goto fail_unlocked and do the cleanup in a generic way.
s/rc/r/ while we are here.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ec612a5bda119cea52bbac9b2a49ecf1e83b08e5)

msgr: clear con->pipe inside pipe_lock on mark_down

We need to do this under protection of the pipe_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit afafb87e8402242d3897069f4b94ba46ffe0c413)

msgr: clear_pipe inside pipe_lock on mark_down_all

Observed a segfault in rebind -> mark_down_all -> clear_pipe -> put that
may have been due to a racing thread clearing the connection_state pointer.
Do the clear_pipe() call under the protection of pipe_lock, as we do in
all other contexts.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5fc1dabfb3b2cbffdee3214d24d7769d6e440e45)

Conflicts:

src/msg/SimpleMessenger.cc

ReplicatedPG: track temp collection contents, clear during on_change

We also assert in on_flushed() that the temp collection is actually
empty.

Fixes: #5670
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 47516d9c4b7f023f3a16e166749fa7b1c7b3b24c)

Conflicts:

src/osd/ReplicatedPG.cc

PG, ReplicatedPG: pass a transaction down to ReplicatedPG::on_change

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9f56a7b8bfcb63cb4fbbc0c9b8ff01de9e518c57)

PG: start flush on primary only after we process the master log

Once we start serving reads, stray objects must have already
been removed. Therefore, we have to flush all operations
up to the transaction writing out the authoritative log.
On replicas, we flush in Stray() if we will not eventually
be activated and in ReplicaActive if we are in the acting
set. This way a replica won't serve a replica read until
the store is consistent.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b41f1ba48563d1d3fd17c2f62d10103b5d63f305)

ReplicatedPG: replace clean_up_local with a debug check

Stray objects should have been cleaned up in the merge_log
transactions. Only on the primary have those operations
necessarily been flushed at activate().

Fixes: 5084
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 278c7b59228f614addf830cb0afff4988c9bc8cb)

FileStore: add global replay guard for split, collection_rename

In the event of a split or collection rename, we need to ensure that
we don't replay any operations on objects within those collections
prior to that point. Thus, we mark a global replay guard on the
collection after doing a syncfs and make sure to check that in
_check_replay_guard() for all object operations.

Fixes: #5154
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f3f92fe21061e21c8b259df5ef283a61782a44db)

Conflicts:

src/os/FileStore.cc

OSD: add config option for peering_wq batch size

Large peering_wq batch sizes may excessively delay
peering messages resulting in unreasonably long
peering. This may speed up peering.

Backport: cuttlefish
Related: #5084
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 39e5a2a406b77fa82e9a78c267b679d49927e3c3)

ceph-disk: use new get_dev_path helper for list

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Tested-by: Olivier Bonvalet <ob.ceph@daevel.fr>
(cherry picked from commit fd1fd664d6102a2a96b27e8ca9933b54ac626ecb)

ceph-disk: use /sys/block to determine partition device names

Not all devices are basename + number; some have intervening character(s),
like /dev/cciss/c0d1p2.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2ea8fac441141d64ee0d26c5dd2b441f9782d840)

ceph-disk: reimplement is_partition() using /sys/block

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5b031e100b40f597752b4917cdbeebb366eb98d7)

ceph-disk: use get_dev_name() helper throughout

This is more robust than the broken split trick.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3359aaedde838c98d1155611e157fd2da9e8b9f5)

ceph-disk: refactor list_[all_]partitions

Make these methods work in terms of device *names*, not paths, and fix up
the only direct list_partitions() caller to do the same.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 35d3f2d84808efda3d2ac868afe03e6959d51c03)

ceph-disk: add get_dev_name, path helpers

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e0401591e352ea9653e3276d66aebeb41801eeb3)

ceph-disk: handle /dev/foo/bar devices throughout

Assume the last component is the unique device name, even if it appears
under a subdir of /dev.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit cb97338b1186939deecb78e9d949c38c3ef59026)

ceph-disk: make is_held() smarter about full disks

Handle the case where the device is a full disk. Make the partition
check a bit more robust (don't make assumptions about naming aside from
the device being a prefix of the partition).

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e082f1247fb6ddfb36c4223cbfdf500d6b45c978)

mon/OSDMonitor: search for latest full osdmap if record version is missing

In 97462a3213e5e15812c79afc0f54d697b6c498b1 we tried to search for a
recent full osdmap but were looking at the wrong key. If full_0 was
present we could record that the latest full map was last_committed even
though it wasn't present. This is fixed in 76cd7ac1c, but we need to
compensate for when get_version_latest_full() gives us a back version
number by repeating the search.

Fixes: #5737
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit c2131d4047156aa2964581c9dbd93846382a07e7)

test: test_store_tool: global init before using LevelDBStore

Fixes a segfault

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a7a7d3fc8a2ba4a30ef136a32f2903d157b3e19a)

mon: OSDMonitor: fix a bug introduced on 97462a32

Fixes: #5737
Backport: cuttlefish

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 76cd7ac1c2094b34ad36bea89b2246fa90eb2f6d)

mon/Paxos: fix pn for uncommitted value during collect/last phase

During the collect/last exchange, peers share any uncommitted values
with the leader.  They are supposed to also share the pn under which
that value was accepted, but were instead using the just-accepted pn
value.  This effectively meant that we *always* took the uncommitted
value; if there were multiples, which one we accepted depended on what
order the LAST messages arrived, not which pn the values were generated
under.

The specific failure sequence I observed:

- collect
  - learned uncommitted value for 262 from myself
  - send collect with pn 901
- got last with pn 901 (incorrect) for 200 (old) from peer
  - discard our own value, remember the other
- finish collect phase
  - ignore old uncommitted value

Fix this by storing a pending_v and pending_pn value whenever we accept
a value.  Use this to send an appropriate pn value in the LAST reply
so that the leader can make it's decision about which uncommitted value
to accept based on accurate information.  Also use it when we learn
the uncommitted value from ourselves.

We could probably be more clever about storing less information here,
for example by omitting pending_v and clearing pending_pn at the
appropriate point, but that would be more fragile.  Similarly, we could
store a pn for *every* commit if we wanted to lay some groundwork for
having multiple uncommitted proposals in flight, but I don't want to
speculate about what is necessary or sufficient for a correct solution
there.

Fixes: #5698
Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 20baf662112dd5f560bc3a2d2114b469444c3de8)

mon/Paxos: debug ignored uncommitted values

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 19b29788966eb80ed847630090a16a3d1b810969)

mon/Paxos: only learn uncommitted value if it is in the future

If an older peer sends an uncommitted value, make sure we only take it
if it is in the future, and at least as new as any current uncommitted
value.

(Prior to the previous patch, peers could send values from long-past
rounds. The pn values are also bogus.)

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b3253a453c057914753846c77499f98d3845c58e)

mon/Paxos: only share uncommitted value if it is next

We may have an uncommitted value from our perspective (it is our lc + 1)
when the collector has a much larger lc (because we have been out for
the last few rounds). Only share an uncommitted value if it is in fact
the next value.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b26b7f6e5e02ac6beb66e3e34e177e6448cf91cf)

v0.61.6

mon/OSDMonitor: fix base case for 7fb3804fb workaround

After cluster creation, we have no full map stored and first_committed ==
1. In that case, there is no need for a full map, since we can get there
from OSDMap() and the incrementals.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao@inktank.com>
(cherry picked from commit e807770784175b05130bba938113fdbf874f152e)

mon: OSDMonitor: work around a full version bug introduced in 7fb3804fb

In 7fb3804fb860dcd0340dd3f7c39eec4315f8e4b6 we moved the full version
stashing logic to the encode_trim_extra() function.  However, we forgot
to update the osdmap's 'latest_full' key that should always point to
the latest osdmap full version.  This eventually degenerated in a missing
full version after a trim.  This patch works around this bug by looking
for the latest available full osdmap version in the store and updating
'latest_full' to its proper value.

Related-to: #5704
Backport: cuttlefish

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 97462a3213e5e15812c79afc0f54d697b6c498b1)

mon: OSDMonitor: update the osdmap's latest_full with the new full version

We used to do this on encode_full(), but since [1] we no longer rely on
PaxosService to manage the full maps for us. And we forgot to write down
the latest_full version to the store, leaving it in a truly outdated state.

[1] - 7fb3804fb860dcd0340dd3f7c39eec4315f8e4b6

Fixes: #5704
Backport: cuttlefish
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a815547ed3e5ffdbbb96c8c0c1b8d6dd8c62bfba)

mon: decline to scrub when paxos is not active

In f1ce8d7c955a2443111bf7d9e16b4c563d445712 we close a race between scrub
and paxos commit completion on the leader. The fix is nontrivial to
backport and probably not worthwhile; just avoid scrubbing at that time
for now.

Note that the actual fix for this is in commit
f1ce8d7c955a2443111bf7d9e16b4c563d445712.

Signed-off-by: Sage Weil <sage@inktank.com>

v0.61.5

ceph-disk: rely on /dev/disk/by-partuuid instead of special-casing journal symlinks

This was necessary when ceph-disk-udev didn't create the by-partuuid (and
other) symlinks for us, but now it is fragile and error-prone. (It also
appears to be broken on a certain customer RHEL VM.) See
d7f7d613512fe39ec883e11d201793c75ee05db1.

Instead, just use the by-partuuid symlinks that we spent all that ugly
effort generating.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 64379e701b3ed862c05f156539506d3382f77aa8)

mon: Monitor: StoreConverter: clearer debug message on 'needs_conversion()'

The previous debug message outputted the function's name, as often our
functions do. This was however a source of bewilderment, as users would
see those in logs and think their stores would need conversion. Changing
this message is trivial enough and it will make ceph users happier log
readers.

Backport: cuttlefish
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ad1392f68170b391d11df0ce5523c2d1fb57f60e)

mon: Monitor: do not reopen MonitorDBStore during conversion

We already open the store on ceph_mon.cc, before we start the conversion.
Given we are unable to reproduce this every time a conversion is triggered,
we are led to believe that this causes a race in leveldb that will lead
to 'store.db/LOCK' being locked upon the open this patch removes.

Regardless, reopening the db here is pointless as we already did it when
we reach Monitor::StoreConverter::convert().

Fixes: #5640
Backport: cuttlefish

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 036e6739a4e873863bae3d7d00f310c015dfcdb3)

messages/MClientReconnect: clear data when encoding

The MClientReconnect puts everything in the data payload portion of
the message and nothing in the front portion. That means that if the
message is resent (socket failure or something), the messenger thinks it
hasn't been encoded yet (front empty) and reencodes, which means
everything gets added (again) to the data portion.

Decoding keep decoding until it runs out of data, so the second copy
means we decode garbage snap realms, leading to the crash in bug

Clearing data each time around resolves the problem, although it does
mean we do the encoding work multiple times. We could alternatively
(or also) stick some data in the front portion of the payload
(ignored), but that changes the wire protocol and I would rather not
do that.

Fixes: #4565
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 314cf046b0b787ca69665e8751eab6fe7adb4037)

mon: once sync full is chosen, make sure we don't change our mind

It is possible for a sequence like:

- probe
- first probe reply has paxos trim that indicates a full sync is
needed
- start sync
- clear store
- something happens that makes us abort and bootstrap (e.g., the
provider mon restarts
- probe
- first probe reply has older paxos trim bound and we call an election
- on election completion, we crash because we have no data.

Non-determinism of the probe decision aside, we need to ensure that
the info we share during probe (fc, lc) is accurate, and that once we
clear the store we know we *must* do a full sync.

This is a backport of aa60f940ec1994a61624345586dc70d261688456.

Fixes: #5621
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

mon: do not scrub if scrub is in progress

This prevents an assert from unexpected scrub results from the previous
scrub on the leader.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 00ae543b3e32f89d906a0e934792cc5309f57696)

messages/MPGStats: do not set paxos version to osdmap epoch

The PaxosServiceMessage version field is meant for client-coordinated
ordering of messages when switching between monitors (and is rarely
used). Do not fill it with the osdmap epoch lest it be compared to a
pgmap version, which may cause the mon to (near) indefinitely put it on
a wait queue until the pgmap version catches up.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit b36338be43f43b6dd4ee87c97f2eaa23b467c386)

osd/OSDmap: fix OSDMap::Incremental::dump() for new pool names

The name is always present when pools are created, but not when they are
modified. Also, a name may be present with a new_pools entry if the pool
is just renamed. Separate it out completely in the dump.

Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3e4a29111e89588385e63f8d92ce3d67739dd679)

mon/PaxosService: prevent reads until initial service commit is done

Do not process reads (or, by PaxosService::dispatch() implication, writes)
until we have committed the initial service state. This avoids things like
EPERM due to missing keys when we race with mon creation, triggered by
teuthology tests doing their health check after startup.

Fixes: #5515
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit d08b6d6df7dba06dad73bdec2c945f24afc02717)

client: send all request put's through put_request()

Make sure all MetaRequest reference put's go through the same path that
releases inode references, including all of the error paths.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 87217e1e3cb2785b79d0dec49bd3f23a827551f5)

client: fix remaining Inode::put() caller, and make method psuedo-private

Not sure I can make this actually private and make Client::put_inode() a
friend method (making all of Client a friend would defeat the purpose).
This works well enough, though!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9af3b86b25574e4d2cdfd43e61028cffa19bdeb1)

client: use put_inode on MetaRequest inode refs

When we drop the request inode refs, we need to use put_inode() to ensure
they get cleaned up properly (removed from inode_map, caps released, etc.).
Do this explicitly here (as we do with all other inode put() paths that
matter).

Fixes: #5381
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 81bee6487fb1ce9e090b030d61bda128a3cf4982)

mon: be smarter about calculating last_epoch_clean lower bound

We need to take PGs whose mapping has not changed in a long time into
account. For them, the pg state will indicate it was clean at the time of
the report, in which case we can use that as a lower-bound on their actual
latest epoch clean. If they are not currently clean (at report time), use
the last_epoch_clean value.

Fixes: #5519
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit cc0006deee3153e06ddd220bf8a40358ba830135)

osd: report pg stats to mon at least every N (=500) epochs

The mon needs a moderately accurate last_epoch_clean value in order to trim
old osdmaps.  To prevent a PG that hasn't peered or received IO in forever
from preventing this, send pg stats at some minimum frequency.  This will
increase the pg stat report workload for the mon over an idle pool, but
should be no worse that a cluster that is getting actual IO and sees these
updates from normal stat updates.

This makes the reported update a bit more aggressive/useful in that the epoch
is the last map epoch processed by this PG and not just one that is >= the
currenting interval.  Note that the semantics of this field are pretty useless
at this point.

See #5519

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit da81228cc73c95737f26c630e5c3eccf6ae1aaec)

osd: fix warning

From 653e04a79430317e275dd77a46c2b17c788b860b

Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bc291d3fc3fc1cac838565cbe0f25f71d855a6e3)

Merge remote-tracking branch 'gh/wip-mon-sync-2' into cuttlefish

Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

Get device-by-path by looking for it instead of assuming 3rd entry.

On some systems (virtual machines so far) the device-by-path entry
from udevadm is not always in the same spot so instead actually
look for the right output instead of blindy assuming that its a
specific field in the output.

Signed-off-by: Sandon Van Ness <sandon@inktank.com>
Reviewed-by: Gary Lowell <gary.lowell@inktank.com>

Merge remote-tracking branch 'gh/cuttlefish' into wip-mon-sync-2

osd: limit number of inc osdmaps send to peers, clients

We should not send an unbounded number of inc maps to our peers or clients.
In particular, if a peer is not contacted for a while, we may think they
have a very old map (say, 10000 epochs ago) and send thousands of inc maps
when the distribution shifts and we need to peer.

Note that if we do not send enough maps, the peers will make do by
requesting the map from somewhere else (currently the mon). Regardless
of the source, however, we must limit the amount that we speculatively
share as it usually is not needed.

Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 653e04a79430317e275dd77a46c2b17c788b860b)

rgw: Fix return value for swift user not found

http://tracker.ceph.com/issues/1779 fixes #1779

Adjust the return value from rgw_get_user_info_by_swift call
in RGW_SWIFT_Auth_Get::execute() to have the correct
return code in response.
(cherry picked from commit 4089001de1f22d6acd0b9f09996b71c716235551)

mon/OSDMonitor: make 'osd crush rm ...' slightly more idempotent

This is a manual backport of 18a624fd8b90d9959de51f07622cf0839e6bd9aa.
Do not return immediately if we are looking at uncommitted state.t

Signed-off-by: Sage Weil <sage@inktank.com>