git.apps.os.sepia.ceph.com Git

upstart: stop ceph-create-keys when the monitor stops

This avoids lingering ceph-create-keys tasks.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a90a2b42db8de134b8ea5d81cab7825fb9ec50b4)

FileStore: fix fd leak in _check_global_replay_guard

Bug introduced in f3f92fe21061e21c8b259df5ef283a61782a44db.

Fixes: #5766
Backport: cuttlefish
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c562b72e703f671127d0ea2173f6a6907c825cd1)

mon/Paxos: share uncommitted value when leader is/was behind

If the leader has and older lc than we do, and we are sharing states to
bring them up to date, we still want to also share our uncommitted value.
This particular case was broken by b26b7f6e, which was only contemplating
the case where the leader was ahead of us or at the same point as us, but
not the case where the leader was behind. Note that the call to
share_state() a few lines up will bring them fully up to date, so
after they receive and store_state() for this message they will be at the
same lc as we are.

Fixes: #5750
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 05b6c7e8645081f405c616735238ae89602d3cc6)

Merge remote-tracking branch 'gh/cuttlefish-next' into cuttlefish

HashIndex: reset attr upon split or merge completion

A replay of an in progress merge or split might make
our counts unreliable.

Fixes: #5723
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0dc3efdd885377a07987d868af5bb7a38245c90b)

test/filestore/store_test: add test for 5723

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 37a4c4af54879512429bb114285bcb4c7c3488d5)

Conflicts:
src/os/LFNIndex.cc
src/test/filestore/store_test.cc

FileStore::_collection_rename: fix global replay guard

If the replay is being replayed, we might have already
performed the rename, skip it. Also, we must set the
collection replay guard only after we have done the
rename.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 870c474c5348831fcb13797d164f49682918fb30)

v0.61.7

PGLog::rewind_divergent_log: unindex only works from tail, index() instead

Fixes: #5714
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6957dbc75cc2577652b542aa3eae69f03060cb63)

The original patch covered the same code in PGLog.cc.

Conflicts:

src/osd/PGLog.cc
src/osd/PG.cc

msg/Pipe: do not hold pipe_lock for verify_authorizer()

We shouldn't hold the pipe_lock while doing the ms_verify_authorizer
upcalls.

Fix by unlocking a bit earlier, and verifying our state is still correct
in the failure path.

This regression was introduced by ecab4bb9513385bd765cca23e4e2fadb7ac4bac2.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 723d691f7a1f53888618dfc311868d1988f61f56)

Conflicts:

src/msg/Pipe.cc

msg/Pipe: a bit of additional debug output

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 16568d9e1fb8ac0c06ebaa1e1dc1d6a432a5e4d4)

msg/Pipe: hold pipe_lock during important parts of accept()

Previously we did not bother with locking for accept() because we were
not visible to any other threads. However, we need to close accepting
Pipes from mark_down_all(), which means we need to handle interference.

Fix up the locking so that we hold pipe_lock when looking at Pipe state
and verify that we are still in the ACCEPTING state any time we retake
the lock.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ecab4bb9513385bd765cca23e4e2fadb7ac4bac2)

msgr: fix a typo/goto-cross from dd4addef2d

We didn't build or review carefully enough!

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1a84411209b13084b3edb87897d5d678937e3299)

msgr: close accepting_pipes from mark_down_all()

We need to catch these pipes too, particularly when doing a rebind(),
to avoid them leaking through.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 687fe888b32ac9d41595348dfc82111c8dbf2fcb)

msgr: maintain list of accepting pipes

New pipes exist in a sort of limbo before we know who the peer is and
add them to rank_pipe. Keep a list of them in accepting_pipes for that
period.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit dd4addef2d5b457cc9a58782fe42af6b13c68b81)

msgr: adjust nonce on rebind()

We can have a situation where:

- we have a pipe to a peer
- pipe goes to standby (on peer)
- we rebind to a new port
- ....
- we rebind again to the same old port
- we connect to peer

and get reattached to the ancient pipe from two instances back. Avoid that
by picking a new nonce each time we rebind.

Add 1,000,000 each time so that the port is still legible in the printed
output.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 994e2bf224ab7b7d5b832485ee14de05354d2ddf)

Conflicts:

src/msg/Accepter.cc

msgr: mark_down_all() after, not before, rebind

If we are shutting down all old connections and binding to new ports,
we want to avoid a sequence like:

- close all prevoius connections
- new connection comes in on old port
- rebind to new ports
-> connection from old port leaks through

As a first step, close all connections after we shut down the old
accepter and before we start the new one.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 07a0860a1899c7353bb506e33de72fdd22b857dd)

Conflicts:

src/msg/SimpleMessenger.cc

msg/Pipe: unlock msgr->lock earlier in accept()

Small cleanup. Nothing needs msgr->lock for the previously larger
window.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ad548e72fd94b4a16717abd3b3f1d1be4a3476cf)

msg/Pipe: avoid creating empty out_q entry

We need to maintain the invariant that all sub queues in out_q are never
empty. Fix discard_requeued_up_to() to avoid creating an entry unless we
know it is already present.

This bug leads to an incorrect reconnect attempt when

- we accept a pipe (lossless peer)
- they send some stuff, maybe
- fault
- we initiate reconnect, even tho we have nothing queued

In particular, we shouldn't reconnect because we aren't checking for
resets, and the fact that our out_seq is 0 while the peer's might be
something else entirely will trigger asserts later.

This fixes at least one source of #5626, and possibly #5517.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9f1c27261811733f40acf759a72958c3689c8516)

msg/Pipe: assert lock is held in various helpers

These all require that we hold pipe_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 579d858aabbe5df88543d096ef4dbddcfc023cca)

msg/Pipe: be a bit more explicit about encoding outgoing messages

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4282971d47b90484e681ff1a71ae29569dbd1d32)

msg/Pipe: fix RECONNECT_SEQ behavior

Calling handle_ack() here has no effect because we have already
spliced sent messages back into our out queue. Instead, pull them out
of there and discard. Add a few assertions along the way.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 495ee108dbb39d63e44cd3d4938a6ec7d11b12e3)

msgr: reaper: make sure pipe has been cleared (under pipe_lock)

All paths to pipe shutdown should have cleared the con->pipe reference
already. Assert as much.

Also, do it under pipe_lock!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9586305a2317c7d6bbf31c9cf5b67dc93ccab50d)

msg/Pipe: goto fail_unlocked on early failures in accept()

Instead of duplicating an incomplete cleanup sequence (that does not
clear_pipe()), goto fail_unlocked and do the cleanup in a generic way.
s/rc/r/ while we are here.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ec612a5bda119cea52bbac9b2a49ecf1e83b08e5)

msgr: clear con->pipe inside pipe_lock on mark_down

We need to do this under protection of the pipe_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit afafb87e8402242d3897069f4b94ba46ffe0c413)

msgr: clear_pipe inside pipe_lock on mark_down_all

Observed a segfault in rebind -> mark_down_all -> clear_pipe -> put that
may have been due to a racing thread clearing the connection_state pointer.
Do the clear_pipe() call under the protection of pipe_lock, as we do in
all other contexts.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5fc1dabfb3b2cbffdee3214d24d7769d6e440e45)

Conflicts:

src/msg/SimpleMessenger.cc

ReplicatedPG: track temp collection contents, clear during on_change

We also assert in on_flushed() that the temp collection is actually
empty.

Fixes: #5670
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 47516d9c4b7f023f3a16e166749fa7b1c7b3b24c)

Conflicts:

src/osd/ReplicatedPG.cc

PG, ReplicatedPG: pass a transaction down to ReplicatedPG::on_change

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9f56a7b8bfcb63cb4fbbc0c9b8ff01de9e518c57)

PG: start flush on primary only after we process the master log

Once we start serving reads, stray objects must have already
been removed. Therefore, we have to flush all operations
up to the transaction writing out the authoritative log.
On replicas, we flush in Stray() if we will not eventually
be activated and in ReplicaActive if we are in the acting
set. This way a replica won't serve a replica read until
the store is consistent.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b41f1ba48563d1d3fd17c2f62d10103b5d63f305)

ReplicatedPG: replace clean_up_local with a debug check

Stray objects should have been cleaned up in the merge_log
transactions. Only on the primary have those operations
necessarily been flushed at activate().

Fixes: 5084
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 278c7b59228f614addf830cb0afff4988c9bc8cb)

FileStore: add global replay guard for split, collection_rename

In the event of a split or collection rename, we need to ensure that
we don't replay any operations on objects within those collections
prior to that point. Thus, we mark a global replay guard on the
collection after doing a syncfs and make sure to check that in
_check_replay_guard() for all object operations.

Fixes: #5154
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f3f92fe21061e21c8b259df5ef283a61782a44db)

Conflicts:

src/os/FileStore.cc

OSD: add config option for peering_wq batch size

Large peering_wq batch sizes may excessively delay
peering messages resulting in unreasonably long
peering. This may speed up peering.

Backport: cuttlefish
Related: #5084
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 39e5a2a406b77fa82e9a78c267b679d49927e3c3)

ceph-disk: use new get_dev_path helper for list

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Tested-by: Olivier Bonvalet <ob.ceph@daevel.fr>
(cherry picked from commit fd1fd664d6102a2a96b27e8ca9933b54ac626ecb)

ceph-disk: use /sys/block to determine partition device names

Not all devices are basename + number; some have intervening character(s),
like /dev/cciss/c0d1p2.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2ea8fac441141d64ee0d26c5dd2b441f9782d840)

ceph-disk: reimplement is_partition() using /sys/block

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5b031e100b40f597752b4917cdbeebb366eb98d7)

ceph-disk: use get_dev_name() helper throughout

This is more robust than the broken split trick.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3359aaedde838c98d1155611e157fd2da9e8b9f5)

ceph-disk: refactor list_[all_]partitions

Make these methods work in terms of device *names*, not paths, and fix up
the only direct list_partitions() caller to do the same.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 35d3f2d84808efda3d2ac868afe03e6959d51c03)

ceph-disk: add get_dev_name, path helpers

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e0401591e352ea9653e3276d66aebeb41801eeb3)

ceph-disk: handle /dev/foo/bar devices throughout

Assume the last component is the unique device name, even if it appears
under a subdir of /dev.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit cb97338b1186939deecb78e9d949c38c3ef59026)

ceph-disk: make is_held() smarter about full disks

Handle the case where the device is a full disk. Make the partition
check a bit more robust (don't make assumptions about naming aside from
the device being a prefix of the partition).

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e082f1247fb6ddfb36c4223cbfdf500d6b45c978)

mon/OSDMonitor: search for latest full osdmap if record version is missing

In 97462a3213e5e15812c79afc0f54d697b6c498b1 we tried to search for a
recent full osdmap but were looking at the wrong key. If full_0 was
present we could record that the latest full map was last_committed even
though it wasn't present. This is fixed in 76cd7ac1c, but we need to
compensate for when get_version_latest_full() gives us a back version
number by repeating the search.

Fixes: #5737
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit c2131d4047156aa2964581c9dbd93846382a07e7)

test: test_store_tool: global init before using LevelDBStore

Fixes a segfault

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a7a7d3fc8a2ba4a30ef136a32f2903d157b3e19a)

mon: OSDMonitor: fix a bug introduced on 97462a32

Fixes: #5737
Backport: cuttlefish

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 76cd7ac1c2094b34ad36bea89b2246fa90eb2f6d)

mon/Paxos: fix pn for uncommitted value during collect/last phase

During the collect/last exchange, peers share any uncommitted values
with the leader.  They are supposed to also share the pn under which
that value was accepted, but were instead using the just-accepted pn
value.  This effectively meant that we *always* took the uncommitted
value; if there were multiples, which one we accepted depended on what
order the LAST messages arrived, not which pn the values were generated
under.

The specific failure sequence I observed:

- collect
  - learned uncommitted value for 262 from myself
  - send collect with pn 901
- got last with pn 901 (incorrect) for 200 (old) from peer
  - discard our own value, remember the other
- finish collect phase
  - ignore old uncommitted value

Fix this by storing a pending_v and pending_pn value whenever we accept
a value.  Use this to send an appropriate pn value in the LAST reply
so that the leader can make it's decision about which uncommitted value
to accept based on accurate information.  Also use it when we learn
the uncommitted value from ourselves.

We could probably be more clever about storing less information here,
for example by omitting pending_v and clearing pending_pn at the
appropriate point, but that would be more fragile.  Similarly, we could
store a pn for *every* commit if we wanted to lay some groundwork for
having multiple uncommitted proposals in flight, but I don't want to
speculate about what is necessary or sufficient for a correct solution
there.

Fixes: #5698
Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 20baf662112dd5f560bc3a2d2114b469444c3de8)

mon/Paxos: debug ignored uncommitted values

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 19b29788966eb80ed847630090a16a3d1b810969)

mon/Paxos: only learn uncommitted value if it is in the future

If an older peer sends an uncommitted value, make sure we only take it
if it is in the future, and at least as new as any current uncommitted
value.

(Prior to the previous patch, peers could send values from long-past
rounds. The pn values are also bogus.)

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b3253a453c057914753846c77499f98d3845c58e)

mon/Paxos: only share uncommitted value if it is next

We may have an uncommitted value from our perspective (it is our lc + 1)
when the collector has a much larger lc (because we have been out for
the last few rounds). Only share an uncommitted value if it is in fact
the next value.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b26b7f6e5e02ac6beb66e3e34e177e6448cf91cf)

v0.61.6

mon/OSDMonitor: fix base case for 7fb3804fb workaround

After cluster creation, we have no full map stored and first_committed ==
1. In that case, there is no need for a full map, since we can get there
from OSDMap() and the incrementals.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao@inktank.com>
(cherry picked from commit e807770784175b05130bba938113fdbf874f152e)

mon: OSDMonitor: work around a full version bug introduced in 7fb3804fb

In 7fb3804fb860dcd0340dd3f7c39eec4315f8e4b6 we moved the full version
stashing logic to the encode_trim_extra() function.  However, we forgot
to update the osdmap's 'latest_full' key that should always point to
the latest osdmap full version.  This eventually degenerated in a missing
full version after a trim.  This patch works around this bug by looking
for the latest available full osdmap version in the store and updating
'latest_full' to its proper value.

Related-to: #5704
Backport: cuttlefish

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 97462a3213e5e15812c79afc0f54d697b6c498b1)

mon: OSDMonitor: update the osdmap's latest_full with the new full version

We used to do this on encode_full(), but since [1] we no longer rely on
PaxosService to manage the full maps for us. And we forgot to write down
the latest_full version to the store, leaving it in a truly outdated state.

[1] - 7fb3804fb860dcd0340dd3f7c39eec4315f8e4b6

Fixes: #5704
Backport: cuttlefish
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a815547ed3e5ffdbbb96c8c0c1b8d6dd8c62bfba)

mon: decline to scrub when paxos is not active

In f1ce8d7c955a2443111bf7d9e16b4c563d445712 we close a race between scrub
and paxos commit completion on the leader. The fix is nontrivial to
backport and probably not worthwhile; just avoid scrubbing at that time
for now.

Note that the actual fix for this is in commit
f1ce8d7c955a2443111bf7d9e16b4c563d445712.

Signed-off-by: Sage Weil <sage@inktank.com>

v0.61.5

ceph-disk: rely on /dev/disk/by-partuuid instead of special-casing journal symlinks

This was necessary when ceph-disk-udev didn't create the by-partuuid (and
other) symlinks for us, but now it is fragile and error-prone. (It also
appears to be broken on a certain customer RHEL VM.) See
d7f7d613512fe39ec883e11d201793c75ee05db1.

Instead, just use the by-partuuid symlinks that we spent all that ugly
effort generating.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 64379e701b3ed862c05f156539506d3382f77aa8)

mon: Monitor: StoreConverter: clearer debug message on 'needs_conversion()'

The previous debug message outputted the function's name, as often our
functions do. This was however a source of bewilderment, as users would
see those in logs and think their stores would need conversion. Changing
this message is trivial enough and it will make ceph users happier log
readers.

Backport: cuttlefish
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ad1392f68170b391d11df0ce5523c2d1fb57f60e)

mon: Monitor: do not reopen MonitorDBStore during conversion

We already open the store on ceph_mon.cc, before we start the conversion.
Given we are unable to reproduce this every time a conversion is triggered,
we are led to believe that this causes a race in leveldb that will lead
to 'store.db/LOCK' being locked upon the open this patch removes.

Regardless, reopening the db here is pointless as we already did it when
we reach Monitor::StoreConverter::convert().

Fixes: #5640
Backport: cuttlefish

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 036e6739a4e873863bae3d7d00f310c015dfcdb3)

messages/MClientReconnect: clear data when encoding

The MClientReconnect puts everything in the data payload portion of
the message and nothing in the front portion. That means that if the
message is resent (socket failure or something), the messenger thinks it
hasn't been encoded yet (front empty) and reencodes, which means
everything gets added (again) to the data portion.

Decoding keep decoding until it runs out of data, so the second copy
means we decode garbage snap realms, leading to the crash in bug

Clearing data each time around resolves the problem, although it does
mean we do the encoding work multiple times. We could alternatively
(or also) stick some data in the front portion of the payload
(ignored), but that changes the wire protocol and I would rather not
do that.

Fixes: #4565
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 314cf046b0b787ca69665e8751eab6fe7adb4037)

mon: once sync full is chosen, make sure we don't change our mind

It is possible for a sequence like:

- probe
- first probe reply has paxos trim that indicates a full sync is
needed
- start sync
- clear store
- something happens that makes us abort and bootstrap (e.g., the
provider mon restarts
- probe
- first probe reply has older paxos trim bound and we call an election
- on election completion, we crash because we have no data.

Non-determinism of the probe decision aside, we need to ensure that
the info we share during probe (fc, lc) is accurate, and that once we
clear the store we know we *must* do a full sync.

This is a backport of aa60f940ec1994a61624345586dc70d261688456.

Fixes: #5621
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

mon: do not scrub if scrub is in progress

This prevents an assert from unexpected scrub results from the previous
scrub on the leader.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 00ae543b3e32f89d906a0e934792cc5309f57696)

messages/MPGStats: do not set paxos version to osdmap epoch

The PaxosServiceMessage version field is meant for client-coordinated
ordering of messages when switching between monitors (and is rarely
used). Do not fill it with the osdmap epoch lest it be compared to a
pgmap version, which may cause the mon to (near) indefinitely put it on
a wait queue until the pgmap version catches up.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit b36338be43f43b6dd4ee87c97f2eaa23b467c386)

osd/OSDmap: fix OSDMap::Incremental::dump() for new pool names

The name is always present when pools are created, but not when they are
modified. Also, a name may be present with a new_pools entry if the pool
is just renamed. Separate it out completely in the dump.

Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3e4a29111e89588385e63f8d92ce3d67739dd679)

mon/PaxosService: prevent reads until initial service commit is done

Do not process reads (or, by PaxosService::dispatch() implication, writes)
until we have committed the initial service state. This avoids things like
EPERM due to missing keys when we race with mon creation, triggered by
teuthology tests doing their health check after startup.

Fixes: #5515
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit d08b6d6df7dba06dad73bdec2c945f24afc02717)

client: send all request put's through put_request()

Make sure all MetaRequest reference put's go through the same path that
releases inode references, including all of the error paths.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 87217e1e3cb2785b79d0dec49bd3f23a827551f5)

client: fix remaining Inode::put() caller, and make method psuedo-private

Not sure I can make this actually private and make Client::put_inode() a
friend method (making all of Client a friend would defeat the purpose).
This works well enough, though!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9af3b86b25574e4d2cdfd43e61028cffa19bdeb1)

client: use put_inode on MetaRequest inode refs

When we drop the request inode refs, we need to use put_inode() to ensure
they get cleaned up properly (removed from inode_map, caps released, etc.).
Do this explicitly here (as we do with all other inode put() paths that
matter).

Fixes: #5381
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 81bee6487fb1ce9e090b030d61bda128a3cf4982)

mon: be smarter about calculating last_epoch_clean lower bound

We need to take PGs whose mapping has not changed in a long time into
account. For them, the pg state will indicate it was clean at the time of
the report, in which case we can use that as a lower-bound on their actual
latest epoch clean. If they are not currently clean (at report time), use
the last_epoch_clean value.

Fixes: #5519
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit cc0006deee3153e06ddd220bf8a40358ba830135)

osd: report pg stats to mon at least every N (=500) epochs

The mon needs a moderately accurate last_epoch_clean value in order to trim
old osdmaps.  To prevent a PG that hasn't peered or received IO in forever
from preventing this, send pg stats at some minimum frequency.  This will
increase the pg stat report workload for the mon over an idle pool, but
should be no worse that a cluster that is getting actual IO and sees these
updates from normal stat updates.

This makes the reported update a bit more aggressive/useful in that the epoch
is the last map epoch processed by this PG and not just one that is >= the
currenting interval.  Note that the semantics of this field are pretty useless
at this point.

See #5519

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit da81228cc73c95737f26c630e5c3eccf6ae1aaec)

osd: fix warning

From 653e04a79430317e275dd77a46c2b17c788b860b

Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bc291d3fc3fc1cac838565cbe0f25f71d855a6e3)

Merge remote-tracking branch 'gh/wip-mon-sync-2' into cuttlefish

Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>

Get device-by-path by looking for it instead of assuming 3rd entry.

On some systems (virtual machines so far) the device-by-path entry
from udevadm is not always in the same spot so instead actually
look for the right output instead of blindy assuming that its a
specific field in the output.

Signed-off-by: Sandon Van Ness <sandon@inktank.com>
Reviewed-by: Gary Lowell <gary.lowell@inktank.com>

Merge remote-tracking branch 'gh/cuttlefish' into wip-mon-sync-2

osd: limit number of inc osdmaps send to peers, clients

We should not send an unbounded number of inc maps to our peers or clients.
In particular, if a peer is not contacted for a while, we may think they
have a very old map (say, 10000 epochs ago) and send thousands of inc maps
when the distribution shifts and we need to peer.

Note that if we do not send enough maps, the peers will make do by
requesting the map from somewhere else (currently the mon). Regardless
of the source, however, we must limit the amount that we speculatively
share as it usually is not needed.

Backport: cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 653e04a79430317e275dd77a46c2b17c788b860b)

rgw: Fix return value for swift user not found

http://tracker.ceph.com/issues/1779 fixes #1779

Adjust the return value from rgw_get_user_info_by_swift call
in RGW_SWIFT_Auth_Get::execute() to have the correct
return code in response.
(cherry picked from commit 4089001de1f22d6acd0b9f09996b71c716235551)

mon/OSDMonitor: make 'osd crush rm ...' slightly more idempotent

This is a manual backport of 18a624fd8b90d9959de51f07622cf0839e6bd9aa.
Do not return immediately if we are looking at uncommitted state.t

Signed-off-by: Sage Weil <sage@inktank.com>

mon/OSDMonitor: fix base case for loading full osdmap

Right after cluster creation, first_committed is 1 and latest stashed in 0,
but we don't have the initial full map yet. Thereafter, we do (because we
write it with trim). Fixes afd6c7d8247075003e5be439ad59976c3d123218.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 43fa7aabf1f7e5deb844c1f52d451bab9e7d1006)

mon: fix osdmap stash, trim to retain complete history of full maps

The current interaction between sync and stashing full osdmaps only on
active mons means that a sync can result in an incomplete osdmap_full
history:

- mon.c starts a full sync
- during sync, active osdmap service should_stash_full() is true and
   includes a full in the txn
- mon.c sync finishes
- mon.c update_from_paxos gets "latest" stashed that it got from the
   paxos txn
- mon.c does *not* walk to previous inc maps to complete it's collection
   of full maps.

To fix this, we disable the periodic/random stash of full maps by the
osdmap service.

This introduces a new problem: we must have at least one full map (the first
one) in order for a mon that just synced to build it's full collection.
Extend the encode_trim() process to allow the osdmap service to include
the oldest full map with the trim txn.  This is more complex than just
writing the full maps in the txn, but cheaper--we only write the full
map at trim time.

This *might* be related to previous bugs where the full osdmap was
missing, or case where leveldb keys seemed to 'disappear'.

Fixes: #5512
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit afd6c7d8247075003e5be439ad59976c3d123218)

mon: implement simple 'scrub' command

Compare all keys within the sync'ed prefixes across members of the quorum
and compare the key counts and CRC for inconsistencies.

Currently this is a one-shot inefficient hammer. We'll want to make this
work in chunks before it is usable in production environments.

Protect with a feature bit to avoid sending MMonScrub to mons who can't
decode it.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit a9906641a1dce150203b72682da05651e4d68ff5)

Conflicts:

src/mon/MonCommands.h
src/mon/Monitor.cc

Elector.h: features are 64 bit

Fixes: #5497
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Luis <joao.luis@inktank.com>
(cherry picked from commit 3564e304e3f50642e4d9ff25e529d5fc60629093)

ceph_features.h: declare all features as ULL

Otherwise, the first 32 get |'d together as ints. Then, the result
((int)-1) is sign extended to ((long long int)-1) before being |'d
with the 1LL entries. This results in ~((uint64_t)0).

Fixes: #5497
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Luis <joao.luis@inktank.com>
(cherry picked from commit 4255b5c2fb54ae40c53284b3ab700fdfc7e61748)

Pipe: use uint64_t not unsigned when setting features

Fixes: #5497
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Luis <joao.luis@inktank.com>
(cherry picked from commit bc3e2f09f8860555d8b3b49b2eea164b4118d817)

client: remove O_LAZY

The once-upon-a-time unique O_LAZY value I chose forever ago is now
O_NOATIME, which means that some clients are choosing relaxed
consistency without meaning to.

It is highly unlikely that a real O_LAZY will ever exist, and we can
select it in the ceph case with the ioctl or libcephfs call, so drop
any support for doing this via open(2) flags.

Update doc/lazy_posix.txt file re: lazy io.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 94afedf02d07ad4678222aa66289a74b87768810)

osd/osd_types: fix pg_stat_t::dump for last_epoch_clean

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 69a55445439fce0dd6a3d32ff4bf436da42f1b11)

mon: remove bad assert about monmap version

It is possible to start a sync when our newest monmap is 0. Usually we see
e0 from probe, but that isn't always published as part of the very first
paxos transaction due to the way PaxosService::_active generates it's
first initial commit.

In any case, having e0 here is harmless.

Fixes: #5509
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 85a1d6cc5d3852c94d1287b566656c5b5024fa13)

mon/Paxos: fix sync restart

If we have a sync going, and an election intervenes, the client will
try to continue by sending a new start_chunks request. In order to
ensure that we get all of the paxos commits from our original starting
point (and thus properly update the keys from which they started),
only pay attention if they *also* send their current last_committed
version. Otherwise, start them at the beginning.

Signed-off-by: Sage Weil <sage@inktank.com>

mon: uninline _trim_enable and Paxos::trim_{enable,disable} so we can debug them

Signed-off-by: Sage Weil <sage@inktank.com>

mon/Paxos: increase paxos max join drift

A value of 10 is too aggressive for large, long-running syncs. 100 is
about 2 minutes of activity at most, which should be a more forgiving
buffer.

Signed-off-by: Sage Weil <sage@inktank.com>

mon/Paxos: configure minimum paxos txns separately

We were using paxos_max_join_drift to control the minimum number of
paxos transactions to keep around. Instead, make this explicit, and
separate from the join drift.

Signed-off-by: Sage Weil <sage@inktank.com>

mon: include any new paxos commits in each sync CHUNK message

We already take note of the paxos version when we begin the sync.  As
sync progresses and there are new paxos commits/txns, include those
and update last_committed, so that when sync completes we will have
a full view of everything that happened during sync.

Note that this does not introduce any compatibility change.  This change
*only* affects the provider.  The key difference is that at the end
of the sync, the provide will set version to the latest version, and
not the version from the start of the sync (as was done previously).

Signed-off-by: Sage Weil <sage@inktank.com>

mon/MonitorDBStore: expose get_chunk_tx()

Allow users get the transaction unencoded.

Signed-off-by: Sage Weil <sage@inktank.com>

mon: enable leveldb cache by default

256 is not as large as the upstream 512 MB, but will help signficiantly and
be less disruptive for existing cuttlefish clusters.

Sort-of backport of e93730b7ffa48b53c8da2f439a60cb6805facf5a.

Signed-off-by: Sage Weil <sage@inktank.com>

mon/Paxos: make 'paxos trim disabled max versions' much much larger

108000 is about 3 hours if paxos is going full-bore (1 proposal/second).
That ought to be pretty safe. Otherwise, we start trimming to soon and a
slow sync will just have to restart when it finishes.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 71ebfe7e1abe4795b46cf00dfe1b03d1893368b0)

Conflicts:

src/common/config_opts.h

mon: do not reopen MonitorDBStore during startup

level doesn't seem to like this when it races with an internal compaction
attempt (see below).  Instead, let the store get opened by the ceph_mon
caller, and pull a bit of the logic into the caller to make the flow a
little easier to follow.

    -2> 2013-06-25 17:49:25.184490 7f4d439f8780 10 needs_conversion
    -1> 2013-06-25 17:49:25.184495 7f4d4065c700  5 asok(0x13b1460) entry start
     0> 2013-06-25 17:49:25.316908 7f4d3fe5b700 -1 *** Caught signal (Segmentation fault) **
in thread 7f4d3fe5b700

ceph version 0.64-667-g089cba8 (089cba8fc0e8ae8aef9a3111cba7342ecd0f8314)
1: ceph-mon() [0x649f0a]
2: (()+0xfcb0) [0x7f4d435dccb0]
3: (leveldb::Table::BlockReader(void*, leveldb::ReadOptions const&, leveldb::Slice const&)+0x154) [0x806e54]
4: ceph-mon() [0x808840]
5: ceph-mon() [0x808b39]
6: ceph-mon() [0x806540]
7: (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0xdd) [0x7f363d]
8: (leveldb::DBImpl::BackgroundCompaction()+0x2c0) [0x7f4210]
9: (leveldb::DBImpl::BackgroundCall()+0x68) [0x7f4cc8]
10: ceph-mon() [0x80b3af]
11: (()+0x7e9a) [0x7f4d435d4e9a]
12: (clone()+0x6d) [0x7f4d4196bccd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ea1f316e5de21487ae034a1aa929068ba23ac525)

sysvinit, upstart: handle symlinks to dirs in /var/lib/ceph/*

Match a symlink to a dir, not just dirs. This fixes the osd case of e.g.,
creating an osd in /data/osd$id in which ceph-disk makes a symlink from
/var/lib/ceph/osd/ceph-$id.

Fix proposed by Matt Thompson <matt.thompson@mandiant.com>; extended to
include the upstart users too.

Fixes: #5490
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 87c98e92d1375c8bc76196bbbf06f677bef95e64)

rgw: add RGWFormatter_Plain allocation to sidestep cranky strlen()

Valgrind complains about an invalid read when we don't pad the allocation,
and because it is inlined we can't whitelist it for valgrind. Workaround
the warning by just padding our allocations a bit.

Fixes: #5346
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 49ff63b1750789070a8c6fef830c9526ae0f6d9f)

mds: warn on unconnected snap realms

When there are more than one active MDS, restarting MDS triggers
assertion "reconnected_snaprealms.empty()" quite often. If there
is no snapshot in the FS, the items left in reconnected_snaprealms
should be other MDS' mdsdir. I think it's harmless.

If there are snapshots in the FS, the assertion probably can catch
real bugs. But at present, snapshot feature is broken, fixing it is
non-trivial. So replace the assertion with a warning.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
(cherry picked from commit 26effc0e583b0a3dade6ec81ef26dec1c94ac8b2)

mon/PGMonitor: use post_paxos_update, not init, to refresh from osdmap

We do two things here:
- make init an one-time unconditional init method, which is what the
   health service expects/needs.
- switch PGMonitor::init to be post_paxos_update() which is called after
   the other services update, which is what PGMonitor really needs.

This is a new version of the fix originally in commit
a2fe0137946541e7b3b537698e1865fbce974ca6 (and those around it).  That is,
this re-fixes a problem where osds do not see pg creates from their
subscribe due to map_pg_creates() not getting called.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e635c47851d185eda557e36bdc4bf3775f7b87a2)

Conflicts:
src/mon/PGMonitor.cc
src/mon/PGMonitor.h

mon/PaxosService: add post_paxos_update() hook

Some services need to update internal state based on other service's
state, and thus need to be run after everyone has pulled their info out of
paxos.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 131686980f0a930d5de7cbce8234fead5bd438b6)

ceph-disk: s/else if/elif/

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Joao Luis <joao.luis@inktank.com>
(cherry picked from commit bd8255a750de08c1b8ee5e9c9a0a1b9b16171462)
(cherry picked from commit 9e604ee6943fdb131978afbec51321050faddfc6)

rgw: fix radosgw-admin buckets list

Fixes: #5455
Backport: cuttlefish
This commit fixes a regression, where radosgw-admin buckets list
operation wasn't returning any data.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e1f9fe58d2860fcbb18c92d3eb3946236b49a6ce)

ceph-disk: use unix lock instead of lockfile class

The lockfile class relies on file system trickery to get safe mutual
exclusion. However, the unix syscalls do this for us. More
importantly, the unix locks go away when the owning process dies, which
is behavior that we want here.

Fixes: #5387
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 2a4953b697a3464862fd3913336edfd7eede2487)