git.apps.os.sepia.ceph.com Git

osd: leave osd_lock locked in shutdown()

No callers expect the lock to be dropped.

Fixes: #3816
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 98a763123240803741ac9f67846b8f405f1b005b)

msg: fix entity_addr_t::is_same_host() for IPv6

We weren't checking the memcmp return value properly! Aie...

Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c8dd2b67b39a8c70e48441ecd1a5cc3c6200ae97)

osd: requeue pg waiters at the front of the finished queue

We could have a sequence like:

- op1
- notify
- op2

in the finished queue. Op1 gets put on waiting_for_pg, the notify
creates the pg and requeues op1 (and the end), op2 is handled, and
finally op1 is handled. That breaks ordering; see #2947.

Instead, when we wake up a pg, queue the waiting messages at the front
of the dispatch queue.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 56c5a07708d52de1699585c9560cff8b4e993d0a)

osd: pull requeued requests off one at a time

Pull items off the finished queue on at a time.  In certain cases, an
event may result in new items betting added to the finished queue that
will be put at the *front* instead of the back.  See latest incarnation
of #2947.

Note that this is a significant changed in behavior in that we can
theoretically starve if an event keeps resulting in new events getting
generated.  Beware!

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit f1841e4189fce70ef5722d508289e516faa9af6a)

mds: open mydir after replay

In certain cases, we may replay the journal and not end up with the
dirfrag for mydir open. This is fine--we just need to open it up and
fetch it below.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e51299fbce6bdc3d6ec736e949ba8643afc965ec)

mds: use inode_t::layout for dir layout policy

Remove the default_file_layout struct, which was just a ceph_file_layout,
and store it in the inode_t. Rip out all the annoying code that put this
on the heap.

To aid in this usage, add a clear_layout() function to inode_t.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

mds: parse ceph.*.layout vxattr key/value content

Use qi to parse a strictly formatted set of key/value pairs. Be picky
about whitespace. Any subset of recognized keys is allowed. Parse the
same set of keys as the ceph.*.layout.* vxattrs.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5551aa5b3b5c2e9e7006476b9cd8cc181d2c9a04)

rgw: fix multipart uploads listing

Fixes: #4177
Backport: bobtail
Listing multipart uploads had a typo, and was requiring the
wrong resource (uploadId instead of uploads).

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit db99fb4417b87301a69cb37b00c35c838b77197e)

rgw: don't copy object when it's copied into itself

Fixes: #4150
Backport: bobtail

When object copied into itself, object will not be fully copied: tail
reference count stays the same, head part is rewritten.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 34f885be536d0ac89c10fd29b1518751d2ffc547)

PG: remove weirdness log for last_complete < log.tail

In the case of a divergent object prior to log.tail,
last_complete may end up before log.tail.

Backport: bobtail
Fixes #4174
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit dbadb3e2921297882c5836c67ca32bb8ecdc75db)

Conflicts:

src/osd/PG.cc

Strip any trailing whitespace from rbd showmapped

More recent versions of ceph append a bit of whitespace to the line
after the name of the /dev/rbdX device; this causes the monitor check
to fail as it can't find the device name due to the whitespace.

This fix excludes any characters after the /dev/rbdN match.
(cherry picked from commit ad84ea07cac5096de38b51b8fc452c99f016b8d8)

Merge pull request #64 from dalgaaf/wip-bobtail-memleaks

cherry-pick some memleak fixes from master to bobtail

rgw/rgw_rest.cc: fix 4K memory leak

Fix 4K memory leak in case RGWClientIO::read() fails in
read_all_chunked_input().

Error from cppcheck was:
Checking src/rgw/rgw_rest.cc...
[src/rgw/rgw_rest.cc:688]: (error) Memory leak: data

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit 89df090e04ef9fc5aae29122df106b0347786fab)

SyntheticClient.cc: fix some memory leaks in the error handling

Fix some memory leaks in case of error handling due to failed
client->open() calls.

Error from cppcheck was:
[src/client/SyntheticClient.cc:1980]: (error) Memory leak: buf
[src/client/SyntheticClient.cc:2040]: (error) Memory leak: buf
[src/client/SyntheticClient.cc:2090]: (error) Memory leak: buf
(cherry picked from commit f0ba80756d1c3c313014ad7be18191981fb545be)

rgw/rgw_xml.cc: fix realloc memory leak in error case

Fix error from cppcheck:

[src/rgw/rgw_xml.cc:212]: (error) Common realloc mistake: 'buf'
nulled but not freed upon failure

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit d48cc789ea075ba2745754035640ada4131b2119)

os/FileStore.cc: fix realloc memory leak in error case

Fix error from cppcheck:

[src/os/FileStore.cc:512]: (error) Common realloc mistake: 'fiemap'
nulled but not freed upon failure

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit c92a0f552587a232f66620170660d6b2ab6fb3a5)

common/fiemap.cc: fix realloc memory leak

Fix error from cppcheck:

[src/common/fiemap.cc:73]: (error) Common realloc mistake: 'fiemap'
nulled but not freed upon failure

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit f26f1470e7af36fa1eb8dc59c8a7c62c3c3a22ba)

osd/OSDCap: add unit test for parsing pools/objects with _ and -

Hunting #4122, where a user saw

2013-02-13 19:39:25.467916 7f766fdb4700 10 osd.0 10 session 0x2c8cc60 client.libvirt has caps osdcap[grant(object_prefix rbd^@children class-read),grant(pool libvirt^@pool^@test rwx)] 'allow class-read object_prefix rbd_children, allow pool libvirt-pool-test rwx'

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2ce28ef1d7f95e71e1043912dfa269ea3b0d1599)
(cherry picked from commit a6534bc8a0247418d5263b765772d5266f99229c)

osd/OSDCap: tweak unquoted_word parsing in osd caps

Newer versions of spirit (1.49.0-3.1ubuntu1.1 in quantal, in particular)
dislike the construct with alnum and replace the - and _ with '\0' in the
resulting string.

Fixes: #4122
Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 6c504d96c1e4fbb67578fba0666ca453b939c218)

v0.56.3

rgw: change json formatting for swift list container

Fixes: #4048
There is some difference in the way swift formats the
xml output and the json output for list container. In
xml the entity is named 'name' and in json it is named
'subdir'.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 3e4d79fe42dfc3ca70dc4d5d2aff5223f62eb34b)

librbd: unprotect any non-unprotected snapshot

Include snapshots in the UNPROTECTING state as well, which can occur
after an unprotect is interrupted.

Fixes: #4100
Backport: bobtail
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit fe283813b44a7c45def6768ea0788a3a0635957e)

java: make CephMountTest use user.* xattr names

Changes to the xattr code in Ceph require
a few tweaks to existing test cases.
Specifically, there is now a ceph.file.layout
xattr by default and user defined xattrs
are prepended with "user."

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joe Buck <jbbuck@gmail.com>
Reviewed-by: Noah Watkins <noahwatkins@gmail.com>

mon: fix typo in C_Stats

Broken by previous commit.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3cf3710be0b4cccc8de152a97be50d983c35116d)

mon: retry PGStats message on EAGAIN

If we get EAGAIN from a paxos restart/election/whatever, we should
restart the message instead of just blindly acking it.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Luis <joao.luis@inktank.com>
(cherry picked from commit 4837063d447afb45554f55bb6fde1c97559acd4b)

mon: handle -EAGAIN in completion contexts

We can get ECANCELED, EAGAIN, or success out of the completion contexts,
but in the EAGAIN case (meaning there was an election) we were sending
a success to the client. This resulted in client hangs and all-around
confusion when the monitor cluster was thrashing.

Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Luis <joao.luis@inktank.com>
(cherry picked from commit 17827769f1fe6d7c4838253fcec3b3a4ad288f41)

osd: only share maps on hb connection of OSD_HBMSGS feature is set

Back in 1bc419a7affb056540ba8f9b332b6ff9380b37af we started sharing maps
with dead osds via the heartbeat connection, but old code will crash on an
unexpected message. Only do this if the OSD_HBMSGS feature is present.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 302b26ff70ee5539da3dcb2e5614e2b7e83b9dcd)

osd: tolerate unexpected messages on the heartbeat interface

We should note but not crash on unexpected messages. Announce this awesome
new "capability" via a feature bit.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit afda30aeaae0a65f83c6886658354ad2b57c4c43)

Conflicts:

src/include/ceph_features.h

Merge remote-tracking branch 'gh/wip-bobtail-osd-msgr' into bobtail

test_libcephfs: fix xattr test

Ignore the ceph.*.layout xattrs.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b0d4dd21c7be86eb47728a4702a3c67ca44424ac)

radosgw-admin: fix cli test

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1b05b0edbac09d1d7cf0da2e536829df05e48573)

Merge remote-tracking branch 'gh/wip-bobtail-vxattrs' into bobtail

mon: enforce reweight be between 0..1

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Luis <joao.luis@inktank.com>
(cherry picked from commit 4e29c95d6f61daa838888840cef0cceedc0fcfdd)

PG: dirty_info on handle_activate_map

We need to make sure the pg epoch is persisted during
activate_map.

Backport: bobtail
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit dbce1d0dc919e221523bd44e1d0834711da1577d)

osd: flush peering queue (consume maps) prior to boot

If the osd itself is behind on many maps during boot, it will get more and
(as part of that) flush the peering wq to ensure the pgs consume them.
However, it is possible for OSD to have latest/recnet maps, but pgs to be
behind, and to jump directly to boot and join. The OSD is then laggy and
unresponsive because the peering wq is way behind.

To avoid this, call consume_map() (kick the peering wq) at the end of
init and flush it to ensure we are *internally* all caught up before we
consider joining the cluster.

I'm pretty sure this is the root cause of #3905 and possibly #3995.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit af95d934b039d65d3667fc022e2ecaebba107b01)

rgw: a tool to fix clobbered bucket info in user's bucket list

This fixes bad entries in user's bucket list that may have occured
due to issue #4039. Syntax:

$ radosgw-admin user check --uid=<uid> [--fix]

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 9cb6c33f0e2281b66cc690a28e08459f2e62ca13)

Conflicts:
src/rgw/rgw_admin.cc

rgw: bucket recreation should not clobber bucket info

Fixes: #4039
User's list of buckets is getting modified even if bucket already
exists. This fix removes the newly created directory object, and
makes sure that user info's data points at the correct bucket.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 9d006ec40ced9d97b590ee07ca9171f0c9bec6e9)

Conflicts:
src/rgw/rgw_op.cc
src/rgw/rgw_rados.cc

rgw: a tool to fix buckets with leaked multipart references

Checks specified bucket for the #4011 symptoms, optionally fix
the issue.

sytax:
radosgw-admin bucket check --bucket=<bucket> [--fix]

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 2d8faf8e5f15e833e6b556b0f3c4ac92e4a4151e)

Conflicts:
src/rgw/rgw_admin.cc
src/rgw/rgw_rados.h

rgw: radosgw-admin object unlink

Add a radosgw-admin option to remove object from bucket index

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 16235a7acb9543d60470170bb2a09956364626cd)

Conflicts:
src/rgw/rgw_admin.cc
src/rgw/rgw_rados.h
src/test/cli/radosgw-admin/help.t

osd: kill unused addr-based send_map()

Not used, old API, bad.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e359a862199c8a94cb238f7271ba1b0edcc0863c)

osd: share incoming maps via Connection*, not addrs

Kill a set of parallel methods that are using the old addr/inst-based
msgr APIs, and instead use Connection handles. This is much safer and gets
us closer to killing the old msgr API.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5e2fab54a4fdf2f59e2b635cbddef8a5909acb7c)

osd: pass new maps to dead osds via existing Connection

Previously we were sending these maps to dead osds via their old addrs
using a new outgoing connection and setting the flags so that the msgr
would clean up. That mechanism is possibly buggy and fragile, and we can
avoid it entirely if we just reuse the existing heartbeat Connection.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1bc419a7affb056540ba8f9b332b6ff9380b37af)

osd: requeue osdmaps on heartbeat connections for cluster connection

If we receive an OSDMap on the cluster connection, requeue it for the
cluster messenger, and process it there where we normally do. This avoids
any concerns about locking and ordering rules.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 76705ace2e9767939aa9acf5d9257c800f838854)

msgr: add get_loopback_connection() method

Return the Connection* for ourselves, so we can queue messages for
ourselves.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a7059eb3f3922cf08c1e5bb5958acc2d45952482)

qa: add layout_vxattrs.sh test script

Test virtual xattrs for file and directory layouts.

TODO: create a data pool, add it to the fs, and make sure we can use it.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 61fbe27a52d12ecd98ddeb5fc0965c4f8ee7841a)

mds: allow dir layout/policy to be removed via removexattr on ceph.dir.layout

This lets a user remove a policy that was previously set on a dir.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit db31a1f9f27416e4d531fda716e32d42a275e84f)

mds: handle ceph.*.layout.* setxattr

Allow individual fields of file or dir layouts to be set via setxattr.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ebebf72f0993d028e795c78a986e1aee542ca5e0)

mdsmap: backported is_data_pool()

This roughly corresponds to mainline commit 99d9e1d.

Signed-off-by: Sage Weil <sage@inktank.com>

mds: fix client view of dir layout when layout is removed

We weren't handling the case where the projected node has NULL for the
layout properly. Fixes the client's view when we remove the dir layout.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 09f28541e374ffac198e4d48082b064aae93cb2c)

client: note presence of dir layout in inode operator<<

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 84751489ca208964e617516e04556722008ddf67)

client: list only aggregate xattr, but allow setting subfield xattrs

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ba32ea9454d36072ec5ea3e6483dc3daf9199903)

client: implement ceph.file.* and ceph.dir.* vxattrs

Display ceph.file.* vxattrs on any regular file, and ceph.dir.* vxattrs
on any directory that has a policy set.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3f82912a891536dd7e930f98e28d9a8c18fab756)

client: move xattr namespace enforcement into internal method

This captures libcephfs users now too.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit febb96509559084357bfaabf7e4d28e494c274aa)

client: allow ceph.* xattrs

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ad7ebad70bf810fde45067f78f316f130a243b9c)

rgw_rest: Make fallback uri configurable.

Some HTTP servers, notabily lighttp, do not set SCRIPT_URI, make the fallback
string configurable.

Signed-off-by: caleb miles <caleb.miles@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit b3a2e7e955547a863d29566aab62bcc480e27a65)

Conflicts:
src/rgw/rgw_rest.cc

rgw: fix setting of NULL to string

Fixes: #3777
s->env->get() returns char * and not string and can return NULL.
Also, remove some old unused code.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 9019fbbe8f84f530b6a8700dfe99dfeb03e0ed3d)

OSD: check for empty command in do_command

Fixes: #3878
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 8cf79f252a1bcea5713065390180a36f31d66dfd)

PGMap: fix -Wsign-compare warning

Fix -Wsign-compare compiler warning:

mon/PGMap.cc: In member function 'void PGMap::apply_incremental
(CephContext*, const PGMap::Incremental&)':
mon/PGMap.cc:247:30: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit b571f8ee2d22a3894120204bc5f119ff37e1de53)

mon: smooth pg stat rates over last N pgmaps

This smooths the recovery and throughput stats over the last N pgmaps,
defaulting to 2.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a7d15afb529615db56bae038b18b66e60d827a96)

mon/PGMap: report IO rates

This does not appear to be very accurate; probably the stat values we're
displaying are not being calculated correctly.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3f6837e022176ec4b530219043cf12e009d1ed6e)

mon/PGMap: report recovery rates

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 208b02a748d97378f312beaa5110d8630c853ced)

mon/PGMap: include timestamp

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 76e9fe5f06411eb0e96753dcd708dd6e43ab2c02)

osd: track recovery ops in stats

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a2495f658c6d17f56ea0a2ab1043299a59a7115b)

osd_types: add recovery counts to object_sum_stats_t

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4aea19ee60fbe1106bdd71de2d172aa2941e8aab)

v0.56.2

cls_rbd, cls_rgw: use PRI*64 when printing/logging 64-bit values

caused segfaults in 32-bit build

Fixes: #3961
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e253830abac76af03c63239302691f7fac1af381)

init-ceph: make ulimit -n be part of daemon command

ulimit -n from 'max open files' was being set only on the machine
running /etc/init.d/ceph. It needs to be added to the commands to
start the daemons, and run both locally and remotely.

Verified by examining /proc/<pid>/limits on local and remote hosts

Fixes: #3900
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Loïc Dachary <loic@dachary.org>
Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
(cherry picked from commit 84a024b647c0ac2ee5a91bacdd4b8c966e44175c)

mon: OSDMonitor: only share osdmap with up OSDs

Try to share the map with a randomly picked OSD; if the picked monitor is
not 'up', then try to find the nearest 'up' OSD in the map by doing a
backward and a forward linear search on the map -- this would be O(n) in
the worst case scenario, as we only do a single iteration starting on the
picked position, incrementing and decrementing two different iterators
until we find an appropriate OSD or we exhaust the map.

Fixes: #3629
Backport: bobtail

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3610e72e4f9117af712f34a2e12c5e9537a5746f)

utime: fix narrowing conversion compiler warning in sleep()

Fix compiler warning:
./include/utime.h: In member function 'void utime_t::sleep()':
./include/utime.h:139:50: warning: narrowing conversion of
'((utime_t*)this)->utime_t::tv.utime_t::<anonymous struct>::tv_sec' from
'__u32 {aka unsigned int}' to '__time_t {aka long int}' inside { } is
ill-formed in C++11 [-Wnarrowing]
./include/utime.h:139:50: warning: narrowing conversion of
'((utime_t*)this)->utime_t::tv.utime_t::<anonymous struct>::tv_nsec' from
'__u32 {aka unsigned int}' to 'long int' inside { } is
ill-formed in C++11 [-Wnarrowing]

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit 014fc6d6c1c68e2e3ad0117d08c4e46e4030d49e)

rgw: fix crash when missing content-type in POST object

Fixes: #3941
This fixes a crash when handling S3 POST request and content type
is not provided.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit f41010c44b3a4489525d25cd35084a168dc5f537)

ReplicatedPG: make_snap_collection when moving snap link in snap_trimmer

Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 88956e3186798058a1170803f8abfc0f3cf77a07)

ReplicatedPG: correctly handle new snap collections on replica

Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9e44fca13bf1ba39dbcad29111b29f46c49d59f7)

mon: Elector: reset the acked leader when the election finishes and we lost

Failure to do so will mean that we will always ack the same leader during
an election started by another monitor. This had been working so far
because we were still acking the existing leader if he was supposed to
still be the leader; or we were acking a new potentially leader; or we
would eventually fall behind on an election and start a new election
ourselves, thus resetting the previously acked leader. While this wasn't
something that mattered much until now, the timechecks code stumbled into
this tiny issue and was failing hard at completing a round because there
wouldn't be a reset before the election started -- timechecks are bound
to election epochs.

Fixes: #3854
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit c54781618569680898e77e151dd7364f22ac4aa1)

rbd: fix bench-write infinite loop

I/O was continously submitted as long as there were few enough ops in
flight. If the number of 'threads' was high, or caching was turned on,
there would never be that many ops in flight, so the loop would continue
indefinitely. Instead, submit at most io_threads ops per offset.

Fixes: #3413
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage.weil@inktank.com>
(cherry picked from commit d81ac8418f9e6bbc9adcc69b2e7cb98dd4db6abb)

rbd: Don't call ProgressContext's finish() if there's an error.

do_copy was different from the others; call pc.fail() on error and
do not call pc.finish().

Fixes: #3729
Signed-off-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 0978dc4963fe441fb67afecb074bc7b01798d59d)

librbd: establish watch before reading header

This eliminates a window in which a race could occur when we have an
image open but no watch established. The previous fix (using
assert_version) did not work well with resend operations.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit c4370ff03f8ab655a009cfd9ba3a0827d8c58b11)

Revert "librbd: ensure header is up to date after initial read"

Using assert version for linger ops doesn't work with retries,
since the version will change after the first send.
This reverts commit e1776809031c6dad441cfb2b9fac9612720b9083.

Conflicts:

qa/workunits/rbd/watch_correct_version.sh
(cherry picked from commit e0858fa89903cf4055889c405f17515504e917a0)

os/FileStore: only adjust up op queue for btrfs

We only need to adjust up the op queue limits during commit for btrfs,
because the snapshot initiation (async create) is currently
high-latency and the op queue is quiesced during that period.

This lets us revert 44dca5c, which disabled the extra allowance because
it is generally bad for non-btrfs writeahead mode.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 38871e27eca5a34de78db23aa3663f6cb045d461)

common/HeartbeatMap: fix uninitialized variable

Introduced by me in 132045ce085e8584a3e177af552ee7a5205b13d8. Thank you,
valgrind!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 00cfe1d3af286ffab7660933415684f18449720c)

sharedptr_registry: remove extaneous Mutex::Locker declaration

For some reason, the lookup() retry loop (for when happened to
race with a removal and grab an invalid WeakPtr) locked
the lock again. This causes the #3836 crash since the lock
is already locked. It's rare since it requires a lookup between
invalidation of the WeakPtr and removal of the WeakPtr entry.

Fixes: #3836
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 037900dc7a051ce2293a4ef9d0e71911b29ec159)

FileStore: ping TPHandle after each operation in _do_transactions

Each completed operation in the transaction proves thread
liveness, a stuck thread should still trigger the timeouts.

Fixes: #3928
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 0c1cc687b6a40d3c6a26671f0652e1b51c3fd1af)

OSD: use TPHandle in peering_wq

Implement _process overload with TPHandle argument and use
that to ping the hb map between pgs and between map epochs
when advancing a pg. The thread will still timeout if
genuinely stuck at any point.

Fixes: 3905
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit e0511f4f4773766d04e845af2d079f82f3177cb6)

WorkQueue: add TPHandle to allow _process to ping the hb map

Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 4f653d23999b24fc8c65a59f14905db6630be5b5)

ReplicatedPG: handle omap > max_recovery_chunk

span_of fails if len == 0.

Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 8a97eef1f7004988449bd7ace4c69d5796495139)

ReplicatedPG: correctly handle omap key larger than max chunk

Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit c3dec3e30a85ecad0090c75a38f28cb83e36232e)

ReplicatedPG: start scanning omap at omap_recovered_to

Previously, we started scanning omap after omap_recovered_to.
This is a problem since the break in the loop implies that
omap_recovered_to is the first key not recovered.

Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 09c71f2f5ee9929ac4574f4c35fb8c0211aad097)

ReplicatedPG: don't finish_recovery_op until the transaction completes

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 62a4b96831c1726043699db86a664dc6a0af8637)

ReplicatedPG: ack push only after transaction has completed

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 20278c4f77b890d5b2b95d2ccbeb4fbe106667ac)

ObjectStore: add queue_transactions with oncomplete

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 4d6ba06309b80fb21de7bb5d12d5482e71de5f16)

common/HeartbeatMap: inject unhealthy heartbeat for N seconds

This lets us test code that is triggered by an unhealthy heartbeat in a
generic way.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 132045ce085e8584a3e177af552ee7a5205b13d8)

os/FileStore: add stall injection into filestore op queue

Allow admin to artificially induce a stall in the op queue.  Forces the
thread(s) to sleep for N seconds.  We pause for 1 second increments and
recheck the value so that a previously stalled thread can be unwedged by
reinjecting a lower value (or 0).  To stall indefinitely, just injust
very large number.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 657df852e9c89bfacdbce25ea014f7830d61e6aa)

osd: do not join cluster if not healthy

If our internal heartbeats are failing, do not send a boot message and try
to join the cluster.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a4e78652cdd1698e8dd72dda51599348d013e5e0)

osd: hold lock while calling start_boot on startup

This probably doesn't strictly matter because start_boot doesn't need the
lock (currently) and few other threads should be running, but it is
better to be consistent.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c406476c0309792c43df512dddb2fe0f19835e71)

osd: do not reply to ping if internal heartbeat is not healthy

If we find that our internal threads are stalled, do not reply to ping
requests. If we do this long enough, peers will mark us down. If we are
only transiently unhealthy, we will reply to the next ping and they will
be satisfied. If we are unhealthy and marked down, and eventually recover,
we will mark ourselves back up.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ad6b231127a6bfcbed600a7493ca3b66c68484d2)

osd: reduce op thread heartbeat default 30 -> 15 seconds

If the thread stalls for 15 seconds, let our internal heartbeat fail.
This will let us internally respond more quickly to a stalled or failing
disk.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 61eafffc3242357d9add48be9308222085536898)

osd: improve sub_op flag points

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 73a969366c8bbd105579611320c43e2334907fef)

osd: refactor ReplicatedPG::do_sub_op

PULL is the only case where we don't wait for active.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 23c02bce90c9725ccaf4295de3177e8146157723)

osd: make last state for slow requests more informative

Report on the last event string, and pass in important context for the
op event list, including:

- which peers were sent sub ops and we are waiting for
- which pg queue we are delayed by

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a1137eb3e168c2d00f93789e4d565c1584790df0)

osd: dump op priority queue state via admin socket

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 24d0d7eb0165c8b8f923f2d8896b156bfb5e0e60)

osd: simplify asok to single callback

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 33efe32151e04beaafd9435d7f86dc2eb046214d)