git.apps.os.sepia.ceph.com Git

mds: Locker: fix a NULL deref in _update_cap_fields

The MClientCaps* is allowed to be NULL, so we can't deref it unless
the dirty param is non-zero. So don't do the ahead-of-time lookup;
just call it explicitly in the if block.

Signed-off-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 3cd8a7fb9683577a7d6e934f18c29b7e84415be6)

Merge remote-tracking branch 'gh/wip-sharedptr-registry-backport' into dumpling

rbd: ObjectCacher reads can hang when reading sparse files

The pending read list was not properly flushed when empty objects
were read from a space file.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit cdb7675a21c9107e3596c90c2b1598def3c6899f)

Enforce cache size on read requests

In-flight cache reads were not previously counted against
new cache read requests, which could result in very large
cache usage. This effect is most noticeable when writing
small chunks to a cloned image since each write requires
a full object read from the parent.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 4fc9fffc494abedac0a9b1ce44706343f18466f1)

Conflicts:
src/osdc/ObjectCacher.h

Locker: accept ctime updates from clients without dirty write caps

The ctime changes any time the inode does. That can happen even without
the file itself having changed, so we'd better accept the update whenever
the auth caps have dirtied, without worrying about the file caps!

Fixes: #9514
Backport: firefly

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: John Spray <john.spray@redhat.com>
(cherry picked from commit 0ea20a668cf859881c49b33d1b6db4e636eda18a)

libcephfs: get osd location on -1 should return EINVAL

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b88af07ef5a3c9a484255b54149a6e6a635845dc)

c70331db13bdbf1f967ece48cdbec28b97c3d19c backported some
CRUSH changes which necessitate backporting this as well, or
we get failures like successfully looking up an OSD with ID -1.
Reviewed-by: Greg Farnum <greg@inktank.com>

Merge pull request #2611 from dachary/wip-9570-buffer-alignment-dumpling

common: buffer alignment (dumpling)

Reviewed-by: Sage Weil <sage@redhat.com>

common/buffer.cc: fix rebuild_page_aligned typo

Introduced: 66a9fbe2c7ba59b7cd034c17865adce3432cd2cb
Fixes: #6003
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit cd30e5fbb1d2d29b4fc0c2b13fa3b9b4493c24f0)

common: rebuild_page_aligned sometimes rebuilds unaligned

rebuild_page_aligned relies on rebuild to create memory that is aligned
according to list::is_page_aligned(). However, when the bufferlist only
contains a single ptr and that its size is not list::is_n_page_size(),
rebuild will not create the expected alligned bufferlist.

The allocation of the ptr is moved out of rebuild which is now given the
ptr as an argument. The rebuild_page_aligned function always require an
aligned ptr with buffer::create_page_aligned(_len) for consistency.

The test

    bufferlist bl;
    bufferptr ptr(buffer::create_page_aligned(2));
    ptr.set_offset(1);
    ptr.set_length(1);
    bl.append(ptr);
    EXPECT_FALSE(bl.is_page_aligned());
    bl.rebuild_page_aligned();
    EXPECT_FALSE(bl.is_page_aligned());

demonstrated the problem. It was assumed to be a feature but should have
been identified as a bug. The last ligne is replaced with

    EXPECT_TRUE(bl.is_page_aligned());

Most tests related to is_page_aligned() wrongfully assumed that

    bufferptr ptr(2);

is never page aligned. Most of the time it is not but sometime it is
when the pointer address is by chance on a CEPH_PAGE_SIZE boundary,
which triggered #6614. Non aligned ptr are created as follows instead:

    bufferptr ptr(buffer::create_page_aligned(2));
    ptr.set_offset(1);
    ptr.set_length(1);

http://tracker.ceph.com/issues/6614 fixes: #6614

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 66a9fbe2c7ba59b7cd034c17865adce3432cd2cb)

osdc/Objecter: only post_rx_buffer if no op timeout

If we post an rx buffer and there is a timeout, the revocation can happen
while the reader has consumed the buffers but before it has decoded and
constructed the message. In particular, we calculate a crc32c over the
data portion of the message after we've taken the buffers and dropped the
lock.

Instead of fixing this race (for example, by reverifying rx_buffers under
the lock while calculating the crc.. bleh), just skip the rx buffer
optimization entirely when a timeout is present.

Note that this doesn't cover the op_cancel() paths, but none of those users
provide static buffers to read into.

Fixes: #9582
Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
backport of 126d0b30e990519b8f845f99ba893fdcd56de447

osd/ReplicatedPG: respect RWORDERED rados flag

If this flag is set, we need to order reads as writes. In particular, this
means that reads will wait for degraded object recovery even if there is a
local copy. And subsequently will be ordered after a preceding write that
is waiting for the same thing.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9322305c80e995e1c4a964edff0fc094329d951b)

Merge pull request #2583 from ceph/wip-7648

crush: backport newer get_full_location

Reviewed-by: Loic Dachary <loic-201408@dachary.org>

crush: fix get_full_location_ordered

This should return -ENOENT when an id is not present. Broken by
746069ee62c74ecf04ed45988029d5c3382a38d2.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d4f07cd90b4037964a18e5c05a3cc5ddd73fb393)

crush/CrushWrapper: simplify get_full_location_ordered()

Just ascend the hierarchy; it is much less complicated.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 746069ee62c74ecf04ed45988029d5c3382a38d2)

0.67.11

cephfs-java: build against older jni headers

Older versions of the JNI interface expected non-const parameters
to their memory move functions. It's unpleasant, but won't actually
change the memory in question, to do a cast_const in order to satisfy
those older headers. (And even if it *did* modify the memory, that
would be okay given our single user.)

Signed-off-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 4d4b77e5b6b923507ec4a0ad9d5c7018e4542a3c)

sharedptr_registry.hpp: removed ptrs need to not blast contents

See the included unit test update. Consider:
1) x = lookup_or_create(1, 1)
2) remove(1)
3) y = lookup_or_create(1, 2)
4) x.reset()
5) z = lookup(1)

The bug is that z will be null since x.reset() caused the
cleanup callback to remove y's key value from contents.

To fix this, contents also records the pointer value for
the weak_ptr. The removal callback only removes the
key from contents if it matches the ptr in contents.

This should work since the pointer passed to the removal
callback must be unique up to that point since it has
not yet been deleted.

This allowed a pg removal -> pg recreation -> pg removal
sequence to cause the second pg removal entry to be
erroneously cleared by the first pg removal's destructor
as it finally made its way through the removal queue.

Fixes: #5951
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 28e4271267976ab8405a24d015f2fb50a2f82c49)

SharedPtrRegistry: get_next must not delete while holding the lock

    bool get_next(const K &key, pair<K, VPtr> *next)

may indirectly delete the object pointed by next->second when
doing :

    *next = make_pair(i->first, next_val);

and it will deadlock (EDEADLK) when

    void operator()(V *to_remove) {
      {
Mutex::Locker l(parent->lock);

tries to acquire the lock because it is already held. The
Mutex::Locker is isolated in a block and the *next* parameter is set
outside of the block.

A test case demonstrating the problem is added to test_sharedptr_registry.cc

http://tracker.ceph.com/issues/6117 fixes #6117

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit ea2fc85e091683ced062594ad25fa569e5c1bbd7)

sharedptr_registry: add a variant of get_next() and the empty() method

The SharedPtrRegistry::get_next() method with a value of type VPtr
instead of V is added because it is sometime more convenient to not
copy the value when walking the registry. The
SharedPtrRegistry::empty() predicate method is added.

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit be04918d4446a7e4ab997e255db6448db749c2a5)

replace in_method_t with a counter

A single counter ( waiting ) accurately reflects the number of
waiters, regardless of the method waiting. It is enough to allow
unit tests to synthetise all situations, including:

T1: x = lookup_or_create(0)
T1: release x part 1 (weak_ptrs now fail to lock)
T2: y = lookup_or_create(0)
T2: block in lookup_or_create (waiting == 1)
T1: z = lookup_or_create(1) (does not block because the key is different)
while holding the lock it waiting++ and waiting == 2
and before returning it waiting-- and waiting is back to == 1
T1: complete release x
T2: complete lookup_or_create(0) (waiting == 0)

The unit tests are modified to add a lookup on an unrelated key to
demonstrate that it does not reset waiting counter.

http://tracker.ceph.com/issues/5527 refs #5527

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 2ec480b1ba00ff02f99a43963a321efc8edf247e)

unit tests for sharedptr_registry

Covers 100% of the LOC and all the expected behavior, including thread
safety.

The sharedptr_registry is made friend of the test class so that it can
synthetize race conditions. The lookup and lookup_or_create methods
set the new in_method data member before calling cond.Wait() so that
the caller knows it is waiting.

http://tracker.ceph.com/issues/5527 refs #5527

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 6b16cd1aaaf818db0a6063f3a3ebb02eeefa3056)

librbd: fix crash using clone of flattened image

The crash occurs due to ImageCtx->parent->parent being uninitialized,
since the inital open_parent() -> open_image(parent) ->
ictx_refresh(parent) occurs before ImageCtx->parent->snap_id is set,
so refresh_parent() is not called to open an ImageCtx for the parent
of the parent. This leaves the ImageCtx->parent->parent NULL, but the
rest of ImageCtx->parent updated to point at the correct parent snapshot.

Setting the parent->snap_id earlier has some unintended side effects
currently, so for now just call refresh_parent() during
open_parent(). This is the easily backportable version of the
fix. Further patches can clean up this whole initialization process.

Fixes: #8845
Backport: firefly, dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 2545e80d274b23b6715f4d8b1f4c6b96182996fb)

test/cli-integration/rbd: fix trailing space

Newer versions of json.tool remove the trailing ' ' after the comma. Add
it back in with sed so that the .t works on both old and new versions, and
so that we don't have to remove the trailing spaces from all of the test
cases.

Fixes: #8920
Backports commit 605064dc685aa25cc7d58ec18b6449a3ce476d01

Signed-off-by: Sage Weil <sage@redhat.com>

os/FileStore: fix mount/remount force_sync race

Consider:

- mount
- sync_entry is doing some work
- umount
   - set force_sync = true
   - set done = true
- sync_entry exits (due to done)
   - ..but does not set force_sync = false
- mount
- journal replay starts
- sync_entry sees force_sync and does a commit while op_seq == 0
...crash...

Fixes: #9144
Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit dd11042f969b94f7a461d02e1475794031c79f61)

Conflicts:
src/os/FileStore.cc

osdc/Objecter: revoke rx_buffer on op_cancel

If we cancel a read, revoke the rx buffers to avoid a use-after-free and/or
other undefined badness by using user buffers that may no longer be
present.

Fixes: #9362
Backport: firefly, dumpling
Reported-by: Matthias Kiefer <matthias.kiefer@1und1.de>
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 2305b2897acba38384358c33ca3bbfcae6f1c74e)

(adjusted for op->con instead of s->con)

mon/Paxos: don't spam log with is_readable at dout level 1

Backport: firefly, dumpling
Reported-by: Aanchal Agrawal <Aanchal.Agrawal@sandisk.com>
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 62ca27d0b119b597ebad40dde64c4d86599e466d)

doc: add note on soft JS dependency for navigating docs

Signed-off-by: Alfredo Deza <alfredo.deza@inktank.com>
(cherry picked from commit 657be818375bea2d8b5998ea1e5505eedc2f294d)

doc: fix missing bracket

Signed-off-by: Alfredo Deza <alfredo.deza@inktank.com>
(cherry picked from commit 69638dfaeb0dcd96dac4b5f5c00ed08042432487)

doc: attempt to get the ayni JS into all head tags

Signed-off-by: Alfredo Deza <alfredo.deza@inktank.com>
(cherry picked from commit 35663fa55ac1579a3b0c8b67028a3a8dfea87b48)

qa/workunits/rbd/qemu-iotests: touch common.env

This seems to be necessary on trusty.

Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 055be68cf8e1b84287ab3631a02e89a9f3ae6cca)

mon: fix divide by zero when pg_num adjusted and no osds

Fixes: #9052
Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
Manual backport of 239401db7b51541a57c59a261b89e0f05347c32d

common/LogClient: fix sending dup log items

We need to skip even the most recently sent item in order to get to the
ones we haven't sent yet.

Fixes: #9080
Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 057c6808be5bc61c3f1ac2b956c1522f18411245)

librbd: fix error path cleanup for opening an image

If the image doesn't exist and caching is enabled, the ObjectCacher
was not being shutdown, and the ImageCtx was leaked. The IoCtx could
later be closed while the ObjectCacher was still running, resulting in
a segfault. Simply use the usual cleanup path in open_image(), which
works fine here.

Fixes: #8912
Backport: dumpling, firefly
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 3dfa72d5b9a1f54934dc8289592556d30430959d)

osd: allow io priority to be set for the disk_tp

The disk_tp covers scrubbing, pg deletion, and snap trimming

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 84b3003119eeb8acfb3faacf357e6c6a452950e3)

Conflicts:
src/osd/OSD.cc

(cherry picked from commit 987ad133415aa988061c95259f9412b05ce8ac7e)

0.67.10

Add rbdcache max dirty object option

Librbd will calculate max dirty object according to rbd_cache_max_size, it
doesn't suitable for every case. If user set image order 24, the calculating
result is too small for reality. It will increase the overhead of trim call
which is called each read/write op.

Now we make it as option for tunning, by default this value is calculated.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
(cherry picked from commit 3c7229a2fea98b30627878c86b1410c8eef2b5d7)

librbd/internal.cc: check earlier for null pointer

Fix potential null ponter deref, move check for 'order != NULL'
to the beginning of the function to prevent a) deref in ldout() call
and b) to leave function as early as possible if check fails.

[src/librbd/internal.cc:843] -> [src/librbd/internal.cc:865]: (warning)
Possible null pointer dereference: order - otherwise it is redundant
to check it against null.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit 3ee3e66a9520a5fcafa7d8c632586642f7bdbd29)

librbd: add an interface to invalidate cached data

This is useful for qemu to guarantee live migration with caching is
safe, by invalidating the cache on the destination before starting it.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 5d340d26dd70192eb0e4f3f240e3433fb9a24154)

librbd: check return code and error out if invalidate_cache fails

This will only happen when shrinking or rolling back an image is done
while other I/O is in flight to the same ImageCtx. This is unsafe, so
return an error before performing the resize or rollback.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit e08b8b66c77be3a3d7f79d91c20b1619571149ee)

Avoid extra check for clean object

We needn't to check clean object via buffer state, skip the clean object.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
(cherry picked from commit f51e33bd9c5a8e1cfc7065b30785696dc45918bc)

rbd.cc: yes, cover formatted output as well. sigh.

Fixes: #7577
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit bd6e35c1b171e46cc3e026c59b076b73440a8502)

rbd.cc: tolerate lack of NUL-termination on block_name_prefix

Fixes: #7577
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit fd76fec589be13a4a6362ef388929d3e3d1d21f6)

rbd: don't forget to call close_image() if remove_child() fails

close_image() among other things unregisters a watcher that's been
registered by open_image(). Even though it'll timeout in 30 or so
seconds, it's not nice now that we check for watchers before starting
the removal process.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
(cherry picked from commit 4ebc32f37a4860bdc676491bf8b042c18fd619cf)

os/FileStore: dump open fds before asserting

Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 4e8de1792731cf30f2744ab0659d036adc0565a3)

rgw: return error if accessing object in non-existent bucket

Fixes: #7064
Instead of trying to access the object, which is impossible as we don't
even have a proper bucket info. Up until now we ended up creating an
empty pool and eventually returning ENOENT, this fix catches the issue
earlier in the process.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 3ed68eb9fac9b3d0bf9111867d609f9ea08fb202)

os/FileStore: force any new xattr into omap on E2BIG

If we have a huge xattr (or many little ones), the _fgetattrs() for the
inline_set will fail with E2BIG.  The conditions later where we decide
whether to clean up the old xattr will then also fail.  Will *will* put
the xattr in omap, but the non-omap version isn't cleaned up.

Fix this by setting a flag if we get E2BIG that the inline_set is known
to be incomplete.  In that case, take the conservative step of assuming
the xattr might be present and chain_fremovexattr().  Ignore any error
because it might not be there.

This is clearly harmless in the general case because it won't be there.  If
it is, we will hopefully remove enough xattrs that the E2BIG condition
will go away (usually by removing some really big chained xattr).

See bug #7779.

This is a backport of 26750fcfe8d766874513e57981565adde2e6d8c7.

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Signed-off-by: Sage Weil <sage@redhat.com>

rgw: calc md5 and compare if user provided appropriate header

Fixes: #8436
Backport: firefly

This was broken in ddc2e1a8e39a5c6b9b224c3eebd1c0e762ca5782. The fix
resurrects and old check that was dropped.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9c56c86bdac6bcb8e76c3f04e7d393e4eaadd721)

rgw: calculate user manifest

Fixes: #8169
Backport: firefly
We didn't calculate the user manifest's object etag at all. The etag
needs to be the md5 of the contantenation of all the parts' etags.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit ddc2e1a8e39a5c6b9b224c3eebd1c0e762ca5782)

Conflicts:
src/rgw/rgw_op.cc

rgw: fix crash in swift CORS preflight request

Fixes: #8586
This fixes error handling, in accordance with commit 6af5a537 that fixed
the same issue for the S3 case.

Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
(cherry picked from commit 18ea2a869791b4894f93fdafde140285f2e4fb65)

cls_rgw: fix object name of objects removed on object creation

Fixes: #8972
Backport: firefly, dumpling

Reported-by: Patrycja Szabłowska <szablowska.patrycja@gmail.com>
Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
(cherry picked from commit 0f8929a68aed9bc3e50cf15765143a9c55826cd2)

qa/workunits/rest/test.py: make osd create test idempotent

Avoid possibility that we create multiple OSDs do to retries by passing in
the optional uuid arg. (A stray osd id will make the osd tell tests a
few lines down fail.)

Fixes: #8728
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bb3e1c92b6682ed39968dc5085b69c117f43cbb0)

mon: Monitor: suicide on start if mon has been removed from monmap

If the monitor has been marked as having been part of an existing quorum
and is no longer in the monmap, then it is safe to assume the monitor
was removed from the monmap. In that event, do not allow the monitor
to start, as it will try to find its way into the quorum again (and
someone clearly stated they don't really want them there), unless
'mon force quorum join' is specified.

Fixes: 6789
Backport: dumpling, emperor

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 86b85947a2148c2e73886c1a7edd09093966ada0)

Conflicts:
src/common/config_opts.h

utf8: export encode_utf8() and decode_utf8()

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 49fc68cf8c3122c878ea9503c9c74d7046bc9c6f)

rgw: dump prefix unconditionally

As part of issue #8858, and to be more in line with S3, dump the Prefix
field when listing bucket even if bucket is empty.

Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
(cherry picked from commit d7209c11251d42227608bc54cc69232ef62ffe80)

Conflicts:
src/rgw/rgw_rest_s3.cc

rgw: list extra objects to set truncation flag correctly

Otherwise we end up returning wrong truncated value, and no data on the
next iteration.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit dc417e477d4ad262885c6b5f5987cf06d63b159d)

rgw: account common prefixes for MaxKeys in bucket listing

To be more in line with the S3 api. Beforehand we didn't account the
common prefixes towards the MaxKeys (a single common prefix counts as a
single key). Also need to adjust the marker now if it is pointing at a
common prefix.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 82d2d612e700f94a0bb2d9fb7555abf327be379b)

rgw: add NextMarker param for bucket listing

Partially fixes #8858.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 924686f0b6593deffcd1d4e80ab06b1e7af00dcb)

rgw: improve delmited listing of bucket

If found a prefix, calculate a string greater than that so that next
request we can skip to that. This is still not the most efficient way to
do it. It'll be better to push it down to the objclass, but that'll
require a much bigger change.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit e6cf618c257f26f97f60a4c1df1d23a14496cab0)

rgw: don't try to wait for pending if list is empty

Fixes: #8846
Backport: firefly, dumpling

This was broken at ea68b9372319fd0bab40856db26528d36359102e. We ended
up calling wait_pending_front() when pending list was empty.
This commit also moves the need_to_wait check to a different place,
where we actually throttle (and not just drain completed IOs).

Reported-by: Sylvain Munaut <s.munaut@whatever-company.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit f9f2417d7db01ecf2425039539997901615816a9)

Use new git mirror for qemu-iotests

Fixes: 8191
Signed-off-by: Warren Usui <warren.usui@inktank.com>
(cherry picked from commit ddf37d903f826f3e153d8009c716780453b68b05)

Support latest qemu iotest code

Modified qemu-iotests workunit script to check for versions
that use the latest qemu (currently only Trusty). Limit the
tests to those that are applicable to rbd.

Fixes: 7882
Signed-off-by: Warren Usui <warren.usui@inktank.com>
(cherry picked from commit 606e725eb5204e76e602d26ffd113e40af2ee812)

librbd: skip zeroes when copying an image

This is the simple coarse-grained solution, but it works well in
common cases like a small base image resized with a bunch of empty
space at the end. Finer-grained sparseness can be copied by using rbd
{export,import}-diff.

Fixes: #6257
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 824da2029613a6f4b380b6b2f16a0bd0903f7e3c)

Revert "qa/workunits/suites/fsx.sh: don't use zero range"

This reverts commit 583e6e3ef7f28bf34fe038e8a2391f9325a69adf.

We're using a different fsx source, which doesn't support the
same options as our git-based one does.

Signed-off-by: Greg Farnum <greg@inktank.com>

qa/workunits/suites/fsx.sh: don't use zero range

Zero range is not supported by cephfs.

Fixes: #8542
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2dec8a810060f65d022c06e82090b4aa5ccec0cb)

Merge pull request #2014 from ceph/wip-scrub-dumpling

osd: scrub priority updates for dumpling

Reviewed-by: Loic Dachary <loic@dachary.org>

rgw: allocate enough space for bucket instance id

Fixes: #8608
Backport: dumpling, firefly
Bucket instance id is a concatenation of zone name, rados instance id,
and a running counter. We need to allocate enough space to account zone
name length.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit d2e86a66ca55685e04ffbfaa58452af59f381277)

ceph-disk: partprobe before settle when preparing dev

Two users have reported this fixes a problem with using --dmcrypt.

Fixes: #6966
Tested-by: Eric Eastman <eric0e@aol.com>
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0f196265f049d432e399197a3af3f90d2e916275)

osd: fix filestore perf stats update

Update the struct we are about to send, not the (unlocked!) one we will
send the next time around.

Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4afffb4a10a0bbf7f2018ef3ed6b167c7921e46b)

common/WorkQueue: allow io priority to be set for wq

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5e4b3b1f1cb870f39fc7cfb3adeae93e078d9057)

Conflicts:
src/common/WorkQueue.cc

common/Thread: allow io priority to be set for a Thread

Ideally, set this before starting the thread. If you set it after, we
could potentially race with create() itself.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 01533183e7455b713640e001962339907fb6f980)

common/io_priority: wrap ioprio_set() and gettid()

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 705713564bebd84ad31cc91698311cf2fbd51a48)

Conflicts:
src/common/Makefile.am

osd: introduce simple sleep during scrub

This option is similar to osd_snap_trim_sleep: simply inject an optional
sleep in the thread that is doing scrub work. This is a very kludgey and
coarse knob for limiting the impact of scrub on the cluster, but can help
until we have a more robust and elegant solution.

Only sleep if we are in the NEW_CHUNK state to avoid delaying processing of
an in-progress chunk. In this state nothing is blocked on anything.
Conveniently, chunky_scrub() requeues itself for each new chunk.

Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c4e8451cc5b4ec5ed07e09c08fb13221e31a7ac6)

Merge pull request #1963 from dachary/wip-8599-ruleset-dumpling

mon: pool set <pool> crush_ruleset must not use rule_exists (dumpling)

Reviewed-by: Sage Weil <sage@inktank.com>

mon: pool set <pool> crush_ruleset must not use rule_exists

Implement CrushWrapper::ruleset_exists that iterates over the existing
rulesets to find the one matching the ruleset argument.

ceph osd pool set <pool> crush_ruleset must not use
CrushWrapper::rule_exists, which checks for a *rule* existing, whereas
the value being set is a *ruleset*. (cherry picked from commit
fb504baed98d57dca8ec141bcc3fd021f99d82b0)

A test via ceph osd pool set data crush_ruleset verifies the ruleset
argument is accepted.

http://tracker.ceph.com/issues/8599 fixes: #8599

Backport: firefly, emperor, dumpling
Signed-off-by: John Spray <john.spray@inktank.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit d02d46e25080d5f7bb8ddd4874d9019a078b816b)

Conflicts:
src/mon/OSDMonitor.cc

osd: 'status' admin socket command

Basic stuff, like what state is the OSD in, and what osdmap epoch are
we on.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 09099c9e4c7d2aa31eb8a0b7c18e43272fae7ce2)

init-ceph: continue after failure doing osd data mount

If we are starting many daemons and hit an error, we normally note it and
move on. Do the same when doing the pre-mount step.

Fixes: #8554
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6a7e20147cc39ed4689809ca7d674d3d408f2a17)

rgw: cut short object read if a chunk returns error

Fixes: #8289
Backport: firefly, dumpling
When reading an object, if we hit an error when trying to read one of
the rados objects then we should just stop. Otherwise we're just going
to continue reading the rest of the object, and since it can't be sent
back to the client (as we have a hole in the middle), we end up
accumulating everything in memory.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 03b0d1cfb7bd30a77fedcf75eb06476b21b14e95)

Merge pull request #1931 from ceph/wip-7068-dumpling

Wip 7068 dumpling

Reviewed-by: Sage Weil <sage@inktank.com>

Merge remote-tracking branch 'origin/wip-8269-dumpling' into dumpling

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

doc: Added requiretty comment to preflight checklist.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

doc: Added Disable requiretty to quick start.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>

ReplicatedPG: lock snapdir obc during write

Otherwise, we won't block properly in prep_push_backfill_object.

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit b87bc2311aa4da065477f402a869e2edc1558e2f)

Conflicts:
src/osd/ReplicatedPG.h

0.67.9

msg: Fix inconsistent message sequence negotiation during connection reset

Backport: firefly, emperor, dumpling

Signed-off-by: Guang Yang (yguang@yahoo-inc.com)
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit bdee119076dd0eb65334840d141ccdf06091e3c9)

OSD::handle_pg_query: on dne pg, send lb=hobject_t() if deleting

We will set lb=hobject_t() if we resurrect the pg. In that case,
we need to have sent that to the primary before hand. If we
finish the removal before the pg is recreated, we'll just end
up backfilling it, which is ok since the pg doesn't exist anyway.

Fixes: #7740
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 04de781765dd5ac0e28dd1a43cfe85020c0854f8)

Conflicts:

src/osd/OSD.cc

mon/MonClient: remove stray _finish_hunting() calls

Callig _finish_hunting() clears out the bool hunting flag, which means we
don't retry by connection to another mon periodically.  Instead, we send
keepalives every 10s.  But, since we aren't yet in state HAVE_SESSION, we
don't check that the keepalives are getting responses.  This means that an
ill-timed connection reset (say, after we get a MonMap, but before we
finish authenticating) can drop the monc into a black hole that does not
retry.

Instead, we should *only* call _finish_hunting() when we complete the
authentication handshake.

Fixes: #8278
Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 77a6f0aefebebf057f02bfb95c088a30ed93c53f)

Merge pull request #1826 from ceph/wip-8162-dumpling

Wip 8162 dumpling

Reviewed-by: Sage Weil <sage@inktank.com>

OSD: fix an osdmap_subscribe interface misuse

When calling osdmap_subscribe, you have to pass an epoch newer than the
current map's. _maybe_boot() was not doing this correctly -- we would
fail a check for being *in* the monitor's existing map range, and then
pass along the map prior to the monitor's range. But if we were exactly
one behind, that value would be our current epoch, and the request would
get dropped. So instead, make sure we are not *in contact* with the monitor's
existing map range.

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 290ac818696414758978b78517b137c226110bb4)

Merge pull request #1827 from ceph/wip-6565-dumpling

Wip 6565 dumpling

Reviewed-by: Sage Weil <sage@inktank.com>

OSD: check for splitting when processing recover/backfill reservations

Fixes: 6565
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 15ec5332ba4154930a0447e2bcf1acec02691e97)

ReplicatedPG::recover_backfill: do not update last_backfill prematurely

Previously, we would update last_backfill on the backfill peer to

backfills_in_flight.empty() ? backfill_pos :
  backfills_in_flight.begin()->first

which is actually the next backfill to complete.  We want to update
last_backfill to the largest completed backfill instead.

We use the pending_backfill_updates mapping to identify the most
recently completed backfill.  Due to the previous patch, deletes
will also be included in that mapping.

Related sha1s from master:
4139e75d63b0503dbb7fea8036044eda5e8b7cf1
7a06a71e0f2023f66d003dfb0168f4fe51eaa058

We don't really want to backport those due to the changes in:
9ec35d5ccf6a86c380865c7fc96017a1f502560a

This patch does essentially the same thing, but using backfill_pos.

Fixse: #8162
Signed-off-by: Samuel Just <sam.just@inktank.com>

ReplicatedPG: add empty stat when we remove an object in recover_backfill

Subsequent updates to that object need to have their stats added
to the backfill info stats atomically with the last_backfill
update.

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit ecddd12b01be120fba87f5ac60539f98f2c69a28)

rgw: don't error out on empty owner when setting acls

Fixes: #6892
Backport: dumpling, emperor
s3cmd specifies empty owner field when trying to set acls on object
/ bucket. We errored out as it didn't match the current owner name, but
with this change we ignore it.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 14cf4caff58cc2c535101d48c53afd54d8632104)

rgw: send user manifest header field

Fixes: #8170
Backport: firefly
If user manifest header exists (swift) send it as part of the object
header data.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 5cc5686039a882ad345681133c9c5a4a2c2fd86b)

client: add asok command to kick sessions that were remote reset

Fixes: #8021
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
(cherry picked from commit 09a1bc5a4601d356b9cc69be8541e6515d763861)

osd: throttle snap trimmming with simple delay

This is not particularly smart, but it is *a* knob that lets you make
the snap trimmer slow down. It's a flow and a simple delay, so it is
adjustable at runtime. Default is 0 (no change in behavior).

Partial solution for #6278.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4413670d784efc2392359f0f22bca7c9056188f4)

PG: only complete replicas should count toward min_size

Backport: emperor,dumpling,cuttlefish
Fixes: #7805
Signed-off-by: Samuel Just <sam.just@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0d5d3d1a30685e7c47173b974caa12076c43a9c4)

rgw: don't allow multiple writers to same multiobject part

Fixes: #8269
A client might need to retry a multipart part write. The original thread
might race with the new one, trying to clean up after it, clobbering the
part's data.
The fix is to detect whether an original part already existed, and if so
use a different part name for it.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>

mon/PGMonitor: set tid on no-op PGStatsAck

The OSD needs to know the tid. Both generally, and specifically because
the flush_pg_stats may be blocking on it.

Fixes: #8280
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 5a6ae2a978dcaf96ef89de3aaa74fe951a64def6)

0.67.8