git.apps.os.sepia.ceph.com Git

Merge pull request #4080 from dachary/wip-11157-dumpling

doc,tests: force checkout of submodules

Reviewed-by: David Zafman <dzafman@redhat.com>

Merge pull request #3969 from ceph/dumpling-11053

mds: fix assertion caused by system clock backwards

Reviewed-by: John Spray <john.spray@redhat.com>

doc,tests: force checkout of submodules

When updating submodules, always checkout even if the HEAD is the
desired commit hash (update --force) to avoid the following:

    * a directory gmock exists in hammer
    * a submodule gmock replaces the directory gmock in master
    * checkout master + submodule update : gmock/.git is created
    * checkout hammer : the gmock directory still contains the .git from
      master because it did not exist at the time and checkout won't
      remove untracked directories
    * checkout master + submodule update : git rev-parse HEAD is
      at the desired commit although the content of the gmock directory
      is from hammer

http://tracker.ceph.com/issues/11157 Fixes: #11157

Signed-off-by: Loic Dachary <ldachary@redhat.com>

mds: fix assertion caused by system clock backwards

Fixes: #11053
Signed-off-by: Yan, Zheng <zyan@redhat.com>

qa: use correct binary path on rpm-based systems

Fixes: #10715
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
(cherry picked from commit 05ce2aa1bf030ea225300b48e7914577a412b38c)

Merge pull request #3655 from ceph/wip-9854-dumpling

osdc: Constrain max number of in-flight read requests

Merge pull request #3645 from ceph/wip-rgw-reason-dumpling

rgw: send http status reason explicitly in fastcgi

Merge pull request #3439 from ceph/wip-10270-dumpling

librbd: gracefully handle deleted/renamed pools

Merge pull request #3489 from dachary/wip-6756-journal-full-dumpling

JounralingObjectStore: journal->committed_thru after replay

Merge pull request #3552 from dachary/wip-10697-dumpling

osd: handle no-op write with snapshot case

fsync-tester: print info about PATH and locations of lsof lookup

We're seeing the lsof invocation fail (as not found) in testing and nobody can
identify why. Since attempting to reproduce the issue has not worked, this
patch will gather data from a genuinely in-vitro location.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
(cherry picked from commit a85051483874ff5b8b0fb50426a3577040457596)

rgw: send http status reason explicitly in fastcgi

There are issues in certain versions of apache 2.4, where the reason is
not sent back. Instead, just provide the reason explicitly.

This commit is a dumpling backport of commit
a9dd4af401328e8f9071dee52470a0685ceb296b.

Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>

osdc: Constrain max number of in-flight read requests

Constrain the number of in-flight RADOS read requests to the
cache size. This reduces the chance of the cache memory
ballooning during certain scenarios like copy-up which can
invoke many concurrent read requests.

Fixes: #9854
Backport: giant, firefly, dumpling
Signed-off-by: Jason Dillaman <dillaman@redhat.com>

osd: handle no-op write with snapshot case

If we have a transaction that does something to the object but it !exists
both before and after, we will continue through the write path.  If the
snapdir object already exists, and we try to create it again, we will
leak a snapdir obc.

Fix is to not recreate the snapdir if it already exists.

Fixes: #10262
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Loic Dachary <ldachary@redhat.com>
(cherry picked from commit 02fae9fc54c10b5a932102bac43f32199d4cb612)

    Conflicts:
        src/osd/ReplicatedPG.cc
        src/test/librados/snapshots.cc

Merge pull request #3520 from ceph/dumpling-rsync-qa

Dumpling rsync qa

Reviewed-by: John Spray <john.spray@redhat.com>

qa: use sudo even more when rsyncing /usr

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
(cherry picked from commit 3aa7797741f9cff06053a2f31550fe6929039692)

qa: use sudo when rsyncing /usr so we can read everything

Signed-off-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit fa07c04231db2d130de54647957ffab4a7a53733)

JounralingObjectStore: journal->committed_thru after replay

It's possible that the osd stopped between when the filestore
op_seq file was updated and when the journal was trimmed. In
that case, it's possible that on boot the journal might be
full, and yet not be trimmed because commit_start assumes
there is no work to do. Calling committed_thru on the journal
ensures that the journal matches committed_seq.

Backport: giant firefly emperor dumpling
Fixes: #6756
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 8924158df8580bbb462130b5bb7f753b13513a23)

librbd: gracefully handle deleted/renamed pools

snap_unprotect and list_children both attempt to scan all
pools. If a pool is deleted or renamed during the scan,
the methods would previously return -ENOENT. Both methods
have been modified to more gracefully handle this condition.

Fixes: #10270, #10122
Backport: giant, firefly
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 436923c68b77c900b7774fbef918c0d6e1614a36)

Merge pull request #3441 from dachary/wip-resolve-symlinks

test/ceph-disk.sh: resolve symlinks before check

Reviewed-by: Jason Dillaman <dillaman@redhat.com>

test/ceph-disk.sh: resolve symlinks before check

Make sure symlinks are resolved in command_fixture()
before compare result of which command and the current
path.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit 8ea86dfa7c4a3d7e089cf9d4e49586657875f851)

Conflicts:
src/test/ceph-disk.sh

If trusty, use older version of qemu

Fixes #10319
Signed-off-by: Warren Usui <warren.usui@inktank.com>
(Cherry-picked from 46a1a4cb670d30397979cd89808a2e420cef2c11)

Merge pull request #3108 from ceph/dumpling-10263

mds: store backtrace for straydir

Reviewed-by: Greg Farnum <gfarnum@redhat.com>

mds: store backtrace for straydir

Backport: giant, firefly, emperor, dumpling
Signed-off-by: Yan, Zheng <zyan@redhat.com>
(cherry picked from commit 0d89db5d3e5ae5d552d4058a88a4e186748ab1d2)

Merge pull request #3079 from ceph/wip-10030-dumpling

librbd: don't close an already closed parent image upon failure

Reviewed-by: Josh Durgin <jdurgin@redhat.com>

Merge pull request #3064 from ceph/wip-10123-dumpling

librbd: protect list_children from invalid child pool IoCtxs

Reviewed-by: Sage Weil <sage@redhat.com>

librbd: don't close an already closed parent image upon failure

If librbd is not able to open a child's parent image, it will
incorrectly close the parent image twice, resulting in a crash.

Fixes: #10030
Backport: firefly, giant
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 61ebfebd59b61ffdc203dfeca01ee1a02315133e)

librbd: protect list_children from invalid child pool IoCtxs

While listing child images, don't ignore error codes returned
from librados when creating an IoCtx. This will prevent seg
faults from occurring when an invalid IoCtx is used.

Fixes: #10123
Backport: giant, firefly, dumpling
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 0d350b6817d7905908a4e432cd359ca1d36bab50)

Merge pull request #3015 from dachary/wip-9665-ceph-disk-partprobe-dumpling

ceph disk zap must call partprobe

ceph-disk: use update_partition in prepare_dev and main_prepare

In the case of prepare_dev the partx alternative was missing and is not
added because update_partition does it.

http://tracker.ceph.com/issues/9721 Fixes: #9721

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
(cherry picked from commit 23e71b1ee816c0ec8bd65891998657c46e364fbe)

Conflicts:
src/ceph-disk

ceph-disk: run partprobe after zap

Not running partprobe after zapping a device can lead to the following:

* ceph-disk prepare /dev/loop2
* links are created in /dev/disk/by-partuuid
* ceph-disk zap /dev/loop2
* links are not removed from /dev/disk/by-partuuid
* ceph-disk prepare /dev/loop2
* some links are not created in /dev/disk/by-partuuid

This is assuming there is a bug in the way udev events are handled by
the operating system.

http://tracker.ceph.com/issues/9665 Fixes: #9665

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
(cherry picked from commit fed3b06c47a5ef22cb3514c7647544120086d1e7)

ceph-disk: encapsulate partprobe / partx calls

Add the update_partition function to reduce code duplication.
The action is made an argument although it always is -a because it will
be -d when deleting a partition.

Use the update_partition function in prepare_journal_dev

Signed-off-by: Loic Dachary <loic-201408@dachary.org>
(cherry picked from commit 922a15ea6865ef915bbdec2597433da6792c1cb2)

Conflicts:
src/ceph-disk

Merge pull request #2918 from ceph/wip-9113-9487-dumpling

dumpling snap trimming

Tested-by: Dan Van Der Ster <daniel.vanderster@cern.ch>

common: Add cctid meta variable

Fixes: #6228
Signed-off-by: Adam Crume <adamcrume@gmail.com>
(cherry picked from commit bb45621cb117131707a85154292a3b3cdd1c662a)

Merge pull request #2982 from athanatos/wip-10168

PGLog: allow for empty pg log in update_range if log_tail == eversion_t(...

Reviewed-by: Sage Weil <sage@redhat.com>

PGLog: allow for empty pg log in update_range if log_tail == eversion_t()

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 03c5344f74991ec351cdc8a55f6495d496479091)

Fixes: 10168

Merge pull request #2919 from athanatos/wip-8595

Wip 8595

qa: allow small allocation diffs for exported rbds

The local filesytem may behave slightly differently. This isn't
foolproof, but seems to be reliable enough on rhel7 rootfs, where
exact comparison was failing.

Fixes: #10002
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
(cherry picked from commit e94d3c11edb9c9cbcf108463fdff8404df79be33)

Merge pull request #2842 from ceph/wip-9869-dumpling

Backport "client: cast m->get_client_tid() to compare to 16-bit Inode::flushing_cap_tid"

ReplicatedPG: don't move on to the next snap immediately

If we have a bunch of trimmed snaps for which we have no
objects, we'll spin for a long time. Instead, requeue.

Fixes: #9487
Backport: dumpling, firefly, giant
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit c17ac03a50da523f250eb6394c89cc7e93cb4659)

osd: initialize purged_snap on backfill start; restart backfill if change

If we backfill a PG to a new OSD, we currently neglect to initialize
purged_snaps.  As a result, the first time the snaptrimmer runs it has to
churn through every deleted snap for all time, and to make matters worse
does so in one go with the PG lock held.  This leads to badness on any
cluster with a significant number of removed snaps that experiences
backfill.

Resolve this by initializing purged_snaps when we finish backfill.  The
backfill itself will clear out any stray snaps and ensure the object set
is in sync with purged_snaps.  Note that purged_snaps on the primary
that is driving backfill will not change during this period as the
snaptrimmer is not scheduled unless the PG is clean (which it won't be
during backfill).

If we by chance to interrupt backfill, go clean with other OSDs,
purge snaps, and then let this OSD rejoin, we will either restart
backfill (non-contiguous log) or the log will include the result of
the snap trim (the events that remove the trimmed snap).

Fixes: #9487
Backfill: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 255b430a87201c7d0cf8f10a3c1e62cbe8dd2d93)

Conflicts:
src/osd/PG.cc

ReplicatedPG: do not queue the snap trimmer constantly

Previously, we continuously requeued the snap trimmer while in
TrimmingObjects. This is not a good idea now that we try to
limit the number of snap trimming repops in flight and requeue
the snap trimmer directly as those repops complete.

Fixes: #9113
Backport: giant, dumpling, firefly
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 34f38b68d89baf1dcbb4571d4f4d3076dc354538)

Conflicts:
src/osd/ReplicatedPG.cc

ReplicatedPG: clean out completed trimmed objects as we go

Also, explicitely maintain a max number of concurrently trimming
objects.

Fixes: 9113
Backport: dumpling, firefly, giant
Signed-off-by: Samuel Just <sam.just@inktank.com>
Conflicts:
src/common/config_opts.h
src/osd/ReplicatedPG.cc

Merge pull request #2775 from ceph/wip-9675.dumpling

CrushWrapper: pick a ruleset same as rule_id

Reviewed-by: Loic Dachary <loic-201408@dachary.org>

client: cast m->get_client_tid() to compare to 16-bit Inode::flushing_cap_tid

m->get_client_tid() is 64 bits (as it should be), but Inode::flushing_cap_tid
is only 16 bits. 16 bits should be plenty to let the cap flush updates
pipeline appropriately, but we need to cast in the proper direction when
comparing these differently-sized versions. So downcast the 64-bit one
to 16 bits.

Fixes: #9869
Signed-off-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 7cda0e52924787f4be6f80cf7c3edcef1c995728)

CrushWrapper: pick a ruleset same as rule_id

Originally in the add_simple_ruleset funtion, the ruleset_id
is not reused but rule_id is reused. So after some add/remove
against rules, the newly created rule likely to have
ruleset!=rule_id.

We dont want this happen because we are trying to hold the constraint
that ruleset == rule_id.

Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
(cherry picked from commit 78e84f34da83abf5a62ae97bb84ab70774b164a6)

Conflicts:
src/crush/CrushWrapper.cc
src/crush/crush.h
src/test/erasure-code/TestErasureCodeIsa.cc
src/test/erasure-code/TestErasureCodeJerasure.cc
src/test/mon/osd-crush.sh

Fixes: #9675

qa/workunits/rbd/import_export.sh: case insensitive grep

Stop tripping over the k vs K thing.

Signed-off-by: Sage Weil <sage@redhat.com>

test/cli-integration/rbd: hide stderr

Signed-off-by: Sage Weil <sage@redhat.com>

Merge pull request #2690 from ceph/dumpling-unused-variable

mds: Locker: remove unused variable

mds: Locker: remove unused variable

Signed-off-by: Yan, Zheng <zyan@redhat.com>

Merge pull request #2680 from ceph/dumpling-locker-null

mds: Locker: fix a NULL deref in _update_cap_fields

mds: Locker: fix a NULL deref in _update_cap_fields

The MClientCaps* is allowed to be NULL, so we can't deref it unless
the dirty param is non-zero. So don't do the ahead-of-time lookup;
just call it explicitly in the if block.

Signed-off-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 3cd8a7fb9683577a7d6e934f18c29b7e84415be6)

ReplicatedPG::do_op: remove unused backfill_target_info

Not removed in previous backported patch.

Signed-off-by: Samuel Just <sam.just@inktank.com>

osd: Fix problems in ReplicatedPG::do_op() logic

Fix assert(is_degraded_object(soid)) in ReplicatedPG::wait_for_degraded_object()
  Use last_backfill_started as the backfill line
  Handle uncommon case of multi op source after backfill line and target before
  backfill line and !is_degraded_object().
  Include backfill line itself for before_backfill (<= instead of <)

Signed-off-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 7837490a9d5fc467a2cc25400f94979d8d4b4d8e)

ReplicatedPG::recover_backfill: update last_backfill to max() when backfill is complete

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 7a06a71e0f2023f66d003dfb0168f4fe51eaa058)

ReplicatedPG: recover_backfill: don't prematurely adjust last_backfill

We can't adjust last_backfill to object x until x has been fully
backfilled. pending_backfill_updates contains all those backfills
started, but which have not yet been reflected in pinfo.last_update.
backfills_in_flight contains those backfills which have not yet
completed. Thus, we can adjust last_update to the largest entry
in pending_backfill_updates not in backfills_in_flight.

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 4139e75d63b0503dbb7fea8036044eda5e8b7cf1)

Revert "ReplicatedPG::recover_backfill: do not update last_backfill prematurely"

I'm going to backport the full version of this patch.

This reverts commit d4e67ff3037a3cc7ae2ecc9e1d8d086c45ae515a.

ReplicatedPG: replace backfill_pos with last_backfill_started

last_backfill_started reflects what pinfo.last_backfill will be
once all currently outstanding backfills complete.  backfill_pos
was tricky since we couldn't correctly inialize it without
doing the first backfill scan pair.

In recover_backfill, we rescan from last_backfill_started rather
than from backfill_pos.  This ensures that we capture all clones
created between last_backfill_started and what previously had been
backfill_pos without special handling in make_writeable.  The main
downside is that we will tend to "rescan" last_backfill_started.

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 9ec35d5ccf6a86c380865c7fc96017a1f502560a)

Conflicts:
src/osd/ReplicatedPG.cc

PG,ReplicatedPG: remove the waiting_for_backfill_peer mechanism

See previous patch.

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 6f975e35a1e29a01347e4a6709b54a0422e063dd)

The referenced previous patch is actually irrelevant to this backport.

Conflicts:
src/osd/ReplicatedPG.cc

PG::BackfillInfo: introduce trim_to

We'll use this to trim off last_backfill_started since it'll
often be included in rescans.

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 8774f03d39b061bc1e811ee8af9d49108f0443c1)

PG::BackfillInterval: use trim() in pop_front()

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 46dfd9197548a9d7cf4e5aa86497d73d8cc5d6ae)

ReplicatedPG::prepare_transaction: info.last_backfill is inclusive

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 0a9a2d7b9c4013631e5e7d4b34bacaa4ec782147)

ReplicatedPG: remove the other backfill related flushes

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 1f50750d0fd94ffddc14f3d7a1e95fa4449aa1b8)

Conflicts:
src/osd/ReplicatedPG.cc

ReplicatedPG: don't rescan the local collection if we can avoid it

Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 664b589b05243b30a92ac3642958d56fb9144e3d)

Conflicts:
src/osd/ReplicatedPG.cc

Merge remote-tracking branch 'gh/wip-sharedptr-registry-backport' into dumpling

rbd: ObjectCacher reads can hang when reading sparse files

The pending read list was not properly flushed when empty objects
were read from a space file.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit cdb7675a21c9107e3596c90c2b1598def3c6899f)

Enforce cache size on read requests

In-flight cache reads were not previously counted against
new cache read requests, which could result in very large
cache usage. This effect is most noticeable when writing
small chunks to a cloned image since each write requires
a full object read from the parent.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 4fc9fffc494abedac0a9b1ce44706343f18466f1)

Conflicts:
src/osdc/ObjectCacher.h

Locker: accept ctime updates from clients without dirty write caps

The ctime changes any time the inode does. That can happen even without
the file itself having changed, so we'd better accept the update whenever
the auth caps have dirtied, without worrying about the file caps!

Fixes: #9514
Backport: firefly

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: John Spray <john.spray@redhat.com>
(cherry picked from commit 0ea20a668cf859881c49b33d1b6db4e636eda18a)

libcephfs: get osd location on -1 should return EINVAL

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b88af07ef5a3c9a484255b54149a6e6a635845dc)

c70331db13bdbf1f967ece48cdbec28b97c3d19c backported some
CRUSH changes which necessitate backporting this as well, or
we get failures like successfully looking up an OSD with ID -1.
Reviewed-by: Greg Farnum <greg@inktank.com>

Merge pull request #2611 from dachary/wip-9570-buffer-alignment-dumpling

common: buffer alignment (dumpling)

Reviewed-by: Sage Weil <sage@redhat.com>

common/buffer.cc: fix rebuild_page_aligned typo

Introduced: 66a9fbe2c7ba59b7cd034c17865adce3432cd2cb
Fixes: #6003
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit cd30e5fbb1d2d29b4fc0c2b13fa3b9b4493c24f0)

common: rebuild_page_aligned sometimes rebuilds unaligned

rebuild_page_aligned relies on rebuild to create memory that is aligned
according to list::is_page_aligned(). However, when the bufferlist only
contains a single ptr and that its size is not list::is_n_page_size(),
rebuild will not create the expected alligned bufferlist.

The allocation of the ptr is moved out of rebuild which is now given the
ptr as an argument. The rebuild_page_aligned function always require an
aligned ptr with buffer::create_page_aligned(_len) for consistency.

The test

    bufferlist bl;
    bufferptr ptr(buffer::create_page_aligned(2));
    ptr.set_offset(1);
    ptr.set_length(1);
    bl.append(ptr);
    EXPECT_FALSE(bl.is_page_aligned());
    bl.rebuild_page_aligned();
    EXPECT_FALSE(bl.is_page_aligned());

demonstrated the problem. It was assumed to be a feature but should have
been identified as a bug. The last ligne is replaced with

    EXPECT_TRUE(bl.is_page_aligned());

Most tests related to is_page_aligned() wrongfully assumed that

    bufferptr ptr(2);

is never page aligned. Most of the time it is not but sometime it is
when the pointer address is by chance on a CEPH_PAGE_SIZE boundary,
which triggered #6614. Non aligned ptr are created as follows instead:

    bufferptr ptr(buffer::create_page_aligned(2));
    ptr.set_offset(1);
    ptr.set_length(1);

http://tracker.ceph.com/issues/6614 fixes: #6614

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 66a9fbe2c7ba59b7cd034c17865adce3432cd2cb)

osdc/Objecter: only post_rx_buffer if no op timeout

If we post an rx buffer and there is a timeout, the revocation can happen
while the reader has consumed the buffers but before it has decoded and
constructed the message. In particular, we calculate a crc32c over the
data portion of the message after we've taken the buffers and dropped the
lock.

Instead of fixing this race (for example, by reverifying rx_buffers under
the lock while calculating the crc.. bleh), just skip the rx buffer
optimization entirely when a timeout is present.

Note that this doesn't cover the op_cancel() paths, but none of those users
provide static buffers to read into.

Fixes: #9582
Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
backport of 126d0b30e990519b8f845f99ba893fdcd56de447

osd/ReplicatedPG: respect RWORDERED rados flag

If this flag is set, we need to order reads as writes. In particular, this
means that reads will wait for degraded object recovery even if there is a
local copy. And subsequently will be ordered after a preceding write that
is waiting for the same thing.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 9322305c80e995e1c4a964edff0fc094329d951b)

Merge pull request #2583 from ceph/wip-7648

crush: backport newer get_full_location

Reviewed-by: Loic Dachary <loic-201408@dachary.org>

crush: fix get_full_location_ordered

This should return -ENOENT when an id is not present. Broken by
746069ee62c74ecf04ed45988029d5c3382a38d2.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d4f07cd90b4037964a18e5c05a3cc5ddd73fb393)

crush/CrushWrapper: simplify get_full_location_ordered()

Just ascend the hierarchy; it is much less complicated.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 746069ee62c74ecf04ed45988029d5c3382a38d2)

0.67.11

cephfs-java: build against older jni headers

Older versions of the JNI interface expected non-const parameters
to their memory move functions. It's unpleasant, but won't actually
change the memory in question, to do a cast_const in order to satisfy
those older headers. (And even if it *did* modify the memory, that
would be okay given our single user.)

Signed-off-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 4d4b77e5b6b923507ec4a0ad9d5c7018e4542a3c)

sharedptr_registry.hpp: removed ptrs need to not blast contents

See the included unit test update. Consider:
1) x = lookup_or_create(1, 1)
2) remove(1)
3) y = lookup_or_create(1, 2)
4) x.reset()
5) z = lookup(1)

The bug is that z will be null since x.reset() caused the
cleanup callback to remove y's key value from contents.

To fix this, contents also records the pointer value for
the weak_ptr. The removal callback only removes the
key from contents if it matches the ptr in contents.

This should work since the pointer passed to the removal
callback must be unique up to that point since it has
not yet been deleted.

This allowed a pg removal -> pg recreation -> pg removal
sequence to cause the second pg removal entry to be
erroneously cleared by the first pg removal's destructor
as it finally made its way through the removal queue.

Fixes: #5951
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 28e4271267976ab8405a24d015f2fb50a2f82c49)

SharedPtrRegistry: get_next must not delete while holding the lock

    bool get_next(const K &key, pair<K, VPtr> *next)

may indirectly delete the object pointed by next->second when
doing :

    *next = make_pair(i->first, next_val);

and it will deadlock (EDEADLK) when

    void operator()(V *to_remove) {
      {
Mutex::Locker l(parent->lock);

tries to acquire the lock because it is already held. The
Mutex::Locker is isolated in a block and the *next* parameter is set
outside of the block.

A test case demonstrating the problem is added to test_sharedptr_registry.cc

http://tracker.ceph.com/issues/6117 fixes #6117

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit ea2fc85e091683ced062594ad25fa569e5c1bbd7)

sharedptr_registry: add a variant of get_next() and the empty() method

The SharedPtrRegistry::get_next() method with a value of type VPtr
instead of V is added because it is sometime more convenient to not
copy the value when walking the registry. The
SharedPtrRegistry::empty() predicate method is added.

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit be04918d4446a7e4ab997e255db6448db749c2a5)

replace in_method_t with a counter

A single counter ( waiting ) accurately reflects the number of
waiters, regardless of the method waiting. It is enough to allow
unit tests to synthetise all situations, including:

T1: x = lookup_or_create(0)
T1: release x part 1 (weak_ptrs now fail to lock)
T2: y = lookup_or_create(0)
T2: block in lookup_or_create (waiting == 1)
T1: z = lookup_or_create(1) (does not block because the key is different)
while holding the lock it waiting++ and waiting == 2
and before returning it waiting-- and waiting is back to == 1
T1: complete release x
T2: complete lookup_or_create(0) (waiting == 0)

The unit tests are modified to add a lookup on an unrelated key to
demonstrate that it does not reset waiting counter.

http://tracker.ceph.com/issues/5527 refs #5527

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 2ec480b1ba00ff02f99a43963a321efc8edf247e)

unit tests for sharedptr_registry

Covers 100% of the LOC and all the expected behavior, including thread
safety.

The sharedptr_registry is made friend of the test class so that it can
synthetize race conditions. The lookup and lookup_or_create methods
set the new in_method data member before calling cond.Wait() so that
the caller knows it is waiting.

http://tracker.ceph.com/issues/5527 refs #5527

Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 6b16cd1aaaf818db0a6063f3a3ebb02eeefa3056)

librbd: fix crash using clone of flattened image

The crash occurs due to ImageCtx->parent->parent being uninitialized,
since the inital open_parent() -> open_image(parent) ->
ictx_refresh(parent) occurs before ImageCtx->parent->snap_id is set,
so refresh_parent() is not called to open an ImageCtx for the parent
of the parent. This leaves the ImageCtx->parent->parent NULL, but the
rest of ImageCtx->parent updated to point at the correct parent snapshot.

Setting the parent->snap_id earlier has some unintended side effects
currently, so for now just call refresh_parent() during
open_parent(). This is the easily backportable version of the
fix. Further patches can clean up this whole initialization process.

Fixes: #8845
Backport: firefly, dumpling
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 2545e80d274b23b6715f4d8b1f4c6b96182996fb)

test/cli-integration/rbd: fix trailing space

Newer versions of json.tool remove the trailing ' ' after the comma. Add
it back in with sed so that the .t works on both old and new versions, and
so that we don't have to remove the trailing spaces from all of the test
cases.

Fixes: #8920
Backports commit 605064dc685aa25cc7d58ec18b6449a3ce476d01

Signed-off-by: Sage Weil <sage@redhat.com>

os/FileStore: fix mount/remount force_sync race

Consider:

- mount
- sync_entry is doing some work
- umount
   - set force_sync = true
   - set done = true
- sync_entry exits (due to done)
   - ..but does not set force_sync = false
- mount
- journal replay starts
- sync_entry sees force_sync and does a commit while op_seq == 0
...crash...

Fixes: #9144
Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit dd11042f969b94f7a461d02e1475794031c79f61)

Conflicts:
src/os/FileStore.cc

osdc/Objecter: revoke rx_buffer on op_cancel

If we cancel a read, revoke the rx buffers to avoid a use-after-free and/or
other undefined badness by using user buffers that may no longer be
present.

Fixes: #9362
Backport: firefly, dumpling
Reported-by: Matthias Kiefer <matthias.kiefer@1und1.de>
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 2305b2897acba38384358c33ca3bbfcae6f1c74e)

(adjusted for op->con instead of s->con)

mon/Paxos: don't spam log with is_readable at dout level 1

Backport: firefly, dumpling
Reported-by: Aanchal Agrawal <Aanchal.Agrawal@sandisk.com>
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 62ca27d0b119b597ebad40dde64c4d86599e466d)

doc: add note on soft JS dependency for navigating docs

Signed-off-by: Alfredo Deza <alfredo.deza@inktank.com>
(cherry picked from commit 657be818375bea2d8b5998ea1e5505eedc2f294d)

doc: fix missing bracket

Signed-off-by: Alfredo Deza <alfredo.deza@inktank.com>
(cherry picked from commit 69638dfaeb0dcd96dac4b5f5c00ed08042432487)

doc: attempt to get the ayni JS into all head tags

Signed-off-by: Alfredo Deza <alfredo.deza@inktank.com>
(cherry picked from commit 35663fa55ac1579a3b0c8b67028a3a8dfea87b48)

qa/workunits/rbd/qemu-iotests: touch common.env

This seems to be necessary on trusty.

Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 055be68cf8e1b84287ab3631a02e89a9f3ae6cca)

mon: fix divide by zero when pg_num adjusted and no osds

Fixes: #9052
Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
Manual backport of 239401db7b51541a57c59a261b89e0f05347c32d

common/LogClient: fix sending dup log items

We need to skip even the most recently sent item in order to get to the
ones we haven't sent yet.

Fixes: #9080
Backport: firefly, dumpling
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 057c6808be5bc61c3f1ac2b956c1522f18411245)

librbd: fix error path cleanup for opening an image

If the image doesn't exist and caching is enabled, the ObjectCacher
was not being shutdown, and the ImageCtx was leaked. The IoCtx could
later be closed while the ObjectCacher was still running, resulting in
a segfault. Simply use the usual cleanup path in open_image(), which
works fine here.

Fixes: #8912
Backport: dumpling, firefly
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 3dfa72d5b9a1f54934dc8289592556d30430959d)

osd: allow io priority to be set for the disk_tp

The disk_tp covers scrubbing, pg deletion, and snap trimming

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 84b3003119eeb8acfb3faacf357e6c6a452950e3)

Conflicts:
src/osd/OSD.cc

(cherry picked from commit 987ad133415aa988061c95259f9412b05ce8ac7e)

0.67.10

Add rbdcache max dirty object option

Librbd will calculate max dirty object according to rbd_cache_max_size, it
doesn't suitable for every case. If user set image order 24, the calculating
result is too small for reality. It will increase the overhead of trim call
which is called each read/write op.

Now we make it as option for tunning, by default this value is calculated.

Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
(cherry picked from commit 3c7229a2fea98b30627878c86b1410c8eef2b5d7)

librbd/internal.cc: check earlier for null pointer

Fix potential null ponter deref, move check for 'order != NULL'
to the beginning of the function to prevent a) deref in ldout() call
and b) to leave function as early as possible if check fails.

[src/librbd/internal.cc:843] -> [src/librbd/internal.cc:865]: (warning)
Possible null pointer dereference: order - otherwise it is redundant
to check it against null.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit 3ee3e66a9520a5fcafa7d8c632586642f7bdbd29)