git.apps.os.sepia.ceph.com Git

Revert "ceph-disk-activate: don't override default or configured osd journal path"

This reverts commit 813e9fe2b4291a1c1922ef78f031daa9b78fe53b.

We run --mkfs with the osd disk mounted in a temporary location, so it is
necessary to explicitly pass in these paths.

If we want to support journals in a different location, we need to make
ceph-disk-prepare update the journal symlink accordingly.. not control it via
the config option.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3e628eee770508e750f64ea50179bbce52e7b8e0)

ceph-disk-activate: rely on default/configured keyring path

No reason to override the default or configured value here.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 936b8f20af1d390976097c427b6e92da4b39b218)

ceph-disk-activate: don't override default or configured osd journal path

There is no reason not to rely on the default or obey any configured
value here.

Fixes: #4031
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 813e9fe2b4291a1c1922ef78f031daa9b78fe53b)

ceph-disk-prepare: move in-use checks to the top, before zap

Move the in-use checks to the very top, before we (say) zap!

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 32407c994f309cd788bf13fe9af27e17a422309a)

ceph-disk-prepare: verify device is not in use by device-mapper

Be nice and tell the user which devices/mappings are consuming the device,
too.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit a6196de9e2f3ca9d67691f79d44e9a9f669443e9)

ceph-disk-prepare: verify device is not mounted before using

Make sure the data and/or journal device(s) are not in use (mounted)
before using them. Make room for additional "in-use" checks in the future.

Closes: #3256
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3bd0ac0ab011c4cdf0121f0d9732938d085fb8bf)

ceph-disk-prepare: clean up stupid check for a digit

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f03f62697f170d42b4b62c53d2860ff2f24a2d73)

ceph-disk-prepare: use os.path.realpath()

My janky symlink resolution is broken in various ways.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 59505546e52a175435881b431bd349d532ae627e)

ceph.spec.in: add new Requires from ceph-disk-prepare

Added new Requires from ceph-disk-prepare: cryptsetup, gptfdisk,
parted and util-linux.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit 5c3f5c5b69a9edc99138d4f1ddb016689303dc28)

Conflicts:

ceph.spec.in

debian: require cryptsetup-bin

This is needed for ceph-disk-prepare's dmcrypt support.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit cfcaceac44d6b7b7c55e81d0bfb05f4893f3b1d0)

Conflicts:

debian/control

Fix: use absolute path with udev

Avoids the following: udevd[61613]: failed to execute '/lib/udev/bash'
'bash -c 'while [ ! -e /dev/mapper/....

Signed-off-by: Alexandre Marangone <alexandre.marangone@inktank.com>
(cherry picked from commit 785b25f53dc7f8035eeba2aae8a196e3b102d930)

ceph-disk-prepare: -f for mkfs.xfs only

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit fecc3c3abf1176f4c7938e161559ea2db59f1cff)

debian: fix start of ceph-all

Tolerate failure, and do ceph-all, not ceph-osd-all.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit aff0bb6fdc8ca358f7ac1e941bb9cfecbefb4bb6)

ceph-disk-prepare: always force mkfs.xfs

Signed-off-by: Alexandre Marangone <alexandre.marangone@inktank.com>
(cherry picked from commit d950d83250db3a179c4b629fd32cd7bc8149997e)

udev: trigger on dmcrypted osd partitions

Automatically map encrypted journal partitions.

For encrypted OSD partitions, map them, wait for the mapped device to
appear, and then ceph-disk-activate.

This is much simpler than doing the work in ceph-disk-activate.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e090a92a20f4161f473d16bc966f7d6aacac75ee)

ceph-disk-prepare: add initial support for dm-crypt

Keep keys in /etc/ceph/dmcrypt-keys.

Identify partition instances by the partition UUID. Identify encrypted
partitions by a parallel set of type UUIDs.

Signed-off-by: Alexandre Marangone <alexandre.maragone@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c6ac0ddf91915ba2aeae46d21367f017e18e82cd)

ceph-disk-activate: pull mount options from ceph.conf

Signed-off-by: Alexandre Marangone <alexandre.marangone@inktank.com>
(cherry picked from commit e7040f55f01db3de7d5cebfc79de50c8b6ad5d45)

ceph-disk-activate: use full paths for everything

We are run from udev, which doesn't get a decent PATH.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b1c0fccba42dd184a2891ee873c0d6d8f8c79d14)

ceph-disk-prepare: do partprobe after setting final partition type

This is necessary to kick udev into processing the updated partition and
running its rules.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 95835de9f80712eb26652ac6b66ba9c5eeb093d6)

debian: start/stop ceph-all event on install/uninstall

This helps us avoid the confusing situation with upstart where an individual
daemon job is running (like ceph-osd id=2) but the container jobs ceph-osd-all
and ceph-all are not.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b7b9af5c0d531dcee7ce9b10043a29b0a1b31f47)

ceph-disk-activate: catch daemon start errors

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 690ae05309db118fb3fe390a48df33355fd068a0)

udev: trigger ceph-disk-activate directly from udev

There is no need to depend on upstart for this.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5bd85ee5aa31bfd1f4f0e434f08c2a19414358ef)

Conflicts:

ceph.spec.in

ceph-disk-activate: auto detect init system

Look for an option 'init' in ceph.conf. Otherwise, check if we're ubuntu.
If so, use upstart. Otherwise, use sysvinit.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d1904b2a848af3c02d2065ac2a42abe0e2699d0f)

ceph-disk-activate: specify full path for blkid, initctl, service

/sbin apparently isn't in the path when udev runs us.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f06b45e66315310abb0720e021da377186455048)

upstart: ceph-hotplug -> ceph-osd-activate

This is a more meaningful name.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e011ad128e7f302cb6955d9a7171ac0ec8890ddf)

upstart/ceph-hotplug: tell activate to start via upstart

This will mark the OSD data dir as upstart-managed.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 792e45c63dd7a9622fddd6e15ee4c075f995ea56)

ceph-disk-prepare: refactor to support DIR, DISK, or PARTITION for data or journal

Lots of code reorganization collapsed into a single commit here.

- detect whether the user gave us a directory, disk, or partition, and Do The
Right Thing
- allow them to force that the input was of type X, for the careful/paranoid.
- make --zap-disk an option -- no longer the default

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b2ff6e8c9d96dee2c063b126de7030a5c2ae0d02)

ceph-disk-activate: detect whether PATH is mount or dir

remove in-the-way symlinks in /var/lib/ceph/osd

This is simpler. Just detect what the path is and Do The Right Thing.

Closes #3341 (which wanted to make --mount the default)

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 191d5f7535f8d96d493e1b35b43a421c67c168ea)

ceph-disk-activate: add --mark-init INITSYSTEM option

Do not assume we will manage via upstart; let that be passed down via the
command line.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit fd4a921085a861e4aa428376219bb39055731f2b)

ceph-disk-activate: factor mounting out of activate

The activate stuff is generic for any OSD, regardless of whether we want
to mount it or not. Pull that part out.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 07655288281c9c6f691f87352dc26b7c11ae07e8)

debian: put ceph-mds upstart conf in ceph-mds package

Fixes: #3157
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 23ad3a46a0099e263f43e0f0c1df1d21cfe58b3f)

debian: include /var/lib/ceph/bootstrap-mds in package

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e80675a0f333c04452d4822fd0eb3c6e92eda3df)

ceph-create-keys: create mds bootstrap key

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 809143f16c70483ba5bb429dea812d31b67f2b49)

upstart/ceph-hotplug: drop -- in ceph-disk-activate args

We would like to transition to

ceph-disk-activate --mount DEV

and away from a generic multi-definition PATH argument.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4698b6a1035dee8509ce2d4dab7b34a16b78f7cd)

init-ceph: iterate/locate local sysvinit-tagged directories

Search /var/lib/ceph/$type/ceph-$id and start/stop those daemons if
present and tagged with the sysvinit file.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c8f528a4070dd3aa0b25c435c6234032aee39b21)

init-ceph: consider sysvinit-tagged dirs as local

If there is a 'sysvinit' file in the daemon directory in the default
location (/var/lib/ceph/$type/ceph-$id), consider it sysvinit-managed.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b8aa4769a62e0d88174678cbefd89d9ee2baceea)

ceph-disk-prepare: align mkfs, mount config options with mkcephfs

'osd mkfs ...', not 'osd fs mkfs ...'. Sigh. Support both.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit af2372ca4a702da70275edd1b1357fcff51e6ae2)

init-ceph: fix (and simplify) pushing ceph.conf to remote unique name

The old code would only do the push once per remote node (due to the
list in $pushed_to) but would reset $unique on each attempt. This would
break if a remote host was processed twice.

Fix by just skipping the $pushed_to optimization entirely.

Fixes: #4794
Reported-by: Andreas Friedrich <andreas.friedrich@ts.fujitsu.com>
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ccbc4dbc6edf09626459ca52a53a72682f541e86)

Merge pull request #238 from ceph/wip-bobtail-rbd-backports-req-order

Reviewed-by: Sage Weil <sage.weil@inktank.com>

rbd: only set STRIPINGV2 feature when needed

Only set the STRIPINGV2 feature if the striping parameters are non-default.
Specifically, fix the case where the passed-in size and count are == 0.

Fixes: #4710
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5926ffa576e9477324ca00eaec731a224195e7db)

Conflicts:

src/rbd.cc

LibrbdWriteback: complete writes strictly in order

RADOS returns writes to the same object in the same order. The
ObjectCacher relies on this assumption to make sure previous writes
are complete and maintain consistency. Reads, however, may be
reordered with respect to each other. When writing to an rbd clone,
reads to the parent must be performed when the object does not exist
in the child yet. These reads may be reordered, resulting in the
original writes being reordered. This breaks the assmuptions of the
ObjectCacher, causing an assert to fail.

To fix this, keep a per-object queue of outstanding writes to an
object in the LibrbdWriteback handler, and finish them in the order in
which they were sent.

Fixes: #4531
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 06d05e5ed7e09fa873cc05021d16f21317a1f8ef)

LibrbdWriteback: removed unused and undefined method

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 909dfb7d183f54f7583a70c05550bec07856d4e4)

LibrbdWriteback: use a tid_t for tids

An int could be much smaller, leading to overflow and bad behavior.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 9d19961539b2d50d0c9edee1e3d5ac6912a37f24)

WritebackHandler: make read return nothing

The tid returned by reads is ignored, and would make tracking writes
internally more difficult by using the same id-space as them. Make read
void and update all implementations.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 870f9cd421ca7b0094f9f89e13b1898a8302c494)

ObjectCacher: deduplicate final part of flush_set()

Both versions of flush_set() did the same thing. Move it into a
helper called from both.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit f5b81d8d167d1aa7f82a5776bbb1f319063ab809)

test_stress_watch: remove bogus asserts

There's no reason to check the duration of a watch. The notify will
timeout after 30s on the OSD, but there's no guarantee the client will
see that in any bounded time. This test is really meant as a stress
test of the OSDs anyway, not of the clients, so just remove asserts
about operation duration.

Fixes: #4591
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sam Just <sam.just@inktank.com>
(cherry picked from commit 4b656730ffff21132f358c2b9a63504dfbf0998d)

librados: don't use lockdep for AioCompletionImpl

This is a quick workaround for the next branch. A more complete fix
will be done for the master branch. This does not affect correctness,
just what qa runs with lockdep enabled do.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage.weil@inktank.com>
(cherry picked from commit 267ce0d90b8f3afaaddfdc0556c9bafbf4628426)

librados: move snapc creation to caller for aio_operate

The common case already has a snapshot context, so avoid duplicating
it (copying a potentially large vector) in IoCtxImpl::aio_operate().

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 4c4d5591bdb048cd9ffa25b529c6127356e7f9a7)

librbd: add an async flush

At this point it's a simple wrapper around the ObjectCacher or
librados.

This is needed for QEMU so that its main thread can continue while a
flush is occurring. Since this will be backported, don't update the
librbd version yet, just add a #define that QEMU and others can use to
detect the presence of aio_flush().

Refs: #3737
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 302b93c478b3f4bc2c82bfb08329e3c98389dd97)

librbd: use the same IoCtx for each request

Before we were duplicating the IoCtx for each new request since they
could have a different snapshot context or read from a different
snapshot id. Since librados now supports setting these explicitly
for a given request, do that instead.

Since librados tracks outstanding requests on a per-IoCtx basis, this
also fixes a bug that causes flush() without caching to ignore
all the outstanding requests, since they were to separate,
duplicate IoCtxs.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 860493e7ff0d87d02069b243fc1c8326ce0721f9)

librbd: add an is_complete() method to AioCompletions

Mainly this is useful for testing, like flushing and checking that
all pending writes are complete after the flush finishes.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 2ae32068dee22a0ca0698e230ead98f2eeeff3e6)

librados: add versions of a couple functions taking explicit snap args

Usually the snapid to read from or the snapcontext to send with a write
are determined implicitly by the IoCtx the operations are done on.

This makes it difficult to have multiple ops in flight to the same
IoCtx using different snapcontexts or reading from different snapshots,
particularly when more than one operation may be needed past the initial
scheduling.

Add versions of aio_read, aio_sparse_read, and aio_operate
that don't depend on the snap id or snapcontext stored in the IoCtx,
but get them from the caller. Specifying this information for each
operation can be a more useful interface in general, but for now just
add it for the methods used by librbd.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit f06debef6c293750539501ec4e6103e5ae078392)

librados: add async flush interface

Sometimes you don't want flush to block, and can't modify
already scheduled aio_writes. This will be useful for a
librbd async flush interface.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 7cc0940f89070dadab5b9102b1e78362f762f402)

Conflicts:

src/include/rados/librados.h
src/include/rados/librados.hpp

ObjectCacher: remove unneeded var from flush_set()

The gather will only have subs if there is something to flush. Remove
the safe variable, which indicates the same thing, and convert the
conditionals that used it to an else branch. Movinig gather.activate()
inside the has_subs() check has no effect since activate() does
nothing when there are no subs.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 553aaac8a19e2359acf6d9d2e1bb4ef0bdba7801)

ObjectCacher: remove NULL checks in flush_set()

Callers will always pass a callback, so assert this and remove the
checks for it being NULL.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 41568b904de6d155e5ee87c68e9c31cbb69508e5)

ObjectCacher: always complete flush_set() callback

This removes the last remnants of
b5e9995f59d363ba00d9cac413d9b754ee44e370. If there's nothing to flush,
immediately call the callback instead of deleting it. Callers were
assuming they were responsible for completing the callback whenever
flush_set() returned true, and always called complete(0) in this
case. Simplify the interface and just do this in flush_set(), so that
it always calls the callback.

Since C_GatherBuilder deletes its finisher if there are no subs,
only set its finisher when subs are present. This way we can still
call ->complete() for the callback.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 92db06c05dc2cad8ed31648cb08866781aee2855)

Conflicts:

src/client/Client.cc

ObjectCacher: fix flush_set when no flushing is needed

C_GatherBuilder takes ownership of the Context we pass it. Deleting it
in flush_set after constructing the C_GatherBuilder results in a
double delete.

Fixes: #3946
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
(cherry picked from commit 3bc21143552b35698c9916c67494336de8964d2a)

objectcacher: Remove commit_set, use flush_set

commit_set() and flush_set() are identical in functionality,
so use flush_set everywhere and remove commit_set from
the code.

Also fixes a bug in flush_set where the finisher context was
getting freed twice if no objects needed to be flushed.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
(cherry picked from commit 72147fd3a1da8ecbcb31ddf6b66a158d71933909)

librbd: make aio_writes to the cache always non-blocking by default

When the ObjectCacher's writex blocks, it affects the thread requesting
the aio, which can cause starvation for other I/O when used by QEMU.

Preserve the old behavior via a config option in case this has any
bad side-effects, like too much memory usage under heavy write loads.

Fixes: #4091
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 03ac01fa6a94fa7a66ede057e9267e0a562c3cdb)

ObjectCacher: optionally make writex always non-blocking

Add a callback argument to writex, and a finisher to run the
callbacks. Move the check for dirty+tx > max_dirty into a helper that
can be called from a wrapper around the callbacks from writex, or from
the current place in _wait_for_write().

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit c21250406eced8e5c467f492a2148c57978634f4)

librbd: flush cache when set_snap() is called

If there are writes pending, they should be sent while the image
is still writeable. If the image becomes read-only, flushing the
cache will just mark everything dirty again due to -EROFS.

Fixes: #4525
Backport: bobtail
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 613b7085bb48cde1e464b7a97c00b8751e0e917f)

librbd: optionally wait for a flush before enabling writeback

Older guests may not send flushes properly (i.e. never), so if this is
enabled, rbd_cache=true is safe for them transparently.

Disable by default, since it will unnecessarily slow down newer guest
boot, and prevent writeback caching for things that don't need to send
flushes, like the command line tool.

Refs: #3817
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1597b3e3a1d776b56e05c57d7c3de396f4f2b5b2)

librbd: invalidate cache when flattening

The cache stores which objects don't exist. Flatten bypasses the cache
when doing its copyups, so when it is done the -ENOENT from the cache
is treated as zeroes instead of 'need to read from parent'.

Clients that have the image open need to forgot about the cached
non-existent objects as well. Do this during ictx_refresh, while the
parent_lock is held exclusively so no new reads from the parent can
happen until the updated parent metadata is visible, so no new reads
from the parent will occur.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 46e8fc00b2dc8eb17d8777b6ef5ad1cfcc389cea)

ObjectCacher: add a method to clear -ENOENT caching

Clear the exists and complete flags for any objects that have exists
set to false, and force any in-flight reads to retry if they get
-ENOENT instead of generating zeros.

This is useful for getting the cache into a consistent state for rbd
after an image has been flattened, since many objects which previously
did not exist and went up to the parent to retrieve data may now exist
in the child.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit f2a23dc0b092c5ac081893e8f28c6d4bcabd0c2e)

ObjectCacher: keep track of outstanding reads on an object

Reads always use C_ReadFinish as a callback (and they are the only
user of this callback). Keep an xlist of these for each object, so
they can remove themselves as they finish. To prevent racing requests
and with discard removing objects from the cache, clear the xlist in
the object destructor, so if the Object is still valid the set_item
will still be on the list.

Make the ObjectCacher constructor take an Object* instead of the pool
and object id, which are derived from the Object* anyway.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit f6f876fe51e40570596c25ac84ba3689f72776c2)

test_rbd: move flatten tests back into TestClone

They need the same setup, and it's easy enough to run specific
subtests. Making them a separate subclass accidentally duplicated
tests from TestClone.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 9c693d7e8312026f6d8d9586381b026ada35d808)

librbd: fix rollback size

The duplicate calls to get_image_size() and get_snap_size() replaced
by 5806226cf0743bb44eaf7bc815897c6846d43233 uncovered this. The first
call was using the currently set snap_id instead of the snapshot being
rolled back to.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit d6c126e2131fefab6df676f2b9d0addf78f7a488)

Merge branch 'wip-4249' into wip-4249-master

Make snap_rollback() only take a read lock on snap_lock, since
it does not modify snapshot-related fields.
Conflicts:
src/librbd/internal.cc
(cherry picked from commit db5fc2270f91aae220fc3c97b0c62e92e263527b)

librbd: make sure racing flattens don't crash

The only way for a parent to disappear is a racing flatten completing,
or possibly in the future the image being forcibly removed. In either
case, continuing to flatten makes no sense, so stop early.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit a1ae8562877d1b902918e866a1699214090c40bd)

librbd: use rwlocks instead of mutexes for several fields

Image metadata like snapshots, size, and parent is frequently read,
but rarely updated. During flatten, we were depending on the parent
lock to prevent the parent ImageCtx from disappearing out from under
us while we read from it. The copy-up path also needed the parent lock
to be able to read from the parent image, which lead to a deadlock.

Convert parent_lock, snap_lock, and md_lock to RWLocks, and change
their use to read instead of exclusive locks where appropriate. The
main place exclusive locks are needed is in ictx_refresh, so this is
pretty simple. This fixes the deadlock, since parent_lock is only
needed for read access in both flatten and the copy-up operation.

cache_lock and refresh_lock are only really used for exclusive access,
so leave them as regular mutexes.

One downside to this is that there's no way to assert is_locked()
for RWLocks, so we'll have to be very careful about changing code
in the future.

Fixes: #3665
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 995ff0e3eaa560b242da8c019a2e11e735e854f7)

common: add lockers for RWLocks

This makes them easier to use, especially instead of existing mutexes.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit e0f8e5a80d6d22bd4dee79a4996ea7265d11b0c1)

objecter: initialize linger op snapid

Since they are write ops now, it must be CEPH_NOSNAP or the OSD
returns EINVAL.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 15bb9ba9fbb4185708399ed6deee070d888ef6d2)

objecter: separate out linger_read() and linger_mutate()

A watch is a mutation, while a notify is a read. The mutations need to
pass in a proper snap context to be fully correct.

Also, make the WRITE flag implicit so the caller doesn't need to pass it
in.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 6c08c7c1c6d354d090eb16df279d4b63ca7a355a)

osd: make watch OSDOp print sanely

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit de4fa95f03b99a55b5713911c364d7e2a4588679)

osdc/Objecter: unwatch is a mutation, not a read

This was causing librados to unblock after the ACK on unwatch, which meant
that librbd users raced and tried to delete the image before the unwatch
change was committed..and got EBUSY. See #3958.

The watch operation has a similar problem.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit fea77682a6cf9c7571573bc9791c03373d1d976d)

Conflicts:

src/librados/IoCtxImpl.cc

osd: an interval can't go readwrite if its acting is empty

Let's not forget that min_size can be zero.

Fixes: #4159
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4277265d99647c9fe950ba627e5d86234cfd70a9)

mon: restrict pool size to 1..10

See: #4159
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 30b8d653751acb4bc4be5ca611f154e19afe910a)

init-ceph: do not stop start on first failure

When starting we often loop over many daemon instances. Currently we stop
on the first error and do not try to start other daemons.

Instead, try them all, but return a failure if anything did not start.

Fixes: #2545
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
(cherry picked from commit d395aa521e8a4b295ed2b08dd7cfb7d9f995fcf7)

Conflicts:

src/init-ceph.in

Merge pull request #210 from dalgaaf/wip-da-bobtail-pybind

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

rados.py: fix create_pool()

Call rados_pool_create_with_all() only if auid and crush_rule
are set properly. In case only crush_rule is set call
rados_pool_create_with_crush_rule() on librados, not the other
way around.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit 94a1f25e7230a700f06a2699c9c2b99ec1bf7144)

mon: Use _daemon version of argparse functions

Allow argparse functions to fail if no argument given by using
special versions that avoid the default CLI behavior of "cerr/exit"

Fixes: #4678
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit be801f6c506d9fbfb6c06afe94663abdb0037be5)

Conflicts:
src/mon/Monitor.cc

ceph_argparse: add _daemon versions of argparse calls

mon needs to call argparse for a couple of -- options, and the
argparse_witharg routines were attempting to cerr/exit on missing
arguments. This is appropriate for the CLI usage, but not the daemon
usage. Add a 'cli' flag that can be set false for the daemon usage
(and cause the parsing routine to return false instead of exit).

The daemon's parsing code due for a rewrite soon.

Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit c76bbc2e6df16d283cac3613628a44937e38bed8)

silence logrotate some more

I was getting email with logrotate error output from “which invoke-rc.d”
on systems without an invoke-rc.d. This patch silences it.

Silence stderr from which when running logrotate

From: Alexandre Oliva <oliva@gnu.org>

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
(cherry picked from commit d02340d90c9d30d44c962bea7171db3fe3bfba8e)

Merge remote-tracking branch 'upstream/bobtail-4556' into bobtail

Reviewed-by: Samuel Just <sam.just@inktank.com>

OSD: always activate_map in advance_pgs, only send messages if up

We should always handle_activate_map() after handle_advance_map() in
order to kick the pg into a valid peering state for processing requests
prior to dropping the lock.

Additionally, we would prefer to avoid sending irrelevant messages
during boot, so only send if we are up according to the current service
osdmap.

Fixes: #4572
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4dfcad44431855ba7d68a1ccb41dc3cb5db6bb50)

PG: update PGPool::name in PGPool::update

Fixes: #4471
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f804892d725cfa25c242bdc577b12ee81dcc0dcc)

ReplicatedPG: send entire stats on OP_BACKFILL_FINISH

Otherwise, we update the stat.stat structure, but not the
stat.invalid_stats part. This will result in a recently
split primary propogating the invalid stats but not the
invalid marker. Sending the whole pg_stat_t structure
also mirrors MOSDSubOp.

Fixes: #4557
Backport: bobtail
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 76b296f01fd0d337c8fc9f79013883e62146f0c6)

osd: disallow classes with flags==0

They must be RD, WR, or something....

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 89c69016e1dddb9f3ca40fd699e4a995ef1e3eee)

osd: EINVAL when rmw_flags is 0

A broken client (e.g., v0.56) can send a request that ends up with an
rmw_flags of 0. Treat this as invalid and return EINVAL.

Fixes: #4556
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f2dda43c9ed4fda9cfa87362514985ee79e0ae15)

osd: fix detection of non-existent class method

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 50b831e3641c21cd5b145271688189e199f432d1)

osd: tolerate rmw_flags==0

We will let OSD return a proper error instead of asserting.

This is effectively a backport of c313423cfda55a2231e000cd5ff20729310867f8.

Signed-off-by: Sage Weil <sage@inktank.com>

test_librbd_fsx: fix image closing

Always close the image we opened in check_clone(), and check the
return code of the rbd_close() called before cloning.

Refs: #3958
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 94ae72546507799667197fd941633bb1fd2520c2)

rbd: remove fiemap use from import

On some kernels and filesystems fiemap can be racy and provide
incorrect data even after an fsync. Later we can use SEEK_HOLE and
SEEK_DATA, but for now just detect zero runs like we do with stdin.

Basically this adapts import from stdin to work in the case of a file
or block device, and gets rid of other cruft in the import that used
fiemap.

Fixes: #4388
Backport: bobtail
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 3091283895e8ffa3e4bda13399318a6e720d498f)

v0.56.4

rgw: bucket index ops on system buckets shouldn't do anything

Fixes: #4508
Backport: bobtail
On certain bucket index operations we didn't check whether
the bucket was a system bucket, which caused the operations
to fail. This triggered an error message on bucket removal
operations.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 70e0ee8ba955322832f0c366537ddf7a0288761e)

systest: restrict list error acceptance

Only ignore errors after the midway point if the midway_sem_post is
defined.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 5b24a68b6e7d57bac688021b822fb2f73494c3e9)

systest: fix race with pool deletion

The second test have pool deletion and object listing wait on the same
semaphore to connect and start. This led to errors sometimes when the
pool was deleted before it could be opened by the listing process. Add
another semaphore so the pool deletion happens only after the listing
has begun.

Fixes: #4147
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit b0271e390564119e998e18189282252d54f75eb6)

os/FileJournal: fix aio self-throttling deadlock

This block of code tries to limit the number of aios in flight by waiting
for the amount of data to be written to grow relative to a function of the
number of aios. Strictly speaking, the condition we are waiting for is a
function of both aio_num and the write queue, but we are only woken by
changes in aio_num, and were (in rare cases) waiting when aio_num == 0 and
there was no possibility of being woken.

Fix this by verifying that aio_num > 0, and restructuring the loop to
recheck that condition on each wakeup.

Fixes: #4079
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit e5940da9a534821d0d8f872c13f9ac26fb05a0f5)

common/MemoryModel: remove logging to /tmp/memlog

This was a hack for dev purposes ages ago; remove it. The predictable
filename is a security issue.

CVE-2013-1882

Reported-by: Michael Scherer <misc@zarb.org>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit c524e2e01da41ab5b6362c117939ea1efbd98095)

init-ceph: clean up temp ceph.conf filename on exit

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 6a7ad2eac1db6abca3d7edb23ca9b80751400a23)