We run --mkfs with the osd disk mounted in a temporary location, so it is
necessary to explicitly pass in these paths.
If we want to support journals in a different location, we need to make
ceph-disk-prepare update the journal symlink accordingly.. not control it via
the config option.
Sage Weil [Thu, 14 Feb 2013 05:47:30 +0000 (21:47 -0800)]
debian: start/stop ceph-all event on install/uninstall
This helps us avoid the confusing situation with upstart where an individual
daemon job is running (like ceph-osd id=2) but the container jobs ceph-osd-all
and ceph-all are not.
Sage Weil [Sun, 27 Jan 2013 03:08:22 +0000 (19:08 -0800)]
ceph-disk-prepare: refactor to support DIR, DISK, or PARTITION for data or journal
Lots of code reorganization collapsed into a single commit here.
- detect whether the user gave us a directory, disk, or partition, and Do The
Right Thing
- allow them to force that the input was of type X, for the careful/paranoid.
- make --zap-disk an option -- no longer the default
Sage Weil [Tue, 23 Apr 2013 17:00:38 +0000 (10:00 -0700)]
init-ceph: fix (and simplify) pushing ceph.conf to remote unique name
The old code would only do the push once per remote node (due to the
list in $pushed_to) but would reset $unique on each attempt. This would
break if a remote host was processed twice.
Fix by just skipping the $pushed_to optimization entirely.
Fixes: #4794 Reported-by: Andreas Friedrich <andreas.friedrich@ts.fujitsu.com> Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ccbc4dbc6edf09626459ca52a53a72682f541e86)
LibrbdWriteback: complete writes strictly in order
RADOS returns writes to the same object in the same order. The
ObjectCacher relies on this assumption to make sure previous writes
are complete and maintain consistency. Reads, however, may be
reordered with respect to each other. When writing to an rbd clone,
reads to the parent must be performed when the object does not exist
in the child yet. These reads may be reordered, resulting in the
original writes being reordered. This breaks the assmuptions of the
ObjectCacher, causing an assert to fail.
To fix this, keep a per-object queue of outstanding writes to an
object in the LibrbdWriteback handler, and finish them in the order in
which they were sent.
The tid returned by reads is ignored, and would make tracking writes
internally more difficult by using the same id-space as them. Make read
void and update all implementations.
There's no reason to check the duration of a watch. The notify will
timeout after 30s on the OSD, but there's no guarantee the client will
see that in any bounded time. This test is really meant as a stress
test of the OSDs anyway, not of the clients, so just remove asserts
about operation duration.
Fixes: #4591 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sam Just <sam.just@inktank.com>
(cherry picked from commit 4b656730ffff21132f358c2b9a63504dfbf0998d)
This is a quick workaround for the next branch. A more complete fix
will be done for the master branch. This does not affect correctness,
just what qa runs with lockdep enabled do.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage.weil@inktank.com>
(cherry picked from commit 267ce0d90b8f3afaaddfdc0556c9bafbf4628426)
Josh Durgin [Thu, 21 Mar 2013 23:04:10 +0000 (16:04 -0700)]
librbd: add an async flush
At this point it's a simple wrapper around the ObjectCacher or
librados.
This is needed for QEMU so that its main thread can continue while a
flush is occurring. Since this will be backported, don't update the
librbd version yet, just add a #define that QEMU and others can use to
detect the presence of aio_flush().
Josh Durgin [Wed, 27 Mar 2013 22:42:10 +0000 (15:42 -0700)]
librbd: use the same IoCtx for each request
Before we were duplicating the IoCtx for each new request since they
could have a different snapshot context or read from a different
snapshot id. Since librados now supports setting these explicitly
for a given request, do that instead.
Since librados tracks outstanding requests on a per-IoCtx basis, this
also fixes a bug that causes flush() without caching to ignore
all the outstanding requests, since they were to separate,
duplicate IoCtxs.
Josh Durgin [Wed, 27 Mar 2013 22:32:29 +0000 (15:32 -0700)]
librados: add versions of a couple functions taking explicit snap args
Usually the snapid to read from or the snapcontext to send with a write
are determined implicitly by the IoCtx the operations are done on.
This makes it difficult to have multiple ops in flight to the same
IoCtx using different snapcontexts or reading from different snapshots,
particularly when more than one operation may be needed past the initial
scheduling.
Add versions of aio_read, aio_sparse_read, and aio_operate
that don't depend on the snap id or snapcontext stored in the IoCtx,
but get them from the caller. Specifying this information for each
operation can be a more useful interface in general, but for now just
add it for the methods used by librbd.
Josh Durgin [Thu, 28 Mar 2013 17:34:37 +0000 (10:34 -0700)]
ObjectCacher: remove unneeded var from flush_set()
The gather will only have subs if there is something to flush. Remove
the safe variable, which indicates the same thing, and convert the
conditionals that used it to an else branch. Movinig gather.activate()
inside the has_subs() check has no effect since activate() does
nothing when there are no subs.
This removes the last remnants of b5e9995f59d363ba00d9cac413d9b754ee44e370. If there's nothing to flush,
immediately call the callback instead of deleting it. Callers were
assuming they were responsible for completing the callback whenever
flush_set() returned true, and always called complete(0) in this
case. Simplify the interface and just do this in flush_set(), so that
it always calls the callback.
Since C_GatherBuilder deletes its finisher if there are no subs,
only set its finisher when subs are present. This way we can still
call ->complete() for the callback.
Josh Durgin [Tue, 29 Jan 2013 22:22:15 +0000 (14:22 -0800)]
ObjectCacher: fix flush_set when no flushing is needed
C_GatherBuilder takes ownership of the Context we pass it. Deleting it
in flush_set after constructing the C_GatherBuilder results in a
double delete.
Fixes: #3946 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>
(cherry picked from commit 3bc21143552b35698c9916c67494336de8964d2a)
Josh Durgin [Wed, 13 Mar 2013 16:37:21 +0000 (09:37 -0700)]
ObjectCacher: optionally make writex always non-blocking
Add a callback argument to writex, and a finisher to run the
callbacks. Move the check for dirty+tx > max_dirty into a helper that
can be called from a wrapper around the callbacks from writex, or from
the current place in _wait_for_write().
Josh Durgin [Thu, 28 Mar 2013 00:30:42 +0000 (17:30 -0700)]
librbd: flush cache when set_snap() is called
If there are writes pending, they should be sent while the image
is still writeable. If the image becomes read-only, flushing the
cache will just mark everything dirty again due to -EROFS.
Josh Durgin [Sat, 16 Mar 2013 00:28:13 +0000 (17:28 -0700)]
librbd: optionally wait for a flush before enabling writeback
Older guests may not send flushes properly (i.e. never), so if this is
enabled, rbd_cache=true is safe for them transparently.
Disable by default, since it will unnecessarily slow down newer guest
boot, and prevent writeback caching for things that don't need to send
flushes, like the command line tool.
Refs: #3817 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1597b3e3a1d776b56e05c57d7c3de396f4f2b5b2)
Josh Durgin [Sat, 9 Mar 2013 02:57:24 +0000 (18:57 -0800)]
librbd: invalidate cache when flattening
The cache stores which objects don't exist. Flatten bypasses the cache
when doing its copyups, so when it is done the -ENOENT from the cache
is treated as zeroes instead of 'need to read from parent'.
Clients that have the image open need to forgot about the cached
non-existent objects as well. Do this during ictx_refresh, while the
parent_lock is held exclusively so no new reads from the parent can
happen until the updated parent metadata is visible, so no new reads
from the parent will occur.
Josh Durgin [Sat, 9 Mar 2013 01:53:31 +0000 (17:53 -0800)]
ObjectCacher: add a method to clear -ENOENT caching
Clear the exists and complete flags for any objects that have exists
set to false, and force any in-flight reads to retry if they get
-ENOENT instead of generating zeros.
This is useful for getting the cache into a consistent state for rbd
after an image has been flattened, since many objects which previously
did not exist and went up to the parent to retrieve data may now exist
in the child.
Josh Durgin [Sat, 9 Mar 2013 01:49:27 +0000 (17:49 -0800)]
ObjectCacher: keep track of outstanding reads on an object
Reads always use C_ReadFinish as a callback (and they are the only
user of this callback). Keep an xlist of these for each object, so
they can remove themselves as they finish. To prevent racing requests
and with discard removing objects from the cache, clear the xlist in
the object destructor, so if the Object is still valid the set_item
will still be on the list.
Make the ObjectCacher constructor take an Object* instead of the pool
and object id, which are derived from the Object* anyway.
Josh Durgin [Tue, 26 Feb 2013 21:20:08 +0000 (13:20 -0800)]
librbd: fix rollback size
The duplicate calls to get_image_size() and get_snap_size() replaced
by 5806226cf0743bb44eaf7bc815897c6846d43233 uncovered this. The first
call was using the currently set snap_id instead of the snapshot being
rolled back to.
Josh Durgin [Mon, 25 Feb 2013 20:05:16 +0000 (12:05 -0800)]
Merge branch 'wip-4249' into wip-4249-master
Make snap_rollback() only take a read lock on snap_lock, since
it does not modify snapshot-related fields.
Conflicts:
src/librbd/internal.cc
(cherry picked from commit db5fc2270f91aae220fc3c97b0c62e92e263527b)
Josh Durgin [Thu, 21 Feb 2013 19:26:45 +0000 (11:26 -0800)]
librbd: make sure racing flattens don't crash
The only way for a parent to disappear is a racing flatten completing,
or possibly in the future the image being forcibly removed. In either
case, continuing to flatten makes no sense, so stop early.
Josh Durgin [Thu, 21 Feb 2013 19:17:18 +0000 (11:17 -0800)]
librbd: use rwlocks instead of mutexes for several fields
Image metadata like snapshots, size, and parent is frequently read,
but rarely updated. During flatten, we were depending on the parent
lock to prevent the parent ImageCtx from disappearing out from under
us while we read from it. The copy-up path also needed the parent lock
to be able to read from the parent image, which lead to a deadlock.
Convert parent_lock, snap_lock, and md_lock to RWLocks, and change
their use to read instead of exclusive locks where appropriate. The
main place exclusive locks are needed is in ictx_refresh, so this is
pretty simple. This fixes the deadlock, since parent_lock is only
needed for read access in both flatten and the copy-up operation.
cache_lock and refresh_lock are only really used for exclusive access,
so leave them as regular mutexes.
One downside to this is that there's no way to assert is_locked()
for RWLocks, so we'll have to be very careful about changing code
in the future.
Sage Weil [Thu, 21 Feb 2013 21:28:47 +0000 (13:28 -0800)]
osdc/Objecter: unwatch is a mutation, not a read
This was causing librados to unblock after the ACK on unwatch, which meant
that librbd users raced and tried to delete the image before the unwatch
change was committed..and got EBUSY. See #3958.
Sage Weil [Fri, 19 Apr 2013 20:05:43 +0000 (13:05 -0700)]
init-ceph: do not stop start on first failure
When starting we often loop over many daemon instances. Currently we stop
on the first error and do not try to start other daemons.
Instead, try them all, but return a failure if anything did not start.
Fixes: #2545 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
(cherry picked from commit d395aa521e8a4b295ed2b08dd7cfb7d9f995fcf7)
Danny Al-Gaaf [Fri, 5 Apr 2013 13:55:34 +0000 (15:55 +0200)]
rados.py: fix create_pool()
Call rados_pool_create_with_all() only if auid and crush_rule
are set properly. In case only crush_rule is set call
rados_pool_create_with_crush_rule() on librados, not the other
way around.
Dan Mick [Mon, 8 Apr 2013 20:49:22 +0000 (13:49 -0700)]
ceph_argparse: add _daemon versions of argparse calls
mon needs to call argparse for a couple of -- options, and the
argparse_witharg routines were attempting to cerr/exit on missing
arguments. This is appropriate for the CLI usage, but not the daemon
usage. Add a 'cli' flag that can be set false for the daemon usage
(and cause the parsing routine to return false instead of exit).
The daemon's parsing code due for a rewrite soon.
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit c76bbc2e6df16d283cac3613628a44937e38bed8)
Samuel Just [Thu, 14 Feb 2013 22:03:56 +0000 (14:03 -0800)]
OSD: always activate_map in advance_pgs, only send messages if up
We should always handle_activate_map() after handle_advance_map() in
order to kick the pg into a valid peering state for processing requests
prior to dropping the lock.
Additionally, we would prefer to avoid sending irrelevant messages
during boot, so only send if we are up according to the current service
osdmap.
Fixes: #4572
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4dfcad44431855ba7d68a1ccb41dc3cb5db6bb50)
Samuel Just [Thu, 28 Mar 2013 21:09:17 +0000 (14:09 -0700)]
PG: update PGPool::name in PGPool::update
Fixes: #4471 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f804892d725cfa25c242bdc577b12ee81dcc0dcc)
Samuel Just [Tue, 26 Mar 2013 22:10:37 +0000 (15:10 -0700)]
ReplicatedPG: send entire stats on OP_BACKFILL_FINISH
Otherwise, we update the stat.stat structure, but not the
stat.invalid_stats part. This will result in a recently
split primary propogating the invalid stats but not the
invalid marker. Sending the whole pg_stat_t structure
also mirrors MOSDSubOp.
Fixes: #4557
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 76b296f01fd0d337c8fc9f79013883e62146f0c6)
Josh Durgin [Thu, 14 Mar 2013 00:05:42 +0000 (17:05 -0700)]
rbd: remove fiemap use from import
On some kernels and filesystems fiemap can be racy and provide
incorrect data even after an fsync. Later we can use SEEK_HOLE and
SEEK_DATA, but for now just detect zero runs like we do with stdin.
Basically this adapts import from stdin to work in the case of a file
or block device, and gets rid of other cruft in the import that used
fiemap.
Yehuda Sadeh [Mon, 25 Mar 2013 16:50:33 +0000 (09:50 -0700)]
rgw: bucket index ops on system buckets shouldn't do anything
Fixes: #4508
Backport: bobtail
On certain bucket index operations we didn't check whether
the bucket was a system bucket, which caused the operations
to fail. This triggered an error message on bucket removal
operations.
Josh Durgin [Mon, 25 Feb 2013 22:55:34 +0000 (14:55 -0800)]
systest: fix race with pool deletion
The second test have pool deletion and object listing wait on the same
semaphore to connect and start. This led to errors sometimes when the
pool was deleted before it could be opened by the listing process. Add
another semaphore so the pool deletion happens only after the listing
has begun.
Sage Weil [Tue, 19 Mar 2013 21:26:16 +0000 (14:26 -0700)]
os/FileJournal: fix aio self-throttling deadlock
This block of code tries to limit the number of aios in flight by waiting
for the amount of data to be written to grow relative to a function of the
number of aios. Strictly speaking, the condition we are waiting for is a
function of both aio_num and the write queue, but we are only woken by
changes in aio_num, and were (in rare cases) waiting when aio_num == 0 and
there was no possibility of being woken.
Fix this by verifying that aio_num > 0, and restructuring the loop to
recheck that condition on each wakeup.
Fixes: #4079 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit e5940da9a534821d0d8f872c13f9ac26fb05a0f5)
Sage Weil [Fri, 22 Mar 2013 20:25:49 +0000 (13:25 -0700)]
common/MemoryModel: remove logging to /tmp/memlog
This was a hack for dev purposes ages ago; remove it. The predictable
filename is a security issue.
CVE-2013-1882
Reported-by: Michael Scherer <misc@zarb.org> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit c524e2e01da41ab5b6362c117939ea1efbd98095)