Sage Weil [Thu, 14 Feb 2013 05:47:30 +0000 (21:47 -0800)]
debian: start/stop ceph-all event on install/uninstall
This helps us avoid the confusing situation with upstart where an individual
daemon job is running (like ceph-osd id=2) but the container jobs ceph-osd-all
and ceph-all are not.
Sage Weil [Sun, 27 Jan 2013 03:08:22 +0000 (19:08 -0800)]
ceph-disk-prepare: refactor to support DIR, DISK, or PARTITION for data or journal
Lots of code reorganization collapsed into a single commit here.
- detect whether the user gave us a directory, disk, or partition, and Do The
Right Thing
- allow them to force that the input was of type X, for the careful/paranoid.
- make --zap-disk an option -- no longer the default
Sage Weil [Tue, 23 Apr 2013 17:00:38 +0000 (10:00 -0700)]
init-ceph: fix (and simplify) pushing ceph.conf to remote unique name
The old code would only do the push once per remote node (due to the
list in $pushed_to) but would reset $unique on each attempt. This would
break if a remote host was processed twice.
Fix by just skipping the $pushed_to optimization entirely.
Fixes: #4794 Reported-by: Andreas Friedrich <andreas.friedrich@ts.fujitsu.com> Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit ccbc4dbc6edf09626459ca52a53a72682f541e86)
LibrbdWriteback: complete writes strictly in order
RADOS returns writes to the same object in the same order. The
ObjectCacher relies on this assumption to make sure previous writes
are complete and maintain consistency. Reads, however, may be
reordered with respect to each other. When writing to an rbd clone,
reads to the parent must be performed when the object does not exist
in the child yet. These reads may be reordered, resulting in the
original writes being reordered. This breaks the assmuptions of the
ObjectCacher, causing an assert to fail.
To fix this, keep a per-object queue of outstanding writes to an
object in the LibrbdWriteback handler, and finish them in the order in
which they were sent.
The tid returned by reads is ignored, and would make tracking writes
internally more difficult by using the same id-space as them. Make read
void and update all implementations.
There's no reason to check the duration of a watch. The notify will
timeout after 30s on the OSD, but there's no guarantee the client will
see that in any bounded time. This test is really meant as a stress
test of the OSDs anyway, not of the clients, so just remove asserts
about operation duration.
Fixes: #4591 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sam Just <sam.just@inktank.com>
(cherry picked from commit 4b656730ffff21132f358c2b9a63504dfbf0998d)
This is a quick workaround for the next branch. A more complete fix
will be done for the master branch. This does not affect correctness,
just what qa runs with lockdep enabled do.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage.weil@inktank.com>
(cherry picked from commit 267ce0d90b8f3afaaddfdc0556c9bafbf4628426)
Josh Durgin [Thu, 21 Mar 2013 23:04:10 +0000 (16:04 -0700)]
librbd: add an async flush
At this point it's a simple wrapper around the ObjectCacher or
librados.
This is needed for QEMU so that its main thread can continue while a
flush is occurring. Since this will be backported, don't update the
librbd version yet, just add a #define that QEMU and others can use to
detect the presence of aio_flush().
Josh Durgin [Wed, 27 Mar 2013 22:42:10 +0000 (15:42 -0700)]
librbd: use the same IoCtx for each request
Before we were duplicating the IoCtx for each new request since they
could have a different snapshot context or read from a different
snapshot id. Since librados now supports setting these explicitly
for a given request, do that instead.
Since librados tracks outstanding requests on a per-IoCtx basis, this
also fixes a bug that causes flush() without caching to ignore
all the outstanding requests, since they were to separate,
duplicate IoCtxs.
Josh Durgin [Wed, 27 Mar 2013 22:32:29 +0000 (15:32 -0700)]
librados: add versions of a couple functions taking explicit snap args
Usually the snapid to read from or the snapcontext to send with a write
are determined implicitly by the IoCtx the operations are done on.
This makes it difficult to have multiple ops in flight to the same
IoCtx using different snapcontexts or reading from different snapshots,
particularly when more than one operation may be needed past the initial
scheduling.
Add versions of aio_read, aio_sparse_read, and aio_operate
that don't depend on the snap id or snapcontext stored in the IoCtx,
but get them from the caller. Specifying this information for each
operation can be a more useful interface in general, but for now just
add it for the methods used by librbd.
Josh Durgin [Thu, 28 Mar 2013 17:34:37 +0000 (10:34 -0700)]
ObjectCacher: remove unneeded var from flush_set()
The gather will only have subs if there is something to flush. Remove
the safe variable, which indicates the same thing, and convert the
conditionals that used it to an else branch. Movinig gather.activate()
inside the has_subs() check has no effect since activate() does
nothing when there are no subs.
This removes the last remnants of b5e9995f59d363ba00d9cac413d9b754ee44e370. If there's nothing to flush,
immediately call the callback instead of deleting it. Callers were
assuming they were responsible for completing the callback whenever
flush_set() returned true, and always called complete(0) in this
case. Simplify the interface and just do this in flush_set(), so that
it always calls the callback.
Since C_GatherBuilder deletes its finisher if there are no subs,
only set its finisher when subs are present. This way we can still
call ->complete() for the callback.
Josh Durgin [Tue, 29 Jan 2013 22:22:15 +0000 (14:22 -0800)]
ObjectCacher: fix flush_set when no flushing is needed
C_GatherBuilder takes ownership of the Context we pass it. Deleting it
in flush_set after constructing the C_GatherBuilder results in a
double delete.
Fixes: #3946 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sam Lang <sam.lang@inktank.com>
(cherry picked from commit 3bc21143552b35698c9916c67494336de8964d2a)
Josh Durgin [Wed, 13 Mar 2013 16:37:21 +0000 (09:37 -0700)]
ObjectCacher: optionally make writex always non-blocking
Add a callback argument to writex, and a finisher to run the
callbacks. Move the check for dirty+tx > max_dirty into a helper that
can be called from a wrapper around the callbacks from writex, or from
the current place in _wait_for_write().
Josh Durgin [Thu, 28 Mar 2013 00:30:42 +0000 (17:30 -0700)]
librbd: flush cache when set_snap() is called
If there are writes pending, they should be sent while the image
is still writeable. If the image becomes read-only, flushing the
cache will just mark everything dirty again due to -EROFS.
Josh Durgin [Sat, 16 Mar 2013 00:28:13 +0000 (17:28 -0700)]
librbd: optionally wait for a flush before enabling writeback
Older guests may not send flushes properly (i.e. never), so if this is
enabled, rbd_cache=true is safe for them transparently.
Disable by default, since it will unnecessarily slow down newer guest
boot, and prevent writeback caching for things that don't need to send
flushes, like the command line tool.
Refs: #3817 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1597b3e3a1d776b56e05c57d7c3de396f4f2b5b2)
Josh Durgin [Sat, 9 Mar 2013 02:57:24 +0000 (18:57 -0800)]
librbd: invalidate cache when flattening
The cache stores which objects don't exist. Flatten bypasses the cache
when doing its copyups, so when it is done the -ENOENT from the cache
is treated as zeroes instead of 'need to read from parent'.
Clients that have the image open need to forgot about the cached
non-existent objects as well. Do this during ictx_refresh, while the
parent_lock is held exclusively so no new reads from the parent can
happen until the updated parent metadata is visible, so no new reads
from the parent will occur.
Josh Durgin [Sat, 9 Mar 2013 01:53:31 +0000 (17:53 -0800)]
ObjectCacher: add a method to clear -ENOENT caching
Clear the exists and complete flags for any objects that have exists
set to false, and force any in-flight reads to retry if they get
-ENOENT instead of generating zeros.
This is useful for getting the cache into a consistent state for rbd
after an image has been flattened, since many objects which previously
did not exist and went up to the parent to retrieve data may now exist
in the child.
Josh Durgin [Sat, 9 Mar 2013 01:49:27 +0000 (17:49 -0800)]
ObjectCacher: keep track of outstanding reads on an object
Reads always use C_ReadFinish as a callback (and they are the only
user of this callback). Keep an xlist of these for each object, so
they can remove themselves as they finish. To prevent racing requests
and with discard removing objects from the cache, clear the xlist in
the object destructor, so if the Object is still valid the set_item
will still be on the list.
Make the ObjectCacher constructor take an Object* instead of the pool
and object id, which are derived from the Object* anyway.
Josh Durgin [Tue, 26 Feb 2013 21:20:08 +0000 (13:20 -0800)]
librbd: fix rollback size
The duplicate calls to get_image_size() and get_snap_size() replaced
by 5806226cf0743bb44eaf7bc815897c6846d43233 uncovered this. The first
call was using the currently set snap_id instead of the snapshot being
rolled back to.
Josh Durgin [Mon, 25 Feb 2013 20:05:16 +0000 (12:05 -0800)]
Merge branch 'wip-4249' into wip-4249-master
Make snap_rollback() only take a read lock on snap_lock, since
it does not modify snapshot-related fields.
Conflicts:
src/librbd/internal.cc
(cherry picked from commit db5fc2270f91aae220fc3c97b0c62e92e263527b)
Josh Durgin [Thu, 21 Feb 2013 19:26:45 +0000 (11:26 -0800)]
librbd: make sure racing flattens don't crash
The only way for a parent to disappear is a racing flatten completing,
or possibly in the future the image being forcibly removed. In either
case, continuing to flatten makes no sense, so stop early.
Josh Durgin [Thu, 21 Feb 2013 19:17:18 +0000 (11:17 -0800)]
librbd: use rwlocks instead of mutexes for several fields
Image metadata like snapshots, size, and parent is frequently read,
but rarely updated. During flatten, we were depending on the parent
lock to prevent the parent ImageCtx from disappearing out from under
us while we read from it. The copy-up path also needed the parent lock
to be able to read from the parent image, which lead to a deadlock.
Convert parent_lock, snap_lock, and md_lock to RWLocks, and change
their use to read instead of exclusive locks where appropriate. The
main place exclusive locks are needed is in ictx_refresh, so this is
pretty simple. This fixes the deadlock, since parent_lock is only
needed for read access in both flatten and the copy-up operation.
cache_lock and refresh_lock are only really used for exclusive access,
so leave them as regular mutexes.
One downside to this is that there's no way to assert is_locked()
for RWLocks, so we'll have to be very careful about changing code
in the future.
Sage Weil [Thu, 21 Feb 2013 21:28:47 +0000 (13:28 -0800)]
osdc/Objecter: unwatch is a mutation, not a read
This was causing librados to unblock after the ACK on unwatch, which meant
that librbd users raced and tried to delete the image before the unwatch
change was committed..and got EBUSY. See #3958.
Sage Weil [Fri, 19 Apr 2013 20:05:43 +0000 (13:05 -0700)]
init-ceph: do not stop start on first failure
When starting we often loop over many daemon instances. Currently we stop
on the first error and do not try to start other daemons.
Instead, try them all, but return a failure if anything did not start.
Fixes: #2545 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
(cherry picked from commit d395aa521e8a4b295ed2b08dd7cfb7d9f995fcf7)
Danny Al-Gaaf [Fri, 5 Apr 2013 13:55:34 +0000 (15:55 +0200)]
rados.py: fix create_pool()
Call rados_pool_create_with_all() only if auid and crush_rule
are set properly. In case only crush_rule is set call
rados_pool_create_with_crush_rule() on librados, not the other
way around.
Dan Mick [Mon, 8 Apr 2013 20:49:22 +0000 (13:49 -0700)]
ceph_argparse: add _daemon versions of argparse calls
mon needs to call argparse for a couple of -- options, and the
argparse_witharg routines were attempting to cerr/exit on missing
arguments. This is appropriate for the CLI usage, but not the daemon
usage. Add a 'cli' flag that can be set false for the daemon usage
(and cause the parsing routine to return false instead of exit).
The daemon's parsing code due for a rewrite soon.
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit c76bbc2e6df16d283cac3613628a44937e38bed8)
Samuel Just [Thu, 14 Feb 2013 22:03:56 +0000 (14:03 -0800)]
OSD: always activate_map in advance_pgs, only send messages if up
We should always handle_activate_map() after handle_advance_map() in
order to kick the pg into a valid peering state for processing requests
prior to dropping the lock.
Additionally, we would prefer to avoid sending irrelevant messages
during boot, so only send if we are up according to the current service
osdmap.
Fixes: #4572
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4dfcad44431855ba7d68a1ccb41dc3cb5db6bb50)
Samuel Just [Thu, 28 Mar 2013 21:09:17 +0000 (14:09 -0700)]
PG: update PGPool::name in PGPool::update
Fixes: #4471 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f804892d725cfa25c242bdc577b12ee81dcc0dcc)
Samuel Just [Tue, 26 Mar 2013 22:10:37 +0000 (15:10 -0700)]
ReplicatedPG: send entire stats on OP_BACKFILL_FINISH
Otherwise, we update the stat.stat structure, but not the
stat.invalid_stats part. This will result in a recently
split primary propogating the invalid stats but not the
invalid marker. Sending the whole pg_stat_t structure
also mirrors MOSDSubOp.
Fixes: #4557
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 76b296f01fd0d337c8fc9f79013883e62146f0c6)
Josh Durgin [Thu, 14 Mar 2013 00:05:42 +0000 (17:05 -0700)]
rbd: remove fiemap use from import
On some kernels and filesystems fiemap can be racy and provide
incorrect data even after an fsync. Later we can use SEEK_HOLE and
SEEK_DATA, but for now just detect zero runs like we do with stdin.
Basically this adapts import from stdin to work in the case of a file
or block device, and gets rid of other cruft in the import that used
fiemap.
Yehuda Sadeh [Mon, 25 Mar 2013 16:50:33 +0000 (09:50 -0700)]
rgw: bucket index ops on system buckets shouldn't do anything
Fixes: #4508
Backport: bobtail
On certain bucket index operations we didn't check whether
the bucket was a system bucket, which caused the operations
to fail. This triggered an error message on bucket removal
operations.
Josh Durgin [Mon, 25 Feb 2013 22:55:34 +0000 (14:55 -0800)]
systest: fix race with pool deletion
The second test have pool deletion and object listing wait on the same
semaphore to connect and start. This led to errors sometimes when the
pool was deleted before it could be opened by the listing process. Add
another semaphore so the pool deletion happens only after the listing
has begun.
Sage Weil [Tue, 19 Mar 2013 21:26:16 +0000 (14:26 -0700)]
os/FileJournal: fix aio self-throttling deadlock
This block of code tries to limit the number of aios in flight by waiting
for the amount of data to be written to grow relative to a function of the
number of aios. Strictly speaking, the condition we are waiting for is a
function of both aio_num and the write queue, but we are only woken by
changes in aio_num, and were (in rare cases) waiting when aio_num == 0 and
there was no possibility of being woken.
Fix this by verifying that aio_num > 0, and restructuring the loop to
recheck that condition on each wakeup.
Fixes: #4079 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit e5940da9a534821d0d8f872c13f9ac26fb05a0f5)
Sage Weil [Fri, 22 Mar 2013 20:25:49 +0000 (13:25 -0700)]
common/MemoryModel: remove logging to /tmp/memlog
This was a hack for dev purposes ages ago; remove it. The predictable
filename is a security issue.
CVE-2013-1882
Reported-by: Michael Scherer <misc@zarb.org> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit c524e2e01da41ab5b6362c117939ea1efbd98095)
Sage Weil [Fri, 22 Mar 2013 20:25:33 +0000 (13:25 -0700)]
init-ceph: push temp conf file to a unique location on remote host
The predictable file name is a security problem.
CVE-2013-1882
Reported-by: Michael Scherer <misc@zarb.org> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 051734522fea92878dd8139f28ec4e6b01371ede)
Sage Weil [Fri, 22 Mar 2013 20:25:23 +0000 (13:25 -0700)]
mkcephfs: make remote temp directory name unique
The predictable file name is a security problem.
CVE-2013-1882
Reported-by: Michael Scherer <misc@zarb.org> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit f463ef78d77b11b5ad78b31e9a3a88d0a6e62bca)
Samuel Just [Fri, 22 Mar 2013 20:51:14 +0000 (13:51 -0700)]
PG::GetMissing: need to check need_up_thru in MLogRec handler
Backport: bobtail Fixes: #4534 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4fe4deafbe1758a6b3570048aca57485bd562440)
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d611eba9caf45f2d977c627b123462a073f523a4)
which is what we are shoving into statvfs, but we have the b_size and
fr_size arithmetic swapped. However, doing the *correct* reporting would
then break the old stat by making both sizes appear to be 4KB (or
whatever).
Sidestep the issue by making *both* values 4MB.. which is both large enough
to report large FS sizes, and also the default stripe size and thus a
"reasonable" value to report for a block size.
Perhaps in the future, when we no longer care about old userland, we can
report the page size for f_bsize, which is probably the "most correct"
thing to do.
Fixes: #3794. See also #3793. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 7c94083643891c9d66a117352f312b268bdb1135)
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 5fc83c8d9887d2a916af11436ccc94fcbfe59b7a)
Samuel Just [Thu, 21 Feb 2013 23:31:36 +0000 (15:31 -0800)]
PG::proc_replica_log: adjust oinfo.last_complete based on omissing
Otherwise, search_for_missing may neglect to check the missing
set for some objects assuming that if the need version is
prior to last_complete, the replica must have it.
Sage Weil [Sat, 9 Feb 2013 08:05:33 +0000 (00:05 -0800)]
osd: fix load_pgs collection handling
On a _TEMP pg, is_pg() would succeed, which meant we weren't actually
hitting the cleanup checks. Instead, restructure this loop as positive
checks and handle each type of collection we understand.
This fixes _TEMP cleanup.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit b19b6dced85617d594c15631571202aab2f94ae8)
Sage Weil [Sat, 9 Feb 2013 08:04:29 +0000 (00:04 -0800)]
osd: fix load_pgs handling of pg dirs without a head
If there is a pgid that passes coll_t::is_pg() but there is no head, we
will populate the pgs map but then fail later when we try to do
read_state. This is a side-effect of 55f8579.
Take explicit note of _head collections we see, and then warn when we
find stray snap collections.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 1f80a0b576c0af1931f743ad988b6293cbf2d6d9)
Samuel Just [Thu, 7 Feb 2013 21:34:47 +0000 (13:34 -0800)]
OSD::load_pgs: first scan colls before initing PGs
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 073f58ede2e473af91f76d01679631c169274af7)
David Zafman [Wed, 9 Jan 2013 03:24:13 +0000 (19:24 -0800)]
osd: Add digest of omap for deep-scrub
Add ScrubMap encode/decode v4 message with omap digest
Compute digest of header and key/value. Use bufferlist
to reflect structure and compute as we go, clearing
bufferlist to reduce memory usage.
Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 509a93e89f04d7e9393090563cf7be8e0ea53891)
Samuel Just [Fri, 15 Mar 2013 22:13:46 +0000 (15:13 -0700)]
OSD: split temp collection as well
Otherwise, when we eventually remove the temp collection, there might be
objects in the temp collection which were independently pulled into the child
pg collection. Thus, removing the old stale parent link from its temp
collection also blasts the omap entries and snap mappings for the real child
object.
Backport: bobtail Fixes: #4452 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit f8d66e87a5c155b027cc6249006b83b4ac9b6c9b)
Samuel Just [Fri, 15 Mar 2013 02:59:36 +0000 (19:59 -0700)]
PG: ignore non MISSING pg query in ReplicaActive
1) Replica sends notify
2) Prior to processing notify, primary queues query to replica
3) Primary processes notify and activates sending MOSDPGLog
to replica.
4) Primary does do_notifies at end of process_peering_events
and sends to Query.
5) Replica sees MOSDPGLog and activates
6) Replica sees Query and asserts.
In the above case, the Replica should simply ignore the old
Query.
Fixes: #4050
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 8222cbc8f35c359a35f8381ad90ff0eed5615dac)
If queue_pos == header.max_size when we create the entry
header magic, the entry will be rejected at get_top() on
replay.
Fixes: #4436
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit de8edb732e3a5ce4471670e43cfe6357ae6a2758)