Samuel Just [Tue, 26 Mar 2013 20:08:29 +0000 (13:08 -0700)]
PG: make _select_auth_object smarter
Previously, we just picked the first one to have the object in
question. Now, we will attempt to choose one that has as
much of the following as possible:
1) has the object (there must be one)
2) has an object_info attr
3) has a valid object_info attr
4) has an object_info whose size matches the scrubbed size
Loic Dachary [Thu, 28 Mar 2013 12:38:09 +0000 (13:38 +0100)]
unit test LFNIndex::remove_object and LFNIndex::lfn_unlink
When the object name is short, check that the corresponding file is
::unlink()ed. When the object name is long, there may be multiple files
with the same name, modulo the anti-collision number showing just before
the FILENAME_COOKIE. The following scenarii are tested:
* there only is one file
* there are multiple files and the last one is removed
* there are multiple files and the last one is moved in place of the
file that is to be removed
lfn_unlink and remove_object are tested together because
lfn_unlink is a private function and remove_object is a protected function
that does very little beside calling lfn_unlink
mon: ConfigKeyService: stash config keys on the monitor
Building up on the Single-Paxos and our existing k/v store that backs
the monitor, we now introduce a simple service so that the monitors
act as a generic k/v store available to the cluster, in which a user
can stash (and later obtain) configuration keys at his own discretion.
Users can put, get, delete, list and check for values using the
following commands:
- ceph config-key put <key> [<value>]
or
- ceph config-key put <key> [-i <in-file>]
with 'value' and 'in-file' being optional; if these are not specified,
'put' will act as 'touch' if 'key' does not exist, or will overwrite
the value of 'key' with a zero byte value (i.e., truncates the
contents of the value to zero)
- ceph config-key get <key>
or
- ceph config-key get <key> -o <out-file>
- ceph config-key delete <key>
- ceph config-key list [-o <out-file]
- ceph config-key exists <key>
Fixes: #4313 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Gary Lowell [Thu, 28 Mar 2013 23:12:33 +0000 (16:12 -0700)]
ceph.spec.in: Move four scripts from sbin to usr/bin
The ceph-create-keys, ceph-disk, ceph-disk-activate, and
ceph-disk-prepare scripts are built in sbin, but debian installs
them into usr/bin, and several utilities look for them there.
This commit changes the RPM to install them in /usr/bin. (Bug #3921)
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
ceph: propagate do_command()'s return value to user space
We were returning '1' regardless of what do_command() returned in case
of error. This would make building tools relying on command error codes
short of useless, and forced them to rely instead on error messages.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit e91405d540ce11b9996e4977212553bd33afb3ed)
ceph: propagate do_command()'s return value to user space
We were returning '1' regardless of what do_command() returned in case
of error. This would make building tools relying on command error codes
short of useless, and forced them to rely instead on error messages.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
Josh Durgin [Thu, 21 Mar 2013 23:04:10 +0000 (16:04 -0700)]
librbd: add an async flush
At this point it's a simple wrapper around the ObjectCacher or
librados.
This is needed for QEMU so that its main thread can continue while a
flush is occurring. Since this will be backported, don't update the
librbd version yet, just add a #define that QEMU and others can use to
detect the presence of aio_flush().
Josh Durgin [Wed, 27 Mar 2013 22:42:10 +0000 (15:42 -0700)]
librbd: use the same IoCtx for each request
Before we were duplicating the IoCtx for each new request since they
could have a different snapshot context or read from a different
snapshot id. Since librados now supports setting these explicitly
for a given request, do that instead.
Since librados tracks outstanding requests on a per-IoCtx basis, this
also fixes a bug that causes flush() without caching to ignore
all the outstanding requests, since they were to separate,
duplicate IoCtxs.
Josh Durgin [Wed, 27 Mar 2013 22:32:29 +0000 (15:32 -0700)]
librados: add versions of a couple functions taking explicit snap args
Usually the snapid to read from or the snapcontext to send with a write
are determined implicitly by the IoCtx the operations are done on.
This makes it difficult to have multiple ops in flight to the same
IoCtx using different snapcontexts or reading from different snapshots,
particularly when more than one operation may be needed past the initial
scheduling.
Add versions of aio_read, aio_sparse_read, and aio_operate
that don't depend on the snap id or snapcontext stored in the IoCtx,
but get them from the caller. Specifying this information for each
operation can be a more useful interface in general, but for now just
add it for the methods used by librbd.
Josh Durgin [Thu, 28 Mar 2013 17:34:37 +0000 (10:34 -0700)]
ObjectCacher: remove unneeded var from flush_set()
The gather will only have subs if there is something to flush. Remove
the safe variable, which indicates the same thing, and convert the
conditionals that used it to an else branch. Movinig gather.activate()
inside the has_subs() check has no effect since activate() does
nothing when there are no subs.
This removes the last remnants of b5e9995f59d363ba00d9cac413d9b754ee44e370. If there's nothing to flush,
immediately call the callback instead of deleting it. Callers were
assuming they were responsible for completing the callback whenever
flush_set() returned true, and always called complete(0) in this
case. Simplify the interface and just do this in flush_set(), so that
it always calls the callback.
Since C_GatherBuilder deletes its finisher if there are no subs,
only set its finisher when subs are present. This way we can still
call ->complete() for the callback.
Josh Durgin [Wed, 13 Mar 2013 16:37:21 +0000 (09:37 -0700)]
ObjectCacher: optionally make writex always non-blocking
Add a callback argument to writex, and a finisher to run the
callbacks. Move the check for dirty+tx > max_dirty into a helper that
can be called from a wrapper around the callbacks from writex, or from
the current place in _wait_for_write().
Loic Dachary [Wed, 27 Mar 2013 20:02:57 +0000 (16:02 -0400)]
unit test LFNIndex::lfn_get_name
The escape logic is tested for
* leading . => \.
* / => \s
* \ => \\
* leading DIR_ => \d
The file names for small object names ( size < FILENAME_PREFIX_LEN )
are created with CEPH_NOSNAP and checked to contain the _head string
and not the _long string.
The file names for long object names ( size >= FILENAME_PREFIX_LEN )
are tested to contain the _long string. A matching file is created to
check that it is removed unless it contains the expected extended
attribute.
If the SHA1 of two long object names collide and they have the same
prefix, lfn_get_name increments an anticollision counter to
differentiate them. This condition is engineered because it would be
really difficult to find two long names that actually create such a
collision.
The lfn_get_name method is private and the get_mangled_name method is
used to access it. The out_path argument is not available and cannot
be tested. However it is a trivial concatenation of the stringsin the
path vector.
A TestIndex class is derived from the LFNIndex class to set the pure
virtual functions. The TestLFNIndex fixture is derived from it so that
the tests get access to the protected methods of LFNIndex
The SetUp method of the fixture creates a PATH directory to be used by
all tests as the base path for all object files.
Josh Durgin [Thu, 28 Mar 2013 00:30:42 +0000 (17:30 -0700)]
librbd: flush cache when set_snap() is called
If there are writes pending, they should be sent while the image
is still writeable. If the image becomes read-only, flushing the
cache will just mark everything dirty again due to -EROFS.
Sage Weil [Thu, 28 Mar 2013 01:43:59 +0000 (18:43 -0700)]
ceph-disk: reimplement is_partition
Previously we were assuming any device that ended in a digit was a
partition, but this is not at all correct (e.g., /dev/sr0, /dev/rbd1).
Instead, look in /dev/disk/by-id and see if there is a symlink that ends in
-partNN that links to our device.
Sam Lang [Wed, 27 Mar 2013 15:58:25 +0000 (10:58 -0500)]
mds: Delay session close if in clientreplay
If the mds is in clientreplay, a session close
request needs to be delayed until it reaches
active. Otherwise, the session state gets set to
'closing', and the replay requests get dropped on the
floor.
Fixes #4564. Signed-off-by: Sam Lang <sam.lang@inktank.com>
Sam Lang [Wed, 27 Mar 2013 14:35:08 +0000 (09:35 -0500)]
mds: Clear backtrace updates on standby_trim_seg
If the mds is standby, when a segment is trimmed, we need
to clear the backtrace updates list to avoid the following
assertion when the segment is deleted.
Samuel Just [Tue, 26 Mar 2013 22:10:37 +0000 (15:10 -0700)]
ReplicatedPG: send entire stats on OP_BACKFILL_FINISH
Otherwise, we update the stat.stat structure, but not the
stat.invalid_stats part. This will result in a recently
split primary propogating the invalid stats but not the
invalid marker. Sending the whole pg_stat_t structure
also mirrors MOSDSubOp.
Fixes: #4557
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Joe Buck [Tue, 26 Mar 2013 21:17:14 +0000 (14:17 -0700)]
testing: fix hadoop-internal-test
Remove now superfluous directory changes
that are causing tests to fail.
This code should have been removed when we transitioned
from running tests with Ant to using Java to run the tests.
Signed-off-by: Joe Buck <jbbuck@gmail.com> Reviewed-by: Noah Watkins <noahwatkins@gmail.com>
Sam Lang [Mon, 25 Mar 2013 19:55:20 +0000 (14:55 -0500)]
client: Don't signal requests already handled
The assertion failure reported in #4530 is triggered
by the following:
1. client sends request
2. mds sends unsafe reply
3. before request gets journaled, mds is killed
4. mds restarts
5. client receives session close (from close request before restart)
6. session close does kick_requests()
7. kick_requests tries to signal caller that doesn't exist.
This fix avoids signaling a caller if the unsafe reply
has been received and the make_request() function has completed.
We do this by setting the caller_cond to null once the caller
is woken up, and only signal the caller in kick_requests if
caller_cond is non-null. This avoids trying to resend requests
listed in mds_request but that have already received unsafe replies.
The unsafe requests are handled by resend_unsafe_requests() code,
so skipping those requests is allowable.
Fixes #4530. Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Loic Dachary [Mon, 25 Mar 2013 17:40:32 +0000 (13:40 -0400)]
fix append to uninitialized buffer in FlatIndex::created
The long_name variable is not initialized. When the append_oname
function is called, it will strlen(long_name) and get a result
that depends on the stack content. The long_name is truncated to a
zero length string to prevent this unexpected behavior.
There is no sure way to trigger the problem by writing a unit
test. Unit tests are added for all public methods of the FlatIndex
class. Most of the time the tests fail if the long_name variable is
not properly initialized.
* uint32_t collection_version()
* coll_t coll() const
* void set_ref(std::tr1::shared_ptr<CollectionIndex> ref)
* int cleanup()
* int init()
* int created(const hobject_t &hoid, const char *path)
* int unlink(const hobject_t &hoid)
* int lookup(const hobject_t &hoid, IndexedPath *path, int *exist)
* int collection_list(vector<hobject_t> *ls)
* int collection_list_partial(const hobject_t &start, int min_count, int max_count, snapid_t seq, vector<hobject_t> *ls, hobject_t *next)
There are a number of border cases that cannot be tested, such as the
logic of the lfn_get static function. Since FlatIndex code is designed
to transition from older namespace conventions, it is difficult to
figure out.
The tests rely on xattr(2) and their availability is checked before
running them.