Yehuda Sadeh [Tue, 15 Oct 2013 17:20:48 +0000 (10:20 -0700)]
rgw: fix authenticated users acl group check
Fixes: #6553
Backport: bobtail, cuttlefish, dumpling
Authenticated users group acl bit was not working correctly. Check to
test whether user is anonymous was wrong.
rgw: drain pending requests before completing write
Fixes: #6268
When doing aio write of objects (either regular or multipart parts) we
need to drain pending aio requests. Otherwise if gateway goes down then
object might end up corrupted.
When posting an object it is possible to provide a key
name that refers to the original filename, however we
need to verify that in the end we don't end up with an
empty object name.
Josh Durgin [Wed, 21 Aug 2013 21:28:49 +0000 (14:28 -0700)]
objecter: resend unfinished lingers when osdmap is no longer paused
Plain Ops that haven't finished yet need to be resent if the osdmap
transitions from full or paused to unpaused. If these Ops are
triggered by LingerOps, they will be cancelled instead (since
should_resend = false), but the LingerOps that triggered them will not
be resent.
Fix this by checking the registered flag for all linger ops, and
resending any of them that aren't paused anymore.
Fixes: #6070 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage.weil@inktank.com>
(cherry picked from commit 38a0ca66a79af4b541e6322467ae3a8a4483cc72)
Sage Weil [Tue, 13 Aug 2013 19:52:41 +0000 (12:52 -0700)]
librados: fix async aio completion wakeup
For aio flush, we register a wait on the most recent write. The write
completion code, however, was *only* waking the waiter if they were waiting
on that write, without regard to previous writes (completed or not).
For example, we might have 6 and 7 outstanding and wait on 7. If they
finish in order all is well, but if 7 finishes first we do the flush
completion early. Similarly, if we
Josh Durgin [Tue, 13 Aug 2013 02:17:09 +0000 (19:17 -0700)]
librados: fix locking for AioCompletionImpl refcounting
Add an already-locked helper so that C_Aio{Safe,Complete} can
increment the reference count when their caller holds the
lock. C_AioCompleteAndSafe's caller is not holding the lock, so call
regular get() to ensure no racing updates can occur.
This eliminates all direct manipulations of AioCompletionImpl->ref,
and makes the necessary locking clear.
The only place C_AioCompleteAndSafe is used is in handling
aio_flush_async(). This could cause a missing completion.
Refs: #5919 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Tested-by: Oliver Francke <Oliver.Francke@filoo.de>
(cherry picked from commit 7a52e2ff5025754f3040eff3fc52d4893cafc389)
Yehuda Sadeh [Mon, 12 Aug 2013 17:05:44 +0000 (10:05 -0700)]
rgw: fix multi delete
Fixes: #5931
Backport: bobtail, cuttlefish
Fix a bad check, where we compare the wrong field. Instead of
comparing the ret code to 0, we compare the string value to 0
which generates implicit casting, hence the crash.
Sage Weil [Mon, 3 Jun 2013 04:21:09 +0000 (21:21 -0700)]
ceph-fuse: create finisher threads after fork()
The ObjectCacher and MonClient classes both instantiate Finisher
threads. We need to make sure they are created *after* the fork(2)
or else the process will fail to join() them on shutdown, and the
threads will not exist while fuse is doing useful work.
Put CephFuse on the heap and move all this initalization into the child
block, and make sure errors are passed back to the parent.
Fix-proposed-by: Alexandre Marangone <alexandre.maragone@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
Yehuda Sadeh [Fri, 14 Jun 2013 21:53:54 +0000 (14:53 -0700)]
rgw: escape prefix correctly when listing objects
Fixes: #5362
When listing objects prefix needs to be escaped correctly (the
same as with the marker). Otherwise listing objects with prefix
that starts with underscore doesn't work.
Backport: bobtail, cuttlefish
Sage Weil [Thu, 6 Jun 2013 23:35:54 +0000 (16:35 -0700)]
osd: do not include logbl in scrub map
This is a potentially use object/file, usually prefixed by a zeroed region
on disk, that is not used by scrub at all. It dates back to f51348dc8bdd5071b7baaf3f0e4d2e0496618f08 (2008) and the original version of
scrub.
This *might* fix #4179. It is not a leak per se, but I observed 1GB
scrub messages going over the write. Maybe the allocations are causing
fragmentation, or the sub_op queues are growing.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 0b036ecddbfd82e651666326d6f16b3c000ade18)
Yehuda Sadeh [Fri, 7 Jun 2013 04:53:00 +0000 (21:53 -0700)]
rgw: handle deep uri resources
In case of deep uri resources (ones created beyond a single level
of hierarchy, e.g. auth/v1.0) we want to create a new empty
handlers for the path if no handlers exists. E.g., for
auth/v1.0 we need to have a handler for 'auth', otherwise
the default S3 handler will be used, which we don't want.
Yehuda Sadeh [Fri, 7 Jun 2013 04:47:21 +0000 (21:47 -0700)]
rgw: fix get_resource_mgr() to correctly identify resource
Fixes: #5262
The original test was not comparing the correct string, ended up
with the effect of just checking the substring of the uri to match
the resource.
Samuel Just [Mon, 15 Apr 2013 23:33:48 +0000 (16:33 -0700)]
PG: don't write out pg map epoch every handle_activate_map
We don't actually need to write out the pg map epoch on every
activate_map as long as:
a) the osd does not trim past the oldest pg map persisted
b) the pg does update the persisted map epoch from time
to time.
To that end, we now keep a reference to the last map persisted.
The OSD already does not trim past the oldest live OSDMapRef.
Second, handle_activate_map will trim if the difference between
the current map and the last_persisted_map is large enough.
Fixes: #4731 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Yehuda Sadeh [Thu, 30 May 2013 19:58:11 +0000 (12:58 -0700)]
rgw: only append prefetched data if reading from head
Fixes: #5209
Backport: bobtail, cuttlefish
If the head object wrongfully contains data, but according to the
manifest we don't read from the head, we shouldn't copy the prefetched
data. Also fix the length calculation for that data.
Yehuda Sadeh [Thu, 30 May 2013 16:34:21 +0000 (09:34 -0700)]
rgw: don't copy object idtag when copying object
Fixes: #5204
When copying object we ended up also copying the original
object idtag which overrode the newly generated one. When
refcount put is called with the wrong idtag the count
does't go down.
Samuel Just [Tue, 28 May 2013 18:10:05 +0000 (11:10 -0700)]
HashIndex: sync top directory during start_split,merge,col_split
Otherwise, the links might be ordered after the in progress
operation tag write. We need the in progress operation tag to
correctly recover from an interrupted merge, split, or col_split.
Fixes: #5180
Backport: cuttlefish, bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5bca9c38ef5187c7a97916970a7fa73b342755ac)
Yehuda Sadeh [Thu, 23 May 2013 04:34:52 +0000 (21:34 -0700)]
rgw: iterate usage entries from correct entry
Fixes: #5152
When iterating through usage entries, and when user id was
provided, we started at the user's first entry and not from
the entry indexed by the request start time.
This commit fixes the issue.
Josh Durgin [Thu, 16 May 2013 22:28:40 +0000 (15:28 -0700)]
librbd: make image creation defaults configurable
Programs using older versions of the image creation functions can't
set newer parameters like image format and fancier striping.
Setting these options lets them use all the new functionality without
being patched and recompiled to use e.g. rbd_create3().
This is particularly useful for things like qemu-img, which does not
know how to create format 2 images yet.
Sage Weil [Thu, 25 Apr 2013 18:13:33 +0000 (11:13 -0700)]
init-ceph: use remote config when starting daemons on remote nodes (-a)
If you use -a to start a remote daemon, assume the remote config is present
instead of pushing the local config. This makes more sense and simplifies
things.
Note that this means that -a in concert with -c foo means that foo must
also be present on the remote node in the same path. That, however, is a
use case that I don't particularly care about right now. :)
Samuel Just [Wed, 24 Apr 2013 19:20:17 +0000 (12:20 -0700)]
PG: call check_recovery_sources in remove_down_peer_info
If we transition out of peering due to affected
prior set, we won't trigger start_peering_interval
and check_recovery_sources won't get called. This
will leave an entry in missing_loc_sources without
a matching missing set. We always want to
check_recovery_sources with remove_down_peer_info.
Fixes: 4805
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 81a6165c13c533e9c1c6684ab7beac09d52ca6b5)
Samuel Just [Thu, 25 Apr 2013 21:08:57 +0000 (14:08 -0700)]
PG: clear want_acting when we leave Primary
This is somewhat annoying actually. Intuitively we want to
clear_primary_state when we leave primary, but when we restart
peering due to a change in prior set status, we can't afford
to forget most of our peering state. want_acting, on the
other hand, should never persist across peering attempts.
In fact, in the future, want_acting should be pulled into
the Primary state structure.
Fixes: #3904 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit a5cade1fe7338602fb2bbfa867433d825f337c87)
elector: trigger a mon reset whenever we bump the epoch
We need to call reset during every election cycle; luckily we
can call it more than once. bump_epoch is (by definition!) only called
once per cycle, and it's called at the beginning, so we put it there.
With OSD sharing data and journal, the previous code created the
journal partiton from the end of the device. A uint32_t is
used in sgdisk to get the last sector, with large HD, uint32_t
is too small.
The journal partition will be created backwards from the
a sector in the midlle of the disk leaving space before
and after it. The data partition will use whichever of
these spaces is greater. The remaining will not be used.
This patch creates the journal partition from the start as a workaround.
Danny Al-Gaaf [Thu, 4 Apr 2013 13:54:31 +0000 (15:54 +0200)]
Makefile.am: install ceph-* python scripts to /usr/bin directly
Install ceph-* scripts directly to $(prefix)$(sbindir) (which
normaly would be /usr/sbin) instead of moving it around after
installation in SPEC file or debian files.
Sage Weil [Fri, 29 Mar 2013 03:49:24 +0000 (20:49 -0700)]
ceph-disk: implement 'list'
This is based on Sandon's initial patch, but much-modified.
Mounts ceph data volumes temporarily to see what is inside. Attempts to
associated journals with osds.
Resolves: #3120 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 6a65b9131c444041d16b880c6a7f332776063a78)
Sage Weil [Thu, 28 Mar 2013 01:43:59 +0000 (18:43 -0700)]
ceph-disk: reimplement is_partition
Previously we were assuming any device that ended in a digit was a
partition, but this is not at all correct (e.g., /dev/sr0, /dev/rbd1).
Instead, look in /dev/disk/by-id and see if there is a symlink that ends in
-partNN that links to our device.
Gary Lowell [Tue, 26 Mar 2013 18:31:16 +0000 (11:31 -0700)]
ceph-disk: udevadm settle before partprobe
After changing the partition table, allow the udev event to be
processed before calling partprobe. This helps prevent partprobe
from getting a resource busy error on some platforms.