Yehuda Sadeh [Thu, 23 Jan 2014 21:48:28 +0000 (13:48 -0800)]
rgw: fix listing of multipart upload parts
Fixes: #7169
There are two issues here. One is that we may return more entries than
we should (as specified by max_parts). Second issue is that the
NextPartNumberMarker is set incorrectly. Both of these issues mainly
affect uploads with > 1000 parts, although can be triggered with less
than that.
Fixes: #6829
Backport: dumpling, emperor
We didn't init this member variable, which might cause that when
modifying user info that has this flag set the 'system' flag might
inadvertently reset.
Robin H. Johnson [Sun, 15 Dec 2013 20:26:19 +0000 (12:26 -0800)]
rgw: Fix CORS allow-headers validation
This fix is needed because Ceph presently validates CORS headers in a
case-sensitive manner. Keeps a local cache of lowercased allowed headers
to avoid converting the allowed headers to lowercase each time.
CORS 6.2.6: If any of the header field-names is not a ASCII
case-insensitive match for any of the values in list of headers do not
set any additional headers and terminate this set of steps.
Signed-off-by: Robin H. Johnson <robbat2@gentoo.org> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 31b60bfd9347a386ff12b4e4f1812d664bcfff01)
Robin H. Johnson [Sun, 15 Dec 2013 19:40:31 +0000 (11:40 -0800)]
rgw: Clarify naming of case-change functions
It is not clear that the lowercase_http_attr & uppercase_http_attr
functions replace dashes with underscores. Rename them to match the
pattern established by the camelcase_dash_http_attr function in
preperation for more case-change functions as needed by later fixes.
Signed-off-by: Robin H. Johnson <robbat2@gentoo.org> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 6a7edab2673423c53c6a422a10cb65fe07f9b235)
Robin H. Johnson [Sun, 15 Dec 2013 19:27:49 +0000 (11:27 -0800)]
rgw: Look at correct header about headers for CORS
The CORS standard dictates that preflight requests are made with the
Access-Control-Request-Headers header containing the headers of the
author request. The Access-Control-Allow-Headers header is sent in the
response.
The present code looks for Access-Control-Allow-Headers in request, so
fix it to look at Access-Control-Request-Headers instead.
Signed-off-by: Robin H. Johnson <robbat2@gentoo.org> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 2abacd9678ae04cefac457882ba718a454948915)
Yehuda Sadeh [Fri, 6 Dec 2013 19:07:09 +0000 (11:07 -0800)]
rgw: fix reading bucket policy in RGWBucket::get_policy()
Fixes: 6940
Backport: dumpling, emperor
We changed the way we keep the bucket policy, and we shouldn't try to
access the bucket object directly. This had changed when we added the
bucket instance object around dumpling.
Reported-by: Gao, Wei M <wei.m.gao@intel.com> Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 7a9a088d82d04f6105d72f6347673724ac16c9f8)
Yehuda Sadeh [Thu, 16 Jan 2014 19:45:27 +0000 (11:45 -0800)]
rgw: handle racing object puts when object doesn't exist
If the object didn't exist before and now we have multiple puts coming
in concurrently, we need to make sure that we behave correctly. Only one
needs to win, the other one can fail silently. We do that by setting
exclusive flag on the object creation and handling the error correctly.
Note that we still want to return -EEXIST in some cases (when the
exclusive flag is passed to put_obj_meta(), e.g., on bucket creation).
Yehuda Sadeh [Thu, 16 Jan 2014 19:33:49 +0000 (11:33 -0800)]
rgw: don't return -ENOENT in put_obj_meta()
Fixes: #7168
An object put may race with the same object's delete. In this case just
ignore the error, same behavior as if object was created and then
removed.
Robin H. Johnson [Sun, 19 Jan 2014 02:01:20 +0000 (18:01 -0800)]
rgw: Use correct secret key for POST authn
The POST authentication by signature validation looked up a user based
on the access key, then used the first secret key for the user. If the
access key used was not the first access key, then the expected
signature would be wrong, and the POST would be rejected.
There's a window in-between receiving an MOSDPGTemp message from an OSD
and actually handling it that may lead to the pool the pg temps refer to
no longer existing. This may happen if the MOSDPGTemp message is queued
pending dispatching due to an on-going proposal (maybe even the pool
removal).
This patch fixes such behavior in two steps:
1. Check if the pool exists in the osdmap upon preprocessing
- if pool does not exist in the osdmap, then the pool must have been
removed prior to handling the message, but after the osd sent it.
- safe to ignore the pg update
2. If all pg updates in the message have been ignored, ignore the whole
message. Otherwise, let prepare handle the rest.
3. Recheck if pool exists in the osdmap upon prepare
- We may have ignored this pg back in preprocess, but other pgs in the
message may have led the message to be passed on to prepare; ignore
pg update once more.
4. Check if pool is pending removal and ignore pg update if so.
We delegate checking the pending value to prepare_pgtemp() because in this
case we should only ignore the update IFF the pending value is in fact
committed. Otherwise we should retry the message. prepare_pgtemp() is
the appropriate place to do so.
Sage Weil [Tue, 28 Jan 2014 18:26:12 +0000 (10:26 -0800)]
buffer: make 0-length splice() a no-op
This was causing a problem in the Striper, but fixing it here will avoid
corner cases all over the tree. Note that we have to bail out before
the end-of-buffer check to avoid hitting that check when the bufferlist is
also empty.
Derek Yarnell [Mon, 27 Jan 2014 19:27:51 +0000 (12:27 -0700)]
packaging: apply udev hack rule to RHEL
In the RPM spec file there is a test to deploy the uuid hack udev rules
for older udev operating systems. This includes CentOS and RHEL, but the
check currently only is for CentOS, causing RHEL clients to get a bogus
osd rules file.
Adjust the conditional to apply to RHEL as well as CentOS. (The %{rhel}
macro is defined in both platforms' redhat-rpm-config package.)
Yehuda Sadeh [Tue, 7 Jan 2014 02:32:42 +0000 (18:32 -0800)]
rgw: convert bucket info if needed
Fixes: #7110
In dumpling, the bucket info was separated into bucket entry point and
bucket instance objects. When setting bucket attrs we only ended up
updating the bucket instance object. However, pre-dumpling buckets still
keep everything at the entry-point object, so acl changes didn't affect
anything (because we never updated the entry point). This change just
converts the bucket info into the new format.
Sage Weil [Sun, 5 Jan 2014 06:40:43 +0000 (22:40 -0800)]
osd: ignore OSDMap messages while we are initializing
The mon may occasionally send OSDMap messages to random OSDs, but is not
very descriminating in that we may not have authenticated yet. Ignore any
messages if that is the case; we will reqeust whatever we need during the
BOOTING state.
Sage Weil [Sun, 5 Jan 2014 06:43:26 +0000 (22:43 -0800)]
mon: only send messages to current OSDs
When choosing a random OSD to send a message to, verify not only that
the OSD id is up but that the session is for the same instance of that OSD
by checking that the address matches.
Sage Weil [Mon, 26 Aug 2013 20:58:47 +0000 (13:58 -0700)]
osd: discriminate based on connection messenger, not peer type
Replace ->get_source().is_osd() checks and instead see if it is the
cluster_messenger so that we do not confuse ourselves when we get
legit requests from other OSDs on our public interface.
NOTE: backporting this because a mixed cluster may send OSD requests
via the client interface, even though dumpling doesn't do this.
Loic Dachary [Sun, 15 Dec 2013 15:27:02 +0000 (16:27 +0100)]
mon: set ceph osd (down|out|in|rm) error code on failure
Instead of always returning true, the error code is set if at least one
operation fails.
EINVAL if the OSD id is invalid (osd.foobar for instance).
EBUSY if trying to remove and OSD that is up.
When used with the ceph command line, it looks like this:
ceph -c ceph.conf osd rm osd.0
Error EBUSY: osd.0 is still up; must be down before removal.
kill PID_OF_osd.0
ceph -c ceph.conf osd down osd.0
marked down osd.0.
ceph -c ceph.conf osd rm osd.0 osd.1
Error EBUSY: removed osd.0, osd.1 is still up; must be down before removal.
Josh Durgin [Fri, 27 Dec 2013 01:38:52 +0000 (17:38 -0800)]
librbd: call user completion after incrementing perfcounters
The perfcounters (and the ictx) are only valid while the image is
still open. If the librbd user gets the callback for its last I/O,
then closes the image, the ictx and its perfcounters will be
invalid. If the AioCompletion object is has not run the rest of its
complete() method yet, it will access these now-invalid addresses,
possibly leading to a crash.
The AioCompletion object is independent of the ictx and does not
access it again after incrementing perfcounters, so avoid this race by
calling the user's callback after this step. The AioCompletion object
will be cleaned up by the rest of complete_request(), independent of
the ImageCtx.
Josh Durgin [Sat, 7 Dec 2013 00:03:20 +0000 (16:03 -0800)]
objecter: don't take extra throttle budget for resent ops
These ops have already taken their budget in the original op_submit().
It will be returned via put_op_budget() when they complete.
If there were many localized reads of missing objects from replicas,
or cache pool redirects, this would cause the objecter to use up all
of its op throttle budget and hang.
Josh Durgin [Fri, 6 Dec 2013 01:34:38 +0000 (17:34 -0800)]
osd: drop writes when full instead of returning an error
There's a race between the client and osd with a newly marked full
osdmap. If the client gets the new map first, it blocks writes and
everything works as expected, with no errors from the osd.
If the osd gets the map first, however, it will respond to any writes
with -ENOSPC. Clients will pass this up the stack, and not retry these
writes later. -ENOSPC isn't handled well by all clients. RBD, for
example, may pass it on to qemu or kernel rbd which will both
interpret it as EIO. Filesystems on top of rbd will not behave well
when they receive EIOs like this, especially if the cluster oscillates
between full and not full, so some writes succeed.
To fix this, never return ENOSPC from the osd because of a map marked
full, and rely on the client to retry all writes when the map is no
longer marked full.
Old clients talking to osds with this fix will hang instead of
propagating an error, but only if they run into this race
condition. ceph-fuse and rbd with caching enabled are not affected,
since the ObjectCacher will retry writes that return errors.
Yehuda Sadeh [Thu, 7 Nov 2013 00:15:47 +0000 (16:15 -0800)]
objecter: set op->paused in recalc_op_target(), resend in not paused
When going through scan_requests() in handle_osd_map() we need to make
sure that if an op should not be paused anymore then set it on the op
itself, and return NEED_RESEND. Otherwise we're going to miss reset of
the full flag.
Also in handle_osd_map(), make sure that op shouldn't be paused before
sending it. There's a lot of cleanup around that area that we should
probably be doing, make the code much more tight.
Josh Durgin [Wed, 6 Nov 2013 02:46:37 +0000 (10:46 +0800)]
objecter: don't resend paused ops
Paused ops are meant to block on the client side until a new map that
unpauses them is recieved. If we send paused writes when the FULL flag
is set, we'll get -ENOSPC from the osds, which is not what Objecter
users expect. This may cause rbd without caching to produce an I/O
error instead of waiting for the cluster to have capacity.
Yehuda Sadeh [Wed, 18 Dec 2013 21:10:21 +0000 (13:10 -0800)]
rgw: don't return data within the librados cb
Fixes: #7030
The callback is running within a single Finisher thread, thus we
shouldn't block there. Append read data to a list and flush it within
the iterate context.
- Added config option to allow S3 to use Keystone auth
- Implemented JSONDecoder for KeystoneToken
- RGW_Auth_S3::authorize now uses rgw_store_user_info on keystone auth
- Minor fix in get_canon_resource; dout is now after the assignment
Reviewed-by: Yehuda Sadeh<yehuda@inktank.com> Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
(cherry picked from commit a200e184b15a03a4ca382e94caf01efb41cb9db7)
Yehuda Sadeh [Mon, 21 Oct 2013 21:17:12 +0000 (14:17 -0700)]
rgw: turn swift COPY into PUT
Fixes: #6606
The swift COPY operation is unique in a sense that it's a write
operation that has its destination not set by the URI target, but by a
different HTTP header. This is problematic as there are some hidden
assumptions in the code that the specified bucket/object in the URI is
the operation target. E.g., certain initialization functions, quota,
etc. Instead of creating a specialized code everywhere for this case
just turn it into a regular copy operation, that is, a PUT with
a specified copy source.
Sage Weil [Wed, 4 Dec 2013 05:51:26 +0000 (21:51 -0800)]
osd/OSDMonitor: accept 'osd pool set ...' value as string
Newer monitors take this as a CephString. Accept that so that if we are
mid-upgrade and get a forwarded message using the alternate schema from
a future mon we will handle it properly.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Josh Durgin [Mon, 25 Nov 2013 21:43:43 +0000 (13:43 -0800)]
init, upstart: prevent daemons being started by both
There can be only one init system starting a daemon. If there is a
host entry in ceph.conf for a daemon, sysvinit would try to start it
even if the daemon's directory did not include a sysvinit file. This
preserves backwards compatibility with older installs using sysvinit,
but if an upstart file is present in the daemon's directory, upstart
will try to start them, regardless of host entries in ceph.conf.
If there's an upstart file in a daemon's directory and a host entry
for that daemon in ceph.conf, both sysvinit and upstart would attempt
to manage it.
Fix this by only starting daemons if the marker file for the other
init system is not present. This maintains backwards compatibility
with older installs using neither sysvinit or upstart marker files,
and does not break any valid configurations. The only configuration
that would break is one with both sysvinit and upstart files present
for the same daemon.
Backport: emperor, dumpling Reported-by: Tim Spriggs <tims@uahirise.org> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 5e34beb61b3f5a1ed4afd8ee2fe976de40f95ace)
Samuel Just [Mon, 4 Nov 2013 05:02:36 +0000 (21:02 -0800)]
OSD: allow project_pg_history to handle a missing map
If we get a peering message for an old map we don't have, we
can throwit out: the sending OSD will learn about the newer
maps and update itself accordingly, and we don't have the
information to know if the message is valid. This situation
can only happen if the sender was down for a long enough time
to create a map gap and its PGs have not yet advanced from
their boot-up maps to the current ones, so we can rely on it
Fixes: #6712 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit cd0d612e1abdf5c87082eeeccd4ca09dd14fd737)
Yehuda Sadeh [Mon, 19 Aug 2013 23:56:27 +0000 (16:56 -0700)]
rgw: bucket meta remove don't overwrite entry point first
Fixes: #6056
When removing a bucket metadata entry we first unlink the bucket
and then we remove the bucket entrypoint object. Originally
when unlinking the bucket we first overwrote the bucket entrypoint
entry marking it as 'unlinked'. However, this is not really needed
as we're just about to remove it. The original version triggered
a bug, as we needed to propagate the new header version first (which
we didn't do, so the subsequent bucket removal failed).
Josh Durgin [Mon, 18 Nov 2013 22:39:12 +0000 (14:39 -0800)]
osd: fix bench block size
The command was declared to take 'size' in dumpling, but was trying to
read 'bsize' instead, so it always used the default of 4MiB. Change
the bench command to read 'size', so it matches what existing clients
are sending.
David Zafman [Wed, 25 Sep 2013 16:19:16 +0000 (09:19 -0700)]
os, osd, tools: Add backportable compatibility checking for sharded objects
OSD
New CEPH_OSD_FEATURE_INCOMPAT_SHARDS
FileStore
NEW CEPH_FS_FEATURE_INCOMPAT_SHARDS
Add FSSuperblock with feature CompatSet in it
Store sharded_objects state using CompatSet
Add set_allow_sharded_objects() and get_allow_sharded_objects() to FileStore/ObjectStore
Add read_superblock()/write_superblock() internal filestore functions
ceph_filestore_dump
Add OSDsuperblock to export format
Use CompatSet from OSD code itself in filestore-dump tool
Always check compatibility of OSD features with on-disk features
On import verify compatibility of on-disk features with export data
Bump super_ver due to export format change
Backport: dumpling, cuttlefish
Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit c6b83180f9f769de27ca7890f5f8ec507ee743ca)
Excluded from cherry-pick:
Didn't add set_allow_sharded_objects() and get_allow_sharded_objects() to FileStore/ObjectStore
Didn't add code to check for incomplete transition to sharded objects in ceph-filestore-dump
rgw: when failing read from client, return correct error
Fixes: #6214
When getting a failed read from client when putting an object
we returned the wrong value (always 0), which in the chunked-
upload case ended up in assuming that the write was done
successfully.
Yehuda Sadeh [Mon, 26 Aug 2013 18:16:08 +0000 (11:16 -0700)]
rgw: quiet down warning message
Fixes: #6123
We don't want to know about failing to read region map info
if it's not found, only if failed on some other error. In
any case it's just a warning.
Josh Durgin [Thu, 24 Oct 2013 15:42:48 +0000 (08:42 -0700)]
rgw: escape bucket and object names in StreamReadRequests
This fixes copy operations for objects that contain unsafe characters,
like a newline, which would return a 403 otherwise, since the GET to
the source rgw would be unable to verify the signature on a partially
valid bucket name.
Josh Durgin [Thu, 24 Oct 2013 15:37:25 +0000 (08:37 -0700)]
rgw: move url escaping to a common place
This is useful outside of the s3 interface. Rename url_escape()
url_encode() for consistency with the exsting common url_decode()
function. This is in preparation for the next commit, which needs
to escape url-unsafe characters in another place.
Josh Durgin [Thu, 24 Oct 2013 15:34:24 +0000 (08:34 -0700)]
rgw: update metadata log list to match data log list
Send the last marker whether the log is truncated in the same format
as data log list, so clients don't have more needless complexity
handling the difference. Keep bucket index logs the same, since they
contain the marker already, and are not used in exactly the same way
metadata and data logs are.
Josh Durgin [Thu, 24 Oct 2013 15:26:19 +0000 (08:26 -0700)]
rgw: include marker and truncated flag in data log list api
Consumers of this api need to know their position in the log. It's
readily available when fetching the log, so return it. Without the
marker in this call, a client could not easily or efficiently figure
out its position in the log, since it would require getting the global
last marker in the log, and then reading all the log entries.
This would be slow for large logs, and would be subject to races that
would cause potentially very expensive duplicate work.
Returning this atomically while fetching the log entries simplifies
all of this.
Josh Durgin [Thu, 24 Oct 2013 15:18:19 +0000 (08:18 -0700)]
cls_log: always return final marker from log_list
There's no reason to restrict returning the marker to the case where
less than the whole log is returned, since there's already a truncated
flag to tell the client what happened.
Giving the client the last marker makes it easy to consume when the
log entries do not contain their own marker. If the last marker is not
returned, the client cannot get the last marker without racing with
updates to the log.
Yan, Zheng [Thu, 10 Oct 2013 02:35:48 +0000 (10:35 +0800)]
mds: fix infinite loop of MDCache::populate_mydir().
make MDCache::populate_mydir() only fetch bare-bone stray dirs.
After all stray dirs are populated, call MDCache::scan_stray_dir(),
it fetches incomplete stray dirs.
Fixes: #4405 Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 007f06ec174d4ee5cfb578c8b3f1c96b2bb0c238)
Yehuda Sadeh [Tue, 15 Oct 2013 17:20:48 +0000 (10:20 -0700)]
rgw: fix authenticated users acl group check
Fixes: #6553
Backport: bobtail, cuttlefish, dumpling
Authenticated users group acl bit was not working correctly. Check to
test whether user is anonymous was wrong.
osdc/ObjectCacher: finish contexts after dropping object reference
The context to finish can be class C_Client_PutInode, which may drop
inode's last reference. So we should first drop object's reference,
then finish contexts.
Sandon Van Ness [Tue, 8 Oct 2013 19:08:08 +0000 (12:08 -0700)]
Go back to $PWD in fsstress.sh if compiling from source.
Although fsstress was being called with a static path the directory
it was writing to was in the current directory so doing a cd to the
source directory that is made in /tmp and then removing it later
caused it to be unable to write the files in a non-existent dir.
This change gets the current path first and cd's back into it after
it is done compiling fsstress.
Issue #6479.
Signed-off-by: Sandon Van Ness <sandon@inktank.com> Reviewed-by: Alfredo Deza <alfredo.deza@inktank.com>
Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 139a714e13aa3c7f42091270b55dde8a17b3c4b8)
mon: OSDMonitor: do not write full_latest during trim
On commit 81983bab we patched OSDMonitor::update_from_paxos() such that we
write the latest full map version to 'full_latest' each time the latest
full map was built from the incremental versions.
This change however clashed with OSDMonitor::encode_trim_extra(), which
also wrote to 'full_latest' on each trim, writing instead the version of
the *oldest* full map. This duality of behaviors could lead the store
to an inconsistent state across the monitors (although there's no sign of
it actually imposing any issues besides rebuilding already existing full
maps on some monitors).
We now stop OSDMonitor::encode_trim_extra() from writing to 'full_latest'.
This function will still write out the oldest full map it has in the store,
but it will no longer write to full_latest, instead leaving it up to
OSDMonitor::update_from_paxos() to figure it out -- and it already does.
Fixes: #6378
Backport: dumpling
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit bd0f29a2c28cca496ec830eac932477ebf3182ba)
We now put CEPH_ARGS in the actual args we parse in python, which are passed
to rados piecemeal later. This lets you put things like --id ... in there
that need to be parsed before librados is initialized.
(cherry picked from commit 97f462be4829f0167ed3d65e6694dfc16f1f3243)
David Zafman [Mon, 9 Sep 2013 20:01:12 +0000 (13:01 -0700)]
crushtool: do not dump core with non-unique bucket IDs
Return -EEXIST on duplicate ID
BUG FIX: crush_add_bucket() mixes error returns and IDs
Add optional argument to return generated ID
Fixes: #6246 Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 8c76f3a0f9cf100ea2c941dc2b61c470aa5033d7)