Ilya Dryomov [Wed, 29 Jan 2014 14:12:01 +0000 (16:12 +0200)]
rbd: check for watchers before trimming an image on 'rbd rm'
Check for watchers before trimming image data to try to avoid getting
into the following situation:
- user does 'rbd rm' on a mapped image with an fs mounted from it
- 'rbd rm' trims (removes) all image data, only header is left
- 'rbd rm' tries to remove a header and fails because krbd has a
watcher registered on the header
- at this point image cannot be unmapped because of the mounted fs
- fs cannot be unmounted because all its data and metadata is gone
Unfortunately, this fix doesn't make it impossible to happen (the
required atomicity isn't there), but it's a big improvement over the
status quo.
Loic Dachary [Sat, 15 Feb 2014 10:43:13 +0000 (11:43 +0100)]
common: ping existing admin socket before unlink
When a daemon initializes it tries to create an admin socket and unlinks
any pre-existing file, regardless. If such a file is in use, it causes
the existing daemon to loose its admin socket.
The AdminSocketClient::ping is implemented to probe an existing socket,
using the "0" message. The AdminSocket::bind_and_listen function is
modified to call ping() on when it finds existing file. It unlinks the
file only if the ping fails.
http://tracker.ceph.com/issues/7188 fixes: #7188
Backport: emperor, dumpling Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 45600789f1ca399dddc5870254e5db883fb29b38)
mon: OSDMonitor: don't crash if formatter is invalid during osd crush dump
Code would assume a formatter would always be defined. If a 'plain'
formatter or even an invalid formatter were to be supplied, the monitor
would crash and burn in poor style.
Samuel Just [Tue, 15 Oct 2013 20:11:29 +0000 (13:11 -0700)]
OSD: ping tphandle during pg removal
Fixes: #6528 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c658258d9e2f590054a30c0dee14a579a51bda8c)
mon: OSDMonitor: allow (un)setting 'hashpspool' flag via 'osd pool set'
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1c2886964a0c005545abab0cf8feae7e06ac02a8)
Conflicts:
src/mon/MonCommands.h
src/mon/OSDMonitor.cc
mon: ceph hashpspool false clears the flag
instead of toggling it. Signed-off-by: Loic Dachary <loic@dachary.org> Reviewed-by: Christophe Courtaut <christophe.courtaut@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 589e2fa485b94244c79079f249428d4d545fca18
Replace some of the infrastructure required by this command that
was not present in Dumpling with single-use code. Signed-off-by: Greg Farnum <greg@inktank.com>
Greg Farnum [Wed, 12 Feb 2014 19:30:15 +0000 (11:30 -0800)]
OSD: disable the PGStatsAck timeout when we are reconnecting to a monitor
Previously, the timeout counter started as soon as we issued the reopen,
but if the reconnect process itself took a while, we might time out and
issue another reopen just as we get to the point where it's possible to
get work done. Since the mon client has its own reconnect timeouts (that is,
the OSD doesn't need to trigger those), we instead disable our timeouts
while the reconnect is happening, and then turn them back on again starting
from when we get the reconnect callback.
Greg Farnum [Wed, 12 Feb 2014 21:51:48 +0000 (13:51 -0800)]
monc: backoff the timeout period when reconnecting
If the monitors are systematically slowing down, we don't want to spam
them with reconnect attempts every three seconds. Instead, every time
we issue a reconnect, multiply our timeout period by a configurable; when
we complete the connection, reduce that multipler by 50%. This should let
us respond to monitor load.
Of course, we don't want to do that for initial startup in the case of a
couple down monitors, so don't apply the backoff until we've successfully
connected to a monitor at least once.
Greg Farnum [Wed, 12 Feb 2014 01:53:56 +0000 (17:53 -0800)]
monc: let users specify a callback when they reopen their monitor session
Then the callback is triggered when a new session is established, and the
daemon can do whatever it likes. There are no guarantees about how long it
might take to trigger, though. In particular we call the provided callback
while not holding our own lock in order to avoid deadlock. This could lead
to some funny ordering from the user's perspective if they call
reopen_session() again before getting the callback, but there's no way around
that, so they just have to use it appropriately.
Yehuda Sadeh [Mon, 6 Jan 2014 20:53:58 +0000 (12:53 -0800)]
radosgw-admin: fix object policy read op
Fixes: #7083
This was broken when we fixed #6940. We use the same function to both
read the bucket policy and the object policy. However, each needed to be
treated differently. Restore old behavior for objects.
Sage Weil [Mon, 7 Oct 2013 12:22:20 +0000 (05:22 -0700)]
os/FileStore: fix ENOENT error code for getattrs()
In commit dc0dfb9e01d593afdd430ca776cf4da2c2240a20 the omap xattrs code
moved up a block and r was no longer local to the block. Translate
ENOENT -> 0 to compensate.
Fix the same error in _rmattrs().
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 6da4b91c07878e07f23eee563cf1d2422f348c2f)
Alfredo Deza [Wed, 12 Feb 2014 21:43:59 +0000 (16:43 -0500)]
add support for absence of PATH
Note that this commit is actually bisecting the changes from
Loic Dachary that touch ceph-disk only (ad515bf). As that changeset
also touches other files it causes conflicts that are not resolvable
for backporting it to dumpling.
Sage Weil [Tue, 10 Sep 2013 05:27:23 +0000 (22:27 -0700)]
ceph-disk: make initial journal files 0 bytes
The ceph-osd will resize journal files up and properly fallocate() them
so that the blocks are preallocated and (hopefully) contiguous. We
don't need to do it here too, and getting fallocate() to work from
python is a pain in the butt.
Josh Durgin [Wed, 29 Jan 2014 01:26:58 +0000 (17:26 -0800)]
ceph-disk: run the right executables from udev
When run by the udev rules, PATH is not defined. Thus,
ceph-disk-activate relies on its which() function to locate the
correct executable. The which() function used os.defpath if none was
set, and this worked for anything using it.
ad6b4b4b08b6ef7ae8086f2be3a9ef521adaa88c added a new default value to
PATH, so only /usr/bin was checked by callers that did not use
which(). This resulted in the mount command not being found when
ceph-disk-activate was run by udev, and thus osds failing to start
after being prepared by ceph-deploy.
Make ceph-disk consistently use the existing helpers (command() and
command_check_call()) that use which(), so lack of PATH does not
matter. Simplify _check_output() to use command(),
another wrapper around subprocess.Popen.
Loic Dachary [Wed, 1 Jan 2014 21:11:30 +0000 (22:11 +0100)]
ceph-disk: create the data directory if it does not exist
Instead of failing if the OSD data directory does not exist, create
it. Only do so if the data directory is not enforced to be a device via
the use of the --data-dev flag. The directory is not recursively created.
Loic Dachary [Mon, 30 Dec 2013 22:57:39 +0000 (23:57 +0100)]
ceph-disk: implement --mark-init=none
It is meant to be used when preparing and activating a directory that is
not to be used with init. No file is created to identify the init
system, no symbolic link is made to the directory in /var/lib/ceph
and the init scripts are not called.
Loic Dachary [Wed, 1 Jan 2014 21:07:57 +0000 (22:07 +0100)]
ceph-disk: fsid is a known configuration option
Use get_conf_with_default instead of get_conf because fsid is a known
ceph configuration option. It allows overriding via CEPH_ARGS which is
convenient for testing. Only options that are not found in config_opts.h
are fetch via get_conf.
Loic Dachary [Mon, 30 Dec 2013 22:07:27 +0000 (23:07 +0100)]
ceph-disk: which() uses PATH first
Instead of relying on a hardcoded set of if paths. Although this has the
potential of changing the location of the binary being used by ceph-disk
on an existing installation, it is currently only used for sgdisk. It
could be disruptive for someone using a modified version of sgdisk but
the odds of this happening are very low.
Loic Dachary [Mon, 30 Dec 2013 21:48:46 +0000 (22:48 +0100)]
ceph-disk: add --prepend-to-path to control execution
/usr/bin is hardcoded in front of some ceph programs which makes it
impossible to control where they are located via the PATH.
The hardcoded path cannot be removed altogether because it will most
likely lead to unexpected and difficult to diagnose problems for
existing installations where the PATH finds the program elsewhere.
The --prepend-to-path flag is added and defaults to /usr/bin : it prepends
to the PATH environment variable. The hardcoded path is removed
and the PATH will be used: since /usr/bin is searched first, the
legacy behavior will not change.
Loic Dachary [Mon, 30 Dec 2013 11:26:20 +0000 (12:26 +0100)]
ceph-disk: prepare --data-dir must not override files
ceph-disk does nothing when given a device that is already prepared. If
given a directory that already contains a successfully prepared OSD, it
will however override it.
Instead of overriding the files in the osd data directory, return
immediately if the magic file exists. Make it so the magic file is
created last to accurately reflect the success of the OSD preparation.
Josh Durgin [Tue, 11 Feb 2014 18:14:36 +0000 (10:14 -0800)]
librbd: remove limit on number of objects in the cache
The number of objects is not a significant indicated of when data
should be written out for rbd. Use the highest possible value for
number of objects and just rely on the dirty data limits to trigger
flushing. When the number of objects is low, and many start being
flushed before they accumulate many requests, it hurts average request
size and performance for many concurrent sequential writes.
Josh Durgin [Tue, 11 Feb 2014 19:53:00 +0000 (11:53 -0800)]
ObjectCacher: use uint64_t for target and max values
All the options are uint64_t, but the ObjectCacher was converting them
to int64_t. There's never any reason for these to be negative, so
change the type.
Adjust a few conditionals so that they only convert known-positive
signed values to uint64_t before comparing with the target and max
values. Leave the actual stats accounting as loff_t for now, since
bugs in accounting will have bad effects if negative values wrap
around.
Loic Dachary [Mon, 10 Feb 2014 22:42:38 +0000 (23:42 +0100)]
common: admin socket fallback to json-pretty format
If the format argument to a command sent to the admin socket is not
among the supported formats ( json, json-pretty, xml, xml-pretty ) the
new_formatter function will return null and the AdminSocketHook::call
function must fall back to a sensible default.
The CephContextHook::call and HelpHook::call failed to do that and a
malformed format argument would cause the mon to crash. A check is added
to each of them and fallback to json-pretty if the format is not
recognized.
To further protect AdminSocketHook::call implementations from similar
problems the format argument is checked immediately after accepting the
command in AdminSocket::do_accept and replaced with json-pretty if it is
not known.
A test case is added for both CephContextHook::call and HelpHook::call
to demonstrate the problem exists and is fixed by the patch.
Three other instances of unsafe calls to new_formatter were found and
a fallback to json-pretty was added. All other calls have been audited
and appear to be safe.
Josh Durgin [Thu, 6 Feb 2014 01:22:14 +0000 (17:22 -0800)]
msg/Pipe: add option to restrict delay injection to specific msg type
This makes it possible to test timeouts reliably by delaying certain
messages effectively forever, but still being able to e.g. connect and
authenticate to the monitors.
Josh Durgin [Tue, 4 Feb 2014 01:59:21 +0000 (17:59 -0800)]
Objecter: implement mon and osd operation timeouts
This captures almost all operations from librados other than mon_commands().
Get the values for the timeouts from the Objecter constructor, so only
librados uses them.
Add C_Cancel_*_Op, finish_*_op(), and *_op_cancel() for each type of
operation, to mirror those for Op. Create a callback and schedule it
in the existing timer thread if the timeouts are specified.
Josh Durgin [Mon, 3 Feb 2014 20:53:15 +0000 (12:53 -0800)]
librados: add timeout to wait_for_osdmap()
This is used by several pool operations independent of the objecter,
including rados_ioctx_create() to look up the pool id in the first
osdmap.
Unfortunately we can't just rely on WaitInterval returning ETIMEDOUT,
since it may also get interrupted by a signal, so we can't avoid
keeping track of time explicitly here.
Sage Weil [Mon, 3 Feb 2014 16:54:14 +0000 (08:54 -0800)]
client: use 64-bit value in sync read eof logic
The file size can jump to a value that is very much larger than our current
position (for example, it could be a disk image file that gets a sparse
write at a large offset). Use a 64-bit value so that 'some' doesn't
overflow.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: John Spray <john.spray@inktank.com>
(cherry picked from commit 7ff2b541c24d1c81c3bcfbcb347694c2097993d7)
Yehuda Sadeh [Thu, 23 Jan 2014 21:48:28 +0000 (13:48 -0800)]
rgw: fix listing of multipart upload parts
Fixes: #7169
There are two issues here. One is that we may return more entries than
we should (as specified by max_parts). Second issue is that the
NextPartNumberMarker is set incorrectly. Both of these issues mainly
affect uploads with > 1000 parts, although can be triggered with less
than that.
Fixes: #6829
Backport: dumpling, emperor
We didn't init this member variable, which might cause that when
modifying user info that has this flag set the 'system' flag might
inadvertently reset.
Robin H. Johnson [Sun, 15 Dec 2013 20:26:19 +0000 (12:26 -0800)]
rgw: Fix CORS allow-headers validation
This fix is needed because Ceph presently validates CORS headers in a
case-sensitive manner. Keeps a local cache of lowercased allowed headers
to avoid converting the allowed headers to lowercase each time.
CORS 6.2.6: If any of the header field-names is not a ASCII
case-insensitive match for any of the values in list of headers do not
set any additional headers and terminate this set of steps.
Signed-off-by: Robin H. Johnson <robbat2@gentoo.org> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 31b60bfd9347a386ff12b4e4f1812d664bcfff01)
Robin H. Johnson [Sun, 15 Dec 2013 19:40:31 +0000 (11:40 -0800)]
rgw: Clarify naming of case-change functions
It is not clear that the lowercase_http_attr & uppercase_http_attr
functions replace dashes with underscores. Rename them to match the
pattern established by the camelcase_dash_http_attr function in
preperation for more case-change functions as needed by later fixes.
Signed-off-by: Robin H. Johnson <robbat2@gentoo.org> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 6a7edab2673423c53c6a422a10cb65fe07f9b235)
Robin H. Johnson [Sun, 15 Dec 2013 19:27:49 +0000 (11:27 -0800)]
rgw: Look at correct header about headers for CORS
The CORS standard dictates that preflight requests are made with the
Access-Control-Request-Headers header containing the headers of the
author request. The Access-Control-Allow-Headers header is sent in the
response.
The present code looks for Access-Control-Allow-Headers in request, so
fix it to look at Access-Control-Request-Headers instead.
Signed-off-by: Robin H. Johnson <robbat2@gentoo.org> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 2abacd9678ae04cefac457882ba718a454948915)
Yehuda Sadeh [Fri, 6 Dec 2013 19:07:09 +0000 (11:07 -0800)]
rgw: fix reading bucket policy in RGWBucket::get_policy()
Fixes: 6940
Backport: dumpling, emperor
We changed the way we keep the bucket policy, and we shouldn't try to
access the bucket object directly. This had changed when we added the
bucket instance object around dumpling.
Reported-by: Gao, Wei M <wei.m.gao@intel.com> Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 7a9a088d82d04f6105d72f6347673724ac16c9f8)
Yehuda Sadeh [Thu, 16 Jan 2014 19:45:27 +0000 (11:45 -0800)]
rgw: handle racing object puts when object doesn't exist
If the object didn't exist before and now we have multiple puts coming
in concurrently, we need to make sure that we behave correctly. Only one
needs to win, the other one can fail silently. We do that by setting
exclusive flag on the object creation and handling the error correctly.
Note that we still want to return -EEXIST in some cases (when the
exclusive flag is passed to put_obj_meta(), e.g., on bucket creation).
Yehuda Sadeh [Thu, 16 Jan 2014 19:33:49 +0000 (11:33 -0800)]
rgw: don't return -ENOENT in put_obj_meta()
Fixes: #7168
An object put may race with the same object's delete. In this case just
ignore the error, same behavior as if object was created and then
removed.
Robin H. Johnson [Sun, 19 Jan 2014 02:01:20 +0000 (18:01 -0800)]
rgw: Use correct secret key for POST authn
The POST authentication by signature validation looked up a user based
on the access key, then used the first secret key for the user. If the
access key used was not the first access key, then the expected
signature would be wrong, and the POST would be rejected.
There's a window in-between receiving an MOSDPGTemp message from an OSD
and actually handling it that may lead to the pool the pg temps refer to
no longer existing. This may happen if the MOSDPGTemp message is queued
pending dispatching due to an on-going proposal (maybe even the pool
removal).
This patch fixes such behavior in two steps:
1. Check if the pool exists in the osdmap upon preprocessing
- if pool does not exist in the osdmap, then the pool must have been
removed prior to handling the message, but after the osd sent it.
- safe to ignore the pg update
2. If all pg updates in the message have been ignored, ignore the whole
message. Otherwise, let prepare handle the rest.
3. Recheck if pool exists in the osdmap upon prepare
- We may have ignored this pg back in preprocess, but other pgs in the
message may have led the message to be passed on to prepare; ignore
pg update once more.
4. Check if pool is pending removal and ignore pg update if so.
We delegate checking the pending value to prepare_pgtemp() because in this
case we should only ignore the update IFF the pending value is in fact
committed. Otherwise we should retry the message. prepare_pgtemp() is
the appropriate place to do so.
Sage Weil [Tue, 28 Jan 2014 18:26:12 +0000 (10:26 -0800)]
buffer: make 0-length splice() a no-op
This was causing a problem in the Striper, but fixing it here will avoid
corner cases all over the tree. Note that we have to bail out before
the end-of-buffer check to avoid hitting that check when the bufferlist is
also empty.
Derek Yarnell [Mon, 27 Jan 2014 19:27:51 +0000 (12:27 -0700)]
packaging: apply udev hack rule to RHEL
In the RPM spec file there is a test to deploy the uuid hack udev rules
for older udev operating systems. This includes CentOS and RHEL, but the
check currently only is for CentOS, causing RHEL clients to get a bogus
osd rules file.
Adjust the conditional to apply to RHEL as well as CentOS. (The %{rhel}
macro is defined in both platforms' redhat-rpm-config package.)
Yehuda Sadeh [Tue, 7 Jan 2014 02:32:42 +0000 (18:32 -0800)]
rgw: convert bucket info if needed
Fixes: #7110
In dumpling, the bucket info was separated into bucket entry point and
bucket instance objects. When setting bucket attrs we only ended up
updating the bucket instance object. However, pre-dumpling buckets still
keep everything at the entry-point object, so acl changes didn't affect
anything (because we never updated the entry point). This change just
converts the bucket info into the new format.