Loic Dachary [Sat, 15 Feb 2014 10:43:13 +0000 (11:43 +0100)]
common: ping existing admin socket before unlink
When a daemon initializes it tries to create an admin socket and unlinks
any pre-existing file, regardless. If such a file is in use, it causes
the existing daemon to loose its admin socket.
The AdminSocketClient::ping is implemented to probe an existing socket,
using the "0" message. The AdminSocket::bind_and_listen function is
modified to call ping() on when it finds existing file. It unlinks the
file only if the ping fails.
http://tracker.ceph.com/issues/7188 fixes: #7188
Backport: emperor, dumpling Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 45600789f1ca399dddc5870254e5db883fb29b38)
mon: OSDMonitor: don't crash if formatter is invalid during osd crush dump
Code would assume a formatter would always be defined. If a 'plain'
formatter or even an invalid formatter were to be supplied, the monitor
would crash and burn in poor style.
Josh Durgin [Tue, 11 Feb 2014 18:14:36 +0000 (10:14 -0800)]
librbd: remove limit on number of objects in the cache
The number of objects is not a significant indicated of when data
should be written out for rbd. Use the highest possible value for
number of objects and just rely on the dirty data limits to trigger
flushing. When the number of objects is low, and many start being
flushed before they accumulate many requests, it hurts average request
size and performance for many concurrent sequential writes.
Josh Durgin [Tue, 11 Feb 2014 19:53:00 +0000 (11:53 -0800)]
ObjectCacher: use uint64_t for target and max values
All the options are uint64_t, but the ObjectCacher was converting them
to int64_t. There's never any reason for these to be negative, so
change the type.
Adjust a few conditionals so that they only convert known-positive
signed values to uint64_t before comparing with the target and max
values. Leave the actual stats accounting as loff_t for now, since
bugs in accounting will have bad effects if negative values wrap
around.
Loic Dachary [Mon, 10 Feb 2014 22:42:38 +0000 (23:42 +0100)]
common: admin socket fallback to json-pretty format
If the format argument to a command sent to the admin socket is not
among the supported formats ( json, json-pretty, xml, xml-pretty ) the
new_formatter function will return null and the AdminSocketHook::call
function must fall back to a sensible default.
The CephContextHook::call and HelpHook::call failed to do that and a
malformed format argument would cause the mon to crash. A check is added
to each of them and fallback to json-pretty if the format is not
recognized.
To further protect AdminSocketHook::call implementations from similar
problems the format argument is checked immediately after accepting the
command in AdminSocket::do_accept and replaced with json-pretty if it is
not known.
A test case is added for both CephContextHook::call and HelpHook::call
to demonstrate the problem exists and is fixed by the patch.
Three other instances of unsafe calls to new_formatter were found and
a fallback to json-pretty was added. All other calls have been audited
and appear to be safe.
Josh Durgin [Thu, 6 Feb 2014 01:22:14 +0000 (17:22 -0800)]
msg/Pipe: add option to restrict delay injection to specific msg type
This makes it possible to test timeouts reliably by delaying certain
messages effectively forever, but still being able to e.g. connect and
authenticate to the monitors.
Josh Durgin [Tue, 4 Feb 2014 01:59:21 +0000 (17:59 -0800)]
Objecter: implement mon and osd operation timeouts
This captures almost all operations from librados other than mon_commands().
Get the values for the timeouts from the Objecter constructor, so only
librados uses them.
Add C_Cancel_*_Op, finish_*_op(), and *_op_cancel() for each type of
operation, to mirror those for Op. Create a callback and schedule it
in the existing timer thread if the timeouts are specified.
Josh Durgin [Mon, 3 Feb 2014 20:53:15 +0000 (12:53 -0800)]
librados: add timeout to wait_for_osdmap()
This is used by several pool operations independent of the objecter,
including rados_ioctx_create() to look up the pool id in the first
osdmap.
Unfortunately we can't just rely on WaitInterval returning ETIMEDOUT,
since it may also get interrupted by a signal, so we can't avoid
keeping track of time explicitly here.
Sage Weil [Mon, 3 Feb 2014 16:54:14 +0000 (08:54 -0800)]
client: use 64-bit value in sync read eof logic
The file size can jump to a value that is very much larger than our current
position (for example, it could be a disk image file that gets a sparse
write at a large offset). Use a 64-bit value so that 'some' doesn't
overflow.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: John Spray <john.spray@inktank.com>
(cherry picked from commit 7ff2b541c24d1c81c3bcfbcb347694c2097993d7)
There's a window in-between receiving an MOSDPGTemp message from an OSD
and actually handling it that may lead to the pool the pg temps refer to
no longer existing. This may happen if the MOSDPGTemp message is queued
pending dispatching due to an on-going proposal (maybe even the pool
removal).
This patch fixes such behavior in two steps:
1. Check if the pool exists in the osdmap upon preprocessing
- if pool does not exist in the osdmap, then the pool must have been
removed prior to handling the message, but after the osd sent it.
- safe to ignore the pg update
2. If all pg updates in the message have been ignored, ignore the whole
message. Otherwise, let prepare handle the rest.
3. Recheck if pool exists in the osdmap upon prepare
- We may have ignored this pg back in preprocess, but other pgs in the
message may have led the message to be passed on to prepare; ignore
pg update once more.
4. Check if pool is pending removal and ignore pg update if so.
We delegate checking the pending value to prepare_pgtemp() because in this
case we should only ignore the update IFF the pending value is in fact
committed. Otherwise we should retry the message. prepare_pgtemp() is
the appropriate place to do so.
Sage Weil [Tue, 28 Jan 2014 18:26:12 +0000 (10:26 -0800)]
buffer: make 0-length splice() a no-op
This was causing a problem in the Striper, but fixing it here will avoid
corner cases all over the tree. Note that we have to bail out before
the end-of-buffer check to avoid hitting that check when the bufferlist is
also empty.
Derek Yarnell [Mon, 27 Jan 2014 19:27:51 +0000 (12:27 -0700)]
packaging: apply udev hack rule to RHEL
In the RPM spec file there is a test to deploy the uuid hack udev rules
for older udev operating systems. This includes CentOS and RHEL, but the
check currently only is for CentOS, causing RHEL clients to get a bogus
osd rules file.
Adjust the conditional to apply to RHEL as well as CentOS. (The %{rhel}
macro is defined in both platforms' redhat-rpm-config package.)
Yehuda Sadeh [Tue, 7 Jan 2014 02:32:42 +0000 (18:32 -0800)]
rgw: convert bucket info if needed
Fixes: #7110
In dumpling, the bucket info was separated into bucket entry point and
bucket instance objects. When setting bucket attrs we only ended up
updating the bucket instance object. However, pre-dumpling buckets still
keep everything at the entry-point object, so acl changes didn't affect
anything (because we never updated the entry point). This change just
converts the bucket info into the new format.
Sage Weil [Sun, 5 Jan 2014 06:40:43 +0000 (22:40 -0800)]
osd: ignore OSDMap messages while we are initializing
The mon may occasionally send OSDMap messages to random OSDs, but is not
very descriminating in that we may not have authenticated yet. Ignore any
messages if that is the case; we will reqeust whatever we need during the
BOOTING state.
Sage Weil [Sun, 5 Jan 2014 06:43:26 +0000 (22:43 -0800)]
mon: only send messages to current OSDs
When choosing a random OSD to send a message to, verify not only that
the OSD id is up but that the session is for the same instance of that OSD
by checking that the address matches.
Loic Dachary [Sun, 15 Dec 2013 15:27:02 +0000 (16:27 +0100)]
mon: set ceph osd (down|out|in|rm) error code on failure
Instead of always returning true, the error code is set if at least one
operation fails.
EINVAL if the OSD id is invalid (osd.foobar for instance).
EBUSY if trying to remove and OSD that is up.
When used with the ceph command line, it looks like this:
ceph -c ceph.conf osd rm osd.0
Error EBUSY: osd.0 is still up; must be down before removal.
kill PID_OF_osd.0
ceph -c ceph.conf osd down osd.0
marked down osd.0.
ceph -c ceph.conf osd rm osd.0 osd.1
Error EBUSY: removed osd.0, osd.1 is still up; must be down before removal.
Josh Durgin [Fri, 27 Dec 2013 01:38:52 +0000 (17:38 -0800)]
librbd: call user completion after incrementing perfcounters
The perfcounters (and the ictx) are only valid while the image is
still open. If the librbd user gets the callback for its last I/O,
then closes the image, the ictx and its perfcounters will be
invalid. If the AioCompletion object is has not run the rest of its
complete() method yet, it will access these now-invalid addresses,
possibly leading to a crash.
The AioCompletion object is independent of the ictx and does not
access it again after incrementing perfcounters, so avoid this race by
calling the user's callback after this step. The AioCompletion object
will be cleaned up by the rest of complete_request(), independent of
the ImageCtx.
Josh Durgin [Sat, 7 Dec 2013 00:03:20 +0000 (16:03 -0800)]
objecter: don't take extra throttle budget for resent ops
These ops have already taken their budget in the original op_submit().
It will be returned via put_op_budget() when they complete.
If there were many localized reads of missing objects from replicas,
or cache pool redirects, this would cause the objecter to use up all
of its op throttle budget and hang.
Josh Durgin [Fri, 6 Dec 2013 01:34:38 +0000 (17:34 -0800)]
osd: drop writes when full instead of returning an error
There's a race between the client and osd with a newly marked full
osdmap. If the client gets the new map first, it blocks writes and
everything works as expected, with no errors from the osd.
If the osd gets the map first, however, it will respond to any writes
with -ENOSPC. Clients will pass this up the stack, and not retry these
writes later. -ENOSPC isn't handled well by all clients. RBD, for
example, may pass it on to qemu or kernel rbd which will both
interpret it as EIO. Filesystems on top of rbd will not behave well
when they receive EIOs like this, especially if the cluster oscillates
between full and not full, so some writes succeed.
To fix this, never return ENOSPC from the osd because of a map marked
full, and rely on the client to retry all writes when the map is no
longer marked full.
Old clients talking to osds with this fix will hang instead of
propagating an error, but only if they run into this race
condition. ceph-fuse and rbd with caching enabled are not affected,
since the ObjectCacher will retry writes that return errors.
Yehuda Sadeh [Thu, 7 Nov 2013 00:15:47 +0000 (16:15 -0800)]
objecter: set op->paused in recalc_op_target(), resend in not paused
When going through scan_requests() in handle_osd_map() we need to make
sure that if an op should not be paused anymore then set it on the op
itself, and return NEED_RESEND. Otherwise we're going to miss reset of
the full flag.
Also in handle_osd_map(), make sure that op shouldn't be paused before
sending it. There's a lot of cleanup around that area that we should
probably be doing, make the code much more tight.
Josh Durgin [Wed, 6 Nov 2013 02:46:37 +0000 (10:46 +0800)]
objecter: don't resend paused ops
Paused ops are meant to block on the client side until a new map that
unpauses them is recieved. If we send paused writes when the FULL flag
is set, we'll get -ENOSPC from the osds, which is not what Objecter
users expect. This may cause rbd without caching to produce an I/O
error instead of waiting for the cluster to have capacity.
Yehuda Sadeh [Wed, 18 Dec 2013 21:10:21 +0000 (13:10 -0800)]
rgw: don't return data within the librados cb
Fixes: #7030
The callback is running within a single Finisher thread, thus we
shouldn't block there. Append read data to a list and flush it within
the iterate context.
mon: OSDMonitor: receive CephInt on 'osd pool set' instead on CephString
This partially reverts 2fe0d0d9 in order to allow Emperor monitors to
forward mon command messages to Dumpling monitors without breaking a
cluster.
The need for this patch became obvious after issue #6796 was triggered.
Basically, in a mixed cluster of Emperor/Dumpling monitors, if a client
happens to obtain the command descriptions from an Emperor monitor and
then issue an 'osd pool set' this can turn out in one of two ways:
1. client msg gets forwarded to an Emperor leader and everything's a-okay;
2. client msg gets forwarded to a Dumpling leader and the string fails to
be interpreted without the monitor noticing, thus leaving the monitor with
an uninitialized variable leading to trouble.
If 2 is triggered, a multitude of bad things can happen, such as thousands
of pg splits, due to a simple 'osd set pool foo pg_num 128' turning out
to be interpreted as 109120394 or some other random number.
This patch is such that we make sure the client sends an integer instead
of a string. We also make sure to interpret anything the client sends as
possibly being a string, or an integer.
Fixes: 6796
Backport: emperor
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 337195f04653eed8e8f153a5b074f3bd48408998) Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Josh Durgin [Mon, 25 Nov 2013 21:43:43 +0000 (13:43 -0800)]
init, upstart: prevent daemons being started by both
There can be only one init system starting a daemon. If there is a
host entry in ceph.conf for a daemon, sysvinit would try to start it
even if the daemon's directory did not include a sysvinit file. This
preserves backwards compatibility with older installs using sysvinit,
but if an upstart file is present in the daemon's directory, upstart
will try to start them, regardless of host entries in ceph.conf.
If there's an upstart file in a daemon's directory and a host entry
for that daemon in ceph.conf, both sysvinit and upstart would attempt
to manage it.
Fix this by only starting daemons if the marker file for the other
init system is not present. This maintains backwards compatibility
with older installs using neither sysvinit or upstart marker files,
and does not break any valid configurations. The only configuration
that would break is one with both sysvinit and upstart files present
for the same daemon.
Backport: emperor, dumpling Reported-by: Tim Spriggs <tims@uahirise.org> Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
(cherry picked from commit 5e34beb61b3f5a1ed4afd8ee2fe976de40f95ace)
Yehuda Sadeh [Wed, 27 Nov 2013 21:34:00 +0000 (13:34 -0800)]
rgw: don't error out on empty owner when setting acls
Fixes: #6892
Backport: dumpling, emperor
s3cmd specifies empty owner field when trying to set acls on object
/ bucket. We errored out as it didn't match the current owner name, but
with this change we ignore it.
Samuel Just [Tue, 12 Nov 2013 23:15:26 +0000 (15:15 -0800)]
ReplicatedPG: test for missing head before find_object_context
find_object_context doesn't return EAGAIN for a missing head.
I chose not to change that behavior since it might hide bugs
in the future. All other callers check for missing on head
before calling into find_object_context because we potentially
need head or snapdir to map a snapid onto a clone.
Backport: emperor Fixes: 6758 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit dd9d8b020286d5e3a69455023c3724a7b436d687)
Josh Durgin [Mon, 18 Nov 2013 22:39:12 +0000 (14:39 -0800)]
osd: fix bench block size
The command was declared to take 'size' in dumpling, but was trying to
read 'bsize' instead, so it always used the default of 4MiB. Change
the bench command to read 'size', so it matches what existing clients
are sending.
Shipping an object_info_t to a replica with the dirty
flag set would cause the replica to interpret that
object as being lost. Instead, we always encode
lost into the slot where dumpling expects to find
it and add another field at the end of the encoding.
Backport: emperor Fixes: #6761 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Samuel Just [Wed, 6 Nov 2013 22:33:03 +0000 (14:33 -0800)]
ReplicatedPG: don't skip missing if sentries is empty on pgls
Formerly, if sentries is empty, we skip missing. In general,
we need to continue adding items from missing until we get
to next (returned from collection_list_partial) to avoid
missing any objects.
Fixes: #6633 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
Samuel Just [Tue, 5 Nov 2013 23:40:29 +0000 (15:40 -0800)]
RadosModel: use sharedptr_registry for snaps_in_use
There might be two concurrent rollback ops each of which
adds snap x to snaps_in_use. Between when the first
completes and the second completes, snap x may be removed
since the first would have removed snap x from snaps_in_use.
Using sharedptr_registry here avoids this by ensuring that
the snap won't be removed from snaps_in_use until all refs
are gone.
This patch also adds size() to sharedptr_registry.
Fixes: #6719 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
Samuel Just [Mon, 4 Nov 2013 19:25:31 +0000 (11:25 -0800)]
FileStore::_collection_move_rename: handle missing dst dir on replay
In case of a replay, a missing destination directory indicates that
the destination object and directory have been removed by a later
transaction. Thus, we need to remove the src object and return
0.
Fixes: #6714 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Danny Al-Gaaf [Mon, 4 Nov 2013 22:30:47 +0000 (23:30 +0100)]
galois.c: fix compiler warning
galois_create_split_w8_tables() takes no parameter, remove '8' passed
to the function in one case.
osd/ErasureCodePluginJerasure/galois.c: In function 'galois_w32_region_multiply':
osd/ErasureCodePluginJerasure/galois.c:696:5: warning: call to function 'galois_create_split_w8_tables' without a real prototype [-Wunprototyped-calls]
In file included from osd/ErasureCodePluginJerasure/galois.c:53:0:
osd/ErasureCodePluginJerasure/galois.h:71:12: note: 'galois_create_split_w8_tables' was declared here
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Samuel Just [Mon, 4 Nov 2013 05:02:36 +0000 (21:02 -0800)]
OSD: allow project_pg_history to handle a missing map
If we get a peering message for an old map we don't have, we
can throwit out: the sending OSD will learn about the newer
maps and update itself accordingly, and we don't have the
information to know if the message is valid. This situation
can only happen if the sender was down for a long enough time
to create a map gap and its PGs have not yet advanced from
their boot-up maps to the current ones, so we can rely on it
Fixes: #6712 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Samuel Just [Sun, 3 Nov 2013 19:06:10 +0000 (11:06 -0800)]
OSD: don't clear peering_wait_for_split in advance_map()
I really don't know why I added this... Ops can be discarded from the
waiting_for_pg queue if we aren't primary simply because there must have
been an exchange of peering events before subops will be sent within a
particular epoch. Thus, any events in the waiting_for_pg queue must be
client ops which should only be seen by the primary. Peering events, on
the other hand, should only be discarded if we are in a new interval,
and that check might as well be performed in the peering wq.
Fixes: #6681 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Samuel Just [Sat, 2 Nov 2013 20:54:51 +0000 (13:54 -0700)]
ReplicatedPG::recover_backfill: adjust last_backfill to HEAD if snapdir
Otherwise, if last_backfill_started is a snapdir, we will fail to send a
transaction for a client IO creating the head object and removing the
snapdir object. The result will be that head will eventually be
backfilled, but the snapdir object will erroneously not be removed.
Fixes: #6685 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Josh Durgin [Sat, 2 Nov 2013 02:02:29 +0000 (19:02 -0700)]
rbd: omit 'rw' option during map
The ro and rw options were added in linux 3.7. To be compatible with
older kernels, don't specify rw. The default will probably always be
rw, so this should not present any problems in the future.
Greg Farnum [Fri, 1 Nov 2013 22:45:02 +0000 (15:45 -0700)]
OSDMonitor: be a little nicer about letting users do pg splitting
We were previously blocking pg splits whenever pg creations were in-
progress, but we only really need to avoid splitting any pgs which are
currently being created. Let the user set a different pg_num if there
are no creating PGs on the pool in question.
Fixes: #6673, take two Signed-off-by: Greg Farnum <greg@inktank.com>
Samuel Just [Thu, 31 Oct 2013 20:19:32 +0000 (13:19 -0700)]
sharedptr_registry.hpp: removed ptrs need to not blast contents
See the included unit test update. Consider:
1) x = lookup_or_create(1, 1)
2) remove(1)
3) y = lookup_or_create(1, 2)
4) x.reset()
5) z = lookup(1)
The bug is that z will be null since x.reset() caused the
cleanup callback to remove y's key value from contents.
To fix this, contents also records the pointer value for
the weak_ptr. The removal callback only removes the
key from contents if it matches the ptr in contents.
This should work since the pointer passed to the removal
callback must be unique up to that point since it has
not yet been deleted.
This allowed a pg removal -> pg recreation -> pg removal
sequence to cause the second pg removal entry to be
erroneously cleared by the first pg removal's destructor
as it finally made its way through the removal queue.
Fixes: #5951 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Noah Watkins [Wed, 30 Oct 2013 23:34:29 +0000 (16:34 -0700)]
prio-q: initialize cur iterator
For new SubQueues `cur` is not intialized, so front/pop_front will freak
out. I honestly I have no idea how this hasn't been seen, but it was
being triggered frequently on OSX.
Fixes: #6686 Signed-off-by: Noah Watkins <noahwatkins@gmail.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Wed, 30 Oct 2013 23:54:39 +0000 (16:54 -0700)]
PGLog: remove obsolete assert in merge_log
This assert assumes that if olog.head != log.head, olog contains
a log entry at log.head, which may not be true since pg splitting
might have left the log with arbitrary holes.
Sage Weil [Tue, 29 Oct 2013 18:08:58 +0000 (11:08 -0700)]
upstart, sysvinit: use ceph-crush-location hook
Instead of hard-coding a check in ceph.conf and some reasonable
defaults, defer this work to ceph-crush-location, and allow users to
specify their own hook with alternative logic.
This can be helpful in a nubmer of cases, like:
- rack (or other) information included in hostname and easily parsed
out by a hook
- multiple types of devices in each host, resulting in 'parallel'
crush trees (e.g., one for hdd, one for ssd)