auth: add rwlock to AuthClientHandler to prevent races
For cephx, build_authorizer reads a bunch of state (especially the
current session_key) which can be updated by the MonClient. With no
locks held, Pipe::connect() calls SimpleMessenger::get_authorizer()
which ends up calling RadosClient::get_authorizer() and then
AuthClientHandler::bulid_authorizer(). This unsafe usage can lead to
crashes like:
Program terminated with signal 11, Segmentation fault.
0x00007fa0d2ddb7cb in ceph::buffer::ptr::release (this=0x7f987a5e3070) at common/buffer.cc:370
370 common/buffer.cc: No such file or directory.
in common/buffer.cc
(gdb) bt
0x00007fa0d2ddb7cb in ceph::buffer::ptr::release (this=0x7f987a5e3070) at common/buffer.cc:370
0x00007fa0d2ddec00 in ~ptr (this=0x7f989c03b830) at ./include/buffer.h:171
ceph::buffer::list::rebuild (this=0x7f989c03b830) at common/buffer.cc:817
0x00007fa0d2ddecb9 in ceph::buffer::list::c_str (this=0x7f989c03b830) at common/buffer.cc:1045
0x00007fa0d2ea4dc2 in Pipe::connect (this=0x7fa0c4307340) at msg/Pipe.cc:907
0x00007fa0d2ea7d73 in Pipe::writer (this=0x7fa0c4307340) at msg/Pipe.cc:1518
0x00007fa0d2eb44dd in Pipe::Writer::entry (this=<value optimized out>) at msg/Pipe.h:59
0x00007fa0e0f5f9d1 in start_thread (arg=0x7f987a5e4700) at pthread_create.c:301
0x00007fa0de560b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Fix this by adding an rwlock to AuthClientHandler. A simpler fix would
be to move RadosClient::get_authorizer() into the MonClient() under
the MonClient lock, but this would not catch all uses of other
Authorizer, e.g. for verify_authorizer() and it would serialize
independent connection attempts.
This mainly matters for cephx, but none and unknown can have the
global_id reset as well.
pipe: only read AuthSessionHandler under pipe_lock
session_security, the AuthSessionHandler for a Pipe, is deleted and
recreated while the pipe_lock is held. read_message() is called
without pipe_lock held, and examines session_security. To make this
safe, make session_security a shared_ptr and take a reference to it
while the pipe_lock is still held, and use that shared_ptr in
read_message().
Sage Weil [Fri, 3 Jan 2014 20:51:15 +0000 (12:51 -0800)]
osdc/ObjectCacher: back off less during flush
In cce990efc8f2a58c8d0fa11c234ddf2242b1b856 we added a limit to avoid
holding the lock for too long. However, if we back off, we currently
wait for a full second, which is probably a bit much--we really just want
to give other threads a chance.
Sage Weil [Tue, 1 Oct 2013 16:28:29 +0000 (09:28 -0700)]
osdc/ObjectCacher: limit writeback IOs generated while holding lock
While analyzing a log from Mike Dawson I saw a long stall while librbd's
objectcacher was starting lots (many hundreds) of IOs. Limit the amount of
time we spend doing this at a time to allow IO replies to be processed so
that the cache remains responsive.
I'm not sure this warrants a tunable (which we would need to add for both
libcephfs and librbd).
Sage Weil [Tue, 8 Apr 2014 17:52:43 +0000 (10:52 -0700)]
os/FileStore: reset journal state on umount
We observed a sequence like:
- replay journal
- sets JournalingObjectStore applied_op_seq
- umount
- mount
- initiate commit with prevous applied_op_seq
- replay journal
- commit finishes
- on replay commit, we fail assert op > committed_seq
Although strictly speaking the assert failure is harmless here, in general
we should not let state leak through from a previous mount into this
mount or else assertions are in general more difficult to reason about.
Sage Weil [Sat, 5 Apr 2014 23:58:55 +0000 (16:58 -0700)]
mon: wait for quorum for MMonGetVersion
We should not respond to checks for map versions when we are in the
probing or electing states or else clients will get incorrect results when
they ask what the latest map version is.
Samuel Just [Tue, 26 Nov 2013 21:20:21 +0000 (13:20 -0800)]
PG: retry GetLog() each time we get a notify in Incomplete
If for some reason there are no up OSDs in the history which
happen to have usable copies of the pg, it's possible that
there is a usable copy elsewhere on the cluster which will
become known to the primary if it waits.
Fixes: #6909 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 964c8e978f86713e37a13b4884a6c0b9b41b5bae)
Sage Weil [Mon, 17 Mar 2014 23:21:17 +0000 (16:21 -0700)]
mon/Paxos: commit only after entire quorum acks
If a subset of the quorum accepts the proposal and we commit, we will start
sharing the new state. However, the mon that didn't yet reply with the
accept may still be sharing the old and stale value.
The simplest way to prevent this is not to commit until the entire quorum
replies. In the general case, there are no failures and this is just fine.
In the failure case, we will call a new election and have a smaller quorum
of (live) nodes and will recommit the same value.
A more performant solution would be to have a separate message invalidate
the old state and commit once we have all invalidations and a majority of
accepts. This will lower latency a bit in the non-failure case, but not
change the failure case significantly. Later!
Fixes: #7736 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit fa1d957c115a440e162dba1b1002bc41fc1eac43)
Samuel Just [Thu, 13 Mar 2014 21:04:19 +0000 (14:04 -0700)]
PrioritizedQueue: cap costs at max_tokens_per_subqueue
Otherwise, you can get a recovery op in the queue which has a cost
higher than the max token value. It won't get serviced until all other
queues also do not have enough tokens and higher priority queues are
empty.
Dan Mick [Thu, 3 Apr 2014 20:59:59 +0000 (13:59 -0700)]
Fix byte-order dependency in calculation of initial challenge
Fixes: #7977 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4dc62669ecd679bc4d0ef2b996b2f0b45b8b4dc7)
Josh Durgin [Thu, 21 Nov 2013 02:35:34 +0000 (18:35 -0800)]
test: use older names for module setup/teardown
setUp and tearDown require nosetests 0.11, but 0.10.4 is the latest on
centos. Rename to use the older aliases, which still work with newer
versions of nosetests as well.
Fixes: #6368 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit f753d56a9edba6ce441520ac9b52b93bd8f1b5b4)
Samuel Just [Sun, 3 Nov 2013 19:06:10 +0000 (11:06 -0800)]
OSD: don't clear peering_wait_for_split in advance_map()
I really don't know why I added this... Ops can be discarded from the
waiting_for_pg queue if we aren't primary simply because there must have
been an exchange of peering events before subops will be sent within a
particular epoch. Thus, any events in the waiting_for_pg queue must be
client ops which should only be seen by the primary. Peering events, on
the other hand, should only be discarded if we are in a new interval,
and that check might as well be performed in the peering wq.
Fixes: #6681 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 9ab513334c7ff9544bac07bd420c6d5d200cf535)
Split may cause holes such that head != tail and yet
log.empty().
Fixes: #6722 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit c6826c1e8a301b2306530c6e5d0f4a3160c4e691)
Samuel Just [Wed, 6 Nov 2013 01:47:48 +0000 (17:47 -0800)]
PGLog::rewind_divergent_log: log may not contain newhead
Due to split, there may be a hole at newhead.
Fixes: #6722 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit f4648bc6fec89c870e0c47b38b2f13496742b10f)
Sage Weil [Fri, 28 Mar 2014 04:09:13 +0000 (21:09 -0700)]
msgr: add KEEPALIVE2 feature
This is similar to KEEPALIVE, except a timestamp is also exchanged. It is
sent with the KEEPALIVE, and then returned with the ACK. The last
received stamp is stored in the Connection so that it can be queried for
liveness. Since all of the users of keepalive are already regularly
triggering a keepalive, they can check the liveness at the same time.
Sage Weil [Fri, 28 Mar 2014 19:34:07 +0000 (12:34 -0700)]
osdc/ObjectCacher: call read completion even when no target buffer
If we do no assemble a target bl, we still want to return a valid return
code with the number of bytes read-ahead so that the C_RetryRead completion
will see this as a finish and call the caller's provided Context.
Samuel Just [Wed, 30 Oct 2013 23:54:39 +0000 (16:54 -0700)]
PGLog: remove obsolete assert in merge_log
This assert assumes that if olog.head != log.head, olog contains
a log entry at log.head, which may not be true since pg splitting
might have left the log with arbitrary holes.
Samuel Just [Wed, 27 Nov 2013 03:17:59 +0000 (19:17 -0800)]
PG: don't query unfound on empty pgs
When the replica responds, it responds with a notify
rather than a log, which the primary then ignores since
it is already in the peer_info map. Rather than fix that
we'll simply not send queries to peers we already know to
have no unfound objects.
Fixes: #6910 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 838b6c8387087543ce50837277f7f6b52ae87d00)
Yehuda Sadeh [Wed, 19 Feb 2014 16:11:56 +0000 (08:11 -0800)]
rgw: reset objv tracker on bucket recreation
Fixes: #6951
If we cannot create a new bucket (as it already existed), we need to
read the old bucket's info. However, this was failing as we were holding
the objv tracker that we created for the bucket creation. We need to
clear it, as subsequent read using it will fail.
Samuel Just [Wed, 6 Nov 2013 22:33:03 +0000 (14:33 -0800)]
ReplicatedPG: don't skip missing if sentries is empty on pgls
Formerly, if sentries is empty, we skip missing. In general,
we need to continue adding items from missing until we get
to next (returned from collection_list_partial) to avoid
missing any objects.
Fixes: #6633 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit c7a30b881151e08b37339bb025789921e7115288)
Sage Weil [Sat, 15 Feb 2014 16:59:51 +0000 (08:59 -0800)]
mon/Elector: bootstrap on timeout
Currently if an election times out we call a new
election. If we have never joined a quorum, bootstrap
instead. This is heavier weight, but captures the case
where, during bootstrap:
- a and b have learned each others' addresses
- everybody calls an election
- a and b form a quorum
- c loops trying to call an election, but is ignored
because a and b don't see its address in the monmap
See logs:
ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-02-14_13:50:04-ceph-deploy-wip-7212-sage-b-testing-basic-plana/83194
Sage Weil [Fri, 14 Feb 2014 19:25:52 +0000 (11:25 -0800)]
mon: tell MonmapMonitor first about winning an election
It is important in the bootstrap case that the very first paxos round
also codify the contents of the monmap itself in order to avoid any manner
of confusing scenarios where subsequent elections are called and people
try to recover and modify paxos without agreeing on who the quorum
participants are.
Sage Weil [Fri, 14 Feb 2014 19:13:26 +0000 (11:13 -0800)]
mon: only learn peer addresses when monmap == 0
It is only safe to dynamically update the address for a peer mon in our
monmap if we are in the midst of the initial quorum formation (i.e.,
monmap.epoch == 0). If it is a later epoch, we have formed our initial
quorum and any and all monmap changes need to be agreed upon by the quorum
and committed via paxos.
Danny Al-Gaaf [Wed, 12 Mar 2014 21:56:44 +0000 (22:56 +0100)]
RGWListBucketMultiparts: init max_uploads/default_max with 0
CID 717377 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member "max_uploads" is not initialized
in this constructor nor in any functions that it calls.
4. uninit_member: Non-static class member "default_max" is not initialized
in this constructor nor in any functions that it calls.
Ilya Dryomov [Wed, 29 Jan 2014 14:12:01 +0000 (16:12 +0200)]
rbd: check for watchers before trimming an image on 'rbd rm'
Check for watchers before trimming image data to try to avoid getting
into the following situation:
- user does 'rbd rm' on a mapped image with an fs mounted from it
- 'rbd rm' trims (removes) all image data, only header is left
- 'rbd rm' tries to remove a header and fails because krbd has a
watcher registered on the header
- at this point image cannot be unmapped because of the mounted fs
- fs cannot be unmounted because all its data and metadata is gone
Unfortunately, this fix doesn't make it impossible to happen (the
required atomicity isn't there), but it's a big improvement over the
status quo.
Loic Dachary [Sat, 15 Feb 2014 10:43:13 +0000 (11:43 +0100)]
common: ping existing admin socket before unlink
When a daemon initializes it tries to create an admin socket and unlinks
any pre-existing file, regardless. If such a file is in use, it causes
the existing daemon to loose its admin socket.
The AdminSocketClient::ping is implemented to probe an existing socket,
using the "0" message. The AdminSocket::bind_and_listen function is
modified to call ping() on when it finds existing file. It unlinks the
file only if the ping fails.
http://tracker.ceph.com/issues/7188 fixes: #7188
Backport: emperor, dumpling Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit 45600789f1ca399dddc5870254e5db883fb29b38)
mon: OSDMonitor: don't crash if formatter is invalid during osd crush dump
Code would assume a formatter would always be defined. If a 'plain'
formatter or even an invalid formatter were to be supplied, the monitor
would crash and burn in poor style.
Samuel Just [Tue, 15 Oct 2013 20:11:29 +0000 (13:11 -0700)]
OSD: ping tphandle during pg removal
Fixes: #6528 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit c658258d9e2f590054a30c0dee14a579a51bda8c)
mon: OSDMonitor: allow (un)setting 'hashpspool' flag via 'osd pool set'
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1c2886964a0c005545abab0cf8feae7e06ac02a8)
Conflicts:
src/mon/MonCommands.h
src/mon/OSDMonitor.cc
mon: ceph hashpspool false clears the flag
instead of toggling it. Signed-off-by: Loic Dachary <loic@dachary.org> Reviewed-by: Christophe Courtaut <christophe.courtaut@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 589e2fa485b94244c79079f249428d4d545fca18
Replace some of the infrastructure required by this command that
was not present in Dumpling with single-use code. Signed-off-by: Greg Farnum <greg@inktank.com>
Fixes: #7511 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Sam Just <sam.just@inktank.com>
(cherry picked from commit 70d23b9a0ad9af5ca35a627a7f93c7e610e17549) Reviewed-by: Greg Farnum <greg@inktank.com>
Greg Farnum [Tue, 11 Feb 2014 21:34:39 +0000 (13:34 -0800)]
OSD: create a helper for handling OSDMap subscriptions, and clean them up
We've had some trouble with not clearing out subscription requests and
overloading the monitors (though only because of other bugs). Write a
helper for handling subscription requests that we can use to centralize
safety logic. Clear out the subscription whenever we get a map that covers
it; if there are more maps available than we received, we will issue another
subscription request based on "m->newest_map" at the end of handle_osd_map().
Notice that the helper will no longer request old maps which we already have,
and that unless forced it will not dispatch multiple subscribe requests
to a single monitor.
Skipping old maps is safe:
1) we only trim old maps when the monitor tells us to,
2) we do not send messages to our peers until we have updated our maps
from the monitor.
That means only old and broken OSDs will send us messages based on maps
in our past, and we can (and should) ignore any directives from them anyway.
Greg Farnum [Wed, 12 Feb 2014 19:30:15 +0000 (11:30 -0800)]
OSD: disable the PGStatsAck timeout when we are reconnecting to a monitor
Previously, the timeout counter started as soon as we issued the reopen,
but if the reconnect process itself took a while, we might time out and
issue another reopen just as we get to the point where it's possible to
get work done. Since the mon client has its own reconnect timeouts (that is,
the OSD doesn't need to trigger those), we instead disable our timeouts
while the reconnect is happening, and then turn them back on again starting
from when we get the reconnect callback.
Greg Farnum [Wed, 12 Feb 2014 21:51:48 +0000 (13:51 -0800)]
monc: backoff the timeout period when reconnecting
If the monitors are systematically slowing down, we don't want to spam
them with reconnect attempts every three seconds. Instead, every time
we issue a reconnect, multiply our timeout period by a configurable; when
we complete the connection, reduce that multipler by 50%. This should let
us respond to monitor load.
Of course, we don't want to do that for initial startup in the case of a
couple down monitors, so don't apply the backoff until we've successfully
connected to a monitor at least once.
Greg Farnum [Wed, 12 Feb 2014 01:53:56 +0000 (17:53 -0800)]
monc: let users specify a callback when they reopen their monitor session
Then the callback is triggered when a new session is established, and the
daemon can do whatever it likes. There are no guarantees about how long it
might take to trigger, though. In particular we call the provided callback
while not holding our own lock in order to avoid deadlock. This could lead
to some funny ordering from the user's perspective if they call
reopen_session() again before getting the callback, but there's no way around
that, so they just have to use it appropriately.
Yehuda Sadeh [Mon, 6 Jan 2014 20:53:58 +0000 (12:53 -0800)]
radosgw-admin: fix object policy read op
Fixes: #7083
This was broken when we fixed #6940. We use the same function to both
read the bucket policy and the object policy. However, each needed to be
treated differently. Restore old behavior for objects.
Sage Weil [Mon, 7 Oct 2013 12:22:20 +0000 (05:22 -0700)]
os/FileStore: fix ENOENT error code for getattrs()
In commit dc0dfb9e01d593afdd430ca776cf4da2c2240a20 the omap xattrs code
moved up a block and r was no longer local to the block. Translate
ENOENT -> 0 to compensate.
Fix the same error in _rmattrs().
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 6da4b91c07878e07f23eee563cf1d2422f348c2f)
Alfredo Deza [Wed, 12 Feb 2014 21:43:59 +0000 (16:43 -0500)]
add support for absence of PATH
Note that this commit is actually bisecting the changes from
Loic Dachary that touch ceph-disk only (ad515bf). As that changeset
also touches other files it causes conflicts that are not resolvable
for backporting it to dumpling.
Sage Weil [Tue, 10 Sep 2013 05:27:23 +0000 (22:27 -0700)]
ceph-disk: make initial journal files 0 bytes
The ceph-osd will resize journal files up and properly fallocate() them
so that the blocks are preallocated and (hopefully) contiguous. We
don't need to do it here too, and getting fallocate() to work from
python is a pain in the butt.
Josh Durgin [Wed, 29 Jan 2014 01:26:58 +0000 (17:26 -0800)]
ceph-disk: run the right executables from udev
When run by the udev rules, PATH is not defined. Thus,
ceph-disk-activate relies on its which() function to locate the
correct executable. The which() function used os.defpath if none was
set, and this worked for anything using it.
ad6b4b4b08b6ef7ae8086f2be3a9ef521adaa88c added a new default value to
PATH, so only /usr/bin was checked by callers that did not use
which(). This resulted in the mount command not being found when
ceph-disk-activate was run by udev, and thus osds failing to start
after being prepared by ceph-deploy.
Make ceph-disk consistently use the existing helpers (command() and
command_check_call()) that use which(), so lack of PATH does not
matter. Simplify _check_output() to use command(),
another wrapper around subprocess.Popen.
Loic Dachary [Wed, 1 Jan 2014 21:11:30 +0000 (22:11 +0100)]
ceph-disk: create the data directory if it does not exist
Instead of failing if the OSD data directory does not exist, create
it. Only do so if the data directory is not enforced to be a device via
the use of the --data-dev flag. The directory is not recursively created.