If found a prefix, calculate a string greater than that so that next
request we can skip to that. This is still not the most efficient way to
do it. It'll be better to push it down to the objclass, but that'll
require a much bigger change.
rgw: don't try to wait for pending if list is empty
Fixes: #8846
Backport: firefly, dumpling
This was broken at ea68b9372319fd0bab40856db26528d36359102e. We ended
up calling wait_pending_front() when pending list was empty.
This commit also moves the need_to_wait check to a different place,
where we actually throttle (and not just drain completed IOs).
Modified qemu-iotests workunit script to check for versions
that use the latest qemu (currently only Trusty). Limit the
tests to those that are applicable to rbd.
Josh Durgin [Mon, 31 Mar 2014 21:53:31 +0000 (14:53 -0700)]
librbd: skip zeroes when copying an image
This is the simple coarse-grained solution, but it works well in
common cases like a small base image resized with a bunch of empty
space at the end. Finer-grained sparseness can be copied by using rbd
{export,import}-diff.
Yehuda Sadeh [Mon, 16 Jun 2014 18:48:24 +0000 (11:48 -0700)]
rgw: allocate enough space for bucket instance id
Fixes: #8608
Backport: dumpling, firefly
Bucket instance id is a concatenation of zone name, rados instance id,
and a running counter. We need to allocate enough space to account zone
name length.
Sage Weil [Thu, 8 May 2014 15:52:51 +0000 (08:52 -0700)]
ceph-disk: partprobe before settle when preparing dev
Two users have reported this fixes a problem with using --dmcrypt.
Fixes: #6966 Tested-by: Eric Eastman <eric0e@aol.com> Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0f196265f049d432e399197a3af3f90d2e916275)
Sage Weil [Tue, 17 Jun 2014 17:47:24 +0000 (10:47 -0700)]
osd: introduce simple sleep during scrub
This option is similar to osd_snap_trim_sleep: simply inject an optional
sleep in the thread that is doing scrub work. This is a very kludgey and
coarse knob for limiting the impact of scrub on the cluster, but can help
until we have a more robust and elegant solution.
Only sleep if we are in the NEW_CHUNK state to avoid delaying processing of
an in-progress chunk. In this state nothing is blocked on anything.
Conveniently, chunky_scrub() requeues itself for each new chunk.
John Spray [Tue, 20 May 2014 15:50:18 +0000 (16:50 +0100)]
mon: pool set <pool> crush_ruleset must not use rule_exists
Implement CrushWrapper::ruleset_exists that iterates over the existing
rulesets to find the one matching the ruleset argument.
ceph osd pool set <pool> crush_ruleset must not use
CrushWrapper::rule_exists, which checks for a *rule* existing, whereas
the value being set is a *ruleset*. (cherry picked from commit fb504baed98d57dca8ec141bcc3fd021f99d82b0)
A test via ceph osd pool set data crush_ruleset verifies the ruleset
argument is accepted.
http://tracker.ceph.com/issues/8599 fixes: #8599
Backport: firefly, emperor, dumpling Signed-off-by: John Spray <john.spray@inktank.com> Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit d02d46e25080d5f7bb8ddd4874d9019a078b816b)
Yehuda Sadeh [Tue, 6 May 2014 18:06:29 +0000 (11:06 -0700)]
rgw: cut short object read if a chunk returns error
Fixes: #8289
Backport: firefly, dumpling
When reading an object, if we hit an error when trying to read one of
the rados objects then we should just stop. Otherwise we're just going
to continue reading the rest of the object, and since it can't be sent
back to the client (as we have a hole in the middle), we end up
accumulating everything in memory.
Sage Weil [Tue, 20 May 2014 17:46:34 +0000 (10:46 -0700)]
OSD::handle_pg_query: on dne pg, send lb=hobject_t() if deleting
We will set lb=hobject_t() if we resurrect the pg. In that case,
we need to have sent that to the primary before hand. If we
finish the removal before the pg is recreated, we'll just end
up backfilling it, which is ok since the pg doesn't exist anyway.
Callig _finish_hunting() clears out the bool hunting flag, which means we
don't retry by connection to another mon periodically. Instead, we send
keepalives every 10s. But, since we aren't yet in state HAVE_SESSION, we
don't check that the keepalives are getting responses. This means that an
ill-timed connection reset (say, after we get a MonMap, but before we
finish authenticating) can drop the monc into a black hole that does not
retry.
Instead, we should *only* call _finish_hunting() when we complete the
authentication handshake.
Fixes: #8278
Backport: firefly, dumpling Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 77a6f0aefebebf057f02bfb95c088a30ed93c53f)
Greg Farnum [Thu, 15 May 2014 23:50:43 +0000 (16:50 -0700)]
OSD: fix an osdmap_subscribe interface misuse
When calling osdmap_subscribe, you have to pass an epoch newer than the
current map's. _maybe_boot() was not doing this correctly -- we would
fail a check for being *in* the monitor's existing map range, and then
pass along the map prior to the monitor's range. But if we were exactly
one behind, that value would be our current epoch, and the request would
get dropped. So instead, make sure we are not *in contact* with the monitor's
existing map range.
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 290ac818696414758978b78517b137c226110bb4)
Samuel Just [Wed, 16 Oct 2013 17:07:37 +0000 (10:07 -0700)]
OSD: check for splitting when processing recover/backfill reservations
Fixes: 6565 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 15ec5332ba4154930a0447e2bcf1acec02691e97)
which is actually the next backfill to complete. We want to update
last_backfill to the largest completed backfill instead.
We use the pending_backfill_updates mapping to identify the most
recently completed backfill. Due to the previous patch, deletes
will also be included in that mapping.
Yehuda Sadeh [Wed, 27 Nov 2013 21:34:00 +0000 (13:34 -0800)]
rgw: don't error out on empty owner when setting acls
Fixes: #6892
Backport: dumpling, emperor
s3cmd specifies empty owner field when trying to set acls on object
/ bucket. We errored out as it didn't match the current owner name, but
with this change we ignore it.
Sage Weil [Fri, 18 Apr 2014 20:50:11 +0000 (13:50 -0700)]
osd: throttle snap trimmming with simple delay
This is not particularly smart, but it is *a* knob that lets you make
the snap trimmer slow down. It's a flow and a simple delay, so it is
adjustable at runtime. Default is 0 (no change in behavior).
Sage Weil [Tue, 1 Apr 2014 23:01:28 +0000 (16:01 -0700)]
PG: only complete replicas should count toward min_size
Backport: emperor,dumpling,cuttlefish Fixes: #7805 Signed-off-by: Samuel Just <sam.just@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0d5d3d1a30685e7c47173b974caa12076c43a9c4)
Yehuda Sadeh [Sat, 3 May 2014 00:06:05 +0000 (17:06 -0700)]
rgw: don't allow multiple writers to same multiobject part
Fixes: #8269
A client might need to retry a multipart part write. The original thread
might race with the new one, trying to clean up after it, clobbering the
part's data.
The fix is to detect whether an original part already existed, and if so
use a different part name for it.
auth: add rwlock to AuthClientHandler to prevent races
For cephx, build_authorizer reads a bunch of state (especially the
current session_key) which can be updated by the MonClient. With no
locks held, Pipe::connect() calls SimpleMessenger::get_authorizer()
which ends up calling RadosClient::get_authorizer() and then
AuthClientHandler::bulid_authorizer(). This unsafe usage can lead to
crashes like:
Program terminated with signal 11, Segmentation fault.
0x00007fa0d2ddb7cb in ceph::buffer::ptr::release (this=0x7f987a5e3070) at common/buffer.cc:370
370 common/buffer.cc: No such file or directory.
in common/buffer.cc
(gdb) bt
0x00007fa0d2ddb7cb in ceph::buffer::ptr::release (this=0x7f987a5e3070) at common/buffer.cc:370
0x00007fa0d2ddec00 in ~ptr (this=0x7f989c03b830) at ./include/buffer.h:171
ceph::buffer::list::rebuild (this=0x7f989c03b830) at common/buffer.cc:817
0x00007fa0d2ddecb9 in ceph::buffer::list::c_str (this=0x7f989c03b830) at common/buffer.cc:1045
0x00007fa0d2ea4dc2 in Pipe::connect (this=0x7fa0c4307340) at msg/Pipe.cc:907
0x00007fa0d2ea7d73 in Pipe::writer (this=0x7fa0c4307340) at msg/Pipe.cc:1518
0x00007fa0d2eb44dd in Pipe::Writer::entry (this=<value optimized out>) at msg/Pipe.h:59
0x00007fa0e0f5f9d1 in start_thread (arg=0x7f987a5e4700) at pthread_create.c:301
0x00007fa0de560b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
Fix this by adding an rwlock to AuthClientHandler. A simpler fix would
be to move RadosClient::get_authorizer() into the MonClient() under
the MonClient lock, but this would not catch all uses of other
Authorizer, e.g. for verify_authorizer() and it would serialize
independent connection attempts.
This mainly matters for cephx, but none and unknown can have the
global_id reset as well.
pipe: only read AuthSessionHandler under pipe_lock
session_security, the AuthSessionHandler for a Pipe, is deleted and
recreated while the pipe_lock is held. read_message() is called
without pipe_lock held, and examines session_security. To make this
safe, make session_security a shared_ptr and take a reference to it
while the pipe_lock is still held, and use that shared_ptr in
read_message().
Sage Weil [Fri, 3 Jan 2014 20:51:15 +0000 (12:51 -0800)]
osdc/ObjectCacher: back off less during flush
In cce990efc8f2a58c8d0fa11c234ddf2242b1b856 we added a limit to avoid
holding the lock for too long. However, if we back off, we currently
wait for a full second, which is probably a bit much--we really just want
to give other threads a chance.
Sage Weil [Tue, 1 Oct 2013 16:28:29 +0000 (09:28 -0700)]
osdc/ObjectCacher: limit writeback IOs generated while holding lock
While analyzing a log from Mike Dawson I saw a long stall while librbd's
objectcacher was starting lots (many hundreds) of IOs. Limit the amount of
time we spend doing this at a time to allow IO replies to be processed so
that the cache remains responsive.
I'm not sure this warrants a tunable (which we would need to add for both
libcephfs and librbd).
Sage Weil [Tue, 8 Apr 2014 17:52:43 +0000 (10:52 -0700)]
os/FileStore: reset journal state on umount
We observed a sequence like:
- replay journal
- sets JournalingObjectStore applied_op_seq
- umount
- mount
- initiate commit with prevous applied_op_seq
- replay journal
- commit finishes
- on replay commit, we fail assert op > committed_seq
Although strictly speaking the assert failure is harmless here, in general
we should not let state leak through from a previous mount into this
mount or else assertions are in general more difficult to reason about.
Sage Weil [Sat, 5 Apr 2014 23:58:55 +0000 (16:58 -0700)]
mon: wait for quorum for MMonGetVersion
We should not respond to checks for map versions when we are in the
probing or electing states or else clients will get incorrect results when
they ask what the latest map version is.
Samuel Just [Tue, 26 Nov 2013 21:20:21 +0000 (13:20 -0800)]
PG: retry GetLog() each time we get a notify in Incomplete
If for some reason there are no up OSDs in the history which
happen to have usable copies of the pg, it's possible that
there is a usable copy elsewhere on the cluster which will
become known to the primary if it waits.
Fixes: #6909 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 964c8e978f86713e37a13b4884a6c0b9b41b5bae)
Sage Weil [Mon, 17 Mar 2014 23:21:17 +0000 (16:21 -0700)]
mon/Paxos: commit only after entire quorum acks
If a subset of the quorum accepts the proposal and we commit, we will start
sharing the new state. However, the mon that didn't yet reply with the
accept may still be sharing the old and stale value.
The simplest way to prevent this is not to commit until the entire quorum
replies. In the general case, there are no failures and this is just fine.
In the failure case, we will call a new election and have a smaller quorum
of (live) nodes and will recommit the same value.
A more performant solution would be to have a separate message invalidate
the old state and commit once we have all invalidations and a majority of
accepts. This will lower latency a bit in the non-failure case, but not
change the failure case significantly. Later!
Fixes: #7736 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit fa1d957c115a440e162dba1b1002bc41fc1eac43)
Samuel Just [Thu, 13 Mar 2014 21:04:19 +0000 (14:04 -0700)]
PrioritizedQueue: cap costs at max_tokens_per_subqueue
Otherwise, you can get a recovery op in the queue which has a cost
higher than the max token value. It won't get serviced until all other
queues also do not have enough tokens and higher priority queues are
empty.
Dan Mick [Thu, 3 Apr 2014 20:59:59 +0000 (13:59 -0700)]
Fix byte-order dependency in calculation of initial challenge
Fixes: #7977 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4dc62669ecd679bc4d0ef2b996b2f0b45b8b4dc7)
Josh Durgin [Thu, 21 Nov 2013 02:35:34 +0000 (18:35 -0800)]
test: use older names for module setup/teardown
setUp and tearDown require nosetests 0.11, but 0.10.4 is the latest on
centos. Rename to use the older aliases, which still work with newer
versions of nosetests as well.
Fixes: #6368 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit f753d56a9edba6ce441520ac9b52b93bd8f1b5b4)
Samuel Just [Sun, 3 Nov 2013 19:06:10 +0000 (11:06 -0800)]
OSD: don't clear peering_wait_for_split in advance_map()
I really don't know why I added this... Ops can be discarded from the
waiting_for_pg queue if we aren't primary simply because there must have
been an exchange of peering events before subops will be sent within a
particular epoch. Thus, any events in the waiting_for_pg queue must be
client ops which should only be seen by the primary. Peering events, on
the other hand, should only be discarded if we are in a new interval,
and that check might as well be performed in the peering wq.
Fixes: #6681 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 9ab513334c7ff9544bac07bd420c6d5d200cf535)
Split may cause holes such that head != tail and yet
log.empty().
Fixes: #6722 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit c6826c1e8a301b2306530c6e5d0f4a3160c4e691)
Samuel Just [Wed, 6 Nov 2013 01:47:48 +0000 (17:47 -0800)]
PGLog::rewind_divergent_log: log may not contain newhead
Due to split, there may be a hole at newhead.
Fixes: #6722 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit f4648bc6fec89c870e0c47b38b2f13496742b10f)
Sage Weil [Fri, 28 Mar 2014 04:09:13 +0000 (21:09 -0700)]
msgr: add KEEPALIVE2 feature
This is similar to KEEPALIVE, except a timestamp is also exchanged. It is
sent with the KEEPALIVE, and then returned with the ACK. The last
received stamp is stored in the Connection so that it can be queried for
liveness. Since all of the users of keepalive are already regularly
triggering a keepalive, they can check the liveness at the same time.
Sage Weil [Fri, 28 Mar 2014 19:34:07 +0000 (12:34 -0700)]
osdc/ObjectCacher: call read completion even when no target buffer
If we do no assemble a target bl, we still want to return a valid return
code with the number of bytes read-ahead so that the C_RetryRead completion
will see this as a finish and call the caller's provided Context.
Samuel Just [Wed, 30 Oct 2013 23:54:39 +0000 (16:54 -0700)]
PGLog: remove obsolete assert in merge_log
This assert assumes that if olog.head != log.head, olog contains
a log entry at log.head, which may not be true since pg splitting
might have left the log with arbitrary holes.
Samuel Just [Wed, 27 Nov 2013 03:17:59 +0000 (19:17 -0800)]
PG: don't query unfound on empty pgs
When the replica responds, it responds with a notify
rather than a log, which the primary then ignores since
it is already in the peer_info map. Rather than fix that
we'll simply not send queries to peers we already know to
have no unfound objects.
Fixes: #6910 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 838b6c8387087543ce50837277f7f6b52ae87d00)
Yehuda Sadeh [Wed, 19 Feb 2014 16:11:56 +0000 (08:11 -0800)]
rgw: reset objv tracker on bucket recreation
Fixes: #6951
If we cannot create a new bucket (as it already existed), we need to
read the old bucket's info. However, this was failing as we were holding
the objv tracker that we created for the bucket creation. We need to
clear it, as subsequent read using it will fail.
Samuel Just [Wed, 6 Nov 2013 22:33:03 +0000 (14:33 -0800)]
ReplicatedPG: don't skip missing if sentries is empty on pgls
Formerly, if sentries is empty, we skip missing. In general,
we need to continue adding items from missing until we get
to next (returned from collection_list_partial) to avoid
missing any objects.
Fixes: #6633 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit c7a30b881151e08b37339bb025789921e7115288)
Sage Weil [Sat, 15 Feb 2014 16:59:51 +0000 (08:59 -0800)]
mon/Elector: bootstrap on timeout
Currently if an election times out we call a new
election. If we have never joined a quorum, bootstrap
instead. This is heavier weight, but captures the case
where, during bootstrap:
- a and b have learned each others' addresses
- everybody calls an election
- a and b form a quorum
- c loops trying to call an election, but is ignored
because a and b don't see its address in the monmap
See logs:
ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-02-14_13:50:04-ceph-deploy-wip-7212-sage-b-testing-basic-plana/83194
Sage Weil [Fri, 14 Feb 2014 19:25:52 +0000 (11:25 -0800)]
mon: tell MonmapMonitor first about winning an election
It is important in the bootstrap case that the very first paxos round
also codify the contents of the monmap itself in order to avoid any manner
of confusing scenarios where subsequent elections are called and people
try to recover and modify paxos without agreeing on who the quorum
participants are.
Sage Weil [Fri, 14 Feb 2014 19:13:26 +0000 (11:13 -0800)]
mon: only learn peer addresses when monmap == 0
It is only safe to dynamically update the address for a peer mon in our
monmap if we are in the midst of the initial quorum formation (i.e.,
monmap.epoch == 0). If it is a later epoch, we have formed our initial
quorum and any and all monmap changes need to be agreed upon by the quorum
and committed via paxos.
Danny Al-Gaaf [Wed, 12 Mar 2014 21:56:44 +0000 (22:56 +0100)]
RGWListBucketMultiparts: init max_uploads/default_max with 0
CID 717377 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member "max_uploads" is not initialized
in this constructor nor in any functions that it calls.
4. uninit_member: Non-static class member "default_max" is not initialized
in this constructor nor in any functions that it calls.
Ilya Dryomov [Wed, 29 Jan 2014 14:12:01 +0000 (16:12 +0200)]
rbd: check for watchers before trimming an image on 'rbd rm'
Check for watchers before trimming image data to try to avoid getting
into the following situation:
- user does 'rbd rm' on a mapped image with an fs mounted from it
- 'rbd rm' trims (removes) all image data, only header is left
- 'rbd rm' tries to remove a header and fails because krbd has a
watcher registered on the header
- at this point image cannot be unmapped because of the mounted fs
- fs cannot be unmounted because all its data and metadata is gone
Unfortunately, this fix doesn't make it impossible to happen (the
required atomicity isn't there), but it's a big improvement over the
status quo.