Greg Farnum [Thu, 9 Oct 2014 22:12:19 +0000 (15:12 -0700)]
mds: Locker: fix a NULL deref in _update_cap_fields
The MClientCaps* is allowed to be NULL, so we can't deref it unless
the dirty param is non-zero. So don't do the ahead-of-time lookup;
just call it explicitly in the if block.
Jason Dillaman [Sun, 7 Sep 2014 02:59:40 +0000 (22:59 -0400)]
Enforce cache size on read requests
In-flight cache reads were not previously counted against
new cache read requests, which could result in very large
cache usage. This effect is most noticeable when writing
small chunks to a cloned image since each write requires
a full object read from the parent.
Locker: accept ctime updates from clients without dirty write caps
The ctime changes any time the inode does. That can happen even without
the file itself having changed, so we'd better accept the update whenever
the auth caps have dirtied, without worrying about the file caps!
Fixes: #9514
Backport: firefly
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: John Spray <john.spray@redhat.com>
(cherry picked from commit 0ea20a668cf859881c49b33d1b6db4e636eda18a)
c70331db13bdbf1f967ece48cdbec28b97c3d19c backported some
CRUSH changes which necessitate backporting this as well, or
we get failures like successfully looking up an OSD with ID -1. Reviewed-by: Greg Farnum <greg@inktank.com>
Loic Dachary [Sat, 26 Oct 2013 00:11:43 +0000 (02:11 +0200)]
common: rebuild_page_aligned sometimes rebuilds unaligned
rebuild_page_aligned relies on rebuild to create memory that is aligned
according to list::is_page_aligned(). However, when the bufferlist only
contains a single ptr and that its size is not list::is_n_page_size(),
rebuild will not create the expected alligned bufferlist.
The allocation of the ptr is moved out of rebuild which is now given the
ptr as an argument. The rebuild_page_aligned function always require an
aligned ptr with buffer::create_page_aligned(_len) for consistency.
demonstrated the problem. It was assumed to be a feature but should have
been identified as a bug. The last ligne is replaced with
EXPECT_TRUE(bl.is_page_aligned());
Most tests related to is_page_aligned() wrongfully assumed that
bufferptr ptr(2);
is never page aligned. Most of the time it is not but sometime it is
when the pointer address is by chance on a CEPH_PAGE_SIZE boundary,
which triggered #6614. Non aligned ptr are created as follows instead:
Sage Weil [Thu, 25 Sep 2014 20:16:52 +0000 (13:16 -0700)]
osdc/Objecter: only post_rx_buffer if no op timeout
If we post an rx buffer and there is a timeout, the revocation can happen
while the reader has consumed the buffers but before it has decoded and
constructed the message. In particular, we calculate a crc32c over the
data portion of the message after we've taken the buffers and dropped the
lock.
Instead of fixing this race (for example, by reverifying rx_buffers under
the lock while calculating the crc.. bleh), just skip the rx buffer
optimization entirely when a timeout is present.
Note that this doesn't cover the op_cancel() paths, but none of those users
provide static buffers to read into.
Sage Weil [Tue, 24 Sep 2013 18:22:19 +0000 (11:22 -0700)]
osd/ReplicatedPG: respect RWORDERED rados flag
If this flag is set, we need to order reads as writes. In particular, this
means that reads will wait for degraded object recovery even if there is a
local copy. And subsequently will be ordered after a preceding write that
is waiting for the same thing.
Greg Farnum [Thu, 22 May 2014 04:41:23 +0000 (21:41 -0700)]
cephfs-java: build against older jni headers
Older versions of the JNI interface expected non-const parameters
to their memory move functions. It's unpleasant, but won't actually
change the memory in question, to do a cast_const in order to satisfy
those older headers. (And even if it *did* modify the memory, that
would be okay given our single user.)
Samuel Just [Thu, 31 Oct 2013 20:19:32 +0000 (13:19 -0700)]
sharedptr_registry.hpp: removed ptrs need to not blast contents
See the included unit test update. Consider:
1) x = lookup_or_create(1, 1)
2) remove(1)
3) y = lookup_or_create(1, 2)
4) x.reset()
5) z = lookup(1)
The bug is that z will be null since x.reset() caused the
cleanup callback to remove y's key value from contents.
To fix this, contents also records the pointer value for
the weak_ptr. The removal callback only removes the
key from contents if it matches the ptr in contents.
This should work since the pointer passed to the removal
callback must be unique up to that point since it has
not yet been deleted.
This allowed a pg removal -> pg recreation -> pg removal
sequence to cause the second pg removal entry to be
erroneously cleared by the first pg removal's destructor
as it finally made its way through the removal queue.
Fixes: #5951 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 28e4271267976ab8405a24d015f2fb50a2f82c49)
Loic Dachary [Mon, 12 Aug 2013 12:05:38 +0000 (14:05 +0200)]
sharedptr_registry: add a variant of get_next() and the empty() method
The SharedPtrRegistry::get_next() method with a value of type VPtr
instead of V is added because it is sometime more convenient to not
copy the value when walking the registry. The
SharedPtrRegistry::empty() predicate method is added.
A single counter ( waiting ) accurately reflects the number of
waiters, regardless of the method waiting. It is enough to allow
unit tests to synthetise all situations, including:
T1: x = lookup_or_create(0)
T1: release x part 1 (weak_ptrs now fail to lock)
T2: y = lookup_or_create(0)
T2: block in lookup_or_create (waiting == 1)
T1: z = lookup_or_create(1) (does not block because the key is different)
while holding the lock it waiting++ and waiting == 2
and before returning it waiting-- and waiting is back to == 1
T1: complete release x
T2: complete lookup_or_create(0) (waiting == 0)
The unit tests are modified to add a lookup on an unrelated key to
demonstrate that it does not reset waiting counter.
Covers 100% of the LOC and all the expected behavior, including thread
safety.
The sharedptr_registry is made friend of the test class so that it can
synthetize race conditions. The lookup and lookup_or_create methods
set the new in_method data member before calling cond.Wait() so that
the caller knows it is waiting.
The crash occurs due to ImageCtx->parent->parent being uninitialized,
since the inital open_parent() -> open_image(parent) ->
ictx_refresh(parent) occurs before ImageCtx->parent->snap_id is set,
so refresh_parent() is not called to open an ImageCtx for the parent
of the parent. This leaves the ImageCtx->parent->parent NULL, but the
rest of ImageCtx->parent updated to point at the correct parent snapshot.
Setting the parent->snap_id earlier has some unintended side effects
currently, so for now just call refresh_parent() during
open_parent(). This is the easily backportable version of the
fix. Further patches can clean up this whole initialization process.
Sage Weil [Wed, 10 Sep 2014 15:00:50 +0000 (08:00 -0700)]
test/cli-integration/rbd: fix trailing space
Newer versions of json.tool remove the trailing ' ' after the comma. Add
it back in with sed so that the .t works on both old and new versions, and
so that we don't have to remove the trailing spaces from all of the test
cases.
Sage Weil [Sat, 16 Aug 2014 19:42:33 +0000 (12:42 -0700)]
os/FileStore: fix mount/remount force_sync race
Consider:
- mount
- sync_entry is doing some work
- umount
- set force_sync = true
- set done = true
- sync_entry exits (due to done)
- ..but does not set force_sync = false
- mount
- journal replay starts
- sync_entry sees force_sync and does a commit while op_seq == 0
...crash...
Sage Weil [Mon, 8 Sep 2014 20:44:57 +0000 (13:44 -0700)]
osdc/Objecter: revoke rx_buffer on op_cancel
If we cancel a read, revoke the rx buffers to avoid a use-after-free and/or
other undefined badness by using user buffers that may no longer be
present.
Fixes: #9362
Backport: firefly, dumpling Reported-by: Matthias Kiefer <matthias.kiefer@1und1.de> Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 2305b2897acba38384358c33ca3bbfcae6f1c74e)
Josh Durgin [Mon, 11 Aug 2014 23:41:26 +0000 (16:41 -0700)]
librbd: fix error path cleanup for opening an image
If the image doesn't exist and caching is enabled, the ObjectCacher
was not being shutdown, and the ImageCtx was leaked. The IoCtx could
later be closed while the ObjectCacher was still running, resulting in
a segfault. Simply use the usual cleanup path in open_image(), which
works fine here.
Haomai Wang [Mon, 14 Jul 2014 06:27:17 +0000 (14:27 +0800)]
Add rbdcache max dirty object option
Librbd will calculate max dirty object according to rbd_cache_max_size, it
doesn't suitable for every case. If user set image order 24, the calculating
result is too small for reality. It will increase the overhead of trim call
which is called each read/write op.
Now we make it as option for tunning, by default this value is calculated.
Danny Al-Gaaf [Wed, 4 Jun 2014 21:22:18 +0000 (23:22 +0200)]
librbd/internal.cc: check earlier for null pointer
Fix potential null ponter deref, move check for 'order != NULL'
to the beginning of the function to prevent a) deref in ldout() call
and b) to leave function as early as possible if check fails.
[src/librbd/internal.cc:843] -> [src/librbd/internal.cc:865]: (warning)
Possible null pointer dereference: order - otherwise it is redundant
to check it against null.
librbd: check return code and error out if invalidate_cache fails
This will only happen when shrinking or rolling back an image is done
while other I/O is in flight to the same ImageCtx. This is unsafe, so
return an error before performing the resize or rollback.
Dan Mick [Wed, 26 Mar 2014 00:09:48 +0000 (17:09 -0700)]
rbd.cc: tolerate lack of NUL-termination on block_name_prefix
Fixes: #7577 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit fd76fec589be13a4a6362ef388929d3e3d1d21f6)
Ilya Dryomov [Wed, 29 Jan 2014 14:12:01 +0000 (16:12 +0200)]
rbd: don't forget to call close_image() if remove_child() fails
close_image() among other things unregisters a watcher that's been
registered by open_image(). Even though it'll timeout in 30 or so
seconds, it's not nice now that we check for watchers before starting
the removal process.
Yehuda Sadeh [Wed, 19 Feb 2014 00:43:48 +0000 (16:43 -0800)]
rgw: return error if accessing object in non-existent bucket
Fixes: #7064
Instead of trying to access the object, which is impossible as we don't
even have a proper bucket info. Up until now we ended up creating an
empty pool and eventually returning ENOENT, this fix catches the issue
earlier in the process.
Sage Weil [Thu, 7 Aug 2014 00:04:02 +0000 (17:04 -0700)]
os/FileStore: force any new xattr into omap on E2BIG
If we have a huge xattr (or many little ones), the _fgetattrs() for the
inline_set will fail with E2BIG. The conditions later where we decide
whether to clean up the old xattr will then also fail. Will *will* put
the xattr in omap, but the non-omap version isn't cleaned up.
Fix this by setting a flag if we get E2BIG that the inline_set is known
to be incomplete. In that case, take the conservative step of assuming
the xattr might be present and chain_fremovexattr(). Ignore any error
because it might not be there.
This is clearly harmless in the general case because it won't be there. If
it is, we will hopefully remove enough xattrs that the E2BIG condition
will go away (usually by removing some really big chained xattr).
Fixes: #8169
Backport: firefly
We didn't calculate the user manifest's object etag at all. The etag
needs to be the md5 of the contantenation of all the parts' etags.
Sage Weil [Wed, 2 Jul 2014 17:38:43 +0000 (10:38 -0700)]
qa/workunits/rest/test.py: make osd create test idempotent
Avoid possibility that we create multiple OSDs do to retries by passing in
the optional uuid arg. (A stray osd id will make the osd tell tests a
few lines down fail.)
mon: Monitor: suicide on start if mon has been removed from monmap
If the monitor has been marked as having been part of an existing quorum
and is no longer in the monmap, then it is safe to assume the monitor
was removed from the monmap. In that event, do not allow the monitor
to start, as it will try to find its way into the quorum again (and
someone clearly stated they don't really want them there), unless
'mon force quorum join' is specified.
rgw: account common prefixes for MaxKeys in bucket listing
To be more in line with the S3 api. Beforehand we didn't account the
common prefixes towards the MaxKeys (a single common prefix counts as a
single key). Also need to adjust the marker now if it is pointing at a
common prefix.
If found a prefix, calculate a string greater than that so that next
request we can skip to that. This is still not the most efficient way to
do it. It'll be better to push it down to the objclass, but that'll
require a much bigger change.
rgw: don't try to wait for pending if list is empty
Fixes: #8846
Backport: firefly, dumpling
This was broken at ea68b9372319fd0bab40856db26528d36359102e. We ended
up calling wait_pending_front() when pending list was empty.
This commit also moves the need_to_wait check to a different place,
where we actually throttle (and not just drain completed IOs).
Modified qemu-iotests workunit script to check for versions
that use the latest qemu (currently only Trusty). Limit the
tests to those that are applicable to rbd.
Josh Durgin [Mon, 31 Mar 2014 21:53:31 +0000 (14:53 -0700)]
librbd: skip zeroes when copying an image
This is the simple coarse-grained solution, but it works well in
common cases like a small base image resized with a bunch of empty
space at the end. Finer-grained sparseness can be copied by using rbd
{export,import}-diff.
Yehuda Sadeh [Mon, 16 Jun 2014 18:48:24 +0000 (11:48 -0700)]
rgw: allocate enough space for bucket instance id
Fixes: #8608
Backport: dumpling, firefly
Bucket instance id is a concatenation of zone name, rados instance id,
and a running counter. We need to allocate enough space to account zone
name length.
Sage Weil [Thu, 8 May 2014 15:52:51 +0000 (08:52 -0700)]
ceph-disk: partprobe before settle when preparing dev
Two users have reported this fixes a problem with using --dmcrypt.
Fixes: #6966 Tested-by: Eric Eastman <eric0e@aol.com> Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0f196265f049d432e399197a3af3f90d2e916275)
Sage Weil [Tue, 17 Jun 2014 17:47:24 +0000 (10:47 -0700)]
osd: introduce simple sleep during scrub
This option is similar to osd_snap_trim_sleep: simply inject an optional
sleep in the thread that is doing scrub work. This is a very kludgey and
coarse knob for limiting the impact of scrub on the cluster, but can help
until we have a more robust and elegant solution.
Only sleep if we are in the NEW_CHUNK state to avoid delaying processing of
an in-progress chunk. In this state nothing is blocked on anything.
Conveniently, chunky_scrub() requeues itself for each new chunk.
John Spray [Tue, 20 May 2014 15:50:18 +0000 (16:50 +0100)]
mon: pool set <pool> crush_ruleset must not use rule_exists
Implement CrushWrapper::ruleset_exists that iterates over the existing
rulesets to find the one matching the ruleset argument.
ceph osd pool set <pool> crush_ruleset must not use
CrushWrapper::rule_exists, which checks for a *rule* existing, whereas
the value being set is a *ruleset*. (cherry picked from commit fb504baed98d57dca8ec141bcc3fd021f99d82b0)
A test via ceph osd pool set data crush_ruleset verifies the ruleset
argument is accepted.
http://tracker.ceph.com/issues/8599 fixes: #8599
Backport: firefly, emperor, dumpling Signed-off-by: John Spray <john.spray@inktank.com> Signed-off-by: Loic Dachary <loic@dachary.org>
(cherry picked from commit d02d46e25080d5f7bb8ddd4874d9019a078b816b)
Yehuda Sadeh [Tue, 6 May 2014 18:06:29 +0000 (11:06 -0700)]
rgw: cut short object read if a chunk returns error
Fixes: #8289
Backport: firefly, dumpling
When reading an object, if we hit an error when trying to read one of
the rados objects then we should just stop. Otherwise we're just going
to continue reading the rest of the object, and since it can't be sent
back to the client (as we have a hole in the middle), we end up
accumulating everything in memory.
Sage Weil [Tue, 20 May 2014 17:46:34 +0000 (10:46 -0700)]
OSD::handle_pg_query: on dne pg, send lb=hobject_t() if deleting
We will set lb=hobject_t() if we resurrect the pg. In that case,
we need to have sent that to the primary before hand. If we
finish the removal before the pg is recreated, we'll just end
up backfilling it, which is ok since the pg doesn't exist anyway.
Callig _finish_hunting() clears out the bool hunting flag, which means we
don't retry by connection to another mon periodically. Instead, we send
keepalives every 10s. But, since we aren't yet in state HAVE_SESSION, we
don't check that the keepalives are getting responses. This means that an
ill-timed connection reset (say, after we get a MonMap, but before we
finish authenticating) can drop the monc into a black hole that does not
retry.
Instead, we should *only* call _finish_hunting() when we complete the
authentication handshake.
Fixes: #8278
Backport: firefly, dumpling Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 77a6f0aefebebf057f02bfb95c088a30ed93c53f)
Greg Farnum [Thu, 15 May 2014 23:50:43 +0000 (16:50 -0700)]
OSD: fix an osdmap_subscribe interface misuse
When calling osdmap_subscribe, you have to pass an epoch newer than the
current map's. _maybe_boot() was not doing this correctly -- we would
fail a check for being *in* the monitor's existing map range, and then
pass along the map prior to the monitor's range. But if we were exactly
one behind, that value would be our current epoch, and the request would
get dropped. So instead, make sure we are not *in contact* with the monitor's
existing map range.
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 290ac818696414758978b78517b137c226110bb4)
Samuel Just [Wed, 16 Oct 2013 17:07:37 +0000 (10:07 -0700)]
OSD: check for splitting when processing recover/backfill reservations
Fixes: 6565 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 15ec5332ba4154930a0447e2bcf1acec02691e97)
which is actually the next backfill to complete. We want to update
last_backfill to the largest completed backfill instead.
We use the pending_backfill_updates mapping to identify the most
recently completed backfill. Due to the previous patch, deletes
will also be included in that mapping.
Yehuda Sadeh [Wed, 27 Nov 2013 21:34:00 +0000 (13:34 -0800)]
rgw: don't error out on empty owner when setting acls
Fixes: #6892
Backport: dumpling, emperor
s3cmd specifies empty owner field when trying to set acls on object
/ bucket. We errored out as it didn't match the current owner name, but
with this change we ignore it.
Sage Weil [Fri, 18 Apr 2014 20:50:11 +0000 (13:50 -0700)]
osd: throttle snap trimmming with simple delay
This is not particularly smart, but it is *a* knob that lets you make
the snap trimmer slow down. It's a flow and a simple delay, so it is
adjustable at runtime. Default is 0 (no change in behavior).
Sage Weil [Tue, 1 Apr 2014 23:01:28 +0000 (16:01 -0700)]
PG: only complete replicas should count toward min_size
Backport: emperor,dumpling,cuttlefish Fixes: #7805 Signed-off-by: Samuel Just <sam.just@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0d5d3d1a30685e7c47173b974caa12076c43a9c4)
Yehuda Sadeh [Sat, 3 May 2014 00:06:05 +0000 (17:06 -0700)]
rgw: don't allow multiple writers to same multiobject part
Fixes: #8269
A client might need to retry a multipart part write. The original thread
might race with the new one, trying to clean up after it, clobbering the
part's data.
The fix is to detect whether an original part already existed, and if so
use a different part name for it.