Sage Weil [Tue, 2 Dec 2014 02:15:59 +0000 (18:15 -0800)]
osd: tolerate sessionless con in fast dispatch path
We can now get a session cleared from a Connection at any time. Change
the assert to an if in ms_fast_dispatch to cope. It's pretty rare, but it
can happen, especially with delay injection. In particular, a racing
thread can call mark_down() on us.
Greg Farnum [Fri, 6 Feb 2015 05:12:17 +0000 (21:12 -0800)]
fsync-tester: print info about PATH and locations of lsof lookup
We're seeing the lsof invocation fail (as not found) in testing and nobody can
identify why. Since attempting to reproduce the issue has not worked, this
patch will gather data from a genuinely in-vitro location.
Jason Dillaman [Mon, 27 Oct 2014 18:47:19 +0000 (14:47 -0400)]
osdc: Constrain max number of in-flight read requests
Constrain the number of in-flight RADOS read requests to the
cache size. This reduces the chance of the cache memory
ballooning during certain scenarios like copy-up which can
invoke many concurrent read requests.
Fixes: #9854
Backport: giant, firefly, dumpling Signed-off-by: Jason Dillaman <dillaman@redhat.com>
This commit ensures that we check for timestamp of s3 request is within
acceptable grace time of radosgw
Addresses some failures in #10062 Fixes: #10062 Signed-off-by: Abhishek Lekshmanan <abhishek.lekshmanan@gmail.com>
(cherry picked from commit 4b35ae067fef9f97b886afe112d662c61c564365)
Lei Dong [Mon, 27 Oct 2014 02:29:48 +0000 (10:29 +0800)]
fix can not disable max_size quota
Currently if we enable quota and set max_size = -1, it doesn’t
mean max_size is unlimited as expected. Instead, it means object
with any size is not allowed to upload because of “QuotaExceeded”.
The root cause is the function rgw_rounded_kb which convert max_size
to max_size_kb returns 0 for -1 because it takes an unsigned int
but we pass an int to it. A simple fix is check max_size before
it’s rounded to max_size_kb.
Test case:
1 enable and set quota:
radosgw-admin quota enable --uid={user_id} --quota-scope=user
radosgw-admin quota set --quota-scope=user --uid={user_id}\
--max-objects=100 --max-size=-1
2 upload any object with non-zero length
it will return 403 with “QuotaExceeded” and return 200 if you apply the fix.
Fixes: #5595
Backport: dumpling, firefly
We need to update the bucket index when updating object attrs, otherwise
we're missing meta changes that need to be registered. It also
solves issue of bucket index not knowing about object acl changes,
although this one still requires some more work.
Yehuda Sadeh [Wed, 5 Nov 2014 21:28:02 +0000 (13:28 -0800)]
rgw: send back ETag on S3 object copy
Fixes: #9479
Backport: firefly, giant
We didn't send the etag back correctly. Original code assumed the etag
resided in the attrs, but attrs only contained request attrs.
Yehuda Sadeh [Wed, 5 Nov 2014 21:40:55 +0000 (13:40 -0800)]
rgw: remove swift user manifest (DLO) hash calculation
Fixes: #9973
Backport: firefly, giant
Previously we were iterating through the parts, creating hash of the
parts etags (as S3 does for multipart uploads). However, swift just
calculates the etag for the empty manifest object.
Yehuda Sadeh [Thu, 20 Nov 2014 18:36:05 +0000 (10:36 -0800)]
rgw-admin: create subuser if needed when creating user
Fixes: #10103
Backport: firefly, giant
This turned up after fixing #9973. Earlier we also didn't create the
subuser in this case, but we didn't really read the subuser info when it
was authenticating. Now we do that as required, so we end up failing the
authentication. This only applies to cases where a subuser was created
using 'user create', and not the 'subuser create' command.
Yehuda Sadeh [Sat, 13 Dec 2014 01:07:30 +0000 (17:07 -0800)]
rgw: use s->bucket_attrs instead of trying to read obj attrs
Fixes: #10307
Backport: firefly, giant
This is needed, since we can't really read the bucket attrs by trying to
read the bucket entry point attrs. We already have the bucket attrs
anyway, use these.
Yehuda Sadeh [Sat, 13 Dec 2014 01:07:30 +0000 (17:07 -0800)]
rgw: use s->bucket_attrs instead of trying to read obj attrs
Fixes: #10307
Backport: firefly, giant
This is needed, since we can't really read the bucket attrs by trying to
read the bucket entry point attrs. We already have the bucket attrs
anyway, use these.
Petr Machata [Thu, 29 Jan 2015 17:15:02 +0000 (10:15 -0700)]
support Boost 1.57.0
Sometime after 1.55, boost introduced a forward declaration of
operator<< in optional.hpp. In 1.55 and earlier, when << was used
without the _io having been included, what got dumped was an implicit
bool conversion.
http://tracker.ceph.com/issues/10688 Refs: #10688 Signed-off-by: Ken Dreyer <kdreyer@redhat.com>
(cherry picked from commit 85717394c33137eb703a7b88608ec9cf3287f67a)
Sage Weil [Tue, 16 Dec 2014 01:04:32 +0000 (17:04 -0800)]
osd: handle no-op write with snapshot case
If we have a transaction that does something to the object but it !exists
both before and after, we will continue through the write path. If the
snapdir object already exists, and we try to create it again, we will
leak a snapdir obc and lock and later crash on an assert when the obc
is destroyed:
0> 2014-12-06 01:49:51.750163 7f08d6ade700 -1 osd/osd_types.h: In function 'ObjectContext::~ObjectContext()' thread 7f08d6ade700 time 2014-12-06 01:49:51.605411
osd/osd_types.h: 2944: FAILED assert(rwstate.empty())
Fix is to not recreated the snapdir if it already exists.
John Spray [Wed, 14 Jan 2015 10:35:53 +0000 (10:35 +0000)]
mds: handle heartbeat_reset during shutdown
Because any thread might grab mds_lock and call heartbeat_reset
immediately after a call to suicide() completes, this needs
to be handled as a special case where we tolerate MDS::hb having
already been destroyed.
Fixes: #10382 Signed-off-by: John Spray <john.spray@redhat.com>
Jason Dillaman [Mon, 15 Dec 2014 15:53:53 +0000 (10:53 -0500)]
librbd: complete all pending aio ops prior to closing image
It was possible for an image to be closed while aio operations
were still outstanding. Now all aio operations are tracked and
completed before the image is closed.
Fixes: #10299
Backport: giant, firefly, dumpling Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 19 Jan 2015 15:28:56 +0000 (10:28 -0500)]
librbd: gracefully handle deleted/renamed pools
snap_unprotect and list_children both attempt to scan all
pools. If a pool is deleted or renamed during the scan,
the methods would previously return -ENOENT. Both methods
have been modified to more gracefully handle this condition.
Fixes: #10270
Backport: giant, firefly Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Josh Durgin [Wed, 14 Jan 2015 23:01:38 +0000 (15:01 -0800)]
qa: ignore duplicates in rados ls
These can happen with split or with state changes due to reordering
results within the hash range requested. It's easy enough to filter
them out at this stage.
Sage Weil [Tue, 14 Oct 2014 19:41:48 +0000 (12:41 -0700)]
msg/simple: do not stop_and_wait on mark_down
We originally blocked in mark_down for fast dispatch threads
to complete to avoid various races in the code. Most of these
were in the OSD itself, where we were not prepared to get
messges on connections that had no attached session. Since
then, the OSD checks have been cleaned up to handle this.
There were other races we were worried about too, but the
details have been lost in the depths of time.
Instead, take the other route: make mark_down never block on
dispatch. This lets us remove the special case that
was added in order to cope with fast dispatch calling
mark_down on itself.
Now, the only stop_and_wait() user is the shutdown sequence.
Sage Weil [Fri, 31 Oct 2014 23:25:09 +0000 (16:25 -0700)]
msg/Pipe: inject delay in stop_and_wait
Inject a delay in stop_and_wait. This will mostly affect the connection
race Pipe takeover code which currently calls stop_and_wait while holding
the msgr->lock. This should make it easier for a racing fast_dispatch
method to get stuck on a call that (indirectly) needs the msgr lock.
See #9921.
Greg Farnum [Tue, 28 Oct 2014 23:45:43 +0000 (16:45 -0700)]
SimpleMessenger: Pipe: do not block on takeover while holding global lock
We previously were able to cause deadlocks:
1) Existing pipe is fast_dispatching
2) Replacement incoming pipe is accepted
*) blocks on stop_and_wait() of existing Pipe
3) External things are blocked on SimpleMessenger::lock() while
blocking completion of the fast dispatch.
To resolve this, if we detect that an existing Pipe we want to take over is
in the process of fast dispatching, we unlock our locks and wait on it to
finish. Then we go back to the lookup step and retry.
The effect of this should be safe:
1) We are not making any changes to the existing Pipe in new ways
2) We have not registered the new Pipe anywhere
3) We have not sent back any replies based on Messenger state to
the remote endpoint.
Sage Weil [Thu, 8 Jan 2015 21:34:52 +0000 (13:34 -0800)]
osd: requeue PG when we skip handling a peering event
If we don't handle the event, we need to put the PG back into the peering
queue or else the event won't get processed until the next event is
queued, at which point we'll be processing events with a delay.
The queue_null is not necessary (and is a waste of effort) because the
event is still in pg->peering_queue and the PG is queued.
Note that this only triggers when we exceeed osd_map_max_advance, usually
when there is a lot of peering and recovery activity going on. A
workaround is to increase that value, but if you exceed osd_map_cache_size
you expose yourself to crache thrashing by the peering work queue, which
can cause serious problems with heavily degraded clusters and bit lots of
people on dumpling.
Loic Dachary [Tue, 16 Dec 2014 12:31:30 +0000 (13:31 +0100)]
erasure-code: relax cauchy w restrictions
A restriction that the w parameter of the cauchy technique is limited to
8, 16 or 32 was added incorrectly while refactoring parameter parsing in
the jerasure plugin and must be relaxed.
Sage Weil [Mon, 24 Nov 2014 02:50:51 +0000 (18:50 -0800)]
crush/CrushWrapper: fix create_or_move_item when name exists but item does not
We were using item_exists(), which simply checks if we have a name defined
for the item. Instead, use _search_item_exists(), which looks for an
instance of the item somewhere in the hierarchy. This matches what
get_item_weightf() is doing, which ensures we get a non-negative weight
that converts properly to floating point.
Backport: giant, firefly Fixes: #9998 Reported-by: Pawel Sadowski <ceph@sadziu.pl> Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 9902383c690dca9ed5ba667800413daa8332157e)
Sage Weil [Sat, 22 Nov 2014 01:47:56 +0000 (17:47 -0800)]
crush/builder: prevent bucket weight underflow on item removal
It is possible to set a bucket weight that is not the sum of the item
weights if you manually modify/build the CRUSH map. Protect against any
underflow on the bucket weight when removing items.
Yan, Zheng [Thu, 4 Dec 2014 04:18:47 +0000 (12:18 +0800)]
osdc/Filer: use finisher to execute C_Probe and C_PurgeRange
Currently contexts C_Probe/C_PurgeRange are executed while holding
OSDSession::completion_lock. C_Probe and C_PurgeRange may call
Objecter::stat() and Objecter::remove() respectively, which acquire
Objecter::rwlock. This can cause deadlock because there is intermediate
dependency between Objecter::rwlock and OSDSession::completion_lock:
Dan Mick [Wed, 10 Dec 2014 21:19:16 +0000 (13:19 -0800)]
rados.py: remove Rados.__del__(); it just causes problems
Recent versions of Python contain a change to thread shutdown that
causes ceph to hang on exit; see http://bugs.python.org/issue21963.
As it turns out, this is relatively easy to avoid by not spawning
threads on exit, as Rados.__del__() will certainly do by calling
shutdown(); I suspect, but haven't proven, that the problem is
that shutdown() tries to start() a threading.Thread() that never
makes it all the way back to signal start().
Also add a PendingReleaseNote and extra doc comments to clarify.
Loic Dachary [Fri, 14 Nov 2014 00:16:10 +0000 (01:16 +0100)]
common: do not omit shard when ghobject NO_GEN is set
Do not silence the display of shard_id when generation is NO_GEN.
Erasure coded objects JSON representation used by ceph_objectstore_tool
need the shard_id to find the file containing the chunk.
Minimal testing is added to ceph_objectstore_tool.py