Greg Farnum [Fri, 6 Feb 2015 05:12:17 +0000 (21:12 -0800)]
fsync-tester: print info about PATH and locations of lsof lookup
We're seeing the lsof invocation fail (as not found) in testing and nobody can
identify why. Since attempting to reproduce the issue has not worked, this
patch will gather data from a genuinely in-vitro location.
Yehuda Sadeh [Sat, 13 Dec 2014 01:07:30 +0000 (17:07 -0800)]
rgw: use s->bucket_attrs instead of trying to read obj attrs
Fixes: #10307
Backport: firefly, giant
This is needed, since we can't really read the bucket attrs by trying to
read the bucket entry point attrs. We already have the bucket attrs
anyway, use these.
Petr Machata [Thu, 29 Jan 2015 17:15:02 +0000 (10:15 -0700)]
support Boost 1.57.0
Sometime after 1.55, boost introduced a forward declaration of
operator<< in optional.hpp. In 1.55 and earlier, when << was used
without the _io having been included, what got dumped was an implicit
bool conversion.
http://tracker.ceph.com/issues/10688 Refs: #10688 Signed-off-by: Ken Dreyer <kdreyer@redhat.com>
(cherry picked from commit 85717394c33137eb703a7b88608ec9cf3287f67a)
Sage Weil [Tue, 16 Dec 2014 01:04:32 +0000 (17:04 -0800)]
osd: handle no-op write with snapshot case
If we have a transaction that does something to the object but it !exists
both before and after, we will continue through the write path. If the
snapdir object already exists, and we try to create it again, we will
leak a snapdir obc and lock and later crash on an assert when the obc
is destroyed:
0> 2014-12-06 01:49:51.750163 7f08d6ade700 -1 osd/osd_types.h: In function 'ObjectContext::~ObjectContext()' thread 7f08d6ade700 time 2014-12-06 01:49:51.605411
osd/osd_types.h: 2944: FAILED assert(rwstate.empty())
Fix is to not recreated the snapdir if it already exists.
John Spray [Wed, 14 Jan 2015 10:35:53 +0000 (10:35 +0000)]
mds: handle heartbeat_reset during shutdown
Because any thread might grab mds_lock and call heartbeat_reset
immediately after a call to suicide() completes, this needs
to be handled as a special case where we tolerate MDS::hb having
already been destroyed.
Fixes: #10382 Signed-off-by: John Spray <john.spray@redhat.com>
Jason Dillaman [Mon, 15 Dec 2014 15:53:53 +0000 (10:53 -0500)]
librbd: complete all pending aio ops prior to closing image
It was possible for an image to be closed while aio operations
were still outstanding. Now all aio operations are tracked and
completed before the image is closed.
Fixes: #10299
Backport: giant, firefly, dumpling Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 19 Jan 2015 15:28:56 +0000 (10:28 -0500)]
librbd: gracefully handle deleted/renamed pools
snap_unprotect and list_children both attempt to scan all
pools. If a pool is deleted or renamed during the scan,
the methods would previously return -ENOENT. Both methods
have been modified to more gracefully handle this condition.
Fixes: #10270
Backport: giant, firefly Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Josh Durgin [Wed, 14 Jan 2015 23:01:38 +0000 (15:01 -0800)]
qa: ignore duplicates in rados ls
These can happen with split or with state changes due to reordering
results within the hash range requested. It's easy enough to filter
them out at this stage.
Sage Weil [Tue, 14 Oct 2014 19:41:48 +0000 (12:41 -0700)]
msg/simple: do not stop_and_wait on mark_down
We originally blocked in mark_down for fast dispatch threads
to complete to avoid various races in the code. Most of these
were in the OSD itself, where we were not prepared to get
messges on connections that had no attached session. Since
then, the OSD checks have been cleaned up to handle this.
There were other races we were worried about too, but the
details have been lost in the depths of time.
Instead, take the other route: make mark_down never block on
dispatch. This lets us remove the special case that
was added in order to cope with fast dispatch calling
mark_down on itself.
Now, the only stop_and_wait() user is the shutdown sequence.
Sage Weil [Fri, 31 Oct 2014 23:25:09 +0000 (16:25 -0700)]
msg/Pipe: inject delay in stop_and_wait
Inject a delay in stop_and_wait. This will mostly affect the connection
race Pipe takeover code which currently calls stop_and_wait while holding
the msgr->lock. This should make it easier for a racing fast_dispatch
method to get stuck on a call that (indirectly) needs the msgr lock.
See #9921.
Greg Farnum [Tue, 28 Oct 2014 23:45:43 +0000 (16:45 -0700)]
SimpleMessenger: Pipe: do not block on takeover while holding global lock
We previously were able to cause deadlocks:
1) Existing pipe is fast_dispatching
2) Replacement incoming pipe is accepted
*) blocks on stop_and_wait() of existing Pipe
3) External things are blocked on SimpleMessenger::lock() while
blocking completion of the fast dispatch.
To resolve this, if we detect that an existing Pipe we want to take over is
in the process of fast dispatching, we unlock our locks and wait on it to
finish. Then we go back to the lookup step and retry.
The effect of this should be safe:
1) We are not making any changes to the existing Pipe in new ways
2) We have not registered the new Pipe anywhere
3) We have not sent back any replies based on Messenger state to
the remote endpoint.
Sage Weil [Thu, 8 Jan 2015 21:34:52 +0000 (13:34 -0800)]
osd: requeue PG when we skip handling a peering event
If we don't handle the event, we need to put the PG back into the peering
queue or else the event won't get processed until the next event is
queued, at which point we'll be processing events with a delay.
The queue_null is not necessary (and is a waste of effort) because the
event is still in pg->peering_queue and the PG is queued.
Note that this only triggers when we exceeed osd_map_max_advance, usually
when there is a lot of peering and recovery activity going on. A
workaround is to increase that value, but if you exceed osd_map_cache_size
you expose yourself to crache thrashing by the peering work queue, which
can cause serious problems with heavily degraded clusters and bit lots of
people on dumpling.
Yan, Zheng [Thu, 4 Dec 2014 04:18:47 +0000 (12:18 +0800)]
osdc/Filer: use finisher to execute C_Probe and C_PurgeRange
Currently contexts C_Probe/C_PurgeRange are executed while holding
OSDSession::completion_lock. C_Probe and C_PurgeRange may call
Objecter::stat() and Objecter::remove() respectively, which acquire
Objecter::rwlock. This can cause deadlock because there is intermediate
dependency between Objecter::rwlock and OSDSession::completion_lock:
Loic Dachary [Fri, 14 Nov 2014 00:16:10 +0000 (01:16 +0100)]
common: do not omit shard when ghobject NO_GEN is set
Do not silence the display of shard_id when generation is NO_GEN.
Erasure coded objects JSON representation used by ceph_objectstore_tool
need the shard_id to find the file containing the chunk.
Minimal testing is added to ceph_objectstore_tool.py
John Spray [Tue, 25 Nov 2014 16:54:42 +0000 (16:54 +0000)]
mon: OSDMonitor: allow adding tiers to FS pools
This was an overly-strict check. In fact it is perfectly
fine to set an overlay on a pool that is already in use
as a filesystem data or metadata pool.
Jason Dillaman [Tue, 18 Nov 2014 02:49:26 +0000 (21:49 -0500)]
librbd: protect list_children from invalid child pool IoCtxs
While listing child images, don't ignore error codes returned
from librados when creating an IoCtx. This will prevent seg
faults from occurring when an invalid IoCtx is used.
Fixes: #10123
Backport: giant, firefly, dumpling Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 0d350b6817d7905908a4e432cd359ca1d36bab50)
John Spray [Thu, 6 Nov 2014 11:46:29 +0000 (11:46 +0000)]
osdc: fix Journaler write error handling
Since we started wrapping the write error
handler in a finisher, multiple calls to
handle_write_error would hit the assert()
on the second call before the actual
handler had been called (at the other end
of the finisher) from the first call.
The symptom was that the MDS was intermittently
failing to respawn on blacklist, seen in #10011.
Introduce ceph_erasure_code_non_regression to check and compare how an
erasure code plugin encodes and decodes content with a given set of
parameters. For instance:
Will create an encoded object (--create) and store it into a directory
along with the chunks, one chunk per file. The directory name is derived
from the parameters. The content of the object is a random pattern of 31
bytes repeated to fill the object size specified with --stripe-width.
The check function (--check) reads the object back from the file,
encodes it and compares the result with the content of the chunks read
from the files. It also attempts recover from one or two erasures.
Chunks encoded by a given version of Ceph are expected to be encoded
exactly in the same way by all Ceph versions going forward.
Loic Dachary [Thu, 9 Oct 2014 16:52:17 +0000 (18:52 +0200)]
ceph-disk: run partprobe after zap
Not running partprobe after zapping a device can lead to the following:
* ceph-disk prepare /dev/loop2
* links are created in /dev/disk/by-partuuid
* ceph-disk zap /dev/loop2
* links are not removed from /dev/disk/by-partuuid
* ceph-disk prepare /dev/loop2
* some links are not created in /dev/disk/by-partuuid
This is assuming there is a bug in the way udev events are handled by
the operating system.
Loic Dachary [Fri, 10 Oct 2014 08:23:34 +0000 (10:23 +0200)]
ceph-disk: encapsulate partprobe / partx calls
Add the update_partition function to reduce code duplication.
The action is made an argument although it always is -a because it will
be -d when deleting a partition.
Use the update_partition function in prepare_journal_dev
John Spray [Mon, 24 Nov 2014 11:00:25 +0000 (11:00 +0000)]
mon: fix MDS health status from peons
The health data was there, but we were attempting
to enumerate MDS GIDs from pending_mdsmap (empty on
peons) instead of mdsmap (populated from paxos updates)
06a245a added a section def to assembly files; I added it twice to
this file. There's no damage, but a compiler warning (on machines with
yasm installed)
Vicente Cheng [Wed, 29 Oct 2014 04:21:11 +0000 (12:21 +0800)]
rbd: Fix the rbd export when image size more than 2G
When using export <image-name> <path> and the size of image is more
than 2G, the previous version about finish() could not handle in
seeking the offset in image and return error.
This is caused by the incorrect variable type. Try to use the correct
variable type to fixed it.
I use another variable which type is uint64_t for confirming seeking
and still use the previous r for return error.
uint64_t is more better than type int for handle lseek64().
Sage Weil [Thu, 13 Nov 2014 01:11:10 +0000 (17:11 -0800)]
osd/OSD: use OSDMap helper to determine if we are correct op target
Use the new helper. This fixes our behavior for EC pools where targetting
a different shard is not correct, while for replicated pools it may be. In
the EC case, it leaves the op hanging indefinitely in the OpTracker because
the pgid exists but as a different shard.
Sage Weil [Thu, 13 Nov 2014 01:04:35 +0000 (17:04 -0800)]
osd/OSDMap: add osd_is_valid_op_target()
Helper to check whether an osd is a given op target for a pg. This
assumes that for EC we always send ops to the primary, while for
replicated we may target any replica.
Josh Durgin [Wed, 12 Nov 2014 02:16:02 +0000 (18:16 -0800)]
qa: allow small allocation diffs for exported rbds
The local filesytem may behave slightly differently. This isn't
foolproof, but seems to be reliable enough on rhel7 rootfs, where
exact comparison was failing.
John Spray [Mon, 27 Oct 2014 12:02:17 +0000 (12:02 +0000)]
client: allow xattr caps in inject_release_failure
Because some test environments generate spurious
rmxattr operations, allow the client to release
'X' caps. Allows xattr operations to proceed
while still preventing client releasing other caps.
John Spray [Fri, 7 Nov 2014 11:34:43 +0000 (11:34 +0000)]
tools: fix MDS journal import
Previously it only worked on fresh filesystems which
hadn't been trimmed yet, and resulted in an invalid
trimmed_pos when expire_pos wasn't on an object
boundary.
Yan, Zheng [Mon, 27 Oct 2014 20:57:16 +0000 (13:57 -0700)]
client: fix I_COMPLETE_ORDERED checking
Current code marks a directory inode as complete and ordered when readdir
finishes, but it does not check if the directory was modified in the middle
of readdir. This is wrong, directory inode should not be marked as ordered
if it was modified during readddir
The fix is introduce a new counter to the inode data struct, we increase
the counter each time the directory is modified. When readdir finishes, we
check the counter to decide if the directory should be marked as ordered.
client: preserve ordering of readdir result in cache
Preserve ordering of readdir result in a list, so that the result of cached
readdir is consistant with uncached readdir.
As a side effect, this commit also removes the code that removes stale dentries.
This is OK because stale dentries does not have valid lease, they will be
filter out by the shared gen check in Client::_readdir_cache_cb()
client: introduce a new flag indicating if dentries in directory are sorted
When creating a file, Client::insert_dentry_inode() set the dentry's offset
based on directory's max offset. The offset does not reflect the real
postion of the dentry in directory. Later readdir reply from real postion
of the dentry in directory. Later readdir reply from MDS may change the
dentry's position/offset. This inconsistency can cause missing/duplicate
entries in readdir result if readdir is partly satisfied by dcache_readdir().
The fix is introduce a new flag indicating if dentries in directory are
sorted. We use _readdir_cache_cb() to handle readdir only when the flag is
set, clear the flag after creating/deleting/renaming file.
Loic Dachary [Thu, 2 Oct 2014 07:23:55 +0000 (09:23 +0200)]
tools: rados put /dev/null should write() and not create()
In the rados.cc special case to handle put an empty objects, use
write_full() instead of create().
A special case was introduced 6843a0b81f10125842c90bc63eccc4fd873b58f2
to create() an object if the rados put file is empty. Prior to this fix
an attempt to rados put an empty file was a noop. The problem with this
fix is that it is not idempotent. rados put an empty file twice would
fail the second time and rados put a file with one byte would succeed as
expected.
Yehuda Sadeh [Thu, 9 Oct 2014 17:20:27 +0000 (10:20 -0700)]
rgw: set length for keystone token validation request
Fixes: #7796
Backport: giany, firefly
Need to set content length to this request, as the server might not
handle a chunked request (even though we don't send anything).
Tested-by: Mark Kirkwood <mark.kirkwood@catalyst.net.nz> Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>
(cherry picked from commit 3dd4ccad7fe97fc16a3ee4130549b48600bc485c)