Greg Farnum [Thu, 6 Nov 2014 19:10:29 +0000 (11:10 -0800)]
MDS: clean up internal MDRequests the standard way
All cleanup is now routed through respond_to_request(),
which invokes the internal_op_finish Context*, then does
mdcache->request_finish(). This is easier to reason about,
and indeed fixes a bug (I was not cleaning up locks
following flush). Use the MDSContinuation to facilitate
this in scrub's case.
Greg Farnum [Fri, 29 Aug 2014 06:03:59 +0000 (23:03 -0700)]
MDCache: make scrub_dentry schedulable and reentrant
Rather than assuming that any necessary inodes are in the cache, split up
MDCache::scrub_dentry into setup and work phases. Add an internal_op_finisher()
to MDRequest. Dispatch any CEPH_MDS_OP_VALIDATE internal operations to
scrub_dentry_work(). Taken together, these make everything work properly when
path_traverse() (by way of rdlock_path_pin_ref()) needs to go to disk before
satisfying the lookup.
Greg Farnum [Wed, 27 Aug 2014 21:11:26 +0000 (14:11 -0700)]
MDCache: "handle" request_forward on internal ops
For now, just return -EXDEV ("Cross-device link") on internal ops that
require forwarding, as forwarding internal ops will require a great deal more
infrastructure.. But push the issue down to this level instead of worrying
about it in path_traverse, and consider the possibility that the MDRequest
might not have a client_request that it's wrapped around.
Greg Farnum [Thu, 21 Aug 2014 03:12:00 +0000 (20:12 -0700)]
Server: rename reply_request -> reply_client_request; make it private
The generic reply_request(MDRequest, int) is now the only caller. It's still
just building an MClientRequest to pass along, but we can change it a lot more
easily now to support responding to non-client requests.
Greg Farnum [Thu, 21 Aug 2014 02:47:00 +0000 (19:47 -0700)]
Server: use mdr->reply_extra_bl instead of explicit MClientReply
Set the MClientReply::extra_bl from reply_extra_bl unconditionally in
reply_request(), instead of only in early_reply(). Further isolate
the reply_request() callers from the use of MClientReply this way.
Greg Farnum [Fri, 1 Aug 2014 14:01:04 +0000 (07:01 -0700)]
Server: remove tracei and tracedn parameters from reply_request
We have members for these two parameters in the MDRequestImpl already, so
make use of them. This helps us move towards dropping the expectation of an
MClientRequest from functions like rdlock_path_pin_ref().
MDCache: add a scrub_dentry() function, and wire it up to the admin socket
scrub_dentry() is passed a string path, and it validates it before replying. We
hook up an admin socket command "scrub_path" to call it and dump the output.
Add a function that will validate the on-disk state of the CInode. We currently
check that the on-disk backtrace matches (or is older) and compare rstats on
dirfrags against the parent dir's inode (for directories only).
TODO: validate that the on-disk Inode object matches what the parent
directory holds.
It's using a sort-of new programming model, trying to stuff stack data into
a Continuation object and write everything sequentially instead of having
a function and Context per IO.
Signed-off-by: Greg Farnum <greg@inktank.com> Signed-off-by: John Spray <john.spray@redhat.com>
Greg Farnum [Wed, 5 Nov 2014 22:23:32 +0000 (14:23 -0800)]
Rebase: MDS: Add an MDSContinuation for ease of use
Unlike the regular Continuation, this one works in terms of an MDRequest
and has wrappers to provide Context callbacks that are either
internal MDS or IO appropriate.
Greg Farnum [Wed, 1 Oct 2014 00:03:24 +0000 (17:03 -0700)]
MDCache: create_unlinked_system_inode() as the guts of create_system_inode()
This way we can create duplicate CInodes without actually linking them
into the cache. It'll be helpful for comparing different versions of
disk states and in-memory state, etc.
mdstypes: add a same_sums() function to nest_info_t
operator== is checking equality of the version as well, but I want
something I can use to check that the internal sums match. This is useful
for eg comparing the sums of a set of dirfrags to the tally stored in
the inode.
Boris Ranto [Thu, 6 Nov 2014 14:38:51 +0000 (15:38 +0100)]
Fedora 19 uses systemd but there is no systemd-run available in the release (rhbz#1157938), this patch makes sure that the init scripts check for availability of systemd-run before they use it (otherwise, they fall back to the default method)
Yehuda Sadeh [Wed, 5 Nov 2014 21:40:55 +0000 (13:40 -0800)]
rgw: remove swift user manifest (DLO) hash calculation
Fixes: #9973
Backport: firefly, giant
Previously we were iterating through the parts, creating hash of the
parts etags (as S3 does for multipart uploads). However, swift just
calculates the etag for the empty manifest object.
Yehuda Sadeh [Wed, 5 Nov 2014 21:28:02 +0000 (13:28 -0800)]
rgw: send back ETag on S3 object copy
Fixes: #9479
Backport: firefly, giant
We didn't send the etag back correctly. Original code assumed the etag
resided in the attrs, but attrs only contained request attrs.
John Wilkins [Mon, 3 Nov 2014 23:08:11 +0000 (15:08 -0800)]
Merge pull request #2854 from ceph/wip-doc-openstack-juno
doc: Update for OpenStack Juno.
New users working with Nova (Juno) noted that libvirt settings are now under a [libvirt] section, and truncate the leading libvirt_. Made subsections for Havana and Icehouse, added a new subsection for Juno.
Sage Weil [Sat, 1 Nov 2014 04:29:42 +0000 (21:29 -0700)]
buffer: implement list::get_contiguous
Return a pointer to a contiguous range of the bufferlist, rebuilding
into a contiguous region as needed. For now, if we need to rebuild,
we just do the whole thing. We can obviously optimize this later to
rebuild on the necessary region, but this is good enough for the
(presumably) common case where the needed region is already in fact
contiguous.
In most time, this works well, But the programm occasionally
hangs forever. Output of gstack:
Thread 1 (Thread 0x7fe0afba0760 (LWP 18509)):
0 0x000000330f20822d in pthread_join () from /lib64/libpthread.so.0
1 0x000000347566cea2 in Thread::join(void**) () from
/usr/lib64/librados.so.2
2 0x00000034755ac535 in librados::RadosClient::shutdown() () from
/usr/lib64/librados.so.2
3 0x0000003475592269 in rados_shutdown () from /usr/lib64/librados.so.2
4 0x0000000000402349 in main ()
Thread 4 (Thread 0x7fe0ab14d700 (LWP 18541)):
0 0x000000330f20e264 in __lll_lock_wait () from /lib64/libpthread.so.0
1 0x000000330f209508 in _L_lock_854 () from /lib64/libpthread.so.0
2 0x000000330f2093d7 in pthread_mutex_lock () from
/lib64/libpthread.so.0
3 0x0000003475633af1 in Mutex::Lock(bool) () from
/usr/lib64/librados.so.2
4 0x00000034755abd37 in librados::RadosClient::put() () from
/usr/lib64/librados.so.2
5 0x0000003475592501 in librados::Rados::shutdown() () from
/usr/lib64/librados.so.2
6 0x00007fe0afbba9f7 in
libradosstriper::RadosStriperImpl::CompletionData::~CompletionData() ()
from /usr/lib64/libradosstriper.so.1
7 0x00007fe0afbbaad9 in
libradosstriper::RadosStriperImpl::WriteCompletionData::~WriteCompletionData()
() from /usr/lib64/libradosstriper.so.1
8 0x00007fe0afbc1d75 in RefCountedObject::put() () from
/usr/lib64/libradosstriper.so.1
9 0x00007fe0afbc224d in
libradosstriper::MultiAioCompletionImpl::safe_request(long) () from
/usr/lib64/libradosstriper.so.1
10 0x00000034755c5ce8 in librados::C_AioSafe::finish(int) () from
/usr/lib64/librados.so.2
11 0x00000034755a0e89 in Context::complete(int) () from
/usr/lib64/librados.so.2
12 0x000000347564d4c8 in Finisher::finisher_thread_entry() () from
/usr/lib64/librados.so.2
13 0x000000330f2079d1 in start_thread () from /lib64/libpthread.so.0
14 0x000000330eae886d in clone () from /lib64/libc.so.6
It is obvious that librados::Rados::shutdown is not a thread-safe
function here. It will hang forever. The culprit of this is when
CompletionData is released, it will first notify
"rados_aio_wait_for_safe" to continue, and CompletionData will call
put() to release other data. But if the main thread(Thread 1 here) runs
fast enough, rados_striper_destroy will be executed before other
thread(Thread 4 here)'s releasing refcnf. In this situation, main thread
runs Rados::shutdown() while other thread runs Rados::shutdown() in the
same time.
My suggestion is to let RadosStriperImpl::aio_flush to block until all
the CompletionData has been released. This makes sure other thread will
never call rados_shutdown.
Sage Weil [Thu, 30 Oct 2014 17:56:36 +0000 (10:56 -0700)]
osdc/Objecter: fix null dref when pool dne
If the base pool does not exist, we need to avoid dereferencing pi.
This simplest fix is to return with POOL_DNE early and skip all of the
checks.
Note that there is one other small semantic change in this function: if
we are using the precalc_pgid then base_oloc pool has to match. But
the list_objects() caller does that, so we're fine.
Backport: giant Fixes: #9944 Signed-off-by: Sage Weil <sage@redhat.com>
Sébastien Han [Thu, 30 Oct 2014 10:59:14 +0000 (11:59 +0100)]
doc: update RBD for Juno
This commit introduces some updates for the OpenStack Juno release. New
flags have been added, many trailing spaces were removed and a new
recommendation for Glance cache management has been added too.
Signed-off-by: Sébastien Han <sebastien.han@enovance.com>
John Spray [Mon, 27 Oct 2014 12:02:17 +0000 (12:02 +0000)]
client: allow xattr caps in inject_release_failure
Because some test environments generate spurious
rmxattr operations, allow the client to release
'X' caps. Allows xattr operations to proceed
while still preventing client releasing other caps.
Vicente Cheng [Wed, 29 Oct 2014 04:21:11 +0000 (12:21 +0800)]
rbd: Fix the rbd export when image size more than 2G
When using export <image-name> <path> and the size of image is more
than 2G, the previous version about finish() could not handle in
seeking the offset in image and return error.
This is caused by the incorrect variable type. Try to use the correct
variable type to fixed it.
I use another variable which type is uint64_t for confirming seeking
and still use the previous r for return error.
uint64_t is more better than type int for handle lseek64().