git.apps.os.sepia.ceph.com Git

libcephfs: choose more unique nonce

We were using a per-process counter combined with the pid.  A short
running process can easily loop through and reuse the same pid later.
Instead, go for 48 bits of randomness and the pid.  This way if we get
a dup pid we'll only get a dup nonce once out of 2^48 tries.

Avoids #3630 when running a libcephfs test in a loop (so that the pid
is eventually reused).  This is a better fix than the broken
8b599083705c2495810c00f9f5fd5bb8ace7f32e.  The real solution on the MDS
side involves cleaning up the msgr/MDS interaction with session
shutdown.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Mon, 31 Dec 2012 23:23:29 +0000 (15:23 -0800)]

client: fix _create

make_request() clear out req->reply and frees req; we can't inspect
it here.

Instead, just assume that extra_bl is the create flag/ino if it is
present. Old code does not include an extra_bl on CREATE, and new code
will have the same first bytes for compatibility.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Mon, 31 Dec 2012 18:16:31 +0000 (10:16 -0800)]

Merge remote-tracking branch 'gh/wip-3625'

commit | commitdiff | tree

Sage Weil [Sun, 30 Dec 2012 23:29:37 +0000 (15:29 -0800)]

Merge remote-tracking branch 'gh/wip-rbd-unprotect' into next

Reviewed-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Thu, 20 Dec 2012 18:25:14 +0000 (18:25 +0000)]

doc: add-or-rm-mons.rst: Add 'Changing Monitor's IPs' section

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Signed-off-by: John Wilkins <john.wilkins@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Wed, 19 Dec 2012 16:48:37 +0000 (16:48 +0000)]

doc: add-or-rm-mons.rst: Clarify what the monitor name/id is.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Josh Durgin [Sun, 30 Dec 2012 07:57:01 +0000 (23:57 -0800)]

doc: fix rbd permissions for unprotect

Unprotect examines all pools, so use blanket x before 0.54. After
that, use class-read restricted by object_prefix to rbd_children.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Josh Durgin [Sun, 30 Dec 2012 04:35:15 +0000 (20:35 -0800)]

librbd: fix race between unprotect and clone

Clone needs to actually re-read the header to make sure the image is
still protected before returning. Additionally, it needs to consider
the image protected *only* if the protection status is protected -
unprotecting does not count. I thought I'd already fixed this, but
can't find the commit.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Josh Durgin [Sun, 30 Dec 2012 04:26:57 +0000 (20:26 -0800)]

rbd: open (source) image as read-only

This allows users without write access to copy, export and list
information about an image.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Josh Durgin [Sat, 29 Dec 2012 06:13:37 +0000 (22:13 -0800)]

librbd: open parent as read-only during clone

We never write to the parent, and don't need to watch it during this process.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Josh Durgin [Sat, 29 Dec 2012 00:54:51 +0000 (16:54 -0800)]

librbd: add {rbd_}open_read_only()

Since 58890cfad5f7bee933baa599a68e6c65993379d4, regular {rbd_}open()
would fail with -EPERM if the user did not have write access to the
pool, since a watch on the header was requested.

For many uses of read-only access, establishing a watch is not
necessary, since changes to the header do not matter. For example,
getting metadata about an image via 'rbd info' does not care if a new
snapshot is created while it is in progress.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Josh Durgin [Sat, 29 Dec 2012 03:47:09 +0000 (19:47 -0800)]

OSD: remove RD flag from CALL ops

20496b8d2b2c3779a771695c6f778abbdb66d92a forgot to do this. Without
this change, all class methods required regular read permission in
addition to class-read or class-write.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Josh Durgin [Sat, 29 Dec 2012 03:44:36 +0000 (19:44 -0800)]

cls_rbd: get_children does not need write permission

This prevented a read-only user from being able to unprotect a
snapshot without write permission on all pools. This was masked before
by the CLS_METHOD_PUBLIC flag.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Sage Weil [Sat, 29 Dec 2012 16:38:52 +0000 (08:38 -0800)]

Revert "mds: replace closed sessions on connect"

This reverts commit 8b599083705c2495810c00f9f5fd5bb8ace7f32e.

This fix is not correct. See #3696.

commit | commitdiff | tree

Sage Weil [Sat, 29 Dec 2012 01:20:43 +0000 (17:20 -0800)]

msg/Pipe: use state_closed atomic_t for _lookup_pipe

We shouldn't look at Pipe::state in SimpleMessenger::_lookup_pipe() without
holding pipe_lock. Instead, use an atomic that we set to non-zero only
when transitioning to the terminal STATE_CLOSED state.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 23 Dec 2012 21:43:15 +0000 (13:43 -0800)]

msgr: inject delays at inconvenient times

Exercise some rare races by injecting delays before taking locks
via the 'ms inject internal delays' option.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 23 Dec 2012 17:22:18 +0000 (09:22 -0800)]

msgr: fix race on Pipe removal from hash

When a pipe is faulting and shutting down, we have to drop pipe_lock to
take msgr lock and then remove the entry. The Pipe in this case will
have STATE_CLOSED. Handle this case in all places we do a lookup on
the rank_pipe hash so that we effectively ignore entries that are
CLOSED.

This fixes a race introduced by the previous commit where we won't use
the CLOSED pipe and try to register a new one, but the old one is still
registered.

See bug #3675.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 23 Dec 2012 17:19:05 +0000 (09:19 -0800)]

msgr: don't queue message on closed pipe

If we have a con that refs a pipe but it is closed, don't use it. If
the ref is still there, it is only because we are racing with fault()
and it is about to (or just was) be detached. Either way,

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 23 Dec 2012 05:24:52 +0000 (21:24 -0800)]

msgr: atomically queue first message with connect_rank

Atomically queue the first message on the new pipe, without dropping
and retaking pipe_lock.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sat, 29 Dec 2012 01:19:46 +0000 (17:19 -0800)]

Merge remote-tracking branch 'gh/next'

commit | commitdiff | tree

Joao Eduardo Luis [Wed, 19 Dec 2012 01:37:47 +0000 (01:37 +0000)]

test: mon: workloadgen: debug when message fsid != monmap fsid

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Joao Eduardo Luis [Tue, 18 Dec 2012 15:34:12 +0000 (15:34 +0000)]

test: mon: workloadgen: assert if monmap's fsid is zero after authenticate

Fixes: #3629
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>

commit | commitdiff | tree

Noah Watkins [Mon, 24 Dec 2012 00:01:42 +0000 (16:01 -0800)]

doc: update Hadoop documentation

Updates configuration option names, and adds object.size,
localize.reads, and root.dir control options.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

commit | commitdiff | tree

Sage Weil [Sat, 29 Dec 2012 01:12:06 +0000 (17:12 -0800)]

init-ceph: ok, 8K files

16K might be a bit many.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 28 Dec 2012 00:01:49 +0000 (16:01 -0800)]

msg/Pipe: remove broken cephs signing requirement check

Remove the special-case check, which does not inform the peer what
protocol features are missing. It also enforces this requirement even
when we negotiate auth none.

Reported as part of bug #3657.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sat, 29 Dec 2012 00:00:47 +0000 (16:00 -0800)]

msg/Pipe: include remote socket addr in debug output

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 28 Dec 2012 23:14:25 +0000 (15:14 -0800)]

client: fix fh leak in non-create case

We may take the O_CREAT path and get an fh from _create, but created can
still be false. In that case, skip the final _open call.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sam Lang [Wed, 19 Dec 2012 20:17:29 +0000 (10:17 -1000)]

mds: Return created inode in mds reply to create

If multiple clients race to create a file, multiple clients will send a
create request and get back a valid dentry+inode, but only one client
will actually win the race to create the file. All other clients should
treat the reply as an open of an existing file and check permissions.
This patch adds the created inode number to the mds create reply if that
request actually created the inode/file (and the feature is supported),
so the client can properly check permissions if the inode number isn't
returned. Fixes #3625.

Signed-off-by: Sam Lang <sam.lang@inktank.com>

commit | commitdiff | tree

Sam Lang [Mon, 17 Dec 2012 19:54:23 +0000 (09:54 -1000)]

client: Make ll_create use _create

This is a fix for bug #3625, where multiple clients race to create a
file, and the loser returns EEXIST instead of a valid file handle.
The patch modifies ll_create in the Client class to use _create(),
which sends the request to the MDS (where an atomic create/open is
performed).

Signed-off-by: Sam Lang <sam.lang@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 28 Dec 2012 21:07:18 +0000 (13:07 -0800)]

log: broadcast cond signals

We were using a single cond, and only signalling one waiter. That means
that if the flusher and several logging threads are waiting, and we hit
a limit, we the logger could signal another logger instead of the flusher,
and we could deadlock.

Similarly, if the flusher empties the queue, it might signal only a single
logger, and that logger could re-signal the flusher, and the other logger
could wait forever.

Intead, break the single cond into two: one for loggers, and one for the
flusher. Always signal the (one) flusher, and always broadcast to all
loggers.

Backport: bobtail, argonaut
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 28 Dec 2012 19:34:47 +0000 (11:34 -0800)]

osd: allow RecoveryDone self-transition in RepNotRecovering

In a mixed cluster where some OSDs support the recovery reservations and
some don't, the replica may be new code in RepNotRecoverying and will
complete a backfill. In that case, we want to just stayin
RepNotRecovering.

It may also be possible to make it infer what the primary is doing even
thought it is not sending recovery reservation messages, but this is much
more complicated and doesn't accomplish much.

Fixes: #3689
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Gary Lowell [Fri, 28 Dec 2012 22:15:37 +0000 (14:15 -0800)]

Merge remote-tracking branch 'origin/wip-gl-docs'

Update release process documentation.

commit | commitdiff | tree

Gary Lowell [Fri, 28 Dec 2012 22:05:56 +0000 (14:05 -0800)]

docs: fix typo in release-process doc

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 28 Dec 2012 20:34:15 +0000 (12:34 -0800)]

osd: less noise about inefficient tmap updates

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 28 Dec 2012 20:11:55 +0000 (12:11 -0800)]

init-ceph: default to 16K max_open_files

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sam Lang [Fri, 28 Dec 2012 19:58:39 +0000 (13:58 -0600)]

ceph-fuse: Avoid doing handle cleanup in dtor

The CephFuse::Handle class needs the client
pointer to be valid for finalizing, so don't finalize
in the destructor (which doesn't get called till the
fuse handle leaves scope), instead use a finalize method that
gets called explicitly before the client pointer is freed.

Signed-off-by: Sam Lang <sam.lang@inktank.com>

commit | commitdiff | tree

Sam Lang [Fri, 28 Dec 2012 19:10:04 +0000 (13:10 -0600)]

ceph-fuse: Pass client handle as userdata

The fuse lowlevel API isn't getting the client
handle when when it gets initialized, resulting
in a null pointer for all the subsequent calls.

Signed-off-by: Sam Lang <sam.lang@inktank.com>

commit | commitdiff | tree

Josh Durgin [Wed, 26 Dec 2012 22:25:51 +0000 (14:25 -0800)]

doc: warn about using caching without QEMU knowing

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 27 Dec 2012 21:27:46 +0000 (13:27 -0800)]

rgw: disable ops and usage logging by default

Most users don't need this, and having it on will just fill their clusters
with objects that will need to be cleaned up later.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 28 Dec 2012 00:38:45 +0000 (16:38 -0800)]

features is uint64_t

This won't bite us for a while yet (we're on bit 26), but it will soon!

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 28 Dec 2012 01:15:29 +0000 (17:15 -0800)]

Merge remote-tracking branch 'gh/next'

commit | commitdiff | tree

Sam Lang [Thu, 6 Dec 2012 05:21:12 +0000 (23:21 -0600)]

ceph-fuse: Split main into init/main/finalize

With the invalidate callback enabled for fuse, the Client::unmount
call requires the fuse channel and session objects remain for performing
the invalidate callbacks. This patch splits the ceph_fuse_ll_main
call into init, main, and finalize functions, so finalization of the
channel and session objects can be done after the unmount completes.

The patch includes cleanup for the code in fuse_ll.cc to make it more
in the style of C++ and make use of the pimpl idiom to hide the fuse
structures within the CephFuse::Handle pimpl class.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Noah Watkins [Thu, 27 Dec 2012 20:06:02 +0000 (12:06 -0800)]

java: remove deprecated libcephfs

Removes ceph_set_default_*

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>

commit | commitdiff | tree

Sage Weil [Fri, 28 Dec 2012 00:06:24 +0000 (16:06 -0800)]

init-ceph: fix status version check across machines

The local state isn't propagated into the backtick shell, resulting in
'unknown' for all remote daemons. Avoid backticks altogether.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Gary Lowell [Thu, 27 Dec 2012 23:39:46 +0000 (15:39 -0800)]

docs: update release process documentation.

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 27 Dec 2012 21:40:01 +0000 (13:40 -0800)]

Merge remote-tracking branch 'gh/wip-mds'

commit | commitdiff | tree

Sage Weil [Wed, 26 Dec 2012 23:27:07 +0000 (15:27 -0800)]

osd: fix recovery assert for pg repair case

In the case of PG repair, this assert is not valid. Disable it for now.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 27 Dec 2012 21:09:24 +0000 (13:09 -0800)]

Merge branch 'wip-osd-flags'

commit | commitdiff | tree

Sage Weil [Thu, 27 Dec 2012 21:07:57 +0000 (13:07 -0800)]

Merge remote-tracking branch 'gh/wip-mds-pool'

Reviewed-by: Sam Lang <sam.lang@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 7 Dec 2012 21:28:55 +0000 (13:28 -0800)]

osd: only calculate OpRequest rmw flags once

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 7 Dec 2012 21:18:50 +0000 (13:18 -0800)]

messages/MOSDOpReply: remove misleading may_read/may_write

These are OpRequest properties, calculated/enforced at the OSD. They don't
belong in the MOSDOp or MOSDOpReply messages.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 7 Dec 2012 21:14:26 +0000 (13:14 -0800)]

osd: move rmw_flags to OpRequest, out of MOSDOp

It was very sloppy to put a server-side processing state inside the
messsage. Move it to the OpRequestRef instead.

Note that the client was filling in bogus data that was then lost during
encoding/decoding; clean that up.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

tamil [Thu, 27 Dec 2012 19:27:31 +0000 (11:27 -0800)]

dropping xfs test 186 due to bug: 3685

Signed-off-by: tamil <tamil.muthamizhan@inktank.com>

commit | commitdiff | tree

Gary Lowell [Thu, 27 Dec 2012 19:12:27 +0000 (11:12 -0800)]

docs: remove extra release-process2 file.

This file mostly duplicated the existing release documentation. Differences
have been merged into the primary file.

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 27 Dec 2012 19:12:33 +0000 (11:12 -0800)]

osd: drop 'osd recovery max active' back to previous default (5)

Having this too large means that queues get too deep on the OSDs during
backfill and latency is very high. In my tests, it also meant we generated
a lot of slow recovery messages just from the recovery ops themselves (no
client io).

Keeping this at the old default means we are no worse in this respect than
argonaut, which is a safe position to start from.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 27 Dec 2012 19:11:08 +0000 (11:11 -0800)]

journal: reduce journal max queue size

Keep the journal queue size smaller than the filestore queue size.

Keeping this small also means that we can lower the latency for new
high priority ops that come into the op queue.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 27 Dec 2012 19:09:00 +0000 (11:09 -0800)]

mds: use set to store MDSMap data pools

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 26 Dec 2012 18:45:08 +0000 (10:45 -0800)]

mds: wait for client's mdsmap when specifying data pool

The client may have a newer map than we do; make sure we wait for it lest
we inadvertantly reply because we think the pool doesn't exist.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 27 Dec 2012 17:33:27 +0000 (09:33 -0800)]

doc: document mds config options

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 26 Dec 2012 18:58:58 +0000 (10:58 -0800)]

doc: journaler config options

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Gary Lowell [Wed, 26 Dec 2012 20:54:27 +0000 (12:54 -0800)]

docs: Merge changes from release-process2 document.

commit | commitdiff | tree

Sage Weil [Wed, 26 Dec 2012 18:40:53 +0000 (10:40 -0800)]

mds: add waiting_for_mdsmap queue

Defer events until we get a specific MDSMap epoch.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Wed, 26 Dec 2012 19:42:04 +0000 (11:42 -0800)]

mds: do not check for pool existence in osdmap

We don't have a wait mechanism to ensure the MSDMap has the latest osdmap
here. Just trust the MDSMap.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Josh Durgin [Wed, 26 Dec 2012 18:55:47 +0000 (10:55 -0800)]

qa: remove xfstests 172 and 173 from qemu testing

These seem to require newer xfs.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>

commit | commitdiff | tree

Sage Weil [Sat, 22 Dec 2012 01:00:48 +0000 (17:00 -0800)]

doc/man/8/mkcephfs: update --mkfs a bit

Document that 'devs' and 'osd mkfs type' must be defined.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 16 Dec 2012 20:26:06 +0000 (12:26 -0800)]

mds: replace closed sessions on connect

If a connection comes and there is a closed session attached, remove it.
This is probably a failure of an old session to get cleaned up properly,
and in certain cases it may even be from a different client (if the addr
nonce is reused). In that case this prevents further damage, although
a complete solution would also clean up the closed connection state if
there is a fault. See #3630.

This fixes a hang that is reproduced by running the libcephfs
Caps.ReadZero test in a loop; eventually the client addr is reused and
we are linked to an ancient Session with a different client id.

Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sat, 3 Mar 2012 23:39:06 +0000 (15:39 -0800)]

mds: don't force in->first == dn->first

The fullbit sets it now. For multiversion inodes, it's "first" can be in
the future, since this dentry may not have changed when the inode was
cowed in place. (OTOH, the dentry cannot have changed without the inode
also have changing.)

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>

commit | commitdiff | tree

Yan, Zheng [Fri, 30 Nov 2012 01:46:31 +0000 (09:46 +0800)]

mds: compare sessionmap version before replaying imported sessions

Otherwise we may wrongly increase mds->sessionmap.version, which
will confuse future journal replays that involving sessionmap.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

commit | commitdiff | tree

Yan, Zheng [Sat, 8 Dec 2012 14:43:32 +0000 (22:43 +0800)]

mds: fix race between send_dentry_link() and cache expire

MDentryLink message can race with cache expire, When it arrives at
the target MDS, it's possible there is no corresponding dentry in
the cache. If this race happens, we should expire the replica inode
encoded in the MDentryLink message. But to expire an inode, the MDS
need to know which subtree does the inode belong to, so modify the
MDentryLink message to include this information.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

commit | commitdiff | tree

Yan, Zheng [Mon, 10 Dec 2012 07:43:44 +0000 (15:43 +0800)]

mds: fix file existing check in Server::handle_client_openc()

Creating new file needs to be handled by directory fragment's auth
MDS, opening existing file in write mode needs to be handled by
corresponding inode's auth MDS. If a file is remote link, its parent
directory fragment's auth MDS can be different from corresponding
inode's auth MDS. So which MDS to handle create file request can be
affected by if the corresponding file already exists.

handle_client_openc() calls rdlock_path_xlock_dentry() at the very
beginning. It always assumes the request needs to be handled by
directory fragment's auth MDS. When handling a create file request,
if the file already exists and remotely linked to a non-auth inode,
handle_client_openc() falls back to handle_client_open(),
handle_client_open() forwards the request because the MDS is not
inode's auth MDS. Then when the request arrives at inode's auth MDS,
rdlock_path_xlock_dentry() is called, it will forward the request
back.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

commit | commitdiff | tree

Yan, Zheng [Mon, 10 Dec 2012 02:06:28 +0000 (10:06 +0800)]

mds: delay processing cache expire when state >= EXPORT_EXPORTING

It's possible that MDS receives cache expire in EXPORT_LOGGINGFINISH
and EXPORT_NOTIFYING states.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

commit | commitdiff | tree

Yan, Zheng [Sun, 9 Dec 2012 05:03:41 +0000 (13:03 +0800)]

mds: don't retry readdir request after issuing caps

If remote linkage without inode is encountered after some caps are
issued, Server::handle_client_readdir() should send the reply to
client immediately instead of retrying the request after opening
the remote dentry. This is because the MDS may want to revoke these
caps before the MDS succeeds in opening the remote dentry.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

commit | commitdiff | tree

Yan, Zheng [Sat, 8 Dec 2012 16:53:28 +0000 (00:53 +0800)]

mds: take export lock set before sending MExportDirDiscover

Migrator::export_dir() only check if it can lock the export lock set
but not take the lock set. So someone else can change the path to
the exporting dir and confuse Migrator::handle_export_discover().

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

commit | commitdiff | tree

Yan, Zheng [Sun, 9 Dec 2012 05:06:33 +0000 (13:06 +0800)]

mds: re-issue caps after importing caps

The imported caps may prevent unstable locks from entering stable
states. So we should call Locker::eval_gather() with parameter
"first" set to true after caps are imported.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

commit | commitdiff | tree

Yan, Zheng [Fri, 7 Dec 2012 08:02:08 +0000 (16:02 +0800)]

mds: always send discover if want_xlocked is true

If want_xlocked is true, we can not rely on previously sent discover
because it's likely the previous discover is blocking on the xlocked
dentry.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

commit | commitdiff | tree

Yan, Zheng [Sat, 8 Dec 2012 07:07:53 +0000 (15:07 +0800)]

mds: fix error hanlding in MDCache::handle_discover_reply()

The error hanlding code in MDCache::handle_discover_reply() has two
main issues. MDCache::handle_discover_reply() does not wake waiters
if dir_auth_hint in reply message is equal to itself's nodeid. This
can happen if discover race with subtree importing. Another issue is
that it checks the existence of cached directory fragment to decide
if it should take waiter from inode or from directory fragment. The
check is unreliable because subtree importing can add directory
fragments to the cache.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

commit | commitdiff | tree

Yan, Zheng [Sat, 8 Dec 2012 05:59:38 +0000 (13:59 +0800)]

mds: set want_base_dir to false for MDCache::discover_ino()

When frozen inode is encountered, MDCache::handle_discover() sends
reply immediately if the reply message is not empty. When handling
"discover ino" requests, the reply message always contains the base
directory fragment. But requestor already has the base directory
fragment, the only effect of the reply message is wake the requestor
and make it send same "discover ino" request again. So the requestor
keeps sending "discover ino" requests but can't make any progress.

The fix is set want_base_dir to false for MDCache::discover_ino().
After set want_base_dir to false, also need update the code that
handles "discover ino" error.

This patch also remove unused error handling code for flag_error_dn

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

commit | commitdiff | tree

Yan, Zheng [Fri, 7 Dec 2012 07:59:56 +0000 (15:59 +0800)]

mds: no bloom filter for replica dir

We should delete dir fragment's bloom filter after exporting the dir
fragment to other MDS. Otherwise the residual bloom filter may cause
problem if the MDS imports dir fragment later.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

commit | commitdiff | tree

Yan, Zheng [Thu, 6 Dec 2012 01:28:46 +0000 (09:28 +0800)]

mds: properly mark dirfrag dirty

If predirty_journal_parents() does not propagate changes in dir's
fragstat into corresponding inode's dirstat, it should mark the
inode as dirfrag dirty. This happens when we modify dir fragments
that are auth subtree roots.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

commit | commitdiff | tree

Yan, Zheng [Fri, 30 Nov 2012 00:53:33 +0000 (08:53 +0800)]

mds: alllow handle_client_readdir() fetching freezing dir.

At that point, the request already auth pins and locks some objects.
So CDir::fetch() should ignore the can_auth_pin check and continue
to fetch freezing dir.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>

commit | commitdiff | tree

Sage Weil [Mon, 24 Dec 2012 03:59:04 +0000 (19:59 -0800)]

Merge branch 'wip-create-layout'

Reviewed-by: Greg Farnum <greg@inktank.com>
The functional tests for the create operations should add and specify non-default
pools, but we don't have a set of library methods to do that yet (to interact with
the monitor).

commit | commitdiff | tree

Sage Weil [Fri, 7 Dec 2012 20:45:19 +0000 (12:45 -0800)]

mds: *_pg_pool -> *_pool

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Fri, 7 Dec 2012 09:16:53 +0000 (01:16 -0800)]

client, libcephfs: add method to get the pool name for an open file

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 6 Dec 2012 08:12:17 +0000 (00:12 -0800)]

client: specify data pool on create operations

Fill in the data pool field if specified by the client, or set to -1.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 6 Dec 2012 08:11:00 +0000 (00:11 -0800)]

mds: verify that the pool id is valid on SET[DIR]LAYOUT

Make sure the data pool exists and is part of the MDSMap data pools list.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 6 Dec 2012 08:10:29 +0000 (00:10 -0800)]

mds: allow data pool to be specfied on create

Reuse old preferred_pg field. Only use if the new CREATEPOOLID feature
is present, and the value is >= 0.

Verify that the data pool is allowed, or return EINVAL to the client.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Thu, 6 Dec 2012 06:03:34 +0000 (22:03 -0800)]

client: remove set_default_*() methods

This is a poor interface. The hadoop stuff is shifting to specify this
information on file creation instead.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 23 Dec 2012 23:17:12 +0000 (15:17 -0800)]

osd: fix dup failure cancellations

If we had a pending failure report, and send a cancellation, take it
out of our pending list so that we don't keep resending cancellations.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 23 Dec 2012 23:16:06 +0000 (15:16 -0800)]

osd: make MOSDFailure output more sensible

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 23 Dec 2012 23:11:39 +0000 (15:11 -0800)]

mon: make osd failure report log msgs sensible

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 23 Dec 2012 22:42:51 +0000 (14:42 -0800)]

Merge branch 'wip-scrub' into next

Reviewed-by: Sage Weil <sage@inktank.com>
Conflicts:
src/osd/PG.cc

commit | commitdiff | tree

Sage Weil [Sun, 23 Dec 2012 21:29:08 +0000 (13:29 -0800)]

monclient: fix get_monmap_privately retry interval

Use mon_client_hunt_interval (default 3) instead of hardcoding 1 second.

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 23 Dec 2012 04:56:45 +0000 (20:56 -0800)]

Makefile: fix 'base' rule

Signed-off-by: Sage Weil <sage@inktank.com>

commit | commitdiff | tree

Sage Weil [Sun, 23 Dec 2012 19:19:39 +0000 (11:19 -0800)]

Merge branch 'next'

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom