Sage Weil [Thu, 7 Jul 2011 20:35:32 +0000 (13:35 -0700)]
mds: set old and new dentry lease bits
Recent kernels got the new CEPH_LOCK_DN definition but we were still
setting the old bit. Set both so we work with both classes of clients. In
the meantime, update the kernel to ignore this field so that eventually we
can drop/reuse it.
Sage Weil [Tue, 5 Jul 2011 21:22:24 +0000 (14:22 -0700)]
mds: always clear_flushed() after finish_flush()
The scatter_writebehind_finish() is always followed up by an eval_gather(),
which does the clear_flushed(). For everyone else (replicas!), we need to
clear it immediately to avoid confusing things later.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 5 Jul 2011 15:58:26 +0000 (08:58 -0700)]
mds: fix file_excl assert
If we are in XSYN state and want to move to anything else, we must go via
EXCL, but we may not be loner anymore. Weaken the file_excl() assert so we
don't crash.
Reported-by: Fyodor Ustinov <ufm@ufm.su> Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Fri, 1 Jul 2011 06:17:39 +0000 (23:17 -0700)]
mon: add 'osd create [id]' command
If the id is specified, mark a non-existant osd rank as existant. The id
must fall within the current [0,max) range. This is the counterpart of
'osd rm <id>'.
If the id is not specified, allocate an unused osd id and set the EXISTS
flag. Increase max_osd as needed.
Closes: #1244 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Fri, 1 Jul 2011 05:04:42 +0000 (22:04 -0700)]
mds: fix off-by-one in cow_inode vs snap flushes
We need to wait for the client to flush snapped caps if the client has
not already flushed for the given snap. If the client has already flushed
caps through the last snapid for the old inode, we do not need to set up
the snapped inode's locks to wait for that.
This fixes an occasional hang on the snaps/snaptest-multiple-capsnaps.sh
workunit.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 30 Jun 2011 20:44:24 +0000 (13:44 -0700)]
client: only send one flushsnap once per mds session
This mirrors a kclient change a while back (e835124).
We only want to send one flushsnap cap message per MDS session:
- it's a waste to send multiples
- the mds will only reply to the first one
If the mds restarts we need to resend.
This fixes a hang where we send multiples, the first (and only) reply is
ignored (due to tid mismatch), and we are left with dangling references to
the inode and hang on umount. (Reliably reproduced by running the full
snaps/ workunit directory.)
Fixes: #1239 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Add an activate() function that must be called before we call the
onfinish callback. This is especially important in multi-threaded
contexts, since otherwise if completions come in in the wrong order, we
may delete the C_Gather object right before calling new_sub on it!
Also delete rm_subs because it is redundant with sub_finish.
Finally, num_subs_created, num_subs_remaining are now methods on
C_GatherBuilder rather than C_Gather.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
C_Gather objects are deleted by the last sub-context to execute.
If you create a C_Gather object manually, you must worry about the case
where there are no sub-contexts.
C_GatherBuilder is a little object that sits on the stack that allows
you to build C_Gather objects without worrying about this.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Samuel Just [Wed, 29 Jun 2011 22:37:23 +0000 (15:37 -0700)]
osd: don't spew spurious scrub unreserve messages
The past primary was sending out scrub unreserve messages to all the
non-primary OSDs in the acting set on a PG state change. They're
spurious since the other OSDs will cancel the scrubs themselves
on state change, and they weren't right anyway because the loop
was looking at all the non-primary OSDs and sending out a message,
which could have excluded the new primary (if it was a replica before)
included other OSDs new to the PG, and included the current OSD.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com> Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
Sage Weil [Wed, 29 Jun 2011 22:23:40 +0000 (15:23 -0700)]
mds: fix mds scatter_writebehind starvation
scatter_writebehind is called by eval_gather on dirty locks, and
eval_gather is called by wrlock_finish on unstable locks when you
drop the last wrlock...and scatter_writebehind force-takes a wrlock.
This meant that a workload like:
seq 3000|xargs -i mkdir a/b/{} &
mkdir a/c
could cause the mkdir a/c to wait until after the other process
finished because rstats can propagate upwards asynchronously, but
mark the directory dirty synchronously, while the mkdir a/c requires
an actual wrlock in order to modify the rstats.
Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
Greg Farnum [Wed, 29 Jun 2011 17:46:06 +0000 (10:46 -0700)]
messenger: add a set_ip function to initialize the IP
Previously we only filled in IPs if they were set by the config file
(at startup) or after we connected to the monitor. Unfortunately this
could lead to conditions where the OSD connected to itself without
knowing that's what it was doing, because the cluster_addr IP wasn't
filled in until much too late. We've provided a mechanism for filling
in the IP and do so in OSD::boot_start.
Sage Weil [Wed, 29 Jun 2011 04:47:01 +0000 (21:47 -0700)]
mds: fix snaprealm split for dir inode
The snaprealm root directory inode belongs to the snaprealm, at least
currently. This make split_at() consistent with the non-directory case
at the top of the method, and prevents a crash later down the line when we
try to tear down the parent snaprealm and we aren't part of it.
Fixes: #1238 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Greg Farnum [Tue, 28 Jun 2011 21:13:55 +0000 (14:13 -0700)]
filepath: don't parse multiple slashes as multiple dname bits.
This causes all kinds of trouble if it occurs because most of the
code isn't prepared for it. So prevent that from happening except
on messages that were explicitly created that way.
Greg Farnum [Tue, 28 Jun 2011 19:36:00 +0000 (12:36 -0700)]
Journaler: pay attention to return codes from read head.
Previously we ignored them, except for printing them out. This could
lead to bad things like creating new journals for non-existent MDSes
if you entered an invalid rank during --reset-journal.
Also assert that the stripe unit is valid
before using it as a divisor.
Samuel Just [Fri, 24 Jun 2011 00:06:02 +0000 (17:06 -0700)]
PG: fix add_next_event and merge add_event
Previously, we would assume that we had an object at the prior_version
in the log event if we encounter it but don't see the object in missing.
Now, if prior_version < log_tail, we assume that we do not have the
object.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Greg Farnum [Tue, 28 Jun 2011 00:03:42 +0000 (17:03 -0700)]
mds: Explicitly initialize layout fields, and to the correct values.
We were previously encoding an fl_pg_preferred of 0, which did
horrible things to the kernel client since 0 is a valid osd to ask for!
To make such things easier to track down in the future, explicitly
fill in defaults when memsetting the struct here.
(There remain other places that memset the struct to zero without
a lot of checks. But we definitely don't want to force them all
to fill in the individual fields, as that's fragile, and since they
don't seem to be breaking anything yet I'm inclined to leave them as
they are.)