C_Gather objects are deleted by the last sub-context to execute.
If you create a C_Gather object manually, you must worry about the case
where there are no sub-contexts.
C_GatherBuilder is a little object that sits on the stack that allows
you to build C_Gather objects without worrying about this.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Samuel Just [Wed, 29 Jun 2011 22:37:23 +0000 (15:37 -0700)]
osd: don't spew spurious scrub unreserve messages
The past primary was sending out scrub unreserve messages to all the
non-primary OSDs in the acting set on a PG state change. They're
spurious since the other OSDs will cancel the scrubs themselves
on state change, and they weren't right anyway because the loop
was looking at all the non-primary OSDs and sending out a message,
which could have excluded the new primary (if it was a replica before)
included other OSDs new to the PG, and included the current OSD.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com> Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
Sage Weil [Wed, 29 Jun 2011 22:23:40 +0000 (15:23 -0700)]
mds: fix mds scatter_writebehind starvation
scatter_writebehind is called by eval_gather on dirty locks, and
eval_gather is called by wrlock_finish on unstable locks when you
drop the last wrlock...and scatter_writebehind force-takes a wrlock.
This meant that a workload like:
seq 3000|xargs -i mkdir a/b/{} &
mkdir a/c
could cause the mkdir a/c to wait until after the other process
finished because rstats can propagate upwards asynchronously, but
mark the directory dirty synchronously, while the mkdir a/c requires
an actual wrlock in order to modify the rstats.
Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
Greg Farnum [Wed, 29 Jun 2011 17:46:06 +0000 (10:46 -0700)]
messenger: add a set_ip function to initialize the IP
Previously we only filled in IPs if they were set by the config file
(at startup) or after we connected to the monitor. Unfortunately this
could lead to conditions where the OSD connected to itself without
knowing that's what it was doing, because the cluster_addr IP wasn't
filled in until much too late. We've provided a mechanism for filling
in the IP and do so in OSD::boot_start.
Sage Weil [Wed, 29 Jun 2011 04:47:01 +0000 (21:47 -0700)]
mds: fix snaprealm split for dir inode
The snaprealm root directory inode belongs to the snaprealm, at least
currently. This make split_at() consistent with the non-directory case
at the top of the method, and prevents a crash later down the line when we
try to tear down the parent snaprealm and we aren't part of it.
Fixes: #1238 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Greg Farnum [Tue, 28 Jun 2011 21:13:55 +0000 (14:13 -0700)]
filepath: don't parse multiple slashes as multiple dname bits.
This causes all kinds of trouble if it occurs because most of the
code isn't prepared for it. So prevent that from happening except
on messages that were explicitly created that way.
Greg Farnum [Tue, 28 Jun 2011 19:36:00 +0000 (12:36 -0700)]
Journaler: pay attention to return codes from read head.
Previously we ignored them, except for printing them out. This could
lead to bad things like creating new journals for non-existent MDSes
if you entered an invalid rank during --reset-journal.
Also assert that the stripe unit is valid
before using it as a divisor.
Samuel Just [Fri, 24 Jun 2011 00:06:02 +0000 (17:06 -0700)]
PG: fix add_next_event and merge add_event
Previously, we would assume that we had an object at the prior_version
in the log event if we encounter it but don't see the object in missing.
Now, if prior_version < log_tail, we assume that we do not have the
object.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Greg Farnum [Tue, 28 Jun 2011 00:03:42 +0000 (17:03 -0700)]
mds: Explicitly initialize layout fields, and to the correct values.
We were previously encoding an fl_pg_preferred of 0, which did
horrible things to the kernel client since 0 is a valid osd to ask for!
To make such things easier to track down in the future, explicitly
fill in defaults when memsetting the struct here.
(There remain other places that memset the struct to zero without
a lot of checks. But we definitely don't want to force them all
to fill in the individual fields, as that's fragile, and since they
don't seem to be breaking anything yet I'm inclined to leave them as
they are.)
Sage Weil [Fri, 24 Jun 2011 18:19:48 +0000 (11:19 -0700)]
client: fix trim_caps
We can't blindly remove caps from inodes because we need at least one cap
for any inode in our cache. Try to trim non-essential caps (when there's
>1), otherwise try to drop referring dentries and indirectly release caps
that way.
Sage Weil [Fri, 24 Jun 2011 17:46:32 +0000 (10:46 -0700)]
client: prefer auth cap in caps_issued_mask()
If we have an auth cap, prefer to touch that one. This helps consolidate
caps on a single mds and allows mds replicas to eventually recall their
state.
Sage Weil [Fri, 24 Jun 2011 17:45:46 +0000 (10:45 -0700)]
client: touch cap on lookup even if we use the dentry lease
Touch the dir cap for the lease's mds even if we use the lease to traverse.
This makes the trim_caps() behave better because it keeps the dentry and
session cap LRUs in sync.
Sage Weil [Thu, 23 Jun 2011 21:26:34 +0000 (14:26 -0700)]
osd: instrument readable latency too
Time before a write is readable (not necessarily on disk). Note that if we
get the commit first (e.g. writeahead journal) this value isn't calculated
or logged.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>