Sage Weil [Sun, 23 Oct 2011 23:16:03 +0000 (16:16 -0700)]
osd: pg_pool_t: set crash_replay_interval on data pool when decoding old
We want to preserve the crash_replay_interval on old clusters being
upgraded. Kludge this by setting it to 60 (the old default) if the
crush_ruleset == 0 and owner == 0, which is normally true for just the
data pool.
This may catch other pools they created by hand, but it's still better
than having the replay interval for all pools when it is not needed.
Sage Weil [Sun, 23 Oct 2011 22:32:58 +0000 (15:32 -0700)]
osd: make osd replay interval a per-pool property
Change the config value to only control the interval set when the data
pool is first created (presumably during mkfs). Start replay interval
based on the pool property.
Introduce a per-pool crash_replay_interval so we can control whether
the OSD waits for replayed ACKed but not COMMITted requests for this
PG. For the metadata and rbd pools, for instance, the replay window
is useless.
Introduce a generic flags field, while we're modifying the encoding.
Sage Weil [Sun, 23 Oct 2011 03:41:03 +0000 (20:41 -0700)]
osd: fix PG::Log::copy_after wrt backlogs (again)
Commit 68fe748fc2d703623050e8f2a448a0fd31ca8a0f fixed half of this problem,
but set this->tail incorrectly. If we read olog.tail, the entry we are
on is a backlog entry, and probably not other.tail. Do not reset tail in
this case because we already set it to other.tail above.
OTOH if we hit v, we do want to set this->tail to the current record as it
is the one that precedes the first log entry.
This fixes an incorrect log.tail send to other nodes, which eventually
propagates as a log bound mismatch. For example,
Sage Weil [Fri, 21 Oct 2011 22:23:51 +0000 (15:23 -0700)]
osd: move may_need_replay calculation out of PriorSet
Although they both depend on past intervals, they are unrelated. Factor
out the may_need_replay calculation from PriorSet. Instead, do it right
before we activate when we need to decide whether to do a replay window
or not.
Sage Weil [Fri, 21 Oct 2011 22:02:34 +0000 (15:02 -0700)]
osd: fix last_clean interval bounds
It was _first and _last, inclusive, but the epochs are really points in
time, so _last should have been non-inclusive. Rename the variables
_begin and _end, print them as proper intervals [begin,end), and fix the
PriorSet calculation to interpret the end bound properly.
Also break that check out into separate cases so that it is clear what is
really happening.
Sage Weil [Fri, 21 Oct 2011 21:45:59 +0000 (14:45 -0700)]
mon: fix last_clean_interval calculation
This up_rom == first check is old and wrong. It may have been correct at
the time, when the OSD had a defined shutdown procedure, but that is not
currently the case. And if/when it is, the OSD can simply provide an
accurate clean_thru value.
Sage Weil [Fri, 21 Oct 2011 21:44:56 +0000 (14:44 -0700)]
osd: eliminate CRASHED state
This was an intermediate state that indicated that replay would be needed.
It was poorly named, and not very useful. Instead, just set the REPLAY
bit if we need replay, and then do it. No need for a separate CRASHED.
Sage Weil [Fri, 21 Oct 2011 16:56:19 +0000 (09:56 -0700)]
osd: simplify finalizing scrub on replica
We can simply call osr.flush() (with pg lock held) to ensure that prior
writes are visible and scrubbable. This avoids the funky handoff to
op_applied() (which didn't seem to work for me just now, although I didn't
fully debug.
Sage Weil [Fri, 21 Oct 2011 16:14:15 +0000 (09:14 -0700)]
osd: PriorSet: acting/up membership implies still alive
If the osd is in the acting or up sets, we can assume they are still alive,
even though we don't know that for sure, because if they are not, we will
rebuild PriorSet.
Note that we have a dependency here on up_thru that we could/should rebuild
PriorSet based on, IF we think it might change the value of the CRASHED
flag and IF we care enough. Right now we don't. Marking CRASHED when we
don't need to is conservative, and not dangerous.
Josh Durgin [Fri, 21 Oct 2011 00:13:21 +0000 (17:13 -0700)]
OSDMonitor: reweight towards average utilization
The existing reweight-by-utilization calculation did not take into
account the current weight of an OSD, and depended in part on the
threshold given by the user. Also send the user both the old and new
weights.
Sage Weil [Thu, 20 Oct 2011 22:46:11 +0000 (15:46 -0700)]
osd: PgPriorSet: kill whoami; make PG arg strictly optional
It is only used for the debug output prefix. Make it so we can leave it
out entirely (e.g. for unit tests).
We don't want to, say, pass in the string prefix itself, or else we are
stuck with generating that string even on low debug levels where it won't
be used.
Sage Weil [Thu, 20 Oct 2011 21:11:20 +0000 (14:11 -0700)]
osd: fix requeue_ops
The ls argument passed to requeue_ops() is a reference, and one of the
methods we call (say, _handle_op) might want to requeue the message on the
same list we were passed, leading to an infinite loop.
Set ls contents aside to avoid that.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 20 Oct 2011 16:19:45 +0000 (09:19 -0700)]
perfcounters: use simple names
We don't need to uniquely identify ourselves in the global namespace with
the PerfCounter name.. only in the current process. Collectd will handle
the per-daemon naming part.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 18 Oct 2011 21:12:06 +0000 (14:12 -0700)]
test_filestore_idempotent: simple tool to generate a worklaod of non-idempotent operations
Generate a workload of operations that are non-idempotent. These are:
transaction {
clone A -> A.($n-1)
write $n to A
}
$n++
loop!
If we apply any transaction to the file system more than once, we will
find that the A.$n object does not contain $n, but instead contains
some larger value.
First run in 'write' mode to generate a workload and fake a crash.
Then run in 'verify' mode to see if the result was bad.
Sage Weil [Tue, 18 Oct 2011 20:38:21 +0000 (13:38 -0700)]
mds: handle xattrs on inode creation
Allow mknod, mkdir, symlink, create to provide xattrs for the new
inode. This will be used by the kclient to set ACLs on new inodes
based on the parent directory.
Sage Weil [Tue, 18 Oct 2011 23:04:22 +0000 (16:04 -0700)]
radosgw-admin: fix conflict with KeyType in libnss
rgw/rgw_admin.cc:459:6: error: using typedef-name 'KeyType' after 'enum'
/usr/include/nss3/keythi.h:69:3: error: 'KeyType' has a previous declaration here
Sage Weil [Tue, 18 Oct 2011 18:40:06 +0000 (11:40 -0700)]
osd: PgPriorSet: restructure lost checks for prior set
When we add down osds to the cur set, we block peering because there
are OSDs that may have data we need and they are not currently up.
When that happens, marking those OSDs as lost may allow peering to
proceed.
Keep an explicit map blocked_by for exactly that set of OSDs (a subset
of cur), and compare lost_by values in prior_set_affected() to that.
Any single OSD from a given interval surviving is sufficient to ensure
that an ACKed write during that interval was committed to disk.
Currently, at least.
In any case, update the prior set calculation to reflect that. Also
make the survival conditional a bit smarter, to include both last_clean
interval (from the OSDs previous interval of up-ness) as well as the
current interval [up_from..up_thru).
Sage Weil [Tue, 18 Oct 2011 00:51:53 +0000 (17:51 -0700)]
osd: do not short-cut up_thru update for new PGs
Commit e731885d2550ee985bf875ab5bb5faf28f1693eb made it possible for
a new PG to go active without forcing the OSDs up_thru to update.
This was motivated by the desire for PG creation by radosgw to go
faster. Radosgw no longer creates a pool per bucket, so this is not
useful there, and it is unclear what other application (that is not
abusing rados pools) would need it.
Since it complicates the prior set calculation for dubious reasons,
let's revert it.
Sage Weil [Tue, 18 Oct 2011 00:44:32 +0000 (17:44 -0700)]
osd: PgPriorSet: revert start_since_joining check
Commit 5b78f5db8c200edcc949033e1badae70fecd2e08 added a check to
prevent some sort of badness when osds were marked lost, but I can't
figure out what it was. Remove the check for now until we can
reproduce/observe the badness in practice, and then write a test and
better motivated/docomented fix.
Sage Weil [Tue, 18 Oct 2011 00:02:23 +0000 (17:02 -0700)]
osd: PgPriorSet: do not include UP osds in prior.cur
The up osds are not (directly) relevant since they are not necessarily
members of the PG. We only care about acting OSDs, which may have
committed writes to the PG during this past interval.
The issue: we redo the prior set calculation if the up_thru for these
OSDs changes in the current map, but the prior_set result does not
depend on the current map's up_thru values in any way; it only depends
on the up_thru in the last epoch of each past interval, and that is
fixed in the past.
Sage Weil [Mon, 17 Oct 2011 15:50:54 +0000 (08:50 -0700)]
ceph.spec: work around build.opensuse.org
The redhat-rpm-config isn't installed on build.opensuse.org, which means
the processor is set to i386 instead of something less ancient. This
breaks compilation on 32-bit x86.
Sage Weil [Mon, 17 Oct 2011 03:37:28 +0000 (20:37 -0700)]
osd: fix add_next_event Missing::item::have
The missing set should be accurate up to the current point in the log. The
log_tail has no bearing on that, nor does last_update, since we're
processing new events in forward order, and updating missing as we go.
Sage Weil [Thu, 13 Oct 2011 19:47:57 +0000 (12:47 -0700)]
osd: all_unfound_are_queried_or_lost
The check to make isn't whether all locations are lost, but whether all
locations are either lost or have been queried and don't have the object(s)
we want.