Sage Weil [Fri, 19 Jun 2009 19:45:36 +0000 (12:45 -0700)]
osd: pass updated stats to replica
When we ship the raw transaction to the replica, we need to ship the
new pg_stat_t as well, since that isn't getting updated in parallel by
prepare_transaction().
Sage Weil [Fri, 19 Jun 2009 04:05:59 +0000 (21:05 -0700)]
osd: fix initialization of log.complete_to in PG::activate()
The complete_to should point to the next object to get, which
should be just PAST info.last_complete. That is because we
can trim the log up to and including last_complete (because
that entry is recovered), and we don't want to invalidate
the iterator.
That is
while (log.complete_to->version <= info.last_complete)
log.complete_to++;
and in sub_op_push,
while (...) {
...
if (info.last_complete < log.complete_to->version)
info.last_complete = log.complete_to->version;
log.complete_to++;
}
Sage Weil [Thu, 18 Jun 2009 23:37:03 +0000 (16:37 -0700)]
osd: make add_next_entry behave when we start at backlog split point
Weaken the assertions a bit and just adjust missing appropriately.
Things may not match up perfectly if the split point is a backlog
entry, so just make missing what it should be a worry less about
what it was.
Here is the specific crash:
09.06.18 16:29:15.085353 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 lcod 0'0 stray m=1] my log = log(0'0,5'4]+backlog
3'1 (0'0) m 200.00000000/head by mds0.1:1 09.06.18 16:20:07.524996 indexed
3'2 (0'0) m 2.00000000/head by mds0.1:5 09.06.18 16:20:07.527454 indexed
5'3 (3'1) m 200.00000000/head by mds0.1:23 09.06.18 16:20:25.128842 indexed
5'4 (5'3) m 200.00000000/head by mds0.1:35 09.06.18 16:20:48.623669 indexed
09.06.18 16:29:15.085393 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 lcod 0'0 stray m=1] osd2 log = log(8'68,9'69]+backlog
3'2 (0'0) b 2.00000000/head by mds0.1:5 09.06.18 16:20:07.527454
9'69 (8'68) m 200.00000000/head by mds0.1:1114 09.06.18 16:28:08.837907
09.06.18 16:29:15.085416 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 lcod 0'0 stray m=1] merge_log log(8'68,9'69]+backlog from osd2 into log(0'0,5'4]+backlog
09.06.18 16:29:15.085456 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 (log bound mismatch, actual=[3'2,9'69] len=2) lcod 0'0 stray m=1] merge_log split point is 3'2 (0'0) b 2.00000000/head by mds0.1:5 09.06.18 16:20:07.527454
09.06.18 16:29:15.085472 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 (log bound mismatch, actual=[3'2,9'69] len=2) lcod 0'0 stray m=1] merge_log merging 3'2 (0'0) b 2.00000000/head by mds0.1:5 09.06.18 16:20:07.527454
09.06.18 16:29:15.085493 1124096336 osd1 10 pg[1.8( v 5'4/3'2 (0'0,5'4] n=2 ec=2 les=10 10/3) r=1 (log bound mismatch, actual=[3'2,9'69] len=2) lcod 0'0 stray m=2] merge_log merging 9'69 (8'68) m 200.00000000/head by mds0.1:1114 09.06.18 16:28:08.837907
osd/PG.h: In function 'void PG::Missing::add_next_event(PG::Log::Entry&)':
osd/PG.h:494: FAILED assert(missing[e.soid].need == e.prior_version)
Sage Weil [Thu, 18 Jun 2009 21:23:29 +0000 (14:23 -0700)]
crush: redefine hash using __u32, for consistency across 32/64 bit
I'm pretty sure this was giving inconsistent results across archs,
because bits would get shifted into the high 32 and then back again
on x86_64 but not x86_32.
Sage Weil [Thu, 18 Jun 2009 18:40:55 +0000 (11:40 -0700)]
osd: trim pg logs on recovery completion
When replica finds itself fully up to date (last_complete ==
last_update) it tells the primary. Primary checks the same.
If the primary find the min_last_complete_ondisk has changed,
it sends out a trim command.
This will let us drop huge pg logs out of memory after a recovery
without waiting for IO and the usual piggybacked trimming logic
to kick in.
Sage Weil [Wed, 17 Jun 2009 22:22:12 +0000 (15:22 -0700)]
kclient: fix readdir vs rm
Okay, do not rely on MDS to provide dentry positioning information,
since it is all relative to the start _string_ we provide, and that
can change directory position without notice.
Simplify readdir a bit wrt seeks. A seek to 0, a new frag, or
prior to the current chunk resets buffered state.
For each frag, we walk through chunks, always in order. We set
dentry positions/offsets based on the frag and position within our
sweep across the frag. Successive chunks are grabbed from the MDS
relative to a filename (not offset), so concurrent
insertions/removals don't bother us (although we will not see
insertions lexicographically prior to our position).
Sage Weil [Wed, 17 Jun 2009 19:50:06 +0000 (12:50 -0700)]
osd: fix pps calculation from pgid.ps() and pgid.pool()
The final placement seed needs to factor in pool, but that can't be
fed into stable_mod or you get weird results (for example, 1.ff and
1.adff won't necessary map to the same thing because of the
stable_mod). Add pool to the stable_mod result, instead. The seed
itself doesn't need to be bounded; it's just an input for CRUSH.
Just so long as there are a limited number of such inputs for a given
pool.
Needs to factor in frag_is_leftmost to account for . and .., just
like the fi->offset calculation in readdir_prepopulate. Fixes the
problem where an ls on a large dir returns duplicate entries.
Sage Weil [Tue, 16 Jun 2009 23:04:12 +0000 (16:04 -0700)]
kclient: specify smallish blksize for directories
This is mainly just because /bin/ls will use the size, or blocks,
or blksize to decide how big of a buffer to allocate for getdents,
and the default of 4MB is unreasonably big. 64k seems like an
okay number, I guess.
Sage Weil [Tue, 16 Jun 2009 22:03:50 +0000 (15:03 -0700)]
crush: fix perm_choose bug
We would get incorrect results if we calculated the same mapping
twice in a row in certain cases. Der. Also, the permutation
calculation was basically just wrong.
Sage Weil [Mon, 15 Jun 2009 23:16:57 +0000 (16:16 -0700)]
kclient: fix di->off calculation
The dentry dir offset calculation wasn't taking into account the
possibility of multiple readdi requests, which in turn meant bad results
for readdir-from-dcache.
Since doing this on the client side was a mess, the MDS includes a dentry
offset for each readdir dentry within the dirfrag. This value is stored
in di->offset (with adjustment in leftmost frag for . and ..), and that's
the value that's passed back via filldir.
Sage Weil [Mon, 15 Jun 2009 22:35:10 +0000 (15:35 -0700)]
kclient: fix I_COMPLETE
The previous use of I_READDIR vs I_COMPLETE was flawed, mainly because
the state was maintained on a per-inode basis, but readdir proceeds on a
per-file basis.
Instead of flags, maintain a counter in the inode that is incremented each
time a dentry is released. When readdir starts, note the counter, and if
it is the same when readdir completes, AND we did not do any forward
seeks on the file handle, AND prepopulate succeeded on each hunk, then we
can set I_COMPLETE.
Sage Weil [Fri, 12 Jun 2009 05:55:07 +0000 (22:55 -0700)]
osd: you may, but need not, specify READ|WRITE flag in MOSDOp.
The OSD will implicitly set the bits based on your OSDOps or class method
calls. The client may still find it useful to specify these expicitly
for it's own informational purposes.
Make sure the MOSDOpReply has bits set based on the _actual_ op performed.
Note that as things stand, this will confuse the Objecter, who relies on
these bits to choose read or modify reply paths and doesn't know a priori
what mode a method is.