Sage Weil [Thu, 18 Jun 2009 18:40:55 +0000 (11:40 -0700)]
osd: trim pg logs on recovery completion
When replica finds itself fully up to date (last_complete ==
last_update) it tells the primary. Primary checks the same.
If the primary find the min_last_complete_ondisk has changed,
it sends out a trim command.
This will let us drop huge pg logs out of memory after a recovery
without waiting for IO and the usual piggybacked trimming logic
to kick in.
Sage Weil [Wed, 17 Jun 2009 22:22:12 +0000 (15:22 -0700)]
kclient: fix readdir vs rm
Okay, do not rely on MDS to provide dentry positioning information,
since it is all relative to the start _string_ we provide, and that
can change directory position without notice.
Simplify readdir a bit wrt seeks. A seek to 0, a new frag, or
prior to the current chunk resets buffered state.
For each frag, we walk through chunks, always in order. We set
dentry positions/offsets based on the frag and position within our
sweep across the frag. Successive chunks are grabbed from the MDS
relative to a filename (not offset), so concurrent
insertions/removals don't bother us (although we will not see
insertions lexicographically prior to our position).
Sage Weil [Wed, 17 Jun 2009 19:50:06 +0000 (12:50 -0700)]
osd: fix pps calculation from pgid.ps() and pgid.pool()
The final placement seed needs to factor in pool, but that can't be
fed into stable_mod or you get weird results (for example, 1.ff and
1.adff won't necessary map to the same thing because of the
stable_mod). Add pool to the stable_mod result, instead. The seed
itself doesn't need to be bounded; it's just an input for CRUSH.
Just so long as there are a limited number of such inputs for a given
pool.
Needs to factor in frag_is_leftmost to account for . and .., just
like the fi->offset calculation in readdir_prepopulate. Fixes the
problem where an ls on a large dir returns duplicate entries.
Sage Weil [Tue, 16 Jun 2009 23:04:12 +0000 (16:04 -0700)]
kclient: specify smallish blksize for directories
This is mainly just because /bin/ls will use the size, or blocks,
or blksize to decide how big of a buffer to allocate for getdents,
and the default of 4MB is unreasonably big. 64k seems like an
okay number, I guess.
Sage Weil [Tue, 16 Jun 2009 22:03:50 +0000 (15:03 -0700)]
crush: fix perm_choose bug
We would get incorrect results if we calculated the same mapping
twice in a row in certain cases. Der. Also, the permutation
calculation was basically just wrong.
Sage Weil [Mon, 15 Jun 2009 23:16:57 +0000 (16:16 -0700)]
kclient: fix di->off calculation
The dentry dir offset calculation wasn't taking into account the
possibility of multiple readdi requests, which in turn meant bad results
for readdir-from-dcache.
Since doing this on the client side was a mess, the MDS includes a dentry
offset for each readdir dentry within the dirfrag. This value is stored
in di->offset (with adjustment in leftmost frag for . and ..), and that's
the value that's passed back via filldir.
Sage Weil [Mon, 15 Jun 2009 22:35:10 +0000 (15:35 -0700)]
kclient: fix I_COMPLETE
The previous use of I_READDIR vs I_COMPLETE was flawed, mainly because
the state was maintained on a per-inode basis, but readdir proceeds on a
per-file basis.
Instead of flags, maintain a counter in the inode that is incremented each
time a dentry is released. When readdir starts, note the counter, and if
it is the same when readdir completes, AND we did not do any forward
seeks on the file handle, AND prepopulate succeeded on each hunk, then we
can set I_COMPLETE.
Sage Weil [Fri, 12 Jun 2009 05:55:07 +0000 (22:55 -0700)]
osd: you may, but need not, specify READ|WRITE flag in MOSDOp.
The OSD will implicitly set the bits based on your OSDOps or class method
calls. The client may still find it useful to specify these expicitly
for it's own informational purposes.
Make sure the MOSDOpReply has bits set based on the _actual_ op performed.
Note that as things stand, this will confuse the Objecter, who relies on
these bits to choose read or modify reply paths and doesn't know a priori
what mode a method is.