mds: remove broken delay of cap releases from a replica.
This hasn't worked in a very long time and serves little purpose
since the clients will have their own cap delay releases.
Nix a few of the repeated asserts while we're at it.
Revert "mds: Only change in->replica_caps_wanted when actually messaging"
This reverts commit a2c761e62acdb3cff941867c224ae295cf6337b3. We actually
want to change this whenever we try to send a message, and we do want
to send messages during state REJOIN (the auth MDS will take the message if it's
appropriate to do so; otherwise it drops the message because the information
it contains is going to arrive anyway when we tell the MDS our entire
replicated state as part of its rejoin). Instead, we're going to fix when we
send messages so that it's not broken.
mds: reorder timing checks in request_inode_file_caps
We do want to hold onto caps for a few seconds after the client
closes it, just in case it decides to re-open again! With the
old arrangement the keep time was never moved off of zero.
I think this has just been broken since it was written: previously it
dropped the caps if the keep time was after current time. Since the
keep time was never set to non-zero except after failing this test,
and only changed once, if you didn't come into this function again
within the 2-second window then you would never drop the caps.
Sage Weil [Fri, 22 Jul 2011 15:41:18 +0000 (08:41 -0700)]
mds: fix ambiguous check when journaling subtree map
We journal the EImportStart--and become ambiguous--when we set the state
to IMPORT_LOGGINGSTART; the subtree auth becomes ambiguous a few stages
before that.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 21 Jul 2011 19:54:00 +0000 (12:54 -0700)]
common: add Formatter class
This is based on the RGW class, but
- uses a stringstream
- has an additional dump_stream() method that gives you a usable ostream
- handles object keys properly
We should merge these implementations.
Not sure if either a stringstream or the raw buffer in the rgw class is
ideal. Probably output should accumulate on a bufferlist so we can avoid
ever reallocating for big dumps.
Filer: return error codes from probing up to the calling layer.
This is pretty limited; if you get multiple errors in one batch
of probes it'll only return the last one to get sent back (ENOENT is
excluded). But it's better than nothing.
mds: Drop locks and auth pins when waiting for freezingtree
In most cases we don't end up in this branch because there's an escape
if you already have an auth_pin on the ref in question. But when
you've got snapshots going on, you can process the request, block on
a lock state change, commit a snapshot change to the inode, and then
try to process the request again. On this second attempt, though,
you end up with a different ref which you don't have a pin on.
Deadlock and breakage ensues!
To fix, drop all locks and auth pins when you wait -- either
you don't have any to drop anyway, or you've got them on a previous
version of the inode that isn't useful anymore.
uclient: correctly initialize mseq in flush_snaps.
Previously we set mseq=0 unconditionally; this was a mistake that
creeped in via bitrot. Instead, set mseq from the auth cap.
This bug meant that any inodes which had gotten migrated between
MDSes would have their flush_snap messages dropped by the MDS.
Resolves #1324.
Sage Weil [Wed, 20 Jul 2011 19:46:19 +0000 (12:46 -0700)]
mds: simplify journaled subtree_map
We may have subtrees split locally due to migrations that are just getting
started or stopped. Simplify the map we journal to disk. Among other
things this makes the replay check simple (it can compare against the
current "live" map).
Sage Weil [Fri, 15 Jul 2011 23:36:07 +0000 (16:36 -0700)]
mds: witness rmdir when subtrees are on other hosts
If there is an rmdir with an empty subtree on another mds, we need to witness/
journal that on the dirfrag's auth mds so that replay correctly updates the
subtree map.
This is simpler than the rename witnesses (and the link/unlink ones) because
we aren't actually journaling a modification to any actual metadata; it's just
the subtree map that is changing.
The projection of the subtree map update needs work still, but that is also the
case for renames.
Sage Weil [Tue, 19 Jul 2011 18:25:46 +0000 (11:25 -0700)]
mds: clean up file flags to file mode translation
There was some seriously wrong and ancient cruft in there. open(2)
specifies that one of O_RDONLY, O_WRONLY, and O_RDWR must always be
specified. Drop all the crazy.
Linux VFS interprets O_WRONLY|O_RDWR as read+write, so we'll do the same.
Introduce a new configuration variable, internal_safe_to_start_threads.
This will be set by common_init_finish once it is safe to start threads.
That will trigger callbacks for any configuration observers listening
for this event.
This is used by ProfLogger to hold off on starting the UNIX domain
socket thread until it is safe to do so.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Write out the length of the message first, so that it's easier to write
clients. Also, serialize ProfLogger instances to memory rather than
directly to the socket, to avoid blocking while holding one of the
ProfLogger locks.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
* Replace existing proflogger config options with "profiling_logger_uri".
This option controls profiling logger sinks.
* ProfLogger: replace file-writing code with code that sends the
information over a UNIX domain socket.
* handle_conf_change is now fully and correctly implemented. We never
read from the md_config_t structure except in this function, so there
are no races. We re-create the thread when the settings change (no need
for SIGHUP, etc.)
* Replace the single big lock with a lock per Proflogger.
* No need for favg any more; just use fset everywhere for floating-point
variables.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>