Sage Weil [Tue, 29 Mar 2011 18:58:13 +0000 (11:58 -0700)]
cmon: add --inject-monmap option
This lets you manually inject a monmap into a down monitor. This is useful
in cases where you need to change the monmap but aren't able to get a
quorum with the old map.
Sage Weil [Fri, 25 Mar 2011 21:34:17 +0000 (14:34 -0700)]
mds: include .ceph is root directory
If the dentry isn't marked dirty _commit_partial won't save it. This is
caught later by the check_rstats() (or anyone actually trying to use the
/.ceph directory).
Fixes: #938 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Fri, 25 Mar 2011 20:50:45 +0000 (13:50 -0700)]
mds: fix client session removal on journal replay
We want to remove the client session from the map as long as it is not
attached to an actual messenger Connection. This key point got lost
somewhere the last time the session states were restructured. It is now
explicit.
This fixes the symptom where a recovering MDS reconnect has to time out on
clients that cleanly closed their sessions.
Also, fix a use-after-free when (uselessly) printing the session state.
Sage Weil [Thu, 24 Mar 2011 03:58:03 +0000 (20:58 -0700)]
mds: remove mds_log_unsafe mode
The mds_log_unsafe mode would wait for ack for some journal writes, and
safe for others. Now that we can reply to client requests without waiting
for the journal to flush (as of ~2 years ago), this distinction is no
longer useful. It is also more error-prone, as it complicates the code
and vastly expands the possible combinations of MDS failures and replay
scenarios we need to verify.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 24 Mar 2011 00:17:44 +0000 (17:17 -0700)]
mds: reimplement laggy
The goal is for the MDS to stop processing requests when it hasn't heard
from the monitors, to avoid a situation where a rogue process goes off
doing its own thing. Yes, if we fail it over the cmds can't write to the
object store, but it can reply to clients when it may not be appropriate
or good to do so.
The old logic was fragile and wonky, with messages getting deferred, and
then re-deferred. This implementation is much cleaner and should be much
more efficient and less fragile. There are still improvements to be made
as far as which messages we do/do not process when we think we're laggy.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 24 Mar 2011 03:37:04 +0000 (20:37 -0700)]
mds: skip redundant flush before journal segment trim
Back in olden times when we would would wait for acks for some journal
writes, we did an extra wait_for_safe() before discarding a journal segment
to make sure anything being discarded was safely committed in newers
segments. These days mds_log_unsafe is always false (and
journaler_safe is true), so we can skip this check.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 22 Mar 2011 04:38:36 +0000 (21:38 -0700)]
osd: factor pg get-or-create code into common helper
handle_pg_notify and _process_pg_info both lookup or create a PG based
on an incoming message. Factor that code into a common helper. There
were a few differences in that the pg notify handler code deals with
more cases (namely, pg creation), but this is harmless for the more
general _process_pg_info caller.
Closes: #577 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Tue, 22 Mar 2011 21:52:15 +0000 (14:52 -0700)]
FileStore: replace op_queue_throttle with op_queue_reserve_throttle
Previously, queue_op would call op_queue_throttle while holding the
journal_lock. op_queue_throttle, however, can sleep.
We fix the problem by:
1) Factor build_op out of queue_op
2) op_queue_throttle is now op_queue_reserve_throttle and takes an op as
an argument. op_queue_reserve_throttle can be called before the journal
lock is taken. This also avoids the race between calling throttle and
incrementing op_queue_bytes and op_queue_len.
3) queue_op now takes the op generated using build_op as an argument.
4) _journaled_ahead no longer needs to call throttle as
queue_transactions has already reserved space.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 23 Mar 2011 23:53:10 +0000 (16:53 -0700)]
mds: add FIXME for snaprealm on rename slave
Replicas don't get the snaprealm opened or updated on rename.
For example:
everything on mds0
mksnap on /foo
/foo/bar has a replica on mds1
rename /foo/bar /bar
-> a snaprealm will get created for /bar on mds0
-> mds1 currently does not do anything about that... it needs to get an
accurate replica of the snaprealm portion of the inode, either from
the master, or via a lock update, or something.
See: #925 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Wed, 23 Mar 2011 23:46:07 +0000 (16:46 -0700)]
mds: remove bad open_snaprealm()
This was added in b438b3d65b478a25ae1b9cab2cdd16c851d65fc8. We don't
want it here, though, because this is a _remote_ dentry rename and all
we are doing to the inode is adjusting nlink. No snaprealms are involved
because the inode isn't moving in the namespace.
Greg Farnum [Fri, 18 Mar 2011 21:25:45 +0000 (14:25 -0700)]
MDCache: make linkunlink rstat propagation work properly.
We could be in a lock state (ie, gather) where we can't take new locks.
But if we're in this function for linkunlink we have to already have
a lock, so in that case make sure the function succeeds and assert
that we do have a lock.