Sage Weil [Tue, 12 Apr 2011 18:07:54 +0000 (11:07 -0700)]
mds: fix create_mydir_hierarchy to save dir
Mark the dentries dirty so they get saved to disk (they're not journaled!).
This fixes rstat problems on startup, where populate_mydir was recreating
the entries and munging rstats accordingly.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 5 Apr 2011 18:57:13 +0000 (11:57 -0700)]
osd: process missing when log is empty
There are important cases where the replica will send a missing set and
empty log to the primary during peer (e.g., when the primary asks for it).
In that case, we need to go into pg->proc_replica_log so that peer_missing
gets updated and peering can complete. It is okay from proc_replica_log's
perspective to pass in an empty log; it will have no effect. Fix the
if() guard appropriately.
Note that the only path into _process_pg_info where missing is NULL is
the handle_pg_info path, which is used for primary->replica "activate now"
messages, updates after already active, and for replica->primary "ok i
activated" messages.
Sage Weil [Fri, 1 Apr 2011 18:00:44 +0000 (11:00 -0700)]
mds: fix discover_path
If we have the base dirfrag, do not request it. Otherwise we can get a
reply that contains only that (partial progress), and we will then fail
to wake up our dentry waiter.
Josh Durgin [Wed, 30 Mar 2011 23:41:00 +0000 (16:41 -0700)]
librbd: fix snapshot handling
To ensure consistency, always set the snap context when the header is
updated. If snapid is set, we update librados' snapid when refreshing
the header as well. Also use CEPH_NOSNAP instead of 0 as the default
snapid to prevent confusion. These changes fix snapshot creation
and removal, and prevent writing to a snapshot.
Rollback is fixed by using selfmanaged_snapshot_rollback.
Josh Durgin [Wed, 30 Mar 2011 22:00:55 +0000 (15:00 -0700)]
librados: add selfmanaged_snap_rollback
This was removed in 2cb86f713df38ebee6aa10a81157f99264a59a70, but is
required for selfmanaged snaps because their snapids aren't in the
pool's snap list, which is how regular rollback finds them.
Samuel Just [Wed, 30 Mar 2011 20:14:55 +0000 (13:14 -0700)]
mkcephfs: copy to daemon nodes for each daemon
The tmp directory is removed after each daemon. Previously, this would
break if two daemons were on the same node. Now, the files will be
copied for each daemon.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 30 Mar 2011 23:46:04 +0000 (16:46 -0700)]
journaler: don't block when we adjust back write_pos
is_readable() may need to adjust the write_pos backward, but will return
false. If we are at the end we still need to wake up any waiters so they
know about it.
Sage Weil [Tue, 29 Mar 2011 18:58:13 +0000 (11:58 -0700)]
cmon: add --inject-monmap option
This lets you manually inject a monmap into a down monitor. This is useful
in cases where you need to change the monmap but aren't able to get a
quorum with the old map.
Sage Weil [Fri, 25 Mar 2011 21:34:17 +0000 (14:34 -0700)]
mds: include .ceph is root directory
If the dentry isn't marked dirty _commit_partial won't save it. This is
caught later by the check_rstats() (or anyone actually trying to use the
/.ceph directory).
Fixes: #938 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Fri, 25 Mar 2011 20:50:45 +0000 (13:50 -0700)]
mds: fix client session removal on journal replay
We want to remove the client session from the map as long as it is not
attached to an actual messenger Connection. This key point got lost
somewhere the last time the session states were restructured. It is now
explicit.
This fixes the symptom where a recovering MDS reconnect has to time out on
clients that cleanly closed their sessions.
Also, fix a use-after-free when (uselessly) printing the session state.
Sage Weil [Thu, 24 Mar 2011 03:58:03 +0000 (20:58 -0700)]
mds: remove mds_log_unsafe mode
The mds_log_unsafe mode would wait for ack for some journal writes, and
safe for others. Now that we can reply to client requests without waiting
for the journal to flush (as of ~2 years ago), this distinction is no
longer useful. It is also more error-prone, as it complicates the code
and vastly expands the possible combinations of MDS failures and replay
scenarios we need to verify.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 24 Mar 2011 00:17:44 +0000 (17:17 -0700)]
mds: reimplement laggy
The goal is for the MDS to stop processing requests when it hasn't heard
from the monitors, to avoid a situation where a rogue process goes off
doing its own thing. Yes, if we fail it over the cmds can't write to the
object store, but it can reply to clients when it may not be appropriate
or good to do so.
The old logic was fragile and wonky, with messages getting deferred, and
then re-deferred. This implementation is much cleaner and should be much
more efficient and less fragile. There are still improvements to be made
as far as which messages we do/do not process when we think we're laggy.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 24 Mar 2011 03:37:04 +0000 (20:37 -0700)]
mds: skip redundant flush before journal segment trim
Back in olden times when we would would wait for acks for some journal
writes, we did an extra wait_for_safe() before discarding a journal segment
to make sure anything being discarded was safely committed in newers
segments. These days mds_log_unsafe is always false (and
journaler_safe is true), so we can skip this check.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 22 Mar 2011 04:38:36 +0000 (21:38 -0700)]
osd: factor pg get-or-create code into common helper
handle_pg_notify and _process_pg_info both lookup or create a PG based
on an incoming message. Factor that code into a common helper. There
were a few differences in that the pg notify handler code deals with
more cases (namely, pg creation), but this is harmless for the more
general _process_pg_info caller.
Closes: #577 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Tue, 22 Mar 2011 21:52:15 +0000 (14:52 -0700)]
FileStore: replace op_queue_throttle with op_queue_reserve_throttle
Previously, queue_op would call op_queue_throttle while holding the
journal_lock. op_queue_throttle, however, can sleep.
We fix the problem by:
1) Factor build_op out of queue_op
2) op_queue_throttle is now op_queue_reserve_throttle and takes an op as
an argument. op_queue_reserve_throttle can be called before the journal
lock is taken. This also avoids the race between calling throttle and
incrementing op_queue_bytes and op_queue_len.
3) queue_op now takes the op generated using build_op as an argument.
4) _journaled_ahead no longer needs to call throttle as
queue_transactions has already reserved space.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 23 Mar 2011 23:53:10 +0000 (16:53 -0700)]
mds: add FIXME for snaprealm on rename slave
Replicas don't get the snaprealm opened or updated on rename.
For example:
everything on mds0
mksnap on /foo
/foo/bar has a replica on mds1
rename /foo/bar /bar
-> a snaprealm will get created for /bar on mds0
-> mds1 currently does not do anything about that... it needs to get an
accurate replica of the snaprealm portion of the inode, either from
the master, or via a lock update, or something.
See: #925 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Wed, 23 Mar 2011 23:46:07 +0000 (16:46 -0700)]
mds: remove bad open_snaprealm()
This was added in b438b3d65b478a25ae1b9cab2cdd16c851d65fc8. We don't
want it here, though, because this is a _remote_ dentry rename and all
we are doing to the inode is adjusting nlink. No snaprealms are involved
because the inode isn't moving in the namespace.
Greg Farnum [Fri, 18 Mar 2011 21:25:45 +0000 (14:25 -0700)]
MDCache: make linkunlink rstat propagation work properly.
We could be in a lock state (ie, gather) where we can't take new locks.
But if we're in this function for linkunlink we have to already have
a lock, so in that case make sure the function succeeds and assert
that we do have a lock.