Samuel Just [Wed, 29 Jun 2011 01:11:55 +0000 (18:11 -0700)]
ReplicatedPG: project changes to clone_overlap
Previously, changes to clone_overlap were incorrect since make_writeable
is called after do_osd_ops. Now, ctx->modified_ranges will hold the set
of offsets modified during the transaction to be applied to
clone_overlap during make_writeable.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Build tests (that check if there are unresolved symbols in libraries)
can slow down the build a lot. We should only enable them when
developers need them.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
* change RETURN_IF_NOT_VAL -> RETURN1_IF_NOT_VAL.
We want to return a non-zero error code when the value is something we
don't expect, even if that unexpected value is 0.
* st_rados_list_objects: add option to ignore list errors (for the
deletion test)
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Wed, 13 Jul 2011 00:27:41 +0000 (17:27 -0700)]
mds: move to MIX state if writer wanted and no wanted loner
We can just look at the target loner here, which also takes any caps wanted
by other replicas on other MDSs into account. Otherwise we need to
to duplicate the CInode::calc_ideal_loner() logic.
This assumes the loner field is accurate.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 13 Jul 2011 20:18:38 +0000 (13:18 -0700)]
mds: migrate loner_cap state
It is tedious to infer what the old loner_cap was pre-migration. Just send
it over the wire and set it explicitly. Usually when we eval() we would
have come to the same conclusion, but when we didn't, we got into
inconsistent/impossible states where the issue caps don't match the loner
state (Asx issued but no loner_cap set). That meant the next issue was
a revocation with no lock state change, which led to yet more problems
down the line.
Sage Weil [Tue, 12 Jul 2011 23:49:25 +0000 (16:49 -0700)]
mds: verify deferred messages aren't stale
We may defer processing of some messages because we are laggy (in hearing
from the monitor). When we eventually get to those messages, make sure
they haven't since become stale (i.e., the source mds isn't now down).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Mon, 11 Jul 2011 18:35:15 +0000 (11:35 -0700)]
mds: only issue xlocker_caps if we are also the loner
We cannot issue caps to a client purely because they have something
xlocked, because we do not revoke caps when we drop the xlock. However,
if they are a loner AND have the object xlocked, we can; this is why the
xlock release code either moves to LOCK or EXCL state.
Remember, the goal here is to issue caps when we do operations on objects
that xlock them (e.g. setattr, mknod) and move directly to the EXCL state
afterward. That only works (or makes sense) when we are the lone client
with caps.
Fix the get_caps_allowed_by_type() helper to do this properly.
Sage Weil [Sun, 10 Jul 2011 21:15:38 +0000 (14:15 -0700)]
mds: rely on master to do anchor locks for slave_link_prep
The replica can't take all these locks without confusing things, since it
maybe need to unlock/relock, may screw up auth_pins, and worse. The master
can take the locks.
The only problem is that the master may not know if the inode has already
been anchored if the lock hasn't cycled since then. In that case, we take
more locks than we need to.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sun, 10 Jul 2011 21:05:10 +0000 (14:05 -0700)]
mds: defer lock eval if freezing or frozen
We were only deferring if frozen. But if freezing we need to too, because
of the way cap messages are deferred. We defer cap messages if
- inode is frozen
- inode is freezing and locks are stable (to avoid starvation)
So if we are in a stable freezing state and start deferring caps, we can't
twiddle locks further or else we can
- potentially starve (okay, in rare cases)
- get stuck because we already started deferring cap messages
We would also screw up the cap message ordering if we became unstable again
and were allowed to start processing cap messages while others were still
deferred.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 8 Jul 2011 16:32:04 +0000 (09:32 -0700)]
mds: take a remote_wrlock on srcdir for cross-mds rename
This ensures that we hold a wrlock on the srcdn auth when the slave
makes it's changes to the src directory, and prevents us from corrupting
the scatterlock state.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 8 Jul 2011 16:30:29 +0000 (09:30 -0700)]
mds: implement remote_wrlock
For the rename code to behave, we need to hold a wrlock on the slave node
to ensure that any racing gather (mix->lock) is not sent prior to the
_rename_prepare() running; otherwise we violate the locking rules and
corrupt rstats.
Implement a remote_wrlock that will be used by rename. The wrlock is held
on a remote node instead of the local node, and is set up similarly to
remote_xlocks.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Wed, 6 Jul 2011 20:48:35 +0000 (13:48 -0700)]
mds: add mix->lock(2) state
There is a problem with the wrlocks and cross-mds renames:
- master (dest auth, srci auth, srcdir replica) takes wrlock on srcdiri
- something triggers a srcdiri lock, putting inest/ifile lock in mix->lock
state
- slave (srcdir auth) sends LOCKACK
- master sends prepare_rename
- slave (srcdir auth) does rename prepare, which modifies srcdir
Even though the master holds a wrlock on the srcdiri, the gather starts
immediately and the slave sends the LOCKACK before the master's wrlock is
released.
To fix this, we add a new mix->lock(2) state, and we do not start the
mix->lock gather from replicas until the local gather completes, _after_
the auth's wrlock is released. This makes the master's wrlock sufficient
to ensure the prepare_rename on the slave is save.
This also works when the slave is the srci auth, since the gather won't
complete until the master releases its wrlock. BUT, it does NOT work if a
third MDS is the srcdiri auth, since it can still gather from the slave
prior to the master releasing its wrlock.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 7 Jul 2011 21:13:14 +0000 (14:13 -0700)]
mon: fix up pending_inc pool op mess
You can't look at pending_inc in preprocess methods. Or return an error
based on pending_inc before it commits. Fix up the snap-related error
checking.
Sage Weil [Thu, 7 Jul 2011 20:35:32 +0000 (13:35 -0700)]
mds: set old and new dentry lease bits
Recent kernels got the new CEPH_LOCK_DN definition but we were still
setting the old bit. Set both so we work with both classes of clients. In
the meantime, update the kernel to ignore this field so that eventually we
can drop/reuse it.