objecter: add accounting to keep track of total in-flight messages.
If the user wishes, they can call throttle_op to hold an operation
until it fits within the limits. The user is responsible for
consistency guarantees and making sure the locking will work!
This means we don't need to take a mutex (and possibly force
ourselves to sleep!) in order to take or put. This is about to
be important since we're adding accounting to the Objecter.
msgr: when both ends support it, exchange in_seq values on reconnect
to prevent gratuitously re-sending messages.
This adds a new feature "CEPH_FEATURE_RECONNECT_SEQ" which goes into
the defaul msgr features, as well as a CEPH_MSGR_TAG_SEQ which indicates
this step is being taken and substitutes for CEPH_MSGR_TAG_READY.
Sage Weil [Mon, 13 Sep 2010 17:03:17 +0000 (10:03 -0700)]
client: buffer name->ino linkage only; do stat at time of readdir
This makes the results more correct/consistent with the current state of
the cache at the time of the readdir(plus). It also means we buffer less
data for the readdir result. And pin inodes for the duration.
Sage Weil [Sat, 11 Sep 2010 21:18:54 +0000 (14:18 -0700)]
mon: remove laggy standby nodes (instead of marking laggy)
This will implicitly clean out old standby nodes that go away. Laggy
cmds's that don't will just re-add themselves. Since the laggy flag is
a map change, this creates no more or less map churn.
Sage Weil [Sat, 11 Sep 2010 21:06:43 +0000 (14:06 -0700)]
osd: don't create clone_obc on replica
We don't need the clone_obc on the replica, since we don't read from there.
We don't have head obc's either.
We do, however, keep the snapset_obc, since there are a few paths that
depend on whether it exists on disk (to clean it up) and that information
isn't currently fed through from the primary.
Sage Weil [Thu, 9 Sep 2010 22:53:10 +0000 (15:53 -0700)]
mds: journal dirfrag accounted_{r,frag}stat on inode update
When the inode sucks in the dir's updated frag info, we need to journal
the dirfrag update as well. This ensures that any subsequent changes in
the dirfrag will be reflected by the difference between rstat and
accounted_rstat from the perspective of the journal contents (in case we
fail/recover during that process).
Sage Weil [Thu, 9 Sep 2010 22:37:09 +0000 (15:37 -0700)]
osdmap: allow blacklist of an entire ip
We can backlist either a specific instance (1.2.3.4:1234/5678) or an
entire IP, in which case the table has something like "1.2.3.4:0/0" (a port
and nonce of 0).
Sage Weil [Thu, 9 Sep 2010 17:56:22 +0000 (10:56 -0700)]
mon: handle subscribe to osdmap=1
We would send an incremental for anything >1, or the latest map, but not
osdmap e1 itself. Fix the condition, and make send_incremental() smart
about starting with the full map at 1 as needed.
Sage Weil [Thu, 9 Sep 2010 17:01:24 +0000 (10:01 -0700)]
mds: fix journal replay of session close->open after reconnect
If the client reconnects, the journal 'close' replay doesn't remove the
session, which leaves the session state intact. It needs to reset it in
that case, or else we get problems if the session is reopened and the
state doesn't match up.
Reported-by: Nat N <phenisha@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Tue, 7 Sep 2010 21:57:49 +0000 (14:57 -0700)]
mds: fix replica state for mix->sync
Should be mix->sync(2): the same as a replica who already go the first
SYNC message and is waiting in mix->sync(2) for the final SYNC to indicate
the gather is completed.
Sage Weil [Tue, 7 Sep 2010 20:59:37 +0000 (13:59 -0700)]
mds: lock path, parent dir scatterlocks _after_ freezing
This fixes a ABBA deadlock between
acquire_locks(): auth_pins items, then locks in order
export_dir: locks paths, then freezes.
Instead, we check for lockability (but don't lock), do the freeze, and then
try to take the locks after. If we can't do so atomically, we currently
just fail. In theory this could wait for the distributed locks, but it's
probably not worth the complexity at this stage; export_dir is currently
still opportunistic and can bail out for a variety of reasons.
Sage Weil [Mon, 23 Aug 2010 22:36:14 +0000 (15:36 -0700)]
mds: can next state lockability checks in eval_gather
The can_* fields need to be ANY or AUTH.. not REQ or XCL (at least not
without trickier checks). Otherwise we progress to the next state too
early and violate the locking rules.
Sage Weil [Tue, 7 Sep 2010 17:01:58 +0000 (10:01 -0700)]
osd: log error instead of crashing on failed pull attempt
If peering screws up and the primary mistakenly tries to pull an object
from us we don't have, log an error instead of crashing. This will still
throw off recovery (it will hang), but that's better than crashing
outright.