Sage Weil [Wed, 22 Sep 2010 18:31:12 +0000 (11:31 -0700)]
mon: move election start reset to starting_election() helper
An election can start either because we call it, or because someone else
calls it. Either way, we need to reset our state, so move that code into
the election_starting() callback, which is called by the elector's
start()/call_election() anyway.
This hopefully fixes a case where we see a timeout expire on the monitor
and fail the assertion
Sage Weil [Tue, 21 Sep 2010 20:41:13 +0000 (13:41 -0700)]
mds: correctly set straydn->first for rename target
Make sure the straydn->first matches the rename target (destdnl->inode).
Unfortunately the cow happens _after_ the destdn->first is set, so instead
of trivially copying it, we dup the MAX calculation. Add some temp
variables to clean up similar code in this method.
Sage Weil [Tue, 21 Sep 2010 20:44:02 +0000 (13:44 -0700)]
mds: do full pre_dirty()/mark_dirty() on cowed dentries
The dir commit/fetch and LogSegment::try_to_expire() rely on any new or
items in the directory getting new versions that correspond to a bump in
the dirfrag version. This must include dentries/inodes that are created
by the cow process, or else we have problems during dir commit/fetch or
segment expire.
Change the dirty list in the Mutation to include the pv so that we can
properly mark them dirty later.
Leave the inode one alone. We could theoretically do the same for the
dirty inodes, but this way we avoid projecting them and copying stuff
around. Any dirty cowed inode will also have a dirty dentry, so it will
still get saved regardless.
Sage Weil [Tue, 21 Sep 2010 20:54:00 +0000 (13:54 -0700)]
mds: only return pdnvec for full path_traverse
We should only return the pdnvec for a full traverse. i.e., either a
success, or a failure in which we instantiate a null dn for the trailing
entry. This makes pdnvec well defined, and allows callers like
rdlock_path_pin_ref() to reply with a null lease when appropriate.
Sage Weil [Tue, 21 Sep 2010 02:59:00 +0000 (19:59 -0700)]
mds: don't instantiate null dentries for snapped namespace
The dentry needs a [first,last] range and we don't know what first is when
we miss a lookup. And part of the point of instantiating null dentires is
to issue leases against them, which we don't do. The client will cache
the null result.
objecter: add accounting to keep track of total in-flight messages.
If the user wishes, they can call throttle_op to hold an operation
until it fits within the limits. The user is responsible for
consistency guarantees and making sure the locking will work!
Sage Weil [Tue, 17 Aug 2010 19:16:02 +0000 (12:16 -0700)]
mds: drop x/wrlocks before, rdlocks after sending reply
This lets us issue the most leases/caps possible. It also ensure we can
issue caps in the snapped namespace when we are still on the head inode
(previously, releasing the rdlock twiddled the state, the client didn't
get say Frc, and hung indefinitely).
Sage Weil [Fri, 17 Sep 2010 16:46:29 +0000 (09:46 -0700)]
mds: touch missed dentry when fetching dir on path traverse
We can get into a loop when doing a path traverse if we miss on a large
directory and then end up trimming the result we need before handling the
original request. To avoid this, we simply put the wanted dentry at the
top of the LRU (instead of midpoint).
Greg Farnum [Thu, 26 Aug 2010 21:04:45 +0000 (14:04 -0700)]
client: Make truncation work properly
The previous if block didn't work because inode->size was usually
changed well before handle_cap_trunc was ever invoked, so it never
did the truncation in the objectcacher! This was okay if you just truncated
a file and then closed it, but if you wrote a file, truncated part of it out,
and then wrote past the (new) end you would get reads that returned
the previously-truncated data out of what should have been a hole.
Now, we do the actual objectcacher truncation in update_inode_file_bits,
because all methods of truncation will move through there and this maintains
proper ordering.
Sage Weil [Thu, 16 Sep 2010 23:15:30 +0000 (16:15 -0700)]
osd: copy truncate_seq et al to clone oi
These fields are logically object attributes that should be preserved
across the clone COW process. (Not copying truncate_seq in particular
corrupts snapshot file data, depending on the order of arrival of racing
trimtrunc and writes.
Sage Weil [Thu, 16 Sep 2010 22:50:50 +0000 (15:50 -0700)]
osd: fix is_pool_snaps_mode() for empty pools
The data pool in particular has seq 0 and (initially) no removed snaps. We
must not return true for that case, or else the OSD will use an empty
pool snap context and not the user/mds provided one.
This means we don't need to take a mutex (and possibly force
ourselves to sleep!) in order to take or put. This is about to
be important since we're adding accounting to the Objecter.
msgr: when both ends support it, exchange in_seq values on reconnect
to prevent gratuitously re-sending messages.
This adds a new feature "CEPH_FEATURE_RECONNECT_SEQ" which goes into
the defaul msgr features, as well as a CEPH_MSGR_TAG_SEQ which indicates
this step is being taken and substitutes for CEPH_MSGR_TAG_READY.
Sage Weil [Mon, 13 Sep 2010 17:03:17 +0000 (10:03 -0700)]
client: buffer name->ino linkage only; do stat at time of readdir
This makes the results more correct/consistent with the current state of
the cache at the time of the readdir(plus). It also means we buffer less
data for the readdir result. And pin inodes for the duration.
Sage Weil [Sat, 11 Sep 2010 21:18:54 +0000 (14:18 -0700)]
mon: remove laggy standby nodes (instead of marking laggy)
This will implicitly clean out old standby nodes that go away. Laggy
cmds's that don't will just re-add themselves. Since the laggy flag is
a map change, this creates no more or less map churn.
Sage Weil [Sat, 11 Sep 2010 21:06:43 +0000 (14:06 -0700)]
osd: don't create clone_obc on replica
We don't need the clone_obc on the replica, since we don't read from there.
We don't have head obc's either.
We do, however, keep the snapset_obc, since there are a few paths that
depend on whether it exists on disk (to clean it up) and that information
isn't currently fed through from the primary.