mon: Fix issue first addressed in 2c5a3d99aa3be5ce114072e84f73a0a6426e63fd.
We were properly falling out of the while loop when we reached end(), but
not checking for it in the following if-else. Now we do! Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
The setup-chroot.sh script is very handy for building the server in a
chroot environment. I thought I would share it here in case anyone else
finds it useful.
Sage Weil [Mon, 27 Sep 2010 15:31:34 +0000 (08:31 -0700)]
mds: don't block request on freezing if we're already auth_pinned.
If we already auth_pinned, we're past the gates; don't stop on freezable.
This screws up xlock: the lock moves to PREXLOCK state, but the request
that would normally xlock it gets deferred because of a racing freezing
of the tree. Then the PREXLOCK gather kicks in and badness happens.
Sage Weil [Sat, 25 Sep 2010 03:10:08 +0000 (20:10 -0700)]
osd: add coll_t::is_pg() method
This makes the interface a bit more adaptable for a situation where it has
a simple string representation instead of the strict structure it has now.
Eventually this function can simply attempt a pg_t parse.
Sage Weil [Fri, 24 Sep 2010 18:43:37 +0000 (11:43 -0700)]
osd: make sparse data/clone push behave with partial object push
We can't error out if we don't get everything we want in one go now that
we support pushing objects in pieces. Remove this check entirely, since
we don't have a good error handling case anyway.
Sage Weil [Fri, 24 Sep 2010 16:40:40 +0000 (09:40 -0700)]
mds: defer cap release and update consistently when frozen
We need to preserve the order of processing of cap release and writeback
messages across handle_client_caps() and process_request_cap_release().
Use a helper with the appropriate condition, and defer the release
processing as needed.
Sage Weil [Fri, 24 Sep 2010 15:15:54 +0000 (08:15 -0700)]
mds: always mark parent scatterlock when marking dirty rstat
Note that this will let the parent nestlock 'dirty' state get out of
sync with the lock state, as the whole point of the dirty rstat lists is
that it can happen any time. It does, however, queue us up.
Sage Weil [Thu, 23 Sep 2010 23:12:21 +0000 (16:12 -0700)]
mds: scatter pin frozen tree on importer too
The importer also needs to scatter pin. This avoids scatterlock gather
races like so:
A: start exporting to B
A: freeze, scatter pin tree
C: initiate gather
A: delay replay to gather
B: reply to gather, do not include (non-auth) dirfrag
A,B: finish migration
A: reply to gather, do not include (now non-auth) dirfrag
C: gets no info about the dirfrag!
By pinning on the importer, we ensure that at least one MDS will respond
to the gather with auth dirfrag info.
Sage Weil [Thu, 23 Sep 2010 17:00:07 +0000 (10:00 -0700)]
mds: fix bounding frag rstat/fragstat update during import
Be careful about when we update bounding dirfrag info during an import. If
the lock is in a MIX state, we do NOT want to update, since the inode
auth doesn't know jack (unless they are also dirfrag auth, in which case
we'll find out when we unscatter anyway).
Sage Weil [Thu, 23 Sep 2010 04:10:18 +0000 (21:10 -0700)]
mds: do not scatter_writebehind on nudge if replicated
This can cause the inode rstat etc to become out of sync with dirfrag
accounted_rstat when the scatterlock is not in a gathered state: the
local values will get updated but those on other nodes will not, and the
inode will drift out of sync with the dirfrags.
Other callers to scatter_writebehind() are all in contexts where we have
_just_ gathered dirfrag state, or there is no remote dirfrag state to
gather.
Sage Weil [Wed, 22 Sep 2010 22:42:52 +0000 (15:42 -0700)]
mds: use scatter pins for migration instead of rd/wrlocks
This is simpler (for the migrator), and wrlocks allow scatter_writebehind,
which is a no-no for a frozen tree. By pinning the frozen dir's parent
inode, we prevent any scatter or unscatter operations from implicitly
updating metadata within the frozen root dirfrag.
Sage Weil [Wed, 22 Sep 2010 18:31:12 +0000 (11:31 -0700)]
mon: move election start reset to starting_election() helper
An election can start either because we call it, or because someone else
calls it. Either way, we need to reset our state, so move that code into
the election_starting() callback, which is called by the elector's
start()/call_election() anyway.
This hopefully fixes a case where we see a timeout expire on the monitor
and fail the assertion
Sage Weil [Tue, 21 Sep 2010 20:41:13 +0000 (13:41 -0700)]
mds: correctly set straydn->first for rename target
Make sure the straydn->first matches the rename target (destdnl->inode).
Unfortunately the cow happens _after_ the destdn->first is set, so instead
of trivially copying it, we dup the MAX calculation. Add some temp
variables to clean up similar code in this method.
Sage Weil [Tue, 21 Sep 2010 20:44:02 +0000 (13:44 -0700)]
mds: do full pre_dirty()/mark_dirty() on cowed dentries
The dir commit/fetch and LogSegment::try_to_expire() rely on any new or
items in the directory getting new versions that correspond to a bump in
the dirfrag version. This must include dentries/inodes that are created
by the cow process, or else we have problems during dir commit/fetch or
segment expire.
Change the dirty list in the Mutation to include the pv so that we can
properly mark them dirty later.
Leave the inode one alone. We could theoretically do the same for the
dirty inodes, but this way we avoid projecting them and copying stuff
around. Any dirty cowed inode will also have a dirty dentry, so it will
still get saved regardless.
Sage Weil [Tue, 21 Sep 2010 20:54:00 +0000 (13:54 -0700)]
mds: only return pdnvec for full path_traverse
We should only return the pdnvec for a full traverse. i.e., either a
success, or a failure in which we instantiate a null dn for the trailing
entry. This makes pdnvec well defined, and allows callers like
rdlock_path_pin_ref() to reply with a null lease when appropriate.
Sage Weil [Tue, 21 Sep 2010 02:59:00 +0000 (19:59 -0700)]
mds: don't instantiate null dentries for snapped namespace
The dentry needs a [first,last] range and we don't know what first is when
we miss a lookup. And part of the point of instantiating null dentires is
to issue leases against them, which we don't do. The client will cache
the null result.
objecter: add accounting to keep track of total in-flight messages.
If the user wishes, they can call throttle_op to hold an operation
until it fits within the limits. The user is responsible for
consistency guarantees and making sure the locking will work!
Sage Weil [Tue, 17 Aug 2010 19:16:02 +0000 (12:16 -0700)]
mds: drop x/wrlocks before, rdlocks after sending reply
This lets us issue the most leases/caps possible. It also ensure we can
issue caps in the snapped namespace when we are still on the head inode
(previously, releasing the rdlock twiddled the state, the client didn't
get say Frc, and hung indefinitely).
Sage Weil [Fri, 17 Sep 2010 16:46:29 +0000 (09:46 -0700)]
mds: touch missed dentry when fetching dir on path traverse
We can get into a loop when doing a path traverse if we miss on a large
directory and then end up trimming the result we need before handling the
original request. To avoid this, we simply put the wanted dentry at the
top of the LRU (instead of midpoint).
Greg Farnum [Thu, 26 Aug 2010 21:04:45 +0000 (14:04 -0700)]
client: Make truncation work properly
The previous if block didn't work because inode->size was usually
changed well before handle_cap_trunc was ever invoked, so it never
did the truncation in the objectcacher! This was okay if you just truncated
a file and then closed it, but if you wrote a file, truncated part of it out,
and then wrote past the (new) end you would get reads that returned
the previously-truncated data out of what should have been a hole.
Now, we do the actual objectcacher truncation in update_inode_file_bits,
because all methods of truncation will move through there and this maintains
proper ordering.
Sage Weil [Thu, 16 Sep 2010 23:15:30 +0000 (16:15 -0700)]
osd: copy truncate_seq et al to clone oi
These fields are logically object attributes that should be preserved
across the clone COW process. (Not copying truncate_seq in particular
corrupts snapshot file data, depending on the order of arrival of racing
trimtrunc and writes.
Sage Weil [Thu, 16 Sep 2010 22:50:50 +0000 (15:50 -0700)]
osd: fix is_pool_snaps_mode() for empty pools
The data pool in particular has seq 0 and (initially) no removed snaps. We
must not return true for that case, or else the OSD will use an empty
pool snap context and not the user/mds provided one.
This means we don't need to take a mutex (and possibly force
ourselves to sleep!) in order to take or put. This is about to
be important since we're adding accounting to the Objecter.