Greg Farnum [Thu, 19 Aug 2010 19:01:11 +0000 (12:01 -0700)]
backtrace: fix segfault in tcmalloc.
The print function is only called when we're about to crash anyway,
and the datamember 'foo' is allocated by ptmalloc, not tcmalloc. Freeing
it via tcmalloc causes its own crash which pollutes our debugging and
incorrectly sticks tcmalloc into the stack. So, just don't free.
Sage Weil [Fri, 20 Aug 2010 16:26:34 +0000 (09:26 -0700)]
crush: return error instead of BUGing on bad forcefed mapping
The forcefed mapping relies on a parent map. However, the current
implementation assumes that the parent mapping is unique for all rules. If
that is not the case (i.e., some osd exists in multiple hierarchies) then
we cannot assert that the TAKE matches the calculated force_context.
For now, we can just fail the mapping in that case (we don't use forcefed
mappings yet). The real solution is probably to define parent maps for
all possible hierarchies (i.e., starting at each unique TAKE starting
point).
Sage Weil [Fri, 20 Aug 2010 04:47:19 +0000 (21:47 -0700)]
mds: fix ENOTEMPTY checking on rmdir/rename
We can't trust the inode rstat size without holding the locks. We can
look at our auth frags and though without fear of a false positive
ENOTEMPTY, however.
Rename the function, introduce a helper for the locked check, update
comments, etc.
Sage Weil [Wed, 18 Aug 2010 20:49:11 +0000 (13:49 -0700)]
mds: fix null snapflush logic
We only want to do a null snapflush if we _know_ there isn't another one
coming: that is, there aren't any outstanding issued excl/wr cap bits at
the client. The old test has the bitwise NOT backwards. We can also
limit the test to the bits we care about.
Sage Weil [Wed, 18 Aug 2010 20:44:34 +0000 (13:44 -0700)]
qa: add snaptest-snap-rm-cmp
This (usually) reproduced a bug where:
- we write a big file
- snap it
- remove it. this makes the mds cow it.
- cp the snapped version.
- mds syncs the head
- client starts writeback
- (sometimes!) client sends other caps back to the mds
- mds does null flushsnap, not realizing a real one is still coming
Sage Weil [Wed, 18 Aug 2010 20:16:57 +0000 (13:16 -0700)]
mds: remove forward-on-nonauth-rdlock behavior
The problem is that we may be rdlocking items with a different auth than
the main item we are modifying, so forwarding based on lock state is
inconsistent with our requirement that we be on the modified item's auth.
Either we can somehow mark whether the locked item is the "main thing" we
are operating on, or we can drop the forward behavior from the locker and
put any forwarding heuristics elsewhere. I'm opting for the latter.
Sage Weil [Wed, 18 Aug 2010 17:18:21 +0000 (10:18 -0700)]
mds: fix null snapflush inode lookup
Don't use pick_inode_snap is totally wrong (it depends on the current set
of snaps, etc.).. look up the inode directly via the ino and last snapid,
which we have. Fixes a failure at the assert.
Reported-by: Thomas Mueller <thomas@chaschperli.ch> Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Tue, 17 Aug 2010 23:38:20 +0000 (16:38 -0700)]
mds: handle no flushsnap
We won't get a flushsnap when the client has EXCL/WR caps but no dirty
data. The MDS needs to release the snapped inode's locks when it gets a
normal update but no FLUSHSNAP.
Sage Weil [Tue, 17 Aug 2010 19:16:02 +0000 (12:16 -0700)]
mds: drop x/wrlocks before, rdlocks after sending reply
This lets us issue the most leases/caps possible. It also ensure we can
issue caps in the snapped namespace when we are still on the head inode
(previously, releasing the rdlock twiddled the state, the client didn't
get say Frc, and hung indefinitely).
Sage Weil [Mon, 16 Aug 2010 23:45:14 +0000 (16:45 -0700)]
mds: make inode first track dn first on rename
This mirrors the logic in cc8f5ac47c77d1e336e16d8deb024d507e0e8c59. Make
the renamed inode first match the destdn to avoid problems down the line.
Do this after we've (potentially) cowed the inode in the journal_cow_dentry
on srcdn.
Here, the foo dentry has a high seq, and subsequent mkdir foo should make
sure we give the new foo dir inode the same seq (and not a lower one from
the parent). Otherwise, things get confused later on.
Sage Weil [Mon, 16 Aug 2010 21:27:34 +0000 (14:27 -0700)]
mds: flush log on cap writeback if !dirty and unstable locks
The problem is if we revoke caps, nothing is dirty, but we do writeback
because we are adjusting max_size. Then we have to wait for the log to
flush even though the revocation should proceed immediately. To move
things along, flush the log immediately.
This still isn't ideal.. it would be nice to allow the locking to continue
and adjust max_size in parallel, but that isn't always possible, and will
require some more thought.
Greg Farnum [Mon, 16 Aug 2010 18:48:12 +0000 (11:48 -0700)]
qa: add snap-rm-diff.sh to look for issues with snapshot integrity.
Currently passes the script, although running these steps manually
(especially with smaller files) fails a fair percentage of the time for me.
Sage Weil [Thu, 12 Aug 2010 20:28:47 +0000 (13:28 -0700)]
mds: only kick head on snap rdlock if auth
- If we are non-auth, stick with the snap, and the auth will do the
inference.
- If we are auth, the head had better exist, because our lock is
pinned in an unreadable state for some reason. Assert as much.
Sage Weil [Wed, 11 Aug 2010 20:30:48 +0000 (13:30 -0700)]
mon: use elector's epoch
This fixes a race with successive elections: we may see a new election
(X+1), then get a victory (X). The victory is ignored (rightly so). But
then a paxos follows that which assumes X, and our check was against
mon_epoch, only updated on election victory.. and we are inconsistent and
crash.
So, just get rid of private mon_epoch and use the elector's value.
Sage Weil [Wed, 11 Aug 2010 17:05:18 +0000 (10:05 -0700)]
osd: write (empty) log, bounds on remove_pg start
This zeros the log, and the bounds, when we start pg removal. Previously
we just removed the log and didn't write the (empty) bounds, breaking
cosd startup later when the old bounds are totally wrong.
Sage Weil [Sun, 8 Aug 2010 15:59:59 +0000 (08:59 -0700)]
filestore: flush using sync(2) hammer
Since we can't easily detect ext3 (let alone whether we have data=journal),
by default use sync(2) as an overly large hammer to flush all prior applied
ops to disk. If the option is explicitly enabled, use fsync(2) on a
file to implicitly flush the journal and prior writes. Admins should only
enable this if they have ext2 in data=journal mode.
Sage Weil [Thu, 5 Aug 2010 20:08:22 +0000 (13:08 -0700)]
mds: do not clone caps to snapped inodes
Instead, explicitly track which locks need to be flushed (via a FLUSHSNAP)
with a LOCK_SNAP_SYNC lock state.
Restructures the handle_client_caps.
Also changes the client ranges format in the inode to keep a follows for
each client (basically 'flushed through') so that the client ranges can
get cleaned out later when it gets cowed.