Sage Weil [Wed, 18 Nov 2009 00:30:36 +0000 (16:30 -0800)]
msgr: fix possible use-after-free by taking pipe lock during reap
This ensures that whoever called stop() (mark_down, in particular) finished
with the pipe (unlocked it) before we go and free it. Otherwise we might
call p->lock.Unlock() after the reaper deleted the Pipe, mucking up some
other memory.
I don't think this was actually triggerd, tho, since we would have seen
the assert(nlock == 0) in ~Mutex???
Sage Weil [Fri, 13 Nov 2009 23:06:31 +0000 (15:06 -0800)]
mds: journal open_files based on is_any_caps_wanted(), not is_any_caps()
Actually we're a bit conservative in a few places since the wanted check
is a bit more expensive. We always do a full check in try_to_expire, so
much of the time we can do the quick check only.
Sage Weil [Fri, 13 Nov 2009 22:57:39 +0000 (14:57 -0800)]
mds: don't rejournal files with caps that are unwanted
If they're unwanted, it's no biggie to fail to reconnect the cap. And
Locker::adjust_cap_wanted() already adjusts the open_file logseg lists in
this way, so let's just totally consistent.
Sage Weil [Fri, 13 Nov 2009 22:44:22 +0000 (14:44 -0800)]
mds: don't croak on open_files without caps
We can get a capless inode here if we replay an open file, the client
fails to reconnect it, but does REPLAY an open request (that adds it
to the logseg). AFAICS it's ok for the client to replay an open on a
file it doesn't have in it's cache anymore.
Sage Weil [Thu, 12 Nov 2009 22:54:22 +0000 (14:54 -0800)]
mds: recommit after commit if waiting for newer version
If there are waiters for a later version of the dir to hit
disk, then we need to recommit as soon as the prior commit
completes. We auth_pin on adding the first waiter, and do
not unpin until removing the last waiter, so this doesn't
break auth_pin rules.
Previously we could stall because we didn't finish the
waiter (on the later version) but also never started the
commit. Sometimes we would get lucky and someone else would
ask for a commit, but sometimes not. We would then see old
LogSegments that would never get fully expired.
Sage Weil [Wed, 11 Nov 2009 23:47:28 +0000 (15:47 -0800)]
mds: force rdlock on any snapped inodes
When the client has an excl lock on an inode, and it's
stating a snapped version of it, we can't expect it to
put 2 and 2 together and look at it's head metadata. If
the cap does not follow the snapid we're trying to stat, do
the full rdlock to force the snapped values back to the
mds so we can do the cow.
If there is nothing cow, the cap will get reissued with an
accurate follows value, and we won't have to do this again.
Sage Weil [Wed, 11 Nov 2009 00:26:44 +0000 (16:26 -0800)]
mds: underwater is function of _loaded_ version, not in core version
We may load a dir version off disk that is older than the
in-core version (because we got newer data from the
journal, say). When marking underwater items clean, do
so based on the _loaded_ version, not out in-core version.
Sage Weil [Tue, 10 Nov 2009 22:51:20 +0000 (14:51 -0800)]
osd: do not apply_transaction in finish_recovery
finish_recovery needs to set up a callback for when the current set of
changes commit to disk (to kickstart cleanup of strya replicas etc). We
can't call apply_transaction this deep inside the call chain without
causing problems. So, pass a list of completion contexts all the way down
so that we can set up the completion callback.
Sage Weil [Mon, 9 Nov 2009 21:17:29 +0000 (13:17 -0800)]
osd: log misdirected ops; reply with -ENXIO
This is more helpful than assert(0). It's still bad (it means the client
and osd calculated different pg mappings) though, but this makes it easier
to identify and fix.
Sage Weil [Sat, 7 Nov 2009 00:43:47 +0000 (16:43 -0800)]
osd: use stronger hash function for mapping objects -> pgs
The old hash (from linux dcache) was very weak, such that
least sig bits may not change and you could get lots of
consecutive objects on the same osds (because lsbits of the
pg weren't changing).
This is Robert Jenkin's hash and is quite strong. Public
domain.
Rev osd disk format, protocol, since we're totally changing
object placement here.
Greg Farnum [Wed, 4 Nov 2009 22:14:41 +0000 (14:14 -0800)]
Hadoop: Numerous fixes.
Set bufUsed=0 on a flush to avoid bad rewrites of data
Downgrade a warning in IOStream since Hadoop apparently checks for EOF by reading until a read returns -1.
Remove some leftover if checks that don't do anything.
TODO: Remove something Sage did already.
Sage Weil [Tue, 3 Nov 2009 23:15:22 +0000 (15:15 -0800)]
msgr: encode sockaddr.ss_family big endian in ceph_entity_addr
The ss_family field is normally host endianness, but we
want to exchange ceph_entity_addr across the wire and store
it on disk. So, encode ss_family in big endian (to match
the other sockaddr field endianness).