Sage Weil [Mon, 28 Jun 2010 21:15:59 +0000 (14:15 -0700)]
msgr: use dedicated reaper thread
We were calling the reaper from the wait() loop. The problem is that
the OSD has two messengers, and only the first was in wait().. the second
wait() was only called after the first terminated (i.e, when the OSD was
shutting down).
Instead, launch a separate reaper thread when we bind, and close it out
on shutdown right after the accepter.
Sage Weil [Tue, 29 Jun 2010 21:32:28 +0000 (14:32 -0700)]
osd: always use original Connection when replying
...even when the op came from another OSD. Not that that should happen
anyway, since we don't forward messages currently. (And can't, since the
OSD doesn't initiate connections to the client!)
Sage Weil [Mon, 28 Jun 2010 18:44:26 +0000 (11:44 -0700)]
journal: set max journal write to 10MB
If we take too big a bite of data to write in a single writev(2), we can
end up making performance worse, because everyone waits for the full write
to complete. Bigger writes mean better throughput but higher latency.
So, balance the two by placing some upper limit.
Sage Weil [Mon, 28 Jun 2010 18:34:29 +0000 (11:34 -0700)]
filejournal: fix journal write_pos advance
This was broken by bd4188a02abff9efffb87a0a2031efe51c1b4d9a. @pos needs to
be advanced (it is pass by reference) or else we just overwrite the same
bytes at the journal start over and over again.
Sage Weil [Sat, 26 Jun 2010 17:28:38 +0000 (10:28 -0700)]
msgr: fix throttle deadlock
Do msgr throttle after peer policy throttle. The msgr (dispatch) throttle
is shortlived and won't deadlock (unless dispatch blocks), so it's safe to
take last. In contrast, the policy throttle carries over the lifetime of
the message, and may block until replication completes or whatever else.
Sage Weil [Sat, 26 Jun 2010 04:46:23 +0000 (21:46 -0700)]
crushwrapper: gracefully handle crush error
crush_do_rule can return <0 in certain error cases (e.g., forcefed device
does not exist in crush map). We should take that to mean an empty []
result instead of crashing.
Sage Weil [Thu, 24 Jun 2010 23:49:12 +0000 (16:49 -0700)]
mds: keep cap follows above in->first in FLUSHSNAP
The client has a follows of 0 initially, which is correct (it does follow
0, and there are no prior snaps). But the inode has ->first of 2, which
is also fine. The follows here needs to be at least higher than the
inode first, though, or the caps cloning gets off...
Sage Weil [Thu, 24 Jun 2010 22:50:47 +0000 (15:50 -0700)]
mds: fix client cap condition
In 551a12f52e36 we fixed a bug with cow_inode() where the
cap->client_follows didn't match last precisely. Instead, we compare
to first. But the == is too strict.. cap follows that is equal _or_older_
than the clone's first should be copied to the clone inode.
This fixes the simple test case
$ echo asdf > bar ; mkdir .snap/bar ; rm bar ; cat .snap/bar/bar
asdf
(Previously we would get nothing unless we waited for the cap to flush on
its own.)
Sage Weil [Thu, 24 Jun 2010 17:40:14 +0000 (10:40 -0700)]
crush: make CHOOSE_LEAF to behave when leaf type is encountered
We may not want to recursively call crush_choose() if we start out with a
leaf. If that happens, we need to fill out the out2[] vector with
our result immediately.
Sage Weil [Wed, 23 Jun 2010 21:08:39 +0000 (14:08 -0700)]
crush: behave when chooseleaf is given leaf type
Fill in the out2 choose_leaf vector if it's defined. This is necessary
because we may not recursively call choose on out2 if the item we're on is
not a bucket (e.g., when chooseleaf is given the leaf type 0).
Thomas Mueller [Mon, 21 Jun 2010 10:32:26 +0000 (10:32 +0000)]
add helptext for option "snapdirname" to manpage of mount.ceph
[ The following text is in the "UTF-8" character set. ]
[ Your display is set for the "iso-8859-1" character set. ]
[ Some characters may be displayed incorrectly. ]
inspired by the addition to
http://ceph.newdream.net/wiki/Snapshots about the snapdirname
option i've created a patch for the mount.ceph manpage
Sage Weil [Sun, 20 Jun 2010 21:41:19 +0000 (14:41 -0700)]
journal: initialize applied_seq during journal replay
This should avoid
#0 0x00007f41b1a18a75 in raise () from /lib/libc.so.6
#1 0x00007f41b1a1c5c0 in abort () from /lib/libc.so.6
#2 0x00007f41b22cd8e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/libstdc++.so.6
#3 0x00007f41b22cbd16 in ?? () from /usr/lib/libstdc++.so.6
#4 0x00007f41b22cbd43 in std::terminate() () from /usr/lib/libstdc++.so.6
#5 0x00007f41b22cbe3e in __cxa_throw () from /usr/lib/libstdc++.so.6
#6 0x00000000005b39f8 in ceph::__ceph_assert_fail (assertion=0x5ec3b2 "seq >= last_committed_seq", file=<value optimized out>, line=711, func=<value optimized out>) at common/assert.cc:30
#7 0x00000000005649e1 in FileJournal::committed_thru (this=0x1116310, seq=0) at os/FileJournal.cc:711
#8 0x000000000055d265 in JournalingObjectStore::commit_finish (this=0x1125740) at os/JournalingObjectStore.cc:186
#9 0x00000000005543f3 in FileStore::sync_entry (this=0x1125740) at os/FileStore.cc:1714
#10 0x00000000004ef93d in FileStore::SyncThread::entry() ()
#11 0x0000000000469a4a in Thread::_entry_func (arg=0x6315) at ./common/Thread.h:39
#12 0x00007f41b28ab9ca in start_thread () from /lib/libpthread.so.0
#13 0x00007f41b1acb6cd in clone () from /lib/libc.so.6
#14 0x0000000000000000 in ?? ()
Sage Weil [Sat, 19 Jun 2010 04:26:41 +0000 (21:26 -0700)]
initscript: remove class loading for now
- only need to do it once, by connecting to a random monitor, not for
each monitor
- not sure we should try it every time we start the monitor for all time,
as opposed to once after mkfs, or whenever the admin chooses to load
new classes
Sage Weil [Fri, 18 Jun 2010 22:59:36 +0000 (15:59 -0700)]
filestore: op_start when op is _queued_, so that q is drained on commit
We need the store in a consistent state on commit, which means flushing
transactions such that we have all ops <= a given seq applied. That is
handled by the commit_start()/commit_started() pair, but will only include
ops in the FileStore queue if we op_start when it is initially queued.
Which is exactly what we want, because the queue can reorder things, so
stopping just currently-being-applied updates will only keep transactions
atomic but not ordered.
Sage Weil [Thu, 17 Jun 2010 20:37:34 +0000 (13:37 -0700)]
msgr: ref count Pipe to avoid use after free
The Connection has a Pipe pointer to facilitate
send_message(Message, Connection)
but the reaper() clears that pointer when tearing down old pipes. This
leads to a race in which submit_message dereferences the old Pipe pointer.
Instead, make Pipe ref counted, and only submit_message() if we get a
valid Pipe reference. This fixes races between send_message() and
reaper() (as well as any use of the Connection after the pipe is closed).