Vangelis Koukis [Thu, 9 Dec 2010 18:53:22 +0000 (20:53 +0200)]
Fix overflow in FileJournal::_open_file()
[ The following text is in the "iso-8859-7" character set. ]
[ Your display is set for the "iso-8859-1" character set. ]
[ Some special characters may be displayed incorrectly. ]
Running the unstable branch, mkcephfs fails when trying to create
a 3GB journal file on the OSDs.
Relevant messages from the osd logfile:
2010-12-09 19:03:54.419737 7fdde4d51720 journal _open_file: unable to extend journal to 18446744072560312320 bytes
2010-12-09 19:03:54.419789 7fdde4d51720 filestore(/osd) mkjournal error creating journal on /osd/journal
The problem is that the calculation of the journal size in bytes
overflows, in FileJournal::_open_file().
Signed-off-by: Vangelis Koukis <vkoukis@cslab.ece.ntua.gr> Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Wed, 8 Dec 2010 23:53:13 +0000 (15:53 -0800)]
filejournal: reset last_commited_seq if we find journal to be invalid
If we read an event that's later than our expected entry, we set read_pos
to -1 and discard the journal. If that happens we also need to reset
last_committed_seq to avoid a crash like
Sage Weil [Tue, 7 Dec 2010 21:31:01 +0000 (13:31 -0800)]
mds: sync->mix replica state is sync->mix(2)
When auth first moves to sync->mix,
- auth sends AC_MIX to replicas
- replicas go to sync->mix
- replicas finish gather, send AC_SYNCACK, move to sync->mix(2)
- auth gets all acks, sends AC_MIX again
- replica moves to MIX
So any new replica should just get sync->mix(2), so that it is not confused
by the second AC_MIX.
Sage Weil [Tue, 7 Dec 2010 19:15:56 +0000 (11:15 -0800)]
mds: open undef dirfrags during rejoin
Any invented dirfrags have a version of 0. This will cause problems later
if we pre_dirty() anything in that dir because the dir version won't be
in sync (it'll be way too small). Also, we can do that at any point,
e.g. when flushing dirty caps, and aren't allowed to delay, so we need to
load those dirfrags now.
In theory we could read only the fnode and not all the dentries, but we
may as well. We should be more careful about memory that this patch is,
though.
Sage Weil [Tue, 7 Dec 2010 17:06:47 +0000 (09:06 -0800)]
mds: send LOCKFLUSHED to trigger finish_flush on replicas
Since f741766a we have triggered start_flush and finish_flush on replicas.
The problem is that the finish_flush didn't always happen for the mix->lock
case: we sould start_flush when we sent the AC_LOCKACK, but could only
finish_flush if/when we got another SYNC or MIX. If the primary stayed in
the LOCK state, we would keep our flushing flag. That in turn causes
problems later when we try to eval_gather() (esp if we are auth at that
point?).
Fix this by sending an explicit AC_LOCKFLUSHED message to replicas after
we do a scatter_writebehind. The replica will only set flushing if it
flushed dirty data, which forces scatter_writebehind, so we will always
get the LOCKFLUSHED to match. Replicas that didn't flush will also get
it, but oh well. We'd need to keep track which ones sent dirty data to
do that properly, though.
TODO: still need to verify that this is correct for rejoin.
Sage Weil [Tue, 7 Dec 2010 15:58:01 +0000 (07:58 -0800)]
mds: clear EXPORTINGCAPS on export_reverse
We need to reverse the effects of encode_export_inode_caps(), which is just
the pin and state bit.
The original problem can be reproduced with
- ceph tell mds 0 injectargs '--mds-kill-import-at 5'
- restart mds
- recovery completes successfully
- wait for the subtree to be reexported
- fail with bad EXPORTINGCAPS get in encode_export_inode_caps
derr was really just an alias for STDERR. Unfortunately, after we call
daemonize, STDERR is connected to /dev/null. So just replace calls to
derr with dout so that our important messages don't get lost.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Rather than having to write logclient.log(LOG_ERROR, ss), coders can now
write clog.error() << "str". Auto-flushing, if enabled, is still
handled automatically.
Rename instances of LogClient to clog (central log) for consistency and
brevity.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Mon, 6 Dec 2010 22:01:28 +0000 (14:01 -0800)]
osd: drop not-quite-copy constructor for object_info_t
Making a copy-like constructor that doesn't actaully copy is confusing
and error prone. In this case, we initialized a clone's object_info with
the head's snapid, causing problems with what info was encoded and crashing
later in the snap_trimmer. Here the one caller already called
copy_user_bits(); let's move the lost copy there.
Tune Debian packaging for the upcoming v0.24 release.
Including switch OpenSSL dependency to Crypto++ as its being used instead of
the former; remove radosacl as its not compiled anymore and pristine clean
the source. Explicitly note this is in a 1.0 package format.
Jim Schutt [Thu, 2 Dec 2010 19:41:35 +0000 (12:41 -0700)]
msgr: Correctly handle half-open connections.
If poll() says a socket is ready for reading, but zero bytes
are read, that means that the peer has sent a FIN. Handle that.
One way the incorrect handling was manifesting is as follows:
Under a heavy write load, clients log many messages like this:
[19021.523192] libceph: tid 876 timed out on osd6, will reset osd
[19021.523328] libceph: tid 866 timed out on osd10, will reset osd
[19081.616032] libceph: tid 841 timed out on osd0, will reset osd
[19081.616121] libceph: tid 826 timed out on osd2, will reset osd
[19081.616176] libceph: tid 806 timed out on osd3, will reset osd
[19081.616226] libceph: tid 875 timed out on osd9, will reset osd
[19081.616275] libceph: tid 834 timed out on osd12, will reset osd
[19081.616326] libceph: tid 874 timed out on osd10, will reset osd
After the clients are done writing and the file system should
be quiet, osd hosts have a high load with many active threads:
$ ps u -C cosd
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1383 162 11.5 1456248 943224 ? Ssl 11:31 406:59 /usr/bin/cosd -i 7 -c /etc/ceph/ceph.conf
$ for p in `ps -C cosd -o pid --no-headers`; do grep -nH State /proc/$p/task/*/status | grep -v sleep; done
/proc/1383/task/10702/status:2:State: R (running)
/proc/1383/task/10710/status:2:State: R (running)
/proc/1383/task/10717/status:2:State: R (running)
/proc/1383/task/11396/status:2:State: R (running)
/proc/1383/task/27111/status:2:State: R (running)
/proc/1383/task/27117/status:2:State: R (running)
/proc/1383/task/27162/status:2:State: R (running)
/proc/1383/task/27694/status:2:State: R (running)
/proc/1383/task/27704/status:2:State: R (running)
/proc/1383/task/27728/status:2:State: R (running)
With this fix applied, a heavy load still causes many client
resets of osds, but no runaway threads result.
Signed-off-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Sage Weil <sage@newdream.net>
object_info_t has one constructor that initializes everything from a
bufferlist. This means that the decode function needs to give default
values to fields in object_info_t that aren't found in the bufferlist.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>