Vangelis Koukis [Thu, 9 Dec 2010 18:53:22 +0000 (20:53 +0200)]
Fix overflow in FileJournal::_open_file()
[ The following text is in the "iso-8859-7" character set. ]
[ Your display is set for the "iso-8859-1" character set. ]
[ Some special characters may be displayed incorrectly. ]
Running the unstable branch, mkcephfs fails when trying to create
a 3GB journal file on the OSDs.
Relevant messages from the osd logfile:
2010-12-09 19:03:54.419737 7fdde4d51720 journal _open_file: unable to extend journal to 18446744072560312320 bytes
2010-12-09 19:03:54.419789 7fdde4d51720 filestore(/osd) mkjournal error creating journal on /osd/journal
The problem is that the calculation of the journal size in bytes
overflows, in FileJournal::_open_file().
Signed-off-by: Vangelis Koukis <vkoukis@cslab.ece.ntua.gr> Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Wed, 8 Dec 2010 23:53:13 +0000 (15:53 -0800)]
filejournal: reset last_commited_seq if we find journal to be invalid
If we read an event that's later than our expected entry, we set read_pos
to -1 and discard the journal. If that happens we also need to reset
last_committed_seq to avoid a crash like
Sage Weil [Tue, 7 Dec 2010 21:31:01 +0000 (13:31 -0800)]
mds: sync->mix replica state is sync->mix(2)
When auth first moves to sync->mix,
- auth sends AC_MIX to replicas
- replicas go to sync->mix
- replicas finish gather, send AC_SYNCACK, move to sync->mix(2)
- auth gets all acks, sends AC_MIX again
- replica moves to MIX
So any new replica should just get sync->mix(2), so that it is not confused
by the second AC_MIX.
Sage Weil [Tue, 7 Dec 2010 19:15:56 +0000 (11:15 -0800)]
mds: open undef dirfrags during rejoin
Any invented dirfrags have a version of 0. This will cause problems later
if we pre_dirty() anything in that dir because the dir version won't be
in sync (it'll be way too small). Also, we can do that at any point,
e.g. when flushing dirty caps, and aren't allowed to delay, so we need to
load those dirfrags now.
In theory we could read only the fnode and not all the dentries, but we
may as well. We should be more careful about memory that this patch is,
though.
Sage Weil [Tue, 7 Dec 2010 17:06:47 +0000 (09:06 -0800)]
mds: send LOCKFLUSHED to trigger finish_flush on replicas
Since f741766a we have triggered start_flush and finish_flush on replicas.
The problem is that the finish_flush didn't always happen for the mix->lock
case: we sould start_flush when we sent the AC_LOCKACK, but could only
finish_flush if/when we got another SYNC or MIX. If the primary stayed in
the LOCK state, we would keep our flushing flag. That in turn causes
problems later when we try to eval_gather() (esp if we are auth at that
point?).
Fix this by sending an explicit AC_LOCKFLUSHED message to replicas after
we do a scatter_writebehind. The replica will only set flushing if it
flushed dirty data, which forces scatter_writebehind, so we will always
get the LOCKFLUSHED to match. Replicas that didn't flush will also get
it, but oh well. We'd need to keep track which ones sent dirty data to
do that properly, though.
TODO: still need to verify that this is correct for rejoin.
Sage Weil [Tue, 7 Dec 2010 15:58:01 +0000 (07:58 -0800)]
mds: clear EXPORTINGCAPS on export_reverse
We need to reverse the effects of encode_export_inode_caps(), which is just
the pin and state bit.
The original problem can be reproduced with
- ceph tell mds 0 injectargs '--mds-kill-import-at 5'
- restart mds
- recovery completes successfully
- wait for the subtree to be reexported
- fail with bad EXPORTINGCAPS get in encode_export_inode_caps
Sage Weil [Mon, 6 Dec 2010 22:01:28 +0000 (14:01 -0800)]
osd: drop not-quite-copy constructor for object_info_t
Making a copy-like constructor that doesn't actaully copy is confusing
and error prone. In this case, we initialized a clone's object_info with
the head's snapid, causing problems with what info was encoded and crashing
later in the snap_trimmer. Here the one caller already called
copy_user_bits(); let's move the lost copy there.
Tune Debian packaging for the upcoming v0.24 release.
Including switch OpenSSL dependency to Crypto++ as its being used instead of
the former; remove radosacl as its not compiled anymore and pristine clean
the source. Explicitly note this is in a 1.0 package format.
Jim Schutt [Thu, 2 Dec 2010 19:41:35 +0000 (12:41 -0700)]
msgr: Correctly handle half-open connections.
If poll() says a socket is ready for reading, but zero bytes
are read, that means that the peer has sent a FIN. Handle that.
One way the incorrect handling was manifesting is as follows:
Under a heavy write load, clients log many messages like this:
[19021.523192] libceph: tid 876 timed out on osd6, will reset osd
[19021.523328] libceph: tid 866 timed out on osd10, will reset osd
[19081.616032] libceph: tid 841 timed out on osd0, will reset osd
[19081.616121] libceph: tid 826 timed out on osd2, will reset osd
[19081.616176] libceph: tid 806 timed out on osd3, will reset osd
[19081.616226] libceph: tid 875 timed out on osd9, will reset osd
[19081.616275] libceph: tid 834 timed out on osd12, will reset osd
[19081.616326] libceph: tid 874 timed out on osd10, will reset osd
After the clients are done writing and the file system should
be quiet, osd hosts have a high load with many active threads:
$ ps u -C cosd
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1383 162 11.5 1456248 943224 ? Ssl 11:31 406:59 /usr/bin/cosd -i 7 -c /etc/ceph/ceph.conf
$ for p in `ps -C cosd -o pid --no-headers`; do grep -nH State /proc/$p/task/*/status | grep -v sleep; done
/proc/1383/task/10702/status:2:State: R (running)
/proc/1383/task/10710/status:2:State: R (running)
/proc/1383/task/10717/status:2:State: R (running)
/proc/1383/task/11396/status:2:State: R (running)
/proc/1383/task/27111/status:2:State: R (running)
/proc/1383/task/27117/status:2:State: R (running)
/proc/1383/task/27162/status:2:State: R (running)
/proc/1383/task/27694/status:2:State: R (running)
/proc/1383/task/27704/status:2:State: R (running)
/proc/1383/task/27728/status:2:State: R (running)
With this fix applied, a heavy load still causes many client
resets of osds, but no runaway threads result.
Signed-off-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Sage Weil <sage@newdream.net>
object_info_t has one constructor that initializes everything from a
bufferlist. This means that the decode function needs to give default
values to fields in object_info_t that aren't found in the bufferlist.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Wed, 1 Dec 2010 21:48:56 +0000 (13:48 -0800)]
filestore: call lower-level do_transactions() during journal replay
We used to call apply_transactions, which avoided rejournaling anything
because the journal wasn't writeable yet, but that uses all kinds of other
machinery that relies on threads and finishers and such that aren't
appropriate or necessary when we're just replaying journaled events.
Instead, call the lower-level do_transactions() directly.
I've found the manpage problem that I've noted before. It's about
monmaptool, the CLI says it's usage:
[--print] [--create [--clobber]] [--add name 1.2.3.4:567] [--rm name]
<mapfilename>
But the manpage states this as an example:
monmaptool --create --add 192.168.0.10:6789 --add 192.168.0.11:6789 --add
192.168.0.12:6789 --clobber monmap
This definitely misses 'name' after the 'add' switch, resulting:
"invalid ip:port '--add'" as an error message. Attached patch fixes this
inconsistency.
Create a copy constructor for object_info_t, since we often want to copy
an object_info_t and would rather not try to remember all the fields.
Drop the lost parameter from one of the other constructors, because it's
not used that much.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
In PG::mark_obj_as_lost, we have to mark a missing object as lost. We
should not assume that we have an old version of the missing object in
the ObjectStore. If the object doesn't exist in the object store, we
have to create it so that recovery can function correctly.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Tue, 30 Nov 2010 23:43:53 +0000 (15:43 -0800)]
mds: fix resolve for surviving observers
Make all survivors participate in resolve stage, so that survivors can
properly determine the outcome of migrations to the failed node that did
not complete.
The sequence (before):
- A starts to export /foo to B
- C has ambiguous auth (A,B) in it's subtree map
- B journals import_start
- B fails
...
- B restarts
- B sends resolves to everyone
- does not claim /foo
- A sends resolve _only_ to B
- does claim /foo
- B knows it's import did not complete
- C doesn't know anything. Also, the maybe_resolve_finish stuff was
totally broken because the recovery_set wasn't initialized
See new (commented out) assert in Migrator.cc to reproduce the above.
In _process_pg_info, if the primary sends us a PG::Log, a replica should
merge that log into its own.
mark_all_unfound_as_lost / share_pg_log: don't send the whole PG::Log.
Just send the new entries that were just added when marking the objects
as lost.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Sage Weil [Tue, 30 Nov 2010 20:48:32 +0000 (12:48 -0800)]
osd: fix misuses of OLOC_BLANK
Commit 6e2b594b fixed a bunch of bad get_object_context() calls, but even
with the parameter fixed some were still broken. Pass in a valid oloc in
those cases. The only places where OLOC_BLANK _is_ still uses is when we
know we have the object locally and will load a valid value off disk.
Sage Weil [Tue, 30 Nov 2010 16:30:57 +0000 (08:30 -0800)]
filestore: make sure blocked op_start's wake up in order
If they wake up out of order (which, theoretically, they could before) we
can screw up journal submitting order in writebehind mode, or apply order
in parallel and writeahead journal mode.
Sage Weil [Tue, 30 Nov 2010 16:24:57 +0000 (08:24 -0800)]
filestore: assert op_submit_finish is called in order
Verify/assert that we aren't screwing up the submission pipeline ordering.
Namely, we want to make sure that if op_apply_start() blocks, we wake up
in the proper order and don't screw up the journaling.
Sage Weil [Tue, 30 Nov 2010 15:54:42 +0000 (07:54 -0800)]
filejournal: rework journal FULL behavior and fix throttling
Keep distinct states for FULL, WAIT, and NOTFULL.
The old code was more or less correct at one point, but assumed the seq
changed on each commit, not each operation; in it's prior state it was
totally broken.
Also fix throttling (we were leaking items in the throttler that were
submitted while the journal was full).
Sage Weil [Tue, 30 Nov 2010 15:51:16 +0000 (07:51 -0800)]
filestore: refactor op_queue/journal locking
- Combine journal_lock and lock.
- Move throttling outside of the lock (this fixes potential deadlock in
parallel journal mode)
- Make interface nomenclature a bit more helpful