]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
14 years agorpm: fix ceph.spec to work with gcephtool
Colin Patrick McCabe [Thu, 9 Dec 2010 19:46:28 +0000 (11:46 -0800)]
rpm: fix ceph.spec to work with gcephtool

Don't try to package gui_resources unless we are building the GUI.
Get GUI dependencies correct.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoFix overflow in FileJournal::_open_file()
Vangelis Koukis [Thu, 9 Dec 2010 18:53:22 +0000 (20:53 +0200)]
Fix overflow in FileJournal::_open_file()

[ The following text is in the "iso-8859-7" character set. ]
    [ Your display is set for the "iso-8859-1" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Running the unstable branch, mkcephfs fails when trying to create
a 3GB journal file on the OSDs.

Relevant messages from the osd logfile:

2010-12-09 19:03:54.419737 7fdde4d51720 journal _open_file: unable to extend journal to 18446744072560312320 bytes
2010-12-09 19:03:54.419789 7fdde4d51720 filestore(/osd) mkjournal error creating journal on /osd/journal

The problem is that the calculation of the journal size in bytes
overflows, in FileJournal::_open_file().

Signed-off-by: Vangelis Koukis <vkoukis@cslab.ece.ntua.gr>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoReplicatedPG.cc: Fixes a bug in snap_trimmer where a pointer to a stack
Samuel Just [Thu, 9 Dec 2010 18:25:39 +0000 (10:25 -0800)]
ReplicatedPG.cc: Fixes a bug in snap_trimmer where a pointer to a stack
Cond is left in the mode.waiting_cond list.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agoReplicatedPG: snap_trimmer now acquires a read lock on the osd map
Samuel Just [Thu, 9 Dec 2010 18:24:34 +0000 (10:24 -0800)]
ReplicatedPG: snap_trimmer now acquires a read lock on the osd map
before calling share_pg_info.

Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
14 years agorpm: don't try to package radosacl
Colin Patrick McCabe [Thu, 9 Dec 2010 18:59:57 +0000 (10:59 -0800)]
rpm: don't try to package radosacl

radosacl is just a test binary, so unless we build with --with-debug, we
won't get it.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agorpm: add pkgconfig to BuildRequires
Colin Patrick McCabe [Thu, 9 Dec 2010 18:39:34 +0000 (10:39 -0800)]
rpm: add pkgconfig to BuildRequires

You can't build without pkgconfig.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agorpm: set files-attr for radosgw
Colin Patrick McCabe [Thu, 9 Dec 2010 18:26:55 +0000 (10:26 -0800)]
rpm: set files-attr for radosgw

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agofilejournal: reset last_commited_seq if we find journal to be invalid
Sage Weil [Wed, 8 Dec 2010 23:53:13 +0000 (15:53 -0800)]
filejournal: reset last_commited_seq if we find journal to be invalid

If we read an event that's later than our expected entry, we set read_pos
to -1 and discard the journal.  If that happens we also need to reset
last_committed_seq to avoid a crash like

2010-12-08 17:04:39.246950 7f269d138910 journal commit_finish thru 16904
2010-12-08 17:04:39.246961 7f269d138910 journal committed_thru 16904 < last_committed_seq 37778589
os/FileJournal.cc: In function 'virtual void FileJournal::committed_thru(uint64_t)':
os/FileJournal.cc:854: FAILED assert(seq >= last_committed_seq)
 ceph version 0.24~rc (commit:fe10300317383ec29948d7dbe3cb31b3aa277e3c)
 1: (FileJournal::committed_thru(unsigned long)+0xad) [0x588e7d]
 2: (JournalingObjectStore::commit_finish()+0x8c) [0x57f2ec]
 3: (FileStore::sync_entry()+0xcff) [0x5764cf]
 4: (FileStore::SyncThread::entry()+0xd) [0x506d9d]
 5: (Thread::_entry_func(void*)+0xa) [0x4790ba]
 6: /lib/libpthread.so.0 [0x7f26a2f8373a]
 7: (clone()+0x6d) [0x7f26a1c2569d]

Fixes #631

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomon: use helper for clock drift check; log relative instead of absolute time
Sage Weil [Wed, 8 Dec 2010 19:12:51 +0000 (11:12 -0800)]
mon: use helper for clock drift check; log relative instead of absolute time

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: sync->mix replica state is sync->mix(2)
Sage Weil [Tue, 7 Dec 2010 21:31:01 +0000 (13:31 -0800)]
mds: sync->mix replica state is sync->mix(2)

When auth first moves to sync->mix,
 - auth sends AC_MIX to replicas
 - replicas go to sync->mix
 - replicas finish gather, send AC_SYNCACK, move to sync->mix(2)
 - auth gets all acks, sends AC_MIX again
 - replica moves to MIX

So any new replica should just get sync->mix(2), so that it is not confused
by the second AC_MIX.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: no not choose lock state on replicas
Sage Weil [Tue, 7 Dec 2010 20:50:15 +0000 (12:50 -0800)]
mds: no not choose lock state on replicas

The lock state has already been set during rejoin.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: small rejoin cleanup
Sage Weil [Tue, 7 Dec 2010 20:45:04 +0000 (12:45 -0800)]
mds: small rejoin cleanup

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: rev mds cluster internal protocol
Sage Weil [Tue, 7 Dec 2010 19:26:24 +0000 (11:26 -0800)]
mds: rev mds cluster internal protocol

The lock encoding changed with the dirty bit on scatterlocks.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: fix replay of already-journaled requests
Sage Weil [Tue, 7 Dec 2010 19:21:39 +0000 (11:21 -0800)]
mds: fix replay of already-journaled requests

Check for already-completed tids for both retried and replayed requests.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: open undef dirfrags during rejoin
Sage Weil [Tue, 7 Dec 2010 19:15:56 +0000 (11:15 -0800)]
mds: open undef dirfrags during rejoin

Any invented dirfrags have a version of 0.  This will cause problems later
if we pre_dirty() anything in that dir because the dir version won't be
in sync (it'll be way too small).  Also, we can do that at any point,
e.g. when flushing dirty caps, and aren't allowed to delay, so we need to
load those dirfrags now.

In theory we could read only the fnode and not all the dentries, but we
may as well.  We should be more careful about memory that this patch is,
though.

Fixes #15.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: add missing try_clear_more() to scatterlock
Sage Weil [Tue, 7 Dec 2010 18:47:58 +0000 (10:47 -0800)]
mds: add missing try_clear_more() to scatterlock

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: explicitly pass scatterlock dirty flag to auth on gather
Sage Weil [Tue, 7 Dec 2010 18:47:30 +0000 (10:47 -0800)]
mds: explicitly pass scatterlock dirty flag to auth on gather

This ensures that if the replica is thinks it is flushing something the
auth will always do a scatter_writebehind.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: send LOCKFLUSHED to trigger finish_flush on replicas
Sage Weil [Tue, 7 Dec 2010 17:06:47 +0000 (09:06 -0800)]
mds: send LOCKFLUSHED to trigger finish_flush on replicas

Since f741766a we have triggered start_flush and finish_flush on replicas.
The problem is that the finish_flush didn't always happen for the mix->lock
case: we sould start_flush when we sent the AC_LOCKACK, but could only
finish_flush if/when we got another SYNC or MIX.  If the primary stayed in
the LOCK state, we would keep our flushing flag.  That in turn causes
problems later when we try to eval_gather() (esp if we are auth at that
point?).

Fix this by sending an explicit AC_LOCKFLUSHED message to replicas after
we do a scatter_writebehind.  The replica will only set flushing if it
flushed dirty data, which forces scatter_writebehind, so we will always
get the LOCKFLUSHED to match.  Replicas that didn't flush will also get
it, but oh well.  We'd need to keep track which ones sent dirty data to
do that properly, though.

TODO: still need to verify that this is correct for rejoin.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: clear EXPORTINGCAPS on export_reverse
Sage Weil [Tue, 7 Dec 2010 15:58:01 +0000 (07:58 -0800)]
mds: clear EXPORTINGCAPS on export_reverse

We need to reverse the effects of encode_export_inode_caps(), which is just
the pin and state bit.

The original problem can be reproduced with
 - ceph tell mds 0 injectargs '--mds-kill-import-at 5'
 - restart mds
 - recovery completes successfully
 - wait for the subtree to be reexported
 - fail with bad EXPORTINGCAPS get in encode_export_inode_caps

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: fix LOOKUPHASH to avoid creating bogus replica CDir
Sage Weil [Tue, 7 Dec 2010 00:31:56 +0000 (16:31 -0800)]
mds: fix LOOKUPHASH to avoid creating bogus replica CDir

We can't create the CDir if we are non-auth.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: introduce rejoin_invent_dirfrag() helper
Sage Weil [Mon, 6 Dec 2010 22:34:36 +0000 (14:34 -0800)]
mds: introduce rejoin_invent_dirfrag() helper

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoautomake: in scripts, use sysconfdir as-is
Colin Patrick McCabe [Tue, 7 Dec 2010 18:56:05 +0000 (10:56 -0800)]
automake: in scripts, use sysconfdir as-is

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoautomake: in deb pkg, use --syconfdir=/etc
Colin Patrick McCabe [Tue, 7 Dec 2010 18:48:19 +0000 (10:48 -0800)]
automake: in deb pkg, use --syconfdir=/etc

When building the debian packages, use --sysconfdir=/etc.

Also, don't fudge sysconfdir in the init-ceph script.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agomkcephfs: require -k; update man page
Sage Weil [Tue, 7 Dec 2010 06:17:47 +0000 (22:17 -0800)]
mkcephfs: require -k; update man page

Force users to specify keyring location; update man page accordingly.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoconfigure: detect crypto++ library
Yehuda Sadeh [Mon, 6 Dec 2010 23:25:08 +0000 (15:25 -0800)]
configure: detect crypto++ library

14 years agoosd: drop not-quite-copy constructor for object_info_t
Sage Weil [Mon, 6 Dec 2010 22:01:28 +0000 (14:01 -0800)]
osd: drop not-quite-copy constructor for object_info_t

Making a copy-like constructor that doesn't actaully copy is confusing
and error prone.  In this case, we initialized a clone's object_info with
the head's snapid, causing problems with what info was encoded and crashing
later in the snap_trimmer.  Here the one caller already called
copy_user_bits(); let's move the lost copy there.

This backs out one of the changes in 0cc8d34e.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agolibrados: fix error path in rados_deinitialize
Colin Patrick McCabe [Mon, 6 Dec 2010 19:11:41 +0000 (11:11 -0800)]
librados: fix error path in rados_deinitialize

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agolibrados: fix the C++ interface init
Yehuda Sadeh [Mon, 6 Dec 2010 19:16:35 +0000 (11:16 -0800)]
librados: fix the C++ interface init

14 years agolibrados: fix C interface error handling in init code
Yehuda Sadeh [Mon, 6 Dec 2010 18:28:20 +0000 (10:28 -0800)]
librados: fix C interface error handling in init code

14 years agoclient: resync ioctl header from ceph-client.
Greg Farnum [Mon, 6 Dec 2010 17:59:50 +0000 (09:59 -0800)]
client: resync ioctl header from ceph-client.

Previous change to the CEPH_IOCTL_MAGIC in fbbf448 was incorrect!

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
14 years agoTune Debian packaging for the upcoming v0.24 release.
Laszlo Boszormenyi [Mon, 6 Dec 2010 05:27:23 +0000 (06:27 +0100)]
Tune Debian packaging for the upcoming v0.24 release.

Including switch OpenSSL dependency to Crypto++ as its being used instead of
the former; remove radosacl as its not compiled anymore and pristine clean
the source. Explicitly note this is in a 1.0 package format.

14 years agoosd: search for unfound on osds in might_have_unfound
Sage Weil [Sun, 5 Dec 2010 05:28:55 +0000 (21:28 -0800)]
osd: search for unfound on osds in might_have_unfound

We were looking at 'up', which is just the set of OSDs we should be on in
the current epoch; nothing to do with where the objects might be found.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoMakefile: make radosacl build WITH_DEBUG only
Sage Weil [Sun, 5 Dec 2010 04:45:50 +0000 (20:45 -0800)]
Makefile: make radosacl build WITH_DEBUG only

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoceph.spec.in: update dependency
Yehuda Sadeh [Sat, 4 Dec 2010 00:12:49 +0000 (16:12 -0800)]
ceph.spec.in: update dependency

14 years agorgw: null terminate armor result
Yehuda Sadeh [Fri, 3 Dec 2010 23:20:45 +0000 (15:20 -0800)]
rgw: null terminate armor result

14 years agorgw: get rid of openssl altogether
Yehuda Sadeh [Fri, 3 Dec 2010 22:45:59 +0000 (14:45 -0800)]
rgw: get rid of openssl altogether

14 years agoconfigure: check for the presence of libcrypto++ header files
Yehuda Sadeh [Fri, 3 Dec 2010 21:35:55 +0000 (13:35 -0800)]
configure: check for the presence of libcrypto++ header files

14 years agocrypto: change include
Yehuda Sadeh [Fri, 3 Dec 2010 21:19:06 +0000 (13:19 -0800)]
crypto: change include

14 years agocommon: remove base64.c
Yehuda Sadeh [Fri, 3 Dec 2010 21:09:20 +0000 (13:09 -0800)]
common: remove base64.c

14 years agocrypto: remove old openssl implementation
Yehuda Sadeh [Fri, 3 Dec 2010 21:06:42 +0000 (13:06 -0800)]
crypto: remove old openssl implementation

14 years agomakefile.am: most binaries (except rgw_*) don't link with openssl
Yehuda Sadeh [Fri, 3 Dec 2010 21:04:41 +0000 (13:04 -0800)]
makefile.am: most binaries (except rgw_*) don't link with openssl

14 years agocommon: use ceph_armor instead of openssl based functions
Yehuda Sadeh [Fri, 3 Dec 2010 20:48:26 +0000 (12:48 -0800)]
common: use ceph_armor instead of openssl based functions

also modify ceph_[un]armor to get dest buffer length

14 years agocrypto: test for allocation failure, cleanup
Yehuda Sadeh [Fri, 3 Dec 2010 20:47:13 +0000 (12:47 -0800)]
crypto: test for allocation failure, cleanup

14 years agocrypto: use crypto++ for aes instead of openssl
Yehuda Sadeh [Fri, 3 Dec 2010 01:23:16 +0000 (17:23 -0800)]
crypto: use crypto++ for aes instead of openssl

need to implement it more efficiently, currently going through a string object

14 years agoosd: remove poid/soid from ScrubMap::object; clean up callers
Sage Weil [Fri, 3 Dec 2010 17:56:15 +0000 (09:56 -0800)]
osd: remove poid/soid from ScrubMap::object; clean up callers

The soid is in the key in the map; no need to store it in the value.
Update the scrub code appropriately.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomon: fix typo
Sage Weil [Fri, 3 Dec 2010 17:47:50 +0000 (09:47 -0800)]
mon: fix typo

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomake: create log directories and tmp directories
Colin Patrick McCabe [Fri, 3 Dec 2010 17:35:55 +0000 (09:35 -0800)]
make: create log directories and tmp directories

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agomsgr: Correctly handle half-open connections.
Jim Schutt [Thu, 2 Dec 2010 19:41:35 +0000 (12:41 -0700)]
msgr: Correctly handle half-open connections.

If poll() says a socket is ready for reading, but zero bytes
are read, that means that the peer has sent a FIN.  Handle that.

One way the incorrect handling was manifesting is as follows:

Under a heavy write load, clients log many messages like this:

[19021.523192] libceph:  tid 876 timed out on osd6, will reset osd
[19021.523328] libceph:  tid 866 timed out on osd10, will reset osd
[19081.616032] libceph:  tid 841 timed out on osd0, will reset osd
[19081.616121] libceph:  tid 826 timed out on osd2, will reset osd
[19081.616176] libceph:  tid 806 timed out on osd3, will reset osd
[19081.616226] libceph:  tid 875 timed out on osd9, will reset osd
[19081.616275] libceph:  tid 834 timed out on osd12, will reset osd
[19081.616326] libceph:  tid 874 timed out on osd10, will reset osd

After the clients are done writing and the file system should
be quiet, osd hosts have a high load with many active threads:

$ ps u -C cosd
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1383  162 11.5 1456248 943224 ?      Ssl  11:31 406:59 /usr/bin/cosd -i 7 -c /etc/ceph/ceph.conf

$ for p in `ps -C cosd -o pid --no-headers`; do grep -nH State /proc/$p/task/*/status | grep -v sleep; done
/proc/1383/task/10702/status:2:State:   R (running)
/proc/1383/task/10710/status:2:State:   R (running)
/proc/1383/task/10717/status:2:State:   R (running)
/proc/1383/task/11396/status:2:State:   R (running)
/proc/1383/task/27111/status:2:State:   R (running)
/proc/1383/task/27117/status:2:State:   R (running)
/proc/1383/task/27162/status:2:State:   R (running)
/proc/1383/task/27694/status:2:State:   R (running)
/proc/1383/task/27704/status:2:State:   R (running)
/proc/1383/task/27728/status:2:State:   R (running)

With this fix applied, a heavy load still causes many client
resets of osds, but no runaway threads result.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomake: create /etc/ceph if it doesn't exist
Colin Patrick McCabe [Fri, 3 Dec 2010 01:31:45 +0000 (17:31 -0800)]
make: create /etc/ceph if it doesn't exist

make: create /etc/ceph if it doesn't exist. On uninstall, remove the
directory if it's empty. (Never remove a user's config file, though.)

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoost: object_info_t: decode old versions correctly
Colin Patrick McCabe [Fri, 3 Dec 2010 00:53:44 +0000 (16:53 -0800)]
ost: object_info_t: decode old versions correctly

object_info_t has one constructor that initializes everything from a
bufferlist. This means that the decode function needs to give default
values to fields in object_info_t that aren't found in the bufferlist.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoman: add man page for cephfs
Greg Farnum [Thu, 2 Dec 2010 22:14:50 +0000 (14:14 -0800)]
man: add man page for cephfs

Add to Makefile, debian, and ceph.spec.in bits

14 years agoosd: fix log tail vs last_complete assert on replica activation
Sage Weil [Wed, 1 Dec 2010 23:40:28 +0000 (15:40 -0800)]
osd: fix log tail vs last_complete assert on replica activation

The last_complete may be below the log tail IFF we have a backlog.

Fixes 756918be3b24d8164699da301ddfbc8e6fd6b751.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilestore: call lower-level do_transactions() during journal replay
Sage Weil [Wed, 1 Dec 2010 21:48:56 +0000 (13:48 -0800)]
filestore: call lower-level do_transactions() during journal replay

We used to call apply_transactions, which avoided rejournaling anything
because the journal wasn't writeable yet, but that uses all kinds of other
machinery that relies on threads and finishers and such that aren't
appropriate or necessary when we're just replaying journaled events.

Instead, call the lower-level do_transactions() directly.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilestore: do journal mode autodetect and sanity check _before_ replay
Sage Weil [Wed, 1 Dec 2010 21:46:30 +0000 (13:46 -0800)]
filestore: do journal mode autodetect and sanity check _before_ replay

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilestore: fix journal locking on trailing mode
Sage Weil [Wed, 1 Dec 2010 19:05:11 +0000 (11:05 -0800)]
filestore: fix journal locking on trailing mode

We're already holding journal_lock due to the surrounding
op_submit_{start,finish}.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoMerge branch 'testing' into rc
Sage Weil [Wed, 1 Dec 2010 18:20:43 +0000 (10:20 -0800)]
Merge branch 'testing' into rc

Conflicts:
configure.ac

14 years agorbd: use MIN instead of min()
Sage Weil [Wed, 1 Dec 2010 17:51:27 +0000 (09:51 -0800)]
rbd: use MIN instead of min()

Not even sure where min() was coming from, but it seems to be missing on
i386 lucid.:

g++ -DHAVE_CONFIG_H -I.     -Wall -D__CEPH__ -D_FILE_OFFSET_BITS=64 -D_REENTRANT -D_THREAD_SAFE -rdynamic -g -O2 -MT rbd.o -MD -MP -MF .deps/rbd.Tpo -c -o rbd.o rbd.cc
rbd.cc: In function 'int do_import(void*, const char*, int, const char*)':
rbd.cc:837: error: no matching function for call to 'min(uint64_t&, off_t)'
make[3]: *** [rbd.o] Error 1

Reported-by: John Leach <john@johnleach.co.uk>
Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoclient: connect to export targets on cap EXPORT
Sage Weil [Wed, 1 Dec 2010 17:44:58 +0000 (09:44 -0800)]
client: connect to export targets on cap EXPORT

Also unconditionally connect on reconnect, even when there aren't any
outstanding requests.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoceph v0.23.2 v0.23.2
Sage Weil [Wed, 1 Dec 2010 17:27:19 +0000 (09:27 -0800)]
ceph v0.23.2

14 years agofilestore: do not autodetect BTRFS_IOC_SNAP_CREATE_ASYNC until interface is finalized
Sage Weil [Wed, 1 Dec 2010 18:03:19 +0000 (10:03 -0800)]
filestore: do not autodetect BTRFS_IOC_SNAP_CREATE_ASYNC until interface is finalized

Li has proposed an alternative V2 ioctl that looks nicer, so wait until
that is finalized.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoclient: fix cap export handler
Sage Weil [Wed, 1 Dec 2010 17:44:26 +0000 (09:44 -0800)]
client: fix cap export handler

An EXPORT cap msg can race with a cap release; deal with that (realigning
this code with the kclient).

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoman: fix monmaptool man page
Laszlo Boszormenyi [Wed, 1 Dec 2010 17:24:45 +0000 (09:24 -0800)]
man: fix monmaptool man page

I've found the manpage problem that I've noted before. It's about
monmaptool, the CLI says it's usage:
[--print] [--create [--clobber]] [--add name 1.2.3.4:567] [--rm name]
<mapfilename>
But the manpage states this as an example:
monmaptool --create --add 192.168.0.10:6789 --add 192.168.0.11:6789 --add
192.168.0.12:6789 --clobber monmap
This definitely misses 'name' after the 'add' switch, resulting:
"invalid ip:port '--add'" as an error message. Attached patch fixes this
inconsistency.

Signed-off-by: Laszlo Boszormenyi <gcs@debian.hu>
14 years agoosd: simplify scrub sanity checks
Sage Weil [Wed, 1 Dec 2010 00:50:41 +0000 (16:50 -0800)]
osd: simplify scrub sanity checks

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: only adjust osd scrub_pending if pg was reserved
Sage Weil [Wed, 1 Dec 2010 00:50:25 +0000 (16:50 -0800)]
osd: only adjust osd scrub_pending if pg was reserved

If for some reason we enter scrub() without scrub_reserved == true, don't
adjust the osd->scrubs_pending or we'll screw up the accounting.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: fix import_reverse re-exporting of caps
Sage Weil [Wed, 1 Dec 2010 00:38:21 +0000 (16:38 -0800)]
mds: fix import_reverse re-exporting of caps

Make the import_reverse() set the pin/state before it clears them by using
the helper that sets them.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: turn off mds_bal_frag until resolve vs split/merge is fixed
Sage Weil [Wed, 1 Dec 2010 00:25:15 +0000 (16:25 -0800)]
mds: turn off mds_bal_frag until resolve vs split/merge is fixed

See #594

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoMerge remote branch 'origin/lost' into unstable
Sage Weil [Wed, 1 Dec 2010 00:11:20 +0000 (16:11 -0800)]
Merge remote branch 'origin/lost' into unstable

Conflicts:
src/osd/osd_types.h

14 years agoosd: refactor object_info_t constructor a bit
Colin Patrick McCabe [Tue, 30 Nov 2010 23:04:15 +0000 (15:04 -0800)]
osd: refactor object_info_t constructor a bit

Create a copy constructor for object_info_t, since we often want to copy
an object_info_t and would rather not try to remember all the fields.
Drop the lost parameter from one of the other constructors, because it's
not used that much.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: share_pg_log: update peer_missing
Colin Patrick McCabe [Tue, 30 Nov 2010 22:43:51 +0000 (14:43 -0800)]
osd: share_pg_log: update peer_missing

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: mark_obj_as_lost: fix oloc init, eversion
Colin Patrick McCabe [Tue, 30 Nov 2010 21:42:07 +0000 (13:42 -0800)]
osd: mark_obj_as_lost: fix oloc init, eversion

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: mark_all_unfound_as_lost: bugfix, refactor
Colin Patrick McCabe [Mon, 29 Nov 2010 20:01:55 +0000 (12:01 -0800)]
osd: mark_all_unfound_as_lost: bugfix, refactor

mark_all_unfound_as_lost: just delete items from the rmissing set as we
find them, rather than using a multi-pass system.

Update info.last_update as we go so that log printouts will look correct
(the log printout function checks info.last_update)

Don't remove from missing or missing_loc in mark_obj_as_lost.
PG::missing_loc should never have the soid, and PG::missing we handle
elsewhere.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: mark_obj_as_lost: don't assume we have obj
Colin Patrick McCabe [Mon, 29 Nov 2010 19:33:39 +0000 (11:33 -0800)]
osd: mark_obj_as_lost: don't assume we have obj

In PG::mark_obj_as_lost, we have to mark a missing object as lost. We
should not assume that we have an old version of the missing object in
the ObjectStore. If the object doesn't exist in the object store, we
have to create it so that recovery can function correctly.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: create lost2 test
Colin Patrick McCabe [Thu, 25 Nov 2010 05:15:00 +0000 (21:15 -0800)]
osd: create lost2 test

This one verifies:
1. Client asks for an unfound object and gets put to sleep
2. Object gets declared lost
3. Client wakes up

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: mark_all_unfound_as_lost: set lost attr
Colin Patrick McCabe [Thu, 25 Nov 2010 04:55:14 +0000 (20:55 -0800)]
osd: mark_all_unfound_as_lost: set lost attr

In mark_all_unfound_as_lost, we need to set the lost bit in the objects'
object_info_t.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoradostool: fix memleak in error path
Colin Patrick McCabe [Thu, 25 Nov 2010 01:26:35 +0000 (17:26 -0800)]
radostool: fix memleak in error path

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: mark_all_unfound_as_lost: wake waiters
Colin Patrick McCabe [Wed, 24 Nov 2010 06:04:53 +0000 (22:04 -0800)]
osd: mark_all_unfound_as_lost: wake waiters

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agotest_lost: add lost1 test
Colin Patrick McCabe [Wed, 24 Nov 2010 05:55:26 +0000 (21:55 -0800)]
test_lost: add lost1 test

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: ReplicatedPG::do_op: error on read-from-lost
Colin Patrick McCabe [Wed, 24 Nov 2010 05:45:47 +0000 (21:45 -0800)]
osd: ReplicatedPG::do_op: error on read-from-lost

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: don't mark objs as lost unless we're active
Colin Patrick McCabe [Tue, 23 Nov 2010 22:30:06 +0000 (14:30 -0800)]
osd: don't mark objs as lost unless we're active

We don't have enough information to mark objects as lost until we
activate the PG. might_have_unfound isn't even built until PG::activate.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agomds: fix resolve for surviving observers
Sage Weil [Tue, 30 Nov 2010 23:43:53 +0000 (15:43 -0800)]
mds: fix resolve for surviving observers

Make all survivors participate in resolve stage, so that survivors can
properly determine the outcome of migrations to the failed node that did
not complete.

The sequence (before):
 - A starts to export /foo to B
 - C has ambiguous auth (A,B) in it's subtree map
 - B journals import_start
 - B fails
...
 - B restarts
 - B sends resolves to everyone
   - does not claim /foo
 - A sends resolve _only_ to B
   - does claim /foo
 - B knows it's import did not complete
 - C doesn't know anything.  Also, the maybe_resolve_finish stuff was
   totally broken because the recovery_set wasn't initialized

See new (commented out) assert in Migrator.cc to reproduce the above.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agotest: dump_osd_store: sort dump output
Colin Patrick McCabe [Tue, 23 Nov 2010 18:55:59 +0000 (10:55 -0800)]
test: dump_osd_store: sort dump output

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: active replicas process logs from primaries
Colin Patrick McCabe [Tue, 23 Nov 2010 18:18:29 +0000 (10:18 -0800)]
osd: active replicas process logs from primaries

In _process_pg_info, if the primary sends us a PG::Log, a replica should
merge that log into its own.

mark_all_unfound_as_lost / share_pg_log: don't send the whole PG::Log.
Just send the new entries that were just added when marking the objects
as lost.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: object_info_t: add lost field
Colin Patrick McCabe [Mon, 22 Nov 2010 23:56:06 +0000 (15:56 -0800)]
osd: object_info_t: add lost field

We can now permanently mark objects as lost by setting the lost bit in
their object_info_t. Rev the object_info_t struct.

get_object_context: re-arrange this so that we're always setting the
lost bit. Also avoid some unecessary steps.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoAdd ./ceph dump pg debug degraded_pgs_exist
Colin Patrick McCabe [Mon, 22 Nov 2010 19:32:38 +0000 (11:32 -0800)]
Add ./ceph dump pg debug degraded_pgs_exist

./ceph dump pg debug degraded_pgs_exist returns TRUE if some pgs are
degraded; false otherwise.

tests: move start_recovery into test_common.sh.
Create recovery1 test.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years ago(re)add mechanism for marking objects as lost
Colin Patrick McCabe [Sat, 20 Nov 2010 03:15:40 +0000 (19:15 -0800)]
(re)add mechanism for marking objects as lost

In activate_map, we now mark objects that we know are unfindable as
lost. This relies on the might_have_unfound set introduced earlier.

Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
14 years agoosd: fix object_info_t() initialization of oloc
Sage Weil [Tue, 30 Nov 2010 20:57:43 +0000 (12:57 -0800)]
osd: fix object_info_t() initialization of oloc

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agomds: add debug output to make completions easier to track
Sage Weil [Tue, 30 Nov 2010 20:56:15 +0000 (12:56 -0800)]
mds: add debug output to make completions easier to track

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoosd: fix misuses of OLOC_BLANK
Sage Weil [Tue, 30 Nov 2010 20:48:32 +0000 (12:48 -0800)]
osd: fix misuses of OLOC_BLANK

Commit 6e2b594b fixed a bunch of bad get_object_context() calls, but even
with the parameter fixed some were still broken.  Pass in a valid oloc in
those cases.  The only places where OLOC_BLANK _is_ still uses is when we
know we have the object locally and will load a valid value off disk.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoRevert "mds: resolve cleanup"
Sage Weil [Tue, 30 Nov 2010 20:23:18 +0000 (12:23 -0800)]
Revert "mds: resolve cleanup"

This reverts commit cd53719f3ce712a060e4ac80cab934c597531a5e.

We need this on surviving nodes too to resolve ambiguous migrations to/from recoverying
nodes.

14 years agoMerge branch 'testing' into unstable
Sage Weil [Tue, 30 Nov 2010 20:19:39 +0000 (12:19 -0800)]
Merge branch 'testing' into unstable

Conflicts:
src/os/FileJournal.cc

14 years agoosd: make recovery_oids debug list per-pg
Sage Weil [Tue, 30 Nov 2010 19:43:19 +0000 (11:43 -0800)]
osd: make recovery_oids debug list per-pg

Otherwise we hit bad asserts if an object of the same name in different
pools is getting recovered simultaneously.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoclient: Set the DirResult buffer to NULL when deleting it.
Greg Farnum [Tue, 30 Nov 2010 18:56:34 +0000 (10:56 -0800)]
client: Set the DirResult buffer to NULL when deleting it.

This should fix a crash exposed by our bonnie workunit. Previously
the client would keep trying to read out of the (deleted) buffer!

14 years agoceph.spec.in: include gui files
Sage Weil [Tue, 30 Nov 2010 17:22:42 +0000 (09:22 -0800)]
ceph.spec.in: include gui files

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agodebian: many many cleanups
Sage Weil [Tue, 30 Nov 2010 17:13:54 +0000 (09:13 -0800)]
debian: many many cleanups

Signed-off-by: Laszlo Boszormenyi <gcs@debian.hu>
14 years agofilejournal: fix throttle vs FULL behavior
Sage Weil [Tue, 30 Nov 2010 16:55:29 +0000 (08:55 -0800)]
filejournal: fix throttle vs FULL behavior

We don't want to add to the throttler if we aren't going to queue the
write, or else we'll never take it off again.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agoMerge branch 'osd_journaling' into unstable
Sage Weil [Tue, 30 Nov 2010 16:32:55 +0000 (08:32 -0800)]
Merge branch 'osd_journaling' into unstable

14 years agofilestore: make sure blocked op_start's wake up in order
Sage Weil [Tue, 30 Nov 2010 16:30:57 +0000 (08:30 -0800)]
filestore: make sure blocked op_start's wake up in order

If they wake up out of order (which, theoretically, they could before) we
can screw up journal submitting order in writebehind mode, or apply order
in parallel and writeahead journal mode.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilestore: assert op_submit_finish is called in order
Sage Weil [Tue, 30 Nov 2010 16:24:57 +0000 (08:24 -0800)]
filestore: assert op_submit_finish is called in order

Verify/assert that we aren't screwing up the submission pipeline ordering.
Namely, we want to make sure that if op_apply_start() blocks, we wake up
in the proper order and don't screw up the journaling.

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilejournal: rework journal FULL behavior and fix throttling
Sage Weil [Tue, 30 Nov 2010 15:54:42 +0000 (07:54 -0800)]
filejournal: rework journal FULL behavior and fix throttling

Keep distinct states for FULL, WAIT, and NOTFULL.

The old code was more or less correct at one point, but assumed the seq
changed on each commit, not each operation; in it's prior state it was
totally broken.

Also fix throttling (we were leaking items in the throttler that were
submitted while the journal was full).

Signed-off-by: Sage Weil <sage@newdream.net>
14 years agofilestore: refactor op_queue/journal locking
Sage Weil [Tue, 30 Nov 2010 15:51:16 +0000 (07:51 -0800)]
filestore: refactor op_queue/journal locking

- Combine journal_lock and lock.
- Move throttling outside of the lock (this fixes potential deadlock in
  parallel journal mode)
- Make interface nomenclature a bit more helpful

Signed-off-by: Sage Weil <sage@newdream.net>