The filter_policy (bloom filter) stuff is fairly new in LevelDB's life,
and it turns out that precise's version is too old for it. Add conditional
compilation for those members in order to build and work properly.
For simplelock and filelock, XLOCK/XLOCKDONE's next state is SYNC.
But filelock in XLOCK/XLOCKDONE state allow Fb caps, filelock in
SYNC state does not. So filelock can be stuck in XLOCK/XLOCKDONE
state forever if there are Fb caps issued.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
Store conversion wasn't converting the osdmap_full/ versions, only the
incrementals under osdmap/ and the latest full version stashed. This
would lead to some serious problems during OSDMonitor's update_from_paxos
when the latest stashed didn't correspond to the first available
incremental.
Fixes: #4521 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Tue, 16 Apr 2013 22:45:41 +0000 (15:45 -0700)]
librbd: flush on diff_iterate
The diff_iterate() tests fail when caching is enabled because recent writes
aren't visible to listsnaps. Flush from diff_iterate to ensure that they
are. Someday, maybe, we might make diff_iterate() inspect the cache
contents to make this more efficient, but for now that is not necessary.
We plug completions when transitioning from a full to non-full journal
to ensure that we do not complete items before we have a stable journal
starting point that is past the committed_thru marker. However, the order
of the header update and completion queueing means that we never remove
the plug if the journalq is empty--the seq test is always false. The
result is very slow osd requests that only commit when we do a full sync.
config: provide settings for the LevelDB stores we use
Now that we can set up the LevelDB options internally, provide
config options on the OSD and the Monitor. We leave the OSD values
at the defaults for now as they're performance-sensitive, but we
set new values on the Monitor so that it can scale to large PGMaps.
(Previously there were issues with large PGMaps taking forever to write;
these changes to the use of compression and the default block and
write buffers counteract them.)
Since we pass these variables through, users who are interested in
doing so now can test and tune them more appropriately.
Reported-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Greg Farnum <greg@inktank.com>
Sam Lang [Fri, 12 Apr 2013 16:08:35 +0000 (11:08 -0500)]
client: Fix inode remove from snaprealm race
This is a follow on fix to b5ce4d0. Always remove the inode from the
snaprealm's list of inodes_with_caps before the snaprealm ref is
decremented (and the snaprealm potentially gets freed).
Fixes #4694. Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
This uses the old stand-alone qemu-iotests repo so it works with the
version of qemu in Ubuntu 12.04. The tests depend tightly on qemu
version, so to use later tests we'd need to install corresponding
versions of qemu.
LevelDB has a lot of options which we don't implement right now. Add
an options struct to the LevelDBStore which users can access as they
wish in order to set values different from the defaults.
This will let us set various size values, as well as turning on
caching or bloom filter read optimizations.
Signed-off-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Greg Farnum <greg@inktank.com>
Gary Lowell [Wed, 20 Feb 2013 01:25:27 +0000 (17:25 -0800)]
init-radosgw.sysv: New radosgw init file for rpm based systems
Added init-radosgw.sys file for rpm based systems, added it to
the tarball list in the makefile, and updated the specfile to
install it. Also added the a dependency in ceph since it uses
utility routes from that package (On debian systems these are
packaged in ceph-common). Incorporated review comments from
Alex. (Bug #4571)
Signed-off-by: Gary Lowell <gary.lowell@inktank.com> Reviewed-by: Alexandre Marangone <alexandre.marangone@inktank.com>
Sam Lang [Tue, 9 Apr 2013 15:35:19 +0000 (10:35 -0500)]
mds: Delay export on missing inodes for reconnect
The reconnect caps sent by the client on reconnect may not have
inodes found in the inode cache until after clientreplay (when
the client creates a new file, for example). Currently, we send an
export for that cap to the client if we don't see an inode in the cache
and path_is_mine() returns false (for example, if the client didn't
send a path because the file was already unlinked).
Instead, we want to delay handling of the reconnect cap until
clientreplay completes.
This patch modifies handle_client_reconnect() so that we don't assume
the cap isn't ours if we don't have an inode for it, but instead delay
recovery for later. An export cap message is only sent if the inode exists
and the cap isn't ours (non-auth) during reconnect. If any remaining
recovered caps exist in the recovered list once the mds goes active, we
send export messages at that point.
Also, after removing the path_is_mine check,
MDCache::parallel_fetch_traverse_dir() needs to skip non-auth dirfrags.
Fixes #4451. Signed-off-by: Sam Lang <sam.lang@inktank.com> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sam Lang [Thu, 4 Apr 2013 20:59:56 +0000 (15:59 -0500)]
client: Unify session close handling
If mds failure causes client reconnect while the
client is unmounting, the client will send a session
close request to the mds even if there are outstanding
inodes in the cache waiting to receive flush_acks. This
causes the mds to send back a session close message and
the client closes the connection, so that when the mds tries
to send flush acks back to the client, they get dropped, resulting
in the client hanging on unmount. The pattern for this bug is:
1. mds restart
2. client sends session open request
3. client unmount sets unmounting flag and waits for flush_acks
4. mds sends session open reply
5. client sends session close request (because its unmounting)
6. mds sends session close, client closes connection
7. mds tries to send flush_acks, but drops them because the connection
is gone
This patch unifies the session close handling so that the client
only sends a session close in unmount once all flush acks have been
received. If the mds restarts during session close, the reconnect
logic will kick the session close waiter so that session close requests
are re-sent for session close replies not yet received.
LibrbdWriteback: complete writes strictly in order
RADOS returns writes to the same object in the same order. The
ObjectCacher relies on this assumption to make sure previous writes
are complete and maintain consistency. Reads, however, may be
reordered with respect to each other. When writing to an rbd clone,
reads to the parent must be performed when the object does not exist
in the child yet. These reads may be reordered, resulting in the
original writes being reordered. This breaks the assmuptions of the
ObjectCacher, causing an assert to fail.
To fix this, keep a per-object queue of outstanding writes to an
object in the LibrbdWriteback handler, and finish them in the order in
which they were sent.
Samuel Just [Tue, 9 Apr 2013 21:53:52 +0000 (14:53 -0700)]
Journal: commits may not include all journaled seqs
At one point, a commit had to drain the FileStore op
queue. This is no longer the case. Consequently, the
journal may have to wait more than one commit for the
filestore to create a stable commit point at a particular
sequence. Handling this requires two changes:
1) We cannot transition to FULL_WAIT until we receive
a commit_start on a seq >= journaled_seq.
2) We cannot remove the journal completion plug until get
a committed_thru on a seq >= header.start_seq at least as
new as the oldest committed item in the journal. If on
replay, the journal does not include fs_op_seq, we ignore
it, which is fine since we won't have reported those
entries committed!
commit 0bcf2ac081 changes session_info_t's format, but there is
a typo in the code that decodes old format. We also need to
handle struct_v == 1, which had the same encoding but without
the size guards (which is all handled by DECODE_START_LEGACY_COMPAT).
The tid returned by reads is ignored, and would make tracking writes
internally more difficult by using the same id-space as them. Make read
void and update all implementations.
There's no reason to check the duration of a watch. The notify will
timeout after 30s on the OSD, but there's no guarantee the client will
see that in any bounded time. This test is really meant as a stress
test of the OSDs anyway, not of the clients, so just remove asserts
about operation duration.
Fixes: #4591 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sam Just <sam.just@inktank.com>
Revert "global: call config observers on global_init (and start logging!)"
This reverts commit a30917746614275baeb718e902133f06ef44fba6. This commit
includes calls that involve Mutexes, Lockers, and lockdep -- which isn't
yet set up, so things break horribly. A more subtle approach is required.
Dan Mick [Mon, 8 Apr 2013 20:49:22 +0000 (13:49 -0700)]
ceph_argparse: add _daemon versions of argparse calls
mon needs to call argparse for a couple of -- options, and the
argparse_witharg routines were attempting to cerr/exit on missing
arguments. This is appropriate for the CLI usage, but not the daemon
usage. Add a 'cli' flag that can be set false for the daemon usage
(and cause the parsing routine to return false instead of exit).
The daemon's parsing code due for a rewrite soon.
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sam Lang [Mon, 8 Apr 2013 14:09:41 +0000 (09:09 -0500)]
mds: Keep LogSegment ref for openc backtrace
The MDRequest is destroyed once the client reply is sent, but
we need the reference to the LogSegment for updating the backtrace, so
store a temporary ref to the LogSegment for later.
Fixes #4660. Signed-off-by: Sam Lang <sam.lang@inktank.com>
mds: fix journaler to set temp_fetch_len appropriately and read the requested amount
The _prefetch() function which intereprets temp_fetch_len interprets
it as the amount of data we need from read_pos, which is the beginning
of read_buf. So by setting it to the amount *more* we needed, we were
getting stuck forever if we actually hit this condition. Fix it by
setting temp_fetch_len based on the amount of data we need in aggregate.
Furthermore, we were previously rounding *down* the requested amount in
order to read only full log segments. Round up instead!
Sage Weil [Sun, 7 Apr 2013 16:06:23 +0000 (09:06 -0700)]
global: call config observers on global_init (and start logging!)
Currently we don't start logging on daemon startup unless the log_file
parameter was adjusted by ceph.conf. Instead, we should call all config
observers so that the logging subsystem is fully configured and we log
even prior to the daemonize and common_init_finish (when we call observers
again). This fixes logging for the initial period before we daemonize.
For some of the daemons (osd, mon), this includes significant work. It
also fixes the problem where users don't see the 'ceph version ...' banner
on daemon start.
Backport: bobtail Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 6 Apr 2013 16:37:52 +0000 (09:37 -0700)]
librbd: fix DiffIterateStress again
- fix seed
- the array indices are points in time; no need to subtract one from i!
- pick a random seed and print it to stdout
I ran this with several different seeds without failure, so I am confident
we are in good shape. And if we ever get a future failure, we'll have the
seed to reproduce.
Sage Weil [Thu, 4 Apr 2013 04:30:51 +0000 (21:30 -0700)]
msgr: add second per-message throttler to message policy
We already have a throttler that lets of limit the amount of memory
consumed by messages from a given source. Currently this is based only
on the size of the message payload. Add a second throttler that limits
the number of messages so that we can effectively throttle small requests
as well.
Sage Weil [Sat, 6 Apr 2013 05:28:38 +0000 (22:28 -0700)]
librbd: fix DiffIterateStress test
If we write to an interval that didn't previously exist and then discard
it so that it again doesn't exist, all during the same interval, then we
should not include it in the 'written' set (or exists set, obviously).
Similarly, when we got to look at a merged diff, we can ignore extents
that were written (and possibly zeroed) if they neither existed before nor
after.
Bump up the iteration count to get more confidence that this is actually
correct.
Samuel Just [Wed, 3 Apr 2013 22:44:39 +0000 (15:44 -0700)]
FileJournal: introduce start_seq header entry
FileStore::header_t::start_seq now encodes the op seq which may be
written at FileStore::header_t::start. This way, FileStore::open()
can pass a valid sequence number to read_entry for validation.
Otherwise, read_entry has no way of knowing whether a failure of a
read at header.start indicates that the journal was empty, or that
the entry is corrupt. With start_seq, read_entry can assume
corruption if start_seq <= committed_up_to.
Fixes: #4527 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Danny Al-Gaaf [Thu, 4 Apr 2013 13:54:31 +0000 (15:54 +0200)]
Makefile.am: install ceph-* python scripts to /usr/bin directly
Install ceph-* scripts directly to $(prefix)$(sbindir) (which
normaly would be /usr/sbin) instead of moving it around after
installation in SPEC file or debian files.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Sage Weil [Wed, 3 Apr 2013 22:32:51 +0000 (15:32 -0700)]
mds: verify mds tell 'dumpcache <filename>' target does not exist
Open target with O_CREAT|O_EXCL to ensure we don't overwrite some other
important file (like, say, /etc/passwd). This is irritating because there
is not c++ ofstream equivalent for O_EXCL; kludge around it using
ostringstream instead.
Fixes: #3266 Signed-off-by: Sage Weil <sage@inktank.com>
mds: do not go through handle_mds_failure for oneself
A standby MDS can attempt the handle_mds_failure paths for itself, if
it sees the transition from up to down. This leads it to insert itself
into the resolve_gather set, which is bad. So check if the failed MDS
is the same as whoami, and abort if so. This fixes #4637.
Sam Lang [Mon, 1 Apr 2013 14:06:59 +0000 (09:06 -0500)]
client: Kick waiters for max size
If the mds restarts without successfully logging a max size
cap update, the client waits indefinitely in Client::get_caps
on the waitfor_caps list. So when the client gets an mds map
indicating a new active mds has replaced a down mds, we need to
kick the caps update request. This patch mimics the behavior
in the kernel by setting the wanted_max_size
and requested_max_size to 0 and wakes up the waiters.
Fixes #4582. Signed-off-by: Sam Lang <sam.lang@inktank.com>