Josh Durgin [Sat, 16 Mar 2013 00:28:13 +0000 (17:28 -0700)]
librbd: optionally wait for a flush before enabling writeback
Older guests may not send flushes properly (i.e. never), so if this is
enabled, rbd_cache=true is safe for them transparently.
Disable by default, since it will unnecessarily slow down newer guest
boot, and prevent writeback caching for things that don't need to send
flushes, like the command line tool.
Refs: #3817 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sam Lang [Mon, 18 Mar 2013 21:59:04 +0000 (16:59 -0500)]
mds: Handle ENODATA returned from getxattr
The osds might return ENODATA if we request an
xattr that doesn't exist. In this case, we're
requesting the 'parent' xattr so that we can
remove all the forwarding pointers, but the xattr
may not have been written (which only happens on
log segment trim), so we don't assert here.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
mon: HealthMonitor: Keep track of monitor cluster's health
The HealthMonitor builds upon the QuorumService interface, and should be
used to keep track of all and any relevant information about the monitor
cluster (maybe even about all the cluster if need be).
This patch also introduces the HealthService interface, used to define
a HealthMonitor service, responsible for dispatching 'MMonHealth' messages
(the QuorumService interface dispatches generic 'Message').
Based on the HealthService interface, we introduce the DataHealthService
class, a service that will track disk space consumption by the monitors,
warn when a given threshold is crossed, and gracefully shutdown the monitor
if disk space usage hits critical levels that might affect the correct
monitor behavior.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: QuorumService: Allow for services quorum-bound to be easily created
As the monitor grows in features, we have been dumping them in the Monitor
class as they don't really fit anywhere else.
Most of those latest features have been, and some of the future changes
will also be, quorum-bounded. By that we mean that these features tend
to require a quorum to be present in order to work.
Although we already have the PaxosService interface, it really isn't
adequate for this kind of features, as they don't really require Paxos,
nor do they access the store. Furthermore, they don't really need to
tick at the same rate as the monitor, and can be fairly independent.
Therefore we now introduce the concept of a QuorumService, a class to be
built upon, managing the tick and dispatch for any kind of service
basically requiring a quorum to function.
Among the already existing monitor features that could take advantage of
this new class we can find the Timecheck infrastructure, as it is by
nature quorum bounded. The monitor store sync could also take advantage
of this service, although it doesn't really require a quorum to work,
and even the PaxosService-related classes could use this.
This patch also introduces the MMonQuorumService base class, to be used
by any message that should want to.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sam Lang [Mon, 18 Mar 2013 19:40:48 +0000 (14:40 -0500)]
client: Remove unecessary set_inode() in _rmdir()
With the recent changes in fc80c1dc6ee315ae5e039986602ffadba46cb43b,
we only allow setting the inode once on a MetaRequest. This triggered
a bug in _rmdir(), where the parent dir inode passed in and being set
on the MetaRequest, and then also setting the dir inode on the MetaRequest.
Removing the set_inode() using the parent dir inode resolves this issue.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Fix interator handling in ~TestFileStoreState(). After std::map::erase()
the used iterator is invalid. Use a while-loop and erase the object with
post-incremented iterator instead.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Danny Al-Gaaf [Mon, 18 Mar 2013 11:45:15 +0000 (12:45 +0100)]
rgw/rgw_rados.cc: make sure range_iter != ranges.end()
Make sure range_iter is valid, set range_iter = next_iter instead of
++range_iter, since next_iter is already checked against ranges.end() and
is the same as ++range_iter.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Sam Lang [Wed, 6 Mar 2013 17:40:49 +0000 (11:40 -0600)]
mds: Add config option for log segment size
The mds log segment size is chosen from the
default layout object size (4MB). Add a parameter
to the config to enable setting the log segment
size to an alternate value.
If the config option to set the journal log segment
size is specified, the log layout must be modified
both for the object size and the striping unit size.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sam Lang [Thu, 21 Feb 2013 13:39:12 +0000 (07:39 -0600)]
qa/workunits/restart: Add test to check backtrace
This script uses the python bindings to libcephfs and rados
to create files and check the correctness of the backtrace
written to the 'parent' xattr on the first object (if its
a file) or inode (if its a dir). The script includes test cases
that kill the mds at specific kill points and restart it through
teuthology using the teuthology restart task.
Sam Lang [Thu, 21 Feb 2013 14:07:35 +0000 (08:07 -0600)]
mds: Add kill points for backtrace testing
To test the mds journal and replay behavior, and the
functionality for storing backtraces on inodes, we
add kill points to the MDS in the openc, journal replay,
and journal expire paths.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sam Lang [Thu, 21 Feb 2013 13:35:14 +0000 (07:35 -0600)]
mds: Cleanup new segment conditionals
The second conditional for adding a new segment is always
true when the first conditional is true. Clean this up
to simply create a new segment when we've reached the end of
the current segment.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Adds a backtrace to the data pool for supporting lookup-by-ino,
storing the backtrace on the first object in the data pool
or the metadata pool for a directory, as the 'parent' xattr
on the object (named by inode) in that pool. For create, rename,
mkdir, and setlayout operations, the backtrace is
queued (with the current log segment) after the journal is
committed and the safe reply is returned to the client, but the
the backtrace operation itself isn't started until the log segment is
expired.
For journal replay, we queue the backtrace so that it gets
written out on journal expire. Inodes get added to the EMetaBlob
in the fullbits list, so we queue backtraces while iterating through
the fullbits during replay.
Using setlayout or setxattr('ceph.file.layout.pool'),
the data pool for a file can be changed after it is created
but before anything is written to the file. A forwarding backtrace
is written to the old pool on a setlayout, to ensure we can always find
the latest backtrace. We store a list of old pools with the backtrace
for cleaning up all forwarding pointers of an inode.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sam Lang [Tue, 5 Mar 2013 14:48:29 +0000 (08:48 -0600)]
mds: New backtrace handling
Add unified backtrace handling for storing a backtrace on file objects
(the first data object) and dirs. The backtrace store operation is
queued on the LogSegment (for performing the store on log segment
expire). We encode the backtrace on queue to avoid keeping a reference
around to the CInode, which may get dropped from the cache by the time
the log segment is expired (and the backtrace is written out).
Fetching the backtrace is implemented on the CInode.
Also allow incrementing/decrementing the DIRTYPARENT pin ref as needed,
instead of using a state semaphore to keep track of whether itsset or
not. This allows us to remove the STATE_DIRTYPARENT field on CInode.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sam Lang [Tue, 26 Feb 2013 18:22:47 +0000 (12:22 -0600)]
include/elist: Fix clear() to use pop_front()
elist<T>::clear() is calling remove(), which isn't a
method defined on elist<T> (it was never defined according
to git). Because elist is templated and no references
to clear() are ever made, the compiler matches remove(T) to the
remove(const char *) system call defined in stdio.h.
Once clear is invoked on an instance of elist<T>, we get the
compile error shown below.
The fix here is to use pop_front() instead of remove().
Compile error is:
In file included from ../../src/mds/CInode.h:22:0,
from ../../src/mds/CInode.cc:19:
../../src/include/elist.h: In instantiation of ‘void elist<T>::clear() [with T = cinode_backtrace_info_t*]’:
../../src/mds/CInode.cc:1129:20: required from here
../../src/include/elist.h:101:7: error: no matching function for call to ‘remove(cinode_backtrace_info_t*)’
../../src/include/elist.h:101:7: note: candidates are:
In file included from ../../src/mds/CInode.cc:17:0:
/usr/include/stdio.h:179:12: note: int remove(const char*)
/usr/include/stdio.h:179:12: note: no known conversion for argument 1 from ‘cinode_backtrace_info_t*’ to ‘const char*’
In file included from /usr/include/c++/4.7/algorithm:63:0,
from /usr/include/c++/4.7/backward/hashtable.h:65,
from /usr/include/c++/4.7/ext/hash_map:65,
from ../../src/include/encoding.h:292,
from ../../src/common/entity_name.h:22,
from ../../src/common/config.h:26,
from ../../src/mds/CInode.h:20,
from ../../src/mds/CInode.cc:19:
/usr/include/c++/4.7/bits/stl_algo.h:1117:5: note: template<class _FIter, class _Tp> _FIter std::remove(_FIter, _FIter, const _Tp&)
/usr/include/c++/4.7/bits/stl_algo.h:1117:5: note: template argument deduction/substitution failed:
In file included from ../../src/mds/CInode.h:22:0,
from ../../src/mds/CInode.cc:19:
../../src/include/elist.h:101:7: note: candidate expects 3 arguments, 1 provided
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sam Lang [Tue, 5 Mar 2013 14:28:47 +0000 (08:28 -0600)]
mds: Use map for CInode pinrefs
Implements pin refs on the inode as a map instead of
a multiset, allowing individual ref counts to act as
real references with values that can be >1.
The pin refs are only used for debugging, but allowing
them to be >1 avoids the need for a separate state field
for things like DIRTYPARENT.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sam Lang [Tue, 26 Feb 2013 00:51:19 +0000 (18:51 -0600)]
client: Ensure inode/dentries are ref counted
The MetaRequest holds onto inodes and dentries
for retrying unsafe requests, but those objects
might be removed from the cache (unlink for example)
causing the inode/dentry to be freed. Ensure that
the inode/dentry is never freed while the MetaRequest
holds onto it by putting/getting the refs using
set/get interfaces.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Samuel Just [Fri, 15 Mar 2013 22:13:46 +0000 (15:13 -0700)]
OSD: split temp collection as well
Otherwise, when we eventually remove the temp collection, there might be
objects in the temp collection which were independently pulled into the child
pg collection. Thus, removing the old stale parent link from its temp
collection also blasts the omap entries and snap mappings for the real child
object.
Backport: bobtail Fixes: #4452 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>