This is a quick workaround for the next branch. A more complete fix
will be done for the master branch. This does not affect correctness,
just what qa runs with lockdep enabled do.
Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage.weil@inktank.com>
ceph: propagate do_command()'s return value to user space
We were returning '1' regardless of what do_command() returned in case
of error. This would make building tools relying on command error codes
short of useless, and forced them to rely instead on error messages.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit e91405d540ce11b9996e4977212553bd33afb3ed)
Josh Durgin [Thu, 21 Mar 2013 23:04:10 +0000 (16:04 -0700)]
librbd: add an async flush
At this point it's a simple wrapper around the ObjectCacher or
librados.
This is needed for QEMU so that its main thread can continue while a
flush is occurring. Since this will be backported, don't update the
librbd version yet, just add a #define that QEMU and others can use to
detect the presence of aio_flush().
Josh Durgin [Wed, 27 Mar 2013 22:42:10 +0000 (15:42 -0700)]
librbd: use the same IoCtx for each request
Before we were duplicating the IoCtx for each new request since they
could have a different snapshot context or read from a different
snapshot id. Since librados now supports setting these explicitly
for a given request, do that instead.
Since librados tracks outstanding requests on a per-IoCtx basis, this
also fixes a bug that causes flush() without caching to ignore
all the outstanding requests, since they were to separate,
duplicate IoCtxs.
Josh Durgin [Wed, 27 Mar 2013 22:32:29 +0000 (15:32 -0700)]
librados: add versions of a couple functions taking explicit snap args
Usually the snapid to read from or the snapcontext to send with a write
are determined implicitly by the IoCtx the operations are done on.
This makes it difficult to have multiple ops in flight to the same
IoCtx using different snapcontexts or reading from different snapshots,
particularly when more than one operation may be needed past the initial
scheduling.
Add versions of aio_read, aio_sparse_read, and aio_operate
that don't depend on the snap id or snapcontext stored in the IoCtx,
but get them from the caller. Specifying this information for each
operation can be a more useful interface in general, but for now just
add it for the methods used by librbd.
Josh Durgin [Thu, 28 Mar 2013 17:34:37 +0000 (10:34 -0700)]
ObjectCacher: remove unneeded var from flush_set()
The gather will only have subs if there is something to flush. Remove
the safe variable, which indicates the same thing, and convert the
conditionals that used it to an else branch. Movinig gather.activate()
inside the has_subs() check has no effect since activate() does
nothing when there are no subs.
This removes the last remnants of b5e9995f59d363ba00d9cac413d9b754ee44e370. If there's nothing to flush,
immediately call the callback instead of deleting it. Callers were
assuming they were responsible for completing the callback whenever
flush_set() returned true, and always called complete(0) in this
case. Simplify the interface and just do this in flush_set(), so that
it always calls the callback.
Since C_GatherBuilder deletes its finisher if there are no subs,
only set its finisher when subs are present. This way we can still
call ->complete() for the callback.
Josh Durgin [Wed, 13 Mar 2013 16:37:21 +0000 (09:37 -0700)]
ObjectCacher: optionally make writex always non-blocking
Add a callback argument to writex, and a finisher to run the
callbacks. Move the check for dirty+tx > max_dirty into a helper that
can be called from a wrapper around the callbacks from writex, or from
the current place in _wait_for_write().
Josh Durgin [Thu, 28 Mar 2013 00:30:42 +0000 (17:30 -0700)]
librbd: flush cache when set_snap() is called
If there are writes pending, they should be sent while the image
is still writeable. If the image becomes read-only, flushing the
cache will just mark everything dirty again due to -EROFS.
Sam Lang [Wed, 27 Mar 2013 15:58:25 +0000 (10:58 -0500)]
mds: Delay session close if in clientreplay
If the mds is in clientreplay, a session close
request needs to be delayed until it reaches
active. Otherwise, the session state gets set to
'closing', and the replay requests get dropped on the
floor.
Fixes #4564. Signed-off-by: Sam Lang <sam.lang@inktank.com>
Sam Lang [Wed, 27 Mar 2013 14:35:08 +0000 (09:35 -0500)]
mds: Clear backtrace updates on standby_trim_seg
If the mds is standby, when a segment is trimmed, we need
to clear the backtrace updates list to avoid the following
assertion when the segment is deleted.
Samuel Just [Tue, 26 Mar 2013 22:10:37 +0000 (15:10 -0700)]
ReplicatedPG: send entire stats on OP_BACKFILL_FINISH
Otherwise, we update the stat.stat structure, but not the
stat.invalid_stats part. This will result in a recently
split primary propogating the invalid stats but not the
invalid marker. Sending the whole pg_stat_t structure
also mirrors MOSDSubOp.
Fixes: #4557
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sam Lang [Tue, 26 Mar 2013 13:55:40 +0000 (08:55 -0500)]
mds: CInode::build_backtrace() always incr iter
Always increment the iterator when adding old pools
to the backtrace. This fixes a bug on files where
the layout had been set to a different pool and then
back to the same pool, causing continuous looping in
the build_backtrace() function.
Fixes #4537. Signed-off-by: Sam Lang <sam.lang@inktank.com>
Yehuda Sadeh [Mon, 25 Mar 2013 16:50:33 +0000 (09:50 -0700)]
rgw: bucket index ops on system buckets shouldn't do anything
Fixes: #4508
Backport: bobtail
On certain bucket index operations we didn't check whether
the bucket was a system bucket, which caused the operations
to fail. This triggered an error message on bucket removal
operations.
Sage Weil [Tue, 19 Mar 2013 21:26:16 +0000 (14:26 -0700)]
os/FileJournal: fix aio self-throttling deadlock
This block of code tries to limit the number of aios in flight by waiting
for the amount of data to be written to grow relative to a function of the
number of aios. Strictly speaking, the condition we are waiting for is a
function of both aio_num and the write queue, but we are only woken by
changes in aio_num, and were (in rare cases) waiting when aio_num == 0 and
there was no possibility of being woken.
Fix this by verifying that aio_num > 0, and restructuring the loop to
recheck that condition on each wakeup.
Fixes: #4079 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
mon: AuthMonitor: delete auth_handler while increasing max_global_id
By not deleting and setting NULL the session's auth_handler, we could
hit a scenario in which we'd end up dispatching a previously-wait-listed
auth message and we wouldn't start its auth session.
This only happened when increasing max_global_id via Paxos (in which case
we would wait-list the message) and would only be noticeable when running
with cephx disabled.
Fixes: #4519 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Samuel Just [Tue, 19 Mar 2013 21:45:41 +0000 (14:45 -0700)]
FileJournal: quieter debugging on journal scanning
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 6740d512ac12263f7bee370bc14b1179f83af5be)
Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
(cherry picked from commit e1e2d5d2176cc9debd436ba944e6ca65b3253c8a)
Josh Durgin [Sat, 16 Mar 2013 00:28:13 +0000 (17:28 -0700)]
librbd: optionally wait for a flush before enabling writeback
Older guests may not send flushes properly (i.e. never), so if this is
enabled, rbd_cache=true is safe for them transparently.
Disable by default, since it will unnecessarily slow down newer guest
boot, and prevent writeback caching for things that don't need to send
flushes, like the command line tool.
Refs: #3817 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 19 Mar 2013 17:15:41 +0000 (10:15 -0700)]
mon/Paxos: set state to RECOVERING during restart
This ensures that the paxos state is not active when the PaxosService
restart() methods run right afterwards, and that EAGAIN waiters will get
requeued appropriately.
Sam Lang [Mon, 18 Mar 2013 21:59:04 +0000 (16:59 -0500)]
mds: Handle ENODATA returned from getxattr
The osds might return ENODATA if we request an
xattr that doesn't exist. In this case, we're
requesting the 'parent' xattr so that we can
remove all the forwarding pointers, but the xattr
may not have been written (which only happens on
log segment trim), so we don't assert here.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
mon: HealthMonitor: Keep track of monitor cluster's health
The HealthMonitor builds upon the QuorumService interface, and should be
used to keep track of all and any relevant information about the monitor
cluster (maybe even about all the cluster if need be).
This patch also introduces the HealthService interface, used to define
a HealthMonitor service, responsible for dispatching 'MMonHealth' messages
(the QuorumService interface dispatches generic 'Message').
Based on the HealthService interface, we introduce the DataHealthService
class, a service that will track disk space consumption by the monitors,
warn when a given threshold is crossed, and gracefully shutdown the monitor
if disk space usage hits critical levels that might affect the correct
monitor behavior.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: QuorumService: Allow for services quorum-bound to be easily created
As the monitor grows in features, we have been dumping them in the Monitor
class as they don't really fit anywhere else.
Most of those latest features have been, and some of the future changes
will also be, quorum-bounded. By that we mean that these features tend
to require a quorum to be present in order to work.
Although we already have the PaxosService interface, it really isn't
adequate for this kind of features, as they don't really require Paxos,
nor do they access the store. Furthermore, they don't really need to
tick at the same rate as the monitor, and can be fairly independent.
Therefore we now introduce the concept of a QuorumService, a class to be
built upon, managing the tick and dispatch for any kind of service
basically requiring a quorum to function.
Among the already existing monitor features that could take advantage of
this new class we can find the Timecheck infrastructure, as it is by
nature quorum bounded. The monitor store sync could also take advantage
of this service, although it doesn't really require a quorum to work,
and even the PaxosService-related classes could use this.
This patch also introduces the MMonQuorumService base class, to be used
by any message that should want to.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sam Lang [Mon, 18 Mar 2013 19:40:48 +0000 (14:40 -0500)]
client: Remove unecessary set_inode() in _rmdir()
With the recent changes in fc80c1dc6ee315ae5e039986602ffadba46cb43b,
we only allow setting the inode once on a MetaRequest. This triggered
a bug in _rmdir(), where the parent dir inode passed in and being set
on the MetaRequest, and then also setting the dir inode on the MetaRequest.
Removing the set_inode() using the parent dir inode resolves this issue.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Fix interator handling in ~TestFileStoreState(). After std::map::erase()
the used iterator is invalid. Use a while-loop and erase the object with
post-incremented iterator instead.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>