Mike Ryan [Fri, 5 Oct 2012 21:37:34 +0000 (14:37 -0700)]
pg: recovery reservations
This extends the backfill reservation system to work with log-based
recovery. The Active and RepActive states of the PG state machine are
greatly expanded to deal with the increased complexity of handling both
recovery and backfill reservations.
Samuel Just [Mon, 15 Oct 2012 22:39:55 +0000 (15:39 -0700)]
FileJournal: break writeq locking from queue_lock
This prevents the relatively long process of queueing
finishers from preventing op submission.
In submit_entry, we no longer check for full before placing
the write in the writeq, committed_thru should work anyway,
and we don't want to grab the required lock.
Samuel Just [Sat, 6 Oct 2012 00:33:36 +0000 (17:33 -0700)]
JournalingFileStore: move apply/commit sequencing to apply_manager
syncing the filestore requires a stable commit point (i.e., all ops
up to applied_seq must have been applied). Previously, we used
journal_lock to atomically block new applies while waiting for
the remaining ones to finish. This creates unnecessary contention.
We now use apply_manager to manage that state atomically with its
own lock.
Samuel Just [Fri, 5 Oct 2012 20:46:13 +0000 (13:46 -0700)]
JournalingFileStore: create submit_manager to order op submission
Previously, we ensured op ordering by queueing for journal and
the op queue under the journal lock. All that is required is
that obtaining an op sequence, queueing for journal, and
(for parallel) queueing for application to the fs are done
atomically. To that end, submit_manager now handles op submission.
Samuel Just [Tue, 31 Jul 2012 16:04:40 +0000 (09:04 -0700)]
JournalingFileStore: pass -1 as the alignment if unimportant
Previously, data_align began at 0 and remained that way if no
transaction contained a large data segment. This 0 was propagated
to prepare_single_write, which padded out most of a page to ensure
that the bl started with 0 alignment. Passing -1 will ensure that
we don't prepad these small segments.
Sage Weil [Tue, 30 Oct 2012 20:19:30 +0000 (13:19 -0700)]
msg/SimpleMessenger: start accepter in ready()
Start the accepter thread when the first dispatcher is ready. This ensures
that there will be someone around to verify authorizers for incoming
connections, and means we have a bit less failure noise on the monitors
as a result.
Sage Weil [Tue, 30 Oct 2012 17:00:42 +0000 (10:00 -0700)]
msg/Pipe: only randomize start seq #'s if MSG_AUTH feature is present
The kernel client expects seq #'s to start at 1 or else it is unhappy.
So, only randomize these values if the MSG_AUTH feature is present--that is
the only time it matters anyway.
Sam Lang [Mon, 29 Oct 2012 15:30:01 +0000 (10:30 -0500)]
client: Fix ref counting double free with hardlink
Peforming a hard link through the libcephfs interface causes
a double free on shutdown, due to the Client::link call decrementing
the parent (of the target) directory's inode. This fix removes the
put_inode(dir) call, to match the behavior of Client::ll_link.
Signed-off-by: Sam Lang <sam.lang@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Dan Mick [Tue, 23 Oct 2012 04:15:51 +0000 (21:15 -0700)]
librbd: clip requests past end-of-image.
Rename check_io to clip_io, which can modify the passed-in length
to clamp it to the device size. This is expected behavior for
block-device emulation.
Call clip_io in rbd_write(); need to return clipped length there,
even though aio_write() is calling clip_io() as well (for the
direct path).
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Jim Schutt [Thu, 27 Sep 2012 21:56:15 +0000 (15:56 -0600)]
PG: Do not discard op data too early
Under a sustained cephfs write load where the offered load is higher
than the storage cluster write throughput, a backlog of replication ops
that arrive via the cluster messenger builds up. The client message
policy throttler, which should be limiting the total write workload
accepted by the storage cluster, is unable to prevent it, for any
value of osd_client_message_size_cap, under such an overload condition.
The root cause is that op data is released too early, in op_applied().
If instead the op data is released at op deletion, then the limit
imposed by the client policy throttler applies over the entire
lifetime of the op, including commits of replication ops. That
makes the policy throttler an effective means for an OSD to
protect itself from a sustained high offered load, because it can
effectively limit the total, cluster-wide resources needed to process
in-progress write ops.
Sage Weil [Wed, 24 Oct 2012 21:41:38 +0000 (14:41 -0700)]
osdc/ObjectCacher: set complete flag when we observe ENOENT
If we observe an ENOENT on a read, set the complete flag. Any dirty
buffers we have will still be in memory, even if the write are in flight,
because the TX state remains pinned until the writes commit. Writes cannot
proceed faster than reads, even though reads may proceed faster than
writes.
Sage Weil [Wed, 24 Oct 2012 19:48:02 +0000 (12:48 -0700)]
osdc/ObjectCacher: refresh iterator in read apply loop
The p iterator points to the next bh, but try_merge_bh() at the end of the
loop might merge that into our result and invalidate the iterator. Fix
this by repeating the lookup on each pass through the loop.
Sage Weil [Wed, 24 Oct 2012 19:44:25 +0000 (12:44 -0700)]
osdc/ObjectCacher: do read completions after assimilating read result
Wait until we have applied the entire read result to the cache before we
trigger any read completion events. This is a cleaner and safer approach
since we can be sure that the callback won't get blocked again on data we
have but haven't applied yet. It also fixes a crash I just observed where
the completion did a read, called trim(), and invalidated/destroyed the
iterator/bh p was referencing.
Sage Weil [Tue, 23 Oct 2012 16:18:04 +0000 (09:18 -0700)]
osdc/ObjectCacher: check lru_is_expireable() in can_close()
We assert that if can_close(), the Object isn't pinned in the LRU. This
assumes we did yur get/put refcounting properly, such that the pins are
at least as restrictive as can_close().
Sage Weil [Fri, 26 Oct 2012 18:30:06 +0000 (11:30 -0700)]
librbd: fix race in AioCompletion that are still being built
When caching is enabled, it is possible for the io completion to happen
faster than we call ->finish_adding_requests() (e.g., on cache read).
When that happens, the final read request completion doesn't see a
pending_count == 0 and thus doesn't do all the final buffer construction
that is necessary to return correct data. In particular, users will see
zeroed buffers. test_librbd_fsx is turning this up consistently after
several thousand ops with an image size of ~100MB and cloning disabled.
This was introduced with the extra logic added here with striping.
Fix this by making a separate flag to indicate the completion is under
construction, and make sure we call complete() when both pending_count==0
and building==false.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>