Samuel Just [Wed, 23 Jan 2013 20:15:10 +0000 (12:15 -0800)]
ReplicatedPG: start scanning omap at omap_recovered_to
Previously, we started scanning omap after omap_recovered_to.
This is a problem since the break in the loop implies that
omap_recovered_to is the first key not recovered.
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Wed, 23 Jan 2013 02:08:22 +0000 (18:08 -0800)]
os/FileStore: add stall injection into filestore op queue
Allow admin to artificially induce a stall in the op queue. Forces the
thread(s) to sleep for N seconds. We pause for 1 second increments and
recheck the value so that a previously stalled thread can be unwedged by
reinjecting a lower value (or 0). To stall indefinitely, just injust
very large number.
Sage Weil [Wed, 23 Jan 2013 02:01:07 +0000 (18:01 -0800)]
osd: hold lock while calling start_boot on startup
This probably doesn't strictly matter because start_boot doesn't need the
lock (currently) and few other threads should be running, but it is
better to be consistent.
Sage Weil [Wed, 23 Jan 2013 01:56:32 +0000 (17:56 -0800)]
osd: do not reply to ping if internal heartbeat is not healthy
If we find that our internal threads are stalled, do not reply to ping
requests. If we do this long enough, peers will mark us down. If we are
only transiently unhealthy, we will reply to the next ping and they will
be satisfied. If we are unhealthy and marked down, and eventually recover,
we will mark ourselves back up.
David Zafman [Wed, 9 Jan 2013 03:24:13 +0000 (19:24 -0800)]
osd: Add digest of omap for deep-scrub
Add ScrubMap encode/decode v4 message with omap digest
Compute digest of header and key/value. Use bufferlist
to reflect structure and compute as we go, clearing
bufferlist to reduce memory usage.
Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Mon, 21 Jan 2013 23:29:28 +0000 (15:29 -0800)]
common/PrioritizedQueue: add min cost, max tokens per bucket
Two problems.
First, we need to cap the tokens per bucket. Otherwise, a stream of
items at one priority over time will indefinitely inflate the tokens
available at another priority. The cap should represent how "bursty"
we allow a given bucket to be. Start with 4MB for now.
Second, set a floor on the item cost. Otherwise, we can have an
infinite queue of 0 cost items that start over queues. More
realistically, we need to balance the overhead of processing small items
with the cost of large items. I.e., a 4 KB item is not 1/1000th as
expensive as a 4MB item.
Sage Weil [Mon, 21 Jan 2013 22:14:25 +0000 (14:14 -0800)]
osd: add OpRequest flag point when commit is sent
With writeahead journaling in particular, we can get requests that
stay in the queue for a long time even after the commit is sent to the
client while we are waiting for the transaction to apply to the fs.
Instead of showing up as 'waiting for subops', make it clear that the
client has gotten its reply and it is local state that is slow.
Sage Weil [Tue, 22 Jan 2013 04:00:26 +0000 (20:00 -0800)]
osd: target transaction size 300 -> 30
Small transactions make pg removal nicer to the op queue. It also slows
down PG deletion a bit, which may exacerbate the PG resurrection case
until #3884 is addressed.
At least on user reported this fixed an osd that kept failing due to
an internal heartbeat failure.
Sage Weil [Sun, 20 Jan 2013 01:33:25 +0000 (17:33 -0800)]
filestore: disable extra committing queue allowance
The motivation here is if there is a problem draining the op queue
during a sync. For XFS and ext4, this isn't generally a problem: you
can continue to make writes while a syncfs(2) is in progress. There
are currently some possible implementation issues with btrfs, but we
have not demonstrated them recently.
Meanwhile, this can cause queue length spikes that screw up latency.
During a commit, we allow too much into the queue (say, recovery
operations). After the sync finishes, we have to drain it out before
we can queue new work (say, a higher priority client request). Having
a deep queue below the point where priorities order work limits the
value of the priority queue.
Sage Weil [Mon, 21 Jan 2013 16:45:10 +0000 (08:45 -0800)]
config: don't make noise about 'internal_safe_to_start_threads'
This is set on start, and subsequently gets into the changed set.
Once any other config value is injected, it is the first thing reported
by the logs, but is confusing and useless to the user. Hide it.
Sage Weil [Mon, 21 Jan 2013 00:11:10 +0000 (16:11 -0800)]
osd: calculate initial PG mapping from PG's osdmap
The initial values of up/acting need to be based on the PG's osdmap, not
the OSD's latest. This can cause various confusion in
pg_interval_t::check_new_interval() when calling OSDMap methods due to the
up/acting OSDs not existing yet (for example).
Fixes: #3879 Reported-by: Jens Kristian S?gaard <jens@mermaidconsulting.dk> Tested-by: Jens Kristian S?gaard <jens@mermaidconsulting.dk> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Travis Rhoden [Sat, 19 Jan 2013 03:26:07 +0000 (22:26 -0500)]
Clarify journal size based on filestore max sync
The docs had the recommended journal size based on the option
"filestore min sync interval" when it should have been
"filestore max sync interval".
While in there, fix a couple of typos -- multiple when it should
be multiply, and a missing word. Change "Should at least twice"
to "Should be at least twice..."
Sam Lang [Thu, 17 Jan 2013 20:03:51 +0000 (14:03 -0600)]
client: Respect O_SYNC, O_DSYNC, and O_RSYNC
If the file is opened with O_SYNC, O_DSYNC, or O_RSYNC, we need to
flush cached data (and metadata for O_SYNC) on a write.
For O_RSYNC, we need to flush dirty data on a read.
This patch adds a file_flush() call to the objectCacher
to allow a specific range to be flushed from the cache, and
in the O_SYNC,O_DSYNC case for write and O_RSYNC case for read,
calls that function waiting for the flush to complete. The patch
also adds a flags field directly to the file handle struct, and
replaces the append boolean with the use of the flags field directly.
Gary Lowell [Fri, 18 Jan 2013 06:43:07 +0000 (22:43 -0800)]
build: Add perl installation dependency to rpm and debian packages.
There was already a dependency on python in the debian control file,
a similar dependency was added to the rpm spec file. perl is needed
for the logrotate script, so a dependecy was on perl wass added to
both. Bug 3768.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Josh Durgin [Wed, 26 Dec 2012 22:24:22 +0000 (14:24 -0800)]
rbd: fix bench-write infinite loop
I/O was continously submitted as long as there were few enough ops in
flight. If the number of 'threads' was high, or caching was turned on,
there would never be that many ops in flight, so the loop would continue
indefinitely. Instead, submit at most io_threads ops per offset.
Fixes: #3413 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage.weil@inktank.com>
Dan Mick [Thu, 17 Jan 2013 19:32:03 +0000 (11:32 -0800)]
crushtool: warn usefully about missing output spec
When running with --test, you must request output to CSV files or
specific types of output to --show-X; make the error message
clarify what the tool wants.
Fixes: #3827 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Dan Mick [Thu, 17 Jan 2013 19:18:46 +0000 (11:18 -0800)]
crushtool: consolidate_whitespace() should eat everything except \n
CRUSH map source with \r (like a DOS text file) failed to compile
with the usual nonuseful message; turns out that eating \r along with
' ' and '\t' etc. solves that problem.
Fixes: #3834 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 28 Dec 2012 00:18:19 +0000 (16:18 -0800)]
mon: enforce 'cephx require signatures' during negotiation
If we are negotiating which auth protocol to use, and the client does not
support the MSG_AUTH feature, and the server has 'cephx require signatures'
set to true, then remove cephx from the list of allowed protocols.
Also print something in the mon log so that we know wtf is going on.
Sage Weil [Fri, 28 Dec 2012 00:03:20 +0000 (16:03 -0800)]
msg/Pipe: require MSG_AUTH feature on server if option is enabled
If we
negotiate cephx AND
are a server AND
cephx require signatures = true
then require the MSG_AUTH feature bit. Put this in the Policy struct for
this connection so that the existing feature bit checks and error reporting
are used, and the peer knows what feature it is missing.
Sage Weil [Thu, 17 Jan 2013 23:01:35 +0000 (15:01 -0800)]
osdmap: make replica separate in default crush map configurable
Add 'osd crush chooseleaf type' option to control what the default
CRUSH rule separates replicas across. Default to 1 (host), and set it
to 0 in vstart.sh.
Fixes: #3785 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>