Dan Mick [Tue, 30 Oct 2012 21:02:53 +0000 (14:02 -0700)]
rbd-fuse: add simple RBD FUSE client
Currently written in C on FUSE hi-level interfaces, so error reporting
could be better. No serious work done for performance. But it's
usable as it stands.
Specify -c <conf> and a mountpoint, and images show up as files in
that mountpoint. You can create new images; they'll be created
with attributes stored in xattrs:
Images may be truncated or extended by rewriting. Currently
once an image is opened, it's not closed, so it can't be deleted
or changed outside of the fuse path.
Dan Mick [Sat, 26 Jan 2013 05:22:45 +0000 (21:22 -0800)]
s3/php: update to 1.5? version of API
Something like v1.5 of the Amazon PHP library requires the AmazonS3
constructor to be given an array of parameters rather than using
the globals. More research needs to happen, and particularly
about the v2 API, but this might solve someone's problem with
v1.5 while we do that research.
Ross Turk [Fri, 25 Jan 2013 20:48:31 +0000 (12:48 -0800)]
doc: wider sidebar, larger font, cleaned tip CSS
The sidebar is now about a hundred pixels wider and the fonts
are larger throughout. This works a lot better when you get
deep into the doc structure - it used to wrap horribly.
I also fixed how literals look inside .tip and .important.
For some reason, the lookup() retry loop (for when happened to
race with a removal and grab an invalid WeakPtr) locked
the lock again. This causes the #3836 crash since the lock
is already locked. It's rare since it requires a lookup between
invalidation of the WeakPtr and removal of the WeakPtr entry.
Fixes: #3836
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Fri, 25 Jan 2013 17:29:37 +0000 (09:29 -0800)]
osd: share incoming maps via Connection*, not addrs
Kill a set of parallel methods that are using the old addr/inst-based
msgr APIs, and instead use Connection handles. This is much safer and gets
us closer to killing the old msgr API.
Sage Weil [Fri, 25 Jan 2013 17:27:00 +0000 (09:27 -0800)]
osd: pass new maps to dead osds via existing Connection
Previously we were sending these maps to dead osds via their old addrs
using a new outgoing connection and setting the flags so that the msgr
would clean up. That mechanism is possibly buggy and fragile, and we can
avoid it entirely if we just reuse the existing heartbeat Connection.
Sage Weil [Fri, 25 Jan 2013 17:25:28 +0000 (09:25 -0800)]
osd: requeue osdmaps on heartbeat connections for cluster connection
If we receive an OSDMap on the cluster connection, requeue it for the
cluster messenger, and process it there where we normally do. This avoids
any concerns about locking and ordering rules.
The allowance is not only added for btrfs as of commit e639254a0c5f8e3528fa8f2b2b451296653556bc, which makes us happy
for both non-btrfs (lower latency) and btrfs (better small io
throughput, no big stall during commit).
Sage Weil [Thu, 24 Jan 2013 06:16:49 +0000 (22:16 -0800)]
os/FileStore: only adjust up op queue for btrfs
We only need to adjust up the op queue limits during commit for btrfs,
because the snapshot initiation (async create) is currently
high-latency and the op queue is quiesced during that period.
This lets us revert 44dca5c, which disabled the extra allowance because
it is generally bad for non-btrfs writeahead mode.
Samuel Just [Thu, 24 Jan 2013 19:07:37 +0000 (11:07 -0800)]
OSD: use TPHandle in peering_wq
Implement _process overload with TPHandle argument and use
that to ping the hb map between pgs and between map epochs
when advancing a pg. The thread will still timeout if
genuinely stuck at any point.
Fixes: 3905
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Wed, 23 Jan 2013 20:15:10 +0000 (12:15 -0800)]
ReplicatedPG: start scanning omap at omap_recovered_to
Previously, we started scanning omap after omap_recovered_to.
This is a problem since the break in the loop implies that
omap_recovered_to is the first key not recovered.
Backport: bobtail Signed-off-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Wed, 23 Jan 2013 02:08:22 +0000 (18:08 -0800)]
os/FileStore: add stall injection into filestore op queue
Allow admin to artificially induce a stall in the op queue. Forces the
thread(s) to sleep for N seconds. We pause for 1 second increments and
recheck the value so that a previously stalled thread can be unwedged by
reinjecting a lower value (or 0). To stall indefinitely, just injust
very large number.
Sage Weil [Wed, 23 Jan 2013 02:01:07 +0000 (18:01 -0800)]
osd: hold lock while calling start_boot on startup
This probably doesn't strictly matter because start_boot doesn't need the
lock (currently) and few other threads should be running, but it is
better to be consistent.
Sage Weil [Wed, 23 Jan 2013 01:56:32 +0000 (17:56 -0800)]
osd: do not reply to ping if internal heartbeat is not healthy
If we find that our internal threads are stalled, do not reply to ping
requests. If we do this long enough, peers will mark us down. If we are
only transiently unhealthy, we will reply to the next ping and they will
be satisfied. If we are unhealthy and marked down, and eventually recover,
we will mark ourselves back up.
David Zafman [Wed, 9 Jan 2013 03:24:13 +0000 (19:24 -0800)]
osd: Add digest of omap for deep-scrub
Add ScrubMap encode/decode v4 message with omap digest
Compute digest of header and key/value. Use bufferlist
to reflect structure and compute as we go, clearing
bufferlist to reduce memory usage.
Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Mon, 21 Jan 2013 23:29:28 +0000 (15:29 -0800)]
common/PrioritizedQueue: add min cost, max tokens per bucket
Two problems.
First, we need to cap the tokens per bucket. Otherwise, a stream of
items at one priority over time will indefinitely inflate the tokens
available at another priority. The cap should represent how "bursty"
we allow a given bucket to be. Start with 4MB for now.
Second, set a floor on the item cost. Otherwise, we can have an
infinite queue of 0 cost items that start over queues. More
realistically, we need to balance the overhead of processing small items
with the cost of large items. I.e., a 4 KB item is not 1/1000th as
expensive as a 4MB item.
Sage Weil [Mon, 21 Jan 2013 22:14:25 +0000 (14:14 -0800)]
osd: add OpRequest flag point when commit is sent
With writeahead journaling in particular, we can get requests that
stay in the queue for a long time even after the commit is sent to the
client while we are waiting for the transaction to apply to the fs.
Instead of showing up as 'waiting for subops', make it clear that the
client has gotten its reply and it is local state that is slow.
Sage Weil [Tue, 22 Jan 2013 04:00:26 +0000 (20:00 -0800)]
osd: target transaction size 300 -> 30
Small transactions make pg removal nicer to the op queue. It also slows
down PG deletion a bit, which may exacerbate the PG resurrection case
until #3884 is addressed.
At least on user reported this fixed an osd that kept failing due to
an internal heartbeat failure.
Sage Weil [Sun, 20 Jan 2013 01:33:25 +0000 (17:33 -0800)]
filestore: disable extra committing queue allowance
The motivation here is if there is a problem draining the op queue
during a sync. For XFS and ext4, this isn't generally a problem: you
can continue to make writes while a syncfs(2) is in progress. There
are currently some possible implementation issues with btrfs, but we
have not demonstrated them recently.
Meanwhile, this can cause queue length spikes that screw up latency.
During a commit, we allow too much into the queue (say, recovery
operations). After the sync finishes, we have to drain it out before
we can queue new work (say, a higher priority client request). Having
a deep queue below the point where priorities order work limits the
value of the priority queue.