Dan Mick [Thu, 23 Mar 2017 23:36:53 +0000 (16:36 -0700)]
debian/rules: invoke cmake with -DBOOST_J
Allow boost build during toplevel cmake from Debian package build
to benefit from multiple processors. Should speed build a lot
on many-proc machines (say, arm64). Use argument passed to
debhelper.
Kefu Chai [Wed, 22 Mar 2017 03:48:40 +0000 (11:48 +0800)]
vstart.sh: remove start_*
so there are only two ways to override the number of daemons to start
- using the env var CEPH_NUM_{MON|OSD|MGR|MDS} or {MON|OSD|MGR|MDS}
- command line options: --{mon,osd,mds}_num
do prevent a daemon from running, set the corrresponding env var to 0.
Piotr Dałek [Mon, 20 Mar 2017 12:51:25 +0000 (13:51 +0100)]
TrackedOp: allow dumping historic ops sorted by duration
Currently dump_historic_ops dumps ops sorted by their initiation time,
which may not have any relation to how long it took, and sorting output
of that command by op duration is neither fast nor convenient.
New asok command ("dump_historic_ops_by_duration") outputs the same
op list, but ordered by their duration time (longest first).
Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
Erwan Velu [Wed, 22 Mar 2017 09:11:44 +0000 (10:11 +0100)]
ceph-disk: Reporting /sys directory in get_partition_dev()
When get_partition_dev() fails, it reports the following message :
ceph_disk.main.Error: Error: partition 2 for /dev/sdb does not appear to exist
The code search for a directory inside the /sys/block/get_dev_name(os.path.realpath(dev)).
The issue here is the error message doesn't report that path when failing while it might be involved in.
This patch is about reporting where the code was looking at when trying to estimate if the partition was available.
Sage Weil [Sat, 18 Mar 2017 17:51:08 +0000 (13:51 -0400)]
os/bluestore: handle zombie OpSequencers
It's possible for the Sequencer to go away while the OpSequencer still has
txcs in flight. We were handling the case where the osr was on the
deferred_queue, but it may be off the deferred_queue but waiting for the
commit to happen, and we still need to wait for that.
Fix this by introducing a 'zombie' state for the osr, in which we keep the
osr in the osr_set.
Clean up the OpSequencer methods and a few other method names.
Sage Weil [Fri, 17 Mar 2017 14:13:22 +0000 (10:13 -0400)]
os/bluestore: move cached items around on collection split
We've been avoiding doing this for a while and it has finally caught up
with us: the SharedBlob may outlive the split due to deferred IO, and
a read on the child collection may load a competing Blob and SharedBlob
and read from the on-disk blocks that haven't been written yet.
Fix by preserving the one-SharedBlob-instance invariant by moving cache
items to the new Collection and cache shard like we should have from the
beginning.
Sage Weil [Thu, 16 Mar 2017 20:33:53 +0000 (16:33 -0400)]
unittest_bluestore_types: fix Collection using tests
We can't use a bare Collection since we get/put refs, the last put will
delete it, and the dtor asserts nref == 0 (no faking a ref and deliberately
leaking!).
Sage Weil [Wed, 15 Mar 2017 19:01:52 +0000 (15:01 -0400)]
os/bluestore: take Collection ref from SharedBlob
These can survive as long as the txc, which can be longer than the
Collection. Make sure we have a valid ref as both finish_write and
~SharedBlob use coll for the SharedBlobSet (and coll->store->cct for
debug).
Sage Weil [Fri, 10 Mar 2017 15:27:52 +0000 (10:27 -0500)]
ceph_test_objectstore: fix Synthetic to never modify bufferlists
We were modifying bufferlists in place, and kludging around it by making
full copies elsewhere. Instead, never modify a buffer.
This fixes issues where the buffer we submit to ObjectStore ends up in
the cache and we modify in place later, corrupting the implementation's
copy. (This was affecting BlueStore.)
Rearrange the data methods to be next to each other and clean them up a
bit too.
Sage Weil [Thu, 9 Mar 2017 22:28:58 +0000 (17:28 -0500)]
os/bluestore: avoid extra dev flush on single device when all io is deferred
If we have no non-deferred IO to flush, and we are running bluefs on a
single shared device, then we can rely on the bluefs flush to make our
current batch of deferred ios stable.
Separate deferred into a "done" and "stable" list. If we do sync, put
everything from "done" onto "stable". Otherwise, after we do our kv
commit via bluefs, move "done" to "stable" then.
Sage Weil [Thu, 9 Mar 2017 19:17:47 +0000 (14:17 -0500)]
os/bluestore: batch up to bluestore_deferred_batch_ops before submitting
Allow several deferred writes to accumulate before we submit them. In
general we have no time pressure, and on HDD (and perhaps sometimes SSD)
it is beneficial to accumulate and batch these so that they result in
fewer seeks. On HDD, this is particularly true of seeks away from the
journal. And on sequential workloads this can avoid seeks. In may even
allow the block layer or SSD firmware to merge IOs and perform fewer
writes.
Sage Weil [Thu, 9 Mar 2017 16:53:06 +0000 (11:53 -0500)]
os/bluestore: avoid waking up kv thread on deferred write completion
In a simple HDD workload with queue depth of 1, we halve our throughput
because the kv thread does a full commit twice per IO: once for the
initial commit, and then again to clean up the deferred write record. The
second wakeup is unnecessary; we can clean it up on the next commit.
We do need to do this wakeup in a few cases, though, when draining the
OpSequencers: (1) on replay during startup, and (2) on shutdown in
_osr_drain_all().
Send everything through _osr_drain_all() for simplicity.
This doubles HDD qd=1 IOPS from ~50 to ~100 on my 7200 rpm test device
(rados bench 30 write -b 4096 -t 1).
Sage Weil [Thu, 9 Mar 2017 02:53:22 +0000 (21:53 -0500)]
os/bluestore: restructure deferred write queue
First, eliminate the work queue--it's useless. We are dispatching aio and
should not block. And if a single thread isn't sufficient to do it, it
probably means we should be parallelizing kv_sync_thread too (which is our
only caller that matters).
Repurpose the old osr-list -> txc-list-per-osr queue structure to manage
the queuing. For any given osr, dispatch one batch of aios at a time,
taking care to collapse any overwrites so that the latest write wins.
Sage Weil [Sat, 11 Mar 2017 19:30:53 +0000 (14:30 -0500)]
os/bluestore: keep onode refs for lifetime of obc
This ensures that we don't trim an onode from the cache while it has a
txc that is still in flight. Which in turn ensures that if we try to read
the object, we will have any writing buffers available.
Sage Weil [Tue, 14 Mar 2017 02:49:37 +0000 (22:49 -0400)]
os/bluestore: add OpSequencer::drain()
Currently this is the same as flush, but more precisely it is an internal
method that means all txc's must complete. Update _wal_apply() to use it
instead of flush(), which is part of the public Sequencer interface.
Sage Weil [Wed, 8 Mar 2017 19:57:52 +0000 (14:57 -0500)]
os/bluestore: release deferred throttle on io finish, before cleanup
The throttle is really about limiting deferred IO; we do not need to
actually remove the deferred record from the kv db before queueing more.
(In fact, the txc that queues more will do the cleanup.)