Sage Weil [Sat, 2 May 2015 00:22:57 +0000 (17:22 -0700)]
os/newstore: queue kv transactions in kv_sync_thread
It appears that db->submit_transaction() will block if there is a sync
commit that is in progress instead of simply queueing the new txn for
later. To work around this, submit these to the backend in the
kv_sync_thread prior to the synchronous submit_transaction_sync().
Sage Weil [Wed, 29 Apr 2015 18:52:55 +0000 (11:52 -0700)]
os/newstore: avoid sync append for small ios
An append is expensive in terms of latency (write, fdatasync, kv commit),
while a wal write is just the kv commit and the write and fdatasync are
async. For small IOs doing the wal may improve performance.
Zhiqiang Wang [Wed, 29 Apr 2015 06:32:25 +0000 (14:32 +0800)]
os/NewStore: delay the read of all the overlays until wal applying
The read of all the overlays can be delayed until applying the wal. If
we are doing async wal apply, this can reduce write op latency by
eliminating unnecessary reads in the write code path.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
os/newstore: fix deadlock when newstore_sync_transaction=true
There is a deadlock issue in Newstore when newstore_sync_transaction = true.
With sync_transaction to true, the txc state machine will go all the way down
from STATE_IO_DONE to STATE_FINISHING in the same thread, while holding the osr->qlock().
The deadlock is caused in _txc_finish and _osr_reap_done, when trying to
lock osr->qlock again.
Since the _txc_finish can be called with(in sync transaction mode) or without
(in async transaction mode) holding the qlock, so fix this by setting the qlock
to PTHREAD_MUTEX_RECURSIVE, thus we can recursive acquire the qlock.
os/Newstore: flush_commit return true on STATE_KV_DONE
There is a racing condition here, if the flush_commit() call
happened after _txc_finish_kv and before next state, the context
was pushed to on_commits but no one will handle the context since
we already pass _txc_finish_kv. This bug can be easily reproduce
by putting a sleep(5) after _txc_finish_kv, and trigger the bug by
ceph-osd -i 0 --mkfs.
Fix this bug by return true directly when state >= STATE_KV_DONE(instead
of > in previous code). We already persist the data in STATE_KV_DONE so
it's safe for us to do this.
Zhiqiang Wang [Tue, 28 Apr 2015 08:24:16 +0000 (16:24 +0800)]
os/NewStore: avoid dup the data of the overlays in the WAL
When writing all the overlays, there is no need to dup the data in WAL.
Instead, we can reference the overlays in the WAL, and remove these
overlays after commiting them to the fs. When replaying, we can get
these data from the referenced overlays. Doing this way, we can save a
write and a deletion for each of the overlay data in the db.
Signed-off-by: Zhiqiang Wang <zhiqiang.wang@intel.com>
Sage Weil [Mon, 27 Apr 2015 21:42:55 +0000 (14:42 -0700)]
os/newstore: fix race in _txc_aio_submit
We cannot rely on the iterator pointers being valid after we submit the
aio because we are racing with the completion. Make our loop decision
before submitting and avoid dereferencing txc after that point.
Sage Weil [Fri, 24 Apr 2015 20:41:35 +0000 (13:41 -0700)]
os/newstore: fix _txc_aio_submit
The aios may complete before _txc_aio_submit completes. In fact, the aio
may complete, commit to the kv store, and then queue more wal aio's before
we finish the loop. Move aios to a separate list to ensure we only submit
them once and do not right another CPU adjusting the list.
Sage Weil [Thu, 23 Apr 2015 21:51:51 +0000 (14:51 -0700)]
os/newstore: throttle over entire write lifecycle
Take a global throttle when we submit ops and release when they complete.
The first throttles cover the period from submit to commit, while the wal
ones also cover the async post-commit wal work. The configs are additive
since the wal ones cover both periods; this should make them reasonably
idiot-proof.
os/Newstore: Check onode.omap_head in valid() and next()
The db iter will be set to KeyValueDB::Iterator() if onode.omap_head
not present. In that case if we touch the db iter we will get a segmentation
fault.
Prevent to touch the db iter when onode.omap_head is invalid(equals to 0).
Sage Weil [Fri, 10 Apr 2015 22:29:16 +0000 (15:29 -0700)]
os/newstore: do not call completions from kv thread
Reads may call wait_wal() holding user locks, and so we cannot block
progress on WAL completion/flushing by calling callbacks that may take
user locks.
Sage Weil [Tue, 7 Apr 2015 18:25:00 +0000 (11:25 -0700)]
os/newstore: keep smallish overlay extents in kv db
If we have a small overwrite, keep the extent in the key/value database.
Only write it back to the file/fragment later, and when we do, write them
all at once.
Sage Weil [Tue, 18 Aug 2015 14:09:20 +0000 (10:09 -0400)]
newstore: initial version
This includes a bunch of new ceph_test_objectstore tests, and a ton of fixes
to existing tests so that objects actually live inside the collections they
are written to.
David Zafman [Tue, 1 Sep 2015 17:25:08 +0000 (10:25 -0700)]
Merge pull request #5173 from ceph/wip-12000-12200
Fast read for erasure coding pool and erasure code error handling
Error handling Reviewed-by: Loic Dachary <ldachary@redhat.com> Reviewed-by: Kefu Chai <kchai@redhat.com>
Fast Read Reviewed-by: David Zafman <dzafman@redhat.com> Reviewed-by: Samuel Just <sjust@redhat.com>
Kefu Chai [Fri, 28 Aug 2015 06:27:53 +0000 (14:27 +0800)]
osd: should use ec_pool() when checking for an ecpool
we were using pool.info.require_rollback() in do_osd_ops() when
handling OP_SPARSE_READ to tell if a pool is an ecpool. should
use pool.info.ec_pool() instead.