]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
8 years agoceph.spec.in: derive _smp_ncpus and use it for -DBOOST_J 14114/head
Dan Mick [Fri, 24 Mar 2017 02:35:08 +0000 (19:35 -0700)]
ceph.spec.in: derive _smp_ncpus and use it for -DBOOST_J

Signed-off-by: Dan Mick <dan.mick@redhat.com>
8 years agoceph.spec.in: move lowmem_build setting of _smp_mflags
Dan Mick [Fri, 24 Mar 2017 02:34:28 +0000 (19:34 -0700)]
ceph.spec.in: move lowmem_build setting of _smp_mflags

Signed-off-by: Dan Mick <dan.mick@redhat.com>
8 years agodebian/rules: invoke cmake with -DBOOST_J
Dan Mick [Thu, 23 Mar 2017 23:36:53 +0000 (16:36 -0700)]
debian/rules: invoke cmake with -DBOOST_J

Allow boost build during toplevel cmake from Debian package build
to benefit from multiple processors.  Should speed build a lot
on many-proc machines (say, arm64).  Use argument passed to
debhelper.

Signed-off-by: Dan Mick <dan.mick@redhat.com>
8 years agoMerge pull request #14110 from dachary/wip-crush-cleanup
Loic Dachary [Thu, 23 Mar 2017 20:48:00 +0000 (21:48 +0100)]
Merge pull request #14110 from dachary/wip-crush-cleanup

crush: builder: clean the arguments of crush_reweight* methods

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Loic Dachary <ldachary@redhat.com>
8 years agocrush: builder: clean the arguments of crush_reweight* methods 14110/head
Sahid Orentino Ferdjaoui [Mon, 13 Mar 2017 16:36:16 +0000 (12:36 -0400)]
crush: builder: clean the arguments of crush_reweight* methods

This commit is just a cleanup to make the arguments of the method
around crush_reweight all coherent.

Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@redhat.com>
8 years agoMerge pull request #14050 from ovh/bp-dump-ops-by-duration
Yuri Weinstein [Thu, 23 Mar 2017 15:47:55 +0000 (08:47 -0700)]
Merge pull request #14050 from ovh/bp-dump-ops-by-duration

common/TrackedOp: allow dumping historic ops sorted by duration

Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #14060 from LiumxNL/wip-170321
Yuri Weinstein [Thu, 23 Mar 2017 15:46:36 +0000 (08:46 -0700)]
Merge pull request #14060 from LiumxNL/wip-170321

osd: combine unstable stats with info.stats when publish stats to osd

Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #13293 from Liuchang0812/cleanup-coverity
Yuri Weinstein [Thu, 23 Mar 2017 15:45:58 +0000 (08:45 -0700)]
Merge pull request #13293 from Liuchang0812/cleanup-coverity

test, osd: fix some coverity issues

Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #14014 from Liuchang0812/wip-fix-seg-fault
Casey Bodley [Thu, 23 Mar 2017 13:54:47 +0000 (09:54 -0400)]
Merge pull request #14014 from Liuchang0812/wip-fix-seg-fault

rgw: fix memory leak in RGWGetObjLayout

Reviewed-by: Jos Collin <jcollin@redhat.com>
Reviewed-by: Casey Bodley <cbodley@redhat.com>
8 years agoMerge pull request #14094 from optimistyzy/322
Haomai Wang [Thu, 23 Mar 2017 11:23:34 +0000 (19:23 +0800)]
Merge pull request #14094 from optimistyzy/322

bluestore, NVMeDevice: use task' own lock for (random) read

Reviewed-by: Haomai Wang <haomai@xsky.com>
8 years agoMerge pull request #14004 from liewegas/wip-osd-full-failsafe
Kefu Chai [Thu, 23 Mar 2017 08:09:34 +0000 (16:09 +0800)]
Merge pull request #14004 from liewegas/wip-osd-full-failsafe

osd: fall back to failsafe threshold if osdmap doesn't set [near]full

Reviewed-by: David Zafman <dzafman@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #13903 from wjwithagen/wip-wjw-run-classes-sed
Kefu Chai [Thu, 23 Mar 2017 08:08:22 +0000 (16:08 +0800)]
Merge pull request #13903 from wjwithagen/wip-wjw-run-classes-sed

test: sed on FreeBSD requires "-i extension", so use gsed

Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #9940 from aclamk/common-recursive-mutex-fix
Kefu Chai [Thu, 23 Mar 2017 08:04:52 +0000 (16:04 +0800)]
Merge pull request #9940 from aclamk/common-recursive-mutex-fix

common: fix lockdep vs recursive mutexes

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agobluestore, NVMeDevice: use task' own lock for (random) read 14094/head
Ziye Yang [Wed, 22 Mar 2017 03:41:00 +0000 (11:41 +0800)]
bluestore, NVMeDevice: use task' own lock for (random) read

The reason is that ioc may be reaped in _aio_thread function
with  the following statements:
for (auto &&it : registered_devices)
          it->reap_ioc();

So if we still use ioc's lock for (random) read, it will cause
core dump.

Signed-off-by: optimistyzy <optimistyzy@gmail.com>
8 years agoMerge pull request #14080 from ceph/evelu-ceph-disk
Loic Dachary [Wed, 22 Mar 2017 18:43:37 +0000 (19:43 +0100)]
Merge pull request #14080 from ceph/evelu-ceph-disk

ceph-disk: Reporting /sys directory in get_partition_dev()

Reviewed-by: Loic Dachary <ldachary@redhat.com>
8 years agoMerge pull request #13942 from xiexingguo/wip-cleanup-proc-repinfo
Kefu Chai [Wed, 22 Mar 2017 15:57:13 +0000 (23:57 +0800)]
Merge pull request #13942 from xiexingguo/wip-cleanup-proc-repinfo

osd/PG: conditionally retry on receiving pg-notify when Primary is Incomplete

Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #14061 from tchaikov/wip-19312
Kefu Chai [Wed, 22 Mar 2017 15:56:27 +0000 (23:56 +0800)]
Merge pull request #14061 from tchaikov/wip-19312

tests: ceph_test_rados_api_watch_notify: test timeout using rados_wat…

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #12449 from cbodley/wip-rgw-test-multi-vers-acl
Casey Bodley [Wed, 22 Mar 2017 15:46:33 +0000 (11:46 -0400)]
Merge pull request #12449 from cbodley/wip-rgw-test-multi-vers-acl

test/rgw: add bucket acl and versioning tests to test_multi.py

Reviewed-by: Orit Wasserman <owasserm@redhat.com>
8 years agoMerge pull request #14059 from vumrao/wip-vumrao-19318
Kefu Chai [Wed, 22 Mar 2017 14:43:41 +0000 (22:43 +0800)]
Merge pull request #14059 from vumrao/wip-vumrao-19318

common/config_opts.h: Remove deprecated osd_compact_leveldb_on_mount option

Reviewed-by: Jos Collin <jcollin@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #14076 from liewegas/wip-bluestore-min-alloc-size
Mark Nelson [Wed, 22 Mar 2017 14:12:41 +0000 (09:12 -0500)]
Merge pull request #14076 from liewegas/wip-bluestore-min-alloc-size

os/bluestore: default 16KB min_alloc_size on ssd

8 years agoMerge pull request #14068 from optimistyzy/321_new
Haomai Wang [Wed, 22 Mar 2017 13:16:48 +0000 (21:16 +0800)]
Merge pull request #14068 from optimistyzy/321_new

Bluestore, NVMEDevice: add the spdk core mask check

Reviewed-by: Haomai Wang <haomai@xsky.com>
8 years agoTrackedOp: allow dumping historic ops sorted by duration 14050/head
Piotr Dałek [Mon, 20 Mar 2017 12:51:25 +0000 (13:51 +0100)]
TrackedOp: allow dumping historic ops sorted by duration

Currently dump_historic_ops dumps ops sorted by their initiation time,
which may not have any relation to how long it took, and sorting output
of that command by op duration is neither fast nor convenient.
New asok command ("dump_historic_ops_by_duration") outputs the same
op list, but ordered by their duration time (longest first).

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
8 years agoBluestore, NVMEDevice: add the spdk core mask check 14068/head
optimistyzy [Tue, 21 Mar 2017 11:00:15 +0000 (19:00 +0800)]
Bluestore, NVMEDevice: add the spdk core mask check

This patch adds the spdk core mask check and also
set the master core for starting DPDK.

Signed-off-by: optimistyzy <optimistyzy@gmail.com>
8 years agorgw/rgw_op: fix memory leak in RGWGetObjLayout 14014/head
liuchang0812 [Wed, 22 Mar 2017 09:27:20 +0000 (17:27 +0800)]
rgw/rgw_op: fix memory leak in RGWGetObjLayout

Signed-off-by: liuchang0812 <liuchang0812@gmail.com>
8 years agoceph-disk: Reporting /sys directory in get_partition_dev() 14080/head
Erwan Velu [Wed, 22 Mar 2017 09:11:44 +0000 (10:11 +0100)]
ceph-disk: Reporting /sys directory in get_partition_dev()

When get_partition_dev() fails, it reports the following message :
    ceph_disk.main.Error: Error: partition 2 for /dev/sdb does not appear to exist
The code search for a directory inside the /sys/block/get_dev_name(os.path.realpath(dev)).

The issue here is the error message doesn't report that path when failing while it might be involved in.

This patch is about reporting where the code was looking at when trying to estimate if the partition was available.

Signed-off-by: Erwan Velu <erwan@redhat.com>
8 years agoos/bluestore: default 16KB min_alloc_size on ssd 14076/head
Sage Weil [Wed, 22 Mar 2017 02:27:23 +0000 (21:27 -0500)]
os/bluestore: default 16KB min_alloc_size on ssd

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #13963 from cbodley/wip-18725
Orit Wasserman [Tue, 21 Mar 2017 21:44:22 +0000 (23:44 +0200)]
Merge pull request #13963 from cbodley/wip-18725

rgw-admin: remove deprecated regionmap commands
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
8 years agoMerge pull request #13888 from liewegas/wip-bluestore-dw
Sage Weil [Tue, 21 Mar 2017 20:05:56 +0000 (15:05 -0500)]
Merge pull request #13888 from liewegas/wip-bluestore-dw

os/bluestore: fix deferred writes; improve flush

Reviewed-by: Igor Fedotov <ifedotov@mirantis.com>
8 years agoMerge pull request #13902 from Wilhelmshaven/rm_redundant_code
Casey Bodley [Tue, 21 Mar 2017 19:43:48 +0000 (15:43 -0400)]
Merge pull request #13902 from Wilhelmshaven/rm_redundant_code

rgw: remove redundant codes in rgw_cache.h

Reviewed-by: Casey Bodley <cbodley@redhat.com>
8 years agoos/bluestore: handle zombie OpSequencers 13888/head
Sage Weil [Sat, 18 Mar 2017 17:51:08 +0000 (13:51 -0400)]
os/bluestore: handle zombie OpSequencers

It's possible for the Sequencer to go away while the OpSequencer still has
txcs in flight.  We were handling the case where the osr was on the
deferred_queue, but it may be off the deferred_queue but waiting for the
commit to happen, and we still need to wait for that.

Fix this by introducing a 'zombie' state for the osr, in which we keep the
osr in the osr_set.

Clean up the OpSequencer methods and a few other method names.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: clean up flush_all()
Sage Weil [Fri, 17 Mar 2017 21:52:56 +0000 (17:52 -0400)]
os/bluestore: clean up flush_all()

Add assertions if we fail to flush everything.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: move cached items around on collection split
Sage Weil [Fri, 17 Mar 2017 14:13:22 +0000 (10:13 -0400)]
os/bluestore: move cached items around on collection split

We've been avoiding doing this for a while and it has finally caught up
with us: the SharedBlob may outlive the split due to deferred IO, and
a read on the child collection may load a competing Blob and SharedBlob
and read from the on-disk blocks that haven't been written yet.

Fix by preserving the one-SharedBlob-instance invariant by moving cache
items to the new Collection and cache shard like we should have from the
beginning.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: simplify flush() wake-up condition
Sage Weil [Fri, 17 Mar 2017 17:54:20 +0000 (13:54 -0400)]
os/bluestore: simplify flush() wake-up condition

Clearer, and fewer wakeups.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoceph_test_objectstore: set bluestore cache shards to 5
Sage Weil [Fri, 17 Mar 2017 14:12:02 +0000 (10:12 -0400)]
ceph_test_objectstore: set bluestore cache shards to 5

Better test coverage!

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agounittest_bluestore_types: fix Collection using tests
Sage Weil [Thu, 16 Mar 2017 20:33:53 +0000 (16:33 -0400)]
unittest_bluestore_types: fix Collection using tests

We can't use a bare Collection since we get/put refs, the last put will
delete it, and the dtor asserts nref == 0 (no faking a ref and deliberately
leaking!).

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore/KernelDevice: drop unused flush_lock
Sage Weil [Thu, 16 Mar 2017 16:24:51 +0000 (12:24 -0400)]
os/bluestore/KernelDevice: drop unused flush_lock

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: better debugging around collections
Sage Weil [Thu, 16 Mar 2017 16:19:30 +0000 (12:19 -0400)]
os/bluestore: better debugging around collections

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: nicer Onode dout prefix
Sage Weil [Thu, 16 Mar 2017 15:30:59 +0000 (11:30 -0400)]
os/bluestore: nicer Onode dout prefix

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: flush_cache on umount, fsck finish, etc.
Sage Weil [Thu, 16 Mar 2017 15:30:37 +0000 (11:30 -0400)]
os/bluestore: flush_cache on umount, fsck finish, etc.

Otherwise cache items survive beyond umount into the next mount cycle!

Also, ensure that we flush_cache *before* clearing coll_map, as some cache
items have references back to the Collection.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: take Collection ref from SharedBlob
Sage Weil [Wed, 15 Mar 2017 19:01:52 +0000 (15:01 -0400)]
os/bluestore: take Collection ref from SharedBlob

These can survive as long as the txc, which can be longer than the
Collection.  Make sure we have a valid ref as both finish_write and
~SharedBlob use coll for the SharedBlobSet (and coll->store->cct for
debug).

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: fix perfcounters for deferred io
Sage Weil [Tue, 14 Mar 2017 20:47:48 +0000 (16:47 -0400)]
os/bluestore: fix perfcounters for deferred io

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: remove dead _do_deferred_op code
Sage Weil [Tue, 14 Mar 2017 20:47:40 +0000 (16:47 -0400)]
os/bluestore: remove dead _do_deferred_op code

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: make throttles tunable online
Sage Weil [Tue, 14 Mar 2017 18:17:20 +0000 (14:17 -0400)]
os/bluestore: make throttles tunable online

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: prevent throttle deadlock due to deferred writes
Sage Weil [Mon, 13 Mar 2017 11:43:57 +0000 (07:43 -0400)]
os/bluestore: prevent throttle deadlock due to deferred writes

Kick off deferred IOs if we pass the throttle midpoint or if we would
block during submission.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoceph_test_objectstore: fix Synthetic to never modify bufferlists
Sage Weil [Fri, 10 Mar 2017 15:27:52 +0000 (10:27 -0500)]
ceph_test_objectstore: fix Synthetic to never modify bufferlists

We were modifying bufferlists in place, and kludging around it by making
full copies elsewhere.  Instead, never modify a buffer.

This fixes issues where the buffer we submit to ObjectStore ends up in
the cache and we modify in place later, corrupting the implementation's
copy.  (This was affecting BlueStore.)

Rearrange the data methods to be next to each other and clean them up a
bit too.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: drop obsolete comment
Sage Weil [Fri, 10 Mar 2017 15:20:22 +0000 (10:20 -0500)]
os/bluestore: drop obsolete comment

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: avoid extra dev flush on single device when all io is deferred
Sage Weil [Thu, 9 Mar 2017 22:28:58 +0000 (17:28 -0500)]
os/bluestore: avoid extra dev flush on single device when all io is deferred

If we have no non-deferred IO to flush, and we are running bluefs on a
single shared device, then we can rely on the bluefs flush to make our
current batch of deferred ios stable.

Separate deferred into a "done" and "stable" list.  If we do sync, put
everything from "done" onto "stable".  Otherwise, after we do our kv
commit via bluefs, move "done" to "stable" then.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: debug alloc release
Sage Weil [Tue, 14 Mar 2017 14:33:23 +0000 (10:33 -0400)]
os/bluestore: debug alloc release

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: flush old/discarded OpSequencers too
Sage Weil [Tue, 14 Mar 2017 14:33:17 +0000 (10:33 -0400)]
os/bluestore: flush old/discarded OpSequencers too

When the Sequencer goes away it get deregistered.  If there are still
deferred IOs in flight, we need to wait for those too.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: batch up to bluestore_deferred_batch_ops before submitting
Sage Weil [Thu, 9 Mar 2017 19:17:47 +0000 (14:17 -0500)]
os/bluestore: batch up to bluestore_deferred_batch_ops before submitting

Allow several deferred writes to accumulate before we submit them.  In
general we have no time pressure, and on HDD (and perhaps sometimes SSD)
it is beneficial to accumulate and batch these so that they result in
fewer seeks.  On HDD, this is particularly true of seeks away from the
journal.  And on sequential workloads this can avoid seeks.  In may even
allow the block layer or SSD firmware to merge IOs and perform fewer
writes.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: only discard deallocated regions of a blob if !shared
Sage Weil [Mon, 13 Mar 2017 11:32:12 +0000 (07:32 -0400)]
os/bluestore: only discard deallocated regions of a blob if !shared

If a blob is shared, we can't discard deallocated regions: there may
be deferred buffers in flight and we might get a read via the clone.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: avoid waking up kv thread on deferred write completion
Sage Weil [Thu, 9 Mar 2017 16:53:06 +0000 (11:53 -0500)]
os/bluestore: avoid waking up kv thread on deferred write completion

In a simple HDD workload with queue depth of 1, we halve our throughput
because the kv thread does a full commit twice per IO: once for the
initial commit, and then again to clean up the deferred write record. The
second wakeup is unnecessary; we can clean it up on the next commit.

We do need to do this wakeup in a few cases, though, when draining the
OpSequencers: (1) on replay during startup, and (2) on shutdown in
_osr_drain_all().

Send everything through _osr_drain_all() for simplicity.

This doubles HDD qd=1 IOPS from ~50 to ~100 on my 7200 rpm test device
(rados bench 30 write -b 4096 -t 1).

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: move many initializations into header
Sage Weil [Thu, 9 Mar 2017 15:34:50 +0000 (10:34 -0500)]
os/bluestore: move many initializations into header

This is less fragile, especially with 2 constructors.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: restructure deferred write queue
Sage Weil [Thu, 9 Mar 2017 02:53:22 +0000 (21:53 -0500)]
os/bluestore: restructure deferred write queue

First, eliminate the work queue--it's useless.  We are dispatching aio and
should not block.  And if a single thread isn't sufficient to do it, it
probably means we should be parallelizing kv_sync_thread too (which is our
only caller that matters).

Repurpose the old osr-list -> txc-list-per-osr queue structure to manage
the queuing.  For any given osr, dispatch one batch of aios at a time,
taking care to collapse any overwrites so that the latest write wins.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: fix OpSequencer/Sequencer lifecycle
Sage Weil [Fri, 10 Mar 2017 03:56:28 +0000 (22:56 -0500)]
os/bluestore: fix OpSequencer/Sequencer lifecycle

Make osr_set refcounts so that it can tolerate a Sequencer destruction
racing with flush or a Sequencer that outlives the BlueStore instance
itself.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: move _osr_reap_done
Sage Weil [Wed, 8 Mar 2017 20:01:35 +0000 (15:01 -0500)]
os/bluestore: move _osr_reap_done

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: reimplement/rename _sync -> _flush_all
Sage Weil [Wed, 8 Mar 2017 20:01:28 +0000 (15:01 -0500)]
os/bluestore: reimplement/rename _sync -> _flush_all

The old implementation is racy and doesn't actually work.  Instead, rely
on a list of all OpSequencers and drain them all.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: keep all OpSequencers registered
Sage Weil [Tue, 14 Mar 2017 02:49:41 +0000 (22:49 -0400)]
os/bluestore: keep all OpSequencers registered

Maintain the set of all live OpSequencers.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: keep onode refs for lifetime of obc
Sage Weil [Sat, 11 Mar 2017 19:30:53 +0000 (14:30 -0500)]
os/bluestore: keep onode refs for lifetime of obc

This ensures that we don't trim an onode from the cache while it has a
txc that is still in flight.  Which in turn ensures that if we try to read
the object, we will have any writing buffers available.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: make OnodeSpace onode_map private
Sage Weil [Sat, 11 Mar 2017 19:21:47 +0000 (14:21 -0500)]
os/bluestore: make OnodeSpace onode_map private

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: make Sequencer::flush() more efficient
Sage Weil [Thu, 9 Mar 2017 23:05:48 +0000 (18:05 -0500)]
os/bluestore: make Sequencer::flush() more efficient

BlueStore collection methods only need preceding transactions to be
applied to the kv db; they do not need to be committed.

Note that this is *only* needed for collection listings; all other read
operations are immediately safe after queue_transactions().

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: add OpSequencer::drain()
Sage Weil [Tue, 14 Mar 2017 02:49:37 +0000 (22:49 -0400)]
os/bluestore: add OpSequencer::drain()

Currently this is the same as flush, but more precisely it is an internal
method that means all txc's must complete.  Update _wal_apply() to use it
instead of flush(), which is part of the public Sequencer interface.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: revert throttle perfcounters
Sage Weil [Wed, 8 Mar 2017 20:45:31 +0000 (15:45 -0500)]
os/bluestore: revert throttle perfcounters

This reverts 3e40595f3cd8626cdceffa4a3a4efb088127f726

The individual throttles have their own set of perfcounters; no need to
duplicate them here.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: release deferred throttle on io finish, before cleanup
Sage Weil [Wed, 8 Mar 2017 19:57:52 +0000 (14:57 -0500)]
os/bluestore: release deferred throttle on io finish, before cleanup

The throttle is really about limiting deferred IO; we do not need to
actually remove the deferred record from the kv db before queueing more.
(In fact, the txc that queues more will do the cleanup.)

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: separate _txc_finish_kv into _txc_{applied,committed}_kv
Sage Weil [Wed, 8 Mar 2017 19:51:39 +0000 (14:51 -0500)]
os/bluestore: separate _txc_finish_kv into _txc_{applied,committed}_kv

We can unblock flush()ing threads as soon as we have applied to the kv db,
while the callbacks must wait until we have committed.

Move methods around a bit to better match the execution order.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: make flush() only wait for kv commit
Sage Weil [Wed, 8 Mar 2017 19:48:12 +0000 (14:48 -0500)]
os/bluestore: make flush() only wait for kv commit

The only remaining flush() users only need to see previous txc's applied
to the kv db (e.g., _omap_clear needs to see the records to delete them).

Signed-off-by: Sage Weil <sage@redhat.com>
# Conflicts:
# src/os/bluestore/BlueStore.h

8 years agoos/bluestore: no need to Onode::flush() on truncate
Sage Weil [Wed, 8 Mar 2017 19:45:27 +0000 (14:45 -0500)]
os/bluestore: no need to Onode::flush() on truncate

We do not release extents until after any deferred IO, so this flush() is
unnecessary.

Signed-off-by: Sage Weil <sage@redhat.com>
# Conflicts:
# src/os/bluestore/BlueStore.cc

8 years agoos/bluestore: no need to Onode::flush() in _do_read
Sage Weil [Mon, 6 Mar 2017 18:51:30 +0000 (13:51 -0500)]
os/bluestore: no need to Onode::flush() in _do_read

We now ensure that deferred writes are in cache until the txc retires,
so there is no need to wait here.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: pin writing cache buffers until txc is finished
Sage Weil [Mon, 6 Mar 2017 18:50:30 +0000 (13:50 -0500)]
os/bluestore: pin writing cache buffers until txc is finished

Notably, this includes WAL writes, which means an in-flight WAL write will
always be in the cache.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: write padded data into buffer cache
Sage Weil [Thu, 9 Mar 2017 14:38:50 +0000 (09:38 -0500)]
os/bluestore: write padded data into buffer cache

We rely on the buffer cache to avoid reading any deferred write data. In
order for that to work, we have to ensure the entire block whose
overwrite is deferred is in the buffer cache.  Otherwise, a write to 0~5
that results in a deferred write could break a subsequent read from 5~5
that reads the same block from disk before the deferred write lands.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: update freelist on initial commit
Sage Weil [Wed, 8 Mar 2017 19:28:55 +0000 (14:28 -0500)]
os/bluestore: update freelist on initial commit

It does not matter if we update the freelist in the initial commit or when
cleaning up the deferred transaction; both will eventually update the
persistent kv freelist.  We maintain one case to ensure that legacy
deferred events (from a kraken upgrade) release when they are replayed.

What matters while online is the Allocator, which has an independent
in-memory copy of the freelist to make decisions.  And we can delay that
as long as we want.  To avoid any concerns about deferred writes racing
against released blocks, just defer any release until the txc is fully
completed (including any deferred writes).  This ensures that even if we
have a pattern like

 txc 1: schedule deferred write on block A
 txc 2: release block A
 txc 1+2: commit
 txc 2: done!
 txc 1: do deferred write
 txc 1: done!

then txc 2 won't do its release because it is stuck behind txc 1 in the
OpSequencer queue:

 ...
 txc 1: reaped
 txc 2: reaped (and extents released to alloc)

This builds in some delay in just-released space being usable again, but
it should be a very small amount of space relative to the size of the
store!

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: wal -> deferred
Sage Weil [Wed, 8 Mar 2017 19:04:47 +0000 (14:04 -0500)]
os/bluestore: wal -> deferred

"wal" can refer to both the rocksdb wal (effectively, or journal) and the
"wal" events we include in it (mainly promises to do future IO or release
extents to the freelist).  This is super confusing!

Instead, call them 'deferred'.. deferred transactions, ops, writes, or
releases.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agovstart.sh: larger wal device
Sage Weil [Thu, 9 Mar 2017 21:46:50 +0000 (16:46 -0500)]
vstart.sh: larger wal device

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #14030 from tchaikov/wip-denc-without-nullptr
Sage Weil [Tue, 21 Mar 2017 17:58:14 +0000 (12:58 -0500)]
Merge pull request #14030 from tchaikov/wip-denc-without-nullptr

os/bluestore: do not use nullptr to calc the size of bluestore_pextent_t

Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #12041 from yangdongsheng/rbd_mirror_clone
Jason Dillaman [Tue, 21 Mar 2017 15:42:15 +0000 (11:42 -0400)]
Merge pull request #12041 from yangdongsheng/rbd_mirror_clone

librbd: asynchronous clone state machine

Reviewed-by: Jason Dillaman <dillaman@redhat.com>
8 years agoMerge pull request #14058 from tchaikov/wip-doc-linkcheck
Kefu Chai [Tue, 21 Mar 2017 14:41:59 +0000 (22:41 +0800)]
Merge pull request #14058 from tchaikov/wip-doc-linkcheck

doc: add optional argument for build-doc

Reviewed-by: Ken Dreyer <kdreyer@redhat.com>
Reviewed-by: liuchang0812 <liuchang0812@gmail.com>
8 years agoMerge pull request #14023 from dillaman/wip-rbd-coverity
Mykola Golub [Tue, 21 Mar 2017 14:40:36 +0000 (16:40 +0200)]
Merge pull request #14023 from dillaman/wip-rbd-coverity

librbd: fix valid coverity warnings

Reviewed-by: Pan Liu <liupan1111@gmail.com>
Reviewed-by: Mykola Golub <mgolub@mirantis.com>
8 years agoMerge pull request #14034 from liupan1111/wip-fix-comment-nbd
Mykola Golub [Tue, 21 Mar 2017 14:37:04 +0000 (16:37 +0200)]
Merge pull request #14034 from liupan1111/wip-fix-comment-nbd

rbd-nbd: fix typo in comment

Reviewed-by: Mykola Golub <mgolub@mirantis.com>
8 years agorbd-nbd: fix typo in comment. 14034/head
Pan Liu [Tue, 21 Mar 2017 11:22:26 +0000 (19:22 +0800)]
rbd-nbd: fix typo in comment.

Signed-off-by: Pan Liu <liupan1111@gmail.com>
8 years agoosd: no need to set dirty_info flag, append_log() will do this 14060/head
Mingxin Liu [Mon, 20 Mar 2017 08:57:16 +0000 (16:57 +0800)]
osd: no need to set dirty_info flag, append_log() will do this

Signed-off-by: Mingxin Liu <mingxin@xsky.com>
8 years agoosd: should consider unstable_stats when publish stats to osd
Mingxin Liu [Tue, 21 Mar 2017 10:15:51 +0000 (18:15 +0800)]
osd: should consider unstable_stats when publish stats to osd

Signed-off-by: Mingxin Liu <mingxin@xsky.com>
8 years agotests: ceph_test_rados_api_watch_notify: test timeout using rados_watch3() 14061/head
Kefu Chai [Tue, 21 Mar 2017 06:03:23 +0000 (14:03 +0800)]
tests: ceph_test_rados_api_watch_notify: test timeout using rados_watch3()

objecter resends the watch request upon tcp reconnecting, and OSD
resets the watch timeout when handling the resent watch request, so,
with a small "ms tcp read timeout", the timeout test always fails. in
this change, we use rados_watch3() which supports the "timeout" option,
so we can pass a timeout smaller than "ms tcp read timeout", and hence
the test can pass.

Fixes: http://tracker.ceph.com/issues/19312
Signed-off-by: Kefu Chai <kchai@redhat.com>
8 years agodoc: cephfs: fix the unexpected indent warning 14058/head
Kefu Chai [Tue, 21 Mar 2017 04:49:45 +0000 (12:49 +0800)]
doc: cephfs: fix the unexpected indent warning

Signed-off-by: Kefu Chai <kchai@redhat.com>
8 years agoadmin/build-doc: support optional argument for specifying sphinx builders
Kefu Chai [Tue, 21 Mar 2017 04:22:57 +0000 (12:22 +0800)]
admin/build-doc: support optional argument for specifying sphinx builders

Signed-off-by: Kefu Chai <kchai@redhat.com>
8 years agocommon/config_opts.h: Remove deprecated osd_compact_leveldb_on_mount option. 14059/head
Vikhyat Umrao [Tue, 21 Mar 2017 04:23:29 +0000 (09:53 +0530)]
common/config_opts.h: Remove deprecated osd_compact_leveldb_on_mount option.
This option was removed in commit 1a5dea72012a8818b076d6ca97b71bd6766fa903.

Fixes: http://tracker.ceph.com/issues/19318
Signed-off-by: Vikhyat Umrao <vumrao@redhat.com>
8 years agoMerge pull request #13997 from tchaikov/wip-doc-fixings
Kefu Chai [Tue, 21 Mar 2017 03:46:12 +0000 (11:46 +0800)]
Merge pull request #13997 from tchaikov/wip-doc-fixings

doc:  fixes to silence sphinx-build

Reviewed-by: Brad Hubbard <bhubbard@redhat.com>
8 years agoMerge pull request #14057 from badone/wip-RadosImport-connect
Brad Hubbard [Tue, 21 Mar 2017 03:35:09 +0000 (13:35 +1000)]
Merge pull request #14057 from badone/wip-RadosImport-connect

tools/rados: Check return value of connect

Reviewed-by: David Zafman <dzafman@redhat.com>
8 years agotools/rados: Check return value of connect 14057/head
Brad Hubbard [Tue, 21 Mar 2017 02:22:20 +0000 (12:22 +1000)]
tools/rados: Check return value of connect

Fail gracefully if Rados::connect returns an error.

Fixes: http://tracker.ceph.com/issues/19319
Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
8 years agorgw: remove redundant codes in rgw_cache.h 13902/head
lihongjie [Thu, 9 Mar 2017 10:12:16 +0000 (18:12 +0800)]
rgw: remove redundant codes in rgw_cache.h

Signed-off-by: lihongjie <lihongjie@cmss.chinamobile.com>
8 years agoMerge pull request #13971 from optimistyzy/0315_1
Haomai Wang [Mon, 20 Mar 2017 21:18:20 +0000 (05:18 +0800)]
Merge pull request #13971 from optimistyzy/0315_1

os/blestore/NVMEDevice: fix the I/O logic for read

Reviewed-by: Haomai Wang <haomai@xsky.com>
8 years agoMerge pull request #13923 from xiexingguo/wip-clean-pglog-t
Yuri Weinstein [Mon, 20 Mar 2017 20:05:54 +0000 (13:05 -0700)]
Merge pull request #13923 from xiexingguo/wip-clean-pglog-t

OSD: drop parameter t from merge_log()

Reviewed-by: Gregory Farnum <gfarnum@redhat.com>
8 years agoMerge pull request #13938 from jimmyway/wip-chg-return-value-to-refs
Yuri Weinstein [Mon, 20 Mar 2017 20:04:44 +0000 (13:04 -0700)]
Merge pull request #13938 from jimmyway/wip-chg-return-value-to-refs

osd: replace object_info_t::operator=() with decode()

Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #13980 from majianpeng/filejournal-bufferlist-rebuild
Yuri Weinstein [Mon, 20 Mar 2017 20:03:58 +0000 (13:03 -0700)]
Merge pull request #13980 from majianpeng/filejournal-bufferlist-rebuild

os/filestore/FileJournal: bufferlist rebuild

Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoqa: rgw task uses period instead of region-map 13963/head
Casey Bodley [Tue, 14 Mar 2017 19:43:13 +0000 (15:43 -0400)]
qa: rgw task uses period instead of region-map

Signed-off-by: Casey Bodley <cbodley@redhat.com>
8 years agorgw-admin: remove deprecated regionmap commands
Casey Bodley [Tue, 14 Mar 2017 18:18:15 +0000 (14:18 -0400)]
rgw-admin: remove deprecated regionmap commands

Fixes: http://tracker.ceph.com/issues/18725
Signed-off-by: Casey Bodley <cbodley@redhat.com>
8 years agoMerge pull request #13535 from dongbula/add-rgw-finisher-to-perfcounter
Sage Weil [Mon, 20 Mar 2017 15:20:48 +0000 (10:20 -0500)]
Merge pull request #13535 from dongbula/add-rgw-finisher-to-perfcounter

rgw: add radosclient finisher to perf counter

Reviewed-by: Casey Bodley <cbodley@redhat.com>
8 years agoMerge pull request #13955 from wangzhengyong/notify_finish
Casey Bodley [Mon, 20 Mar 2017 14:25:39 +0000 (10:25 -0400)]
Merge pull request #13955 from wangzhengyong/notify_finish

rgw: handle error return value in build_linked_oids_index

Reviewed-by: Casey Bodley <cbodley@redhat.com>
8 years agoMerge pull request #13820 from mikulely/cleanup-rgw-lc
Casey Bodley [Mon, 20 Mar 2017 14:25:14 +0000 (10:25 -0400)]
Merge pull request #13820 from mikulely/cleanup-rgw-lc

rgw: cleanup lifecycle managament

Reviewed-by: Casey Bodley <cbodley@redhat.com>
8 years agoMerge pull request #13481 from theanalyst/rgw/env-dout
Casey Bodley [Mon, 20 Mar 2017 14:20:43 +0000 (10:20 -0400)]
Merge pull request #13481 from theanalyst/rgw/env-dout

rgw: don't log the env_map twice

Reviewed-by: Casey Bodley <cbodley@redhat.com>
8 years agoMerge pull request #13895 from guihecheng/rgw_file-fix
Matt Benjamin [Mon, 20 Mar 2017 13:54:02 +0000 (09:54 -0400)]
Merge pull request #13895 from guihecheng/rgw_file-fix

rgw_file: fix reversed return value of getattr