]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
8 years agorbd: use min<uint64_t>() explicitly 14202/head
Kefu Chai [Tue, 28 Mar 2017 18:05:07 +0000 (02:05 +0800)]
rbd: use min<uint64_t>() explicitly

on arm32, size_t is actually int, which cannot be compared with uint64_t
using std::min().

Fixes: http://tracker.ceph.com/issues/18938
Signed-off-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #14114 from dmick/wip-boost-j
Kefu Chai [Sat, 25 Mar 2017 04:13:17 +0000 (12:13 +0800)]
Merge pull request #14114 from dmick/wip-boost-j

debian/rules, ceph.spec.in: invoke cmake with -DBOOST_J

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #13889 from liewegas/wip-denc-nullptr
Sage Weil [Fri, 24 Mar 2017 21:41:37 +0000 (16:41 -0500)]
Merge pull request #13889 from liewegas/wip-denc-nullptr

include/denc: remove nullptr runtime magic boundedness check

Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #14096 from baiyanchun/remove_useless_parameter
Sage Weil [Fri, 24 Mar 2017 21:41:18 +0000 (16:41 -0500)]
Merge pull request #14096 from baiyanchun/remove_useless_parameter

common: remove useless parameter

Reviewed-by: Kefu Chai <kchai@redhat.com>
Reviewed-by: Pan Liu <liupan1111@gmail.com>
8 years agoMerge pull request #14131 from liewegas/wip-crush-encode
Sage Weil [Fri, 24 Mar 2017 20:28:27 +0000 (15:28 -0500)]
Merge pull request #14131 from liewegas/wip-crush-encode

crush: only encode class info if SERVER_LUMINOUS

Reviewed-by: Loic Dachary <ldachary@redhat.com>
8 years agoMerge pull request #13960 from wangzhengyong/kstore
Sage Weil [Fri, 24 Mar 2017 18:17:39 +0000 (13:17 -0500)]
Merge pull request #13960 from wangzhengyong/kstore

os/kstore: some error handling

Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #13973 from shinobu-x/wp-sk-primarylogpg-null-nullptr
Sage Weil [Fri, 24 Mar 2017 18:16:58 +0000 (13:16 -0500)]
Merge pull request #13973 from shinobu-x/wp-sk-primarylogpg-null-nullptr

osd/PrimaryLogPG: nullptr not NULL

Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #13995 from liuhongtong/wip-config
Sage Weil [Fri, 24 Mar 2017 18:13:39 +0000 (13:13 -0500)]
Merge pull request #13995 from liuhongtong/wip-config

common/config: set rocksdb_cache_size to OPT_U64

Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #14013 from ShiqiCooperation/newshiqi
Sage Weil [Fri, 24 Mar 2017 18:12:16 +0000 (13:12 -0500)]
Merge pull request #14013 from ShiqiCooperation/newshiqi

test/unittest_bluefs: check whether add_block_device success

Reviewed-by: Sage Weil <sage@redhat.com>
8 years agocrush: only encode class info if SERVER_LUMINOUS 14131/head
Sage Weil [Fri, 24 Mar 2017 17:59:34 +0000 (13:59 -0400)]
crush: only encode class info if SERVER_LUMINOUS

This fixes OSDMap reencode crc mismatches on jewel to
luminous upgrades.

Fixes: http://tracker.ceph.com/issues/19361
Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoceph.spec.in: derive _smp_ncpus and use it for -DBOOST_J 14114/head
Dan Mick [Fri, 24 Mar 2017 02:35:08 +0000 (19:35 -0700)]
ceph.spec.in: derive _smp_ncpus and use it for -DBOOST_J

Signed-off-by: Dan Mick <dan.mick@redhat.com>
8 years agoceph.spec.in: move lowmem_build setting of _smp_mflags
Dan Mick [Fri, 24 Mar 2017 02:34:28 +0000 (19:34 -0700)]
ceph.spec.in: move lowmem_build setting of _smp_mflags

Signed-off-by: Dan Mick <dan.mick@redhat.com>
8 years agodebian/rules: invoke cmake with -DBOOST_J
Dan Mick [Thu, 23 Mar 2017 23:36:53 +0000 (16:36 -0700)]
debian/rules: invoke cmake with -DBOOST_J

Allow boost build during toplevel cmake from Debian package build
to benefit from multiple processors.  Should speed build a lot
on many-proc machines (say, arm64).  Use argument passed to
debhelper.

Signed-off-by: Dan Mick <dan.mick@redhat.com>
8 years agoMerge pull request #14082 from idealguo/update-bucket-acl
Casey Bodley [Fri, 24 Mar 2017 15:15:05 +0000 (11:15 -0400)]
Merge pull request #14082 from idealguo/update-bucket-acl

rgw: enable to update acl of bucket created in slave zonegroup

Reviewed-by: Casey Bodley <cbodley@redhat.com>
8 years agoMerge pull request #14043 from zhangsw/fix-rgw-deletebucket
Casey Bodley [Fri, 24 Mar 2017 15:11:50 +0000 (11:11 -0400)]
Merge pull request #14043 from zhangsw/fix-rgw-deletebucket

rgw: delete non-empty buckets in slave zonegroup works not well

Reviewed-by: Casey Bodley <cbodley@redhat.com>
8 years agoMerge pull request #13991 from Liuchang0812/wip-rgw-optimization
Casey Bodley [Fri, 24 Mar 2017 15:10:28 +0000 (11:10 -0400)]
Merge pull request #13991 from Liuchang0812/wip-rgw-optimization

rgw: avoid listing user buckets for rgw_delete_user

Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Reviewed-by: Casey Bodley <cbodley@redhat.com>
8 years agoMerge pull request #13504 from rzarzynski/wip-rgw-chunkingfilter-cleanup
Casey Bodley [Fri, 24 Mar 2017 15:08:18 +0000 (11:08 -0400)]
Merge pull request #13504 from rzarzynski/wip-rgw-chunkingfilter-cleanup

rgw: clean up the unneeded rgw::io::ChunkingFilter::has_content_length.

Reviewed-by: Casey Bodley <cbodley@redhat.com>
8 years agoMerge pull request #13847 from wjwithagen/wip-wjw-ceph-disk-tests-2
Kefu Chai [Fri, 24 Mar 2017 14:44:15 +0000 (22:44 +0800)]
Merge pull request #13847 from wjwithagen/wip-wjw-ceph-disk-tests-2

ceph-disk/tests/test_main.py: FreeBSD does not do multipath

Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #13974 from tchaikov/wip-vstart-start-mgr
Kefu Chai [Fri, 24 Mar 2017 13:44:56 +0000 (21:44 +0800)]
Merge pull request #13974 from tchaikov/wip-vstart-start-mgr

vstart: do not start mgr if not start_all

Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #13197 from asheplyakov/master-18740
Kefu Chai [Fri, 24 Mar 2017 07:53:17 +0000 (15:53 +0800)]
Merge pull request #13197 from asheplyakov/master-18740

systemd/ceph-disk: make it possible to customize timeout

Reviewed-by: Loic Dachary <ldachary@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #14103 from tchaikov/wip-https-github
Kefu Chai [Fri, 24 Mar 2017 06:43:48 +0000 (14:43 +0800)]
Merge pull request #14103 from tchaikov/wip-https-github

script: ceph-release-notes: use https instead of http

Reviewed-by: Abhishek Lekshmanan <abhishek@suse.com>
8 years agoMerge pull request #14085 from wjwithagen/wip-wjw-bluestore-fixture
Sage Weil [Fri, 24 Mar 2017 01:47:45 +0000 (20:47 -0500)]
Merge pull request #14085 from wjwithagen/wip-wjw-bluestore-fixture

test/objectstore/store_test_fixture.cc: Exclude bluestore code if required.

Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #13931 from wangzhengyong/extent
Sage Weil [Fri, 24 Mar 2017 01:47:12 +0000 (20:47 -0500)]
Merge pull request #13931 from wangzhengyong/extent

os/bluestore: fix bug for calc extent_avg in reshard function

Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>
Reviewed-by: Igor Fedotov <ifedotov@mirantis.com>
8 years agoMerge pull request #14073 from liewegas/wip-bluestore-nullptr
Sage Weil [Fri, 24 Mar 2017 01:44:59 +0000 (20:44 -0500)]
Merge pull request #14073 from liewegas/wip-bluestore-nullptr

os/bluestore: avoid nullptr in bluestore_extent_ref_map_t::bound_encode

Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #13577 from yonghengdexin735/wip-zzz-openalloc
Sage Weil [Fri, 24 Mar 2017 01:44:35 +0000 (20:44 -0500)]
Merge pull request #13577 from yonghengdexin735/wip-zzz-openalloc

os/bluestore: fix bug in _open_alloc()

Reviewed-by: Varada Kari <varada.kari@sandisk.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #14110 from dachary/wip-crush-cleanup
Loic Dachary [Thu, 23 Mar 2017 20:48:00 +0000 (21:48 +0100)]
Merge pull request #14110 from dachary/wip-crush-cleanup

crush: builder: clean the arguments of crush_reweight* methods

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Loic Dachary <ldachary@redhat.com>
8 years agovstart.sh: do not init fsmap if "$new == 0" 13974/head
Kefu Chai [Wed, 22 Mar 2017 05:04:06 +0000 (13:04 +0800)]
vstart.sh: do not init fsmap if "$new == 0"

we cannot create a new cephfs using a non-empty pool without '--force'
option now, so the "ceph fs new" command fails with "vstart.sh -k".

Signed-off-by: Kefu Chai <kchai@redhat.com>
8 years agotests: remove mds,osd,mon args passed to vstart.sh
Kefu Chai [Wed, 22 Mar 2017 15:33:30 +0000 (23:33 +0800)]
tests: remove mds,osd,mon args passed to vstart.sh

Signed-off-by: Kefu Chai <kchai@redhat.com>
8 years agocrush: builder: clean the arguments of crush_reweight* methods 14110/head
Sahid Orentino Ferdjaoui [Mon, 13 Mar 2017 16:36:16 +0000 (12:36 -0400)]
crush: builder: clean the arguments of crush_reweight* methods

This commit is just a cleanup to make the arguments of the method
around crush_reweight all coherent.

Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@redhat.com>
8 years agovstart.sh: remove start_*
Kefu Chai [Wed, 22 Mar 2017 03:48:40 +0000 (11:48 +0800)]
vstart.sh: remove start_*

so there are only two ways to override the number of daemons to start
- using the env var CEPH_NUM_{MON|OSD|MGR|MDS} or {MON|OSD|MGR|MDS}
- command line options: --{mon,osd,mds}_num

do prevent a daemon from running, set the corrresponding env var to 0.

Signed-off-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #14050 from ovh/bp-dump-ops-by-duration
Yuri Weinstein [Thu, 23 Mar 2017 15:47:55 +0000 (08:47 -0700)]
Merge pull request #14050 from ovh/bp-dump-ops-by-duration

common/TrackedOp: allow dumping historic ops sorted by duration

Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #14060 from LiumxNL/wip-170321
Yuri Weinstein [Thu, 23 Mar 2017 15:46:36 +0000 (08:46 -0700)]
Merge pull request #14060 from LiumxNL/wip-170321

osd: combine unstable stats with info.stats when publish stats to osd

Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #13293 from Liuchang0812/cleanup-coverity
Yuri Weinstein [Thu, 23 Mar 2017 15:45:58 +0000 (08:45 -0700)]
Merge pull request #13293 from Liuchang0812/cleanup-coverity

test, osd: fix some coverity issues

Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #14014 from Liuchang0812/wip-fix-seg-fault
Casey Bodley [Thu, 23 Mar 2017 13:54:47 +0000 (09:54 -0400)]
Merge pull request #14014 from Liuchang0812/wip-fix-seg-fault

rgw: fix memory leak in RGWGetObjLayout

Reviewed-by: Jos Collin <jcollin@redhat.com>
Reviewed-by: Casey Bodley <cbodley@redhat.com>
8 years agoos/bluestore: avoid nullptr in bluestore_extent_ref_map_t::bound_encode 14073/head
Sage Weil [Thu, 23 Mar 2017 13:21:39 +0000 (08:21 -0500)]
os/bluestore: avoid nullptr in bluestore_extent_ref_map_t::bound_encode

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #14094 from optimistyzy/322
Haomai Wang [Thu, 23 Mar 2017 11:23:34 +0000 (19:23 +0800)]
Merge pull request #14094 from optimistyzy/322

bluestore, NVMeDevice: use task' own lock for (random) read

Reviewed-by: Haomai Wang <haomai@xsky.com>
8 years agoscript: ceph-release-notes: use https instead of http 14103/head
Kefu Chai [Thu, 23 Mar 2017 11:13:41 +0000 (19:13 +0800)]
script: ceph-release-notes: use https instead of http

Signed-off-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #14004 from liewegas/wip-osd-full-failsafe
Kefu Chai [Thu, 23 Mar 2017 08:09:34 +0000 (16:09 +0800)]
Merge pull request #14004 from liewegas/wip-osd-full-failsafe

osd: fall back to failsafe threshold if osdmap doesn't set [near]full

Reviewed-by: David Zafman <dzafman@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #13903 from wjwithagen/wip-wjw-run-classes-sed
Kefu Chai [Thu, 23 Mar 2017 08:08:22 +0000 (16:08 +0800)]
Merge pull request #13903 from wjwithagen/wip-wjw-run-classes-sed

test: sed on FreeBSD requires "-i extension", so use gsed

Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #9940 from aclamk/common-recursive-mutex-fix
Kefu Chai [Thu, 23 Mar 2017 08:04:52 +0000 (16:04 +0800)]
Merge pull request #9940 from aclamk/common-recursive-mutex-fix

common: fix lockdep vs recursive mutexes

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agocommon: remove useless parameter 14096/head
baiyanchun [Thu, 23 Mar 2017 02:38:15 +0000 (10:38 +0800)]
common: remove useless parameter

Signed-off-by: baiyanchun <yanchun.bai@istuary.com>
8 years agobluestore, NVMeDevice: use task' own lock for (random) read 14094/head
Ziye Yang [Wed, 22 Mar 2017 03:41:00 +0000 (11:41 +0800)]
bluestore, NVMeDevice: use task' own lock for (random) read

The reason is that ioc may be reaped in _aio_thread function
with  the following statements:
for (auto &&it : registered_devices)
          it->reap_ioc();

So if we still use ioc's lock for (random) read, it will cause
core dump.

Signed-off-by: optimistyzy <optimistyzy@gmail.com>
8 years agorgw: enable to update acl of bucket created in slave zonegroup 14082/head
Guo Zhandong [Wed, 22 Mar 2017 10:00:37 +0000 (18:00 +0800)]
rgw: enable to update acl of bucket created in slave zonegroup

Fixes: http://tracker.ceph.com/issues/16888
Signed-off-by: Guo Zhandong <guozhandong@cmss.chinamobile.com>
8 years agoMerge pull request #14080 from ceph/evelu-ceph-disk
Loic Dachary [Wed, 22 Mar 2017 18:43:37 +0000 (19:43 +0100)]
Merge pull request #14080 from ceph/evelu-ceph-disk

ceph-disk: Reporting /sys directory in get_partition_dev()

Reviewed-by: Loic Dachary <ldachary@redhat.com>
8 years agoMerge pull request #13942 from xiexingguo/wip-cleanup-proc-repinfo
Kefu Chai [Wed, 22 Mar 2017 15:57:13 +0000 (23:57 +0800)]
Merge pull request #13942 from xiexingguo/wip-cleanup-proc-repinfo

osd/PG: conditionally retry on receiving pg-notify when Primary is Incomplete

Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #14061 from tchaikov/wip-19312
Kefu Chai [Wed, 22 Mar 2017 15:56:27 +0000 (23:56 +0800)]
Merge pull request #14061 from tchaikov/wip-19312

tests: ceph_test_rados_api_watch_notify: test timeout using rados_wat…

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #12449 from cbodley/wip-rgw-test-multi-vers-acl
Casey Bodley [Wed, 22 Mar 2017 15:46:33 +0000 (11:46 -0400)]
Merge pull request #12449 from cbodley/wip-rgw-test-multi-vers-acl

test/rgw: add bucket acl and versioning tests to test_multi.py

Reviewed-by: Orit Wasserman <owasserm@redhat.com>
8 years agoMerge pull request #14059 from vumrao/wip-vumrao-19318
Kefu Chai [Wed, 22 Mar 2017 14:43:41 +0000 (22:43 +0800)]
Merge pull request #14059 from vumrao/wip-vumrao-19318

common/config_opts.h: Remove deprecated osd_compact_leveldb_on_mount option

Reviewed-by: Jos Collin <jcollin@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
8 years agoMerge pull request #14076 from liewegas/wip-bluestore-min-alloc-size
Mark Nelson [Wed, 22 Mar 2017 14:12:41 +0000 (09:12 -0500)]
Merge pull request #14076 from liewegas/wip-bluestore-min-alloc-size

os/bluestore: default 16KB min_alloc_size on ssd

8 years agotest/objectstore/store_test_fixture.cc: Exclude bluestore code if required. 14085/head
Willem Jan Withagen [Wed, 22 Mar 2017 14:03:32 +0000 (15:03 +0100)]
test/objectstore/store_test_fixture.cc: Exclude bluestore code if required.

Signed-off-by: Willem Jan Withagen <wjw@digiware.nl>
8 years agoMerge pull request #14068 from optimistyzy/321_new
Haomai Wang [Wed, 22 Mar 2017 13:16:48 +0000 (21:16 +0800)]
Merge pull request #14068 from optimistyzy/321_new

Bluestore, NVMEDevice: add the spdk core mask check

Reviewed-by: Haomai Wang <haomai@xsky.com>
8 years agoTrackedOp: allow dumping historic ops sorted by duration 14050/head
Piotr Dałek [Mon, 20 Mar 2017 12:51:25 +0000 (13:51 +0100)]
TrackedOp: allow dumping historic ops sorted by duration

Currently dump_historic_ops dumps ops sorted by their initiation time,
which may not have any relation to how long it took, and sorting output
of that command by op duration is neither fast nor convenient.
New asok command ("dump_historic_ops_by_duration") outputs the same
op list, but ordered by their duration time (longest first).

Signed-off-by: Piotr Dałek <piotr.dalek@corp.ovh.com>
8 years agoBluestore, NVMEDevice: add the spdk core mask check 14068/head
optimistyzy [Tue, 21 Mar 2017 11:00:15 +0000 (19:00 +0800)]
Bluestore, NVMEDevice: add the spdk core mask check

This patch adds the spdk core mask check and also
set the master core for starting DPDK.

Signed-off-by: optimistyzy <optimistyzy@gmail.com>
8 years agorgw/rgw_op: fix memory leak in RGWGetObjLayout 14014/head
liuchang0812 [Wed, 22 Mar 2017 09:27:20 +0000 (17:27 +0800)]
rgw/rgw_op: fix memory leak in RGWGetObjLayout

Signed-off-by: liuchang0812 <liuchang0812@gmail.com>
8 years agoceph-disk: Reporting /sys directory in get_partition_dev() 14080/head
Erwan Velu [Wed, 22 Mar 2017 09:11:44 +0000 (10:11 +0100)]
ceph-disk: Reporting /sys directory in get_partition_dev()

When get_partition_dev() fails, it reports the following message :
    ceph_disk.main.Error: Error: partition 2 for /dev/sdb does not appear to exist
The code search for a directory inside the /sys/block/get_dev_name(os.path.realpath(dev)).

The issue here is the error message doesn't report that path when failing while it might be involved in.

This patch is about reporting where the code was looking at when trying to estimate if the partition was available.

Signed-off-by: Erwan Velu <erwan@redhat.com>
8 years agovstart.sh: do nothing if $CEPH_NUM_* is 0
Kefu Chai [Wed, 22 Mar 2017 03:34:21 +0000 (11:34 +0800)]
vstart.sh: do nothing if $CEPH_NUM_* is 0

Signed-off-by: Kefu Chai <kchai@redhat.com>
8 years agovstart.sh: extract start_{osd,mon,mgr,mds} into functions
Kefu Chai [Wed, 15 Mar 2017 07:28:09 +0000 (15:28 +0800)]
vstart.sh: extract start_{osd,mon,mgr,mds} into functions

Signed-off-by: Kefu Chai <kchai@redhat.com>
8 years agoos/bluestore: default 16KB min_alloc_size on ssd 14076/head
Sage Weil [Wed, 22 Mar 2017 02:27:23 +0000 (21:27 -0500)]
os/bluestore: default 16KB min_alloc_size on ssd

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoMerge pull request #13963 from cbodley/wip-18725
Orit Wasserman [Tue, 21 Mar 2017 21:44:22 +0000 (23:44 +0200)]
Merge pull request #13963 from cbodley/wip-18725

rgw-admin: remove deprecated regionmap commands
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
8 years agoMerge pull request #13888 from liewegas/wip-bluestore-dw
Sage Weil [Tue, 21 Mar 2017 20:05:56 +0000 (15:05 -0500)]
Merge pull request #13888 from liewegas/wip-bluestore-dw

os/bluestore: fix deferred writes; improve flush

Reviewed-by: Igor Fedotov <ifedotov@mirantis.com>
8 years agoMerge pull request #13902 from Wilhelmshaven/rm_redundant_code
Casey Bodley [Tue, 21 Mar 2017 19:43:48 +0000 (15:43 -0400)]
Merge pull request #13902 from Wilhelmshaven/rm_redundant_code

rgw: remove redundant codes in rgw_cache.h

Reviewed-by: Casey Bodley <cbodley@redhat.com>
8 years agoos/bluestore: handle zombie OpSequencers 13888/head
Sage Weil [Sat, 18 Mar 2017 17:51:08 +0000 (13:51 -0400)]
os/bluestore: handle zombie OpSequencers

It's possible for the Sequencer to go away while the OpSequencer still has
txcs in flight.  We were handling the case where the osr was on the
deferred_queue, but it may be off the deferred_queue but waiting for the
commit to happen, and we still need to wait for that.

Fix this by introducing a 'zombie' state for the osr, in which we keep the
osr in the osr_set.

Clean up the OpSequencer methods and a few other method names.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: clean up flush_all()
Sage Weil [Fri, 17 Mar 2017 21:52:56 +0000 (17:52 -0400)]
os/bluestore: clean up flush_all()

Add assertions if we fail to flush everything.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: move cached items around on collection split
Sage Weil [Fri, 17 Mar 2017 14:13:22 +0000 (10:13 -0400)]
os/bluestore: move cached items around on collection split

We've been avoiding doing this for a while and it has finally caught up
with us: the SharedBlob may outlive the split due to deferred IO, and
a read on the child collection may load a competing Blob and SharedBlob
and read from the on-disk blocks that haven't been written yet.

Fix by preserving the one-SharedBlob-instance invariant by moving cache
items to the new Collection and cache shard like we should have from the
beginning.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: simplify flush() wake-up condition
Sage Weil [Fri, 17 Mar 2017 17:54:20 +0000 (13:54 -0400)]
os/bluestore: simplify flush() wake-up condition

Clearer, and fewer wakeups.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoceph_test_objectstore: set bluestore cache shards to 5
Sage Weil [Fri, 17 Mar 2017 14:12:02 +0000 (10:12 -0400)]
ceph_test_objectstore: set bluestore cache shards to 5

Better test coverage!

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agounittest_bluestore_types: fix Collection using tests
Sage Weil [Thu, 16 Mar 2017 20:33:53 +0000 (16:33 -0400)]
unittest_bluestore_types: fix Collection using tests

We can't use a bare Collection since we get/put refs, the last put will
delete it, and the dtor asserts nref == 0 (no faking a ref and deliberately
leaking!).

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore/KernelDevice: drop unused flush_lock
Sage Weil [Thu, 16 Mar 2017 16:24:51 +0000 (12:24 -0400)]
os/bluestore/KernelDevice: drop unused flush_lock

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: better debugging around collections
Sage Weil [Thu, 16 Mar 2017 16:19:30 +0000 (12:19 -0400)]
os/bluestore: better debugging around collections

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: nicer Onode dout prefix
Sage Weil [Thu, 16 Mar 2017 15:30:59 +0000 (11:30 -0400)]
os/bluestore: nicer Onode dout prefix

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: flush_cache on umount, fsck finish, etc.
Sage Weil [Thu, 16 Mar 2017 15:30:37 +0000 (11:30 -0400)]
os/bluestore: flush_cache on umount, fsck finish, etc.

Otherwise cache items survive beyond umount into the next mount cycle!

Also, ensure that we flush_cache *before* clearing coll_map, as some cache
items have references back to the Collection.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: take Collection ref from SharedBlob
Sage Weil [Wed, 15 Mar 2017 19:01:52 +0000 (15:01 -0400)]
os/bluestore: take Collection ref from SharedBlob

These can survive as long as the txc, which can be longer than the
Collection.  Make sure we have a valid ref as both finish_write and
~SharedBlob use coll for the SharedBlobSet (and coll->store->cct for
debug).

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: fix perfcounters for deferred io
Sage Weil [Tue, 14 Mar 2017 20:47:48 +0000 (16:47 -0400)]
os/bluestore: fix perfcounters for deferred io

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: remove dead _do_deferred_op code
Sage Weil [Tue, 14 Mar 2017 20:47:40 +0000 (16:47 -0400)]
os/bluestore: remove dead _do_deferred_op code

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: make throttles tunable online
Sage Weil [Tue, 14 Mar 2017 18:17:20 +0000 (14:17 -0400)]
os/bluestore: make throttles tunable online

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: prevent throttle deadlock due to deferred writes
Sage Weil [Mon, 13 Mar 2017 11:43:57 +0000 (07:43 -0400)]
os/bluestore: prevent throttle deadlock due to deferred writes

Kick off deferred IOs if we pass the throttle midpoint or if we would
block during submission.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoceph_test_objectstore: fix Synthetic to never modify bufferlists
Sage Weil [Fri, 10 Mar 2017 15:27:52 +0000 (10:27 -0500)]
ceph_test_objectstore: fix Synthetic to never modify bufferlists

We were modifying bufferlists in place, and kludging around it by making
full copies elsewhere.  Instead, never modify a buffer.

This fixes issues where the buffer we submit to ObjectStore ends up in
the cache and we modify in place later, corrupting the implementation's
copy.  (This was affecting BlueStore.)

Rearrange the data methods to be next to each other and clean them up a
bit too.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: drop obsolete comment
Sage Weil [Fri, 10 Mar 2017 15:20:22 +0000 (10:20 -0500)]
os/bluestore: drop obsolete comment

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: avoid extra dev flush on single device when all io is deferred
Sage Weil [Thu, 9 Mar 2017 22:28:58 +0000 (17:28 -0500)]
os/bluestore: avoid extra dev flush on single device when all io is deferred

If we have no non-deferred IO to flush, and we are running bluefs on a
single shared device, then we can rely on the bluefs flush to make our
current batch of deferred ios stable.

Separate deferred into a "done" and "stable" list.  If we do sync, put
everything from "done" onto "stable".  Otherwise, after we do our kv
commit via bluefs, move "done" to "stable" then.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: debug alloc release
Sage Weil [Tue, 14 Mar 2017 14:33:23 +0000 (10:33 -0400)]
os/bluestore: debug alloc release

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: flush old/discarded OpSequencers too
Sage Weil [Tue, 14 Mar 2017 14:33:17 +0000 (10:33 -0400)]
os/bluestore: flush old/discarded OpSequencers too

When the Sequencer goes away it get deregistered.  If there are still
deferred IOs in flight, we need to wait for those too.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: batch up to bluestore_deferred_batch_ops before submitting
Sage Weil [Thu, 9 Mar 2017 19:17:47 +0000 (14:17 -0500)]
os/bluestore: batch up to bluestore_deferred_batch_ops before submitting

Allow several deferred writes to accumulate before we submit them.  In
general we have no time pressure, and on HDD (and perhaps sometimes SSD)
it is beneficial to accumulate and batch these so that they result in
fewer seeks.  On HDD, this is particularly true of seeks away from the
journal.  And on sequential workloads this can avoid seeks.  In may even
allow the block layer or SSD firmware to merge IOs and perform fewer
writes.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: only discard deallocated regions of a blob if !shared
Sage Weil [Mon, 13 Mar 2017 11:32:12 +0000 (07:32 -0400)]
os/bluestore: only discard deallocated regions of a blob if !shared

If a blob is shared, we can't discard deallocated regions: there may
be deferred buffers in flight and we might get a read via the clone.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: avoid waking up kv thread on deferred write completion
Sage Weil [Thu, 9 Mar 2017 16:53:06 +0000 (11:53 -0500)]
os/bluestore: avoid waking up kv thread on deferred write completion

In a simple HDD workload with queue depth of 1, we halve our throughput
because the kv thread does a full commit twice per IO: once for the
initial commit, and then again to clean up the deferred write record. The
second wakeup is unnecessary; we can clean it up on the next commit.

We do need to do this wakeup in a few cases, though, when draining the
OpSequencers: (1) on replay during startup, and (2) on shutdown in
_osr_drain_all().

Send everything through _osr_drain_all() for simplicity.

This doubles HDD qd=1 IOPS from ~50 to ~100 on my 7200 rpm test device
(rados bench 30 write -b 4096 -t 1).

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: move many initializations into header
Sage Weil [Thu, 9 Mar 2017 15:34:50 +0000 (10:34 -0500)]
os/bluestore: move many initializations into header

This is less fragile, especially with 2 constructors.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: restructure deferred write queue
Sage Weil [Thu, 9 Mar 2017 02:53:22 +0000 (21:53 -0500)]
os/bluestore: restructure deferred write queue

First, eliminate the work queue--it's useless.  We are dispatching aio and
should not block.  And if a single thread isn't sufficient to do it, it
probably means we should be parallelizing kv_sync_thread too (which is our
only caller that matters).

Repurpose the old osr-list -> txc-list-per-osr queue structure to manage
the queuing.  For any given osr, dispatch one batch of aios at a time,
taking care to collapse any overwrites so that the latest write wins.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: fix OpSequencer/Sequencer lifecycle
Sage Weil [Fri, 10 Mar 2017 03:56:28 +0000 (22:56 -0500)]
os/bluestore: fix OpSequencer/Sequencer lifecycle

Make osr_set refcounts so that it can tolerate a Sequencer destruction
racing with flush or a Sequencer that outlives the BlueStore instance
itself.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: move _osr_reap_done
Sage Weil [Wed, 8 Mar 2017 20:01:35 +0000 (15:01 -0500)]
os/bluestore: move _osr_reap_done

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: reimplement/rename _sync -> _flush_all
Sage Weil [Wed, 8 Mar 2017 20:01:28 +0000 (15:01 -0500)]
os/bluestore: reimplement/rename _sync -> _flush_all

The old implementation is racy and doesn't actually work.  Instead, rely
on a list of all OpSequencers and drain them all.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: keep all OpSequencers registered
Sage Weil [Tue, 14 Mar 2017 02:49:41 +0000 (22:49 -0400)]
os/bluestore: keep all OpSequencers registered

Maintain the set of all live OpSequencers.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: keep onode refs for lifetime of obc
Sage Weil [Sat, 11 Mar 2017 19:30:53 +0000 (14:30 -0500)]
os/bluestore: keep onode refs for lifetime of obc

This ensures that we don't trim an onode from the cache while it has a
txc that is still in flight.  Which in turn ensures that if we try to read
the object, we will have any writing buffers available.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: make OnodeSpace onode_map private
Sage Weil [Sat, 11 Mar 2017 19:21:47 +0000 (14:21 -0500)]
os/bluestore: make OnodeSpace onode_map private

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: make Sequencer::flush() more efficient
Sage Weil [Thu, 9 Mar 2017 23:05:48 +0000 (18:05 -0500)]
os/bluestore: make Sequencer::flush() more efficient

BlueStore collection methods only need preceding transactions to be
applied to the kv db; they do not need to be committed.

Note that this is *only* needed for collection listings; all other read
operations are immediately safe after queue_transactions().

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: add OpSequencer::drain()
Sage Weil [Tue, 14 Mar 2017 02:49:37 +0000 (22:49 -0400)]
os/bluestore: add OpSequencer::drain()

Currently this is the same as flush, but more precisely it is an internal
method that means all txc's must complete.  Update _wal_apply() to use it
instead of flush(), which is part of the public Sequencer interface.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: revert throttle perfcounters
Sage Weil [Wed, 8 Mar 2017 20:45:31 +0000 (15:45 -0500)]
os/bluestore: revert throttle perfcounters

This reverts 3e40595f3cd8626cdceffa4a3a4efb088127f726

The individual throttles have their own set of perfcounters; no need to
duplicate them here.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: release deferred throttle on io finish, before cleanup
Sage Weil [Wed, 8 Mar 2017 19:57:52 +0000 (14:57 -0500)]
os/bluestore: release deferred throttle on io finish, before cleanup

The throttle is really about limiting deferred IO; we do not need to
actually remove the deferred record from the kv db before queueing more.
(In fact, the txc that queues more will do the cleanup.)

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: separate _txc_finish_kv into _txc_{applied,committed}_kv
Sage Weil [Wed, 8 Mar 2017 19:51:39 +0000 (14:51 -0500)]
os/bluestore: separate _txc_finish_kv into _txc_{applied,committed}_kv

We can unblock flush()ing threads as soon as we have applied to the kv db,
while the callbacks must wait until we have committed.

Move methods around a bit to better match the execution order.

Signed-off-by: Sage Weil <sage@redhat.com>
8 years agoos/bluestore: make flush() only wait for kv commit
Sage Weil [Wed, 8 Mar 2017 19:48:12 +0000 (14:48 -0500)]
os/bluestore: make flush() only wait for kv commit

The only remaining flush() users only need to see previous txc's applied
to the kv db (e.g., _omap_clear needs to see the records to delete them).

Signed-off-by: Sage Weil <sage@redhat.com>
# Conflicts:
# src/os/bluestore/BlueStore.h

8 years agoos/bluestore: no need to Onode::flush() on truncate
Sage Weil [Wed, 8 Mar 2017 19:45:27 +0000 (14:45 -0500)]
os/bluestore: no need to Onode::flush() on truncate

We do not release extents until after any deferred IO, so this flush() is
unnecessary.

Signed-off-by: Sage Weil <sage@redhat.com>
# Conflicts:
# src/os/bluestore/BlueStore.cc

8 years agoos/bluestore: no need to Onode::flush() in _do_read
Sage Weil [Mon, 6 Mar 2017 18:51:30 +0000 (13:51 -0500)]
os/bluestore: no need to Onode::flush() in _do_read

We now ensure that deferred writes are in cache until the txc retires,
so there is no need to wait here.

Signed-off-by: Sage Weil <sage@redhat.com>