]> git.apps.os.sepia.ceph.com Git - ceph-ci.git/log
ceph-ci.git
4 years agoscript/ceph-debug-docker: s/x86_64/$(arch)/
Kefu Chai [Sat, 5 Jun 2021 04:19:41 +0000 (12:19 +0800)]
script/ceph-debug-docker: s/x86_64/$(arch)/

try to avoid hardwiring to a certain architecture if possible.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agoMerge pull request #41581 from tchaikov/wip-options-mgr-mon
Kefu Chai [Sat, 5 Jun 2021 02:06:07 +0000 (10:06 +0800)]
Merge pull request #41581 from tchaikov/wip-options-mgr-mon

common/options: extract mgr and mon options out

Reviewed-by: Neha Ojha <nojha@redhat.com>
4 years agoMerge pull request #40073 from jmolmo/delete_service_causes_osd_removal
Kefu Chai [Sat, 5 Jun 2021 00:44:42 +0000 (08:44 +0800)]
Merge pull request #40073 from jmolmo/delete_service_causes_osd_removal

mgr/cephadm: Warn about OSDs to remove manually when deleting an OSD service

Reviewed-by: Sebastian Wagner <sewagner@redhat.com>
Reviewed-by: Adam King <adking@redhat.com>
4 years agoMerge PR #41697 into master
Patrick Donnelly [Fri, 4 Jun 2021 20:07:42 +0000 (13:07 -0700)]
Merge PR #41697 into master

* refs/pull/41697/head:
script: add a few more volume mounts for sepia
script: drop ceph-fuse from docker debugging
script: enable centos debuginfo repo for debugging
script: update repo url for multi-arch builds
script: fetch autobuild.asc key via HTTPS

Reviewed-by: Kefu Chai <kchai@redhat.com>
4 years agoMerge pull request #41690 from tchaikov/wip-test-alloc_aging
Kefu Chai [Fri, 4 Jun 2021 17:57:03 +0000 (01:57 +0800)]
Merge pull request #41690 from tchaikov/wip-test-alloc_aging

test/objectstore/unittest_alloc_aging: init cct

Reviewed-by: Igor Fedotov <ifedotov@suse.com>
4 years agoMerge pull request #41698 from tchaikov/wip-qa-rook
Kefu Chai [Fri, 4 Jun 2021 17:23:35 +0000 (01:23 +0800)]
Merge pull request #41698 from tchaikov/wip-qa-rook

qa/suites/orch/rook/smoke: stop testing on ubuntu 18.04

Reviewed-by: Sage Weil <sage@redhat.com>
4 years agoqa/suites/orch/rook/smoke: stop testing on ubuntu 18.04
Kefu Chai [Fri, 4 Jun 2021 17:11:13 +0000 (01:11 +0800)]
qa/suites/orch/rook/smoke: stop testing on ubuntu 18.04

even rook does not really install ceph packages in the host directly, it
uses the ceph container image. but teuthology insists on checking the
existence of debian packages by querying shaman server when it sees a
teuthology facet file which includes:

os_type: ubuntu
os_version: "18.04"

but since we've stopped building ubuntu/bionic packages, teuthology
just complains when we are scheduling test suites which are composed
from facets in qa/suites/orch/rook/smoke.

in this change, the ubuntu_18.04.yaml is dropped because ubuntu/bionic
does not really increase the test coverage of ceph. it helps to test
the rook and container runtime though.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agoscript: add a few more volume mounts for sepia
Patrick Donnelly [Fri, 4 Jun 2021 16:33:54 +0000 (09:33 -0700)]
script: add a few more volume mounts for sepia

We now have a few Ceph file systems with various possible mount points
depending which lab machine you're using.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
4 years agoscript: drop ceph-fuse from docker debugging
Patrick Donnelly [Fri, 4 Jun 2021 16:33:30 +0000 (09:33 -0700)]
script: drop ceph-fuse from docker debugging

Install this on the fly as necessary...

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
4 years agoscript: enable centos debuginfo repo for debugging
Patrick Donnelly [Fri, 4 Jun 2021 16:32:52 +0000 (09:32 -0700)]
script: enable centos debuginfo repo for debugging

So we can fetch e.g. the sqlite debuginfo packages.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
4 years agoscript: update repo url for multi-arch builds
Patrick Donnelly [Fri, 4 Jun 2021 16:31:19 +0000 (09:31 -0700)]
script: update repo url for multi-arch builds

Brad suggested this change based on his commit [1]. Thank you!

[1] https://github.com/ceph/ceph-ansible/commit/267cce9e8360fc8cb9c192fde2406e5dca724610

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
4 years agoscript: fetch autobuild.asc key via HTTPS
Patrick Donnelly [Fri, 4 Jun 2021 16:30:04 +0000 (09:30 -0700)]
script: fetch autobuild.asc key via HTTPS

Rather than relying the key being avaiable on the LRC /ceph file system.
(Someone appears to have deleted it recently.)

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
4 years agoMerge pull request #41679 from AmnonHanuhov/wip-get_rid_of_pending_q
Kefu Chai [Fri, 4 Jun 2021 12:13:54 +0000 (20:13 +0800)]
Merge pull request #41679 from AmnonHanuhov/wip-get_rid_of_pending_q

crimson/net: Use out_q instead of pending_q

Reviewed-by: Yingxin Cheng <yingxin.cheng@intel.com>
4 years agocrimson/net: Use out_q instead of pending_q
Amnon Hanuhov [Thu, 3 Jun 2021 13:57:41 +0000 (16:57 +0300)]
crimson/net: Use out_q instead of pending_q

pending_q contains the same messages as in out_q and it is only used
for creating a bytestream out of these messages. We can just use out_q for that.

Signed-off-by: Amnon Hanuhov <ahanukov@redhat.com>
4 years agoMerge pull request #41631 from tchaikov/wip-keyring-decode
Kefu Chai [Fri, 4 Jun 2021 09:15:06 +0000 (17:15 +0800)]
Merge pull request #41631 from tchaikov/wip-keyring-decode

auth/KeyRing: always decode keying as plaintext

Reviewed-by: Willem Jan Withagen <wjw@digiware.nl>
Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
4 years agoMerge pull request #41587 from cfsnyder/bugfix_47738
Kefu Chai [Fri, 4 Jun 2021 09:00:48 +0000 (17:00 +0800)]
Merge pull request #41587 from cfsnyder/bugfix_47738

mgr/DaemonServer.cc: prevent mgr crashes caused by integer underflow that is triggered by large increases to pg_num/pgp_num

Reviewed-by: Kefu Chai <kchai@redhat.com>
4 years agoMerge pull request #41592 from tchaikov/wip-ceph-default-confffile
Kefu Chai [Fri, 4 Jun 2021 08:59:24 +0000 (16:59 +0800)]
Merge pull request #41592 from tchaikov/wip-ceph-default-confffile

ceph.in: use rados.Rados.DEFAULT_CONF_FILES

Reviewed-by: Neha Ojha <nojha@redhat.com>
4 years agoMerge pull request #41594 from tchaikov/wip/test/librados/list
Kefu Chai [Fri, 4 Jun 2021 08:58:59 +0000 (16:58 +0800)]
Merge pull request #41594 from tchaikov/wip/test/librados/list

test/librados/list: print reason why test fails

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
4 years agoMerge pull request #36941 from hoamer/patch-1
Kefu Chai [Fri, 4 Jun 2021 08:57:41 +0000 (16:57 +0800)]
Merge pull request #36941 from hoamer/patch-1

doc/mgr/administrator: add a more precise description for creating key

Reviewed-by: Kefu Chai <kchai@redhat.com>
4 years agodoc/mgr/administrator: add a more precise description for creating key
hoamer [Wed, 2 Sep 2020 07:13:12 +0000 (09:13 +0200)]
doc/mgr/administrator: add a more precise description for creating key

added a more precise description to handle filename when creating key for mgr

Signed-off-by: hoamer <kontakt@sebastian-neugebauer.de>
4 years agotest/objectstore/unittest_alloc_aging: init cct
Kefu Chai [Wed, 2 Jun 2021 09:54:18 +0000 (17:54 +0800)]
test/objectstore/unittest_alloc_aging: init cct

* initialize the cct use by test, otherwise g_ceph_context is
  not set at all.
* instead of using g_ceph_context, use static member variable cct.
  less dependency to the global instance.
* setup and teardown the cct for test suite, because global_init()
  initialize g_ceph_context, which cannot be set multiple times.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agotest/objectstore: s/TearDownTestCase/TearDownTestSuite/
Kefu Chai [Wed, 2 Jun 2021 09:38:49 +0000 (17:38 +0800)]
test/objectstore: s/TearDownTestCase/TearDownTestSuite/

TearDownTestCase is deprecated by GTest. let's use the new API instead.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agoMerge pull request #41652 from tchaikov/wip-qa-asock-or
Kefu Chai [Fri, 4 Jun 2021 05:50:38 +0000 (13:50 +0800)]
Merge pull request #41652 from tchaikov/wip-qa-asock-or

qa/tasks/admin_socket: support "foo || bar" as command

Reviewed-by: Samuel Just <sjust@redhat.com>
4 years agoMerge pull request #41686 from t-msn/update-trace-doc
Kefu Chai [Fri, 4 Jun 2021 04:30:23 +0000 (12:30 +0800)]
Merge pull request #41686 from t-msn/update-trace-doc

doc/dev: update how to use lttng/blkin trace

Reviewed-by: Kefu Chai <kchai@redhat.com>
4 years agodoc/dev: update how to use lttng/blkin trace
Misono Tomohiro [Fri, 4 Jun 2021 02:36:49 +0000 (11:36 +0900)]
doc/dev: update how to use lttng/blkin trace

Update doc to reflect current status.

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
4 years agoMerge PR #41553 into master
Sage Weil [Fri, 4 Jun 2021 02:04:55 +0000 (22:04 -0400)]
Merge PR #41553 into master

* refs/pull/41553/head:
ceph-volume: replace __ with _ in device_id

Reviewed-by: Kefu Chai <kchai@redhat.com>
4 years agoMerge PR #41636 into master
Sage Weil [Fri, 4 Jun 2021 02:04:32 +0000 (22:04 -0400)]
Merge PR #41636 into master

* refs/pull/41636/head:
mgr/cephadm/inventory: do not try to resolve current mgr host
pybind/mgr/mgr_module: make get_mgr_ip() return mgr's IP from mgrmap
mgr/restful: use get_mgr_ip() instead of hostname

Reviewed-by: Adam King <adking@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>
4 years agoMerge pull request #41674 from tchaikov/wip-vstart-without-restful
Kefu Chai [Fri, 4 Jun 2021 01:44:58 +0000 (09:44 +0800)]
Merge pull request #41674 from tchaikov/wip-vstart-without-restful

vstart.sh: add an option named --without-restful

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
4 years agoMerge pull request #41670 from tchaikov/wip-op-tracking-spin-off-0
Kefu Chai [Thu, 3 Jun 2021 23:50:44 +0000 (07:50 +0800)]
Merge pull request #41670 from tchaikov/wip-op-tracking-spin-off-0

crimson, common: improve const-correctness of Operation::dump()s.

Reviewed-by: Samuel Just <sjust@redhat.com>
4 years agoMerge pull request #41672 from tchaikov/wip-crimson-test-handle-fut
Kefu Chai [Thu, 3 Jun 2021 23:50:21 +0000 (07:50 +0800)]
Merge pull request #41672 from tchaikov/wip-crimson-test-handle-fut

test/crimson/seastore: always handle returned future<>

Reviewed-by: Samuel Just <sjust@redhat.com>
4 years agoMerge PR #41654 into master
Patrick Donnelly [Thu, 3 Jun 2021 20:34:54 +0000 (13:34 -0700)]
Merge PR #41654 into master

* refs/pull/41654/head:
mds: do not infinitely recursively print a metric

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
4 years agoMerge PR #41639 into master
Patrick Donnelly [Thu, 3 Jun 2021 20:33:58 +0000 (13:33 -0700)]
Merge PR #41639 into master

* refs/pull/41639/head:
mds/scrub: write root inode backtrace at creation

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
4 years agoMerge PR #41499 into master
Patrick Donnelly [Thu, 3 Jun 2021 20:33:27 +0000 (13:33 -0700)]
Merge PR #41499 into master

* refs/pull/41499/head:
qa/tasks/mds_thrash: fix thrash iteration never skip

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
4 years agoMerge PR #41443 into master
Patrick Donnelly [Thu, 3 Jun 2021 20:23:17 +0000 (13:23 -0700)]
Merge PR #41443 into master

* refs/pull/41443/head:
test: update log-ignorelist for fs:mirror test

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
4 years agoMerge PR #39910 into master
Patrick Donnelly [Thu, 3 Jun 2021 20:22:23 +0000 (13:22 -0700)]
Merge PR #39910 into master

* refs/pull/39910/head:
test: Add test for mgr hang when osd is full
mgr: Set client_check_pool_perm to false
mds: Add full caps to avoid osd full check

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
4 years agoMerge pull request #41559 from dmick/wip-grafana-container
Dan Mick [Thu, 3 Jun 2021 18:32:24 +0000 (11:32 -0700)]
Merge pull request #41559 from dmick/wip-grafana-container

monitoring/grafana/build/Makefile: revamp for arm64 builds, pushes to docker and quay, jenkins

4 years agomgr/cephadm/inventory: do not try to resolve current mgr host
Sage Weil [Thu, 3 Jun 2021 14:29:00 +0000 (10:29 -0400)]
mgr/cephadm/inventory: do not try to resolve current mgr host

The CNI configuration may set up a private network for the container, which
is mapped to the hostname in /etc/hosts.  For example, my test box sets
up 10.88.0.0/24 because I was using crio + kubeadm on this host earlier
(at least I think that's why):

$ sudo podman run --rm --name test123 --entrypoint /bin/bash -it quay.ceph.io/ceph-ci/ceph:master -c "cat /etc/hosts"
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.88.0.8 f9e91bf2478f test123

In any case, we should never trust a lookup of our own hostname from inside
a container!

This isn't quite sufficient, though: if this is a single-host cluster, then
we fall back to using get_mgr_ip(). That value may be distorted by the
public_network option on the mgr, but we don't have any other good
options here, and single-node clusters are unlikely to have complex
network configs.

Refactor a bit to avoid the try/except nesting.

Signed-off-by: Sage Weil <sage@newdream.net>
4 years agopybind/mgr/mgr_module: make get_mgr_ip() return mgr's IP from mgrmap
Sage Weil [Wed, 2 Jun 2021 02:31:11 +0000 (22:31 -0400)]
pybind/mgr/mgr_module: make get_mgr_ip() return mgr's IP from mgrmap

The previous approach was convoluted: we tried to do a DNS lookup on the
hostname, which would fail if /etc/hosts had an entry.  Which, with podman,
it does.  And the IP it has will vary in all sorts of weird ways.  For
example, CNI on my host means that I get a dynamic address in 10.88.0.0/24.

Avoid all of that nonsense and use the IP that is in the mgrmap.  There
may be multiple IPs (v2 + v1, or maybe even IPv4 + v6 in the future); in
that case, use the first one.

Signed-off-by: Sage Weil <sage@newdream.net>
4 years agomgr/restful: use get_mgr_ip() instead of hostname
Sage Weil [Wed, 2 Jun 2021 02:31:47 +0000 (22:31 -0400)]
mgr/restful: use get_mgr_ip() instead of hostname

Now we match dashboard!

Signed-off-by: Sage Weil <sage@newdream.net>
4 years agoMerge pull request #41308 from sseshasa/wip-osd-benchmark-for-mclock
Neha Ojha [Thu, 3 Jun 2021 15:39:22 +0000 (08:39 -0700)]
Merge pull request #41308 from sseshasa/wip-osd-benchmark-for-mclock

osd: Run osd bench test to override default max osd capacity for mclock

Reviewed-by: Neha Ojha <nojha@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
4 years agoMerge pull request #41316 from cbodley/wip-50785
Casey Bodley [Thu, 3 Jun 2021 15:05:00 +0000 (11:05 -0400)]
Merge pull request #41316 from cbodley/wip-50785

rgw: parse tenant name out of rgwx-bucket-instance

Reviewed-by: Daniel Gryniewicz <dang@redhat.com>
Reviewed-by: Shilpa Jagannath <smanjara@redhat.com>
4 years agoMerge pull request #41677 from tchaikov/wip-oom
Kefu Chai [Thu, 3 Jun 2021 14:40:26 +0000 (22:40 +0800)]
Merge pull request #41677 from tchaikov/wip-oom

ceph.spec.in: increase the mem_per_job to 3GiB

Reviewed-by: David Galloway <dgallowa@redhat.com>
4 years agoMerge pull request #41668 from pleiadesian/patch-bucket-chown
Casey Bodley [Thu, 3 Jun 2021 14:28:35 +0000 (10:28 -0400)]
Merge pull request #41668 from pleiadesian/patch-bucket-chown

rgw: require bucket name in bucket chown

Reviewed-by: Or Friedmann <ofriedma@redhat.com>
Reviewed-by: Daniel Gryniewicz <dang@redhat.com>
4 years agoMerge pull request #41462 from yehudasa/wip-50920
Casey Bodley [Thu, 3 Jun 2021 14:16:30 +0000 (10:16 -0400)]
Merge pull request #41462 from yehudasa/wip-50920

rgw: auth v4 client: don't convert '+' to space

Reviewed-by: Casey Bodley <cbodley@redhat.com>
4 years agocmake: increase the MAX_{LINK,COMPILE}_MEM
Kefu Chai [Thu, 3 Jun 2021 12:48:53 +0000 (20:48 +0800)]
cmake: increase the MAX_{LINK,COMPILE}_MEM

based on recent observation, quite a few C++ source file take
around more than 3.0GiB to compile. for instance,
test_mock_HttpClient.cc could take up to 6270MiB memory to compile.

so increase MAX_{LINK,COMPILE}_MEM accordingly.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agoceph.spec.in: increase the mem_per_job to 3GiB
Kefu Chai [Thu, 3 Jun 2021 12:41:36 +0000 (20:41 +0800)]
ceph.spec.in: increase the mem_per_job to 3GiB

to lower the number of jobs, we are experiencing build failures on
a builder with 48c96t, 193 free mem. the failures were caused by
OOM killer which kills the c++ compiler

[498376.128969] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/jenkins.service,task=cc1plus,pid=1387895,uid=1110
[498376.145288] Out of memory: Killed process 1387895 (cc1plus) total-vm:3323312kB, anon-rss:3164568kB, file-rss:0kB, shmem-rss:0kB, UID:1110
[498376.315185] oom_reaper: reaped process 1387895 (cc1plus), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[498377.882072] cc1plus invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

before this change, we use the total memory to calculate the number
of jobs, and assume that each job takes at most 2.5GiB mem. in the
case above, the # of job is 96.

after this change, we use the free memory, and increse the mem per job
to 3.0GiB. in the case above, the # of job would be 85.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agoMerge pull request #41669 from tchaikov/wip-crimson-asok-dump-metrics
Kefu Chai [Thu, 3 Jun 2021 11:45:23 +0000 (19:45 +0800)]
Merge pull request #41669 from tchaikov/wip-crimson-asok-dump-metrics

crimson/admin: s/perf dump_seastar/dump_metrics/

Reviewed-by: Amnon Hanuhov <ahanukov@redhat.com>
4 years agovstart.sh: use here document to display multi-line message
Kefu Chai [Thu, 3 Jun 2021 10:45:48 +0000 (18:45 +0800)]
vstart.sh: use here document to display multi-line message

for better readability

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agovstart.sh: add an option named --without-restful"
Kefu Chai [Thu, 3 Jun 2021 10:42:48 +0000 (18:42 +0800)]
vstart.sh: add an option named --without-restful"

so we don't need to wait for restful module to be loaded if not working
on this mgr module.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agovstart.sh: extract create_mgr_restful_secret() out
Kefu Chai [Thu, 3 Jun 2021 10:38:08 +0000 (18:38 +0800)]
vstart.sh: extract create_mgr_restful_secret() out

for better readability, and so it's easier to make this step optional if
developer is not interested in using the restful mgr module.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agodoc: Update mclock-config-ref to reflect automated OSD benchmarking
Sridhar Seshasayee [Wed, 12 May 2021 14:50:20 +0000 (20:20 +0530)]
doc: Update mclock-config-ref to reflect automated OSD benchmarking

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
4 years agoMerge pull request #41671 from liu-chunmei/seastore-logger
Kefu Chai [Thu, 3 Jun 2021 07:39:16 +0000 (15:39 +0800)]
Merge pull request #41671 from liu-chunmei/seastore-logger

crimson/seastore: cleanup ceph_subsystem_filestore to seastore

Reviewed-by: Kefu Chai <kchai@redhat.com>
4 years agotest/crimson/seastore: declare return type explicitly
Kefu Chai [Thu, 3 Jun 2021 07:32:20 +0000 (15:32 +0800)]
test/crimson/seastore: declare return type explicitly

for better readability

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agotest/crimson/seastore: always handle returned future<>
Kefu Chai [Thu, 3 Jun 2021 07:28:45 +0000 (15:28 +0800)]
test/crimson/seastore: always handle returned future<>

this change also silences the [-Wunused-result] warning.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agocommon: fix a formatting nit in OpTracker::dump_ops_in_flight().
Radoslaw Zarzynski [Tue, 19 Jan 2021 16:05:47 +0000 (17:05 +0100)]
common: fix a formatting nit in OpTracker::dump_ops_in_flight().

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
4 years agocrimson: improve const-correctness of Operation::dump()s.
Radoslaw Zarzynski [Tue, 19 Jan 2021 16:05:12 +0000 (17:05 +0100)]
crimson: improve const-correctness of Operation::dump()s.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
4 years agocrimson/seastore: cleanup ceph_subsystem_filestore to seastore
chunmei-liu [Thu, 3 Jun 2021 06:41:42 +0000 (23:41 -0700)]
crimson/seastore: cleanup ceph_subsystem_filestore to seastore

Signed-off-by: chunmei-liu <chunmei.liu@intel.com>
4 years agoMerge pull request #41666 from tchaikov/wip-crimson-stop
Kefu Chai [Thu, 3 Jun 2021 06:33:52 +0000 (14:33 +0800)]
Merge pull request #41666 from tchaikov/wip-crimson-stop

crimson/osd: wait for SIGINT and SIGTERM before stopping

Reviewed-by: Chunmei Liu <chunmei.liu@intel.com>
4 years agoqa: use dump_metrics as alternative of get_heap_property
Radoslaw Zarzynski [Mon, 17 May 2021 14:49:20 +0000 (14:49 +0000)]
qa: use dump_metrics as alternative of get_heap_property

"get_heap_property *" asock commands are exposed to operators
to check the tcmalloc internals for understanding the performance
of the memory subsystem. but crimson uses the builtin seastar allocator
which is not backed by tcmalloc. but we can dump the metrics using
the "dump_metrics" asock command which is only available from
crimson-osd.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agoqa/tasks/admin_socket: support "foo || bar" as command
Kefu Chai [Wed, 2 Jun 2021 14:06:22 +0000 (22:06 +0800)]
qa/tasks/admin_socket: support "foo || bar" as command

so we can cater the needs of different implementation of osd, i.e.,
classic osd and crimson osd. they offer different set of asock commands.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agocrimson/admin/osd_admin: sort forward declarations
Kefu Chai [Thu, 3 Jun 2021 05:58:40 +0000 (13:58 +0800)]
crimson/admin/osd_admin: sort forward declarations

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agocrimson/admin: fix the indent
Kefu Chai [Thu, 3 Jun 2021 05:48:27 +0000 (13:48 +0800)]
crimson/admin: fix the indent

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agocrimson/admin: s/perf dump_seastar/dump_metrics/
Kefu Chai [Thu, 3 Jun 2021 05:45:05 +0000 (13:45 +0800)]
crimson/admin: s/perf dump_seastar/dump_metrics/

as a user-facing interface, no need to expose seastar in the name,
what matters to user is the content not the underlying technology or library.

so rename the command prefix to "dump_metrics"

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agocrimson/admin: s/SeastarMetricsHook/DumpMetricsHook/
Kefu Chai [Thu, 3 Jun 2021 05:39:28 +0000 (13:39 +0800)]
crimson/admin: s/SeastarMetricsHook/DumpMetricsHook/

seastar is the name of one of the libraries used to implement crimson,
but the asok hook dumps not only builtin metrics in seastar, but also
the ones registered by crimson and seastore, so rename it to a more
general name.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agorgw: require bucket name in bucket chown
Zulai Wang [Thu, 3 Jun 2021 05:13:15 +0000 (13:13 +0800)]
rgw: require bucket name in bucket chown

Checking and reporting missing the mandatory parameter avoid clueless error
message for bucket chown.

Signed-off-by: Zulai Wang <zl31wang@gmail.com>
4 years agocrimson/osd: wait for SIGINT and SIGTERM before stopping
Kefu Chai [Thu, 3 Jun 2021 05:26:17 +0000 (13:26 +0800)]
crimson/osd: wait for SIGINT and SIGTERM before stopping

this change addresses an regression introduced by
37b83f4ed7ca69f105b93bf482cb2289cbaf9a4d. as we should not stop
services without being asked to do so.

in this change, signal handler for SIGINT and SIGTERM is registered to
handle these signals, and in the seastar thread, we wait until any of
these two signals is caught.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agoMerge pull request #41627 from tchaikov/wip-mgr-repl-doc
Kefu Chai [Thu, 3 Jun 2021 01:36:15 +0000 (09:36 +0800)]
Merge pull request #41627 from tchaikov/wip-mgr-repl-doc

doc/mgr/modules: add a "debugging" section

Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
4 years agoMerge pull request #41138 from kalebskeithley/python39
Kefu Chai [Thu, 3 Jun 2021 01:34:56 +0000 (09:34 +0800)]
Merge pull request #41138 from kalebskeithley/python39

do_cmake: build with python3.9 on RHEL9

Reviewed-by: Kefu Chai <kchai@redhat.com>
4 years agodo_cmake: build with python3.9 on RHEL9
Kefu Chai [Thu, 3 Jun 2021 01:29:19 +0000 (09:29 +0800)]
do_cmake: build with python3.9 on RHEL9

rhel9 has python3.9 as of rhel9beta

Signed-off-by: Kaleb S KEITHLEY <kkeithle@redhat.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agoMerge pull request #41496 from Huber-ming/correct_spell
Kefu Chai [Thu, 3 Jun 2021 01:16:42 +0000 (09:16 +0800)]
Merge pull request #41496 from Huber-ming/correct_spell

rgw: correct the spelling of "instace"

Reviewed-by: Kefu Chai <kchai@redhat.com>
4 years agoMerge PR #41635 into master
Patrick Donnelly [Wed, 2 Jun 2021 15:18:22 +0000 (08:18 -0700)]
Merge PR #41635 into master

* refs/pull/41635/head:
qa: increase fragmentation to improve uniform distribution

Reviewed-by: Ramana Raja <rraja@redhat.com>
4 years agoMerge pull request #41644 from rzarzynski/wip-crimson-fix-blocked-peering
Kefu Chai [Wed, 2 Jun 2021 14:43:40 +0000 (22:43 +0800)]
Merge pull request #41644 from rzarzynski/wip-crimson-fix-blocked-peering

crimson/monc: fix subscription stall that blocked peering.

Reviewed-by: Kefu Chai <kchai@redhat.com>
4 years agomds: do not infinitely recursively print a metric
Patrick Donnelly [Wed, 2 Jun 2021 14:28:49 +0000 (07:28 -0700)]
mds: do not infinitely recursively print a metric

Fixes: b1b44d775df3160d937c068d5e1079e24199ed6b
Fixes: https://tracker.ceph.com/issues/51067
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
4 years agoMerge PR #41651 into master
Sage Weil [Wed, 2 Jun 2021 14:27:03 +0000 (10:27 -0400)]
Merge PR #41651 into master

* refs/pull/41651/head:
doc/cephadm: s/the the/the

Reviewed-by: Kefu Chai <kchai@redhat.com>
4 years agoMerge pull request #41645 from tchaikov/wip-crimson-osd-mkfs
Kefu Chai [Wed, 2 Jun 2021 14:10:12 +0000 (22:10 +0800)]
Merge pull request #41645 from tchaikov/wip-crimson-osd-mkfs

crimson/osd: check existing superblock when mkfs

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
4 years agodoc/cephadm: s/the the/the
Zac Dover [Wed, 2 Jun 2021 14:06:06 +0000 (00:06 +1000)]
doc/cephadm: s/the the/the

This removes an extraneous "the" and reworks a
sentence so that it adheres to the grammatical
rules of the English language.

Signed-off-by: Zac Dover <zac.dover@gmail.com>
4 years agocrimson/osd: check existing superblock when mkfs
Kefu Chai [Wed, 2 Jun 2021 12:57:14 +0000 (20:57 +0800)]
crimson/osd: check existing superblock when mkfs

in case mkfs on an existing store.

this change mirrors the behavior of classic osd, also addresses the
assert failure when BlueStore tries to create a collection when it
already contains a colloection with the same collection id.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agocrimson/osd: extract OSD::_write_superblock() out
Kefu Chai [Wed, 2 Jun 2021 12:47:03 +0000 (20:47 +0800)]
crimson/osd: extract OSD::_write_superblock() out

prepare for the change to verify existing meta collection and superblock
stored in it.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agocrimson/monc: fix subscription stall that blocked peering.
Radoslaw Zarzynski [Wed, 2 Jun 2021 11:59:37 +0000 (11:59 +0000)]
crimson/monc: fix subscription stall that blocked peering.

There is a scenario when the `active_con` is properly
chosen but isn't marked as `ready_to_send`.
If `renew_subs()` is called during the `on_session_opened()`,
the flag will be turned on after the subscriptions are
renewed which cannot happen as it requires the flag to be
already set. In other words: there is a circular data dependency.

The net result is stalling the subscription machinery,
particularly the `OSDMap` subs. This caused a nasty peering
issue at Sepia [1] where PG 2.7 got stuck in the `GetInfo`
state.

```
rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136908$ less ./remote/smithi039/log/ceph-osd.1.log.gz
...
DEBUG 2021-05-26 20:19:48,134 [shard 0] osd -  pg_epoch 14 pg[2.7( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0) [] r=
-1 lpr=0 crt=0'0 mlcod 0'0 unknown enter Initial
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0]
r=0 lpr=0 crt=0'0 mlcod 0'0 unknown enter Reset
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 unknown enter Started
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 unknown enter Start
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 unknown enter Started/Primary
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating enter Started/Primary/Peering
...
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering enter Started/Primary/Peering/GetInfo
DEBUG 2021-05-26 20:19:48,138 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering build_prior all_probe
DEBUG 2021-05-26 20:19:48,139 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering build_prior final: probe 0,1 down  blocked_by {}
DEBUG 2021-05-26 20:19:48,139 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering up_thru 0 < same_since 14, must notify monitor
DEBUG 2021-05-26 20:19:48,139 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>:  no prior_set down osds, clearing prior_readable_until_ub
DEBUG 2021-05-26 20:19:48,139 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>:  querying info from osd.0
...
DEBUG 2021-05-26 20:19:48,237 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering  got osd.0 2.7( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c=0/0 les/c/f=0/0/0 sis=0)
DEBUG 2021-05-26 20:19:48,237 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>: Adding osd: 0 peer features: 3f01cfbb7ffdffff
DEBUG 2021-05-26 20:19:48,237 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>: Common peer features: 3f01cfbb7ffdffff
DEBUG 2021-05-26 20:19:48,237 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>: Common acting features: 3f01cfbb7ffdffff
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering state<Started/Primary/Peering/GetInfo>: Common upacting features: 3f01cfbb7ffdffff
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering exit Started/Primary/Peering/GetInfo 0.099480 4 2021-05-26T20:19:48.146172+0000
...
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering enter Started/Primary/Peering/GetLog
...
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering enter Started/Primary/Peering/GetMissing
...
DEBUG 2021-05-26 20:19:48,238 [shard 0] osd -  pg_epoch 14 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+peering enter Started/Primary/Peering/WaitUpThru
...
DEBUG 2021-05-26 20:19:49,139 [shard 0] osd -  pg_epoch 15 pg[2.7( empty local-lis/les=0/0 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating enter Started/Primary/Active
...
DEBUG 2021-05-26 20:19:49,142 [shard 0] osd -  pg_epoch 15 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=0/0 les/c/f=0/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 creating+activating enter Started/Primary/Active/Activating
...
DEBUG 2021-05-26 20:19:49,204 [shard 0] osd -  pg_epoch 15 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/0 les/c/f=15/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 active enter Started/Primary/Active/Recovered
...
DEBUG 2021-05-26 20:19:49,204 [shard 0] osd -  pg_epoch 15 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/0 les/c/f=15/0/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 active enter Started/Primary/Active/Clean
...
DEBUG 2021-05-26 20:22:31,223 [shard 0] osd -  pg_epoch 86 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=14) [1,0] r=0 lpr=14 crt=0'0 mlcod 0'0 active enter Reset
...
<a lot of flipping>
...
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown activate_map
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown exit Reset 0.035744 1 2021-05-26T20:24:07.817331+0000
INFO  2021-05-26 20:24:07,851 [shard 0] osd - Exiting state: Reset, entered at 1622060647.81581881622060647.8173316 spent on 1 events
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown enter Started
INFO  2021-05-26 20:24:07,851 [shard 0] osd - Entering state: Started
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown enter Start
INFO  2021-05-26 20:24:07,851 [shard 0] osd - Entering state: Start
INFO  2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown state<Start>: transitioning to Primary
DEBUG 2021-05-26 20:24:07,851 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown exit Start 0.000041 0 0.000000
INFO  2021-05-26 20:24:07,851 [shard 0] osd - Exiting state: Start, entered at 1622060647.8516333, 0.0 spent on 0 events
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown enter Started/Primary
INFO  2021-05-26 20:24:07,852 [shard 0] osd - Entering state: Started/Primary
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163
) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 unknown enter Started/Primary/Peering
INFO  2021-05-26 20:24:07,852 [shard 0] osd - Entering state: Started/Primary/Peering
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering enter Started/Primary/Peering/GetInfo
INFO  2021-05-26 20:24:07,852 [shard 0] osd - Entering state: Started/Primary/Peering/GetInfo
...
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering build_prior all_probe 0,1,4
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering build_prior maybe_rw interval:139, acting: 0
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering build_prior final: probe 0,1,4 down  blocked_by {}
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering up_thru 125 < same_since 163, must notify monitor
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering state<Started/Primary/Peering/GetInfo>:  no prior_set down osds, clearing prior_readable_until_ub
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering state<Started/Primary/Peering/GetInfo>:  querying info from osd.0
DEBUG 2021-05-26 20:24:07,852 [shard 0] osd -  pg_epoch 163 pg[2.7( empty local-lis/les=14/15 n=0 ec=14/14 lis/c=14/14 les/c/f=15/15/0 sis=163) [1,0] r=0 lpr=163 pi=[14,163)/1 crt=0'0 mlcod 0'0 peering state<Started/Primary/Peering/GetInfo>:  querying info from osd.4
...
DEBUG 2021-05-26 20:24:07,924 [shard 0] ms - [osd.1(cluster) v2:172.21.15.39:6803/34727@61064 >> osd.4 v2:172.21.15.62:6802/34686] connect to existing
DEBUG 2021-05-26 20:24:07,924 [shard 0] ms - [osd.1(cluster) v2:172.21.15.39:6803/34727@61064 >> osd.4 v2:172.21.15.62:6802/34686] --> #62 === pg_query2(2.7 2.7 query(info 0'0 epoch_sent 163) e163/163) v1 (131)
...
DEBUG 2021-05-26 20:24:07,942 [shard 0] ms - [osd.1(cluster) v2:172.21.15.39:6803/34727@61064 >> osd.4 v2:172.21.15.62:6802/34686] GOT AckFrame: seq=62
...
<plenty of osd_ping messanging but no reply to the pg_query for 2.7>
...
DEBUG 2021-05-26 20:58:19,829 [shard 0] ms - [osd.1(hb_front) v2:172.21.15.39:6807/34727 >> osd.4 v2:172.21.15.62:6807/34686@54816] <== #772 =
== osd_ping(ping e17 up_from 10 ping_stamp 2021-05-26T20:58:19.825573+0000/2319.780029297s send_stamp 2319.780029297s) v5 (70)
DEBUG 2021-05-26 20:58:19,829 [shard 0] ms - [osd.1(hb_front) v2:172.21.15.39:6807/34727 >> osd.4 v2:172.21.15.62:6807/34686@54816] --> #772 === osd_ping(ping_reply e249 up_from 10 ping_stamp 2021-05-26T20:58:19.825573+0000/2319.780029297s send_stamp 2320.039062500s) v5 (70
```

The peering request got stuck due to awaiting for `OSDMap`.

```
DEBUG 2021-05-26 20:24:07,930 [shard 0] ms - [osd.4(cluster) v2:172.21.15.62:6802/34686 >> osd.1 v2:172.21.15.39:6803/34727@61064] <== #62 === pg_query2(2.7 2.7 query(info 0'0 epoch_sent 163) e163/163) v1 (131)
DEBUG 2021-05-26 20:24:07,930 [shard 0] osd - handle_peering_op on 2.7 from 1
DEBUG 2021-05-26 20:24:07,930 [shard 0] osd - peering_event(id=517, detail=PeeringEvent(from=1 pgid=2.7 sent=163 requested=163 evt=epoch_sent: 163 epoch_requested: 163 MQuery 2.7 from 1 query_epoch 163 query: query(info 0'0 epoch_sent 163))): star
```

```
INFO  2021-05-26 20:19:49,127 [shard 0] osd - evt epoch is 15, i have 14, will wait
INFO  2021-05-26 20:19:49,128 [shard 0] osd - osdmap_subscribe(14)
DEBUG 2021-05-26 20:19:49,128 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@63208 >> mon.1 v2:172.21.15.62:3300/0] --> #9 === mon_s
ubscribe({osdmap=14}) v3 (15)
...
INFO  2021-05-26 20:19:49,131 [shard 0] osd - handle_osd_map osd_map(14..15 src has 1..15) v4
INFO  2021-05-26 20:19:49,131 [shard 0] osd - handle_osd_map epochs [14..15], i have 15, src has [1..15]
...
INFO  2021-05-26 20:19:49,138 [shard 0] osd - handle_osd_map osd_map(14..15 src has 1..15) v4
INFO  2021-05-26 20:19:49,138 [shard 0] osd - handle_osd_map epochs [14..15], i have 15, src has [1..15]
...
INFO  2021-05-26 20:19:49,139 [shard 0] osd - evt epoch is 15, i have 14, will wait
INFO  2021-05-26 20:19:49,141 [shard 0] osd - osdmap_subscribe(14)
WARN  2021-05-26 20:19:49,141 [shard 0] monc - renew_subs - empty
...
INFO  2021-05-26 20:19:50,140 [shard 0] osd - handle_osd_map osd_map(15..16 src has 1..16) v4
INFO  2021-05-26 20:19:50,140 [shard 0] osd - handle_osd_map epochs [15..16], i have 15, src has [1..16]
DEBUG 2021-05-26 20:19:50,141 [shard 0] bluestore - do_transaction
INFO  2021-05-26 20:19:50,145 [shard 0] osd - osd.4: committed_osd_maps(16, 16)
...
INFO  2021-05-26 20:20:42,881 [shard 0] osd - handle_osd_map epochs [16..17], i have 16, src has [1..17]
DEBUG 2021-05-26 20:20:42,882 [shard 0] bluestore - do_transaction
INFO  2021-05-26 20:20:42,886 [shard 0] osd - osd.4: committed_osd_maps(17, 17)
...
INFO  2021-05-26 20:20:43,941 [shard 0] osd - evt epoch is 18, i have 17, will wait
INFO  2021-05-26 20:20:43,941 [shard 0] osd - osdmap_subscribe(17)
...
INFO  2021-05-26 20:20:43,957 [shard 0] osd - evt epoch is 18, i have 17, will wait
INFO  2021-05-26 20:20:43,957 [shard 0] osd - osdmap_subscribe(17)
...
INFO  2021-05-26 20:20:43,969 [shard 0] osd - evt epoch is 18, i have 17, will wait
INFO  2021-05-26 20:20:43,969 [shard 0] osd - osdmap_subscribe(17)
...
DEBUG 2021-05-26 20:20:46,930 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@57288 >> mon.2 v2:172.21.15.39:3301/0] <== #4 === osd_m
ap(20..21 src has 1..21) v4 (41)
INFO  2021-05-26 20:20:46,930 [shard 0] osd - handle_osd_map osd_map(20..21 src has 1..21) v4
INFO  2021-05-26 20:20:46,930 [shard 0] osd - handle_osd_map epochs [20..21], i have 17, src has [1..21]
INFO  2021-05-26 20:20:46,930 [shard 0] osd - handle_osd_map message skips epochs 18..19
INFO  2021-05-26 20:20:46,930 [shard 0] osd - osdmap_subscribe(18)
...
DEBUG 2021-05-26 20:20:47,936 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@57288 >> mon.2 v2:172.21.15.39:3301/0] <== #5 === osd_m
ap(21..22 src has 1..22) v4 (41)
INFO  2021-05-26 20:20:47,936 [shard 0] osd - handle_osd_map osd_map(21..22 src has 1..22) v4
INFO  2021-05-26 20:20:47,936 [shard 0] osd - handle_osd_map epochs [21..22], i have 17, src has [1..22]
INFO  2021-05-26 20:20:47,936 [shard 0] osd - handle_osd_map message skips epochs 18..20
INFO  2021-05-26 20:20:47,936 [shard 0] osd - osdmap_subscribe(18)
...
<osdmap_subscribe(18) over and over>
```

```
2021-05-26T20:19:42.048+0000 7f4712ffd700  1 -- [v2:172.21.15.62:3300/0,v1:172.21.15.62:6789/0] <== osd.4 v2:172.21.15.62:6801/34686 4 ==== mon_subscribe({mgrmap=0+,osd_pg_creates=0+,osdmap=0+}) v3 ==== 82+0+0 (secure 0 0 0) 0x7f46fc04e150 con 0x7f470401c480
2021-05-26T20:19:42.048+0000 7f4712ffd700 20 mon.b@1(peon) e1 _ms_dispatch existing session 0x7f46fc02f500 for osd.4
2021-05-26T20:19:42.048+0000 7f4712ffd700 20 mon.b@1(peon) e1  entity_name osd.4 global_id 4168 (new_ok) caps allow *
2021-05-26T20:19:42.048+0000 7f4712ffd700 10 mon.b@1(peon) e1 handle_subscribe mon_subscribe({mgrmap=0+,osd_pg_creates=0+,osdmap=0+}) v3
...
2021-05-26T20:19:49.129+0000 7f4712ffd700  1 -- [v2:172.21.15.62:3300/0,v1:172.21.15.62:6789/0] <== osd.4 v2:172.21.15.62:6801/34686 9 ==== mo
n_subscribe({osdmap=14}) v3 ==== 36+0+0 (secure 0 0 0) 0x7f46e8556210 con 0x7f470401c480
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 mon.b@1(peon) e1 _ms_dispatch existing session 0x7f46fc02f500 for osd.4
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 mon.b@1(peon) e1  entity_name osd.4 global_id 4168 (new_ok) caps allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 10 mon.b@1(peon) e1 handle_subscribe mon_subscribe({osdmap=14}) v3
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 is_capable service=mon command= read addr v2:172.21.15.62:6801/34686 on cap allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 20  allow so far , doing grant allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 20  allow all
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 is_capable service=osd command= read addr v2:172.21.15.62:6801/34686 on cap allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 20  allow so far , doing grant allow *
2021-05-26T20:19:49.129+0000 7f4712ffd700 20  allow all
2021-05-26T20:19:49.129+0000 7f4712ffd700 10 mon.b@1(peon).osd e15 check_osdmap_sub 0x7f46e84f0150 next 14 (onetime)
2021-05-26T20:19:49.129+0000 7f4712ffd700  5 mon.b@1(peon).osd e15 send_incremental [14..15] to osd.4
2021-05-26T20:19:49.129+0000 7f4712ffd700 10 mon.b@1(peon).osd e15 build_incremental [14..15] with features 3f01cfbb7ffdffff
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 mon.b@1(peon).osd e15 build_incremental    inc 15 622 bytes
2021-05-26T20:19:49.129+0000 7f4712ffd700 20 mon.b@1(peon).osd e15 build_incremental    inc 14 578 bytes
2021-05-26T20:19:49.129+0000 7f4712ffd700  1 -- [v2:172.21.15.62:3300/0,v1:172.21.15.62:6789/0] --> v2:172.21.15.62:6801/34686 -- osd_map(14..
15 src has 1..15) v4 -- 0x7f46e856a100 con 0x7f470401c480
```

```
seastar::future<> Client::renew_subs()
{
  if (!sub.have_new()) {
    logger().warn("{} - empty", __func__);
    return seastar::now();
  }
  logger().trace("{}", __func__);

  auto m = crimson::make_message<MMonSubscribe>();
  m->what = sub.get_subs();
  m->hostname = ceph_get_short_hostname();
  return send_message(std::move(m)).then([this] {
    sub.renewed();
  });
}
```

```
INFO  2021-05-26 20:19:42,081 [shard 0] osd - osdmap_subscribe(1)
DEBUG 2021-05-26 20:19:42,081 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@63208 >> mon.1 v2:172.21.15.62:3300/0] --> #6 === mon_s
ubscribe({osdmap=1}) v3 (15)
...
INFO  2021-05-26 20:19:49,128 [shard 0] osd - osdmap_subscribe(14)
DEBUG 2021-05-26 20:19:49,128 [shard 0] ms - [osd.4(client) v2:172.21.15.62:6801/34686@63208 >> mon.1 v2:172.21.15.62:3300/0] --> #9 === mon_subscribe({osdmap=14}) v3 (15)
...
INFO  2021-05-26 20:19:49,141 [shard 0] osd - osdmap_subscribe(14)
WARN  2021-05-26 20:19:49,141 [shard 0] monc - renew_subs - empty
<no MMonSubcribe>
...
INFO  2021-05-26 20:20:43,941 [shard 0] osd - evt epoch is 18, i have 17, will wait
INFO  2021-05-26 20:20:43,941 [shard 0] osd - osdmap_subscribe(17)
<no MMonSubcribe>
...
INFO  2021-05-26 20:20:46,930 [shard 0] osd - handle_osd_map message skips epochs 18..19
INFO  2021-05-26 20:20:46,930 [shard 0] osd - osdmap_subscribe(18)
<no MMonSubcribe>
```

[1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136908

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
4 years agoMerge pull request #41630 from rhcs-dashboard/fix-bucket-calculations
Ernesto Puerta [Wed, 2 Jun 2021 12:12:56 +0000 (14:12 +0200)]
Merge pull request #41630 from rhcs-dashboard/fix-bucket-calculations

mgr/dashboard: fix bucket objects and size calculations

Reviewed-by: Alfonso Martínez <almartin@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
4 years agoMerge pull request #41638 from tchaikov/wip-doc-crimson-doc
Kefu Chai [Wed, 2 Jun 2021 10:43:47 +0000 (18:43 +0800)]
Merge pull request #41638 from tchaikov/wip-doc-crimson-doc

doc/dev/crimson: update link to scylladb debugging tips

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
4 years agomds/scrub: write root inode backtrace at creation
Milind Changire [Wed, 2 Jun 2021 09:42:09 +0000 (15:12 +0530)]
mds/scrub: write root inode backtrace at creation

Write root inode backtrace as soon as it is created;
Unwritten backtrace always caused scrub to fail for root inode.

Fixes: https://tracker.ceph.com/issues/50976
Signed-off-by: Milind Changire <mchangir@redhat.com>
4 years agodoc/dev/crimson: update link to scylladb debugging tips
Kefu Chai [Wed, 2 Jun 2021 09:10:25 +0000 (17:10 +0800)]
doc/dev/crimson: update link to scylladb debugging tips

the old one is not reachable anymore.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agoMerge pull request #41637 from tchaikov/wip-crimson-never-discard-future
Kefu Chai [Wed, 2 Jun 2021 09:00:53 +0000 (17:00 +0800)]
Merge pull request #41637 from tchaikov/wip-crimson-never-discard-future

crimson: always handle returned future

Reviewed-by: Xuehan Xu <xuxuehan@360.cn>
Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
4 years agodoc/mgr/modules: add a "debugging" section
Kefu Chai [Tue, 1 Jun 2021 11:58:47 +0000 (19:58 +0800)]
doc/mgr/modules: add a "debugging" section

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agopybind/ceph_mgr_repl: define "timeout" opt as an int
Kefu Chai [Tue, 1 Jun 2021 12:40:16 +0000 (20:40 +0800)]
pybind/ceph_mgr_repl: define "timeout" opt as an int

otherwise it would be a str, which would be rejected by the underlying
`json_command()` method.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agoqa/tasks: Enhance wait_until_true() to check & retry recovery progress
Sridhar Seshasayee [Wed, 19 May 2021 15:22:15 +0000 (20:52 +0530)]
qa/tasks: Enhance wait_until_true() to check & retry recovery progress

With mclock scheduler enabled, the recovery throughput is throttled based
on factors like the type of mclock profile enabled, the OSD capacity among
others. Due to this the recovery times may vary and therefore the existing
timeout of 120 secs may not be sufficient.

To address the above, a new method called _is_inprogress_or_complete() is
introduced in the TestProgress Class that checks if the event with the
specified 'id' is in progress by checking the 'progress' key of the
progress command response. This method also handles the corner case where
the event completes just before it's called.

The existing wait_until_true() method in the CephTestCase Class is
modified to accept another function argument called "check_fn". This is
set to the _is_inprogress_or_complete() function described earlier in the
"test_turn_off_module" test that has been observed to fail due to the
reasons already described above. A retry mechanism of a maximum of 5
attempts is introduced after the first timeout is hit. This means that
the wait can extend up to a maximum of 600 secs (120 secs * 5) as long as
there is recovery progress reported by the 'ceph progress' command result.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
4 years agoosd: Disable heartbeat timeout until a non-future workitem can be processed
Sridhar Seshasayee [Tue, 25 May 2021 14:09:33 +0000 (19:39 +0530)]
osd: Disable heartbeat timeout until a non-future workitem can be processed

There could be rare instances when employing the mclock scheduler where a
worker thread for a shard may not get an immediate work item to process.
Such items are designated as future work items. In such cases, the
_process() loop waits until the time indicated by the scheduler to attempt
a dequeue from the scheduler queue again. It may so happen that if there
are multiple threads per shard, a thread may not get an immediate item for
a long time. This time could exceed the heartbeat timeout for the thread
and result in hearbeat timeouts reported for the osd in question. To
prevent this, the heartbeat timeouts for the thread is disabled before
waiting for an item and enabled once the wait period is over.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
4 years agoosd: Run osd bench test to override default max osd capacity for mclock
Sridhar Seshasayee [Mon, 10 May 2021 09:11:54 +0000 (14:41 +0530)]
osd: Run osd bench test to override default max osd capacity for mclock

If mclock scheduler is enabled, run the osd bench test as part of osd
initialization sequence in order to determine the max osd capacity. The
iops determined as part of the test is used to override the default
osd_mclock_max_capacity_iops_[hdd,ssd] option depending on the
underlying device type.

The test performs random writes of 100 objects of 4MiB size using
4KiB blocksize. The existing test which was a part of asok_command() is
factored out into a separate method called run_osd_bench_test() so that it
can be used for both purposes. If the test fails, the default values
for the above mentioned options are used.

A new method called update_configuration() in introduced in OpScheduler
base class to facilitate propagation of changes to a config option
that is not user initiated. This method helps in applying changes and
update any internal variable associated with a config option as
long as it is tracked. In this case, the change to the max osd capacity
is propagated to each op shard using the mentioned method. In the
future this method can be useful to propagate changes to advanced
config option(s) that the user is not expected to modify.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
4 years agoosd: Remove the generic "osd_mclock_max_capacity_iops" option.
Sridhar Seshasayee [Fri, 14 May 2021 08:24:19 +0000 (13:54 +0530)]
osd: Remove the generic "osd_mclock_max_capacity_iops" option.

Remove the generic "osd_mclock_max_capacity_iops" option and use the
"osd_mclock_max_capacity_iops_[hdd,ssd]" options. It is better to have a
clear indication about the type of underlying device. This helps in
avoiding confusion when trying to read or override the options.

Signed-off-by: Sridhar Seshasayee <sseshasa@redhat.com>
4 years agomgr/dashboard: fix bucket objects and size calculations
Avan Thakkar [Tue, 1 Jun 2021 14:21:16 +0000 (19:51 +0530)]
mgr/dashboard: fix bucket objects and size calculations

Fixes: https://tracker.ceph.com/issues/51035
Signed-off-by: Avan Thakkar <athakkar@redhat.com>
4 years agocrimson/common/interruptible_future: mark future 'nodiscard'
Kefu Chai [Wed, 2 Jun 2021 06:16:25 +0000 (14:16 +0800)]
crimson/common/interruptible_future: mark future 'nodiscard'

so compiler is able to error out if we discard a future.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agocrimson/common/errorator: mark errorator::future 'nodiscard'
Kefu Chai [Wed, 2 Jun 2021 06:15:43 +0000 (14:15 +0800)]
crimson/common/errorator: mark errorator::future 'nodiscard'

so compiler is able to error out if we discard a future.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agocrimson: always handle returned future
Kefu Chai [Wed, 2 Jun 2021 06:13:04 +0000 (14:13 +0800)]
crimson: always handle returned future

to ignore a future without good reason could lead to catastrophic
issues. see also b127fa3cdd405c71cf09875f61f107c23af6b8cf

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agocrimson/os: do not return a future in finally()
Kefu Chai [Wed, 2 Jun 2021 06:11:07 +0000 (14:11 +0800)]
crimson/os: do not return a future in finally()

errorator always discard the returned future.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agoMerge pull request #41026 from TRYTOBE8TME/wip-rgw-rabbitmq
Yuval Lifshitz [Wed, 2 Jun 2021 04:47:39 +0000 (07:47 +0300)]
Merge pull request #41026 from TRYTOBE8TME/wip-rgw-rabbitmq

qa/tasks: Adding RabbitMQ task for bucket notification tests

4 years agoauth/KeyRing: rename decode_plaintext() to decode()
Kefu Chai [Wed, 2 Jun 2021 04:04:30 +0000 (12:04 +0800)]
auth/KeyRing: rename decode_plaintext() to decode()

as the former is just an alias of the latter.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agoauth/KeyRing: do not decode a copy of bl
Kefu Chai [Wed, 2 Jun 2021 03:59:16 +0000 (11:59 +0800)]
auth/KeyRing: do not decode a copy of bl

i checked all the code paths calling into KeyRing::decode(), none of
them relies on the behavior that the bl is not mutated after the
iterator is decoded.

actually, it is more intuitive to always move the iterator forward when
decoding the encoded keyring in the bufferlist.

Signed-off-by: Kefu Chai <kchai@redhat.com>
4 years agoqa: increase fragmentation to improve uniform distribution
Patrick Donnelly [Tue, 1 Jun 2021 21:00:23 +0000 (14:00 -0700)]
qa: increase fragmentation to improve uniform distribution

Fixes: https://tracker.ceph.com/issues/51060
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
4 years agoauth/KeyRing: always decode keying as plaintext
Kefu Chai [Tue, 1 Jun 2021 14:44:08 +0000 (22:44 +0800)]
auth/KeyRing: always decode keying as plaintext

for three reasons:

* we don't encode binary KeyRing since v0.48: the binary encoder for
  KeyRing was dropped in eaea7aa9b28849be612b22ce84971db671319806,
  which was included since v0.48 (argonaut). and we don't encode
  KeyRing in binary manually elsewhere since then.
* we should not use exception in the normal code path. in C++,
  exception is not designed to be efficient or semantically a
  language facility to be part of the normal code path. so, from
  the readability perspective, we should not use exception here.
  as all encoded KeyRings are in plaintext.
* simpler this way.

Signed-off-by: Kefu Chai <kchai@redhat.com>