Kefu Chai [Sat, 8 May 2021 08:43:55 +0000 (16:43 +0800)]
crimson/common: use string_view when appropriate
the typical use case of get_val() passes a literal string as the key,
in that case, there is no need to create a std::string. as
md_config_t::get_val() always accepts a string_view as the option name.
crimson/monc: honor auth_result_t::canceled as the result of do_auth().
An attempt to `Connection::do_auth()` may finish in one of three states:
_success_, _failure_ and _cancellation_. Unfortunately, its callers were
missing the third treating cancellation like a failure. This was the root
cause of the following failure at Sepia:
```
rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-05-06_22:08:43-rados-master-distro-basic-smithi/6102605$ less ./remote/smithi204/log/ceph-osd.3.log.gz
...
WARN 2021-05-06 22:35:40,464 [shard 0] osd - ms_handle_reset
...
INFO 2021-05-06 22:35:40,465 [shard 0] monc - do_auth_single: connection closed
INFO 2021-05-06 22:35:40,465 [shard 0] ms - [osd.3(client) v2:172.21.15.204:6808/31418@57568 >> mon.? v2:172.21.15.204:3300/0] execute_connecting(): protocol aborted at CLOSING -- std::system_error (error crimson::net:6, protocol aborted)
...
ERROR 2021-05-06 22:35:40,465 [shard 0] osd - mon.osd.3 dispatch() ms_handle_reset caught exception: std::system_error (error crimson::net:3, negotiation failure)
ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3909-g81233a18/rpm/el8/BUILD/ceph-17.0.0-3909-g81233a18/src/crimson/common/gated.h:36: crimson::common::Gated::dispatch(const char*, T&, Func&&) [with Func = crimson::mon::Client::ms_handle_reset(crimson::net::ConnectionRef, bool)::<lambda()>&; T = crimson::mon::Client]::<lambda(std::__exception_ptr::exception_ptr)>: Assertion `*eptr.__cxa_exception_type() == typeid(seastar::gate_closed_exception)' failed.
Aborting on shard 0.
Backtrace:
0# 0x00005618C973932F in ceph-osd
1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd
2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd
3# 0x00007F7BB592EB20 in /lib64/libpthread.so.0
4# gsignal in /lib64/libc.so.6
5# abort in /lib64/libc.so.6
6# 0x00007F7BB3F29B09 in /lib64/libc.so.6
7# 0x00007F7BB3F37DE6 in /lib64/libc.so.6
8# 0x00005618C9FF295C in ceph-osd
9# 0x00005618C3907313 in ceph-osd
10# 0x00005618CCA2F84F in ceph-osd
11# 0x00005618CCA34D90 in ceph-osd
12# 0x00005618CCBEC9BB in ceph-osd
13# 0x00005618CC744E9A in ceph-osd
14# main in ceph-osd
15# __libc_start_main in /lib64/libc.so.6
16# _start in ceph-osd
daemon-helper: command crashed with signal 6
```
The low-level signal handler above assumes `local_engine._backend`
is not null which stays true only for threads from the S*'s world.
Unfortunately, as we don't block the `SIGHUP` for alien threads,
kernel is perfectly authorized to pick up one them to run the handler
leading to weirdly-looking segfaults like this one:
```
INFO 2021-04-23 07:06:57,807 [shard 0] bluestore - stat
DEBUG 2021-04-23 07:06:58,753 [shard 0] ms - [osd.1(client) v2:172.21.15.100:6802/30478@51064 >> mgr.4105 v2:172.21.15.109:6800/29891] --> #7 === pg_stats(0 pgs seq 55834574872 v 0) v2 (87)
...
INFO 2021-04-23 07:06:58,813 [shard 0] bluestore - stat
DEBUG 2021-04-23 07:06:59,753 [shard 0] osd - AdminSocket::handle_client: incoming asok string: {"prefix": "get_command_descriptions"}
INFO 2021-04-23 07:06:59,753 [shard 0] osd - asok response length: 2947
INFO 2021-04-23 07:06:59,817 [shard 0] bluestore - stat
DEBUG 2021-04-23 07:06:59,865 [shard 0] osd - AdminSocket::handle_client: incoming asok string: {"prefix": "get_command_descriptions"}
INFO 2021-04-23 07:06:59,866 [shard 0] osd - asok response length: 2947
DEBUG 2021-04-23 07:07:00,020 [shard 0] osd - AdminSocket::handle_client: incoming asok string: {"prefix": "get_command_descriptions"}
INFO 2021-04-23 07:07:00,020 [shard 0] osd - asok response length: 2947
INFO 2021-04-23 07:07:00,820 [shard 0] bluestore - stat
...
Backtrace:
0# 0x00005600CD0D6AAF in ceph-osd
1# FatalSignal::signaled(int) in ceph-osd
2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd
3# 0x00007F5877C7EB20 in /lib64/libpthread.so.0
4# 0x00005600CD830B81 in ceph-osd
5# 0x00007F5877C7EB20 in /lib64/libpthread.so.0
6# pthread_cond_timedwait in /lib64/libpthread.so.0
7# crimson::os::ThreadPool::loop(std::chrono::duration<long, std::ratio<1l, 1000l> >, unsigned long) in ceph-osd
8# 0x00007F5877999BA3 in /lib64/libstdc++.so.6
9# 0x00007F5877C7414A in /lib64/libpthread.so.0
10# clone in /lib64/libc.so.6
daemon-helper: command crashed with signal 11
```
Ultimately, it turned out the thread came out from a syscall (`futex`)
and started crunching the `SIGHUP` handler's code in which a nullptr
dereference happened.
This patch blocks `SIGHUP` for all threads spawned by `AlienStore`.
Kefu Chai [Fri, 7 May 2021 13:36:48 +0000 (21:36 +0800)]
crimson/os/seastore: use map::merge() to merge maps
C++17's std::map allows us to merge two maps, and in this case, we can
even consume `child_result`. so map::merge() is used instead of insert()
in hope to avoid the memcpy and allocation of pair<> nodes.
Lucian Petrut [Fri, 7 May 2021 09:23:30 +0000 (09:23 +0000)]
win*.sh,cmake: Fix Windows linking errors
The Windows build is hitting linking errors after
bumping the Boost version to 1.75. The issue is that Boost
is now setting the zlib dependecy using INTERFACE_LINK_LIBRARIES,
which means that it's no longer located using the standard
"find_package" mechanism.
In order for the linker to locate zlib, we'll add it to the
linker search path.
Samuel Just [Wed, 28 Apr 2021 07:24:00 +0000 (00:24 -0700)]
crimson/os/seastore: clean up meta implementation
There's really no reason to cache the decoded representation here since
the meta keys are only accessed during startup, mkfs. This approach is
much simpler.
Neha Ojha [Mon, 3 May 2021 19:28:27 +0000 (19:28 +0000)]
qa/standalone: use osd op queue = wpq
mclock_scheduler is now the default and some of these tests need to be modified
to run well with it. Continue using wpq until
https://tracker.ceph.com/issues/50574 is addressed.
Sage Weil [Thu, 6 May 2021 14:33:00 +0000 (10:33 -0400)]
Merge PR #41107 into master
* refs/pull/41107/head:
mgr/cephadm: apply hostname/addr checks to 'orch host set-addr' too
mgr/cephadm: make 'host add' idempotent
mgr/cephadm: set host crush location based on HostSpec
python-common: add location property to HostSpec, + tests
Reviewed-by: Juan Miguel Olmo <jolmomar@redhat.com>
Kefu Chai [Thu, 6 May 2021 07:36:27 +0000 (15:36 +0800)]
doc/_ext: rewrite directive using ObjectDescription
which allows us to use different scheme when defining an option,
without this change, if two options in different mgr module share the
same name we cannot differentiate them, after this change, their id
would prefixed with the module name.
link() and unlink() are an artifact of the representation of buckets and
users in RADOS, and don't need to be in the API. Instead, fix create()
and chown() to behave properly.
Signed-off-by: Daniel Gryniewicz <dang@redhat.com>