Currently, when deployed with cephadm and podman, logs from ceph daemons are
routed as stderr -> conmon -> journald. We can send logs to journald directly
by connecting to its socket. And this has the following advantages:
- record structured metadata along with log message
- prettier output from journalctl
- no duplicated timestamp
- colorized according to priority
- multi-line logs are indented properly
- easier to filter the logs afterward
- theoretically better performance
- workaround https://tracker.ceph.com/issues/49551
This depends on commit 9ef8055. And depends on commit 1a76f47 because cgroup is
needed to associate these logs to the systemd unit.
Kefu Chai [Tue, 11 May 2021 09:55:32 +0000 (17:55 +0800)]
doc/_theme: show the menu button
because we have a top-nav bar, which is setting on top of the bar
containing the menu button when the docs is displayed wit a device with
smaller width. in this change, the container of the menu button is moved
down a little bit, so it is visible again.
max_misplaced with replaced by in target_max_misplaced_ratio edbd592ee44e02a5328e1510879555c2f9dcfc9e, but the document was not
sync'ed. let's update it accordingly.
Zac Dover [Mon, 10 May 2021 23:19:10 +0000 (09:19 +1000)]
doc/cephadm: rewrite "config ssl/tls f. grafana"
This PR streamlines the grammar in the subsection
called "Configuring SSL/TLS for Grafana" in the
monitoring.rst file. It also corrects the prompt
rst.
Kefu Chai [Sat, 8 May 2021 08:43:55 +0000 (16:43 +0800)]
crimson/common: use string_view when appropriate
the typical use case of get_val() passes a literal string as the key,
in that case, there is no need to create a std::string. as
md_config_t::get_val() always accepts a string_view as the option name.
crimson/monc: honor auth_result_t::canceled as the result of do_auth().
An attempt to `Connection::do_auth()` may finish in one of three states:
_success_, _failure_ and _cancellation_. Unfortunately, its callers were
missing the third treating cancellation like a failure. This was the root
cause of the following failure at Sepia:
```
rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-05-06_22:08:43-rados-master-distro-basic-smithi/6102605$ less ./remote/smithi204/log/ceph-osd.3.log.gz
...
WARN 2021-05-06 22:35:40,464 [shard 0] osd - ms_handle_reset
...
INFO 2021-05-06 22:35:40,465 [shard 0] monc - do_auth_single: connection closed
INFO 2021-05-06 22:35:40,465 [shard 0] ms - [osd.3(client) v2:172.21.15.204:6808/31418@57568 >> mon.? v2:172.21.15.204:3300/0] execute_connecting(): protocol aborted at CLOSING -- std::system_error (error crimson::net:6, protocol aborted)
...
ERROR 2021-05-06 22:35:40,465 [shard 0] osd - mon.osd.3 dispatch() ms_handle_reset caught exception: std::system_error (error crimson::net:3, negotiation failure)
ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3909-g81233a18/rpm/el8/BUILD/ceph-17.0.0-3909-g81233a18/src/crimson/common/gated.h:36: crimson::common::Gated::dispatch(const char*, T&, Func&&) [with Func = crimson::mon::Client::ms_handle_reset(crimson::net::ConnectionRef, bool)::<lambda()>&; T = crimson::mon::Client]::<lambda(std::__exception_ptr::exception_ptr)>: Assertion `*eptr.__cxa_exception_type() == typeid(seastar::gate_closed_exception)' failed.
Aborting on shard 0.
Backtrace:
0# 0x00005618C973932F in ceph-osd
1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd
2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd
3# 0x00007F7BB592EB20 in /lib64/libpthread.so.0
4# gsignal in /lib64/libc.so.6
5# abort in /lib64/libc.so.6
6# 0x00007F7BB3F29B09 in /lib64/libc.so.6
7# 0x00007F7BB3F37DE6 in /lib64/libc.so.6
8# 0x00005618C9FF295C in ceph-osd
9# 0x00005618C3907313 in ceph-osd
10# 0x00005618CCA2F84F in ceph-osd
11# 0x00005618CCA34D90 in ceph-osd
12# 0x00005618CCBEC9BB in ceph-osd
13# 0x00005618CC744E9A in ceph-osd
14# main in ceph-osd
15# __libc_start_main in /lib64/libc.so.6
16# _start in ceph-osd
daemon-helper: command crashed with signal 6
```
The low-level signal handler above assumes `local_engine._backend`
is not null which stays true only for threads from the S*'s world.
Unfortunately, as we don't block the `SIGHUP` for alien threads,
kernel is perfectly authorized to pick up one them to run the handler
leading to weirdly-looking segfaults like this one:
```
INFO 2021-04-23 07:06:57,807 [shard 0] bluestore - stat
DEBUG 2021-04-23 07:06:58,753 [shard 0] ms - [osd.1(client) v2:172.21.15.100:6802/30478@51064 >> mgr.4105 v2:172.21.15.109:6800/29891] --> #7 === pg_stats(0 pgs seq 55834574872 v 0) v2 (87)
...
INFO 2021-04-23 07:06:58,813 [shard 0] bluestore - stat
DEBUG 2021-04-23 07:06:59,753 [shard 0] osd - AdminSocket::handle_client: incoming asok string: {"prefix": "get_command_descriptions"}
INFO 2021-04-23 07:06:59,753 [shard 0] osd - asok response length: 2947
INFO 2021-04-23 07:06:59,817 [shard 0] bluestore - stat
DEBUG 2021-04-23 07:06:59,865 [shard 0] osd - AdminSocket::handle_client: incoming asok string: {"prefix": "get_command_descriptions"}
INFO 2021-04-23 07:06:59,866 [shard 0] osd - asok response length: 2947
DEBUG 2021-04-23 07:07:00,020 [shard 0] osd - AdminSocket::handle_client: incoming asok string: {"prefix": "get_command_descriptions"}
INFO 2021-04-23 07:07:00,020 [shard 0] osd - asok response length: 2947
INFO 2021-04-23 07:07:00,820 [shard 0] bluestore - stat
...
Backtrace:
0# 0x00005600CD0D6AAF in ceph-osd
1# FatalSignal::signaled(int) in ceph-osd
2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd
3# 0x00007F5877C7EB20 in /lib64/libpthread.so.0
4# 0x00005600CD830B81 in ceph-osd
5# 0x00007F5877C7EB20 in /lib64/libpthread.so.0
6# pthread_cond_timedwait in /lib64/libpthread.so.0
7# crimson::os::ThreadPool::loop(std::chrono::duration<long, std::ratio<1l, 1000l> >, unsigned long) in ceph-osd
8# 0x00007F5877999BA3 in /lib64/libstdc++.so.6
9# 0x00007F5877C7414A in /lib64/libpthread.so.0
10# clone in /lib64/libc.so.6
daemon-helper: command crashed with signal 11
```
Ultimately, it turned out the thread came out from a syscall (`futex`)
and started crunching the `SIGHUP` handler's code in which a nullptr
dereference happened.
This patch blocks `SIGHUP` for all threads spawned by `AlienStore`.
Kefu Chai [Fri, 7 May 2021 13:36:48 +0000 (21:36 +0800)]
crimson/os/seastore: use map::merge() to merge maps
C++17's std::map allows us to merge two maps, and in this case, we can
even consume `child_result`. so map::merge() is used instead of insert()
in hope to avoid the memcpy and allocation of pair<> nodes.
Lucian Petrut [Fri, 7 May 2021 09:23:30 +0000 (09:23 +0000)]
win*.sh,cmake: Fix Windows linking errors
The Windows build is hitting linking errors after
bumping the Boost version to 1.75. The issue is that Boost
is now setting the zlib dependecy using INTERFACE_LINK_LIBRARIES,
which means that it's no longer located using the standard
"find_package" mechanism.
In order for the linker to locate zlib, we'll add it to the
linker search path.
Samuel Just [Wed, 28 Apr 2021 07:24:00 +0000 (00:24 -0700)]
crimson/os/seastore: clean up meta implementation
There's really no reason to cache the decoded representation here since
the meta keys are only accessed during startup, mkfs. This approach is
much simpler.