crimson/osd: extend lifetime of OpsExecuter to match all_completed.
f7181ab2f65803ecd8204f8f4f5aad4713b747f3 has optimized the client
parallelism. To achieve that `PG::do_osd_ops()` were converted to
return basically future pair of futures. Unfortunately, the life-
time management of `OpsExecuter` was kept intact. In the result,
the object was valid only till fullfying the outer future while,
due to the `rollbacker` instances, it should be available till
`all_completed` becomes available.
This issue can explain the following problem has been observed
in a Teuthology job [1].
The commit deals with the problem by repacking the outer future.
An alternative could be in switching from `std::unique_ptr` to
`seastar::shared_ptr` for managing `OpsExecuter`.
Sage Weil [Thu, 6 May 2021 22:47:27 +0000 (18:47 -0400)]
mgr/nfs: take --ingress argument to 'nfs cluster create'
It is likely that the rook/k8s variation of ingress will not take a
virtual_ip argument. We want to make sure that ingress yes/no can be
specified independent of the virtual_ip.
Sage Weil [Thu, 6 May 2021 14:57:46 +0000 (10:57 -0400)]
cephadm: --stop-signal=SIGTERM
haproxy's container image tells docker|podman to send SIGUSR1 for a "clean"
shutdown. For NFS, the connections never close, so we will always hit the
podman|docker 10s timeout and get a SIGKILL. That, in turn, causes haproxy
to exit with 143, and puts the systemd unit in a failed state.
This highlights a general problem(?) with stopping containers: if they don't
do it quickly then we'll end up in this error state. We don't directly
address that here.
Avoid this problem by always stopping containers with SIGTERM. In the
haproxy case, that means an immediate shutdown (no graceful drain of
open connections). In theory we could do this only for haproxy with
NFS, but we can easily imagine RGW connections that don't close in 10s
either, and we don't want containers exiting in error state--we just
want the proxy to stop quickly.
Sage Weil [Mon, 3 May 2021 15:48:45 +0000 (11:48 -0400)]
mgr/orchestrator: default nfs pool, namespaces
Apply nfs default pool (currently 'nfs-ganesha'), and default the
namespace to the service_id.
There is no practical reason for users to ever need to change this, and
requiring them to provide this informaiton at config/apply time just
complicates life.
Sage Weil [Wed, 5 May 2021 16:59:44 +0000 (12:59 -0400)]
mgr/nfs: remove 'nfs cluster update'
This command is very awkward to implement unless all service spec fields
are always required. That will soon mean both the placement *and*
virtual_ip (if any), making it much less useful for a human to make use
of.
Instead, let them update yaml, or adjust the nfs and/or ingress specs
directly. I don't think this command is needed.
crimson/osd: fix assertion failure in OpSequencer.
`OpSequencer` assumes that ID of a previous client request
is always lower than ID of current one. This is reflected
by the assertion in `OpSequencer::start_op()`. It triggered
the following failure [1] in Teuthology:
```
DEBUG 2021-05-07 08:01:41,227 [shard 0] osd - client_request(id=1, detail=osd_op(client.4171.0:1 2.2 2.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e29) v8) same_interval_since: 31
ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3910-g1b18e076/rpm/el8/BUILD/ceph- 17.0.0-3910-g1b18e076/src/crimson/osd/osd_operation_sequencer.h:38: seastar::futurize_t<Result> crimson::osd::OpSequencer::start_op(HandleT&, uint64_t, uint64_t, FuncT&&) [with HandleT = crimson::PipelineHa
ndle; FuncT = crimson::interruptible::interruptor<InterruptCond>::wrap_function(Func&&) [with Func = crimson::osd::ClientRequest::start()::<lambda()> mutable::<lambda(Ref<crimson::osd::PG>)> mutable::<lambd
a()> mutable::<lambda()>; InterruptCond = crimson::osd::IOInterruptCondition]::<lambda()>; Result = crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, seastar::future<>
>; seastar::futurize_t<Result> = crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, seastar::future<> >; uint64_t = long unsigned int]: Assertion `prev_op < this_op' fai
led.
Aborting on shard 0.
Backtrace:
Segmentation fault.
Backtrace:
0# 0x00005592B028932F in ceph-osd
1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd
2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd
3# 0x00007F57B72E7B20 in /lib64/libpthread.so.0
4# gsignal in /lib64/libc.so.6
5# abort in /lib64/libc.so.6
6# 0x00007F57B58E2B09 in /lib64/libc.so.6
7# 0x00007F57B58F0DE6 in /lib64/libc.so.6
8# 0x00005592ABB8484D in ceph-osd
9# 0x00005592ABB8ACB3 in ceph-osd
10# seastar::continuation<seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >, seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<boost::intrusive_ptr<crimson::osd::PG> >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > >(seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >&&, seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>&, seastar::future_state<boost::intrusive_ptr<crimson::osd::PG> >&&)#1}, boost::intrusive_ptr<crimson::osd::PG> >::run_and_dispose() in ceph-osd
11# 0x00005592B357F88F in ceph-osd
12# 0x00005592B3584DD0 in ceph-osd
```
Crash analysis resulted in two observations:
1. during the request execution the acting set got
changed, the request was interrupted and a try
to re-execute it emerged;
2. the interrupted request was the very first client
request the OSD has ever seen.
Code analysis showed a problem in how `ClientRequest`
establishes `prev_op_id`: although supposed to be performed
only once for a request, it can get executed twice but only
for the very first request `OpSequencer` saw.
```cpp
void ClientRequest::may_set_prev_op()
{
// set prev_op_id if it's not set yet
if (__builtin_expect(prev_op_id == 0, true)) {
prev_op_id = sequencer.get_last_issued();
}
}
```
Unfortunately, `0` isn't a distincted value that cannot
be returned by `get_last_issued()`:
// the id of last op which is issued
uint64_t last_issued = 0;
```
As a result, `OpSequencer` returned on the second call
a new value (actually `this_op`) violating the assertion.
The commit fixes the problem by switching from a designated
value to `std::optional`.
Kefu Chai [Mon, 24 May 2021 09:27:34 +0000 (17:27 +0800)]
ceph.in: put libasan.so path before other LD_PRELOAD paths
to ensure it is the first one to be preaload. to address following error
like:
==821517==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
sunilkumarn417 [Wed, 19 May 2021 10:02:45 +0000 (15:32 +0530)]
qa/tasks/cephadm: Include bootstrap registry options for downstream
- registry-url, registry-username and registry-password bootstrap options are
supported now. This is needed to access monitoring service container images.
- usage of RHEL distribution based cephadm in download_cephadm task.
Kefu Chai [Mon, 24 May 2021 02:21:52 +0000 (10:21 +0800)]
vstart.sh: specify mon_data_avail_crit in ceph.conf
ceph-mon consumes this option when it boots, and exits if the ratio
of free space is lower than the specified number, which is 5% by
default. but we use `ceph -c $conf_fn config assimilate-conf -i -`
to absorb these option after monitor starts. so, without this change,
the default value of mon_data_avail_crit is always used, if machine
has lower ratio of free space on the partition where mon store is
located, ceph-mon just exists with the error message like:
2021-05-24T01:53:14.644+0000 7ff64961e580 -1 error: monitor data
filesystem reached concerning levels of available storage space
(available: 4% 17 GiB)
after this change, the option is written in ceph.conf, and can be
read by ceph-mon when it boots. so the overriden value of 1% has
the chance to take effect. this helps to address some test failures
found in our "make check" runs performed by jenkins on machines whose
disk space is enough for completing the test, but its ratio of free
space is lower than 5%.
Kefu Chai [Mon, 24 May 2021 02:00:25 +0000 (10:00 +0800)]
pybind/ceph_daemon: import OrderedDict from collections
as OrderedDict is always provided by collections module since Python3.1,
and we only support python3.6 and up, there is no need to try to import
OrderedDict from collections.abc anymore.