Sage Weil [Tue, 25 May 2021 20:10:49 +0000 (16:10 -0400)]
mgr/cephadm: convert host addr if non-IP to IP
Previously we allowed the host.addr to be a DNS name (short or fqdn).
This is problematic because of the inconsistent way that docker and podman
handle /etc/hosts, and undesirable because relying on external DNS is
an external source of failure for the cluster without any benefit in
return (simply updating DNS is not sufficient to make ceph behave).
So: update any non-IP to an IP as soon as we start up (presumably on
upgrade). If we get a loopback address (127.0.0.1 or 127.0.1.1), then
wait and hope that the next instance of the manager has better luck.
Sage Weil [Tue, 25 May 2021 17:00:35 +0000 (13:00 -0400)]
mgr/dashboard,prometheus: new method of getting mgr IP
- Use a centralized method get_mgr_ip()
- Look up the hostname via DNS. This is a bit more reliable than
getfqdn() since it will work even when podman adds the container
name to /etc/hosts.
Sage Weil [Fri, 21 May 2021 17:31:31 +0000 (13:31 -0400)]
mgr/cephadm: use known host addr
If the host IP/addr is known, use that. The addr might even be a FQDN
instead of an IP address, in which case we want to look that up instead
of the bare hostname.
Sage Weil [Fri, 21 May 2021 16:32:49 +0000 (12:32 -0400)]
mgr/cephadm: resolve IP at 'orch host add' time
We prefer to always have a real IP for hosts in the cluster. This avoids
a reliance on DNS for most operations.
Perhaps more importantly, it means we are less sensitive to inconsistent
host lookup results, for example due to (1) mismatched /etc/hosts files
between machines, or (2) a lookup of the local hostname that returns
127.0.1.1.
Adjust with_hosts() fixture to take an addr, and adjust tests accordingly.
Adam C. Emerson [Thu, 20 May 2021 23:19:55 +0000 (19:19 -0400)]
rgw: Simplify log shard probing and err on the side of omap
In the multigeneration version we no longer care whether entries
exist, since we never delete and recreate empty logs. Remove logic
that marked entirely empty shards as DNE under the assumption that
they would be deleted if so.
Fixes: https://tracker.ceph.com/issues/50169 Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
crimson/osd: extend lifetime of OpsExecuter to match all_completed.
f7181ab2f65803ecd8204f8f4f5aad4713b747f3 has optimized the client
parallelism. To achieve that `PG::do_osd_ops()` were converted to
return basically future pair of futures. Unfortunately, the life-
time management of `OpsExecuter` was kept intact. In the result,
the object was valid only till fullfying the outer future while,
due to the `rollbacker` instances, it should be available till
`all_completed` becomes available.
This issue can explain the following problem has been observed
in a Teuthology job [1].
The commit deals with the problem by repacking the outer future.
An alternative could be in switching from `std::unique_ptr` to
`seastar::shared_ptr` for managing `OpsExecuter`.
Sage Weil [Thu, 6 May 2021 22:47:27 +0000 (18:47 -0400)]
mgr/nfs: take --ingress argument to 'nfs cluster create'
It is likely that the rook/k8s variation of ingress will not take a
virtual_ip argument. We want to make sure that ingress yes/no can be
specified independent of the virtual_ip.
Sage Weil [Thu, 6 May 2021 14:57:46 +0000 (10:57 -0400)]
cephadm: --stop-signal=SIGTERM
haproxy's container image tells docker|podman to send SIGUSR1 for a "clean"
shutdown. For NFS, the connections never close, so we will always hit the
podman|docker 10s timeout and get a SIGKILL. That, in turn, causes haproxy
to exit with 143, and puts the systemd unit in a failed state.
This highlights a general problem(?) with stopping containers: if they don't
do it quickly then we'll end up in this error state. We don't directly
address that here.
Avoid this problem by always stopping containers with SIGTERM. In the
haproxy case, that means an immediate shutdown (no graceful drain of
open connections). In theory we could do this only for haproxy with
NFS, but we can easily imagine RGW connections that don't close in 10s
either, and we don't want containers exiting in error state--we just
want the proxy to stop quickly.
Sage Weil [Mon, 3 May 2021 15:48:45 +0000 (11:48 -0400)]
mgr/orchestrator: default nfs pool, namespaces
Apply nfs default pool (currently 'nfs-ganesha'), and default the
namespace to the service_id.
There is no practical reason for users to ever need to change this, and
requiring them to provide this informaiton at config/apply time just
complicates life.
Sage Weil [Wed, 5 May 2021 16:59:44 +0000 (12:59 -0400)]
mgr/nfs: remove 'nfs cluster update'
This command is very awkward to implement unless all service spec fields
are always required. That will soon mean both the placement *and*
virtual_ip (if any), making it much less useful for a human to make use
of.
Instead, let them update yaml, or adjust the nfs and/or ingress specs
directly. I don't think this command is needed.
Sage Weil [Mon, 24 May 2021 21:03:33 +0000 (16:03 -0500)]
doc/cephfs/nfs: remove documented limitation
At the time NFS support was added, this limitation applied.
However, in
https://github.com/nfs-ganesha/nfs-ganesha/commit/b3d97f8157a131f55d848ff57e46af91b746b944
and
https://github.com/nfs-ganesha/nfs-ganesha/commit/1cfe7e2df96f9785367ba94d41559140f584a875
we added support for multiple filesystems and started mixing
the fscid into the filehandle.
crimson/osd: fix assertion failure in OpSequencer.
`OpSequencer` assumes that ID of a previous client request
is always lower than ID of current one. This is reflected
by the assertion in `OpSequencer::start_op()`. It triggered
the following failure [1] in Teuthology:
```
DEBUG 2021-05-07 08:01:41,227 [shard 0] osd - client_request(id=1, detail=osd_op(client.4171.0:1 2.2 2.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e29) v8) same_interval_since: 31
ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3910-g1b18e076/rpm/el8/BUILD/ceph- 17.0.0-3910-g1b18e076/src/crimson/osd/osd_operation_sequencer.h:38: seastar::futurize_t<Result> crimson::osd::OpSequencer::start_op(HandleT&, uint64_t, uint64_t, FuncT&&) [with HandleT = crimson::PipelineHa
ndle; FuncT = crimson::interruptible::interruptor<InterruptCond>::wrap_function(Func&&) [with Func = crimson::osd::ClientRequest::start()::<lambda()> mutable::<lambda(Ref<crimson::osd::PG>)> mutable::<lambd
a()> mutable::<lambda()>; InterruptCond = crimson::osd::IOInterruptCondition]::<lambda()>; Result = crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, seastar::future<>
>; seastar::futurize_t<Result> = crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, seastar::future<> >; uint64_t = long unsigned int]: Assertion `prev_op < this_op' fai
led.
Aborting on shard 0.
Backtrace:
Segmentation fault.
Backtrace:
0# 0x00005592B028932F in ceph-osd
1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd
2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd
3# 0x00007F57B72E7B20 in /lib64/libpthread.so.0
4# gsignal in /lib64/libc.so.6
5# abort in /lib64/libc.so.6
6# 0x00007F57B58E2B09 in /lib64/libc.so.6
7# 0x00007F57B58F0DE6 in /lib64/libc.so.6
8# 0x00005592ABB8484D in ceph-osd
9# 0x00005592ABB8ACB3 in ceph-osd
10# seastar::continuation<seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >, seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<boost::intrusive_ptr<crimson::osd::PG> >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > >(seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >&&, seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>&, seastar::future_state<boost::intrusive_ptr<crimson::osd::PG> >&&)#1}, boost::intrusive_ptr<crimson::osd::PG> >::run_and_dispose() in ceph-osd
11# 0x00005592B357F88F in ceph-osd
12# 0x00005592B3584DD0 in ceph-osd
```
Crash analysis resulted in two observations:
1. during the request execution the acting set got
changed, the request was interrupted and a try
to re-execute it emerged;
2. the interrupted request was the very first client
request the OSD has ever seen.
Code analysis showed a problem in how `ClientRequest`
establishes `prev_op_id`: although supposed to be performed
only once for a request, it can get executed twice but only
for the very first request `OpSequencer` saw.
```cpp
void ClientRequest::may_set_prev_op()
{
// set prev_op_id if it's not set yet
if (__builtin_expect(prev_op_id == 0, true)) {
prev_op_id = sequencer.get_last_issued();
}
}
```
Unfortunately, `0` isn't a distincted value that cannot
be returned by `get_last_issued()`:
// the id of last op which is issued
uint64_t last_issued = 0;
```
As a result, `OpSequencer` returned on the second call
a new value (actually `this_op`) violating the assertion.
The commit fixes the problem by switching from a designated
value to `std::optional`.
Kefu Chai [Mon, 24 May 2021 09:27:34 +0000 (17:27 +0800)]
ceph.in: put libasan.so path before other LD_PRELOAD paths
to ensure it is the first one to be preaload. to address following error
like:
==821517==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
sunilkumarn417 [Wed, 19 May 2021 10:02:45 +0000 (15:32 +0530)]
qa/tasks/cephadm: Include bootstrap registry options for downstream
- registry-url, registry-username and registry-password bootstrap options are
supported now. This is needed to access monitoring service container images.
- usage of RHEL distribution based cephadm in download_cephadm task.