crimson/os: fix use-after-free in AlienStore::get_attr().
The `FuturizedStore` interface imposes the `get_attr()`
takes the `name` parameter as `std::string_view`, and
thus burdens implementations with extending the life-
time of the data the instance refers to.
Unfortunately, `AlienStore` is unaware that prolonging
the life of a `std::string_view` instance doesn't prolong
the data memory it points to. This problem has manifested
in the following use-after-free detected at Sepia:
```
rzarzynski@teuthology:/home/teuthworker/archive/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136929$ less ./remote/smithi194/log/ceph-osd.7.log.gz
...
DEBUG 2021-05-26 20:24:54,077 [shard 0] osd - do_osd_ops_execute: object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head - handling op
call
DEBUG 2021-05-26 20:24:54,077 [shard 0] osd - handling op call on object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head
DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - calling method lock.lock, num_read=0, num_write=0
DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - handling op getxattr on object 14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head
DEBUG 2021-05-26 20:24:54,078 [shard 0] osd - getxattr on obj=14:55e1a5b4:test-rados-api-smithi067-38889-2::foo:head for attr=_lock.TestLockPP1
DEBUG 2021-05-26 20:24:54,078 [shard 0] bluestore - get_attr
=================================================================
==34068==ERROR: AddressSanitizer: heap-use-after-free on address 0x6030001851d0 at pc 0x7f824d6a5b27 bp 0x7f822b4201c0 sp 0x7f822b41f968
READ of size 17 at 0x6030001851d0 thread T28 (alien-store-tp)
...
#0 0x7f824d6a5b26 (/lib64/libasan.so.5+0x40b26)
#1 0x55e2cbb2e00b (/usr/bin/ceph-osd+0x2b6dc00b)
#2 0x55e2d31f086e (/usr/bin/ceph-osd+0x32d9e86e)
#3 0x55e2d3467607 in crimson::os::ThreadPool::loop(std::chrono::duration<long, std::ratio<1l, 1000l> >, unsigned long) (/usr/bin/ceph-osd+0x33015607)
#4 0x55e2d346b14a (/usr/bin/ceph-osd+0x3301914a)
#5 0x7f8249d32ba2 (/lib64/libstdc++.so.6+0xc2ba2)
#6 0x7f824a00d149 in start_thread (/lib64/libpthread.so.0+0x8149)
#7 0x7f82486edf22 in clone (/lib64/libc.so.6+0xfcf22)
0x6030001851d0 is located 0 bytes inside of 31-byte region [0x6030001851d0,0x6030001851ef)
freed by thread T0 here:
#0 0x7f824d757688 in operator delete(void*) (/lib64/libasan.so.5+0xf2688)
previously allocated by thread T0 here:
#0 0x7f824d7567b0 in operator new(unsigned long) (/lib64/libasan.so.5+0xf17b0)
Thread T28 (alien-store-tp) created by T0 here:
#0 0x7f824d6b7ea3 in __interceptor_pthread_create (/lib64/libasan.so.5+0x52ea3)
SUMMARY: AddressSanitizer: heap-use-after-free (/lib64/libasan.so.5+0x40b26)
Shadow bytes around the buggy address:
0x0c06800289e0: fd fd fd fa fa fa fd fd fd fa fa fa 00 00 00 fa
0x0c06800289f0: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd
0x0c0680028a00: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa
0x0c0680028a10: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa
0x0c0680028a20: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd
=>0x0c0680028a30: fd fd fa fa fd fd fd fd fa fa[fd]fd fd fd fa fa
0x0c0680028a40: fd fd fd fd fa fa fd fd fd fd fa fa 00 00 00 07
0x0c0680028a50: fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa fd fd
0x0c0680028a60: fd fd fa fa fd fd fd fd fa fa fd fd fd fd fa fa
0x0c0680028a70: 00 00 00 00 fa fa fd fd fd fd fa fa fd fd fd fd
0x0c0680028a80: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==34068==ABORTING
```
Kefu Chai [Sat, 29 May 2021 08:24:59 +0000 (16:24 +0800)]
crimson/os/alienstore: do not cleanup if not started
there is chance stop() and umount() methods get called even if start()
is not called in the error handling path. in that case, just make these
methods no-op. to ensure that OSD behaves in that case.
Kefu Chai [Sat, 29 May 2021 08:03:50 +0000 (16:03 +0800)]
crimson/os/alienstore: create tp in AlienStore::start()
thread pool is not needed until AlienStore::start(). with this change,
we are able to tell if the AlienStore is actually started or not in
AlienStore::stop().
as seastar::sharded<Service> start a service in two phases:
1. construct the shard instances
2. actually start them
and it stops a service in a single shot, which both stops the services
and destructs the service instance(s).
so we have to implement a proper stop() method for services whose
start() might not be called after its instance is created by
seastar::sharded<Service>::start() in case of error handling or if
we just don't want to call start().
to ensure we can skip the steps to clean up the stuff created by
start(), we need to have a flag in the sharded service, because
AlienStore is a member variable of OSD, and when we do mkfs, AlienStore
is not start()'ed, and as explained above, we have to call OSD::stop()
to ensure OSD instance is destructed properly. but OSD::stop()
calls store->umount() and store->stop() unconditionally. these methods
in AlienStore rely on a functional thread pool.
fortunately, we don't need to call these methods if the store is never
mounted or started. in a case of failed "mkfs", store is not mounted at
all but the store and osd instances are created.
so, in this change, thread pool is created in AlienStore::start(), and
we will use it to tell if AlienStore is started or not in the following
change which makes the related method no-op if AlienStore is not started
yet.
also, postpone the creation of `store` until in AlienStore::start(), so
we don't need to destroy it in the dtor of AlienStore. otherwise,
BlueStore::~BlueStore() would need to reference resources which are only
available in alien threads, but when OSD::~OSD() is called, we are in
seastar's reactor.
Kefu Chai [Sat, 29 May 2021 06:51:09 +0000 (14:51 +0800)]
crimson/osd/main: catch exception thrown in the async() call
* use seastar::app_template::run() instead of
seastar::app_template::run_deprecated() for returning int,
instead of returning `void`. so the application can return
int explicitly in the continuation passed to run(). more
readable this way.
* wrap the all the block in run() in a giant try-catch block,
so the exceptions thrown by the startup code can be captured
and handled.
* do not capture the exceptions individually, in the try-catch
block anymore. the outer catch block takes care of them.
this change improves the error handling when crimson-osd launches.
with the recent support for async rbd operations from pacific+ when an
older client(non async support) goes on upgrade, and simultaneously
interacts with a newer client which expects the requests to be async,
experiences hang; considering the return code for request completion to
be acknowledgement for async request, which then keeps waiting for
another acknowledgement of request completion.
this if happens should be a rare only when lockowner is an old client
and should be deferred if compatibility issues arises.
Sage Weil [Thu, 27 May 2021 23:14:53 +0000 (19:14 -0400)]
Merge PR #41483 into master
* refs/pull/41483/head:
cephadm: stop passing --no-hosts to podman
mgr/nfs: use host.addr for backend IP where possible
mgr/cephadm: convert host addr if non-IP to IP
mgr/dashboard,prometheus: new method of getting mgr IP
doc/cephadm: remove any reference to the use of DNS or /etc/hosts
mgr/cephadm: use known host addr
mgr/cephadm: resolve IP at 'orch host add' time
Sage Weil [Tue, 25 May 2021 17:55:08 +0000 (13:55 -0400)]
cephadm: stop passing --no-hosts to podman
This reverts cfc1f914ce74f1fd1f45e2efd3ba2ddcb2da129a, which is no longer
neceesary because (1) we don't use socket.getfqdn(), and (2) we generally
do not rely on DNS or /etc/hosts at all anymore (with the exception of
the upgrade transition).
Sage Weil [Tue, 25 May 2021 20:10:49 +0000 (16:10 -0400)]
mgr/cephadm: convert host addr if non-IP to IP
Previously we allowed the host.addr to be a DNS name (short or fqdn).
This is problematic because of the inconsistent way that docker and podman
handle /etc/hosts, and undesirable because relying on external DNS is
an external source of failure for the cluster without any benefit in
return (simply updating DNS is not sufficient to make ceph behave).
So: update any non-IP to an IP as soon as we start up (presumably on
upgrade). If we get a loopback address (127.0.0.1 or 127.0.1.1), then
wait and hope that the next instance of the manager has better luck.
Sage Weil [Tue, 25 May 2021 17:00:35 +0000 (13:00 -0400)]
mgr/dashboard,prometheus: new method of getting mgr IP
- Use a centralized method get_mgr_ip()
- Look up the hostname via DNS. This is a bit more reliable than
getfqdn() since it will work even when podman adds the container
name to /etc/hosts.
Sage Weil [Fri, 21 May 2021 17:31:31 +0000 (13:31 -0400)]
mgr/cephadm: use known host addr
If the host IP/addr is known, use that. The addr might even be a FQDN
instead of an IP address, in which case we want to look that up instead
of the bare hostname.
Kefu Chai [Thu, 27 May 2021 14:26:05 +0000 (22:26 +0800)]
os/bluestore: pass string_view to ctor of Allocator
just for the sake of correctness, as they don't need a full-blown
std::string, what they need is but a string like object. and they always
create a std::string instance as a member variable if they want to have
a copy of it.
I found that the difference between "rbd cp" and "rbd deep cp",
i.e. what "deep" means in this context, is documented only in
the mailing list archive and in the Mimic reelase notes.
Let's make the difference explicit in the manpage and in rbd --help.
Signed-off-by: Jan "Yenya" Kasprzak <kas@fi.muni.cz>
Zac Dover [Thu, 27 May 2021 01:28:38 +0000 (11:28 +1000)]
doc/cephadm: enrich "service status"
This PR improves the syntax of the "Service
Status" section of the "Service Managment"
section of the cephadm guide. This includes
pretty significant reworking of the information
in the section, so vetting this one might be
annoying. Anyway, I think I've lowered the
cognitive load on the reader.
Sage Weil [Fri, 21 May 2021 16:32:49 +0000 (12:32 -0400)]
mgr/cephadm: resolve IP at 'orch host add' time
We prefer to always have a real IP for hosts in the cluster. This avoids
a reliance on DNS for most operations.
Perhaps more importantly, it means we are less sensitive to inconsistent
host lookup results, for example due to (1) mismatched /etc/hosts files
between machines, or (2) a lookup of the local hostname that returns
127.0.1.1.
Adjust with_hosts() fixture to take an addr, and adjust tests accordingly.
so work is allocated on stack, but the alignment of the
crush_work_bucket struct is not taken into consideration, so in
crush_init_workspace(), point could point to an address which is not
aligned to 8 bytes, which is the alignment of crush_work_bucket by
default. so is its member variables, all of them are uint32_t, and hence
are also 8-bytes aligned.
to ensure the compiler generate the correct assembly for accessing
the member variables without assuming that the struct is 8-byte
aligned, we should specify the alignment explicitly.
in this change, `__attribute__ ((packed))` is specified for
crush_work_bucket, so that its alignment is 1.
this issue is spotted by ASan, it complains like:
../src/crush/mapper.c:881:22: runtime error: member access within misaligned address 0x7ffe051f90dc for type 'struct crush_work_bucket', which requires 8 byte alignment
0x7ffe051f90dc: note: pointer points here
1d e5 77 3d 68 55 00 00 00 00 00 00 00 00 00 00 20 93 1f 05 fe 7f 00 00 10 91 1f 05 fe 7f 00 00
^
../src/crush/mapper.c:882:22: runtime error: member access within misaligned address 0x7ffe051f90dc for type 'struct crush_work_bucket', which requires 8 byte alignment
0x7ffe051f90dc: note: pointer points here
1d e5 77 3d 00 00 00 00 00 00 00 00 00 00 00 00 20 93 1f 05 fe 7f 00 00 10 91 1f 05 fe 7f 00 00
^
../src/crush/mapper.c:883:20: runtime error: member access within misaligned address 0x7ffe051f90dc for type 'struct crush_work_bucket', which requires 8 byte alignment
0x7ffe051f90dc: note: pointer points here
1d e5 77 3d 00 00 00 00 00 00 00 00 00 00 00 00 20 93 1f 05 fe 7f 00 00 10 91 1f 05 fe 7f 00 00
^