"If multiple threads of execution access the same std::shared_ptr
object without synchronization and any of those accesses uses
a non-const member function of shared_ptr then a data race will
occur (...)"
One of the coredumps showed the `shared_ptr`-typed `OSD::osdmap`
with healthy looking content but damaged control block:
```
[Current thread is 1 (Thread 0x7f7dcaf73700 (LWP 205295))]
(gdb) bt
#0 0x0000559cb81c3ea0 in ?? ()
#1 0x0000559c97675b27 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148
#2 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148
#3 0x0000559c975ef8aa in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167
#4 std::__shared_ptr<OSDMap const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167
#5 std::shared_ptr<OSDMap const>::~shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr.h:103
#6 OSD::create_context (this=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9053
#7 0x0000559c97655571 in OSD::dequeue_peering_evt (this=0x559ca22ac000, sdata=0x559ca2ef2900, pg=0x559cb4aa3400, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...)
at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9665
#8 0x0000559c97886db6 in ceph::osd::scheduler::PGPeeringItem::run (this=<optimized out>, osd=<optimized out>, sdata=<optimized out>, pg=..., handle=...) at /usr/include/c++/8/ext/atomicity.h:96
#9 0x0000559c9764862f in ceph::osd::scheduler::OpSchedulerItem::run (handle=..., pg=..., sdata=<optimized out>, osd=<optimized out>, this=0x7f7dcaf703f0) at /usr/include/c++/8/bits/unique_ptr.h:342
#10 OSD::ShardedOpWQ::_process (this=<optimized out>, thread_index=<optimized out>, hb=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:10677
#11 0x0000559c97c76094 in ShardedThreadPool::shardedthreadpool_worker (this=0x559ca22aca28, thread_index=14) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.cc:311
#12 0x0000559c97c78cf4 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.h:706
#13 0x00007f7df17852de in start_thread () from /lib64/libpthread.so.0
#14 0x00007f7df052f133 in __libc_ifunc_impl_list () from /lib64/libc.so.6
#15 0x0000000000000000 in ?? ()
(gdb) frame 7
#7 0x0000559c97655571 in OSD::dequeue_peering_evt (this=0x559ca22ac000, sdata=0x559ca2ef2900, pg=0x559cb4aa3400, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...)
at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9665
9665 in /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc
(gdb) print osdmap
$24 = std::shared_ptr<const OSDMap> (expired, weak count 0) = {get() = 0x559cba028000}
(gdb) print *osdmap
# pretty sane OSDMap
(gdb) print sizeof(osdmap)
$26 = 16
(gdb) x/2a &osdmap
0x559ca22acef0: 0x559cba028000 0x559cba0ec900
(gdb) frame 2
#2 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148
148 /usr/include/c++/8/bits/shared_ptr_base.h: No such file or directory.
(gdb) disassemble
Dump of assembler code for function std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release():
...
0x0000559c97675b1e <+62>: mov (%rdi),%rax
0x0000559c97675b21 <+65>: mov %rdi,%rbx
0x0000559c97675b24 <+68>: callq *0x10(%rax)
=> 0x0000559c97675b27 <+71>: test %rbp,%rbp
...
End of assembler dump.
(gdb) info registers rdi rbx rax
rdi 0x559cba0ec900 94131624790272
rbx 0x559cba0ec900 94131624790272
rax 0x559cba0ec8a0 94131624790176
(gdb) x/a 0x559cba0ec8a0 + 0x10
0x559cba0ec8b0: 0x559cb81c3ea0
(gdb) bt
#0 0x0000559cb81c3ea0 in ?? ()
...
(gdb) p $_siginfo._sifields._sigfault.si_addr
$27 = (void *) 0x559cb81c3ea0
```
Helgrind seems to agree:
```
==00:00:02:54.519 510301== Possible data race during write of size 8 at 0xF123930 by thread #90
==00:00:02:54.519 510301== Locks held: 2, at addresses 0xF122A58 0xF1239A8
==00:00:02:54.519 510301== at 0x7218DD: operator= (shared_ptr_base.h:1078)
==00:00:02:54.519 510301== by 0x7218DD: operator= (shared_ptr.h:103)
==00:00:02:54.519 510301== by 0x7218DD: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8116)
==00:00:02:54.519 510301== by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678)
==00:00:02:54.519 510301== by 0x72A06C: Context::complete(int) (Context.h:77)
==00:00:02:54.519 510301== by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66)
==00:00:02:54.519 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389)
==00:00:02:54.519 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so)
==00:00:02:54.519 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so)
==00:00:02:54.519 510301==
==00:00:02:54.519 510301== This conflicts with a previous read of size 8 by thread #117
==00:00:02:54.519 510301== Locks held: 1, at address 0x2123E9A0
==00:00:02:54.519 510301== at 0x6B5842: __shared_ptr (shared_ptr_base.h:1165)
==00:00:02:54.519 510301== by 0x6B5842: shared_ptr (shared_ptr.h:129)
==00:00:02:54.519 510301== by 0x6B5842: get_osdmap (OSD.h:1700)
==00:00:02:54.519 510301== by 0x6B5842: OSD::create_context() (OSD.cc:9053)
==00:00:02:54.519 510301== by 0x71B570: OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&) (OSD.cc:9665)
==00:00:02:54.519 510301== by 0x71B997: OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&) (OSD.cc:9701)
==00:00:02:54.519 510301== by 0x70E62E: run (OpSchedulerItem.h:148)
==00:00:02:54.519 510301== by 0x70E62E: OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) (OSD.cc:10677)
==00:00:02:54.519 510301== by 0xD3C093: ShardedThreadPool::shardedthreadpool_worker(unsigned int) (WorkQueue.cc:311)
==00:00:02:54.519 510301== by 0xD3ECF3: ShardedThreadPool::WorkThreadSharded::entry() (WorkQueue.h:706)
==00:00:02:54.519 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389)
==00:00:02:54.519 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so)
==00:00:02:54.519 510301== Address 0xf123930 is 3,824 bytes inside a block of size 10,296 alloc'd
==00:00:02:54.519 510301== at 0xA7DC0C3: operator new[](unsigned long) (vg_replace_malloc.c:433)
==00:00:02:54.519 510301== by 0x66F766: main (ceph_osd.cc:688)
==00:00:02:54.519 510301== Block was alloc'd by thread #1
```
Actually there is plenty of similar issues reported like:
```
==00:00:05:04.903 510301== Possible data race during read of size 8 at 0x1E3E0588 by thread #119
==00:00:05:04.903 510301== Locks held: 1, at address 0x1EAD41D0
==00:00:05:04.903 510301== at 0x753165: clear (hashtable.h:2051)
==00:00:05:04.903 510301== by 0x753165: std::_Hashtable<entity_addr_t, std::pair<entity_addr_t const, utime_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<entity_addr_t const, utime_t>
>, std::__detail::_Select1st, std::equal_to<entity_addr_t>, std::hash<entity_addr_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__deta
il::_Hashtable_traits<true, false, true> >::~_Hashtable() (hashtable.h:1369)
==00:00:05:04.903 510301== by 0x75331C: ~unordered_map (unordered_map.h:102)
==00:00:05:04.903 510301== by 0x75331C: OSDMap::~OSDMap() (OSDMap.h:350)
==00:00:05:04.903 510301== by 0x753606: operator() (shared_cache.hpp:100)
==00:00:05:04.903 510301== by 0x753606: std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr
_base.h:471)
==00:00:05:04.903 510301== by 0x73BB26: _M_release (shared_ptr_base.h:155)
==00:00:05:04.903 510301== by 0x73BB26: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:148)
==00:00:05:04.903 510301== by 0x6B58A9: ~__shared_count (shared_ptr_base.h:728)
==00:00:05:04.903 510301== by 0x6B58A9: ~__shared_ptr (shared_ptr_base.h:1167)
==00:00:05:04.903 510301== by 0x6B58A9: ~shared_ptr (shared_ptr.h:103)
==00:00:05:04.903 510301== by 0x6B58A9: OSD::create_context() (OSD.cc:9053)
==00:00:05:04.903 510301== by 0x71B570: OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&) (OSD.cc:9665)
==00:00:05:04.903 510301== by 0x71B997: OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&) (OSD.cc:9701)
==00:00:05:04.903 510301== by 0x70E62E: run (OpSchedulerItem.h:148)
==00:00:05:04.903 510301== by 0x70E62E: OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) (OSD.cc:10677)
==00:00:05:04.903 510301== by 0xD3C093: ShardedThreadPool::shardedthreadpool_worker(unsigned int) (WorkQueue.cc:311)
==00:00:05:04.903 510301== by 0xD3ECF3: ShardedThreadPool::WorkThreadSharded::entry() (WorkQueue.h:706)
==00:00:05:04.903 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389)
==00:00:05:04.903 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so)
==00:00:05:04.903 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so)
==00:00:05:04.903 510301==
==00:00:05:04.903 510301== This conflicts with a previous write of size 8 by thread #90
==00:00:05:04.903 510301== Locks held: 2, at addresses 0xF122A58 0xF1239A8
==00:00:05:04.903 510301== at 0x7531E1: clear (hashtable.h:2054)
==00:00:05:04.903 510301== by 0x7531E1: std::_Hashtable<entity_addr_t, std::pair<entity_addr_t const, utime_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<entity_addr_t const, utime_t> >, std::__detail::_Select1st, std::equal_to<entity_addr_t>, std::hash<entity_addr_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() (hashtable.h:1369)
==00:00:05:04.903 510301== by 0x75331C: ~unordered_map (unordered_map.h:102)
==00:00:05:04.903 510301== by 0x75331C: OSDMap::~OSDMap() (OSDMap.h:350)
==00:00:05:04.903 510301== by 0x753606: operator() (shared_cache.hpp:100)
==00:00:05:04.903 510301== by 0x753606: std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr_base.h:471)
==00:00:05:04.903 510301== by 0x73BB26: _M_release (shared_ptr_base.h:155)
==00:00:05:04.903 510301== by 0x73BB26: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:148)
==00:00:05:04.903 510301== by 0x72191E: operator= (shared_ptr_base.h:747)
==00:00:05:04.903 510301== by 0x72191E: operator= (shared_ptr_base.h:1078)
==00:00:05:04.903 510301== by 0x72191E: operator= (shared_ptr.h:103)
==00:00:05:04.903 510301== by 0x72191E: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8116)
==00:00:05:04.903 510301== by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678)
==00:00:05:04.903 510301== by 0x72A06C: Context::complete(int) (Context.h:77)
==00:00:05:04.903 510301== by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66)
==00:00:05:04.903 510301== Address 0x1e3e0588 is 872 bytes inside a block of size 1,208 alloc'd
==00:00:05:04.903 510301== at 0xA7DC0C3: operator new[](unsigned long) (vg_replace_malloc.c:433)
==00:00:05:04.903 510301== by 0x6C7C0C: OSDService::try_get_map(unsigned int) (OSD.cc:1606)
==00:00:05:04.903 510301== by 0x7213BD: get_map (OSD.h:699)
==00:00:05:04.903 510301== by 0x7213BD: get_map (OSD.h:1732)
==00:00:05:04.903 510301== by 0x7213BD: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8076)
==00:00:05:04.903 510301== by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678)
==00:00:05:04.903 510301== by 0x72A06C: Context::complete(int) (Context.h:77)
==00:00:05:04.903 510301== by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66)
==00:00:05:04.903 510301== by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389)
==00:00:05:04.903 510301== by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so)
==00:00:05:04.903 510301== by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so)
```
Josh Durgin [Sat, 1 Feb 2020 19:00:24 +0000 (14:00 -0500)]
mgr/pg_autoscaler: correct and simplify progress tracking
Reset the progress each time we make an adjustment, and track progress
from that initial state to that new target. Previously we were also
using the wrong target: the current pg_num_target, not the new value
(pg_num_final) that we set.
Look up the pool by name, not id, in _maybe_adjust(), since that is how it is
retrieved by osdmap.get_pools_by_name().
Dedupe some logic into PgAdjustmentProgress to simplify things.
Sage Weil [Sun, 9 Feb 2020 19:55:27 +0000 (13:55 -0600)]
Merge PR #33133 into master
* refs/pull/33133/head:
qa/workunits/cephadm/test_cephadm.sh: make monitoring tests faster
qa/workunits/cephadm/test_cephadm: 2 OSDs is enough
cephadm: disable node-exporter cpu/memory limits for the time being
Sage Weil [Sun, 9 Feb 2020 15:40:10 +0000 (09:40 -0600)]
Merge PR #33117 into master
* refs/pull/33117/head:
qa/suites/upgrade/nautilus-x-singleton: ensure hit sets behave across upgrade
osd/PrimaryLogPG: use legacy timestamp rendering for hit_set objects
include/utime: allow legacy rendering of timestamp
* refs/pull/32816/head:
mds: check inode type when deciding if filelock should be in EXCL state
mds: don't delegate inos when handling replayed requests
mds: process re-sent async dir operations at clientreplay stage
mds: consider async dirops when checking directory empty
mds: always suppress issuing caps in Locker::issue_new_caps()
mds: try reconnect cap only when replayed request creates new inode
mds: set cap id to 1 for newly created inode
Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Sage Weil [Fri, 7 Feb 2020 19:37:47 +0000 (13:37 -0600)]
Merge PR #33070 into master
* refs/pull/33070/head:
mgr/telemetry: use raise_for_status()
mgr/telemetry: factor post into helper
mgr/telemetry: catch exception during requests.put
Sage Weil [Wed, 5 Feb 2020 23:19:56 +0000 (17:19 -0600)]
mgr/orch: HostSpec -> HostPlacementSpec
This object is about describing where to place a service on a host: it
includes a host name and either an IP or network and possibly even a name
for the service.
Sage Weil [Fri, 7 Feb 2020 19:27:13 +0000 (13:27 -0600)]
Merge PR #33062 into master
* refs/pull/33062/head:
qa/suites/rados/cephadm: collect all cephadm tests together here
qa/workunits/cephadm/test_repos: add test for the repo commands
cephadm: add '{add,rm}-repo', with initial centos/rhel/fedora/ubuntu support
Sage Weil [Fri, 7 Feb 2020 19:26:54 +0000 (13:26 -0600)]
Merge PR #33119 into master
* refs/pull/33119/head:
mgr/upgrade: fix mgr self check
mgr/cephadm: fail upgrade if target image doesn't exist
mgr/cephadm: factor upgrade failure into helper
mgr/cephadm: refresh if we don't know a daemon's image_id
mgr/cephadm: refresh services in upgrade loop
mgr/cephadm: clean up upgrade messages a bit
ceph-volume: add unit test test_safe_prepare_osd_already_created
This commit adds a new unit test
`test_safe_prepare_osd_already_created()` in order to test when
`is_ceph_device()` returns `True` `RuntimeError` is well raised.
Kefu Chai [Fri, 7 Feb 2020 14:44:53 +0000 (22:44 +0800)]
pybind/ceph_argparse: avoid int overflow
in python 2.6.8, `thread.join(timeout)` tries to convert the given
timeout to PyTime, but turns out `2 << 32` overflows when python
runtime converts the timeout from sec to ns. that's why
the `lock.acquire()` call always fail in
`Thread._wait_for_tstate_lock()`.
and we end up with an alive thread after calling `thread.join()`.
Jan Fajerski [Mon, 6 Jan 2020 17:02:57 +0000 (18:02 +0100)]
ceph-volume: add available property in target specific flavors
This adds two properties available_[lvm,raw] to device (and thus inventory).
The goal is to have different notions of availability based on the
intended use case. For example finding LVM structures make a drive
unavailable for the raw mode, but might be available for the lvm mode.
Fixes: https://tracker.ceph.com/issues/43400 Signed-off-by: Jan Fajerski <jfajerski@suse.com>
Sage Weil [Thu, 6 Feb 2020 15:18:13 +0000 (09:18 -0600)]
cephadm: fix inspect-image
This was broken by d8debba782cd4f40ed13db7f1af8ef43503ccec5
because the 'images' json output works with podman but not with
docker. (Also, the inspect command is more explicit and cleaner.)