git.apps.os.sepia.ceph.com Git

osd: fix racy accesses to OSD::osdmap.

Accordingly to cppreference.com [1]:

  "If multiple threads of execution access the same std::shared_ptr
  object without synchronization and any of those accesses uses
  a non-const member function of shared_ptr then a data race will
  occur (...)"

[1]: https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic

One of the coredumps showed the `shared_ptr`-typed `OSD::osdmap`
with healthy looking content but damaged control block:

  ```
  [Current thread is 1 (Thread 0x7f7dcaf73700 (LWP 205295))]
  (gdb) bt
  #0  0x0000559cb81c3ea0 in ?? ()
  #1  0x0000559c97675b27 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148
  #2  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148
  #3  0x0000559c975ef8aa in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167
  #4  std::__shared_ptr<OSDMap const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr_base.h:1167
  #5  std::shared_ptr<OSDMap const>::~shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/8/bits/shared_ptr.h:103
  #6  OSD::create_context (this=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9053
  #7  0x0000559c97655571 in OSD::dequeue_peering_evt (this=0x559ca22ac000, sdata=0x559ca2ef2900, pg=0x559cb4aa3400, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...)
      at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9665
  #8  0x0000559c97886db6 in ceph::osd::scheduler::PGPeeringItem::run (this=<optimized out>, osd=<optimized out>, sdata=<optimized out>, pg=..., handle=...) at /usr/include/c++/8/ext/atomicity.h:96
  #9  0x0000559c9764862f in ceph::osd::scheduler::OpSchedulerItem::run (handle=..., pg=..., sdata=<optimized out>, osd=<optimized out>, this=0x7f7dcaf703f0) at /usr/include/c++/8/bits/unique_ptr.h:342
  #10 OSD::ShardedOpWQ::_process (this=<optimized out>, thread_index=<optimized out>, hb=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:10677
  #11 0x0000559c97c76094 in ShardedThreadPool::shardedthreadpool_worker (this=0x559ca22aca28, thread_index=14) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.cc:311
  #12 0x0000559c97c78cf4 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/common/WorkQueue.h:706
  #13 0x00007f7df17852de in start_thread () from /lib64/libpthread.so.0
  #14 0x00007f7df052f133 in __libc_ifunc_impl_list () from /lib64/libc.so.6
  #15 0x0000000000000000 in ?? ()
  (gdb) frame 7
  #7  0x0000559c97655571 in OSD::dequeue_peering_evt (this=0x559ca22ac000, sdata=0x559ca2ef2900, pg=0x559cb4aa3400, evt=std::shared_ptr<PGPeeringEvent> (use count 2, weak count 0) = {...}, handle=...)
      at /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc:9665
  9665      in /usr/src/debug/ceph-15.0.0-10071.g5b5a3a3.el8.x86_64/src/osd/OSD.cc
  (gdb) print osdmap
  $24 = std::shared_ptr<const OSDMap> (expired, weak count 0) = {get() = 0x559cba028000}
  (gdb) print *osdmap
     # pretty sane OSDMap
  (gdb) print sizeof(osdmap)
  $26 = 16
  (gdb) x/2a &osdmap
  0x559ca22acef0:   0x559cba028000  0x559cba0ec900

  (gdb) frame 2
  #2  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x559cba0ec900) at /usr/include/c++/8/bits/shared_ptr_base.h:148
  148       /usr/include/c++/8/bits/shared_ptr_base.h: No such file or directory.
  (gdb) disassemble
  Dump of assembler code for function std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release():
  ...
     0x0000559c97675b1e <+62>:      mov    (%rdi),%rax
     0x0000559c97675b21 <+65>:      mov    %rdi,%rbx
     0x0000559c97675b24 <+68>:      callq  *0x10(%rax)
  => 0x0000559c97675b27 <+71>:      test   %rbp,%rbp
  ...
  End of assembler dump.
  (gdb) info registers rdi rbx rax
  rdi            0x559cba0ec900      94131624790272
  rbx            0x559cba0ec900      94131624790272
  rax            0x559cba0ec8a0      94131624790176
  (gdb) x/a 0x559cba0ec8a0 + 0x10
  0x559cba0ec8b0:   0x559cb81c3ea0
  (gdb) bt
  #0  0x0000559cb81c3ea0 in ?? ()
  ...
  (gdb) p $_siginfo._sifields._sigfault.si_addr
  $27 = (void *) 0x559cb81c3ea0
  ```

Helgrind seems to agree:
  ```
  ==00:00:02:54.519 510301== Possible data race during write of size 8 at 0xF123930 by thread #90
  ==00:00:02:54.519 510301== Locks held: 2, at addresses 0xF122A58 0xF1239A8
  ==00:00:02:54.519 510301==    at 0x7218DD: operator= (shared_ptr_base.h:1078)
  ==00:00:02:54.519 510301==    by 0x7218DD: operator= (shared_ptr.h:103)
  ==00:00:02:54.519 510301==    by 0x7218DD: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8116)
  ==00:00:02:54.519 510301==    by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678)
  ==00:00:02:54.519 510301==    by 0x72A06C: Context::complete(int) (Context.h:77)
  ==00:00:02:54.519 510301==    by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66)
  ==00:00:02:54.519 510301==    by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389)
  ==00:00:02:54.519 510301==    by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so)
  ==00:00:02:54.519 510301==    by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so)
  ==00:00:02:54.519 510301==
  ==00:00:02:54.519 510301== This conflicts with a previous read of size 8 by thread #117
  ==00:00:02:54.519 510301== Locks held: 1, at address 0x2123E9A0
  ==00:00:02:54.519 510301==    at 0x6B5842: __shared_ptr (shared_ptr_base.h:1165)
  ==00:00:02:54.519 510301==    by 0x6B5842: shared_ptr (shared_ptr.h:129)
  ==00:00:02:54.519 510301==    by 0x6B5842: get_osdmap (OSD.h:1700)
  ==00:00:02:54.519 510301==    by 0x6B5842: OSD::create_context() (OSD.cc:9053)
  ==00:00:02:54.519 510301==    by 0x71B570: OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&) (OSD.cc:9665)
  ==00:00:02:54.519 510301==    by 0x71B997: OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&) (OSD.cc:9701)
  ==00:00:02:54.519 510301==    by 0x70E62E: run (OpSchedulerItem.h:148)
  ==00:00:02:54.519 510301==    by 0x70E62E: OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) (OSD.cc:10677)
  ==00:00:02:54.519 510301==    by 0xD3C093: ShardedThreadPool::shardedthreadpool_worker(unsigned int) (WorkQueue.cc:311)
  ==00:00:02:54.519 510301==    by 0xD3ECF3: ShardedThreadPool::WorkThreadSharded::entry() (WorkQueue.h:706)
  ==00:00:02:54.519 510301==    by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389)
  ==00:00:02:54.519 510301==    by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so)
  ==00:00:02:54.519 510301==  Address 0xf123930 is 3,824 bytes inside a block of size 10,296 alloc'd
  ==00:00:02:54.519 510301==    at 0xA7DC0C3: operator new[](unsigned long) (vg_replace_malloc.c:433)
  ==00:00:02:54.519 510301==    by 0x66F766: main (ceph_osd.cc:688)
  ==00:00:02:54.519 510301==  Block was alloc'd by thread #1
  ```

Actually there is plenty of similar issues reported like:
  ```
  ==00:00:05:04.903 510301== Possible data race during read of size 8 at 0x1E3E0588 by thread #119
  ==00:00:05:04.903 510301== Locks held: 1, at address 0x1EAD41D0
  ==00:00:05:04.903 510301==    at 0x753165: clear (hashtable.h:2051)
  ==00:00:05:04.903 510301==    by 0x753165: std::_Hashtable<entity_addr_t, std::pair<entity_addr_t const, utime_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<entity_addr_t const, utime_t>
  >, std::__detail::_Select1st, std::equal_to<entity_addr_t>, std::hash<entity_addr_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__deta
  il::_Hashtable_traits<true, false, true> >::~_Hashtable() (hashtable.h:1369)
  ==00:00:05:04.903 510301==    by 0x75331C: ~unordered_map (unordered_map.h:102)
  ==00:00:05:04.903 510301==    by 0x75331C: OSDMap::~OSDMap() (OSDMap.h:350)
  ==00:00:05:04.903 510301==    by 0x753606: operator() (shared_cache.hpp:100)
  ==00:00:05:04.903 510301==    by 0x753606: std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr
  _base.h:471)
  ==00:00:05:04.903 510301==    by 0x73BB26: _M_release (shared_ptr_base.h:155)
  ==00:00:05:04.903 510301==    by 0x73BB26: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:148)
  ==00:00:05:04.903 510301==    by 0x6B58A9: ~__shared_count (shared_ptr_base.h:728)
  ==00:00:05:04.903 510301==    by 0x6B58A9: ~__shared_ptr (shared_ptr_base.h:1167)
  ==00:00:05:04.903 510301==    by 0x6B58A9: ~shared_ptr (shared_ptr.h:103)
  ==00:00:05:04.903 510301==    by 0x6B58A9: OSD::create_context() (OSD.cc:9053)
  ==00:00:05:04.903 510301==    by 0x71B570: OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&) (OSD.cc:9665)
  ==00:00:05:04.903 510301==    by 0x71B997: OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&) (OSD.cc:9701)
  ==00:00:05:04.903 510301==    by 0x70E62E: run (OpSchedulerItem.h:148)
  ==00:00:05:04.903 510301==    by 0x70E62E: OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) (OSD.cc:10677)
  ==00:00:05:04.903 510301==    by 0xD3C093: ShardedThreadPool::shardedthreadpool_worker(unsigned int) (WorkQueue.cc:311)
  ==00:00:05:04.903 510301==    by 0xD3ECF3: ShardedThreadPool::WorkThreadSharded::entry() (WorkQueue.h:706)
  ==00:00:05:04.903 510301==    by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389)
  ==00:00:05:04.903 510301==    by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so)
  ==00:00:05:04.903 510301==    by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so)
  ==00:00:05:04.903 510301==
  ==00:00:05:04.903 510301== This conflicts with a previous write of size 8 by thread #90
  ==00:00:05:04.903 510301== Locks held: 2, at addresses 0xF122A58 0xF1239A8
  ==00:00:05:04.903 510301==    at 0x7531E1: clear (hashtable.h:2054)
  ==00:00:05:04.903 510301==    by 0x7531E1: std::_Hashtable<entity_addr_t, std::pair<entity_addr_t const, utime_t>, mempool::pool_allocator<(mempool::pool_index_t)15, std::pair<entity_addr_t const, utime_t> >, std::__detail::_Select1st, std::equal_to<entity_addr_t>, std::hash<entity_addr_t>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() (hashtable.h:1369)
  ==00:00:05:04.903 510301==    by 0x75331C: ~unordered_map (unordered_map.h:102)
  ==00:00:05:04.903 510301==    by 0x75331C: OSDMap::~OSDMap() (OSDMap.h:350)
  ==00:00:05:04.903 510301==    by 0x753606: operator() (shared_cache.hpp:100)
  ==00:00:05:04.903 510301==    by 0x753606: std::_Sp_counted_deleter<OSDMap const*, SharedLRU<unsigned int, OSDMap const>::Cleanup, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() (shared_ptr_base.h:471)
  ==00:00:05:04.903 510301==    by 0x73BB26: _M_release (shared_ptr_base.h:155)
  ==00:00:05:04.903 510301==    by 0x73BB26: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (shared_ptr_base.h:148)
  ==00:00:05:04.903 510301==    by 0x72191E: operator= (shared_ptr_base.h:747)
  ==00:00:05:04.903 510301==    by 0x72191E: operator= (shared_ptr_base.h:1078)
  ==00:00:05:04.903 510301==    by 0x72191E: operator= (shared_ptr.h:103)
  ==00:00:05:04.903 510301==    by 0x72191E: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8116)
  ==00:00:05:04.903 510301==    by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678)
  ==00:00:05:04.903 510301==    by 0x72A06C: Context::complete(int) (Context.h:77)
  ==00:00:05:04.903 510301==    by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66)
  ==00:00:05:04.903 510301==  Address 0x1e3e0588 is 872 bytes inside a block of size 1,208 alloc'd
  ==00:00:05:04.903 510301==    at 0xA7DC0C3: operator new[](unsigned long) (vg_replace_malloc.c:433)
  ==00:00:05:04.903 510301==    by 0x6C7C0C: OSDService::try_get_map(unsigned int) (OSD.cc:1606)
  ==00:00:05:04.903 510301==    by 0x7213BD: get_map (OSD.h:699)
  ==00:00:05:04.903 510301==    by 0x7213BD: get_map (OSD.h:1732)
  ==00:00:05:04.903 510301==    by 0x7213BD: OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*) (OSD.cc:8076)
  ==00:00:05:04.903 510301==    by 0x7752CA: C_OnMapCommit::finish(int) (OSD.cc:7678)
  ==00:00:05:04.903 510301==    by 0x72A06C: Context::complete(int) (Context.h:77)
  ==00:00:05:04.903 510301==    by 0xD07F14: Finisher::finisher_thread_entry() (Finisher.cc:66)
  ==00:00:05:04.903 510301==    by 0xA7E1203: mythread_wrapper (hg_intercepts.c:389)
  ==00:00:05:04.903 510301==    by 0xC6182DD: start_thread (in /usr/lib64/libpthread-2.28.so)
  ==00:00:05:04.903 510301==    by 0xD8B34B2: clone (in /usr/lib64/libc-2.28.so)
  ```

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Merge pull request #33035 from jdurgin/wip-target-ratio

mgr/pg_autoscaler: treat target ratios as weights

Reviewed-by: Sage Weil <sage@redhat.com>

mgr/pg_autoscaler: add unit tests for effective_target_ratio

Signed-off-by: Josh Durgin <jdurgin@redhat.com>

mgr/pg_autoscaler: remove unused imports

Signed-off-by: Josh Durgin <jdurgin@redhat.com>

mgr/pg_autoscaler: cleanup pool_id type

Keep it as an int so we don't have to cast back and forth.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>

mgr/pg_autoscaler: correct and simplify progress tracking

Reset the progress each time we make an adjustment, and track progress
from that initial state to that new target. Previously we were also
using the wrong target: the current pg_num_target, not the new value
(pg_num_final) that we set.

Look up the pool by name, not id, in _maybe_adjust(), since that is how it is
retrieved by osdmap.get_pools_by_name().

Dedupe some logic into PgAdjustmentProgress to simplify things.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>

PendingReleaseNotes: mention target_size_ratio change

Signed-off-by: Josh Durgin <jdurgin@redhat.com>

doc: update autoscaler docs for target ratio

Signed-off-by: Josh Durgin <jdurgin@redhat.com>

mgr/pg_autoscaler: include effective target ratio in status

Signed-off-by: Josh Durgin <jdurgin@redhat.com>

mgr/pg_autoscaler: add warning when target bytes and ratio are both set

Signed-off-by: Josh Durgin <jdurgin@redhat.com>

mgr/pg_autoscaler: remove target ratio warning

Since the ratios are normalized, they cannot exceed 1.0 or overcommit
combined with target_bytes.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>

qa: use ratios >1 in pg_autoscaler test

Also check for pg_num_target being set correctly, rather than pg_num,
so the test doesn't depend on merging/splitting speed.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>

mgr/pg_autoscaler: treat target ratios as weights

Normalize across pools so that it's simpler to use - this way you
don't have to adjust every other pool when you add one.

Handle pools with target_bytes by taking their capacity off the top,
and dividing the rest into the pools with a target_ratio.

If both target bytes and ratio are specified, ignore bytes. This
matches the docs and makes accounting simpler.

Fixes: https://tracker.ceph.com/issues/43947
Signed-off-by: Josh Durgin <jdurgin@redhat.com>

Merge PR #33126 into master

* refs/pull/33126/head:
doc/mgr/orchestrator_cli: update support table
mgr/deepsea: remove
mgr/ansible: remove

Reviewed-by: Michael Fritch <mfritch@suse.com>

Merge PR #33136 into master

* refs/pull/33136/head:
cephadm: fix ceph version probe

Reviewed-by: Michael Fritch <mfritch@suse.com>

cephadm: fix ceph version probe

docker returns '<no value>' if the label isn't present, in which case we
still need to run ceph -v.

Also, don't probe non-ceph (e.g., monitoring) containers.

Also, only probe each image id once.

Add a simple test.

Signed-off-by: Sage Weil <sage@redhat.com>

Merge PR #33133 into master

* refs/pull/33133/head:
qa/workunits/cephadm/test_cephadm.sh: make monitoring tests faster
qa/workunits/cephadm/test_cephadm: 2 OSDs is enough
cephadm: disable node-exporter cpu/memory limits for the time being

Reviewed-by: Michael Fritch <mfritch@suse.com>

Merge PR #33134 into master

* refs/pull/33134/head:
qa/workunits/cephadm/test_repos: don't try to use the refspec

Reviewed-by: Michael Fritch <mfritch@suse.com>

doc/mgr/orchestrator_cli: update support table

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/deepsea: remove

Signed-off-by: Sage Weil <sage@redhat.com>

Merge PR #33117 into master

* refs/pull/33117/head:
qa/suites/upgrade/nautilus-x-singleton: ensure hit sets behave across upgrade
osd/PrimaryLogPG: use legacy timestamp rendering for hit_set objects
include/utime: allow legacy rendering of timestamp

Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge pull request #32928 from ljishen/wip-rados-bench-latency

rados bench: fix the delayed checking of completed ops

Reviewed-by: Josh Durgin <jdurgin@redhat.com>

Merge pull request #32934 from rzarzynski/wip-bl-32bytes

include, common: make ceph::bufferlist 32 bytes long on x86

Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge pull request #33099 from howard0su/wip_ceph_fix

ceph.in: print decoded output in interactive mode

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge pull request #33101 from tchaikov/wip-thread-join

pybind/ceph_argparse: avoid int overflow

Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge pull request #33068 from yuvalif/verify-pubsub-inc-sync

qa/rgw/pubsub: verify incremental sync is used in pubsub

Merge PR #32816 into master

* refs/pull/32816/head:
mds: check inode type when deciding if filelock should be in EXCL state
mds: don't delegate inos when handling replayed requests
mds: process re-sent async dir operations at clientreplay stage
mds: consider async dirops when checking directory empty
mds: always suppress issuing caps in Locker::issue_new_caps()
mds: try reconnect cap only when replayed request creates new inode
mds: set cap id to 1 for newly created inode

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>

qa/workunits/cephadm/test_cephadm.sh: make monitoring tests faster

The sleep 90 was way overkill.

Signed-off-by: Sage Weil <sage@redhat.com>

qa/workunits/cephadm/test_cephadm: 2 OSDs is enough

Signed-off-by: Sage Weil <sage@redhat.com>

cephadm: disable node-exporter cpu/memory limits for the time being

Ubuntu 18.04 kernel does not support these.

Signed-off-by: Sage Weil <sage@redhat.com>

Merge PR #33092 into master

* refs/pull/33092/head:
doc/rados/operations: adjust docs a bit
mon/OSDMonitor: accept 'autoscale_mode' argument to 'osd pool create'

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge PR #33094 into master

* refs/pull/33094/head:
mgr/telemetry: split entity_name only once (handle ids with dots)

Reviewed-by: Sage Weil <sage@redhat.com>

Merge PR #33037 into master

* refs/pull/33037/head:
osd/OSD: choose more heartbeat peers from different subtrees

Reviewed-by: yanjun <yan.jun8@zte.com.cn>

qa/workunits/cephadm/test_repos: don't try to use the refspec

This is usually a sha1, and we can't reliably find packages based on
that.

Signed-off-by: Sage Weil <sage@redhat.com>

Merge PR #33132 into master

* refs/pull/33132/head:
qa/workunits/cephadm/test_repos: apt update
qa/workunits/cephadm/test_repos: fix subst
qa/suites/rados/cephadm/.../test_repos: run without ulimit and coverage tools
qa/tasks/workunit: accept coverage_and_limits: false
qa/suites/rados/cephadm: move ubuntu_18.04_podman to shared location
qa/suites/rados/cephadm: fix conflicts, missing .qa link

Reviewed-by: Michael Fritch <mfritch@suse.com>

qa/workunits/cephadm/test_repos: apt update

Signed-off-by: Sage Weil <sage@redhat.com>

qa/workunits/cephadm/test_repos: fix subst

Signed-off-by: Sage Weil <sage@redhat.com>

Merge pull request #33061 from tchaikov/wip-pybind-compiler-flags-patch

pybind: refactor monkey_with_compiler()

Reviewed-By: Adam Emersen <aemerson@redhat.com>

Merge PR #33098 into master

* refs/pull/33098/head:
mgr/orch,cephadm: add 'host set-addr'
mgr/orch: include addr (and labels) in 'host ls'
mgr/cephadm: fix 'cephadm check-host'
mgr/cephadm: use addr to contact host
mgr/orch: pass HostSpec to add_host
mgr/orch: HostSpec -> HostPlacementSpec

Reviewed-by: Sebastian Wagner <swagner@suse.com>

qa/suites/rados/cephadm/.../test_repos: run without ulimit and coverage tools

Signed-off-by: Sage Weil <sage@redhat.com>

qa/tasks/workunit: accept coverage_and_limits: false

Allow workunits without teuthology tools (normally installed by ceph.py,
IIRC).

Signed-off-by: Sage Weil <sage@redhat.com>

qa/suites/rados/cephadm: move ubuntu_18.04_podman to shared location

Also set the registries.conf file so we can pull from docker.io.

Signed-off-by: Sage Weil <sage@redhat.com>

qa/suites/upgrade/nautilus-x-singleton: ensure hit sets behave across upgrade

Create a pool that generates hit sets before the upgrade, and ensure that
they (continue to) trim after the upgrade.

Signed-off-by: Sage Weil <sage@redhat.com>

osd/PrimaryLogPG: use legacy timestamp rendering for hit_set objects

These objects exist prior to octopus and we need to be consistent with
the naming. Stick with the legacy form.

Fixes: https://tracker.ceph.com/issues/44024
Signed-off-by: Sage Weil <sage@redhat.com>

include/utime: allow legacy rendering of timestamp

In 79d8d761cf8fb6a5679d1925f92889c950dc2be4 and
ec3ddcb9886e3c74b78aa8521bc05e695f6aeeab
we switched to a strict ISO8660 rendering for timestamps. In some cases,
we need to render the timestamp in the legacy form: ' ' instead of 'T',
and no time zone suffix.

Signed-off-by: Sage Weil <sage@redhat.com>

qa/suites/rados/cephadm: fix conflicts, missing .qa link

Signed-off-by: Sage Weil <sage@redhat.com>

Merge PR #33080 into master

* refs/pull/33080/head:
qa: specify random distros in multimds

Reviewed-by: Rishabh Dave <ridave@redhat.com>

Merge PR #33070 into master

* refs/pull/33070/head:
mgr/telemetry: use raise_for_status()
mgr/telemetry: factor post into helper
mgr/telemetry: catch exception during requests.put

Reviewed-by: Yaarit Hatuka <yaarithatuka@gmail.com>

mgr/orch,cephadm: add 'host set-addr'

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/orch: include addr (and labels) in 'host ls'

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/cephadm: fix 'cephadm check-host'

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/cephadm: use addr to contact host

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/orch: pass HostSpec to add_host

Distinguish between the hostname and the addr (dns name or IP) to reach
the host. Include labels here too since it's convenient to do so.

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/orch: HostSpec -> HostPlacementSpec

This object is about describing where to place a service on a host: it
includes a host name and either an IP or network and possibly even a name
for the service.

Signed-off-by: Sage Weil <sage@redhat.com>

Merge PR #33118 into master

* refs/pull/33118/head:
orchestrator cli: change 'rgw update' params order
mgr/cephadm: fix minor typo

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Sebastian Wagner <swagner@suse.com>

Merge PR #33062 into master

* refs/pull/33062/head:
qa/suites/rados/cephadm: collect all cephadm tests together here
qa/workunits/cephadm/test_repos: add test for the repo commands
cephadm: add '{add,rm}-repo', with initial centos/rhel/fedora/ubuntu support

Reviewed-by: Sebastian Wagner <swagner@suse.com>

Merge PR #33119 into master

* refs/pull/33119/head:
mgr/upgrade: fix mgr self check
mgr/cephadm: fail upgrade if target image doesn't exist
mgr/cephadm: factor upgrade failure into helper
mgr/cephadm: refresh if we don't know a daemon's image_id
mgr/cephadm: refresh services in upgrade loop
mgr/cephadm: clean up upgrade messages a bit

Reviewed-by: Sebastian Wagner <swagner@suse.com>

Merge pull request #33086 from guits/guits-fix_cv_rerun

ceph-volume: skip osd creation when already done

qa/suites/rados/cephadm: collect all cephadm tests together here

Signed-off-by: Sage Weil <sage@redhat.com>

qa/workunits/cephadm/test_repos: add test for the repo commands

This isn't a great test, but it is something.

Signed-off-by: Sage Weil <sage@redhat.com>

cephadm: add '{add,rm}-repo', with initial centos/rhel/fedora/ubuntu support

Other distros to follow.

Signed-off-by: Sage Weil <sage@redhat.com>

Merge PR #33111 into master

* refs/pull/33111/head:
qa/suites/rados/cephadm[-smoke]: test podman on ubuntu 18.04

Reviewed-by: Sebastian Wagner <swagner@suse.com>

Merge pull request #33128 from pponnuvel/fix_incorrect_link_documentation

coding-style: update a link and fix typos

Reviewed-by: Kefu Chai <kchai@redhat.com>

coding-style: update a link and fix typos.

Signed-off-by: Ponnuvel Palaniyappan <pponnuvel@gmail.com>

mgr/ansible: remove

Signed-off-by: Sage Weil <sage@redhat.com>

ceph-volume: add unit test test_safe_prepare_osd_already_created

This commit adds a new unit test
`test_safe_prepare_osd_already_created()` in order to test when
`is_ceph_device()` returns `True` `RuntimeError` is well raised.

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

ceph-volume: skip osd creation when already done

When rerunning ceph-volume lvm create on a device already prepared and
activated, ceph-volume should skip the creation.

This is a regression introduced by bb4de1a3fc238eaf9f717dc59c6bdf338ef6d657

Fixes: https://tracker.ceph.com/issues/43981
Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>

Merge PR #33042 into master

* refs/pull/33042/head:
cephadm: bootstrap: warn on fqdn hostname

Reviewed-by: Sebastian Wagner <swagner@suse.com>

pybind/ceph_argparse: avoid int overflow

in python 2.6.8, `thread.join(timeout)` tries to convert the given
timeout to PyTime, but turns out `2 << 32` overflows when python
runtime converts the timeout from sec to ns. that's why
the `lock.acquire()` call always fail in
`Thread._wait_for_tstate_lock()`.
and we end up with an alive thread after calling `thread.join()`.

Signed-off-by: Kefu Chai <kchai@redhat.com>

qa/suites/rados/cephadm[-smoke]: test podman on ubuntu 18.04

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/upgrade: fix mgr self check

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/cephadm: fail upgrade if target image doesn't exist

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/cephadm: factor upgrade failure into helper

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/cephadm: refresh if we don't know a daemon's image_id

This notably happens right after we deploy a fresh daemon.

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/cephadm: refresh services in upgrade loop

Signed-off-by: Sage Weil <sage@redhat.com>

mgr/cephadm: clean up upgrade messages a bit

Signed-off-by: Sage Weil <sage@redhat.com>

Merge pull request #32634 from jan--f/c-v-inventory-fixes

ceph-volume: finer grained availability notion in inventory.

Merge pull request #31978 from jan--f/c-v-batch-no-db-dev-drop

ceph-volume/batch: fail on filtered devices when non-interactive

Merge PR #33075 into master

* refs/pull/33075/head:
examples/librados: fix bufferlist::copy() in hello_world.cc.

Reviewed-by: Kefu Chai <kchai@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>

ceph-volume: add available property in target specific flavors

This adds two properties available_[lvm,raw] to device (and thus inventory).
The goal is to have different notions of availability based on the
intended use case. For example finding LVM structures make a drive
unavailable for the raw mode, but might be available for the lvm mode.

Fixes: https://tracker.ceph.com/issues/43400
Signed-off-by: Jan Fajerski <jfajerski@suse.com>

Merge pull request #33112 from jan--f/c-v-lvm-list-regression-31700

ceph-volume: fix regression and improve output in lvm list

Merge pull request #32985 from sebastian-philipp/mgr-progress-mypy

mgr/progress: Add integration to pybind/mgr/tox.ini

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge PR #32432 into master

* refs/pull/32432/head:
mds: Reorganize structure members in snap header

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>

orchestrator cli: change 'rgw update' params order

Fixes: https://tracker.ceph.com/issues/44029
First realm, then zone -- to be consistent with the other commands

Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>

mgr/cephadm: fix minor typo

Signed-off-by: Yehuda Sadeh <yehuda@redhat.com>

Merge PR #33114 into master

* refs/pull/33114/head:
cephadm:Fix name argument parsing during image check for non-ceph components

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Michael Fritch <mfritch@suse.com>

cephadm:Fix name argument parsing during image check for non-ceph components

bug in parsing introduced in 97def7c
args.name may exist but will be none if flag is not used
check the value in addition to checking if it exists

Signed-off-by: Daniel-Pivonka <dpivonka@redhat.com>

Merge PR #33109 into master

* refs/pull/33109/head:
cephadm: fix inspect-image

Reviewed-by: Michael Fritch <mfritch@suse.com>

Merge PR #33089 into master

* refs/pull/33089/head:
cephadm: re-introduce the `podman logs` command

Reviewed-by: Sage Weil <sage@redhat.com>

Merge PR #33110 into master

* refs/pull/33110/head:
qa/distros: rhel and centos: whitelist cephadm logrotate selinux denial

Reviewed-by: Boris Ranto <branto@redhat.com>

ceph-volume: fix various lvm list issues

A single report on a non-lvm device now works.
Format was cleaned up, report lvm journal,wal, db only once.

Fixes: https://tracker.ceph.com/issues/44009
Signed-off-by: Jan Fajerski <jfajerski@suse.com>

ceph-volume: add get_device_lvs to easily retrieve all lvs per device

Also drop the sep argument from get_lvs and siblings, unused.
Introduce LV_CMD_OPTIONS to unify options to lvs.

Signed-off-by: Jan Fajerski <jfajerski@suse.com>

cephadm: re-introduce the `podman logs` command

Fixes: https://tracker.ceph.com/issues/43973
Signed-off-by: Michael Fritch <mfritch@suse.com>

Merge PR #33093 into master

* refs/pull/33093/head:
build-integration-branch: don't fail on existing branch

Reviewed-by: Neha Ojha <nojha@redhat.com>

cephadm: fix inspect-image

This was broken by d8debba782cd4f40ed13db7f1af8ef43503ccec5
because the 'images' json output works with podman but not with
docker. (Also, the inspect command is more explicit and cleaner.)

Signed-off-by: Sage Weil <sage@redhat.com>

qa/distros: rhel and centos: whitelist cephadm logrotate selinux denial

This is fixed in RHEL 8.1.1 (and by extension centos/rhel 8.2+).

No fix for el 7 yet

Partially-fixes: https://tracker.ceph.com/issues/43703
Signed-off-by: Sage Weil <sage@redhat.com>

doc/rados/operations: adjust docs a bit

Signed-off-by: Sage Weil <sage@redhat.com>

mon/OSDMonitor: accept 'autoscale_mode' argument to 'osd pool create'

Allow the autoscale mode to be set atomically with pool creation.

Fixes: https://tracker.ceph.com/issues/42638
Signed-off-by: Sage Weil <sage@redhat.com>

Merge PR #33071 into master

* refs/pull/33071/head:
mgr/cephadm: remove item from cache when removing

Reviewed-by: Michael Fritch <mfritch@suse.com>

Merge pull request #33059 from tspmelo/wip-node-10-18-1

make-dist: Bump Node.js to v10.18.1

Reviewed-by: Nathan Cutler <ncutler@suse.com>