git-server-git.apps.pok.os.sepia.ceph.com Git

log: use smaller buffer for ConcreteEntry

This brings down the static size of the memory used by the logging infrastructure:

    If we used 1024, we'd have 1088*10000 = 10880000 = 10MB in use by the ring
    buffer and 2*1088*100 = 2*108800 = 2*106KB for the m_new and m_flush
    vectors.

In my testing, 1024 covers most log entries.

Note, I've kept 4096 for the StackStringStream via MutableEntry as these are
already allocated on the heap and cached in a thread local vector. Generally
there should only be about a dozen of these allocated so it's worth keeping a
larger buffer.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

common: remove obsolete buffer classes

In favor of the simpler StackStreamBuffer.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

log: avoid heap allocations for most log entries

Each log Entry now exists on the stack and uses a large (4k) buffer for its log
stream. This Entry is std::move'd to the queues (std::vector and
boost::circular_buffer) in the Log, involving only memory copies in the general
case. There are two memory copies (std::move) for any given Entry, once in
Log::submit_entry and again in Log::_flush

In practice, this eliminates 100% of allocations outside of startup
allocations

I've run a simple experiment with the MDS that copies /usr/bin to CephFS. I got
measurements for the number of allocations from the heap profiler and the
profile of CPU usage in the MDS.

** Before this patch **

    == Heap profile: ==

    $ google-pprof --alloc_objects --text bin/ceph-mds out/mds.a.profile.0001.heap
    Total: 1105048 objects
      433329  39.2%  39.2%   433329  39.2% ceph::logging::Log::create_entry
      209311  18.9%  58.2%   209311  18.9% __gnu_cxx::__aligned_membuf::_M_addr (inline)
      192963  17.5%  75.6%   192963  17.5% __gnu_cxx::new_allocator::allocate (inline)
       61774   5.6%  81.2%    61774   5.6% std::__cxx11::basic_string::_M_mutate
       37689   3.4%  84.6%    37689   3.4% ceph::buffer::raw_combined::create (inline)
       22773   2.1%  86.7%    22773   2.1% mempool::pool_allocator::allocate (inline)
       17761   1.6%  88.3%    20523   1.9% std::pair::pair (inline)
       15795   1.4%  89.7%    15797   1.4% std::swap (inline)
       11011   1.0%  90.7%   130061  11.8% std::__cxx11::list::_M_insert (inline)
       10822   1.0%  91.7%    10822   1.0% std::__cxx11::basic_string::reserve
        9108   0.8%  92.5%    32721   3.0% __gnu_cxx::new_allocator::construct (inline)
        8608   0.8%  93.3%     8610   0.8% std::_Deque_base::_M_initialize_map (inline)
        7694   0.7%  94.0%     7694   0.7% std::__cxx11::basic_string::_M_capacity (inline)
        6160   0.6%  94.5%     6160   0.6% Journaler::wrap_finisher
        6084   0.6%  95.1%    70892   6.4% std::map::operator[] (inline)
        5347   0.5%  95.6%     5347   0.5% MutationImpl::add_projected_fnode
        4381   0.4%  96.0%     7706   0.7% mempool::pool_allocator::construct (inline)
        3588   0.3%  96.3%   182966  16.6% Locker::_do_cap_update
        3049   0.3%  96.6%     5280   0.5% std::__shared_count::__shared_count (inline)
        3043   0.3%  96.9%     3043   0.3% MDSLogContextBase::MDSLogContextBase (inline)
        3038   0.3%  97.1%    14763   1.3% std::__uninitialized_copy::__uninit_copy (inline)

   So approximately 430k heap allocations for Entry were created! The
   basic_string::_M_mutate is also another allocation via PrebufferedStreambuf

== Profile data ==

Selecting interesting functions

Samples: 798K of event 'cycles:pp', Event count (approx
  Children      Self  Com  Shared Object        Symbol
+    1.04%     1.04%  log  libceph-common.so.0  [.] ceph::logging::Log::_flush
+    0.05%     0.05%  log  libceph-common.so.0  [.] ceph::logging::Log::flush
+    0.00%     0.00%  log  libceph-common.so.0  [.] ceph::logging::Log::_log_safe_write
+    0.00%     0.00%  log  libceph-common.so.0  [.] ceph::logging::Log::entry
+    0.00%     0.00%  log  libceph-common.so.0  [.] ceph::logging::Log::_flush_logbuf
...

  Children      Self  Command         Shared Object        Symbol
+    3.69%     0.00%  safe_timer      libceph-common.so.0  [.] CachedPrebufferedStreambuf::~CachedPrebufferedStreambuf
+    0.53%     0.00%  ms_dispatch     libceph-common.so.0  [.] CachedPrebufferedStreambuf::~CachedPrebufferedStreambuf
+    0.13%     0.00%  fn_anonymous    libceph-common.so.0  [.] CachedPrebufferedStreambuf::~CachedPrebufferedStreambuf
+    0.00%     0.00%  ms_dispatch     libceph-common.so.0  [.] CachedPrebufferedStreambuf::create
+    0.00%     0.00%  fn_anonymous    libceph-common.so.0  [.] CachedPrebufferedStreambuf::create

  Children      Self  Command         Shared Object        Symbol
+    0.07%     0.07%  fn_anonymous    libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  ms_dispatch     libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  ms_dispatch     ceph-mds             [.] _ZN4ceph7logging3Log12create_entryEiiPm@plt
+    0.00%     0.00%  md_submit       libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  fn_anonymous    ceph-mds             [.] _ZN4ceph7logging3Log12create_entryEiiPm@plt
+    0.00%     0.00%  safe_timer      libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  mds_rank_progr  libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  mds_rank_progr  ceph-mds             [.] _ZN4ceph7logging3Log12create_entryEiiPm@plt
+    0.00%     0.00%  msgr-worker-0   libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  msgr-worker-2   libceph-common.so.0  [.] ceph::logging::Log::create_entry
+    0.00%     0.00%  md_submit       ceph-mds             [.] _ZN4ceph7logging3Log12create_entryEiiPm@plt
+    0.00%     0.00%  msgr-worker-1   libceph-common.so.0  [.] ceph::logging::Log::create_entry

  Children      Self  Command         Shared Object        Symbol
+    8.29%     0.00%  ms_dispatch     libstdc++.so.6.0.24  [.] virtual thunk to std::basic_ostream<char, std::char_traits<char> >::~basic_ostream
+    7.55%     1.46%  ms_dispatch     libstdc++.so.6.0.24  [.] std::ostream::_M_insert<long>
+    3.87%     0.00%  fn_anonymous    libstdc++.so.6.0.24  [.] std::basic_ostream<char, std::char_traits<char> >::~basic_ostream
+    2.92%     0.00%  md_submit       libstdc++.so.6.0.24  [.] virtual thunk to std::basic_ostream<char, std::char_traits<char> >::~basic_ostream
+    2.38%     0.00%  fn_anonymous    libstdc++.so.6.0.24  [.] virtual thunk to std::basic_ostream<char, std::char_traits<char> >::~basic_ostream
+    2.22%     2.22%  fn_anonymous    libstdc++.so.6.0.24  [.] std::ostream::sentry::sentry
+    1.89%     0.13%  fn_anonymous    libstdc++.so.6.0.24  [.] std::__ostream_insert<char, std::char_traits<char> >
+    0.71%     0.00%  ms_dispatch     libstdc++.so.6.0.24  [.] std::basic_ostream<char, std::char_traits<char> >::~basic_ostream
+    0.39%     0.06%  fn_anonymous    libstdc++.so.6.0.24  [.] std::ostream::_M_insert<long>
+    0.29%     0.21%  ms_dispatch     libstdc++.so.6.0.24  [.] std::__ostream_insert<char, std::char_traits<char> >
+    0.27%     0.27%  ms_dispatch     libstdc++.so.6.0.24  [.] std::ostream::sentry::~sentry
+    0.27%     0.27%  fn_anonymous    libstdc++.so.6.0.24  [.] std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<long>
+    0.22%     0.22%  ms_dispatch     libstdc++.so.6.0.24  [.] std::basic_streambuf<char, std::char_traits<char> >::xsputn
+    0.20%     0.20%  ms_dispatch     libstdc++.so.6.0.24  [.] std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<long>
+    0.15%     0.15%  fn_anonymous    libstdc++.so.6.0.24  [.] std::ostream::sentry::~sentry
+    0.14%     0.14%  ms_dispatch     libstdc++.so.6.0.24  [.] std::ostream::sentry::sentry
+    0.13%     0.00%  ms_dispatch     libstdc++.so.6.0.24  [.] std::ostream::_M_insert<unsigned long>
+    0.13%     0.13%  fn_anonymous    libstdc++.so.6.0.24  [.] std::basic_streambuf<char, std::char_traits<char> >::xsputn
+    0.00%     0.00%  fn_anonymous    libstdc++.so.6.0.24  [.] std::ostream::_M_insert<unsigned long>
+    0.00%     0.00%  ms_dispatch     libstdc++.so.6.0.24  [.] std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<unsigned long>

    And the unittest_log time:

    $ bin/unittest_log
    [==========] Running 15 tests from 1 test case
    [----------] Global test environment set-up
    [----------] 15 tests from Log
    [ RUN      ] Log.Simple
    [       OK ] Log.Simple (0 ms)
    [ RUN      ] Log.ReuseBad
    [       OK ] Log.ReuseBad (1 ms)
    [ RUN      ] Log.ManyNoGather
    [       OK ] Log.ManyNoGather (0 ms)
    [ RUN      ] Log.ManyGatherLog
    [       OK ] Log.ManyGatherLog (12 ms)
    [ RUN      ] Log.ManyGatherLogStringAssign
    [       OK ] Log.ManyGatherLogStringAssign (27 ms)
    [ RUN      ] Log.ManyGatherLogStringAssignWithReserve
    [       OK ] Log.ManyGatherLogStringAssignWithReserve (27 ms)
    [ RUN      ] Log.ManyGatherLogPrebuf
    [       OK ] Log.ManyGatherLogPrebuf (15 ms)
    [ RUN      ] Log.ManyGatherLogPrebufOverflow
    [       OK ] Log.ManyGatherLogPrebufOverflow (15 ms)
    [ RUN      ] Log.ManyGather
    [       OK ] Log.ManyGather (8 ms)
    [ RUN      ] Log.InternalSegv

    [WARNING] /home/pdonnell/cephfs-shell/src/googletest/googletest/src/gtest-death-test.cc:836:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 3 threads
    [       OK ] Log.InternalSegv (8 ms)
    [ RUN      ] Log.LargeLog
    [       OK ] Log.LargeLog (43 ms)
    [ RUN      ] Log.TimeSwitch
    [       OK ] Log.TimeSwitch (1 ms)
    [ RUN      ] Log.TimeFormat
    [       OK ] Log.TimeFormat (0 ms)
    [ RUN      ] Log.Speed_gather
    [       OK ] Log.Speed_gather (1779 ms)
    [ RUN      ] Log.Speed_nogather
    [       OK ] Log.Speed_nogather (64 ms)
    [----------] 15 tests from Log (2000 ms total)

    [----------] Global test environment tear-down
    [==========] 15 tests from 1 test case ran. (2000 ms total)
    [  PASSED  ] 15 tests

** After Patch **

The StackStreamBuf uses 4k for its default buffer. This appears to be more than
reasonable for preventing allocations for logging

    == Heap profile: ==

$ google-pprof --alloc_objects --text bin/ceph-mds out/mds.a.profile.0001.heap
Total: 1052127 objects
      384957  36.6%  36.6%   384957  36.6% __gnu_cxx::new_allocator::allocate (inline)
      274720  26.1%  62.7%   274720  26.1% __gnu_cxx::__aligned_membuf::_M_addr (inline)
       68496   6.5%  69.2%    68496   6.5% std::__cxx11::basic_string::_M_mutate
       44140   4.2%  73.4%    51828   4.9% std::pair::pair (inline)
       43091   4.1%  77.5%    43091   4.1% ceph::buffer::raw_combined::create (inline)
       27706   2.6%  80.1%   236407  22.5% std::__cxx11::list::_M_insert (inline)
       25699   2.4%  82.6%    25699   2.4% std::__cxx11::basic_string::reserve
       23183   2.2%  84.8%    23183   2.2% mempool::pool_allocator::allocate (inline)
       19466   1.9%  86.6%    81716   7.8% __gnu_cxx::new_allocator::construct (inline)
       17606   1.7%  88.3%    17606   1.7% std::__cxx11::basic_string::_M_capacity (inline)
       16879   1.6%  89.9%    16881   1.6% std::swap (inline)
        8572   0.8%  90.7%     8574   0.8% std::_Deque_base::_M_initialize_map (inline)
        8477   0.8%  91.5%    11808   1.1% mempool::pool_allocator::construct (inline)
        6166   0.6%  92.1%     6166   0.6% Journaler::wrap_finisher
        6080   0.6%  92.7%   134975  12.8% std::map::operator[] (inline)
        6079   0.6%  93.3%     6079   0.6% MutationImpl::add_projected_fnode

== Profile data ==

    Samples: 62K of event 'cycles:u', Event count (approx.)
      Overhead  Command         Shared Object         Symbol
    +    5.91%  log             libc-2.23.so          [.] vfprintf
    +    5.75%  ms_dispatch     libstdc++.so.6.0.24   [.] std::__ostream_insert<char, std::char_traits<char> >
    +    4.59%  ms_dispatch     ceph-mds              [.] StackStringBuf<4096ul>::xsputn
    +    4.26%  ms_dispatch     libc-2.23.so          [.] __memmove_ssse3_back
    +    4.07%  log             libceph-common.so.0   [.] ceph::logging::Log::_flush
    +    2.48%  ms_dispatch     libstdc++.so.6.0.24   [.] std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<l
    +    2.13%  fn_anonymous    libstdc++.so.6.0.24   [.] std::__ostream_insert<char, std::char_traits<char> >
    +    2.09%  ms_dispatch     ceph-mds              [.] CDir::check_rstats
    +    2.06%  ms_dispatch     libstdc++.so.6.0.24   [.] std::ostream::sentry::sentry
    +    1.98%  ms_dispatch     libstdc++.so.6.0.24   [.] std::ostream::sentry::~sentry
    +    1.87%  log             libc-2.23.so          [.] __strcpy_sse2_unaligned
    +    1.60%  fn_anonymous    ceph-mds              [.] StackStringBuf<4096ul>::xsputn
    +    1.46%  log             libc-2.23.so          [.] _IO_default_xsputn
    +    1.45%  log             libc-2.23.so          [.] _itoa_word
    +    1.43%  fn_anonymous    libc-2.23.so          [.] __memmove_ssse3_back
    +    1.40%  ms_dispatch     libstdc++.so.6.0.24   [.] std::ostream::_M_insert<long>
    +    0.98%  fn_anonymous    libstdc++.so.6.0.24   [.] std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<l
    +    0.89%  ms_dispatch     libstdc++.so.6.0.24   [.] 0x
    +    0.88%  ms_dispatch     libstdc++.so.6.0.24   [.] std::_Rb_tree_increment

    And the unittest_log time:

    $ bin/unittest_log
    [==========] Running 13 tests from 1 test case.
    [----------] Global test environment set-up.
    [----------] 13 tests from Log
    [ RUN      ] Log.Simple
    [       OK ] Log.Simple (1 ms)
    [ RUN      ] Log.ReuseBad
    [       OK ] Log.ReuseBad (0 ms)
    [ RUN      ] Log.ManyNoGather
    [       OK ] Log.ManyNoGather (0 ms)
    [ RUN      ] Log.ManyGatherLog
    [       OK ] Log.ManyGatherLog (83 ms)
    [ RUN      ] Log.ManyGatherLogStringAssign
    [       OK ] Log.ManyGatherLogStringAssign (79 ms)
    [ RUN      ] Log.ManyGatherLogStackSpillover
    [       OK ] Log.ManyGatherLogStackSpillover (81 ms)
    [ RUN      ] Log.ManyGather
    [       OK ] Log.ManyGather (80 ms)
    [ RUN      ] Log.InternalSegv

    [WARNING] /home/pdonnell/ceph/src/googletest/googletest/src/gtest-death-test.cc:836:: Death tests use fork(), which is unsafe particularly in a threaded context. For this test, Google Test detected 3 threads.
    [       OK ] Log.InternalSegv (7 ms)
    [ RUN      ] Log.LargeLog
    [       OK ] Log.LargeLog (55 ms)
    [ RUN      ] Log.TimeSwitch
    [       OK ] Log.TimeSwitch (4 ms)
    [ RUN      ] Log.TimeFormat
    [       OK ] Log.TimeFormat (1 ms)
    [ RUN      ] Log.Speed_gather
    [       OK ] Log.Speed_gather (1441 ms)
    [ RUN      ] Log.Speed_nogather
    [       OK ] Log.Speed_nogather (63 ms)
    [----------] 13 tests from Log (1895 ms total)

    [----------] Global test environment tear-down
    [==========] 13 tests from 1 test case ran. (1895 ms total)
    [  PASSED  ] 13 tests.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

Merge PR #23824 into master

* refs/pull/23824/head:
doc/cephfs: add notes on application best practices

Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>

Merge pull request #21379 from pritha-srivastava/wip-rgw-user-policy

rgw: User Policy

Merge pull request #23497 from noahdesu/insights

mgr/insights: insights reporting module

Reviewed-by: John Spray <john.spray@redhat.com>

Merge pull request #23146 from jcsp/wip-progress

mgr/progress: improve+test OSD out handling

Reviewed-by: Noah Watkins <nwatkins@redhat.com>

Merge pull request #23897 from votdev/improve_delete_modal

mgr/dashboard: Make deletion dialog more touch device friendly

Reviewed-by: Ricardo Marques <rimarques@suse.com>
Reviewed-by: Stephan Müller <smueller@suse.com>

Merge pull request #24016 from votdev/bug_35907

mgr/dashboard: Progress bar does not stop in TableKeyValueComponent

Reviewed-by: Ricardo Marques <rimarques@suse.com>
Reviewed-by: Stephan Müller <smueller@suse.com>

Merge PR #22987 into master

* refs/pull/22987/head:
common,rgw: rename sha1_digest_t
osd: decrement old chunk's reference count if the chunk has a reference.
src/test: add a unit test
osd: using fingerprint OID if fingerprint is set
osd: add flag interfaces in chunk_info_t
common/buffer.cc: add sha1 fingerprint
osd: add fingerprint property
mon: add a command to set fingerprint algorithm

Merge PR #24006 into master

* refs/pull/24006/head:
osd/OSD: clear ping_history on heartbeat_reset
mon/OSDMonitor: share new maps with even non-active osds

Reviewed-by: Sage Weil <sage@redhat.com>

Merge PR #24010 into master

* refs/pull/24010/head:
osd/OSD: kick right merge source
mgr/DaemonServer: split should respect inflight creating pgs

Reviewed-by: Sage Weil <sage@redhat.com>

Merge pull request #23464 from wido/influx-optimize

mgr/influx: Use Queue to store points which need to be written

Reviewed-by: John Spray <john.spray@redhat.com>

Merge pull request #24019 from alfredodeza/wip-rm34535

ceph-volume batch carve out lvs for bluestore

Reviewed-by: Andrew Schoen <aschoen@redhat.com>

Merge pull request #23957 from tchaikov/wip-crimson-logging

common,auth,crimson: add logging to crimson

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Liu-Chunmei <chunmei.liu@intel.com>

Merge pull request #23992 from badone/wip-librados-client-unique-ptr-compile-error

librados: Include memory for unique_ptr definition

Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge pull request #23723 from xiexingguo/wip-list-missing

osd/PrimaryLogPG: rename list_missing -> list_unfound command

Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>

Merge pull request #23921 from croit/fix-35544

osd/OSDMap: add osd status to utilization dumper

Reviewed-by: Sage Weil <sage@redhat.com>

Merge pull request #23488 from xiaomanh/master

tools: correct the description of Allowed options in osdomap tool

Reviewed-by: Kefu Chai <kchai@redhat.com>

mgr/dashboard: Make deletion dialog more touch device friendly

* Refactor deletion dialog
* Add directives.module.ts to be able to use 'autofocus' in deletion dialog

Signed-off-by: Volker Theile <vtheile@suse.com>

qa: add task for progress module

Signed-off-by: John Spray <john.spray@redhat.com>

qa: add 4th OSD to mgr test cluster

This is useful for testing progress module.

Signed-off-by: John Spray <john.spray@redhat.com>

qa: add tests for progress module

Signed-off-by: John Spray <john.spray@redhat.com>

mgr/progress: no progress event on unmoved pgs

PGs may not be moved on osd out, if there is no suitable
location for them to move to. In this situation
it doesn't make sense to have a progress event, as the
health warnings adequately communicate the situation.

Signed-off-by: John Spray <john.spray@redhat.com>

mgr/progress: fix PgRecoveryEvent completion cases

The event was previously not getting moved to the completed
list. There are a couple more cases too:
- When some pgs go away (a pool is removed) during the event
- When the OSD comes back in after going out

Signed-off-by: John Spray <john.spray@redhat.com>

mgr: expose osdmap pg_to_up_acting_osds

It's not efficient to have python calling this
O(pg_num) times to find the pgs for an OSD, but
I'm just shooting for something functional for now.

Signed-off-by: John Spray <john.spray@redhat.com>

Merge pull request #23910 from votdev/improve_autofocus_directive

mgr/dashboard: Refactor autofocus directive

Reviewed-by: Ricardo Marques <rimarques@suse.com>
Reviewed-by: Stephan Müller <smueller@suse.com>

Merge pull request #23752 from ifed01/wip-ifed-fix-garbage-test

os/tests: fix garbageCollection test case from store_test suite.

Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge pull request #24021 from libingyang-zte/master

doc: Fix Spelling Error of Radosgw

Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>

doc: Fix Spelling Error of Radosgw

Signed-off-by: Li Bingyang <li.bingyang1@zte.com.cn>

qa/tasks/mgr: whitelist insights test health checks

these drive health history tracking tests.

Signed-off-by: Noah Watkins <nwatkins@redhat.com>

Merge pull request #23602 from smanjara/wip-test-netem

qa: Task to emulate network delay and packet drop between two given h…

mgr/dashboard: Progress bar does not stop in TableKeyValueComponent

Fixes: https://tracker.ceph.com/issues/35907
Signed-off-by: Volker Theile <vtheile@suse.com>

ceph-volume tests.util verify Disk objects don't change state with divisions

Signed-off-by: Alfredo Deza <adeza@redhat.com>

ceph-volume util.disk fix an issue where Disk objects would mutate on div operations

Signed-off-by: Alfredo Deza <adeza@redhat.com>

ceph-volume util.prepare add a helper to get block.db sizes from ceph.conf

Signed-off-by: Alfredo Deza <adeza@redhat.com>

ceph-volume lvm.batch.bluestore add TODOs for custom fast/slow devices

Signed-off-by: Alfredo Deza <adeza@redhat.com>

-f ceph-volume lvm.batch.bluestore validation and reporting with VG reuse

Signed-off-by: Alfredo Deza <adeza@redhat.com>

doc/cephfs: add notes on application best practices

Signed-off-by: John Spray <john.spray@redhat.com>

Merge pull request #23997 from batrick/multimds-qa-broken-symlink

qa: fix symlink

Merge PR #23965 into master

* refs/pull/23965/head:
doc/dev/msgr2: better formatting
doc/dev/msgr2: clarify padding alignment
doc/dev/msgr2: tweak message flow handshake
doc/dev/msgr2: remove stream concept, streamline auth

Reviewed-by: Ricardo Dias <rdias@suse.com>

ceph-volume lvm.batch fix error reporting, Device objects aren't strings

Signed-off-by: Alfredo Deza <adeza@redhat.com>

ceph-volume lvm.batch.bluestore validation and reporting with VG reuse

Reworks the bluestore validation and reporting to account for reusable
VGs from fast devices, and adds validation calls to ensure the new way
to calculate this process will work.

Signed-off-by: Alfredo Deza <adeza@redhat.com>

ceph-volume lvm.batch.filestore capture SizeAllocationErrors

Signed-off-by: Alfredo Deza <adeza@redhat.com>

ceph-volume lvm.batch make sure data devices don't have existing LVs on bluestore

Signed-off-by: Alfredo Deza <adeza@redhat.com>

Merge pull request #21271 from cbodley/wip-rgw-beast-async

rgw: beast frontend reworks pause/stop and yields during body io

Reviewed-by: Adam C. Emerson <aemerson@redhat.com>

Merge PR #23845 into master

* refs/pull/23845/head:
osd/OSDMap: include age in up and in counts for ceph status
mon/OSDMonitor: set new_last_{up,in}_change
osd/OSDMap: store last_up_change and last_in_change
mgr/MgrMap: include mgr age in map printer
mon/MgrMap: track active_changed timestamp
mon: include mon quorum age in status
include/utime: add utimespan_str helper

Reviewed-by: John Spray <john.spray@redhat.com>

Merge PR #23949 into master

* refs/pull/23949/head:
mon/OSDMonitor: invalidate max_failed_since on cancel_report

Reviewed-by: Sage Weil <sage@redhat.com>

Merge PR #23968 into master

* refs/pull/23968/head:
dout: add basic prefix providers
dout: add DoutPrefixPipe for composing prefix providers

Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge PR #23971 into master

* refs/pull/23971/head:
cls/numops: fix cls_numops.cc log add to mul

Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>

Merge PR #23975 into master

* refs/pull/23975/head:
common/buffer.cc: add create_small_page_aligned to avoid mem waste when apply for small mem in big page size(e.g. 64k) OS

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>

Merge pull request #23939 from votdev/bug_35685

mgr/dashboard: Fix bug in user form when changing password

Reviewed-by: Stephan Müller <smueller@suse.com>

osd/OSD: kick right merge source

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

Merge pull request #23839 from trociny/wip-migration-commit-race

librbd: fix potential live migration after commit issues due to not refreshed image header

Reviewed-by: Jason Dillaman <dillaman@redhat.com>

osd/OSD: clear ping_history on heartbeat_reset

Because the old connections are gone, and hence we should not
leave behind a long list of obsolete ping_history there, which
is misleading...

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

mon/OSDMonitor: share new maps with even non-active osds

OSDs may not be aware of their deadness and trapped at
an obsolete map in which they were still marked as up:

```
host        osd     down_at     stuck_at
ceph-03     9       e712        e711
ceph-03     13      e700        e699
ceph-03     28      e697        e696
ceph-03     48      e697        e696
ceph-03     52      e707        e704
ceph-03     61      e710        e708
ceph-03     73      e712        e710
ceph-03     77      e708        e707

ceph-05     12      e711        e710
ceph-05     21      e703        e702
ceph-05     24      e700        e699
ceph-05     29      e703        e699
ceph-05     41      e711        e710
ceph-05     53      e711        e710
ceph-05     72      e712        e711

```

In https://github.com/ceph/ceph/pull/23958 an OSD will ping monitor
periodically now if it is stuck at __wait_for_healthy__. But in the
above case OSDs are still considering themselves as __active__ and
hence should miss that fixer.

Since these OSDs might be still able to contact with monitors (
otherwise there is no way for them to be marked up again) and send
beacons contiguously, we can simply get them out of the trap by
sharing some new maps with them.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
Signed-off-by: runsisi <runsisi@zte.com.cn>

mgr/dashboard: Refactor autofocus directive

Refactor the autofocus directive and add some unit tests.

Signed-off-by: Volker Theile <vtheile@suse.com>

mgr/dashboard: Unable to edit user when making an accidental change to the password field

Fixes: https://tracker.ceph.com/issues/35685
Signed-off-by: Volker Theile <vtheile@suse.com>

mgr/DaemonServer: split should respect inflight creating pgs

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

Merge pull request #23993 from badone/wip-fedora-build-Cython3-error

rpm: Fix Fedora error "No matching package to install: 'Cython3'"

Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge pull request #23833 from falcon78921/wip-docs-34539

doc/rados: fixed hit set type link

Reviewed-by: Kefu Chai <kchai@redhat.com>

Merge pull request #24000 from libingyang-zte/master

doc: Fix Spelling Error of Radosgw

Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>

doc: fixed hit set type link

Fixed reference link for hit set type value. Restructured wording in description.
Fixes: https://tracker.ceph.com/issues/34539
Signed-off-by: James McClune <jmcclune@mcclunetechnologies.net>

doc: Fix Spelling Error of Radosgw

Signed-off-by: Li Bingyang <li.bingyang1@zte.com.cn>

qa: fix symlink

Introduced-by: 6ac1882dc49bc714390d81e582d5febc9bb5e549
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

Merge pull request #23895 from xiexingguo/wip-more-async-fixes

osd/PrimaryLogPG: update missing_loc more carefully

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #23958 from xiexingguo/wip-heartbeat-stuck

osd/OSD: ping monitor if we are stuck at __waiting_for_healthy__

Reviewed-by: Sage Weil <sage@redhat.com>

mon/OSDMonitor: invalidate max_failed_since on cancel_report

max_failed_since might reference the very failure-report which is to be
cancelled. We can simply invalidate it here to make **get_failed_since()**
recalculate if necessary.

Fixes: http://tracker.ceph.com/issues/35860
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>

Merge PR #23449 into master

* refs/pull/23449/head:
osd/OSDMap: cleanup: s/tmpmap/nextmap/
qa/standalone/osd/osd-backfill-stats: fixes
osd/OSDMap: clean out pg_temp mappings that exceed pool size
mon/OSDMonitor: clean temps and upmaps in encode_pending, efficiently
osd/OSDMapMapping: do not crash if acting > pool size

Reviewed-by: David Zafman <dzafman@redhat.com>
Reviewed-by: xie xingguo <xie.xingguo@zte.com.cn>
Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge PR #23984 into master

* refs/pull/23984/head:
mon: test if gid exists in pending for prepare_beacon

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Greg Farnum <gfarnum@redhat.com>

osd/OSDMap: cleanup: s/tmpmap/nextmap/

Be consistent with OSDMap.h

Signed-off-by: Sage Weil <sage@redhat.com>

qa/standalone/osd/osd-backfill-stats: fixes

Grep from the primary's log, not every osd's log.

For the backfill_remapped task in particular, after the pg_temp change it
just so happens that the primary changes across the pool size change and
thus two different primaries do (some) backfill. Fix that test to pass
the correct primary.

Other tests are unaffected as they do not (happen to) trigger a primary
change and already satisfied the (removed) check that only one OSD does
backfill.

Signed-off-by: Sage Weil <sage@redhat.com>

osd/OSDMap: clean out pg_temp mappings that exceed pool size

If the pool size is reduced, we can end up with pg_temp mappings that are
too big. This can trigger bad behavior elsewhere (e.g., OSDMapMapping,
which assumes that acting and up are always <= pool size).

Fixes: http://tracker.ceph.com/issues/26866
Signed-off-by: Sage Weil <sage@redhat.com>

mon/OSDMonitor: clean temps and upmaps in encode_pending, efficiently

- do not rebuild the next map when we already have it
- do this work in encode_pending, not create_pending, so we get bad
values before they are published.

Signed-off-by: Sage Weil <sage@redhat.com>

osd/OSDMapMapping: do not crash if acting > pool size

Existing oversized pg_temp mappings (or some other bug) might make acting
exceed the pool size. Avoid overrunning out buffer if that happens.

Note that the mapping won't be completely accurate in that case!

Signed-off-by: Sage Weil <sage@redhat.com>

mon: test if gid exists in pending for prepare_beacon

If it does not, send a null map. Bug introduced by
624efc64323f99b2e843f376879c1080276e036f which made preprocess_beacon only look
at the current fsmap (correctly). prepare_beacon relied on preprocess_beacon
doing that check on pending.

Running:

    while sleep 0.5; do bin/ceph mds fail 0; done

is sufficient to reproduce this bug. You will see:

    2018-09-07 15:33:30.350 7fffe36a8700  5 mon.a@0(leader).mds e69 preprocess_beacon mdsbeacon(24412/a up:reconnect seq 2 v69) v7 from mds.0 127.0.0.1:6813/2891525302 compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
    2018-09-07 15:33:30.350 7fffe36a8700 10 mon.a@0(leader).mds e69 preprocess_beacon: GID exists in map: 24412
    2018-09-07 15:33:30.350 7fffe36a8700  5 mon.a@0(leader).mds e69 _note_beacon mdsbeacon(24412/a up:reconnect seq 2 v69) v7 noting time
    2018-09-07 15:33:30.350 7fffe36a8700  7 mon.a@0(leader).mds e69 prepare_update mdsbeacon(24412/a up:reconnect seq 2 v69) v7
    2018-09-07 15:33:30.350 7fffe36a8700 12 mon.a@0(leader).mds e69 prepare_beacon mdsbeacon(24412/a up:reconnect seq 2 v69) v7 from mds.0 127.0.0.1:6813/2891525302
    2018-09-07 15:33:30.350 7fffe36a8700 15 mon.a@0(leader).mds e69 prepare_beacon got health from gid 24412 with 0 metrics.
    2018-09-07 15:33:30.350 7fffe36a8700  5 mon.a@0(leader).mds e69 mds_beacon mdsbeacon(24412/a up:reconnect seq 2 v69) v7 is not in fsmap (state up:reconnect)

in the mon leader log. The last line indicates the problem was safely handled.

Fixes: http://tracker.ceph.com/issues/35848
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

Merge PR #20469 into master

* refs/pull/20469/head:
osd/PG: remove warn on delete+merge race
osd: base project_pg_history on is_new_interval
osd: make project_pg_history handle concurrent osdmap publish
osd: handle pg delete vs merge race
osd/PG: do not purge strays in premerge state
doc/rados/operations/placement-groups: a few minor corrections
doc/man/8/ceph: drop enumeration of pg states
doc/dev/placement-groups: drop old 'splitting' reference
osd: wait for laggy pgs without osd_lock in handle_osd_map
osd: drain peering wq in start_boot, not _committed_maps
osd: kick split children
osd: no osd_lock for finish_splits
osd/osd_types: remove is_split assert
ceph-objectstore-tool: prevent import of pg that has since merged
qa/suites: test pg merging
qa/tasks/thrashosds: support merging pgs too
mon/OSDMonitor: mon_inject_pg_merge_bounce_probability
doc/rados/operations/placement-groups: update to describe pg_num reductions too
doc/rados/operations: remove reference to lpgs
osd: implement pg merge
osd/PG: implement merge_from
osdc/Objecter: resend ops on pg merge
osd: collect and record pg_num changes by pool
osd: make load_pgs remove message more accurate
osd/osd_types: pg_t: add is_merge_target()
osd/osd_types: pg_t::is_merge -> is_merge_source
osd/osd_types: adding or substracting invalid stats -> invalid stats
osd/PG: clear_ready_to_merge on_shutdown (or final merge source prep)
osd: debug pending_creates_from_osd cleanup, don't use cbegin
ceph-objectstore-tool: debug intervals update
mgr/ClusterState: discard pg updates for pgs >= pg_num
mon/OSDMonitor: fix long line
mon/OSDMonitor: move pool created check into caller
mon/OSDMonitor: adjust pgp_num_target down along with pg_num_target as needed
mon/OSDMonitor: add mon_osd_max_initial_pgs to cap initial pool pgs
osd/OSDMap: set pg[p]_num_target in build_simple*() methods
mon/PGMap: adjust SMALLER_PGP_NUM warning to use *_target values
mon/OSDMonitor: set CREATING flag for force-create-pg
mon/OSDMonitor: start sending new-style pg_create2 messages
mon/OSDMonitor: set last_force_resend_prenautilus for pg_num_pending changes
osd: ignore pg creates when pool FLAG_CREATING is not set
mgr: do not adjust pg_num until FLAG_CREATING removed from pool
mon/OSDMonitor: add FLAG_CREATING on upgrade if pools still creating
mon/OSDMonitor: prevent FLAG_CREATING from getting set pre-nautilus
mon/OSDMonitor: disallow pg_num changes while CREATING flag is set
mon/OSDMonitor: set POOL_CREATING flag until initial pool pgs are created
osd/osd_types: add pg_pool_t FLAG_POOL_CREATING
osd/osd_types: introduce last_force_resend_prenautilus
osd/PGLog: merge_from helper
osd: no cache agent or snap trimming during premerge
osd: notify mon when pending PGs are ready to merge
mgr: add simple controller to adjust pg[p]_num_actual
mon/OSDMonitor: MOSDPGReadyToMerge to complete a pg_num change
mon/OSDMonitor: allow pg_num to adjusted up or down via pg[p]_num_target
osd/osd_types: make pg merge an interval boundary
osd/osd_types: add pg_t::is_merge() method
osd/osd_types: add pg_num_pending to pg_pool_t
osd: allow multiple threads to block on wait_min_pg_epoch
osd: restructure advance_pg() call mechanism
mon/PGMap: prune merged pgs
mon/PGMap: track pgs by state for each pool
osd/SnapMapper: allow split_bits to decrease (merge)
os/bluestore: fix osr_drain before merge
os/bluestore: allow reuse of osr from existing collection
os/filestore: (re)implement merge
os/filestore: add _merge_collections post-check
os: implement merge_collection
os/ObjectStore: add merge_collection operation to Transaction

Merge pull request #23894 from xiexingguo/wip-complete-to-2

osd/PrimaryLogPG: avoid dereferencing invalid complete_to

Reviewed-by: Sage Weil <sage@redhat.com>

Merge pull request #23976 from idryomov/wip-cram-git-clone

qa/tasks/cram: tasks now must live in the repository

Reviewed-by: Jason Dillaman <dillaman@redhat.com>

Merge pull request #23828 from cbodley/wip-rgw-sync-trace-cleanup

rgw: cleanups for sync tracing

Reviewed-by: Yehuda Sadeh <yehuda@redhat.com>

Merge pull request #23571 from cbodley/wip-26938

rgw: data sync respects error_retry_time for backoff on error_repo

Reviewed-by: Yehuda Sadeh <yehuda@redhat.com>

rgw: beast frontend closes connections on stop

the strategy for stop relies on the fact that process_request() is
completely synchronous, so that io_context.stop() would still complete
each request and clean up properly

to tolerate an asynchronous process_request(), we instead need to drain
all outstanding work on the io_context so that io_context.run() can
return control natually to all of the worker threads. that would allow
us to suspend our coroutine in the middle of process_request(), and
still guarantee that process_request() will resume and run to completion
before the worker threads exit

each connected socket also counts as outstanding work, and needs to be
closed in order to drain the io_context. each connection now adds itself
to a connection list so that stop() can close its socket

Signed-off-by: Casey Bodley <cbodley@redhat.com>

rgw: beast frontend uses async SharedMutex for pause

the strategy for pause relied on stopping the io_context and waiting for
io_context.run() to return control to all of the worker threads. this
relies on the fact that process_request() is completely synchronous (so
considered a single unit of work in the io_context) - otherwise, pause
could complete in the middle of a call to process_request(), and destroy
the RGWRados instance while it's still in use

calling io_context.stop() to pause the worker threads also assumes that
no other work will be scheduled on these threads

to decouple pause from worker threads, handle_connection() now uses an
async shared mutex to synchronize with pause/unpause

Signed-off-by: Casey Bodley <cbodley@redhat.com>

osd/PG: remove warn on delete+merge race

This was there just to confirm that this path was exercised by the
rados suite (it is, several hits per rados run of 1/666).

Signed-off-by: Sage Weil <sage@redhat.com>

osd: base project_pg_history on is_new_interval

Signed-off-by: Sage Weil <sage@redhat.com>

osd: make project_pg_history handle concurrent osdmap publish

The class's osdmap may be updated while we are in our loop. Pass it in
explicitly instead.

Fixes: http://tracker.ceph.com/issues/26970
Signed-off-by: Sage Weil <sage@redhat.com>

osd: handle pg delete vs merge race

Deletion involves an awkward dance between the pg lock and shard locks,
while the merge prep and tracking is "shard down". If the delete has
finished its work we may find that a merge has since been prepped.

Unwinding the merge tracking is nontrivial, especially because it might
involved a second PG, possibly even a fabricated placeholder one. Instead,
if we delete and find that a merge is coming, undo our deletion and let
things play out in the future map epoch.

Signed-off-by: Sage Weil <sage@redhat.com>

osd/PG: do not purge strays in premerge state

The point of premerge is to ensure that the constituent parts of the
target PG are fully clean.  If there is an intervening PG migration and
one of the halves finishes migrating before the other, one half could
get removed and the final merge could result in an incomplete PG.  In the
worst case, the two halves (let's call them A and B) could have started
out together on say [0,1,2], A moves to [3,4,5] and gets deleted from
[0,1,2], and then the final merge happens such that *all* copies of the PG
are incomplete.

We could construct a clever check that does allow removal of strays when
the sibling PG is also ready to go, but it would be complicated.  Do the
simple thing.  In reality, this would be an extremely hard case to hit
because the premerge window is generally very short.

Signed-off-by: Sage Weil <sage@redhat.com>

doc/rados/operations/placement-groups: a few minor corrections

Signed-off-by: Sage Weil <sage@redhat.com>

doc/man/8/ceph: drop enumeration of pg states

This is more maintainable.

Signed-off-by: Sage Weil <sage@redhat.com>

doc/dev/placement-groups: drop old 'splitting' reference

Signed-off-by: Sage Weil <sage@redhat.com>

osd: wait for laggy pgs without osd_lock in handle_osd_map

We can't hold osd_lock while blocking because other objectstore completions
need to take osd_lock (e.g., _committed_osd_maps), and those objectstore
completions need to complete in order to finish_splits. Move the blocking
to the top before we establish any local state in this stack frame since
both the public and cluster dispatchers may race in handle_osd_map and
we are dropping and retaking osd_lock.

Signed-off-by: Sage Weil <sage@redhat.com>

osd: drain peering wq in start_boot, not _committed_maps

We can't safely block in _committed_osd_maps because we are being run
by the store's finisher threads, and we may have to wait for a PG to split
and then merge via that same queue and deadlock.

Do not hold osd_lock while waiting as this can interfere with *other*
objectstore completions that take osd_lock.

Signed-off-by: Sage Weil <sage@redhat.com>

osd: kick split children

Ensure that we bring split children up to date to the latest map even in
the absence of new OSDMaps feeding in NullEvts. This is important when
the handle_osd_map (or boot) thread is blocked waiting for pgs to catch
up, but we also need a newly-split child to catch up (perhaps so that it
can merge).

Signed-off-by: Sage Weil <sage@redhat.com>

osd: no osd_lock for finish_splits

This used to protect the pg registration probably? There is no need for
it now.

More importantly, having it here can cause a deadlock when we are holding
osd_lock and blocking on wait_min_pg_epoch(), because a PG may need to
finish splitting to advance and then merge with a peer. (The wait won't
block on *this* PG since it isn't registered in the shard yet, but it
will block on the merge peer.)

Signed-off-by: Sage Weil <sage@redhat.com>

osd/osd_types: remove is_split assert

The problem is:

osd is at epoch 80
import pg 1.a as of e57
1.a and 1.1a merged in epoch 60something
we set up a merge now,
but in should_restart_peering via advance_pg we hit the is_split assert
that the ps is < old_pg_num

We can meaningfully return false (this is not a split) for a pg that is
beyond pg_num.

Signed-off-by: Sage Weil <sage@redhat.com>

ceph-objectstore-tool: prevent import of pg that has since merged

We currently import a portion of the PG if it has split. Merge is more
complicated, though, mainly because COT is operating in a mode where it
fast-forwards the PG to the latest OSDMap epoch, which means it has to
implement any transformations to the PG (split/merge) independently.
Avoid doing this for merge.

Signed-off-by: Sage Weil <sage@redhat.com>

qa/suites: test pg merging

Signed-off-by: Sage Weil <sage@redhat.com>

qa/tasks/thrashosds: support merging pgs too

Signed-off-by: Sage Weil <sage@redhat.com>

mon/OSDMonitor: mon_inject_pg_merge_bounce_probability

Optionally bounce pg_num back up right after we decrease it. This triggers
conditions in the OSD where the merge and split logic may conflict.

Signed-off-by: Sage Weil <sage@redhat.com>