git-server-git.apps.pok.os.sepia.ceph.com Git

ceph-volume/tests: update setup_mixed_type playbook

we need to create a file with a larger size.
see https://github.com/ceph/ceph/pull/43300#issuecomment-951961243

Signed-off-by: Guillaume Abrioux <gabrioux@redhat.com>
(cherry picked from commit 8af00e25aa4ab60d0309e31f6c20edd6cd5be1ee)

ceph-volume: fix bug with miscalculation of required db/wal slot size for VGs with multiple PVs

Previous logic for calculating db/wal slot sizes made the assumption that there would only be
a single PV backing each db/wal VG. This wasn't the case for OSDs deployed prior to v15.2.8,
since ceph-volume previously deployed multiple SSDs in the same VG. This fix removes the
assumption and does the correct calculation in either case.

Fixes: https://tracker.ceph.com/issues/52730
Signed-off-by: Cory Snyder <csnyder@iland.com>
(cherry picked from commit cd6aa1329f70f89338757ba295e279ecfdbc2d07)

Merge pull request #43890 from batrick/i53231

pacific: MDSMonitor: assertion during upgrade to v16.2.5+

Reviewed-by: Venky Shankar vshankar@redhat.com

Merge pull request #43841 from vshankar/wip-52952

pacific: mds: skip journaling blocklisted clients when in `replay` state

Reviewed-by: Venky Shankar vshankar@redhat.com
Reviewed-by: Kotresh HR khiremat@redhat.com

Merge pull request #43891 from batrick/i53232

pacific: MDSMonitor: no active MDS after cluster deployment

Reviewed-by: Venky Shankar vshankar@redhat.com

Merge pull request #43828 from batrick/i53006

pacific: qa: reduce frag split confs for dir_split counter test

Reviewed-by: Venky Shankar <vshankar@redhat.com>

Merge pull request #43941 from sebastian-philipp/backport-43780

pacific: doc/radosgw/nfs: add note about NFSv3 deprecation

Reviewed-by: Michael Fritch <mfritch@suse.com>

Merge pull request #43937 from neha-ojha/wip-42709-pacific

pacific: qa/tasks/kubeadm: force docker cgroup engine to systemd

Reviewed-by: Joseph Sawaya <jsawaya@redhat.com>

doc/radosgw/nfs: add note about NFSv3 deprecation

cephadm and rook based deployments have deprecated
the NFSv3 deprecated in favor of NFSv4

Signed-off-by: Michael Fritch <mfritch@suse.com>
(cherry picked from commit a4fa3dd71474f48e898ae604fd398154ce91b49c)

Merge pull request #43731 from ifed01/wip-ifed-multiple-backports-pac

pacific: os/bluestore: multiple repair fixes

Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #43697 from cfsnyder/wip-52844-pacific

pacific: mon,auth: fix proposal (and mon db rebuild) of rotating secrets

Reviewed-by: Neha Ojha <nojha@redhat.com>

qa/tasks/kubeadm: force docker cgroup engine to systemd

Signed-off-by: Sage Weil <sage@newdream.net>
(cherry picked from commit 517b7759b3ab2b84b2a4ddace411e6ac7599eddd)

Merge pull request #43756 from ifed01/wip-ifed-fix-write-small-head-pad-pac

pacific: os/bluestore: _do_write_small fix head_pad

Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>

Merge pull request #43918 from ideepika/wip-remain-ssd-cache-pacific

pacific: librbd/cache/pwl: SSD caching backports

Reviewed-by: Mykola Golub <mykola.golub@clyso.com>
Reviewed-by: Deepika Upadhyay <dupadhya@redhat.com>

qa/suites/rbd/persistent-writeback-cache: add test case

Add the test case which size is 8GB, So that some problems that occur
only in test scenarios above 4GB may be found in this test. For example,
the variables of 32-bit may be unexpected value when it operates with
a 64 bit value.

Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit 3da4a9401c887f9fa92d090786c5ef2121665bd2)

librbd/cache/pwl/ssd: add layout version control

Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit dc566a3cd30d91ecbe87cb049df5e9462a526b6d)

librbd/cache/pwl/ssd: make log entry pointers 64 bit (on-disk format change)

Fixes: https://tracker.ceph.com/issues/50675
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit c091ec3471973f41717b237d500b8e4374af660f)

librbd/cache/pwl: fix reorder issue between func process_writeback_dirty_entries

In fact, we not only make sure ops in order in func process_writeback_dirty_entries,
but also make sure ops in order between func process_writeback_dirty_entries.

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 76f4d29d92be3f9f45767cb1ac6cc50da528ecec)

Merge pull request #43823 from adamemerson/wip-53160-pacific

pacific: rgw: Ensure buckets too old to decode a layout have layout logs

Reviewed-by: Casey Bodley <cbodley@redhat.com>

mds/FSMap: allow upgrade when no MDS is "in"

Fixes: https://tracker.ceph.com/issues/52975
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 9b8e3187ed491e4bcfdecbc82babb0bca835bad6)

pybind/mgr/cephadm: disable FSMap sanity checks during MDS upgrade

See comment for explanation.

Fixes: https://tracker.ceph.com/issues/53155
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit b29423111be8fcfc7702a39a7b0f6da43d12067f)

mds/FSMap: assign v16.2.4 compat to pre-v16.2.5 standby daemons

With v16.2.5, the monitors store an MDS's CompatSet with its mds_info_t
in the MDSMap. If an older MDS fails and rejoins the cluster, it gets
assigned the empty CompatSet. This is problematic during upgrades as an
MDS failure may prevent the upgrade process from continuing and cause
file system unavailability.

This patch makes it so the mons will assign a reasonable default: a
CompatSet used since v14.2.0 until v16.2.5.

Fixes: https://tracker.ceph.com/issues/53150
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 74e3f5ec5a49ce99b56c305624e9110fcb4b787c)

Merge pull request #43251 from kotreshhr/wip-52633-pacific

pacific: mds: Add new flag to MClientSession

Reviewed-by: Kotresh HR <khiremat@redhat.com>

Merge pull request #43223 from kotreshhr/wip-52628-pacific

pacific: mgr/volumes: Fix permission during subvol creation with mode

Reviewed-by: Kotresh HR <khiremat@redhat.com>

Merge pull request #43862 from ivancich/wip-multipart-purge-fix-pacific

pacific: rgw: fix bucket purge incomplete multipart uploads

Reviewed-by: Adam Emerson <aemerson@redhat.com>

Merge pull request #43772 from ideepika/wip-ssd-cache-testing-backports-pacific

pacific: librbd/cache/pwl: persistant cache backports

Reviewed-by: Mykola Golub <mykola.golub@clyso.com>

cmake/modules/Findpmem: always set pmem_VERSION_STRING

before this change, `pmem_VERSION_STRING` is not set if it is not able
to fulfill the specified version requirement. the intention was to check
if the version is able to satisfy the requirement. but actually, passing
an empty `pmem_VERSION_STRING` to `find_package_handle_standard_args()`
as the option of `VERSION_VAR` does not fail this check. on the
contrary, it prints

-- Found pmem: pmem_pmemobj_INCLUDE_DIR;pmem_pmem_INCLUDE_DIR (Required
is at least version "1.17")

if we requires pmem 1.17, while the found version is, for instance,
1.10.

if the required version is 1.7, and the found version is 1.10, the
output from cmake is:

-- Found pmem: pmem_pmemobj_INCLUDE_DIR;pmem_pmem_INCLUDE_DIR (found
suitable version "1.10", minimum required is "1.7")

in this change, the version spec is not specified when calling
`pkg_check_modules()`. so, `PKG_${component}_VERSION` is always set.
and we can always delegate the version checking to
`find_package_handle_standard_args()`. please note, we use the lower
version returned by pkg-config if multiple components are required and
both pkg-config settings return their versions.

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit ad85af2ef8b76481adc0d1500033cd5094b8f21e)

Conflicts:
cmake/modules/Findpmem.cmake
<<<<<<< HEAD
=======
  pkg_check_modules(PKG_${component} QUIET "lib${component}")
  if(NOT pmem_VERSION_STRING OR PKG_${component}_VERSION VERSION_LESS pmem_VERSION_STRING)
    set(pmem_VERSION_STRING ${PKG_${component}_VERSION})
  endif()
  find_path(pmem_${component}_INCLUDE_DIR
    NAMES lib${component}.h
    HINTS ${PKG_${component}_INCLUDE_DIRS})
  find_library(pmem_${component}_LIBRARY
    NAMES ${component}
    HINTS ${PKG_${component}_LIBRARY_DIRS})
>>>>>>> ad85af2ef8b (cmake/modules/Findpmem: always set pmem_VERSION_STRING)

ceph.spec.in: do not build with system pmdk by default

we need to use libpmem 1.10 in #40493.

without enabling the module stream offering libpmem 1.9.2, we can only
have access to libpmem 1.6.1. and fedora 33 only has libpmem 1.9
packaged. the same applies to openSUSE Tumbleweed and openSUSE Leap. so
let's stop using libpmem packaged by distro by default, until these
distros include libpmem 1.10.

Signed-off-by: Kefu Chai <kchai@redhat.com>
(cherry picked from commit ecb8d2cae2c063acf4e7e1bffed887d52117762f)

cmake: bump to PMDK v1.10

Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit 636ab08f2604efd4cac3200d5741fa15b070f072)

cmake: use src/pmdk for building pmdk if it exists

so we can build with pmdk enabled if the dist tarball
contains pmdk

Signed-off-by: Feng Hualong <hualong.feng@intel.com>
(cherry picked from commit 14c2a2e59fbdc716d35c08735d50bdadfab8300d)

Conflicts:
cmake/modules/Buildpmem.cmake
trivial fix, adopt what master has:
1. external_project args >> source_dir_args
2. INSTALL_COMMAND ""

make-dist: add pmdk to dist tarball

Signed-off-by: Feng Hualong <hualong.feng@intel.com>
(cherry picked from commit 9d958d0b9d33bfa0e27323597986e541eed23951)

os/bluestore: fix improper offset calculation when repairing.

While repairing misreferenced blobs BlueStore could improperly calculate
an offset within a blob being fixed. This could happen when single
physical extent has been replaced by multiple ones - the following
pextent (if any in the current blob) would be treated with the improper offset within the blob. Offset calculation didn't account for each of that new pextents but the last one only.

Fixes: https://tracker.ceph.com/issues/51682
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit ca4b6675fc3fd2f4cadad58044c97c5bb23d5938)

rgw: fix bucket purge incomplete multipart uploads

The marker was not working correctly as segments of the bucket index
were listed to shut down any incomplete multipart uploads. This fixes
the marker, so it's maintained properly across iterations.

Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>

Merge pull request #43787 from tchaikov/pacific-pr-41353

pacific: pybind/mgr/CMakeLists.txt: exclude files not used at runtime

Reviewed-by: Ernesto Puerta <epuertat@redhat.com>

Merge pull request #43682 from rhcs-dashboard/wip-52806-pacific

pacific: mgr/dashboard: BATCH incl.: NFS integration, Cluster Expansion Workflow, and Angular 11 upgrade

Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>

Merge pull request #43703 from cfsnyder/wip-52783-pacific

pacific: rgw/sts: fix for copy object operation using sts

Reviewed-by: Casey Bodley <cbodley@redhat.com>

Merge pull request #43809 from cbodley/wip-qa-rgw-java-pacific

pacific: qa/rgw: pacific branch targets ceph-pacific branch of java_s3tests

Reviewed-by: Yuri Weinstein <yweinste@redhat.com>
Reviewed-by: Nathan Cutler <ncutler@suse.com>
Reviewed-by: Ali Maredia <amaredia@redhat.com>

Merge pull request #43777 from cfsnyder/wip-52468-pacific

pacific: rgw: use existing s->bucket in s3 website retarget()

Reviewed-by: Casey Bodley <cbodley@redhat.com>

mds: skip journaling blocklisted clients when in `replay` state

When a standby MDS is transitioning to active, it passes through
`replay` state. When the MDS is in this state, there are no journal
segments available for recording journal updates. If the MDS receives
an OSDMap update in this state, journaling blocklisted clients causes
a crash since no journal segments are available. This is a bit hard
to reproduce as it requires correct timing of an OSDMap update along
with various other factors.

Note that, when the MDS reaches `reconnect` state, it will journal
the blocklisted clients anyway.

This partially fixes tracker: https://tracker.ceph.com/issues/51589
which mentions a similar crash but in `reconnect` state. However,
that crash was seen in nautilus.

A couple of minor changes include removing hardcoded function names
and carving out reusable parts into a separate function.

Partially-fixes: https://tracker.ceph.com/issues/51589
Signed-off-by: Venky Shankar <vshankar@redhat.com>
(cherry picked from commit 6d6236dc8d15636af8060057e6e69c26c473f987)

Merge pull request #43708 from cfsnyder/wip-52598-pacific

pacific: ceph-volume: util/prepare fix osd_id_available()

Merge pull request #43812 from rhcs-dashboard/wip-53153-pacific

pacific: mgr/dashboard: fix missing alert rule details

Reviewed-by: Avan Thakkar <athakkar@redhat.com>
Reviewed-by: Nizamudeen A <nia@redhat.com>

qa: reduce frag split confs for dir_split counter test

Fixes: https://tracker.ceph.com/issues/52949
Fixes: a5675535ba532cfe782238d995751f9bc91f5078
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit abd19aa215eeef6da66c522c9fbeec870acee85a)

qa: only set frag confs for workloads

Otherwise these local conf overrides prevent functional testing.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

qa: fold frag confs into conf/mds.yaml

These overrides are standard for all configurations. The config to
enable fragmentation is also long removed.

Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>

rgw: Ensure buckets too old to decode a layout have layout logs

When decoding `RGWBucketInfo` data from before Pacific, we won't call
`rgw::BucketLayout::decode`, but will instead synthesize the layout
information. This leaves the `rgw::BucketLayout::logs` empty, as the
fallback to populate it only applies to old versions of
`rgw::BucketLayout`.

Add a check at the end of `RGWBUcketInfo::decode` to populate it if
empty.

Fixes: https://tracker.ceph.com/issues/53132
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
(cherry picked from commit 3279509127e65314c07963a3e127e926308bd76a)
Fixes: https://tracker.ceph.com/issues/53160
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>

librbd/cache/pwl: cancel advance dispatch of external flush request

For external flush request, it new syncpoint after passing
guardedrequest and before dispatch. Then dispatch bypass deferred
queue But the last write request may still in the deferred queue.
It don't dispatch and not associated with any syncpoint. The
external flush request will bypass the previous write request in
deferred queue now. This does not conform to the semantics of
external flush requests. External flush request should strictly
follow the order of dispath.

But for internal flush request, it will be dispatched after all
write request which associated with previous syncpoint, persisted
in cache. C_gather guarantee it.

It is necessary to distinguish between external and internal
flush requests. Internal flush can and should be dispatched in
advance bypass deferred queue. At the same time, the order of
external requests needs to be kept unchanged. So cancel advance
dispatch of external flush request.

Fixes: https://tracker.ceph.com/issues/52599
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit 9951868fc33f0469cd4135fadebb5a8d2b4b55dd)

librbd/cache/pwl: fix assert in _aio_stop() during shutdown

For wait_for_ops(next_ctx). this next_ctx may run in aio_thread.
Then the next program runs on the aio thread. remove_pool_file()
calls bdev->close(), then calles _aio_stop(), exec aio_thread.join(),
cause assert. Thread can't join itself. Fix it by adding close ctx
to m_work_queue, so close() can run in work queue thread.

At the same time, correct the order of wait_for_ops().
flush_dirty_entries(next_ctx) may call wake_up() and start_op().
so moving wait_for_ops() behind flush_dirty_entries(next_ctx) is more
appropriate.

Fixes: https://tracker.ceph.com/issues/52566
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit 94f9873718a82d2def8f268c1581fbf97fee0f49)

librbd/cache/pwl/ssd: move finish_op() to the end of callback function

finish_op() of ssd cache is not in the end of callback function in
append_op_log_entries(), and after finish_op(), some operation also
need to get m_lock. So, during shutdown, wait_for_ops() thinks all OPs
are over, and no thread will acquire the m_lock, In the subsequent
operation of shutdown, the m_lock is obtained, and _aio_stop() in
bdev->close() waits for all aio_writes() and aio_submit() to end
when the m_lock is held, but the callback function of aio_write() is
waiting for the m_lock, causing a deadlock. Move finish_op() to the
end to fix dead lock.

Fixes: https://tracker.ceph.com/issues/52235
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit c531768838e44ed8eb28e8b27594d7e03ca3ffcf)

librbd/cache/pwl: Check the cache is clean

Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit 066b8a6d2ee091839b9b21ac89b8dfcebf8825cd)

librbd/cache/pwl/ssd: Remove unused parameter.

Met the following compiler warning message:
>[38/80] Building CXX object
src/librbd/CMakeFiles/librbd_plugin_pwl_cache.dir/cache/pwl/ssd/WriteLog.cc.o
>../src/librbd/cache/pwl/ssd/WriteLog.cc:37:25: warning: unused variable
'ops_appended_together' [-Wunused-const-variable]
>const unsigned long int ops_appended_together = MAX_WRITES_PER_SYNC_POINT;

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 40dad4c30c3b46ed3ac06961ebe740f6d82a6bc1)

librbd/cache/pwl/ssd: Fix a race between get_cache_bl() and remove_cache_bl()

In fact, although in get_cache_bl it use lock to protect, it can't
protect function "list& operator= (const list& other)".
So we should use copy_cache_bl.

Fixes: https://tracker.ceph.com/issues/52400
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit fe72b3953735329441397f257d5dd18f6819187d)

librbd/cache/pwl/ssd: flushed_sync_gen capture is unused

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit f8fb7609b30bf3e60a64e4282250b121151d1559)

librbd/cache/pwl/ssd: ensure first_{valid,free}_entry aren't bogus

Ensure first_{valid,free}_entry are inside the expected range when
scheduling root updates and decoding the root on recovery.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit b46f81fbb22e67548e9825756da625d8e9017d10)

librbd/cache/pwl/ssd: avoid corrupting m_first_free_entry

In append_op_log_entries(), new_first_free_entry is read after
append_ops() returns.  This can result in accessing freed memory
because all I/Os may complete and append_ctx callback may run
by the time new_first_free_entry is read.  Garbage value gets
written to m_first_free_entry and depending on the circumstances
it may allow AbstractWriteLog code to accept more dirty user data
than we have space for.  Luckily we usually crash before then.

Fixes: https://tracker.ceph.com/issues/50832
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit d83a0f6db8ff26eeb2c817b1bd192fb357f715df)

librbd/cache/pwl/ssd: avoid corrupting first_free_entry

In append_ops(), new_first_free_entry is assigned to after aio_submit()
is called.  This can result in accessing uninitialized or freed memory
because all I/Os may complete and append_ctx callback may run before the
assignment is executed.  Garbage value gets written to first_free_entry
and we eventually crash, most likely in bufferlist manipulation code.

But worse, the corrupted first_free_entry makes it to media almost all
the time.   The result is a corrupted cache -- dirty user data is lost.

Fixes: https://tracker.ceph.com/issues/50832
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit ef381d993ce29c5d0d774a6af27c3af861392ca1)

librbd/cache/pwl/ssd: remove correct m_blocks_to_log_entries entry

When retiring, m_blocks_to_log_entries doesn't remove
corresponding write_entry (should be `*it` not `entry`)
that will be retired. It leads to read error. And
there should also consider discard entries.

Fixes: https://tracker.ceph.com/issues/52579
Signed-off-by: Feng Hualong <hualong.feng@intel.com>
(cherry picked from commit 01bb75a1056d26ae43832d567087b3d67ab84261)

librbd/cache/pwl/ssd: Remove unused parameter.

Met the following compiler warning message:
>[38/80] Building CXX object
src/librbd/CMakeFiles/librbd_plugin_pwl_cache.dir/cache/pwl/ssd/WriteLog.cc.o
>../src/librbd/cache/pwl/ssd/WriteLog.cc:37:25: warning: unused variable
'ops_appended_together' [-Wunused-const-variable]
>const unsigned long int ops_appended_together = MAX_WRITES_PER_SYNC_POINT;

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 40dad4c30c3b46ed3ac06961ebe740f6d82a6bc1)

librbd/cache/pwl/ssd: Remove useless locks.

Return a reference don't need by lock protect.

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit ca9eb28b4f249725e3ecae328fbf8076c648819a)

librbd/cache/pwl/ssd: flushed_sync_gen capture is unused

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit f8fb7609b30bf3e60a64e4282250b121151d1559)

librbd/cache/pwl/ssd: ensure first_{valid,free}_entry aren't bogus

Ensure first_{valid,free}_entry are inside the expected range when
scheduling root updates and decoding the root on recovery.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit b46f81fbb22e67548e9825756da625d8e9017d10)

librbd/cache/pwl/ssd: avoid corrupting m_first_free_entry

In append_op_log_entries(), new_first_free_entry is read after
append_ops() returns.  This can result in accessing freed memory
because all I/Os may complete and append_ctx callback may run
by the time new_first_free_entry is read.  Garbage value gets
written to m_first_free_entry and depending on the circumstances
it may allow AbstractWriteLog code to accept more dirty user data
than we have space for.  Luckily we usually crash before then.

Fixes: https://tracker.ceph.com/issues/50832
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit d83a0f6db8ff26eeb2c817b1bd192fb357f715df)

librbd/cache/pwl: don't clear next_sync_point_entry prematurely

In SyncPointLogOperation::clear_earlier_sync_point(),
sync_point->log_entry->next_sync_point_entry was prematurely set to
nullptr in clear_earlier_sync_point(). It is in write op stage, but
next_sync_point_entry is used in writeback stage in
handle_flushed_sync_point().

handle_flushed_sync_point() may pass a nullptr
cause assert in m_work_queue.The solution is to move the statement
that set next_sync_point_entry to nullptr after it is used.

Fixes: https://tracker.ceph.com/issues/52465
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit a1a20041d57c9a90bc0a60d86469445ba8efb5ea)

librbd/cache/pwl: Fix pmem cache fragment issue

I/O may hang due to pmem cache fragment issue when blocks are diffrent
in size. Call pmdk API(pmemobj_defrag) to solve.

Fixes: https://tracker.ceph.com/issues/49879
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit b53392a15380b57d6111cb2926083393627f1ed7)

librbd/cache/pwl/ssd: ensure bdev and pool sizes match

m_log_pool_size should be multiple of 1M but, just in case, prevent
is_valid_io() assert in KernelDevice::aio_write().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 41b95ac987954dc2c29090926236660b8348e9d2)

librbd/cache/pwl: make pool size a multiple of 1M

In ssd mode, we need it to be a multiple of bdev block size.
Instead of munging it after opening the bdev in ssd/WriteLog.cc, let's
impose a common restriction and round rbd_persistent_cache_size down to
a 1M boundary.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 1fc722bc218acfe8a52ad7be2c06347bddb42bbe)

librbd/cache/pwl/ssd: ensure bdev and pool sizes match

m_log_pool_size should be multiple of 1M but, just in case, prevent
is_valid_io() assert in KernelDevice::aio_write().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 41b95ac987954dc2c29090926236660b8348e9d2)

librbd/cache/pwl: make pool size a multiple of 1M

In ssd mode, we need it to be a multiple of bdev block size.
Instead of munging it after opening the bdev in ssd/WriteLog.cc, let's
impose a common restriction and round rbd_persistent_cache_size down to
a 1M boundary.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 1fc722bc218acfe8a52ad7be2c06347bddb42bbe)

librbd/cache/pwl: supplement error handle

add return after error handling and clean bdev before return

Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit 2cd9881116a47794051615fcf0a95a5e879ae7e6)

librdb/cache/pwl/ssd: fix m_bytes_allocated exceeding m_bytes_allocated_cap

SSD cache mode uses m_bytes_allocated and m_bytes_allocated_cap to control
whether the allocation is successful. m_bytes_allocated may exceeding
m_bytes_allocated_cap due to the multi-thread under no lock. When adding
to m_bytes_allocated, check again to solve this issue.

On the other hand, m_bytes_allocated_cap should be less than the actual
ring buffer size. keep m_first_free_entry == m_first_valid_entry
only means the cache is empty.

Fixes: https://tracker.ceph.com/issues/50670
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit eee12082ff77afd7a69c6261acf5c1f59f1f9fed)

librbd/cache/pwl: initialize number_log_entries

Using uninitialized number_log_entries cause writesame req space
calculation error. sometimes fail in TestMockCacheSSDWriteLog.writesame.

Fixes: https://tracker.ceph.com/issues/52852
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit dd33684733014a4dedf9551d75efe591d330d864)

librbd/cache/pwl: supplement error handle

add return after error handling and clean bdev before return

Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit 2cd9881116a47794051615fcf0a95a5e879ae7e6)

librbd/cache/pwl: fix reorder when flush cache-data to osd.

Consider the following workload:
writeA(0, 4096)
writeB(0, 512).
pwl can makre sure writeA persist to cache before writeB.
But when flush to osd, it use async-read to read data from cache and in
the callback function they issue write to osd.
So although we by order issue aio-read(4096), aio-read(512). But we
can't make sure the return order.
If aio-read(512) firstly return, the write order to next layer is
writeB(0, 512)
writeA(0, 4096).
This is wrong from the user point.

To avoid this occur, we should firstly read all data from cache. And
then send write by order.

Fiexs: https://tracker.ceph.com/issues/52511

Tested-by: Feng Hualong <hualong.feng@intel.com>
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 1fc3be248097eb6087560c22374c1f924bfe0735)

librbd/cache/pwl: Fix dead lock issue when pwl initialization failed

when pwl initialization failed, 'AbstractWriteLog' will release itself
in callback, it hold guard lock and want to get lock to delete data,
which causes dead lock. This PR works by release image_cache outside
the callback function.

Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit 937af36e204791554708632245b4bca1d52f49a6)

librbd/cache/pwl: solve the problem of calc m_bytes_allocated when reload entries.

Currently, it will load existing entries after restart and cacl
m_bytes_allocated based on those entries. But currently there are
the following problems：
1: The allocated of write-same is not calculated for rwl & ssd cache.
2: for ssd cache, it not include the size of log-entry itself and don't
consider data alignment. This will cause less calculation and more
allocatation later. And will overwrite the data which don't flush to osd
and make data lost.

The calculation methods of ssd and rwl are different. So add new api
allocated_and_cached_data() to implement their own method.

For SSD cache, we dirtly use m_first_valid_entry & m_first_free_entry to
calc m_bytes_allocated.

trivial fix: new code from PR, nothing unrelated: https://www.diffchecker.com/S1eXatpM
Fixes:https://tracker.ceph.com/issues/52341
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit a96ca93d69d5c1f302f3141082302d4699915397)

librbd/cache/pwl/ssd: fix first_valid_entry calculation in retire_entries()

Consider one control_block which cotain multi encode(WriteLogCacheEntry):
Log1: WriteLogEntry
Log2: WriteLogEntry
Log3: Non-WriteLogEntry
For this case, currently calc method is: control_block_pos + sizeof(control_block).
But in fact, it should: control_block_pos + sizeof(control_block) +
data_length(Log1 + Log2).

Wrong first_valid_entry will persist to superblock and restart to read.
This cause read wrong position and when decode(WriteLogCacheEntry) it
will report bug.

Fixes: https://tracker.ceph.com/issues/52323
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 2d337fb122d147e32d027d1e7211cd4156a5b72b)

librbd/cache/pwl/ssd: solve competition between read and retire

SSD read is not like rwl's. SSD need aio read. Therefore,
we cannot guarantee that the data will not be retire
during the period from sending the read request to the SSD
and receiving the data to the memory, which may cause
the corresponding data on the SSD to be overwritten.

Fixes: https://tracker.ceph.com/issues/52236
Signed-off-by: Feng Hualong <hualong.feng@intel.com>
(cherry picked from commit dfdb452aa996af40f7b0ec684670ccf9a9b2d4c1)

librbd/cache/pwl/rwl: fix buf_persist and add writeback_lat perf counters

initialize buf_persist_time, then change name buf_persist_time to
buf_persist_start_time, change flush to internal_flush. add
writeback_lat perf conters. update some print formats for perf.

Fixes: https://tracker.ceph.com/issues/52090
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit 8ed3080078ad5eaa51145a9481c7c2223ad38765)

librbd/cache/pwl: avoid stack overflow caused by nested shared_ptr destruction

Destruction of nested shared_ptr will cause stack overflow.
With the explicit assignment of nullptr, the deleted node
is completely disconnected from the current linked list

-------              *******               -------
|sync | <--earlier-- |sync | <--earlier-x- |sync |
|point| --later----> |point| --later----x> |point|
-------              *******               -------
   |                    |                     |
   V                    V                     V
-------              -------               -------
|log_ | ---next----> |log_ | ---next----x> |log_ |
|entry|              |entry|               |entry|
-------              -------               -------

earlier: earlier_sync_point
later:   later_sync_point
next:    next_sync_point_entry

Fixes: https://tracker.ceph.com/issues/51418
Signed-off-by: Feng Hualong <hualong.feng@intel.com>
(cherry picked from commit e706b9db5c5d79366c5167d01ad46e13f8500936)

librbd/cache/pwl/ssd: fix use-after-free on C_BlockIORequest

In setup_schedule_append() function, its first expression
will cause the req to be deleted, and subsequent use of
the variable req becomes an illegal operation. And due to
delete, rep->m_image_ctx will be empty, so it lead to
segfault in AbstractWriteLog::get_context().
So pass the `req` into `schedule_append()` function.

Fixes: https://tracker.ceph.com/issues/50951
Signed-off-by: Hualong Feng <hualong.feng@intel.com>
(cherry picked from commit 2dc3b8881290f1e12c536a232bea37547a73a45b)

librbd/cache/pwl/ssd: flushed_sync_gen capture is unused

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit f8fb7609b30bf3e60a64e4282250b121151d1559)

librbd/cache/pwl/ssd: ensure first_{valid,free}_entry aren't bogus

Ensure first_{valid,free}_entry are inside the expected range when
scheduling root updates and decoding the root on recovery.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit b46f81fbb22e67548e9825756da625d8e9017d10)

librbd/cache/pwl/ssd: avoid corrupting m_first_free_entry

In append_op_log_entries(), new_first_free_entry is read after
append_ops() returns.  This can result in accessing freed memory
because all I/Os may complete and append_ctx callback may run
by the time new_first_free_entry is read.  Garbage value gets
written to m_first_free_entry and depending on the circumstances
it may allow AbstractWriteLog code to accept more dirty user data
than we have space for.  Luckily we usually crash before then.

Fixes: https://tracker.ceph.com/issues/50832
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit d83a0f6db8ff26eeb2c817b1bd192fb357f715df)

librbd/cache/pwl/ssd: ensure bdev and pool sizes match

m_log_pool_size should be multiple of 1M but, just in case, prevent
is_valid_io() assert in KernelDevice::aio_write().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 41b95ac987954dc2c29090926236660b8348e9d2)

librbd/cache/pwl: make pool size a multiple of 1M

In ssd mode, we need it to be a multiple of bdev block size.
Instead of munging it after opening the bdev in ssd/WriteLog.cc, let's
impose a common restriction and round rbd_persistent_cache_size down to
a 1M boundary.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 1fc722bc218acfe8a52ad7be2c06347bddb42bbe)

librbd/cache/pwl: supplement error handle

add return after error handling and clean bdev before return

trivial fix:
<<<<<<< head
      on_finish->complete(-1);
      return;
=======
      on_finish->complete(r);
      return false;
>>>>>>> 2cd9881116a (librbd/cache/pwl: supplement error handle)

picked second patch
Signed-off-by: Yin Congmin <congmin.yin@intel.com>
(cherry picked from commit 2cd9881116a47794051615fcf0a95a5e879ae7e6)

librbd/cache/pwl/ssd: stronger assert in aio_read_data_blocks()

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit ab05cc4a8f4709db649ad51d32141e429d9dc2b3)

librbd/cache/pwl/ssd: rename aio_read_data_block() overload

Rename the overload that deals with multiple data blocks to
aio_read_data_blocks().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 00363a0f7c20a7f002b37eac93d88791e5bc2568)

librbd/cache/pwl/ssd: persist correct write_data_pos

WriteLogCacheEntry gets appended to persist_log_entries before
write_data_pos is updated with the actual media offset. Because
push_back() makes a copy, the updated write_data_pos value never
makes it to media, making recovery impossible.

Fixes: https://tracker.ceph.com/issues/50669
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit fe757401ada7bfd6784b6f9ca5556e1459df7a69)

librbd/cache/pwl/ssd: set m_bytes_allocated_cap on recovery

Currently it's set only when a new cache is formatted.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit cb9b3afd87a28e96d3688e1c73900d8043fac6cc)

librbd/cache/pwl/ssd: actually use first_{valid,free}_entry on recovery

first_valid_entry and first_free_entry pointers are read from media
but not actually used: both m_first_valid_entry and m_first_free_entry
get assigned 0 (or garbage). next_log_pos gets the same value as well
meaning that not only no recovery is attempted but the cache also gets
corrupted because DATA_RING_BUFFER_OFFSET is not applied.

Fixes: https://tracker.ceph.com/issues/50669
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit ef020b85fb16c1730fc08eadfd1b51d3c4cd019a)

librbd/cache/pwl/ssd: don't count log entries

In ssd mode log entries are variable size. Attempting to count and
impose watermarks on the number of log entries is bogus because the
total number of entries it would take to fill the cache to capacity
is also variable and can't be precisely estimated.

had conflicts, but no new changes
Fixes: https://tracker.ceph.com/issues/50669
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit ea65553b4a9ee1349c6da8452d861afe579e99e9)

librbd/cache/pwl: fix AbstractWriteLog::check_allocation() signature

All parameters are integers and none of them are (in-)out, so don't
take them by reference. Additionally num_lanes, num_log_entries and
num_unpublished_reserves don't need to be 64-bit as their respective
fields in AbstractWriteLog are 32-bit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 74ecc4b76a10c53be928807b5be077f080d34724)

librbd/cache/pwl: rename m_log_pool_config_size to m_log_pool_size

trivial fix: no new changes: https://www.diffchecker.com/9btXJhCC
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 829ef952d2e408fe3676b38e7ecd26cbb04571a5)

librbd/cache/pwl: get rid of AbstractWriteLog::m_log_pool_actual_size

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 820bbecfb130ad99483f0d468f1b1c9612e54935)

librbd/cache/pwl/ssd: get rid of WriteLog::pool_size

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 6cd36592f631051d2ae49b7d92f2a30cd12c9c41)

mgr/dashboard: fix missing alert rule details

Fixes: https://tracker.ceph.com/issues/53144
Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
(cherry picked from commit b47f9c83d87057bbb1a0c6052450088532a31f81)

Merge pull request #43793 from ifed01/wip-ifed-fix-omap-upgrade-pac

pacific: os/bluestore: fix invalid omap name conversion when upgrading to per-pg

Reviewed-by: Neha Ojha <nojha@redhat.com>

mgr/dashboard: NFS 'create export' form: fixes

* Do not allow a pseudo that is already in use by another export.
* Create mode form: prefill dropdown selectors if options > 0.
* Edit mode form: do not reset the field values that depend on other values that are being edited (unlike Create mode).
* Fix broken link: cluster service.
* Fix error message style for non-existent cephfs path.
* nfs-service.ts: lsDir: thow error if volume is not provided.
* File renaming: nfsganesha.py => nfs.py; test_ganesha.py => test_nfs.py

Fixes: https://tracker.ceph.com/issues/53083
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
(cherry picked from commit d817a24e345516229bc637e2c675d12e6bfcc456)

Conflicts:
src/pybind/mgr/dashboard/tests/test_ganesha.py
- Delete file: in fact it's renamed as test_nfs.py

mgr/dashboard: python unit tests refactoring

* Controller tests: cherrypy config: authentication disabled by default; ability to pass custom config (e.g. enable authentication).
* Auth controller: add tests; test that unauthorized request fails when authentication is enabled.
* DocsTest: clear ENDPOINT_MAP so the test_gen_tags test becomes deterministic.
* pylint: disable=no-name-in-module: fix imports in tests.

Fixes: https://tracker.ceph.com/issues/53083
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
(cherry picked from commit a6aeded5141ec3a959a86aa4002ccf5ac8d0a523)

Conflicts:
src/pybind/mgr/dashboard/tests/test_docs.py
- Resolved conflicts.
src/pybind/mgr/dashboard/tests/test_iscsi.py
- Resolved conflicts.

qa/rgw: pacific branch targets ceph-pacific branch of java_s3tests

this commit is applied directly to the pacific branch instead of a
backport from master, because it targets the ceph-pacific branch instead
of the ceph-master branch on master

Signed-off-by: Casey Bodley <cbodley@redhat.com>