Yin Congmin [Fri, 12 Nov 2021 08:54:31 +0000 (16:54 +0800)]
qa/suites/rbd/persistent-writeback-cache: add test case
Add the test case which size is 8GB, So that some problems that occur
only in test scenarios above 4GB may be found in this test. For example,
the variables of 32-bit may be unexpected value when it operates with
a 64 bit value.
Jianpeng Ma [Mon, 8 Nov 2021 06:33:28 +0000 (14:33 +0800)]
librbd/cache/pwl: fix reorder issue between func process_writeback_dirty_entries
In fact, we not only make sure ops in order in func process_writeback_dirty_entries,
but also make sure ops in order between func process_writeback_dirty_entries.
cmake/modules/Findpmem: always set pmem_VERSION_STRING
before this change, `pmem_VERSION_STRING` is not set if it is not able
to fulfill the specified version requirement. the intention was to check
if the version is able to satisfy the requirement. but actually, passing
an empty `pmem_VERSION_STRING` to `find_package_handle_standard_args()`
as the option of `VERSION_VAR` does not fail this check. on the
contrary, it prints
-- Found pmem: pmem_pmemobj_INCLUDE_DIR;pmem_pmem_INCLUDE_DIR (Required
is at least version "1.17")
if we requires pmem 1.17, while the found version is, for instance,
1.10.
if the required version is 1.7, and the found version is 1.10, the
output from cmake is:
-- Found pmem: pmem_pmemobj_INCLUDE_DIR;pmem_pmem_INCLUDE_DIR (found
suitable version "1.10", minimum required is "1.7")
in this change, the version spec is not specified when calling
`pkg_check_modules()`. so, `PKG_${component}_VERSION` is always set.
and we can always delegate the version checking to
`find_package_handle_standard_args()`. please note, we use the lower
version returned by pkg-config if multiple components are required and
both pkg-config settings return their versions.
ceph.spec.in: do not build with system pmdk by default
we need to use libpmem 1.10 in #40493.
without enabling the module stream offering libpmem 1.9.2, we can only
have access to libpmem 1.6.1. and fedora 33 only has libpmem 1.9
packaged. the same applies to openSUSE Tumbleweed and openSUSE Leap. so
let's stop using libpmem packaged by distro by default, until these
distros include libpmem 1.10.
Igor Fedotov [Thu, 15 Jul 2021 12:10:14 +0000 (15:10 +0300)]
os/bluestore: fix improper offset calculation when repairing.
While repairing misreferenced blobs BlueStore could improperly calculate
an offset within a blob being fixed. This could happen when single
physical extent has been replaced by multiple ones - the following
pextent (if any in the current blob) would be treated with the improper offset within the blob. Offset calculation didn't account for each of that new pextents but the last one only.
Fixes: https://tracker.ceph.com/issues/51682 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
(cherry picked from commit ca4b6675fc3fd2f4cadad58044c97c5bb23d5938)
The marker was not working correctly as segments of the bucket index
were listed to shut down any incomplete multipart uploads. This fixes
the marker, so it's maintained properly across iterations.
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
Adam C. Emerson [Tue, 2 Nov 2021 16:46:15 +0000 (12:46 -0400)]
rgw: Ensure buckets too old to decode a layout have layout logs
When decoding `RGWBucketInfo` data from before Pacific, we won't call
`rgw::BucketLayout::decode`, but will instead synthesize the layout
information. This leaves the `rgw::BucketLayout::logs` empty, as the
fallback to populate it only applies to old versions of
`rgw::BucketLayout`.
Add a check at the end of `RGWBUcketInfo::decode` to populate it if
empty.
Fixes: https://tracker.ceph.com/issues/53132 Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
(cherry picked from commit 3279509127e65314c07963a3e127e926308bd76a) Fixes: https://tracker.ceph.com/issues/53160 Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
librbd/cache/pwl: cancel advance dispatch of external flush request
For external flush request, it new syncpoint after passing
guardedrequest and before dispatch. Then dispatch bypass deferred
queue But the last write request may still in the deferred queue.
It don't dispatch and not associated with any syncpoint. The
external flush request will bypass the previous write request in
deferred queue now. This does not conform to the semantics of
external flush requests. External flush request should strictly
follow the order of dispath.
But for internal flush request, it will be dispatched after all
write request which associated with previous syncpoint, persisted
in cache. C_gather guarantee it.
It is necessary to distinguish between external and internal
flush requests. Internal flush can and should be dispatched in
advance bypass deferred queue. At the same time, the order of
external requests needs to be kept unchanged. So cancel advance
dispatch of external flush request.
librbd/cache/pwl: fix assert in _aio_stop() during shutdown
For wait_for_ops(next_ctx). this next_ctx may run in aio_thread.
Then the next program runs on the aio thread. remove_pool_file()
calls bdev->close(), then calles _aio_stop(), exec aio_thread.join(),
cause assert. Thread can't join itself. Fix it by adding close ctx
to m_work_queue, so close() can run in work queue thread.
At the same time, correct the order of wait_for_ops().
flush_dirty_entries(next_ctx) may call wake_up() and start_op().
so moving wait_for_ops() behind flush_dirty_entries(next_ctx) is more
appropriate.
Yin Congmin [Fri, 27 Aug 2021 15:41:49 +0000 (15:41 +0000)]
librbd/cache/pwl/ssd: move finish_op() to the end of callback function
finish_op() of ssd cache is not in the end of callback function in
append_op_log_entries(), and after finish_op(), some operation also
need to get m_lock. So, during shutdown, wait_for_ops() thinks all OPs
are over, and no thread will acquire the m_lock, In the subsequent
operation of shutdown, the m_lock is obtained, and _aio_stop() in
bdev->close() waits for all aio_writes() and aio_submit() to end
when the m_lock is held, but the callback function of aio_write() is
waiting for the m_lock, causing a deadlock. Move finish_op() to the
end to fix dead lock.
Jianpeng Ma [Tue, 7 Sep 2021 02:19:55 +0000 (10:19 +0800)]
librbd/cache/pwl/ssd: Remove unused parameter.
Met the following compiler warning message:
>[38/80] Building CXX object
src/librbd/CMakeFiles/librbd_plugin_pwl_cache.dir/cache/pwl/ssd/WriteLog.cc.o
>../src/librbd/cache/pwl/ssd/WriteLog.cc:37:25: warning: unused variable
'ops_appended_together' [-Wunused-const-variable]
>const unsigned long int ops_appended_together = MAX_WRITES_PER_SYNC_POINT;
Jianpeng Ma [Tue, 7 Sep 2021 02:00:53 +0000 (10:00 +0800)]
librbd/cache/pwl/ssd: Fix a race between get_cache_bl() and remove_cache_bl()
In fact, although in get_cache_bl it use lock to protect, it can't
protect function "list& operator= (const list& other)".
So we should use copy_cache_bl.
Fixes: https://tracker.ceph.com/issues/52400 Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit fe72b3953735329441397f257d5dd18f6819187d)
In append_op_log_entries(), new_first_free_entry is read after
append_ops() returns. This can result in accessing freed memory
because all I/Os may complete and append_ctx callback may run
by the time new_first_free_entry is read. Garbage value gets
written to m_first_free_entry and depending on the circumstances
it may allow AbstractWriteLog code to accept more dirty user data
than we have space for. Luckily we usually crash before then.
In append_ops(), new_first_free_entry is assigned to after aio_submit()
is called. This can result in accessing uninitialized or freed memory
because all I/Os may complete and append_ctx callback may run before the
assignment is executed. Garbage value gets written to first_free_entry
and we eventually crash, most likely in bufferlist manipulation code.
But worse, the corrupted first_free_entry makes it to media almost all
the time. The result is a corrupted cache -- dirty user data is lost.
When retiring, m_blocks_to_log_entries doesn't remove
corresponding write_entry (should be `*it` not `entry`)
that will be retired. It leads to read error. And
there should also consider discard entries.
Jianpeng Ma [Tue, 7 Sep 2021 02:19:55 +0000 (10:19 +0800)]
librbd/cache/pwl/ssd: Remove unused parameter.
Met the following compiler warning message:
>[38/80] Building CXX object
src/librbd/CMakeFiles/librbd_plugin_pwl_cache.dir/cache/pwl/ssd/WriteLog.cc.o
>../src/librbd/cache/pwl/ssd/WriteLog.cc:37:25: warning: unused variable
'ops_appended_together' [-Wunused-const-variable]
>const unsigned long int ops_appended_together = MAX_WRITES_PER_SYNC_POINT;
In append_op_log_entries(), new_first_free_entry is read after
append_ops() returns. This can result in accessing freed memory
because all I/Os may complete and append_ctx callback may run
by the time new_first_free_entry is read. Garbage value gets
written to m_first_free_entry and depending on the circumstances
it may allow AbstractWriteLog code to accept more dirty user data
than we have space for. Luckily we usually crash before then.
In SyncPointLogOperation::clear_earlier_sync_point(),
sync_point->log_entry->next_sync_point_entry was prematurely set to
nullptr in clear_earlier_sync_point(). It is in write op stage, but
next_sync_point_entry is used in writeback stage in
handle_flushed_sync_point().
handle_flushed_sync_point() may pass a nullptr
cause assert in m_work_queue.The solution is to move the statement
that set next_sync_point_entry to nullptr after it is used.
Ilya Dryomov [Sat, 14 Aug 2021 17:06:28 +0000 (19:06 +0200)]
librbd/cache/pwl: make pool size a multiple of 1M
In ssd mode, we need it to be a multiple of bdev block size.
Instead of munging it after opening the bdev in ssd/WriteLog.cc, let's
impose a common restriction and round rbd_persistent_cache_size down to
a 1M boundary.
Ilya Dryomov [Sat, 14 Aug 2021 17:06:28 +0000 (19:06 +0200)]
librbd/cache/pwl: make pool size a multiple of 1M
In ssd mode, we need it to be a multiple of bdev block size.
Instead of munging it after opening the bdev in ssd/WriteLog.cc, let's
impose a common restriction and round rbd_persistent_cache_size down to
a 1M boundary.
SSD cache mode uses m_bytes_allocated and m_bytes_allocated_cap to control
whether the allocation is successful. m_bytes_allocated may exceeding
m_bytes_allocated_cap due to the multi-thread under no lock. When adding
to m_bytes_allocated, check again to solve this issue.
On the other hand, m_bytes_allocated_cap should be less than the actual
ring buffer size. keep m_first_free_entry == m_first_valid_entry
only means the cache is empty.
Jianpeng Ma [Mon, 1 Nov 2021 01:25:52 +0000 (09:25 +0800)]
librbd/cache/pwl: fix reorder when flush cache-data to osd.
Consider the following workload:
writeA(0, 4096)
writeB(0, 512).
pwl can makre sure writeA persist to cache before writeB.
But when flush to osd, it use async-read to read data from cache and in
the callback function they issue write to osd.
So although we by order issue aio-read(4096), aio-read(512). But we
can't make sure the return order.
If aio-read(512) firstly return, the write order to next layer is
writeB(0, 512)
writeA(0, 4096).
This is wrong from the user point.
To avoid this occur, we should firstly read all data from cache. And
then send write by order.
Fiexs: https://tracker.ceph.com/issues/52511
Tested-by: Feng Hualong <hualong.feng@intel.com> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 1fc3be248097eb6087560c22374c1f924bfe0735)
librbd/cache/pwl: Fix dead lock issue when pwl initialization failed
when pwl initialization failed, 'AbstractWriteLog' will release itself
in callback, it hold guard lock and want to get lock to delete data,
which causes dead lock. This PR works by release image_cache outside
the callback function.
Jianpeng Ma [Tue, 31 Aug 2021 01:02:56 +0000 (09:02 +0800)]
librbd/cache/pwl: solve the problem of calc m_bytes_allocated when reload entries.
Currently, it will load existing entries after restart and cacl
m_bytes_allocated based on those entries. But currently there are
the following problems:
1: The allocated of write-same is not calculated for rwl & ssd cache.
2: for ssd cache, it not include the size of log-entry itself and don't
consider data alignment. This will cause less calculation and more
allocatation later. And will overwrite the data which don't flush to osd
and make data lost.
The calculation methods of ssd and rwl are different. So add new api
allocated_and_cached_data() to implement their own method.
For SSD cache, we dirtly use m_first_valid_entry & m_first_free_entry to
calc m_bytes_allocated.
trivial fix: new code from PR, nothing unrelated: https://www.diffchecker.com/S1eXatpM
Fixes:https://tracker.ceph.com/issues/52341 Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit a96ca93d69d5c1f302f3141082302d4699915397)
Jianpeng Ma [Fri, 20 Aug 2021 06:29:37 +0000 (14:29 +0800)]
librbd/cache/pwl/ssd: fix first_valid_entry calculation in retire_entries()
Consider one control_block which cotain multi encode(WriteLogCacheEntry):
Log1: WriteLogEntry
Log2: WriteLogEntry
Log3: Non-WriteLogEntry
For this case, currently calc method is: control_block_pos + sizeof(control_block).
But in fact, it should: control_block_pos + sizeof(control_block) +
data_length(Log1 + Log2).
Wrong first_valid_entry will persist to superblock and restart to read.
This cause read wrong position and when decode(WriteLogCacheEntry) it
will report bug.
Fixes: https://tracker.ceph.com/issues/52323 Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 2d337fb122d147e32d027d1e7211cd4156a5b72b)
librbd/cache/pwl/ssd: solve competition between read and retire
SSD read is not like rwl's. SSD need aio read. Therefore,
we cannot guarantee that the data will not be retire
during the period from sending the read request to the SSD
and receiving the data to the memory, which may cause
the corresponding data on the SSD to be overwritten.
librbd/cache/pwl/rwl: fix buf_persist and add writeback_lat perf counters
initialize buf_persist_time, then change name buf_persist_time to
buf_persist_start_time, change flush to internal_flush. add
writeback_lat perf conters. update some print formats for perf.
librbd/cache/pwl: avoid stack overflow caused by nested shared_ptr destruction
Destruction of nested shared_ptr will cause stack overflow.
With the explicit assignment of nullptr, the deleted node
is completely disconnected from the current linked list
librbd/cache/pwl/ssd: fix use-after-free on C_BlockIORequest
In setup_schedule_append() function, its first expression
will cause the req to be deleted, and subsequent use of
the variable req becomes an illegal operation. And due to
delete, rep->m_image_ctx will be empty, so it lead to
segfault in AbstractWriteLog::get_context().
So pass the `req` into `schedule_append()` function.
In append_op_log_entries(), new_first_free_entry is read after
append_ops() returns. This can result in accessing freed memory
because all I/Os may complete and append_ctx callback may run
by the time new_first_free_entry is read. Garbage value gets
written to m_first_free_entry and depending on the circumstances
it may allow AbstractWriteLog code to accept more dirty user data
than we have space for. Luckily we usually crash before then.
Ilya Dryomov [Sat, 14 Aug 2021 17:06:28 +0000 (19:06 +0200)]
librbd/cache/pwl: make pool size a multiple of 1M
In ssd mode, we need it to be a multiple of bdev block size.
Instead of munging it after opening the bdev in ssd/WriteLog.cc, let's
impose a common restriction and round rbd_persistent_cache_size down to
a 1M boundary.
WriteLogCacheEntry gets appended to persist_log_entries before
write_data_pos is updated with the actual media offset. Because
push_back() makes a copy, the updated write_data_pos value never
makes it to media, making recovery impossible.
Ilya Dryomov [Thu, 13 May 2021 11:11:57 +0000 (13:11 +0200)]
librbd/cache/pwl/ssd: actually use first_{valid,free}_entry on recovery
first_valid_entry and first_free_entry pointers are read from media
but not actually used: both m_first_valid_entry and m_first_free_entry
get assigned 0 (or garbage). next_log_pos gets the same value as well
meaning that not only no recovery is attempted but the cache also gets
corrupted because DATA_RING_BUFFER_OFFSET is not applied.
Ilya Dryomov [Sat, 8 May 2021 08:24:37 +0000 (10:24 +0200)]
librbd/cache/pwl/ssd: don't count log entries
In ssd mode log entries are variable size. Attempting to count and
impose watermarks on the number of log entries is bogus because the
total number of entries it would take to fill the cache to capacity
is also variable and can't be precisely estimated.
had conflicts, but no new changes Fixes: https://tracker.ceph.com/issues/50669 Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit ea65553b4a9ee1349c6da8452d861afe579e99e9)
All parameters are integers and none of them are (in-)out, so don't
take them by reference. Additionally num_lanes, num_log_entries and
num_unpublished_reserves don't need to be 64-bit as their respective
fields in AbstractWriteLog are 32-bit.
Ilya Dryomov [Wed, 12 May 2021 10:19:07 +0000 (12:19 +0200)]
librbd/cache/pwl: rename m_log_pool_config_size to m_log_pool_size
trivial fix: no new changes: https://www.diffchecker.com/9btXJhCC Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 829ef952d2e408fe3676b38e7ecd26cbb04571a5)
* Do not allow a pseudo that is already in use by another export.
* Create mode form: prefill dropdown selectors if options > 0.
* Edit mode form: do not reset the field values that depend on other values that are being edited (unlike Create mode).
* Fix broken link: cluster service.
* Fix error message style for non-existent cephfs path.
* nfs-service.ts: lsDir: thow error if volume is not provided.
* File renaming: nfsganesha.py => nfs.py; test_ganesha.py => test_nfs.py
Fixes: https://tracker.ceph.com/issues/53083 Signed-off-by: Alfonso Martínez <almartin@redhat.com>
(cherry picked from commit d817a24e345516229bc637e2c675d12e6bfcc456)
Conflicts:
src/pybind/mgr/dashboard/tests/test_ganesha.py
- Delete file: in fact it's renamed as test_nfs.py
* Controller tests: cherrypy config: authentication disabled by default; ability to pass custom config (e.g. enable authentication).
* Auth controller: add tests; test that unauthorized request fails when authentication is enabled.
* DocsTest: clear ENDPOINT_MAP so the test_gen_tags test becomes deterministic.
* pylint: disable=no-name-in-module: fix imports in tests.
Fixes: https://tracker.ceph.com/issues/53083 Signed-off-by: Alfonso Martínez <almartin@redhat.com>
(cherry picked from commit a6aeded5141ec3a959a86aa4002ccf5ac8d0a523)
Casey Bodley [Thu, 4 Nov 2021 16:12:04 +0000 (12:12 -0400)]
qa/rgw: pacific branch targets ceph-pacific branch of java_s3tests
this commit is applied directly to the pacific branch instead of a
backport from master, because it targets the ceph-pacific branch instead
of the ceph-master branch on master
Sage Weil [Wed, 29 Sep 2021 20:28:07 +0000 (16:28 -0400)]
mon,auth: fix proposal of rotating keys
Instead of updating the live CephxKeyServer's rotating_keys and also
including them in a paxos proposal, propose new keys only in the proposal,
and only make them live once they are committed. This keeps mons fully in
sync and avoids any inconsistency between the live behavior and committed
state (e.g., stale or divergent keys being applied and passed out to
daemons).