librbd/cache/pwl: Fix dead lock issue when pwl initialization failed
when pwl initialization failed, 'AbstractWriteLog' will release itself
in callback, it hold guard lock and want to get lock to delete data,
which causes dead lock. This PR works by release image_cache outside
the callback function.
Jianpeng Ma [Tue, 31 Aug 2021 01:02:56 +0000 (09:02 +0800)]
librbd/cache/pwl: solve the problem of calc m_bytes_allocated when reload entries.
Currently, it will load existing entries after restart and cacl
m_bytes_allocated based on those entries. But currently there are
the following problems:
1: The allocated of write-same is not calculated for rwl & ssd cache.
2: for ssd cache, it not include the size of log-entry itself and don't
consider data alignment. This will cause less calculation and more
allocatation later. And will overwrite the data which don't flush to osd
and make data lost.
The calculation methods of ssd and rwl are different. So add new api
allocated_and_cached_data() to implement their own method.
For SSD cache, we dirtly use m_first_valid_entry & m_first_free_entry to
calc m_bytes_allocated.
trivial fix: new code from PR, nothing unrelated: https://www.diffchecker.com/S1eXatpM
Fixes:https://tracker.ceph.com/issues/52341 Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit a96ca93d69d5c1f302f3141082302d4699915397)
Jianpeng Ma [Fri, 20 Aug 2021 06:29:37 +0000 (14:29 +0800)]
librbd/cache/pwl/ssd: fix first_valid_entry calculation in retire_entries()
Consider one control_block which cotain multi encode(WriteLogCacheEntry):
Log1: WriteLogEntry
Log2: WriteLogEntry
Log3: Non-WriteLogEntry
For this case, currently calc method is: control_block_pos + sizeof(control_block).
But in fact, it should: control_block_pos + sizeof(control_block) +
data_length(Log1 + Log2).
Wrong first_valid_entry will persist to superblock and restart to read.
This cause read wrong position and when decode(WriteLogCacheEntry) it
will report bug.
Fixes: https://tracker.ceph.com/issues/52323 Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 2d337fb122d147e32d027d1e7211cd4156a5b72b)
librbd/cache/pwl/ssd: solve competition between read and retire
SSD read is not like rwl's. SSD need aio read. Therefore,
we cannot guarantee that the data will not be retire
during the period from sending the read request to the SSD
and receiving the data to the memory, which may cause
the corresponding data on the SSD to be overwritten.
librbd/cache/pwl/rwl: fix buf_persist and add writeback_lat perf counters
initialize buf_persist_time, then change name buf_persist_time to
buf_persist_start_time, change flush to internal_flush. add
writeback_lat perf conters. update some print formats for perf.
librbd/cache/pwl: avoid stack overflow caused by nested shared_ptr destruction
Destruction of nested shared_ptr will cause stack overflow.
With the explicit assignment of nullptr, the deleted node
is completely disconnected from the current linked list
librbd/cache/pwl/ssd: fix use-after-free on C_BlockIORequest
In setup_schedule_append() function, its first expression
will cause the req to be deleted, and subsequent use of
the variable req becomes an illegal operation. And due to
delete, rep->m_image_ctx will be empty, so it lead to
segfault in AbstractWriteLog::get_context().
So pass the `req` into `schedule_append()` function.
In append_op_log_entries(), new_first_free_entry is read after
append_ops() returns. This can result in accessing freed memory
because all I/Os may complete and append_ctx callback may run
by the time new_first_free_entry is read. Garbage value gets
written to m_first_free_entry and depending on the circumstances
it may allow AbstractWriteLog code to accept more dirty user data
than we have space for. Luckily we usually crash before then.
Ilya Dryomov [Sat, 14 Aug 2021 17:06:28 +0000 (19:06 +0200)]
librbd/cache/pwl: make pool size a multiple of 1M
In ssd mode, we need it to be a multiple of bdev block size.
Instead of munging it after opening the bdev in ssd/WriteLog.cc, let's
impose a common restriction and round rbd_persistent_cache_size down to
a 1M boundary.
WriteLogCacheEntry gets appended to persist_log_entries before
write_data_pos is updated with the actual media offset. Because
push_back() makes a copy, the updated write_data_pos value never
makes it to media, making recovery impossible.
Ilya Dryomov [Thu, 13 May 2021 11:11:57 +0000 (13:11 +0200)]
librbd/cache/pwl/ssd: actually use first_{valid,free}_entry on recovery
first_valid_entry and first_free_entry pointers are read from media
but not actually used: both m_first_valid_entry and m_first_free_entry
get assigned 0 (or garbage). next_log_pos gets the same value as well
meaning that not only no recovery is attempted but the cache also gets
corrupted because DATA_RING_BUFFER_OFFSET is not applied.
Ilya Dryomov [Sat, 8 May 2021 08:24:37 +0000 (10:24 +0200)]
librbd/cache/pwl/ssd: don't count log entries
In ssd mode log entries are variable size. Attempting to count and
impose watermarks on the number of log entries is bogus because the
total number of entries it would take to fill the cache to capacity
is also variable and can't be precisely estimated.
had conflicts, but no new changes Fixes: https://tracker.ceph.com/issues/50669 Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit ea65553b4a9ee1349c6da8452d861afe579e99e9)
All parameters are integers and none of them are (in-)out, so don't
take them by reference. Additionally num_lanes, num_log_entries and
num_unpublished_reserves don't need to be 64-bit as their respective
fields in AbstractWriteLog are 32-bit.
Ilya Dryomov [Wed, 12 May 2021 10:19:07 +0000 (12:19 +0200)]
librbd/cache/pwl: rename m_log_pool_config_size to m_log_pool_size
trivial fix: no new changes: https://www.diffchecker.com/9btXJhCC Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 829ef952d2e408fe3676b38e7ecd26cbb04571a5)
librbd/cache/pwl/ssd/WriteLog: don't crash on split log entries
write_log_entries() will split a log entry at the end of the log, the
remainder is written to the beginning at DATA_RING_BUFFER_OFFSET. On
the read side aio_read_data_block() doesn't handle this case and just
crashes. Unless the workload in use is <= 4K, the image is rendered
unusable sooner or later.
librbd/cache/pwl: use m_bytes_allocated_cap for both rwl and ssd
Follow rwl mode and use AbstractWriteLog::m_bytes_allocated_cap
instead of m_log_pool_ring_buffer_size specific to ssd. This fixes
"bytes available" calculation in STATS output.
librbd/cache/pwl/ssd/WriteLog: decrement m_bytes_allocated when retiring
Currently if ssd cache is filled to capacity, all future I/O hangs
indefinitely because even though the cache eventually becomes clean
and retires enough entries to get back under RETIRE_HIGH_WATER, this
isn't communicated to AbstractWriteLog::check_allocation().
Kefu Chai [Sat, 30 Oct 2021 03:18:17 +0000 (11:18 +0800)]
admin/doc-requirements.txt: pin Sphinx at 3.5.4
* pin Sphinx at 3.5.4
* pin docutils at 0.18
at least the combination of these two versions
is known to compile.
to address the bug reported at
https://sourceforge.net/p/docutils/bugs/431/
the backtrace looks like:
/home/jenkins-build/build/workspace/ceph-pr-docs/build-doc/virtualenv/lib/python3.8/site-packages/sphinx/util/docutils.py:285:
RemovedInSphinx30Warning: function based directive support is now
deprecated. Use class based directive instead.
warnings.warn('function based directive support is now deprecated. '
Exception occurred:
File
"/home/jenkins-build/build/workspace/ceph-pr-docs/build-doc/virtualenv/lib/python3.8/site-packages/docutils/writers/html5_polyglot/__init__.py",
line 445, in section_title_tags
if (ids and self.settings.section_self_link
AttributeError: 'Values' object has no attribute 'section_self_link'
Nathan Cutler [Wed, 20 Oct 2021 10:51:02 +0000 (12:51 +0200)]
rgw/tracing: unify SO version numbers within librgw2 package
The librgw2 package contains several SO files. Two of those - librgw_op_tp.so
and librgw_rados_tp.so - had a different version number than the main librgw.
This was a violation of the openSUSE Shared Library Packaging Policy [1] but it
also seems like a "violation" of common sense.
* APIVersion:
* Moved to a separate file
* Added doctests
* Added sentinel values:
* DEFAULT = 1.0
* EXPERIMENTAL = 0.1
* NONE = 0.0
* Added to_mime_type() helper method
* Controllers.__init__:
* Added type hints
* Replaced string versions with APIVersions
* Feedback controller:
* Replaced with EXPERIMENTAL (probably it should be NONE)
Fixes: https://tracker.ceph.com/issues/52480 Signed-off-by: Ernesto Puerta <epuertat@redhat.com>
Conflicts:
src/pybind/mgr/dashboard/controllers/__init__.py
- Remove the current changes and keep the incoming new changes
src/pybind/mgr/dashboard/controllers/crush_rule.py
- Changes related to the versioning like importing the APIVersion
src/pybind/mgr/dashboard/controllers/docs.py
- Changes related to the versioning like importing the APIVersion
src/pybind/mgr/dashboard/controllers/feedback.py
- Deleted the file since feedback module isn't backported to pacific
src/pybind/mgr/dashboard/controllers/host.py
- Changes related to the versioning like importing the APIVersion
src/pybind/mgr/dashboard/openapi.yaml
- Generated a new openapi yaml file
src/pybind/mgr/dashboard/tests/__init__.py
- Changes related to the versioning like importing the APIVersion
src/pybind/mgr/dashboard/tests/test_docs.py
- Changes related to the versioning like importing the APIVersion
src/pybind/mgr/dashboard/tests/test_host.py
- Changes related to the versioning like importing the APIVersion
src/pybind/mgr/dashboard/tests/test_tools.py
- Changes related to the versioning like importing the APIVersion
src/pybind/mgr/dashboard/tests/test_versioning.py
- Changes related to the versioning like importing the APIVersion
src/pybind/mgr/dashboard/controllers/crush_rule.py
- Removed the MethodMap decorator which updates the version of the
enpoint to 2.0 because those changes which caused that version
updating were not backported to pacific
Patrick Donnelly [Tue, 14 Sep 2021 17:02:12 +0000 (13:02 -0400)]
test/libcephfs: put inodes after lookup
Otherwise, the client umount will hang due to inability to trim the
inodes looked up using the low-level interface. This results in slow-op
warnings and an eviction:
2021-09-11T17:23:31.097+0000 7f99c3522700 0 log_channel(cluster) log [WRN] : evicting unresponsive client smithi176 (9756), after 303.924 seconds
2021-09-11T17:23:31.097+0000 7f99c3522700 10 mds.0.server autoclosing stale session client.9756 172.21.15.176:0/3891214934 last renewed caps 303.924s ago
mgr/dashboard: make modified API endpoints backward compatible
Fixes: https://tracker.ceph.com/issues/52480 Signed-off-by: Avan Thakkar <athakkar@redhat.com>
Introducing APIVersion class to handle versioning for API-endpints and making
them backward compatible.
The test is failing on deleting a host because the agent daemon is
present in that host. Its not possible to simply delete a host. We need
to drain it first and then delete it.
where the numbers of scrubbed object, clones, dirty and omap are always
less than the total number of corresponding numbers, if the PG contains
object(s) whose hash happens to be 0xffffffff.
in this change, if the calculated hash of the upper bound is greater
than the maximum possible number represented by uint32_t, in addition to
setting the hash of the upper bound hobj to 0xffffffff, we also set the
nspace of hobj of the upper bound to "\xff", so that the upper bound
is greater than an hobj whose hash happens to be 0xfffffff. please note,
the nspace of "\xff" is not an ascii string, so it's not likely to be
less than a real-world nspace of an hobj.
with this new *greater* upper bound, we are able to include the previous
missing hobj when listing the objects in a PG. so the scrub won't be
annoyed when the number of objects does not match.
Mykola Golub [Mon, 30 Aug 2021 06:58:04 +0000 (07:58 +0100)]
osd: re-cache peer_bytes on every peering state activate
peer_bytes is used for backfill reservation request and may be
reset if backfill is interrupted, and we want it set back before
continuing backfill and re-sending the reservation request.