cmake/modules/Findpmem: always set pmem_VERSION_STRING
before this change, `pmem_VERSION_STRING` is not set if it is not able
to fulfill the specified version requirement. the intention was to check
if the version is able to satisfy the requirement. but actually, passing
an empty `pmem_VERSION_STRING` to `find_package_handle_standard_args()`
as the option of `VERSION_VAR` does not fail this check. on the
contrary, it prints
-- Found pmem: pmem_pmemobj_INCLUDE_DIR;pmem_pmem_INCLUDE_DIR (Required
is at least version "1.17")
if we requires pmem 1.17, while the found version is, for instance,
1.10.
if the required version is 1.7, and the found version is 1.10, the
output from cmake is:
-- Found pmem: pmem_pmemobj_INCLUDE_DIR;pmem_pmem_INCLUDE_DIR (found
suitable version "1.10", minimum required is "1.7")
in this change, the version spec is not specified when calling
`pkg_check_modules()`. so, `PKG_${component}_VERSION` is always set.
and we can always delegate the version checking to
`find_package_handle_standard_args()`. please note, we use the lower
version returned by pkg-config if multiple components are required and
both pkg-config settings return their versions.
ceph.spec.in: do not build with system pmdk by default
we need to use libpmem 1.10 in #40493.
without enabling the module stream offering libpmem 1.9.2, we can only
have access to libpmem 1.6.1. and fedora 33 only has libpmem 1.9
packaged. the same applies to openSUSE Tumbleweed and openSUSE Leap. so
let's stop using libpmem packaged by distro by default, until these
distros include libpmem 1.10.
The marker was not working correctly as segments of the bucket index
were listed to shut down any incomplete multipart uploads. This fixes
the marker, so it's maintained properly across iterations.
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
librbd/cache/pwl: cancel advance dispatch of external flush request
For external flush request, it new syncpoint after passing
guardedrequest and before dispatch. Then dispatch bypass deferred
queue But the last write request may still in the deferred queue.
It don't dispatch and not associated with any syncpoint. The
external flush request will bypass the previous write request in
deferred queue now. This does not conform to the semantics of
external flush requests. External flush request should strictly
follow the order of dispath.
But for internal flush request, it will be dispatched after all
write request which associated with previous syncpoint, persisted
in cache. C_gather guarantee it.
It is necessary to distinguish between external and internal
flush requests. Internal flush can and should be dispatched in
advance bypass deferred queue. At the same time, the order of
external requests needs to be kept unchanged. So cancel advance
dispatch of external flush request.
librbd/cache/pwl: fix assert in _aio_stop() during shutdown
For wait_for_ops(next_ctx). this next_ctx may run in aio_thread.
Then the next program runs on the aio thread. remove_pool_file()
calls bdev->close(), then calles _aio_stop(), exec aio_thread.join(),
cause assert. Thread can't join itself. Fix it by adding close ctx
to m_work_queue, so close() can run in work queue thread.
At the same time, correct the order of wait_for_ops().
flush_dirty_entries(next_ctx) may call wake_up() and start_op().
so moving wait_for_ops() behind flush_dirty_entries(next_ctx) is more
appropriate.
Yin Congmin [Fri, 27 Aug 2021 15:41:49 +0000 (15:41 +0000)]
librbd/cache/pwl/ssd: move finish_op() to the end of callback function
finish_op() of ssd cache is not in the end of callback function in
append_op_log_entries(), and after finish_op(), some operation also
need to get m_lock. So, during shutdown, wait_for_ops() thinks all OPs
are over, and no thread will acquire the m_lock, In the subsequent
operation of shutdown, the m_lock is obtained, and _aio_stop() in
bdev->close() waits for all aio_writes() and aio_submit() to end
when the m_lock is held, but the callback function of aio_write() is
waiting for the m_lock, causing a deadlock. Move finish_op() to the
end to fix dead lock.
Jianpeng Ma [Tue, 7 Sep 2021 02:19:55 +0000 (10:19 +0800)]
librbd/cache/pwl/ssd: Remove unused parameter.
Met the following compiler warning message:
>[38/80] Building CXX object
src/librbd/CMakeFiles/librbd_plugin_pwl_cache.dir/cache/pwl/ssd/WriteLog.cc.o
>../src/librbd/cache/pwl/ssd/WriteLog.cc:37:25: warning: unused variable
'ops_appended_together' [-Wunused-const-variable]
>const unsigned long int ops_appended_together = MAX_WRITES_PER_SYNC_POINT;
Jianpeng Ma [Tue, 7 Sep 2021 02:00:53 +0000 (10:00 +0800)]
librbd/cache/pwl/ssd: Fix a race between get_cache_bl() and remove_cache_bl()
In fact, although in get_cache_bl it use lock to protect, it can't
protect function "list& operator= (const list& other)".
So we should use copy_cache_bl.
Fixes: https://tracker.ceph.com/issues/52400 Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit fe72b3953735329441397f257d5dd18f6819187d)
In append_op_log_entries(), new_first_free_entry is read after
append_ops() returns. This can result in accessing freed memory
because all I/Os may complete and append_ctx callback may run
by the time new_first_free_entry is read. Garbage value gets
written to m_first_free_entry and depending on the circumstances
it may allow AbstractWriteLog code to accept more dirty user data
than we have space for. Luckily we usually crash before then.
In append_ops(), new_first_free_entry is assigned to after aio_submit()
is called. This can result in accessing uninitialized or freed memory
because all I/Os may complete and append_ctx callback may run before the
assignment is executed. Garbage value gets written to first_free_entry
and we eventually crash, most likely in bufferlist manipulation code.
But worse, the corrupted first_free_entry makes it to media almost all
the time. The result is a corrupted cache -- dirty user data is lost.
When retiring, m_blocks_to_log_entries doesn't remove
corresponding write_entry (should be `*it` not `entry`)
that will be retired. It leads to read error. And
there should also consider discard entries.
Jianpeng Ma [Tue, 7 Sep 2021 02:19:55 +0000 (10:19 +0800)]
librbd/cache/pwl/ssd: Remove unused parameter.
Met the following compiler warning message:
>[38/80] Building CXX object
src/librbd/CMakeFiles/librbd_plugin_pwl_cache.dir/cache/pwl/ssd/WriteLog.cc.o
>../src/librbd/cache/pwl/ssd/WriteLog.cc:37:25: warning: unused variable
'ops_appended_together' [-Wunused-const-variable]
>const unsigned long int ops_appended_together = MAX_WRITES_PER_SYNC_POINT;
In append_op_log_entries(), new_first_free_entry is read after
append_ops() returns. This can result in accessing freed memory
because all I/Os may complete and append_ctx callback may run
by the time new_first_free_entry is read. Garbage value gets
written to m_first_free_entry and depending on the circumstances
it may allow AbstractWriteLog code to accept more dirty user data
than we have space for. Luckily we usually crash before then.
In SyncPointLogOperation::clear_earlier_sync_point(),
sync_point->log_entry->next_sync_point_entry was prematurely set to
nullptr in clear_earlier_sync_point(). It is in write op stage, but
next_sync_point_entry is used in writeback stage in
handle_flushed_sync_point().
handle_flushed_sync_point() may pass a nullptr
cause assert in m_work_queue.The solution is to move the statement
that set next_sync_point_entry to nullptr after it is used.
Ilya Dryomov [Sat, 14 Aug 2021 17:06:28 +0000 (19:06 +0200)]
librbd/cache/pwl: make pool size a multiple of 1M
In ssd mode, we need it to be a multiple of bdev block size.
Instead of munging it after opening the bdev in ssd/WriteLog.cc, let's
impose a common restriction and round rbd_persistent_cache_size down to
a 1M boundary.
Ilya Dryomov [Sat, 14 Aug 2021 17:06:28 +0000 (19:06 +0200)]
librbd/cache/pwl: make pool size a multiple of 1M
In ssd mode, we need it to be a multiple of bdev block size.
Instead of munging it after opening the bdev in ssd/WriteLog.cc, let's
impose a common restriction and round rbd_persistent_cache_size down to
a 1M boundary.
SSD cache mode uses m_bytes_allocated and m_bytes_allocated_cap to control
whether the allocation is successful. m_bytes_allocated may exceeding
m_bytes_allocated_cap due to the multi-thread under no lock. When adding
to m_bytes_allocated, check again to solve this issue.
On the other hand, m_bytes_allocated_cap should be less than the actual
ring buffer size. keep m_first_free_entry == m_first_valid_entry
only means the cache is empty.
Jianpeng Ma [Mon, 1 Nov 2021 01:25:52 +0000 (09:25 +0800)]
librbd/cache/pwl: fix reorder when flush cache-data to osd.
Consider the following workload:
writeA(0, 4096)
writeB(0, 512).
pwl can makre sure writeA persist to cache before writeB.
But when flush to osd, it use async-read to read data from cache and in
the callback function they issue write to osd.
So although we by order issue aio-read(4096), aio-read(512). But we
can't make sure the return order.
If aio-read(512) firstly return, the write order to next layer is
writeB(0, 512)
writeA(0, 4096).
This is wrong from the user point.
To avoid this occur, we should firstly read all data from cache. And
then send write by order.
Fiexs: https://tracker.ceph.com/issues/52511
Tested-by: Feng Hualong <hualong.feng@intel.com> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 1fc3be248097eb6087560c22374c1f924bfe0735)
librbd/cache/pwl: Fix dead lock issue when pwl initialization failed
when pwl initialization failed, 'AbstractWriteLog' will release itself
in callback, it hold guard lock and want to get lock to delete data,
which causes dead lock. This PR works by release image_cache outside
the callback function.
Jianpeng Ma [Tue, 31 Aug 2021 01:02:56 +0000 (09:02 +0800)]
librbd/cache/pwl: solve the problem of calc m_bytes_allocated when reload entries.
Currently, it will load existing entries after restart and cacl
m_bytes_allocated based on those entries. But currently there are
the following problems:
1: The allocated of write-same is not calculated for rwl & ssd cache.
2: for ssd cache, it not include the size of log-entry itself and don't
consider data alignment. This will cause less calculation and more
allocatation later. And will overwrite the data which don't flush to osd
and make data lost.
The calculation methods of ssd and rwl are different. So add new api
allocated_and_cached_data() to implement their own method.
For SSD cache, we dirtly use m_first_valid_entry & m_first_free_entry to
calc m_bytes_allocated.
trivial fix: new code from PR, nothing unrelated: https://www.diffchecker.com/S1eXatpM
Fixes:https://tracker.ceph.com/issues/52341 Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit a96ca93d69d5c1f302f3141082302d4699915397)
Jianpeng Ma [Fri, 20 Aug 2021 06:29:37 +0000 (14:29 +0800)]
librbd/cache/pwl/ssd: fix first_valid_entry calculation in retire_entries()
Consider one control_block which cotain multi encode(WriteLogCacheEntry):
Log1: WriteLogEntry
Log2: WriteLogEntry
Log3: Non-WriteLogEntry
For this case, currently calc method is: control_block_pos + sizeof(control_block).
But in fact, it should: control_block_pos + sizeof(control_block) +
data_length(Log1 + Log2).
Wrong first_valid_entry will persist to superblock and restart to read.
This cause read wrong position and when decode(WriteLogCacheEntry) it
will report bug.
Fixes: https://tracker.ceph.com/issues/52323 Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 2d337fb122d147e32d027d1e7211cd4156a5b72b)
librbd/cache/pwl/ssd: solve competition between read and retire
SSD read is not like rwl's. SSD need aio read. Therefore,
we cannot guarantee that the data will not be retire
during the period from sending the read request to the SSD
and receiving the data to the memory, which may cause
the corresponding data on the SSD to be overwritten.
librbd/cache/pwl/rwl: fix buf_persist and add writeback_lat perf counters
initialize buf_persist_time, then change name buf_persist_time to
buf_persist_start_time, change flush to internal_flush. add
writeback_lat perf conters. update some print formats for perf.
librbd/cache/pwl: avoid stack overflow caused by nested shared_ptr destruction
Destruction of nested shared_ptr will cause stack overflow.
With the explicit assignment of nullptr, the deleted node
is completely disconnected from the current linked list
librbd/cache/pwl/ssd: fix use-after-free on C_BlockIORequest
In setup_schedule_append() function, its first expression
will cause the req to be deleted, and subsequent use of
the variable req becomes an illegal operation. And due to
delete, rep->m_image_ctx will be empty, so it lead to
segfault in AbstractWriteLog::get_context().
So pass the `req` into `schedule_append()` function.
In append_op_log_entries(), new_first_free_entry is read after
append_ops() returns. This can result in accessing freed memory
because all I/Os may complete and append_ctx callback may run
by the time new_first_free_entry is read. Garbage value gets
written to m_first_free_entry and depending on the circumstances
it may allow AbstractWriteLog code to accept more dirty user data
than we have space for. Luckily we usually crash before then.
Ilya Dryomov [Sat, 14 Aug 2021 17:06:28 +0000 (19:06 +0200)]
librbd/cache/pwl: make pool size a multiple of 1M
In ssd mode, we need it to be a multiple of bdev block size.
Instead of munging it after opening the bdev in ssd/WriteLog.cc, let's
impose a common restriction and round rbd_persistent_cache_size down to
a 1M boundary.
WriteLogCacheEntry gets appended to persist_log_entries before
write_data_pos is updated with the actual media offset. Because
push_back() makes a copy, the updated write_data_pos value never
makes it to media, making recovery impossible.
Ilya Dryomov [Thu, 13 May 2021 11:11:57 +0000 (13:11 +0200)]
librbd/cache/pwl/ssd: actually use first_{valid,free}_entry on recovery
first_valid_entry and first_free_entry pointers are read from media
but not actually used: both m_first_valid_entry and m_first_free_entry
get assigned 0 (or garbage). next_log_pos gets the same value as well
meaning that not only no recovery is attempted but the cache also gets
corrupted because DATA_RING_BUFFER_OFFSET is not applied.
Ilya Dryomov [Sat, 8 May 2021 08:24:37 +0000 (10:24 +0200)]
librbd/cache/pwl/ssd: don't count log entries
In ssd mode log entries are variable size. Attempting to count and
impose watermarks on the number of log entries is bogus because the
total number of entries it would take to fill the cache to capacity
is also variable and can't be precisely estimated.
had conflicts, but no new changes Fixes: https://tracker.ceph.com/issues/50669 Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit ea65553b4a9ee1349c6da8452d861afe579e99e9)
All parameters are integers and none of them are (in-)out, so don't
take them by reference. Additionally num_lanes, num_log_entries and
num_unpublished_reserves don't need to be 64-bit as their respective
fields in AbstractWriteLog are 32-bit.
Ilya Dryomov [Wed, 12 May 2021 10:19:07 +0000 (12:19 +0200)]
librbd/cache/pwl: rename m_log_pool_config_size to m_log_pool_size
trivial fix: no new changes: https://www.diffchecker.com/9btXJhCC Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 829ef952d2e408fe3676b38e7ecd26cbb04571a5)
* Do not allow a pseudo that is already in use by another export.
* Create mode form: prefill dropdown selectors if options > 0.
* Edit mode form: do not reset the field values that depend on other values that are being edited (unlike Create mode).
* Fix broken link: cluster service.
* Fix error message style for non-existent cephfs path.
* nfs-service.ts: lsDir: thow error if volume is not provided.
* File renaming: nfsganesha.py => nfs.py; test_ganesha.py => test_nfs.py
Fixes: https://tracker.ceph.com/issues/53083 Signed-off-by: Alfonso Martínez <almartin@redhat.com>
(cherry picked from commit d817a24e345516229bc637e2c675d12e6bfcc456)
Conflicts:
src/pybind/mgr/dashboard/tests/test_ganesha.py
- Delete file: in fact it's renamed as test_nfs.py
* Controller tests: cherrypy config: authentication disabled by default; ability to pass custom config (e.g. enable authentication).
* Auth controller: add tests; test that unauthorized request fails when authentication is enabled.
* DocsTest: clear ENDPOINT_MAP so the test_gen_tags test becomes deterministic.
* pylint: disable=no-name-in-module: fix imports in tests.
Fixes: https://tracker.ceph.com/issues/53083 Signed-off-by: Alfonso Martínez <almartin@redhat.com>
(cherry picked from commit a6aeded5141ec3a959a86aa4002ccf5ac8d0a523)
Casey Bodley [Thu, 4 Nov 2021 16:12:04 +0000 (12:12 -0400)]
qa/rgw: pacific branch targets ceph-pacific branch of java_s3tests
this commit is applied directly to the pacific branch instead of a
backport from master, because it targets the ceph-pacific branch instead
of the ceph-master branch on master
The current check only allows to request an OSD id that exists but
marked as 'destroyed'.
With this small fix, we can now use `--osd-id` with an id that doesn't
exist at all.
Avan Thakkar [Wed, 20 Oct 2021 14:07:26 +0000 (19:37 +0530)]
mgr/dashboard: gather facts should only be fetched when orch backend is cephadm
Fixes: https://tracker.ceph.com/issues/52981 Signed-off-by: Avan Thakkar <athakkar@redhat.com>
Gather facts in UI should only be fetched if there is orch available and
get_facts feature is available for that orch backend.
Alfonso Martínez [Thu, 26 Aug 2021 10:05:54 +0000 (12:05 +0200)]
mgr/dashboard: NFS exports: API + UI: integration with mgr/nfs; cleanups
mgr/dashboard: move NFS_GANESHA_SUPPORTED_FSALS to mgr_module.py
Importing from nfs module throws AttributeError because as a side effect the dashboard module is impersonating the nfs module.
https://gist.github.com/varshar16/61ac26426bbe5f5f562ebb14bcd0f548
mgr/dashboard: 'Create NFS export' form: list clusters from nfs module
mgr/dashboard: frontend+backend cleanups for NFS export
Removed all code and references related to daemons. UI cleanup and adopted unit-testing for
nfs-epxort create form for CEPHFS backend. Cleanup for export list/get/create/set/delete endpoints.
mgr/dashboard: rm set-ganesha ref + update docs
Remove existing set-ganesha-clusters-rados-pool-namespace references as
they are no longer required. Moreover, nfs doc in dashboard doc is
updated accordingly to the current nfs status.
mgr/dashboard: add nfs-export e2e test coverage
mgr/dashboard: 'Create NFS export' form: remove RGW user id field.
- Improve bucket typeahead behavior.
- Increase version for bucket list endpoint.
- Some refactoring.
mgr/dashboard: 'Create NFS export' form: allow RGW backend only when default realm is selected.
When RGW multisite is configured, the NFS module can only handle buckets in the default realm.
mgr/dashboard: 'Create service' form: fix NFS service creation.
After https://github.com/ceph/ceph/pull/42073, NFS pool and namespace are not customizable.
- Allow only existing buckets.
- Refactoring:
- Moved bucket validator from bucket form to cd-validators.ts
- Split bucket validator into 2: bucket name validator and bucket existence (that checks either existence or non-existence).
None of these values are required for listing nfs cluster by mgr/nfs module.
Instead directly list available cluster names
mgr/dashboard: add comment to remove listing of daemons
As the configs are per cluster. There is no need to list daemons per cluster.
mgr/dashboard/controllers/nfsganesha: Add comments to update/remove status endpoint
This endpoint can be updated in suggested way or even removed. As it was
initially[1] introduced to check if dashboard pool and namespace configuration was
set.
Varsha Rao [Mon, 16 Aug 2021 05:39:18 +0000 (11:09 +0530)]
mgr/nfs: don't log fsal keys
FSAL keys are written to export objects. Don't log it, instead log the initial
export before generating keys and user config object mostly will not contain
export block. It's okay to log user config object content.
Sage Weil [Wed, 28 Jul 2021 14:29:47 +0000 (10:29 -0400)]
mgr/dashboard: consume mgr/nfs via mgr.remote()
Stop using the dashboard version of the Ganesha config classes; consume
mgr/nfs instead via remote().
mgr/nfs/export: return Export from _apply_export
Future callers will want this.
mgr/nfs: new module methods for dashboard consumption
Add some new methods that are easy for the dashboard API to consume. These
are very similar to the CLI methods but do now have the @CLICommand and
related decorators, and have slightly different interfaces (e.g., returning
the created/modified Export dict).
mgr/dashboard: remove old ganesha code (and tests)