iovec have unsigned length (size_t) and before this patch the
total length was computed by adding iovec's length to a signed
length variable (ssize_t). While the code checked if the resulting
length was negative on overflow, the case where length is positive
after overflow was not checked. This patch fixes the overflow check
by changing length to unsigned size_t.
Additionally, this patch fixes the case where some iovecs have been
added to the bufferlist and the aio completion has been blocked, but
adding an additional iovec fails because of overflow. This leads to
the UserBufferDeleter trying to unblock the completion on destruction
of the bufferlist but asserting because the completion was never
armed. We avoid this by first computing the total length and checking
for overflows and iovcnt before adding them to the bufferlist.
Ilya Dryomov [Sat, 19 Mar 2022 13:04:52 +0000 (14:04 +0100)]
qa/workunits/rbd/cli_generic.sh: relax trash purge schedule status assert
Commit 08df6e0fd006 ("qa/workunits/rbd: expand LevelSpec parsing
coverage") didn't account for images with a separate data pool. This
was missed because of small-cache-pool.yaml breakage.
Addition of a SCRUB_DURATION field that shows how long the scrub/deep-scrub of a pg took.
This field will be displayed in the output of the "ceph pg dump --format=json" and "ceph pg ls-by-pool --format=json" commands.
Lucian Petrut [Mon, 7 Mar 2022 08:12:23 +0000 (08:12 +0000)]
include: Define dlfcn.h on Windows
"dlfcn.h" is not available on Windows, so Ceph provides a drop-in
replacement through "dlfcn_compat.h".
The issue is that directly importing "dlfcn.h" fails at the moment,
for which reason we'll simply add a file called "dlfcn.h" that
includes "dlfcn_compat.h".
Kamoltat [Mon, 28 Feb 2022 21:40:43 +0000 (21:40 +0000)]
upgrade/pacific-x/parallel: Added mds.a and mds.b
Added mds daemons so that it can create
cephFS pools and set options using
`do_set_pool()` in FSCommand.cc. Such that
we can cover corner cases like that in
ceph-fuse: perform cleanup if test_dentry_handling failed
If remount failed due to some reason then ceph_abort() is
getting called which causes child process termination
without cleanup.
To fix this issue, ceph_abort() call moved after
performing cleanup.
quiesce all activities and destage allocations to disk before killing the OSD
1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
2) skip service.prepare_to_stop() which can take as much as 10 seconds
3) skip debug options in fast-shutdown
4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
5) clear op_shardedwq queues, this is safe since we didn't started processing them
6) stop timer
7) drain osd_op_tp (no new items will be added)
8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
10) increase debug level on fast-shutdown
11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
12) disable fsck-on-umount when running fast-shutdown
13) add an option to increase debug level at fast-shutdown umount()
14) set a time limit to fast-shutdown
15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
16) Fix error message for qfsck (error was caused by PR https://github.com/ceph/ceph/pull/44563)
17) make shutdown-timeout configurable
Fixes: https://tracker.ceph.com/issues/53266 Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
(cherry picked from commit 9b2a64a5f6ea743b2a4f4c2dbd703248d88b2a96)
Dan van der Ster [Thu, 24 Feb 2022 08:42:00 +0000 (09:42 +0100)]
osd: require osd_pg_max_concurrent_snap_trims > 0
If osd_pg_max_concurrent_snap_trims is zero, we mistakenly clear
the snaptrim queue. Require it to be > 0.
Fixes: https://tracker.ceph.com/issues/54396 Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch>
(cherry picked from commit 29545b617b3b0324f9b0b20e032e3e38557115eb)
Ilya Dryomov [Tue, 8 Mar 2022 12:56:15 +0000 (13:56 +0100)]
test/librbd/test_notify.py: effect post object map rebuild assert
Instead of just optionally skipping update_features test, commit 9c0b239d70cd ("qa/upgrade: conditionally disable update_features
tests") moved it after rebuild_object_map test. This isn't right
because update_features test invalidates the object map as a side
effect and rebuild_object_map test is what makes it valid again:
Invoke "rbd mirror snapshot schedule ls -R" and "rbd mirror snapshot
schedule status" commands on all levels, consistently. In particular,
make sure that an image level schedule is listed for a recursive query
at the pool level both before and after the schedule kicks in:
$ rbd create --size 1G --mirror-image-mode snapshot -p foo bar
$ rbd mirror snapshot schedule add -p foo --image bar 1m
$ rbd mirror snapshot schedule ls -p foo -R
POOL NAMESPACE IMAGE SCHEDULE
foo bar every 1m
<wait for schedule to become visible in status>
$ rbd mirror snapshot schedule ls -p foo -R
POOL NAMESPACE IMAGE SCHEDULE
foo bar every 1m
Also, make sure that pool and image level status queries work:
$ rbd mirror snapshot schedule status -p foo
SCHEDULE TIME IMAGE
2022-03-04 07:14:00 foo/bar
$ rbd mirror snapshot schedule status -p foo --image bar
SCHEDULE TIME IMAGE
2022-03-04 07:14:00 foo/bar
Both of these issues are fixed by the previous commit.
Sunny Kumar [Thu, 24 Feb 2022 16:07:39 +0000 (16:07 +0000)]
mgr/rbd_support: cast pool_id from int to str when collecting LevelSpec
While collecting LevelSpec using class method from_name make sure to cast
pool_id from int to string. This is necessary to match the internal
representation of LevelSpec where pool_id is maintained as str.
Kefu Chai [Sat, 5 Mar 2022 04:49:57 +0000 (12:49 +0800)]
cmake: pass RTE_DEVEL_BUILD=n when building dpdk
ceph is still using the Makefile based building system for building
DPDK. and DPDK enables -Werror if RTE_DEVEL_BUILD is 'y' which is
enabled by default when the dpdk is built from a git repo.
but newer GCC is more picky than the older versions, to prevent
the possible FTBFS when we switch to newer GCC for building old
branches whose dpdk submodule might be include the changes addressing
those warnings. let's just disable this option.
the only effect of this option is to add -Werror to CFLAGS. but
the building warnings from DPDK is not our focus when developing
Ceph in the most cases. so it should be fine.
afd8be7eac5e996c3bd07656601a4534053e2516 broke it.
It has dropped`block_wal` and `block_db` from
`ceph_volume.devices.raw.activate.activate_bluestore` but
`activate.main.Activate.main` still passes those arguments when
calling `RAWActivate([]).activate()`
A minimal change extracted from PR#44050, to facilitate
backporting.
The multitudes of bogus events generated fill up the logs.
Signed-off-by: Ronen Friedman <rfriedma@redhat.com>
(cherry picked from commit e1b5347b81d17c8a5a1f6e1d4d76d18977ec2b0c)
Conflicts: the logic changes were already part of Quincy. Left is
a removal of an unneeded log message.
Kotresh HR [Tue, 1 Feb 2022 11:06:34 +0000 (16:36 +0530)]
mgr/volumes: Fix subvolumegroup ls
The subvolumegroup ls listed '_deleting' directory which is
internal to 'mgr/volumes' and should not be listed as a
subvolumegroup. This patch fixes the same by filtering it.
Kotresh HR [Thu, 10 Feb 2022 05:34:41 +0000 (11:04 +0530)]
mgr/volumes: Fix clone uid/gid mismatch
This is the regression caused by commit 18b85c53a.
The 'set_attrs' function sets the uid/gid of the
group to the subvolume if uid/gid is not passed.
The attrs of the clone should match the source
snapshot. Hence, don't use the 'set_attrs'
function to set only the quota attrs for the
clone.
Yaarit Hatuka [Tue, 22 Feb 2022 19:22:09 +0000 (19:22 +0000)]
mgr/devicehealth: skip null pages when extracting wear level
Some devices have null pages in their ata_device_statistics struct; skip
those pages in order to avoid an AttributeError when extracting device's
wear level.
osd: Write non-zero data as part of osd benchmark test.
An optimization (see PR: https://github.com/ceph/ceph/pull/43337) was made
in BlueStore to avoid writing bufferlists made up of zeros. The osd
benchmark used zero filled bufferlists and this resulted in inflated osd
benchmark results.
This issue is fixed by using bufferlists filled with non-zero values.
Mykola Golub [Fri, 18 Feb 2022 10:42:23 +0000 (10:42 +0000)]
rbd-mirror: make mirror properly detect pool replayer needs restart
When a PoolReplayer detects remote pool metadata change it
sets "stopping" flag expecting the Mirror will restart it.
Although setting "stopping" flag makes the PoolReplayer::run
thread to terminate, the thread's is_started function will still
return true until join is called (and reset the thread id).
This made impossible for the Mirror to detect (by calling
PoolReplayer::is_running) that the PoolReplayer needed restart.
Ilya Dryomov [Sun, 20 Feb 2022 16:33:08 +0000 (17:33 +0100)]
rbd-mirror: synchronize with in-flight stop in ImageReplayer::stop()
Complete on_finish right away only if the replayer is stopped (meaning
that it is legible to be restarted immediately, possibly from on_finish
itself). This is the behaviour pretty much anyone would assume and
also what ImageReplayer::restart() relies on.
Ilya Dryomov [Sun, 20 Feb 2022 12:11:02 +0000 (13:11 +0100)]
rbd-mirror: manual stop should take precedence over regular stop
Somewhat similar to commit 0a3794e56256 ("rbd-mirror: make stop
properly cancel restart"), make it so that a) if a manual stop is
joined to regular stop, the stop becomes manual and b) if a regular
stop is joined to a manual stop, the stop stays manual.
Ilya Dryomov [Sat, 19 Feb 2022 15:43:04 +0000 (16:43 +0100)]
rbd-mirror: straighten ImageReplayer::stop() a bit
- don't default on_finish parameter
- m_restart_requested is set in ImageReplayer::restart() which is the
only restart=true call site, so setting m_restart_requested here is
redundant
- is_stopped_() can't be true in is_running_() branch
- on_finish->complete(0) in the end is unreachable
Casey Bodley [Tue, 15 Feb 2022 23:27:10 +0000 (18:27 -0500)]
common: replace BitVector::NoInitAllocator with wrapper struct
in c++20, the deprecated `struct std::allocator<T>::rebind` template was
removed, so `BitVector` no longer compiles. without a `rebind` to
inherit, `std::allocator_traits<NoInitAllocator>::rebind_alloc<U>` was
looking for `NoInitAllocator<U>`, but it isn't a template class
further investigation found that in c++17, `vector<__u32, NoInitAllocator>`
was rebinding this `NoInitAllocator` to `std::allocator<__u32>` and
preventing the no-init optimization from taking effect
instead of messing with the allocator to avoid zero-initialization, wrap
each __u32 in a struct whose constructor does not initialize the value
Volker Theile [Wed, 9 Feb 2022 08:37:48 +0000 (09:37 +0100)]
mgr/dashboard: "Please expand your cluster first" shouldn't be shown if cluster is already meaningfully running
This PR will assume that a cluster is already up and fully running. If this should not be the expected behaviour, deployment tools have to set 'INSTALLED' explicitly. Without this assumption it might happen that upgraded and fully running clusters, e.g. Octopus -> Pacific, will show the 'Expand Cluster' on first log in.
cephadm will take care that the bootstrap phase will write the necessary key to show the 'Expand cluster' page.
IvanGuan [Tue, 18 Jan 2022 13:01:59 +0000 (21:01 +0800)]
mds: kill session when mds do ms_handle_remote_reset
if the mds decide to reuse the old connection it will
do reset_session and should also kill the session
which are open state in MDSDaemon::ms_handle_remote_reset
to prevent the situation client session is stuck in
opening state and never has chance to becaome open.
the root cause is client missed the request_open
reply but the mds session has become open already.
so we should kill the session in mds side and let
mds recreate the session when received the connect
request from client.
Fixes: http://tracker.ceph.com/issues/53911 Signed-off-by: YunfeiGuan <yunfeiguan@xtaotech.com>
(cherry picked from commit 3651deb4e0b0c102adcaddce79ee4e053f033418)
Laura Flores [Fri, 11 Feb 2022 19:37:26 +0000 (19:37 +0000)]
mgr/telemetry: handle empty device report when "send" is triggered
On certain environments, such as the "ceph-dev-docker" environment
(https://github.com/ricardoasmarques/ceph-dev-docker), the mgr
module is unable to fetch device metrics. As a result, the device
report generated by "gather_device_report()" returns an empty dict.
This causes an AssertionError when the "send" function is triggered
(i.e. by running `ceph telemetry status` or `ceph telemetry send`),
and the module crashes.
The fix in this commit checks that the generated device report
contains metrics before trying to send it. If the device report
does not contain metrics (it returns an empty dict), the module
will log an appropriate message in the mgr log and not send the
device report.
If this scenario happens when running the `ceph telemetry send` command,
the user will additionally see this message:
```
Ceph report sent to https://telemetry.ceph.com/report
Unable to send device report: channel is on, but generated report was empty.
```
I also added a few more debug messages in gather_device_report() to make
future debugging easier.
Fixes: https://tracker.ceph.com/issues/54250 Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit 54e0e58f1b3f431281df0e2dd2b258f85cbade19)
Miaomiao Liu [Fri, 14 Jan 2022 05:40:11 +0000 (13:40 +0800)]
cmake: replace BuildQatDrv.cmake with FindQatDrv.cmake
because QAT driver with version v1.7.l.4.14.0 or higher cannot be dowmloaded
directly by URL, FindQatDrv.cmake can find the locally installed QAT package and libraries
Signed-off-by: Miaomiao Liu <miaomiao.liu@intel.com> Signed-off-by: Hualong Feng <hualong.feng@intel.com>
(cherry picked from commit 082dcf0a58b47946cc1375a7e8336b261d499f64)
Adam Kupczyk [Wed, 9 Feb 2022 15:19:56 +0000 (16:19 +0100)]
os/bluestore/bluefs: Fix improper vselector tracking in _flush_special()
Moves vselector size tracking outside _flush_special().
Function _compact_log_async...() updated sizes twice.
Problem could not be solved by making second modification of size just update,
as it will possibly disrupt vselector consistency check (_vselector_check()).
Feature to track vselector consistency relies on the fact that either log.lock or nodes.lock
are taken when the check is performed. Which is not true for _compact_log_async...().
Now _flush_special does not update vselector sizes by itself but leaves the update to
the caller.
Fixes: https://tracker.ceph.com/issues/54248 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 4bc0f61d23299724fad2d8e6f2858734f1db6e5a)