John Mulligan [Fri, 14 Feb 2025 19:51:03 +0000 (14:51 -0500)]
doc: document the new container build tool and link to it in README
Add a new markdown file in the root of the tree, ContainerBuild.md, that
can serve as a basic introduction to the new container build tools
recently merged to ceph.
Add a small 'breadcrumb' section to the project README.md to help find
this new document.
John Mulligan [Thu, 20 Feb 2025 00:17:30 +0000 (19:17 -0500)]
script/build-with-container: add support for overlay dir
The source dir (aka homedir, default /ceph) is mounted in the container
read-write. This is needed as the various ceph build scripts expect to
write things into the tree - often this is in the build directory - but
not always. This can lead to small messes and/or situations that are
confusing to debug, especially if one is jumping between distros often.
Add an option to use an overlay volume for the homedir - by default we
enable a persistent overlay with a supplied "upper dir" where files that
were written will appear. One can also enable a temporary overlay that
forgets the writes when the container exits - maybe useful when doing
experiments in 'interactive' mode.
To use this option run the command with the `--overlay=<dir>` option.
For example: `./src/script/build-with-container.py -b build.inner
--overlay-dir build.ovr`. This will create a directory
`build.ovr/content` automatically and all new files will appear there.
For example the build directory will appear at
`build.ovr/content/build.inner`.
To use the temporary overlay use a `-` as the directory name. For
example: `./src/script/build-with-container.py -b build.inner
--overlay-dir -`
John Mulligan [Thu, 20 Feb 2025 14:50:49 +0000 (09:50 -0500)]
script/build-with-container: skip dnf cache dir volume mounts on docker
When using docker the --volume option is not available during build
(docker [buildx] build), unlike podman. Since passing these volumes must
be conditional on them being set up I see no way to handle this short of
just disabling the option on docker. Log the fact that it's being
skipped - the only other issue is that we pointlessly set up some dirs
and the build may be a bit slower.
John Mulligan [Wed, 19 Feb 2025 18:20:36 +0000 (13:20 -0500)]
script/build-with-container: remove default --volume arg from ctr build
On the original github pr #59841 user fayak kindly informed us that the
--volume option was not supported by docker build. Since this section
was a leftover from a previous way of constructing the builder image and
was no longer needed we simply removed it.
John Mulligan [Wed, 19 Feb 2025 18:20:01 +0000 (13:20 -0500)]
script/build-with-container.py: build builder image with --pull=always
Construct the builder image using the --pull=always flag to initiate a
pull of the base image (centos, ubuntu, etc) in order to avoid using a
stale base image. Since the script automatically (by default) avoids
building if a matching tag is in local container storage it is handy to
use a fresh base when it *is* time to build something. Otherwise, you
end up in a situation like I sometimes do - using a months old base
unintentionally.
John Mulligan [Fri, 14 Feb 2025 19:50:42 +0000 (14:50 -0500)]
script/build-with-container: add a common packages target
Add a `packages` target to build-with-container.py that requests a build
of packages, whatever package type is native to the distro selected.
For example `./src/script/build-with-container.py -d ubuntu22.04 -e
packages` will automatically select a deb packages build where
`./src/script/build-with-container.py -d centos9 -e packages` will
trigger rpm packages to be built. The underlying package-type specific
targets remain unchanged.
John Mulligan [Fri, 14 Feb 2025 16:44:35 +0000 (11:44 -0500)]
script/build-with-container: support custom tag suffixes
Previously, one could use the `--tag` option to completely override the
container tag generated by the script. However, there are cases where
one may want to add information to the tag rather than override it.
Allow the tag value to start with a plus (+) character that indicates
that the remainder of the string is to be suffixed to the generated tag.
Add a command line option --base-branch that allows the user to supply a
custom base branch name. git doesn't make determining this easy so we
always assume a base branch of 'main' by default - but this option lets
one change that.
John Mulligan [Fri, 14 Feb 2025 16:24:29 +0000 (11:24 -0500)]
src/script: rename CEPH_BRANCH to CEPH_BASE_BRANCH for build container
Previously, we were passing build argument of CEPH_BRANCH, but that was
a bit misleading as we expect the current branch to vary a bit (as users
will be using branches to develop and test the code). What we actually
care about is the base branch ('main', 'squid', etc) as that is fed into
our bootstrap script and we want the option to simple variations based
on the name of said base branch.
Rename CEPH_BRANCH to CEPH_BASE_BRANCH for clarity.
Add a new --current-branch argument that lets the user supply a name for
the current branch. This allows the automatic tag generation to avoid
calling git - something useful if the tree is not using a git checkout
(like a tarball). It also allows you to pull a temporary branch in git
but ignore it and act like the temporary branch is the base branch.
John Mulligan [Tue, 11 Feb 2025 23:36:13 +0000 (18:36 -0500)]
script/build-with-container: add more distro aliases
Add a system to define distro name aliases and use that to define some
additional aliases, primarily to match ubuntu codenames rather than
version numbers. Requested by Zack.
Ilya Dryomov [Mon, 3 Mar 2025 16:59:35 +0000 (17:59 +0100)]
test/pybind/rbd: fix read offset in write zeroes tests
Random data is written and write zeroes is invoked on 0~256, but the
read is done on 256~256. This means that if write zeroes malfunctions
the test wouldn't catch it (especially in the thick provision case).
VinayBhaskar-V [Tue, 26 Nov 2024 11:18:51 +0000 (16:48 +0530)]
librbd: add rbd_diff_iterate3() API to take source snapshot by ID
Allow a diff to start from a non-user snapshot. This would be used by
"rbd du" command to account for non-user snapshots which are currently
just skipped potentially resulting in underreported space usage and in
other places.
Ilya Dryomov [Sun, 2 Mar 2025 08:24:52 +0000 (09:24 +0100)]
librbd: fix a deadlock on image_lock caused by Mirror::image_disable()
With Mirror::image_disable() taking image_lock for write and calling
list_children() under it, the following deadlock is possible:
1. Mirror::image_disable() takes image_lock for write and calls
list_children()
2. AbstractWriteLog::periodic_stats() timer fires (it runs every
5 seconds) and ImageCacheState::write_image_cache_state() is called
under a global timer_lock
3. ImageCacheState::write_image_cache_state() successfully takes
owner_lock and blocks attempting to take image_lock for read because
it's already held for write by Mirror::image_disable()
4. list_children() blocks inside of a call to ImageState::close() on
a descendant image
5. The descendant image close can't proceed because TokenBucketThrottle
requires a global timer_lock to complete QosImageDispatch shutdown
6. safe_timer thread which is holding timer_lock can't proceed because
ImageCacheState::write_image_cache_state() is effectively blocked on
the descendant image close through Mirror::image_disable()
Until commit 281a64acf920 ("librbd: remove snapshot mirror image-meta
when disabling"), Mirror::image_disable() was taking image_lock only for
read meaning that this deadlock wasn't possible. The only other change
that commit 281a64acf920 made to the code block protected by image_lock
was using child_mirror_image_internal for cls_client::mirror_image_get()
call on descendant images instead of mirror_image_internal to preserve
the value of mirror_image_internal for later. Both are local variables
that have nothing to do with image_lock, so I'm going back and making
Mirror::image_disable() take image_lock only for read again.
Conflicts:
src/pybind/mgr/dashboard/frontend/src/app/ceph/block/nvmeof-namespaces-form/nvmeof-namespaces-form.component.ts
src/pybind/mgr/dashboard/frontend/src/app/shared/api/nvmeof.service.spec.ts
src/pybind/mgr/dashboard/frontend/src/app/shared/api/nvmeof.service.ts
- restore it to the request type where groups is not present
workunit/dencoder: fix corpus test for backword and forward compability
- changed the check for non-deterministic, return code 1 is also legit
- unneeded check for is_dir, if it exist
- limit the number of threads to prevent error
Fixes: https://tracker.ceph.com/issues/67263 Signed-off-by: NitzanMordhai <nmordech@redhat.com>
(cherry picked from commit 30921272ddee5e7c8aaf4bdb8d69645ce92ba379)
Dan Mick [Thu, 27 Feb 2025 00:16:26 +0000 (16:16 -0800)]
container/build.sh: remove local container images
Optionally, for those that want to run build.sh locally and
use the images. The default is to remove, for Jenkins builders,
which will build, push, and rmi.
Ilya Dryomov [Thu, 20 Feb 2025 15:38:41 +0000 (16:38 +0100)]
qa/workunits/rbd: add a test for force promote with a user snapshot
Add a reproducer for the crash on a bad variant access which was fixed
in commit 7d75161051da ("librbd: fix a crash in get_rollback_snap_id").
The reproducer deliberately works around many other issues with force
promote in snapshot-based mirroring: stopping rbd-mirror daemon
shouldn't be necessary (let alone with SIGKILL), get_rollback_snap_id()
and its caller can_create_primary_snapshot() are flawed and can pick
the wrong snapshot to roll back to or skip rollback when it's actually
required, the user snapshot in this scenario should be removed as part
of force promoting because it's incomplete and won't be usable after
the image is promoted, etc.
Zac Dover [Mon, 3 Feb 2025 13:37:34 +0000 (23:37 +1000)]
doc/rados: improve pg_num/pgp_num info
Improve the guidance around setting pg_num, and clear up confusion
around whether pgp_num should be set manually or, indeed, if it even can
be set manually.
This PR was raised in response to Mark Schouten's email here: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/CBDJTLTTIEZVG7GVZBX37UAWGYNSSMPD/
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit c43e7337212fe38e8db63d00345fa9858b3cb10a)
N Balachandran [Sat, 15 Feb 2025 13:26:31 +0000 (18:56 +0530)]
rbd-mirror: fix possible recursive lock of ImageReplayer::m_lock
If periodic status update (LambdaContext which is queued from
handle_update_mirror_image_replay_status()) races with shutdown and
ends up being the last in-flight operation that shutdown was pending
on, we attempt to recursively acquire m_lock in shut_down() because
m_in_flight_op_tracker.finish_op() is called with m_lock (and also
m_threads->timer_lock) held. These locks are needed only for the call
to schedule_update_mirror_image_replay_status() and should be unlocked
immediately.
Fixes: https://tracker.ceph.com/issues/69978 Co-authored-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: N Balachandran <nithya.balachandran@ibm.com>
(cherry picked from commit c60514087bc29540d3babd7855c5a4e28f2bf1b0)
Zac Dover [Tue, 25 Feb 2025 04:57:11 +0000 (14:57 +1000)]
doc/releases: correct squid release order
Put the releases of Squid in descending order. This change alters the
order of the Squid releases so that it is the same as the order of the
other Ceph releases.
ceph-volume: migrate unit tests from 'mock' to 'unittest.mock'
unit tests in ceph-volume was still using the external 'mock' library
for unit tests, which is unnecessary since 'unittest.mock' is part
of the Python standard library (available since Python 3.3).
This commit updates all imports to use 'unittest.mock' instead,
ensuring better maintainability and removing the need for an extra
dependency.
This refactors `get_physical_osds()`.
The calculation of `data_slots` is now more concise. The handling of
`dev_size`, `rel_data_size`, and `abs_size` is standardized.
The initialization of `free_size` is moved outside the loop
for clarity. Redundant checks and assignments are removed to simplify
the code.
ceph-volume: support splitting db even on collocated scenario
This change enables ceph-volume to create OSDs where the DB is
explicitly placed on a separate LVM partition, even in collocated
scenarios (i.e., block and DB on the same device).
This helps mitigate BlueStore fragmentation issues.
Given that ceph-volume can't automatically predict a proper default size for the db device,
the idea is to use the `--block-db-size` parameter:
Passing `--block-db-size` and `--db-devices` makes ceph-volume create db devices
on dedicated devices (current implementation):
```
Total OSDs: 2
Type Path LV Size % of device
----------------------------------------------------------------------------------------------------
data /dev/vdb 200.00 GB 100.00%
block_db /dev/vdd 4.00 GB 2.00%
----------------------------------------------------------------------------------------------------
data /dev/vdc 200.00 GB 100.00%
block_db /dev/vdd 4.00 GB 2.00%
```
Passing `--block-db-size` without `--db-devices` makes ceph-volume create a separate
LV for db device on the same device (new behavior):
```
Total OSDs: 2
Type Path LV Size % of device
----------------------------------------------------------------------------------------------------
data /dev/vdb 196.00 GB 98.00%
block_db /dev/vdb 4.00 GB 2.00%
----------------------------------------------------------------------------------------------------
data /dev/vdc 196.00 GB 98.00%
block_db /dev/vdc 4.00 GB 2.00%
```
This new behavior is supported with the `--osds-per-device` parameter:
```
Total OSDs: 4
Type Path LV Size % of device
----------------------------------------------------------------------------------------------------
data /dev/vdb 96.00 GB 48.00%
block_db /dev/vdb 4.00 GB 2.00%
----------------------------------------------------------------------------------------------------
data /dev/vdb 96.00 GB 48.00%
block_db /dev/vdb 4.00 GB 2.00%
----------------------------------------------------------------------------------------------------
data /dev/vdc 96.00 GB 48.00%
block_db /dev/vdc 4.00 GB 2.00%
----------------------------------------------------------------------------------------------------
data /dev/vdc 96.00 GB 48.00%
block_db /dev/vdc 4.00 GB 2.00%
```
This adds Python type annotations to `ceph_volume.util.device`,
along with all necessary adjustments to ensure compatibility
and maintain code clarity.
ceph-volume: set default value for BlueStore.block_lv to None
This change updates the `BlueStore` class in
`ceph_volume.objectstore` by initializing the `block_lv` attribute
to `None` with the type `Optional[Volume]`. This ensures that the
attribute has a default value and avoids potential runtime errors
when the attribute is accessed before being explicitly assigned.
ceph-volume: improve wipefs retry logic in lvm.zap
- Simplify the initialization of `tries` and `interval` variables for clarity.
- Adjust the retry logic in the `wipefs` function to:
- Include the attempt count in the warning message for better debugging.
- Start the retry loop at 1 and increment up to `tries`.
- Remove unnecessary unpacking of `stdout` and `stderr` since they were unused.
- Update the loop to increment `tries` by 1 to reflect the intended number of attempts.
This change improves code readability and makes retry behavior more transparent.
Ilya Dryomov [Tue, 18 Feb 2025 16:51:47 +0000 (17:51 +0100)]
test/rbd_mirror: clear Namespace::s_instance at the end of a test
TestMockPoolReplayer.Namespaces and NamespacesError tests leave behind
a dangling pointer to a stack-allocated MockNamespace which leads to an
easily reproducible use-after-free and segfault when tests are shuffled.
Ilya Dryomov [Mon, 17 Feb 2025 11:41:51 +0000 (12:41 +0100)]
test/rbd_mirror: flush watch/notify callbacks in TestImageReplayer
TestImageReplayer establishes its own (i.e. outside of the SUT code)
watch on the header of the remote image to be able to synchronize the
execution of the test with certain notifications. This watch is
established before the remote image is opened and is teared down until
after the remote image is closed but while the image replayer is still
running. The flush that is part of image close sequence thus isn't
guaranteed to cover all callbacks, especially for snapshot-based
mirroring where UnlinkPeerRequest spawned from Replayer::unlink_peer()
generates a notification on the remote image for each completed unlink.
Since TestImageReplayer further immediately deletes C_WatchCtx, pretty
much any test can segfault when C_WatchCtx::handle_notify() is invoked
by TestWatchNotify infrastructure. Because it's a virtual method, the
segfault often involves a completely bogus instruction pointer:
Kefu Chai [Sun, 24 Mar 2024 10:41:40 +0000 (18:41 +0800)]
squid: crush: use std::vector instead of variable length arrays
despite that variable length arrays (VLA for short) has been around
for a long time, it is an extension supported by GCC and Clang, and is
not a part of C++ standard, its implementation allocates the dynamically
sized array on stack, hence is a source of potential stack overflow.
when compiling with Clang, it complains. so in this change, we switch
to std::vector<>, which is defined by the C++ standard, and it allocates
the storage on heap, so it is immune to the possible stack overflow problem.
```
/home/kefu/dev/ceph/src/crush/CrushWrapper.h:1613:16: warning: variable length arrays in C++ are a Clang extension [-Wvla-cxx-extension]
1613 | int rawout[maxout];
| ^~~~~~
/home/kefu/dev/ceph/src/crush/CrushWrapper.h:1613:16: note: function parameter 'maxout' with unknown value cannot be used in a constant expression
/home/kefu/dev/ceph/src/crush/CrushWrapper.h:1610:60: note: declared here
1610 | void do_rule(int rule, int x, std::vector<int>& out, int maxout,
| ^
/home/kefu/dev/ceph/src/crush/CrushWrapper.h:1614:15: warning: variable length arrays in C++ are a Clang extension [-Wvla-cxx-extension]
1614 | char work[crush_work_size(crush, maxout)];
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kefu/dev/ceph/src/crush/CrushWrapper.h:1614:31: note: implicit use of 'this' pointer is only allowed within the evaluation of a call to a 'constexpr' member function
1614 | char work[crush_work_size(crush, maxout)];
| ^
2 warnings generated.
```