Zac Dover [Fri, 22 Aug 2025 08:39:29 +0000 (18:39 +1000)]
doc/cephfs: edit troubleshooting.rst (Slow MDS)
Move the "Slow requests (MDS)" section immediately after the first
section in this document ("Slow/Stuck Operations"), because the first
procedure on the page directs the reader to undertake the operation in
"Slow requests (MDS)" before trying anything else.
Dan Mick [Tue, 26 Aug 2025 00:45:21 +0000 (17:45 -0700)]
Remove git clean -fdx
either
1) a source tarball is supplied, in which case the local dir is
irrelevant, or
2) make-debs calls make-dist, which doesn't care about a dirty cwd
so it just punishes the unaware by removing things that they may
have wanted to keep.
Dan Mick [Sat, 23 Aug 2025 00:43:24 +0000 (17:43 -0700)]
make-debs.sh: invoke tar with --no-same-owner
When running as a normal user, tar does not attempt to preserve
owners set on the tar content files. When running as root, it does.
Containerized builds are running as root. Stop make-debs.sh from
trying to set other owners for files, and leaving files in the
host system with mapped UIDs other than the user running the container
(which causes jenkins to be unable to clear the workspace).
Dan Mick [Thu, 21 Aug 2025 20:00:43 +0000 (13:00 -0700)]
make-debs.sh: make "skip debug packages" conditional
Now that we're using make-debs.sh as a builder inside containers,
the default should be to build all the packages, including debug.
(Also, fix a typo.)
Niklas Hambüchen [Sat, 21 Jun 2025 17:46:13 +0000 (19:46 +0200)]
doc/rados/configuration: Mention show-with-defaults and ceph-conf
A small improvement based on
"Why is it still so difficult to just dump all config and where it comes from?"
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/EZSLRYBYEWDA6YIARQVMUKQUWHAE3PGR/
`show-with-defaults` is very useful, and `ceph-conf` is mentioned
so that it's clear that it's legacy, and the user doesn't have to
wonder if it's actually useful but was forgotten in the list.
Improve source rpm detection by adding a new detection method that
executes and rpm command in a container to get exactly the version of
the source rpm that the ceph.spec file would have generated. For
backwards compatibility and that I don't entirely trust myself to have
tested this the old methods are still available.
The old `--rpm-no-match-sha` is now an alias for `--srpm-match=any` to
cause it to build any (unique) ceph srpm it finds.
`--srpm-match=versionglob` retains the previous default behavior of
using a glob matching on the git id or ceph version value. The new
default of `--srpm-match=auto` implements the rpm command based behavior
described above.
All of this is wrapped in a new step `find-rpm` but that's mostly an
implementation detail and for testing.
Dan Mick [Wed, 13 Aug 2025 19:16:45 +0000 (12:16 -0700)]
pybind/mgr/dashboard/frontend: add NPM_CACHEDIR envvar, use in bwc
Add an optional NPM_CACHEDIR environment variable to serve as the
cache parameter for npm in the dashboard frontend build. The idea
is to allow it to persist across builds so that we decrease the load
on registry.npmjs.org, which has been throttling our requests when
using build-with-container.py, and also hopefully improve the time
of the frontend npm operations.
build-with-container.py also grows a --npm-cache-path option to allow
setting it for container builds and passing the envvar to the build.
John Mulligan [Wed, 21 May 2025 21:46:40 +0000 (17:46 -0400)]
dashboard: fix the workaround for unpacking node sources
My previous workaround in the dashboard for the unpacking of non-root
own tarball as the fake root of a container did not work because of the
strange quoting/escaping behavior of cmake (it tried to run `id -u` as a
single command, not a command and an argument).
Use single quoted string and old school backticks to work around this issue.
Fixes: 24dbfb5da4813c6588f9cd199b9f527bb67f1e88 Signed-off-by: John Mulligan <jmulligan@redhat.com>
(cherry picked from commit 3a36180a373d91adcf9726660204f0cc1dcecba3)
John Mulligan [Fri, 2 May 2025 15:17:53 +0000 (11:17 -0400)]
dashboard: ensure nodeenv downloaded content is owned by current user
When testing ceph builds in a container we discovered that certain files
could not be deleted by jenkins after a build. This was due to the way
the container maps IDs - files owned by the root user in the container
become owned by the "real" user/jenkins user on the "host".
However, the node tarball that is fetched and unpacked by nodeenv has
a different owner name/uid that is preserved in the tree and this id
gets mapped to something that can be managed by the "fake root" of the
container but not by the "regular" user outside the container.
The simplest workaround I can think of is to chown the tree back
to the current user and avoid leaving files on disk with uncleanly
mapped uids.
John Mulligan [Thu, 29 May 2025 17:41:45 +0000 (13:41 -0400)]
mgr/dashboard: add a cobertura xml file workaround variable
Add an environment variable REWRITE_COVERAGE_ROOTDIR that
changes the "hardcoded" path in the cobertura-coverage.xml file.
This can be used to map the paths used in a container build to
the paths known to a jenkins job (or whatever else you want to
do with the file).
Nitzan Mordechai [Thu, 19 Jun 2025 08:54:43 +0000 (08:54 +0000)]
monitor: Enhance historic ops command output and error handling
Dumping monitor historic operations currently yields no results
and incorrectly issues an error message indicating that
"mon_enable_op_tracker" is not enabled, even when it should be.
This commit addresses these issues by:
- Adding previously missing commands for historic operations.
- Correcting the dump operations check to only issue an error when
"mon_enable_op_tracker" is genuinely not enabled.
- Tracking "mon_enable_op_tracker" changes
- Refactoring and organizing the historic operations dump command code.
- Improving the appearance and clarity of error messages.
Zac Dover [Sat, 9 Aug 2025 05:53:59 +0000 (15:53 +1000)]
doc/cephfs: edit troubleshooting.rst
Edit the section "RADOS Health" in the file
doc/cephfs/troubleshooting.rst. Add a Sphinx directive to the
doc/rados/troubleshooting/index.rst file that directs to the index of
the RADOS troubleshooting documentation.
msg: drain stack before stopping processors to avoid shutdown hang
`AsyncMessenger::shutdown()` called WorkerProcessor::stop() first,
killing the worker threads, then queued a C_drain callback via
stack->drain(). If a worker had already exited its event loop it never
processed the callback, so drain.wait() blocked forever and the monitor
shutdown hung for minutes.
Move stack->drain() ahead of the processors->stop() loop. With the new
order the workers are still alive to acknowledge the drain.
qa/suites/krbd: use a standard fixed-1 cluster in unmap subsuite
A custom "fixed-1, but with the client on a separate node" cluster was
needed only for pre-single-major.yaml kernel which is no longer around.
This can be a single-node job now -- see commits 311a450163cf
("krbd/unmap: put client.0 on a separate remote") and 39a579144cd8
("qa/suites/krbd: drop pre-single-major test").
Ronen Friedman [Fri, 8 Aug 2025 13:03:16 +0000 (08:03 -0500)]
osd/scrub: do not limit operator-initiated repairs
'auto-repair' scrubs are limited to a maximum of
'scrub_auto_repair_num_errors' damaged objects.
However, operator-initiated repairs should not be limited
by that number. Alas, a bug in a previous commit
(97de817) modified the code in such a way that it applied the
'scrub_auto_repair_num_errors' limit to all repairs,
including operator-initiated ones. This commit fixes that.
Fixes: https://tracker.ceph.com/issues/72438
Note: the fix is similar to 'Tentacle' & 'main' fixes
(PR#64860 & PR#64849), but - as the surrounding code
was changed, this is not a backport.
Zac Dover [Thu, 7 Aug 2025 05:03:22 +0000 (15:03 +1000)]
doc/cephfs: edit troubleshooting.rst
Follow up on comments made by Anthony D'Atri in
https://github.com/ceph/ceph/pull/64832 and make other small changes to
increase the ease of reading this text.
Zac Dover [Thu, 7 Aug 2025 04:41:01 +0000 (14:41 +1000)]
doc/rados: document section absent in release < T
Add a note to doc/rados/operations/erasure-code.rst to warn future
backporters against adding the section "Erasure Coding Optimizations" to
versions of the documentation prior to the Tentacle release.
Zac Dover [Tue, 5 Aug 2025 11:24:41 +0000 (21:24 +1000)]
doc/cephfs: edit troubleshooting.rst
Edit "Stuck in up:replay" under the "Stuck During Recovery" section of
doc/cephfs/troubleshooting.rst. I had planned to edit the entire "Stuck
During Recovery" section in a single commit, but I think that the
material is too involved for that.
Zack Cerza [Fri, 7 Mar 2025 20:53:23 +0000 (13:53 -0700)]
make-debs.sh: Optionally take debian version
Our existing CI builds have names like:
ceph-base_20.0.0-194-g6efaea33-1jammy_amd64.deb
Before this change, they are like:
ceph-base_20.0.0-158-gb0de3a42-1_amd64.deb
This way we can pass e.g. "jammy" to end up with names compatible with our CI
builds.
Incorporate into doc/cephfs/ceph-dokan.rst the suggestions made by
Anthony D'Atri in https://github.com/ceph/ceph/pull/64737, and make a
few other small improvements to the English language in that file.