Patrick Donnelly [Mon, 25 Mar 2024 17:58:13 +0000 (13:58 -0400)]
Merge PR #56406 into squid
* refs/pull/56406/head:
doc/dev: update quiesce developer document
qa: wrap quiesce verification to dump debugging on error
qa: update quiesce tests for control via locallock
qa: set archive path in vstart_runner
qa: refactor CephFSMount.kill_background to optionally kill all background jobs
qa: use kwarg for rank parameter
qa: simplify calls to (rank|mds)_(tell|asok)
Revert "pybind/mgr/volumes: block quiesce for critical .meta file"
mds: remove is_root indication on quiesce_inode op
mds: prevent new lock cache cons when invalidating an existing one
mds: use XLOCK_WAIT For local lock xlockers
mds: prevent new wrlocks on LocalLock if there exists any xlock waiter
mds: block import discover when parent directory inode is quiesced
mds: avoid issuing exclusive caps to clients lacking w caps
mds: print lock cache during invalidation
mds: use inodeno_t to track quiesce requests
mds: dispatch quiesce_inode ops after dir traversal
mds: remove quiescelock handling for SimpleLock type
mds: quiescelock as local lock + cap masking
qa: run quiesce unit tests in fs:functional
qa: add quiesce protocol unit tests
qa: detect partial migrations during large config of dist epin
qa: use stdin-killer to timeout run_shell_payload
qa: simplify run_shell argument processing
doc: add dev docs for quiesce protocol
pybind/mgr/volumes: block quiesce for critical .meta file
mds: add vxattr to block quiesce on an inode
mds: convert encoded ephemeral dist pin to flags
mds: add counter to throttle quiesce
mds: add quiesce set feature flag
mds: skip non-head inodes for quiesce
mds: add quiesce op
mds: print all SimpleLock flags in debug output
mds: pretty print mutation when dumping lock
mds: add new inode quiescelock
mds: use 128 bits for waiters on MDSCacheObject
mds: provide mechanism to authpin while freezing
mds: add command to get specific op
mds: finish request before completing internal req
mds: complete internal op if killed
mds: avoid killing dead requests
mds: add command to kill request
mds: add path argument to `ops` and `dump tree` to stream result to local file
mds: print internal_request filepaths if present
mds: add more information to debug message
mds: remove redundant parenthesis
mds: implement Mutation::dump method
mds: make LockType fields const
mds: annotate mdr with try_rdlock_snap_layout failure
mds: refactor if into switch
mds: call Locker method using this
mds: simplify assert
mds: dump locks passed to Locker::acquire_locks
mds: add LockOp::print method for debugging
mds: use new insert template via print
mds: add request result to mutation for analysis by tests
mds: add comment on locking order rules
mds: allow specifying rdlock position
mds: remove dead method
common: provide a template for object dumps
common: support long running ops without slow warnings
common: simplify loop
common: add JSONFormatterFile class
common: use more efficient vector for stack
include: use larger int for large gathers
Patrick Donnelly [Mon, 25 Mar 2024 17:57:14 +0000 (13:57 -0400)]
Merge PR #56407 into squid
* refs/pull/56407/head:
qa/cephfs: stop ignoring MON_DOWN globally
qa: extend mon timeout coming up after mondb creation
qa: update dashboard schema for mon_status
mon: do not log MON_DOWN if monitor uptime is less than threshold
Nizamudeen A [Tue, 19 Mar 2024 14:57:13 +0000 (20:27 +0530)]
mgr/dashboard: rm warning/error threshold for cpu usage
for multi-core cpu's the value can be more than 100% so it doesn't make
sense to show warning/error when the usage is at or more than 100%.
hence removing it
Patrick Donnelly [Thu, 15 Feb 2024 15:28:32 +0000 (10:28 -0500)]
mds: reverse MDSMap encoding of max_xattr_size/bal_rank_mask
Commit e134c890 adds the bal_rank_mask with encoded (ev) version 17. This was
merged into main Oct 2022 and made it into the reef release normally.
Commit 7b8def5c adds the max_xattr_size also with encoded (ev) version 17 but
places it before bal_rank_mask. This is problematic as there were no plans to
backport e134c890 to quincy or pacific so piggybacking on the ev 17 bump would
not work and otherwise would require the backports to be done as a set to
ensure consistency (including with the kernel client).
However, the real issue is that 7b8def5c was not merged until after reef was
already cut. This required 7b8def5c to be backported separately in [1] which
was not merged until after v18.2.1 (current reef HEAD as of this commit).
Ultimately, this means that there are reef versions (v18.2.[01]) in the wild
which expect bal_rank_mask to be encoded at ev17 and not (max_xattr_size,
bal_rank_mask). Adding to the complications, the kernel client has already
merged code [2] expecting max_xattr_size for ev17.
It was decided in a github discussion [3] to move bal_rank_mask to ev18 to
avoid updating the kernel client which was done in the main branch via 36ee8e7e
and update the reef max_xattr_size backport with the same change (d8cebd67).
Unfortunately, this breaks upgrades v18.2.[01] to newer reef versions or to
main. The reason is that monitors will encode v17 with bal_rank_mask
(max_xattr_size is not merged yet) and send that to upgraded mgrs (which are
upgraded first). The mgr will attempt to decode bal_rank_mask as a uint64_t
(max_xattr_size) but fail because an empty (by default) bal_rank_mask is simply
encoded as a signed 32-bit integer. Consequently, the mgr will fail decoding
with:
failed to decode message of type 45 v1: End of buffer [buffer:2]
Of course the problem does not stop there, even if the mgr were able to handle
this, the monitors/mds/clients would fail in similar fashion.
So the only choice left is to fix max_xattr_size to be encoded at ev18.
Fortunately, v18.2.2 has not been released nor has any max_xattr_size backport
to quincy/pacific been merged. The main downside will be that kernels will
wrongly decode ev17 (which is already true for ceph clusters running
v18.2.[01]). A follow-up kernel fix will be required.
This flag is no longer necessary as the volumes plugin issues quiesce calls
against the data (i.e. root) directory of the subvolume rather than the
subvolume directory (with its associated .meta file).
mds: prevent new lock cache cons when invalidating an existing one
The previous scheme invalidated a lock cache and then immediately removed it
from its Capability list. The lock cache would eventually be deleted but a new
one could be constructed shortly after. The main reason for this is that simply
invalidating the lock cache does not drive a state change in the local locks
preventing new writers. This is mostly important for acquiring the quiescelock.
This commit also corrects a bug where a MDLockCache would be created for a
given opcode type (like create) when the capability does not have the issued
cap (CEPH_CAP_DIR_CREATE). The bug would not cause any negative side-effects
but would hold locks unnecessarily when only MDS ops (and not the client
executing ops asynchronously) are acquiring the locks.
mds: prevent new wrlocks on LocalLock if there exists any xlock waiter
Otherwise, an xlock waiter can become starved as a LocalLock supports multiple
writers.
Strictly speaking, a new lock state would be appropriate for this but we cheat
frequently with the LocalLock -- there is only one state. All transition checks
are already manually performed by the Locker.
mds: block import discover when parent directory inode is quiesced
This is to prevent two racing ranks quiescing some root from exporting a tree
under a completely quiesced directory (inode). The state of that imported tree
may take time to quiesce and cause the root to be QUIESCED before all inodes
under it are actually quiesced.
If a dirfrag to be imported is discovered before the parent is quiesced, then
the quiesce traversal will issue a quiesce_inode op normally for parent which
will attempt to authpin the parent. That will block if the export is still
in-progress (causing quiesce to wait for the export to finish or abort).
If a CInode is removed from cache before the quiesce_inode request can process
it (and pin it in cache), a new CInode may be created with the same address.
That pointer still exists in MutationImpl::quiesce_ops and would prevent
issuing a quiesce_inode op for the new inode.
Patrick Donnelly [Tue, 27 Feb 2024 20:17:28 +0000 (15:17 -0500)]
mds: quiescelock as local lock + cap masking
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
(cherry picked from commit 7fa8bc8b29f22f3cbe3a15c34f86003ed7c73088)
Patrick Donnelly [Sat, 17 Feb 2024 15:26:14 +0000 (10:26 -0500)]
qa: detect partial migrations during large config of dist epin
This method would wrongly "succeed" when looking for setup of distributed
ephemerally pinned directory fragments. If the migrator splits a subtree during
the course of migration (to reduce the migration size) then the operation may
not actually be complete.
Patrick Donnelly [Wed, 24 Jan 2024 02:25:35 +0000 (21:25 -0500)]
qa: use stdin-killer to timeout run_shell_payload
- simplify argument processing / forwarding
- use stdin-killer to kill all sub-processes of the shell
- do not needlessly use run_shell to execute the command as it adds a timeout
to the stdout/stderr processing
- provide a stdin (PIPE) by default otherwise teuthology's code closes stdin
and triggers stdin-killer to timeout the shell.
- use a 15 minute timeout by default
Patrick Donnelly [Tue, 20 Feb 2024 22:08:32 +0000 (17:08 -0500)]
mds: use 128 bits for waiters on MDSCacheObject
Adding a new inode lock will overflow inode wait bits into the MDSCacheObject
wait bits. Make space for the quiescelock.
This includes a minor refactor to no longer attempt scoping the set of masks we
test in MDSCacheObject::waiting when calling MDSCacheObject::is_waiter_for.
This optimization wasn't worth the overhead and would be awkard to keep as
std::bitset cannot be used as a key for a std::multimap (easily). Instead, we
use the sequence number as a key which helps us to avoid allocating another map
whenever we call MDSCacheObject::take_waiting.
Patrick Donnelly [Tue, 13 Feb 2024 16:07:26 +0000 (11:07 -0500)]
mds: provide mechanism to authpin while freezing
When a subtree is freezing, it's no longer possible to acquire new authpins.
This is a problem when a compound request like quiescing a subtree is trying to
acquire authpins for each sub-op. This creates a situation where some quiesce
sub-ops complete with authpins (thereby preventing the tree from becoming
"frozen") and new sub-ops cannot acquire authpins (because the tree is
"freezing"). To circumvent this, allow some authpin requests to proceed if
FLAG_BYPASSFREEZING is set.