]> git.apps.os.sepia.ceph.com Git - xfsprogs-dev.git/log
xfsprogs-dev.git
14 months agoMerge tag 'scrub-media-scan-service-6.10_2024-07-29' of https://git.kernel.org/pub...
Carlos Maiolino [Tue, 6 Aug 2024 13:48:39 +0000 (15:48 +0200)]
Merge tag 'scrub-media-scan-service-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

xfs_scrub_all: automatic media scan service [v30.9 15/28]

Now that we've completed the online fsck functionality, there are a few
things that could be improved in the automatic service.  Specifically,
we would like to perform a more intensive metadata + media scan once per
month, to give the user confidence that the filesystem isn't losing data
silently.  To accomplish this, enhance xfs_scrub_all to be able to
trigger media scans.  Next, add a duplicate set of system services that
start the media scans automatically.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'scrub-service-security-6.10_2024-07-29' of https://git.kernel.org/pub...
Carlos Maiolino [Tue, 6 Aug 2024 13:48:23 +0000 (15:48 +0200)]
Merge tag 'scrub-service-security-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

xfs_scrub: tighten security of systemd services [v30.9 14/28]

To reduce the risk of the online fsck service suffering some sort of
catastrophic breach that results in attackers reconfiguring the running
system, I embarked on a security audit of the systemd service files.
The result should be that all elements of the background service
(individual scrub jobs, the scrub_all initiator, and the failure
reporting) run with as few privileges and within as strong of a sandbox
as possible.

Granted, this does nothing about the potential for the /kernel/ screwing
up, but at least we could prevent obvious container escapes.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'scrub-fstrim-minlen-freesp-histogram-6.10_2024-07-29' of https://git.kerne...
Carlos Maiolino [Tue, 6 Aug 2024 13:47:58 +0000 (15:47 +0200)]
Merge tag 'scrub-fstrim-minlen-freesp-histogram-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

xfs_scrub: use free space histograms to reduce fstrim runtime [v30.9 13/28]

This patchset dramatically reduces the runtime of the FITRIM calls made
during phase 8 of xfs_scrub.  It turns out that phase 8 can really get
bogged down if the free space contains a large number of very small
extents.  In these cases, the runtime can increase by an order of
magnitude to free less than 1%% of the free space.  This is not worth the
time, since we're spending a lot of time to do very little work.  The
FITRIM ioctl allows us to specify a minimum extent length, so we can use
statistical methods to compute a minlen parameter.

It turns out xfs_db/spaceman already have the code needed to create
histograms of free space extent lengths.  We add the ability to compute
a CDF of the extent lengths, which make it easy to pick a minimum length
corresponding to 99%% of the free space.  In most cases, this results in
dramatic reductions in phase 8 runtime.  Hence, move the histogram code
to libfrog, and wire up xfs_scrub, since phase 7 already walks the
fsmap.

We also add a new -o suboption to xfs_scrub so that people who /do/ want
to examine every free extent can do so.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'scrub-fstrim-phase-6.10_2024-07-29' of https://git.kernel.org/pub/scm...
Carlos Maiolino [Tue, 6 Aug 2024 13:47:46 +0000 (15:47 +0200)]
Merge tag 'scrub-fstrim-phase-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

xfs_scrub: move fstrim to a separate phase [v30.9 12/28]

Back when I originally designed xfs_scrub, all filesystem metadata
checks were complete by the end of phase 3, and phase 4 was where all
the metadata repairs occurred.  On the grounds that the filesystem
should be fully consistent by then, I made a call to FITRIM at the end
of phase 4 to discard empty space in the filesystem.

Unfortunately, that's no longer the case -- summary counters, link
counts, and quota counters are not checked until phase 7.  It's not safe
to instruct the storage to unmap "empty" areas if we don't know where
those empty areas are, so we need to create a phase 8 to trim the fs.
While we're at it, make it more obvious that fstrim only gets to run if
there are no unfixed corruptions and no other runtime errors have
occurred.

Finally, reduce the latency impacts on the rest of the system by
breaking up the fstrim work into a loop that targets only 16GB per call.
This enables better progress reporting for interactive runs and cgroup
based resource constraints for background runs.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'scrub-detect-deceptive-extensions-6.10_2024-07-29' of https://git.kernel...
Carlos Maiolino [Tue, 6 Aug 2024 13:47:28 +0000 (15:47 +0200)]
Merge tag 'scrub-detect-deceptive-extensions-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

xfs_scrub: detect deceptive filename extensions [v30.9 11/28]

In early 2023, malware researchers disclosed a phishing attack that was
targeted at people running Linux workstations.  The attack vector
involved the use of filenames containing what looked like a file
extension but instead contained a lookalike for the full stop (".")
and a common extension ("pdf").  Enhance xfs_scrub phase 5 to detect
these types of attacks and warn the system administrator.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'scrub-repair-scheduling-6.10_2024-07-29' of https://git.kernel.org/pub...
Carlos Maiolino [Tue, 6 Aug 2024 13:47:09 +0000 (15:47 +0200)]
Merge tag 'scrub-repair-scheduling-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

xfs_scrub: improve scheduling of repair items [v30.9 10/28]

Currently, phase 4 of xfs_scrub uses per-AG repair item lists to
schedule repair work across a thread pool.  This scheme is suboptimal
when most of the repairs involve a single AG because all the work gets
dumped on a single pool thread.

Instead, we should create a thread pool with the same number of workers
as CPUs, and dispatch individual repair tickets as separate work items
to maximize parallelization.

However, we also need to ensure that repairs to space metadata and file
metadata are kept in separate queues because file repairs generally
depend on correctness of space metadata.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'scrub-object-tracking-6.10_2024-07-29' of https://git.kernel.org/pub/scm...
Carlos Maiolino [Tue, 6 Aug 2024 13:46:57 +0000 (15:46 +0200)]
Merge tag 'scrub-object-tracking-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

xfs_scrub: use scrub_item to track check progress [v30.9 09/28]

Now that we've introduced tickets to track the status of repairs to a
specific principal XFS object (fs, ag, file), use them to track the
scrub state of those same objects.  Ultimately, we want to make it easy
to introduce vectorized repair, where we send a batch of repair requests
to the kernel instead of making millions of ioctl calls.  For now,
however, we'll settle for easier bookkeepping.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'scrub-repair-data-deps-6.10_2024-07-29' of https://git.kernel.org/pub...
Carlos Maiolino [Tue, 6 Aug 2024 13:46:15 +0000 (15:46 +0200)]
Merge tag 'scrub-repair-data-deps-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

xfs_scrub: track data dependencies for repairs [v30.9 08/28]

Certain kinds of XFS metadata depend on the correctness of lower level
metadata.  For example, directory indexes depends on the directory data
fork, which in turn depend on the directory inode to be correct.  The
current scrub code does not strictly preserve these dependencies if it
has to defer a repair until phase 4, because phase 4 prioritizes repairs
by type (corruption, then cross referencing, and then preening) and
loses the ordering of in the previous phases.  This leads to absurd
things like trying to repair a directory before repairing its corrupted
fork, which is absurd.

To solve this problem, introduce a repair ticket structure to track all
the repairs pending for a principal object (inode, AG, etc).  This
reduces memory requirements if an object requires more than one type of
repair and makes it very easy to track the data dependencies between
sub-objects of a principal object.  Repair dependencies between object
types (e.g.  bnobt before inodes) must still be encoded statically into
phase 4.

A secondary benefit of this new ticket structure is that we can decide
to attempt a repair of an object A that was flagged for a cross
referencing error during the scan if a different object B depends on A
but only B showed definitive signs of corruption.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'scrub-better-repair-warnings-6.10_2024-07-29' of https://git.kernel.org...
Carlos Maiolino [Tue, 6 Aug 2024 13:45:53 +0000 (15:45 +0200)]
Merge tag 'scrub-better-repair-warnings-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

xfs_scrub: improve warnings about difficult repairs [v30.9 07/28]

While I was poking through the QA results for xfs_scrub, I noticed that
it doesn't warn the user when the primary and secondary realtime
metadata are so out of whack that the chances of a successful repair are
not so high.  I decided that it was worth refactoring the scrub code a
bit so that we could warn the user about these types of things, and
ended up refactoring unnecessary helpers out of existence and fixing
other reporting gaps.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'scrub-repair-fixes-6.10_2024-07-29' of https://git.kernel.org/pub/scm...
Carlos Maiolino [Tue, 6 Aug 2024 13:45:40 +0000 (15:45 +0200)]
Merge tag 'scrub-repair-fixes-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

xfs_scrub: fixes to the repair code [v30.9 06/28]

Now that we've landed the new kernel code, it's time to reorganize the
xfs_scrub code that handles repairs.  Clean up various naming warts and
misleading error messages.  Move the repair code to scrub/repair.c as
the first step.  Then, fix various issues in the repair code before we
start reorganizing things.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'inode-repair-improvements-6.10_2024-07-29' of https://git.kernel.org/pub...
Carlos Maiolino [Tue, 6 Aug 2024 13:45:27 +0000 (15:45 +0200)]
Merge tag 'inode-repair-improvements-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

xfsprogs: inode-related repair fixes [v30.9 05/28]

While doing QA of the online fsck code, I made a few observations:
First, nobody was checking that the di_onlink field is actually zero;
Second, that allocating a temporary file for repairs can fail (and
thus bring down the entire fs) if the inode cluster is corrupt; and
Third, that file link counts do not pin at ~0U to prevent integer
overflows.

This scattered patchset fixes those three problems.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'dirattr-validate-owners-6.10_2024-07-29' of https://git.kernel.org/pub...
Carlos Maiolino [Tue, 6 Aug 2024 13:45:05 +0000 (15:45 +0200)]
Merge tag 'dirattr-validate-owners-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

xfsprogs: set and validate dir/attr block owners [v30.9 04/28]

There are a couple of significatn changes that need to be made to the
directory and xattr code before we can support online repairs of those
data structures.

The first change is because online repair is designed to use libxfs to
create a replacement dir/xattr structure in a temporary file, and use
atomic extent swapping to commit the corrected structure.  To avoid the
performance hit of walking every block of the new structure to rewrite
the owner number, we instead change libxfs to allow callers of the dir
and xattr code the ability to set an explicit owner number to be written
into the header fields of any new blocks that are created.

The second change is to update the dir/xattr code to actually *check*
the owner number in each block that is read off the disk, since we don't
currently do that.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'atomic-file-updates-6.10_2024-07-29' of https://git.kernel.org/pub/scm...
Carlos Maiolino [Tue, 6 Aug 2024 13:44:20 +0000 (15:44 +0200)]
Merge tag 'atomic-file-updates-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

xfsprogs: atomic file updates [v30.9 03/28]

This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange
ranges of bytes between two files atomically.

This new functionality enables data storage programs to stage and commit
file updates such that reader programs will see either the old contents
or the new contents in their entirety, with no chance of torn writes.  A
successful call completion guarantees that the new contents will be seen
even if the system fails.

The ability to exchange file fork mappings between files in this manner
is critical to supporting online filesystem repair, which is built upon
the strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically.  The
ioctls exist to facilitate testing of the new functionality and to
enable future application program designs.

User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file.  If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas.  Note that application software must quiesce writes to the file
while it stages an atomic update.  This will be addressed by a
subsequent series.

This mechanism solves the clunkiness of two existing atomic file update
mechanisms: for O_TRUNC + rewrite, this eliminates the brief period
where other programs can see an empty file.  For create tempfile +
rename, the need to copy file attributes and extended attributes for
each file update is eliminated.

However, this method introduces its own awkwardness -- any program
initiating an exchange now needs to have a way to signal to other
programs that the file contents have changed.  For file access mediated
via read and write, fanotify or inotify are probably sufficient.  For
mmaped files, that may not be fast enough.

The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down.  Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.

Note that this function is /not/ the O_DIRECT atomic untorn file writes
concept that has also been floating around for years.  It is also not
the RWF_ATOMIC patchset that has been shared.  This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.

As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata.  The atomic file content
exchange is implemented as an atomic exchange of file fork mappings,
which means that we can implement online reconstruction of extended
attributes and directories by building a new one in another inode and
exchanging the contents.

Subsequent patchsets adapt the online filesystem repair code to use
atomic file exchanges.  This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode.  If this
completes successfully, the new contents can be committed atomically
into the inode being repaired.  This is essential to avoid making
corruption problems worse if the system goes down in the middle of
running repair.

For userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'libxfs-sync-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kerne...
Carlos Maiolino [Tue, 6 Aug 2024 13:43:49 +0000 (15:43 +0200)]
Merge tag 'libxfs-sync-6.10_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

libxfs: sync with 6.10 [02/28]

Synchronize libxfs with the kernel.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoMerge tag 'libxfs-6.9-fixes_2024-07-29' of https://git.kernel.org/pub/scm/linux/kerne...
Carlos Maiolino [Tue, 6 Aug 2024 13:42:46 +0000 (15:42 +0200)]
Merge tag 'libxfs-6.9-fixes_2024-07-29' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev into for-next

libxfs: fixes for 6.9 [01/28]

A couple more last minute fixes for 6.9.

This has been running on the djcloud for months with no problems.  Enjoy!

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoxfs_scrub_all: failure reporting for the xfs_scrub_all job
Darrick J. Wong [Mon, 29 Jul 2024 23:23:17 +0000 (16:23 -0700)]
xfs_scrub_all: failure reporting for the xfs_scrub_all job

Create a failure reporting service for when xfs_scrub_all fails.  This
shouldn't happen often, but let's report anyways.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub_all: tighten up the security on the background systemd service
Darrick J. Wong [Mon, 29 Jul 2024 23:23:15 +0000 (16:23 -0700)]
xfs_scrub_all: tighten up the security on the background systemd service

Currently, xfs_scrub_all has to run with enough privileges to find
mounted XFS filesystems and the device associated with that mount and to
start xfs_scrub@<mountpoint> sub-services.  Minimize the risk of
xfs_scrub_all escaping its service container or contaminating the rest
of the system by using systemd's sandboxing controls to prohibit as much
access as possible.

The directives added by this patch were recommended by the command
'systemd-analyze security xfs_scrub_all.service' in systemd 249.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub_all: trigger automatic media scans once per month
Darrick J. Wong [Mon, 29 Jul 2024 23:23:16 +0000 (16:23 -0700)]
xfs_scrub_all: trigger automatic media scans once per month

Teach the xfs_scrub_all background service to trigger an automatic scan
of all file data once per month.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub_fail: tighten up the security on the background systemd service
Darrick J. Wong [Mon, 29 Jul 2024 23:23:15 +0000 (16:23 -0700)]
xfs_scrub_fail: tighten up the security on the background systemd service

Currently, xfs_scrub_fail has to run with enough privileges to access
the journal contents for a given scrub run and to send a report via
email.  Minimize the risk of xfs_scrub_fail escaping its service
container or contaminating the rest of the system by using systemd's
sandboxing controls to prohibit as much access as possible.

The directives added by this patch were recommended by the command
'systemd-analyze security xfs_scrub_fail@.service' in systemd 249.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub_all: enable periodic file data scrubs automatically
Darrick J. Wong [Mon, 29 Jul 2024 23:23:16 +0000 (16:23 -0700)]
xfs_scrub_all: enable periodic file data scrubs automatically

Enhance xfs_scrub_all with the ability to initiate a file data scrub
periodically.  The user must specify the period, and they may optionally
specify the path to a file that will record the last time the file data
was scrubbed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub_all: support metadata+media scans of all filesystems
Darrick J. Wong [Mon, 29 Jul 2024 23:23:16 +0000 (16:23 -0700)]
xfs_scrub_all: support metadata+media scans of all filesystems

Add the necessary systemd services and control bits so that
xfs_scrub_all can kick off a metadata+media scan of a filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub_all: remove journalctl background process
Darrick J. Wong [Mon, 29 Jul 2024 23:23:16 +0000 (16:23 -0700)]
xfs_scrub_all: remove journalctl background process

Now that we only start systemd services if we're running in service
mode, there's no need for the background journalctl process that only
ran if we had started systemd services in non-service mode.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub_all: only use the xfs_scrub@ systemd services in service mode
Darrick J. Wong [Mon, 29 Jul 2024 23:23:16 +0000 (16:23 -0700)]
xfs_scrub_all: only use the xfs_scrub@ systemd services in service mode

Since the per-mount xfs_scrub@.service definition includes a bunch of
resource usage constraints, we no longer want to use those services if
xfs_scrub_all is being run directly by the sysadmin (aka not in service
mode) on the presumption that sysadmins want answers as quickly as
possible.

Therefore, only try to call the systemd service from xfs_scrub_all if
SERVICE_MODE is set in the environment.  If reaching out to systemd
fails and we're in service mode, we still want to run xfs_scrub
directly.  Split the makefile variables as necessary so that we only
pass -b to xfs_scrub in service mode.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: tune fstrim minlen parameter based on free space histograms
Darrick J. Wong [Mon, 29 Jul 2024 23:23:14 +0000 (16:23 -0700)]
xfs_scrub: tune fstrim minlen parameter based on free space histograms

Currently, phase 8 runs very slowly on filesystems with a lot of small
free space extents.  To reduce the amount of time spent on fstrim
activities during phase 8, we want to balance estimated runtime against
completeness of the trim.  In short, the goal is to reduce runtime by
avoiding small trim requests.

At the start of phase 8, a CDF is computed in decreasing order of extent
length from the histogram buckets created during the fsmap scan in phase
7.  A point corresponding to the fstrim percentage target is chosen from
the CDF and mapped back to a histogram bucket, and free space extents
smaller than that amount are ommitted from fstrim.

On my aging /home filesystem, the free space histogram reported by
xfs_spaceman looks like this:

   from      to extents    blocks    pct blkcdf extcdf
      1       1  121953    121953   0.04 100.00 100.00
      2       3  124741    299694   0.09  99.96  81.16
      4       7  113492    593763   0.18  99.87  61.89
      8      15  109215   1179524   0.36  99.69  44.36
     16      31   76972   1695455   0.52  99.33  27.48
     32      63   48655   2219667   0.68  98.82  15.59
     64     127   31398   2876898   0.88  98.14   8.08
    128     255    8014   1447920   0.44  97.27   3.23
    256     511    4142   1501758   0.46  96.82   1.99
    512    1023    2433   1768732   0.54  96.37   1.35
   1024    2047    1795   2648460   0.81  95.83   0.97
   2048    4095    1429   4206103   1.28  95.02   0.69
   4096    8191    1045   6162111   1.88  93.74   0.47
   8192   16383     791   9242745   2.81  91.87   0.31
  16384   32767     473  10883977   3.31  89.06   0.19
  32768   65535     272  12385566   3.77  85.74   0.12
  65536  131071     192  18098739   5.51  81.98   0.07
 131072  262143     108  20675199   6.29  76.47   0.04
 262144  524287      80  29061285   8.84  70.18   0.03
 524288 1048575      39  29002829   8.83  61.33   0.02
1048576 2097151      25  36824985  11.21  52.51   0.01
2097152 4194303      32 101727192  30.95  41.30   0.01
4194304 8388607       7  34007410  10.35  10.35   0.00

From this table, we see that free space extents that are 16 blocks or
longer constitute 99.3% of the free space in the filesystem but only
27.5% of the extents.  If we set the fstrim minlen parameter to 16
blocks, that means that we can trim over 99% of the space in one third
of the time it would take to trim everything.

Add a new -o fstrim_pct= option to xfs_scrub just in case there are
users out there who want a different percentage.  For example, accepting
a 95% trim would net us a speed increase of nearly two orders of
magnitude, ignoring system call overhead.  Setting it to 100% will trim
everything, just like fstrim(8).

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: improve responsiveness while trimming the filesystem
Darrick J. Wong [Mon, 29 Jul 2024 23:23:13 +0000 (16:23 -0700)]
xfs_scrub: improve responsiveness while trimming the filesystem

On a 10TB filesystem where the free space in each AG is heavily
fragmented, I noticed some very high runtimes on a FITRIM call for the
entire filesystem.  xfs_scrub likes to report progress information on
each phase of the scrub, which means that a strace for the entire
filesystem:

ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839>

shows that scrub is uncommunicative for the entire duration.  We can't
report any progress for the duration of the call, and the program is not
responsive to signals.  Reducing the size of the FITRIM requests to a
single AG at a time produces lower times for each individual call, but
even this isn't quite acceptable, because the time between progress
reports are still very high:

Strace for the first 4x 1TB AGs looks like (2):
ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033>
ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323>
ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226>
ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744>

I then had the idea to limit the length parameter of each call to a
smallish amount (~11GB) so that we could report progress relatively
quickly, but much to my surprise, each FITRIM call still took ~68
seconds!

Unfortunately, the by-length fstrim implementation handles this poorly
because it walks the entire free space by length index (cntbt), which is
a very inefficient way to walk a subset of an AG when the free space is
fragmented.

To fix that, I created a second implementation in the kernel that will
walk the bnobt and perform the trims in block number order.  This
algorithm constrains the amount of btree scanning to something
resembling the range passed in, which reduces the amount of time it
takes to respond to a signal.

Therefore, break up the FITRIM calls so they don't scan more than 11GB
of space at a time.  Break the calls up by AG so that each call only has
to take one AGF per call, because each AG that we traverse causes a log
force.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: dump unicode points
Darrick J. Wong [Mon, 29 Jul 2024 23:23:11 +0000 (16:23 -0700)]
xfs_scrub: dump unicode points

Add some debug functions to make it easier to query unicode character
properties.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: tighten up the security on the background systemd service
Darrick J. Wong [Mon, 29 Jul 2024 23:23:15 +0000 (16:23 -0700)]
xfs_scrub: tighten up the security on the background systemd service

Currently, xfs_scrub has to run with some elevated privileges.  Minimize
the risk of xfs_scrub escaping its service container or contaminating
the rest of the system by using systemd's sandboxing controls to
prohibit as much access as possible.

The directives added by this patch were recommended by the command
'systemd-analyze security xfs_scrub@.service' in systemd 249.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: collect free space histograms during phase 7
Darrick J. Wong [Mon, 29 Jul 2024 23:23:14 +0000 (16:23 -0700)]
xfs_scrub: collect free space histograms during phase 7

Collect a histogram of free space observed during phase 7.  We'll put
this information to use in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: don't call FITRIM after runtime errors
Darrick J. Wong [Mon, 29 Jul 2024 23:23:12 +0000 (16:23 -0700)]
xfs_scrub: don't call FITRIM after runtime errors

Don't call FITRIM if there have been runtime errors -- we don't want to
touch anything after any kind of unfixable problem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: use dynamic users when running as a systemd service
Darrick J. Wong [Mon, 29 Jul 2024 23:23:15 +0000 (16:23 -0700)]
xfs_scrub: use dynamic users when running as a systemd service

Five years ago, systemd introduced the DynamicUser directive that
allocates a new unique user/group id, runs a service with those ids, and
deletes them after the service exits.  This is a good replacement for
User=nobody, since it eliminates the threat of nobody-services messing
with each other.

Make this transition ahead of all the other security tightenings that
will land in the next few patches, and add credits for the people who
suggested the change and reviewed it.

Link: https://0pointer.net/blog/dynamic-users-with-systemd.html
Suggested-by: Helle Vaanzinn <glitsj16@riseup.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
14 months agoxfs_scrub: remove pointless spacemap.c arguments
Darrick J. Wong [Mon, 29 Jul 2024 23:23:14 +0000 (16:23 -0700)]
xfs_scrub: remove pointless spacemap.c arguments

Remove unused parameters from the full-device spacemap scan functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: report FITRIM errors properly
Darrick J. Wong [Mon, 29 Jul 2024 23:23:12 +0000 (16:23 -0700)]
xfs_scrub: report FITRIM errors properly

Move the error reporting for the FITRIM ioctl out of vfs.c and into
phase8.c.  This makes it so that IO errors encountered during trim are
counted as runtime errors instead of being dropped silently.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub.service: reduce background CPU usage to less than one core if possible
Darrick J. Wong [Mon, 29 Jul 2024 23:23:15 +0000 (16:23 -0700)]
xfs_scrub.service: reduce background CPU usage to less than one core if possible

Currently, the xfs_scrub background service is configured to use -b,
which means that the program runs completely serially.  However, even
using all of one CPU core with idle priority may be enough to cause
thermal throttling and unwanted fan noise on smaller systems (e.g.
laptops) with fast IO systems.

Let's try to avoid this (at least on systemd) by using cgroups to limit
the program's usage to slghtly more than half of one CPU and lowering
the nice priority in the scheduler.  What we /really/ want is to run
steadily on an efficiency core, but there doesn't seem to be a means to
ask the scheduler not to ramp up the CPU frequency for a particular
task.

While we're at it, group the resource limit directives together.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: don't close stdout when closing the progress bar
Darrick J. Wong [Mon, 29 Jul 2024 23:23:13 +0000 (16:23 -0700)]
xfs_scrub: don't close stdout when closing the progress bar

When we're tearing down the progress bar file stream, check that it's
not an alias of stdout before closing it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: fix the work estimation for phase 8
Darrick J. Wong [Mon, 29 Jul 2024 23:23:12 +0000 (16:23 -0700)]
xfs_scrub: fix the work estimation for phase 8

If there are latent errors on the filesystem, we aren't going to do any
work during phase 8 and it makes no sense to add that into the work
estimate for the progress bar.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: allow auxiliary pathnames for sandboxing
Darrick J. Wong [Mon, 29 Jul 2024 23:23:14 +0000 (16:23 -0700)]
xfs_scrub: allow auxiliary pathnames for sandboxing

In the next patch, we'll tighten up the security on the xfs_scrub
service so that it can't escape.  However, sandboxing the service
involves making the host filesystem as inaccessible as possible, with
the filesystem to scrub bind mounted onto a known location within the
sandbox.  Hence we need one path for reporting and a new -M argument to
tell scrub what it should actually be trying to open.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agolibfrog: print cdf of free space buckets
Darrick J. Wong [Mon, 29 Jul 2024 23:23:13 +0000 (16:23 -0700)]
libfrog: print cdf of free space buckets

Print the cumulative distribution function of the free space buckets in
reverse order.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: collapse trim_filesystem
Darrick J. Wong [Mon, 29 Jul 2024 23:23:12 +0000 (16:23 -0700)]
xfs_scrub: collapse trim_filesystem

Collapse this two-line helper into the main function since it's trivial.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agolibfrog: print wider columns for free space histogram
Darrick J. Wong [Mon, 29 Jul 2024 23:23:13 +0000 (16:23 -0700)]
libfrog: print wider columns for free space histogram

The values reported here can reach very large values, so compute the
column width dynamically.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: ignore phase 8 if the user disabled fstrim
Darrick J. Wong [Mon, 29 Jul 2024 23:23:12 +0000 (16:23 -0700)]
xfs_scrub: ignore phase 8 if the user disabled fstrim

If the user told us to skip trimming the filesystem, don't run the phase
at all.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agolibfrog: hoist free space histogram code
Darrick J. Wong [Mon, 29 Jul 2024 23:23:13 +0000 (16:23 -0700)]
libfrog: hoist free space histogram code

Combine the two free space histograms in xfs_db and xfs_spaceman into a
single implementation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: move FITRIM to phase 8
Darrick J. Wong [Mon, 29 Jul 2024 23:23:11 +0000 (16:23 -0700)]
xfs_scrub: move FITRIM to phase 8

Issuing discards against the filesystem should be the *last* thing that
xfs_scrub does, after everything else has been checked, repaired, and
found to be clean.  If we can't satisfy all those conditions, we have no
business telling the storage to discard itself.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: try to repair space metadata before file metadata
Darrick J. Wong [Mon, 29 Jul 2024 23:23:09 +0000 (16:23 -0700)]
xfs_scrub: try to repair space metadata before file metadata

Phase 4 (metadata repairs) of xfs_scrub has suffered a mild race
condition since the beginning of its existence.  Repair functions for
higher level metadata such as directories build the new directory blocks
in an unlinked temporary file and use atomic extent swapping to commit
the corrected directory contents into the existing directory.  Atomic
extent swapping requires consistent filesystem space metadata, but phase
4 has never enforced correctness dependencies between space and file
metadata repairs.

Before the previous patch eliminated the per-AG repair lists, this error
was not often hit in testing scenarios because the allocator generally
succeeds in placing file data blocks in the same AG as the inode.  With
pool threads now able to pop file repairs from the repair list before
space repairs complete, this error became much more obvious.

Fortunately, the new phase 4 design makes it easy to try to enforce the
consistency requirements of higher level file metadata repairs.  Split
the repair list into one for space metadata and another for file
metadata.  Phase 4 will now try to fix the space metadata until it stops
making progress on that, and only then will it try to fix file metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: hoist scrub retry loop to scrub_item_check_file
Darrick J. Wong [Mon, 29 Jul 2024 23:23:08 +0000 (16:23 -0700)]
xfs_scrub: hoist scrub retry loop to scrub_item_check_file

For metadata check calls, use the ioctl retry and freeze permission
tracking in scrub_item that we created in the last patch.  This enables
us to move the check retry loop out of xfs_scrub_metadata and into its
caller to remove a long backwards jump, and gets us closer to
vectorizing scrub calls.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: report deceptive file extensions
Darrick J. Wong [Mon, 29 Jul 2024 23:23:11 +0000 (16:23 -0700)]
xfs_scrub: report deceptive file extensions

Earlier this year, ESET revealed that Linux users had been tricked into
opening executables containing malware payloads.  The trickery came in
the form of a malicious zip file containing a filename with the string
"job offer․pdf".  Note that the filename does *not* denote a real pdf
file, since the last four codepoints in the file name are "ONE DOT
LEADER", p, d, and f.  Not period (ok, FULL STOP), p, d, f like you'd
normally expect.

Teach xfs_scrub to look for codepoints that could be confused with a
period followed by alphanumerics.

Link: https://www.welivesecurity.com/2023/04/20/linux-malware-strengthens-links-lazarus-3cx-supply-chain-attack/
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: recheck entire metadata objects after corruption repairs
Darrick J. Wong [Mon, 29 Jul 2024 23:23:09 +0000 (16:23 -0700)]
xfs_scrub: recheck entire metadata objects after corruption repairs

When we've finished making repairs to some domain of filesystem metadata
(file, AG, etc.) to correct an inconsistency, we should recheck all the
other metadata types within that domain to make sure that we neither
made things worse nor introduced more cross-referencing problems.  If we
did, requeue the item to make the repairs.  If the only changes we made
were optimizations, don't bother.

The XFS_SCRUB_TYPE_ values are getting close to the max for a u32, so
I chose u64 for sri_selected.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: hoist repair retry loop to repair_item_class
Darrick J. Wong [Mon, 29 Jul 2024 23:23:08 +0000 (16:23 -0700)]
xfs_scrub: hoist repair retry loop to repair_item_class

For metadata repair calls, move the ioctl retry and freeze permission
tracking into scrub_item.  This enables us to move the repair retry loop
out of xfs_repair_metadata and into its caller to remove a long
backwards jump, and gets us closer to vectorizing scrub calls.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: rename struct unicrash.normalizer
Darrick J. Wong [Mon, 29 Jul 2024 23:23:11 +0000 (16:23 -0700)]
xfs_scrub: rename struct unicrash.normalizer

We're about to introduce a second normalizer, so change the name of the
existing one to reflect the algorithm that you'll get if you use it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: improve thread scheduling repair items during phase 4
Darrick J. Wong [Mon, 29 Jul 2024 23:23:08 +0000 (16:23 -0700)]
xfs_scrub: improve thread scheduling repair items during phase 4

As it stands, xfs_scrub doesn't do a good job of scheduling repair items
during phase 4.  The repair lists are sharded by AG, and one repair
worker is started for each per-AG repair list.  Consequently, if one AG
requires considerably more work than the others (e.g. inodes are not
spread evenly among the AGs) then phase 4 can stall waiting for that one
worker thread when there's still plenty of CPU power available.

While our initial assumptions were that repairs would be vanishingly
scarce, the reality is that "repairs" can be triggered for optimizations
like gaps in the xattr structures, or clearing the inode reflink flag on
inodes that no longer share data.  In real world testing scenarios, the
lack of balance leads to complaints about excessive runtime of
xfs_scrub.

To fix these balance problems, we replace the per-AG repair item lists
in the scrub context with a single repair item list.  Phase 4 will be
redesigned as follows:

The repair worker will grab a repair item from the main list, try to
repair it, record whether the repair attempt made any progress, and
requeue the item if it was not fully fixed.  A separate repair scheduler
function starts the repair workers, and waits for them all to complete.
Requeued repairs are merged back into the main repair list.  If we made
any forward progress, we'll start another round of repairs with the
repair workers.  Phase 4 retains the behavior that if the pool stops
making forward progress, it will try all the repairs one last time,
serially.

To facilitate this new design, phase 2 will queue repairs of space
metadata items directly to the main list.  Phase 3's worker threads will
queue repair items to per-thread lists and splice those lists into the
main list at the end.

On a filesystem crafted to put all the inodes in a single AG, this
restores xfs_scrub's ability to parallelize repairs.  There seems to be
a slight performance hit for the evenly-spread case, but avoiding a
performance cliff due to an unbalanced fs is more important here.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: reduce size of struct name_entry
Darrick J. Wong [Mon, 29 Jul 2024 23:23:11 +0000 (16:23 -0700)]
xfs_scrub: reduce size of struct name_entry

libicu doesn't support processing strings longer than 2GB in length, and
we never feed the unicrash code a name longer than about 300 bytes.
Rearrange the structure to reduce the head structure size from 56 bytes
to 44 bytes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agolibfrog: enhance ptvar to support initializer functions
Darrick J. Wong [Mon, 29 Jul 2024 23:23:08 +0000 (16:23 -0700)]
libfrog: enhance ptvar to support initializer functions

Modify the per-thread variable code to support passing in an initializer
function that will set up each thread's variable space when it is
claimed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: type-coerce the UNICRASH_* flags
Darrick J. Wong [Mon, 29 Jul 2024 23:23:11 +0000 (16:23 -0700)]
xfs_scrub: type-coerce the UNICRASH_* flags

Promote this type to something that we can type-check.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: rename UNICRASH_ZERO_WIDTH to UNICRASH_INVISIBLE
Darrick J. Wong [Mon, 29 Jul 2024 23:23:10 +0000 (16:23 -0700)]
xfs_scrub: rename UNICRASH_ZERO_WIDTH to UNICRASH_INVISIBLE

"Zero width" doesn't fully describe what the flag represents -- it gets
set for any codepoint that doesn't render.  Rename it accordingly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: store bad flags with the name entry
Darrick J. Wong [Mon, 29 Jul 2024 23:23:10 +0000 (16:23 -0700)]
xfs_scrub: store bad flags with the name entry

When scrub is checking unicode names, there are certain properties of
the directory/attribute/label name itself that it can complain about.
Store these in struct name_entry so that the confusable names detector
can pick this up later.

This restructuring enables a subsequent patch to detect suspicious
sequences in the NFC normalized form of the name without needing to hang
on to that NFC form until the end of processing.  IOWs, it's a memory
usage optimization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: hoist non-rendering character predicate
Darrick J. Wong [Mon, 29 Jul 2024 23:23:10 +0000 (16:23 -0700)]
xfs_scrub: hoist non-rendering character predicate

Hoist this predicate code into its own function; we're going to use it
elsewhere later on.  While we're at it, document how we generated this
list in the first place.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: guard against libicu returning negative buffer lengths
Darrick J. Wong [Mon, 29 Jul 2024 23:23:10 +0000 (16:23 -0700)]
xfs_scrub: guard against libicu returning negative buffer lengths

The libicu functions u_strFromUTF8, unorm2_normalize, and
uspoof_getSkeleton return int32_t values.  Guard against negative return
values, even though the library itself never does this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: avoid potential UAF after freeing a duplicate name entry
Darrick J. Wong [Mon, 29 Jul 2024 23:23:10 +0000 (16:23 -0700)]
xfs_scrub: avoid potential UAF after freeing a duplicate name entry

Change the function declaration of unicrash_add to set the caller's
@new_entry to NULL if we detect an updated name entry and do not wish to
continue processing.  This avoids a theoretical UAF if the unicrash_add
caller were to accidentally continue using the pointer.

This isn't an /actual/ UAF because the function formerly set @badflags
to zero, but let's be a little defensive.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: add a couple of omitted invisible code points
Darrick J. Wong [Mon, 29 Jul 2024 23:23:09 +0000 (16:23 -0700)]
xfs_scrub: add a couple of omitted invisible code points

I missed a few non-rendering code points in the "zero width"
classification code.  Add them now, and sort the list.  Finding them is
an annoyingly manual process because there are various code points that
are not supposed to affect the rendering of a string of text but are not
explicitly named as such.  There are other code points that, when
surrounded by code points from the same chart, actually /do/ affect the
rendering.

IOWs, the only way to figure this out is to grep the likely code points
and then go figure out how each of them render by reading the Unicode
spec or trying it.

$ wget https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
$ grep -E '(separator|zero width|invisible|joiner|application)' -i UnicodeData.txt

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: hoist code that removes ignorable characters
Darrick J. Wong [Mon, 29 Jul 2024 23:23:09 +0000 (16:23 -0700)]
xfs_scrub: hoist code that removes ignorable characters

Hoist the loop that removes "ignorable" code points from the skeleton
string into a separate function and give the UChar cursors names that
are easier to understand.  Convert the code to use the safe versions of
the U16_ accessor functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: use proper UChar string iterators
Darrick J. Wong [Mon, 29 Jul 2024 23:23:09 +0000 (16:23 -0700)]
xfs_scrub: use proper UChar string iterators

For code that wants to examine a UChar string, use libicu's string
iterators to walk UChar strings, instead of the open-coded U16_NEXT*
macros that perform no typechecking.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: remove unused action_list fields
Darrick J. Wong [Mon, 29 Jul 2024 23:23:07 +0000 (16:23 -0700)]
xfs_scrub: remove unused action_list fields

Remove some fields since we don't need them anymore.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: enable users to bump information messages to warnings
Darrick J. Wong [Mon, 29 Jul 2024 23:23:05 +0000 (16:23 -0700)]
xfs_scrub: enable users to bump information messages to warnings

Add a -o iwarn option that enables users to specify that informational
messages (such as incomplete scans, or confusing names) should be
treated as warnings.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: refactor scrub_meta_type out of existence
Darrick J. Wong [Mon, 29 Jul 2024 23:23:07 +0000 (16:23 -0700)]
xfs_scrub: refactor scrub_meta_type out of existence

Remove this helper function since it's trivial now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: retry incomplete repairs
Darrick J. Wong [Mon, 29 Jul 2024 23:23:06 +0000 (16:23 -0700)]
xfs_scrub: retry incomplete repairs

If a repair says it didn't do anything on account of not being able to
complete a scan of the metadata, retry the repair a few times; if even
that doesn't work, we can delay it to phase 4.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: warn about difficult repairs to rt and quota metadata
Darrick J. Wong [Mon, 29 Jul 2024 23:23:04 +0000 (16:23 -0700)]
xfs_scrub: warn about difficult repairs to rt and quota metadata

Warn the user if there are problems with the rt or quota metadata that
might make repairs difficult.  For now there aren't any corruption
conditions that would trigger this, but we don't want to leave a gap.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: remove enum check_outcome
Darrick J. Wong [Mon, 29 Jul 2024 23:23:07 +0000 (16:23 -0700)]
xfs_scrub: remove enum check_outcome

Get rid of this enumeration, and just do what we will directly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: check dependencies of a scrub type before repairing
Darrick J. Wong [Mon, 29 Jul 2024 23:23:06 +0000 (16:23 -0700)]
xfs_scrub: check dependencies of a scrub type before repairing

Now that we have a map of a scrub type to its dependent scrub types, use
this information to avoid trying to fix higher level metadata before the
lower levels have passed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: any inconsistency in metadata should trigger difficulty warnings
Darrick J. Wong [Mon, 29 Jul 2024 23:23:04 +0000 (16:23 -0700)]
xfs_scrub: any inconsistency in metadata should trigger difficulty warnings

Any inconsistency in the space metadata can be a sign that repairs will
be difficult, so set off the warning if there were cross referencing
problems too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: start tracking scrub state in scrub_item
Darrick J. Wong [Mon, 29 Jul 2024 23:23:07 +0000 (16:23 -0700)]
xfs_scrub: start tracking scrub state in scrub_item

Start using the scrub_item to track which metadata objects need
checking by adding a new flag to the scrub_item state set.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: clean up repair_item_difficulty a little
Darrick J. Wong [Mon, 29 Jul 2024 23:23:06 +0000 (16:23 -0700)]
xfs_scrub: clean up repair_item_difficulty a little

Document the flags handling in repair_item_difficulty.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: add missing repair types to the mustfix and difficulty assessment
Darrick J. Wong [Mon, 29 Jul 2024 23:23:04 +0000 (16:23 -0700)]
xfs_scrub: add missing repair types to the mustfix and difficulty assessment

Add a few scrub types that ought to trigger a mustfix (such as AGI
corruption) and all the AG space metadata to the repair difficulty
assessment.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: boost the repair priority of dependencies of damaged items
Darrick J. Wong [Mon, 29 Jul 2024 23:23:06 +0000 (16:23 -0700)]
xfs_scrub: boost the repair priority of dependencies of damaged items

In XFS, certain types of metadata objects depend on the correctness of
lower level metadata objects.  For example, directory blocks are stored
in the data fork of directory files, which means that any issues with
the inode core and the data fork should be dealt with before we try to
repair a directory.

xfs_scrub prioritises repairs by the severity of what the kernel scrub
function reports -- anything directly observed to be corrupt get
repaired first, then anything that had trouble with cross referencing,
and finally anything that was correct but could be further optimised.
Returning to the above example, if a directory data fork mapping offset
is off by a bit flip, scrub will mark that as failing cross referencing,
but it'll mark the directory as corrupt.  Repair should check out the
mapping problem before it tackles the directory.

Do this by embedding a dependency table and using it to boost the
priority of the repair_item fields as needed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: split up the mustfix repairs and difficulty assessment functions
Darrick J. Wong [Mon, 29 Jul 2024 23:23:04 +0000 (16:23 -0700)]
xfs_scrub: split up the mustfix repairs and difficulty assessment functions

Currently, action_list_find_mustfix does two things -- it figures out
which repairs must be tried during phase 2 to enable the inode scan in
phase 3; and it figures out if xfs_scrub should warn about secondary and
primary metadata corruption that might make repair difficult.

Split these into separate functions to make each more coherent.  A long
time from now we'll need this to enable warnings about difficult rt
repairs, but for now this is merely a code cleanup.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: remove scrub_metadata_file
Darrick J. Wong [Mon, 29 Jul 2024 23:23:06 +0000 (16:23 -0700)]
xfs_scrub: remove scrub_metadata_file

Collapse this function with scrub_meta_type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: get rid of trivial fs metadata scanner helpers
Darrick J. Wong [Mon, 29 Jul 2024 23:23:04 +0000 (16:23 -0700)]
xfs_scrub: get rid of trivial fs metadata scanner helpers

Get rid of these pointless wrappers.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: remove action lists from phaseX code
Darrick J. Wong [Mon, 29 Jul 2024 23:23:05 +0000 (16:23 -0700)]
xfs_scrub: remove action lists from phaseX code

Now that we track repair schedules by filesystem object (and not
individual repairs) we can get rid of all the onstack list heads and
whatnot in the phaseX code.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: use repair_item to direct repair activities
Darrick J. Wong [Mon, 29 Jul 2024 23:23:05 +0000 (16:23 -0700)]
xfs_scrub: use repair_item to direct repair activities

Now that the new scrub_item tracks the state of any filesystem object
needing any kind of repair, use it to drive filesystem repairs and
updates to the in-kernel health status when repair finishes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: track repair items by principal, not by individual repairs
Darrick J. Wong [Mon, 29 Jul 2024 23:23:05 +0000 (16:23 -0700)]
xfs_scrub: track repair items by principal, not by individual repairs

Create a new structure to track scrub and repair state by principal
filesystem object (e.g. ag number or inode number/generation) so that we
can more easily examine and ensure that we satisfy repair order
dependencies.  This transposition will eventually enable bulk scrub
operations and will also save a lot of memory if a given object needs a
lot of work.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: actually try to fix summary counters ahead of repairs
Darrick J. Wong [Mon, 29 Jul 2024 23:23:03 +0000 (16:23 -0700)]
xfs_scrub: actually try to fix summary counters ahead of repairs

A while ago, I decided to make phase 4 check the summary counters before
it starts any other repairs, having observed that repairs of primary
metadata can fail because the summary counters (incorrectly) claim that
there aren't enough free resources in the filesystem.  However, if
problems are found in the summary counters, the repair work will be run
as part of the AG 0 repairs, which means that it runs concurrently with
other scrubbers.  This doesn't quite get us to the intended goal, so try
to fix the scrubbers ahead of time.  If that fails, tough, we'll get
back to it in phase 7 if scrub gets that far.

Fixes: cbaf1c9d91a0 ("xfs_scrub: check summary counters")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agomkfs/repair: pin inodes that would otherwise overflow link count
Darrick J. Wong [Mon, 29 Jul 2024 23:23:02 +0000 (16:23 -0700)]
mkfs/repair: pin inodes that would otherwise overflow link count

Update userspace utilities not to allow integer overflows of inode link
counts to result in a file that is referenced by parent directories but
has zero link count.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_{db,repair}: add an explicit owner field to xfs_da_args
Darrick J. Wong [Mon, 29 Jul 2024 23:23:02 +0000 (16:23 -0700)]
xfs_{db,repair}: add an explicit owner field to xfs_da_args

Update these two utilities to set the owner field of the da args
structure prior to calling directory and extended attribute functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agomkfs: add a formatting option for exchange-range
Darrick J. Wong [Mon, 29 Jul 2024 23:23:01 +0000 (16:23 -0700)]
mkfs: add a formatting option for exchange-range

Allow users to enable the logged file mapping exchange intent items on a
filesystem, which in turn enables XFS_IOC_EXCHANGE_RANGE and online
repair of metadata that lives in files, e.g. directories and xattrs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: collapse trivial superblock scrub helpers
Darrick J. Wong [Mon, 29 Jul 2024 23:23:03 +0000 (16:23 -0700)]
xfs_scrub: collapse trivial superblock scrub helpers

Remove the trivial primary super scrub helper function since it makes
tracing code paths difficult and will become annoying in the patches
that follow.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: require primary superblock repairs to complete before proceeding
Darrick J. Wong [Mon, 29 Jul 2024 23:23:03 +0000 (16:23 -0700)]
xfs_scrub: require primary superblock repairs to complete before proceeding

Phase 2 of the xfs_scrub program calls the kernel to check the primary
superblock before scanning the rest of the filesystem.  Though doing so
is a no-op now (since the primary super must pass all checks as a
prerequisite for mounting), the goal of this code is to enable future
kernel code to intercept an xfs_scrub run before it actually does
anything.  If this some day involves fixing the primary superblock, it
seems reasonable to require that /all/ repairs complete successfully
before moving on to the rest of the filesystem.

Unfortunately, that's not what xfs_scrub does now -- primary super
repairs that fail are theoretically deferred to phase 4!  So make this
mandatory.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agolibxfs: port the bumplink function from the kernel
Darrick J. Wong [Mon, 29 Jul 2024 23:23:02 +0000 (16:23 -0700)]
libxfs: port the bumplink function from the kernel

Port the xfs_bumplink function from the kernel and use it to replace raw
calls to inc_nlink.  The next patch will need this common function to
prevent integer overflows in the link count.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_repair: add exchange-range to file systems
Darrick J. Wong [Mon, 29 Jul 2024 23:23:01 +0000 (16:23 -0700)]
xfs_repair: add exchange-range to file systems

Enable upgrading existing filesystems to support the file exchange range
feature.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: fix missing scrub coverage for broken inodes
Darrick J. Wong [Mon, 29 Jul 2024 23:23:03 +0000 (16:23 -0700)]
xfs_scrub: fix missing scrub coverage for broken inodes

If INUMBERS says that an inode is allocated, but BULKSTAT skips over the
inode and BULKSTAT_SINGLE errors out when loading the inumber, there are
two possibilities: One, we're racing with ifree; or two, the inode is
corrupt and iget failed.

When this happens, the scrub_scan_all_inodes code will insert a dummy
bulkstat record with all fields zeroed except bs_ino and bs_blksize.
Hence the use of i_mode switches in phase3 to schedule file content
scrubbing are not entirely correct -- bs_mode==0 means "type unknown",
which ought to mean "schedule all scrubbers".

Unfortunately, the current code doesn't do that, so instead we schedule
no content scrubs.  If the broken file was actually a directory, we fail
to check the directory contents for further corruptions.

Found by using fuzzing with xfs/385 and core.format = 0.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: log when a repair was unnecessary
Darrick J. Wong [Mon, 29 Jul 2024 23:23:03 +0000 (16:23 -0700)]
xfs_scrub: log when a repair was unnecessary

If the kernel tells us that a filesystem object didn't need repairs, we
should log that with a message specific to that outcome.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agolibfrog: advertise exchange-range support
Darrick J. Wong [Mon, 29 Jul 2024 23:23:01 +0000 (16:23 -0700)]
libfrog: advertise exchange-range support

Report the presence of exchange range for a given filesystem.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: move repair functions to repair.c
Darrick J. Wong [Mon, 29 Jul 2024 23:23:02 +0000 (16:23 -0700)]
xfs_scrub: move repair functions to repair.c

Move all the repair functions to repair.c.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_io: create exchangerange command to test file range exchange ioctl
Darrick J. Wong [Mon, 29 Jul 2024 23:23:01 +0000 (16:23 -0700)]
xfs_io: create exchangerange command to test file range exchange ioctl

Create a new xfs_io command to make raw calls to the
XFS_IOC_EXCHANGE_RANGE ioctl.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_scrub: remove ALP_* flags namespace
Darrick J. Wong [Mon, 29 Jul 2024 23:23:02 +0000 (16:23 -0700)]
xfs_scrub: remove ALP_* flags namespace

In preparation to move all the repair code to repair.[ch], remove the
ALP_* flags namespace since it mostly overlaps with XRM_*.  Rename the
clunky "COMPLAIN_IF_UNFIXED" flag to "FINAL_WARNING", because that's
what it really means.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_fsr: skip the xattr/forkoff levering with the newer swapext implementations
Darrick J. Wong [Mon, 29 Jul 2024 23:23:00 +0000 (16:23 -0700)]
xfs_fsr: skip the xattr/forkoff levering with the newer swapext implementations

The newer swapext implementations in the kernel run at a high enough
level (above the bmap layer) that it's no longer required to manipulate
bs_forkoff by creating garbage xattrs to get the extent tree that we
want.  If we detect the newer algorithms, skip this error prone step.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_fsr: convert to bulkstat v5 ioctls
Darrick J. Wong [Mon, 29 Jul 2024 23:23:00 +0000 (16:23 -0700)]
xfs_fsr: convert to bulkstat v5 ioctls

Now that libhandle can, er, handle bulkstat information coming from the
v5 bulkstat ioctl, port xfs_fsr to use the new interfaces instead of
repeatedly converting things back and forth.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_logprint: support dumping exchmaps log items
Darrick J. Wong [Mon, 29 Jul 2024 23:23:00 +0000 (16:23 -0700)]
xfs_logprint: support dumping exchmaps log items

Support dumping exchmaps log items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs_db: advertise exchange-range in the version command
Darrick J. Wong [Mon, 29 Jul 2024 23:23:00 +0000 (16:23 -0700)]
xfs_db: advertise exchange-range in the version command

Amend the version command to advertise exchange-range support.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs: fix direction in XFS_IOC_EXCHANGE_RANGE
Darrick J. Wong [Mon, 29 Jul 2024 23:22:59 +0000 (16:22 -0700)]
xfs: fix direction in XFS_IOC_EXCHANGE_RANGE

Source kernel commit: dc5e1cbae270b625dcb978f8ea762eb16a93a016

The kernel reads userspace's buffer but does not write it back.
Therefore this is really an _IOW ioctl.  Change this before 6.10 final
releases.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
14 months agolibfrog: add support for exchange range ioctl family
Darrick J. Wong [Mon, 29 Jul 2024 23:23:00 +0000 (16:23 -0700)]
libfrog: add support for exchange range ioctl family

Add some library code to support the new file range exchange and commit
ioctls.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
14 months agoxfs: allow unlinked symlinks and dirs with zero size
Darrick J. Wong [Mon, 29 Jul 2024 23:22:58 +0000 (16:22 -0700)]
xfs: allow unlinked symlinks and dirs with zero size

Source kernel commit: 1ec9307fc066dd8a140d5430f8a7576aa9d78cd3

For a very very long time, inode inactivation has set the inode size to
zero before unmapping the extents associated with the data fork.
Unfortunately, commit 3c6f46eacd876 changed the inode verifier to
prohibit zero-length symlinks and directories.  If an inode happens to
get logged in this state and the system crashes before freeing the
inode, log recovery will also fail on the broken inode.

Therefore, allow zero-size symlinks and directories as long as the link
count is zero; nobody will be able to open these files by handle so
there isn't any risk of data exposure.

Fixes: 3c6f46eacd876 ("xfs: sanity check directory inode di_size")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
14 months agolibhandle: add support for bulkstat v5
Darrick J. Wong [Mon, 29 Jul 2024 23:22:59 +0000 (16:22 -0700)]
libhandle: add support for bulkstat v5

Add support to libhandle for generating file handles with bulkstat v5
structures.  xfs_fsr will need this to be able to interface with the new
vfs range swap ioctl, and other client programs will probably want this
over time.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>