git-server-git.apps.pok.os.sepia.ceph.com Git

libfrog: allow bitmap_free to handle a null bitmap pointer

Allow bitmap_free() callers to pass a pointer to a NULL pointer.
This will help subsequent refactorings in xfs_scrub have cleaner
bitmap_free callsites.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

debian: enable xfs_healer on the root filesystem by default

Now that we're finished building autonomous repair, enable the healer
service on the root filesystem by default.  The root filesystem is
mounted by the initrd prior to starting systemd, which is why the
xfs_healer_start service cannot autostart the service for the root
filesystem.

dh_installsystemd won't activate a template service (aka one with an
at-sign in the name) even if it provides a DefaultInstance directive to
make that possible.  Hence we enable this explicitly via the postinst
script.

Note that Debian enables services by default upon package installation,
so this is consistent with their policies.  Their kernel doesn't enable
online fsck, so healer won't do much more than monitor for corruptions
and log them.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

debian/control: listify the build dependencies

This will make it less gross to add more build deps later.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

mkfs: enable online repair if all backrefs are enabled

If all backreferences are enabled in the filesystem, then enable online
repair by default if the user didn't supply any other autofsck setting.
Users might as well get full self-repair capability if they're paying
for the extra metadata.

Note that it's up to each distro to enable the systemd services
according to their own service activation policies. Debian policy is to
enable all systemd services at package installation but they don't
enable online fsck in their Kconfig so the services won't activate.
RHEL and SUSE policy requires sysadmins to enable them explicitly unless
the OS vendor also ships a systemd preset file enabling the services.
Distros without systemd won't get any of the systemd services,
obviously.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_io: add listmount and statmount commands

Add two new commands: one to list all mounts via statmount, now that we
use this in xfs_healer_start, and another to statmount each open file.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub: print systemd service names

Add a hidden switch to xfs_scrub to emit systemd service names for XFS
services targetting filesystems paths instead of opencoding the
computation in things like fstests.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: add a manual page

Add a new section 8 manpage for this service daemon so others can read
about what this program is supposed to do.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: validate that repair fds point to the monitored fs

When xfs_healer reopens a mountpoint to perform a repair, it should
validate that the opened fd points to a file on the same filesystem as
the one being monitored.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: use statmount to find moved filesystems even faster

As noted in the previous patch, it's possible that a mounted filesystem
can move mountpoints between the time of the initial mount (at which
point xfs_healer starts) and when it actually wants to start a repair.
The previous patch fixed that problem by using getmntent to walk
/proc/self/mounts to see if it finds a mount with the same "source"
name, aka data device.

However, this is really slow if there are a lot of filesystems because
we end up wading through a lot of irrelevant information.  However,
statmount() can help us here because as of Linux 7.0 we can open the
passed-in path at startup, call statmount() on it to retrieve the
mnt_id, and then call it again later with that same mnt_id to find the
mountpoint.  Luckily xfs_healthmon didn't get merged until 7.0 so it's
more or less guaranteed to be there if XFS_IOC_HEALTH_MONITOR succeeds.

Obviously if this doesn't work, we can fall back to the slow walk.

This statmount code enables xfs_healer to find a filesystem that has
had its mountpoint moved to a different place in the directory tree
without the use of bind mounts and without needing to walk the entire
mount list:

# mount -t tmpfs urk /mnt
# mount --make-rprivate /mnt
# mkdir -p /mnt/a /mnt/b
# mount /dev/sda /mnt/a
# mount --move /mnt/a /mnt/b

The key here is that the struct mount object is moved, and no new ones
are created.  Therefore, the original mnt_id is still usable.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: use getmntent to find moved filesystems

It's possible that a mounted filesystem can move mountpoints between the
time of the initial mount (at which point xfs_healer starts) and when
it actually wants to start a repair. When this happens,
weakhandle::mountpoint becomes obsolete and opening it will either fail
with ENOENT or the handle revalidation will return ESTALE.

However, we do still have a means to find the mounted filesystem -- the
fsname parameter (aka the path to the data device at mount time). This
is record in /proc/mounts, which means that we can iterate getmntent to
see if we can find the mount elsewhere.

As documented a few patches ago, this would be easier if we had
revocable fds that didn't pin mounts, but that's a very huge ask.

This getmntent code enables xfs_healer to find a filesystem that has
been bind mounted in a new place and the original mountpoint detached:

# mount /dev/sda /mnt
# xfs_healer /mnt &
# mount /mnt /opt --bind
# umount /mnt

The key here is that each bind mount gets a separate struct mount
object.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: run full scrub after lost corruption events or targeted repair failure

If we fail to perform a spot repair of metadata or the kernel tells us
that it lost corruption events due to queue limits, initiate a full run
of the online fsck service to try to fix the error.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: use the autofsck fsproperty to select mode

Make the xfs_healer background service query the autofsck filesystem
property to figure out which operating mode it should use.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: don't start service if kernel support unavailable

Use ExecCondition= in the system service to check if kernel support for
the health monitor is available. If not, we don't want to run the
service, have it fail, and generate a bunch of silly log messages.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: create a service to start the per-mount healer service

Create a daemon to wait for xfs mount events via fsnotify and start up
the per-mount healer service. It's important that we're running in the
same mount namespace as the mount, so we're a fanotify client to avoid
having to filter the mount namespaces ourselves.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: create a per-mount background monitoring service

Create a systemd service definition for our self-healing filesystem
daemon so that we can run it for every mounted filesystem. Add a
hidden switch so that we can print the service unit name for fstests.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: use getparents to look up file names

If the kernel tells about something that happened to a file, use the
GETPARENTS ioctl to try to look up the path to that file for more
ergonomic reporting.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: enable repairing filesystems

Make it so that our health monitoring daemon can initiate repairs in
response to reports of corrupt filesystem metadata.  Repairs are
initiated from the background workers as explained in the previous
patch.

Note that just like xfs_scrub, xfs_healer's ability to repair metadata
relies heavily on back references such as reverse mappings and directory
parent pointers to add redundancy to the filesystem.  Check for these
two features and whine a bit if they are missing, just like scrub.

There's a bit of trickery with the fd that is used to initiate repairs
in the kernel.  Because an open fd will pin the filesystem in memory,
xfs_healer can only hold an open fd to the target filesystem while it's
performing repairs.  Therefore, at startup xfs_healer must sample enough
information about the target filesystem to reconnect to it later on.
Currently, the fs source (aka the data device path) and the root
directory handle are sufficient to do this.

Someday we might be able to have revocable fds, which would eliminate
the need for such efforts in userspace.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_healer: create daemon to listen for health events

Create a daemon program that can listen for and log health events.
Eventually this will be used to self-heal filesystems in real time.

Because events can take a while to process, the main thread reads event
objects from the healthmon fd and dispatches them to a background
workqueue as quickly as it can.  This split of responsibilities is
necessary because the kernel event queue will drop events if the queue
fills up, and each event can take some time to process (logging,
repairs, etc.) so we don't want to lose events.

To be clear, xfs_healer and xfs_scrub are complementary tools:

Scrub walks the whole filesystem, finds stuff that needs fixing or
rebuilding, and rebuilds it.  This is sort of analogous to a patrol
scrub.

Healer listens for metadata corruption messages from the kernel and
issues a targeted repair of that structure.  This is kind of like an
ondemand scrub.

My end goal is that xfs_healer (the service) is active all the time and
can respond instantly to a corruption report, whereas xfs_scrub (the
service) gets run periodically as a cron job.

xfs_healer can decide that it's overwhelmed with problems and start
xfs_scrub to deal with the mess.  Ideally you don't crash the filesystem
and then have to use xfs_repair to smash your way back to a mountable
filesystem.

By default we run xfs_healer as a background service, which means that
we only start two threads -- one to read the events, and another to
process them.  In other words, we try not to use all available hardware
resources for repairs.  The foreground mode switch starts up a large
number of threads to try to increase parallelism, which may or may not
be useful for repairs depending on how much metadata the kernel needs to
scan.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_io: add a media verify command

Add a subcommand to invoke the media verification ioctl to make sure
that we can actually check the storage underneath an xfs filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_io: monitor filesystem health events

Create a subcommand to monitor for health events generated by the kernel.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

man2: document the media verification ioctl

Document XFS_IOC_VERIFY_MEDIA, which is a new ioctl for xfs_scrub to
perform media scans on the disks underneath the filesystem. This will
enable media errors to be reported to xfs_healer and fsnotify.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

man2: document the healthmon ioctl

Document the XFS_IOC_HEALTH_MONITOR and
XFS_IOC_HEALTH_FD_ON_MONITORED_FS ioctls.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: add wrappers for listmount and statmount

Add some wrappers for listmount and statmount so that we don't have to
open-code the kernel ABI quirks in every utility program that uses it.
Note that glibc seems to have discussed providing a wrapper in late 2023
but took no action; and the listmount manpage says that there is no
glibc wrapper.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: hoist a couple of service helper functions

Hoist a couple of service/daemon-related helper functions to libfrog so
that we can share the code between xfs_scrub and xfs_healer.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: add support code for starting systemd services programmatically

Add some simple routines for computing the name of systemd service
instances and starting systemd services.  These will be used by the
xfs_healer_start service to start per-filesystem xfs_healer service
instances.

Note that we run systemd helper programs as subprocesses for a couple of
reasons.  First, the path-escaping functionality is not a part of any
library-accessible API, which means it can only be accessed via
systemd-escape(1).  Second, although the service startup functionality
can be reached via dbus, doing so would introduce a new library
dependency.  Systemd is also undergoing a dbus -> varlink RPC transition
so we avoid that mess by calling the cli systemctl(1) program.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: create healthmon event log library functions

Add some helper functions to log health monitoring events so that xfs_io
and xfs_healer can share logging code.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: add a function to grab the path from an open fd and a file handle

handle_walk_paths operates on a file handle, but requires that the fs
has been registered with libhandle via path_to_fshandle.  For a normal
libhandle client this is the desirable behavior because the application
*should* maintain an open fd to the filesystem mount.

However for xfs_healer this isn't going to work well because the healer
mustn't pin the mount while it's running.  It's smart enough to know how
to find and reconnect to the mountpoint, but libhandle doesn't have any
such concept.

Therefore, alter the libfrog getparents code so that xfs_healer can pass
in the mountpoint and reconnected fd without needing libhandle.  All
we're really doing here is trying to obtain a user-visible path for a
file that encountered problems for logging purposes; if it fails, we'll
fall back to logging the inode number.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libxfs: fix printing of cache.c_maxcount.

c_maxcount is usigned, but is being printed as signed.

Signed-off-by: Thiago Becker <tbecker@redhat.com>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>

fsr: always print error messages from xfrog_defragrange()

Error messages from xfrog_defragrange() are only printed when
verbose/debug flags are used.

We had reports from users complaining it's hard to find out the error
messages lost in the middle of dozens of other informational messages.

Change packfile() behavior to print errors independently of verbose or
debug flags.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

fsr: package function should check for negative errors

xfrog_defragrange as most other functions from libfrog return
a negative error value, while xfs_fsr's packfile(), expects
a positive error value.

Whenever xfrog_defragrange fails, the switch case always falls into the
default clausule, making the error message pointless.
Fix this by inverting xfrog_defragrange() return value call.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

xfs: fix returned valued from xfs_defer_can_append

Source kernel commit: 54fcd2f95f8d216183965a370ec69e1aab14f5da

xfs_defer_can_append returns a bool, it shouldn't be returning
a NULL.

Found by code inspection.

Fixes: 4dffb2cbb483 ("xfs: allow pausing of pending deferred work items")
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Souptick Joarder <souptick.joarder@hpe.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: Remove redundant NULL check after __GFP_NOFAIL

Source kernel commit: 281cb17787d4284a7790b9cbd80fded826ca7739

kzalloc() is called with __GFP_NOFAIL, so a NULL return is not expected.
Drop the redundant !map check in xfs_dabuf_map().
Also switch the nirecs-sized allocation to kcalloc().

Signed-off-by: hongao <hongao@uniontech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add static size checks for ioctl UABI

Source kernel commit: 650b774cf94495465d6a38c31bb1a6ce697b6b37

The ioctl structures in libxfs/xfs_fs.h are missing static size checks.
It is useful to have static size checks for these structures as adding
new fields to them could cause issues (e.g. extra padding that may be
inserted by the compiler). So add these checks to xfs/xfs_ondisk.h.

Due to different padding/alignment requirements across different
architectures, to avoid build failures, some structures are ommited from
the size checks. For example, structures with "compat_" definitions in
xfs/xfs_ioctl32.h are ommited.

Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: remove duplicate static size checks

Source kernel commit: e97cbf863d8918452c9f81bebdade8d04e2e7b60

In libxfs/xfs_ondisk.h, remove some duplicate entries of
XFS_CHECK_STRUCT_SIZE().

Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: Add a comment in xfs_log_sb()

Source kernel commit: ac1d977096a17d56c55bd7f90be48e81ac4cec3f

Add a comment explaining why the sb_frextents are updated outside the
if (xfs_has_lazycount(mp) check even though it is a lazycounter.
RT groups are supported only in v5 filesystems which always have
lazycounter enabled - so putting it inside the if(xfs_has_lazycount(mp)
check is redundant.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: remove metafile inodes from the active inode stat

Source kernel commit: 47553dd60b1da88df2354f841a4f71dd4de6478a

The active inode (or active vnode until recently) stat can get much larger
than expected on file systems with a lot of metafile inodes like zoned
file systems on SMR hard disks with 10.000s of rtg rmap inodes.

Remove all metafile inodes from the active counter to make it more useful
to track actual workloads and add a separate counter for active metafile
inodes.

This fixes xfs/177 on SMR hard drives.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: fix code alignment issues in xfs_ondisk.c

Source kernel commit: fd81d3fd01a5ee4bd26a7dc440e7a2209277d14b

Fixup some code alignment issues in xfs_ondisk.c

Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: Refactoring the nagcount and delta calculation

Source kernel commit: a49b7ff63f98ba1c4503869c568c99ecffa478f2

Introduce xfs_growfs_compute_delta() to calculate the nagcount
and delta blocks and refactor the code from xfs_growfs_data_private().
No functional changes.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

Convert 'alloc_obj' family to use the new default GFP_KERNEL argument

Source kernel commit: bf4afc53b77aeaa48b5409da5c8da6bb4eff7f43

This was done entirely with mindless brute force, using

git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/$alloc_objs*(.*$, GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

treewide: Replace kmalloc with kmalloc_obj for non-scalar types

Source kernel commit: 69050f8d6d075dc01af7a5f2f550a8067510366f

This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:     kmalloc(sizeof(TYPE), ...)
are replaced with:      kmalloc_obj(TYPE, ...)

Array allocations:      kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:      kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:      kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>

xfs: give the defer_relog stat a xs_ prefix

Source kernel commit: edf6078212c366459d3c70833290579b200128b1

Make this counter naming consistent with all the others.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add zone reset error injection

Source kernel commit: 41374ae69ec3a910950d3888f444f80678c6f308

Add a new errortag to test that zone reset errors are handled correctly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: don't validate error tags in the I/O path

Source kernel commit: b8862a09d8256a9037293f1da3b4617b21de26f1

We can trust XFS developers enough to not pass random stuff to
XFS_ERROR_TEST/DELAY. Open code the validity check in xfs_errortag_add,
which is the only place that receives unvalidated error tag values from
user space, and drop the now pointless xfs_errortag_enabled helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: fix spacing style issues in xfs_alloc.c

Source kernel commit: 0ead3b72469e52ca02946b2e5b35fff38bfa061f

Fix checkpatch.pl errors regarding missing spaces around assignment
operators in xfs_alloc_compute_diff() and xfs_alloc_fixup_trees().

Adhere to the Linux kernel coding style by ensuring spaces are placed
around the assignment operator '='.

Signed-off-by: Shin Seong-jun <shinsj4653@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add a method to replace shortform attrs

Source kernel commit: eaec8aeff31d0679eadb27a13a62942ddbfd7b87

If we're trying to replace an xattr in a shortform attr structure and
the old entry fits the new entry, we can just memcpy and exit without
having to delete, compact, and re-add the entry (or worse use the attr
intent machinery). For parent pointers this only advantages renaming
where the filename length stays the same (e.g. mv autoexec.bat
scandisk.exe) but for regular xattrs it might be useful for updating
security labels and the like.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: speed up parent pointer operations when possible

Source kernel commit: d693534513d8dcdaafcf855986d0fe0476a47462

After a recent fsmark benchmarking run, I observed that the overhead of
parent pointers on file creation and deletion can be a bit high.  On a
machine with 20 CPUs, 128G of memory, and an NVME SSD capable of pushing
750000iops, I see the following results:

$ mkfs.xfs -f -l logdev=/dev/nvme1n1,size=1g /dev/nvme0n1 -n parent=0
meta-data=/dev/nvme0n1           isize=512    agcount=40, agsize=9767586 blks
=                       sectsz=4096  attr=2, projid32bit=1
=                       crc=1        finobt=1, sparse=1, rmapbt=1
=                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
=                       exchange=0   metadir=0
data     =                       bsize=4096   blocks=390703440, imaxpct=5
=                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =/dev/nvme1n1           bsize=4096   blocks=262144, version=2
=                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
=                       rgcount=0    rgsize=0 extents
=                       zoned=0      start=0 reserved=0

So we created 40 AGs, one per CPU.  Now we create 40 directories and run
fsmark:

$ time fs_mark  -D  10000  -S  0  -n  100000  -s  0  -L  8 -d ...
# Version 3.3, 40 thread(s) starting at Wed Dec 10 14:22:07 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 0 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.

parent=0               parent=1
==================     ==================
real    0m57.573s      real    1m2.934s
user    3m53.578s      user    3m53.508s
sys     19m44.440s     sys     25m14.810s

$ time rm -rf ...

parent=0               parent=1
==================     ==================
real    0m59.649s      real    1m12.505s
user    0m41.196s      user    0m47.489s
sys     13m9.566s      sys     20m33.844s

Parent pointers increase the system time by 28% overhead to create 32
million files that are totally empty.  Removing them incurs a system
time increase of 56%.  Wall time increases by 9% and 22%.

For most filesystems, each file tends to have a single owner and not
that many xattrs.  If the xattr structure is shortform, then all xattr
changes are logged with the inode and do not require the the xattr
intent mechanism to persist the parent pointer.

Therefore, we can speed up parent pointer operations by calling the
shortform xattr functions directly if the child's xattr is in short
format.  Now the overhead looks like:

$ time fs_mark  -D  10000  -S  0  -n  100000  -s  0  -L  8 -d ...

parent=0               parent=1
==================     ==================
real    0m58.030s      real    1m0.983s
user    3m54.141s      user    3m53.758s
sys     19m57.003s     sys     21m30.605s

$ time rm -rf ...

parent=0               parent=1
==================     ==================
real    0m58.911s      real    1m4.420s
user    0m41.329s      user    0m45.169s
sys     13m27.857s     sys     15m58.564s

Now parent pointers only increase the system time by 8% for creation and
19% for deletion.  Wall time increases by 5% and 9% now.

Close the performance gap by creating helpers for the attr set, remove,
and replace operations that will try to make direct shortform updates,
and fall back to the attr intent machinery if that doesn't work.  This
works for regular xattrs and for parent pointers.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: reduce xfs_attr_try_sf_addname parameters

Source kernel commit: 1ef7729df1f0c5f7bb63a121164f54d376d35835

The dp parameter to this function is an alias of args->dp, so remove it
for clarity before we go adding new callers.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: strengthen attr leaf block freemap checking

Source kernel commit: 27a0c41f33d8d31558d334b07eb58701aab0b3dd

Check for erroneous overlapping freemap regions and collisions between
freemap regions and the xattr leaf entry array.

Note that we must explicitly zero out the extra freemaps in
xfs_attr3_leaf_compact so that the in-memory buffer has a correctly
initialized freemap array to satisfy the new verification code, even if
subsequent code changes the contents before unlocking the buffer.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: refactor attr3 leaf table size computation

Source kernel commit: a165f7e7633ee0d83926d29e7909fdd8dd4dfadc

Replace all the open-coded callsites with a single static inline helper.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: fix freemap adjustments when adding xattrs to leaf blocks

Source kernel commit: 3eefc0c2b78444b64feeb3783c017d6adc3cd3ce

xfs/592 and xfs/794 both trip this assertion in the leaf block freemap
adjustment code after ~20 minutes of running on my test VMs:

ASSERT(ichdr->firstused >= ichdr->count * sizeof(xfs_attr_leaf_entry_t)
+ xfs_attr3_leaf_hdr_size(leaf));

Upon enabling quite a lot more debugging code, I narrowed this down to
fsstress trying to set a local extended attribute with namelen=3 and
valuelen=71.  This results in an entry size of 80 bytes.

At the start of xfs_attr3_leaf_add_work, the freemap looks like this:

i 0 base 448 size 0 rhs 448 count 46
i 1 base 388 size 132 rhs 448 count 46
i 2 base 2120 size 4 rhs 448 count 46
firstused = 520

where "rhs" is the first byte past the end of the leaf entry array.
This is inconsistent -- the entries array ends at byte 448, but
freemap[1] says there's free space starting at byte 388!

By the end of the function, the freemap is in worse shape:

i 0 base 456 size 0 rhs 456 count 47
i 1 base 388 size 52 rhs 456 count 47
i 2 base 2120 size 4 rhs 456 count 47
firstused = 440

Important note: 388 is not aligned with the entries array element size
of 8 bytes.

Based on the incorrect freemap, the name area starts at byte 440, which
is below the end of the entries array!  That's why the assertion
triggers and the filesystem shuts down.

How did we end up here?  First, recall from the previous patch that the
freemap array in an xattr leaf block is not intended to be a
comprehensive map of all free space in the leaf block.  In other words,
it's perfectly legal to have a leaf block with:

* 376 bytes in use by the entries array
* freemap[0] has [base = 376, size = 8]
* freemap[1] has [base = 388, size = 1500]
* the space between 376 and 388 is free, but the freemap stopped
tracking that some time ago

If we add one xattr, the entries array grows to 384 bytes, and
freemap[0] becomes [base = 384, size = 0].  So far, so good.  But if we
add a second xattr, the entries array grows to 392 bytes, and freemap[0]
gets pushed up to [base = 392, size = 0].  This is bad, because
freemap[1] hasn't been updated, and now the entries array and the free
space claim the same space.

The fix here is to adjust all freemap entries so that none of them
collide with the entries array.  Note that this fix relies on commit
2a2b5932db6758 ("xfs: fix attr leaf header freemap.size underflow") and
the previous patch that resets zero length freemap entries to have
base = 0.

Fixes: 1da177e4c3f415 ("Linux-2.6.12-rc2")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: delete attr leaf freemap entries when empty

Source kernel commit: 6f13c1d2a6271c2e73226864a0e83de2770b6f34

Back in commit 2a2b5932db6758 ("xfs: fix attr leaf header freemap.size
underflow"), Brian Foster observed that it's possible for a small
freemap at the end of the end of the xattr entries array to experience
a size underflow when subtracting the space consumed by an expansion of
the entries array.  There are only three freemap entries, which means
that it is not a complete index of all free space in the leaf block.

This code can leave behind a zero-length freemap entry with a nonzero
base.  Subsequent setxattr operations can increase the base up to the
point that it overlaps with another freemap entry.  This isn't in and of
itself a problem because the code in _leaf_add that finds free space
ignores any freemap entry with zero size.

However, there's another bug in the freemap update code in _leaf_add,
which is that it fails to update a freemap entry that begins midway
through the xattr entry that was just appended to the array.  That can
result in the freemap containing two entries with the same base but
different sizes (0 for the "pushed-up" entry, nonzero for the entry
that's actually tracking free space).  A subsequent _leaf_add can then
allocate xattr namevalue entries on top of the entries array, leading to
data loss.  But fixing that is for later.

For now, eliminate the possibility of confusion by zeroing out the base
of any freemap entry that has zero size.  Because the freemap is not
intended to be a complete index of free space, a subsequent failure to
find any free space for a new xattr will trigger block compaction, which
regenerates the freemap.

It looks like this bug has been in the codebase for quite a long time.

Fixes: 1da177e4c3f415 ("Linux-2.6.12-rc2")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: split and refactor zone validation

Source kernel commit: 19c5b6051ed62d8c4b1cf92e463c1bcf629107f4

Currently xfs_zone_validate mixes validating the software zone state in
the XFS realtime group with validating the hardware state reported in
struct blk_zone and deriving the write pointer from that.

Move all code that works on the realtime group to xfs_init_zone, and only
keep the hardware state validation in xfs_zone_validate. This makes the
code more clear, and allows for better reuse in userspace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add a xfs_rtgroup_raw_size helper

Source kernel commit: fc633b5c5b80c1d840b7a8bc2828be96582c6b55

Add a helper to figure the on-disk size of a group, accounting for the
XFS_SB_FEAT_INCOMPAT_ZONE_GAPS feature if needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add missing forward declaration in xfs_zones.h

Source kernel commit: 41263267ef26d315b1425eb9c8a8d7092f9db7c8

Add the missing forward declaration for struct blk_zone in xfs_zones.h.
This avoids headaches with the order of header file inclusion to avoid
compilation errors.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: remove xfs_attr_leaf_hasname

Source kernel commit: 3a65ea768b8094e4699e72f9ab420eb9e0f3f568

The calling convention of xfs_attr_leaf_hasname() is problematic, because
it returns a NULL buffer when xfs_attr3_leaf_read fails, a valid buffer
when xfs_attr3_leaf_lookup_int returns -ENOATTR or -EEXIST, and a
non-NULL buffer pointer for an already released buffer when
xfs_attr3_leaf_lookup_int fails with other error values.

Fix this by simply open coding xfs_attr_leaf_hasname in the callers, so
that the buffer release code is done by each caller of
xfs_attr3_leaf_read.

Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
Reported-by: Mark Tinguely <mark.tinguely@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: directly include xfs_platform.h

Source kernel commit: cf9b52fa7d65362b648927d1d752ec99659f5c43

The xfs.h header conflicts with the public xfs.h in xfsprogs, leading
to a spurious difference in all shared libxfs files that have to
include libxfs_priv.h in userspace. Directly include xfs_platform.h so
that we can add a header of the same name to xfsprogs and remove this
major annoyance for the shared code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: move struct xfs_log_iovec to xfs_log_priv.h

Source kernel commit: 027410591418bded6ba6051151d88fc6fb8a7614

This structure is now only used by the core logging and CIL code.

Also remove the unused typedef.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add media verification ioctl

Source kernel commit: b8accfd65d31f25b9df15ec2419179b6fa0b21d5

Add a new privileged ioctl so that xfs_scrub can ask the kernel to
verify the media of the devices backing an xfs filesystem, and have any
resulting media errors reported to fsnotify and xfs_healer.

To accomplish this, the kernel allocates a folio between the base page
size and 1MB, and issues read IOs to a gradually incrementing range of
one of the storage devices underlying an xfs filesystem. If any error
occurs, that raw error is reported to the calling process. If the error
happens to be one of the ones that the kernel considers indicative of
data loss, then it will also be reported to xfs_healthmon and fsnotify.

Driving the verification from the kernel enables xfs (and by extension
xfs_scrub) to have precise control over the size and error handling of
IOs that are issued to the underlying block device, and to emit
notifications about problems to other relevant kernel subsystems
immediately.

Note that the caller is also allowed to reduce the size of the IO and
to ask for a relaxation period after each IO.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: check if an open file is on the health monitored fs

Source kernel commit: 8b85dc4090e1c72c6d42acd823514cce67cd54fc

Create a new ioctl for the healthmon file that checks that a given fd
points to the same filesystem that the healthmon file is monitoring.
This allows xfs_healer to check that when it reopens a mountpoint to
perform repairs, the file that it gets matches the filesystem that
generated the corruption report.

(Note that xfs_healer doesn't maintain an open fd to a filesystem that
it's monitoring so that it doesn't pin the mount.)

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: convey file I/O errors to the health monitor

Source kernel commit: dfa8bad3a8796ce1ca4f1d15158e2ecfb9c5c014

Connect the fserror reporting to the health monitor so that xfs can send
events about file I/O errors to the xfs_healer daemon. These events are
entirely informational because xfs cannot regenerate user data, so
hopefully the fsnotify I/O error event gets noticed by the relevant
management systems.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: convey externally discovered fsdax media errors to the health monitor

Source kernel commit: e76e0e3fc9957a5183ddc51dc84c3e471125ab06

Connect the fsdax media failure notification code to the health monitor
so that xfs can send events about that to the xfs_healer daemon.

Later on we'll add the ability for the xfs_scrub media scan (phase 6) to
report the errors that it finds to the kernel so that those are also
logged by xfs_healer.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: convey filesystem shutdown events to the health monitor

Source kernel commit: 74c4795e50f816dbf5cf094691fc4f95bbc729ad

Connect the filesystem shutdown code to the health monitor so that xfs
can send events about that to the xfs_healer daemon.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: convey metadata health events to the health monitor

Source kernel commit: 5eb4cb18e445d09f64ef4b7c8fdc3b2296cb0702

Connect the filesystem metadata health event collection system to the
health monitor so that xfs can send events to xfs_healer as it collects
information.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: convey filesystem unmount events to the health monitor

Source kernel commit: 25ca57fa3624cae9c6b5c6d3fc7f38318ca1402e

In xfs_healthmon_unmount, send events to xfs_healer so that it knows
that nothing further can be done for the filesystem.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: create event queuing, formatting, and discovery infrastructure

Source kernel commit: b3a289a2a9397b2e731f334d7d36623a0f9192c5

Create the basic infrastructure that we need to report health events to
userspace.  We need a compact form for recording critical information
about an event and queueing them; a means to notice that we've lost some
events; and a means to format the events into something that userspace
can handle.  Make the kernel export C structures via read().

In a previous iteration of this new subsystem, I wanted to explore data
exchange formats that are more flexible and easier for humans to read
than C structures.  The thought being that when we want to rev (or
worse, enlarge) the event format, it ought to be trivially easy to do
that in a way that doesn't break old userspace.

I looked at formats such as protobufs and capnproto.  These look really
nice in that extending the wire format is fairly easy, you can give it a
data schema and it generates the serialization code for you, handles
endianness problems, etc.  The huge downside is that neither support C
all that well.

Too hard, and didn't want to port either of those huge sprawling
libraries first to the kernel and then again to xfsprogs.  Then I
thought, how about JSON?  Javascript objects are human readable, the
kernel can emit json without much fuss (it's all just strings!) and
there are plenty of interpreters for python/rust/c/etc.

There's a proposed schema format for json, which means that xfs can
publish a description of the events that kernel will emit.  Userspace
consumers (e.g. xfsprogs/xfs_healer) can embed the same schema document
and use it to validate the incoming events from the kernel, which means
it can discard events that it doesn't understand, or garbage being
emitted due to bugs.

However, json has a huge crutch -- javascript is well known for its
vague definitions of what are numbers.  This makes expressing a large
number rather fraught, because the runtime is free to represent a number
in nearly any way it wants.  Stupider ones will truncate values to word
size, others will roll out doubles for uint52_t (yes, fifty-two) with
the resulting loss of precision.  Not good when you're dealing with
discrete units.

It just so happens that python's json library is smart enough to see a
sequence of digits and put them in a u64 (at least on x86_64/aarch64)
but an actual javascript interpreter (pasting into Firefox) isn't
necessarily so clever.

It turns out that none of the proposed json schemas were ever ratified
even in an open-consensus way, so json blobs are still just loosely
structured blobs.  The parsing in userspace was also noticeably slow and
memory-consumptive.

Hence only the C interface survives.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: start creating infrastructure for health monitoring

Source kernel commit: a48373e7d35a89f6f9b39f0d0da9bf158af054ee

Start creating helper functions and infrastructure to pass filesystem
health events to a health monitoring file. Since this is an
administrative interface, we only support a single health monitor
process per filesystem, so we don't need to use anything fancy such as
notifier chains (== tons of indirect calls).

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: fix missing gettext call in current_fixed_time

This error message can be seen by regular users, so it should be looked
up in the internationalization catalogue with gettext.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: hoist some utilities from libxfs

This started with a desire to move the duplicate cmn_err declarations in
libxfs into libfrog/util.h ahead of the patch that renames libxfs_priv.h
to xfs_platform.h.

Then this patch expanded in scope when I realized that there were
several other utility functions that weren't specific to xfs; those go
in libfrog.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libxfs: port various kernel apis from 7.0

Port more kernel APIs from Linux as of 7.0.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libxfs: fix XFS_STATS_DEC

This macro only takes two arguments in the kernel, so fix the definition
here too. All existing callsites #if 0 it into oblivion which is why
we've never noticed, but an upcoming patch in the libxfs sync will not
be so lucky.

Cc: <linux-xfs@vger.kernel.org> # v4.9.0
Fixes: ece930fa14a343 ("xfs: refactor xfs_bunmapi_cow")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_repair: don't fail on INCOMPLETE attrs in leaf blocks

While trying to fix problems in generic/753, I noticed test failures on
account of xfs_repair:

attribute entry #4 in attr block 0, inode 131 is INCOMPLETE
problem with attribute contents in inode 131
would clear attr fork
bad nblocks 4 for inode 131, would reset to 0
bad anextents 1 for inode 131, would reset to 0

Looking at the dumped filesystem, inode 131 is a linked file, and the
"incomplete" xattr was clearly part of an xfs_attr_set operation that
failed midway through because the induced log shutdown prevented xfs
from finishing the creation of a remote xattr.  This kind of thing is
expected, but instead xfs_repair deletes the entire attr fork!

It's far too drastic to delete every xattr because doing that destroys
things like security labels.  The kernel won't show incomplete attrs so
it's not a big deal to leave them attached to the file.  Note that
xfs_scrub can fix such things.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfsprogs: Release v6.19.0

Update all the necessary files for a v6.19.0 release.

Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>

xfs_io: print more realtime subvolume related information in statfs

The statfs call report information from the fs geometry structure.
Add the fields added for realtime group and zoned device support to
this output.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs_io: fix fsmap help

By default fsmap prints the reverse mapping for all devices. Update the
help text to correctly reflect that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andrey Albershteyn <aalbersh@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

mkfs: fix log sunit automatic configuration

This patch fixes ~96% failure rates on fstests on my test fleet, some
of which now have 4k LBA disks with unexciting min/opt io geometry:

# lsblk -t /dev/sda
NAME   ALIGNMENT MIN-IO  OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE   RA WSAME
sda            0   4096 1048576    4096     512    1 bfq       256 2048    0B
# mkfs.xfs -f -N /dev/sda3
meta-data=/dev/sda3              isize=512    agcount=4, agsize=2183680 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=1   metadir=0
data     =                       bsize=4096   blocks=8734720, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=1
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=4096  sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 extents
         =                       zoned=0      start=0 reserved=0

Note that MIN-IO == PHY-SEC, so dsunit/dswidth are zero.  With this
change, we no longer set the lsunit to the fsblock size if the log
sector size is greater than 512.  Unfortunately, dsunit is also not set,
so mkfs never sets the log sunit and it remains zero.  I think
this causes problems with the log roundoff computation in the kernel:

if (xfs_has_logv2(mp) && mp->m_sb.sb_logsunit > 1)
log->l_iclog_roundoff = mp->m_sb.sb_logsunit;
else
log->l_iclog_roundoff = BBSIZE;

because now the roundoff factor is less than the log sector size.  After
a while, the filesystem cannot be mounted anymore because:

XFS (sda3): Mounting V5 Filesystem 81b8ffa8-383b-4574-a68c-9b8202707a26
XFS (sda3): Corruption warning: Metadata has LSN (4:2729) ahead of current LSN (4:2727). Please unmount and run xfs_repair (>= v4.3) to resolve.
XFS (sda3): log mount/recovery failed: error -22
XFS (sda3): log mount failed

Reverting this patch makes the problem go away, but I think you're
trying to make it so that mkfs will set lsunit = dsunit if dsunit>0 and
the caller didn't specify any -lsunit= parameter, right?

But there's something that just seems off with this whole function.  If
the user provided a -lsunit/-lsu option then we need to validate the
value and either use it if it makes sense, or complain if not.  If the
user didn't specify any option, then we should figure it out
automatically from the other data device geometry options (internal) or
the external log device probing.

But that's not what this function does.  Why would you do this:

else if (cfg->lsectorsize > XLOG_HEADER_SIZE)
lsu = cfg->blocksize; /* lsunit matches filesystem block size */

and then loudly validate that lsu (bytes) is congruent with the fsblock
size?  This is trivially true, but then it disables the "make lsunit use
dsunit if set" logic below:

} else if (cfg->sb_feat.log_version == 2 &&
   cfg->loginternal && cfg->dsunit) {
/* lsunit and dsunit now in fs blocks */
cfg->lsunit = cfg->dsunit;
}

AFAICT, the "lsunit matches fs block size" logic is buggy.  This code
was added with no justification as part of a "reworking" commit
2f44b1b0e5adc4 ("mkfs: rework stripe calculations") back in 2017.  I
think the correct logic is to move the "lsunit matches fs block size"
logic to the no-lsunit-option code after the validation code.

This seems to set sb_logsunit to 4096 on my test VM, to 0 on the even
more boring VMs with 512 physical sectors, and to 262144 with the
scsi_debug device that Lukas Herbolt created with:

# modprobe scsi_debug inq_vendor=XFS_TEST physblk_exp=3 sector_size=512 \
opt_xferlen_exp=9 opt_blks=512 dev_size_mb=100 virtual_gb=1000

We also need to adjust the external log lsunit computation code, which
was omitted from the first version of this patch.

Cc: <linux-xfs@vger.kernel.org> # v4.15.0
Fixes: 2f44b1b0e5adc4 ("mkfs: rework stripe calculations")
Fixes: ca1eb448e116da ("mkfs.xfs fix sunit size on 512e and 4kN disks.")
Link: https://lore.kernel.org/linux-xfs/20250926123829.2101207-2-lukas@herbolt.com/
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

mkfs: fix protofile data corruption when in/out file block sizes don't match

As written in 73fb78e5ee8940, if libxfs_file_write is passed an
unaligned file range to write, it will zero the unaligned regions at the
head and tail of the block. This is what we want for a newly allocated
(and hence unwritten) block, but this is definitely not what we want
if some other part of the block has already been written.

Fix this by extending the data/hole_pos range to be aligned to the block
size of the new filesystem. This means we read slightly more, but we
never rewrite blocks in the new filesystem, sidestepping the behavior.

Found by xfs/841 when the test filesystem has a 1k fsblock size.

Cc: <linux-xfs@vger.kernel.org> # v6.13.0
Fixes: 73fb78e5ee8940 ("mkfs: support copying in large or sparse files")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libxfs: fix data corruption bug in libxfs_file_write

libxfs_file_write tries to initialize the entire file block buffer,
which includes zeroing the head portion if @pos is not aligned to the
filesystem block size. However, @buf is the file data to copy in at
position @pos, not the position of the file block. Therefore, block_off
should be added to b_addr, not buf.

Cc: <linux-xfs@vger.kernel.org> # v6.13.0
Fixes: 73fb78e5ee8940 ("mkfs: support copying in large or sparse files")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

misc: fix a few memory leaks

valgrind caught these while I was trying to debug xfs/841 regression so
fix them.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

debian: Drop Uploader: Bastian Germann

I am no longer uploading the package to Debian.
The package is the same except for debian/upstream/signing-key.asc
which I have kept on the actual signer's key for the releases.

Signed-off-by: Bastian Germann <bage@debian.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

xfs: set max_agbno to allow sparse alloc of last full inode chunk

Source kernel commit: c360004c0160dbe345870f59f24595519008926f

Sparse inode cluster allocation sets min/max agbno values to avoid
allocating an inode cluster that might map to an invalid inode
chunk. For example, we can't have an inode record mapped to agbno 0
or that extends past the end of a runt AG of misaligned size.

The initial calculation of max_agbno is unnecessarily conservative,
however. This has triggered a corner case allocation failure where a
small runt AG (i.e. 2063 blocks) is mostly full save for an extent
to the EOFS boundary: [2050,13]. max_agbno is set to 2048 in this
case, which happens to be the offset of the last possible valid
inode chunk in the AG. In practice, we should be able to allocate
the 4-block cluster at agbno 2052 to map to the parent inode record
at agbno 2048, but the max_agbno value precludes it.

Note that this can result in filesystem shutdown via dirty trans
cancel on stable kernels prior to commit 9eb775968b68 ("xfs: walk
all AGs if TRYLOCK passed to xfs_alloc_vextent_iterate_ags") because
the tail AG selection by the allocator sets t_highest_agno on the
transaction. If the inode allocator spins around and finds an inode
chunk with free inodes in an earlier AG, the subsequent dir name
creation path may still fail to allocate due to the AG restriction
and cancel.

To avoid this problem, update the max_agbno calculation to the agbno
prior to the last chunk aligned agbno in the AG. This is not
necessarily the last valid allocation target for a sparse chunk, but
since inode chunks (i.e. records) are chunk aligned and sparse
allocs are cluster sized/aligned, this allows the sb_spino_align
alignment restriction to take over and round down the max effective
agbno to within the last valid inode chunk in the AG.

Note that even though the allocator improvements in the
aforementioned commit seem to avoid this particular dirty trans
cancel situation, the max_agbno logic improvement still applies as
we should be able to allocate from an AG that has been appropriately
selected. The more important target for this patch however are
older/stable kernels prior to this allocator rework/improvement.

Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: fix an overly long line in xfs_rtgroup_calc_geometry

Source kernel commit: baed03efe223b1649320e835d7e0c03b3dde0b0c

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: mark __xfs_rtgroup_extents static

Source kernel commit: e0aea42a32984a6fd13410aed7afd3bd0caeb1c1

__xfs_rtgroup_extents is not used outside of xfs_rtgroup.c, so mark it
static. Move it and xfs_rtgroup_extents up in the file to avoid forward
declarations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: validate that zoned RT devices are zone aligned

Source kernel commit: 982d2616a2906113e433fdc0cfcc122f8d1bb60a

Garbage collection assumes all zones contain the full amount of blocks.
Mkfs already ensures this happens, but make the kernel check it as well
to avoid getting into trouble due to fuzzers or mkfs bugs.

Fixes: 2167eaabe2fa ("xfs: define the zoned on-disk format")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: remove xarray mark for reclaimable zones

Source kernel commit: bf3b8e915215ef78319b896c0ccc14dc57dac80f

We can easily check if there are any reclaimble zones by just looking
at the used counters in the reclaim buckets, so do that to free up the
xarray mark we currently use for this purpose.

Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: remove the xlog_rec_header_t typedef

Source kernel commit: ef1e275638fe6f6d54c18a770c138e4d5972b280

There are almost no users of the typedef left, kill it and switch the
remaining users to use the underlying struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: remove xlog_in_core_2_t

Source kernel commit: fe985b910e03fd91193f399a1aca9d1ea22c2557

xlog_in_core_2_t is a really odd type, not only is it grossly
misnamed because it actually is an on-disk structure, but it also
reprents the actual on-disk structure in a rather odd way.

A v1 or small v2 log header look like:

+-----------------------+
|      xlog_record      |
+-----------------------+

while larger v2 log headers look like:

+-----------------------+
|      xlog_record      |
+-----------------------+
|  xlog_rec_ext_header  |
+-------------------+---+
|         .....         |
+-----------------------+
|  xlog_rec_ext_header  |
+-----------------------+

I.e., the ext headers are a variable sized array at the end of the
header.  So instead of declaring a union of xlog_rec_header,
xlog_rec_ext_header and padding to BBSIZE, add the proper padding to
struct struct xlog_rec_header and struct xlog_rec_ext_header, and
add a variable sized array of the latter to the former.  This also
exposes the somewhat unusual scope of the log checksums, which is
made explicitly now by adding proper padding and macro designating
the actual payload length.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add a XLOG_CYCLE_DATA_SIZE constant

Source kernel commit: 74d975ed6c9f8ba44179502a8ad5a839b38e8630

The XLOG_HEADER_CYCLE_SIZE / BBSIZE expression is used a lot
in the log code, give it a symbolic name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: use a lockref for the xfs_dquot reference count

Source kernel commit: 0c5e80bd579f7bec3704bad6c1f72b13b0d73b53

The xfs_dquot structure currently uses the anti-pattern of using the
in-object lock that protects the content to also serialize reference
count updates for the structure, leading to a cumbersome free path.
This is partially papered over by the fact that we never free the dquot
directly but always through the LRU. Switch to use a lockref instead and
move the reference counter manipulations out of q_qlock.

To make this work, xfs_qm_flush_one and xfs_qm_flush_one are converted to
acquire a dquot reference while flushing to integrate with the lockref
"get if not dead" scheme.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: add a xfs_groups_to_rfsbs helper

Source kernel commit: 0ec73eb3f12350799c4b3fb764225f6e38b42d1e

Plus a rtgroup wrapper and use that to avoid overflows when converting
zone/rtg counts to block counts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>

xfs: error tag to force zeroing on debug kernels

Source kernel commit: 66d78a11479cfea00e8d1d9d3e33f3db1597e6bf

iomap_zero_range() has to cover various corner cases that are
difficult to test on production kernels because it is used in fairly
limited use cases. For example, it is currently only used by XFS and
mostly only in partial block zeroing cases.

While it's possible to test most of these functional cases, we can
provide more robust test coverage by co-opting fallocate zero range
to invoke zeroing of the entire range instead of the more efficient
block punch/allocate sequence. Add an errortag to occasionally
invoke forced zeroing.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>

mkfs.xfs fix sunit size on 512e and 4kN disks.

Creating of XFS on 4kN or 512e disk result in suboptimal LSU/LSUNIT.
As of now we check if the sectorsize is bigger than XLOG_HEADER_SIZE
and so we set lsu to blocksize. But we do not check the the size if
lsunit can be bigger to fit the disk geometry.

Signed-off-by: Lukas Herbolt <lukas@herbolt.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_scrub_all: fix non-service-mode arguments to xfs_scrub

Back in commit 7da76e2745d6a7, we changed the default arguments to
xfs_scrub for the xfs_scrub@ service to derive the fix/preen/check mode
from the "autofsck" filesystem property instead of hardcoding "-p".
Unfortunately, I forgot to make the same update for xfs_scrub_all being
run from the CLI and directly invoking xfs_scrub.

Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1125314
Cc: linux-xfs@vger.kernel.org # v6.10.0
Fixes: 7da76e2745d6a7 ("xfs_scrub: use the autofsck fsproperty to select mode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

libfrog: enable cached report zones

Modify the function xfrog_report_zones() to default to always trying
first a cached report zones using the BLKREPORTZONEV2 ioctl.
If the kernel does not support BLKREPORTZONEV2, fall back to the
(slower) regular report zones BLKREPORTZONE ioctl.

TO enable this feature even if xfsprogs is compiled on a system where
linux/blkzoned.h does not define BLKREPORTZONEV2, this ioctl is defined
in libfrog/zones.h, together with the BLK_ZONE_REP_CACHED flag and the
BLK_ZONE_COND_ACTIVE zone condition.

Since a cached report zone always return the condition
BLK_ZONE_COND_ACTIVE for any zone that is implicitly open, explicitly
open or closed, the function xfs_zone_validate_seq() is modified to
handle this new condition as being equivalent to the implicit open,
explicit open or closed conditions.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
[hch: don't try cached reporting again if not supported]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

libfrog: lift common zone reporting code from mkfs and repair

Define the new helper function xfrog_report_zones() to report zones of
a zoned block device. This function is implemented in the new file
libfrog/zones.c and defined in the header file libfrog/zones.h and
use it from mkfs and repair instead of the previous open coded versions.

xfrog_report_zones() allocates and returns a struct blk_zone_report
structure, which can be be reused by subsequent invocations. It is the
responsibility of the caller to free this structure after use.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
[hch: refactored to allow buffer reuse]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

mkfs: remove unnecessary return value affectation

The function report_zones() in mkfs/xfs_mkfs.c is a void function. So
there is no need to set the variable ret to -EIO before returning if
fstat() fails.

Fixes: 2e5a737a61d3 ("xfs_mkfs: support creating file system with zoned RT devices")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs: use blkdev_report_zones_cached()

Source kernel commit: e04ccfc28252f181ea8d469d834b48e7dece65b2

Modify xfs_mount_zones() to replace the call to blkdev_report_zones()
with blkdev_report_zones_cached() to speed-up mount operations.
Since this causes xfs_zone_validate_seq() to see zones with the
BLK_ZONE_COND_ACTIVE condition, this function is also modified to acept
this condition as valid.

With this change, mounting a freshly formatted large capacity (30 TB)
SMR HDD completes under 2s compared to over 4.7s before.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>

include blkzoned.h in platform_defs.h

We'll need to conditionally add definitions added in later version of
blkzoned.h soon. The right place for that is platform_defs.h, which
means blkzoned.h needs to be included there for cpp trickery to work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>

debian: don't explicitly reload systemd from postinst

Now that we use dh_installsystemd, it's no longer necessary to run
systemctl daemon-reload explicitly from postinst because
dh_installsystemd will inject that into the DEBHELPER section on its
own.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

xfs_mdrestore: fix restoration on filesystems with 4k sectors

Running xfs/129 on a disk with 4k LBAs produces the following failure:

--- /run/fstests/bin/tests/xfs/129.out 2025-07-15 14:41:40.210489431 -0700
+++ /run/fstests/logs/xfs/129.out.bad 2026-01-05 21:43:08.814485633 -0800
@@ -2,3 +2,8 @@ QA output created by 129
  Create the original file blocks
  Reflink every other block
  Create metadump file, restore it and check restored fs
+xfs_mdrestore: Invalid superblock disk address/length
+mount: /opt: can't read superblock on /dev/loop0.
+       dmesg(1) may have more information after failed mount system call.
+mount /dev/loop0 /opt failed
+(see /run/fstests/logs/xfs/129.full for details)

This is a failure to restore a v2 metadump to /dev/loop0.  Looking at
the metadump itself, the first xfs_meta_extent contains:

{
.xme_addr = 0,
.xme_len = 8,
}

Hrm.  This is the primary superblock on the data device, with a length
of 8x512B = 4K.  The original filesystem has this geometry:

# xfs_info /dev/sda4
meta-data=/dev/sda4              isize=512    agcount=4, agsize=2183680 blks
         =                       sectsz=4096  attr=2, projid32bit=1

In other words, a sector size of 4k because the device's LBA size is 4k.
Regrettably, the metadump validation in mdrestore assumes that the
primary superblock is only 512 bytes long, which is not correct for this
scenario.

Fix this by allowing an xme_len value of up to the maximum sector size
for xfs, which is 32k.  Also remove a redundant and confusing mask check
for the xme_addr.

Note that this error was masked (at least on little-endian platforms
that most of us test on) until recent commit 98f05de13e7815 ("mdrestore:
fix restore_v2() superblock length check") which is why I didn't spot it
earlier.

Cc: linux-xfs@vger.kernel.org # v6.6.0
Fixes: fa9f484b79123c ("mdrestore: Define mdrestore ops for v2 format")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

mkfs: quiet down warning about insufficient write zones

xfs/067 fails with the following weird mkfs message:

--- tests/xfs/067.out 2025-07-15 14:41:40.191273467 -0700
+++ /run/fstests/logs/xfs/067.out.bad 2026-01-06 16:59:11.907677987 -0800
@@ -1,4 +1,8 @@
QA output created by 067
+Warning: not enough zones (134/133) for backing requested rt size due to
+over-provisioning needs, writable size will be less than (null)
+Warning: not enough zones (134/133) for backing requested rt size due to
+over-provisioning needs, writable size will be less than (null)

In this case, MKFS_OPTIONS is set to: "-rrtdev=/dev/sdb4 -m
metadir=1,autofsck=1,uquota,gquota,pquota -d rtinherit=1 -r zoned=1
/dev/sda4"

In other words, we didn't pass an explicit rt volume size to mkfs, so
the message is a bit bogus. Let's skip printing the message when
the user did not provide an explicit rtsize parameter.

Cc: linux-xfs@vger.kernel.org # v6.18.0
Fixes: b5d372d96db1ad ("mkfs: adjust_nr_zones for zoned file system on conventional devices")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>