Darrick J. Wong [Tue, 17 Mar 2026 17:15:16 +0000 (10:15 -0700)]
libfrog: allow bitmap_free to handle a null bitmap pointer
Allow bitmap_free() callers to pass a pointer to a NULL pointer.
This will help subsequent refactorings in xfs_scrub have cleaner
bitmap_free callsites.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:18 +0000 (14:41 -0800)]
debian: enable xfs_healer on the root filesystem by default
Now that we're finished building autonomous repair, enable the healer
service on the root filesystem by default. The root filesystem is
mounted by the initrd prior to starting systemd, which is why the
xfs_healer_start service cannot autostart the service for the root
filesystem.
dh_installsystemd won't activate a template service (aka one with an
at-sign in the name) even if it provides a DefaultInstance directive to
make that possible. Hence we enable this explicitly via the postinst
script.
Note that Debian enables services by default upon package installation,
so this is consistent with their policies. Their kernel doesn't enable
online fsck, so healer won't do much more than monitor for corruptions
and log them.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:18 +0000 (14:41 -0800)]
mkfs: enable online repair if all backrefs are enabled
If all backreferences are enabled in the filesystem, then enable online
repair by default if the user didn't supply any other autofsck setting.
Users might as well get full self-repair capability if they're paying
for the extra metadata.
Note that it's up to each distro to enable the systemd services
according to their own service activation policies. Debian policy is to
enable all systemd services at package installation but they don't
enable online fsck in their Kconfig so the services won't activate.
RHEL and SUSE policy requires sysadmins to enable them explicitly unless
the OS vendor also ships a systemd preset file enabling the services.
Distros without systemd won't get any of the systemd services,
obviously.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Thu, 5 Mar 2026 21:41:14 +0000 (13:41 -0800)]
xfs_scrub: print systemd service names
Add a hidden switch to xfs_scrub to emit systemd service names for XFS
services targetting filesystems paths instead of opencoding the
computation in things like fstests.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:16 +0000 (14:41 -0800)]
xfs_healer: validate that repair fds point to the monitored fs
When xfs_healer reopens a mountpoint to perform a repair, it should
validate that the opened fd points to a file on the same filesystem as
the one being monitored.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Thu, 5 Mar 2026 18:47:54 +0000 (10:47 -0800)]
xfs_healer: use statmount to find moved filesystems even faster
As noted in the previous patch, it's possible that a mounted filesystem
can move mountpoints between the time of the initial mount (at which
point xfs_healer starts) and when it actually wants to start a repair.
The previous patch fixed that problem by using getmntent to walk
/proc/self/mounts to see if it finds a mount with the same "source"
name, aka data device.
However, this is really slow if there are a lot of filesystems because
we end up wading through a lot of irrelevant information. However,
statmount() can help us here because as of Linux 7.0 we can open the
passed-in path at startup, call statmount() on it to retrieve the
mnt_id, and then call it again later with that same mnt_id to find the
mountpoint. Luckily xfs_healthmon didn't get merged until 7.0 so it's
more or less guaranteed to be there if XFS_IOC_HEALTH_MONITOR succeeds.
Obviously if this doesn't work, we can fall back to the slow walk.
This statmount code enables xfs_healer to find a filesystem that has
had its mountpoint moved to a different place in the directory tree
without the use of bind mounts and without needing to walk the entire
mount list:
# mount -t tmpfs urk /mnt
# mount --make-rprivate /mnt
# mkdir -p /mnt/a /mnt/b
# mount /dev/sda /mnt/a
# mount --move /mnt/a /mnt/b
The key here is that the struct mount object is moved, and no new ones
are created. Therefore, the original mnt_id is still usable.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:16 +0000 (14:41 -0800)]
xfs_healer: use getmntent to find moved filesystems
It's possible that a mounted filesystem can move mountpoints between the
time of the initial mount (at which point xfs_healer starts) and when
it actually wants to start a repair. When this happens,
weakhandle::mountpoint becomes obsolete and opening it will either fail
with ENOENT or the handle revalidation will return ESTALE.
However, we do still have a means to find the mounted filesystem -- the
fsname parameter (aka the path to the data device at mount time). This
is record in /proc/mounts, which means that we can iterate getmntent to
see if we can find the mount elsewhere.
As documented a few patches ago, this would be easier if we had
revocable fds that didn't pin mounts, but that's a very huge ask.
This getmntent code enables xfs_healer to find a filesystem that has
been bind mounted in a new place and the original mountpoint detached:
# mount /dev/sda /mnt
# xfs_healer /mnt &
# mount /mnt /opt --bind
# umount /mnt
The key here is that each bind mount gets a separate struct mount
object.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:16 +0000 (14:41 -0800)]
xfs_healer: run full scrub after lost corruption events or targeted repair failure
If we fail to perform a spot repair of metadata or the kernel tells us
that it lost corruption events due to queue limits, initiate a full run
of the online fsck service to try to fix the error.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:15 +0000 (14:41 -0800)]
xfs_healer: don't start service if kernel support unavailable
Use ExecCondition= in the system service to check if kernel support for
the health monitor is available. If not, we don't want to run the
service, have it fail, and generate a bunch of silly log messages.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:15 +0000 (14:41 -0800)]
xfs_healer: create a service to start the per-mount healer service
Create a daemon to wait for xfs mount events via fsnotify and start up
the per-mount healer service. It's important that we're running in the
same mount namespace as the mount, so we're a fanotify client to avoid
having to filter the mount namespaces ourselves.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Thu, 5 Mar 2026 21:26:27 +0000 (13:26 -0800)]
xfs_healer: create a per-mount background monitoring service
Create a systemd service definition for our self-healing filesystem
daemon so that we can run it for every mounted filesystem. Add a
hidden switch so that we can print the service unit name for fstests.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:14 +0000 (14:41 -0800)]
xfs_healer: use getparents to look up file names
If the kernel tells about something that happened to a file, use the
GETPARENTS ioctl to try to look up the path to that file for more
ergonomic reporting.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:14 +0000 (14:41 -0800)]
xfs_healer: enable repairing filesystems
Make it so that our health monitoring daemon can initiate repairs in
response to reports of corrupt filesystem metadata. Repairs are
initiated from the background workers as explained in the previous
patch.
Note that just like xfs_scrub, xfs_healer's ability to repair metadata
relies heavily on back references such as reverse mappings and directory
parent pointers to add redundancy to the filesystem. Check for these
two features and whine a bit if they are missing, just like scrub.
There's a bit of trickery with the fd that is used to initiate repairs
in the kernel. Because an open fd will pin the filesystem in memory,
xfs_healer can only hold an open fd to the target filesystem while it's
performing repairs. Therefore, at startup xfs_healer must sample enough
information about the target filesystem to reconnect to it later on.
Currently, the fs source (aka the data device path) and the root
directory handle are sufficient to do this.
Someday we might be able to have revocable fds, which would eliminate
the need for such efforts in userspace.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:14 +0000 (14:41 -0800)]
xfs_healer: create daemon to listen for health events
Create a daemon program that can listen for and log health events.
Eventually this will be used to self-heal filesystems in real time.
Because events can take a while to process, the main thread reads event
objects from the healthmon fd and dispatches them to a background
workqueue as quickly as it can. This split of responsibilities is
necessary because the kernel event queue will drop events if the queue
fills up, and each event can take some time to process (logging,
repairs, etc.) so we don't want to lose events.
To be clear, xfs_healer and xfs_scrub are complementary tools:
Scrub walks the whole filesystem, finds stuff that needs fixing or
rebuilding, and rebuilds it. This is sort of analogous to a patrol
scrub.
Healer listens for metadata corruption messages from the kernel and
issues a targeted repair of that structure. This is kind of like an
ondemand scrub.
My end goal is that xfs_healer (the service) is active all the time and
can respond instantly to a corruption report, whereas xfs_scrub (the
service) gets run periodically as a cron job.
xfs_healer can decide that it's overwhelmed with problems and start
xfs_scrub to deal with the mess. Ideally you don't crash the filesystem
and then have to use xfs_repair to smash your way back to a mountable
filesystem.
By default we run xfs_healer as a background service, which means that
we only start two threads -- one to read the events, and another to
process them. In other words, we try not to use all available hardware
resources for repairs. The foreground mode switch starts up a large
number of threads to try to increase parallelism, which may or may not
be useful for repairs depending on how much metadata the kernel needs to
scan.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:13 +0000 (14:41 -0800)]
man2: document the media verification ioctl
Document XFS_IOC_VERIFY_MEDIA, which is a new ioctl for xfs_scrub to
perform media scans on the disks underneath the filesystem. This will
enable media errors to be reported to xfs_healer and fsnotify.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 3 Mar 2026 19:01:39 +0000 (11:01 -0800)]
libfrog: add wrappers for listmount and statmount
Add some wrappers for listmount and statmount so that we don't have to
open-code the kernel ABI quirks in every utility program that uses it.
Note that glibc seems to have discussed providing a wrapper in late 2023
but took no action; and the listmount manpage says that there is no
glibc wrapper.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:12 +0000 (14:41 -0800)]
libfrog: add support code for starting systemd services programmatically
Add some simple routines for computing the name of systemd service
instances and starting systemd services. These will be used by the
xfs_healer_start service to start per-filesystem xfs_healer service
instances.
Note that we run systemd helper programs as subprocesses for a couple of
reasons. First, the path-escaping functionality is not a part of any
library-accessible API, which means it can only be accessed via
systemd-escape(1). Second, although the service startup functionality
can be reached via dbus, doing so would introduce a new library
dependency. Systemd is also undergoing a dbus -> varlink RPC transition
so we avoid that mess by calling the cli systemctl(1) program.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:12 +0000 (14:41 -0800)]
libfrog: add a function to grab the path from an open fd and a file handle
handle_walk_paths operates on a file handle, but requires that the fs
has been registered with libhandle via path_to_fshandle. For a normal
libhandle client this is the desirable behavior because the application
*should* maintain an open fd to the filesystem mount.
However for xfs_healer this isn't going to work well because the healer
mustn't pin the mount while it's running. It's smart enough to know how
to find and reconnect to the mountpoint, but libhandle doesn't have any
such concept.
Therefore, alter the libfrog getparents code so that xfs_healer can pass
in the mountpoint and reconnected fd without needing libhandle. All
we're really doing here is trying to obtain a user-visible path for a
file that encountered problems for logging purposes; if it fails, we'll
fall back to logging the inode number.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Carlos Maiolino [Tue, 7 Apr 2026 11:14:59 +0000 (13:14 +0200)]
fsr: package function should check for negative errors
xfrog_defragrange as most other functions from libfrog return
a negative error value, while xfs_fsr's packfile(), expects
a positive error value.
Whenever xfrog_defragrange fails, the switch case always falls into the
default clausule, making the error message pointless.
Fix this by inverting xfrog_defragrange() return value call.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
xfs_defer_can_append returns a bool, it shouldn't be returning
a NULL.
Found by code inspection.
Fixes: 4dffb2cbb483 ("xfs: allow pausing of pending deferred work items") Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Souptick Joarder <souptick.joarder@hpe.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
kzalloc() is called with __GFP_NOFAIL, so a NULL return is not expected.
Drop the redundant !map check in xfs_dabuf_map().
Also switch the nirecs-sized allocation to kcalloc().
Signed-off-by: hongao <hongao@uniontech.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The ioctl structures in libxfs/xfs_fs.h are missing static size checks.
It is useful to have static size checks for these structures as adding
new fields to them could cause issues (e.g. extra padding that may be
inserted by the compiler). So add these checks to xfs/xfs_ondisk.h.
Due to different padding/alignment requirements across different
architectures, to avoid build failures, some structures are ommited from
the size checks. For example, structures with "compat_" definitions in
xfs/xfs_ioctl32.h are ommited.
Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Add a comment explaining why the sb_frextents are updated outside the
if (xfs_has_lazycount(mp) check even though it is a lazycounter.
RT groups are supported only in v5 filesystems which always have
lazycounter enabled - so putting it inside the if(xfs_has_lazycount(mp)
check is redundant.
Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The active inode (or active vnode until recently) stat can get much larger
than expected on file systems with a lot of metafile inodes like zoned
file systems on SMR hard disks with 10.000s of rtg rmap inodes.
Remove all metafile inodes from the active counter to make it more useful
to track actual workloads and add a separate counter for active metafile
inodes.
This fixes xfs/177 on SMR hard drives.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Introduce xfs_growfs_compute_delta() to calculate the nagcount
and delta blocks and refactor the code from xfs_growfs_data_private().
No functional changes.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Add a new errortag to test that zone reset errors are handled correctly.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We can trust XFS developers enough to not pass random stuff to
XFS_ERROR_TEST/DELAY. Open code the validity check in xfs_errortag_add,
which is the only place that receives unvalidated error tag values from
user space, and drop the now pointless xfs_errortag_enabled helper.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
If we're trying to replace an xattr in a shortform attr structure and
the old entry fits the new entry, we can just memcpy and exit without
having to delete, compact, and re-add the entry (or worse use the attr
intent machinery). For parent pointers this only advantages renaming
where the filename length stays the same (e.g. mv autoexec.bat
scandisk.exe) but for regular xattrs it might be useful for updating
security labels and the like.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
After a recent fsmark benchmarking run, I observed that the overhead of
parent pointers on file creation and deletion can be a bit high. On a
machine with 20 CPUs, 128G of memory, and an NVME SSD capable of pushing
750000iops, I see the following results:
So we created 40 AGs, one per CPU. Now we create 40 directories and run
fsmark:
$ time fs_mark -D 10000 -S 0 -n 100000 -s 0 -L 8 -d ...
# Version 3.3, 40 thread(s) starting at Wed Dec 10 14:22:07 2025
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 0 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
parent=0 parent=1
================== ==================
real 0m57.573s real 1m2.934s
user 3m53.578s user 3m53.508s
sys 19m44.440s sys 25m14.810s
$ time rm -rf ...
parent=0 parent=1
================== ==================
real 0m59.649s real 1m12.505s
user 0m41.196s user 0m47.489s
sys 13m9.566s sys 20m33.844s
Parent pointers increase the system time by 28% overhead to create 32
million files that are totally empty. Removing them incurs a system
time increase of 56%. Wall time increases by 9% and 22%.
For most filesystems, each file tends to have a single owner and not
that many xattrs. If the xattr structure is shortform, then all xattr
changes are logged with the inode and do not require the the xattr
intent mechanism to persist the parent pointer.
Therefore, we can speed up parent pointer operations by calling the
shortform xattr functions directly if the child's xattr is in short
format. Now the overhead looks like:
parent=0 parent=1
================== ==================
real 0m58.030s real 1m0.983s
user 3m54.141s user 3m53.758s
sys 19m57.003s sys 21m30.605s
$ time rm -rf ...
parent=0 parent=1
================== ==================
real 0m58.911s real 1m4.420s
user 0m41.329s user 0m45.169s
sys 13m27.857s sys 15m58.564s
Now parent pointers only increase the system time by 8% for creation and
19% for deletion. Wall time increases by 5% and 9% now.
Close the performance gap by creating helpers for the attr set, remove,
and replace operations that will try to make direct shortform updates,
and fall back to the attr intent machinery if that doesn't work. This
works for regular xattrs and for parent pointers.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Check for erroneous overlapping freemap regions and collisions between
freemap regions and the xattr leaf entry array.
Note that we must explicitly zero out the extra freemaps in
xfs_attr3_leaf_compact so that the in-memory buffer has a correctly
initialized freemap array to satisfy the new verification code, even if
subsequent code changes the contents before unlocking the buffer.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Upon enabling quite a lot more debugging code, I narrowed this down to
fsstress trying to set a local extended attribute with namelen=3 and
valuelen=71. This results in an entry size of 80 bytes.
At the start of xfs_attr3_leaf_add_work, the freemap looks like this:
i 0 base 448 size 0 rhs 448 count 46
i 1 base 388 size 132 rhs 448 count 46
i 2 base 2120 size 4 rhs 448 count 46
firstused = 520
where "rhs" is the first byte past the end of the leaf entry array.
This is inconsistent -- the entries array ends at byte 448, but
freemap[1] says there's free space starting at byte 388!
By the end of the function, the freemap is in worse shape:
i 0 base 456 size 0 rhs 456 count 47
i 1 base 388 size 52 rhs 456 count 47
i 2 base 2120 size 4 rhs 456 count 47
firstused = 440
Important note: 388 is not aligned with the entries array element size
of 8 bytes.
Based on the incorrect freemap, the name area starts at byte 440, which
is below the end of the entries array! That's why the assertion
triggers and the filesystem shuts down.
How did we end up here? First, recall from the previous patch that the
freemap array in an xattr leaf block is not intended to be a
comprehensive map of all free space in the leaf block. In other words,
it's perfectly legal to have a leaf block with:
* 376 bytes in use by the entries array
* freemap[0] has [base = 376, size = 8]
* freemap[1] has [base = 388, size = 1500]
* the space between 376 and 388 is free, but the freemap stopped
tracking that some time ago
If we add one xattr, the entries array grows to 384 bytes, and
freemap[0] becomes [base = 384, size = 0]. So far, so good. But if we
add a second xattr, the entries array grows to 392 bytes, and freemap[0]
gets pushed up to [base = 392, size = 0]. This is bad, because
freemap[1] hasn't been updated, and now the entries array and the free
space claim the same space.
The fix here is to adjust all freemap entries so that none of them
collide with the entries array. Note that this fix relies on commit 2a2b5932db6758 ("xfs: fix attr leaf header freemap.size underflow") and
the previous patch that resets zero length freemap entries to have
base = 0.
Fixes: 1da177e4c3f415 ("Linux-2.6.12-rc2") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Back in commit 2a2b5932db6758 ("xfs: fix attr leaf header freemap.size
underflow"), Brian Foster observed that it's possible for a small
freemap at the end of the end of the xattr entries array to experience
a size underflow when subtracting the space consumed by an expansion of
the entries array. There are only three freemap entries, which means
that it is not a complete index of all free space in the leaf block.
This code can leave behind a zero-length freemap entry with a nonzero
base. Subsequent setxattr operations can increase the base up to the
point that it overlaps with another freemap entry. This isn't in and of
itself a problem because the code in _leaf_add that finds free space
ignores any freemap entry with zero size.
However, there's another bug in the freemap update code in _leaf_add,
which is that it fails to update a freemap entry that begins midway
through the xattr entry that was just appended to the array. That can
result in the freemap containing two entries with the same base but
different sizes (0 for the "pushed-up" entry, nonzero for the entry
that's actually tracking free space). A subsequent _leaf_add can then
allocate xattr namevalue entries on top of the entries array, leading to
data loss. But fixing that is for later.
For now, eliminate the possibility of confusion by zeroing out the base
of any freemap entry that has zero size. Because the freemap is not
intended to be a complete index of free space, a subsequent failure to
find any free space for a new xattr will trigger block compaction, which
regenerates the freemap.
It looks like this bug has been in the codebase for quite a long time.
Fixes: 1da177e4c3f415 ("Linux-2.6.12-rc2") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently xfs_zone_validate mixes validating the software zone state in
the XFS realtime group with validating the hardware state reported in
struct blk_zone and deriving the write pointer from that.
Move all code that works on the realtime group to xfs_init_zone, and only
keep the hardware state validation in xfs_zone_validate. This makes the
code more clear, and allows for better reuse in userspace.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Add a helper to figure the on-disk size of a group, accounting for the
XFS_SB_FEAT_INCOMPAT_ZONE_GAPS feature if needed.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Add the missing forward declaration for struct blk_zone in xfs_zones.h.
This avoids headaches with the order of header file inclusion to avoid
compilation errors.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The calling convention of xfs_attr_leaf_hasname() is problematic, because
it returns a NULL buffer when xfs_attr3_leaf_read fails, a valid buffer
when xfs_attr3_leaf_lookup_int returns -ENOATTR or -EEXIST, and a
non-NULL buffer pointer for an already released buffer when
xfs_attr3_leaf_lookup_int fails with other error values.
Fix this by simply open coding xfs_attr_leaf_hasname in the callers, so
that the buffer release code is done by each caller of
xfs_attr3_leaf_read.
Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Reported-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The xfs.h header conflicts with the public xfs.h in xfsprogs, leading
to a spurious difference in all shared libxfs files that have to
include libxfs_priv.h in userspace. Directly include xfs_platform.h so
that we can add a header of the same name to xfsprogs and remove this
major annoyance for the shared code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Add a new privileged ioctl so that xfs_scrub can ask the kernel to
verify the media of the devices backing an xfs filesystem, and have any
resulting media errors reported to fsnotify and xfs_healer.
To accomplish this, the kernel allocates a folio between the base page
size and 1MB, and issues read IOs to a gradually incrementing range of
one of the storage devices underlying an xfs filesystem. If any error
occurs, that raw error is reported to the calling process. If the error
happens to be one of the ones that the kernel considers indicative of
data loss, then it will also be reported to xfs_healthmon and fsnotify.
Driving the verification from the kernel enables xfs (and by extension
xfs_scrub) to have precise control over the size and error handling of
IOs that are issued to the underlying block device, and to emit
notifications about problems to other relevant kernel subsystems
immediately.
Note that the caller is also allowed to reduce the size of the IO and
to ask for a relaxation period after each IO.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new ioctl for the healthmon file that checks that a given fd
points to the same filesystem that the healthmon file is monitoring.
This allows xfs_healer to check that when it reopens a mountpoint to
perform repairs, the file that it gets matches the filesystem that
generated the corruption report.
(Note that xfs_healer doesn't maintain an open fd to a filesystem that
it's monitoring so that it doesn't pin the mount.)
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Connect the fserror reporting to the health monitor so that xfs can send
events about file I/O errors to the xfs_healer daemon. These events are
entirely informational because xfs cannot regenerate user data, so
hopefully the fsnotify I/O error event gets noticed by the relevant
management systems.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Connect the fsdax media failure notification code to the health monitor
so that xfs can send events about that to the xfs_healer daemon.
Later on we'll add the ability for the xfs_scrub media scan (phase 6) to
report the errors that it finds to the kernel so that those are also
logged by xfs_healer.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Connect the filesystem metadata health event collection system to the
health monitor so that xfs can send events to xfs_healer as it collects
information.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Create the basic infrastructure that we need to report health events to
userspace. We need a compact form for recording critical information
about an event and queueing them; a means to notice that we've lost some
events; and a means to format the events into something that userspace
can handle. Make the kernel export C structures via read().
In a previous iteration of this new subsystem, I wanted to explore data
exchange formats that are more flexible and easier for humans to read
than C structures. The thought being that when we want to rev (or
worse, enlarge) the event format, it ought to be trivially easy to do
that in a way that doesn't break old userspace.
I looked at formats such as protobufs and capnproto. These look really
nice in that extending the wire format is fairly easy, you can give it a
data schema and it generates the serialization code for you, handles
endianness problems, etc. The huge downside is that neither support C
all that well.
Too hard, and didn't want to port either of those huge sprawling
libraries first to the kernel and then again to xfsprogs. Then I
thought, how about JSON? Javascript objects are human readable, the
kernel can emit json without much fuss (it's all just strings!) and
there are plenty of interpreters for python/rust/c/etc.
There's a proposed schema format for json, which means that xfs can
publish a description of the events that kernel will emit. Userspace
consumers (e.g. xfsprogs/xfs_healer) can embed the same schema document
and use it to validate the incoming events from the kernel, which means
it can discard events that it doesn't understand, or garbage being
emitted due to bugs.
However, json has a huge crutch -- javascript is well known for its
vague definitions of what are numbers. This makes expressing a large
number rather fraught, because the runtime is free to represent a number
in nearly any way it wants. Stupider ones will truncate values to word
size, others will roll out doubles for uint52_t (yes, fifty-two) with
the resulting loss of precision. Not good when you're dealing with
discrete units.
It just so happens that python's json library is smart enough to see a
sequence of digits and put them in a u64 (at least on x86_64/aarch64)
but an actual javascript interpreter (pasting into Firefox) isn't
necessarily so clever.
It turns out that none of the proposed json schemas were ever ratified
even in an open-consensus way, so json blobs are still just loosely
structured blobs. The parsing in userspace was also noticeably slow and
memory-consumptive.
Hence only the C interface survives.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Start creating helper functions and infrastructure to pass filesystem
health events to a health monitoring file. Since this is an
administrative interface, we only support a single health monitor
process per filesystem, so we don't need to use anything fancy such as
notifier chains (== tons of indirect calls).
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sun, 22 Feb 2026 22:41:02 +0000 (14:41 -0800)]
libfrog: hoist some utilities from libxfs
This started with a desire to move the duplicate cmn_err declarations in
libxfs into libfrog/util.h ahead of the patch that renames libxfs_priv.h
to xfs_platform.h.
Then this patch expanded in scope when I realized that there were
several other utility functions that weren't specific to xfs; those go
in libfrog.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 3 Mar 2026 18:48:55 +0000 (10:48 -0800)]
libxfs: fix XFS_STATS_DEC
This macro only takes two arguments in the kernel, so fix the definition
here too. All existing callsites #if 0 it into oblivion which is why
we've never noticed, but an upcoming patch in the libxfs sync will not
be so lucky.
Cc: <linux-xfs@vger.kernel.org> # v4.9.0 Fixes: ece930fa14a343 ("xfs: refactor xfs_bunmapi_cow") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Sat, 14 Mar 2026 15:57:26 +0000 (08:57 -0700)]
xfs_repair: don't fail on INCOMPLETE attrs in leaf blocks
While trying to fix problems in generic/753, I noticed test failures on
account of xfs_repair:
attribute entry #4 in attr block 0, inode 131 is INCOMPLETE
problem with attribute contents in inode 131
would clear attr fork
bad nblocks 4 for inode 131, would reset to 0
bad anextents 1 for inode 131, would reset to 0
Looking at the dumped filesystem, inode 131 is a linked file, and the
"incomplete" xattr was clearly part of an xfs_attr_set operation that
failed midway through because the induced log shutdown prevented xfs
from finishing the creation of a remote xattr. This kind of thing is
expected, but instead xfs_repair deletes the entire attr fork!
It's far too drastic to delete every xattr because doing that destroys
things like security labels. The kernel won't show incomplete attrs so
it's not a big deal to leave them attached to the file. Note that
xfs_scrub can fix such things.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Note that MIN-IO == PHY-SEC, so dsunit/dswidth are zero. With this
change, we no longer set the lsunit to the fsblock size if the log
sector size is greater than 512. Unfortunately, dsunit is also not set,
so mkfs never sets the log sunit and it remains zero. I think
this causes problems with the log roundoff computation in the kernel:
because now the roundoff factor is less than the log sector size. After
a while, the filesystem cannot be mounted anymore because:
XFS (sda3): Mounting V5 Filesystem 81b8ffa8-383b-4574-a68c-9b8202707a26
XFS (sda3): Corruption warning: Metadata has LSN (4:2729) ahead of current LSN (4:2727). Please unmount and run xfs_repair (>= v4.3) to resolve.
XFS (sda3): log mount/recovery failed: error -22
XFS (sda3): log mount failed
Reverting this patch makes the problem go away, but I think you're
trying to make it so that mkfs will set lsunit = dsunit if dsunit>0 and
the caller didn't specify any -lsunit= parameter, right?
But there's something that just seems off with this whole function. If
the user provided a -lsunit/-lsu option then we need to validate the
value and either use it if it makes sense, or complain if not. If the
user didn't specify any option, then we should figure it out
automatically from the other data device geometry options (internal) or
the external log device probing.
But that's not what this function does. Why would you do this:
and then loudly validate that lsu (bytes) is congruent with the fsblock
size? This is trivially true, but then it disables the "make lsunit use
dsunit if set" logic below:
} else if (cfg->sb_feat.log_version == 2 &&
cfg->loginternal && cfg->dsunit) {
/* lsunit and dsunit now in fs blocks */
cfg->lsunit = cfg->dsunit;
}
AFAICT, the "lsunit matches fs block size" logic is buggy. This code
was added with no justification as part of a "reworking" commit 2f44b1b0e5adc4 ("mkfs: rework stripe calculations") back in 2017. I
think the correct logic is to move the "lsunit matches fs block size"
logic to the no-lsunit-option code after the validation code.
This seems to set sb_logsunit to 4096 on my test VM, to 0 on the even
more boring VMs with 512 physical sectors, and to 262144 with the
scsi_debug device that Lukas Herbolt created with:
Darrick J. Wong [Mon, 2 Mar 2026 20:46:56 +0000 (12:46 -0800)]
mkfs: fix protofile data corruption when in/out file block sizes don't match
As written in 73fb78e5ee8940, if libxfs_file_write is passed an
unaligned file range to write, it will zero the unaligned regions at the
head and tail of the block. This is what we want for a newly allocated
(and hence unwritten) block, but this is definitely not what we want
if some other part of the block has already been written.
Fix this by extending the data/hole_pos range to be aligned to the block
size of the new filesystem. This means we read slightly more, but we
never rewrite blocks in the new filesystem, sidestepping the behavior.
Found by xfs/841 when the test filesystem has a 1k fsblock size.
Cc: <linux-xfs@vger.kernel.org> # v6.13.0 Fixes: 73fb78e5ee8940 ("mkfs: support copying in large or sparse files") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Mar 2026 20:55:34 +0000 (12:55 -0800)]
libxfs: fix data corruption bug in libxfs_file_write
libxfs_file_write tries to initialize the entire file block buffer,
which includes zeroing the head portion if @pos is not aligned to the
filesystem block size. However, @buf is the file data to copy in at
position @pos, not the position of the file block. Therefore, block_off
should be added to b_addr, not buf.
Cc: <linux-xfs@vger.kernel.org> # v6.13.0 Fixes: 73fb78e5ee8940 ("mkfs: support copying in large or sparse files") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Bastian Germann [Fri, 20 Feb 2026 17:17:10 +0000 (18:17 +0100)]
debian: Drop Uploader: Bastian Germann
I am no longer uploading the package to Debian.
The package is the same except for debian/upstream/signing-key.asc
which I have kept on the actual signer's key for the releases.
Signed-off-by: Bastian Germann <bage@debian.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Sparse inode cluster allocation sets min/max agbno values to avoid
allocating an inode cluster that might map to an invalid inode
chunk. For example, we can't have an inode record mapped to agbno 0
or that extends past the end of a runt AG of misaligned size.
The initial calculation of max_agbno is unnecessarily conservative,
however. This has triggered a corner case allocation failure where a
small runt AG (i.e. 2063 blocks) is mostly full save for an extent
to the EOFS boundary: [2050,13]. max_agbno is set to 2048 in this
case, which happens to be the offset of the last possible valid
inode chunk in the AG. In practice, we should be able to allocate
the 4-block cluster at agbno 2052 to map to the parent inode record
at agbno 2048, but the max_agbno value precludes it.
Note that this can result in filesystem shutdown via dirty trans
cancel on stable kernels prior to commit 9eb775968b68 ("xfs: walk
all AGs if TRYLOCK passed to xfs_alloc_vextent_iterate_ags") because
the tail AG selection by the allocator sets t_highest_agno on the
transaction. If the inode allocator spins around and finds an inode
chunk with free inodes in an earlier AG, the subsequent dir name
creation path may still fail to allocate due to the AG restriction
and cancel.
To avoid this problem, update the max_agbno calculation to the agbno
prior to the last chunk aligned agbno in the AG. This is not
necessarily the last valid allocation target for a sparse chunk, but
since inode chunks (i.e. records) are chunk aligned and sparse
allocs are cluster sized/aligned, this allows the sb_spino_align
alignment restriction to take over and round down the max effective
agbno to within the last valid inode chunk in the AG.
Note that even though the allocator improvements in the
aforementioned commit seem to avoid this particular dirty trans
cancel situation, the max_agbno logic improvement still applies as
we should be able to allocate from an AG that has been appropriately
selected. The more important target for this patch however are
older/stable kernels prior to this allocator rework/improvement.
Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
__xfs_rtgroup_extents is not used outside of xfs_rtgroup.c, so mark it
static. Move it and xfs_rtgroup_extents up in the file to avoid forward
declarations.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
Garbage collection assumes all zones contain the full amount of blocks.
Mkfs already ensures this happens, but make the kernel check it as well
to avoid getting into trouble due to fuzzers or mkfs bugs.
Fixes: 2167eaabe2fa ("xfs: define the zoned on-disk format") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
We can easily check if there are any reclaimble zones by just looking
at the used counters in the reclaim buckets, so do that to free up the
xarray mark we currently use for this purpose.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
There are almost no users of the typedef left, kill it and switch the
remaining users to use the underlying struct.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
xlog_in_core_2_t is a really odd type, not only is it grossly
misnamed because it actually is an on-disk structure, but it also
reprents the actual on-disk structure in a rather odd way.
I.e., the ext headers are a variable sized array at the end of the
header. So instead of declaring a union of xlog_rec_header,
xlog_rec_ext_header and padding to BBSIZE, add the proper padding to
struct struct xlog_rec_header and struct xlog_rec_ext_header, and
add a variable sized array of the latter to the former. This also
exposes the somewhat unusual scope of the log checksums, which is
made explicitly now by adding proper padding and macro designating
the actual payload length.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The XLOG_HEADER_CYCLE_SIZE / BBSIZE expression is used a lot
in the log code, give it a symbolic name.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
The xfs_dquot structure currently uses the anti-pattern of using the
in-object lock that protects the content to also serialize reference
count updates for the structure, leading to a cumbersome free path.
This is partially papered over by the fact that we never free the dquot
directly but always through the LRU. Switch to use a lockref instead and
move the reference counter manipulations out of q_qlock.
To make this work, xfs_qm_flush_one and xfs_qm_flush_one are converted to
acquire a dquot reference while flushing to integrate with the lockref
"get if not dead" scheme.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
iomap_zero_range() has to cover various corner cases that are
difficult to test on production kernels because it is used in fairly
limited use cases. For example, it is currently only used by XFS and
mostly only in partial block zeroing cases.
While it's possible to test most of these functional cases, we can
provide more robust test coverage by co-opting fallocate zero range
to invoke zeroing of the entire range instead of the more efficient
block punch/allocate sequence. Add an errortag to occasionally
invoke forced zeroing.
Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
Lukas Herbolt [Thu, 19 Feb 2026 11:44:09 +0000 (12:44 +0100)]
mkfs.xfs fix sunit size on 512e and 4kN disks.
Creating of XFS on 4kN or 512e disk result in suboptimal LSU/LSUNIT.
As of now we check if the sectorsize is bigger than XLOG_HEADER_SIZE
and so we set lsu to blocksize. But we do not check the the size if
lsunit can be bigger to fit the disk geometry.
Signed-off-by: Lukas Herbolt <lukas@herbolt.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Mon, 2 Feb 2026 19:14:05 +0000 (11:14 -0800)]
xfs_scrub_all: fix non-service-mode arguments to xfs_scrub
Back in commit 7da76e2745d6a7, we changed the default arguments to
xfs_scrub for the xfs_scrub@ service to derive the fix/preen/check mode
from the "autofsck" filesystem property instead of hardcoding "-p".
Unfortunately, I forgot to make the same update for xfs_scrub_all being
run from the CLI and directly invoking xfs_scrub.
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1125314 Cc: linux-xfs@vger.kernel.org # v6.10.0 Fixes: 7da76e2745d6a7 ("xfs_scrub: use the autofsck fsproperty to select mode") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Modify the function xfrog_report_zones() to default to always trying
first a cached report zones using the BLKREPORTZONEV2 ioctl.
If the kernel does not support BLKREPORTZONEV2, fall back to the
(slower) regular report zones BLKREPORTZONE ioctl.
TO enable this feature even if xfsprogs is compiled on a system where
linux/blkzoned.h does not define BLKREPORTZONEV2, this ioctl is defined
in libfrog/zones.h, together with the BLK_ZONE_REP_CACHED flag and the
BLK_ZONE_COND_ACTIVE zone condition.
Since a cached report zone always return the condition
BLK_ZONE_COND_ACTIVE for any zone that is implicitly open, explicitly
open or closed, the function xfs_zone_validate_seq() is modified to
handle this new condition as being equivalent to the implicit open,
explicit open or closed conditions.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
[hch: don't try cached reporting again if not supported] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Damien Le Moal [Wed, 28 Jan 2026 04:32:58 +0000 (05:32 +0100)]
libfrog: lift common zone reporting code from mkfs and repair
Define the new helper function xfrog_report_zones() to report zones of
a zoned block device. This function is implemented in the new file
libfrog/zones.c and defined in the header file libfrog/zones.h and
use it from mkfs and repair instead of the previous open coded versions.
xfrog_report_zones() allocates and returns a struct blk_zone_report
structure, which can be be reused by subsequent invocations. It is the
responsibility of the caller to free this structure after use.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
[hch: refactored to allow buffer reuse] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Damien Le Moal [Wed, 28 Jan 2026 04:32:57 +0000 (05:32 +0100)]
mkfs: remove unnecessary return value affectation
The function report_zones() in mkfs/xfs_mkfs.c is a void function. So
there is no need to set the variable ret to -EIO before returning if
fstat() fails.
Fixes: 2e5a737a61d3 ("xfs_mkfs: support creating file system with zoned RT devices") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de>
Modify xfs_mount_zones() to replace the call to blkdev_report_zones()
with blkdev_report_zones_cached() to speed-up mount operations.
Since this causes xfs_zone_validate_seq() to see zones with the
BLK_ZONE_COND_ACTIVE condition, this function is also modified to acept
this condition as valid.
With this change, mounting a freshly formatted large capacity (30 TB)
SMR HDD completes under 2s compared to over 4.7s before.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de>
We'll need to conditionally add definitions added in later version of
blkzoned.h soon. The right place for that is platform_defs.h, which
means blkzoned.h needs to be included there for cpp trickery to work.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Darrick J. Wong [Tue, 20 Jan 2026 17:51:51 +0000 (09:51 -0800)]
debian: don't explicitly reload systemd from postinst
Now that we use dh_installsystemd, it's no longer necessary to run
systemctl daemon-reload explicitly from postinst because
dh_installsystemd will inject that into the DEBHELPER section on its
own.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 20 Jan 2026 17:51:35 +0000 (09:51 -0800)]
xfs_mdrestore: fix restoration on filesystems with 4k sectors
Running xfs/129 on a disk with 4k LBAs produces the following failure:
--- /run/fstests/bin/tests/xfs/129.out 2025-07-15 14:41:40.210489431 -0700
+++ /run/fstests/logs/xfs/129.out.bad 2026-01-05 21:43:08.814485633 -0800
@@ -2,3 +2,8 @@ QA output created by 129
Create the original file blocks
Reflink every other block
Create metadump file, restore it and check restored fs
+xfs_mdrestore: Invalid superblock disk address/length
+mount: /opt: can't read superblock on /dev/loop0.
+ dmesg(1) may have more information after failed mount system call.
+mount /dev/loop0 /opt failed
+(see /run/fstests/logs/xfs/129.full for details)
This is a failure to restore a v2 metadump to /dev/loop0. Looking at
the metadump itself, the first xfs_meta_extent contains:
{
.xme_addr = 0,
.xme_len = 8,
}
Hrm. This is the primary superblock on the data device, with a length
of 8x512B = 4K. The original filesystem has this geometry:
In other words, a sector size of 4k because the device's LBA size is 4k.
Regrettably, the metadump validation in mdrestore assumes that the
primary superblock is only 512 bytes long, which is not correct for this
scenario.
Fix this by allowing an xme_len value of up to the maximum sector size
for xfs, which is 32k. Also remove a redundant and confusing mask check
for the xme_addr.
Note that this error was masked (at least on little-endian platforms
that most of us test on) until recent commit 98f05de13e7815 ("mdrestore:
fix restore_v2() superblock length check") which is why I didn't spot it
earlier.
Cc: linux-xfs@vger.kernel.org # v6.6.0 Fixes: fa9f484b79123c ("mdrestore: Define mdrestore ops for v2 format") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong [Tue, 20 Jan 2026 17:51:19 +0000 (09:51 -0800)]
mkfs: quiet down warning about insufficient write zones
xfs/067 fails with the following weird mkfs message:
--- tests/xfs/067.out 2025-07-15 14:41:40.191273467 -0700
+++ /run/fstests/logs/xfs/067.out.bad 2026-01-06 16:59:11.907677987 -0800
@@ -1,4 +1,8 @@
QA output created by 067
+Warning: not enough zones (134/133) for backing requested rt size due to
+over-provisioning needs, writable size will be less than (null)
+Warning: not enough zones (134/133) for backing requested rt size due to
+over-provisioning needs, writable size will be less than (null)
In this case, MKFS_OPTIONS is set to: "-rrtdev=/dev/sdb4 -m
metadir=1,autofsck=1,uquota,gquota,pquota -d rtinherit=1 -r zoned=1
/dev/sda4"
In other words, we didn't pass an explicit rt volume size to mkfs, so
the message is a bit bogus. Let's skip printing the message when
the user did not provide an explicit rtsize parameter.
Cc: linux-xfs@vger.kernel.org # v6.18.0 Fixes: b5d372d96db1ad ("mkfs: adjust_nr_zones for zoned file system on conventional devices") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>