git-server-git.apps.pok.os.sepia.ceph.com Git

NFSD: Clear TIME_DELEG in the suppattr_exclcreat bitmap

>From RFC 8881:

5.8.1.14. Attribute 75: suppattr_exclcreat

> The bit vector that would set all REQUIRED and RECOMMENDED
> attributes that are supported by the EXCLUSIVE4_1 method of file
> creation via the OPEN operation. The scope of this attribute
> applies to all objects with a matching fsid.

There's nothing in RFC 8881 that states that suppattr_exclcreat is
or is not allowed to contain bits for attributes that are clear in
the reported supported_attrs bitmask. But it doesn't make sense for
an NFS server to indicate that it /doesn't/ implement an attribute,
but then also indicate that clients /are/ allowed to set that
attribute using OPEN(create) with EXCLUSIVE4_1.

The FATTR4_WORD2_TIME_DELEG attributes are also not to be allowed
for OPEN(create) with EXCLUSIVE4_1. It doesn't make sense to set
a delegated timestamp on a new file.

Fixes: 7e13f4f8d27d ("nfsd: handle delegated timestamps in SETATTR")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Clear SECLABEL in the suppattr_exclcreat bitmap

>From RFC 8881:

5.8.1.14. Attribute 75: suppattr_exclcreat

> The bit vector that would set all REQUIRED and RECOMMENDED
> attributes that are supported by the EXCLUSIVE4_1 method of file
> creation via the OPEN operation. The scope of this attribute
> applies to all objects with a matching fsid.

There's nothing in RFC 8881 that states that suppattr_exclcreat is
or is not allowed to contain bits for attributes that are clear in
the reported supported_attrs bitmask. But it doesn't make sense for
an NFS server to indicate that it /doesn't/ implement an attribute,
but then also indicate that clients /are/ allowed to set that
attribute using OPEN(create) with EXCLUSIVE4_1.

Ensure that the SECURITY_LABEL and ACL bits are not set in the
suppattr_exclcreat bitmask when they are also not set in the
supported_attrs bitmask.

Fixes: 8c18f2052e75 ("nfsd41: SUPPATTR_EXCLCREAT attribute")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: fix memory leak in nfsd_create_serv error paths

When nfsd_create_serv() calls percpu_ref_init() to initialize
nn->nfsd_net_ref, it allocates both a percpu reference counter
and a percpu_ref_data structure (64 bytes). However, if the
function fails later due to svc_create_pooled() returning NULL
or svc_bind() returning an error, these allocations are not
cleaned up, resulting in a memory leak.

The leak manifests as:
- Unreferenced percpu allocation (8 bytes per CPU)
- Unreferenced percpu_ref_data structure (64 bytes)

Fix this by adding percpu_ref_exit() calls in both error paths
to properly clean up the percpu_ref_init() allocations.

This patch fixes the percpu_ref leak in nfsd_create_serv() seen
as an auxiliary leak in syzbot report 099461f8558eb0a1f4f3; the
prepare_creds() and vsock-related leaks in the same report
remain to be addressed separately.

Reported-by: syzbot+099461f8558eb0a1f4f3@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=099461f8558eb0a1f4f3
Fixes: 47e988147f40 ("nfsd: add nfsd_serv_try_get and nfsd_serv_put")
Signed-off-by: Shardul Bankar <shardul.b@mpiricsoftware.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: Mark variable __maybe_unused to avoid W=1 build break

Clang is not happy about set but (in some cases) unused variable:

fs/nfsd/export.c:1027:17: error: variable 'inode' set but not used [-Werror,-Wunused-but-set-variable]

since it's used as a parameter to dprintk() which might be configured
a no-op. To avoid uglifying code with the specific ifdeffery just mark
the variable __maybe_unused.

The commit [1], which introduced this behaviour, is quite old and hence
the Fixes tag points to the first of the Git era.

Link: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=0431923fb7a1
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

svcrdma: bound check rq_pages index in inline path

svc_rdma_copy_inline_range indexed rqstp->rq_pages[rc_curpage] without
verifying rc_curpage stays within the allocated page array. Add guards
before the first use and after advancing to a new page.

Fixes: d7cc73972661 ("svcrdma: support multiple Read chunks per RPC")
Cc: stable@vger.kernel.org
Signed-off-by: Joshua Rogers <linux@joshua.hu>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

svcrdma: return 0 on success from svc_rdma_copy_inline_range

The function comment specifies 0 on success and -EINVAL on invalid
parameters. Make the tail return 0 after a successful copy loop.

Fixes: d7cc73972661 ("svcrdma: support multiple Read chunks per RPC")
Cc: stable@vger.kernel.org
Signed-off-by: Joshua Rogers <linux@joshua.hu>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

svcrdma: use rc_pageoff for memcpy byte offset

svc_rdma_copy_inline_range added rc_curpage (page index) to the page
base instead of the byte offset rc_pageoff. Use rc_pageoff so copies
land within the current page.

Found by ZeroPath (https://zeropath.com)

Fixes: 8e122582680c ("svcrdma: Move svc_rdma_read_info::ri_pageno to struct svc_rdma_recv_ctxt")
Cc: stable@vger.kernel.org
Signed-off-by: Joshua Rogers <linux@joshua.hu>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

SUNRPC: svcauth_gss: avoid NULL deref on zero length gss_token in gss_read_proxy_verf

A zero length gss_token results in pages == 0 and in_token->pages[0]
is NULL. The code unconditionally evaluates
page_address(in_token->pages[0]) for the initial memcpy, which can
dereference NULL even when the copy length is 0. Guard the first
memcpy so it only runs when length > 0.

Fixes: 5866efa8cbfb ("SUNRPC: Fix svcauth_gss_proxy_init()")
Cc: stable@vger.kernel.org
Signed-off-by: Joshua Rogers <linux@joshua.hu>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: nfsd-io-modes: Separate lists

Sphinx reports htmldocs indentation warnings:

Documentation/filesystems/nfs/nfsd-io-modes.rst:58: ERROR: Unexpected indentation. [docutils]
Documentation/filesystems/nfs/nfsd-io-modes.rst:59: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils]

These caused the lists to be shown as long running paragraphs merged
with their previous paragraphs.

Fix these by separating the lists with a blank line.

Fixes: fa8d4e6784d1b6 ("NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20251202152506.7a2d2d41@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: nfsd-io-modes: Wrap shell snippets in literal code blocks

Sphinx reports htmldocs indentation warnings:

Documentation/filesystems/nfs/nfsd-io-modes.rst:29: ERROR: Unexpected indentation. [docutils]
Documentation/filesystems/nfs/nfsd-io-modes.rst:34: ERROR: Unexpected indentation. [docutils]

Fix these by wrapping shell snippets in literal code blocks.

Fixes: fa8d4e6784d1b6 ("NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20251202152506.7a2d2d41@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Add toctree entry for NFSD IO modes docs

Commit fa8d4e6784d1b6 ("NFSD: add
Documentation/filesystems/nfs/nfsd-io-modes.rst") adds documentation for
NFSD I/O modes, but it forgets to add toctree entry for it. Hence,
Sphinx reports:

Documentation/filesystems/nfs/nfsd-io-modes.rst: WARNING: document isn't included in any toctree [toc.not_included]

Add the entry.

Fixes: fa8d4e6784d1b6 ("NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20251202152506.7a2d2d41@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst

This document details the NFSD IO modes that are configurable using
NFSD's experimental debugfs interfaces:

/sys/kernel/debug/nfsd/io_cache_read
/sys/kernel/debug/nfsd/io_cache_write

This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's
debugfs interfaces are replaced with per-export controls).

Future updates will provide more specific guidance and howto
information to help others use and evaluate NFSD's IO modes:
BUFFERED, DONTCACHE and DIRECT.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Implement NFSD_IO_DIRECT for NFS WRITE

When NFSD_IO_DIRECT is selected via the
/sys/kernel/debug/nfsd/io_cache_write experimental tunable, split
incoming unaligned NFS WRITE requests into a prefix, middle and
suffix segment, as needed. The middle segment is now DIO-aligned and
the prefix and/or suffix are unaligned. Synchronous buffered IO is
used for the unaligned segments, and IOCB_DIRECT is used for the
middle DIO-aligned extent.

Although IOCB_DIRECT avoids the use of the page cache, by itself it
doesn't guarantee data durability. For UNSTABLE WRITE requests,
durability is obtained by a subsequent NFS COMMIT request.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Co-developed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Make FILE_SYNC WRITEs comply with spec

Mike noted that when NFSD responds to an NFS_FILE_SYNC WRITE, it
does not also persist file time stamps. To wit, Section 18.32.3
of RFC 8881 mandates:

> The client specifies with the stable parameter the method of how
> the data is to be processed by the server. If stable is
> FILE_SYNC4, the server MUST commit the data written plus all file
> system metadata to stable storage before returning results. This
> corresponds to the NFSv2 protocol semantics. Any other behavior
> constitutes a protocol violation. If stable is DATA_SYNC4, then
> the server MUST commit all of the data to stable storage and
> enough of the metadata to retrieve the data before returning.

Commit 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()") replaced:

- flags |= RWF_SYNC;

with:

+ kiocb.ki_flags |= IOCB_DSYNC;

which appears to be correct given:

if (flags & RWF_SYNC)
kiocb_flags |= IOCB_DSYNC;

in kiocb_set_rw_flags(). However the author of that commit did not
appreciate that the previous line in kiocb_set_rw_flags() results
in IOCB_SYNC also being set:

kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);

RWF_SUPPORTED contains RWF_SYNC, and RWF_SYNC is the same bit as
IOCB_SYNC. Reviewers at the time did not catch the omission.

Reported-by: Mike Snitzer <snitzer@kernel.org>
Closes: https://lore.kernel.org/linux-nfs/20251018005431.3403-1-cel@kernel.org/T/#t
Fixes: 3f3503adb332 ("NFSD: Use vfs_iocb_iter_write()")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Add trace point for SCSI fencing operation.

Add trace point to print client IP address, net namespace number,
device name and status of SCSI pr_preempt command.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: use correct reservation type in nfsd4_scsi_fence_client

The reservation type argument for the pr_preempt call should match the
one used in nfsd4_block_get_device_info_scsi.

Fixes: f99d4fbdae67 ("nfsd: add SCSI layout support")
Cc: stable@vger.kernel.org
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

xdrgen: Don't generate unnecessary semicolon

The Jinja2 templates add a semicolon at the end of every function.
The C language does not require this punctuation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

xdrgen: Fix union declarations

Add a missing template file. This file is used when a union is
defined as a public API (ie, "pragma public <union name>;").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: don't start nfsd if sv_permsocks is empty

Previously, while trying to create a server instance, if no
listening sockets were present then default parameter udp
and tcp listeners were created. It's unclear what purpose
was of starting these listeners were and how this could have
been triggered by the userland setup. This patch proposed
to ensure the reverse that we never end in a situation where
no listener sockets are created and we are trying to create
nfsd threads.

The problem it solves is: when nfs.conf only has tcp=n (and
nothing else for the choice of transports), nfsdctl would
still start the server and create udp and tcp listeners.

Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

xdrgen: handle _XdrString in union encoder/decoder

Running xdrgen on xdrgen/tests/test.x fails when
generating encoder or decoder functions for union
members of type _XdrString. It was because _XdrString
does not have a spec attribute like _XdrBasic,
leading to AttributeError.

This patch updates emit_union_case_spec_definition
and emit_union_case_spec_decoder/encoder to handle
_XdrString by assigning type_name = "char *" and
avoiding referencing to spec.

Testing: Fixed xdrgen tool was run on originally failing
test file (tools/net/sunrpc/xdrgen/tests/test.x) and now
completes without AttributeError. Modified xdrgen tool was
also run against nfs4_1.x (Documentation/sunrpc/xdr/nfs4_1.x).
The output header file matches with nfs4_1.h
(include/linux/sunrpc/xdrgen/nfs4_1.h).
This validates the patch for all XDR input files currently
within the kernel.

Changes since v2:
- Moved the shebang to the first line
- Removed SPDX header to match style of current xdrgen files

Changes since v1:
- Corrected email address in Signed-off-by.
- Wrapped patch description lines to 72 characters.

Signed-off-by: Khushal Chitturi <kc9282016@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

xdrgen: Fix the variable-length opaque field decoder template

Ensure that variable-length opaques are decoded into the named
field, and do not overwrite the structure itself.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

xdrgen: Make the xdrgen script location-independent

The @pythondir@ placeholder is meant for build-time substitution,
such as with autoconf. autoconf is not used in the kernel. Let's
replace that mechanism with one that better enables the xdrgen
script to be run from any directory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

xdrgen: Generalize/harden pathname construction

Use Python's built-in Path constructor to find the Jinja templates.
This provides better error checking, proper use of path component
separators, and more reliable location of the template files.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

lockd: don't allow locking on reexported NFSv2/3

Since commit 9254c8ae9b81 ("nfsd: disallow file locking and delegations
for NFSv4 reexport"), file locking when reexporting an NFS mount via
NFSv4 is expressly prohibited by nfsd. Do the same in lockd:

Add a new nlmsvc_file_cannot_lock() helper that will test whether file
locking is allowed for a given file, and return nlm_lck_denied_nolocks
if it isn't.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

MAINTAINERS: add a nfsd blocklayout reviewer

Add a minimal entry for the block layout driver to make sure Christoph
who wrote the code gets Cced on all patches. The actual maintenance
stays with the nfsd maintainer team.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: Use MD5 library instead of crypto_shash

Update NFSD's support for "legacy client tracking" (which uses MD5) to
use the MD5 library instead of crypto_shash.  This has several benefits:

- Simpler code.  Notably, much of the error-handling code is no longer
  needed, since the library functions can't fail.

- Improved performance due to reduced overhead.  A microbenchmark of
  nfs4_make_rec_clidname() shows a speedup from 1455 cycles to 425.

- The MD5 code can now safely be built as a loadable module when nfsd is
  built as a loadable module.  (Previously, nfsd forced the MD5 code to
  built-in, presumably to work around the unreliability of the
  name-based loading.)  Thus select MD5 from the tristate option NFSD if
  NFSD_LEGACY_CLIENT_TRACKING, instead of from the bool option NFSD_V4.

- Fixes a bug where legacy client tracking was not supported on kernels
  booted with "fips=1", due to crypto_shash not allowing MD5 to be used.
  This particular use of MD5 is not for a cryptographic purpose, though,
  so it is acceptable even when fips=1 (see
  https://lore.kernel.org/r/dae495a93cbcc482f4ca23c3a0d9360a1fd8c3a8.camel@redhat.com/).

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: stop pretending that we cache the SEQUENCE reply.

nfsd does not cache the reply to a SEQUENCE.  As the comment above
nfsd4_replay_cache_entry() says:

* The sequence operation is not cached because we can use the slot and
* session values.

The comment above nfsd4_cache_this() suggests otherwise.

* The session reply cache only needs to cache replies that the client
* actually asked us to.  But it's almost free for us to cache compounds
* consisting of only a SEQUENCE op, so we may as well cache those too.
* Also, the protocol doesn't give us a convenient response in the case
* of a replay of a solo SEQUENCE op that wasn't cached

The code in nfsd4_store_cache_entry() makes it clear that only responses
beyond 'cstate.data_offset' are actually cached, and data_offset is set
at the end of nfsd4_encode_sequence() *after* the sequence response has
been encoded.

This patch simplifies code and removes the confusing comments.

- nfsd4_is_solo_sequence() is discarded as not-useful.
- nfsd4_cache_this() is now trivial so it too is discarded with the
  code placed in-line at the one call-site in nfsd4_store_cache_entry().
- nfsd4_enc_sequence_replay() is open-coded in to
  nfsd4_replay_cache_entry(), and then simplified to (hopefully) make
  the process of replaying a reply clearer.

Signed-off-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFS: nfsd-maintainer-entry-profile: Inline function name prefixes

Sphinx reports htmldocs warnings:

Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst:185: ERROR: Unknown target name: "nfsd". [docutils]
Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst:188: ERROR: Unknown target name: "nfsdn". [docutils]
Documentation/filesystems/nfs/nfsd-maintainer-entry-profile.rst:192: ERROR: Unknown target name: "nfsd4m". [docutils]

These are due to Sphinx confusing function name prefixes for external
link syntax. Fix the warnings by inlining the prefixes.

Fixes: 3a1ce35030e1e0 ("NFSD: Add a subsystem policy document")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20251117174218.29365f30@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Add a subsystem policy document

Steer contributors to NFSD's patchworks instance, list our patch
submission preferences, and more. The new document is based on the
existing netdev and xfs subsystem policy documents.

This is an attempt to add transparency to the process of accepting
contributions to NFSD and getting them merged upstream.

Suggested-by: "Darrick J. Wong" <djwong@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NeilBrown <neil@brown.name>
[ cel: Hand-edits to address review comments ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

sunrpc: allocate a separate bvec array for socket sends

svc_tcp_sendmsg() calls xdr_buf_to_bvec() with the second slot of
rq_bvec as the start, but doesn't reduce the array length by one, which
could lead to an array overrun. Also, rq_bvec is always rq_maxpages in
length, which can be too short in some cases, since the TCP record
marker consumes a slot.

Fix both problems by adding a separate bvec array to the svc_sock that
is specifically for sending. For TCP, make this array one slot longer
than rq_maxpages, to account for the record marker. For UDP, only
allocate as large an array as we need since it's limited to 64k of
payload.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

SUNRPC: Improve "fragment too large" warning

Including the client IP address that generated the overrun traffic
seems like it would be helpful. The message now reads:

kernel: svc: nfsd oversized RPC fragment (1064958 octets) from 100.64.0.11:45866

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Implement NFSD_IO_DIRECT for NFS READ

Add an experimental option that forces NFS READ operations to use
direct I/O instead of reading through the NFS server's page cache.

There is already at least one other layer of read caching: the page
cache on NFS clients.

The server's page cache, in many cases, is unlikely to provide
additional benefit. Some benchmarks have demonstrated that the
server's page cache is actively detrimental for workloads whose
working set is larger than the server's available physical memory.

For instance, on small NFS servers, cached NFS file content can
squeeze out local memory consumers. For large sequential workloads,
an enormous amount of data flows into and out of the page cache
and is consumed by NFS clients exactly once -- caching that data
is expensive to do and totally valueless.

For now this is a hidden option that can be enabled on test
systems for benchmarking. In the longer term, this option might
be enabled persistently or per-export. When the exported file
system does not support direct I/O, NFSD falls back to using
either DONTCACHE or buffered I/O to fulfill NFS READ requests.

Suggested-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Relocate the xdr_reserve_space_vec() call site

In order to detect when a direct READ is possible, we need the send
buffer's .page_len to be zero when there is nothing in the buffer's
.pages array yet.

However, when xdr_reserve_space_vec() extends the size of the
xdr_stream to accommodate a READ payload, it adds to the send
buffer's .page_len.

It should be safe to reserve the stream space /after/ the VFS read
operation completes. This is, for example, how an NFSv3 READ works:
the VFS read goes into the rq_bvec, and is then added to the send
xdr_stream later by svcxdr_encode_opaque_pages().

Now that xdr_reserve_space_vec() uses the number of bytes actually
read, the xdr_truncate_encode() call is no longer necessary.

Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: pass nfsd_file to nfsd_iter_read()

Prepare for nfsd_iter_read() to use the DIO alignment stored in
nfsd_file by passing the nfsd_file to nfsd_iter_read() rather than
just the file which is associaed with the nfsd_file.

This means nfsd4_encode_readv() now also needs the nfsd_file rather
than the file. Instead of changing the file arg to be the nfsd_file,
we discard the file arg as the nfsd_file (and indeed the file) is
already available via the "read" argument.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD/blocklayout: Support multiple extents per LAYOUTGET

Allow the pNFS server to respond with multiple extents to a LAYOUTGET
request, thereby avoiding unnecessary load on the server and improving
performance for the client. The number of LAYOUTGET requests is
significantly reduced for various file access patterns, including
random and parallel writes.

Additionally, this change allows the client to request layouts with the
loga_minlength value greater than the minimum possible length of a single
extent in XFS. We use this functionality to fix a livelock in the client.

Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD/blocklayout: Introduce layout content structure

Add a layout content structure instead of a single extent. The ability
to store and encode an array of extents is then used to implement support
for multiple extents per LAYOUTGET.

Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD/blocklayout: Extract extent mapping from proc_layoutget

No changes in functionality. Split the proc_layoutget function to
create a helper function that maps single extent to the requested
range. This helper function is then used to implement support for
multiple extents per LAYOUTGET.

Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD/blocklayout: Fix minlength check in proc_layoutget

The extent returned by the file system may have a smaller offset than
the segment offset requested by the client. In this case, the minimum
segment length must be checked against the requested range. Otherwise,
the client may not be able to continue the read/write operation.

Fixes: 8650b8a05850 ("nfsd: pNFS block layout driver")
Signed-off-by: Sergey Bashirov <sergeybashirov@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

svcrdma: Increase the server's default RPC/RDMA credit grant

The range of commits from commit e3274026e2ec ("SUNRPC: move all of
xprt handling into svc_xprt_handle()") to commit 15d39883ee7d
("SUNRPC: change the back-channel queue to lwq") enabled NFSD
performance to scale better as the number of nfsd threads is
increased. These commits were merged in v6.7.

Now that the nfsd thread count can scale to more threads, permit
individual clients to make more use of those threads. Increase the
RPC/RDMA per-connection credit grant from 64 to 128 -- same as the
Linux NFS client.

Simple single client fio-based benchmarking so far shows only
improvement, no regression.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Update comment documenting unsupported fattr4 attributes

TIME_CREATE has been supported since commit e377a3e698fb ("nfsd: Add
support for the birth time attribute").

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: delete unreachable confusing code in nfs4_open_delegation()

op_delegate_type is assigned OPEN_DELEGATE_NONE just before the if-block
where condition specifies it not be equal to OPEN_DELEGATE_NONE. Compiler
treats the block as unreachable and optimizes it out from the resulting
executable.

In that aspect commit d08d32e6e5c0 ("nfsd4: return delegation immediately
if lease fails") notably makes no difference.

Seems it's better to just drop this code instead of fiddling with memory
barriers or atomics.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Matvey Kovalev <matvey.kovalev@ispras.ru>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

NFSD: Add array bounds-checking in nfsd_iter_read()

The *count parameter does not appear to be explicitly restricted
to being smaller than rsize, so it might be possible to overrun
the rq_bvec or rq_pages arrays.

Rather than overrunning these arrays (damage done!) and then WARNING
once, let's harden the loop so that it terminates before the end of
the arrays are reached. This should result in a short read, which is
OK -- clients recover by sending additional READ requests for the
remaining unread bytes.

Reported-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: switch the default for NFSD_LEGACY_CLIENT_TRACKING to "n"

We added this Kconfig option a little over a year ago. Switch the
default to "n" in preparation for its eventual removal.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: change nfs4_client_to_reclaim() to allocate data

The calling convention for nfs4_client_to_reclaim() is clumsy in that
the caller needs to free memory if the function fails. It is much
cleaner if the function frees its own memory.

This patch changes nfs4_client_to_reclaim() to re-allocate the .data
fields to be stored in the newly allocated struct nfs4_client_reclaim,
and to free everything on failure.

__cld_pipe_inprogress_downcall() needs to allocate the data anyway to
copy it from user-space, so now that data is allocated twice. I think
that is a small price to pay for a cleaner interface.

Signed-off-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

nfsd: move name lookup out of nfsd4_list_rec_dir()

nfsd4_list_rec_dir() is called with two different callbacks.
One of the callbacks uses vfs_rmdir() to remove the directory.
The other doesn't use the dentry at all, just the name.

As only one callback needs the dentry, this patch moves the lookup into
that callback. This prepares of changes to how directory operations
are locked.

Signed-off-by: NeilBrown <neil@brown.name>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

svcrdma: Release transport resources synchronously

NFSD has always supported added network listeners. The new netlink
protocol now enables the removal of listeners.

Olga noticed that if an RDMA listener is removed and immediately
re-added, the deferred __svc_rdma_free() function might not have
run yet, so some or all of the old listener's RDMA resources
linger, which prevents a new listener on the same address from
being created.

Also, svc_xprt_free() does a module_put() just after calling
->xpo_free(). That means if there is deferred work going on, the
module could be unloaded before that work is even started,
resulting in a UAF.

Neil asks:
> What particular part of __svc_rdma_free() needs to run in order for a
> subsequent registration to succeed?
> Can that bit be run directory from svc_rdma_free() rather than be
> delayed?
> (I know almost nothing about rdma so forgive me if the answers to these
> questions seems obvious)

The reasons I can recall are:

- Some of the transport tear-down work can sleep
- Releasing a cm_id is tricky and can deadlock

We might be able to mitigate the second issue with judicious
application of transport reference counting.

Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Closes: https://lore.kernel.org/linux-nfs/20250821204328.89218-1-okorniev@redhat.com/
Suggested-by: NeilBrown <neil@brown.name>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Merge tag 'perf-tools-fixes-for-v6.18-2-2025-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf tools fixes from Arnaldo Carvalho de Melo:

- Fix writing bpf_prog (infos|btfs)_cnt to data file, to not generate
   invalid perf.data files in some corner cases.

- Fix 'perf top' segfault by ensuring libbfd is initialized. This is an
   opt-in feature due to license incompatibilities.

- Fix segfault in 'perf lock' due to missing kernel map.

- Fix 'perf lock contention' test.

- Don't fail fast path detection if binutils-devel isn't available.

- Sync KVM's vmx.h with the kernel to pick SEAMCALL exit reason.

* tag 'perf-tools-fixes-for-v6.18-2-2025-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
  perf libbfd: Ensure libbfd is initialized prior to use
  perf test: Fix lock contention test
  perf lock: Fix segfault due to missing kernel map
  tools headers UAPI: Sync KVM's vmx.h with the kernel to pick SEAMCALL exit reason
  perf build: Don't fail fast path feature detection when binutils-devel is not available
  perf header: Write bpf_prog (infos|btfs)_cnt to data file

Merge tag 'mm-hotfixes-stable-2025-11-16-10-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"7 hotfixes.  5 are cc:stable, 4 are against mm/

  All are singletons - please see the respective changelogs for details"

* tag 'mm-hotfixes-stable-2025-11-16-10-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm, swap: fix potential UAF issue for VMA readahead
  selftests/user_events: fix type cast for write_index packed member in perf_test
  lib/test_kho: check if KHO is enabled
  mm/huge_memory: fix folio split check for anon folios in swapcache
  MAINTAINERS: update David Hildenbrand's email address
  crash: fix crashkernel resource shrink
  mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb

Merge tag 'firewire-fixes-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394

Pull firewire fixes from Takashi Sakamoto:
"This includes some fixes for the topology map, newly introduced in
  v6.18 kernel"

* tag 'firewire-fixes-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: core: fix to update generation field in topology map
  firewire: core: Initialize topology_map.lock

Merge tag 'edac_urgent_for_v6.18_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC fixes from Borislav Petkov:

- In Versalnet, handle the reporting of non-standard hw errors whose
   information can come in more than one remote processor message.

- Explicitly reenable ECC checking after a warm reset in Altera OCRAM
   as those registers are reset to default otherwise

- Fix single-bit error injection in Altera EDAC to not inject errors
   directly in ECC RAM and thus lead to false double-bit errors due to
   same ECC RAM being in concurrent use

* tag 'edac_urgent_for_v6.18_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/altera: Use INTTEST register for Ethernet and USB SBE injection
  EDAC/altera: Handle OCRAM ECC enable after warm reset
  EDAC/versalnet: Handle split messages for non-standard errors

firewire: core: fix to update generation field in topology map

The generation field of topology map is updated after initialized by zero.
The updated value of generation field is always zero, and is against
specification.

This commit fixes the bug.

Fixes: 7d138cb269db ("firewire: core: use spin lock specific to topology map")
Link: https://lore.kernel.org/r/20251114144421.415278-1-o-takashi@sakamocchi.jp
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>

mm, swap: fix potential UAF issue for VMA readahead

Since commit 78524b05f1a3 ("mm, swap: avoid redundant swap device
pinning"), the common helper for allocating and preparing a folio in the
swap cache layer no longer tries to get a swap device reference
internally, because all callers of __read_swap_cache_async are already
holding a swap entry reference. The repeated swap device pinning isn't
needed on the same swap device.

Caller of VMA readahead is also holding a reference to the target entry's
swap device, but VMA readahead walks the page table, so it might encounter
swap entries from other devices, and call __read_swap_cache_async on
another device without holding a reference to it.

So it is possible to cause a UAF when swapoff of device A raced with
swapin on device B, and VMA readahead tries to read swap entries from
device A. It's not easy to trigger, but in theory, it could cause real
issues.

Make VMA readahead try to get the device reference first if the swap
device is a different one from the target entry.

Link: https://lkml.kernel.org/r/20251111-swap-fix-vma-uaf-v1-1-41c660e58562@tencent.com
Fixes: 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning")
Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/user_events: fix type cast for write_index packed member in perf_test

Accessing 'reg.write_index' directly triggers a -Waddress-of-packed-member
warning due to potential unaligned pointer access:

perf_test.c:239:38: warning: taking address of packed member 'write_index'
of class or structure 'user_reg' may result in an unaligned pointer value
[-Waddress-of-packed-member]
239 | ASSERT_NE(-1, write(self->data_fd, &reg.write_index,
| ^~~~~~~~~~~~~~~

Since write(2) works with any alignment. Casting '&reg.write_index'
explicitly to 'void *' to suppress this warning.

Link: https://lkml.kernel.org/r/20251106095532.15185-1-ankitkhushwaha.linux@gmail.com
Fixes: 42187bdc3ca4 ("selftests/user_events: Add perf self-test for empty arguments events")
Signed-off-by: Ankit Khushwaha <ankitkhushwaha.linux@gmail.com>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: sunliming <sunliming@kylinos.cn>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

lib/test_kho: check if KHO is enabled

We must check whether KHO is enabled prior to issuing KHO commands,
otherwise KHO internal data structures are not initialized.

Link: https://lkml.kernel.org/r/20251106220635.2608494-1-pasha.tatashin@soleen.com
Fixes: b753522bed0b ("kho: add test for kexec handover")
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202511061629.e242724-lkp@intel.com
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/huge_memory: fix folio split check for anon folios in swapcache

Both uniform and non uniform split check missed the check to prevent
splitting anon folios in swapcache to non-zero order.

Splitting anon folios in swapcache to non-zero order can cause data
corruption since swapcache only support PMD order and order-0 entries.
This can happen when one use split_huge_pages under debugfs to split
anon folios in swapcache.

In-tree callers do not perform such an illegal operation. Only debugfs
interface could trigger it. I will put adding a test case on my TODO
list.

Fix the check.

Link: https://lkml.kernel.org/r/20251105162910.752266-1-ziy@nvidia.com
Fixes: 58729c04cf10 ("mm/huge_memory: add buddy allocator like (non-uniform) folio_split()")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reported-by: "David Hildenbrand (Red Hat)" <david@kernel.org>
Closes: https://lore.kernel.org/all/dc0ecc2c-4089-484f-917f-920fdca4c898@kernel.org/
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

MAINTAINERS: update David Hildenbrand's email address

Switch to kernel.org email address as I will be leaving Red Hat. The old
address will remain active until end of January 2026, so performing the
change now should make sure that most mails will reach me.

Link: https://lkml.kernel.org/r/20251103103659.379335-1-david@kernel.org
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

crash: fix crashkernel resource shrink

When crashkernel is configured with a high reservation, shrinking its
value below the low crashkernel reservation causes two issues:

1. Invalid crashkernel resource objects
2. Kernel crash if crashkernel shrinking is done twice

For example, with crashkernel=200M,high, the kernel reserves 200MB of high
memory and some default low memory (say 256MB). The reservation appears
as:

cat /proc/iomem | grep -i crash
af000000-beffffff : Crash kernel
433000000-43f7fffff : Crash kernel

If crashkernel is then shrunk to 50MB (echo 52428800 >
/sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved:
af000000-beffffff : Crash kernel

Instead, it should show 50MB:
af000000-b21fffff : Crash kernel

Further shrinking crashkernel to 40MB causes a kernel crash with the
following trace (x86):

BUG: kernel NULL pointer dereference, address: 0000000000000038
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
<snip...>
Call Trace: <TASK>
? __die_body.cold+0x19/0x27
? page_fault_oops+0x15a/0x2f0
? search_module_extables+0x19/0x60
? search_bpf_extables+0x5f/0x80
? exc_page_fault+0x7e/0x180
? asm_exc_page_fault+0x26/0x30
? __release_resource+0xd/0xb0
release_resource+0x26/0x40
__crash_shrink_memory+0xe5/0x110
crash_shrink_memory+0x12a/0x190
kexec_crash_size_store+0x41/0x80
kernfs_fop_write_iter+0x141/0x1f0
vfs_write+0x294/0x460
ksys_write+0x6d/0xf0
<snip...>

This happens because __crash_shrink_memory()/kernel/crash_core.c
incorrectly updates the crashk_res resource object even when
crashk_low_res should be updated.

Fix this by ensuring the correct crashkernel resource object is updated
when shrinking crashkernel memory.

Link: https://lkml.kernel.org/r/20251101193741.289252-1-sourabhjain@linux.ibm.com
Fixes: 16c6006af4d4 ("kexec: enable kexec_crash_size to support two crash kernel regions")
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb

In the past, CONFIG_ARCH_HAS_GIGANTIC_PAGE indicated that we support
runtime allocation of gigantic hugetlb folios.  In the meantime it evolved
into a generic way for the architecture to state that it supports gigantic
hugetlb folios.

In commit fae7d834c43c ("mm: add __dump_folio()") we started using
CONFIG_ARCH_HAS_GIGANTIC_PAGE to decide MAX_FOLIO_ORDER: whether we could
have folios larger than what the buddy can handle.  In the context of that
commit, we started using MAX_FOLIO_ORDER to detect page corruptions when
dumping tail pages of folios.  Before that commit, we assumed that we
cannot have folios larger than the highest buddy order, which was
obviously wrong.

In commit 7b4f21f5e038 ("mm/hugetlb: check for unreasonable folio sizes
when registering hstate"), we used MAX_FOLIO_ORDER to detect
inconsistencies, and in fact, we found some now.

Powerpc allows for configs that can allocate gigantic folio during boot
(not at runtime), that do not set CONFIG_ARCH_HAS_GIGANTIC_PAGE and can
exceed PUD_ORDER.

To fix it, let's make powerpc select CONFIG_ARCH_HAS_GIGANTIC_PAGE with
hugetlb on powerpc, and increase the maximum folio size with hugetlb to 16
GiB on 64bit (possible on arm64 and powerpc) and 1 GiB on 32 bit
(powerpc).  Note that on some powerpc configurations, whether we actually
have gigantic pages depends on the setting of CONFIG_ARCH_FORCE_MAX_ORDER,
but there is nothing really problematic about setting it unconditionally:
we just try to keep the value small so we can better detect problems in
__dump_folio() and inconsistencies around the expected largest folio in
the system.

Ideally, we'd have a better way to obtain the maximum hugetlb folio size
and detect ourselves whether we really end up with gigantic folios.  Let's
defer bigger changes and fix the warnings first.

While at it, handle gigantic DAX folios more clearly: DAX can only end up
creating gigantic folios with HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD.

Add a new Kconfig option HAVE_GIGANTIC_FOLIOS to make both cases clearer.
In particular, worry about ARCH_HAS_GIGANTIC_PAGE only with HUGETLB_PAGE.

Note: with enabling CONFIG_ARCH_HAS_GIGANTIC_PAGE on powerpc, we will now
also allow for runtime allocations of folios in some more powerpc configs.
I don't think this is a problem, but if it is we could handle it through
__HAVE_ARCH_GIGANTIC_PAGE_RUNTIME_SUPPORTED.

While __dump_page()/__dump_folio was also problematic (not handling
dumping of tail pages of such gigantic folios correctly), it doesn't seem
critical enough to mark it as a fix.

Link: https://lkml.kernel.org/r/20251114214920.2550676-1-david@kernel.org
Fixes: 7b4f21f5e038 ("mm/hugetlb: check for unreasonable folio sizes when registering hstate")
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Closes: https://lore.kernel.org/r/3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu/
Reported-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Closes: https://lore.kernel.org/r/94377f5c-d4f0-4c0f-b0f6-5bf1cd7305b1@linux.ibm.com/
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

firewire: core: Initialize topology_map.lock

Lockdep barfs on the new uninitialized spinlock.
Initialize it.

protip: enable lockdep (CONFIG_PROVE_LOCKING=y) when
doing locking changes

firewire_ohci 0000:02:01.1: added OHCI v1.10 device as card 0, 4 IR + 4 IT contexts, quirks 0x11
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 0 UID: 0 PID: 1042 Comm: irq/17-firewire Not tainted 6.17.0-rc2-cl-bisect2-00026-g7d138cb269db #136 PREEMPT
Hardware name: Dell Inc. Latitude E5400 /0D695C, BIOS A19 06/13/2013
Call Trace:
<TASK>
dump_stack_lvl+0x6d/0xa0
register_lock_class+0x783/0x790
? find_held_lock+0x2b/0x80
? __mod_timer+0x110/0x320
? __mod_timer+0x110/0x320
__lock_acquire+0x405/0x2600
lock_acquire+0xca/0x2e0
? fw_core_handle_bus_reset+0x888/0xca0 [firewire_core]
? fw_core_handle_bus_reset+0x878/0xca0 [firewire_core]
? fw_core_handle_bus_reset+0x878/0xca0 [firewire_core]
_raw_spin_lock+0x2e/0x40
? fw_core_handle_bus_reset+0x888/0xca0 [firewire_core]
fw_core_handle_bus_reset+0x888/0xca0 [firewire_core]
handle_selfid_complete_event+0x35c/0x7a0 [firewire_ohci]
? irq_thread+0x8d/0x280
irq_thread_fn+0x18/0x50
irq_thread+0x15a/0x280
? irq_check_status_bit+0x100/0x100
? lockdep_hardirqs_on+0x78/0x100
? irq_finalize_oneshot.part.0+0xc0/0xc0
? irq_forced_thread_fn+0x60/0x60
kthread+0x114/0x200
? kthreads_online_cpu+0x110/0x110
ret_from_fork+0x158/0x1e0
? kthreads_online_cpu+0x110/0x110
ret_from_fork_asm+0x11/0x20
</TASK>

Reported-by: Erhard Furtner <erhard_f@mailbox.org>
Fixes: 7d138cb269db ("firewire: core: use spin lock specific to topology map")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp>

perf libbfd: Ensure libbfd is initialized prior to use

Multiple threads may be creating and destroying BFD objects in
situations like `perf top`.

Without appropriate initialization crashes may occur during libbfd's
cache management.

BFD's locks require recursive mutexes, add support for these.

Committer testing:

This happens only when building with 'make BUILD_NONDISTRO=1' and having
the binutils-devel package (or equivalent) installed, i.e. linking with
binutils devel files, an opt-in perf build.

Before:

  root@x1:~# perf top
  perf: Segmentation fault
  -------- backtrace --------
  <SNIP multiple failed attempts at printing a backtrace>
  root@x1:~#

After this patch it works as before.

Closes: https://lore.kernel.org/lkml/aQt66zhfxSA80xwt@gentoo.org/
Fixes: 95931d9a594dd0b5 ("perf libbfd: Move libbfd functionality to its own file")
Reported-by: Guilherme Amadio <amadio@gentoo.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf test: Fix lock contention test

Couple of independent fixes:

1. Wire in SIGSEGV handler that terminates the test with a failure code.

2. Use "--lock-cgroup" instead of "-g"; "-g" was proposed but never
   merged. See commit 4d1792d0a2564caf ("perf lock contention: Add
   --lock-cgroup option")

3. Call cleanup() on every normal exit so trap_cleanup() doesn't mistake
   it for an unexpected signal and emit a false-negative "Unexpected
   signal in main" message.

Before patch:

  # ./perf test -vv "lock contention"
   85: kernel lock contention analysis test:
  --- start ---
  test child forked, pid 610711
  Testing perf lock record and perf lock contention
  Testing perf lock contention --use-bpf
  Testing perf lock record and perf lock contention at the same time
  Testing perf lock contention --threads
  Testing perf lock contention --lock-addr
  Testing perf lock contention --lock-cgroup
  Unexpected signal in test_aggr_cgroup
  ---- end(0) ----
   85: kernel lock contention analysis test                            : Ok

After patch:

  # ./perf test -vv "lock contention"
   85: kernel lock contention analysis test:
  --- start ---
  test child forked, pid 602637
  Testing perf lock record and perf lock contention
  Testing perf lock contention --use-bpf
  Testing perf lock record and perf lock contention at the same time
  Testing perf lock contention --threads
  Testing perf lock contention --lock-addr
  Testing perf lock contention --lock-cgroup
  Testing perf lock contention --type-filter (w/ spinlock)
  Testing perf lock contention --lock-filter (w/ tasklist_lock)
  Testing perf lock contention --callstack-filter (w/ unix_stream)
  [Skip] Could not find 'unix_stream'
  Testing perf lock contention --callstack-filter with task aggregation
  [Skip] Could not find 'unix_stream'
  Testing perf lock contention --cgroup-filter
  Testing perf lock contention CSV output
  ---- end(0) ----
   85: kernel lock contention analysis test                            : Ok

Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ananth Narayan <ananth.narayan@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Santosh Shukla <santosh.shukla@amd.com>
Cc: Tycho Andersen <tycho@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf lock: Fix segfault due to missing kernel map

Kernel maps are encoded in PERF_RECORD_MMAP2 samples but "perf lock
report" and "perf lock contention" do not process MMAP2 samples.

Because of that, machine->vmlinux_map stays NULL and any later access
triggers a segmentation fault.

Fix it by adding ->mmap2() callbacks.

Fixes: 53b00ff358dc75b1 ("perf record: Make --buildid-mmap the default")
Reported-by: Tycho Andersen (AMD) <tycho@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Tested-by: Tycho Andersen (AMD) <tycho@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ananth Narayan <ananth.narayan@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Santosh Shukla <santosh.shukla@amd.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

tools headers UAPI: Sync KVM's vmx.h with the kernel to pick SEAMCALL exit reason

To pick the changes in:

  9d7dfb95da2cb5c1 ("KVM: VMX: Inject #UD if guest tries to execute SEAMCALL or TDCALL")

The 'perf kvm-stat' tool uses the exit reasons that are included in the
VMX_EXIT_REASONS define, this new SEAMCALL isn't included there (TDCALL
is), so shouldn't be causing any change in behaviour, this patch ends up
being just addressess the following perf build warning:

  Warning: Kernel ABI header differences:
    diff -u tools/arch/x86/include/uapi/asm/vmx.h arch/x86/include/uapi/asm/vmx.h

Please see tools/include/uapi/README for further details.

Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf build: Don't fail fast path feature detection when binutils-devel is not available

This is one more remnant of the BUILD_NONDISTRO series to make building
with binutils-devel opt-in due to license incompatibility.

In this case just the references at link time were still in place, which
make building the test-all.bin file fail, which wasn't detected before
probably because the last test was done with binutils-devel available,
doh.

Now:

  $ rpm -q binutils-devel
  package binutils-devel is not installed
  $ file /tmp/build/perf-tools/feature/test-all.bin
  /tmp/build/perf-tools/feature/test-all.bin: ELF 64-bit LSB executable, x86-64, version 1 (SYSV),
  dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2,
  BuildID[sha1]=4b5388a346b51f1b993f0b0dbd49f4570769b03c, for GNU/Linux 3.2.0, not stripped
  $

Fixes: 970ae86307718c34 ("perf build: The bfd features are opt-in, stop testing for them by default")
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf header: Write bpf_prog (infos|btfs)_cnt to data file

With commit f0d0f978f3f5830a ("perf header: Don't write empty BPF/BTF
info"), the write_bpf_( prog_info() | btf() ) functions exit without
writing anything if env->bpf_prog.(infos| btfs)_cnt is zero.

process_bpf_( prog_info() | btf() ), however, still expect a "count"
value to exist in the data file. If btf information is empty, for
example, process_bpf_btf will read garbage or some other data as the
number of btf nodes in the data file. As a result, the data file will
not be processed correctly.

Instead, write the count to the data file and exit if it is zero.

Fixes: f0d0f978f3f5830a ("perf header: Don't write empty BPF/BTF info")
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Thomas Falcon <thomas.falcon@intel.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

Merge tag 'slab-for-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab fix from Vlastimil Babka:

- Fix memory leak of objects from remote NUMA node when bulk freeing to
a cache with sheaves (Harry Yoo)

* tag 'slab-for-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slub: fix memory leak in free_to_pcs_bulk()

Merge tag 'linux_kselftest-fixes-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest fix from Shuah Khan:
"Fixes event-filter-function.tc tracing test failure caused when a
  first run to sample events triggers kmem_cache_free which interferes
  with the rest of the test.

  Fix this by calling sample_events twice to eliminate the
  kmem_cache_free related noise from the sampling"

* tag 'linux_kselftest-fixes-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests/tracing: Run sample events to clear page cache events

mm/slub: fix memory leak in free_to_pcs_bulk()

The commit 989b09b73978 ("slab: skip percpu sheaves for remote object
freeing") introduced the remote_objects array in free_to_pcs_bulk() to
skip sheaves when objects from a remote node are freed.

However, the array is flushed only when:
1) the array becomes full (++remote_nr >= PCS_BATCH_MAX), or
2) slab_free_hook() returns false and size becomes zero.

When neither of the conditions is met, objects in the array are leaked.
This resulted in a memory leak [1], where 82 GiB of memory was allocated
for the maple_node cache.

Flush the array after successfully freeing objects to sheaves
in the do_free: path.

In the meantime, move the snippet if (!size) goto flush_remote; outside
the while loop for readability. Let's say all objects in the array are
from a remote node: then we acquire s->cpu_sheaves->lock and try to free
an object even when size is zero. This doesn't appear to be harmful,
but isn't really readable.

Reported-by: Tytus Rogalewski <admin@simplepod.ai>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220765 [1]
Closes: https://lore.kernel.org/linux-mm/20251107094809.12e9d705b7bf4815783eb184@linux-foundation.org
Closes: https://lore.kernel.org/all/aRGDTwbt2EIz2CYn@hyeyoo
Fixes: 989b09b73978 ("slab: skip percpu sheaves for remote object freeing")
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251111125331.12246-1-harry.yoo@oracle.com
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Tytus Rogalewski <admin@simplepod.ai>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>