git-server-git.apps.pok.os.sepia.ceph.com Git

net: cache snapshot entries for ndo_set_rx_mode_async

Add a per-device netdev_hw_addr_list cache (rx_mode_addr_cache) that
allows __hw_addr_list_snapshot() and __hw_addr_list_reconcile() to
reuse previously allocated entries instead of hitting GFP_ATOMIC on
every snapshot cycle.

snapshot pops entries from the cache when available, falling back to
__hw_addr_create(). reconcile splices both snapshot lists back into
the cache via __hw_addr_splice(). The cache is flushed in
free_netdev().

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-4-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work

Add ndo_set_rx_mode_async callback that drivers can implement instead
of the legacy ndo_set_rx_mode. The legacy callback runs under the
netif_addr_lock spinlock with BHs disabled, preventing drivers from
sleeping. The async variant runs from a work queue with rtnl_lock and
netdev_lock_ops held, in fully sleepable context.

When __dev_set_rx_mode() sees ndo_set_rx_mode_async, it schedules
netdev_rx_mode_work instead of calling the driver inline. The work
function takes two snapshots of each address list (uc/mc) under
the addr_lock, then drops the lock and calls the driver with the
work copies. After the driver returns, it reconciles the snapshots
back to the real lists under the lock.

Add netif_rx_mode_sync() to opportunistically execute the pending
workqueue update inline, so that rx mode changes are committed
before returning to userspace:
  - dev_change_flags (SIOCSIFFLAGS / RTM_NEWLINK)
  - dev_set_promiscuity
  - dev_set_allmulti
  - dev_ifsioc SIOCADDMULTI / SIOCDELMULTI
  - do_setlink (RTM_SETLINK)

Note that some deep hierarchies still do skip the lower updates via:
  - dev_uc_sync
  - dev_mc_sync

If we do end up hitting user-visible issues, we can add more calls to
netif_rx_mode_sync in specific places. But hopefully we should not,
the actual user-visible lists are still synced, it's that just HW state
that might be lagging.

Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-3-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: add address list snapshot and reconciliation infrastructure

Introduce __hw_addr_list_snapshot() and __hw_addr_list_reconcile()
for use by the upcoming ndo_set_rx_mode_async callback.

The async rx_mode path needs to snapshot the device's unicast and
multicast address lists under the addr_lock, hand those snapshots
to the driver (which may sleep), and then propagate any sync_cnt
changes back to the real lists. Two identical snapshots are taken:
a work copy for the driver to pass to __hw_addr_sync_dev() and a
reference copy to compute deltas against.

__hw_addr_list_reconcile() walks the reference snapshot comparing
each entry against the work snapshot to determine what the driver
synced or unsynced. It then applies those deltas to the real list,
handling concurrent modifications:

  - If the real entry was concurrently removed but the driver synced
    it to hardware (delta > 0), re-insert a stale entry so the next
    work run properly unsyncs it from hardware.
  - If the entry still exists, apply the delta normally. An entry
    whose refcount drops to zero is removed.

  # dev_addr_test_snapshot_benchmark: 1024 addrs x 1000 snapshots: 89872802 ns total, 89872 ns/iter
  # dev_addr_test_snapshot_benchmark.speed: slow

Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260416185712.2155425-2-sdf@fomichev.me
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

netfilter: nf_tables: add hook transactions for device deletions

Restore the flag that indicates that the hook is going away, ie.
NFT_HOOK_REMOVE, but add a new transaction object to track deletion
of hooks without altering the basechain/flowtable hook_list during
the preparation phase.

The existing approach that moves the hook from the basechain/flowtable
hook_list to transaction hook_list breaks netlink dump path readers
of this RCU-protected list.

It should be possible use an array for nft_trans_hook to store the
deleted hooks to compact the representation but I am not expecting
many hook object, specially now that wildcard support for devices
is in place.

Note that the nft_trans_chain_hooks() list contains a list of struct
nft_trans_hook objects for DELCHAIN and DELFLOWTABLE commands, while
this list stores struct nft_hook objects for NEWCHAIN and NEWFLOWTABLE.
Note that new commands can be updated to use nft_trans_hook for
consistency.

This patch also adapts the event notification path to deal with the list
of hook transactions.

Fixes: 7d937b107108 ("netfilter: nf_tables: support for deleting devices in an existing netdev chain")
Fixes: b6d9014a3335 ("netfilter: nf_tables: delete flowtable hooks via transaction list")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: join hook list via splice_list_rcu() in commit phase

Publish new hooks in the list into the basechain/flowtable using
splice_list_rcu() to ensure netlink dump list traversal via rcu is safe
while concurrent ruleset update is going on.

Fixes: 78d9f48f7f44 ("netfilter: nf_tables: add devices to existing flowtable")
Fixes: b9703ed44ffb ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

rculist: add list_splice_rcu() for private lists

This patch adds a helper function, list_splice_rcu(), to safely splice
a private (non-RCU-protected) list into an RCU-protected list.

The function ensures that only the pointer visible to RCU readers
(prev->next) is updated using rcu_assign_pointer(), while the rest of
the list manipulations are performed with regular assignments, as the
source list is private and not visible to concurrent RCU readers.

This is useful for moving elements from a private list into a global
RCU-protected list, ensuring safe publication for RCU readers.
Subsystems with some sort of batching mechanism from userspace can
benefit from this new function.

The function __list_splice_rcu() has been added for clarity and to
follow the same pattern as in the existing list_splice*() interfaces,
where there is a check to ensure that the list to splice is not
empty. Note that __list_splice_rcu() has no documentation for this
reason.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: use list_del_rcu for netlink hooks

nft_netdev_unregister_hooks and __nft_unregister_flowtable_net_hooks need
to use list_del_rcu(), this list can be walked by concurrent dumpers.

Add a new helper and use it consistently.

Fixes: f9a43007d3f7 ("netfilter: nf_tables: double hook unregistration in netns path")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: arp_tables: fix IEEE1394 ARP payload parsing

Weiming Shi says:

"arp_packet_match() unconditionally parses the ARP payload assuming two
hardware addresses are present (source and target). However,
IPv4-over-IEEE1394 ARP (RFC 2734) omits the target hardware address
field, and arp_hdr_len() already accounts for this by returning a
shorter length for ARPHRD_IEEE1394 devices.

As a result, on IEEE1394 interfaces arp_packet_match() advances past a
nonexistent target hardware address and reads the wrong bytes for both
the target device address comparison and the target IP address. This
causes arptables rules to match against garbage data, leading to
incorrect filtering decisions: packets that should be accepted may be
dropped and vice versa.

The ARP stack in net/ipv4/arp.c (arp_create and arp_process) already
handles this correctly by skipping the target hardware address for
ARPHRD_IEEE1394. Apply the same pattern to arp_packet_match()."

Mangle the original patch to always return 0 (no match) in case user
matches on the target hardware address which is never present in
IEEE1394.

Note that this returns 0 (no match) for either normal and inverse match
because matching in the target hardware address in ARPHRD_IEEE1394 has
never been supported by arptables. This is intentional, matching on the
target hardware address should never evaluate true for ARPHRD_IEEE1394.

Moreover, adjust arpt_mangle to drop the packet too as AI suggests:

In arpt_mangle, the logic assumes a standard ARP layout. Because
IEEE1394 (FireWire) omits the target hardware address, the linear
pointer arithmetic miscalculates the offset for the target IP address.
This causes mangling operations to write to the wrong location, leading
to packet corruption. To ensure safety, this patch drops packets
(NF_DROP) when mangling is requested for these fields on IEEE1394
devices, as the current implementation cannot correctly map the FireWire
ARP payload.

This omits both mangling target hardware and IP address. Even if IP
address mangling should be possible in IEEE1394, this would require
to adjust arpt_mangle offset calculation, which has never been
supported.

Based on patch from Weiming Shi <bestswngs@gmail.com>.

Fixes: 6752c8db8e0c ("firewire net, ipv4 arp: Extend hardware address and remove driver-level packet inspection.")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

erofs: unify lcn as u64 for 32-bit platforms

As sashiko reported [1], `lcn` was typed as `unsigned long` (or
`unsigned int` sometimes), which is only 32 bits wide on 32-bit
platforms, which causes `(lcn << lclusterbits)` to be truncated
at 4 GiB.

In order to consolidate the logic, just use `u64` consistently
around the codebase.

[1] https://sashiko.dev/r/20260420034612.1899973-1-hsiangkao%40linux.alibaba.com

Fixes: 152a333a5895 ("staging: erofs: add compacted compression indexes support")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

erofs: fix offset truncation when shifting pgoff on 32-bit platforms

On 32-bit platforms, pgoff_t is 32 bits wide, so left-shifting
large arbitrary pgoff_t values by PAGE_SHIFT performs 32-bit arithmetic
and silently truncates the result for pages beyond the 4 GiB boundary.

Cast the page index to loff_t before shifting to produce a correct
64-bit byte offset.

Fixes: 386292919c25 ("erofs: introduce readmore decompression strategy")
Fixes: 307210c262a2 ("erofs: verify metadata accesses for file-backed mounts")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>

erofs: fix the out-of-bounds nameoff handling for trailing dirents

Currently we already have boundary-checks for nameoffs, but the trailing
dirents are special since the namelens are calculated with strnlen()
with unchecked nameoffs.

If a crafted EROFS has a trailing dirent with nameoff >= maxsize,
maxsize - nameoff can underflow, causing strnlen() to read past the
directory block.

nameoff0 should also be verified to be a multiple of
`sizeof(struct erofs_dirent)` as well [1].

[1] https://sashiko.dev/#/patchset/20260416063511.3173774-1-hsiangkao%40linux.alibaba.com

Fixes: 3aa8ec716e52 ("staging: erofs: add directory operations")
Fixes: 33bac912840f ("staging: erofs: keep corrupted fs from crashing kernel in erofs_readdir()")
Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Reported-by: Junrui Luo <moonafterrain@outlook.com>
Closes: https://lore.kernel.org/r/A0FD7E0F-7558-49B0-8BC8-EB1ECDB2479A@outlook.com
Cc: stable@vger.kernel.org
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>

slip: bound decode() reads against the compressed packet length

slhc_uncompress() parses a VJ-compressed TCP header by advancing a
pointer through the packet via decode() and pull16(). Neither helper
bounds-checks against isize, and decode() masks its return with
& 0xffff so it can never return the -1 that callers test for -- those
error paths are dead code.

A short compressed frame whose change byte requests optional fields
lets decode() read past the end of the packet. The over-read bytes
are folded into the cached cstate and reflected into subsequent
reconstructed packets.

Make decode() and pull16() take the packet end pointer and return -1
when exhausted. Add a bounds check before the TCP-checksum read.
The existing == -1 tests now do what they were always meant to.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Simon Horman <horms@kernel.org>
Closes: https://lore.kernel.org/netdev/20260414134126.758795-2-horms@kernel.org/
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260416100147.531855-5-bestswngs@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

ALSA: usb-audio/line6: Add support for POD HD PRO

The POD HD PRO is the rackmount version of the POD 500, with most of the
same behaviors. As with some of the other rackmount POD devices it will
not send captured audio to the host unless the host is sending playback
audio, so it has LINE6_CAP_IN_NEEDS_OUT in addition to the POD 500
flags.

Tested-By: Phil Willoughby <willerz@gmail.com>
Signed-off-by: Phil Willoughby <willerz@gmail.com>
Link: https://patch.msgid.link/20260420152405.7230-1-willerz@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

ALSA: hda/realtek: Add LED fixup for HP EliteBook 6 G2a Laptops

The HP EliteBook 6 G2a laptops requires specific LED control method
ALC236_FIXUP_HP_MUTE_LED_MICMUTE_VREF to work.

Signed-off-by: Chris Chiu <chris.chiu@canonical.com>
Link: https://patch.msgid.link/20260421023429.3723154-1-chris.chiu@canonical.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>

slip: reject VJ receive packets on instances with no rstate array

slhc_init() accepts rslots == 0 as a valid configuration, with the
documented meaning of 'no receive compression'. In that case the
allocation loop in slhc_init() is skipped, so comp->rstate stays
NULL and comp->rslot_limit stays 0 (from the kzalloc of struct
slcompress).

The receive helpers do not defend against that configuration.
slhc_uncompress() dereferences comp->rstate[x] when the VJ header
carries an explicit connection ID, and slhc_remember() later assigns
cs = &comp->rstate[...] after only comparing the packet's slot number
to comp->rslot_limit. Because rslot_limit is 0, slot 0 passes the
range check, and the code dereferences a NULL rstate.

The configuration is reachable in-tree through PPP. PPPIOCSMAXCID
stores its argument in a signed int, and (val >> 16) uses arithmetic
shift. Passing 0xffff0000 therefore sign-extends to -1, so val2 + 1
is 0 and ppp_generic.c ends up calling slhc_init(0, 1). Because
/dev/ppp open is gated by ns_capable(CAP_NET_ADMIN), the whole path
is reachable from an unprivileged user namespace. Once the malformed
VJ state is installed, any inbound VJ-compressed or VJ-uncompressed
frame that selects slot 0 crashes the kernel in softirq context:

Oops: general protection fault, probably for non-canonical
       address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:slhc_uncompress (drivers/net/slip/slhc.c:519)
Call Trace:
  <TASK>
  ppp_receive_nonmp_frame (drivers/net/ppp/ppp_generic.c:2466)
  ppp_input (drivers/net/ppp/ppp_generic.c:2359)
  ppp_async_process (drivers/net/ppp/ppp_async.c:492)
  tasklet_action_common (kernel/softirq.c:926)
  handle_softirqs (kernel/softirq.c:623)
  run_ksoftirqd (kernel/softirq.c:1055)
  smpboot_thread_fn (kernel/smpboot.c:160)
  kthread (kernel/kthread.c:436)
  ret_from_fork (arch/x86/kernel/process.c:164)
  </TASK>

Reject the receive side on such instances instead of touching rstate.
slhc_uncompress() falls through to its existing 'bad' label, which
bumps sls_i_error and enters the toss state. slhc_remember() mirrors
that with an explicit sls_i_error increment followed by slhc_toss();
the sls_i_runt counter is not used here because a missing rstate is
an internal configuration state, not a runt packet.

The transmit path is unaffected: the only in-tree caller that picks
rslots from userspace (ppp_generic.c) still supplies tslots >= 1, and
slip.c always calls slhc_init(16, 16), so comp->tstate remains valid
and slhc_compress() continues to work.

Fixes: 4ab42d78e37a ("ppp, slip: Validate VJ compression slot parameters completely")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260415204130.258866-2-bestswngs@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

rhashtable: Bounce deferred worker kick through irq_work

Inserts past 75% load call schedule_work(&ht->run_work) to kick an
async resize. If a caller holds a raw spinlock (e.g. an
insecure_elasticity user), schedule_work() under that lock records

  caller_lock -> pool->lock -> pi_lock -> rq->__lock

A cycle forms if any of these locks is acquired in the reverse
direction elsewhere. sched_ext, the only current insecure_elasticity
user, hits this: it holds scx_sched_lock across rhashtable inserts of
sub-schedulers, while scx_bypass() takes rq->__lock -> scx_sched_lock.
Exercising the resize path produces:

  Chain exists of:
    &pool->lock --> &rq->__lock --> scx_sched_lock

Bounce the kick from the insert paths through irq_work so
schedule_work() runs from hard IRQ context with the caller's lock no
longer held. rht_deferred_worker()'s self-rearm on error stays on
schedule_work(&ht->run_work) - the worker runs in process context with
no caller lock held, and keeping the self-requeue on @run_work lets
cancel_work_sync() in rhashtable_free_and_destroy() drain it.

v3: Keep rht_deferred_worker()'s self-rearm on schedule_work(&run_work).
    Routing it through irq_work in v2 broke cancel_work_sync()'s
    self-requeue handling - an irq_work queued after irq_work_sync()
    returned but while cancel_work_sync() was still waiting could fire
    post-teardown.

v2: Bounce unconditionally instead of gating on insecure_elasticity,
    as suggested by Herbert.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

scsi: hisi_sas: Fix sparse warnings in prep_ata_v3_hw()

In prep_ata_v3_hw(), add cpu_to_le32() to fix warning:

  drivers/scsi/hisi_sas/hisi_sas_v3_hw.c:1448:26: sparse: sparse: invalid assignment: |=
  drivers/scsi/hisi_sas/hisi_sas_v3_hw.c:1448:26: sparse:    left side has type restricted __le32
  drivers/scsi/hisi_sas/hisi_sas_v3_hw.c:1448:26: sparse:    right side has type unsigned int

Fixes: 8aa580cd9284 ("scsi: hisi_sas: Enable force phy when SATA disk directly connected")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604191850.IVYPTaML-lkp@intel.com/
Signed-off-by: Yihang Li <liyihang9@huawei.com>
Link: https://patch.msgid.link/20260420021044.3339459-1-liyihang9@huawei.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

scsi: pmcraid: Fix typo in comments

Fix typo in structure comment.

Signed-off-by: Hugo Villeneuve <hvilleneuve@dimonoff.com>
Link: https://patch.msgid.link/20260417200738.3920001-1-hugo@hugovil.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

scsi: scsi_dh_alua: Increase default ALUA timeout to maximum spec value

The ALUA handler maps a 0 value (no implicit transition timeout provided
by the target) to the ALUA_FAILOVER_TIMEOUT constant, currently 60
seconds. This means the kernel already does not accept an infinite
transition time.

However, 60 seconds is insufficient for some arrays that may take longer
to complete ALUA transitions. Since the highest value allowed by the
SCSI specification for the implicit transition timeout is a single byte
(255 seconds), change the default to 255. This way, when a target does
not provide an explicit transition timeout, we default to the maximum
value the spec allows rather than an arbitrary 60 second limit.

Co-developed-by: Krishna Kant <krishna.kant@purestorage.com>
Signed-off-by: Krishna Kant <krishna.kant@purestorage.com>
Co-developed-by: Riya Savla <rsavla@purestorage.com>
Signed-off-by: Riya Savla <rsavla@purestorage.com>
Signed-off-by: Brian Bunker <brian@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20260416165512.26497-2-brian@purestorage.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

scsi: smartpqi: Silence a recursive lock warning

On systems with multiple controllers debug kernel shows

WARNING: possible recursive locking detected

during shutdown.

Each controller does have its own ctrl_info (and mutex) and that isn't
correctly recognized by debug kernel. Suppress the warning by releasing
the mutex at the end of pqi_shutdown().

Signed-off-by: Tomas Henzl <thenzl@redhat.com>
Acked-by: Don Brace <don.brace@microchip.com>
Link: https://patch.msgid.link/20260414124118.23661-1-thenzl@redhat.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

scsi: mpt3sas: Limit NVMe request size to 2 MiB

The HBA firmware reports NVMe MDTS values based on the underlying drive
capability. However, because the driver allocates a fixed 4K buffer for
the PRP list, accommodating at most 512 entries, the driver supports a
maximum I/O transfer size of 2 MiB.

Limit max_hw_sectors to the smaller of the reported MDTS and the 2 MiB
driver limit to prevent issuing oversized I/O that may lead to a kernel
oops.

Cc: stable@vger.kernel.org
Fixes: 9b8b84879d4a ("block: Increase BLK_DEF_MAX_SECTORS_CAP")
Reported-by: Mira Limbeck <m.limbeck@proxmox.com>
Closes: https://lore.kernel.org/r/291f78bf-4b4a-40dd-867d-053b36c564b3@proxmox.com
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9b8b84879d4a
Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Ranjan Kumar <ranjan.kumar@broadcom.com>
Tested-by: Mira Limbeck <m.limbeck@proxmox.com>
Link: https://patch.msgid.link/20260414110811.85156-1-ranjan.kumar@broadcom.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

scsi: sg: Don't use GFP_ATOMIC in sg_start_req()

sg_start_req() is called from normal user context and can sleep when
waiting for memory. Switch it to use GFP_KERNEL, which fixes allocation
failures seen with the bio_alloc rework.

Fixes: b520c4eef83d ("block: split bio_alloc_bioset more clearly into a fast and slowpath")
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://patch.msgid.link/20260415060813.807659-2-hch@lst.de
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

btrfs: fix double-decrement of bytes_may_use in submit_one_async_extent()

submit_one_async_extent() calls btrfs_reserve_extent(), which decrements
bytes_may_use. If the call btrfs_create_io_em() fails, we jump to
out_free_reserve, which calls extent_clear_unlock_delalloc().

Because we're specifying EXTENT_DO_ACCOUNTING, i.e.
EXTENT_CLEAR_META_RESV | EXTENT_CLEAR_DATA_RESV, this decreases
bytes_may_use again. This can lead to problems later on, as an initial
write can fail only for the writeback to silently ENOSPC.

Fix this by replacing EXTENT_DO_ACCOUNTING with EXTENT_CLEAR_META_RESV.
This parallels a4fe134fc1d8eb ("btrfs: fix a double release on reserved
extents in cow_one_range()"), which is the same fix in cow_one_range().

Fixes: 151a41bc46df ("Btrfs: fix what bits we clear when erroring out from delalloc")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: check return value of btrfs_partially_delete_raid_extent()

btrfs_partially_delete_raid_extent() returns an error code (e.g.
-ENOMEM from kzalloc(), or errors from btrfs_del_item/btrfs_insert_item()),
but all three call sites in btrfs_delete_raid_extent() discard the
return value, silently losing errors and potentially leaving the stripe
tree in an inconsistent state.

Fix by capturing the return value into ret at all three call sites and
breaking out of the loop on error where appropriate.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: handle -EAGAIN from btrfs_duplicate_item and refresh stale leaf pointer

In the 'punch a hole' case of btrfs_delete_raid_extent(),
btrfs_duplicate_item() can return -EAGAIN when the leaf needs to be
split and the path becomes invalid. The old code treats any error as
fatal and breaks out of the loop.

Additionally, btrfs_duplicate_item() may trigger setup_leaf_for_split()
which can reallocate the leaf node. The code continues using the old
leaf pointer, leading to use-after-free or stale data access.

Fix both issues by:

- Handling -EAGAIN specifically: release the path and retry the loop.
- Refreshing leaf = path->nodes[0] after successful duplication.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: replace ASSERT with proper error handling in stripe lookup fallback

After falling back to the previous item in btrfs_delete_raid_extent(),
the code uses ASSERT(found_start <= start) to verify the found extent
actually precedes our target range. If the B-tree state is unexpected
(e.g. no overlapping extent exists), this triggers a kernel BUG/panic
in debug builds, or silently continues with wrong data otherwise.

Replace the ASSERT with a proper bounds check that returns -ENOENT if
the found extent does not actually overlap with the start position.

Signed-off-by: robbieko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix wrong min_objectid in btrfs_previous_item() call

When found_start > start and slot == 0, btrfs_previous_item() is called
with min_objectid=start to find the previous stripe extent. However, the
previous stripe extent we are looking for has objectid < start (it starts
before our deletion range), so passing start as min_objectid prevents
finding it.

Fix by passing 0 as min_objectid to allow finding any preceding stripe
extent regardless of its objectid.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix raid stripe search missing entries at leaf boundaries

In btrfs_delete_raid_extent(), the search key uses offset=0. When the
target stripe entry is the first item on a leaf, btrfs_search_slot()
may land on the previous leaf and decrementing the slot from nritems
still points to the wrong entry, causing the stripe extent to be
silently missed.

Fix this by searching with offset=(u64)-1 instead. Since no real stripe
entry has this offset, btrfs_search_slot() always returns 1 with the
slot pointing past the last matching objectid entry. Then unconditionally
decrement the slot with a proper slots[0]==0 early-exit check to handle
the case where no matching entry exists.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: copy devid in btrfs_partially_delete_raid_extent()

When btrfs_partially_delete_raid_extent() rebuilds a truncated/shifted
stripe extent into newitem, the loop copies the physical address for
each stride but forgets to copy the devid. The resulting item written
back to the stripe tree has zeroed-out devids, corrupting the stripe
mapping.

Fix this by reading the devid with btrfs_raid_stride_devid() and
writing it into the new item with btrfs_set_stack_raid_stride_devid()
before copying the physical address.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: robbieko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: handle unexpected free-space-tree key types

Replace the conditional assertions with proper error handling and
transaction abort if we find an unexpected key type in the free space
tree.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix missing last_unlink_trans update when removing a directory

When removing a directory we are not updating its last_unlink_trans field,
which can result in incorrect fsync behaviour in case some one fsyncs the
directory after it was removed because it's holding a file descriptor on
it.

Example scenario:

   mkdir /mnt/dir1
   mkdir /mnt/dir1/dir2
   mkdir /mnt/dir3

   sync -f /mnt

   # Do some change to the directory and fsync it.
   chmod 700 /mnt/dir1
   xfs_io -c fsync /mnt/dir1

   # Move dir2 out of dir1 so that dir1 becomes empty.
   mv /mnt/dir1/dir2 /mnt/dir3/

   open fd on /mnt/dir1
   call rmdir(2) on path "/mnt/dir1"
   fsync fd

   <trigger power failure>

When attempting to mount the filesystem, the log replay will fail with
an -EIO error and dmesg/syslog has the following:

   [445771.626482] BTRFS info (device dm-0): first mount of filesystem 0368bbea-6c5e-44b5-b409-09abe496e650
   [445771.626486] BTRFS info (device dm-0): using crc32c checksum algorithm
   [445771.627912] BTRFS info (device dm-0): start tree-log replay
   [445771.628335] page: refcount:2 mapcount:0 mapping:0000000061443ddc index:0x1d00 pfn:0x7072a5
   [445771.629453] memcg:ffff89f400351b00
   [445771.629892] aops:btree_aops [btrfs] ino:1
   [445771.630737] flags: 0x17fffc00000402a(uptodate|lru|private|writeback|node=0|zone=2|lastcpupid=0x1ffff)
   [445771.632359] raw: 017fffc00000402a fffff47284d950c8 fffff472907b7c08 ffff89f458e412b8
   [445771.633713] raw: 0000000000001d00 ffff89f6c51d1a90 00000002ffffffff ffff89f400351b00
   [445771.635029] page dumped because: eb page dump
   [445771.635825] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=10 ino=258, invalid nlink: has 2 expect no more than 1 for dir
   [445771.638088] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14878 owner 5
   [445771.638091] BTRFS info (device dm-0): refs 4 lock_owner 0 current 3581087
   [445771.638094] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
   [445771.638097] inode generation 3 transid 9 size 16 nbytes 16384
   [445771.638098] block group 0 mode 40755 links 1 uid 0 gid 0
   [445771.638100] rdev 0 sequence 2 flags 0x0
   [445771.638102] atime 1775744884.0
   [445771.660056] ctime 1775744885.645502983
   [445771.660058] mtime 1775744885.645502983
   [445771.660060] otime 1775744884.0
   [445771.660062] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
   [445771.660064] index 0 name_len 2
   [445771.660066] item 2 key (256 DIR_ITEM 1843588421) itemoff 16077 itemsize 34
   [445771.660068] location key (259 1 0) type 2
   [445771.660070] transid 9 data_len 0 name_len 4
   [445771.660075] item 3 key (256 DIR_ITEM 2363071922) itemoff 16043 itemsize 34
   [445771.660076] location key (257 1 0) type 2
   [445771.660077] transid 9 data_len 0 name_len 4
   [445771.660078] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34
   [445771.660079] location key (257 1 0) type 2
   [445771.660080] transid 9 data_len 0 name_len 4
   [445771.660081] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34
   [445771.660082] location key (259 1 0) type 2
   [445771.660083] transid 9 data_len 0 name_len 4
   [445771.660084] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160
   [445771.660086] inode generation 9 transid 9 size 8 nbytes 0
   [445771.660087] block group 0 mode 40777 links 1 uid 0 gid 0
   [445771.660088] rdev 0 sequence 2 flags 0x0
   [445771.660089] atime 1775744885.641174097
   [445771.660090] ctime 1775744885.645502983
   [445771.660091] mtime 1775744885.645502983
   [445771.660105] otime 1775744885.641174097
   [445771.660106] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14
   [445771.660107] index 2 name_len 4
   [445771.660108] item 8 key (257 DIR_ITEM 2676584006) itemoff 15767 itemsize 34
   [445771.660109] location key (258 1 0) type 2
   [445771.660110] transid 9 data_len 0 name_len 4
   [445771.660111] item 9 key (257 DIR_INDEX 2) itemoff 15733 itemsize 34
   [445771.660112] location key (258 1 0) type 2
   [445771.660113] transid 9 data_len 0 name_len 4
   [445771.660114] item 10 key (258 INODE_ITEM 0) itemoff 15573 itemsize 160
   [445771.660115] inode generation 9 transid 10 size 0 nbytes 0
   [445771.660116] block group 0 mode 40755 links 2 uid 0 gid 0
   [445771.660117] rdev 0 sequence 0 flags 0x0
   [445771.660118] atime 1775744885.645502983
   [445771.660119] ctime 1775744885.645502983
   [445771.660120] mtime 1775744885.645502983
   [445771.660121] otime 1775744885.645502983
   [445771.660122] item 11 key (258 INODE_REF 257) itemoff 15559 itemsize 14
   [445771.660123] index 2 name_len 4
   [445771.660124] item 12 key (258 INODE_REF 259) itemoff 15545 itemsize 14
   [445771.660125] index 2 name_len 4
   [445771.660126] item 13 key (259 INODE_ITEM 0) itemoff 15385 itemsize 160
   [445771.660127] inode generation 9 transid 10 size 8 nbytes 0
   [445771.660128] block group 0 mode 40755 links 1 uid 0 gid 0
   [445771.660129] rdev 0 sequence 1 flags 0x0
   [445771.660130] atime 1775744885.645502983
   [445771.660130] ctime 1775744885.645502983
   [445771.660131] mtime 1775744885.645502983
   [445771.660132] otime 1775744885.645502983
   [445771.660133] item 14 key (259 INODE_REF 256) itemoff 15371 itemsize 14
   [445771.660134] index 3 name_len 4
   [445771.660135] item 15 key (259 DIR_ITEM 2676584006) itemoff 15337 itemsize 34
   [445771.660136] location key (258 1 0) type 2
   [445771.660137] transid 10 data_len 0 name_len 4
   [445771.660138] item 16 key (259 DIR_INDEX 2) itemoff 15303 itemsize 34
   [445771.660139] location key (258 1 0) type 2
   [445771.660140] transid 10 data_len 0 name_len 4
   [445771.660144] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected
   [445771.661650] ------------[ cut here ]------------
   [445771.662358] WARNING: fs/btrfs/disk-io.c:326 at btree_csum_one_bio+0x217/0x230 [btrfs], CPU#8: mount/3581087
   [445771.663588] Modules linked in: btrfs f2fs xfs (...)
   [445771.671229] CPU: 8 UID: 0 PID: 3581087 Comm: mount Tainted: G        W           7.0.0-rc6-btrfs-next-230+ #2 PREEMPT(full)
   [445771.672575] Tainted: [W]=WARN
   [445771.672987] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
   [445771.674460] RIP: 0010:btree_csum_one_bio+0x217/0x230 [btrfs]
   [445771.675222] Code: 89 44 24 (...)
   [445771.677364] RSP: 0018:ffffd23882247660 EFLAGS: 00010246
   [445771.678029] RAX: 0000000000000000 RBX: ffff89f6c51d1a90 RCX: 0000000000000000
   [445771.678975] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff89f406020000
   [445771.679983] RBP: ffff89f821204000 R08: 0000000000000000 R09: 00000000ffefffff
   [445771.680905] R10: ffffd23882247448 R11: 0000000000000003 R12: ffffd23882247668
   [445771.681978] R13: ffff89f458e40fc0 R14: ffff89f737f4f500 R15: ffff89f737f4f500
   [445771.682912] FS:  00007f0447a98840(0000) GS:ffff89fb9771d000(0000) knlGS:0000000000000000
   [445771.684393] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [445771.685230] CR2: 00007f0447bf1330 CR3: 000000017cb02002 CR4: 0000000000370ef0
   [445771.686273] Call Trace:
   [445771.686646]  <TASK>
   [445771.686969]  btrfs_submit_bbio+0x83f/0x860 [btrfs]
   [445771.687750]  ? write_one_eb+0x28f/0x340 [btrfs]
   [445771.688428]  btree_writepages+0x2e3/0x550 [btrfs]
   [445771.689180]  ? kmem_cache_alloc_noprof+0x12a/0x490
   [445771.689963]  ? alloc_extent_state+0x19/0x120 [btrfs]
   [445771.690801]  ? kmem_cache_free+0x135/0x380
   [445771.691328]  ? preempt_count_add+0x69/0xa0
   [445771.691831]  ? set_extent_bit+0x252/0x8e0 [btrfs]
   [445771.692468]  ? xas_load+0x9/0xc0
   [445771.692873]  ? xas_find+0x14d/0x1a0
   [445771.693304]  do_writepages+0xc6/0x160
   [445771.693756]  filemap_writeback+0xb8/0xe0
   [445771.694274]  btrfs_write_marked_extents+0x61/0x170 [btrfs]
   [445771.694999]  btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs]
   [445771.695818]  btrfs_commit_transaction+0x5c8/0xd10 [btrfs]
   [445771.696530]  ? kmem_cache_free+0x135/0x380
   [445771.697120]  ? release_extent_buffer+0x34/0x160 [btrfs]
   [445771.697786]  btrfs_recover_log_trees+0x7be/0x7e0 [btrfs]
   [445771.698525]  ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
   [445771.699206]  open_ctree+0x11e5/0x1810 [btrfs]
   [445771.699776]  btrfs_get_tree.cold+0xb/0x162 [btrfs]
   [445771.700463]  ? fscontext_read+0x165/0x180
   [445771.701146]  ? rw_verify_area+0x50/0x180
   [445771.701866]  vfs_get_tree+0x25/0xd0
   [445771.702491]  vfs_cmd_create+0x59/0xe0
   [445771.703125]  __do_sys_fsconfig+0x303/0x610
   [445771.703603]  do_syscall_64+0xe9/0xf20
   [445771.703974]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [445771.704700] RIP: 0033:0x7f0447cbd4aa
   [445771.705108] Code: 73 01 c3 (...)
   [445771.707263] RSP: 002b:00007ffc4e528318 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
   [445771.708107] RAX: ffffffffffffffda RBX: 00005561585d8c20 RCX: 00007f0447cbd4aa
   [445771.708931] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
   [445771.709744] RBP: 00005561585d9120 R08: 0000000000000000 R09: 0000000000000000
   [445771.710674] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [445771.711477] R13: 00007f0447e4f580 R14: 00007f0447e5126c R15: 00007f0447e36a23
   [445771.712277]  </TASK>
   [445771.712541] ---[ end trace 0000000000000000 ]---
   [445771.713382] BTRFS error (device dm-0): error while writing out transaction: -5
   [445771.714679] BTRFS warning (device dm-0): Skipping commit of aborted transaction.
   [445771.715562] BTRFS error (device dm-0 state A): Transaction aborted (error -5)
   [445771.716459] BTRFS: error (device dm-0 state A) in cleanup_transaction:2068: errno=-5 IO failure
   [445771.717936] BTRFS error (device dm-0 state EA): failed to recover log trees with error: -5
   [445771.719681] BTRFS error (device dm-0 state EA): open_ctree failed: -5

The problem is that such a fsync should have result in a fallback to a
transaction commit, but that did not happen because through the
btrfs_rmdir() we never update the directory's last_unlink_trans field.
Any inode that had a link removed must have its last_unlink_trans updated
to the ID of transaction used for the operation, otherwise fsync and log
replay will not work correctly.

btrfs_rmdir() calls btrfs_unlink_inode() and through that call chain we
never call btrfs_record_unlink_dir() in order to update last_unlink_trans.
However btrfs_unlink(), which is used for unlinking regular files, calls
btrfs_record_unlink_dir() and then calls btrfs_unlink_inode(). So fix
this by moving the call to btrfs_record_unlink_dir() from btrfs_unlink()
to btrfs_unlink_inode().

A test case for fstests will follow soon.

Reported-by: Slava0135 <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAAJYhww5ov62Hm+n+tmhcL-e_4cBobg+OWogKjOJxVUXivC=MQ@mail.gmail.com/
CC: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: don't clobber errors in add_remap_tree_entries()

In add_remap_tree_entries(), we only process a certain number of entries
at a time, meaning we may need to loop.

But because we weren't checking the return value of btrfs_insert_empty_items()
within the loop, this meant that if the last iteration of the loop
succeeded but a previous iteration failed, we were erroneously returning
0.

Fix this by breaking the loop early if btrfs_insert_empty_items() fails.

Fixes: b56f35560b82 ("btrfs: handle setting up relocation of block group with remap-tree")
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: enable shutdown ioctl for non-experimental builds

Although commit 304076527c38 ("btrfs: move shutdown and remove_bdev
callbacks out of experimental features") tries to move both shutdown and
remove_bdev out of experimental features, that commit has only addressed
the super block operation callback, the ioctl one is left untouched.

Fix that missing aspect by also moving shutdown ioctl out of
experimental features.

Since we're here, also add unknown flag detection to reject any
unsupported shutdown flags.

Fixes: 304076527c38 ("btrfs: move shutdown and remove_bdev callbacks out of experimental features")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: apply first key check for readahead when possible

Currently for tree block readahead we never pass a
btrfs_tree_parent_check with @has_first_key set.

Without @has_first_key set, btrfs will skip the following extra
checks:

- Header generation check
  This is a minor one.

- Empty leaf/node checks
  This is more serious, for certain trees like the csum tree, they are
  allowed to be empty, thus an empty leaf can pass the tree checker.
  But if there is a parent node for such an empty leaf, it indicates
  corruption.

  Without @has_first_key set, we can no longer detect such a problem.

  In fact there is already a fuzzed image report that a corrupted csum
  leaf which has zero nritems but still has a parent node can trigger
  a BUG_ON() during csum deletion.

However there are only two call sites of btrfs_readahead_tree_block():

- Inside relocate_tree_blocks()
  At this call site we are trying to grab the first key of the tree
  block, thus we are not able to pass a @first_key parameter.

- Inside btrfs_readahead_node_child()
  This is the more common call site, where we have the parent node and
  want to readahead the child tree blocks.

  In this case we can easily grab the node key and pass it for checks.

Add a new parameter @first_key to btrfs_readahead_tree_block() and pass
the node key to it inside btrfs_readahead_node_child().

This should plug the gap in empty leaf detection during readahead.

Link: https://lore.kernel.org/linux-btrfs/20260409071255.3358044-1-gality369@gmail.com/
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: abort transaction in do_remap_reloc_trans() on failure

If one of the calls made by do_remap_reloc_trans() fails, we can leave
the remap tree in an inconsistent state. Abort the transaction if this
happens, to prevent the corrupt state from reaching the disk.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix bytes_may_use leak in do_remap_reloc_trans()

If the call to btrfs_reserve_extent() in do_remap_reloc_trans() returns
a smaller extent than we asked for, currently we're not undoing the
bytes_may_use change that we made. Fix this by calling
btrfs_space_info_update_bytes_may_use() again for the difference.

Fixes: fd6594b1446c ("btrfs: replace identity remaps with actual remaps when doing relocations")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix bytes_may_use leak in move_existing_remap()

If the call to btrfs_reserve_extent() in move_existing_remap() returns a
smaller extent than we asked for, currently we're not undoing the
bytes_may_use change that we made. Fix this by calling
btrfs_space_info_update_bytes_may_use() again for the difference.

Fixes: bbea42dfb91f ("btrfs: move existing remaps before relocating block group")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

tracing: tell git to ignore the generated 'undefsyms_base.c' file

This odd file was added to automatically figure out tool-generated
symbols.

Honestly, it *should* have been just a real honest-to-goodness regular
file in git, instead of having strange code to generate it in the
Makefile, but that is not how that silly thing works. So now we need to
ignore it explicitly.

Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'linux_kselftest-next-7.1-next-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest fixes from Shuah Khan:
"Fix regressions in non-bash shells and busybox support, and revert a
  commit that regressed in build and installation when one or more tests
  fail to build.

  Fix duplicated test number reporting introduced in ktap support patch"

* tag 'linux_kselftest-next-7.1-next-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests: Fix duplicated test number reporting
  selftests: Fix runner.sh for non-bash shells
  selftests: Fix runner.sh busybox support
  selftests: Deescalate error reporting

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull more arm64 updates from Catalin Marinas:
"The main 'feature' is a workaround for C1-Pro erratum 4193714
  requiring IPIs during TLB maintenance if a process is running in user
  space with SME enabled.

  The hardware acknowledges the DVMSync messages before completing
  in-flight SME accesses, with security implications. The workaround
  makes use of the mm_cpumask() to track the cores that need
  interrupting (arm64 hasn't used this mask before).

  The rest are fixes for MPAM, CCA and generated header that turned up
  during the merging window or shortly before.

  Summary:

  Core features:

   - Add workaround for C1-Pro erratum 4193714 - early CME (SME unit)
     DVMSync acknowledgement. The fix consists of sending IPIs on TLB
     maintenance to those CPUs running in user space with SME enabled

   - Include kernel-hwcap.h in list of generated files (missed in a
     recent commit generating the KERNEL_HWCAP_* macros)

  CCA:

   - Fix RSI_INCOMPLETE error check in arm-cca-guest

  MPAM:

   - Fix an unmount->remount problem with the CDP emulation,
     uninitialised variable and checker warnings"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm_mpam: resctrl: Make resctrl_mon_ctx_waiters static
  arm_mpam: resctrl: Fix the check for no monitor components found
  arm_mpam: resctrl: Fix MBA CDP alloc_capable handling on unmount
  virt: arm-cca-guest: fix error check for RSI_INCOMPLETE
  arm64/hwcap: Include kernel-hwcap.h in list of generated files
  arm64: errata: Work around early CME DVMSync acknowledgement
  arm64: cputype: Add C1-Pro definitions
  arm64: tlb: Pass the corresponding mm to __tlbi_sync_s1ish()
  arm64: tlb: Introduce __tlbi_sync_s1ish_{kernel,batch}() for TLB maintenance

Merge tag 'sh-for-v7.1-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux

Pull sh updates from John Paul Adrian Glaubitz:
"Two patches from Thomas Zimmermann, one by Tim Bird and one by Thomas
  Weißschuh.

  The first patch by Thomas Zimmermann adds a missing include in dac.h
  for SH-3 which became necessary after 243ce64b2b37 ("backlight: Do not
  include <linux/fb.h> in header file") which made __raw_readb() and
  __raw_writeb() inaccessible in dac.h.

  Thomas' second patch drops CONFIG_FIRMWARE_EDID for SH as it depends
  on X86 or EFI_GENERIC_STUB which are not defined on SH for obvious
  reasons.

  The patch by Tim Bird fixes just a small typo in two SPDX ID lines
  which he stumbled over by accident.

  And, least but not last, the patch by Thomas Weißschuh removes the
  CONFIG_VSYSCALL reference from UAPI. This was necessary as the
  definition of AT_SYSINFO_EHDR was gated between CONFIG_VSYSCALL to
  avoid a default gate VMA to be created. However that default gate VMA
  was removed entirely in commit a6c19dfe3994 (arm64,ia64,ppc,s390,
  sh,tile,um,x86,mm: remove default gate area)"

* tag 'sh-for-v7.1-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux:
  sh: Drop CONFIG_FIRMWARE_EDID from defconfig files
  sh: Remove CONFIG_VSYSCALL reference from UAPI
  sh: Fix typo in SPDX license ID lines
  sh: Include <linux/io.h> in dac.h

Merge tag 'uml-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux

Pull uml updates from Johannes Berg:
"Mostly cleanups and small things, notably:

   - musl libc compatibility

   - vDSO installation fix

   - TLB sync race fix for recent SMP support

   - build fix for 32-bit with Clang 20/21"

* tag 'uml-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux:
  um: Disable GCOV_PROFILE_ALL on 32-bit UML with Clang 20/21
  um: drivers: call kernel_strrchr() explicitly in cow_user.c
  um: Replace strncpy() with strnlen()+memcpy_and_pad() in strncpy_chunk_from_user()
  x86/um: fix vDSO installation
  um: Remove CONFIG_FRAME_WARN from x86_64_defconfig
  um: Fix pte_read() and pte_exec() for kernel mappings
  um: Fix potential race condition in TLB sync
  um: time-travel: clean up kernel-doc warnings
  um: avoid struct sigcontext redefinition with musl
  um: fix address-of CMSG_DATA() rvalue in stub

Merge tag 'printk-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk updates from Petr Mladek:

- Fix printk ring buffer initialization and sanity checks

- Workaround printf kunit test compilation with gcc < 12.1

- Add IPv6 address printf format tests

- Misc code and documentation cleanup

* tag 'printk-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printf: Compile the kunit test with DISABLE_BRANCH_PROFILING DISABLE_BRANCH_PROFILING
  lib/vsprintf: use bool for local decode variable
  lib/hexdump: print_hex_dump_bytes() calls print_hex_dump_debug()
  printk: ringbuffer: fix errors in comments
  printk_ringbuffer: Add sanity check for 0-size data
  printk_ringbuffer: Fix get_data() size sanity check
  printf: add IPv6 address format tests
  printk: Fix _DESCS_COUNT type for 64-bit systems

Merge tag 'timers-urgent-2026-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
"Fix timer stalls caused by incorrect handling of the
dev->next_event_forced flag"

* tag 'timers-urgent-2026-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clockevents: Add missing resets of the next_event_forced flag

rtmutex: Use waiter::task instead of current in remove_waiter()

remove_waiter() is used by the slowlock paths, but it is also used for
proxy-lock rollback in rt_mutex_start_proxy_lock() when invoked from
futex_requeue().

In the latter case waiter::task is not current, but remove_waiter()
operates on current for the dequeue operation. That results in several
problems:

  1) the rbtree dequeue happens without waiter::task::pi_lock being held

  2) the waiter task's pi_blocked_on state is not cleared, which leaves a
     dangling pointer primed for UAF around.

  3) rt_mutex_adjust_prio_chain() operates on the wrong top priority waiter
     task

Use waiter::task instead of current in all related operations in
remove_waiter() to cure those problems.

[ tglx: Fixup rt_mutex_adjust_prio_chain(), add a comment and amend the
   changelog ]

Fixes: 8161239a8bcc ("rtmutex: Simplify PI algorithm and make highest prio task get lock")
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org

Merge tag 'core-urgent-2026-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull entry cleanup from Ingo Molnar:
"Remove the unused ARCH_SYSCALL_WORK_{ENTER,EXIT} flags"

* tag 'core-urgent-2026-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Kill ARCH_SYSCALL_WORK_{ENTER,EXIT}

KVM: selftests: Replace "paddr" with "gpa" throughout

Replace all variations of "paddr" variables in KVM selftests with "gpa",
with the exception of the ELF structures, as those fields are not specific
to guest virtual addresses, to complete the conversion from vm_paddr_t to
gpa_t.

No functional change intended.

Link: https://patch.msgid.link/20260420212004.3938325-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Replace "u64 nested_paddr" with "gpa_t l2_gpa"

In x86's nested TDP APIs, use the appropriate gpa_t typedef and rename
variables from nested_paddr to l2_gpa to match KVM x86's nomenclature.

No functional change intended.

Link: https://patch.msgid.link/20260420212004.3938325-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Replace "u64 gpa" with "gpa_t" throughout

Use gpa_t instead of u64 for obvious declarations of GPA variables.

No functional change intended.

Link: https://patch.msgid.link/20260420212004.3938325-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Replace "vaddr" with "gva" throughout

Replace all variations of "vaddr" variables in KVM selftests with "gva",
with the exception of the ELF structures, as those fields are not specific
to guest virtual addresses, to complete the conversion from vm_vaddr_t to
gva_t.

Opportunistically use gva_t instead of u64 for relevant variables, and
fixup indentation as appropriate.

No functional change intended.

Link: https://patch.msgid.link/20260420212004.3938325-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Clarify that arm64's inject_uer() takes a host PA, not a guest PA

Rename inject_uer()'s @paddr to @hpa to make it more obvious that it
injects an error using a host PA, not a guest PA.

No functional change intended.

Link: https://patch.msgid.link/20260420212004.3938325-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Rename translate_to_host_paddr() => translate_hva_to_hpa()

Rename arm64's translate_to_host_paddr() to translate_hva_to_hpa() and
update variable names to match, as using "vaddr" and "paddr" terminology
is super confusing due to selftests using those exact names for *guest*
addresses.

Opportunisitically drop superfluous local page_addr and paddr variables.

No functional change intended.

Link: https://patch.msgid.link/20260420212004.3938325-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Rename vm_vaddr_populate_bitmap() => vm_populate_gva_bitmap()

Now that KVM selftests use gva_t instead of vm_vaddr_t, rename the helper
for populating the initial GVA bitmap to drop the defunct terminology and
use "vm" for the scope.

Opportunistically fixup the declaration of the API, which has been broken
since day 1. The flaw went unnoticed because the sole caller is defined
after the weak version, i.e. can see the prototype without a previous
declaration.

No functional change intended.

Fixes: e8b9a055fa04 ("KVM: arm64: selftests: Align VA space allocator with TTBR0")
Link: https://patch.msgid.link/20260420212004.3938325-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Rename vm_vaddr_unused_gap() => vm_unused_gva_gap()

Now that KVM selftests use gva_t instead of vm_vaddr_t, rename the API
for finding an unused range of virtual memory to drop the defunct
terminology and use "vm" for the scope.

Opportunistically clean up the function comment to drop superfluous
and redundant information.

No functional change intended.

Link: https://patch.msgid.link/20260420212004.3938325-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Drop "vaddr_" from APIs that allocate memory for a given VM

Now that KVM selftests use gva_t instead of vm_vaddr_t, drop "vaddr_" from
the core memory allocation APIs as the information is extraneous and does
more harm than good. E.g. the APIs don't _just_ allocate virtual memory,
they allocate backing physical memory and install mappings in the guest
page tables. And as proven by kmalloc() and malloc(), developers generally
expect that allocations come with a working virtual address.

Opportunistically clean up the function comment for vm_alloc(), and drop
the misleading and superfluous comments for its wrappers.

No functional change intended.

Link: https://patch.msgid.link/20260420212004.3938325-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Use u8 instead of uint8_t

Use u8 instead of uint8_t to make the KVM selftests code more concise
and more similar to the kernel (since selftests are primarily developed
by kernel developers).

This commit was generated with the following command:

git ls-files tools/testing/selftests/kvm | xargs sed -i 's/uint8_t/u8/g'

Then by manually adjusting whitespace to make checkpatch.pl happy.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://patch.msgid.link/20260420212004.3938325-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Use s16 instead of int16_t

Use s16 instead of int16_t to make the KVM selftests code more concise
and more similar to the kernel (since selftests are primarily developed
by kernel developers).

This commit was generated with the following command:

git ls-files tools/testing/selftests/kvm | xargs sed -i 's/int16_t/s16/g'

Then by manually adjusting whitespace to make checkpatch.pl happy.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://patch.msgid.link/20260420212004.3938325-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Use u16 instead of uint16_t

Use u16 instead of uint16_t to make the KVM selftests code more concise
and more similar to the kernel (since selftests are primarily developed
by kernel developers).

This commit was generated with the following command:

git ls-files tools/testing/selftests/kvm | xargs sed -i 's/uint16_t/u16/g'

Then by manually adjusting whitespace to make checkpatch.pl happy.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://patch.msgid.link/20260420212004.3938325-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Use s32 instead of int32_t

Use s32 instead of int32_t to make the KVM selftests code more concise
and more similar to the kernel (since selftests are primarily developed
by kernel developers).

This commit was generated with the following command:

git ls-files tools/testing/selftests/kvm | xargs sed -i 's/int32_t/s32/g'

Then by manually adjusting whitespace to make checkpatch.pl happy.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://patch.msgid.link/20260420212004.3938325-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Use u32 instead of uint32_t

Use u32 instead of uint32_t to make the KVM selftests code more concise
and more similar to the kernel (since selftests are primarily developed
by kernel developers).

This commit was generated with the following command:

git ls-files tools/testing/selftests/kvm | xargs sed -i 's/uint32_t/u32/g'

Then by manually adjusting whitespace to make checkpatch.pl happy.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://patch.msgid.link/20260420212004.3938325-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Use s64 instead of int64_t

Use s64 instead of int64_t to make the KVM selftests code more concise
and more similar to the kernel (since selftests are primarily developed
by kernel developers).

This commit was generated with the following command:

git ls-files tools/testing/selftests/kvm | xargs sed -i 's/int64_t/s64/g'

Then by manually adjusting whitespace to make checkpatch.pl happy.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://patch.msgid.link/20260420212004.3938325-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Use u64 instead of uint64_t

Use u64 instead of uint64_t to make the KVM selftests code more concise
and more similar to the kernel (since selftests are primarily developed
by kernel developers).

This commit was generated with the following command:

git ls-files tools/testing/selftests/kvm | xargs sed -i 's/uint64_t/u64/g'

Then by manually adjusting whitespace to make checkpatch.pl happy.

Include <linux/types.h> in include/kvm_util_types.h, iinclude/test_util.h,
and include/x86/pmu.h to pick up the tools-defined u64. Arguably, all
headers (especially kvm_util_types.h) should have already been including
stdint.h to get uint64_t from the libc headers, but the missing dependency
only rears its head once KVM uses u64 instead of uint64_t.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
[sean: rename pread_uint64() => pread_u64, expand on types.h include]
Link: https://patch.msgid.link/20260420212004.3938325-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Use gpa_t for GPAs in Hyper-V selftests

Fix various Hyper-V selftests to use gpa_t for variables that contain
guest physical addresses, rather than gva_t. In practice, the bugs are
benign as both gva_t and gpa_t are u64 typedefs, i.e. gpa_t and gva_t are
interchangeable from a functional perspective, the code is just confusing.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
[sean: call out that both are u64 typedefs]
Link: https://patch.msgid.link/20260420212004.3938325-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Use gpa_t instead of vm_paddr_t

Replace all occurrences of vm_paddr_t with gpa_t to align with KVM code
and with the conversion helpers (e.g. addr_hva2gpa()).

This commit was generated with the following command:

git ls-files tools/testing/selftests/kvm | xargs sed -i 's/vm_paddr_/gpa_/g'

Then by manually adjusting whitespace to make checkpatch.pl happy.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
[sean: drop bogus changelog blurb about renaming functions]
Link: https://patch.msgid.link/20260420212004.3938325-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

KVM: selftests: Use gva_t instead of vm_vaddr_t

Replace all occurrences of vm_vaddr_t with gva_t to align with KVM code
and with the conversion helpers (e.g. addr_gva2hva()).

This commit was generated with the following command:

git ls-files tools/testing/selftests/kvm | xargs sed -i 's/vm_vaddr_/gva_/g'

Then by manually adjusting whitespace to make checkpatch.pl happy, and
dropping renames of functions that allocate memory within a given VM.

No functional change intended.

Signed-off-by: David Matlack <dmatlack@google.com>
[sean: drop renames of allocator APIs]
Link: https://patch.msgid.link/20260420212004.3938325-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>

netfilter: nfnetlink_osf: fix potential NULL dereference in ttl check

The nf_osf_ttl() function accessed skb->dev to perform a local interface
address lookup without verifying that the device pointer was valid.

Additionally, the implementation utilized an in_dev_for_each_ifa_rcu
loop to match the packet source address against local interface
addresses. It assumed that packets from the same subnet should not see a
decrement on the initial TTL. A packet might appear it is from the same
subnet but it actually isn't especially in modern environments with
containers and virtual switching.

Remove the device dereference and interface loop. Replace the logic with
a switch statement that evaluates the TTL according to the ttl_check.

Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match")
Reported-by: Kito Xu (veritas501) <hxzene@gmail.com>
Closes: https://lore.kernel.org/netfilter-devel/20260414074556.2512750-1-hxzene@gmail.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nfnetlink_osf: fix out-of-bounds read on option matching

In nf_osf_match(), the nf_osf_hdr_ctx structure is initialized once
and passed by reference to nf_osf_match_one() for each fingerprint
checked. During TCP option parsing, nf_osf_match_one() advances the
shared ctx->optp pointer.

If a fingerprint perfectly matches, the function returns early without
restoring ctx->optp to its initial state. If the user has configured
NF_OSF_LOGLEVEL_ALL, the loop continues to the next fingerprint.
However, because ctx->optp was not restored, the next call to
nf_osf_match_one() starts parsing from the end of the options buffer.
This causes subsequent matches to read garbage data and fail
immediately, making it impossible to log more than one match or logging
incorrect matches.

Instead of using a shared ctx->optp pointer, pass the context as a
constant pointer and use a local pointer (optp) for TCP option
traversal. This makes nf_osf_match_one() strictly stateless from the
caller's perspective, ensuring every fingerprint check starts at the
correct option offset.

Fixes: 1a6a0951fc00 ("netfilter: nfnetlink_osf: add missing fmatch check")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ipvs: fix MTU check for GSO packets in tunnel mode

Currently, IPVS skips MTU checks for GSO packets by excluding them with
the !skb_is_gso(skb) condition. This creates problems when IPVS tunnel
mode encapsulates GSO packets with IPIP headers.

The issue manifests in two ways:

1. MTU violation after encapsulation:
   When a GSO packet passes through IPVS tunnel mode, the original MTU
   check is bypassed. After adding the IPIP tunnel header, the packet
   size may exceed the outgoing interface MTU, leading to unexpected
   fragmentation at the IP layer.

2. Fragmentation with problematic IP IDs:
   When net.ipv4.vs.pmtu_disc=1 and a GSO packet with multiple segments
   is fragmented after encapsulation, each segment gets a sequentially
   incremented IP ID (0, 1, 2, ...). This happens because:

   a) The GSO packet bypasses MTU check and gets encapsulated
   b) At __ip_finish_output, the oversized GSO packet is split into
      separate SKBs (one per segment), with IP IDs incrementing
   c) Each SKB is then fragmented again based on the actual MTU

   This sequential IP ID allocation differs from the expected behavior
   and can cause issues with fragment reassembly and packet tracking.

Fix this by properly validating GSO packets using
skb_gso_validate_network_len(). This function correctly validates
whether the GSO segments will fit within the MTU after segmentation. If
validation fails, send an ICMP Fragmentation Needed message to enable
proper PMTU discovery.

Fixes: 4cdd34084d53 ("netfilter: nf_conntrack_ipv6: improve fragmentation handling")
Signed-off-by: Yingnan Zhang <342144303@qq.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nat: use kfree_rcu to release ops

Florian Westphal says:

"Historically this is not an issue, even for normal base hooks: the data
path doesn't use the original nf_hook_ops that are used to register the
callbacks.

However, in v5.14 I added the ability to dump the active netfilter
hooks from userspace.

This code will peek back into the nf_hook_ops that are available
at the tail of the pointer-array blob used by the datapath.

The nat hooks are special, because they are called indirectly from
the central nat dispatcher hook. They are currently invisible to
the nfnl hook dump subsystem though.

But once that changes the nat ops structures have to be deferred too."

Update nf_nat_register_fn() to deal with partial exposition of the hooks
from error path which can be also an issue for nfnetlink_hook.

Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: xtables: restrict several matches to inet family

This is a partial revert of:

commit ab4f21e6fb1c ("netfilter: xtables: use NFPROTO_UNSPEC in more extensions")

to allow ipv4 and ipv6 only.

- xt_mac
- xt_owner
- xt_physdev

These extensions are not used by ebtables in userspace.

Moreover, xt_realm is only for ipv4, since dst->tclassid is ipv4
specific.

Fixes: ab4f21e6fb1c ("netfilter: xtables: use NFPROTO_UNSPEC in more extensions")
Reported-by: "Kito Xu (veritas501)" <hxzene@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: remove sprintf usage

Replace it with scnprintf, the buffer sizes are expected to be large enough
to hold the result, no need for snprintf+overflow check.

Increase buffer size in mangle_content_len() while at it.

BUG: KASAN: stack-out-of-bounds in vsnprintf+0xea5/0x1270
Write of size 1 at addr [..]
vsnprintf+0xea5/0x1270
sprintf+0xb1/0xe0
mangle_content_len+0x1ac/0x280
nf_nat_sdp_session+0x1cc/0x240
process_sdp+0x8f8/0xb80
process_invite_request+0x108/0x2b0
process_sip_msg+0x5da/0xf50
sip_help_tcp+0x45e/0x780
nf_confirm+0x34d/0x990
[..]

Fixes: 9fafcd7b2032 ("[NETFILTER]: nf_conntrack/nf_nat: add SIP helper port")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nfnetlink_osf: fix divide-by-zero in OSF_WSS_MODULO

nf_osf_match_one() computes ctx->window % f->wss.val in the
OSF_WSS_MODULO branch with no guard for f->wss.val == 0. A
CAP_NET_ADMIN user can add such a fingerprint via nfnetlink; a
subsequent matching TCP SYN divides by zero and panics the kernel.

Reject the bogus fingerprint in nfnl_osf_add_callback() above the
per-option for-loop. f->wss is per-fingerprint, not per-option, so
the check must run regardless of f->opt_num (including 0). Also
reject wss.wc >= OSF_WSS_MAX; nf_osf_match_one() already treats that
as "should not happen".

Crash:
Oops: divide error: 0000 [#1] SMP KASAN NOPTI
RIP: 0010:nf_osf_match_one (net/netfilter/nfnetlink_osf.c:98)
Call Trace:
<IRQ>
  nf_osf_match (net/netfilter/nfnetlink_osf.c:220)
  xt_osf_match_packet (net/netfilter/xt_osf.c:32)
  ipt_do_table (net/ipv4/netfilter/ip_tables.c:348)
  nf_hook_slow (net/netfilter/core.c:622)
  ip_local_deliver (net/ipv4/ip_input.c:265)
  ip_rcv (include/linux/skbuff.h:1162)
  __netif_receive_skb_one_core (net/core/dev.c:6181)
  process_backlog (net/core/dev.c:6642)
  __napi_poll (net/core/dev.c:7710)
  net_rx_action (net/core/dev.c:7945)
  handle_softirqs (kernel/softirq.c:622)

Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_osf: restrict it to ipv4

This expression only supports for ipv4, restrict it.

Fixes: b96af92d6eaf ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf")
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

io_uring: fix iowq_limits data race in tctx node addition

__io_uring_add_tctx_node() reads ctx->int_flags and
ctx->iowq_limits[0..1] without holding ctx->uring_lock, while
io_register_iowq_max_workers() writes these same fields under the lock.

Mostly an application problem if you try and make these race, but let's
silence KCSAN by just grabbing the ->uring_lock around the operation.
This is a slow path operation anyway, and ->uring_lock will be grabbed
by submission right after anyway.

Fixes: 2e480058ddc2 ("io-wq: provide a way to limit max number of workers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

x86/shstk: Prevent deadlock during shstk sigreturn

During sigreturn the shadow stack signal frame is popped. The kernel does
this by reading the shadow stack using normal read accesses. When it can't
assume the memory is shadow stack, it takes extra steps to makes sure it is
reading actual shadow stack memory and not other normal readable memory. It
does this by holding the mmap read lock while doing the access and checking
the flags of the VMA.

Unfortunately that is not safe. If the read of the shadow stack sigframe
hits a page fault, the fault handler will try to recursively grab another
mmap read lock. This normally works ok, but if a writer on another CPU is
also waiting, the second read lock could fail and cause a deadlock.

Fix this by not holding mmap lock during the read access to userspace.

Instead use mmap_lock_speculate_...() to watch for changes between dropping
mmap lock and the userspace access. Retry if anything grabbed an mmap write
lock in between and could have changed the VMA.

These mmap_lock_speculate_...() helpers use mm::mm_lock_seq, which is only
available when PER_VMA_LOCK is configured. So make X86_USER_SHADOW_STACK
depend on it. On x86, PER_VMA_LOCK is a default configuration for SMP
kernels. So drop support for the other configs under the assumption that
the !SMP shadow stack user base does not exist.

Currently there is a check that skips the lookup work when the SSP can be
assumed to be on a shadow stack. While reorganizing the function, remove
the optimization to make the tricky code flows more common, such that
issues like this cannot escape detection for so long.

Fixes: 7fad2a432cd3 ("x86/shstk: Check that signal frame is shadow stack mem")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org

io_uring/tctx: mark io_wq as exiting before error path teardown

syzbot reports that it's hitting the below condition for exiting an
io_wq context:

WARN_ON_ONCE(!test_bit(IO_WQ_BIT_EXIT, &wq->state))

in io_wq_put_and_exit(), which can be triggered with memory allocation
fault injection. Ensure that the io_wq is marked as exiting to silence
this warning trigger.

Reported-by: syzbot+79a4cc863a8db58cd92b@syzkaller.appspotmail.com
Fixes: 7880174e1e5e ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling")
Reviewed-by: Clément Léger <cleger@meta.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

io_uring/tctx: check for setup tctx->io_wq before teardown

As with the idling code before it, the error exit path should check for
a NULL tctx->io_wq before calling io_wq_put_and_exit().

Fixes: 7880174e1e5e ("io_uring/tctx: clean up __io_uring_add_tctx_node() error handling")
Reported-by: Dan Carpenter <error27@gmail.com>
Reviewed-by: Clément Léger <cleger@meta.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

drm/nouveau: fix u32 overflow in pushbuf reloc bounds check

nouveau_gem_pushbuf_reloc_apply() validates each relocation with

if (r->reloc_bo_offset + 4 > nvbo->bo.base.size)

but reloc_bo_offset is __u32 (uapi/drm/nouveau_drm.h) and the integer
literal 4 promotes to unsigned int, so the addition is performed in 32
bits and wraps before the comparison against the size_t bo size.

Cast to u64 so the addition happens in 64-bit arithmetic.

Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Reported-by: Anthropic
Cc: stable <stable@kernel.org>
Assisted-by: gkh_clanker_t1000
Fixes: a1606a9596e5 ("drm/nouveau: new gem pushbuf interface, bump to 0.0.16")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Add Fixes: tag. - Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

ktest: Add logfile to failure directory

The logfile contains a lot of useful information about the tests being
run. Add it to the stored failure directory when the test fails.

Cc: John 'Warthog9' Hawley <warthog9@kernel.org>
Link: https://patch.msgid.link/20260420142315.7bbc3624@fedora
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

ktest: Fix the month in the name of the failure directory

The Perl localtime() function returns the month starting at 0 not 1. This
caused the date produced to create the directory for saving files of a
failed run to have the month off by one.

machine-test-useconfig-fail-20260314073628

The above happened in April, not March. The correct name should have been:

machine-test-useconfig-fail-20260414073628

This was somewhat confusing.

Cc: stable@vger.kernel.org
Cc: John 'Warthog9' Hawley <warthog9@kernel.org>
Link: https://patch.msgid.link/20260420142426.33ad0293@fedora
Fixes: 7faafbd69639b ("ktest: Add open and close console and start stop monitor")
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Merge tag 'platform-drivers-x86-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver updates from Ilpo Järvinen:
"asus-wmi:
   - Retain battery charge threshold during boot which avoids
     unsolicited change to 100%. Return -ENODATA when the limit
     is not yet known
   - Improve screenpad power/brightness handling consistency
   - Fix screenpad brightness range

  barco-p50-gpio:
   - Normalize gpio_get return values

  bitland-mifs-wmi:
   - Add driver for Bitland laptops (supports platform profile,
     hwmon, kbd backlight, gpu mode, hotkeys, and fan boost)

  dell_rbu:
   - Fix using uninitialized value in sysfs write function

  dell-wmi-sysman:
   - Respect destination length when constructing enum strings

  hp-wmi:
   - Propagate fan setting apply failures and log an error
   - Fix sysfs write vs work handler cancel_delayed_work_sync() deadlock
   - Correct keepalive schedule_delayed_work() to mod_delayed_work()
   - Fix u8 underflows in GPU delta calculation
   - Use mutex to protect fan pwm/mode
   - Ignore kbd backlight and FnLock key events that are handled by FW
   - Fix fan table parsing (use correct field)
   - Add support for Omen 14-fb0xxx, 16-n0xxx, 16-wf1xxx, and
     Omen MAX 16-ak0xxxx

  input: trackpoint & thinkpad_acpi:
   - Enable doubletap by default and add sysfs enable/disable

  int3472:
   - Add support for GPIO type 0x02 (IR flood LED)

  intel-speed-select: (updated to v1.26)
   - Avoid using current base frequency as maximum
   - Fix CPU extended family ID decoding
   - Fix exit code
   - Improve error reporting

  intel/vsec:
   - Refactor to support ACPI-enumerated PMT endpoints.

  pcengines-apuv2:
   - Attach software node to the gpiochip

  uniwill:
   - Refactor hwmon to smaller parts to accomodate HW diversity
   - Support USB-C power/performance priority switch through sysfs
   - Add another XMG Fusion 15 (L19) DMI vendor
   - Enable fine-grained features to device lineup mapping

  wmi:
   - Perform output size check within WMI core to allow simpler WMI
     drivers

  misc:
   - acpi_driver -> platform driver conversions (a large number of
     changes from Rafael J. Wysocki)
   - cleanups / refactoring / improvements"

* tag 'platform-drivers-x86-v7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (106 commits)
  platform/x86: hp-wmi: Add support for Omen 16-wf1xxx (8C77)
  platform/x86: hp-wmi: Add support for Omen 16-n0xxx (8A44)
  platform/x86: hp-wmi: Add support for OMEN MAX 16-ak0xxx (8D87)
  platform/x86: hp-wmi: fix fan table parsing
  platform/x86: hp-wmi: add Omen 14-fb0xxx (board 8C58) support
  platform/wmi: Replace .no_notify_data with .min_event_size
  platform/wmi: Extend wmidev_query_block() to reject undersized data
  platform/wmi: Extend wmidev_invoke_method() to reject undersized data
  platform/wmi: Prepare to reject undersized unmarshalling results
  platform/wmi: Convert drivers to use wmidev_invoke_procedure()
  platform/wmi: Add wmidev_invoke_procedure()
  platform/x86: int3472: Add support for GPIO type 0x02 (IR flood LED)
  platform/x86: int3472: Parameterize LED con_id in registration
  platform/x86: int3472: Rename pled to led in LED registration code
  platform/x86: int3472: Use local variable for LED struct access
  platform/x86: thinkpad_acpi: remove obsolete TODO comment
  platform/x86: dell-wmi-sysman: bound enumeration string aggregation
  platform/x86: hp-wmi: Ignore backlight and FnLock events
  platform/x86: uniwill-laptop: Fix signedness bug
  platform/x86: dell_rbu: avoid uninit value usage in packet_size_write()
  ...

Merge tag 'backlight-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight

Pull backlight updates from Lee Jones:
"Apple Backlight:
   - Convert the Apple Backlight ACPI driver to a proper platform
     driver, aligning with current ACPI binding practices

  Skyworks SKY81452:
   - Check the return value of `devm_gpiod_get_optional()`
     to properly handle GPIO acquisition errors"

* tag 'backlight-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
  backlight: apple_bl: Convert to a platform driver
  backlight: sky81452-backlight: Check return value of devm_gpiod_get_optional() in sky81452_bl_parse_dt()

net: mctp: fix don't require received header reserved bits to be zero

From the MCTP Base specification (DSP0236 v1.2.1), the first byte of
the MCTP header contains a 4 bit reserved field, and 4 bit version.

On our current receive path, we require those 4 reserved bits to be
zero, but the 9500-8i card is non-conformant, and may set these
reserved bits.

DSP0236 states that the reserved bits must be written as zero, and
ignored when read. While the device might not conform to the former,
we should accept these message to conform to the latter.

Relax our check on the MCTP version byte to allow non-zero bits in the
reserved field.

Fixes: 889b7da23abf ("mctp: Add initial routing framework")
Signed-off-by: Yuan Zhaoming <yuanzm2@lenovo.com>
Cc: stable@vger.kernel.org
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20260417141340.5306-1-yuanzhaoming901030@126.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

gtp: disable BH before calling udp_tunnel_xmit_skb()

gtp_genl_send_echo_req() runs as a generic netlink doit handler in
process context with BH not disabled. It calls udp_tunnel_xmit_skb(),
which eventually invokes iptunnel_xmit() — that uses __this_cpu_inc/dec
on softnet_data.xmit.recursion to track the tunnel xmit recursion level.

Without local_bh_disable(), the task may migrate between
dev_xmit_recursion_inc() and dev_xmit_recursion_dec(), breaking the
per-CPU counter pairing. The result is stale or negative recursion
levels that can later produce false-positive
SKB_DROP_REASON_RECURSION_LIMIT drops on either CPU.

The other udp_tunnel_xmit_skb() call sites in gtp.c are unaffected:
the data path runs under ndo_start_xmit and the echo response handlers
run from the UDP encap rx softirq, both with BH already disabled.

Fix it by disabling BH around the udp_tunnel_xmit_skb() call, mirroring
commit 2cd7e6971fc2 ("sctp: disable BH before calling
udp_tunnel_xmit_skb()").

Fixes: 6f1a9140ecda ("net: add xmit recursion limit to tunnel xmit functions")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
Link: https://patch.msgid.link/20260417055408.4667-1-devnexen@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

hv_sock: Report EOF instead of -EIO for FIN

Commit f0c5827d07cb unluckily causes a regression for the FIN packet,
and the final read syscall gets an error rather than 0.

Ideally, we would want to fix hvs_channel_readable_payload() so that it
could return 0 in the FIN scenario, but it's not good for the hv_sock
driver to use the VMBus ringbuffer's cached priv_read_index, which is
internal data in the VMBus driver.

Fix the regression in hv_sock by returning 0 rather than -EIO.

Fixes: f0c5827d07cb ("hv_sock: Return the readable bytes in hvs_stream_has_data()")
Cc: stable@vger.kernel.org
Reported-by: Ben Hillis <Ben.Hillis@microsoft.com>
Reported-by: Mitchell Levy <levymitchell0@gmail.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260416191433.840637-1-decui@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'leds-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds

Pull LED updates from Lee Jones:
  Core:
   - Implement fallback to software node name for LED names
   - Fix formatting issues in `led-core.c` reported by checkpatch.pl
   - Make `led_remove_lookup()` NULL-aware
   - Switch from `class_find_device_by_of_node()` to
     `class_find_device_by_fwnode()`
   - Drop the unneeded dependency on `OF_GPIO` from `LEDS_NETXBIG`
     in Kconfig

  Kinetic KTD2692:
   - Make the `ktd2692_timing` variable static to resolve a
     sparse warning

  LGM SSO:
   - Fix a typo in the `GET_SRC_OFFSET` macro
   - Remove a duplicate assignment of `priv->mmap` in
     `intel_sso_led_probe()`

  Multicolor:
   - Fix a signedness error by changing the `intensity_value` type
     to `unsigned int`

  Qualcomm LPG:
   - Prevent array overflow when selecting high-resolution values

  Spreadtrum SC2731:
   - Add a compatible string for the SC2730 PMIC LED controller

  TI LM3642:
   - Use `guard(mutex)` to simplify locking and avoid manual
     `mutex_unlock()` calls

  TI LP5569:
   - Use `sysfs_emit()` instead of `sprintf()` for sysfs outputs

  TI LP5860:
   - Add the `enable-gpios` property for the `VIO_EN` pin"

  TI LP8860:
   - Do not unconditionally program the EEPROM on probe
   - Hold the mutex lock for the entirety of the EEPROM programming
     process
   - Return directly from `lp8860_init()` instead of using empty `goto`
     statements
   - Use a single regmap table and an access table instead of separate
     maps for normal and EEPROM registers
   - Remove an unused read of the `STATUS` register during EEPROM
     programming

  TTY Trigger:
   - Prefer `IS_ERR_OR_NULL()` over manual NULL checks"

* tag 'leds-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds:
  leds: class: Make led_remove_lookup() NULL-aware
  leds: led-class: Switch to using class_find_device_by_fwnode()
  leds: Kconfig: Drop unneeded dependency on OF_GPIO
  leds: lm3642: Use guard to simplify locking
  leds: core: Fix formatting issues
  leds: core: Implement fallback to software node name for LED names
  leds: lgm-sso: Fix typo in macro for src offset
  dt-bindings: leds: lp5860: add enable-gpio
  leds: Prefer IS_ERR_OR_NULL over manual NULL check
  dt-bindings: leds: sc2731: Add compatible for SC2730
  leds: lp8860: Do not always program EEPROM on probe
  leds: lp8860: Remove unused read of STATUS register
  leds: lp8860: Hold lock for all of EEPROM programming
  leds: lp8860: Return directly from lp8860_init
  leds: lp8860: Use a single regmap table
  leds: lgm-sso: Remove duplicate assignments for priv->mmap
  leds: qcom-lpg: Check for array overflow when selecting the high resolution
  leds: ktd2692: Make ktd2692_timing variable static
  leds: lp5569: Use sysfs_emit instead of sprintf()
  leds: multicolor: Change intensity_value to unsigned int

net: airoha: Fix possible TX queue stall in airoha_qdma_tx_napi_poll()

Since multiple net_device TX queues can share the same hw QDMA TX queue,
there is no guarantee we have inflight packets queued in hw belonging to a
net_device TX queue stopped in the xmit path because hw QDMA TX queue
can be full. In this corner case the net_device TX queue will never be
re-activated. In order to avoid any potential net_device TX queue stall,
we need to wake all the net_device TX queues feeding the same hw QDMA TX
queue in airoha_qdma_tx_napi_poll routine.

Fixes: 23020f0493270 ("net: airoha: Introduce ethernet support for EN7581 SoC")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260416-airoha-txq-potential-stall-v2-1-42c732074540@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

openvswitch: cap upcall PID array size and pre-size vport replies

The vport netlink reply helpers allocate a fixed-size skb with
nlmsg_new(NLMSG_DEFAULT_SIZE, ...) but serialize the full upcall PID
array via ovs_vport_get_upcall_portids().  Since
ovs_vport_set_upcall_portids() accepts any non-zero multiple of
sizeof(u32) with no upper bound, a CAP_NET_ADMIN user can install a PID
array large enough to overflow the reply buffer, causing nla_put() to
fail with -EMSGSIZE and hitting BUG_ON(err < 0).  On systems with
unprivileged user namespaces enabled (e.g., Ubuntu default), this is
reachable via unshare -Urn since OVS vport mutation operations use
GENL_UNS_ADMIN_PERM.

kernel BUG at net/openvswitch/datapath.c:2414!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 1 UID: 0 PID: 65 Comm: poc Not tainted 7.0.0-rc7-00195-geb216e422044 #1
RIP: 0010:ovs_vport_cmd_set+0x34c/0x400
Call Trace:
  <TASK>
  genl_family_rcv_msg_doit (net/netlink/genetlink.c:1116)
  genl_rcv_msg (net/netlink/genetlink.c:1194)
  netlink_rcv_skb (net/netlink/af_netlink.c:2550)
  genl_rcv (net/netlink/genetlink.c:1219)
  netlink_unicast (net/netlink/af_netlink.c:1344)
  netlink_sendmsg (net/netlink/af_netlink.c:1894)
  __sys_sendto (net/socket.c:2206)
  __x64_sys_sendto (net/socket.c:2209)
  do_syscall_64 (arch/x86/entry/syscall_64.c:63)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  </TASK>
Kernel panic - not syncing: Fatal exception

Reject attempts to set more PIDs than nr_cpu_ids in
ovs_vport_set_upcall_portids(), and pre-compute the worst-case reply
size in ovs_vport_cmd_msg_size() based on that bound, similar to the
existing ovs_dp_cmd_msg_size().  nr_cpu_ids matches the cap already
used by the per-CPU dispatch configuration on the datapath side
(ovs_dp_cmd_fill_info() serialises at most nr_cpu_ids PIDs), so the
two sides stay consistent.

Fixes: 5cd667b0a456 ("openvswitch: Allow each vport to have an array of 'port_id's.")
Reported-by: Xiang Mei <xmei5@asu.edu>
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
Link: https://patch.msgid.link/20260416024653.153456-2-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5: Fix HCA caps leak on notifier init failure

mlx5_mdev_init() allocates HCA caps via mlx5_hca_caps_alloc() before
calling mlx5_notifiers_init(). If notifier initialization fails, the
error path jumps to err_hca_caps and skips mlx5_hca_caps_free(), leaking
allocated caps.

Add a dedicated unwind label for notifier-init failure that frees HCA
caps before continuing the existing cleanup sequence.

Fixes: b6b03097f982 ("net/mlx5: Initialize events outside devlink lock")
Signed-off-by: Prathamesh Deshpande <prathameshdeshpande7@gmail.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260415005022.34764-1-prathameshdeshpande7@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

pppoe: drop PFC frames

RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT
RECOMMENDED for PPPoE. In practice, pppd does not support negotiating
PFC for PPPoE sessions, and the current PPPoE driver assumes an
uncompressed (2-byte) protocol field. However, the generic PPP layer
function ppp_input() is not aware of the negotiation result, and still
accepts PFC frames.

If a peer with a broken implementation or an attacker sends a frame with
a compressed (1-byte) protocol field, the subsequent PPP payload is
shifted by one byte. This causes the network header to be 4-byte
misaligned, which may trigger unaligned access exceptions on some
architectures.

To reduce the attack surface, drop PPPoE PFC frames. Introduce
ppp_skb_is_compressed_proto() helper function to be used in both
ppp_generic.c and pppoe.c to avoid open-coding.

Fixes: 7fb1b8ca8fa1 ("ppp: Move PFC decompression to PPP generic layer")
Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260415022456.141758-2-qingfang.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

flow_dissector: do not dissect PPPoE PFC frames

RFC 2516 Section 7 states that Protocol Field Compression (PFC) is NOT
RECOMMENDED for PPPoE. In practice, pppd does not support negotiating
PFC for PPPoE sessions, and the flow dissector driver has assumed an
uncompressed frame until the blamed commit.

During the review process of that commit [1], support for PFC is
suggested. However, having a compressed (1-byte) protocol field means
the subsequent PPP payload is shifted by one byte, causing 4-byte
misalignment for the network header and an unaligned access exception
on some architectures.

The exception can be reproduced by sending a PPPoE PFC frame to an
ethernet interface of a MIPS board, with RPS enabled, even if no PPPoE
session is active on that interface:

$ 0   : 00000000 80c40000 00000000 85144817
$ 4   : 00000008 00000100 80a75758 81dc9bb8
$ 8   : 00000010 8087ae2c 0000003d 00000000
$12   : 000000e0 00000039 00000000 00000000
$16   : 85043240 80a75758 81dc9bb8 00006488
$20   : 0000002f 00000007 85144810 80a70000
$24   : 81d1bda0 00000000
$28   : 81dc8000 81dc9aa8 00000000 805ead08
Hi    : 00009d51
Lo    : 2163358a
epc   : 805e91f0 __skb_flow_dissect+0x1b0/0x1b50
ra    : 805ead08 __skb_get_hash_net+0x74/0x12c
Status: 11000403        KERNEL EXL IE
Cause : 40800010 (ExcCode 04)
BadVA : 85144817
PrId  : 0001992f (MIPS 1004Kc)
Call Trace:
[<805e91f0>] __skb_flow_dissect+0x1b0/0x1b50
[<805ead08>] __skb_get_hash_net+0x74/0x12c
[<805ef330>] get_rps_cpu+0x1b8/0x3fc
[<805fca70>] netif_receive_skb_list_internal+0x324/0x364
[<805fd120>] napi_complete_done+0x68/0x2a4
[<8058de5c>] mtk_napi_rx+0x228/0xfec
[<805fd398>] __napi_poll+0x3c/0x1c4
[<805fd754>] napi_threaded_poll_loop+0x234/0x29c
[<805fd848>] napi_threaded_poll+0x8c/0xb0
[<80053544>] kthread+0x104/0x12c
[<80002bd8>] ret_from_kernel_thread+0x14/0x1c

Code: 02d51821  1060045b  00000000 <8c640000> 3084000f  2c820005  144001a2  00042080  8e220000

To reduce the attack surface and maintain performance, do not process
PPPoE PFC frames.

[1] https://lore.kernel.org/r/20220630231016.GA392@debian.home
Fixes: 46126db9c861 ("flow_dissector: Add PPPoE dissectors")
Signed-off-by: Qingfang Deng <qingfang.deng@linux.dev>
Link: https://patch.msgid.link/20260415022456.141758-1-qingfang.deng@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'mfd-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd

Pull MFD updates from Lee Jones:
"Core:
   - Add a resource-managed version of alloc_workqueue()
     (`devm_alloc_workqueue()`)
   - Preserve the Open Firmware (OF) node when an ACPI handle
     is present

  Apple SMC:
   - Wire up the Apple SMC power driver by adding a new MFD cell

  Atmel HLCDC:
   - Fetch the LVDS PLL clock as a fallback if the generic sys_clk
     is unavailable

  Broadcom BCM2835 PM:
   - Add support for the BCM2712 power management device
   - Introduce a hardware type identifier to distinguish SoC variants

  Congatec CGBC, KEMPLD, RSMU, Si476x:
   - Fix various kernel-doc warnings and correct struct member names

  DLN2:
   - Drop redundant USB device references and switch to managed
     resource allocations
   - Update bare 'unsigned' types to 'unsigned int'

  ENE KB3930:
   - Use the of_device_is_system_power_controller() wrapper

  EZX PCAP:
   - Avoid rescheduling after destroying the workqueue by switching
     to a device-managed workqueue
   - Drop redundant memory allocation error messages
   - Return directly instead of using empty goto statements

  Freescale i.MX25 TSADC:
   - Convert devicetree bindings from TXT to YAML format

  Freescale MC13xxx:
   - Fix a memory leak in subdevice platform data allocation by
     using devm_kmemdup()

  Intel LPC ICH:
   - Expose a software node for the GPIO controller cell to fix
     GPIO lookups

  Intel LPSS:
   - Add PCI IDs for the Intel Nova Lake-H platform

  Maxim MAX77620:
   - Convert devicetree bindings from TXT to YAML format
   - Document an optional I2C address for the MAX77663 RTC device

  Maxim MAX77705:
   - Make the max77705_pm_ops variable static to resolve a
     sparse warning

  MediaTek MT6397:
   - Correct the hardware CIDs for the MT6328, MT6331, and MT6332
     PMICs to allow proper driver binding

  ROHM BD71828:
   - Enable system wakeup via the power button

  ROHM BD72720:
   - Add a new compatible string for the ROHM BD73900 PMIC

  SpacemiT P1:
   - Drop the deprecated "vin-supply" property from the devicetree
     bindings
   - Add individual regulator supply properties to match actual
     hardware topology

  STMicroelectronics STPMIC1:
   - Attempt system shutdown a second time to handle transient I2C
     communication failures

  Viperboard:
   - Drop redundant USB device references"

* tag 'mfd-next-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (28 commits)
  mfd: core: Preserve OF node when ACPI handle is present
  mfd: ene-kb3930: Use of_device_is_system_power_controller() wrapper
  mfd: intel-lpss: Add Intel Nova Lake-H PCI IDs
  dt-bindings: mfd: max77620: Document optional RTC address for MAX77663
  dt-bindings: mfd: max77620: Convert to DT schema
  mfd: ezx-pcap: Avoid rescheduling after destroying workqueue
  mfd: ezx-pcap: Return directly instead of empty gotos
  mfd: ezx-pcap: Drop memory allocation error message
  mfd: bcm2835-pm: Add BCM2712 PM device support
  mfd: bcm2835-pm: Introduce SoC-specific type identifier
  dt-bindings: mfd: bd72720: Add ROHM BD73900
  mfd: si476x: Fix kernel-doc warnings
  mfd: rsmu: Remove a empty kernel-doc line
  mfd: kempld: Fix kernel-doc struct member names
  mfd: congatec: Fix kernel-doc struct member names
  dt-bindings: mfd: Convert fsl-imx25-tsadc.txt to yaml format
  mfd: viperboard: Drop redundant device reference
  mfd: dln2: Switch to managed resources and fix bare unsigned types
  mfd: macsmc: Wire up Apple SMC power driver
  mfd: mt6397: Properly fix CID of MT6328, MT6331 and MT6332
  ...

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
"The usual collection of driver changes, more core infrastructure
  updates that typical this cycle:

   - Minor cleanups and kernel-doc fixes in bnxt_re, hns, rdmavt, efa,
     ocrdma, erdma, rtrs, hfi1, ionic, and pvrdma

   - New udata validation framework and driver updates

   - Modernize CQ creation interface in mlx4 and mlx5, manage CQ umem in
     core

   - Promote UMEM to a core component, split out DMA block iterator
     logic

   - Introduce FRMR pools with aging, statistics, pinned handles, and
     netlink control and use it in mlx5

   - Add PCIe TLP emulation support in mlx5

   - Extend umem to work with revocable pinned dmabuf's and use it in
     irdma

   - More net namespace improvements for rxe

   - GEN4 hardware support in irdma

   - First steps to MW and UC support in mana_ib

   - Support for CQ umem and doorbells in bnxt_re

   - Drop opa_vnic driver from hfi1

  Fixes:

   - IB/core zero dmac neighbor resolution race

   - GID table memory free

   - rxe pad/ICRC validation and r_key async errors

   - mlx4 external umem for CQ

   - umem DMA attributes on unmap

   - mana_ib RX steering on RSS QP destroy"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (116 commits)
  RDMA/core: Fix user CQ creation for drivers without create_cq
  RDMA/ionic: bound node_desc sysfs read with %.64s
  IB/core: Fix zero dmac race in neighbor resolution
  RDMA/mana_ib: Support memory windows
  RDMA/rxe: Validate pad and ICRC before payload_size() in rxe_rcv
  RDMA/core: Prefer NLA_NUL_STRING
  RDMA/core: Fix memory free for GID table
  RDMA/hns: Remove the duplicate calls to ib_copy_validate_udata_in()
  RDMA: Remove redundant = {} for udata req structs
  RDMA/irdma: Add missing comp_mask check in alloc_ucontext
  RDMA/hns: Add missing comp_mask check in create_qp
  RDMA/mlx5: Pull comp_mask validation into ib_copy_validate_udata_in_cm()
  RDMA: Use ib_copy_validate_udata_in_cm() for zero comp_mask
  RDMA/hns: Use ib_copy_validate_udata_in()
  RDMA/mlx4: Use ib_copy_validate_udata_in() for QP
  RDMA/mlx4: Use ib_copy_validate_udata_in()
  RDMA/mlx5: Use ib_copy_validate_udata_in() for MW
  RDMA/mlx5: Use ib_copy_validate_udata_in() for SRQ
  RDMA/pvrdma: Use ib_copy_validate_udata_in() for srq
  RDMA: Use ib_copy_validate_udata_in() for implicit full structs
  ...

Merge tag 'ntfs3_for_7.1' of https://github.com/Paragon-Software-Group/linux-ntfs3

Pull ntfs3 updates from Konstantin Komarov:
"New:
   - reject inodes with zero non-DOS link count
   - return folios from ntfs_lock_new_page()
   - subset of W=1 warnings for stricter checks
   - work around -Wmaybe-uninitialized warnings
   - buffer boundary checks to run_unpack()
   - terminate the cached volume label after UTF-8 conversion

  Fixes:
   - check return value of indx_find to avoid infinite loop
   - prevent uninitialized lcn caused by zero len
   - increase CLIENT_REC name field size to prevent buffer overflow
   - missing run load for vcn0 in attr_data_get_block_locked()
   - memory leak in indx_create_allocate()
   - OOB write in attr_wof_frame_info()
   - mount failure on volumes with fragmented MFT bitmap
   - integer overflow in run_unpack() volume boundary check
   - validate rec->used in journal-replay file record check

  Updates:
   - resolve compare function in public index APIs
   - $LXDEV xattr lookup
   - potential double iput on d_make_root() failure
   - initialize err in ni_allocate_da_blocks_locked()
   - correct the pre_alloc condition in attr_allocate_clusters()"

* tag 'ntfs3_for_7.1' of https://github.com/Paragon-Software-Group/linux-ntfs3:
  fs/ntfs3: fix Smatch warnings
  fs/ntfs3: validate rec->used in journal-replay file record check
  fs/ntfs3: terminate the cached volume label after UTF-8 conversion
  fs/ntfs3: fix potential double iput on d_make_root() failure
  ntfs3: fix integer overflow in run_unpack() volume boundary check
  ntfs3: add buffer boundary checks to run_unpack()
  ntfs3: fix mount failure on volumes with fragmented MFT bitmap
  fs/ntfs3: fix $LXDEV xattr lookup
  ntfs3: fix OOB write in attr_wof_frame_info()
  ntfs3: fix memory leak in indx_create_allocate()
  ntfs3: work around false-postive -Wmaybe-uninitialized warnings
  fs/ntfs3: fix missing run load for vcn0 in attr_data_get_block_locked()
  fs/ntfs3: increase CLIENT_REC name field size
  fs/ntfs3: prevent uninitialized lcn caused by zero len
  fs/ntfs3: add a subset of W=1 warnings for stricter checks
  fs/ntfs3: return folios from ntfs_lock_new_page()
  fs/ntfs3: resolve compare function in public index APIs
  ntfs3: reject inodes with zero non-DOS link count

selftests/sched_ext: Add non_scx_kfunc_deny test

Verify that the BPF verifier rejects a non-SCX struct_ops program
(tcp_congestion_ops) that attempts to call an SCX kfunc (scx_bpf_kick_cpu).
The test expects the load to fail with -EACCES from scx_kfunc_context_filter.

Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

sched_ext: Deny SCX kfuncs to non-SCX struct_ops programs

scx_kfunc_context_filter() currently allows non-SCX struct_ops programs
(e.g. tcp_congestion_ops) to call SCX unlocked kfuncs. This is wrong
for two reasons:

- It is semantically incorrect: a TCP congestion control program has no
  business calling SCX kfuncs such as scx_bpf_kick_cpu().

- With CONFIG_EXT_SUB_SCHED=y, kfuncs like scx_bpf_kick_cpu() call
  scx_prog_sched(aux), which invokes bpf_prog_get_assoc_struct_ops(aux)
  and casts the result to struct sched_ext_ops * before reading ops->priv.
  For a non-SCX struct_ops program the returned pointer is the kdata of
  that struct_ops type, which is far smaller than sched_ext_ops, making
  the read an out-of-bounds access (confirmed with KASAN).

Extend the filter to cover scx_kfunc_set_any and scx_kfunc_set_idle as
well, and deny all SCX kfuncs for any struct_ops program that is not the
SCX struct_ops. This addresses both issues: the semantic contract is
enforced at the verifier level, and the runtime out-of-bounds access
becomes unreachable.

Fixes: d1d3c1c6ae36 ("sched_ext: Add verifier-time kfunc context filter")
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

dm-thin: fix metadata refcount underflow

There's a bug in dm-thin in the function rebalance_children. If the
internal btree node has one entry, the code tries to copy all btree
entries from the node's child to the node itself and then decrement the
child's reference count.

If the child node is shared (it has reference count > 1), we won't free
it, so there would be two pointers to each of the grandchildren nodes.
But the reference counts of the grandchildren is not increased, thus the
reference count doesn't match the number of pointers that point to the
grandchildren. This results in "device mapper: space map common: unable
to decrement block" errors.

Fix this bug by incrementing reference counts on the grandchildren if the
btree node is shared.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 3241b1d3e0aa ("dm: add persistent data library")
Cc: stable@vger.kernel.org

Merge tag 'ecryptfs-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs updates from Tyler Hicks:

- avoid unnecessary eCryptfs inode timestamp truncation by re-using the
   lower filesystem's time granularity

- various small code cleanups

- reorganize the setattr hook inode resizing to improve style and
   readability, remove an unnecessary memory allocation when shrinking,
   and to support an upcoming rework of the VFS interfaces involved in
   truncation

* tag 'ecryptfs-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  ecryptfs: keep the lower iattr contained in truncate_upper
  ecryptfs: factor out a ecryptfs_iattr_to_lower helper
  ecryptfs: merge ecryptfs_inode_newsize_ok into truncate_upper
  ecryptfs: combine the two ATTR_SIZE blocks in ecryptfs_setattr
  ecryptfs: use ZERO_PAGE instead of allocating zeroed memory in truncate_upper
  ecryptfs: streamline truncate_upper
  ecryptfs: cleanup ecryptfs_setattr
  ecryptfs: Drop TODO comment in ecryptfs_derive_iv
  ecryptfs: Fix typo in ecryptfs_derive_iv function comment
  ecryptfs: Log function name only once in decode_and_decrypt_filename
  ecryptfs: Remove redundant if checks in encrypt_and_encode_filename
  ecryptfs: Fix tag number in encrypt_filename() error message
  ecryptfs: Use struct_size to improve process_response + send_miscdev
  ecryptfs: Replace memcpy + manual NUL termination with strscpy
  ecryptfs: Set s_time_gran to get correct time granularity

Merge tag 'nfsd-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:

- filehandle signing to defend against filehandle-guessing attacks
   (Benjamin Coddington)

   The server now appends a SipHash-2-4 MAC to each filehandle when
   the new "sign_fh" export option is enabled. NFSD then verifies
   filehandles received from clients against the expected MAC;
   mismatches return NFS error STALE

- convert the entire NLMv4 server-side XDR layer from hand-written C to
   xdrgen-generated code, spanning roughly thirty patches (Chuck Lever)

   XDR functions are generally boilerplate code and are easy to get
   wrong. The goals of this conversion are improved memory safety, lower
   maintenance burden, and groundwork for eventual Rust code generation
   for these functions.

- improve pNFS block/SCSI layout robustness with two related changes
   (Dai Ngo)

   SCSI persistent reservation fencing is now tracked per client and
   per device via an xarray, to avoid both redundant preempt operations
   on devices already fenced and a potential NFSD deadlock when all nfsd
   threads are waiting for a layout return.

- scalability and infrastructure improvements

   Sincere thanks to all contributors, reviewers, testers, and bug
   reporters who participated in the v7.1 NFSD development cycle.

* tag 'nfsd-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (83 commits)
  NFSD: Docs: clean up pnfs server timeout docs
  nfsd: fix comment typo in nfsxdr
  nfsd: fix comment typo in nfs3xdr
  NFSD: convert callback RPC program to per-net namespace
  NFSD: use per-operation statidx for callback procedures
  svcrdma: Use contiguous pages for RDMA Read sink buffers
  SUNRPC: Add svc_rqst_page_release() helper
  SUNRPC: xdr.h: fix all kernel-doc warnings
  svcrdma: Factor out WR chain linking into helper
  svcrdma: Add Write chunk WRs to the RPC's Send WR chain
  svcrdma: Clean up use of rdma->sc_pd->device
  svcrdma: Clean up use of rdma->sc_pd->device in Receive paths
  svcrdma: Add fair queuing for Send Queue access
  SUNRPC: Optimize rq_respages allocation in svc_alloc_arg
  SUNRPC: Track consumed rq_pages entries
  svcrdma: preserve rq_next_page in svc_rdma_save_io_pages
  SUNRPC: Handle NULL entries in svc_rqst_release_pages
  SUNRPC: Allocate a separate Reply page array
  SUNRPC: Tighten bounds checking in svc_rqst_replace_page
  NFSD: Sign filehandles
  ...

ASoC: Correct bug parsing DisCo booleans

Charles Keepax <ckeepax@opensource.cirrus.com> says:

MIPI DisCo uses the unfortunate convention of allowing boolean
properties to be present but having a zero value. Opposed to the
normal convention of simply not specifying the property. Fix an
issue in the SDCA code where mipi-sdca-control-deferrable is not
parsed correctly.

However, we also have some shipping ACPIs where these properties
are not specified correctly. Update the MBQ regmap to attempt defers
albeit with a warning in the case where a control attempts to defer
but is not marked at such. There is little down side to this as if
defer is genuinely not supported then the control will just return
the same error again.