Victor Nogueira [Wed, 25 Feb 2026 13:43:48 +0000 (10:43 -0300)]
net/sched: Only allow act_ct to bind to clsact/ingress qdiscs and shared blocks
As Paolo said earlier [1]:
"Since the blamed commit below, classify can return TC_ACT_CONSUMED while
the current skb being held by the defragmentation engine. As reported by
GangMin Kim, if such packet is that may cause a UaF when the defrag engine
later on tries to tuch again such packet."
act_ct was never meant to be used in the egress path, however some users
are attaching it to egress today [2]. Attempting to reach a middle
ground, we noticed that, while most qdiscs are not handling
TC_ACT_CONSUMED, clsact/ingress qdiscs are. With that in mind, we
address the issue by only allowing act_ct to bind to clsact/ingress
qdiscs and shared blocks. That way it's still possible to attach act_ct to
egress (albeit only with clsact).
Florian Westphal [Thu, 26 Feb 2026 16:19:17 +0000 (17:19 +0100)]
selftests: netfilter: nft_queue.sh: avoid flakes on debug kernels
Jakub reports test flakes on debug kernels:
FAIL: test_udp_gro_ct: Expected software segmentation to occur, had 23 and 17
This test assumes that the kernels nfnetlink_queue module sees N GSO
packets, segments them into M skbs and queues them to userspace for
reinjection.
Hence, if M >= N, no segmentation occurred.
However, its possible that this happens:
- nfnetlink_queue gets GSO packet
- segments that into n skbs
- userspace buffer is full, kernel drops the segmented skbs
-> "toqueue" counter incremented by 1, "fromqueue" is unchanged.
If this happens often enough in a single run, M >= N check triggers
incorrectly.
To solve this, allow the nf_queue.c test program to set the FAIL_OPEN
flag so that the segmented skbs bypass the queueing step in the kernel
if the receive buffer is full.
Also, reduce number of sending socat instances, decrease their priority
and increase nice value for the nf_queue program itself to reduce the
probability of overruns happening in the first place.
Fixes: 59ecffa3995e ("selftests: netfilter: nft_queue.sh: add udp fraglist gro test case") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20260218184114.0b405b72@kernel.org/ Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260226161920.1205-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonas Köppeler [Thu, 26 Feb 2026 11:40:16 +0000 (12:40 +0100)]
net/sched: sch_cake: fixup cake_mq rate adjustment for diffserv config
cake_mq's rate adjustment during the sync periods did not adjust the
rates for every tin in a diffserv config. This lead to inconsistencies
of rates between the tins. Fix this by setting the rates for all tins
during synchronization.
====================
Fix invariant violation for single-value tnums
We're hitting an invariant violation in Cilium that sometimes leads to
BPF programs being rejected and Cilium failing to start [1]. As far as
I know this is the first case of invariant violation found in a real
program (i.e., not by a fuzzer). The following extract from verifier
logs shows what's happening:
More details are given in the second patch, but in short, the verifier
should be able to detect that the false branch of instruction 237 is
never true. After instruction 236, the u64 range and the tnum overlap
in a single value, 0xf00.
The long-term solution to invariant violation is likely to rely on the
refinement + invariant violation check to detect dead branches, as
started by Eduard. To fix the current issue, we need something with
less refactoring that we can backport to affected kernels.
The solution implemented in the second patch is to improve the bounds
refinement to avoid this case. It relies on a new tnum helper,
tnum_step, first sent as an RFC in [2]. The last two patches extend and
update the selftests.
Link: https://github.com/cilium/cilium/issues/44216 Link: https://lore.kernel.org/bpf/20251107192328.2190680-2-harishankar.vishwanathan@gmail.com/
Changes in v3:
- Fix commit description error spotted by AI bot.
- Simplify constants in first two tests (Eduard).
- Rework comment on third test (Eduard).
- Add two new negative test cases (Eduard).
- Rebased.
Changes in v2:
- Add guard suggested by Hari in tnum_step, to avoid undefined
behavior spotted by AI code review.
- Add explanation diagrams in code as suggested by Eduard.
- Rework conditions for readability as suggested by Eduard.
- Updated reference to SMT formula.
- Rebased.
====================
Paul Chaignon [Fri, 27 Feb 2026 21:42:45 +0000 (22:42 +0100)]
selftests/bpf: Avoid simplification of crafted bounds test
The reg_bounds_crafted tests validate the verifier's range analysis
logic. They focus on the actual ranges and thus ignore the tnum. As a
consequence, they carry the assumption that the tested cases can be
reproduced in userspace without using the tnum information.
Unfortunately, the previous change the refinement logic breaks that
assumption for one test case:
The tested bytecode is shown below. Without our previous improvement, on
the false branch of the condition, R7 is only known to have u64 range
[0xfffffffe; 0x100000000]. With our improvement, and using the tnum
information, we can deduce that R7 equals 0x100000000.
R7's tnum is (0; 0x1ffffffff). On the false branch, regs_refine_cond_op
refines R7's u32 range to [0; 0x7fffffff]. Then, __reg32_deduce_bounds
refines the s32 range to 0 using u32 and finally also sets u32=0.
From this, __reg_bound_offset improves the tnum to (0; 0x100000000).
Finally, our previous patch uses this new tnum to deduce that it only
intersect with u64=[0xfffffffe; 0x100000000] in a single value:
0x100000000.
Because the verifier uses the tnum to reach this constant value, the
selftest is unable to reproduce it by only simulating ranges. The
solution implemented in this patch is to change the test case such that
there is more than one overlap value between u64 and the tnum. The max.
u64 value is thus changed from 0x100000000 to 0x300000000.
Paul Chaignon [Fri, 27 Feb 2026 21:36:30 +0000 (22:36 +0100)]
selftests/bpf: Test refinement of single-value tnum
This patch introduces selftests to cover the new bounds refinement
logic introduced in the previous patch. Without the previous patch,
the first two tests fail because of the invariant violation they
trigger. The last test fails because the R10 access is not detected as
dead code. In addition, all three tests fail because of R0 having a
non-constant value in the verifier logs.
In addition, the last two cases are covering the negative cases: when we
shouldn't refine the bounds because the u64 and tnum overlap in at least
two values.
Paul Chaignon [Fri, 27 Feb 2026 21:35:02 +0000 (22:35 +0100)]
bpf: Improve bounds when tnum has a single possible value
We're hitting an invariant violation in Cilium that sometimes leads to
BPF programs being rejected and Cilium failing to start [1]. The
following extract from verifier logs shows what's happening:
We reach instruction 236 with two possible values for R9, 0xe00 and
0xf00. This is perfectly reflected in the tnum, but of course the ranges
are less accurate and cover [0xe00; 0xf00]. Taking the fallthrough path
at instruction 236 allows the verifier to reduce the range to
[0xe01; 0xf00]. The tnum is however not updated.
With these ranges, at instruction 237, the verifier is not able to
deduce that R9 is always equal to 0xf00. Hence the fallthrough pass is
explored first, the verifier refines the bounds using the assumption
that R9 != 0xf00, and ends up with an invariant violation.
This pattern of impossible branch + bounds refinement is common to all
invariant violations seen so far. The long-term solution is likely to
rely on the refinement + invariant violation check to detect dead
branches, as started by Eduard. To fix the current issue, we need
something with less refactoring that we can backport.
This patch uses the tnum_step helper introduced in the previous patch to
detect the above situation. In particular, three cases are now detected
in the bounds refinement:
1. The u64 range and the tnum only overlap in umin.
u64: ---[xxxxxx]-----
tnum: --xx----------x-
2. The u64 range and the tnum only overlap in the maximum value
represented by the tnum, called tmax.
u64: ---[xxxxxx]-----
tnum: xx-----x--------
3. The u64 range and the tnum only overlap in between umin (excluded)
and umax.
u64: ---[xxxxxx]-----
tnum: xx----x-------x-
To detect these three cases, we call tnum_step(tnum, umin), which
returns the smallest member of the tnum greater than umin, called
tnum_next here. We're in case (1) if umin is part of the tnum and
tnum_next is greater than umax. We're in case (2) if umin is not part of
the tnum and tnum_next is equal to tmax. Finally, we're in case (3) if
umin is not part of the tnum, tnum_next is inferior or equal to umax,
and calling tnum_step a second time gives us a value past umax.
This change implements these three cases. With it, the above bytecode
looks as follows:
In addition to the new selftests, this change was also verified with
Agni [3]. For the record, the raw SMT is available at [4]. The property
it verifies is that: If a concrete value x is contained in all input
abstract values, after __update_reg_bounds, it will continue to be
contained in all output abstract values.
bpf: Introduce tnum_step to step through tnum's members
This commit introduces tnum_step(), a function that, when given t, and a
number z returns the smallest member of t larger than z. The number z
must be greater or equal to the smallest member of t and less than the
largest member of t.
The first step is to compute j, a number that keeps all of t's known
bits, and matches all unknown bits to z's bits. Since j is a member of
the t, it is already a candidate for result. However, we want our result
to be (minimally) greater than z.
There are only two possible cases:
(1) Case j <= z. In this case, we want to increase the value of j and
make it > z.
(2) Case j > z. In this case, we want to decrease the value of j while
keeping it > z.
(Case 1.1) Let's first consider the case where j < z. We will address j
== z later.
Since z > j, there had to be a bit position that was 1 in z and a 0 in
j, beyond which all positions of higher significance are equal in j and
z. Further, this position could not have been unknown in a, because the
unknown positions of a match z. This position had to be a 1 in z and
known 0 in t.
Let k be position of the most significant 1-to-0 flip. In our example, k
= 3 (starting the count at 1 at the least significant bit). Setting (to
1) the unknown bits of t in positions of significance smaller than
k will not produce a result > z. Hence, we must set/unset the unknown
bits at positions of significance higher than k. Specifically, we look
for the next larger combination of 1s and 0s to place in those
positions, relative to the combination that exists in z. We can achieve
this by concatenating bits at unknown positions of t into an integer,
adding 1, and writing the bits of that result back into the
corresponding bit positions previously extracted from z.
>From our example, considering only positions of significance greater
than k:
t = xx..x
z = 10..1
+ 1
-----
11..0
This is the exact combination 1s and 0s we need at the unknown bits of t
in positions of significance greater than k. Further, our result must
only increase the value minimally above z. Hence, unknown bits in
positions of significance smaller than k should remain 0. We finally
have,
Matching the unknown bits of the t to the bits of z yielded exactly z.
To produce a number greater than z, we must set/unset the unknown bits
in t, and *all* the unknown bits of t candidates for being set/unset. We
can do this similar to Case 1.1, by adding 1 to the bits extracted from
the masked bit positions of z. Essentially, this case is equivalent to
Case 1.1, with k = 0.
t = 1x1x0xxx
z = .0.1.100
+ 1
---------
.0.1.101
This is the exact combination of bits needed in the unknown positions of
t. After recalling the known positions of t, we get
Since j > z, there had to be a bit position which was 0 in z, and a 1 in
j, beyond which all positions of higher significance are equal in j and
z. This position had to be a 0 in z and known 1 in t. Let k be the
position of the most significant 0-to-1 flip. In our example, k = 4.
Because of the 0-to-1 flip at position k, a member of t can become
greater than z if the bits in positions greater than k are themselves >=
to z. To make that member *minimally* greater than z, the bits in
positions greater than k must be exactly = z. Hence, we simply match all
of t's unknown bits in positions more significant than k to z's bits. In
positions less significant than k, we set all t's unknown bits to 0
to retain minimality.
In our example, in positions of greater significance than k (=4),
t=x000. These positions are matched with z (1000) to produce 1000. In
positions of lower significance than k, t=10x1. All unknown bits are set
to 0 to produce 1001. The final result is:
This concludes the computation for a result > z that is a member of t.
The procedure for tnum_step() in this commit implements the idea
described above. As a proof of correctness, we verified the algorithm
against a logical specification of tnum_step. The specification asserts
the following about the inputs t, z and output res that:
1. res is a member of t, and
2. res is strictly greater than z, and
3. there does not exist another value res2 such that
3a. res2 is also a member of t, and
3b. res2 is greater than z
3c. res2 is smaller than res
We checked the implementation against this logical specification using
an SMT solver. The verification formula in SMTLIB format is available
at [1]. The verification returned an "unsat": indicating that no input
assignment exists for which the implementation and the specification
produce different outputs.
In addition, we also automatically generated the logical encoding of the
C implementation using Agni [2] and verified it against the same
specification. This verification also returned an "unsat", confirming
that the implementation is equivalent to the specification. The formula
for this check is also available at [3].
Paul Moses [Mon, 23 Feb 2026 15:05:44 +0000 (15:05 +0000)]
net/sched: act_gate: snapshot parameters with RCU on replace
The gate action can be replaced while the hrtimer callback or dump path is
walking the schedule list.
Convert the parameters to an RCU-protected snapshot and swap updates under
tcf_lock, freeing the previous snapshot via call_rcu(). When REPLACE omits
the entry list, preserve the existing schedule so the effective state is
unchanged.
Fixes: a51c328df310 ("net: qos: introduce a gate control flow action") Cc: stable@vger.kernel.org Signed-off-by: Paul Moses <p@1g4.org> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260223150512.2251594-2-p@1g4.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
bpf: Fix per-CPU bulk queue races on PREEMPT_RT
On PREEMPT_RT kernels, local_bh_disable() only calls migrate_disable()
(when PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption. This means CFS scheduling can preempt a task inside the
per-CPU bulk queue (bq) operations in cpumap and devmap, allowing
another task on the same CPU to concurrently access the same bq,
leading to use-after-free, list corruption, and kernel panics.
Patch 1 fixes the cpumap race in bq_flush_to_queue(), originally
reported by syzbot [1].
Patch 2 fixes the same class of race in devmap's bq_xmit_all(),
identified by code inspection after Sebastian Andrzej Siewior pointed
out that devmap has the same per-CPU bulk queue pattern [2].
Both patches use local_lock_nested_bh() to serialize access to the
per-CPU bq. On non-RT this is a pure lockdep annotation with no
overhead; on PREEMPT_RT it provides a per-CPU sleeping lock.
To reproduce the devmap race, insert an mdelay(100) in bq_xmit_all()
after "cnt = bq->count" and before the actual transmit loop. Then pin
two threads to the same CPU, each running BPF_PROG_TEST_RUN with an XDP
program that redirects to a DEVMAP entry (e.g. a veth pair). CFS
timeslicing during the mdelay window causes interleaving. Without the
fix, KASAN reports null-ptr-deref due to operating on freed frames:
BUG: KASAN: null-ptr-deref in __build_skb_around+0x22d/0x340
Write of size 32 at addr 0000000000000d50 by task devmap_race_rep/449
v3 -> v4: https://lore.kernel.org/all/20260213034018.284146-1-jiayuan.chen@linux.dev/
- Move panic trace to cover letter. (Sebastian Andrzej Siewior)
- Add Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> to both patches
from cover letter.
v2 -> v3: https://lore.kernel.org/bpf/20260212023634.366343-1-jiayuan.chen@linux.dev/
- Fix commit message: remove incorrect "spin_lock() becomes rt_mutex"
claim, the per-CPU bq has no spin_lock at all. (Sebastian Andrzej Siewior)
- Fix commit message: accurately describe local_lock_nested_bh()
behavior instead of referencing local_lock(). (Sebastian Andrzej Siewior)
- Remove incomplete discussion of snapshot alternative.
(Sebastian Andrzej Siewior)
- Remove panic trace from commit message. (Sebastian Andrzej Siewior)
- Add patch 2/2 for devmap, same race pattern. (Sebastian Andrzej Siewior)
v1 -> v2: https://lore.kernel.org/bpf/20260211064417.196401-1-jiayuan.chen@linux.dev/
- Use local_lock_nested_bh()/local_unlock_nested_bh() instead of
local_lock()/local_unlock(), since these paths already run under
local_bh_disable(). (Sebastian Andrzej Siewior)
- Replace "Caller must hold bq->bq_lock" comment with
lockdep_assert_held() in bq_flush_to_queue(). (Sebastian Andrzej Siewior)
- Fix Fixes tag to 3253cb49cbad ("softirq: Allow to drop the
softirq-BKL lock on PREEMPT_RT") which is the actual commit that
makes the race possible. (Sebastian Andrzej Siewior)
====================
Jiayuan Chen [Wed, 25 Feb 2026 12:14:56 +0000 (20:14 +0800)]
bpf: Fix race in devmap on PREEMPT_RT
On PREEMPT_RT kernels, the per-CPU xdp_dev_bulk_queue (bq) can be
accessed concurrently by multiple preemptible tasks on the same CPU.
The original code assumes bq_enqueue() and __dev_flush() run atomically
with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_xmit_all(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.
This leads to several races:
1. Double-free / use-after-free on bq->q[]: bq_xmit_all() snapshots
cnt = bq->count, then iterates bq->q[0..cnt-1] to transmit frames.
If preempted after the snapshot, a second task can call bq_enqueue()
-> bq_xmit_all() on the same bq, transmitting (and freeing) the
same frames. When the first task resumes, it operates on stale
pointers in bq->q[], causing use-after-free.
2. bq->count and bq->q[] corruption: concurrent bq_enqueue() modifying
bq->count and bq->q[] while bq_xmit_all() is reading them.
3. dev_rx/xdp_prog teardown race: __dev_flush() clears bq->dev_rx and
bq->xdp_prog after bq_xmit_all(). If preempted between
bq_xmit_all() return and bq->dev_rx = NULL, a preempting
bq_enqueue() sees dev_rx still set (non-NULL), skips adding bq to
the flush_list, and enqueues a frame. When __dev_flush() resumes,
it clears dev_rx and removes bq from the flush_list, orphaning the
newly enqueued frame.
4. __list_del_clearprev() on flush_node: similar to the cpumap race,
both tasks can call __list_del_clearprev() on the same flush_node,
the second dereferences the prev pointer already set to NULL.
The race between task A (__dev_flush -> bq_xmit_all) and task B
(bq_enqueue -> bq_xmit_all) on the same CPU:
Task A (xdp_do_flush) Task B (ndo_xdp_xmit redirect)
---------------------- --------------------------------
__dev_flush(flush_list)
bq_xmit_all(bq)
cnt = bq->count /* e.g. 16 */
/* start iterating bq->q[] */
<-- CFS preempts Task A -->
bq_enqueue(dev, xdpf)
bq->count == DEV_MAP_BULK_SIZE
bq_xmit_all(bq, 0)
cnt = bq->count /* same 16! */
ndo_xdp_xmit(bq->q[])
/* frames freed by driver */
bq->count = 0
<-- Task A resumes -->
ndo_xdp_xmit(bq->q[])
/* use-after-free: frames already freed! */
Fix this by adding a local_lock_t to xdp_dev_bulk_queue and acquiring
it in bq_enqueue() and __dev_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.
Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260225121459.183121-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jiayuan Chen [Wed, 25 Feb 2026 12:14:55 +0000 (20:14 +0800)]
bpf: Fix race in cpumap on PREEMPT_RT
On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed
concurrently by multiple preemptible tasks on the same CPU.
The original code assumes bq_enqueue() and __cpu_map_flush() run
atomically with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_flush_to_queue(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.
This leads to several races:
1. Double __list_del_clearprev(): after bq->count is reset in
bq_flush_to_queue(), a preempting task can call bq_enqueue() ->
bq_flush_to_queue() on the same bq when bq->count reaches
CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev()
on the same bq->flush_node, the second call dereferences the
prev pointer that was already set to NULL by the first.
2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt
the packet queue while bq_flush_to_queue() is processing it.
The race between task A (__cpu_map_flush -> bq_flush_to_queue) and
task B (bq_enqueue -> bq_flush_to_queue) on the same CPU:
Task A (xdp_do_flush) Task B (cpu_map_enqueue)
---------------------- ------------------------
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush bq->q[] to ptr_ring */
bq->count = 0
spin_unlock(&q->producer_lock)
bq_enqueue(rcpu, xdpf)
<-- CFS preempts Task A --> bq->q[bq->count++] = xdpf
/* ... more enqueues until full ... */
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush to ptr_ring */
spin_unlock(&q->producer_lock)
__list_del_clearprev(flush_node)
/* sets flush_node.prev = NULL */
<-- Task A resumes -->
__list_del_clearprev(flush_node)
flush_node.prev->next = ...
/* prev is NULL -> kernel oops */
Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
in bq_enqueue() and __cpu_map_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.
To reproduce, insert an mdelay(100) between bq->count = 0 and
__list_del_clearprev() in bq_flush_to_queue(), then run reproducer
provided by syzkaller.
Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/ Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260225121459.183121-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
====================
Close race in freeing special fields and map value
There exists a race across various map types where the freeing of
special fields (tw, timer, wq, kptr, etc.) can be done eagerly when a
logical delete operation is done on a map value, such that the program
which continues to have access to such a map value can recreate the
fields and cause them to leak.
The set contains fixes for this case. It is a continuation of Mykyta's
previous attempt in [0], but applies to all fields. A test is included
which reproduces the bug reliably in absence of the fixes.
Local Storage Benchmarks
------------------------
Evaluation Setup: Benchmarked on a dual-socket Intel Xeon Gold 6348 (Ice
Lake) @ 2.60GHz (56 cores / 112 threads), with the CPU governor set to
performance. Bench was pinned to a single NUMA node throughout the test.
Benchmark comes from [1] using the following command:
./bench -p 1 local-storage-create --storage-type <socket,task> --batch-size <16,32,64>
Before the test, 10 runs of all cases ([socket|task] x 3 batch sizes x 7
iterations per batch size) are done to warm up and prime the machine.
Then, 3 runs of all cases are done (with and without the patch, across
reboots).
For each comparison, we have 21 samples, i.e. per batch size (e.g.
socket 16) of a given local storage, we have 3 runs x 7 iterations.
The statistics (mean, median, stddev) and t-test is done for each
scenario (local storage and batch size pair) individually (21 samples
for either case). All values are for local storage creations in thousand
creations / sec (k/s).
The cases for socket are within the range of noise, and improvements in task
local storage are due to high variance (CV ~4%-6% across batch sizes). The only
statistically significant case worth mentioning is socket with batch size 64
with p-value from t-test < 0.05, but the absolute difference is small (~2k/s).
TL;DR there doesn't appear to be any significant regression or improvement.
* Add Paul's Reviewed-by.
* Fix use-after-free in accessing bpf_mem_alloc embedded in map. (syzbot CI)
* Add benchmark numbers for local storage.
* Add extra test case for per-cpu hashmap coverage with up to 16 refcount leaks.
* Target bpf tree.
====================
Add a couple of tests to ensure that the refcount drops to zero when we
exercise the race where creation of a special field succeeds the logical
bpf_obj_free_fields done when deleting an element. Prior to previous
changes, the fields would be freed eagerly and repopulate and end up
leaking, causing the reference to not drop down correctly. Running this
test on a kernel without fixes will cause a hang in delete_module, since
the module reference stays active due to the leaked kptr not dropping
it. After the fixes tests succeed as expected.
Currently, when use_kmalloc_nolock is false, the freeing of fields for a
local storage selem is done eagerly before waiting for the RCU or RCU
tasks trace grace period to elapse. This opens up a window where the
program which has access to the selem can recreate the fields after the
freeing of fields is done eagerly, causing memory leaks when the element
is finally freed and returned to the kernel.
Make a few changes to address this. First, delay the freeing of fields
until after the grace periods have expired using a __bpf_selem_free_rcu
wrapper which is eventually invoked after transitioning through the
necessary number of grace period waits. Replace usage of the kfree_rcu
with call_rcu to be able to take a custom callback. Finally, care needs
to be taken to extend the rcu barriers for all cases, and not just when
use_kmalloc_nolock is true, as RCU and RCU tasks trace callbacks can be
in flight for either case and access the smap field, which is used to
obtain the BTF record to walk over special fields in the map value.
While we're at it, drop migrate_disable() from bpf_selem_free_rcu, since
migration should be disabled for RCU callbacks already.
BPF hash map may now use the map_check_btf() callback to decide whether
to set a dtor on its bpf_mem_alloc or not. Unlike C++ where members can
opt out of const-ness using mutable, we must lose the const qualifier on
the callback such that we can avoid the ugly cast. Make the change and
adjust all existing users, and lose the comment in hashtab.c.
There is a race window where BPF hash map elements can leak special
fields if the program with access to the map value recreates these
special fields between the check_and_free_fields done on the map value
and its eventual return to the memory allocator.
Several ways were explored prior to this patch, most notably [0] tried
to use a poison value to reject attempts to recreate special fields for
map values that have been logically deleted but still accessible to BPF
programs (either while sitting in the free list or when reused). While
this approach works well for task work, timers, wq, etc., it is harder
to apply the idea to kptrs, which have a similar race and failure mode.
Instead, we change bpf_mem_alloc to allow registering destructor for
allocated elements, such that when they are returned to the allocator,
any special fields created while they were accessible to programs in the
mean time will be freed. If these values get reused, we do not free the
fields again before handing the element back. The special fields thus
may remain initialized while the map value sits in a free list.
When bpf_mem_alloc is retired in the future, a similar concept can be
introduced to kmalloc_nolock-backed kmem_cache, paired with the existing
idea of a constructor.
Note that the destructor registration happens in map_check_btf, after
the BTF record is populated and (at that point) avaiable for inspection
and duplication. Duplication is necessary since the freeing of embedded
bpf_mem_alloc can be decoupled from actual map lifetime due to logic
introduced to reduce the cost of rcu_barrier()s in mem alloc free path in 9f2c6e96c65e ("bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.").
As such, once all callbacks are done, we must also free the duplicated
record. To remove dependency on the bpf_map itself, also stash the key
size of the map to obtain value from htab_elem long after the map is
gone.
Linus Torvalds [Fri, 27 Feb 2026 21:40:30 +0000 (13:40 -0800)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"The diffstat is dominated by changes to our TLB invalidation errata
handling and the introduction of a new GCS selftest to catch one of
the issues that is fixed here relating to PROT_NONE mappings.
- Fix cpufreq warning due to attempting a cross-call with interrupts
masked when reading local AMU counters
- Fix DEBUG_PREEMPT warning from the delay loop when it tries to
access per-cpu errata workaround state for the virtual counter
- Re-jig and optimise our TLB invalidation errata workarounds in
preparation for more hardware brokenness
- Fix GCS mappings to interact properly with PROT_NONE and to avoid
corrupting the pte on CPUs with FEAT_LPA2
- Fix ioremap_prot() to extract only the memory attributes from the
user pte and ignore all the other 'prot' bits"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: topology: Fix false warning in counters_read_on_cpu() for same-CPU reads
arm64: Fix sampling the "stable" virtual counter in preemptible section
arm64: tlb: Optimize ARM64_WORKAROUND_REPEAT_TLBI
arm64: tlb: Allow XZR argument to TLBI ops
kselftest: arm64: Check access to GCS after mprotect(PROT_NONE)
arm64: gcs: Honour mprotect(PROT_NONE) on shadow stack mappings
arm64: gcs: Do not set PTE_SHARED on GCS mappings if FEAT_LPA2 is enabled
arm64: io: Extract user memory type in ioremap_prot()
arm64: io: Rename ioremap_prot() to __ioremap_prot()
Linus Torvalds [Fri, 27 Feb 2026 21:32:52 +0000 (13:32 -0800)]
Merge tag 'pci-v7.0-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- Update MAINTAINERS email address (Shawn Guo)
- Refresh cached Endpoint driver MSI Message Address to fix a v7.0
regression when kernel changes the address after firmware has
configured it (Niklas Cassel)
- Flush Endpoint MSI-X writes so they complete before the outbound ATU
entry is unmapped (Niklas Cassel)
- Correct the PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value, which broke VMM use
of PCI capabilities (Bjorn Helgaas)
* tag 'pci-v7.0-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: Correct PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value
PCI: dwc: ep: Flush MSI-X write before unmapping its ATU entry
PCI: dwc: ep: Refresh MSI Message Address cache on change
MAINTAINERS: Update Shawn Guo's address for HiSilicon PCIe controller driver
Merge patch series "tighten nstree visibility checks"
Christian Brauner <brauner@kernel.org> says:
Listing various namespaces is currently only scoped on owning namespace.
We can make this more fine-grained so that we scope visibility even
tighter. To make it possible to change behavior restrict visibility for
now. This shouldn't be a big deal as there aren't actual large users out
there and paves the way to make this even cleaner in the future.
* patches from https://patch.msgid.link/20260226-work-visibility-fixes-v1-0-d2c2853313bd@kernel.org:
selftests: fix mntns iteration selftests
nstree: tighten permission checks for listing
nsfs: tighten permission checks for handle opening
nsfs: tighten permission checks for ns iteration ioctls
Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.
nsfs: tighten permission checks for handle opening
Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.
nsfs: tighten permission checks for ns iteration ioctls
Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.
Jakub Kicinski [Fri, 27 Feb 2026 17:07:45 +0000 (09:07 -0800)]
io_uring/zcrx: don't set rx_page_size when not requested
The rx_buf_len parameter was recently added to the Rx zero-copy
implementation. The expectation is that when not set system will
maintain previous behavior and use the default buffer size (PAGE_SIZE).
This works correctly at the iouring level, but we don't preserve
the same "zero means default" semantics when registering the memory
provider on the netdev. mp_param.rx_page_size is unconditionally
set to PAGE_SIZE. This causes __net_mp_open_rxq() to check for
QCFG_RX_PAGE_SIZE support in the driver, and return -EOPNOTSUPP
for drivers that don't advertise it -- even though the user never
asked for large buffers.
Only set mp_param.rx_page_size when rx_buf_len was explicitly provided,
so that the default page size path works on all zcrx-capable drivers.
mlx5 and fbnic only support 4kB pages in the current release.
Fixes: 795663b4d160 ("io_uring/zcrx: implement large rx buffer support") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 27 Feb 2026 18:52:57 +0000 (10:52 -0800)]
Merge tag 'cxl-fixes-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl fixes from Dave Jiang:
- Fix incorrect usages of decoder flags
- Validate payload size before accessing contents
- Fix race condition when creating nvdimm objects
- Fix deadlock on attach failure
* tag 'cxl-fixes-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/region: Test CXL_DECODER_F_NORMALIZED_ADDRESSING as a bitmask
cxl: Test CXL_DECODER_F_LOCK as a bitmask
cxl/mbox: validate payload size before accessing contents in cxl_payload_from_user_allowed()
cxl: Fix race of nvdimm_bus object when creating nvdimm objects
cxl: Move devm_cxl_add_nvdimm_bridge() to cxl_pmem.ko
cxl/port: Hold port host lock during dport adding.
cxl/port: Introduce port_to_host() helper
cxl/memdev: fix deadlock in cxl_memdev_autoremove() on attach failure
Linus Torvalds [Fri, 27 Feb 2026 18:49:54 +0000 (10:49 -0800)]
Merge tag 'mmc-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- Avoid bitfield RMW for claim/retune flags
MMC host:
- dw_mmc-rockchip: Fix runtime PM support for internal phase support
- mmci: Fix device_node reference leak in of_get_dml_pipe_index()
- sdhci-brcmstb: Use correct register offset for V1 pin_sel restore"
* tag 'mmc-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: core: Avoid bitfield RMW for claim/retune flags
mmc: sdhci-brcmstb: use correct register offset for V1 pin_sel restore
mmc: dw_mmc-rockchip: Fix runtime PM support for internal phase support
mmc: mmci: Fix device_node reference leak in of_get_dml_pipe_index()
Linus Torvalds [Fri, 27 Feb 2026 18:42:02 +0000 (10:42 -0800)]
Merge tag 'block-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
"Two sets of fixes, one for drbd, and one for the zoned loop driver"
* tag 'block-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
zloop: check for spurious options passed to remove
zloop: advertise a volatile write cache
drbd: fix null-pointer dereference on local read error
drbd: Replace deprecated strcpy with strscpy
drbd: fix "LOGIC BUG" in drbd_al_begin_io_nonblock()
Linus Torvalds [Fri, 27 Feb 2026 18:39:11 +0000 (10:39 -0800)]
Merge tag 'io_uring-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull io_uring fixes from Jens Axboe:
"Just two minor patches in here, ensuring the use of READ_ONCE() for
sqe field reading is consistent across the codebase. There were two
missing cases, now they are covered too"
* tag 'io_uring-7.0-20260227' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/timeout: READ_ONCE sqe->addr
io_uring/cmd_net: use READ_ONCE() for ->addr3 read
Linus Torvalds [Fri, 27 Feb 2026 18:21:06 +0000 (10:21 -0800)]
Merge tag 'xfs-fixes-7.0-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
"Nothing reeeally stands out here: a few bug fixes, some refactoring to
easily fit the bug fixes, and a couple cosmetic changes"
* tag 'xfs-fixes-7.0-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: add static size checks for ioctl UABI
xfs: remove duplicate static size checks
xfs: Add comments for usages of some macros.
xfs: Update lazy counters in xfs_growfs_rt_bmblock()
xfs: Add a comment in xfs_log_sb()
xfs: Fix xfs_last_rt_bmblock()
xfs: don't report half-built inodes to fserror
xfs: don't report metadata inodes to fserror
xfs: fix potential pointer access race in xfs_healthmon_get
xfs: fix xfs_group release bug in xfs_dax_notify_dev_failure
xfs: fix xfs_group release bug in xfs_verify_report_losses
xfs: fix copy-paste error in previous fix
xfs: Fix error pointer dereference
xfs: remove metafile inodes from the active inode stat
xfs: cleanup inode counter stats
xfs: fix code alignment issues in xfs_ondisk.c
xfs: Replace &rtg->rtg_group with rtg_group()
xfs: Refactoring the nagcount and delta calculation
xfs: Replace ASSERT with XFS_IS_CORRUPT in xfs_rtcopy_summary()
Linus Torvalds [Fri, 27 Feb 2026 17:54:02 +0000 (09:54 -0800)]
Merge tag 'slab-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fixes from Vlastimil Babka:
- Fix for spurious page allocation warnings on sheaf refill (Harry Yoo)
- Fix for CONFIG_MEM_ALLOC_PROFILING_DEBUG warnings (Suren
Baghdasaryan)
- Fix for kernel-doc warning on ksize() (Sanjay Chitroda)
- Fix to avoid setting slab->stride later than on slab allocation.
Doesn't yet fix the reports from powerpc; debugging is making
progress (Harry Yoo)
* tag 'slab-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slab: initialize slab->stride early to avoid memory ordering issues
mm/slub: drop duplicate kernel-doc for ksize()
mm/slab: mark alloc tags empty for sheaves allocated with __GFP_NO_OBJ_EXT
mm/slab: pass __GFP_NOWARN to refill_sheaf() if fallback is available
Linus Torvalds [Fri, 27 Feb 2026 17:42:17 +0000 (09:42 -0800)]
Merge tag 'gpio-fixes-for-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix memory leaks in shared GPIO management
- normalize the return values of gpio_chip::get() in GPIO core on
behalf of drivers that return invalid values (this is done because
adding stricter sanitization of callback retvals led to breakages in
existing users, we'll revert that once all are fixed)
* tag 'gpio-fixes-for-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: normalize the return value of gc->get() on behalf of buggy drivers
gpio: shared: fix memory leaks
Linus Torvalds [Fri, 27 Feb 2026 17:34:02 +0000 (09:34 -0800)]
Merge tag 'sound-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A bunch of small device-specific fixes. Mostly quirks and fix-ups for
USB- and HD-audio at this time, in addition to a couple of ASoC AMD
and Cirrus fixes"
* tag 'sound-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (24 commits)
ASoC: SDCA: Fix comments for sdca_irq_request()
ALSA: us144mkii: Drop kernel-doc markers
ALSA: usb: qcom: Correct parameter comment for uaudio_transfer_buffer_setup()
ALSA: usb-audio: Drop superfluous kernel-doc markers
ALSA: hda: cs35l56: Remove unnecessary struct cs_dsp_client_ops
ALSA: hda: cs35l56: Fix signedness error in cs35l56_hda_posture_put()
ALSA: usb-audio: Use correct version for UAC3 header validation
ALSA: hda/realtek: add quirk for Acer Nitro ANV15-51
ALSA: hda/intel: increase default bdl_pos_adj for Nvidia controllers
ALSA: usb-audio: Use inclusive terms
ALSA: usb-audio: Avoid implicit feedback mode on DIYINHK USB Audio 2.0
ALSA: usb-audio: Check max frame size for implicit feedback mode, too
ALSA: usb-audio: Cap the packet size pre-calculations
ASoC: amd: yc: Add ASUS EXPERTBOOK BM1503CDA to quirk table
ASoC: cs42l43: Report insert for exotic peripherals
ALSA: usb-audio: Skip clock selector for Focusrite devices
ALSA: usb-audio: Add QUIRK_FLAG_SKIP_IFACE_SETUP
ALSA: usb-audio: Remove VALIDATE_RATES quirk for Focusrite devices
ALSA: usb-audio: Improve Focusrite sample rate filtering
ALSA: hda/realtek: add quirk for Samsung Galaxy Book Flex (NT950QCT-A38A)
...
Maíra Canal [Thu, 12 Feb 2026 14:49:44 +0000 (11:49 -0300)]
pmdomain: bcm: bcm2835-power: Fix broken reset status read
bcm2835_reset_status() has a misplaced parenthesis on every PM_READ()
call. Since PM_READ(reg) expands to readl(power->base + (reg)), the
expression:
PM_READ(PM_GRAFX & PM_V3DRSTN)
computes the bitwise AND of the register offset PM_GRAFX with the
bitmask PM_V3DRSTN before using the result as a register offset, reading
from the wrong MMIO address instead of the intended PM_GRAFX register.
The same issue affects the PM_IMAGE cases.
Fix by moving the closing parenthesis so PM_READ() receives only the
register offset, and the bitmask is applied to the value returned by
the read.
Fixes: 670c672608a1 ("soc: bcm: bcm2835-pm: Add support for power domains under a new binding.") Signed-off-by: Maíra Canal <mcanal@igalia.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Stefan Wahren <wahrenst@gmx.net> Cc: stable@vger.kernel.org Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Linus Torvalds [Fri, 27 Feb 2026 16:56:07 +0000 (08:56 -0800)]
Merge tag 'drm-fixes-2026-02-27' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Regular fixes pull, amdxdna and amdgpu are the main ones, with a
couple of intel fixes, then a scattering of fixes across drivers,
nothing too major.
i915/display:
- Fix Panel Replay stuck with X during mode transitions on Panther
Lake
vmwgfx:
- A reference count and error handling fix"
* tag 'drm-fixes-2026-02-27' of https://gitlab.freedesktop.org/drm/kernel: (39 commits)
drm/amd: Disable MES LR compute W/A
drm/amdgpu: Fix error handling in slot reset
drm/amdgpu/vcn5: Add SMU dpm interface type
drm/amdgpu: Fix locking bugs in error paths
drm/amdgpu: Unlock a mutex before destroying it
drm/amd/display: Use GFP_ATOMIC in dc_create_stream_for_sink
drm/amdgpu: add upper bound check on user inputs in wait ioctl
drm/amdgpu: add upper bound check on user inputs in signal ioctl
drm/amdgpu/userq: Do not allow userspace to trivially triger kernel warnings
drm/amdgpu/userq: Fix reference leak in amdgpu_userq_wait_ioctl
accel/amdxdna: Use a different name for latest firmware
drm/client: Do not destroy NULL modes
drm/gpusvm: Fix drm_gpusvm_pages_valid_unlocked() kernel-doc
drm/xe/sync: Fix user fence leak on alloc failure
drm/xe/sync: Cleanup partially initialized sync on parse failure
drm/xe/wa: Steer RMW of MCR registers while building default LRC
accel/amdxdna: Validate command buffer payload count
accel/amdxdna: Prevent ubuf size overflow
accel/amdxdna: Fix out-of-bounds memset in command slot handling
accel/amdxdna: Fix command hang on suspended hardware context
...
Bjorn Helgaas [Fri, 27 Feb 2026 12:10:08 +0000 (06:10 -0600)]
PCI: Correct PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value
fb82437fdd8c ("PCI: Change capability register offsets to hex") incorrectly
converted the PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 value from decimal 52 to hex
0x32:
-#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 52 /* v2 endpoints with link end here */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 0x32 /* end of v2 EPs w/ link */
This broke PCI capabilities in a VMM because subsequent ones weren't
DWORD-aligned.
Change PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 to the correct value of 0x34.
fb82437fdd8c was from Baruch Siach <baruch@tkos.co.il>, but this was not
Baruch's fault; it's a mistake I made when applying the patch.
Fixes: fb82437fdd8c ("PCI: Change capability register offsets to hex") Reported-by: David Woodhouse <dwmw2@infradead.org> Closes: https://lore.kernel.org/all/3ae392a0158e9d9ab09a1d42150429dd8ca42791.camel@infradead.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Krzysztof Wilczyński <kwilczynski@kernel.org>
Thorsten Blum [Thu, 26 Feb 2026 22:15:22 +0000 (23:15 +0100)]
smb: client: Use snprintf in cifs_set_cifscreds
Replace unbounded sprintf() calls with the safer snprintf(). Avoid using
magic numbers and use strlen() to calculate the key descriptor buffer
size. Save the size in a local variable and reuse it for the bounded
snprintf() calls. Remove CIFSCREDS_DESC_SIZE.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Keith Busch [Wed, 25 Feb 2026 19:38:05 +0000 (11:38 -0800)]
nvme-multipath: fix leak on try_module_get failure
We need to fall back to the synchronous removal if we can't get a
reference on the module needed for the deferred removal.
Fixes: 62188639ec16 ("nvme-multipath: introduce delayed removal of the multipath head node") Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Harry Yoo [Mon, 23 Feb 2026 07:58:09 +0000 (16:58 +0900)]
mm/slab: initialize slab->stride early to avoid memory ordering issues
When alloc_slab_obj_exts() is called later (instead of during slab
allocation and initialization), slab->stride and slab->obj_exts are
updated after the slab is already accessible by multiple CPUs.
The current implementation does not enforce memory ordering between
slab->stride and slab->obj_exts. For correctness, slab->stride must be
visible before slab->obj_exts. Otherwise, concurrent readers may observe
slab->obj_exts as non-zero while stride is still stale.
With stale slab->stride, slab_obj_ext() could return the wrong obj_ext.
This could cause two problems:
- obj_cgroup_put() is called on the wrong objcg, leading to
a use-after-free due to incorrect reference counting [1] by
decrementing the reference count more than it was incremented.
- refill_obj_stock() is called on the wrong objcg, leading to
a page_counter overflow [2] by uncharging more memory than charged.
Fix this by unconditionally initializing slab->stride in
alloc_slab_obj_exts_early(), before the need_slab_obj_exts() check.
In the case of SLAB_OBJ_EXT_IN_OBJ, it is overridden in the function.
This ensures updates to slab->stride become visible before the slab
can be accessed by other CPUs via the per-node partial slab list
(protected by spinlock with acquire/release semantics).
Thanks to Shakeel Butt for pointing out this issue [3].
[vbabka@kernel.org: the bug reports [1] and [2] are not yet fully fixed,
with investigation ongoing, but it is nevertheless a step in the right
direction to only set stride once after allocating the slab and not
change it later ]
Tvrtko Ursulin [Fri, 27 Feb 2026 12:49:01 +0000 (12:49 +0000)]
drm/ttm: Fix ttm_pool_beneficial_order() return type
Fix a nasty copy and paste bug, where the incorrect boolean return type of
the ttm_pool_beneficial_order() helper had a consequence of avoiding
direct reclaim too eagerly for drivers which use this feature (currently
amdgpu).
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Fixes: 7e9c548d3709 ("drm/ttm: Allow drivers to specify maximum beneficial TTM pool size") Cc: Christian König <christian.koenig@amd.com> Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Cc: dri-devel@lists.freedesktop.org Cc: <stable@vger.kernel.org> # v6.19+ Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net> Link: https://lore.kernel.org/r/20260227124901.3177-1-tvrtko.ursulin@igalia.com
Randy Dunlap [Thu, 26 Feb 2026 05:12:29 +0000 (21:12 -0800)]
platform_data/mlxreg: mlxreg.h: fix all kernel-doc warnings
Use the correct kernel-doc format & notation to eliminate
kernel-doc warnings:
Warning: include/linux/platform_data/mlxreg.h:24 Enum value
'MLX_WDT_TYPE1' not described in enum 'mlxreg_wdt_type'
Warning: include/linux/platform_data/mlxreg.h:24 Enum value
'MLX_WDT_TYPE2' not described in enum 'mlxreg_wdt_type'
Warning: include/linux/platform_data/mlxreg.h:24 Enum value
'MLX_WDT_TYPE3' not described in enum 'mlxreg_wdt_type'
Warning: include/linux/platform_data/mlxreg.h:37 bad line:
PHYs ready / unready state;
Warning: include/linux/platform_data/mlxreg.h:153 struct member 'np'
not described in 'mlxreg_core_data'
Warning: include/linux/platform_data/mlxreg.h:153 struct member 'hpdev'
not described in 'mlxreg_core_data'
platform/x86: hp-bioscfg: Support allocations of larger data
Some systems have much larger amounts of enumeration attributes
than have been previously encountered. This can lead to page allocation
failures when using kcalloc(). Switch over to using kvcalloc() to
allow larger allocations.
Fixes: 6b2770bfd6f92 ("platform/x86: hp-bioscfg: enum-attributes") Cc: stable@vger.kernel.org Reported-by: Paul Kerry <p.kerry@sheffield.ac.uk> Tested-by: Paul Kerry <p.kerry@sheffield.ac.uk> Closes: https://bugs.debian.org/1127612 Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patch.msgid.link/20260225210646.59381-1-mario.limonciello@amd.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
platform/x86: oxpec: Add support for Aokzoe A2 Pro
Aokzoe A2 Pro is an older device that the oxpec driver is missing the
quirk for. It has the same behavior as the AOKZOE A1 devices. Add a
quirk for it to the oxpec driver.
Ariel Silver [Sat, 21 Feb 2026 14:26:00 +0000 (15:26 +0100)]
media: dvb-net: fix OOB access in ULE extension header tables
The ule_mandatory_ext_handlers[] and ule_optional_ext_handlers[] tables
in handle_one_ule_extension() are declared with 255 elements (valid
indices 0-254), but the index htype is derived from network-controlled
data as (ule_sndu_type & 0x00FF), giving a range of 0-255. When
htype equals 255, an out-of-bounds read occurs on the function pointer
table, and the OOB value may be called as a function pointer.
Add a bounds check on htype against the array size before either table
is accessed. Out-of-range values now cause the SNDU to be discarded.
Chintan Vankar [Tue, 24 Feb 2026 18:13:59 +0000 (23:43 +0530)]
net: ethernet: ti: am65-cpsw-nuss/cpsw-ale: Fix multicast entry handling in ALE table
In the current implementation, flushing multicast entries in MAC mode
incorrectly deletes entries for all ports instead of only the target port,
disrupting multicast traffic on other ports. The cause is adding multicast
entries by setting only host port bit, and not setting the MAC port bits.
Fix this by setting the MAC port's bit in the port mask while adding the
multicast entry. Also fix the flush logic to preserve the host port bit
during removal of MAC port and free ALE entries when mask contains only
host port.
Fixes: 5c50a856d550 ("drivers: net: ethernet: cpsw: add multicast address to ALE table") Signed-off-by: Chintan Vankar <c-vankar@ti.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260224181359.2055322-1-c-vankar@ti.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
bridge: Check relevant options in VLAN range grouping
The br_vlan_opts_eq_range() function determines if consecutive VLANs can
be grouped together in a range for compact netlink notifications. It
currently checks state, tunnel info, and multicast router configuration,
but misses two categories of per-VLAN options that affect the output:
1. User-visible priv_flags (neigh_suppress, mcast_enabled)
2. Port multicast context options (mcast_max_groups, mcast_n_groups)
When VLANs have different settings for these options, they are incorrectly
grouped into ranges, causing netlink notifications to report only one
VLAN's settings for the entire range.
Fix by checking priv_flags equality, but only for flags that affect netlink
output (BR_VLFLAG_NEIGH_SUPPRESS_ENABLED and BR_VLFLAG_MCAST_ENABLED),
and comparing multicast context options (mcast_max_groups, mcast_n_groups).
Add a test with four test cases for each option, to ensure that VLANs with
different values are not grouped into ranges and VLANs with matching
values are properly grouped together.
====================
Danielle Ratson [Wed, 25 Feb 2026 14:39:56 +0000 (16:39 +0200)]
selftests: net: Add bridge VLAN range grouping tests
Add a new test file bridge_vlan_dump.sh with four test cases that verify
VLANs with different per-VLAN options are not incorrectly grouped into
ranges in the dump output.
The tests verify the kernel's br_vlan_opts_eq_range() function correctly
prevents VLAN range grouping when neigh_suppress, mcast_max_groups,
mcast_n_groups, or mcast_enabled options differ.
Each test verifies that VLANs with different option values appear as
individual entries rather than ranges, and that VLANs with matching
values are properly grouped together.
Example output:
$ ./bridge_vlan_dump.sh
TEST: VLAN range grouping with neigh_suppress [ OK ]
TEST: VLAN range grouping with mcast_max_groups [ OK ]
TEST: VLAN range grouping with mcast_n_groups [ OK ]
TEST: VLAN range grouping with mcast_enabled [ OK ]
Danielle Ratson [Wed, 25 Feb 2026 14:39:55 +0000 (16:39 +0200)]
bridge: Check relevant per-VLAN options in VLAN range grouping
The br_vlan_opts_eq_range() function determines if consecutive VLANs can
be grouped together in a range for compact netlink notifications. It
currently checks state, tunnel info, and multicast router configuration,
but misses two categories of per-VLAN options that affect the output:
1. User-visible priv_flags (neigh_suppress, mcast_enabled)
2. Port multicast context (mcast_max_groups, mcast_n_groups)
When VLANs have different settings for these options, they are incorrectly
grouped into ranges, causing netlink notifications to report only one
VLAN's settings for the entire range.
Fix by checking priv_flags equality, but only for flags that affect netlink
output (BR_VLFLAG_NEIGH_SUPPRESS_ENABLED and BR_VLFLAG_MCAST_ENABLED),
and comparing multicast context (mcast_max_groups and mcast_n_groups).
Example showing the bugs before the fix:
$ bridge vlan set vid 10 dev dummy1 neigh_suppress on
$ bridge vlan set vid 11 dev dummy1 neigh_suppress off
$ bridge -d vlan show dev dummy1
port vlan-id
dummy1 10-11
... neigh_suppress on
$ bridge vlan set vid 10 dev dummy1 mcast_max_groups 100
$ bridge vlan set vid 11 dev dummy1 mcast_max_groups 200
$ bridge -d vlan show dev dummy1
port vlan-id
dummy1 10-11
... mcast_max_groups 100
After the fix, VLANs 10 and 11 are shown as separate entries with their
correct individual settings.
Fixes: a1aee20d5db2 ("net: bridge: Add netlink knobs for number / maximum MDB entries") Fixes: 83f6d600796c ("bridge: vlan: Allow setting VLAN neighbor suppression state") Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20260225143956.3995415-2-danieller@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Davide Caratti [Tue, 24 Feb 2026 20:28:32 +0000 (21:28 +0100)]
net/sched: ets: fix divide by zero in the offload path
Offloading ETS requires computing each class' WRR weight: this is done by
averaging over the sums of quanta as 'q_sum' and 'q_psum'. Using unsigned
int, the same integer size as the individual DRR quanta, can overflow and
even cause division by zero, like it happened in the following splat:
Thorsten Blum [Thu, 26 Feb 2026 21:28:45 +0000 (22:28 +0100)]
smb: client: Don't log plaintext credentials in cifs_set_cifscreds
When debug logging is enabled, cifs_set_cifscreds() logs the key
payload and exposes the plaintext username and password. Remove the
debug log to avoid exposing credentials.
Fixes: 8a8798a5ff90 ("cifs: fetch credentials out of keyring for non-krb5 auth multiuser mounts") Cc: stable@vger.kernel.org Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Steve French <stfrench@microsoft.com>
Paulo Alcantara [Thu, 26 Feb 2026 00:34:55 +0000 (21:34 -0300)]
smb: client: fix broken multichannel with krb5+signing
When mounting a share with 'multichannel,max_channels=n,sec=krb5i',
the client was duplicating signing key for all secondary channels,
thus making the server fail all commands sent from secondary channels
due to bad signatures.
Every channel has its own signing key, so when establishing a new
channel with krb5 auth, make sure to use the new session key as the
derived key to generate channel's signing key in SMB2_auth_kerberos().
Paulo Alcantara [Mon, 23 Feb 2026 16:34:35 +0000 (13:34 -0300)]
smb: client: use atomic_t for mnt_cifs_flags
Use atomic_t for cifs_sb_info::mnt_cifs_flags as it's currently
accessed locklessly and may be changed concurrently in mount/remount
and reconnect paths.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Reviewed-by: David Howells <dhowells@redhat.com> Cc: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
Linus Torvalds [Fri, 27 Feb 2026 00:01:18 +0000 (16:01 -0800)]
Merge tag 'v7.0-rc1-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- auth security improvement
- fix potential buffer overflow in smbdirect negotiation
* tag 'v7.0-rc1-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix signededness bug in smb_direct_prepare_negotiation()
ksmbd: Compare MACs in constant time
Linus Torvalds [Thu, 26 Feb 2026 23:27:41 +0000 (15:27 -0800)]
Merge tag 'mm-hotfixes-stable-2026-02-26-14-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"12 hotfixes. 7 are cc:stable. 8 are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-02-26-14-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: update Yosry Ahmed's email address
mailmap: add entry for Daniele Alessandrelli
mm: fix NULL NODE_DATA dereference for memoryless nodes on boot
mm/tracing: rss_stat: ensure curr is false from kthread context
mm/kfence: fix KASAN hardware tag faults during late enablement
mm/damon/core: disallow non-power of two min_region_sz
Squashfs: check metadata block offset is within range
MAINTAINERS, mailmap: update e-mail address for Vlastimil Babka
liveupdate: luo_file: remember retrieve() status
mm: thp: deny THP for files on anonymous inodes
mm: change vma_alloc_folio_noprof() macro to inline function
mm/kfence: disable KFENCE upon KASAN HW tags enablement
Linus Torvalds [Thu, 26 Feb 2026 23:19:16 +0000 (15:19 -0800)]
Merge tag 'dma-mapping-7.0-2026-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"Two DMA-mapping fixes for the recently merged API rework (Jiri Pirko
and Stian Halseth)"
* tag 'dma-mapping-7.0-2026-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
sparc: Fix page alignment in dma mapping
dma-mapping: avoid random addr value print out on error path
Alain Volmat [Tue, 24 Feb 2026 15:09:22 +0000 (16:09 +0100)]
spi: stm32: fix missing pointer assignment in case of dma chaining
Commit c4f2c05ab029 ("spi: stm32: fix pointer-to-pointer variables usage")
introduced a regression since dma descriptors generated as part of the
stm32_spi_prepare_rx_dma_mdma_chaining function are not well propagated
to the caller function, leading to mdma-dma chaining being no more
functional.
Linus Torvalds [Thu, 26 Feb 2026 22:45:29 +0000 (14:45 -0800)]
Merge tag 'acpi-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"New platform quirks for two systems:
- Add a quirk for Lenovo G70-35 to save the ACPI NVS memory on system
suspend (Piotr Mazek)
- Add a DMI quirk for Acer Aspire One D255 to work around a backlight
issue by returning false to _OSI("Windows 2009") (Sofia Schneider)"
* tag 'acpi-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: OSI: Add DMI quirk for Acer Aspire One D255
ACPI: PM: Save NVS memory on Lenovo G70-35
Andy Shevchenko [Mon, 23 Feb 2026 18:06:51 +0000 (19:06 +0100)]
pinctrl: cy8c95x0: Don't miss reading the last bank registers
When code had been changed to use for_each_set_clump8(), it mistakenly
switched from chip->nport to chip->tpin since the cy8c9540 and cy8c9560
have a 4-pin gap. This, in particular, led to the missed read of
the last bank interrupt status register and hence missing interrupts
on those pins. Restore the upper limit in for_each_set_clump8() to take
into consideration that gap.
Fixes: 83e29a7a1fdf ("pinctrl: cy8c95x0; Switch to use for_each_set_clump8()") Cc: stable@vger.kernel.org Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Walleij <linusw@kernel.org>
Linus Torvalds [Thu, 26 Feb 2026 22:40:21 +0000 (14:40 -0800)]
Merge tag 'pm-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix two intel_pstate driver issues causing it to crash on sysfs
attribute accesses when some CPUs in the system are offline, finalize
changes related to turning pm_runtime_put() into a void function, and
update Daniel Lezcano's contact information:
- Fix two issues in the intel_pstate driver causing it to crash when
its sysfs interface is used on a system with some offline CPUs
(David Arcari, Srinivas Pandruvada)
- Update the last user of the pm_runtime_put() return value to
discard it and turn pm_runtime_put() into a void function (Rafael
Wysocki)
- Update Daniel Lezcano's contact information in MAINTAINERS and
.mailmap (Daniel Lezcano)"
* tag 'pm-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
MAINTAINERS: Update contact with the kernel.org address
cpufreq: intel_pstate: Fix crash during turbo disable
cpufreq: intel_pstate: Fix NULL pointer dereference in update_cpu_qos_request()
PM: runtime: Change pm_runtime_put() return type to void
pmdomain: imx: gpcv2: Discard pm_runtime_put() return value
Justin Tee [Thu, 4 Dec 2025 20:26:13 +0000 (12:26 -0800)]
nvmet-fcloop: Check remoteport port_state before calling done callback
In nvme_fc_handle_ls_rqst_work, the lsrsp->done callback is only set when
remoteport->port_state is FC_OBJSTATE_ONLINE. Otherwise, the
nvme_fc_xmt_ls_rsp's LLDD call to lport->ops->xmt_ls_rsp is expected to
fail and the nvme-fc transport layer itself will directly call
nvme_fc_xmt_ls_rsp_free instead of relying on LLDD's done callback to free
the lsrsp resources.
Update the fcloop_t2h_xmt_ls_rsp routine to check remoteport->port_state.
If online, then lsrsp->done callback will free the lsrsp. Else, return
-ENODEV to signal the nvme-fc transport to handle freeing lsrsp.
Cc: Ewan D. Milne <emilne@redhat.com> Tested-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Closes: https://lore.kernel.org/linux-nvme/21255200-a271-4fa0-b099-97755c8acd4c@work/ Fixes: 10c165af35d2 ("nvmet-fcloop: call done callback even when remote port is gone") Signed-off-by: Justin Tee <justintee8345@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Linus Torvalds [Thu, 26 Feb 2026 22:34:21 +0000 (14:34 -0800)]
Merge tag 'for-linus-7.0-1' of https://github.com/cminyard/linux-ipmi
Pull IPMI driver fixes from Corey Minyard:
"This mostly revolves around getting the driver to behave when the IPMI
device misbehaves. Past attempts have not worked very well because I
didn't have hardware I could make do this, and AI was fairly useless
for help on this.
So I modified qemu and my test suite so I could reproduce a
misbehaving IPMI device, and with that I was able to fix the issues"
* tag 'for-linus-7.0-1' of https://github.com/cminyard/linux-ipmi:
ipmi:si: Fix check for a misbehaving BMC
ipmi:msghandler: Handle error returns from the SMI sender
ipmi:si: Don't block module unload if the BMC is messed up
ipmi:si: Use a long timeout when the BMC is misbehaving
ipmi:si: Handle waiting messages when BMC failure detected
ipmi:ls2k: Make ipmi_ls2k_platform_driver static
ipmi: ipmb: initialise event handler read bytes
ipmi: Consolidate the run to completion checking for xmit msgs lock
ipmi: Fix use-after-free and list corruption on sender error
David Carlier [Thu, 26 Feb 2026 12:45:17 +0000 (12:45 +0000)]
sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flag
SCX_EFLAG_INITIALIZED is the sole member of enum scx_exit_flags with no
explicit value, so the compiler assigns it 0. This makes the bitwise OR
in scx_ops_init() a no-op:
Thomas Weißschuh [Thu, 26 Feb 2026 07:41:48 +0000 (08:41 +0100)]
kbuild: install-extmod-build: Package resolve_btfids if necessary
When CONFIG_DEBUG_INFO_BTF_MODULES is enabled and vmlinux is available,
Makefile.modfinal and gen-btf.sh will try to use resolve_btfids on the
module .ko. install-extmod-build currently does not package
resolve_btfids, so that step fails.
- Fix two issues in the intel_pstate driver causing it to crash when
its sysfs interface is used on a system with some offline CPUs (David
Arcari, Srinivas Pandruvada)
- Update the last user of the pm_runtime_put() return value to discard
it and turn pm_runtime_put() into a void function (Rafael Wysocki)
* pm-cpufreq:
cpufreq: intel_pstate: Fix crash during turbo disable
cpufreq: intel_pstate: Fix NULL pointer dereference in update_cpu_qos_request()
* pm-runtime:
PM: runtime: Change pm_runtime_put() return type to void
pmdomain: imx: gpcv2: Discard pm_runtime_put() return value
Dave Airlie [Thu, 26 Feb 2026 19:49:05 +0000 (05:49 +1000)]
Merge tag 'drm-misc-fixes-2026-02-26' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
Several fixes for:
- amdxdna: Fix for a deadlock, a NULL pointer dereference, a suspend
failure, a hang, an out-of-bounds access, a buffer overflow, input
sanitization and other minor fixes.
- dw-dp: An error handling fix
- ethosu: A binary shift overflow fix
- imx: An error handling fix
- logicvc: A dt node reference leak fix
- nouveau: A WARN_ON removal
- samsung-dsim: A memory leak fix
- sharp-memory: A NULL pointer dereference fix
- vmgfx: A reference count and error handling fix
T.J. Mercier [Wed, 25 Feb 2026 00:33:48 +0000 (16:33 -0800)]
selftests/bpf: Fix OOB read in dmabuf_collector
Dmabuf name allocations can be less than DMA_BUF_NAME_LEN characters,
but bpf_probe_read_kernel always tries to read exactly that many bytes.
If a name is less than DMA_BUF_NAME_LEN characters,
bpf_probe_read_kernel will read past the end. bpf_probe_read_kernel_str
stops at the first NUL terminator so use it instead, like
iter_dmabuf_for_each already does.
Fixes: ae5d2c59ecd7 ("selftests/bpf: Add test for dmabuf_iter") Reported-by: Jerome Lee <jaewookl@quicinc.com> Signed-off-by: T.J. Mercier <tjmercier@google.com> Link: https://lore.kernel.org/r/20260225003349.113746-1-tjmercier@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Direct leak of 32 byte(s) in 1 object(s) allocated from:
#0 0x7f1ff3cfa340 in calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:77
#1 0x5610c15bb520 in bpf_program_attach_fd /codebuild/output/src685977285/src/actions-runner/_work/vmtest/vmtest/src/tools/lib/bpf/libbpf.c:13164
#2 0x5610c15bb740 in bpf_program__attach_xdp /codebuild/output/src685977285/src/actions-runner/_work/vmtest/vmtest/src/tools/lib/bpf/libbpf.c:13204
#3 0x5610c14f91d3 in test_xdp_flowtable /codebuild/output/src685977285/src/actions-runner/_work/vmtest/vmtest/src/tools/testing/selftests/bpf/prog_tests/xdp_flowtable.c:138
#4 0x5610c1533566 in run_one_test /codebuild/output/src685977285/src/actions-runner/_work/vmtest/vmtest/src/tools/testing/selftests/bpf/test_progs.c:1406
#5 0x5610c1537fb0 in main /codebuild/output/src685977285/src/actions-runner/_work/vmtest/vmtest/src/tools/testing/selftests/bpf/test_progs.c:2097
#6 0x7f1ff25df1c9 (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 8e9fd827446c24067541ac5390e6f527fb5947bb)
#7 0x7f1ff25df28a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 8e9fd827446c24067541ac5390e6f527fb5947bb)
#8 0x5610c0bd3180 in _start (/tmp/work/vmtest/vmtest/selftests/bpf/test_progs+0x593180) (BuildId: cdf9f103f42307dc0a2cd6cfc8afcbc1366cf8bd)
Fix by properly destroying bpf_link on exit in xdp_flowtable test.
Kohei Enju [Wed, 25 Feb 2026 05:34:44 +0000 (05:34 +0000)]
bpf: Fix stack-out-of-bounds write in devmap
get_upper_ifindexes() iterates over all upper devices and writes their
indices into an array without checking bounds.
Also the callers assume that the max number of upper devices is
MAX_NEST_DEV and allocate excluded_devices[1+MAX_NEST_DEV] on the stack,
but that assumption is not correct and the number of upper devices could
be larger than MAX_NEST_DEV (e.g., many macvlans), causing a
stack-out-of-bounds write.
Add a max parameter to get_upper_ifindexes() to avoid the issue.
When there are too many upper devices, return -EOVERFLOW and abort the
redirect.
To reproduce, create more than MAX_NEST_DEV(8) macvlans on a device with
an XDP program attached using BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS.
Then send a packet to the device to trigger the XDP redirect path.
Fuad Tabba [Thu, 26 Feb 2026 07:55:25 +0000 (07:55 +0000)]
bpf, arm64: Force 8-byte alignment for JIT buffer to prevent atomic tearing
struct bpf_plt contains a u64 target field. Currently, the BPF JIT
allocator requests an alignment of 4 bytes (sizeof(u32)) for the JIT
buffer.
Because the base address of the JIT buffer can be 4-byte aligned (e.g.,
ending in 0x4 or 0xc), the relative padding logic in build_plt() fails
to ensure that target lands on an 8-byte boundary.
This leads to two issues:
1. UBSAN reports misaligned-access warnings when dereferencing the
structure.
2. More critically, target is updated concurrently via WRITE_ONCE() in
bpf_arch_text_poke() while the JIT'd code executes ldr. On arm64,
64-bit loads/stores are only guaranteed to be single-copy atomic if
they are 64-bit aligned. A misaligned target risks a torn read,
causing the JIT to jump to a corrupted address.
Fix this by increasing the allocation alignment requirement to 8 bytes
(sizeof(u64)) in bpf_jit_binary_pack_alloc(). This anchors the base of
the JIT buffer to an 8-byte boundary, allowing the relative padding math
in build_plt() to correctly align the target field.
genksyms: Fix parsing a declarator with a preceding attribute
After commit 07919126ecfc ("netfilter: annotate NAT helper hook pointers
with __rcu"), genksyms fails to parse the __rcu annotation when building
with CONFIG_DEBUG_INFO_BTF=y, CONFIG_PAHOLE_HAS_BTF_TAG=y, and a version
of clang that supports btf_type_tag.
$ cat kernel/configs/repro.config
CONFIG_BPF_SYSCALL=y
CONFIG_MODVERSIONS=y
# CONFIG_DEBUG_INFO_NONE is not set
CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
CONFIG_DEBUG_INFO_BTF=y
$ make -skj"$(nproc)" ARCH=x86_64 LLVM=1 mrproper defconfig repro.config all
WARNING: modpost: EXPORT symbol "nf_nat_ftp_hook" [vmlinux] version generation failed, symbol will not be versioned.
...
WARNING: modpost: EXPORT symbol "nf_nat_irc_hook" [vmlinux] version generation failed, symbol will not be versioned.
...
genksyms falls over parsing the __rcu attribute in the declarator:
# Kernel reproducer
$ make -skj"$(nproc)" ARCH=x86_64 KCFLAGS=-D__GENKSYMS__ LLVM=1 net/netfilter/nf_conntrack_ftp.i
Commit 3e86e4d74c04 ("kbuild: keep .modinfo section in
vmlinux.unstripped") added .modinfo to ELF_DETAILS while removing it
from COMMON_DISCARDS, as it was needed in vmlinux.unstripped and
ELF_DETAILS was present in all architecture specific vmlinux linker
scripts. While this shuffle is fine for vmlinux, ELF_DETAILS and
COMMON_DISCARDS may be used by other linker scripts, such as the s390
and x86 compressed boot images, which may not expect to have a .modinfo
section. In certain circumstances, this could result in a bootloader
failing to load the compressed kernel [1].
Commit ddc6cbef3ef1 ("s390/boot/vmlinux.lds.S: Ensure bzImage ends with
SecureBoot trailer") recently addressed this for the s390 bzImage but
the same bug remains for arm, parisc, and x86. The presence of .modinfo
in the x86 bzImage was the root cause of the issue worked around with
commit d50f21091358 ("kbuild: align modinfo section for Secureboot
Authenticode EDK2 compat"). misc.c in arch/x86/boot/compressed includes
lib/decompress_unzstd.c, which in turn includes lib/xxhash.c and its
MODULE_LICENSE / MODULE_DESCRIPTION macros due to the STATIC definition.
Split .modinfo out from ELF_DETAILS into its own macro and handle it in
all vmlinux linker scripts. Discard .modinfo in the places where it was
previously being discarded from being in COMMON_DISCARDS, as it has
never been necessary in those uses.
Cc: stable@vger.kernel.org Fixes: 3e86e4d74c04 ("kbuild: keep .modinfo section in vmlinux.unstripped") Reported-by: Ed W <lists@wildgooses.com> Closes: https://lore.kernel.org/587f25e0-a80e-46a5-9f01-87cb40cfa377@wildgooses.com/ [1] Tested-by: Ed W <lists@wildgooses.com> # x86_64 Link: https://patch.msgid.link/20260225-separate-modinfo-from-elf-details-v1-1-387ced6baf4b@kernel.org Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Sumit Gupta [Thu, 26 Feb 2026 11:59:11 +0000 (17:29 +0530)]
arm64: topology: Fix false warning in counters_read_on_cpu() for same-CPU reads
The counters_read_on_cpu() function warns when called with IRQs
disabled to prevent deadlock in smp_call_function_single(). However,
this warning is spurious when reading counters on the current CPU,
since no IPI is needed for same CPU reads.
Commit 12eb8f4fff24 ("cpufreq: CPPC: Update FIE arch_freq_scale in
ticks for non-PCC regs") changed the CPPC Frequency Invariance Engine
to read AMU counters directly from the scheduler tick for non-PCC
register spaces (like FFH), instead of deferring to a kthread. This
means counters_read_on_cpu() is now called with IRQs disabled from the
tick handler, triggering the warning.
Fix this by restructuring the logic: when IRQs are disabled (tick
context), call the function directly for same-CPU reads. Otherwise
use smp_call_function_single().
Fixes: 997c021abc6e ("cpufreq: CPPC: Update FIE arch_freq_scale in ticks for non-PCC regs") Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Sumit Gupta <sumitg@nvidia.com> Signed-off-by: Will Deacon <will@kernel.org>
Marc Zyngier [Thu, 26 Feb 2026 08:22:32 +0000 (08:22 +0000)]
arm64: Fix sampling the "stable" virtual counter in preemptible section
Ben reports that when running with CONFIG_DEBUG_PREEMPT, using
__arch_counter_get_cntvct_stable() results in well deserves warnings,
as we access a per-CPU variable without preemption disabled.
Fix the issue by disabling preemption on reading the counter. We can
probably do a lot better by not disabling preemption on systems that
do not require horrible workarounds to return a valid counter value,
but this plugs the issue for the time being.
Fixes: 29cc0f3aa7c6 ("arm64: Force the use of CNTVCT_EL0 in __delay()") Reported-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/aZw3EGs4rbQvbAzV@e134344.arm.com Tested-by: Ben Horgan <ben.horgan@arm.com> Tested-by: André Draszik <andre.draszik@linaro.org> Signed-off-by: Will Deacon <will@kernel.org>
Linus Torvalds [Thu, 26 Feb 2026 18:05:15 +0000 (10:05 -0800)]
Merge tag 'kmalloc_obj-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kmalloc_obj fixes from Kees Cook:
- Fix pointer-to-array allocation types for ubd and kcsan
- Force size overflow helpers to __always_inline
- Bump __builtin_counted_by_ref to Clang 22.1 from 22.0 (Nathan Chancellor)
* tag 'kmalloc_obj-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
kcsan: test: Adjust "expect" allocation type for kmalloc_obj
overflow: Make sure size helpers are always inlined
init/Kconfig: Adjust fixed clang version for __builtin_counted_by_ref
ubd: Use pointer-to-pointers for io_thread_req arrays
Kees Cook [Mon, 23 Feb 2026 22:01:56 +0000 (14:01 -0800)]
kcsan: test: Adjust "expect" allocation type for kmalloc_obj
The call to kmalloc_obj(observed.lines) returns "char (*)[3][512]",
a pointer to the whole 2D array. But "expect" wants to be "char (*)[512]",
the decayed pointer type, as if it were observed.lines itself (though
without the "3" bounds). This produces the following build error:
../kernel/kcsan/kcsan_test.c: In function '__report_matches':
../kernel/kcsan/kcsan_test.c:171:16: error: assignment to 'char (*)[512]' from incompatible pointer type 'char (*)[3][512]'
[-Wincompatible-pointer-types]
171 | expect = kmalloc_obj(observed.lines);
| ^
Instead of changing the "expect" type to "char (*)[3][512]" and
requiring a dereference at each use (e.g. "(expect*)[0]"), just
explicitly cast the return to the desired type.
Note that I'm intentionally not switching back to byte-based "kmalloc"
here because I cannot find a way for the Coccinelle script (which will
be used going forward to catch future conversions) to exclude this case.
Sanjay Chitroda [Thu, 26 Feb 2026 05:47:12 +0000 (11:17 +0530)]
mm/slub: drop duplicate kernel-doc for ksize()
The implementation of ksize() was updated with kernel-doc by commit fab0694646d7 ("mm/slab: move [__]ksize and slab_ksize() to mm/slub.c")
However, the public header still contains a kernel-doc comment
attached to the ksize() prototype.
Having documentation both in the header and next to the implementation
causes Sphinx to treat the function as being documented twice,
resulting in the warning:
WARNING: Duplicate C declaration, also defined at core-api/mm-api:521
Declaration is '.. c:function:: size_t ksize(const void *objp)'
Kernel-doc guidelines recommend keeping the documentation with the
function implementation. Therefore remove the redundant kernel-doc
block from include/linux/slab.h so that the implementation in slub.c
remains the canonical source for documentation.
mm/slab: mark alloc tags empty for sheaves allocated with __GFP_NO_OBJ_EXT
alloc_empty_sheaf() allocates sheaves from SLAB_KMALLOC caches using
__GFP_NO_OBJ_EXT to avoid recursion, however it does not mark their
allocation tags empty before freeing, which results in a warning when
CONFIG_MEM_ALLOC_PROFILING_DEBUG is set. Fix this by marking allocation
tags for such sheaves as empty.
The problem was technically introduced in commit 4c0a17e28340 but only
becomes possible to hit with commit 913ffd3a1bf5.
Fixes: 4c0a17e28340 ("slab: prevent recursive kmalloc() in alloc_empty_sheaf()") Fixes: 913ffd3a1bf5 ("slab: handle kmalloc sheaves bootstrap") Reported-by: David Wang <00107082@163.com> Closes: https://lore.kernel.org/all/20260223155128.3849-1-00107082@163.com/ Analyzed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Tested-by: Harry Yoo <harry.yoo@oracle.com> Tested-by: David Wang <00107082@163.com> Link: https://patch.msgid.link/20260225163407.2218712-1-surenb@google.com Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Harry Yoo [Mon, 23 Feb 2026 13:33:22 +0000 (22:33 +0900)]
mm/slab: pass __GFP_NOWARN to refill_sheaf() if fallback is available
When refill_sheaf() is called, failing to refill the sheaf doesn't
necessarily mean the allocation will fail because a fallback path
might be available and serve the allocation request.
Suppress spurious warnings by passing __GFP_NOWARN along with
__GFP_NOMEMALLOC whenever a fallback path is available.
When the caller is alloc_full_sheaf() or __pcs_replace_empty_main(),
the kernel always falls back to the slowpath (__slab_alloc_node()).
For __prefill_sheaf_pfmemalloc(), the fallback path is available
only when gfp_pfmemalloc_allowed() returns true.