Ma Ke [Sun, 16 Nov 2025 02:44:11 +0000 (10:44 +0800)]
powerpc/warp: Fix error handling in pika_dtm_thread
pika_dtm_thread() acquires client through of_find_i2c_device_by_node()
but fails to release it in error handling path. This could result in a
reference count leak, preventing proper cleanup and potentially
leading to resource exhaustion. Add put_device() to release the
reference in the error handling path.
Found by code review.
Cc: stable@vger.kernel.org Fixes: 3984114f0562 ("powerpc/warp: Platform fix for i2c change") Signed-off-by: Ma Ke <make24@iscas.ac.cn> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/20251116024411.21968-1-make24@iscas.ac.cn
Ally Heev [Sun, 16 Nov 2025 14:25:44 +0000 (19:55 +0530)]
powerpc: 82xx: fix uninitialized pointers with free attribute
Uninitialized pointers with `__free` attribute can cause undefined
behavior as the memory allocated to the pointer is freed automatically
when the pointer goes out of scope.
powerpc/km82xx doesn't have any bugs related to this as of now, but,
it is better to initialize and assign pointers with `__free` attribute
in one statement to ensure proper scope-based cleanup
Linus Walleij [Tue, 5 May 2026 18:47:56 +0000 (20:47 +0200)]
powerpc/g5: Enable all windfarms by default
The G5 defconfig is clearly intended for the G5 Powermac
series, and that should enable all the available
windfarm drivers, or the machine will overheat a short
while after booting and shut itself down, which is
annoying.
Guangshuo Li [Thu, 7 May 2026 10:06:03 +0000 (18:06 +0800)]
drm/bridge: imx8qxp-pxl2dpi: avoid ERR_PTR with device_node cleanup
imx8qxp_pxl2dpi_get_available_ep_from_port() returns ERR_PTR()
on errors. imx8qxp_pxl2dpi_find_next_bridge() stores its return
value in a __free(device_node) variable before checking IS_ERR().
When the function returns on the error path, the cleanup action calls
of_node_put() on the ERR_PTR() value.
Do not let a device_node cleanup variable hold error pointers. Change
imx8qxp_pxl2dpi_get_available_ep_from_port() to return an int and pass
the endpoint node through an output argument. Initialize the output
argument to NULL so callers hold either NULL on error paths or a valid
device_node pointer on successful path.
Fixes: ceea3f7806a10 ("drm/bridge: imx8qxp-pxl2dpi: simplify put of device_node pointers") Cc: stable@vger.kernel.org Reviewed-by: Liu Ying <victor.liu@nxp.com> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com> Link: https://patch.msgid.link/20260507100604.667731-1-lgs201920130244@gmail.com Signed-off-by: Liu Ying <victor.liu@nxp.com>
qed: fix division by zero in qed_init_wfq_param when all vports are configured
In qed_init_wfq_param(), variable non_requested_count can become zero
when the number of vports with the configured flag set (including the
current vport being configured) equals total num_vports. This happens
when configuring the last unconfigured vport or when re-configuring
an already configured vport.
The function then calculates left_rate_per_vp = total_left_rate /
non_requested_count, which causes division by zero.
Fix this by skipping the division when non_requested_count is zero.
In that case, there is no remaining bandwidth to distribute, so just
record the configuration for the current vport and return success.
Davide Caratti [Fri, 8 May 2026 17:05:10 +0000 (19:05 +0200)]
net/sched: dualpi2: initialize timer earlier in dualpi2_init()
'pi2_timer' needs to be initialized in all error paths of dualpi2_init():
otherwise, a failure in qdisc_create_dflt() causes the following crash in
dualpi2_destroy():
net: ena: PHC: Check return code before setting timestamp output
ena_phc_gettimex64() is setting the output parameter regardless
of whether ena_com_phc_get_timestamp() succeeded or failed.
When ena_com_phc_get_timestamp() returns an error, the timestamp
parameter may contain uninitialized stack memory (e.g., when PHC is
disabled or in blocked state) or invalid hardware values. Passing
these to userspace via the PTP ioctl is both a security issue
(information leak) and a correctness bug.
Fix by checking the return code after releasing the lock and only
setting the output timestamp on success.
Fixes: e0ea34158ee8 ("net: ena: Add PHC support in the ENA driver") Cc: stable@vger.kernel.org Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260507003518.22554-1-akiyano@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/rds: reset op_nents when zerocopy page pin fails
When iov_iter_get_pages2() fails in rds_message_zcopy_from_user(),
the pinned pages are released with put_page(), and
rm->data.op_mmp_znotifier is cleared. But we fail to properly
clear rm->data.op_nents.
Later when rds_message_purge() is called from rds_sendmsg() the
cleanup loop iterates over the incorrectly non zero number of
op_nents and frees them again.
Fix this by properly resetting op_nents when it should be in
rds_message_zcopy_from_user().
Linus Torvalds [Mon, 11 May 2026 22:38:49 +0000 (15:38 -0700)]
Merge tag 'linux_kselftest-kunit-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit fixes from Shuah Khan:
"Fix to decouple KUNIT_DEBUGFS and KUNIT_ALL_TESTS options and fix
KUNIT_DEBUGFS dependencies so it depends on DEBUG_FS without which it
will not be useful"
* tag 'linux_kselftest-kunit-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: config: KUNIT_DEBUGFS should depend on DEBUG_FS
kunit: config: Enable KUNIT_DEBUGFS by default
Tejun Heo [Mon, 11 May 2026 22:05:48 +0000 (12:05 -1000)]
sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error()
dereferences p->comm and p->pid. If the iterator's reference is the last
drop, the task is freed synchronously and the deref becomes a UAF.
drm/amdgpu/gfx_v12_0: set gfx.rs64_enable from PFP header on GFX12
gfx_v12_0_init_microcode() always loads RS64 CP ucode but never set
adev->gfx.rs64_enable, so it stayed false and code that branches on it
(e.g. MEC pipe reset) used the legacy CP_MEC_CNTL path incorrectly.
Match GFX11: derive RS64 mode from the PFP firmware header (v2.0) via
amdgpu_ucode_hdr_version(). Log at debug when RS64 is enabled.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit b03d53598b0d2048e8fa7303b8d0784768ec4fa6)
Xiang Liu [Thu, 7 May 2026 12:56:15 +0000 (20:56 +0800)]
drm/amd/ras: Fix CPER ring debugfs read overflow
The legacy CPER debugfs reader can reach the payload path without a
valid pointer snapshot. The remaining user byte count is also treated as
the ring occupancy in dwords, so reads past the header can copy more than
requested.
Take the CPER lock before sampling pointers. Resample rptr/wptr for
payload reads, bound the payload copy by available dwords and the
remaining user size, and advance the file position for each dword copied.
Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1e40ef87ffdc291e05ccdade8b9170cc9c1c4249)
drm/amd/display: Wrap DCN32 phantom-plane allocation in DC_RUN_WITH_PREEMPTION_ENABLED
[Why]
dcn32_validate_bandwidth() wraps dcn32_internal_validate_bw() with
DC_FP_START()/DC_FP_END(). In x86 non-RT, DC_FP_START takes fpregs_lock(),
which disables local softirqs.
The DML1 path through dcn32_enable_phantom_plane() calls kvzalloc() to
allocate ~335 KiB for dc_plane_state. This triggers the vmalloc path,
which calls BUG_ON(in_interrupt()) because it's invoked within the
FPU-enabled (softirq disabled) region, leading to a kernel crash.
[How]
Wrap the dc_state_create_phantom_plane() call with the
DC_RUN_WITH_PREEMPTION_ENABLED() macro to allow preemption during
this memory allocation.
Fixes: 235c67634230 ("drm/amd/display: add DCN32/321 specific files for Display Core") Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/4470 Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: James Lin <pinglei.lin@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 885ccbef7b94a8b38f69c4211c679021aa27ad11) Cc: stable@vger.kernel.org
Christian König [Mon, 20 Apr 2026 14:08:35 +0000 (16:08 +0200)]
drm/amdgpu: fix userq hang detection and reset
Fix lock inversions pointed out by Prike and Sunil. The hang detection
timeout *CAN'T* grab locks under which we wait for fences, especially
not the userq_mutex lock.
Then instead of this completely broken handling with the
hang_detect_fence just cancel the work when fences are processed and
re-start if necessary.
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1b62077f045ac6ffde7c97005c6659569ac5c1ec)
Christian König [Mon, 20 Apr 2026 13:13:57 +0000 (15:13 +0200)]
drm/amdgpu: remove almost all calls to amdgpu_userq_detect_and_reset_queues
Well the reset handling seems broken on multiple levels.
As first step of fixing this remove most calls to the hang detection.
That function should only be called after we run into a timeout! And *NOT*
as random check spread over the code in multiple places.
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 71bea36b54ccfb14cbc90f94267af6369af4e702)
Christian König [Thu, 16 Apr 2026 13:32:11 +0000 (15:32 +0200)]
drm/amdgpu: rework amdgpu_userq_signal_ioctl v3
This one was fortunately not looking so bad as the wait ioctl path, but
there were still a few things which could be fixed/improved:
1. Allocating with GFP_ATOMIC was quite unnecessary, we can do that
before taking the userq_lock.
2. Use a new mutex as protection for the fence_drv_xa so that we can do
memory allocations while holding it.
3. Starting the reset timer is unnecessary when the fence is already
signaled when we create it.
4. Cleanup error handling, avoid trying to free the queue when we don't
even got one.
v2: fix incorrect usage of xa_find, destroy the new mutex on error
v3: cleanup ref ordering
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1609eb0f81a609d350169839128cecf298c84e7a)
Christian König [Mon, 20 Apr 2026 18:18:43 +0000 (20:18 +0200)]
drm/amdgpu: remove deadlocks from amdgpu_userq_pre_reset
The purpose of a GPU reset is to make sure that fence can be signaled
again and the signal and resume workers can make progress again.
So waiting for the resume worker or any fence in the GPU reset path is
just utterly nonsense.
Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fcd5f065eab46993af43442fd77ee8d9eb9c5bdf)
Matthew Auld [Fri, 8 May 2026 10:26:37 +0000 (11:26 +0100)]
drm/xe/dma-buf: fix UAF with retry loop
Retry doesn't work here, since bo will be freed on error, leading to
UAF. However, now that we do the alloc & init before the attach, we can
now combine this as one unit and have the init do the alloc for us. This
should make the retry safe.
Reported by Sashiko.
v2: Fix up the error unwind (CI)
Closes: https://sashiko.dev/#/patchset/20260506184332.86743-2-matthew.auld%40intel.com Fixes: eb289a5f6cc6 ("drm/xe: Convert xe_dma_buf.c for exhaustive eviction") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.18+ Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patch.msgid.link/20260508102635.149172-4-matthew.auld@intel.com
(cherry picked from commit 479669418253e0f27f8cf5db01a731352ea592e7) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Matthew Auld [Fri, 8 May 2026 10:26:36 +0000 (11:26 +0100)]
drm/xe/dma-buf: handle empty bo and UAF races
There look to be some nasty races here when triggering the
invalidate_mappings hook:
1) We do xe_bo_alloc() followed by the attach, before the actual full bo
init step in xe_dma_buf_init_obj(). However the bo is visible on the
attachments list after the attach. This is bad since exporter driver,
say amdgpu, can at any time call back into our invalidate_mappings hook,
with an empty/bogus bo, leading to potential bugs/crashes.
2) Similar to 1) but here we get a UAF, when the invalidate_mappings
hook is triggered. For example, we get as far as xe_bo_init_locked()
but this fails in some way. But here the bo will be freed on error, but
we still have it attached from dma-buf pov, so if the
invalidate_mappings is now triggered then the bo we access is gone and
we trigger UAF and more bugs/crashes.
To fix this, move the attach step until after we actually have a fully
set up buffer object. Note that the bo is not published to userspace
until later, so not sure what the comment "Don't publish the bo
until we have a valid attachment", is referring to.
We have at least two different customers reporting hitting a NULL ptr
deref in evict_flags when importing something from amdgpu, followed by
triggering the evict flow. Hit rate is also pretty low, which would
hint at some kind of race, so something like 1) or 2) might explain
this.
v2:
- Shuffle the order of the ops slightly (no functional change)
- Improve the comment to better explain the ordering (Matt B)
Matt Roper [Thu, 7 May 2026 21:00:15 +0000 (14:00 -0700)]
drm/xe: Make decision to use Xe2-style blitter instructions a feature flag
The blitter engines' MEM_COPY and MEM_SET instructions were added as
part of the same hardware change that introduced service copy engines
(i.e., BCS1-BCS8) which is why the driver checks for service copy engine
presence when deciding whether to use these instructions or the older
XY_* instructions. However when making this decision the driver should
consider which engines are part of the hardware architecture, not which
engines are present/usable on the current device. For graphics IP
versions that architecturally include service copy engines (i.e.,
everything Xe2 and later, plus PVC's Xe_HPC) we should use MEM_SET and
MEM_COPY even in if all of the service copy engines wind up getting
fused off. I.e., we need to decide based on whether the platform's
graphics descriptor contains these engines, rather than whether the
usable engine mask contains them. This logic got broken when
gt->info.__engine_mask was removed, although in practice that mistake
has been harmless so far because there haven't been any hardware
SKUs that fuse off all of the service copy engines yet.
Replace the incorrect has_service_copy_support() function with a GT
feature flag that tracks more accurately whether the new blitter
instructions are usable. In addition to fixing incorrect logic if all
service copies are fused off, the flag also makes it more obvious what
the calling code is trying to do; previously it wasn't terribly obvious
why "has service copy engines" was being used as the condition for using
different instructions on all copy engine types.
The new feature flag is named 'has_xe2_blt_instructions' because we
expect this flag to be set for all Xe2 and later platforms (i.e.,
everything officially supported by the Xe driver). Technically there's
also one Xe1-era platform (PVC) that supports these engines/instructions
and will set this flag, but this still seems to be the most clear and
understandable name for the flag.
Fixes: 61549a2ee594 ("drm/xe: Drop __engine_mask") Cc: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Reviewed-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Link: https://patch.msgid.link/20260507-xe2_copy-v1-1-26506381b821@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit 09b399842907565a64e351fb22da790b4c673ffb) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Arvind Yadav [Wed, 6 May 2026 13:20:27 +0000 (18:50 +0530)]
drm/xe/madvise: Track purgeability with BO-local counters
xe_bo_recompute_purgeable_state() walks all VMAs of a BO to determine
whether the BO can be made purgeable. This makes VMA create/destroy and
madvise updates O(n) in the number of mappings.
Replace the walk with BO-local counters protected by the BO dma-resv
lock:
- vma_count tracks the number of VMAs mapping the BO.
- willneed_count tracks active WILLNEED holders, including WILLNEED
VMAs and active dma-buf exports for non-imported BOs.
A DONTNEED BO is promoted back to WILLNEED on a 0->1 transition of
willneed_count. A BO is demoted to DONTNEED on a 1->0 transition only
when it still has VMAs, preserving the previous behaviour where a BO
with no mappings keeps its current madvise state.
Fixes: 4f44961eab84 ("drm/xe/vm: Prevent binding of purged buffer objects")
v2:
- Use early return for imported BOs in all four helpers to avoid
nesting (Matt B).
- Group purgeability state into a purgeable sub-struct on struct
xe_bo (Matt B).
- Reword xe_bo_willneed_put_locked() kernel-doc to explain that a 1->0
transition means all remaining active VMAs are DONTNEED (Matt B).
v3:
- Move DONTNEED/PURGED reject from vma_lock_and_validate() into
xe_vma_create(), gated on attr->purgeable_state == WILLNEED.
Fixes vm_bind bypass and partial-unbind rejection on DONTNEED
BOs (Matt B).
- Drop .check_purged from MAP and REMAP; keep it for PREFETCH and
add a comment why (Matt B).
- Skip BO validation in vma_lock_and_validate() for non-WILLNEED
VMA remnants so cleanup/remap paths do not repopulate
DONTNEED/PURGED BOs.
Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Signed-off-by: Arvind Yadav <arvind.yadav@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patch.msgid.link/20260506132027.2556046-1-arvind.yadav@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
(cherry picked from commit 23fb2ea56cb4fa2587bc072b04e4e698687a48e4) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Guopeng Zhang [Sat, 9 May 2026 10:20:31 +0000 (18:20 +0800)]
cgroup/cpuset: Reserve DL bandwidth only for root-domain moves
cpuset_can_attach() currently adds the bandwidth of all migrating
SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination
cpuset effective CPU masks do not overlap, the whole sum is then
reserved in the destination root domain.
set_cpus_allowed_dl(), however, subtracts bandwidth from the source
root domain only when the affinity change really moves the task between
root domains. A DL task can move between cpusets that are still in the
same root domain, so including that task in sum_migrate_dl_bw can reserve
destination bandwidth without a matching source-side subtraction.
Share the root-domain move test with set_cpus_allowed_dl(). Keep
nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL
task accounting, but add to sum_migrate_dl_bw only for tasks that need a
root-domain bandwidth move. Keep using the destination cpuset effective
CPU mask and leave the broader can_attach()/attach() transaction model
unchanged.
Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Raphael Zimmer [Tue, 5 May 2026 09:08:12 +0000 (11:08 +0200)]
libceph: Fix potential out-of-bounds access in osdmap_decode()
When decoding osd_state and osd_weight from an incoming osdmap in
osdmap_decode(), both are decoded for each osd, i.e., map->max_osd
times. The ceph_decode_need() check only accounts for
sizeof(*map->osd_weight) once. This can potentially result in an
out-of-bounds memory access if the incoming message is corrupted such
that the max_osd value exceeds the actual content of the osdmap message.
This patch fixes the issue by changing the corresponding part in the
ceph_decode_need() check to account for
map->max_osd*sizeof(*map->osd_weight).
Jens Axboe [Mon, 11 May 2026 16:58:56 +0000 (10:58 -0600)]
io_uring: hold uring_lock across io_kill_timeouts() in cancel path
io_uring_try_cancel_requests() dropped ctx->uring_lock before calling
io_kill_timeouts(), which walks each timeout's link chain via
io_match_task() to test REQ_F_INFLIGHT. With chain mutation now
serialized by ctx->uring_lock, that walk needs the lock too.
Jens Axboe [Mon, 11 May 2026 16:58:50 +0000 (10:58 -0600)]
io_uring: defer linked-timeout chain splice out of hrtimer context
io_link_timeout_fn() is the hrtimer callback that fires when a linked
timeout expires. It currently calls io_remove_next_linked(prev) under
ctx->timeout_lock to splice the timeout request out of the link chain.
This is the only chain-mutation site that runs without ctx->uring_lock,
because hrtimer callbacks cannot take a mutex. Defer the splicing until
the task_work callback.
Jens Axboe [Mon, 11 May 2026 16:58:38 +0000 (10:58 -0600)]
io_uring: hold uring_lock when walking link chain in io_wq_free_work()
io_wq_free_work() calls io_req_find_next() from io-wq worker context,
which reads and clears req->link without holding any lock. This can
potentially race with other paths that mutate the same chain under
ctx->uring_lock.
Take ctx->uring_lock around the io_req_find_next() call. Only requests
with IO_REQ_LINK_FLAGS reach this path, which is not the hot path.
nvme: fix race condition between connected uevent and STARTED_ONCE flag
When a controller connects, nvme_start_ctrl() emits the
"NVME_EVENT=connected" uevent and sets the NVME_CTRL_STARTED_ONCE flag.
Currently, the uevent is emitted before the flag is set. This creates a
race condition for userspace tools (like udev rules) that might rely on
the "connected" event to configure other attributes.
Swap the order of operations in nvme_start_ctrl() so that the
NVME_CTRL_STARTED_ONCE flag is set before the uevent is sent. This
guarantees that the admin_timeout can already be changed when userspace
is notified.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
ACPI: driver: Check ACPI_COMPANION() against NULL during probe
Since every platform driver can be forced to match a device that doesn't
match its list of device IDs because of device_match_driver_override(),
platform drivers that rely on the existence of a device's ACPI companion
object should verify its presence.
Accordingly, add requisite ACPI_COMPANION() or ACPI_HANDLE() checks
against NULL to 13 platform drivers handling core ACPI devices.
Also change the value returned by the ACPI thermal zone driver when
the device's ACPI companion is not present to -ENODEV for consistency
with the other drivers.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Hans de Goede <johannes.goede@oss.qualcomm.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/4516068.ejJDZkT8p0@rafael.j.wysocki Cc: 7.0+ <stable@vger.kernel.org> # 7.0+
Jann Horn [Mon, 11 May 2026 15:55:11 +0000 (08:55 -0700)]
exit: prevent preemption of oopsing TASK_DEAD task
When an already-exiting task oopses, make_task_dead() currently calls
do_task_dead() with preemption enabled. That is forbidden:
do_task_dead() calls __schedule(), which has a comment saying "WARNING:
must be called with preemption disabled!".
If an oopsing task is preempted in do_task_dead(), between becoming
TASK_DEAD and entering the scheduler explicitly, bad things happen:
finish_task_switch() assumes that once the scheduler has switched away
from a TASK_DEAD task, the task can never run again and its stack is no
longer needed; but that assumption apparently doesn't hold if the dead
task was preempted (the SM_PREEMPT case).
This means that the scheduler ends up repeatedly dropping references on
the dead task's stack, which can lead to use-after-free or double-free
of the entire task stack; in other words, two tasks can end up running
on the same stack, resulting in various kinds of memory corruption.
(This does not just affect "recursively oopsing" tasks; it is enough to
oops once during task exit, for example in a file_operations::release
handler)
Fixes: 7f80a2fd7db9 ("exit: Stop poorly open coding do_task_dead in make_task_dead") Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The incorrect implementations of SUBNQN is a known issue in a massive number of
NVMe units. However, the warning "nvme nvmex: missing or invalid SUBNQN field."
is usually appropriate and will not affect performance or behavior etc. That is
because the support for SUBNQN is mandatory if the controller supports NVMe
revision 1.2.1 or greater, and it reported itself without a SUBNQN field which
breaks compliance with the specification. It should be not quirked by the Linux
Kernel.
Signed-off-by: Alan Cui <me@alancui.cc> Signed-off-by: Keith Busch <kbusch@kernel.org>
Sagi Grimberg [Sun, 10 May 2026 20:30:29 +0000 (23:30 +0300)]
nvmet-tcp: Fix potential UAF when ddgst mismatch
Shivam Kumar found via vulnerability testing:
When data digest is enabled on an NVMe/TCP connection and a digest
mismatch occurs on a non-final H2C_DATA PDU during an R2T-based
data transfer, the digest error handler in nvmet_tcp_try_recv_ddgst()
calls nvmet_req_uninit() — which performs percpu_ref_put() on the
submission queue — but does NOT mark the command as completed. It
does not set cqe->status, does not modify rbytes_done, and does not
clear any flag. When the subsequent fatal error triggers queue
teardown, nvmet_tcp_uninit_data_in_cmds() iterates all commands,
checks nvmet_tcp_need_data_in() for each one, and finds that the
already-uninited command still appears to need data (because
rbytes_done < transfer_len and cqe->status == 0). It therefore calls
nvmet_req_uninit() a second time on the same command — a double
percpu_ref_put against a single percpu_ref_get.
Reported-by: Shivam Kumar <kumar.shivam43666@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
nvme-pci: fix use-after-free in nvme_free_host_mem()
nvme_free_host_mem() frees dev->hmb_sgt via dma_free_noncontiguous()
but never clears the pointer afterward. This leads to a use-after-free
if nvme_free_host_mem() is called twice in the same error path.
This can happen during nvme_probe() when nvme_setup_host_mem() succeeds
in allocating the HMB (setting dev->hmb_sgt) but nvme_set_host_mem()
fails with an I/O error:
nvme_setup_host_mem()
nvme_alloc_host_mem_single() -> sets dev->hmb_sgt
nvme_set_host_mem() -> fails with -EIO
nvme_free_host_mem() -> frees hmb_sgt, but does NOT NULL it
return error
nvme_probe() error path:
nvme_free_host_mem() -> dev->hmb_sgt is stale, use-after-free
The second call dereferences the freed sgt, causing a NULL pointer
dereference in iommu_dma_free_noncontiguous() when it accesses
sgt->sgl->dma_address (the backing memory has been freed and zeroed).
This is reproducible on Thunderbolt-attached NVMe devices (e.g., OWC
Envoy Express behind a Dell WD22TB4 dock) where the device intermittently
returns I/O errors during HMB setup due to PCIe link instability.
Fix this by setting dev->hmb_sgt to NULL after freeing it, so the
second call takes the multi-descriptor path which safely handles the
already-cleaned-up state.
Fixes: 63a5c7a4b4c4 ("nvme-pci: use dma_alloc_noncontigous if possible") Signed-off-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Hannes Reinecke [Thu, 30 Apr 2026 13:22:32 +0000 (15:22 +0200)]
nvmet-auth: Do not print DH-HMAC-CHAP secrets
From a security standpoint we should not allow to print out the DH-HMAC-CHAP
secrets, but at the same time having them is useful for debugging
authentication failures.
So add a Kconfig option NVME_TARGET_AUTH_DEBUG to only enable debugging
if explictly requested at build time.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
Keith Busch [Wed, 6 May 2026 13:16:02 +0000 (06:16 -0700)]
nvme: fix bio leak on mapping failure
The local bio is always NULL, so we'd leak the bio if the integrity
mapping failed. Just get it directly from the request.
Fixes: d0d1d522316e91f ("blk-map: provide the bdev to bio if one exists") Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Keith Busch [Wed, 6 May 2026 13:05:14 +0000 (06:05 -0700)]
nvme: make prp passthrough usage less scary
The warning is a bit alarming, and it only prints for the very first
non-sgl capable device that receives a passthrough command. Just log an
informational message on initial discovery for every device.
Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
Jiayuan Chen [Mon, 30 Mar 2026 07:32:29 +0000 (15:32 +0800)]
irq_work: Fix use-after-free in irq_work_single() on PREEMPT_RT
On PREEMPT_RT, non-HARD irq_work runs in per-CPU kthreads via
run_irq_workd(), so irq_work_sync() uses rcuwait() to wait for BUSY==0.
After irq_work_single() clears BUSY via atomic_cmpxchg(), it still
dereferences @work for irq_work_is_hard() and rcuwait_wake_up().
An irq_work_sync() caller on another CPU that enters after BUSY is cleared
can observe BUSY==0 immediately, return, and free the work before those
accesses complete — causing a use-after-free.
Fix this by wrapping run_irq_workd() in guard(rcu)() so that the entire
irq_work_single() execution is within an RCU read-side critical
section. Then add synchronize_rcu() in irq_work_sync() after
rcuwait_wait_event() to ensure the caller waits for the RCU grace period
before returning, preventing premature frees.
Fixes: 810979682ccc ("irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.") Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20260330073234.303732-1-jiayuan.chen@linux.dev
Paolo Pisati [Fri, 8 May 2026 07:09:56 +0000 (09:09 +0200)]
platform/x86: asus-nb-wmi: add DMI quirk for ASUS Zenbook Duo UX8407AA
Use the existing zenbook duo keyboard quirk for the UX8407AA model too.
Signed-off-by: Paolo Pisati <p.pisati@gmail.com> Reviewed-by: Denis Benato <denis.benato@linux.dev> Link: https://patch.msgid.link/20260508070956.62201-1-p.pisati@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Ming Lei [Sun, 10 May 2026 14:48:43 +0000 (22:48 +0800)]
ublk: reject max_sectors smaller than PAGE_SECTORS in parameter validation
blk_validate_limits() requires max_hw_sectors >= PAGE_SECTORS and fires
a WARN_ON_ONCE if this invariant is violated. ublk_validate_params()
only checked the upper bound of max_sectors against max_io_buf_bytes,
allowing userspace to pass small values (including zero) that trigger
the warning when blk_mq_alloc_disk() is called from
ublk_ctrl_start_dev().
Before 494ea040bcb5, ublk used blk_queue_max_hw_sectors() which silently
clamped small values up to PAGE_SECTORS. The conversion to passing
queue_limits directly to blk_mq_alloc_disk() lost that clamping and now
hits blk_validate_limits()'s WARN_ON_ONCE instead.
Validate that max_sectors is at least PAGE_SECTORS in
ublk_validate_params() so invalid values are rejected early with
-EINVAL instead of reaching the block layer.
Maoyi Xie [Sun, 10 May 2026 08:41:19 +0000 (16:41 +0800)]
io_uring/fdinfo: translate SqThread PID through caller's pid_ns
SQPOLL stores current->pid (init_pid_ns view) in sqd->task_pid
at thread creation. fdinfo prints it raw via
seq_printf("SqThread:\t%d\n", sq_pid). A reader inside a
non-initial pid_ns sees the host PID, not the kthread's PID in
the reader's own pid_ns.
The SQPOLL kthread is created with CLONE_THREAD and no
CLONE_NEW*, so it lives in the submitter's pid_ns. An
unprivileged user_ns + pid_ns submitter can read fdinfo and
learn the host PID of a kthread whose in-namespace PID is
different.
Reproducer (mainline 7.0, KASAN): unshare CLONE_NEWUSER |
CLONE_NEWPID | CLONE_NEWNS, mount a private /proc, then have a
grandchild that is pid 1 in the new pid_ns open an io_uring
ring with IORING_SETUP_SQPOLL. /proc/self/task lists {1, 2};
the SQPOLL kthread is pid 2. Before: fdinfo prints
SqThread = <host pid>. After: SqThread = 2.
Use task_pid_nr_ns() against the proc inode's pid_ns to compute
sq_pid, instead of reading the stored sq->task_pid (which holds
the init_pid_ns view). pidfd_show_fdinfo() in kernel/pid.c
follows the same pattern.
Yong-Xuan Wang [Fri, 8 May 2026 09:31:21 +0000 (02:31 -0700)]
irqchip/riscv-imsic: Clear interrupt move state during CPU offlining
Affinity changes of IMSIC interrupts have to be careful to not lose an
interrupt in the process. Each vector keeps track of an affinity change in
progress with two pointers in struct imsic_vector.
imsic_vector::move_prev points to the previous CPU target data and
imsic_vector::move_next to the designated new CPU target data.
imsic_vector::move_prev on the new CPU can only be cleared after the
previous CPU has cleared imsic_vector::move_next, which ususally happens in
__imsic_remote_sync().
In case of CPU hot-unplug __imsic_remote_sync() is not invoked because the
CPU is already marked offline. That means imsic_vector::move_prev becomes
stale until the CPU is onlined again.
The stale pointer prevents further affinity changes for the affected
interrupts.
Solve this by clearing the imsic_vector::move_prev pointers in the CPU
hotplug offline path.
[ tglx: Replace word salad in change log ]
Fixes: 0f67911e821c ("irqchip/riscv-imsic: Separate next and previous pointers in IMSIC vector") Signed-off-by: Yong-Xuan Wang <yongxuan.wang@sifive.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260508-imsic-v2-1-e9f08dd46cf5@sifive.com
Mark Rutland [Thu, 7 May 2026 11:05:18 +0000 (12:05 +0100)]
genirq/chip: Don't call add_interrupt_randomness() for NMIs
Recently handle_percpu_devid_irq() was changed to call
add_interrupt_randomness(). This introduced a potential deadlock when
handle_percpu_devid_irq() is used to handle an NMI, which can be
detected with lockdep, e.g.
================================
WARNING: inconsistent lock state
7.1.0-rc2-pnmi #465 Not tainted
--------------------------------
inconsistent {INITIAL USE} -> {IN-NMI} usage.
perf/695 [HC1[1]:SC0[0]:HE0:SE1] takes: ffff00837dfd3a18 (&base->lock){-.-.}-{2:2}, at: lock_timer_base+0x6c/0xac
{INITIAL USE} state was registered at:
_raw_spin_lock_irqsave+0x68/0xb0
lock_timer_base+0x6c/0xac
__mod_timer+0x100/0x32c
add_timer_global+0x2c/0x40
__queue_delayed_work+0xf0/0x140
queue_delayed_work_on+0x134/0x138
mem_cgroup_css_online+0x30c/0x310
online_css+0x34/0x10c
cgroup_init_subsys+0x158/0x1c8
cgroup_init+0x440/0x524
start_kernel+0x888/0x998
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&base->lock);
<Interrupt>
lock(&base->lock);
*** DEADLOCK ***
During review, Thomas pointed out it wouldn't be safe for
handle_percpu_devid_irq() to call add_interrupt_randomness() if it was
used to handle NMIs:
https://lore.kernel.org/lkml/87bjgik042.ffs@tglx/
... but evidently people missed that handle_percpu_devid_irq() *is* used
for NMIs.
While it might seem that NMIs should be handled with a separate
handle_percpu_devid_nmi() function, for various structural reasons this was
impractical, and handle_percpu_devid_irq() has been expected to be used for
NMIs since commits:
21bbbc50f398f ("irqchip/gic-v3: Switch high priority PPIs over to handle_percpu_devid_irq()") 5ff78c8de9d83 ("genirq: Kill handle_percpu_devid_fasteoi_nmi()")
Taking the above into account, avoid the deadlock by not calling
add_interrupt_randomness() when handle_percpu_devid_irq() is called in an
NMI context. This is consistent with other NNI handling flows, which do not
call add_interrupt_randomness().
At the same time, update the kernel-doc comment to make it clear that
handle_percpu_devid_irq() can be called in NMI context. The rest of
handle_percpu_devid_irq() is currently NMI safe and doesn't need to change.
Fixes: fd7400cfcbaa ("genirq/chip: Invoke add_interrupt_randomness() in handle_percpu_devid_irq()") Reported-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://patch.msgid.link/20260507110518.3128248-1-mark.rutland@arm.com
Sascha Bischoff [Wed, 6 May 2026 09:37:43 +0000 (09:37 +0000)]
irqchip/gic-v5: Allocate ITS parent LPIs as a range
The ITS MSI domain no longer manages LPI allocation directly. LPIs are
allocated and freed by the parent LPI domain, which can now handle a
full range of interrupts and unwind partial allocations internally.
Make the ITS domain request and release the parent IRQs as a single
range instead of iterating over each interrupt. The ITS allocation
path then only needs to reserve EventIDs, allocate the parent range,
and fill in the ITS irq_data for each MSI. Since no operation in the
per-MSI loop can fail, the partial parent-free unwind becomes
unnecessary.
On teardown, reset the ITS irq_data for the range and then release the
parent range in one call, leaving LPI teardown to the LPI domain.
Fixes: 0f0101325876 ("irqchip/gic-v5: Add GICv5 LPI/IPI support") Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260506093634.382062-4-sascha.bischoff@arm.com
Sascha Bischoff [Wed, 6 May 2026 09:37:23 +0000 (09:37 +0000)]
irqchip/gic-v5: Support range allocation for LPIs
The per-IPI parent allocation loop returns immediately on failure and leaks
any parent interrupts allocated by earlier iterations.
The GICv5 LPI domain now owns LPI allocation and teardown internally,
but its irq_domain callbacks still reject requests where nr_irqs is
greater than one. This forces child domains to allocate and free LPIs
one at a time even when the interrupt core requests a contiguous
range.
Handle multi-interrupt allocation and teardown in the LPI domain by
iterating over the requested range and unwinding any partially
allocated state on failure.
Allocate the parent LPIs for the IPI domain with a single range
request as well, which cures the leakage problem.
Fixes: 0f0101325876 ("irqchip/gic-v5: Add GICv5 LPI/IPI support") Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260506093634.382062-3-sascha.bischoff@arm.com
Sascha Bischoff [Wed, 6 May 2026 09:37:02 +0000 (09:37 +0000)]
irqchip/gic-v5: Move LPI allocation into the LPI domain
The IPI and ITS MSI domains currently allocate and release LPIs
directly, then pass the selected LPI ID to the parent LPI domain. This
leaks the LPI domain's allocation policy into its child domains and
forces each child to duplicate part of the parent domain's teardown.
Make the LPI domain allocate LPIs in its .alloc() callback and release
them in a matching .free() callback. Child domains can then request a
parent interrupt without passing an implementation-specific LPI ID,
and the LPI lifetime is tied to the domain that owns the LPI
namespace.
Remove the gicv5_alloc_lpi() and gicv5_free_lpi() wrappers now that no
external caller needs to manage LPIs directly.
This is a preparatory change for an actual leakage problem in the
allocation code and therefore tagged with the same Fixes tag.
Fixes: 0f0101325876 ("irqchip/gic-v5: Add GICv5 LPI/IPI support") Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260506093634.382062-2-sascha.bischoff@arm.com
Derek J. Clark [Sun, 10 May 2026 04:25:39 +0000 (04:25 +0000)]
platform/x86: lenovo-wmi-other: Limit adding attributes to supported devices
Adds lwmi_is_attr_01_supported, and only creates the attribute subfolder
if the attribute is supported by the hardware. Due to some poorly
implemented BIOS this is a multi-step sequence of events. This is
because:
- Some BIOS support getting the capability data from custom mode (0xff),
while others only support it in no-mode (0x00).
- Some BIOS support get/set for the current value from custom mode (0xff),
while others only support it in no-mode (0x00).
- Some BIOS report capability data for a method that is not fully
implemented.
- Some BIOS have methods fully implemented, but no complimentary
capability data.
To ensure we only expose fully implemented methods with corresponding
capability data, we check each outcome before reporting that an
attribute can be supported.
Checking for lwmi_is_attr_01_supported during remove is not done to
ensure that we don't attempt to call cd01 or send WMI events if one of
the interfaces being removed was the cause of the driver unloading.
Fixes: edc4b183b794 ("platform/x86: Add Lenovo Other Mode WMI Driver") Reported-by: Kurt Borja <kuurtb@gmail.com> Closes: https://lore.kernel.org/platform-driver-x86/DG60P3SHXR8H.3NSEHMZ6J7XRC@gmail.com/ Cc: stable@vger.kernel.org Reviewed-by: Rong Zhang <i@rong.moe> Tested-by: Rong Zhang <i@rong.moe> Reviewed-by: Mark Pearson <mpearson-lenovo@squebb.ca> Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com> Link: https://patch.msgid.link/20260510042546.436874-10-derekjohn.clark@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Derek J. Clark [Sun, 10 May 2026 04:25:38 +0000 (04:25 +0000)]
platform/x86: lenovo-wmi-other: Add Attribute ID helper functions
Adds lwmi_attr_id() function. In the same vein as LWMI_ATTR_ID_FAN_RPM(),
but as a generic, to de-duplicate attribute_id assignment boilerplate.
Adds tunable_attr_01_id() function that breaks out the members of a
tunable_attr_01 struct and passes them to lwmi_attr_id().
No functional change intended.
Cc: stable@vger.kernel.org Reviewed-by: Rong Zhang <i@rong.moe> Tested-by: Rong Zhang <i@rong.moe> Reviewed-by: Mark Pearson <mpearson-lenovo@squebb.ca> Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com> Link: https://patch.msgid.link/20260510042546.436874-9-derekjohn.clark@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Derek J. Clark [Sun, 10 May 2026 04:25:37 +0000 (04:25 +0000)]
platform/x86: lenovo-wmi-helpers: Move gamezone enums to wmi-helpers
In a later patch in the series the thermal mode enum will be accessed
across three separate drivers (wmi-capdata, wmi-gamezonem and wmi-other).
An additional patch in the series will also add a function prototype that
needs to reference this enum in wmi-helpers.h. To avoid having all these
drivers begin to import each others headers, and to avoid declaring an
opaque enum to hande the second case, move the thermal mode enum to
helpers where it can be safely accessed by everything that needs it from
a single import.
While at it, since the gamezone_events_type enum is the only remaining
item in the header, move that as well and remove the gamezone header
entirely.
Cc: stable@vger.kernel.org Reviewed-by: Mark Pearson <mpearson-lenovo@squebb.ca> Reviewed-by: Rong Zhang <i@rong.moe> Tested-by: Rong Zhang <i@rong.moe> Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com> Link: https://patch.msgid.link/20260510042546.436874-8-derekjohn.clark@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Rong Zhang [Sun, 10 May 2026 04:25:36 +0000 (04:25 +0000)]
platform/x86: lenovo: Decouple lenovo-wmi-gamezone and lenovo-wmi-other
Currently, lenovo-wmi-gamezone depends on lenovo-wmi-other as the former
imports symbols from the latter. The imported symbols are just used to
register a notifier block. However, there is no runtime dependency
between both drivers, and either of them can run without the other,
which is the major purpose of using the notifier framework.
Such a link-time dependency is non-optimal. A previous attempt to "fix"
it made LENOVO_WMI_GAMEZONE select LENOVO_WMI_TUNING, which was
fundamentally broken and resulted in undefined Kconfig behavior, as
`select' cannot be used on a symbol with potentially unmet dependencies.
Decouple both drivers by moving the thermal mode notifier chain to
lenovo-wmi-helpers. Methods for notifier block (un)registration are
exported for lenovo-wmi-gamezone, while a method for querying the
current thermal mode are exported for lenovo-wmi-other.
This turns the dependency graph from
+------------ lenovo-wmi-gamezone
| |
v |
lenovo-wmi-helpers |
^ |
| V
+------------ lenovo-wmi-other
into
+------------ lenovo-wmi-gamezone
|
v
lenovo-wmi-helpers
^
|
+------------ lenovo-wmi-other
To make it clear, the name of the notifier chain is also renamed from
`om_chain_head' to `tm_chain_head', indicating that it's used to query
the current thermal mode.
No functional change intended.
Reviewed-by: Mark Pearson <mpearson-lenovo@squebb.ca> Fixes: 6e38b9fcbfa3 ("platform/x86: lenovo: gamezone needs "other mode"") Cc: stable@vger.kernel.org Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202603252259.gHvJDyh3-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202603260302.X0NjQOda-lkp@intel.com/ Signed-off-by: Rong Zhang <i@rong.moe> Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com> Link: https://patch.msgid.link/20260510042546.436874-7-derekjohn.clark@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Derek J. Clark [Sun, 10 May 2026 04:25:35 +0000 (04:25 +0000)]
platform/x86: lenovo-wmi-other: Fix tunable_attr_01 struct members
In struct tunable_attr_01 the capdata pointer is unused and the size of
the id members is u32 when it should be u8. Fix these prior to adding
additional members.
No functional change intended.
Reviewed-by: Mark Pearson <mpearson-lenovo@squebb.ca> Cc: stable@vger.kernel.org Reviewed-by: Rong Zhang <i@rong.moe> Tested-by: Rong Zhang <i@rong.moe> Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com> Link: https://patch.msgid.link/20260510042546.436874-6-derekjohn.clark@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Derek J. Clark [Sun, 10 May 2026 04:25:34 +0000 (04:25 +0000)]
platform/x86: lenovo-wmi-other: Zero initialize WMI arguments
Adds explicit initialization of wmi_method_args_32 declarations with
zero values to prevent uninitialized data from being sent to the device
BIOS when passed.
No functional change intended.
Reviewed-by: Mark Pearson <mpearson-lenovo@squebb.ca> Fixes: 22024ac5366f ("platform/x86: Add Lenovo Gamezone WMI Driver") Fixes: edc4b183b794 ("platform/x86: Add Lenovo Other Mode WMI Driver") Reported-by: Rong Zhang <i@rong.moe> Closes: https://lore.kernel.org/platform-driver-x86/95c7e7b539dd0af41189c754fcd35cec5b6fe182.camel@rong.moe/ Cc: stable@vger.kernel.org Reviewed-by: Rong Zhang <i@rong.moe> Tested-by: Rong Zhang <i@rong.moe> Signed-off-by: Derek J. Clark <derekjohn.clark@gmail.com> Link: https://patch.msgid.link/20260510042546.436874-5-derekjohn.clark@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Rong Zhang [Sun, 10 May 2026 04:25:33 +0000 (04:25 +0000)]
platform/x86: lenovo-wmi-other: Balance component bind and unbind
When lwmi_om_master_bind() fails, the master device's components are
left bound, with the aggregate device destroyed due to the failure
(found by sashiko.dev [1]).
Balance calls to component_bind_all() and component_unbind_all() when an
error is propagated to the component framework.
Rong Zhang [Sun, 10 May 2026 04:25:32 +0000 (04:25 +0000)]
platform/x86: lenovo-wmi-other: Balance IDA id allocation and free
Currently, the IDA id is only freed on wmi-other device removal or
failure to create firmware-attributes device, kset, or attributes. It
leaks IDA ids if the wmi-other device is bound multiple times, as the
unbind callback never frees the previously allocated IDA id.
Additionally, if the wmi-other device has failed to create a
firmware-attributes device before it gets removed, the wmi-device
removal callback double frees the same IDA id.
These bugs were found by sashiko.dev [1].
Fix them by moving ida_free() into lwmi_om_fw_attr_remove() so it is
balanced with ida_alloc() in lwmi_om_fw_attr_add(). With them fixed,
properly set and utilize the validity of priv->ida_id to balance
firmware-attributes registration and removal, without relying on
propagating the registration error to the component framework, which is
more reliable and aligns with the hwmon device registration and removal
sequences.
Fix spelling mistake in comment:
- occured -> occurred
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Md Shofiqul Islam <shofiqtest@gmail.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
fprobe: Fix unregister_fprobe() to wait for RCU grace period
Commit 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer")
changed fprobe to register struct fprobe to an rcu-hlist, but it forgot
to wait for RCU GP. Thus there can be use-after-free if the fprobe is
released right after unregistering. This can be happened on fprobe
event and sample module code.
To fix this issue, add synchronize_rcu() in unregister_fprobe().
Note that BPF is OK because fprobe is used as a part of
bpf_kprobe_multi_link. This unregisters its fprobe in
bpf_kprobe_multi_link_release() and it is deallocated via
bpf_kprobe_multi_link_dealloc(), which is invoked from
bpf_link_defer_dealloc_rcu_gp() RCU callback.
For BPF, this also introduced unregister_fprobe_async() which does
NOT wait for RCU grace priod.
Andrea Righi [Mon, 11 May 2026 08:31:30 +0000 (10:31 +0200)]
sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths
scx_alloc_and_add_sched() can fail after @sch has been assigned to
ops->priv. In those cases @sch is torn down (either via kfree() through
the err_free_* chain or via kobject_put() -> scx_kobj_release() -> RCU
work), but @ops->priv is left pointing at the about-to-be-freed pointer.
With the recent -EBUSY gate in scx_root_enable_workfn() and
scx_sub_enable_workfn() that rejects an attach when @ops->priv is still
non-NULL, see commit bbf30b383cf6 ("sched_ext: Fix ops->priv clobber on
concurrent attach/detach"), a dangling @ops->priv permanently locks the
kdata out: every future attach attempt sees a stale binding and returns
-EBUSY even though no scheduler is actually attached.
Clear @ops->priv on the post-assign failure paths so that the kdata
returns to its pre-attach state when the function returns ERR_PTR().
ceph: add ceph_has_realms_with_quotas() check to ceph_quota_update_statfs()
When MDS rejects a session, remove_session_caps() ->
__ceph_remove_cap() -> ceph_change_snap_realm() clears
i_snap_realm for every inode that loses its last cap.
The realm is restored once caps are re-granted after
reconnect. It is not a real error and this patch changes
pr_err_ratelimited_client() on doutc().
Every quota methods ceph_quota_is_max_files_exceeded(),
ceph_quota_is_max_bytes_exceeded(),
ceph_quota_is_max_bytes_approaching() calls
ceph_has_realms_with_quotas() check. This patch adds
the missing ceph_has_realms_with_quotas() call into
ceph_quota_update_statfs().
[ idryomov: add braces around both arms of multiline ifs ]
libceph: Fix potential out-of-bounds access in __ceph_x_decrypt()
In __ceph_x_decrypt(), a part of the buffer p is interpreted as a
ceph_x_encrypt_header, and the magic field of this struct is accessed.
This happens without any guarantee that the buffer is large enough to
hold this struct. The function parameter ciphertext_len represents the
length of the ciphertext to decrypt and is guaranteed to be at most the
remaining size of the allocated buffer p. However, this value is not
necessarily greater than sizeof(ceph_x_encrypt_header). E.g., a message
frame of type FRAME_TAG_AUTH_REPLY_MORE, that is just as long to hold
the ciphertext at its end with a ciphertext_len of 8 or less, can
trigger an out-of-bounds memory access when accessing hdr->magic.
This patch fixes the issue by adding a check to ensure that the
decrypted plaintext in the buffer is large enough to represent at least
the ceph_x_encrypt_header.
Commit d93231a6bc8a ("ceph: prevent a client from exceeding the MDS
maximum xattr size") moved the required_blob_size computation to before
the __build_xattrs() call, introducing a race.
__build_xattrs() releases and reacquires i_ceph_lock during execution.
In that window, handle_cap_grant() may update i_xattrs.blob with a
newer MDS-provided blob and bump i_xattrs.version. When
__build_xattrs() detects that index_version < version, it destroys and
rebuilds the entire xattr rb-tree from the new blob, potentially
increasing count, names_size, and vals_size.
The prealloc_blob size check that follows still uses the stale
required_blob_size computed before the rebuild, so it passes even when
prealloc_blob is too small for the now-larger tree. After __set_xattr()
adds one more xattr on top, __ceph_build_xattrs_blob() is called from
the cap flush path and hits:
Fix this by recomputing required_blob_size after __build_xattrs()
returns, using the current tree state. Also re-validate against
m_max_xattr_size to fall back to the sync path if the rebuilt tree now
exceeds the MDS limit.
Cc: stable@vger.kernel.org Fixes: d93231a6bc8a ("ceph: prevent a client from exceeding the MDS maximum xattr size") Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The old_blob in __ceph_setxattr() can store
ci->i_xattrs.prealloc_blob value during the retry.
However, it is never called the ceph_buffer_put()
for the old_blob object. This patch fixes the issue of
the buffer leak.
libceph: Fix unnecessarily high ceph_decode_need() for uniform bucket
In crush_decode_uniform_bucket(), the item_weight field of the bucket
is set. This is a single field of type u32 since the uniform bucket uses
the same weight for all items. The value in ceph_decode_need() is set to
(1+b->h.size) * sizeof(u32), which is higher than actually needed.
This patch removes the call to ceph_decode_need() with the unnecessarily
high value and switches the subsequent operation from ceph_decode_32()
to ceph_decode_32_safe(), which already includes the correct bounds
check.
libceph: Fix potential out-of-bounds access in crush_decode()
A message of type CEPH_MSG_OSD_MAP containing a crush map with at least
one bucket has two fields holding the bucket algorithm. If the values
in these two fields differ, an out-of-bounds access can occur. This is
the case because the first algorithm field (alg) is used to allocate
the correct amount of memory for a bucket of this type, while the second
algorithm field inside the bucket (b->alg) is used in the subsequent
processing.
This patch fixes the issue by adding a check that compares alg and
b->alg and aborts the processing in case they differ. Furthermore,
b->alg is set to 0 in this case, because the destruction of the crush
map also uses this field to determine the bucket type, which can again
result in an out-of-bounds access when trying to free the memory pointed
to by the fields of the bucket. To correctly free the memory allocated
for the bucket in such a case, the corresponding call to kfree is moved
from the algorithm-specific crush_destroy_bucket functions to the
generic crush_destroy_bucket().
Zhenzhong Duan [Sat, 9 May 2026 02:43:46 +0000 (10:43 +0800)]
iommu/vt-d: Avoid NULL pointer dereference or refcount corruption
Commit 60f030f7418d ("iommu/vt-d: Avoid use of NULL after WARN_ON_ONCE")
fixed a NULL pointer dereference in an unlikely situation partly.
If dev_pasid is not found in the dev_pasids list, it remains NULL.
However, the teardown operations are executed unconditionally, this lead
to a NULL pointer dereference or refcount corruption.
If the domain was never attached to this IOMMU, info will be NULL, which
would cause an immediate dereference when checking --info->refcnt.
Even if info is not NULL, decrementing the refcount without having removed
a valid PASID might unbalance the count. This could lead to premature
dropping of the refcount to 0, potentially causing a use-after-free for the
remaining active devices sharing the domain.
Fix it by returning early if dev_pasid is NULL, before executing the
teardown operations.
Issue found by AI review and suggested by Kevin Tian.
https://sashiko.dev/#/patchset/20260421031347.1408890-1-zhenzhong.duan%40intel.com
Fixes: 60f030f7418d ("iommu/vt-d: Avoid use of NULL after WARN_ON_ONCE") Cc: stable@vger.kernel.org Suggested-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20260422033538.95000-1-zhenzhong.duan@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The global static blocked domain is a dummy domain without corresponding
dmar_domain structure, accessing beyond iommu_domain structure triggers
oops easily. Fix it by return early in domain_remove_dev_pasid() like
identity domain.
Fixes: 7d0c9da6c150 ("iommu/vt-d: Add set_dev_pasid callback for dma domain") Cc: stable@vger.kernel.org Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20260421031347.1408890-1-zhenzhong.duan@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Naval Alcalá [Sat, 9 May 2026 02:43:44 +0000 (10:43 +0800)]
iommu/vt-d: Disable DMAR for Intel Q35 IGFX
Intel Q35 integrated graphics (8086:29b2) exhibits broken DMAR
behaviour similar to other G4x/GM45 devices for which DMAR is
already disabled via quirks.
When DMAR is enabled, the system may hard lock up during boot or
early device initialization, requiring a reset.
Add the missing PCI ID to the existing quirk list to disable
DMAR for this device.
Guopeng Zhang [Sat, 9 May 2026 10:20:30 +0000 (18:20 +0800)]
cgroup/cpuset: Reset DL migration state on can_attach() failure
cpuset_can_attach() accumulates temporary SCHED_DEADLINE migration
state in the destination cpuset while walking the taskset.
If a later task_can_attach() or security_task_setscheduler() check
fails, cgroup_migrate_execute() treats cpuset as the failing subsystem
and does not call cpuset_cancel_attach() for it. The partially
accumulated state is then left behind and can be consumed by a later
attach, corrupting cpuset DL task accounting and pending DL bandwidth
accounting.
Reset the pending DL migration state from the common error exit when
ret is non-zero. Successful can_attach() keeps the state for
cpuset_attach() or cpuset_cancel_attach().
Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Waiman Long <longman@redhat.com>
iommu: Warn on premature unblock during DMA aliased sibling reset
When two aliased siblings are in the same iommu_group, they might share the
same RID. The reset functions don't support this case, though it is unclear
whether there is a real case of having an ATS capable device on a PCI/PCI-X
bus.
Theoretically, however, if two aliased devices are resetting concurrently,
one might be unblocked prematurely in the middle of the reset by the other
sibling who completes the reset first.
This isn't a regression from this series but it's better to spit a warning,
so we can know if such use case is common enough for us to make subsequent
patches for its coverage.
iommu: Fix WARN_ON in __iommu_group_set_domain_nofail() due to reset
In __iommu_group_set_domain_internal(), concurrent domain attachments are
rejected when any device in the group is recovering. This is necessary to
fence concurrent attachments to a multi-device group where devices might
share the same RID due to PCI DMA alias quirks, but triggers the WARN_ON in
__iommu_group_set_domain_nofail().
Other IOMMU_SET_DOMAIN_MUST_SUCCEED callers in detach/teardown paths, such
as __iommu_group_set_core_domain and __iommu_release_dma_ownership, should
not be rejected, as the domain would be freed anyway in these nofail paths
while group->domain is still pointing to it. So pci_dev_reset_iommu_done()
could trigger a UAF when re-attaching group->domain.
Honor the IOMMU_SET_DOMAIN_MUST_SUCCEED flag, allowing the callers through
the group->recovery_cnt fence, so as to update the group->domain pointer.
Instead add a gdev->blocked check in the device iteration loop, to prevent
any concurrent per-device detachment.
iommu: Fix ATS invalidation timeouts during __iommu_remove_group_pasid()
If a device is blocked, its PASID domains are already detached. Repeating
iommu_remove_dev_pasid() is unnecessary and might trigger ATS invalidation
timeouts.
Skip the iommu_remove_dev_pasid() call upon gdev->blocked.
Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
internally while both are calling pci_dev_reset_iommu_prepare/done().
As pci_dev_reset_iommu_prepare() doesn't support re-entry, the inner call
will trigger a WARN_ON and return -EBUSY, resulting in failing the entire
device reset.
On the other hand, removing the outer calls in the PCI callers is unsafe.
As pointed out by Kevin, device-specific quirks like reset_hinic_vf_dev()
execute custom firmware waits after their inner pcie_flr() completes. If
the IOMMU protection relies solely on the inner reset, the IOMMU will be
unblocked prematurely while the device is still resetting.
Instead, fix this by making pci_dev_reset_iommu_prepare/done() reentrant.
Introduce gdev->reset_depth to handle the re-entries on the same device.
iommu: Replace per-group resetting_domain with per-gdev blocked flag
The core tracks device resetting states with a per-group resetting_domain,
while a reset is actually per group-device. Such a mismatch might lead to
confusion and even difficulty to untangle per-gdev handling requirement.
Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
internally while both are calling pci_dev_reset_iommu_prepare/done(). And
the solution requires the core to track at the group_device level as well.
Introduce a 'blocked' flag to struct group_device, to allow a multi-device
group to isolate concurrent device resets independently.
As the reset routine is per gdev, it cannot clear group->resetting_domain
without iterating over the device list to ensure no other device is being
reset. Simplify it by replacing the resetting_domain with a 'recovery_cnt'
in the struct iommu_group.
No functional change. But this is essential to apply following bug fixes.
iommu: Fix NULL group->domain dereference in pci_dev_reset_iommu_done()
Local sashiko review pointed it out that group->domain could be NULL when
a default domain fails to allocate during the first probe, which can crash
at domain->ops->attach_dev dereference in __iommu_attach_device() invoked
by pci_dev_reset_iommu_done().
pci_dev_reset_iommu_prepare() is fine as an old_domain pointer can be NULL.
Skip the re-attach in pci_dev_reset_iommu_done() to fix the bug.
iommu/amd: Bounds-check devid in __rlookup_amd_iommu()
iommu_device_register() walks every device on the PCI bus via
bus_for_each_dev() and calls amd_iommu_probe_device() for each. The
inlined check_device() path computes the device's sbdf, calls
rlookup_amd_iommu() to find the owning IOMMU, and only afterwards
verifies devid <= pci_seg->last_bdf. __rlookup_amd_iommu() indexes
rlookup_table[devid] with no bounds check of its own, so for a PCI
device whose BDF is not described by the IVRS, the lookup reads past
the end of the allocation before the caller's bounds check can run.
This was harmless before commit e874c666b15b ("iommu/amd: Change
rlookup, irq_lookup, and alias to use kvalloc()"): the table was a
zeroed page-order allocation, so the over-read returned NULL and the
caller's NULL check skipped the device. After that commit the table is
a tight kvcalloc() and the over-read returns adjacent slab contents,
which check_device() then dereferences as a struct amd_iommu *,
causing a boot-time GPF.
Seen on Google Compute Engine ct6e VMs, where the virtualized IVRS
describes only the four TPU endpoints 00:04.0-07.0; the gVNIC at
00:08.0 (devid 0x40) indexes 56 bytes past the 456-byte allocation,
into the adjacent kmalloc-512 slab object:
pci 0000:00:04.0: Adding to iommu group 0
pci 0000:00:05.0: Adding to iommu group 1
pci 0000:00:06.0: Adding to iommu group 2
pci 0000:00:07.0: Adding to iommu group 3
Oops: general protection fault, probably for non-canonical address 0x3a64695f78746382: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.22 #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 12/06/2025
RIP: 0010:amd_iommu_probe_device+0x54/0x3a0
Call Trace:
__iommu_probe_device+0x107/0x520
probe_iommu_group+0x29/0x50
bus_for_each_dev+0x7e/0xe0
iommu_device_register+0xc9/0x240
iommu_go_to_state+0x9c0/0x1c60
amd_iommu_init+0x14/0x40
pci_iommu_init+0x16/0x60
do_one_initcall+0x47/0x2f0
Guard the array access in __rlookup_amd_iommu(). With the fix applied
on 6.18.22, the gVNIC at 00:08.0 is skipped cleanly and the VM boots.
Fixes: e874c666b15b ("iommu/amd: Change rlookup, irq_lookup, and alias to use kvalloc()") Cc: stable@vger.kernel.org Reported-by: Ziyuan Chen <zc@anthropic.com> Tested-by: Ziyuan Chen <zc@anthropic.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Assisted-by: Claude:unspecified Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Eder Zulian [Fri, 10 Apr 2026 12:55:50 +0000 (14:55 +0200)]
iommu/amd: Remove latent out-of-bounds access in IOMMU debugfs
In iommu_mmio_write() and iommu_capability_write(), the variables
dbg_mmio_offset and dbg_cap_offset are declared as int. However, they
are populated using kstrtou32_from_user(). If a user provides a
sufficiently large value, it can become a negative integer.
Prior to this patch, the AMD IOMMU debugfs implementation was already
protected by different mechanisms.
1. #define OFS_IN_SZ 8 ensures the user string <= 8 bytes, so
e.g. 0xffffffff isn't a valid input.
if (cnt > OFS_IN_SZ)
return -EINVAL;
2. Implicit type promotion in iommu_mmio_write(), dbg_mmio_offset is int
and iommu->mmio_phys_end is u64
if (dbg_mmio_offset > iommu->mmio_phys_end - sizeof(u64))
return -EINVAL;
3. The show handlers would currently catch the negative number and
refuse to perform the read.
Replace kstrtou32_from_user() with kstrtos32_from_user() to parse the
input, and check for negative values to explicitly prevent out-of-bounds
memory accesses directly in iommu_mmio_write() and
iommu_capability_write().
Signed-off-by: Eder Zulian <ezulian@redhat.com> Fixes: 7a4ee419e8c1 ("iommu/amd: Add debugfs support to dump IOMMU MMIO registers") Cc: stable@vger.kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Andrea Righi [Mon, 11 May 2026 06:18:12 +0000 (08:18 +0200)]
sched_ext: Fix ops->priv clobber on concurrent attach/detach
Under heavy concurrent attach/detach operations, scx_claim_exit() can
trigger a NULL pointer dereference. This can be reproduced running the
reload_loop kselftests inside a virtme-ng session:
T1 acquires scx_enable_mutex inside scx_root_disable()'s mutex_unlock
window and starts a fresh attach on the same kdata, assigning sch_a800
to @ops->priv. T2 then continues out of scx_disable()/flush_disable_work
and clobbers @ops->priv to NULL, leaking sch_a800; the bpf_link is gone
but state stays SCX_ENABLED, so all future attaches fail with -EBUSY
permanently. The next bpf_scx_unreg() on that kdata then reads NULL
@ops->priv and dereferences it in scx_claim_exit().
Make @ops->priv the lifecycle binding: in scx_root_enable_workfn() and
scx_sub_enable_workfn(), after the existing state check and still under
scx_enable_mutex, refuse with -EBUSY if @ops->priv is non-NULL. This
rejects an attempt to reuse a kdata that is still bound to a previous
scheduler instance, closing the race without changing the unreg side.
Andrea Righi [Sun, 10 May 2026 17:52:11 +0000 (19:52 +0200)]
selftests/sched_ext: Fix build error in dequeue selftest
Building the dequeue selftest with newer compilers (e.g., gcc 16)
triggers the following error:
dequeue.c:28:22: error: variable 'sum' set but not used
The 'volatile' qualifier prevents the writes from being optimized away,
but does not silence the unused variable 'sum' is indeed only written
and never read.
Consume 'sum' via an empty asm() with a register input constraint. This
forces the compiler to keep the accumulated value (preserving the CPU
stress loop) and avoiding the build error.
Fixes: 658ad2259b3e ("selftests/sched_ext: Add test to validate ops.dequeue() semantics") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
cg_read_strcmp() allocated a buffer sized to strlen(expected) + 1,
then passed it to read_text() which calls read(fd, buf, size-1).
When comparing against an empty string (""), strlen("") = 0 gives a
1-byte buffer, and read() is asked to read 0 bytes. The file content
is never actually read, so strcmp("", buf) always returns 0 regardless
of the real content. This caused cg_test_proc_killed() to always
report the cgroup as empty immediately, making OOM tests pass without
verifying that processes were killed.
Signed-off-by: Hongfu Li <lihongfu@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
Guopeng Zhang [Mon, 11 May 2026 01:31:50 +0000 (09:31 +0800)]
cgroup/dmem: Return -ENOMEM on failed pool preallocation
get_cg_pool_unlocked() handles allocation failures under dmemcg_lock by
dropping the lock, preallocating a pool with GFP_KERNEL, and retrying the
locked lookup and creation path.
If the fallback allocation fails too, pool remains NULL. Since the loop
condition is while (!pool), the function can keep retrying instead of
propagating the allocation failure to the caller.
Set pool to ERR_PTR(-ENOMEM) when the fallback allocation fails so the
loop exits through the existing common return path. The callers already
handle ERR_PTR() from get_cg_pool_unlocked(), so this restores the
expected error path.
Tejun Heo [Sun, 10 May 2026 20:08:16 +0000 (10:08 -1000)]
sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
scx_fail_parent() leaves cgroup tasks at (state=NONE, sched=parent,
sched_class=ext) until the parent itself is torn down by the scx_error() it
raised. When the later root_disable iterates them, two paths trip on NONE.
scx_disable_and_exit_task() re-enters the wrapper at NONE: the inner switch
returns early but the trailing scx_set_task_sched(p, NULL) clobbers the
parent sched left by scx_fail_parent(), and scx_set_task_state(p, NONE)
wastes a write on an already-NONE task. switched_from_scx() then calls
scx_disable_task(), which WARNs on non-ENABLED state and writes state=READY,
producing a NONE -> READY transition the validation matrix rejects.
Treat NONE as "nothing to do" in both paths. Add a NONE early-return at the
top of scx_disable_and_exit_task() and a parallel NONE check in
switched_from_scx() next to task_dead_and_done().
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
Tejun Heo [Sun, 10 May 2026 20:08:16 +0000 (10:08 -1000)]
sched_ext: Close sub-sched init race with post-init DEAD recheck
scx_sub_enable_workfn()'s init pass and scx_sub_disable() migration both
drop the rq lock to call __scx_init_task() against the other sched. A
TASK_DEAD @p can fall through sched_ext_dead() in that window.
sched_ext_dead() runs ops.exit_task() on the sched @p was attached to, not
on the sched whose init just completed, so the new allocation leaks.
Reuse the DEAD signal set by sched_ext_dead(). After __scx_init_task()
returns, take task_rq_lock(p) and check for DEAD; on hit, call
scx_sub_init_cancel_task() against the sub sched the init ran for and drop
@p; on miss, proceed as before.
Tejun Heo [Sun, 10 May 2026 20:08:16 +0000 (10:08 -1000)]
sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
scx_root_enable_workfn() drops the iter rq lock for ops.init_task() and a
TASK_DEAD @p can fall through sched_ext_dead() in that window. The race hits
when sched_ext_dead() observes SCX_TASK_INIT (the intermediate state before
@p->scx.sched is published) and dereferences NULL via SCX_HAS_OP(NULL,
exit_task), or observes SCX_TASK_NONE during the unlocked init window and
skips cleanup so exit_task() never runs.
Add SCX_TASK_INIT_BEGIN. The enable path writes NONE -> INIT_BEGIN under the
iter rq lock, then takes the rq lock again after init to walk INIT_BEGIN ->
INIT -> READY. sched_ext_dead() that wins the rq-lock race observes
INIT_BEGIN and sets DEAD without calling into ops; the post-init recheck
unwinds via scx_sub_init_cancel_task().
scx_fork() runs single-threaded against sched_ext_dead() (the task is not on
scx_tasks until scx_post_fork() adds it) so its INIT_BEGIN -> INIT walk
needs no rq-lock pairing; it rolls back to NONE on ops.init_task() failure.
The validation matrix grows the INIT_BEGIN row and the INIT_BEGIN -> DEAD
edge; INIT now requires INIT_BEGIN as the predecessor. scx_sub_disable()'s
migration writes INIT_BEGIN as a synthetic predecessor to satisfy the
tightened verification.
The sub-sched paths still race with sched_ext_dead() during the unlocked
init window. This will be fixed by the next patch.
Tejun Heo [Sun, 10 May 2026 20:08:16 +0000 (10:08 -1000)]
sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
SCX_TASK_OFF_TASKS marked tasks already through sched_ext_dead() so cgroup
task iteration would skip them. This can be expressed better with a task
state. Replace the flag with SCX_TASK_DEAD.
scx_disable_and_exit_task() resets state to NONE on its way out, so
sched_ext_dead() now sets DEAD after the wrapper returns. The validation
matrix grows NONE -> DEAD, warns on DEAD -> NONE, and tightens READY's
predecessor to INIT or ENABLED so the new DEAD value cannot silently
transition to READY.
Prepares for the following enable vs dead race fix.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
Tejun Heo [Sun, 10 May 2026 20:08:16 +0000 (10:08 -1000)]
sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
Prepare for the SCX_TASK_INIT_BEGIN/DEAD work that follows by collapsing the
scx_init_task() helper. Move the SCX_TASK_RESET_RUNNABLE_AT setting into
scx_set_task_state() on the INIT transition (it was set unconditionally at
every INIT site through the scx_init_task() helper), inline scx_init_task()
into scx_fork() and scx_root_enable_workfn(), and drop the helper.
As a side effect, scx_sub_disable() migration sequence now also sets
RESET_RUNNABLE_AT (it previously wrote INIT directly without going through
scx_init_task()). The flag triggers a runnable_at reset on the next
set_task_runnable(), which is harmless on a task that has just been moved
between scheds.
On root-enable, p->scx.flags is written without the task's rq lock. The task
isn't visible to scx yet, and a follow-up patch restores the lock-held
write.
v2: Note p->scx.flags rq-lock relaxation on root-enable path. (Andrea)
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
Tejun Heo [Sun, 10 May 2026 20:08:16 +0000 (10:08 -1000)]
sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
Cleanups in preparation for the state-machine work that follows:
- Convert three sub-sched call sites that open-code
rcu_assign_pointer(p->scx.sched, ...) to scx_set_task_sched().
- Move scx_get_task_state()/scx_set_task_state() above the SCX task iter
section so scx_task_iter_next_locked() can use them without a forward
declaration.
No functional change.
Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
net: xgene: fix mdio_np leak in xgene_mdiobus_register()
The for_each_child_of_node() loop captures mdio_np via break,
holding the refcount. of_mdiobus_register() does not consume the
reference, so it leaks on success.
Jakub Kicinski [Sun, 10 May 2026 17:07:37 +0000 (10:07 -0700)]
Merge tag 'batadv-net-pullrequest-20260508' of https://git.open-mesh.org/batadv
Simon Wunderlich says:
====================
Here are some batman-adv bugfixes:
- fix integer overflow on buff_pos, by Lyes Bourennani
- fix invalid tp_meter access during teardown, by Jiexun Wang (2 patches)
- stop caching unowned originator pointers in BAT IV, by Jiexun Wang
- tp_meter: fix tp_num leak on kmalloc failure, by Sven Eckelmann
- fix BLA refcounting issues, by Sven Eckelmann (3 patches)
* tag 'batadv-net-pullrequest-20260508' of https://git.open-mesh.org/batadv:
batman-adv: bla: put backbone reference on failed claim hash insert
batman-adv: bla: only purge non-released claims
batman-adv: bla: prevent use-after-free when deleting claims
batman-adv: tp_meter: fix tp_num leak on kmalloc failure
batman-adv: stop caching unowned originator pointers in BAT IV
batman-adv: stop tp_meter sessions during mesh teardown
batman-adv: reject new tp_meter sessions during teardown
batman-adv: fix integer overflow on buff_pos
====================
net: ena: PHC: Fix potential use-after-free in get_timestamp
Move the phc->active check and resp pointer assignment to after
acquiring the spinlock. Previously, phc->active was checked without
holding the lock, and resp was cached from ena_dev->phc.virt_addr
before the lock was acquired.
If ena_com_phc_destroy() runs between the lockless active check and
the lock acquisition, it sets active=false, releases the lock, frees
the DMA memory, and sets virt_addr=NULL. The get_timestamp path would
then read a NULL virt_addr and dereference it.
With both the active check and the pointer read under the lock,
destroy cannot free the memory while get_timestamp is using it.
Fixes: e0ea34158ee8 ("net: ena: Add PHC support in the ENA driver") Cc: stable@vger.kernel.org Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260508062126.7273-1-akiyano@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Chuck Lever [Sun, 19 Apr 2026 18:52:59 +0000 (14:52 -0400)]
NFSD: Fix infinite loop in layout state revocation
find_one_sb_stid() skips stids whose sc_status is non-zero, but the
SC_TYPE_LAYOUT case in nfsd4_revoke_states() never sets sc_status
before calling nfsd4_close_layout(). The retry loop therefore finds
the same layout stid on every iteration, hanging the revoker
indefinitely.
Fixes: 1e33e1414bec ("nfsd: allow layout state to be admin-revoked.") Reported-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Jeff Layton [Sat, 11 Apr 2026 21:12:16 +0000 (17:12 -0400)]
sunrpc: start cache request seqno at 1 to fix netlink GET_REQS
sunrpc_cache_requests_snapshot() filters requests with
crq->seqno <= min_seqno. The min_seqno for the first netlink
dump call is cb->args[0] which is 0. Since next_seqno was
initialized to 0, the very first cache request got seqno=0
and was silently skipped by the snapshot (0 <= 0 is true).
This caused netlink-based GET_REQS to return 0 pending requests
even when a request was queued, preventing mountd from resolving
cache entries (particularly expkey/nfsd.fh). The unresolved
CACHE_PENDING state blocked all further notifications for the
entry, leading to permanent NFS4ERR_DELAY hangs.
Start next_seqno at 1 so all requests have seqno >= 1 and pass
the snapshot filter when min_seqno is 0.
Fixes: facc4e3c8042 ("sunrpc: split cache_detail queue into request and reader lists") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>