ksmbd: fix slab-out-of-bounds in smb2_allocate_rsp_buf
If ->ProtocolId is SMB2_TRANSFORM_PROTO_NUM, smb2 request size
validation could be skipped. if request size is smaller than
sizeof(struct smb2_query_info_req), slab-out-of-bounds read can happen in
smb2_allocate_rsp_buf(). This patch allocate response buffer after
decrypting transform request. smb3_decrypt_req() will validate transform
request size and avoid slab-out-of-bound in smb2_allocate_rsp_buf().
Reported-by: Norbert Szetei <norbert@doyensec.com> Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Merge tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull sysfs fix from Al Viro:
"Get rid of lockdep false positives around sysfs/overlayfs
syzbot has uncovered a class of lockdep false positives for setups
with sysfs being one of the backing layers in overlayfs. The root
cause is that of->mutex allocated when opening a sysfs file read-only
(which overlayfs might do) is confused with of->mutex of a file opened
writable (held in write to sysfs file, which overlayfs won't do).
Assigning them separate lockdep classes fixes that bunch and it's
obviously safe"
* tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kernfs: annotate different lockdep class for of->mutex of writable files
Merge tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Follow up fixes for the BHI mitigations code
- Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as
expected
- Work around an APIC emulation bug when the kernel is built with Clang
and run as a SEV guest
- Follow up x86 topology fixes
* tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Move TOPOEXT enablement into the topology parser
x86/cpu/amd: Make the NODEID_MSR union actually work
x86/cpu/amd: Make the CPUID 0x80000008 parser correct
x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
x86/bugs: Fix BHI handling of RRSBA
x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
x86/bugs: Fix BHI documentation
x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
x86/bugs: Fix return type of spectre_bhi_state()
x86/apic: Force native_apic_mem_read() to use the MOV instruction
Merge tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix a PREEMPT_RT build bug"
* tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking: Make rwsem_assert_held_write_nolockdep() build with PREEMPT_RT=y
Merge tag 'irq-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
"Fix a bug in the GIC irqchip driver"
* tag 'irq-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Fix VSYNC referencing an unmapped VPE on GIC v4.1
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio bugfixes from Michael Tsirkin:
"Some small, obvious (in hindsight) bugfixes:
- new ioctl in vhost-vdpa has a wrong # - not too late to fix
- vhost has apparently been lacking an smp_rmb() - due to code
duplication :( The duplication will be fixed in the next merge
cycle, this is a minimal fix
- an error message in vhost talks about guest moving used index -
which of course never happens, guest only ever moves the available
index
- i2c-virtio didn't set the driver owner so it did not get refcounted
correctly"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: correct misleading printing information
vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE
virtio: store owner from modules with register_virtio_driver()
vhost: Add smp_rmb() in vhost_enable_notify()
vhost: Add smp_rmb() in vhost_vq_avail_empty()
Merge tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- fix up swiotlb buffer padding even more (Petr Tesarik)
- fix for partial dma_sync on swiotlb (Michael Kelley)
- swiotlb debugfs fix (Dexuan Cui)
* tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: do not set total_used to 0 in swiotlb_create_debugfs_files()
swiotlb: fix swiotlb_bounce() to do partial sync's correctly
swiotlb: extend buffer pre-padding to alloc_align_mask if necessary
Amir Goldstein [Fri, 5 Apr 2024 14:56:35 +0000 (17:56 +0300)]
kernfs: annotate different lockdep class for of->mutex of writable files
The writable file /sys/power/resume may call vfs lookup helpers for
arbitrary paths and readonly files can be read by overlayfs from vfs
helpers when sysfs is a lower layer of overalyfs.
To avoid a lockdep warning of circular dependency between overlayfs
inode lock and kernfs of->mutex, use a different lockdep class for
writable and readonly kernfs files.
Reported-by: syzbot+9a5b0ced8b1bfb238b56@syzkaller.appspotmail.com Fixes: 0fedefd4c4e3 ("kernfs: sysfs: support custom llseek method for sysfs entries") Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge tag 'ata-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fixes from Damien Le Moal:
- Add the mask_port_map parameter to the ahci driver. This is a
follow-up to the recent snafu with the ASMedia controller and its
virtual port hidding port-multiplier devices. As ASMedia confirmed
that there is no way to determine if these slow-to-probe virtual
ports are actually representing the ports of a port-multiplier
devices, this new parameter allow masking ports to significantly
speed up probing during system boot, resulting in shorter boot times.
- A fix for an incorrect handling of a port unlock in
ata_scsi_dev_rescan().
- Allow command duration limits to be detected for ACS-4 devices are
there are such devices out in the field.
* tag 'ata-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-core: Allow command duration limits detection for ACS-4 drives
ata: libata-scsi: Fix ata_scsi_dev_rescan() error path
ata: ahci: Add mask_port_map module parameter
Merge tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- fix for oops in cifs_get_fattr of deleted files
- fix for the remote open counter going negative in some directory
lease cases
- fix for mkfifo to instantiate dentry to avoid possible crash
- important fix to allow handling key rotation for mount and remount
(ie cases that are becoming more common when password that was used
for the mount will expire soon but will be replaced by new password)
* tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: fix broken reconnect when password changing on the server by allowing password rotation
smb: client: instantiate when creating SFU files
smb3: fix Open files on server counter going negative
smb: client: fix NULL ptr deref in cifs_mark_open_handles_for_deleted_file()
Igor Pylypiv [Thu, 11 Apr 2024 20:12:24 +0000 (20:12 +0000)]
ata: libata-core: Allow command duration limits detection for ACS-4 drives
Even though the command duration limits (CDL) feature was first added
in ACS-5 (major version 12), there are some ACS-4 (major version 11)
drives that implement CDL as well.
IDENTIFY_DEVICE, SUPPORTED_CAPABILITIES, and CURRENT_SETTINGS log pages
are mandatory in the ACS-4 standard so it should be safe to read these
log pages on older drives implementing the ACS-4 standard.
Fixes: 62e4a60e0cdb ("scsi: ata: libata: Detect support for command duration limits") Cc: stable@vger.kernel.org Signed-off-by: Igor Pylypiv <ipylypiv@google.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Commit 0c76106cb975 ("scsi: sd: Fix TCG OPAL unlock on system resume")
incorrectly handles failures of scsi_resume_device() in
ata_scsi_dev_rescan(), leading to a double call to
spin_unlock_irqrestore() to unlock a device port. Fix this by redefining
the goto labels used in case of errors and only unlock the port
scsi_scan_mutex when scsi_resume_device() fails.
Bug found with the Smatch static checker warning:
drivers/ata/libata-scsi.c:4774 ata_scsi_dev_rescan()
error: double unlocked 'ap->lock' (orig line 4757)
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 0c76106cb975 ("scsi: sd: Fix TCG OPAL unlock on system resume") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org>
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
"Fix the TLBI RANGE operand calculation causing live migration under
KVM/arm64 to miss dirty pages due to stale TLB entries"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: tlb: Fix TLBI RANGE operand
Merge tag 'soc-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
"The device tree changes this time are all for NXP i.MX platforms,
addressing issues with clocks and regulators on i.MX7 and i.MX8.
The old OMAP2 based Nokia N8x0 tablet get a couple of code fixes for
regressions that came in.
The ARM SCMI and FF-A firmware interfaces get a couple of minor bug
fixes.
A regression fix for RISC-V cache management addresses a problem with
probe order on Sifive cores"
* tag 'soc-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (23 commits)
MAINTAINERS: Change Krzysztof Kozlowski's email address
arm64: dts: imx8qm-ss-dma: fix can lpcg indices
arm64: dts: imx8-ss-dma: fix can lpcg indices
arm64: dts: imx8-ss-dma: fix adc lpcg indices
arm64: dts: imx8-ss-dma: fix pwm lpcg indices
arm64: dts: imx8-ss-dma: fix spi lpcg indices
arm64: dts: imx8-ss-conn: fix usb lpcg indices
arm64: dts: imx8-ss-lsio: fix pwm lpcg indices
ARM: dts: imx7s-warp: Pass OV2680 link-frequencies
ARM: dts: imx7-mba7: Use 'no-mmc' property
arm64: dts: imx8-ss-conn: fix usdhc wrong lpcg clock order
arm64: dts: freescale: imx8mp-venice-gw73xx-2x: fix USB vbus regulator
arm64: dts: freescale: imx8mp-venice-gw72xx-2x: fix USB vbus regulator
cache: sifive_ccache: Partially convert to a platform driver
firmware: arm_scmi: Make raw debugfs entries non-seekable
firmware: arm_scmi: Fix wrong fastchannel initialization
firmware: arm_ffa: Fix the partition ID check in ffa_notification_info_get()
ARM: OMAP2+: fix USB regression on Nokia N8x0
mmc: omap: restore original power up/down steps
mmc: omap: fix deferred probe
...
Merge tag 'iommu-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- Intel VT-d Fixes:
- Allocate local memory for PRQ page
- Fix WARN_ON in iommu probe path
- Fix wrong use of pasid config
- AMD IOMMU Fixes:
- Lock inversion fix
- Log message severity fix
- Disable SNP when v2 page-tables are used
- Mediatek driver:
- Fix module autoloading
* tag 'iommu-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/amd: Change log message severity
iommu/vt-d: Fix WARN_ON in iommu probe path
iommu/vt-d: Allocate local memory for page request queue
iommu/vt-d: Fix wrong use of pasid config
iommu: mtk: fix module autoloading
iommu/amd: Do not enable SNP when V2 page table is enabled
iommu/amd: Fix possible irq lock inversion dependency issue
Merge tag 'pci-v6.9-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- Revert a quirk that prevented Secondary Bus Reset for LSI / Agere
FW643.
We thought the device was broken, but the reset does work correctly
on other platforms, and the reset avoids leaking data out of VMs
(Bjorn Helgaas)
- Update MAINTAINERS to reflect that Gustavo Pimentel is no longer
reachable (Manivannan Sadhasivam)
* tag 'pci-v6.9-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
Revert "PCI: Mark LSI FW643 to avoid bus reset"
MAINTAINERS: Drop Gustavo Pimentel as PCI DWC Maintainer
* tag 'block-6.9-20240412' of git://git.kernel.dk/linux:
block: fix that blk_time_get_ns() doesn't update time after schedule
block: allow device to have both virt_boundary_mask and max segment size
block: fix q->blkg_list corruption during disk rebind
blk-iocost: avoid out of bounds shift
raid1: fix use-after-free for original bio in raid1_write_request()
Merge tag 'io_uring-6.9-20240412' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Fix for sigmask restoring while waiting for events (Alexey)
- Typo fix in comment (Haiyue)
- Fix for a msg_control retstore on SEND_ZC retries (Pavel)
* tag 'io_uring-6.9-20240412' of git://git.kernel.dk/linux:
io-uring: correct typo in comment for IOU_F_TWQ_LAZY_WAKE
io_uring/net: restore msg_control on sendzc retry
io_uring: Fix io_cqring_wait() not restoring sigmask on get_timespec64() failure
Merge tag 'ceph-for-6.9-rc4' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Two CephFS fixes marked for stable and a MAINTAINERS update"
* tag 'ceph-for-6.9-rc4' of https://github.com/ceph/ceph-client:
MAINTAINERS: remove myself as a Reviewer for Ceph
ceph: switch to use cap_delay_lock for the unlink delay list
ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE
Commit d96c36004e31 ("tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig
entry") removed a hidden tab because it apparently showed breakage in
some third-party kernel config parsing tool.
It wasn't clear what tool it was, but let's make sure it gets fixed.
Because if you can't parse tabs as whitespace, you should not be parsing
the kernel Kconfig files.
In fact, let's make such breakage more obvious than some esoteric ftrace
record size option. If you can't parse tabs, you can't have page sizes.
Yes, tab-vs-space confusion is sadly a traditional Unix thing, and
'make' is famous for being broken in this regard. But no, that does not
mean that it's ok.
I'd add more random tabs to our Kconfig files, but I don't want to make
things uglier than necessary. But it *might* bbe necessary if it turns
out we see more of this kind of silly tooling.
Merge tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix the buffer_percent accounting as it is dependent on three
variables:
1) pages_read - number of subbuffers read
2) pages_lost - number of subbuffers lost due to overwrite
3) pages_touched - number of pages that a writer entered
These three counters only increment, and to know how many active
pages there are on the buffer at any given time, the pages_read and
pages_lost are subtracted from pages_touched.
But the pages touched was incremented whenever any writer went to the
next subbuffer even if it wasn't the only one, so it was incremented
more than it should be causing the counter for how many subbuffers
currently have content incorrect, which caused the buffer_percent
that holds waiters until the ring buffer is filled to a given
percentage to wake up early.
- Fix warning of unused functions when PERF_EVENTS is not configured in
- Replace bad tab with space in Kconfig for FTRACE_RECORD_RECURSION_SIZE
- Fix to some kerneldoc function comments in eventfs code.
* tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Only update pages_touched when a new page is touched
tracing: hide unused ftrace_event_id_fops
tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entry
eventfs: Fix kernel-doc comments to functions
Merge tag 'drm-fixes-2024-04-12' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Looks like everyone woke up after holidays, this weeks pull has a
bunch of stuff all over, 2 weeks worth of amdgpu is a lot of it, then
i915/xe have a few, a bunch of msm fixes, then some scattered driver
fixes.
I expect things will settle down for rc5.
client:
- Protect connector modes with mode_config mutex
ast:
- Fix soft lockup
host1x:
- Do not setup DMA for virtual addresses
ivpu:
- Fix deadlock in context_xa
- PCI fixes
- Fixes to error handling
nouveau:
- gsp: Fix OOB access
- Fix casting
panfrost:
- Fix error path in MMU code
qxl:
- Revert "drm/qxl: simplify qxl_fence_wait"
vmwgfx:
- Enable DMA for SEV mappings
i915:
- Couple CDCLK programming fixes
- HDCP related fix
- 4 Bigjoiner related fixes
- Fix for a circular locking around GuC on reset+wedged case
msm:
- DP refcount leak fix on disconnect
- Add missing newlines to prints in msm_fb and msm_kms
- fix dpu debugfs entry permissions
- Fix the interface table for the catalog of X1E80100
- fix irq message printing
- Bindings fix to add DP node as child of mdss for mdss node
- Minor typo fix in DP driver API which handles port status change
- fix CHRASHDUMP_READ()
- fix HHB (highest bank bit) for a619 to fix UBWC corruption
* tag 'drm-fixes-2024-04-12' of https://gitlab.freedesktop.org/drm/kernel: (65 commits)
amdkfd: use calloc instead of kzalloc to avoid integer overflow
drm/xe: Label RING_CONTEXT_CONTROL as masked
drm/xe/xe_migrate: Cast to output precision before multiplying operands
drm/xe/hwmon: Cast result to output precision on left shift of operand
drm/xe/display: Fix double mutex initialization
drm/amdgpu: differentiate external rev id for gfx 11.5.0
drm/amd/display: Adjust dprefclk by down spread percentage.
drm/amd/display: Set VSC SDP Colorimetry same way for MST and SST
drm/amd/display: Program VSC SDP colorimetry for all DP sinks >= 1.4
drm/amd/display: fix disable otg wa logic in DCN316
drm/amd/display: Do not recursively call manual trigger programming
drm/amd/display: always reset ODM mode in context when adding first plane
drm/amdgpu: fix incorrect number of active RBs for gfx11
drm/amd/display: Return max resolution supported by DWB
amd/amdkfd: sync all devices to wait all processes being evicted
drm/amdgpu: clear set_q_mode_offs when VM changed
drm/amdgpu: Fix VCN allocation in CPX partition
drm/amd/pm: fix the high voltage issue after unload
drm/amd/display: Skip on writeback when it's not applicable
drm/amdgpu: implement IRQ_STATE_ENABLE for SDMA v4.4.2
...
block: fix that blk_time_get_ns() doesn't update time after schedule
While monitoring the throttle time of IO from iocost, it's found that
such time is always zero after the io_schedule() from ioc_rqos_throttle,
for example, with the following debug patch:
+ printk("%s-%d: %s enter %llu\n", current->comm, current->pid, __func__, blk_time_get_ns());
while (true) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (wait.committed)
break;
io_schedule();
}
+ printk("%s-%d: %s exit %llu\n", current->comm, current->pid, __func__, blk_time_get_ns());
It can be observerd that blk_time_get_ns() always return the same time:
And I think the root cause is that 'PF_BLOCK_TS' is always cleared
by blk_flush_plug() before scheduel(), hence blk_plug_invalidate_ts()
will never be called:
io_schedule:
io_schedule_prepare
blk_flush_plug
__blk_flush_plug
/* the flag is cleared, while time is not */
current->flags &= ~PF_BLOCK_TS;
schedule
sched_update_worker
/* the flag is not set, hence plug->cur_ktime is not cleared */
if (tsk->flags & PF_BLOCK_TS)
blk_plug_invalidate_ts()
blk_time_get_ns
/* got the time stashed before schedule */
return plug->cur_ktime;
Fix the problem by clearing cached time in __blk_flush_plug().
John Stultz [Wed, 10 Apr 2024 23:26:30 +0000 (16:26 -0700)]
selftests: timers: Fix abs() warning in posix_timers test
Building with clang results in the following warning:
posix_timers.c:69:6: warning: absolute value function 'abs' given an
argument of type 'long long' but has parameter of type 'int' which may
cause truncation of value [-Wabsolute-value]
if (abs(diff - DELAY * USECS_PER_SEC) > USECS_PER_SEC / 2) {
^
So switch to using llabs() instead.
selftests: kselftest: Mark functions that unconditionally call exit() as __noreturn
After commit 6d029c25b71f ("selftests/timers/posix_timers: Reimplement
check_timer_distribution()"), clang warns:
tools/testing/selftests/timers/../kselftest.h:398:6: warning: variable 'major' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
398 | if (uname(&info) || sscanf(info.release, "%u.%u.", &major, &minor) != 2)
| ^~~~~~~~~~~~
tools/testing/selftests/timers/../kselftest.h:401:9: note: uninitialized use occurs here
401 | return major > min_major || (major == min_major && minor >= min_minor);
| ^~~~~
tools/testing/selftests/timers/../kselftest.h:398:6: note: remove the '||' if its condition is always false
398 | if (uname(&info) || sscanf(info.release, "%u.%u.", &major, &minor) != 2)
| ^~~~~~~~~~~~~~~
tools/testing/selftests/timers/../kselftest.h:395:20: note: initialize the variable 'major' to silence this warning
395 | unsigned int major, minor;
| ^
| = 0
This is a false positive because if uname() fails, ksft_exit_fail_msg()
will be called, which unconditionally calls exit(), a noreturn function.
However, clang does not know that ksft_exit_fail_msg() will call exit() at
the point in the pipeline that the warning is emitted because inlining has
not occurred, so it assumes control flow will resume normally after
ksft_exit_fail_msg() is called.
Make it clear to clang that all of the functions that call exit()
unconditionally in kselftest.h are noreturn transitively by marking them
explicitly with '__attribute__((__noreturn__))', which clears up the
warning above and any future warnings that may appear for the same reason.
After commit 6d029c25b71f ("selftests/timers/posix_timers: Reimplement
check_timer_distribution()") the following warning occurs when building
with an older gcc:
posix_timers.c:250:2: warning: format not a string literal and no format arguments [-Wformat-security]
250 | ksft_print_msg(errmsg);
| ^~~~~~~~~~~~~~
Fix this up by changing it to ksft_print_msg("%s", errmsg)
Lu Baolu [Thu, 11 Apr 2024 03:07:44 +0000 (11:07 +0800)]
iommu/vt-d: Fix WARN_ON in iommu probe path
Commit 1a75cc710b95 ("iommu/vt-d: Use rbtree to track iommu probed
devices") adds all devices probed by the iommu driver in a rbtree
indexed by the source ID of each device. It assumes that each device
has a unique source ID. This assumption is incorrect and the VT-d
spec doesn't state this requirement either.
The reason for using a rbtree to track devices is to look up the device
with PCI bus and devfunc in the paths of handling ATS invalidation time
out error and the PRI I/O page faults. Both are PCI ATS feature related.
Only track the devices that have PCI ATS capabilities in the rbtree to
avoid unnecessary WARN_ON in the iommu probe path. Otherwise, on some
platforms below kernel splat will be displayed and the iommu probe results
in failure.
Thomas Gleixner [Thu, 11 Apr 2024 16:55:38 +0000 (18:55 +0200)]
x86/cpu/amd: Move TOPOEXT enablement into the topology parser
The topology rework missed that early_init_amd() tries to re-enable the
Topology Extensions when the BIOS disabled them.
The new parser is invoked before early_init_amd() so the re-enable attempt
happens too late.
Move it into the AMD specific topology parser code where it belongs.
Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/878r1j260l.ffs@tglx
Thomas Gleixner [Wed, 10 Apr 2024 19:45:28 +0000 (21:45 +0200)]
x86/cpu/amd: Make the NODEID_MSR union actually work
A system with NODEID_MSR was reported to crash during early boot without
any output.
The reason is that the union which is used for accessing the bitfields in
the MSR is written wrongly and the resulting executable code accesses the
wrong part of the MSR data.
As a consequence a later division by that value results in 0 and that
result is used for another division as divisor, which obviously does not
work well.
The magic world of C, unions and bitfields:
union {
u64 bita : 3,
bitb : 3;
u64 all;
} x;
x.all = foo();
a = x.bita;
b = x.bitb;
results in the effective executable code of:
a = b = x.bita;
because bita and bitb are treated as union members and therefore both end
up at bit offset 0.
Rework the NODEID_MSR union in exactly that way to cure the problem.
Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser") Reported-by: "kernelci.org bot" <bot@kernelci.org> Reported-by: Laura Nao <laura.nao@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Laura Nao <laura.nao@collabora.com> Link: https://lore.kernel.org/r/20240410194311.596282919@linutronix.de Closes: https://lore.kernel.org/all/20240322175210.124416-1-laura.nao@collabora.com/
Thomas Gleixner [Wed, 10 Apr 2024 19:45:27 +0000 (21:45 +0200)]
x86/cpu/amd: Make the CPUID 0x80000008 parser correct
CPUID 0x80000008 ECX.cpu_nthreads describes the number of threads in the
package. The parser uses this value to initialize the SMT domain level.
That's wrong because cpu_nthreads does not describe the number of threads
per physical core. So this needs to set the CORE domain level and let the
later parsers set the SMT shift if available.
Preset the SMT domain level with the assumption of one thread per core,
which is correct ifrt here are no other CPUID leafs to parse, and propagate
cpu_nthreads and the core level APIC bitwidth into the CORE domain.
Fixes: f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology parser") Reported-by: "kernelci.org bot" <bot@kernelci.org> Reported-by: Laura Nao <laura.nao@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Laura Nao <laura.nao@collabora.com> Link: https://lore.kernel.org/r/20240410194311.535206450@linutronix.de
x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
For consistency with the other CONFIG_MITIGATION_* options, replace the
CONFIG_SPECTRE_BHI_{ON,OFF} options with a single
CONFIG_MITIGATION_SPECTRE_BHI option.
iommu/amd: Do not enable SNP when V2 page table is enabled
DTE[Mode]=0 is not supported when SNP is enabled in the host. That means
to support SNP, IOMMU must be configured with V1 page table (See IOMMU
spec [1] for the details). If user passes kernel command line to configure
IOMMU domains with v2 page table (amd_iommu=pgtbl_v2) then disable SNP
as the user asked by not forcing the page table to v1.
Dave Airlie [Fri, 12 Apr 2024 01:01:44 +0000 (11:01 +1000)]
Merge tag 'drm-msm-next-2024-04-11' of https://gitlab.freedesktop.org/drm/msm into drm-fixes
Fixes for v6.9
Display:
- Fixes for PM refcount leak when DP goes to disconnected state and
also when link training fails. This is also one of the issues found
with the pm runtime series
- Add missing newlines to prints in msm_fb and msm_kms
- Change permissions of some dpu debugfs entries which write to const
data from catalog to read-only to avoid protection faults
- Fix the interface table for the catalog of X1E80100. This is an
important fix to bringup DP for X1E80100.
- Logging fix to print the callback symbol in the invalid IRQ message
case rather than printing when its known to be NULL.
- Bindings fix to add DP node as child of mdss for mdss node
- Minor typo fix in DP driver API which handles port status change
GPU:
- fix CHRASHDUMP_READ()
- fix HHB (highest bank bit) for a619 to fix UBWC corruption
Merge tag 'cxl-fixes-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl fixes from Dave Jiang:
- Fix index of Clear Event Record handles in cxl_clear_event_record()
- Fix use before init of map->reg_type in cxl_decode_regblock()
- Fix initialization of mbox_cmd.size_out in cxl_mem_get_records_log()
- Fix CXL path access_coordinate computation:
- Remove unneded check of iter in loop
- Fix of retrieving of access_coordinate in PCI topology walk
- Fix of incorrect region access_coordinate data calculation
- Consolidate of access_coordinates attached to downstream port
context
- Add check to validate access_coordinate validity to prevent
incorrect data being exposed via sysfs
* tag 'cxl-fixes-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl: Add checks to access_coordinate calculation to fail missing data
cxl: Consolidate dport access_coordinate ->hb_coord and ->sw_coord into ->coord
cxl: Fix incorrect region perf data calculation
cxl: Fix retrieving of access_coordinates in PCIe path
cxl: Remove checking of iter in cxl_endpoint_get_perf_coordinates()
cxl/core: Fix initialization of mbox_cmd.size_out in get event
cxl/core/regs: Fix usage of map->reg_type in cxl_decode_regblock() before assigned
cxl/mem: Fix for the index of Clear Event Record Handle
Merge tag 'hyperv-fixes-signed-20240411' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Some cosmetic changes (Erni Sri Satya Vennela, Li Zhijian)
- Introduce hv_numa_node_to_pxm_info() (Nuno Das Neves)
- Fix KVP daemon to handle IPv4 and IPv6 combination for keyfile format
(Shradha Gupta)
- Avoid freeing decrypted memory in a confidential VM (Rick Edgecombe
and Michael Kelley)
* tag 'hyperv-fixes-signed-20240411' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
Drivers: hv: vmbus: Don't free ring buffers that couldn't be re-encrypted
uio_hv_generic: Don't free decrypted memory
hv_netvsc: Don't free decrypted memory
Drivers: hv: vmbus: Track decrypted status in vmbus_gpadl
Drivers: hv: vmbus: Leak pages if set_memory_encrypted() fails
hv/hv_kvp_daemon: Handle IPv4 and Ipv6 combination for keyfile format
hv: vmbus: Convert sprintf() family to sysfs_emit() family
mshyperv: Introduce hv_numa_node_to_pxm_info()
x86/hyperv: Cosmetic changes for hv_apic.c
ring-buffer: Only update pages_touched when a new page is touched
The "buffer_percent" logic that is used by the ring buffer splice code to
only wake up the tasks when there's no data after the buffer is filled to
the percentage of the "buffer_percent" file is dependent on three
variables that determine the amount of data that is in the ring buffer:
1) pages_read - incremented whenever a new sub-buffer is consumed
2) pages_lost - incremented every time a writer overwrites a sub-buffer
3) pages_touched - incremented when a write goes to a new sub-buffer
Basically, the amount of data is the total number of sub-bufs that have been
touched, minus the number of sub-bufs lost and sub-bufs consumed. This is
divided by the total count to give the buffer percentage. When the
percentage is greater than the value in the "buffer_percent" file, it
wakes up splice readers waiting for that amount.
It was observed that over time, the amount read from the splice was
constantly decreasing the longer the trace was running. That is, if one
asked for 60%, it would read over 60% when it first starts tracing, but
then it would be woken up at under 60% and would slowly decrease the
amount of data read after being woken up, where the amount becomes much
less than the buffer percent.
This was due to an accounting of the pages_touched incrementation. This
value is incremented whenever a writer transfers to a new sub-buffer. But
the place where it was incremented was incorrect. If a writer overflowed
the current sub-buffer it would go to the next one. If it gets preempted
by an interrupt at that time, and the interrupt performs a trace, it too
will end up going to the next sub-buffer. But only one should increment
the counter. Unfortunately, that was not the case.
Change the cmpxchg() that does the real switch of the tail-page into a
try_cmpxchg(), and on success, perform the increment of pages_touched. This
will only increment the counter once for when the writer moves to a new
sub-buffer, and not when there's a race and is incremented for when a
writer and its preempting writer both move to the same new sub-buffer.
Link: https://lore.kernel.org/linux-trace-kernel/20240409151309.0d0e5056@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to wake up reader") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When CONFIG_PERF_EVENTS, a 'make W=1' build produces a warning about the
unused ftrace_event_id_fops variable:
kernel/trace/trace_events.c:2155:37: error: 'ftrace_event_id_fops' defined but not used [-Werror=unused-const-variable=]
2155 | static const struct file_operations ftrace_event_id_fops = {
Hide this in the same #ifdef as the reference to it.
Steve French [Thu, 4 Apr 2024 23:06:56 +0000 (18:06 -0500)]
smb3: fix broken reconnect when password changing on the server by allowing password rotation
There are various use cases that are becoming more common in which password
changes are scheduled on a server(s) periodically but the clients connected
to this server need to stay connected (even in the face of brief network
reconnects) due to mounts which can not be easily unmounted and mounted at
will, and servers that do password rotation do not always have the ability
to tell the clients exactly when to the new password will be effective,
so add support for an alt password ("password2=") on mount (and also
remount) so that we can anticipate the upcoming change to the server
without risking breaking existing mounts.
An alternative would have been to use the kernel keyring for this but the
processes doing the reconnect do not have access to the keyring but do
have access to the ses structure.
Reviewed-by: Bharath SM <bharathsm@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Paulo Alcantara [Tue, 9 Apr 2024 14:28:59 +0000 (11:28 -0300)]
smb: client: instantiate when creating SFU files
In cifs_sfu_make_node(), on success, instantiate rather than leave it
with dentry unhashed negative to support callers that expect mknod(2)
to always instantiate.
Cc: stable@vger.kernel.org Fixes: 72bc63f5e23a ("smb3: fix creating FIFOs when mounting with "sfu" mount option") Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Steve French [Sun, 7 Apr 2024 04:16:08 +0000 (23:16 -0500)]
smb3: fix Open files on server counter going negative
We were decrementing the count of open files on server twice
for the case where we were closing cached directories.
Fixes: 8e843bf38f7b ("cifs: return a single-use cfid if we did not get a lease") Cc: stable@vger.kernel.org Acked-by: Bharath SM <bharathsm@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Jeff Layton [Tue, 9 Apr 2024 11:01:57 +0000 (07:01 -0400)]
MAINTAINERS: remove myself as a Reviewer for Ceph
It has been a couple of years since I stepped down as CephFS maintainer.
I'm not involved in any meaningful way with the project these days, so
while I'm happy to help review the occasional patch, I don't need to be
cc'ed on all of them.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Merge tag 'acpi-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix the handling of dependencies between devices in the ACPI
device enumeration code and address a _UID matching regression from
the 6.8 development cycle.
Specifics:
- Modify the ACPI device enumeration code to avoid counting
dependencies that have been met already as unmet (Hans de Goede)
- Make _UID matching take the integer value of 0 into account as
appropriate (Raag Jadav)"
* tag 'acpi-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: bus: allow _UID matching for integer zero
ACPI: scan: Do not increase dep_unmet for already met dependencies
Merge tag 'pm-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix the suspend-to-idle core code to guarantee that timers queued on
CPUs other than the one that has first left the idle state, which
should expire directly after resume, will be handled (Anna-Maria
Behnsen)"
* tag 'pm-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: s2idle: Make sure CPUs will wakeup directly on resume
- drv: virtio_net: fix guest hangup on invalid RSS update
- drv: mlx5e: Fix mlx5e_priv_init() cleanup flow
- dsa: mt7530: trap link-local frames regardless of ST Port State"
* tag 'net-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (59 commits)
net: ena: Set tx_info->xdpf value to NULL
net: ena: Fix incorrect descriptor free behavior
net: ena: Wrong missing IO completions check order
net: ena: Fix potential sign extension issue
af_unix: Fix garbage collector racing against connect()
net: dsa: mt7530: trap link-local frames regardless of ST Port State
Revert "s390/ism: fix receive message buffer allocation"
net: sparx5: fix wrong config being used when reconfiguring PCS
net/mlx5: fix possible stack overflows
net/mlx5: Disallow SRIOV switchdev mode when in multi-PF netdev
net/mlx5e: RSS, Block XOR hash with over 128 channels
net/mlx5e: Do not produce metadata freelist entries in Tx port ts WQE xmit
net/mlx5e: HTB, Fix inconsistencies with QoS SQs number
net/mlx5e: Fix mlx5e_priv_init() cleanup flow
net/mlx5e: RSS, Block changing channels number when RXFH is configured
net/mlx5: Correctly compare pkt reformat ids
net/mlx5: Properly link new fs rules into the tree
net/mlx5: offset comp irq index in name by one
net/mlx5: Register devlink first under devlink lock
net/mlx5: E-switch, store eswitch pointer before registering devlink_param
...
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"The most important fix is the sg one because the regression it fixes
(spurious warning and use after final put) is already backported to
stable.
The next biggest impact is the target fix for wrong credentials used
to load a module because it's affecting new kernels installed on
selinux based distributions.
The other three fixes are an obvious off by one and SATA protocol
issues"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: qla2xxx: Fix off by one in qla_edif_app_getstats()
scsi: hisi_sas: Modify the deadline for ata_wait_after_reset()
scsi: hisi_sas: Handle the NCQ error returned by D2H frame
scsi: target: Fix SELinux error when systemd-modules loads the target module
scsi: sg: Avoid race in error handling & drop bogus warn
Merge tag 'loongarch-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
- make {virt, phys, page, pfn} translation work with KFENCE for
LoongArch (otherwise NVMe and virtio-blk cannot work with KFENCE
enabled)
- update dts files for Loongson-2K series to make devices work
correctly
- fix a build error
* tag 'loongarch-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: Include linux/sizes.h in addrspace.h to prevent build errors
LoongArch: Update dts for Loongson-2K2000 to support GMAC/GNET
LoongArch: Update dts for Loongson-2K2000 to support PCI-MSI
LoongArch: Update dts for Loongson-2K2000 to support ISA/LPC
LoongArch: Update dts for Loongson-2K1000 to support ISA/LPC
LoongArch: Make virt_addr_valid()/__virt_addr_valid() work with KFENCE
LoongArch: Make {virt, phys, page, pfn} translation work with KFENCE
mm: Move lowmem_page_address() a little later
Merge tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs
Pull more bcachefs fixes from Kent Overstreet:
"Notable user impacting bugs
- On multi device filesystems, recovery was looping in
btree_trans_too_many_iters(). This checks if a transaction has
touched too many btree paths (because of iteration over many keys),
and isuses a restart to drop unneeded paths.
But it's now possible for some paths to exceed the previous limit
without iteration in the interior btree update path, since the
transaction commit will do alloc updates for every old and new
btree node, and during journal replay we don't use the btree write
buffer for locking reasons and thus those updates use btree paths
when they wouldn't normally.
- Fix a corner case in rebalance when moving extents on a
durability=0 device. This wouldn't be hit when a device was
formatted with durability=0 since in that case we'll only use it as
a write through cache (only cached extents will live on it), but
durability can now be changed on an existing device.
- bch2_get_acl() could rarely forget to handle a transaction restart;
this manifested as the occasional missing acl that came back after
dropping caches.
- Fix a major performance regression on high iops multithreaded write
workloads (only since 6.9-rc1); a previous fix for a deadlock in
the interior btree update path to check the journal watermark
introduced a dependency on the state of btree write buffer flushing
that we didn't want.
- Assorted other repair paths and recovery fixes"
* tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs: (25 commits)
bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()
bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()
bcachefs: Fix a race in btree_update_nodes_written()
bcachefs: btree_node_scan: Respect member.data_allowed
bcachefs: Don't scan for btree nodes when we can reconstruct
bcachefs: Fix check_topology() when using node scan
bcachefs: fix eytzinger0_find_gt()
bcachefs: fix bch2_get_acl() transaction restart handling
bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list
bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()
bcachefs: Rename struct field swap to prevent macro naming collision
MAINTAINERS: Add entry for bcachefs documentation
Documentation: filesystems: Add bcachefs toctree
bcachefs: JOURNAL_SPACE_LOW
bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE
bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems
bcachefs: fix rand_delete unit test
bcachefs: fix ! vs ~ typo in __clear_bit_le64()
bcachefs: Fix rebalance from durability=0 device
bcachefs: Print shutdown journal sequence number
...
Lucas De Marchi [Fri, 5 Apr 2024 20:07:11 +0000 (13:07 -0700)]
drm/xe/display: Fix double mutex initialization
All of these mutexes are already initialized by the display side since
commit 3fef3e6ff86a ("drm/i915: move display mutex inits to display
code"), so the xe shouldn´t initialize them.
David Arinzon [Wed, 10 Apr 2024 09:13:58 +0000 (09:13 +0000)]
net: ena: Set tx_info->xdpf value to NULL
The patch mentioned in the `Fixes` tag removed the explicit assignment
of tx_info->xdpf to NULL with the justification that there's no need
to set tx_info->xdpf to NULL and tx_info->num_of_bufs to 0 in case
of a mapping error. Both values won't be used once the mapping function
returns an error, and their values would be overridden by the next
transmitted packet.
While both values do indeed get overridden in the next transmission
call, the value of tx_info->xdpf is also used to check whether a TX
descriptor's transmission has been completed (i.e. a completion for it
was polled).
An example scenario:
1. Mapping failed, tx_info->xdpf wasn't set to NULL
2. A VF reset occurred leading to IO resource destruction and
a call to ena_free_tx_bufs() function
3. Although the descriptor whose mapping failed was freed by the
transmission function, it still passes the check
if (!tx_info->skb)
(skb and xdp_frame are in a union)
4. The xdp_frame associated with the descriptor is freed twice
This patch returns the assignment of NULL to tx_info->xdpf to make the
cleaning function knows that the descriptor is already freed.
Fixes: 504fd6a5390c ("net: ena: fix DMA mapping function issues in XDP") Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
David Arinzon [Wed, 10 Apr 2024 09:13:57 +0000 (09:13 +0000)]
net: ena: Fix incorrect descriptor free behavior
ENA has two types of TX queues:
- queues which only process TX packets arriving from the network stack
- queues which only process TX packets forwarded to it by XDP_REDIRECT
or XDP_TX instructions
The ena_free_tx_bufs() cycles through all descriptors in a TX queue
and unmaps + frees every descriptor that hasn't been acknowledged yet
by the device (uncompleted TX transactions).
The function assumes that the processed TX queue is necessarily from
the first category listed above and ends up using napi_consume_skb()
for descriptors belonging to an XDP specific queue.
This patch solves a bug in which, in case of a VF reset, the
descriptors aren't freed correctly, leading to crashes.
Fixes: 548c4940b9f1 ("net: ena: Implement XDP_TX action") Signed-off-by: Shay Agroskin <shayagr@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
David Arinzon [Wed, 10 Apr 2024 09:13:56 +0000 (09:13 +0000)]
net: ena: Wrong missing IO completions check order
Missing IO completions check is called every second (HZ jiffies).
This commit fixes several issues with this check:
1. Duplicate queues check:
Max of 4 queues are scanned on each check due to monitor budget.
Once reaching the budget, this check exits under the assumption that
the next check will continue to scan the remainder of the queues,
but in practice, next check will first scan the last already scanned
queue which is not necessary and may cause the full queue scan to
last a couple of seconds longer.
The fix is to start every check with the next queue to scan.
For example, on 8 IO queues:
Bug: [0,1,2,3], [3,4,5,6], [6,7]
Fix: [0,1,2,3], [4,5,6,7]
2. Unbalanced queues check:
In case the number of active IO queues is not a multiple of budget,
there will be checks which don't utilize the full budget
because the full scan exits when reaching the last queue id.
The fix is to run every TX completion check with exact queue budget
regardless of the queue id.
For example, on 7 IO queues:
Bug: [0,1,2,3], [4,5,6], [0,1,2,3]
Fix: [0,1,2,3], [4,5,6,0], [1,2,3,4]
The budget may be lowered in case the number of IO queues is less
than the budget (4) to make sure there are no duplicate queues on
the same check.
For example, on 3 IO queues:
Bug: [0,1,2,0], [1,2,0,1]
Fix: [0,1,2], [0,1,2]
Fixes: 1738cd3ed342 ("net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)") Signed-off-by: Amit Bernstein <amitbern@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
David Arinzon [Wed, 10 Apr 2024 09:13:55 +0000 (09:13 +0000)]
net: ena: Fix potential sign extension issue
Small unsigned types are promoted to larger signed types in
the case of multiplication, the result of which may overflow.
In case the result of such a multiplication has its MSB
turned on, it will be sign extended with '1's.
This changes the multiplication result.
Code example of the phenomenon:
-------------------------------
u16 x, y;
size_t z1, z2;
The expected result of ffff*ffff is fffe0001, and without the
explicit casting to avoid the unwanted sign extension we got fffffffffffe0001.
This commit adds an explicit casting to avoid the sign extension
issue.
Fixes: 689b2bdaaa14 ("net: ena: add functions for handling Low Latency Queues in ena_com") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: David Arinzon <darinzon@amazon.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 11 Apr 2024 08:42:43 +0000 (10:42 +0200)]
Merge tag 'for-net-2024-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- L2CAP: Don't double set the HCI_CONN_MGMT_CONNECTED bit
- Fix memory leak in hci_req_sync_complete
- hci_sync: Fix using the same interval and window for Coded PHY
- Fix not validating setsockopt user input
* tag 'for-net-2024-04-10' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: l2cap: Don't double set the HCI_CONN_MGMT_CONNECTED bit
Bluetooth: hci_sock: Fix not validating setsockopt user input
Bluetooth: ISO: Fix not validating setsockopt user input
Bluetooth: L2CAP: Fix not validating setsockopt user input
Bluetooth: RFCOMM: Fix not validating setsockopt user input
Bluetooth: SCO: Fix not validating setsockopt user input
Bluetooth: Fix memory leak in hci_req_sync_complete()
Bluetooth: hci_sync: Fix using the same interval and window for Coded PHY
Bluetooth: ISO: Don't reject BT_ISO_QOS if parameters are unset
====================
x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
While syscall hardening helps prevent some BHI attacks, there's still
other low-hanging fruit remaining. Don't classify it as a mitigation
and make it clear that the system may still be vulnerable if it doesn't
have a HW or SW mitigation enabled.
The ARCH_CAP_RRSBA check isn't correct: RRSBA may have already been
disabled by the Spectre v2 mitigation (or can otherwise be disabled by
the BHI mitigation itself if needed). In that case retpolines are fine.
x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
So we are using the 'ia32_cap' value in a number of places,
which got its name from MSR_IA32_ARCH_CAPABILITIES MSR register.
But there's very little 'IA32' about it - this isn't 32-bit only
code, nor does it originate from there, it's just a historic
quirk that many Intel MSR names are prefixed with IA32_.
This is already clear from the helper method around the MSR:
x86_read_arch_cap_msr(), which doesn't have the IA32 prefix.
So rename 'ia32_cap' to 'x86_arch_cap_msr' to be consistent with
its role and with the naming of the helper function.
x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
There's no need to keep reading MSR_IA32_ARCH_CAPABILITIES over and
over. It's even read in the BHI sysfs function which is a big no-no.
Just read it once and cache it.
Michal Luczaj [Tue, 9 Apr 2024 20:09:39 +0000 (22:09 +0200)]
af_unix: Fix garbage collector racing against connect()
Garbage collector does not take into account the risk of embryo getting
enqueued during the garbage collection. If such embryo has a peer that
carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
different set of children. Leading to an incorrectly elevated inflight
count, and then a dangling pointer within the gc_inflight_list.
sockets are AF_UNIX/SOCK_STREAM
S is an unconnected socket
L is a listening in-flight socket bound to addr, not in fdtable
V's fd will be passed via sendmsg(), gets inflight count bumped
// V count=1 inflight=1
// GC candidate condition met
for u in gc_inflight_list:
if (total_refs == inflight_refs)
add u to gc_candidates
// gc_candidates={L, V}
for u in gc_candidates:
scan_children(u, dec_inflight)
// embryo (skb1) was not
// reachable from L yet, so V's
// inflight remains unchanged
__skb_queue_tail(L, skb1)
unix_state_unlock(L)
for u in gc_candidates:
if (u.inflight)
scan_children(u, inc_inflight_move_tail)
// V count=1 inflight=2 (!)
If there is a GC-candidate listening socket, lock/unlock its state. This
makes GC wait until the end of any ongoing connect() to that socket. After
flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
there is another embryo coming, it can not possibly carry SCM_RIGHTS. At
this point, unix_inflight() can not happen because unix_gc_lock is already
taken. Inflight graph remains unaffected.
net: dsa: mt7530: trap link-local frames regardless of ST Port State
In Clause 5 of IEEE Std 802-2014, two sublayers of the data link layer
(DLL) of the Open Systems Interconnection basic reference model (OSI/RM)
are described; the medium access control (MAC) and logical link control
(LLC) sublayers. The MAC sublayer is the one facing the physical layer.
In 8.2 of IEEE Std 802.1Q-2022, the Bridge architecture is described. A
Bridge component comprises a MAC Relay Entity for interconnecting the Ports
of the Bridge, at least two Ports, and higher layer entities with at least
a Spanning Tree Protocol Entity included.
Each Bridge Port also functions as an end station and shall provide the MAC
Service to an LLC Entity. Each instance of the MAC Service is provided to a
distinct LLC Entity that supports protocol identification, multiplexing,
and demultiplexing, for protocol data unit (PDU) transmission and reception
by one or more higher layer entities.
It is described in 8.13.9 of IEEE Std 802.1Q-2022 that in a Bridge, the LLC
Entity associated with each Bridge Port is modeled as being directly
connected to the attached Local Area Network (LAN).
On the switch with CPU port architecture, CPU port functions as Management
Port, and the Management Port functionality is provided by software which
functions as an end station. Software is connected to an IEEE 802 LAN that
is wholly contained within the system that incorporates the Bridge.
Software provides access to the LLC Entity associated with each Bridge Port
by the value of the source port field on the special tag on the frame
received by software.
We call frames that carry control information to determine the active
topology and current extent of each Virtual Local Area Network (VLAN),
i.e., spanning tree or Shortest Path Bridging (SPB) and Multiple VLAN
Registration Protocol Data Units (MVRPDUs), and frames from other link
constrained protocols, such as Extensible Authentication Protocol over LAN
(EAPOL) and Link Layer Discovery Protocol (LLDP), link-local frames. They
are not forwarded by a Bridge. Permanently configured entries in the
filtering database (FDB) ensure that such frames are discarded by the
Forwarding Process. In 8.6.3 of IEEE Std 802.1Q-2022, this is described in
detail:
Each of the reserved MAC addresses specified in Table 8-1
(01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]) shall be
permanently configured in the FDB in C-VLAN components and ERs.
Each of the reserved MAC addresses specified in Table 8-2
(01-80-C2-00-00-[01,02,03,04,05,06,07,08,09,0A,0E]) shall be permanently
configured in the FDB in S-VLAN components.
Each of the reserved MAC addresses specified in Table 8-3
(01-80-C2-00-00-[01,02,04,0E]) shall be permanently configured in the FDB
in TPMR components.
The FDB entries for reserved MAC addresses shall specify filtering for all
Bridge Ports and all VIDs. Management shall not provide the capability to
modify or remove entries for reserved MAC addresses.
The addresses in Table 8-1, Table 8-2, and Table 8-3 determine the scope of
propagation of PDUs within a Bridged Network, as follows:
The Nearest Bridge group address (01-80-C2-00-00-0E) is an address that
no conformant Two-Port MAC Relay (TPMR) component, Service VLAN (S-VLAN)
component, Customer VLAN (C-VLAN) component, or MAC Bridge can forward.
PDUs transmitted using this destination address, or any other addresses
that appear in Table 8-1, Table 8-2, and Table 8-3
(01-80-C2-00-00-[00,01,02,03,04,05,06,07,08,09,0A,0B,0C,0D,0E,0F]), can
therefore travel no further than those stations that can be reached via a
single individual LAN from the originating station.
The Nearest non-TPMR Bridge group address (01-80-C2-00-00-03), is an
address that no conformant S-VLAN component, C-VLAN component, or MAC
Bridge can forward; however, this address is relayed by a TPMR component.
PDUs using this destination address, or any of the other addresses that
appear in both Table 8-1 and Table 8-2 but not in Table 8-3
(01-80-C2-00-00-[00,03,05,06,07,08,09,0A,0B,0C,0D,0F]), will be relayed
by any TPMRs but will propagate no further than the nearest S-VLAN
component, C-VLAN component, or MAC Bridge.
The Nearest Customer Bridge group address (01-80-C2-00-00-00) is an
address that no conformant C-VLAN component, MAC Bridge can forward;
however, it is relayed by TPMR components and S-VLAN components. PDUs
using this destination address, or any of the other addresses that appear
in Table 8-1 but not in either Table 8-2 or Table 8-3
(01-80-C2-00-00-[00,0B,0C,0D,0F]), will be relayed by TPMR components and
S-VLAN components but will propagate no further than the nearest C-VLAN
component or MAC Bridge.
Because the LLC Entity associated with each Bridge Port is provided via CPU
port, we must not filter these frames but forward them to CPU port.
In a Bridge, the transmission Port is majorly decided by ingress and egress
rules, FDB, and spanning tree Port State functions of the Forwarding
Process. For link-local frames, only CPU port should be designated as
destination port in the FDB, and the other functions of the Forwarding
Process must not interfere with the decision of the transmission Port. We
call this process trapping frames to CPU port.
Therefore, on the switch with CPU port architecture, link-local frames must
be trapped to CPU port, and certain link-local frames received by a Port of
a Bridge comprising a TPMR component or an S-VLAN component must be
excluded from it.
A Bridge of the switch with CPU port architecture cannot comprise a
Two-Port MAC Relay (TPMR) component as a TPMR component supports only a
subset of the functionality of a MAC Bridge. A Bridge comprising two Ports
(Management Port doesn't count) of this architecture will either function
as a standard MAC Bridge or a standard VLAN Bridge.
Therefore, a Bridge of this architecture can only comprise S-VLAN
components, C-VLAN components, or MAC Bridge components. Since there's no
TPMR component, we don't need to relay PDUs using the destination addresses
specified on the Nearest non-TPMR section, and the proportion of the
Nearest Customer Bridge section where they must be relayed by TPMR
components.
One option to trap link-local frames to CPU port is to add static FDB
entries with CPU port designated as destination port. However, because that
Independent VLAN Learning (IVL) is being used on every VID, each entry only
applies to a single VLAN Identifier (VID). For a Bridge comprising a MAC
Bridge component or a C-VLAN component, there would have to be 16 times
4096 entries. This switch intellectual property can only hold a maximum of
2048 entries. Using this option, there also isn't a mechanism to prevent
link-local frames from being discarded when the spanning tree Port State of
the reception Port is discarding.
The remaining option is to utilise the BPC, RGAC1, RGAC2, RGAC3, and RGAC4
registers. Whilst this applies to every VID, it doesn't contain all of the
reserved MAC addresses without affecting the remaining Standard Group MAC
Addresses. The REV_UN frame tag utilised using the RGAC4 register covers
the remaining 01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F] destination
addresses. It also includes the 01-80-C2-00-00-22 to 01-80-C2-00-00-FF
destination addresses which may be relayed by MAC Bridges or VLAN Bridges.
The latter option provides better but not complete conformance.
This switch intellectual property also does not provide a mechanism to trap
link-local frames with specific destination addresses to CPU port by
Bridge, to conform to the filtering rules for the distinct Bridge
components.
Therefore, regardless of the type of the Bridge component, link-local
frames with these destination addresses will be trapped to CPU port:
01-80-C2-00-00-[00,01,02,03,0E]
In a Bridge comprising a MAC Bridge component or a C-VLAN component:
Link-local frames with these destination addresses won't be trapped to
CPU port which won't conform to IEEE Std 802.1Q-2022:
01-80-C2-00-00-[04,05,06,07,08,09,0A,0B,0C,0D,0F]
In a Bridge comprising an S-VLAN component:
Link-local frames with these destination addresses will be trapped to CPU
port which won't conform to IEEE Std 802.1Q-2022:
01-80-C2-00-00-00
Link-local frames with these destination addresses won't be trapped to
CPU port which won't conform to IEEE Std 802.1Q-2022:
01-80-C2-00-00-[04,05,06,07,08,09,0A]
Currently on this switch intellectual property, if the spanning tree Port
State of the reception Port is discarding, link-local frames will be
discarded.
To trap link-local frames regardless of the spanning tree Port State, make
the switch regard them as Bridge Protocol Data Units (BPDUs). This switch
intellectual property only lets the frames regarded as BPDUs bypass the
spanning tree Port State function of the Forwarding Process.
With this change, the only remaining interference is the ingress rules.
When the reception Port has no PVID assigned on software, VLAN-untagged
frames won't be allowed in. There doesn't seem to be a mechanism on the
switch intellectual property to have link-local frames bypass this function
of the Forwarding Process.
A couple of debug functions use a 512 byte temporary buffer and call another
function that has another buffer of the same size, which in turn exceeds the
usual warning limit for excessive stack usage:
Rework these so that each of the various code paths only ever has one of
these buffers in it, and exactly the functions that declare one have
the 'noinline_for_stack' annotation that prevents them from all being
inlined into the same caller.
net/mlx5e: RSS, Block XOR hash with over 128 channels
When supporting more than 128 channels, the RQT size is
calculated by multiplying the number of channels by 2
and rounding up to the nearest power of 2.
The index of the RQT is derived from the RSS hash
calculations. If XOR8 is used as the RSS hash function,
there are only 256 possible hash results, and therefore,
only 256 indexes can be reached in the RQT.
Block setting the RSS hash function to XOR when the number
of channels exceeds 128.
Fixes: 74a8dadac17e ("net/mlx5e: Preparations for supporting larger number of channels") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240409190820.227554-11-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/mlx5e: Do not produce metadata freelist entries in Tx port ts WQE xmit
Free Tx port timestamping metadata entries in the NAPI poll context and
consume metadata enties in the WQE xmit path. Do not free a Tx port
timestamping metadata entry in the WQE xmit path even in the error path to
avoid a race between two metadata entry producers.
net/mlx5e: HTB, Fix inconsistencies with QoS SQs number
When creating a new HTB class while the interface is down,
the variable that follows the number of QoS SQs (htb_max_qos_sqs)
may not be consistent with the number of HTB classes.
Previously, we compared these two values to ensure that
the node_qid is lower than the number of QoS SQs, and we
allocated stats for that SQ when they are equal.
Change the check to compare the node_qid with the current
number of leaf nodes and fix the checking conditions to
ensure allocation of stats_list and stats for each node.
When mlx5e_priv_init() fails, the cleanup flow calls mlx5e_selq_cleanup which
calls mlx5e_selq_apply() that assures that the `priv->state_lock` is held using
lockdep_is_held().
net/mlx5e: RSS, Block changing channels number when RXFH is configured
Changing the channels number after configuring the receive flow hash
indirection table may affect the RSS table size. The previous
configuration may no longer be compatible with the new receive flow
hash indirection table.
Block changing the channels number when RXFH is configured and changing
the channels number requires resizing the RSS table size.
Fixes: 74a8dadac17e ("net/mlx5e: Preparations for supporting larger number of channels") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240409190820.227554-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
struct mlx5_pkt_reformat contains a naked union of a u32 id and a
dr_action pointer which is used when the action is SW-managed (when
pkt_reformat.owner is set to MLX5_FLOW_RESOURCE_OWNER_SW). Using id
directly in that case is incorrect, as it maps to the least significant
32 bits of the 64-bit pointer in mlx5_fs_dr_action and not to the pkt
reformat id allocated in firmware.
For the purpose of comparing whether two rules are identical,
interpreting the least significant 32 bits of the mlx5_fs_dr_action
pointer as an id mostly works... until it breaks horribly and produces
the outcome described in [1].
This patch fixes mlx5_flow_dests_cmp to correctly compare ids using
mlx5_fs_dr_action_get_pkt_reformat_id for the SW-managed rules.
net/mlx5: Properly link new fs rules into the tree
Previously, add_rule_fg would only add newly created rules from the
handle into the tree when they had a refcount of 1. On the other hand,
create_flow_handle tries hard to find and reference already existing
identical rules instead of creating new ones.
These two behaviors can result in a situation where create_flow_handle
1) creates a new rule and references it, then
2) in a subsequent step during the same handle creation references it
again,
resulting in a rule with a refcount of 2 that is not linked into the
tree, will have a NULL parent and root and will result in a crash when
the flow group is deleted because del_sw_hw_rule, invoked on rule
deletion, assumes node->parent is != NULL.
This happened in the wild, due to another bug related to incorrect
handling of duplicate pkt_reformat ids, which lead to the code in
create_flow_handle incorrectly referencing a just-added rule in the same
flow handle, resulting in the problem described above. Full details are
at [1].
This patch changes add_rule_fg to add new rules without parents into
the tree, properly initializing them and avoiding the crash. This makes
it more consistent with how rules are added to an FTE in
create_flow_handle.
Michael Liang [Tue, 9 Apr 2024 19:08:11 +0000 (22:08 +0300)]
net/mlx5: offset comp irq index in name by one
The mlx5 comp irq name scheme is changed a little bit between
commit 3663ad34bc70 ("net/mlx5: Shift control IRQ to the last index")
and commit 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation").
The index in the comp irq name used to start from 0 but now it starts
from 1. There is nothing critical here, but it's harmless to change
back to the old behavior, a.k.a starting from 0.