Gaosheng Cui [Fri, 23 Sep 2022 09:00:12 +0000 (17:00 +0800)]
ftrace: Remove obsoleted code from ftrace and task_struct
The trace of "struct task_struct" was no longer used since
commit 345ddcc882d8 ("ftrace: Have set_ftrace_pid use the
bitmap like events do"), and the functions about flags for
current->trace is useless, so remove them.
Waiman Long [Thu, 22 Sep 2022 14:56:22 +0000 (10:56 -0400)]
tracing: Disable interrupt or preemption before acquiring arch_spinlock_t
It was found that some tracing functions in kernel/trace/trace.c acquire
an arch_spinlock_t with preemption and irqs enabled. An example is the
tracing_saved_cmdlines_size_read() function which intermittently causes
a "BUG: using smp_processor_id() in preemptible" warning when the LTP
read_all_proc test is run.
That can be problematic in case preemption happens after acquiring the
lock. Add the necessary preemption or interrupt disabling code in the
appropriate places before acquiring an arch_spinlock_t.
The convention here is to disable preemption for trace_cmdline_lock and
interupt for max_lock.
Link: https://lkml.kernel.org/r/20220922145622.1744826-1-longman@redhat.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: stable@vger.kernel.org Fixes: a35873a0993b ("tracing: Add conditional snapshot") Fixes: 939c7a4f04fc ("tracing: Introduce saved_cmdlines_size file") Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
tracing/osnoise: Fix possible recursive locking in stop_per_cpu_kthreads
There is a recursive lock on the cpu_hotplug_lock.
In kernel/trace/trace_osnoise.c:<start/stop>_per_cpu_kthreads:
- start_per_cpu_kthreads calls cpus_read_lock() and if
start_kthreads returns a error it will call stop_per_cpu_kthreads.
- stop_per_cpu_kthreads then calls cpus_read_lock() again causing
deadlock.
Fix this by calling cpus_read_unlock() before calling
stop_per_cpu_kthreads. This behavior can also be seen in commit f46b16520a08 ("trace/hwlat: Implement the per-cpu mode").
This error was noticed during the LTP ftrace-stress-test:
WARNING: possible recursive locking detected
--------------------------------------------
sh/275006 is trying to acquire lock: ffffffffb02f5400 (cpu_hotplug_lock){++++}-{0:0}, at: stop_per_cpu_kthreads
but task is already holding lock: ffffffffb02f5400 (cpu_hotplug_lock){++++}-{0:0}, at: start_per_cpu_kthreads
other info that might help us debug this:
Possible unsafe locking scenario:
Yipeng Zou [Mon, 19 Sep 2022 12:56:29 +0000 (20:56 +0800)]
tracing: kprobe: Make gen test module work in arm and riscv
For now, this selftest module can only work in x86 because of the
kprobe cmd was fixed use of x86 registers.
This patch adapted to register names under arm and riscv, So that
this module can be worked on those platform.
All uses of arch_kprobe_override_function() have been removed by
commit 540adea3809f ("error-injection: Separate error-injection
from kprobe"), so remove the declaration, too.
tracing/filter: Call filter predicate functions directly via a switch statement
Due to retpolines, indirect calls are much more expensive than direct
calls. The filters have a select set of functions it uses for the
predicates. Instead of using function pointers to call them, create a
filter_pred_fn_call() function that uses a switch statement to call the
predicate functions directly. This gives almost a 10% speedup to the
filter logic.
tracing: Move struct filter_pred into trace_events_filter.c
The structure filter_pred and the typedef of the function used are only
referenced by trace_events_filter.c. There's no reason to have it in an
external header file. Move them into the only file they are used in.
Link: https://lkml.kernel.org/r/20220906225529.598047132@goodmis.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
tracing/hist: Call hist functions directly via a switch statement
Due to retpolines, indirect calls are much more expensive than direct
calls. The histograms have a select set of functions it uses for the
histograms, instead of using function pointers to call them, create a
hist_fn_call() function that uses a switch statement to call the histogram
functions directly. This gives a 13% speedup to the histogram logic.
tracing: Add numeric delta time to the trace event benchmark
In order to testing filtering and histograms via the trace event
benchmark, record the delta time of the last event as a numeric value
(currently, it just saves it within the string) so that filters and
histograms can use it.
Link: https://lkml.kernel.org/r/20220906225529.213677569@goodmis.org Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Regression and bug fixes:
- Performance regression fix from 5.18 on a Rasberry Pi
- Fix extent parsing bug which triggers a BUG_ON when a (corrupted)
extent tree has has a non-root node when zero entries.
- Fix a livelock where in the right (wrong) circumstances a large
number of nfsd threads can try to write to a nearly full file
system, and retry for hours(!)"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: limit the number of retries after discarding preallocations blocks
ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0
ext4: use buckets for cr 1 block scan instead of rbtree
ext4: use locality group preallocation for small closed files
ext4: make directory inode spreading reflect flexbg size
ext4: avoid unnecessary spreading of allocations among groups
ext4: make mballoc try target group first even with mb_optimize_scan
Merge tag 'dax-and-nvdimm-fixes-v6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull NVDIMM and DAX fixes from Dan Williams:
"A recently discovered one-line fix for devdax that further addresses a
v5.5 regression, and (a bit embarrassing) a small batch of fixes that
have been sitting in my fixes tree for weeks.
The older fixes have soaked in linux-next during that time and address
an fsdax infinite loop and some other minor fixups.
- Fix a infinite loop bug in fsdax
- Fix memory-type detection for devdax (EINJ regression)
- Small cleanups"
* tag 'dax-and-nvdimm-fixes-v6.0-final' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
devdax: Fix soft-reservation memory description
fsdax: Fix infinite loop in dax_iomap_rw()
nvdimm/namespace: drop nested variable in create_namespace_pmem()
ndtest: Cleanup all of blk namespace specific code
pmem: fix a name collision
Merge tag 'i2c-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"I2C driver bugfixes for mlxbf and imx, a few documentation fixes after
the rework this cycle, and one hardening for the i2c-mux core"
* tag 'i2c-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: mux: harden i2c_mux_alloc() against integer overflows
i2c: mlxbf: Fix frequency calculation
i2c: mlxbf: prevent stack overflow in mlxbf_i2c_smbus_start_transaction()
i2c: mlxbf: incorrect base address passed during io write
Documentation: i2c: fix references to other documents
MAINTAINERS: remove Nehal Shah from AMD MP2 I2C DRIVER
i2c: imx: If pm_runtime_get_sync() returned 1 device access is possible
Dan Williams [Fri, 23 Sep 2022 22:05:56 +0000 (15:05 -0700)]
devdax: Fix soft-reservation memory description
The "hmem" platform-devices that are created to represent the
platform-advertised "Soft Reserved" memory ranges end up inserting a
resource that causes the iomem_resource tree to look like this:
This is because insert_resource() reparents ranges when they completely
intersect an existing range.
This matters because code that uses region_intersects() to scan for a
given IORES_DESC will only check that top-level 'hmem.0' resource and
not the 'Soft Reserved' descendant.
So, to support EINJ (via einj_error_inject()) to inject errors into
memory hosted by a dax-device, be sure to describe the memory as
IORES_DESC_SOFT_RESERVED. This is a follow-on to:
commit b13a3e5fd40b ("ACPI: APEI: Fix _EINJ vs EFI_MEMORY_SP")
...that fixed EINJ support for "Soft Reserved" ranges in the first
instance.
Fixes: 262b45ae3ab4 ("x86/efi: EFI soft reservation to E820 enumeration") Reported-by: Ricardo Sandoval Torres <ricardo.sandoval.torres@intel.com> Tested-by: Ricardo Sandoval Torres <ricardo.sandoval.torres@intel.com> Cc: <stable@vger.kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Omar Avelar <omar.avelar@intel.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Mark Gross <markgross@kernel.org> Link: https://lore.kernel.org/r/166397075670.389916.7435722208896316387.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Merge tag 'kbuild-fixes-v6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix build error for the combination of SYSTEM_TRUSTED_KEYRING=y and
X509_CERTIFICATE_PARSER=m
- Fix DEBUG_INFO_SPLIT to generate debug info for GCC 11+ and Clang 12+
- Revive debug info for assembly files
- Remove unused code
* tag 'kbuild-fixes-v6.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
Makefile.debug: re-enable debug info for .S files
Makefile.debug: set -g unconditional on CONFIG_DEBUG_INFO_SPLIT
certs: make system keyring depend on built-in x509 parser
Kconfig: remove unused function 'menu_get_root_menu'
scripts/clang-tools: remove unused module
Merge tag 'pm-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix an uninitialized variable usage in the operating performance
points code and add missing DT bindings for it.
Specifics:
- Fix uninitialized variable usage in dev_pm_opp_config_clks_simple()
(Christophe JAILLET)
- Add missing OPP DT properties (Rob Herring)"
* tag 'pm-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
dt-bindings: opp: Add missing (unevaluated|additional)Properties on child nodes
OPP: Fix an un-initialized variable usage
Merge tag 'char-misc-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are three tiny driver fixes for 6.0-rc7. They include:
- phy driver reset bugfix
- fpga memleak bugfix
- counter irq config bugfix
The first two have been in linux-next for a while, the last one has
only been added to my tree in the past few days, but was in linux-next
under a different commit id. I couldn't pull directly from the counter
tree due to some gpg key propagation issue, so I took the commit
directly from email instead"
* tag 'char-misc-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
counter: 104-quad-8: Fix skipped IRQ lines during events configuration
fpga: m10bmc-sec: Fix possible memory leak of flash_buf
phy: marvell: phy-mvebu-a3700-comphy: Remove broken reset support
Merge tag 'tty-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
"Here are some small, and late, serial driver fixes for 6.0-rc7 to
resolve some reported problems.
Included in here are:
- tegra icount accounting fixes, including a framework function that
other drivers will be converted over to using in 6.1-rc1.
- fsl_lpuart reset bugfix
- 8250 omap 485 bugfix
- sifive serial clock bugfix
The last three patches have not shown up in linux-next due to them
being added to my tree only 2 days ago, but they are tiny and
self-contained and the developers say they resolve issues that they
have with 6.0-rc. The other three have been in linux-next for a while
with no reported issues"
* tag 'tty-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: sifive: enable clocks for UART when probed
serial: 8250: omap: Use serial8250_em485_supported
serial: fsl_lpuart: Reset prior to registration
serial: tegra-tcu: Use uart_xmit_advance(), fixes icount.tx accounting
serial: tegra: Use uart_xmit_advance(), fixes icount.tx accounting
serial: Create uart_xmit_advance()
Merge tag 'cgroup-for-6.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Add Waiman Long as a cpuset maintainer
- cgroup_get_from_id() could be fed a kernfs ID which doesn't point to
a cgroup directory but a knob file and then crash. Error out if the
lookup kernfs_node isn't a directory.
* tag 'cgroup-for-6.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: cgroup_get_from_id() must check the looked-up kn is a directory
cpuset: Add Waiman Long as a cpuset maintainer
Merge tag 'wq-for-6.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
"Just one patch to improve flush lockdep coverage"
* tag 'wq-for-6.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: don't skip lockdep work dependency in cancel_work_sync()
Merge tag 'block-6.0-2022-09-22' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"Fix a regression that's been plaguing us by reverting the offending
commit, as attempts to both reproduce the issue and fix it in a saner
fashion have failed.
Fix for a potential oops condition in the s390 dasd block driver"
* tag 'block-6.0-2022-09-22' of git://git.kernel.dk/linux:
Revert "block: freeze the queue earlier in del_gendisk"
s390/dasd: fix Oops in dasd_alias_get_start_dev due to missing pavgroup
Nick Desaulniers [Mon, 19 Sep 2022 17:45:47 +0000 (10:45 -0700)]
Makefile.debug: re-enable debug info for .S files
Alexey reported that the fraction of unknown filename instances in
kallsyms grew from ~0.3% to ~10% recently; Bill and Greg tracked it down
to assembler defined symbols, which regressed as a result of:
commit b8a9092330da ("Kbuild: do not emit debug info for assembly with LLVM_IAS=1")
In that commit, I allude to restoring debug info for assembler defined
symbols in a follow up patch, but it seems I forgot to do so in
commit a66049e2cf0e ("Kbuild: make DWARF version a choice")
Link: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=31bf18645d98b4d3d7357353be840e320649a67d Fixes: b8a9092330da ("Kbuild: do not emit debug info for assembly with LLVM_IAS=1") Reported-by: Alexey Alexandrov <aalexand@google.com> Reported-by: Bill Wendling <morbo@google.com> Reported-by: Greg Thelen <gthelen@google.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Suggested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Nick Desaulniers [Mon, 19 Sep 2022 17:30:30 +0000 (10:30 -0700)]
Makefile.debug: set -g unconditional on CONFIG_DEBUG_INFO_SPLIT
Dmitrii, Fangrui, and Mashahiro note:
Before GCC 11 and Clang 12 -gsplit-dwarf implicitly uses -g2.
Fix CONFIG_DEBUG_INFO_SPLIT for gcc-11+ & clang-12+ which now need -g
specified in order for -gsplit-dwarf to work at all.
-gsplit-dwarf has been mutually exclusive with -g since support for
CONFIG_DEBUG_INFO_SPLIT was introduced in
commit 866ced950bcd ("kbuild: Support split debug info v4")
I don't think it ever needed to be.
io_uring: ensure that cached task references are always put on exit
io_uring caches task references to avoid doing atomics for each of them
per request. If a request is put from the same task that allocated it,
then we can maintain a per-ctx cache of them. This obviously relies
on io_uring always pruning caches in a reliable way, and there's
currently a case off io_uring fd release where we can miss that.
One example is a ring setup with IOPOLL, which relies on the task
polling for completions, which will free them. However, if such a task
submits a request and then exits or closes the ring without reaping
the completion, then ring release will reap and put. If release happens
from that very same task, the completed request task refs will get
put back into the cache pool. This is problematic, as we're now beyond
the point of pruning caches.
Manually drop these caches after doing an IOPOLL reap. This releases
references from the current task, which is enough. If another task
happens to be doing the release, then the caching will not be
triggered and there's no issue.
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"These are all very simple and self-contained, although the CFI
jump-table fix touches the generic linker script as that's where the
problematic macro lives.
- Fix false positive "sleeping while atomic" warning resulting from
the kPTI rework taking a mutex too early.
- Fix possible overflow in AMU frequency calculation
- Fix incorrect shift in CMN PMU driver which causes problems with
newer versions of the IP
- Reduce alignment of the CFI jump table to avoid huge kernel images
and link errors with !4KiB page size configurations"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
vmlinux.lds.h: CFI: Reduce alignment of jump-table to function alignment
perf/arm-cmn: Add more bits to child node address offset field
arm64: topology: fix possible overflow in amu_fie_setup()
arm64: mm: don't acquire mutex when rewriting swapper
certs: make system keyring depend on built-in x509 parser
Commit e90886291c7c ("certs: make system keyring depend on x509 parser")
is not the right fix because x509_load_certificate_list() can be modular.
The combination of CONFIG_SYSTEM_TRUSTED_KEYRING=y and
CONFIG_X509_CERTIFICATE_PARSER=m still results in the following error:
LD .tmp_vmlinux.kallsyms1
ld: certs/system_keyring.o: in function `load_system_certificate_list':
system_keyring.c:(.init.text+0x8c): undefined reference to `x509_load_certificate_list'
make: *** [Makefile:1169: vmlinux] Error 1
Fixes: e90886291c7c ("certs: make system keyring depend on x509 parser") Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Tested-by: Adam Borowski <kilobyte@angband.pl>
Merge tag 'driver-core-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are two tiny driver core fixes for 6.0-rc7 that resolve some
oft-reported problems.
The first is a revert of the "fw_devlink.strict=1" default option that
we keep trying to enable, but we keep finding platforms that this just
breaks everything on. So again, we need it reverted and hopefully it
can be worked on in future releases.
The second is a sysfs file-size bugfix that resolves an issue that
many people are starting to hit as the fix it is fixing also was
backported to stable kernels. The util-linux developers are starting
to get bugreports about sysfs files that contain no data because of
this problem, and this fix which has been in linux-next in the
bitfield tree for a long time, resolves it. I'm submitting it here as
it needs to be merged for 6.0-final, not for 6.1-rc1.
Both of these have been in linux-next with no reported issues, only
reports were that these fixed problems"
* tag 'driver-core-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
drivers/base: Fix unsigned comparison to -1 in CPUMAP_FILE_MAX_BYTES
Revert "driver core: Set fw_devlink.strict=1 by default"
Merge tag 'usb-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB / Thunderbolt driver fixes and ids from Greg KH:
"Here are a few small USB and Thunderbolt driver fixes and new device
ids for 6.0-rc7.
They contain:
- new usb-serial driver ids
- documentation build warning fix in USB hub code
- flexcop-usb long-posted bugfix (the v4l maintainer for this is MIA
so I have finally picked this up as it is a fix for a reported
problem.)
- dwc3 64bit DMA bugfix
- new thunderbolt device ids
- typec build error fix
All of these have been in linux-next with no reported issues"
* tag 'usb-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: typec: anx7411: Fix build error without CONFIG_POWER_SUPPLY
media: flexcop-usb: fix endpoint type check
USB: serial: option: add Quectel RM520N
USB: serial: option: add Quectel BG95 0x0203 composition
thunderbolt: Add support for Intel Maple Ridge single port controller
usb: dwc3: core: leave default DMA if the controller does not support 64-bit DMA
USB: core: Fix RST error in hub.c
Merge tag 'riscv-for-linus-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A handful of build fixes for the T-Head errata, including some
functional issues the compilers found
- A fix for a nasty sigreturn bug
* tag 'riscv-for-linus-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
RISC-V: Avoid coupling the T-Head CMOs and Zicbom
riscv: fix a nasty sigreturn bug...
riscv: make t-head erratas depend on MMU
riscv: fix RISCV_ISA_SVPBMT kconfig dependency warning
RISC-V: Clean up the Zicbom block size probing
* tag 'drm-fixes-2022-09-23-1' of git://anongit.freedesktop.org/drm/drm: (30 commits)
MAINTAINERS: switch graphics to airlied other addresses
drm/mediatek: dsi: Move mtk_dsi_stop() call back to mtk_dsi_poweroff()
drm/amd/display: Reduce number of arguments of dml314's CalculateFlipSchedule()
drm/amd/display: Reduce number of arguments of dml314's CalculateWatermarksAndDRAMSpeedChangeSupport()
drm/amdgpu: don't register a dirty callback for non-atomic
drm/amd/pm: drop the pptable related workarounds for SMU 13.0.0
drm/amd/pm: add support for 3794 pptable for SMU13.0.0
drm/amd/display: correct num_dsc based on HW cap
drm/amd/display: Disable OTG WA for the plane_state NULL case on DCN314
drm/amd/display: Add shift and mask for ICH_RESET_AT_END_OF_LINE
drm/amd/display: increase dcn315 pstate change latency
drm/amd/display: Fix DP MST timeslot issue when fallback happened
drm/amd/display: Display distortion after hotplug 5K tiled display
drm/amd/display: Update dummy P-state search to use DCN32 DML
drm/amd/display: skip audio setup when audio stream is enabled
drm/amd/display: update gamut remap if plane has changed
drm/amd/display: Assume an LTTPR is always present on fixed_vs links
drm/amd/display: fix dcn315 memory channel count and width read
drm/amd/display: Fix double cursor on non-video RGB MPO
drm/amd/display: Only consider pixle rate div policy for DCN32+
...
Will Deacon [Thu, 22 Sep 2022 21:57:15 +0000 (22:57 +0100)]
vmlinux.lds.h: CFI: Reduce alignment of jump-table to function alignment
Due to undocumented, hysterical raisins on x86, the CFI jump-table
sections in .text are needlessly aligned to PMD_SIZE in the vmlinux
linker script. When compiling a CFI-enabled arm64 kernel with a 64KiB
page-size, a PMD maps 512MiB of virtual memory and so the .text section
increases to a whopping 940MiB and blows the final Image up to 960MiB.
Others report a link failure.
Since the CFI jump-table requires only instruction alignment, reduce the
alignment directives to function alignment for parity with other parts
of the .text section. This reduces the size of the .text section for the
aforementioned 64KiB page size arm64 kernel to 19MiB for a much more
reasonable total Image size of 39MiB.
Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: "Mohan Rao .vanimina" <mailtoc.mohanrao@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/all/CAL_GTzigiNOMYkOPX1KDnagPhJtFNqSK=1USNbS0wUL4PW6-Uw@mail.gmail.com/ Fixes: cf68fffb66d6 ("add support for Clang CFI") Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220922215715.13345-1-will@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Three small and pretty obvious fixes, all in drivers"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: mpt3sas: Fix return value check of dma_get_required_mask()
scsi: qla2xxx: Fix memory leak in __qlt_24xx_handle_abts()
scsi: qedf: Fix a UAF bug in __qedf_probe()
Merge tag 'slab-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fixes from Vlastimil Babka:
- Fix a possible use-after-free in SLUB's kmem_cache removal,
introduced in this cycle, by Feng Tang.
- WQ_MEM_RECLAIM dependency fix for the workqueue-based cpu slab
flushing introduced in 5.15, by Maurizio Lombardi.
- Add missing KASAN hooks in two kmalloc entry paths, by Peter
Collingbourne.
- A BUG_ON() removal in SLUB's kmem_cache creation when allocation
fails (too small to possibly happen in practice, syzbot used fault
injection), by Chao Yu.
* tag 'slab-for-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm: slub: fix flush_cpu_slab()/__free_slab() invocations in task context.
mm/slab_common: fix possible double free of kmem_cache
kasan: call kasan_malloc() from __kmalloc_*track_caller()
mm/slub: fix to return errno if kmalloc() fails
KVM: x86: Inject #UD on emulated XSETBV if XSAVES isn't enabled
Inject #UD when emulating XSETBV if CR4.OSXSAVE is not set. This also
covers the "XSAVE not supported" check, as setting CR4.OSXSAVE=1 #GPs if
XSAVE is not supported (and userspace gets to keep the pieces if it
forces incoherent vCPU state).
Add a comment to kvm_emulate_xsetbv() to call out that the CPU checks
CR4.OSXSAVE before checking for intercepts. AMD'S APM implies that #UD
has priority (says that intercepts are checked before #GP exceptions),
while Intel's SDM says nothing about interception priority. However,
testing on hardware shows that both AMD and Intel CPUs prioritize the #UD
over interception.
Fixes: 02d4160fbd76 ("x86: KVM: add xsetbv to the emulator") Cc: stable@vger.kernel.org Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220824033057.3576315-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Always enable legacy FP/SSE in allowed user XFEATURES
Allow FP and SSE state to be saved and restored via KVM_{G,SET}_XSAVE on
XSAVE-capable hosts even if their bits are not exposed to the guest via
XCR0.
Failing to allow FP+SSE first showed up as a QEMU live migration failure,
where migrating a VM from a pre-XSAVE host, e.g. Nehalem, to an XSAVE
host failed due to KVM rejecting KVM_SET_XSAVE. However, the bug also
causes problems even when migrating between XSAVE-capable hosts as
KVM_GET_SAVE won't set any bits in user_xfeatures if XSAVE isn't exposed
to the guest, i.e. KVM will fail to actually migrate FP+SSE.
Because KVM_{G,S}ET_XSAVE are designed to allowing migrating between
hosts with and without XSAVE, KVM_GET_XSAVE on a non-XSAVE (by way of
fpu_copy_guest_fpstate_to_uabi()) always sets the FP+SSE bits in the
header so that KVM_SET_XSAVE will work even if the new host supports
XSAVE.
Fixes: ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0")
bz: https://bugzilla.redhat.com/show_bug.cgi?id=2079311 Cc: stable@vger.kernel.org Cc: Leonardo Bras <leobras@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
[sean: add comment, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220824033057.3576315-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reinstate the per-vCPU guest_supported_xcr0 by partially reverting
commit 988896bb6182; the implicit assessment that guest_supported_xcr0 is
always the same as guest_fpu.fpstate->user_xfeatures was incorrect.
kvm_vcpu_after_set_cpuid() isn't the only place that sets user_xfeatures,
as user_xfeatures is set to fpu_user_cfg.default_features when guest_fpu
is allocated via fpu_alloc_guest_fpstate() => __fpstate_reset().
guest_supported_xcr0 on the other hand is zero-allocated. If userspace
never invokes KVM_SET_CPUID2, supported XCR0 will be '0', whereas the
allowed user XFEATURES will be non-zero.
Practically speaking, the edge case likely doesn't matter as no sane
userspace will live migrate a VM without ever doing KVM_SET_CPUID2. The
primary motivation is to prepare for KVM intentionally and explicitly
setting bits in user_xfeatures that are not set in guest_supported_xcr0.
Because KVM_{G,S}ET_XSAVE can be used to svae/restore FP+SSE state even
if the host doesn't support XSAVE, KVM needs to set the FP+SSE bits in
user_xfeatures even if they're not allowed in XCR0, e.g. because XCR0
isn't exposed to the guest. At that point, the simplest fix is to track
the two things separately (allowed save/restore vs. allowed XCR0).
Fixes: 988896bb6182 ("x86/kvm/fpu: Remove kvm_vcpu_arch.guest_supported_xcr0") Cc: stable@vger.kernel.org Cc: Leonardo Bras <leobras@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220824033057.3576315-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Miaohe Lin [Wed, 7 Sep 2022 08:06:57 +0000 (16:06 +0800)]
KVM: x86/mmu: add missing update to max_mmu_rmap_size
The update to statistic max_mmu_rmap_size is unintentionally removed by
commit 4293ddb788c1 ("KVM: x86/mmu: Remove redundant spte present check
in mmu_set_spte"). Add missing update to it or max_mmu_rmap_size will
always be nonsensical 0.
Fixes: 4293ddb788c1 ("KVM: x86/mmu: Remove redundant spte present check in mmu_set_spte") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Message-Id: <20220907080657.42898-1-linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jinrong Liang [Tue, 2 Aug 2022 07:12:40 +0000 (15:12 +0800)]
selftests: kvm: Fix a compile error in selftests/kvm/rseq_test.c
The following warning appears when executing:
make -C tools/testing/selftests/kvm
rseq_test.c: In function ‘main’:
rseq_test.c:237:33: warning: implicit declaration of function ‘gettid’; did you mean ‘getgid’? [-Wimplicit-function-declaration]
(void *)(unsigned long)gettid());
^~~~~~
getgid
/usr/bin/ld: /tmp/ccr5mMko.o: in function `main':
../kvm/tools/testing/selftests/kvm/rseq_test.c:237: undefined reference to `gettid'
collect2: error: ld returned 1 exit status
make: *** [../lib.mk:173: ../kvm/tools/testing/selftests/kvm/rseq_test] Error 1
Use the more compatible syscall(SYS_gettid) instead of gettid() to fix it.
More subsequent reuse may cause it to be wrapped in a lib file.
Signed-off-by: Jinrong Liang <cloudliang@tencent.com>
Message-Id: <20220802071240.84626-1-cloudliang@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
mm: slub: fix flush_cpu_slab()/__free_slab() invocations in task context.
Commit 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations
__free_slab() invocations out of IRQ context") moved all flush_cpu_slab()
invocations to the global workqueue to avoid a problem related
with deactivate_slab()/__free_slab() being called from an IRQ context
on PREEMPT_RT kernels.
When the flush_all_cpu_locked() function is called from a task context
it may happen that a workqueue with WQ_MEM_RECLAIM bit set ends up
flushing the global workqueue, this will cause a dependency issue.
Merge tag 'soc-fixes-6.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"Another set of fixes for fixes for the soc tree:
- A fix for the interrupt number on at91/lan966 ethernet PHYs
- A second round of fixes for NXP i.MX series, including a couple of
build issues, and board specific DT corrections on TQMa8MPQL,
imx8mp-venice-gw74xx and imx8mm-verdin for reliability and
partially broken functionality
- Several fixes for Rockchip SoCs, addressing a USB issue on
BPI-R2-Pro, wakeup on Gru-Bob and reliability of high-speed SD
cards, among other minor issues
- A fix for a long-running naming mistake that prevented the moxart
mmc driver from working at all
- Multiple Arm SCMI firmware fixes for hardening some corner cases"
* tag 'soc-fixes-6.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (30 commits)
arm64: dts: imx8mp-venice-gw74xx: fix port/phy validation
ARM: dts: lan966x: Fix the interrupt number for internal PHYs
arm64: dts: imx8mp-venice-gw74xx: fix ksz9477 cpu port
arm64: dts: imx8mp-venice-gw74xx: fix CAN STBY polarity
dt-bindings: memory-controllers: fsl,imx8m-ddrc: drop Leonard Crestez
arm64: dts: tqma8mqml: Include phy-imx8-pcie.h header
arm64: defconfig: enable ARCH_NXP
arm64: dts: imx8mp-tqma8mpql-mba8mpxl: add missing pinctrl for RTC alarm
ARM: dts: fix Moxa SDIO 'compatible', remove 'sdhci' misnomer
arm64: dts: imx8mm-verdin: extend pmic voltages
arm64: dts: rockchip: Remove 'enable-active-low' from rk3566-quartz64-a
arm64: dts: rockchip: Remove 'enable-active-low' from rk3399-puma
arm64: dts: rockchip: fix property for usb2 phy supply on rk3568-evb1-v10
arm64: dts: rockchip: fix property for usb2 phy supply on rock-3a
arm64: dts: imx8ulp: add #reset-cells for pcc
arm64: dts: tqma8mpxl-ba8mpxl: Fix button GPIOs
arm64: dts: imx8mn: remove GPU power domain reset
arm64: dts: rockchip: Set RK3399-Gru PCLK_EDP to 24 MHz
arm64: dts: imx8mm: Reverse CPLD_Dn GPIO label mapping on MX8Menlo
arm64: dts: rockchip: fix upper usb port on BPI-R2-Pro
...
Merge tag 'net-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from wifi, netfilter and can.
A handful of awaited fixes here - revert of the FEC changes, bluetooth
fix, fixes for iwlwifi spew.
We added a warning in PHY/MDIO code which is triggering on a couple of
platforms in a false-positive-ish way. If we can't iron that out over
the week we'll drop it and re-add for 6.1.
I've added a new "follow up fixes" section for fixes to fixes in
6.0-rcs but it may actually give the false impression that those are
problematic or that more testing time would have caught them. So
likely a one time thing.
- ebtables: fix memory leak when blob is malformed
- nf_ct_ftp: fix deadlock when nat rewrite is needed
Current release - regressions:
- Revert "fec: Restart PPS after link state change" and the related
"net: fec: Use a spinlock to guard `fep->ptp_clk_on`"
- Bluetooth: fix HCIGETDEVINFO regression
- wifi: mt76: fix 5 GHz connection regression on mt76x0/mt76x2
- mptcp: fix fwd memory accounting on coalesce
- rwlock removal fall out:
- ipmr: always call ip{,6}_mr_forward() from RCU read-side
critical section
- ipv6: fix crash when IPv6 is administratively disabled
- tcp: read multiple skbs in tcp_read_skb()
- mdio_bus_phy_resume state warning fallout:
- eth: ravb: fix PHY state warning splat during system resume
- eth: sh_eth: fix PHY state warning splat during system resume
Current release - new code bugs:
- wifi: iwlwifi: don't spam logs with NSS>2 messages
- eth: mtk_eth_soc: enable XDP support just for MT7986 SoC
Previous releases - regressions:
- bonding: fix NULL deref in bond_rr_gen_slave_id
- wifi: iwlwifi: mark IWLMEI as broken
Previous releases - always broken:
- nf_conntrack helpers:
- irc: tighten matching on DCC message
- sip: fix ct_sip_walk_headers
- osf: fix possible bogus match in nf_osf_find()
- ipvlan: fix out-of-bound bugs caused by unset skb->mac_header
- core: fix flow symmetric hash
- bonding, team: unsync device addresses on ndo_stop
- phy: micrel: fix shared interrupt on LAN8814"
* tag 'net-6.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
selftests: forwarding: add shebang for sch_red.sh
bnxt: prevent skb UAF after handing over to PTP worker
net: marvell: Fix refcounting bugs in prestera_port_sfp_bind()
net: sched: fix possible refcount leak in tc_new_tfilter()
net: sunhme: Fix packet reception for len < RX_COPY_THRESHOLD
udp: Use WARN_ON_ONCE() in udp_read_skb()
selftests: bonding: cause oops in bond_rr_gen_slave_id
bonding: fix NULL deref in bond_rr_gen_slave_id
net: phy: micrel: fix shared interrupt on LAN8814
net/smc: Stop the CLC flow if no link to map buffers on
ice: Fix ice_xdp_xmit() when XDP TX queue number is not sufficient
net: atlantic: fix potential memory leak in aq_ndev_close()
can: gs_usb: gs_usb_set_phys_id(): return with error if identify is not supported
can: gs_usb: gs_can_open(): fix race dev->can.state condition
can: flexcan: flexcan_mailbox_read() fix return value for drop = true
net: sh_eth: Fix PHY state warning splat during system resume
net: ravb: Fix PHY state warning splat during system resume
netfilter: nf_ct_ftp: fix deadlock when nat rewrite is needed
netfilter: ebtables: fix memory leak when blob is malformed
netfilter: nf_tables: fix percpu memory leak at nf_tables_addchain()
...
Merge tag 'efi-urgent-for-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- Use the right variable to check for shim insecure mode
- Wipe setup_data field when booting via EFI
- Add missing error check to efibc driver
* tag 'efi-urgent-for-v6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: libstub: check Shim mode using MokSBStateRT
efi: x86: Wipe setup_data on pure EFI boot
efi: efibc: Guard against allocation failure
Merge tag 'perf-tools-fixes-for-v6.0-2022-09-21' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Fix polling of system-wide events related to mixing per-cpu and
per-thread events.
- Do not check if /proc/modules is unchanged when copying /proc/kcore,
that doesn't get in the way of post processing analysis.
- Include program header in ELF files generated for JIT files, so that
they can be opened by tools using elfutils libraries.
- Enter namespaces when synthesizing build-ids.
- Fix some bugs related to a recent cpu_map overhaul where we should be
using an index and not the cpu number.
- Fix BPF program ELF section name, using the naming expected by libbpf
when using BPF counters in 'perf stat'.
- Add a new test for perf stat cgroup BPF counter.
- Adjust check on 'perf test wp' for older kernels, where the
PERF_EVENT_IOC_MODIFY_ATTRIBUTES ioctl isn't supported.
- Sync x86 cpufeatures with the kernel sources, no changes in tooling.
* tag 'perf-tools-fixes-for-v6.0-2022-09-21' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf tools: Honor namespace when synthesizing build-ids
tools headers cpufeatures: Sync with the kernel sources
perf kcore_copy: Do not check /proc/modules is unchanged
libperf evlist: Fix polling of system-wide events
perf record: Fix cpu mask bit setting for mixed mmaps
perf test: Skip wp modify test on old kernels
perf jit: Include program header in ELF files
perf test: Add a new test for perf stat cgroup BPF counter
perf stat: Use evsel->core.cpus to iterate cpus in BPF cgroup counters
perf stat: Fix cpu map index in bperf cgroup code
perf stat: Fix BPF program section name
ext4: limit the number of retries after discarding preallocations blocks
This patch avoids threads live-locking for hours when a large number
threads are competing over the last few free extents as they blocks
getting added and removed from preallocation pools. From our bug
reporter:
A reliable way for triggering this has multiple writers
continuously write() to files when the filesystem is full, while
small amounts of space are freed (e.g. by truncating a large file
-1MiB at a time). In the local filesystem, this can be done by
simply not checking the return code of write (0) and/or the error
(ENOSPACE) that is set. Over NFS with an async mount, even clients
with proper error checking will behave this way since the linux NFS
client implementation will not propagate the server errors [the
write syscalls immediately return success] until the file handle is
closed. This leads to a situation where NFS clients send a
continuous stream of WRITE rpcs which result in ERRNOSPACE -- but
since the client isn't seeing this, the stream of writes continues
at maximum network speed.
When some space does appear, multiple writers will all attempt to
claim it for their current write. For NFS, we may see dozens to
hundreds of threads that do this.
The real-world scenario of this is database backup tooling (in
particular, github.com/mdkent/percona-xtrabackup) which may write
large files (>1TiB) to NFS for safe keeping. Some temporary files
are written, rewound, and read back -- all before closing the file
handle (the temp file is actually unlinked, to trigger automatic
deletion on close/crash.) An application like this operating on an
async NFS mount will not see an error code until TiB have been
written/read.
The lockup was observed when running this database backup on large
filesystems (64 TiB in this case) with a high number of block
groups and no free space. Fragmentation is generally not a factor
in this filesystem (~thousands of large files, mostly contiguous
except for the parts written while the filesystem is at capacity.)
LuÃs Henriques [Mon, 22 Aug 2022 09:42:35 +0000 (10:42 +0100)]
ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0
When walking through an inode extents, the ext4_ext_binsearch_idx() function
assumes that the extent header has been previously validated. However, there
are no checks that verify that the number of entries (eh->eh_entries) is
non-zero when depth is > 0. And this will lead to problems because the
EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this:
Olof Johansson [Tue, 20 Sep 2022 16:00:18 +0000 (09:00 -0700)]
serial: sifive: enable clocks for UART when probed
When the PWM driver was changed to disable clocks if no PWMs are enabled,
it ended up also disabling the shared parent with the UART, since the
UART doesn't do any clock enablement on its own.
To avoid these surprises, switch to clk_get_enabled().
Fixes: ace41d7564e655 ("pwm: sifive: Ensure the clk is enabled exactly once per running PWM") Cc: stable <stable@kernel.org> Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Cc: Emil Renner Berthing <emil.renner.berthing@canonical.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com> Reviewed-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Olof Johansson <olof@lixom.net> Link: https://lore.kernel.org/r/20220920160017.7315-1-olof@lixom.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
serial: 8250: omap: Use serial8250_em485_supported
8250_omap uses em485, fill in rs485_supported accordingly. This makes
RS485 work with 8250_omap again, which was broken with the introduction
of the RS485 config sanitization.
Since commit bd5305dcabbc ("tty: serial: fsl_lpuart: do software reset
for imx7ulp and imx8qxp"), certain i.MX UARTs are reset after they've
already been registered. Register state may thus be clobbered after
user space has begun to open and access the UART.
Avoid by performing the reset prior to registration.
Fixes: bd5305dcabbc ("tty: serial: fsl_lpuart: do software reset for imx7ulp and imx8qxp") Cc: stable@vger.kernel.org # v5.15+ Cc: Fugang Duan <fugang.duan@nxp.com> Cc: Sherry Sun <sherry.sun@nxp.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> Link: https://lore.kernel.org/r/72fb646c1b0b11c989850c55f52f9ff343d1b2fa.1662884345.git.lukas@wunner.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Hangbin Liu [Thu, 22 Sep 2022 02:44:53 +0000 (10:44 +0800)]
selftests: forwarding: add shebang for sch_red.sh
RHEL/Fedora RPM build checks are stricter, and complain when executable
files don't have a shebang line, e.g.
*** WARNING: ./kselftests/net/forwarding/sch_red.sh is executable but has no shebang, removing executable bit
Fix it by adding shebang line.
Fixes: 6cf0291f9517 ("selftests: forwarding: Add a RED test for SW datapath") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/20220922024453.437757-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 21 Sep 2022 20:10:05 +0000 (13:10 -0700)]
bnxt: prevent skb UAF after handing over to PTP worker
When reading the timestamp is required bnxt_tx_int() hands
over the ownership of the completed skb to the PTP worker.
The skb should not be used afterwards, as the worker may
run before the rest of our code and free the skb, leading
to a use-after-free.
Since dev_kfree_skb_any() accepts NULL make the loss of
ownership more obvious and set skb to NULL.
Fixes: 83bb623c968e ("bnxt_en: Transmit and retrieve packet timestamps") Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/20220921201005.335390-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Liang He [Wed, 21 Sep 2022 13:32:45 +0000 (21:32 +0800)]
net: marvell: Fix refcounting bugs in prestera_port_sfp_bind()
In prestera_port_sfp_bind(), there are two refcounting bugs:
(1) we should call of_node_get() before of_find_node_by_name() as
it will automaitcally decrease the refcount of 'from' argument;
(2) we should call of_node_put() for the break of the iteration
for_each_child_of_node() as it will automatically increase and
decrease the 'child'.
net: sched: fix possible refcount leak in tc_new_tfilter()
tfilter_put need to be called to put the refount got by tp->ops->get to
avoid possible refcount leak when chain->tmplt_ops != NULL and
chain->tmplt_ops != tp->ops.
Fixes: 7d5509fa0d3d ("net: sched: extend proto ops with 'put' callback") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Link: https://lore.kernel.org/r/20220921092734.31700-1-hbh25y@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sean Anderson [Tue, 20 Sep 2022 23:50:18 +0000 (19:50 -0400)]
net: sunhme: Fix packet reception for len < RX_COPY_THRESHOLD
There is a separate receive path for small packets (under 256 bytes).
Instead of allocating a new dma-capable skb to be used for the next packet,
this path allocates a skb and copies the data into it (reusing the existing
sbk for the next packet). There are two bytes of junk data at the beginning
of every packet. I believe these are inserted in order to allow aligned DMA
and IP headers. We skip over them using skb_reserve. Before copying over
the data, we must use a barrier to ensure we see the whole packet. The
current code only synchronizes len bytes, starting from the beginning of
the packet, including the junk bytes. However, this leaves off the final
two bytes in the packet. Synchronize the whole packet.
To reproduce this problem, ping a HME with a payload size between 17 and
214
$ ping -s 17 <hme_address>
which will complain rather loudly about the data mismatch. Small packets
(below 60 bytes on the wire) do not have this issue. I suspect this is
related to the padding added to increase the minimum packet size.
====================
bonding: fix NULL deref in bond_rr_gen_slave_id
Fix a NULL dereference of the struct bonding.rr_tx_counter member because
if a bond is initially created with an initial mode != zero (Round Robin)
the memory required for the counter is never created and when the mode is
changed there is never any attempt to verify the memory is allocated upon
switching modes.
====================
Jonathan Toppins [Tue, 20 Sep 2022 17:45:51 +0000 (13:45 -0400)]
selftests: bonding: cause oops in bond_rr_gen_slave_id
This bonding selftest used to cause a kernel oops on aarch64
and should be architectures agnostic.
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jonathan Toppins [Tue, 20 Sep 2022 17:45:52 +0000 (13:45 -0400)]
bonding: fix NULL deref in bond_rr_gen_slave_id
Fix a NULL dereference of the struct bonding.rr_tx_counter member because
if a bond is initially created with an initial mode != zero (Round Robin)
the memory required for the counter is never created and when the mode is
changed there is never any attempt to verify the memory is allocated upon
switching modes.
The fix is to allocate the memory in bond_open() which is guaranteed
to be called before any packets are processed.
Fixes: 848ca9182a7d ("net: bonding: Use per-cpu rr_tx_counter") CC: Jussi Maki <joamaki@gmail.com> Signed-off-by: Jonathan Toppins <jtoppins@redhat.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Michael Walle [Tue, 20 Sep 2022 14:16:19 +0000 (16:16 +0200)]
net: phy: micrel: fix shared interrupt on LAN8814
Since commit ece19502834d ("net: phy: micrel: 1588 support for LAN8814
phy") the handler always returns IRQ_HANDLED, except in an error case.
Before that commit, the interrupt status register was checked and if
it was empty, IRQ_NONE was returned. Restore that behavior to play nice
with the interrupt line being shared with others.
Fixes: ece19502834d ("net: phy: micrel: 1588 support for LAN8814 phy") Signed-off-by: Michael Walle <michael@walle.cc> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Reviewed-by: Divya Koppera <Divya.Koppera@microchip.com> Link: https://lore.kernel.org/r/20220920141619.808117-1-michael@walle.cc Signed-off-by: Jakub Kicinski <kuba@kernel.org>
usb: typec: anx7411: Fix build error without CONFIG_POWER_SUPPLY
Building without CONFIG_POWER_SUPPLY will fail:
drivers/usb/typec/anx7411.o: In function `anx7411_detect_power_mode':
anx7411.c:(.text+0x527): undefined reference to `power_supply_changed'
drivers/usb/typec/anx7411.o: In function `anx7411_psy_set_prop':
anx7411.c:(.text+0x90d): undefined reference to `power_supply_get_drvdata'
anx7411.c:(.text+0x930): undefined reference to `power_supply_changed'
drivers/usb/typec/anx7411.o: In function `anx7411_psy_get_prop':
anx7411.c:(.text+0x94d): undefined reference to `power_supply_get_drvdata'
drivers/usb/typec/anx7411.o: In function `anx7411_i2c_probe':
anx7411.c:(.text+0x111d): undefined reference to
`devm_power_supply_register'
drivers/usb/typec/anx7411.o: In function `anx7411_work_func':
anx7411.c:(.text+0x167c): undefined reference to `power_supply_changed'
anx7411.c:(.text+0x1b55): undefined reference to `power_supply_changed'
counter: 104-quad-8: Fix skipped IRQ lines during events configuration
IRQ trigger configuration is skipped if it has already been set before;
however, the IRQ line still needs to be OR'd to irq_enabled because
irq_enabled is reset for every events_configure call. This patch moves
the irq_enabled OR operation update to before the irq_trigger check so
that IRQ line enablement is not skipped.
arm64: topology: fix possible overflow in amu_fie_setup()
cpufreq_get_hw_max_freq() returns max frequency in kHz as *unsigned int*,
while freq_inv_set_max_ratio() gets passed this frequency in Hz as 'u64'.
Multiplying max frequency by 1000 can potentially result in overflow --
multiplying by 1000ULL instead should avoid that...
Found by Linux Verification Center (linuxtesting.org) with the SVACE static
analysis tool.
Mark Rutland [Tue, 20 Sep 2022 13:47:31 +0000 (14:47 +0100)]
arm64: mm: don't acquire mutex when rewriting swapper
Since commit:
47546a1912fc4a03 ("arm64: mm: install KPTI nG mappings with MMU enabled)"
... when building with CONFIG_DEBUG_ATOMIC_SLEEP=y and booting under
QEMU TCG with '-cpu max', there's a boot-time splat:
| BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
| in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 15, name: migration/0
| preempt_count: 1, expected: 0
| RCU nest depth: 0, expected: 0
| no locks held by migration/0/15.
| irq event stamp: 28
| hardirqs last enabled at (27): [<ffff8000091ed180>] _raw_spin_unlock_irq+0x3c/0x7c
| hardirqs last disabled at (28): [<ffff8000081b8d74>] multi_cpu_stop+0x150/0x18c
| softirqs last enabled at (0): [<ffff80000809a314>] copy_process+0x594/0x1964
| softirqs last disabled at (0): [<0000000000000000>] 0x0
| CPU: 0 PID: 15 Comm: migration/0 Not tainted 6.0.0-rc3-00002-g419b42ff7eef #3
| Hardware name: linux,dummy-virt (DT)
| Stopper: multi_cpu_stop+0x0/0x18c <- stop_cpus.constprop.0+0xa0/0xfc
| Call trace:
| dump_backtrace.part.0+0xd0/0xe0
| show_stack+0x1c/0x5c
| dump_stack_lvl+0x88/0xb4
| dump_stack+0x1c/0x38
| __might_resched+0x180/0x230
| __might_sleep+0x4c/0xa0
| __mutex_lock+0x5c/0x450
| mutex_lock_nested+0x30/0x40
| create_kpti_ng_temp_pgd+0x4fc/0x6d0
| kpti_install_ng_mappings+0x2b8/0x3b0
| cpu_enable_non_boot_scope_capabilities+0x7c/0xd0
| multi_cpu_stop+0xa0/0x18c
| cpu_stopper_thread+0x88/0x11c
| smpboot_thread_fn+0x1ec/0x290
| kthread+0x118/0x120
| ret_from_fork+0x10/0x20
Since commit:
ee017ee353506fce ("arm64/mm: avoid fixmap race condition when create pud mapping")
... once the kernel leave the SYSTEM_BOOTING state, the fixmap pagetable
entries are protected by the fixmap_lock mutex.
The new KPTI rewrite code uses __create_pgd_mapping() to create a
temporary pagetable. This happens in atomic context, after secondary
CPUs are brought up and the kernel has left the SYSTEM_BOOTING state.
Hence we try to acquire a mutex in atomic context, which is generally
unsound (though benign in this case as the mutex should be free and all
other CPUs are quiescent).
This patch avoids the issue by pulling the mutex out of alloc_init_pud()
and calling it at a higher level in the pagetable manipulation code.
This allows it to be used without locking where one CPU is known to be
in exclusive control of the machine, even after having left the
SYSTEM_BOOTING state.
Fixes: 47546a1912fc ("arm64: mm: install KPTI nG mappings with MMU enabled") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20220920134731.1625740-1-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
net/smc: Stop the CLC flow if no link to map buffers on
There might be a potential race between SMC-R buffer map and
link group termination.
smc_smcr_terminate_all() | smc_connect_rdma()
--------------------------------------------------------------
| smc_conn_create()
for links in smcibdev |
schedule links down |
| smc_buf_create()
| \- smcr_buf_map_usable_links()
| \- no usable links found,
| (rmb->mr = NULL)
|
| smc_clc_send_confirm()
| \- access conn->rmb_desc->mr[]->rkey
| (panic)
During reboot and IB device module remove, all links will be set
down and no usable links remain in link groups. In such situation
smcr_buf_map_usable_links() should return an error and stop the
CLC flow accessing to uninitialized mr.
Johan Hovold [Mon, 22 Aug 2022 15:10:27 +0000 (17:10 +0200)]
media: flexcop-usb: fix endpoint type check
Commit d725d20e81c2 ("media: flexcop-usb: sanity checking of endpoint
type") tried to add an endpoint type sanity check for the single
isochronous endpoint but instead broke the driver by checking the wrong
descriptor or random data beyond the last endpoint descriptor.
Make sure to check the right endpoint descriptor.
Fixes: d725d20e81c2 ("media: flexcop-usb: sanity checking of endpoint type") Cc: Oliver Neukum <oneukum@suse.com> Cc: stable@vger.kernel.org # 5.9 Reported-by: Dongliang Mu <mudongliangabcd@gmail.com> Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://lore.kernel.org/r/20220822151027.27026-1-johan@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We currently check the MokSBState variable to decide whether we should
treat UEFI secure boot as being disabled, even if the firmware thinks
otherwise. This is used by shim to indicate that it is not checking
signatures on boot images. In the kernel, we use this to relax lockdown
policies.
However, in cases where shim is not even being used, we don't want this
variable to interfere with lockdown, given that the variable may be
non-volatile and therefore persist across a reboot. This means setting
it once will persistently disable lockdown checks on a given system.
So switch to the mirrored version of this variable, called MokSBStateRT,
which is supposed to be volatile, and this is something we can check.
Cc: <stable@vger.kernel.org> # v4.19+ Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Peter Jones <pjones@redhat.com>
Ard Biesheuvel [Thu, 4 Aug 2022 13:39:48 +0000 (15:39 +0200)]
efi: x86: Wipe setup_data on pure EFI boot
When booting the x86 kernel via EFI using the LoadImage/StartImage boot
services [as opposed to the deprecated EFI handover protocol], the setup
header is taken from the image directly, and given that EFI's LoadImage
has no Linux/x86 specific knowledge regarding struct bootparams or
struct setup_header, any absolute addresses in the setup header must
originate from the file and not from a prior loading stage.
Since we cannot generally predict where LoadImage() decides to load an
image (*), such absolute addresses must be treated as suspect: even if a
prior boot stage intended to make them point somewhere inside the
[signed] image, there is no way to validate that, and if they point at
an arbitrary location in memory, the setup_data nodes will not be
covered by any signatures or TPM measurements either, and could be made
to contain an arbitrary sequence of SETUP_xxx nodes, which could
interfere quite badly with the early x86 boot sequence.
(*) Note that, while LoadImage() does take a buffer/size tuple in
addition to a device path, which can be used to provide the image
contents directly, it will re-allocate such images, as the memory
footprint of an image is generally larger than the PE/COFF file
representation.
Jan Kara [Thu, 8 Sep 2022 09:21:28 +0000 (11:21 +0200)]
ext4: use buckets for cr 1 block scan instead of rbtree
Using rbtree for sorting groups by average fragment size is relatively
expensive (needs rbtree update on every block freeing or allocation) and
leads to wide spreading of allocations because selection of block group
is very sentitive both to changes in free space and amount of blocks
allocated. Furthermore selecting group with the best matching average
fragment size is not necessary anyway, even more so because the
variability of fragment sizes within a group is likely large so average
is not telling much. We just need a group with large enough average
fragment size so that we have high probability of finding large enough
free extent and we don't want average fragment size to be too big so
that we are likely to find free extent only somewhat larger than what we
need.
So instead of maintaing rbtree of groups sorted by fragment size keep
bins (lists) or groups where average fragment size is in the interval
[2^i, 2^(i+1)). This structure requires less updates on block allocation
/ freeing, generally avoids chaotic spreading of allocations into block
groups, and still is able to quickly (even faster that the rbtree)
provide a block group which is likely to have a suitably sized free
space extent.
This patch reduces number of block groups used when untarring archive
with medium sized files (size somewhat above 64k which is default
mballoc limit for avoiding locality group preallocation) to about half
and thus improves write speeds for eMMC flash significantly.
Jan Kara [Thu, 8 Sep 2022 09:21:27 +0000 (11:21 +0200)]
ext4: use locality group preallocation for small closed files
Curently we don't use any preallocation when a file is already closed
when allocating blocks (from writeback code when converting delayed
allocation). However for small files, using locality group preallocation
is actually desirable as that is not specific to a particular file.
Rather it is a method to pack small files together to reduce
fragmentation and for that the fact the file is closed is actually even
stronger hint the file would benefit from packing. So change the logic
to allow locality group preallocation in this case.
Jan Kara [Thu, 8 Sep 2022 09:21:26 +0000 (11:21 +0200)]
ext4: make directory inode spreading reflect flexbg size
Currently the Orlov inode allocator searches for free inodes for a
directory only in flex block groups with at most inodes_per_group/16
more directory inodes than average per flex block group. However with
growing size of flex block group this becomes unnecessarily strict.
Scale allowed difference from average directory count per flex block
group with flex block group size as we do with other metrics.
Jan Kara [Thu, 8 Sep 2022 09:21:25 +0000 (11:21 +0200)]
ext4: avoid unnecessary spreading of allocations among groups
mb_set_largest_free_order() updates lists containing groups with largest
chunk of free space of given order. The way it updates it leads to
always moving the group to the tail of the list. Thus allocations
looking for free space of given order effectively end up cycling through
all groups (and due to initialization in last to first order). This
spreads allocations among block groups which reduces performance for
rotating disks or low-end flash media. Change
mb_set_largest_free_order() to only update lists if the order of the
largest free chunk in the group changed.
Jan Kara [Thu, 8 Sep 2022 09:21:24 +0000 (11:21 +0200)]
ext4: make mballoc try target group first even with mb_optimize_scan
One of the side-effects of mb_optimize_scan was that the optimized
functions to select next group to try were called even before we tried
the goal group. As a result we no longer allocate files close to
corresponding inodes as well as we don't try to expand currently
allocated extent in the same group. This results in reaim regression
with workfile.disk workload of upto 8% with many clients on my test
machine:
Also this is causing huge regression (time increased by a factor of 5 or
so) when untarring archive with lots of small files on some eMMC storage
cards.
Fix the problem by making sure we try goal group first.
Jakub Kicinski [Thu, 22 Sep 2022 01:39:23 +0000 (18:39 -0700)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2022-09-20 (ice)
Michal re-sets TC configuration when changing number of queues.
Mateusz moves the check and call for link-down-on-close to the specific
path for downing/closing the interface.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ice: Fix interface being down after reset with link-down-on-close flag on
ice: config netdev tc before setting queues number
====================