After commit 41c5ef31ad71 ("x86/ibt: Base IBT bits"), the
.note.gnu.property section is always generated when
CONFIG_X86_KERNEL_IBT is enabled, which causes the first issue to become
visible with an allmodconfig build:
ld.lld: error: arch/x86/entry/vdso/vclock_gettime-x32.o:(.note.gnu.property+0x1c): program property is too short
To avoid this error, do not allow CONFIG_X86_X32_ABI to be selected when
using llvm-objcopy. If the two issues ever get fixed in llvm-objcopy,
this can be turned into a feature check.
Masahiro Yamada [Mon, 14 Mar 2022 19:48:41 +0000 (12:48 -0700)]
x86: Remove toolchain check for X32 ABI capability
Commit 0bf6276392e9 ("x32: Warn and disable rather than error if
binutils too old") added a small test in arch/x86/Makefile because
binutils 2.22 or newer is needed to properly support elf32-x86-64. This
check is no longer necessary, as the minimum supported version of
binutils is 2.23, which is enforced at configuration time with
scripts/min-tool-version.sh.
Remove this check and replace all uses of CONFIG_X86_X32 with
CONFIG_X86_X32_ABI, as two symbols are no longer necessary.
[nathan: Rebase, fix up a few places where CONFIG_X86_X32 was still
used, and simplify commit message to satisfy -tip requirements]
Peter Zijlstra [Tue, 8 Mar 2022 15:30:56 +0000 (16:30 +0100)]
x86/alternative: Use .ibt_endbr_seal to seal indirect calls
Objtool's --ibt option generates .ibt_endbr_seal which lists
superfluous ENDBR instructions. That is those instructions for which
the function is never indirectly called.
Overwrite these ENDBR instructions with a NOP4 such that these
function can never be indirect called, reducing the number of viable
ENDBR targets in the kernel.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:55 +0000 (16:30 +0100)]
objtool: Find unused ENDBR instructions
Find all ENDBR instructions which are never referenced and stick them
in a section such that the kernel can poison them, sealing the
functions from ever being an indirect call target.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:54 +0000 (16:30 +0100)]
objtool: Validate IBT assumptions
Intel IBT requires that every indirect JMP/CALL targets an ENDBR
instructions, failing this #CP happens and we die. Similarly, all
exception entries should be ENDBR.
Find all code relocations and ensure they're either an ENDBR
instruction or ANNOTATE_NOENDBR. For the exceptions look for
UNWIND_HINT_IRET_REGS at sym+0 not being ENDBR.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:53 +0000 (16:30 +0100)]
objtool: Add IBT/ENDBR decoding
Intel IBT requires the target of any indirect CALL or JMP instruction
to be the ENDBR instruction; optionally it allows those two
instructions to have a NOTRACK prefix in order to avoid this
requirement.
The kernel will not enable the use of NOTRACK, as such any occurence
of it in compiler generated code should be flagged.
Teach objtool to Decode ENDBR instructions and WARN about NOTRACK
prefixes.
Both cases are due to a call_on_stack() calling a __noreturn function.
Since that's an inline asm, GCC can't do anything about the
instructions after the CALL. Therefore put in an explicit
ASM_REACHABLE annotation to make sure objtool and gcc are consistently
confused about control flow.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:46 +0000 (16:30 +0100)]
objtool: Ignore extra-symbol code
There's a fun implementation detail on linking STB_WEAK symbols. When
the linker combines two translation units, where one contains a weak
function and the other an override for it. It simply strips the
STB_WEAK symbol from the symbol table, but doesn't actually remove the
code.
The result is that when objtool is ran in a whole-archive kind of way,
it will encounter *heaps* of unused (and unreferenced) code. All
rudiments of weak functions.
Additionally, when a weak implementation is split into a .cold
subfunction that .cold symbol is left in place, even though completely
unused.
Teach objtool to ignore such rudiments by searching for symbol holes;
that is, code ranges that fall outside the given symbol bounds.
Specifically, ignore a sequence of unreachable instruction iff they
occupy a single hole, additionally ignore any .cold subfunctions
referenced.
Both ld.bfd and ld.lld behave like this. LTO builds otoh can (and do)
properly DCE weak functions.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:44 +0000 (16:30 +0100)]
x86/ibt: Ensure module init/exit points have references
Since the references to the module init/exit points only have external
references, a module LTO run will consider them 'unused' and seal
them, leading to an immediate fail on module load.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:43 +0000 (16:30 +0100)]
x86/ibt: Dont generate ENDBR in .discard.text
Having ENDBR in discarded sections can easily lead to relocations into
discarded sections which the linkers aren't really fond of. Objtool
also shouldn't generate them, but why tempt fate.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:40 +0000 (16:30 +0100)]
x86/ibt: Annotate text references
Annotate away some of the generic code references. This is things
where we take the address of a symbol for exception handling or return
addresses (eg. context switch).
Peter Zijlstra [Tue, 8 Mar 2022 15:30:35 +0000 (16:30 +0100)]
x86/ibt: Add IBT feature, MSR and #CP handling
The bits required to make the hardware go.. Of note is that, provided
the syscall entry points are covered with ENDBR, #CP doesn't need to
be an IST because we'll never hit the syscall gap.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:33 +0000 (16:30 +0100)]
x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline
With IBT enabled builds we need ENDBR instructions at indirect jump
target sites, since we start execution of the JIT'ed code through an
indirect jump, the very first instruction needs to be ENDBR.
Similarly, since eBPF tail-calls use indirect branches, their landing
site needs to be an ENDBR too.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:32 +0000 (16:30 +0100)]
x86/ibt,kprobes: Cure sym+0 equals fentry woes
In order to allow kprobes to skip the ENDBR instructions at sym+0 for
X86_KERNEL_IBT builds, change _kprobe_addr() to take an architecture
callback to inspect the function at hand and modify the offset if
needed.
This streamlines the existing interface to cover more cases and
require less hooks. Once PowerPC gets fully converted there will only
be the one arch hook.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:31 +0000 (16:30 +0100)]
x86/ibt,ftrace: Make function-graph play nice
Return trampoline must not use indirect branch to return; while this
preserves the RSB, it is fundamentally incompatible with IBT. Instead
use a retpoline like ROP gadget that defeats IBT while not unbalancing
the RSB.
And since ftrace_stub is no longer a plain RET, don't use it to copy
from. Since RET is a trivial instruction, poke it directly.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:30 +0000 (16:30 +0100)]
x86/livepatch: Validate __fentry__ location
Currently livepatch assumes __fentry__ lives at func+0, which is most
likely untrue with IBT on. Instead make it use ftrace_location() by
default which both validates and finds the actual ip if there is any
in the same symbol.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:29 +0000 (16:30 +0100)]
x86/ibt,ftrace: Search for __fentry__ location
Currently a lot of ftrace code assumes __fentry__ is at sym+0. However
with Intel IBT enabled the first instruction of a function will most
likely be ENDBR.
Change ftrace_location() to not only return the __fentry__ location
when called for the __fentry__ location, but also when called for the
sym+0 location.
Then audit/update all callsites of this function to consistently use
these new semantics.
Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20220308154318.227581603@infradead.org
Peter Zijlstra [Tue, 8 Mar 2022 15:30:24 +0000 (16:30 +0100)]
x86/ibt,entry: Sprinkle ENDBR dust
Kernel entry points should be having ENDBR on for IBT configs.
The SYSCALL entry points are found through taking their respective
address in order to program them in the MSRs, while the exception
entry points are found through UNWIND_HINT_IRET_REGS.
The rule is that any UNWIND_HINT_IRET_REGS at sym+0 should have an
ENDBR, see the later objtool ibt validation patch.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:22 +0000 (16:30 +0100)]
x86/entry,xen: Early rewrite of restore_regs_and_return_to_kernel()
By doing an early rewrite of 'jmp native_iret` in
restore_regs_and_return_to_kernel() we can get rid of the last
INTERRUPT_RETURN user and paravirt_iret.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:21 +0000 (16:30 +0100)]
x86/entry: Cleanup PARAVIRT
Since commit 5c8f6a2e316e ("x86/xen: Add
xenpv_restore_regs_and_return_to_usermode()") Xen will no longer reach
this code and we can do away with the paravirt
SWAPGS/INTERRUPT_RETURN.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:18 +0000 (16:30 +0100)]
x86/ibt: Add ANNOTATE_NOENDBR
In order to have objtool warn about code references to !ENDBR
instruction, we need an annotation to allow this for non-control-flow
instances -- consider text range checks, text patching, or return
trampolines etc.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:17 +0000 (16:30 +0100)]
x86/ibt: Base IBT bits
Add Kconfig, Makefile and basic instruction support for x86 IBT.
(Ab)use __DISABLE_EXPORTS to disable IBT since it's already employed
to mark compressed and purgatory. Additionally mark realmode with it
as well to avoid inserting ENDBR instructions there. While ENDBR is
technically a NOP, inserting them was causing some grief due to code
growth. There's also a problem with using __noendbr in code compiled
without -fcf-protection=branch.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:16 +0000 (16:30 +0100)]
objtool: Have WARN_FUNC fall back to sym+off
Currently WARN_FUNC() either prints func+off and failing that prints
sec+off, add an intermediate sym+off. This is useful when playing
around with entry code.
Peter Zijlstra [Tue, 8 Mar 2022 15:30:15 +0000 (16:30 +0100)]
objtool,efi: Update __efi64_thunk annotation
The current annotation relies on not running objtool on the file; this
won't work when running objtool on vmlinux.o. Instead explicitly mark
__efi64_thunk() to be ignored.
This preserves the status quo, which is somewhat unfortunate. Luckily
this code is hardly ever used.
Fenghua Yu [Mon, 7 Feb 2022 23:02:53 +0000 (15:02 -0800)]
tools/objtool: Check for use of the ENQCMD instruction in the kernel
The ENQCMD instruction implicitly accesses the PASID_MSR to fill in the
pasid field of the descriptor being submitted to an accelerator. But
there is no precise (and stable across kernel changes) point at which
the PASID_MSR is updated from the value for one task to the next.
Kernel code that uses accelerators must always use the ENQCMDS instruction
which does not access the PASID_MSR.
Check for use of the ENQCMD instruction in the kernel and warn on its
usage.
Linus Torvalds [Sun, 13 Mar 2022 17:36:38 +0000 (10:36 -0700)]
Merge tag 'x86_urgent_for_v5.17_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Free shmem backing storage for SGX enclave pages when those are
swapped back into EPC memory
- Prevent do_int3() from being kprobed, to avoid recursion
- Remap setup_data and setup_indirect structures properly when
accessing their members
- Correct the alternatives patching order for modules too
* tag 'x86_urgent_for_v5.17_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Free backing memory after faulting the enclave page
x86/traps: Mark do_int3() NOKPROBE_SYMBOL
x86/boot: Add setup_indirect support in early_memremap_is_setup_data()
x86/boot: Fix memremap of setup_indirect structures
x86/module: Fix the paravirt vs alternative order
Linus Torvalds [Sat, 12 Mar 2022 18:29:25 +0000 (10:29 -0800)]
Merge tag 'perf-tools-fixes-for-v5.17-2022-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Fix event parser error for hybrid systems
- Fix NULL check against wrong variable in 'perf bench' and in the
parsing code
- Update arm64 KVM headers from the kernel sources
- Sync cpufeatures header with the kernel sources
* tag 'perf-tools-fixes-for-v5.17-2022-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf parse: Fix event parser error for hybrid systems
perf bench: Fix NULL check against wrong variable
perf parse-events: Fix NULL check against wrong variable
tools headers cpufeatures: Sync with the kernel sources
tools kvm headers arm64: Update KVM headers from the kernel sources
Linus Torvalds [Sat, 12 Mar 2022 18:22:43 +0000 (10:22 -0800)]
Merge tag 'drm-fixes-2022-03-12' of git://anongit.freedesktop.org/drm/drm
Pull drm kconfig fix from Dave Airlie:
"Thorsten pointed out this had fallen down the cracks and was in -next
only, I've picked it out, fixed up it's Fixes: line.
- fix regression in Kconfig"
* tag 'drm-fixes-2022-03-12' of git://anongit.freedesktop.org/drm/drm:
drm/panel: Select DRM_DP_HELPER for DRM_PANEL_EDP
Zhengjun Xing [Mon, 7 Mar 2022 15:16:27 +0000 (23:16 +0800)]
perf parse: Fix event parser error for hybrid systems
This bug happened on hybrid systems when both cpu_core and cpu_atom
have the same event name such as "UOPS_RETIRED.MS" while their event
terms are different, then during perf stat, the event for cpu_atom
will parse fail and then no output for cpu_atom.
It is because event terms in the "head" of parse_events_multi_pmu_add
will be changed to event terms for cpu_core after parsing UOPS_RETIRED.MS
for cpu_core, then when parsing the same event for cpu_atom, it still
uses the event terms for cpu_core, but event terms for cpu_atom are
different with cpu_core, the event parses for cpu_atom will fail. This
patch fixes it, the event terms should be parsed from the original
event.
This patch can work for the hybrid systems that have the same event
in more than 2 PMUs. It also can work in non-hybrid systems.
Before:
# perf stat -v -e UOPS_RETIRED.MS -a sleep 1
Using CPUID GenuineIntel-6-97-1
UOPS_RETIRED.MS -> cpu_core/period=0x1e8483,umask=0x4,event=0xc2,frontend=0x8/
Control descriptor is not initialized
UOPS_RETIRED.MS: 27378451606851848516068518485
Using CPUID GenuineIntel-6-97-1
UOPS_RETIRED.MS -> cpu_core/period=0x1e8483,umask=0x4,event=0xc2,frontend=0x8/
UOPS_RETIRED.MS -> cpu_atom/period=0x1e8483,umask=0x1,event=0xc2/
Control descriptor is not initialized
UOPS_RETIRED.MS: 19775551607695071116076950711
UOPS_RETIRED.MS: 568684 80386942348038694234
Weiguo Li [Fri, 11 Mar 2022 13:06:57 +0000 (21:06 +0800)]
perf parse-events: Fix NULL check against wrong variable
We did a null check after "tmp->symbol = strdup(...)", but we checked
"list->symbol" other than "tmp->symbol".
Reviewed-by: John Garry <john.garry@huawei.com> Signed-off-by: Weiguo Li <liwg06@foxmail.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/tencent_DF39269807EC9425E24787E6DB632441A405@qq.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
tools headers cpufeatures: Sync with the kernel sources
To pick the changes from:
d45476d983240937 ("x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCE")
Its just a comment fixup.
This only causes these perf files to be rebuilt:
CC /tmp/build/perf/bench/mem-memcpy-x86-64-asm.o
CC /tmp/build/perf/bench/mem-memset-x86-64-asm.o
And addresses this perf build warning:
Warning: Kernel ABI header at 'tools/arch/x86/include/asm/cpufeatures.h' differs from latest version at 'arch/x86/include/asm/cpufeatures.h'
diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h
tools kvm headers arm64: Update KVM headers from the kernel sources
To pick the changes from:
a5905d6af492ee6a ("KVM: arm64: Allow SMCCC_ARCH_WORKAROUND_3 to be discovered and migrated")
That don't causes any changes in tooling (when built on x86), only
addresses this perf build warning:
Warning: Kernel ABI header at 'tools/arch/arm64/include/uapi/asm/kvm.h' differs from latest version at 'arch/arm64/include/uapi/asm/kvm.h'
diff -u tools/arch/arm64/include/uapi/asm/kvm.h arch/arm64/include/uapi/asm/kvm.h
Randy Dunlap [Fri, 11 Mar 2022 19:49:12 +0000 (11:49 -0800)]
ARM: Spectre-BHB: provide empty stub for non-config
When CONFIG_GENERIC_CPU_VULNERABILITIES is not set, references
to spectre_v2_update_state() cause a build error, so provide an
empty stub for that function when the Kconfig option is not set.
Fixes this build error:
arm-linux-gnueabi-ld: arch/arm/mm/proc-v7-bugs.o: in function `cpu_v7_bugs_init':
proc-v7-bugs.c:(.text+0x52): undefined reference to `spectre_v2_update_state'
arm-linux-gnueabi-ld: proc-v7-bugs.c:(.text+0x82): undefined reference to `spectre_v2_update_state'
Fixes: b9baf5c8c5c3 ("ARM: Spectre-BHB workaround") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: patches@armlinux.org.uk Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 11 Mar 2022 20:28:21 +0000 (12:28 -0800)]
Merge tag 'riscv-for-linus-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- prevent users from enabling the alternatives framework (and thus
errata handling) on XIP kernels, where runtime code patching does not
function correctly.
- properly detect offset overflow for AUIPC-based relocations in
modules. This may manifest as modules calling arbitrary invalid
addresses, depending on the address allocated when a module is
loaded.
* tag 'riscv-for-linus-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Fix auipc+jalr relocation range checks
riscv: alternative only works on !XIP_KERNEL
When building for Thumb2, the vectors make use of a local label. Sadly,
the Spectre BHB code also uses a local label with the same number which
results in the Thumb2 reference pointing at the wrong place. Fix this
by changing the number used for the Spectre BHB local label.
Fixes: b9baf5c8c5c3 ("ARM: Spectre-BHB workaround") Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 11 Mar 2022 19:24:58 +0000 (11:24 -0800)]
Merge tag 'mmc-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- Restore (mostly) the busy polling for MMC_SEND_OP_COND
MMC host:
- meson-gx: Fix DMA usage of meson_mmc_post_req()"
* tag 'mmc-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: core: Restore (almost) the busy polling for MMC_SEND_OP_COND
mmc: meson: Fix usage of meson_mmc_post_req()
Jarkko Sakkinen [Thu, 3 Mar 2022 22:38:58 +0000 (00:38 +0200)]
x86/sgx: Free backing memory after faulting the enclave page
There is a limited amount of SGX memory (EPC) on each system. When that
memory is used up, SGX has its own swapping mechanism which is similar
in concept but totally separate from the core mm/* code. Instead of
swapping to disk, SGX swaps from EPC to normal RAM. That normal RAM
comes from a shared memory pseudo-file and can itself be swapped by the
core mm code. There is a hierarchy like this:
EPC <-> shmem <-> disk
After data is swapped back in from shmem to EPC, the shmem backing
storage needs to be freed. Currently, the backing shmem is not freed.
This effectively wastes the shmem while the enclave is running. The
memory is recovered when the enclave is destroyed and the backing
storage freed.
Sort this out by freeing memory with shmem_truncate_range(), as soon as
a page is faulted back to the EPC. In addition, free the memory for
PCMD pages as soon as all PCMD's in a page have been marked as unused
by zeroing its contents.
Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220303223859.273187-1-jarkko@kernel.org
Linus Torvalds [Fri, 11 Mar 2022 18:28:32 +0000 (10:28 -0800)]
Merge branch 'davidh' (fixes from David Howells)
Merge misc fixes from David Howells:
"A set of patches for watch_queue filter issues noted by Jann. I've
added in a cleanup patch from Christophe Jaillet to convert to using
formal bitmap specifiers for the note allocation bitmap.
Also two filesystem fixes (afs and cachefiles)"
* emailed patches from David Howells <dhowells@redhat.com>:
cachefiles: Fix volume coherency attribute
afs: Fix potential thrashing in afs writeback
watch_queue: Make comment about setting ->defunct more accurate
watch_queue: Fix lack of barrier/sync/lock between post and read
watch_queue: Free the alloc bitmap when the watch_queue is torn down
watch_queue: Fix the alloc bitmap size to reflect notes allocated
watch_queue: Use the bitmap API when applicable
watch_queue: Fix to always request a pow-of-2 pipe ring size
watch_queue: Fix to release page in ->release()
watch_queue, pipe: Free watchqueue state after clearing pipe ring
watch_queue: Fix filter limit check
David Howells [Fri, 11 Mar 2022 16:02:18 +0000 (16:02 +0000)]
cachefiles: Fix volume coherency attribute
A network filesystem may set coherency data on a volume cookie, and if
given, cachefiles will store this in an xattr on the directory in the
cache corresponding to the volume.
The function that sets the xattr just stores the contents of the volume
coherency buffer directly into the xattr, with nothing added; the
checking function, on the other hand, has a cut'n'paste error whereby it
tries to interpret the xattr contents as would be the xattr on an
ordinary file (using the cachefiles_xattr struct). This results in a
failure to match the coherency data because the buffer ends up being
shifted by 18 bytes.
Fix this by defining a structure specifically for the volume xattr and
making both the setting and checking functions use it.
Since the volume coherency doesn't work if used, take the opportunity to
insert a reserved field for future use, set it to 0 and check that it is
0. Log mismatch through the appropriate tracepoint.
Note that this only affects cifs; 9p, afs, ceph and nfs don't use the
volume coherency data at the moment.
Fixes: 32e150037dce ("fscache, cachefiles: Store the volume coherency data") Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Steve French <smfrench@gmail.com>
cc: linux-cifs@vger.kernel.org
cc: linux-cachefs@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 15:58:21 +0000 (15:58 +0000)]
afs: Fix potential thrashing in afs writeback
In afs_writepages_region(), if the dirty page we find is undergoing
writeback or write to cache, but the sync_mode is WB_SYNC_NONE, we go
round the loop trying the same page again and again with no pausing or
waiting unless and until another thread manages to clear the writeback
and fscache flags.
Fix this with three measures:
(1) Advance start to after the page we found.
(2) Break out of the loop and return if rescheduling is requested.
(3) Arbitrarily give up after a maximum of 5 skips.
Fixes: 31143d5d515e ("AFS: implement basic file write support") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Marc Dionne <marc.dionne@auristor.com> Acked-by: Marc Dionne <marc.dionne@auristor.com> Link: https://lore.kernel.org/r/164692725757.2097000.2060513769492301854.stgit@warthog.procyon.org.uk/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Huafei [Thu, 10 Mar 2022 12:09:15 +0000 (20:09 +0800)]
x86/traps: Mark do_int3() NOKPROBE_SYMBOL
Since kprobe_int3_handler() is called in do_int3(), probing do_int3()
can cause a breakpoint recursion and crash the kernel. Therefore,
do_int3() should be marked as NOKPROBE_SYMBOL.
David Howells [Fri, 11 Mar 2022 13:24:47 +0000 (13:24 +0000)]
watch_queue: Make comment about setting ->defunct more accurate
watch_queue_clear() has a comment stating that setting ->defunct to true
preventing new additions as well as preventing notifications. Whilst
the latter is true, the first bit is superfluous since at the time this
function is called, the pipe cannot be accessed to add new event
sources.
Remove the "new additions" bit from the comment.
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 13:24:36 +0000 (13:24 +0000)]
watch_queue: Fix lack of barrier/sync/lock between post and read
There's nothing to synchronise post_one_notification() versus
pipe_read(). Whilst posting is done under pipe->rd_wait.lock, the
reader only takes pipe->mutex which cannot bar notification posting as
that may need to be made from contexts that cannot sleep.
Fix this by setting pipe->head with a barrier in post_one_notification()
and reading pipe->head with a barrier in pipe_read().
If that's not sufficient, the rd_wait.lock will need to be taken,
possibly in a ->confirm() op so that it only applies to notifications.
The lock would, however, have to be dropped before copy_page_to_iter()
is invoked.
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 13:24:22 +0000 (13:24 +0000)]
watch_queue: Fix the alloc bitmap size to reflect notes allocated
Currently, watch_queue_set_size() sets the number of notes available in
wqueue->nr_notes according to the number of notes allocated, but sets
the size of the bitmap to the unrounded number of notes originally asked
for.
Fix this by setting the bitmap size to the number of notes we're
actually going to make available (ie. the number allocated).
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 13:24:08 +0000 (13:24 +0000)]
watch_queue: Fix to always request a pow-of-2 pipe ring size
The pipe ring size must always be a power of 2 as the head and tail
pointers are masked off by AND'ing with the size of the ring - 1.
watch_queue_set_size(), however, lets you specify any number of notes
between 1 and 511. This number is passed through to pipe_resize_ring()
without checking/forcing its alignment.
Fix this by rounding the number of slots required up to the nearest
power of two. The request is meant to guarantee that at least that many
notifications can be generated before the queue is full, so rounding
down isn't an option, but, alternatively, it may be better to give an
error if we aren't allowed to allocate that much ring space.
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 13:23:46 +0000 (13:23 +0000)]
watch_queue: Fix to release page in ->release()
When a pipe ring descriptor points to a notification message, the
refcount on the backing page is incremented by the generic get function,
but the release function, which marks the bitmap, doesn't drop the page
ref.
Fix this by calling generic_pipe_buf_release() at the end of
watch_queue_pipe_buf_release().
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 13:23:38 +0000 (13:23 +0000)]
watch_queue, pipe: Free watchqueue state after clearing pipe ring
In free_pipe_info(), free the watchqueue state after clearing the pipe
ring as each pipe ring descriptor has a release function, and in the
case of a notification message, this is watch_queue_pipe_buf_release()
which tries to mark the allocation bitmap that was previously released.
Fix this by moving the put of the pipe's ref on the watch queue to after
the ring has been cleared. We still need to call watch_queue_clear()
before doing that to make sure that the pipe is disconnected from any
notification sources first.
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 13:23:31 +0000 (13:23 +0000)]
watch_queue: Fix filter limit check
In watch_queue_set_filter(), there are a couple of places where we check
that the filter type value does not exceed what the type_filter bitmap
can hold. One place calculates the number of bits by:
if (tf[i].type >= sizeof(wfilter->type_filter) * 8)
which is fine, but the second does:
if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG)
which is not. This can lead to a couple of out-of-bounds writes due to
a too-large type:
(1) __set_bit() on wfilter->type_filter
(2) Writing more elements in wfilter->filters[] than we allocated.
Fix this by just using the proper WATCH_TYPE__NR instead, which is the
number of types we actually know about.
The bug may cause an oops looking something like:
BUG: KASAN: slab-out-of-bounds in watch_queue_set_filter+0x659/0x740
Write of size 4 at addr ffff88800d2c66bc by task watch_queue_oob/611
...
Call Trace:
<TASK>
dump_stack_lvl+0x45/0x59
print_address_description.constprop.0+0x1f/0x150
...
kasan_report.cold+0x7f/0x11b
...
watch_queue_set_filter+0x659/0x740
...
__x64_sys_ioctl+0x127/0x190
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Allocated by task 611:
kasan_save_stack+0x1e/0x40
__kasan_kmalloc+0x81/0xa0
watch_queue_set_filter+0x23a/0x740
__x64_sys_ioctl+0x127/0x190
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88800d2c66a0
which belongs to the cache kmalloc-32 of size 32
The buggy address is located 28 bytes inside of
32-byte region [ffff88800d2c66a0, ffff88800d2c66c0)
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 11 Mar 2022 05:15:42 +0000 (21:15 -0800)]
Merge tag 'drm-fixes-2022-03-11' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"As expected at this stage its pretty quiet, one sun4i mixer fix and
one i915 display flicker fix:
i915:
- fix psr screen flicker
sun4i:
- mixer format fix"
* tag 'drm-fixes-2022-03-11' of git://anongit.freedesktop.org/drm/drm:
drm/sun4i: mixer: Fix P010 and P210 format numbers
drm/i915/psr: Set "SF Partial Frame Enable" also on full update
RISC-V can do PC-relative jumps with a 32bit range using the following
two instructions:
auipc t0, imm20 ; t0 = PC + imm20 * 2^12
jalr ra, t0, imm12 ; ra = PC + 4, PC = t0 + imm12
Crucially both the 20bit immediate imm20 and the 12bit immediate imm12
are treated as two's-complement signed values. For this reason the
immediates are usually calculated like this:
..where offset is the signed offset from the auipc instruction. When
the 11th bit of offset is 0 the addition of 0x800 doesn't change the top
20 bits and imm12 considered positive. When the 11th bit is 1 the carry
of the addition by 0x800 means imm20 is one higher, but since imm12 is
then considered negative the two's complement representation means it
all cancels out nicely.
However, this addition by 0x800 (2^11) means an offset greater than or
equal to 2^31 - 2^11 would overflow so imm20 is considered negative and
result in a backwards jump. Similarly the lower range of offset is also
moved down by 2^11 and hence the true 32bit range is
Linus Torvalds [Fri, 11 Mar 2022 01:23:08 +0000 (17:23 -0800)]
Merge tag 'trace-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Minor tracing fixes:
- Fix unregistering the same event twice. A user could disable the
same event that osnoise will disable on unregistering.
- Inform RCU of a quiescent state in the osnoise testing thread.
- Fix some kerneldoc comments"
* tag 'trace-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix some W=1 warnings in kernel doc comments
tracing/osnoise: Force quiescent states while tracing
tracing/osnoise: Do not unregister events twice
Linus Torvalds [Fri, 11 Mar 2022 00:47:58 +0000 (16:47 -0800)]
Merge tag 'net-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth, and ipsec.
Current release - regressions:
- Bluetooth: fix unbalanced unlock in set_device_flags()
- Bluetooth: fix not processing all entries on cmd_sync_work, make
connect with qualcomm and intel adapters reliable
- Revert "xfrm: state and policy should fail if XFRMA_IF_ID 0"
- xdp: xdp_mem_allocator can be NULL in trace_mem_connect()
- eth: ice: fix race condition and deadlock during interface enslave
Current release - new code bugs:
- tipc: fix incorrect order of state message data sanity check
Previous releases - regressions:
- esp: fix possible buffer overflow in ESP transformation
- dsa: unlock the rtnl_mutex when dsa_master_setup() fails
- phy: meson-gxl: fix interrupt handling in forced mode
- smsc95xx: ignore -ENODEV errors when device is unplugged
Previous releases - always broken:
- xfrm: fix tunnel mode fragmentation behavior
- esp: fix inter address family tunneling on GSO
- tipc: fix null-deref due to race when enabling bearer
- sctp: fix kernel-infoleak for SCTP sockets
- eth: macb: fix lost RX packet wakeup race in NAPI receive
- eth: intel stop disabling VFs due to PF error responses
- eth: bcmgenet: don't claim WOL when its not available"
* tag 'net-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
xdp: xdp_mem_allocator can be NULL in trace_mem_connect().
ice: Fix race condition during interface enslave
net: phy: meson-gxl: improve link-up behavior
net: bcmgenet: Don't claim WOL when its not available
net: arc_emac: Fix use after free in arc_mdio_probe()
sctp: fix kernel-infoleak for SCTP sockets
net: phy: correct spelling error of media in documentation
net: phy: DP83822: clear MISR2 register to disable interrupts
gianfar: ethtool: Fix refcount leak in gfar_get_ts_info
selftests: pmtu.sh: Kill nettest processes launched in subshell.
selftests: pmtu.sh: Kill tcpdump processes launched by subshell.
NFC: port100: fix use-after-free in port100_send_complete
net/mlx5e: SHAMPO, reduce TIR indication
net/mlx5e: Lag, Only handle events from highest priority multipath entry
net/mlx5: Fix offloading with ESWITCH_IPV4_TTL_MODIFY_ENABLE
net/mlx5: Fix a race on command flush flow
net/mlx5: Fix size field in bufferx_reg struct
ax25: Fix NULL pointer dereference in ax25_kill_by_device
net: marvell: prestera: Add missing of_node_put() in prestera_switch_set_base_mac_addr
net: ethernet: lpc_eth: Handle error for clk_enable
...
xdp: xdp_mem_allocator can be NULL in trace_mem_connect().
Since the commit mentioned below __xdp_reg_mem_model() can return a NULL
pointer. This pointer is dereferenced in trace_mem_connect() which leads
to segfault.
The trace points (mem_connect + mem_disconnect) were put in place to
pair connect/disconnect using the IDs. The ID is only assigned if
__xdp_reg_mem_model() does not return NULL. That connect trace point is
of no use if there is no ID.
Skip that connect trace point if xdp_alloc is NULL.
[ Toke Høiland-Jørgensen delivered the reasoning for skipping the trace
point ]
Fixes: 4a48ef70b93b8 ("xdp: Allow registering memory model without rxq reference") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/YikmmXsffE+QajTB@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ivan Vecera [Thu, 10 Mar 2022 17:16:41 +0000 (18:16 +0100)]
ice: Fix race condition during interface enslave
Commit 5dbbbd01cbba83 ("ice: Avoid RTNL lock when re-creating
auxiliary device") changes a process of re-creation of aux device
so ice_plug_aux_dev() is called from ice_service_task() context.
This unfortunately opens a race window that can result in dead-lock
when interface has left LAG and immediately enters LAG again.
Reproducer:
```
#!/bin/sh
ip link add lag0 type bond mode 1 miimon 100
ip link set lag0
for n in {1..10}; do
echo Cycle: $n
ip link set ens7f0 master lag0
sleep 1
ip link set ens7f0 nomaster
done
```
1. Command 'ip link ... set nomaster' causes that ice_plug_aux_dev()
is called from ice_service_task() context, aux device is created
and associated device->lock is taken.
2. Command 'ip link ... set master...' calls ice's notifier under
RTNL lock and that notifier calls ice_unplug_aux_dev(). That
function tries to take aux device->lock but this is already taken
by ice_plug_aux_dev() in step 1
3. Later ice_plug_aux_dev() tries to take RTNL lock but this is already
taken in step 2
4. Dead-lock
The patch fixes this issue by following changes:
- Bit ICE_FLAG_PLUG_AUX_DEV is kept to be set during ice_plug_aux_dev()
call in ice_service_task()
- The bit is checked in ice_clear_rdma_cap() and only if it is not set
then ice_unplug_aux_dev() is called. If it is set (in other words
plugging of aux device was requested and ice_plug_aux_dev() is
potentially running) then the function only clears the bit
- Once ice_plug_aux_dev() call (in ice_service_task) is finished
the bit ICE_FLAG_PLUG_AUX_DEV is cleared but it is also checked
whether it was already cleared by ice_clear_rdma_cap(). If so then
aux device is unplugged.
Signed-off-by: Ivan Vecera <ivecera@redhat.com> Co-developed-by: Petr Oros <poros@redhat.com> Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Dave Ertman <david.m.ertman@intel.com> Link: https://lore.kernel.org/r/20220310171641.3863659-1-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jeremy Linton [Thu, 10 Mar 2022 04:55:35 +0000 (22:55 -0600)]
net: bcmgenet: Don't claim WOL when its not available
Some of the bcmgenet platforms don't correctly support WOL, yet
ethtool returns:
"Supports Wake-on: gsf"
which is false.
Ideally if there isn't a wol_irq, or there is something else that
keeps the device from being able to wakeup it should display:
"Supports Wake-on: d"
This patch checks whether the device can wakup, before using the
hard-coded supported flags. This corrects the ethtool reporting, as
well as the WOL configuration because ethtool verifies that the mode
is supported before attempting it.
Fixes: c51de7f3976b ("net: bcmgenet: add Wake-on-LAN support code") Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> Tested-by: Peter Robinson <pbrobinson@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20220310045535.224450-1-jeremy.linton@arm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianglei Nie [Wed, 9 Mar 2022 12:18:24 +0000 (20:18 +0800)]
net: arc_emac: Fix use after free in arc_mdio_probe()
If bus->state is equal to MDIOBUS_ALLOCATED, mdiobus_free(bus) will free
the "bus". But bus->name is still used in the next line, which will lead
to a use after free.
We can fix it by putting the name in a local variable and make the
bus->name point to the rodata section "name",then use the name in the
error message without referring to bus to avoid the uaf.
Jakub Kicinski [Thu, 10 Mar 2022 22:32:32 +0000 (14:32 -0800)]
Merge tag 'mlx5-fixes-2022-03-09' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2022-03-09
This series provides bug fixes to mlx5 driver.
* tag 'mlx5-fixes-2022-03-09' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: SHAMPO, reduce TIR indication
net/mlx5e: Lag, Only handle events from highest priority multipath entry
net/mlx5: Fix offloading with ESWITCH_IPV4_TTL_MODIFY_ENABLE
net/mlx5: Fix a race on command flush flow
net/mlx5: Fix size field in bufferx_reg struct
====================
Linus Torvalds [Thu, 10 Mar 2022 20:43:06 +0000 (12:43 -0800)]
Merge tag 'staging-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg KH:
"Here are three small fixes for staging drivers for 5.17-rc8 or -final,
which ever comes next.
They resolve some reported problems:
- rtl8723bs wifi driver deadlock fix for reported problem that is a
revert of a previous patch. Also a documentation fix is added so
that the same problem hopefully can not come back again.
- gdm724x driver use-after-free fix for a reported problem.
All of these have been in linux-next for a while with no reported
problems"
* tag 'staging-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: rtl8723bs: Improve the comment explaining the locking rules
staging: rtl8723bs: Fix access-point mode deadlock
staging: gdm724x: fix use after free in gdm_lte_rx()
Miaoqian Lin [Thu, 10 Mar 2022 01:53:13 +0000 (01:53 +0000)]
gianfar: ethtool: Fix refcount leak in gfar_get_ts_info
The of_find_compatible_node() function returns a node pointer with
refcount incremented, We should use of_node_put() on it when done
Add the missing of_node_put() to release the refcount.
Fixes: 7349a74ea75c ("net: ethernet: gianfar_ethtool: get phc index through drvdata") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20220310015313.14938-1-linmq006@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>