Namhyung Kim [Mon, 29 Jul 2024 00:41:25 +0000 (17:41 -0700)]
perf ftrace: Factor out check_ftrace_capable()
The check is a common part of the ftrace commands, let's move it out.
Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Changbin Du <changbin.du@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lore.kernel.org/lkml/20240729004127.238611-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Mon, 29 Jul 2024 00:41:24 +0000 (17:41 -0700)]
perf ftrace: Add 'tail' option to --graph-opts
The 'graph-tail' option is to print function name as a comment at the end.
This is useful when a large function is mixed with other functions
(possibly from different CPUs).
For example,
$ sudo perf ftrace -- perf stat true
...
1) | get_unused_fd_flags() {
1) | alloc_fd() {
1) 0.178 us | _raw_spin_lock();
1) 0.187 us | expand_files();
1) 0.169 us | _raw_spin_unlock();
1) 1.211 us | }
1) 1.503 us | }
$ sudo perf ftrace --graph-opts tail -- perf stat true
...
1) | get_unused_fd_flags() {
1) | alloc_fd() {
1) 0.099 us | _raw_spin_lock();
1) 0.083 us | expand_files();
1) 0.081 us | _raw_spin_unlock();
1) 0.601 us | } /* alloc_fd */
1) 0.751 us | } /* get_unused_fd_flags */
Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Changbin Du <changbin.du@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lore.kernel.org/lkml/20240729004127.238611-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Commit aa1551f299ba ("perf test pmu: Refactor format test and exposed
test APIs") added the 'test_pmus' list, but didn't use it.
(It seems to put them on the other_pmus list?)
Remove it.
Fixes: aa1551f299ba414c ("perf test pmu: Refactor format test and exposed test APIs") Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Cc: Ian Rogers <irogers@google.com> Cc: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/lkml/20240727175919.1041468-1-linux@treblig.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Mon, 15 Jul 2024 16:07:05 +0000 (19:07 +0300)]
perf tools: Enable evsel__is_aux_event() to work for S390_CPUMSF
evsel__is_aux_event() identifies AUX area tracing selected events.
S390_CPUMSF uses a raw event type (PERF_TYPE_RAW - refer
s390_cpumsf_evsel_is_auxtrace()) not a PMU type value that could be checked
in evsel__is_aux_event(). However it sets needs_auxtrace_mmap (refer
auxtrace_record__init()), so check that first.
Currently, the features that use evsel__is_aux_event() are used only by
Intel PT, but that may change in the future.
Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Leo Yan <leo.yan@arm.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Acked-by: Ian Rogers <irogers@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Hendrik Brueckner <brueckner@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mike Leach <mike.leach@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: coresight@lists.linaro.org Cc: linux-arm-kernel@lists.infradead.org Link: https://lore.kernel.org/r/20240715160712.127117-7-adrian.hunter@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adrian Hunter [Mon, 15 Jul 2024 16:07:04 +0000 (19:07 +0300)]
perf tools: Enable evsel__is_aux_event() to work for ARM/ARM64
Set pmu->auxtrace on ARM/ARM64 AUX area PMUs. evsel__is_aux_event() needs
the setting to identify AUX area tracing selected events.
Currently, the features that use evsel__is_aux_event() are used only by
Intel PT, but that may change in the future.
Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Leo Yan <leo.yan@arm.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Acked-by: Ian Rogers <irogers@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Hendrik Brueckner <brueckner@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mike Leach <mike.leach@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yicong Yang <yangyicong@hisilicon.com> Cc: coresight@lists.linaro.org Cc: linux-arm-kernel@lists.infradead.org Link: https://lore.kernel.org/r/20240715160712.127117-6-adrian.hunter@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
James Clark [Tue, 23 Jul 2024 13:28:58 +0000 (14:28 +0100)]
perf scripts python cs-etm: Restore first sample log in verbose mode
The linked commit moved the early return on the first sample to before
the verbose log, so move the log earlier too. Now the first sample is
also logged and not skipped.
Fixes: 2d98dbb4c9c5b09c ("perf scripts python arm-cs-trace-disasm.py: Do not ignore disam first sample") Reviewed-by: Leo Yan <leo.yan@arm.com> Signed-off-by: James Clark <james.clark@linaro.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Benjamin Gray <bgray@linux.ibm.com> Cc: coresight@lists.linaro.org Cc: gankulkarni@os.amperecomputing.com Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mike Leach <mike.leach@linaro.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ruidong Tian <tianruidong@linux.alibaba.com> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20240723132858.12747-1-james.clark@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
James Clark [Mon, 22 Jul 2024 15:27:56 +0000 (16:27 +0100)]
perf cs-etm: Output 0 instead of 0xdeadbeef when exception packets are flushed
Normally exception packets don't directly output a branch sample, but
if they're the last record in a buffer then they will. Because they
don't have addresses set we'll see the placeholder value
CS_ETM_INVAL_ADDR (0xdeadbeef) in the output.
Since commit 6035b6804bdf ("perf cs-etm: Support dummy address value for
CS_ETM_TRACE_ON packet") we've used 0 as an externally visible "not set"
address value. For consistency reasons and to not make exceptions look
like an error, change them to use 0 too.
This is particularly visible when doing userspace only tracing because
trace is disabled when jumping to the kernel, causing the flush and then
forcing the last exception packet to be emitted as a branch. With kernel
trace included, there is no flush so exception packets don't generate
samples until the next range packet and they'll pick up the correct
address.
Reviewed-by: Mike Leach <mike.leach@linaro.org> Signed-off-by: James Clark <james.clark@linaro.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@arm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Garry <john.g.garry@oracle.com> Cc: Leo Yan <leo.yan@arm.com> Cc: Leo Yan <leo.yan@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: Will Deacon <will@kernel.org> Cc: coresight@lists.linaro.org Cc: gankulkarni@os.amperecomputing.com Cc: linux-arm-kernel@lists.infradead.org Link: https://lore.kernel.org/r/20240722152756.59453-2-james.clark@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Chen Ni [Tue, 16 Jul 2024 07:53:47 +0000 (15:53 +0800)]
perf inject: Convert comma to semicolon
Replace a comma between expression statements by a semicolon.
Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20240716075347.969041-1-nichen@iscas.ac.cn Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Chen Ni [Tue, 16 Jul 2024 07:43:40 +0000 (15:43 +0800)]
perf daemon: Convert comma to semicolon
Replace a comma between expression statements by a semicolon.
Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20240716074340.968909-1-nichen@iscas.ac.cn Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf annotate: Add support to use libcapstone in powerpc
Now perf uses the capstone library to disassemble the instructions in
x86. capstone is used (if available) for perf annotate to speed up.
Currently it only supports x86 architecture.
This patch includes changes to enable this in powerpc.
For now, only for data type sort keys, this method is used and only
binary code (raw instruction) is read. This is because powerpc approach
to understand instructions and reg fields uses raw instruction.
The "cs_disasm" is currently not enabled. While attempting to do
cs_disasm, observation is that some of the instructions were not
identified (ex: extswsli, maddld) and it had to fallback to use objdump.
Hence enabling "cs_disasm" is added in comment section as a TODO for
powerpc.
Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Tested-by: Kajol Jain <kjain@linux.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Akanksha J N <akanksha@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Disha Goel <disgoel@linux.vnet.ibm.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Link: https://lore.kernel.org/lkml/20240718084358.72242-15-atrajeev@linux.vnet.ibm.com
[ Use dso__nsinfo(dso) as required to match EXTRA_CFLAGS=-DREFCNT_CHECKING=1 build expectations ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf annotate: Use capstone_init and remove open_capstone_handle from disasm.c
capstone_init is made availbale for all archs to use and updated to
enable support for CS_ARCH_PPC as well. Patch removes
open_capstone_handle and uses capstone_init in all the places.
Committer notes:
Avoid including capstone/capstone.h from print_insn.h to not break the
build in builtin-script.c due to the namespace clash with libbpf:
/usr/include/capstone/bpf.h:94:14: error: 'bpf_insn' defined as wrong kind of tag
Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Tested-by: Kajol Jain <kjain@linux.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Akanksha J N <akanksha@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Disha Goel <disgoel@linux.vnet.ibm.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Link: https://lore.kernel.org/lkml/20240718084358.72242-14-atrajeev@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf annotate: Add some of the arithmetic instructions to support instruction tracking in powerpc
Data-type profiling has the concept of instruction tracking.
Example sequence in powerpc:
ld r10,264(r3)
mr r31,r3
<<after some sequence>
ld r9,312(r31)
or differently
lwz r10,264(r3)
add r31, r3, RB
lwz r9, 0(r31)
If a sample is hit at "lwz r9, 0(r31)", data type of r31 depends
on previous instruction sequence here. So to track the previous
instructions, patch adds changes to identify some of the arithmetic
instructions which are having opcode as 31.
Since memory instructions also has cases with opcode 31, use the bits
22:30 to filter the arithmetic instructions here.
Also there are instructions with just two operands like "addme", "addze".
This patch adds new instructions ops "arithmetic_ops" to handle this
Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Tested-by: Kajol Jain <kjain@linux.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Akanksha J N <akanksha@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Disha Goel <disgoel@linux.vnet.ibm.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Link: https://lore.kernel.org/lkml/20240718084358.72242-10-atrajeev@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The opcode for "ldx" is 31. There are other instructions also with
opcode 31 which are memory insn like ldux, stbx, lwzx, lhaux
But all instructions with opcode 31 are not memory. Example is add
instruction: "add RT,RA,RB"
The value in bit 21-30 [ 21 for ldx ] is different for these
instructions. Patch uses this value to assign instruction ops for these
cases. The naming convention and value to identify these are picked from
defines in "arch/powerpc/include/asm/ppc-opcode.h"
Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Tested-by: Kajol Jain <kjain@linux.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Akanksha J N <akanksha@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Disha Goel <disgoel@linux.vnet.ibm.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Link: https://lore.kernel.org/lkml/20240718084358.72242-9-atrajeev@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf annotate: Add parse function for memory instructions in powerpc
Use the raw instruction code and macros to identify memory instructions,
extract register fields and also offset.
The implementation addresses the D-form, X-form, DS-form instructions.
Two main functions are added.
New parse function "load_store__parse" as instruction ops parser for
memory instructions.
Unlike other parsers (like mov__parse), this one fills in the
"multi_regs" field for source/target and new added "mem_ref" field. No
other fields are set because, here there is no need to parse the
disassembled code and arch specific macros will take care of extracting
offset and regs which is easier and will be precise.
In powerpc, all instructions with a primary opcode from 32 to 63
are memory instructions. Update "ins__find" function to have "raw_insn"
also as a parameter.
Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Tested-by: Kajol Jain <kjain@linux.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Akanksha J N <akanksha@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Disha Goel <disgoel@linux.vnet.ibm.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Link: https://lore.kernel.org/lkml/20240718084358.72242-8-atrajeev@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf annotate: Update parameters for reg extract functions to use raw instruction on powerpc
Use the raw instruction code and macros to identify memory instructions,
extract register fields and also offset.
The implementation addresses the D-form, X-form, DS-form instructions.
Adds "mem_ref" field to check whether source/target has memory
reference.
Add function "get_powerpc_regs" which will set these fields: reg1, reg2,
offset depending of where it is source or target ops.
Update "parse" callback for "struct ins_ops" to also pass "struct
disasm_line" as argument. This is needed in parse functions where opcode
is used to determine whether to set multi_regs and other fields
Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Tested-by: Kajol Jain <kjain@linux.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Akanksha J N <akanksha@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Disha Goel <disgoel@linux.vnet.ibm.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Link: https://lore.kernel.org/lkml/20240718084358.72242-7-atrajeev@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf annotate: Add support to capture and parse raw instruction in powerpc using dso__data_read_offset utility
Add support to capture and parse raw instruction in powerpc.
Currently, the perf tool infrastructure uses two ways to disassemble
and understand the instruction. One is objdump and other option is
via libcapstone.
Currently, the perf tool infrastructure uses "--no-show-raw-insn" option
with "objdump" while disassemble. Example from powerpc with this option
for an instruction address is:
This line "lwz r10,0(r9)" is parsed to extract instruction name,
registers names and offset. Also to find whether there is a memory
reference in the operands, "memory_ref_char" field of objdump is used.
For x86, "(" is used as memory_ref_char to tackle instructions of the
form "mov (%rax), %rcx".
In case of powerpc, not all instructions using "(" are the only memory
instructions. Example, above instruction can also be of extended form (X
form) "lwzx r10,0,r19". Inorder to easy identify the instruction category
and extract the source/target registers, patch adds support to use raw
instruction for powerpc. Approach used is to read the raw instruction
directly from the DSO file using "dso__data_read_offset" utility which
is already implemented in perf infrastructure in "util/dso.c".
Example:
38 01 81 e8 ld r4,312(r1)
Here "38 01 81 e8" is the raw instruction representation. In powerpc,
this translates to instruction form: "ld RT,DS(RA)" and binary code
as:
Function "symbol__disassemble_dso" is updated to read raw instruction
directly from DSO using dso__data_read_offset utility. In case of
above example, this captures:
line: 38 01 81 e8
The above works well when 'perf report' is invoked with only sort keys
for data type ie type and typeoff.
Because there is no instruction level annotation needed if only data
type information is requested for.
For annotating sample, along with type and typeoff sort key, "sym" sort
key is also needed. And by default invoking just "perf report" uses sort
key "sym" that displays the symbol information.
With approach changes in powerpc which first reads DSO for raw
instruction, "perf annotate" and "perf report" + a key breaks since
it doesn't do the instruction level disassembly.
Here, raw instruction is displayed in the output instead of human
readable annotated form.
One way to get the appropriate data is to specify "--objdump path", by
which code annotation will be done. But the default behaviour will be
changed. To fix this breakage, check if "sym" sort key is set. If so
fallback and use the libcapstone/objdump way of disassmbling the sample.
With the changes and "perf report"
Samples: 1K of event 'mem-loads', 4000 Hz, Event count (approx.): 937238
do_work /usr/bin/pmlogger [Percent: local period]
Percentâ ld r17,16(r3)
â addi r18,r21,16
â li r19,0
Just add the extern for 'sort_order' in disasm.c so that we don't end up
breaking the build due to this type colision with capstone and libbpf:
In file included from /usr/include/capstone/capstone.h:325,
from /git/perf-6.10.0/tools/perf/util/print_insn.h:23,
from builtin-script.c:38:
/usr/include/capstone/bpf.h:94:14: error: 'bpf_insn' defined as wrong kind of tag
94 | typedef enum bpf_insn {
I reported this to the bpf mailing list, see one of the links below.
Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Tested-by: Kajol Jain <kjain@linux.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Akanksha J N <akanksha@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Disha Goel <disgoel@linux.vnet.ibm.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Link: https://lore.kernel.org/lkml/20240718084358.72242-6-atrajeev@linux.vnet.ibm.com Link: https://lore.kernel.org/bpf/ZqOltPk9VQGgJZAA@x1/T/#u Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This line "lwz r10,0(r9)" is parsed to extract instruction name,
registers names and offset.
In powerpc, the approach for data type profiling uses raw instruction
instead of result from objdump to identify the instruction category and
extract the source/target registers.
Example: 38 01 81 e8 ld r4,312(r1)
Here "38 01 81 e8" is the raw instruction representation. Add function
"disasm_line__parse_powerpc" to handle parsing of raw instruction.
Also update "struct disasm_line" to save the binary code/
With the change, function captures:
line -> "38 01 81 e8 ld r4,312(r1)"
raw instruction "38 01 81 e8"
Raw instruction is used later to extract the reg/offset fields. Macros
are added to extract opcode and register fields. "struct disasm_line"
is updated to carry union of "bytes" and "raw_insn" of 32 bit to carry raw
code (raw).
Function "disasm_line__parse_powerpc fills the raw instruction hex value
and can use macros to get opcode. There is no changes in existing code
paths, which parses the disassembled code. The size of raw instruction
depends on architecture.
In case of powerpc, the parsing the disasm line needs to handle cases
for reading binary code directly from DSO as well as parsing the objdump
result. Hence adding the logic into separate function instead of
updating "disasm_line__parse". The architecture using the instruction
name and present approach is not altered. Since this approach targets
powerpc, the macro implementation is added for powerpc as of now.
Since the disasm_line__parse is used in other cases (perf annotate) and
not only data tye profiling, the powerpc callback includes changes to
work with binary code as well as mnemonic representation.
Also in case if the DSO read fails and libcapstone is not supported, the
approach fallback to use objdump as option. Hence as option, patch has
changes to ensure objdump option also works well.
Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Tested-by: Kajol Jain <kjain@linux.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Akanksha J N <akanksha@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Disha Goel <disgoel@linux.vnet.ibm.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Link: https://lore.kernel.org/lkml/20240718084358.72242-5-atrajeev@linux.vnet.ibm.com
[ Add check for strndup() result ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf annotate: Add "update_insn_state" callback function to handle arch specific instruction tracking
Add "update_insn_state" callback to "struct arch" to handle instruction
tracking. Currently updating instruction state is handled by static
function "update_insn_state_x86" which is defined in "annotate-data.c".
Make this as a callback for specific arch and move to archs specific
file "arch/x86/annotate/instructions.c" . This will help to add helper
function for other platforms in file:
"arch/<platform>/annotate/instructions.c" and make changes/updates
easier.
Define callback "update_insn_state" as part of "struct arch", also make
some of the debug functions non-static so that it can be referenced from
other places.
Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Tested-by: Kajol Jain <kjain@linux.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Akanksha J N <akanksha@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Disha Goel <disgoel@linux.vnet.ibm.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Segher Boessenkool <segher@kernel.crashing.org> Link: https://lore.kernel.org/lkml/20240718084358.72242-3-atrajeev@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Direct leak of 13628 byte(s) in 6 object(s) allocated from:
#0 0x56354f60c092 in malloc (/tmp/perf/perf+0x29c092)
#1 0x7ff25c7d02e7 in _PyObject_Malloc /build/python3.11/../Objects/obmalloc.c:2003:11
#2 0x7ff25c7d02e7 in _PyObject_Malloc /build/python3.11/../Objects/obmalloc.c:1996:1
perf trace: Remove arg_fmt->is_enum, we can get that from the BTF type
This is to pave the way for other BTF types, i.e. we try to find BTF
type then use things like btf_is_enum(btf_type) that we cached to find
the right strtoul and scnprintf routines.
For now only enum is supported, all the other types simple return zero
for scnprintf which makes it have the same behaviour as when BTF isn't
available, i.e. fallback to no pretty printing. Ditto for strtoul.
root@x1:~# perf test -v enum
124: perf trace enum augmentation tests : Ok
root@x1:~# perf test -v enum
124: perf trace enum augmentation tests : Ok
root@x1:~# perf test -v enum
124: perf trace enum augmentation tests : Ok
root@x1:~# perf test -v enum
124: perf trace enum augmentation tests : Ok
root@x1:~# perf test -v enum
124: perf trace enum augmentation tests : Ok
root@x1:~#
Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Howard Chu <howardchu95@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Howard Chu <howardchu95@gmail.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240624181345.124764-9-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To have a central place that will look at the BTF type and call the
right scnprintf routine or return zero, meaning BTF pretty printing
isn't available or not implemented for a specific type.
Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Howard Chu <howardchu95@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240624181345.124764-8-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Howard Chu [Mon, 24 Jun 2024 18:13:43 +0000 (02:13 +0800)]
perf test trace_btf_enum: Add regression test for the BTF augmentation of enums in 'perf trace'
Trace landlock_add_rule syscall to see if the output is desirable.
Trace the non-syscall tracepoint 'timer:hrtimer_init' and
'timer:hrtimer_start', see if the 'mode' argument is augmented,
the 'mode' enum argument has the prefix of 'HRTIMER_MODE_'
in its name.
It was agreed on the discussion (see Link below) to shorten then name of
the workload from 'landlock_add_rule' to 'landlock', and I moved it to a
separate patch.
Also, to address a build failure from Namhyung, I stopped loading
linux/landlock.h and instead added the used defines, enums and types to
make this build in older systems. All we want is to emit the syscall and
intercept it.
Howard Chu [Mon, 24 Jun 2024 18:13:39 +0000 (02:13 +0800)]
perf trace: BTF-based enum pretty printing for syscall args
In this patch, BTF is used to turn enum value to the corresponding
name. There is only one system call that uses enum value as its
argument, that is `landlock_add_rule()`.
The vmlinux btf is loaded lazily, when user decided to trace the
`landlock_add_rule` syscall. But if one decide to run `perf trace`
without any arguments, the behaviour is to trace `landlock_add_rule`,
so vmlinux btf will be loaded by default.
The laziest behaviour is to load vmlinux btf when a
`landlock_add_rule` syscall hits. But I think you could lose some
samples when loading vmlinux btf at run time, for it can delay the
handling of other samples. I might need your precious opinions on
this...
before:
```
perf $ ./perf trace -e landlock_add_rule
0.000 ( 0.008 ms): ldlck-test/438194 landlock_add_rule(rule_type: 2) = -1 EBADFD (File descriptor in bad state)
0.010 ( 0.001 ms): ldlck-test/438194 landlock_add_rule(rule_type: 1) = -1 EBADFD (File descriptor in bad state)
```
after:
```
perf $ ./perf trace -e landlock_add_rule
0.000 ( 0.029 ms): ldlck-test/438194 landlock_add_rule(rule_type: LANDLOCK_RULE_NET_PORT) = -1 EBADFD (File descriptor in bad state)
0.036 ( 0.004 ms): ldlck-test/438194 landlock_add_rule(rule_type: LANDLOCK_RULE_PATH_BENEATH) = -1 EBADFD (File descriptor in bad state)
```
Committer notes:
Made it build with NO_LIBBPF=1, simplified btf_enum_fprintf(), see [1]
for the discussion.
Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: GĂŒnther Noack <gnoack@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: MickaĂ«l SalaĂŒn <mic@digikod.net> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/lkml/20240613022757.3589783-1-howardchu95@gmail.com Link: https://lore.kernel.org/lkml/ZnXAhFflUl_LV1QY@x1 Link: https://lore.kernel.org/r/20240624181345.124764-3-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Merge tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix regression in extent map rework when handling insertion of
overlapping compressed extent
- fix unexpected file length when appending to a file using direct io
and buffer not faulted in
- in zoned mode, fix accounting of unusable space when flipping
read-only block group back to read-write
- fix page locking when COWing an inline range, assertion failure found
by syzbot
- fix calculation of space info in debugging print
- tree-checker, add validation of data reference item
- fix a few -Wmaybe-uninitialized build warnings
* tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
btrfs: fix corruption after buffer fault in during direct IO append write
btrfs: zoned: fix zone_unusable accounting on making block group read-write again
btrfs: do not subtract delalloc from avail bytes
btrfs: make cow_file_range_inline() honor locked_page on error
btrfs: fix corrupt read due to bad offset of a compressed extent map
btrfs: tree-checker: validate dref root and objectid
Merge tag 'perf-tools-fixes-for-v6.11-2024-07-30' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
Pull perf tools fixes from Namhyung Kim:
"Some more build fixes and a random crash fix:
- Fix cross-build by setting pkg-config env according to the arch
- Fix static build for missing library dependencies
- Fix Segfault when callchain has no symbols"
* tag 'perf-tools-fixes-for-v6.11-2024-07-30' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
perf docs: Document cross compilation
perf: build: Link lib 'zstd' for static build
perf: build: Link lib 'lzma' for static build
perf: build: Only link libebl.a for old libdw
perf: build: Set Python configuration for cross compilation
perf: build: Setup PKG_CONFIG_LIBDIR for cross compilation
perf tool: fix dereferencing NULL al->maps
Merge tag 'chrome-platform-fixes-for-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
Pull chrome-platform fix from Tzung-Bi Shih:
"Fix a race condition that sends multiple host commands at a time"
* tag 'chrome-platform-fixes-for-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
platform/chrome: cros_ec_proto: Lock device when updating MKBP version
This clarifies the rules for min()/max()/clamp() type checking and makes
them a much more efficient macro expansion.
In particular, we now look at the type and range of the inputs to see
whether they work together, generating a mask of acceptable comparisons,
and then just verifying that the inputs have a shared case:
- an expression with a signed type can be used for
(1) signed comparisons
(2) unsigned comparisons if it is statically known to have a
non-negative value
- an expression with an unsigned type can be used for
(3) unsigned comparison
(4) signed comparisons if the type is smaller than 'int' and thus
the C integer promotion rules will make it signed anyway
Here rule (1) and (3) are obvious, and rule (2) is important in order to
allow obvious trivial constants to be used together with unsigned
values.
Rule (4) is not necessarily a good idea, but matches what we used to do,
and we have extant cases of this situation in the kernel. Notably with
bcachefs having an expression like
where bch2_bucket_sectors_dirty() returns an 's64', and
'ca->mi.bucket_size' is of type 'u16'.
Technically that bcachefs comparison is clearly sensible on a C type
level, because the 'u16' will go through the normal C integer promotion,
and become 'int', and then we're comparing two signed values and
everything looks sane.
However, it's not entirely clear that a 'min(s64,u16)' operation makes a
lot of conceptual sense, and it's possible that we will remove rule (4).
After all, the _reason_ we have these complicated type checks is exactly
that the C type promotion rules are not very intuitive.
But at least for now the rule is in place for backwards compatibility.
Also note that rule (2) existed before, but is hugely relaxed by this
commit. It used to be true only for the simplest compile-time
non-negative integer constants. The new macro model will allow cases
where the compiler can trivially see that an expression is non-negative
even if it isn't necessarily a constant.
because our old 'min()' macro would see that 'pia[addr] & 0x3F' is of
type 'int' and clearly not a C constant expression, so doing a 'min()'
with a 'size_t' is a signedness violation.
Our new 'min()' macro still sees that 'pia[addr] & 0x3F' is of type
'int', but is smart enough to also see that it is clearly non-negative,
and thus would allow that case without any complaints.
Cc: Arnd Bergmann <arnd@kernel.org> Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Sterba [Mon, 29 Jul 2024 19:59:24 +0000 (21:59 +0200)]
btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry()
Some arch + compiler combinations report a potentially unused variable
location in btrfs_lookup_dentry(). This is a false alert as the variable
is passed by value and always valid or there's an error. The compilers
cannot probably reason about that although btrfs_inode_by_name() is in
the same file.
> + /kisskb/src/fs/btrfs/inode.c: error: 'location.objectid' may be used
+uninitialized in this function [-Werror=maybe-uninitialized]: => 5603:9
> + /kisskb/src/fs/btrfs/inode.c: error: 'location.type' may be used
+uninitialized in this function [-Werror=maybe-uninitialized]: => 5674:5
Initialize it to zero, this should fix the warnings and won't change the
behaviour as btrfs_inode_by_name() accepts only a root or inode item
types, otherwise returns an error.
platform/chrome: cros_ec_proto: Lock device when updating MKBP version
The cros_ec_get_host_command_version_mask() function requires that the
caller must have ec_dev->lock mutex before calling it. This requirement
was not met and as a result it was possible that two commands were sent
to the device at the same time.
The problem was observed while using UART backend which doesn't use any
additional locks, unlike SPI backend which locks the controller until
response is received.
profiling: remove stale percpu flip buffer variables
For some reason I didn't see this issue on my arm64 or x86-64 builds,
but Stephen Rothwell reports that commit 2accfdb7eff6 ("profiling:
attempt to remove per-cpu profile flip buffer") left these static
variables around, and the powerpc build is unhappy about them:
kernel/profile.c:52:28: warning: 'cpu_profile_flip' defined but not used [-Wunused-variable]
52 | static DEFINE_PER_CPU(int, cpu_profile_flip);
| ^~~~~~~~~~~~~~~~
..
So remove these stale left-over remnants too.
Fixes: 2accfdb7eff6 ("profiling: attempt to remove per-cpu profile flip buffer") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'for-linus-2024072901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID fixes from Benjamin Tissoires:
- fixes for HID-BPF after the merge with the bpf tree (Arnd Bergmann
and Benjamin Tissoires)
- some tool type fix for the Wacom driver (Tatsunosuke Tobita)
- a reorder of the sensor discovery to ensure the HID AMD SFH is
removed when no sensors are available (Basavaraj Natikar)
* tag 'for-linus-2024072901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
selftests/hid: add test for attaching multiple time the same struct_ops
HID: bpf: prevent the same struct_ops to be attached more than once
selftests/hid: disable struct_ops auto-attach
selftests/hid: fix bpf_wq new API
HID: amd_sfh: Move sensor discovery before HID device initialization
hid: bpf: add BPF_JIT dependency
HID: wacom: more appropriate tool type categorization
HID: wacom: Modify pen IDs
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"The biggest thing here is the adminq change - but it looks like the
only way to avoid headq blocking causing indefinite stalls.
This fixes three issues:
- Prevent admin commands on one VF blocking another.
This prevents a bad VF from blocking a good one, as well as fixing
a scalability issue with large # of VFs
- Correctly return error on command failure on octeon. We used to
treat failed commands as a success.
- Fix modpost warning when building virtio_dma_buf. Harmless, but the
fix is trivial"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_pci_modern: remove admin queue serialization lock
virtio_pci_modern: use completion instead of busy loop to wait on admin cmd result
virtio_pci_modern: pass cmd as an identification token
virtio_pci_modern: create admin queue of queried size
virtio: create admin queues alongside other virtqueues
virtio_pci: pass vq info as an argument to vp_setup_vq()
virtio: push out code to vp_avq_index()
virtio_pci_modern: treat vp_dev->admin_vq.info.vq pointer as static
virtio_pci: introduce vector allocation fallback for slow path virtqueues
virtio_pci: pass vector policy enum to vp_find_one_vq_msix()
virtio_pci: pass vector policy enum to vp_find_vqs_msix()
virtio_pci: simplify vp_request_msix_vectors() call a bit
virtio_pci: push out single vq find code to vp_find_one_vq_msix()
vdpa/octeon_ep: Fix error code in octep_process_mbox()
virtio: add missing MODULE_DESCRIPTION() macro
task_work: make TWA_NMI_CURRENT handling conditional on IRQ_WORK
The TWA_NMI_CURRENT handling very much depends on IRQ_WORK, but that
isn't universally enabled everywhere.
Maybe the IRQ_WORK infrastructure should just be unconditional - x86
ends up indirectly enabling it through unconditionally enabling
PERF_EVENTS, for example. But it also gets enabled by having SMP
support, or even if you just have PRINTK enabled.
But in the meantime TWA_NMI_CURRENT causes tons of build failures on
various odd minimal configs. Which did show up in linux-next, but
despite that nobody bothered to fix it or even inform me until -rc1 was
out.
Fixes: 466e4d801cd4 ("task_work: Add TWA_NMI_CURRENT as an additional notify mode") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: kernelci.org bot <bot@kernelci.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
profiling: attempt to remove per-cpu profile flip buffer
This is the really old legacy kernel profiling code, which has long
since been obviated by "real profiling" (ie 'prof' and company), and
mainly remains as a source of syzbot reports.
There are anecdotal reports that people still use it for boot-time
profiling, but it's unlikely that such use would care about the old NUMA
optimizations in this code from 2004 (commit ad02973d42: "profile: 512x
Altix timer interrupt livelock fix" in the BK import archive at [1])
So in order to head off future syzbot reports, let's try to simplify
this code and get rid of the per-cpu profile buffers that are quite a
large portion of the complexity footprint of this thing (including CPU
hotplug callbacks etc).
It's unlikely anybody will actually notice, or possibly, as Thomas put
it: "Only people who indulge in nostalgia will notice :)".
That said, if it turns out that this code is actually actively used by
somebody, we can always revert this removal. Thus the "attempt" in the
summary line.
[ Note: in a small nod to "the profiling code can cause NUMA problems",
this also removes the "increment the last entry in the profiling array
on any unknown hits" logic. That would account any program counter in
a module to that single counter location, and might exacerbate any
NUMA cacheline bouncing issues ]
in profile_tick(); prof_cpu_mask remains uninitialzed until cpumask_copy()
completes while cpumask_available(prof_cpu_mask) returns true as soon as
alloc_cpumask_var(&prof_cpu_mask) completes.
We could replace alloc_cpumask_var() with zalloc_cpumask_var() and
call cpumask_copy() from create_proc_profile() on only UP kernels, for
profile_online_cpu() calls cpumask_set_cpu() as needed via
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN) on SMP kernels. But this patch
removes prof_cpu_mask because it seems unnecessary.
The cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) test
in profile_tick() is likely always true due to
a CPU cannot call profile_tick() if that CPU is offline
and
cpumask_set_cpu(cpu, prof_cpu_mask) is called when that CPU becomes
online and cpumask_clear_cpu(cpu, prof_cpu_mask) is called when that
CPU becomes offline
. This test could be false during transition between online and offline.
But according to include/linux/cpuhotplug.h , CPUHP_PROFILE_PREPARE
belongs to PREPARE section, which means that the CPU subjected to
profile_dead_cpu() cannot be inside profile_tick() (i.e. no risk of
use-after-free bug) because interrupt for that CPU is disabled during
PREPARE section. Therefore, this test is guaranteed to be true, and
can be removed. (Since profile_hits() checks prof_buffer != NULL, we
don't need to check prof_buffer != NULL here unless get_irq_regs() or
user_mode() is such slow that we want to avoid when prof_buffer == NULL).
do_profile_hits() is called from profile_tick() from timer interrupt
only if cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) is true and
prof_buffer is not NULL. But syzbot is also reporting that sometimes
do_profile_hits() is called while current thread is still doing vzalloc(),
where prof_buffer must be NULL at this moment. This indicates that multiple
threads concurrently tried to write to /sys/kernel/profiling interface,
which caused that somebody else try to re-allocate prof_buffer despite
somebody has already allocated prof_buffer. Fix this by using
serialization.
Filipe Manana [Fri, 26 Jul 2024 10:12:52 +0000 (11:12 +0100)]
btrfs: fix corruption after buffer fault in during direct IO append write
During an append (O_APPEND write flag) direct IO write if the input buffer
was not previously faulted in, we can corrupt the file in a way that the
final size is unexpected and it includes an unexpected hole.
The problem happens like this:
1) We have an empty file, with size 0, for example;
2) We do an O_APPEND direct IO with a length of 4096 bytes and the input
buffer is not currently faulted in;
3) We enter btrfs_direct_write(), lock the inode and call
generic_write_checks(), which calls generic_write_checks_count(), and
that function sets the iocb position to 0 with the following code:
if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);
4) We call btrfs_dio_write() and enter into iomap, which will end up
calling btrfs_dio_iomap_begin() and that calls
btrfs_get_blocks_direct_write(), where we update the i_size of the
inode to 4096 bytes;
5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access
the page of the write input buffer (at iomap_dio_bio_iter(), with a
call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets
returned to btrfs at btrfs_direct_write() via btrfs_dio_write();
6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode,
fault in the write buffer and then goto to the label 'relock';
7) We lock again the inode, do all the necessary checks again and call
again generic_write_checks(), which calls generic_write_checks_count()
again, and there we set the iocb's position to 4K, which is the current
i_size of the inode, with the following code pointed above:
if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);
8) Then we go again to btrfs_dio_write() and enter iomap and the write
succeeds, but it wrote to the file range [4K, 8K), leaving a hole in
the [0, 4K) range and an i_size of 8K, which goes against the
expectations of having the data written to the range [0, 4K) and get an
i_size of 4K.
Fix this by not unlocking the inode before faulting in the input buffer,
in case we get -EFAULT or an incomplete write, and not jumping to the
'relock' label after faulting in the buffer - instead jump to a location
immediately before calling iomap, skipping all the write checks and
relocking. This solves this problem and it's fine even in case the input
buffer is memory mapped to the same file range, since only holding the
range locked in the inode's io tree can cause a deadlock, it's safe to
keep the inode lock (VFS lock), as was fixed and described in commit 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO
reads and writes").
A sample reproducer provided by a reporter is the following:
Reported-by: Hanna Czenczek <hreitz@redhat.com> Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/ Fixes: 8184620ae212 ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb") CC: stable@vger.kernel.org # 6.1+ Tested-by: Hanna Czenczek <hreitz@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Wed, 15 Feb 2023 00:18:02 +0000 (09:18 +0900)]
btrfs: zoned: fix zone_unusable accounting on making block group read-write again
When btrfs makes a block group read-only, it adds all free regions in the
block group to space_info->bytes_readonly. That free space excludes
reserved and pinned regions. OTOH, when btrfs makes the block group
read-write again, it moves all the unused regions into the block group's
zone_unusable. That unused region includes reserved and pinned regions.
As a result, it counts too much zone_unusable bytes.
Fortunately (or unfortunately), having erroneous zone_unusable does not
affect the calculation of space_info->bytes_readonly, because free
space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on
the erroneous zone_unusable and it reduces the num_bytes just to cancel the
error.
This behavior can be easily discovered by adding a WARN_ON to check e.g,
"bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test
case like btrfs/282.
Fix it by properly considering pinned and reserved in
btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce
btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake.
Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
The block group's avail bytes printed when dumping a space info subtract
the delalloc_bytes. However, as shown in btrfs_add_reserved_bytes() and
btrfs_free_reserved_bytes(), it is added or subtracted along with
"reserved" for the delalloc case, which means the "delalloc_bytes" is a
part of the "reserved" bytes. So, excluding it to calculate the avail space
counts delalloc_bytes twice, which can lead to an invalid result.
Fixes: e50b122b832b ("btrfs: print available space for a block group when dumping a space info") CC: stable@vger.kernel.org # 6.6+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
Boris Burkov [Mon, 22 Jul 2024 23:49:45 +0000 (16:49 -0700)]
btrfs: make cow_file_range_inline() honor locked_page on error
The btrfs buffered write path runs through __extent_writepage() which
has some tricky return value handling for writepage_delalloc().
Specifically, when that returns 1, we exit, but for other return values
we continue and end up calling btrfs_folio_end_all_writers(). If the
folio has been unlocked (note that we check the PageLocked bit at the
start of __extent_writepage()), this results in an assert panic like
this one from syzbot:
I was hitting the same issue by doing hundreds of accelerated runs of
generic/475, which also hits IO errors by design.
I instrumented that reproducer with bpftrace and found that the
undesirable folio_unlock was coming from the following callstack:
folio_unlock+5
__process_pages_contig+475
cow_file_range_inline.constprop.0+230
cow_file_range+803
btrfs_run_delalloc_range+566
writepage_delalloc+332
__extent_writepage # inlined in my stacktrace, but I added it here
extent_write_cache_pages+622
Looking at the bisected-to patch in the syzbot report, Josef realized
that the logic of the cow_file_range_inline error path subtly changing.
In the past, on error, it jumped to out_unlock in cow_file_range(),
which honors the locked_page, so when we ultimately call
folio_end_all_writers(), the folio of interest is still locked. After
the change, we always unlocked ignoring the locked_page, on both success
and error. On the success path, this all results in returning 1 to
__extent_writepage(), which skips the folio_end_all_writers() call,
which makes it OK to have unlocked.
Fix the bug by wiring the locked_page into cow_file_range_inline() and
only setting locked_page to NULL on success.
Reported-by: syzbot+a14d8ac9af3a2a4fd0c8@syzkaller.appspotmail.com Fixes: 0586d0a89e77 ("btrfs: move extent bit and page cleanup into cow_file_range_inline") CC: stable@vger.kernel.org # 6.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
Now that we no longer have any C constant expression contexts (ie array
size declarations or static initializers) that use min() or max(), we
can simpify the implementation by not having to worry about the result
staying as a C constant expression.
So now we can unconditionally just use temporary variables of the right
type, and get rid of the excessive expansion that used to come from the
use of
__builtin_choose_expr(__is_constexpr(...), ..
to pick the specialized code for constant expressions.
Another expansion simplification is to pass the temporary variables (in
addition to the original expression) to our __types_ok() macro. That
may superficially look like it complicates the macro, but when we only
want the type of the expression, expanding the temporary variable names
is much simpler and smaller than expanding the potentially complicated
original expression.
As a result, on my machine, doing a
$ time make drivers/staging/media/atomisp/pci/isp/kernels/ynr/ynr_1.0/ia_css_ynr.host.i
goes from
real 0m16.621s
user 0m15.360s
sys 0m1.221s
to
real 0m2.532s
user 0m2.091s
sys 0m0.452s
because the token expansion goes down dramatically.
In particular, the longest line expansion (which was line 71 of that
'ia_css_ynr.host.c' file) shrinks from 23,338kB (yes, 23MB for one
single line) to "just" 1,444kB (now "only" 1.4MB).
And yes, that line is still the line from hell, because it's doing
multiple levels of "min()/max()" expansion thanks to some of them being
hidden inside the uDIGIT_FITTING() macro.
Lorenzo has a nice cleanup patch that makes that driver use inline
functions instead of macros for sDIGIT_FITTING() and uDIGIT_FITTING(),
which will fix that line once and for all, but the 16-fold reduction in
this case does show why we need to simplify these helpers.
Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
minmax: don't use max() in situations that want a C constant expression
We only had a couple of array[] declarations, and changing them to just
use 'MAX()' instead of 'max()' fixes the issue.
This will allow us to simplify our min/max macros enormously, since they
can now unconditionally use temporary variables to avoid using the
argument values multiple times.
Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While working on simplifying the minmax functions, and avoiding
excessive macro expansion, it turns out that the sr.c use of the
'clamp()' macro has the arguments the wrong way around.
The clamp logic is
val = clamp(in, low, high);
and it returns the input clamped to the low/high limits. But sr.c ddid
speed = clamp(0, speed, 0xffff / 177);
which clamps the value '0' to the range '[speed, 0xffff / 177]' and ends
up being nonsensical.
minmax: make generic MIN() and MAX() macros available everywhere
This just standardizes the use of MIN() and MAX() macros, with the very
traditional semantics. The goal is to use these for C constant
expressions and for top-level / static initializers, and so be able to
simplify the min()/max() macros.
These macro names were used by various kernel code - they are very
traditional, after all - and all such users have been fixed up, with a
few different approaches:
- trivial duplicated macro definitions have been removed
Note that 'trivial' here means that it's obviously kernel code that
already included all the major kernel headers, and thus gets the new
generic MIN/MAX macros automatically.
- non-trivial duplicated macro definitions are guarded with #ifndef
This is the "yes, they define their own versions, but no, the include
situation is not entirely obvious, and maybe they don't get the
generic version automatically" case.
- strange use case #1
A couple of drivers decided that the way they want to describe their
versioning is with
#define MAJ 1
#define MIN 2
#define DRV_VERSION __stringify(MAJ) "." __stringify(MIN)
which adds zero value and I just did my Alexander the Great
impersonation, and rewrote that pointless Gordian knot as
#define DRV_VERSION "1.2"
instead.
- strange use case #2
A couple of drivers thought that it's a good idea to have a random
'MIN' or 'MAX' define for a value or index into a table, rather than
the traditional macro that takes arguments.
These values were re-written as C enum's instead. The new
function-line macros only expand when followed by an open
parenthesis, and thus don't clash with enum use.
Happily, there weren't really all that many of these cases, and a lot of
users already had the pattern of using '#ifndef' guarding (or in one
case just using '#undef MIN') before defining their own private version
that does the same thing. I left such cases alone.
Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'kbuild-fixes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix RPM package build error caused by an incorrect locale setup
- Mark modules.weakdep as ghost in RPM package
- Fix the odd combination of -S and -c in stack protector scripts,
which is an error with the latest Clang
* tag 'kbuild-fixes-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: Fix '-S -c' in x86 stack protector scripts
kbuild: rpm-pkg: ghost modules.weakdep file
kbuild: rpm-pkg: Fix C locale setup
minmax: simplify and clarify min_t()/max_t() implementation
This simplifies the min_t() and max_t() macros by no longer making them
work in the context of a C constant expression.
That means that you can no longer use them for static initializers or
for array sizes in type definitions, but there were only a couple of
such uses, and all of them were converted (famous last words) to use
MIN_T/MAX_T instead.
Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 3a7e02c040b1 ("minmax: avoid overly complicated constant
expressions in VM code") added the simpler MIN_T/MAX_T macros in order
to avoid some excessive expansion from the rather complicated regular
min/max macros.
The complexity of those macros stems from two issues:
(a) trying to use them in situations that require a C constant
expression (in static initializers and for array sizes)
(b) the type sanity checking
and MIN_T/MAX_T avoids both of these issues.
Now, in the whole (long) discussion about all this, it was pointed out
that the whole type sanity checking is entirely unnecessary for
min_t/max_t which get a fixed type that the comparison is done in.
But that still leaves min_t/max_t unnecessarily complicated due to
worries about the C constant expression case.
However, it turns out that there really aren't very many cases that use
min_t/max_t for this, and we can just force-convert those.
This does exactly that.
Which in turn will then allow for much simpler implementations of
min_t()/max_t(). All the usual "macros in all upper case will evaluate
the arguments multiple times" rules apply.
We should do all the same things for the regular min/max() vs MIN/MAX()
cases, but that has the added complexity of various drivers defining
their own local versions of MIN/MAX, so that needs another level of
fixes first.
kbuild: Fix '-S -c' in x86 stack protector scripts
After a recent change in clang to stop consuming all instances of '-S'
and '-c' [1], the stack protector scripts break due to the kernel's use
of -Werror=unused-command-line-argument to catch cases where flags are
not being properly consumed by the compiler driver:
$ echo | clang -o - -x c - -S -c -Werror=unused-command-line-argument
clang: error: argument unused during compilation: '-c' [-Werror,-Wunused-command-line-argument]
This results in CONFIG_STACKPROTECTOR getting disabled because
CONFIG_CC_HAS_SANE_STACKPROTECTOR is no longer set.
'-c' and '-S' both instruct the compiler to stop at different stages of
the pipeline ('-S' after compiling, '-c' after assembling), so having
them present together in the same command makes little sense. In this
case, the test wants to stop before assembling because it is looking at
the textual assembly output of the compiler for either '%fs' or '%gs',
so remove '-c' from the list of arguments to resolve the error.
All versions of GCC continue to work after this change, along with
versions of clang that do or do not contain the change mentioned above.
Cc: stable@vger.kernel.org Fixes: 4f7fd4d7a791 ("[PATCH] Add the -fstack-protector option to the CFLAGS") Fixes: 60a5317ff0f4 ("x86: implement x86_32 stack protector") Link: https://github.com/llvm/llvm-project/commit/6461e537815f7fa68cef06842505353cf5600e9c Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Merge tag 'v6.11-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Pull turbostat updates from Len Brown:
- Enable turbostat extensions to add both perf and PMT (Intel
Platform Monitoring Technology) counters via the cmdline
- Demonstrate PMT access with built-in support for Meteor Lake's
Die C6 counter
* tag 'v6.11-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbostat: version 2024.07.26
tools/power turbostat: Include umask=%x in perf counter's config
tools/power turbostat: Document PMT in turbostat.8
tools/power turbostat: Add MTL's PMT DC6 builtin counter
tools/power turbostat: Add early support for PMT counters
tools/power turbostat: Add selftests for added perf counters
tools/power turbostat: Add selftests for SMI, APERF and MPERF counters
tools/power turbostat: Move verbose counter messages to level 2
tools/power turbostat: Move debug prints from stdout to stderr
tools/power turbostat: Fix typo in turbostat.8
tools/power turbostat: Add perf added counter example to turbostat.8
tools/power turbostat: Fix formatting in turbostat.8
tools/power turbostat: Extend --add option with perf counters
tools/power turbostat: Group SMI counter with APERF and MPERF
tools/power turbostat: Add ZERO_ARRAY for zero initializing builtin array
tools/power turbostat: Replace enum rapl_source and cstate_source with counter_source
tools/power turbostat: Remove anonymous union from rapl_counter_info_t
tools/power/turbostat: Switch to new Intel CPU model defines
Merge tag 'cxl-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull CXL updates from Dave Jiang:
"Core:
- A CXL maturity map has been added to the documentation to detail
the current state of CXL enabling.
It provides the status of the current state of various CXL features
to inform current and future contributors of where things are and
which areas need contribution.
- A notifier handler has been added in order for a newly created CXL
memory region to trigger the abstract distance metrics calculation.
This should bring parity for CXL memory to the same level vs
hotplugged DRAM for NUMA abstract distance calculation. The
abstract distance reflects relative performance used for memory
tiering handling.
- An addition for XOR math has been added to address the CXL DPA to
SPA translation.
CXL address translation did not support address interleave math
with XOR prior to this change.
Fixes:
- Fix to address race condition in the CXL memory hotplug notifier
- Add missing MODULE_DESCRIPTION() for CXL modules
- Fix incorrect vendor debug UUID define
Misc:
- A warning has been added to inform users of an unsupported
configuration when mixing CXL VH and RCH/RCD hierarchies
- The ENXIO error code has been replaced with EBUSY for inject poison
limit reached via debugfs and cxl-test support
- Moving the PCI config read in cxl_dvsec_rr_decode() to avoid
unnecessary PCI config reads
- A refactor to a common struct for DRAM and general media CXL
events"
* tag 'cxl-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/core/pci: Move reading of control register to immediately before usage
cxl: Remove defunct code calculating host bridge target positions
cxl/region: Verify target positions using the ordered target list
cxl: Restore XOR'd position bits during address translation
cxl/core: Fold cxl_trace_hpa() into cxl_dpa_to_hpa()
cxl/test: Replace ENXIO with EBUSY for inject poison limit reached
cxl/memdev: Replace ENXIO with EBUSY for inject poison limit reached
cxl/acpi: Warn on mixed CXL VH and RCH/RCD Hierarchy
cxl/core: Fix incorrect vendor debug UUID define
Documentation: CXL Maturity Map
cxl/region: Simplify cxl_region_nid()
cxl/region: Support to calculate memory tier abstract distance
cxl/region: Fix a race condition in memory hotplug notifier
cxl: add missing MODULE_DESCRIPTION() macros
cxl/events: Use a common struct for DRAM and General Media events
Merge tag 'unicode-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode
Pull unicode update from Gabriel Krisman Bertazi:
"Two small fixes to silence the compiler and static analyzers tools
from Ben Dooks and Jeff Johnson"
* tag 'unicode-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode:
unicode: add MODULE_DESCRIPTION() macros
unicode: make utf8 test count static
In the same way as for other similar files, mark as ghost the new file
generated by depmod for configured weak dependencies for modules,
modules.weakdep, so that although it is not included in the package,
claim the ownership on it.
Signed-off-by: Jose Ignacio Tornos Martinez <jtornosm@redhat.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Merge tag '6.11-rc-smb-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull more smb client updates from Steve French:
- fix for potential null pointer use in init cifs
- additional dynamic trace points to improve debugging of some common
scenarios
- two SMB1 fixes (one addressing reconnect with POSIX extensions, one a
mount parsing error)
* tag '6.11-rc-smb-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb3: add dynamic trace point for session setup key expired failures
smb3: add four dynamic tracepoints for copy_file_range and reflink
smb3: add dynamic tracepoint for reflink errors
cifs: mount with "unix" mount option for SMB1 incorrectly handled
cifs: fix reconnect with SMB1 UNIX Extensions
cifs: fix potential null pointer use in destroy_workqueue in init_cifs error path
Merge tag 'block-6.11-20240726' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Fix request without payloads cleanup (Leon)
- Use new protection information format (Francis)
- Improved debug message for lost pci link (Bart)
- Another apst quirk (Wang)
- Use appropriate sysfs api for printing chars (Markus)
- ublk async device deletion fix (Ming)
- drbd kerneldoc fixups (Simon)
- Fix deadlock between sd removal and release (Yang)
* tag 'block-6.11-20240726' of git://git.kernel.dk/linux:
nvme-pci: add missing condition check for existence of mapped data
ublk: fix UBLK_CMD_DEL_DEV_ASYNC handling
block: fix deadlock between sd_remove & sd_release
drbd: Add peer_device to Kernel doc
nvme-core: choose PIF from QPIF if QPIFS supports and PIF is QTYPE
nvme-pci: Fix the instructions for disabling power management
nvme: remove redundant bdev local variable
nvme-fabrics: Use seq_putc() in __nvmf_concat_opt_tokens()
nvme/pci: Add APST quirk for Lenovo N60z laptop
Merge tag 'vfs-6.11-rc1.fixes.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"This contains two fixes for this merge window:
VFS:
- I noticed that it is possible for a privileged user to mount most
filesystems with a non-initial user namespace in sb->s_user_ns.
When fsopen() is called in a non-init namespace the caller's
namespace is recorded in fs_context->user_ns. If the returned file
descriptor is then passed to a process privileged in init_user_ns,
that process can call fsconfig(fd_fs, FSCONFIG_CMD_CREATE*),
creating a new superblock with sb->s_user_ns set to the namespace
of the process which called fsopen().
This is problematic as only filesystems that raise FS_USERNS_MOUNT
are known to be able to support a non-initial s_user_ns. Others may
suffer security issues, on-disk corruption or outright crash the
kernel. Prevent that by restricting such delegation to filesystems
that allow FS_USERNS_MOUNT.
Note, that this delegation requires a privileged process to
actually create the superblock so either the privileged process is
cooperaing or someone must have tricked a privileged process into
operating on a fscontext file descriptor whose origin it doesn't
know (a stupid idea).
The bug dates back to about 5 years afaict.
Misc:
- Fix hostfs parsing when the mount request comes in via the legacy
mount api.
In the legacy mount api hostfs allows to specify the host directory
mount without any key.
Restore that behavior"
* tag 'vfs-6.11-rc1.fixes.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
hostfs: fix the host directory parse when mounting.
fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT
Merge tag 'rust-6.11' of https://github.com/Rust-for-Linux/linux
Pull Rust updates from Miguel Ojeda:
"The highlight is the establishment of a minimum version for the Rust
toolchain, including 'rustc' (and bundled tools) and 'bindgen'.
The initial minimum will be the pinned version we currently have, i.e.
we are just widening the allowed versions. That covers three stable
Rust releases: 1.78.0, 1.79.0, 1.80.0 (getting released tomorrow),
plus beta, plus nightly.
This should already be enough for kernel developers in distributions
that provide recent Rust compiler versions routinely, such as Arch
Linux, Debian Unstable (outside the freeze period), Fedora Linux,
Gentoo Linux (especially the testing channel), Nix (unstable) and
openSUSE Slowroll and Tumbleweed.
In addition, the kernel is now being built-tested by Rust's pre-merge
CI. That is, every change that is attempting to land into the Rust
compiler is tested against the kernel, and it is merged only if it
passes. Similarly, the bindgen tool has agreed to build the kernel in
their CI too.
Thus, with the pre-merge CI in place, both projects hope to avoid
unintentional changes to Rust that break the kernel. This means that,
in general, apart from intentional changes on their side (that we will
need to workaround conditionally on our side), the upcoming Rust
compiler versions should generally work.
In addition, the Rust project has proposed getting the kernel into
stable Rust (at least solving the main blockers) as one of its three
flagship goals for 2024H2 [1].
I would like to thank Niko, Sid, Emilio et al. for their help
promoting the collaboration between Rust and the kernel.
Toolchain and infrastructure:
- Support several Rust toolchain versions.
- Support several bindgen versions.
- Remove 'cargo' requirement and simplify 'rusttest', thanks to
'alloc' having been dropped last cycle.
- Provide proper error reporting for the 'rust-analyzer' target.
'kernel' crate:
- Add 'uaccess' module with a safe userspace pointers abstraction.
- Add 'page' module with a 'struct page' abstraction.
- Support more complex generics in workqueue's 'impl_has_work!'
macro.
'macros' crate:
- Add 'firmware' field support to the 'module!' macro.
- Improve 'module!' macro documentation.
Documentation:
- Provide instructions on what packages should be installed to build
the kernel in some popular Linux distributions.
- Introduce the new kernel.org LLVM+Rust toolchains.
- Explain '#[no_std]'.
And a few other small bits"
Link: https://rust-lang.github.io/rust-project-goals/2024h2/index.html#flagship-goals
* tag 'rust-6.11' of https://github.com/Rust-for-Linux/linux: (26 commits)
docs: rust: quick-start: add section on Linux distributions
rust: warn about `bindgen` versions 0.66.0 and 0.66.1
rust: start supporting several `bindgen` versions
rust: work around `bindgen` 0.69.0 issue
rust: avoid assuming a particular `bindgen` build
rust: start supporting several compiler versions
rust: simplify Clippy warning flags set
rust: relax most deny-level lints to warnings
rust: allow `dead_code` for never constructed bindings
rust: init: simplify from `map_err` to `inspect_err`
rust: macros: indent list item in `paste!`'s docs
rust: add abstraction for `struct page`
rust: uaccess: add typed accessors for userspace pointers
uaccess: always export _copy_[from|to]_user with CONFIG_RUST
rust: uaccess: add userspace pointers
kbuild: rust-analyzer: improve comment documentation
kbuild: rust-analyzer: better error handling
docs: rust: no_std is used
rust: alloc: add __GFP_HIGHMEM flag
rust: alloc: fix typo in docs for GFP_NOWAIT
...
Merge tag 'apparmor-pr-2024-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor
Pull apparmor updates from John Johansen:
"Cleanups
- optimization: try to avoid refing the label in apparmor_file_open
- remove useless static inline function is_deleted
- use kvfree_sensitive to free data->data
- fix typo in kernel doc
Bug fixes:
- unpack transition table if dfa is not present
- test: add MODULE_DESCRIPTION()
- take nosymfollow flag into account
- fix possible NULL pointer dereference
- fix null pointer deref when receiving skb during sock creation"
* tag 'apparmor-pr-2024-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor:
apparmor: unpack transition table if dfa is not present
apparmor: try to avoid refing the label in apparmor_file_open
apparmor: test: add MODULE_DESCRIPTION()
apparmor: take nosymfollow flag into account
apparmor: fix possible NULL pointer dereference
apparmor: fix typo in kernel doc
apparmor: remove useless static inline function is_deleted
apparmor: use kvfree_sensitive to free data->data
apparmor: Fix null pointer deref when receiving skb during sock creation
Merge tag 'landlock-6.11-rc1-houdini-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux
Pull landlock fix from MickaĂ«l SalaĂŒn:
"Jann Horn reported a sandbox bypass for Landlock. This includes the
fix and new tests. This should be backported"
* tag 'landlock-6.11-rc1-houdini-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
selftests/landlock: Add cred_transfer test
landlock: Don't lose track of restrictions on cred_transfer
Merge tag 'devicetree-fixes-for-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull more devicetree updates from Rob Herring:
"Most of this is a treewide change to of_property_for_each_u32() which
was small enough to do in one go before rc1 and avoids the need to
create of_property_for_each_u32_some_new_name().
- Treewide conversion of of_property_for_each_u32() to drop internal
arguments making struct property opaque
- Add binding for Amlogic A4 SoC watchdog
- Fix constraints for AD7192 'single-channel' property"
* tag 'devicetree-fixes-for-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt-bindings: iio: adc: ad7192: Fix 'single-channel' constraints
of: remove internal arguments from of_property_for_each_u32()
dt-bindings: watchdog: add support for Amlogic A4 SoCs
Merge tag 'iommu-fixes-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu fixes from Will Deacon:
"We're still resolving a regression with the handling of unexpected
page faults on SMMUv3, but we're not quite there with a fix yet.
- Fix NULL dereference when freeing domain in Unisoc SPRD driver
- Separate assignment statements with semicolons in AMD page-table
code
- Fix Tegra erratum workaround when the CPU is using 16KiB pages"
* tag 'iommu-fixes-v6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommu: arm-smmu: Fix Tegra workaround for PAGE_SIZE mappings
iommu/amd: Convert comma to semicolon
iommu: sprd: Avoid NULL deref in sprd_iommu_hw_en
Merge tag 'firewire-fixes-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
Pull firewire fixes from Takashi Sakamoto:
"The recent integration of compiler collections introduced the
technology to check flexible array length at runtime by providing
proper annotations. In v6.10 kernel, a patch was merged into firewire
subsystem to utilize it, however the annotation was inadequate.
There is also the related change for the flexible array in sound
subsystem, but it causes a regression where the data in the payload of
isochronous packet is incorrect for some devices. These bugs are now
fixed"
* tag 'firewire-fixes-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
ALSA: firewire-lib: fix wrong value as length of header for CIP_NO_HEADER case
Revert "firewire: Annotate struct fw_iso_packet with __counted_by()"
Merge tag 'spi-fix-v6.11-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"The bulk of this is a series of fixes for the microchip-core driver
mostly originating from one of their customers, I also applied an
additional patch adding support for controlling the word size which
came along with it since it's still the merge window and clearly had a
bunch of fairly thorough testing.
We also have a fix for the compatible used to bind spidev to the
BH2228FV"
* tag 'spi-fix-v6.11-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spidev: add correct compatible for Rohm BH2228FV
dt-bindings: trivial-devices: fix Rohm BH2228FV compatible string
spi: microchip-core: add support for word sizes of 1 to 32 bits
spi: microchip-core: ensure TX and RX FIFOs are empty at start of a transfer
spi: microchip-core: fix init function not setting the master and motorola modes
spi: microchip-core: only disable SPI controller when register value change requires it
spi: microchip-core: defer asserting chip select until just before write to TX FIFO
spi: microchip-core: fix the issues in the isr
Merge tag 'regulator-fix-v6.11-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"These two commits clean up the excessively loose dependencies for the
RZG2L USB VBCTRL regulator driver, ensuring it shouldn't prompt for
people who can't use it"
* tag 'regulator-fix-v6.11-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: Further restrict RZG2L USB VBCTRL regulator dependencies
regulator: renesas-usb-vbus-regulator: Update the default
Merge tag 'regmap-fix-v6.11-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fix from Mark Brown:
"Arnd sent a workaround for a false positive warning which was showing
up with GCC 14.1"
* tag 'regmap-fix-v6.11-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: maple: work around gcc-14.1 false-positive warning
Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"A few clk driver fixes for the merge window to fix the build and boot
on some SoCs.
- Initialize struct clk_init_data in the TI da8xx-cfgchip driver so
that stack contents aren't used for things like clk flags leading
to unexpected behavior
- Don't leak stack contents in a debug print in the new Sophgo clk
driver
- Disable the new T-Head clk driver on 32-bit targets to fix the
build due to a division
- Fix Samsung Exynos4 fin_pll wreckage from the clkdev rework done
last cycle by using a struct clk_hw directly instead of a struct
clk consumer"
* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: samsung: fix getting Exynos4 fin_pll rate from external clocks
clk: T-Head: Disable on 32-bit Targets
clk: sophgo: clk-sg2042-pll: Fix uninitialized variable in debug output
clk: davinci: da8xx-cfgchip: Initialize clk_init_data before use
Merge tag 'i3c/for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux
Pull i3c updates from Alexandre Belloni:
"This cycle, there are new features for the Designware controller and
fixes for the other IPs:
- dw: optional apb clock and power management support, IBI handling
fixes
- mipi-i3c-hci: IBI handling fixes
- svc: a few fixes"
* tag 'i3c/for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
dt-bindings: i3c: add header for generic I3C flags
i3c: master: svc: Fix error code in svc_i3c_master_do_daa_locked()
i3c: master: Enhance i3c_bus_type visibility for device searching & event monitoring
i3c: dw: Add power management support
i3c: dw: Add some functions for reusability
i3c: dw: Save timing registers and other values
i3c: master: svc: Improve DAA STOP handle code logic
i3c: dw: Add optional apb clock
i3c: dw: Use new *_enabled clk API
dt-bindings: i3c: dw: Add apb clock binding
i3c: master: svc: Convert comma to semicolon
i3c: mipi-i3c-hci: Round IBI data chunk size to HW supported value
i3c: mipi-i3c-hci: Error out instead on BUG_ON() in IBI DMA setup
i3c: mipi-i3c-hci: Set IBI Status and Data Ring base addresses
i3c: mipi-i3c-hci: Switch to lower_32_bits()/upper_32_bits() helpers
i3c: dw: Remove ibi_capable property
i3c: dw: Fix IBI intr programming
i3c: dw: Fix clearing queue thld
i3c: mipi-i3c-hci: Fix number of DAT/DCT entries for HCI versions < 1.1
i3c: master: svc: resend target address when get NACK
Merge tag 'thermal-6.11-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull thermal control fix from Rafael Wysocki:
"Prevent the thermal core from flooding the kernel log with useless
messages if thermal zone temperature can never be determined (or its
sensor has failed permanently) and make it finally give up and disable
defective thermal zones (Rafael Wysocki)"
* tag 'thermal-6.11-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: core: Back off when polling thermal zones on errors
thermal: trip: Split thermal_zone_device_set_mode()
Merge tag 'mm-hotfixes-stable-2024-07-26-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"11 hotfixes, 7 of which are cc:stable. 7 are MM, 4 are other"
* tag 'mm-hotfixes-stable-2024-07-26-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
nilfs2: handle inconsistent state in nilfs_btnode_create_block()
selftests/mm: skip test for non-LPA2 and non-LVA systems
mm/page_alloc: fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist()
mm: memcg: add cacheline padding after lruvec in mem_cgroup_per_node
alloc_tag: outline and export free_reserved_page()
decompress_bunzip2: fix rare decompression failure
mm/huge_memory: avoid PMD-size page cache if needed
mm: huge_memory: use !CONFIG_64BIT to relax huge page alignment on 32 bit machines
mm: fix old/young bit handling in the faulting path
dt-bindings: arm: update James Clark's email address
MAINTAINERS: mailmap: update James Clark's email address
Merge tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer migration updates from Thomas Gleixner:
"Fixes and minor updates for the timer migration code:
- Stop testing the group->parent pointer as it is not guaranteed to
be stable over a chain of operations by design.
This includes a warning which would be nice to have but it produces
false positives due to the racy nature of the check.
- Plug a race between CPUs going in and out of idle and a CPU hotplug
operation. The latter can create and connect a new hierarchy level
which is missed in the concurrent updates of CPUs which go into
idle. As a result the events of such a CPU might not be processed
and timers go stale.
Cure it by splitting the hotplug operation into a prepare and
online callback. The prepare callback is guaranteed to run on an
online and therefore active CPU. This CPU updates the hierarchy and
being online ensures that there is always at least one migrator
active which handles the modified hierarchy correctly when going
idle. The online callback which runs on the incoming CPU then just
marks the CPU active and brings it into operation.
- Improve tracing and polish the code further so it is more obvious
what's going on"
* tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Fix grammar in comment
timers/migration: Spare write when nothing changed
timers/migration: Rename childmask by groupmask to make naming more obvious
timers/migration: Read childmask and parent pointer in a single place
timers/migration: Use a single struct for hierarchy walk data
timers/migration: Improve tracing
timers/migration: Move hierarchy setup into cpuhotplug prepare callback
timers/migration: Do not rely always on group->parent
Merge tag 'riscv-for-linus-6.11-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull more RISC-V updates from Palmer Dabbelt:
- Support for NUMA (via SRAT and SLIT), console output (via SPCR), and
cache info (via PPTT) on ACPI-based systems.
- The trap entry/exit code no longer breaks the return address stack
predictor on many systems, which results in an improvement to trap
latency.
- Support for HAVE_ARCH_STACKLEAK.
- The sv39 linear map has been extended to support 128GiB mappings.
- The frequency of the mtime CSR is now visible via hwprobe.
* tag 'riscv-for-linus-6.11-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (21 commits)
RISC-V: Provide the frequency of time CSR via hwprobe
riscv: Extend sv39 linear mapping max size to 128G
riscv: enable HAVE_ARCH_STACKLEAK
riscv: signal: Remove unlikely() from WARN_ON() condition
riscv: Improve exception and system call latency
RISC-V: Select ACPI PPTT drivers
riscv: cacheinfo: initialize cacheinfo's level and type from ACPI PPTT
riscv: cacheinfo: remove the useless input parameter (node) of ci_leaf_init()
RISC-V: ACPI: Enable SPCR table for console output on RISC-V
riscv: boot: remove duplicated targets line
trace: riscv: Remove deprecated kprobe on ftrace support
riscv: cpufeature: Extract common elements from extension checking
riscv: Introduce vendor variants of extension helpers
riscv: Add vendor extensions to /proc/cpuinfo
riscv: Extend cpufeature.c to detect vendor extensions
RISC-V: run savedefconfig for defconfig
RISC-V: hwprobe: sort EXT_KEY()s in hwprobe_isa_ext0() alphabetically
ACPI: NUMA: replace pr_info with pr_debug in arch_acpi_numa_init
ACPI: NUMA: change the ACPI_NUMA to a hidden option
ACPI: NUMA: Add handler for SRAT RINTC affinity structure
...
Merge tag 'for-linus-6.11-rc1a-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Two fixes for issues introduced in this merge window:
- fix enhanced debugging in the Xen multicall handling
- two patches fixing a boot failure when running as dom0 in PVH mode"
* tag 'for-linus-6.11-rc1a-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: fix memblock_reserve() usage on PVH
x86/xen: move xen_reserve_extra_memory()
xen: fix multicall debug data referencing
Hongbo Li [Thu, 25 Jul 2024 06:51:30 +0000 (14:51 +0800)]
hostfs: fix the host directory parse when mounting.
hostfs not keep the host directory when mounting. When the host
directory is none (default), fc->source is used as the host root
directory, and this is wrong. Here we use `parse_monolithic` to
handle the old mount path for parsing the root directory. For new
mount path, The `parse_param` is used for the host directory parse.
fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT
Christian noticed that it is possible for a privileged user to mount
most filesystems with a non-initial user namespace in sb->s_user_ns.
When fsopen() is called in a non-init namespace the caller's namespace
is recorded in fs_context->user_ns. If the returned file descriptor is
then passed to a process priviliged in init_user_ns, that process can
call fsconfig(fd_fs, FSCONFIG_CMD_CREATE), creating a new superblock
with sb->s_user_ns set to the namespace of the process which called
fsopen().
This is problematic. We cannot assume that any filesystem which does not
set FS_USERNS_MOUNT has been written with a non-initial s_user_ns in
mind, increasing the risk for bugs and security issues.
Prevent this by returning EPERM from sget_fc() when FS_USERNS_MOUNT is
not set for the filesystem and a non-initial user namespace will be
used. sget() does not need to be updated as it always uses the user
namespace of the current context, or the initial user namespace if
SB_SUBMOUNT is set.
Fixes: cb50b348c71f ("convenience helpers: vfs_get_super() and sget_fc()") Reported-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Link: https://lore.kernel.org/r/20240724-s_user_ns-fix-v1-1-895d07c94701@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
ALSA: firewire-lib: fix wrong value as length of header for CIP_NO_HEADER case
In a commit 1d717123bb1a ("ALSA: firewire-lib: Avoid
-Wflex-array-member-not-at-end warning"), DEFINE_FLEX() macro was used to
handle variable length of array for header field in struct fw_iso_packet
structure. The usage of macro has a side effect that the designated
initializer assigns the count of array to the given field. Therefore
CIP_HEADER_QUADLETS (=2) is assigned to struct fw_iso_packet.header,
while the original designated initializer assigns zero to all fields.
With CIP_NO_HEADER flag, the change causes invalid length of header in
isochronous packet for 1394 OHCI IT context. This bug affects all of
devices supported by ALSA fireface driver; RME Fireface 400, 800, UCX, UFX,
and 802.
This commit fixes the bug by replacing it with the alternative version of
macro which corresponds no initializer.
The header_length field is byte unit, thus it can not express the number of
elements in header field. It seems that the argument for counted_by
attribute can have no arithmetic expression, therefore this commit just
reverts the issued commit.
minmax: avoid overly complicated constant expressions in VM code
The minmax infrastructure is overkill for simple constants, and can
cause huge expansions because those simple constants are then used by
other things.
For example, 'pageblock_order' is a core VM constant, but because it was
implemented using 'min_t()' and all the type-checking that involves, it
actually expanded to something like 2.5kB of preprocessor noise.
And when that simple constant was then used inside other expansions:
the end result was that one statement expanding to 253kB in size.
There are probably other cases of this, but this one case certainly
stood out.
I've added 'MIN_T()' and 'MAX_T()' macros for this kind of "core simple
constant with specific type" use. These macros skip the type checking,
and as such need to be very sparingly used only for obvious cases that
have active issues like this.
minmax: avoid overly complex min()/max() macro arguments in xen
We have some very fancy min/max macros that have tons of sanity checking
to warn about mixed signedness etc.
This is all things that a sane compiler should warn about, but there are
no sane compiler interfaces for this, and '-Wsign-compare' is broken [1]
and not useful.
So then we compensate (some would say over-compensate) by doing the
checks manually with some truly horrid macro games.
And no, we can't just use __builtin_types_compatible_p(), because the
whole question of "does it make sense to compare these two values" is a
lot more complicated than that.
For example, it makes a ton of sense to compare unsigned values with
simple constants like "5", even if that is indeed a signed type. So we
have these very strange macros to try to make sensible type checking
decisions on the arguments to 'min()' and 'max()'.
But that can cause enormous code expansion if the min()/max() macros are
used with complicated expressions, and particularly if you nest these
things so that you get the first big expansion then expanded again.
The xen setup.c file ended up ballooning to over 50MB of preprocessed
noise that takes 15s to compile (obviously depending on the build host),
largely due to one single line.
So let's split that one single line to just be simpler. I think it ends
up being more legible to humans too at the same time. Now that single
file compiles in under a second.
nilfs2: handle inconsistent state in nilfs_btnode_create_block()
Syzbot reported that a buffer state inconsistency was detected in
nilfs_btnode_create_block(), triggering a kernel bug.
It is not appropriate to treat this inconsistency as a bug; it can occur
if the argument block address (the buffer index of the newly created
block) is a virtual block number and has been reallocated due to
corruption of the bitmap used to manage its allocation state.
So, modify nilfs_btnode_create_block() and its callers to treat it as a
possible filesystem error, rather than triggering a kernel bug.
Dev Jain [Thu, 18 Jul 2024 05:25:04 +0000 (10:55 +0530)]
selftests/mm: skip test for non-LPA2 and non-LVA systems
Post my improvement of the test in e4a4ba415419 ("selftests/mm:
va_high_addr_switch: dynamically initialize testcases to enable LPA2
testing"):
The test begins to fail on 4k and 16k pages, on non-LPA2 systems. To
reduce noise in the CI systems, let us skip the test when higher address
space is not implemented.
Link: https://lkml.kernel.org/r/20240718052504.356517-1-dev.jain@arm.com Fixes: e4a4ba415419 ("selftests/mm: va_high_addr_switch: dynamically initialize testcases to enable LPA2 testing") Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Zhijian [Tue, 23 Jul 2024 06:44:28 +0000 (14:44 +0800)]
mm/page_alloc: fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist()
It's expected that no page should be left in pcp_list after calling
zone_pcp_disable() in offline_pages(). Previously, it's observed that
offline_pages() gets stuck [1] due to some pages remaining in pcp_list.
Cause:
There is a race condition between drain_pages_zone() and __rmqueue_pcplist()
involving the pcp->count variable. See below scenario:
CPU0 CPU1
---------------- ---------------
spin_lock(&pcp->lock);
__rmqueue_pcplist() {
zone_pcp_disable() {
/* list is empty */
if (list_empty(list)) {
/* add pages to pcp_list */
alloced = rmqueue_bulk()
mutex_lock(&pcp_batch_high_lock)
...
__drain_all_pages() {
drain_pages_zone() {
/* read pcp->count, it's 0 here */
count = READ_ONCE(pcp->count)
/* 0 means nothing to drain */
/* update pcp->count */
pcp->count += alloced << order;
...
...
spin_unlock(&pcp->lock);
In this case, after calling zone_pcp_disable() though, there are still some
pages in pcp_list. And these pages in pcp_list are neither movable nor
isolated, offline_pages() gets stuck as a result.
Solution:
Expand the scope of the pcp->lock to also protect pcp->count in
drain_pages_zone(), to ensure no pages are left in the pcp list after
zone_pcp_disable()
Roman Gushchin [Tue, 23 Jul 2024 17:12:44 +0000 (17:12 +0000)]
mm: memcg: add cacheline padding after lruvec in mem_cgroup_per_node
Oliver Sand reported a performance regression caused by commit 98c9daf5ae6b ("mm: memcg: guard memcg1-specific members of struct
mem_cgroup_per_node"), which puts some fields of the mem_cgroup_per_node
structure under the CONFIG_MEMCG_V1 config option. Apparently it causes a
false cache sharing between lruvec and lru_zone_size members of the
structure. Fix it by adding an explicit padding after the lruvec member.
Even though the padding is not required with CONFIG_MEMCG_V1 set, it seems
like the introduced memory overhead is not significant enough to warrant
another divergence in the mem_cgroup_per_node layout, so the padding is
added unconditionally.
Link: https://lkml.kernel.org/r/20240723171244.747521-1-roman.gushchin@linux.dev Fixes: 98c9daf5ae6b ("mm: memcg: guard memcg1-specific members of struct mem_cgroup_per_node") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202407121335.31a10cb6-oliver.sang@intel.com Tested-by: Oliver Sang <oliver.sang@intel.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
alloc_tag: outline and export free_reserved_page()
Outline and export free_reserved_page() because modules use it and it in
turn uses page_ext_{get|put} which should not be exported. The same
result could be obtained by outlining {get|put}_page_tag_ref() but that
would have higher performance impact as these functions are used in more
performance critical paths.
Link: https://lkml.kernel.org/r/20240717212844.2749975-1-surenb@google.com Fixes: dcfe378c81f7 ("lib: introduce support for page allocation tagging") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407080044.DWMC9N9I-lkp@intel.com/ Suggested-by: Christoph Hellwig <hch@infradead.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: <stable@vger.kernel.org> [6.10] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>