Yonghong Song [Thu, 6 Apr 2023 16:45:00 +0000 (09:45 -0700)]
selftests/bpf: Add tests for non-constant cond_op NE/EQ bound deduction
Add various tests for code pattern '<non-const> NE/EQ <const>' implemented
in the previous verifier patch. Without the verifier patch, these new
tests will fail.
Yonghong Song [Thu, 6 Apr 2023 16:44:55 +0000 (09:44 -0700)]
bpf: Improve verifier JEQ/JNE insn branch taken checking
Currently, for BPF_JEQ/BPF_JNE insn, verifier determines
whether the branch is taken or not only if both operands
are constants. Therefore, for the following code snippet,
0: (85) call bpf_ktime_get_ns#5 ; R0_w=scalar()
1: (a5) if r0 < 0x3 goto pc+2 ; R0_w=scalar(umin=3)
2: (b7) r2 = 2 ; R2_w=2
3: (1d) if r0 == r2 goto pc+2 6
At insn 3, since r0 is not a constant, verifier assumes both branch
can be taken which may lead inproper verification failure.
Add comparing umin/umax value and the constant. If the umin value
is greater than the constant, or umax value is smaller than the constant,
for JEQ the branch must be not-taken, and for JNE the branch must be taken.
The jmp32 mode JEQ/JNE branch taken checking is also handled similarly.
The following lists the veristat result w.r.t. changed number
of processes insns during verification:
Mostly a small improvement but test_cls_redirect.bpf.linked3.o has a 13% regression.
I checked with verifier log and found it this is due to pruning.
For some JEQ/JNE branches impacted by this patch,
one branch is explored and the other has state equivalence and
pruned.
Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Dave Marchevsky <davemarchevsky@fb.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230406164455.1045294-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patchset includes the test with the bugfix as requested here:
https://lore.kernel.org/all/f1a32d5a-03e7-fce1-f5a5-6095f365f0a9@linux.dev/
Patch #1 (the bugfix) is identical to the previous submission except
that I improved the commit message slightly.
Magnus: I improved the test code a little different than you asked
since I thought this was a little simpler than having a separate
function for now. Hopefully, you can live with this :-).
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
selftests: xsk: Add test UNALIGNED_INV_DESC_4K1_FRAME_SIZE
Add unaligned descriptor test for frame size of 4001. Using an odd frame
size ensures that the end of the UMEM is not near a page boundary. This
allows testing descriptors that staddle the end of the UMEM but not a
page.
This test used to fail without the previous commit ("xsk: Fix unaligned
descriptor validation").
Make sure unaligned descriptors that straddle the end of the UMEM are
considered invalid. Currently, descriptor validation is broken for
zero-copy mode which only checks descriptors at page granularity.
For example, descriptors in zero-copy mode that overrun the end of the
UMEM but not a page boundary are (incorrectly) considered valid. The
UMEM boundary check needs to happen before the page boundary and
contiguity checks in xp_desc_crosses_non_contig_pg(). Do this check in
xp_unaligned_validate_desc() instead like xp_check_unaligned() already
does.
Fixes: 2b43470add8c ("xsk: Introduce AF_XDP buffer allocation API") Signed-off-by: Kal Conley <kal.conley@dectris.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230405235920.7305-2-kal.conley@dectris.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Viktor Malik [Thu, 30 Mar 2023 10:20:01 +0000 (12:20 +0200)]
kallsyms: move module-related functions under correct configs
Functions for searching module kallsyms should have non-empty
definitions only if CONFIG_MODULES=y and CONFIG_KALLSYMS=y. Until now,
only CONFIG_MODULES check was used for many of these, which may have
caused complilation errors on some configs.
This patch moves all relevant functions under the correct configs.
Merge branch 'bpftool: Add inline annotations when dumping program CFGs'
Quentin Monnet says:
====================
This set contains some improvements for bpftool's "visual" program dump
option, which produces the control flow graph in a DOT format. The main
objective is to add support for inline annotations on such graphs, so that
we can have the C source code for the program showing up alongside the
instructions, when available. The last commits also make it possible to
display the line numbers or the bare opcodes in the graph, as supported by
regular program dumps.
v3:
- Fixed formatting of DOT graph: escape spaces, and remove indent that
would cause some unwanted spaces to show up in the resulting graph.
- Don't print line information if the record is empty.
- Add '<' and ' ' to the list of escaped characters for generting the
DOT graph.
- Truncate long file paths, use shorter field names ("line", "col") for
code location information in the graph, add missing separator space.
- Add a commit to return an error if JSON output and CFG are both
required.
- Add a drive-by, clean up commit for bash completion (avoid unnecessary
calls to _bpftool_once_attr()).
v2: Replace fputc(..., stdout) with putchar(...) in dotlabel_puts().
====================
bpftool: Clean up _bpftool_once_attr() calls in bash completion
In bpftool's bash completion file, function _bpftool_once_attr() is able
to process multiple arguments. There are a few locations where this
function is called multiple times in a row, each time for a single
argument; let's pass all arguments instead to minimize the number of
function calls required for the completion.
bpftool: Support printing opcodes and source file references in CFG
Add support for displaying opcodes or/and file references (filepath,
line and column numbers) when dumping the control flow graphs of loaded
BPF programs with bpftool.
The filepaths in the records are absolute. To avoid blocks on the graph
to get too wide, we truncate them when they get too long (but we always
keep the entire file name). In the unlikely case where the resulting
file name is ambiguous, it remains possible to get the full path with a
regular dump (no CFG).
bpftool: Support "opcodes", "linum", "visual" simultaneously
When dumping a program, the keywords "opcodes" (for printing the raw
opcodes), "linum" (for displaying the filename, line number, column
number along with the source code), and "visual" (for generating the
control flow graph for translated programs) are mutually exclusive. But
there's no reason why they should be. Let's make it possible to pass
several of them at once. The "file FILE" option, which makes bpftool
output a binary image to a file, remains incompatible with the others.
bpftool: Return an error on prog dumps if both CFG and JSON are required
We do not support JSON output for control flow graphs of programs with
bpftool. So far, requiring both the CFG and JSON output would result in
producing a null JSON object. It makes more sense to raise an error
directly when parsing command line arguments and options, so that users
know they won't get any output they might expect.
If JSON is required for the graph, we leave it to Graphviz instead:
bpftool: Support inline annotations when dumping the CFG of a program
We support dumping the control flow graph of loaded programs to the DOT
format with bpftool, but so far this feature wouldn't display the source
code lines available through BTF along with the eBPF bytecode. Let's add
support for these annotations, to make it easier to read the graph.
In prog.c, we move the call to dump_xlated_cfg() in order to pass and
use the full struct dump_data, instead of creating a minimal one in
draw_bb_node().
We pass the pointer to this struct down to dump_xlated_for_graph() in
xlated_dumper.c, where most of the logics is added. We deal with BTF
mostly like we do for plain or JSON output, except that we cannot use a
"nr_skip" value to skip a given number of linfo records (we don't
process the BPF instructions linearly, and apart from the root of the
graph we don't know how many records we should skip, so we just store
the last linfo and make sure the new one we find is different before
printing it).
When printing the source instructions to the label of a DOT graph node,
there are a few subtleties to address. We want some special newline
markers, and there are some characters that we must escape. To deal with
them, we introduce a new dedicated function btf_dump_linfo_dotlabel() in
btf_dumper.c. We'll reuse this function in a later commit to format the
filepath, line, and column references as well.
bpftool: Fix bug for long instructions in program CFG dumps
When dumping the control flow graphs for programs using the 16-byte long
load instruction, we need to skip the second part of this instruction
when looking for the next instruction to process. Otherwise, we end up
printing "BUG_ld_00" from the kernel disassembler in the CFG.
bpftool: Fix documentation about line info display for prog dumps
The documentation states that when line_info is available when dumping a
program, the source line will be displayed "by default". There is no
notion of "default" here: the line is always displayed if available,
there is no way currently to turn it off.
In the next sentence, the documentation states that if "linum" is used
on the command line, the relevant filename, line, and column will be
displayed "on top of the source line". This is incorrect, as they are
currently displayed on the right side of the source line (or on top of
the eBPF instruction, not the source).
This commit fixes the documentation to address these points.
selftests/bpf: Wait for receive in cg_storage_multi test
In some cases the loopback latency might be large enough, causing
the assertion on invocations to be run before ingress prog getting
executed. The assertion would fail and the test would flake.
This can be reliably reproduced by arbitrarily increasing the
loopback latency (thanks to [1]):
tc qdisc add dev lo root handle 1: htb default 12
tc class add dev lo parent 1:1 classid 1:12 htb rate 20kbps ceil 20kbps
tc qdisc add dev lo parent 1:12 netem delay 100ms
Fix this by waiting on the receive end, instead of instantly
returning to the assert. The call to read() will wait for the
default SO_RCVTIMEO timeout of 3 seconds provided by
start_server().
Fix flaky STATS_RX_DROPPED test. The receiver calls getsockopt after
receiving the last (valid) packet which is not the final packet sent in
the test (valid and invalid packets are sent in alternating fashion with
the final packet being invalid). Since the last packet may or may not
have been dropped already, both outcomes must be allowed.
This issue could also be fixed by making sure the last packet sent is
valid. This alternative is left as an exercise to the reader (or the
benevolent maintainers of this file).
This problem was quite visible on certain setups. On one machine this
failure was observed 50% of the time.
Also, remove a redundant assignment of pkt_stream->nb_pkts. This field
is already initialized by __pkt_stream_alloc.
Fixes: 27e934bec35b ("selftests: xsk: make stat tests not spin on getsockopt") Signed-off-by: Kal Conley <kal.conley@dectris.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230403120400.31018-1-kal.conley@dectris.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
This change fixes flakiness in the BIDIRECTIONAL test:
# [is_pkt_valid] expected length [60], got length [90]
not ok 1 FAIL: SKB BUSY-POLL BIDIRECTIONAL
When IPv6 is enabled, the interface will periodically send MLDv1 and
MLDv2 packets. These packets can cause the BIDIRECTIONAL test to fail
since it uses VETH0 for RX.
For other tests, this was not a problem since they only receive on VETH1
and IPv6 was already disabled on VETH0.
selftests: xsk: Use correct UMEM size in testapp_invalid_desc
Avoid UMEM_SIZE macro in testapp_invalid_desc which is incorrect when
the frame size is not XSK_UMEM__DEFAULT_FRAME_SIZE. Also remove the
macro since it's no longer being used.
Fixes: 909f0e28207c ("selftests: xsk: Add tests for 2K frame size") Signed-off-by: Kal Conley <kal.conley@dectris.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230403145047.33065-2-kal.conley@dectris.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Jiri Olsa [Mon, 3 Apr 2023 22:02:54 +0000 (00:02 +0200)]
kallsyms: Disable preemption for find_kallsyms_symbol_value
Artem reported suspicious RCU usage [1]. The reason is that verifier
calls find_kallsyms_symbol_value with preemption enabled which will
trigger suspicious RCU usage warning in rcu_dereference_sched call.
Disabling preemption in find_kallsyms_symbol_value and adding
__find_kallsyms_symbol_value function.
Merge branch 'bpf: Follow up to RCU enforcement in the verifier.'
Alexei Starovoitov says:
====================
From: Alexei Starovoitov <ast@kernel.org>
The patch set is addressing a fallout from
commit 6fcd486b3a0a ("bpf: Refactor RCU enforcement in the verifier.")
It was too aggressive with PTR_UNTRUSTED marks.
Patches 1-6 are cleanup and adding verifier smartness to address real
use cases in bpf programs that broke with too aggressive PTR_UNTRUSTED.
The partial revert is done in patch 7 anyway.
====================
bpf: Undo strict enforcement for walking untagged fields.
The commit 6fcd486b3a0a ("bpf: Refactor RCU enforcement in the verifier.")
broke several tracing bpf programs. Even in clang compiled kernels there are
many fields that are not marked with __rcu that are safe to read and pass into
helpers, but the verifier doesn't know that they're safe. Aggressively marking
them as PTR_UNTRUSTED was premature.
bpf: Refactor NULL-ness check in check_reg_type().
check_reg_type() unconditionally disallows PTR_TO_BTF_ID | PTR_MAYBE_NULL.
It's problematic for helpers that allow ARG_PTR_TO_BTF_ID_OR_NULL like
bpf_sk_storage_get(). Allow passing PTR_TO_BTF_ID | PTR_MAYBE_NULL into such
helpers. That technically includes bpf_kptr_xchg() helper, but in practice:
bpf_kptr_xchg(..., bpf_cpumask_create());
is still disallowed because bpf_cpumask_create() returns ref counted pointer
with ref_obj_id > 0.
bpf: Teach verifier that certain helpers accept NULL pointer.
bpf_[sk|inode|task|cgrp]_storage_[get|delete]() and bpf_get_socket_cookie() helpers
perform run-time check that sk|inode|task|cgrp pointer != NULL.
Teach verifier about this fact and allow bpf programs to pass
PTR_TO_BTF_ID | PTR_MAYBE_NULL into such helpers.
It will be used in the subsequent patch that will do
bpf_sk_storage_get(.., skb->sk, ...);
Even when 'skb' pointer is trusted the 'sk' pointer may be NULL.
btf_nested_type_is_trusted() tries to find a struct member at corresponding offset.
It works for flat structures and falls apart in more complex structs with nested structs.
The offset->member search is already performed by btf_struct_walk() including nested structs.
Reuse this work and pass {field name, field btf id} into btf_nested_type_is_trusted()
instead of offset to make BTF_TYPE_SAFE*() logic more robust.
bpf: Invoke btf_struct_access() callback only for writes.
Remove duplicated if (atype == BPF_READ) btf_struct_access() from
btf_struct_access() callback and invoke it only for writes. This is
possible to do because currently btf_struct_access() custom callback
always delegates to generic btf_struct_access() helper for BPF_READ
accesses.
Dave Marchevsky [Mon, 3 Apr 2023 20:00:27 +0000 (13:00 -0700)]
bpf: Fix struct_meta lookup for bpf_obj_free_fields kfunc call
bpf_obj_drop_impl has a void return type. In check_kfunc_call, the "else
if" which sets insn_aux->kptr_struct_meta for bpf_obj_drop_impl is
surrounded by a larger if statement which checks btf_type_is_ptr. As a
result:
* The bpf_obj_drop_impl-specific code will never execute
* The btf_struct_meta input to bpf_obj_drop is always NULL
* __bpf_obj_drop_impl will always see a NULL btf_record when called
from BPF program, and won't call bpf_obj_free_fields
* program-allocated kptrs which have fields that should be cleaned up
by bpf_obj_free_fields may instead leak resources
This patch adds a btf_type_is_void branch to the larger if and moves
special handling for bpf_obj_drop_impl there, fixing the issue.
Dave Thaler [Sun, 26 Mar 2023 05:49:46 +0000 (05:49 +0000)]
bpf, docs: Add docs on extended 64-bit immediate instructions
Add docs on extended 64-bit immediate instructions, including six instructions
previously undocumented. Include a brief description of maps and variables,
as used by those instructions.
bpf: compute hashes in bloom filter similar to hashmap
If the value size in a bloom filter is a multiple of 4, then the jhash2()
function is used to compute hashes. The length parameter of this function
equals to the number of 32-bit words in input. Compute it in the hot path
instead of pre-computing it, as this is translated to one extra shift to
divide the length by four vs. one extra memory load of a pre-computed length.
bpf: optimize hashmap lookups when key_size is divisible by 4
The BPF hashmap uses the jhash() hash function. There is an optimized version
of this hash function which may be used if hash size is a multiple of 4. Apply
this optimization to the hashmap in a similar way as it is done in the bloom
filter map.
On practice the optimization is only noticeable for smaller key sizes, which,
however, is sufficient for many applications. An example is listed in the
following table of measurements (a hashmap of 65536 elements was used):
Merge branch 'Enable RCU semantics for task kptrs'
David Vernet says:
====================
In commit 22df776a9a86 ("tasks: Extract rcu_users out of union"), the
'refcount_t rcu_users' field was extracted out of a union with the
'struct rcu_head rcu' field. This allows us to use the field for
refcounting struct task_struct with RCU protection, as the RCU callback
no longer flips rcu_users to be nonzero after the callback is scheduled.
This patch set leverages this to do a few things:
1. Marks struct task_struct as RCU safe in the verifier, allowing
referenced kptr tasks stored in maps to be accessed in an RCU
read region without acquiring a reference (with just a NULL check).
2. Makes bpf_task_acquire() a KF_ACQUIRE | KF_RCU | KF_RET_NULL kfunc.
3. Removes bpf_task_kptr_get() and bpf_task_acquire_not_zero(), as
they're now redundant with the above two changes.
4. Updates selftests and documentation accordingly.
---
Changelog:
v1: https://lore.kernel.org/all/20230331005733.406202-1-void@manifault.com/
v1 -> v2:
- Remove testcases validating nested trust inheritance. The first
version used 'struct task_struct __rcu *parent', but because that
field has the __rcu tag it functions differently on gcc and llvm and
causes gcc selftests to fail. Alexei is reworking nested trust,
anyways so let's leave it off for now (Alexei).
====================
David Vernet [Fri, 31 Mar 2023 19:57:33 +0000 (14:57 -0500)]
bpf,docs: Update documentation to reflect new task kfuncs
Now that struct task_struct objects are RCU safe, and bpf_task_acquire()
can return NULL, we should update the BPF task kfunc documentation to
reflect the current state of the API.
David Vernet [Fri, 31 Mar 2023 19:57:32 +0000 (14:57 -0500)]
bpf: Remove now-defunct task kfuncs
In commit 22df776a9a86 ("tasks: Extract rcu_users out of union"), the
'refcount_t rcu_users' field was extracted out of a union with the
'struct rcu_head rcu' field. This allows us to safely perform a
refcount_inc_not_zero() on task->rcu_users when acquiring a reference on
a task struct. A prior patch leveraged this by making struct task_struct
an RCU-protected object in the verifier, and by bpf_task_acquire() to
use the task->rcu_users field for synchronization.
Now that we can use RCU to protect tasks, we no longer need
bpf_task_kptr_get(), or bpf_task_acquire_not_zero(). bpf_task_kptr_get()
is truly completely unnecessary, as we can just use RCU to get the
object. bpf_task_acquire_not_zero() is now equivalent to
bpf_task_acquire().
In addition to these changes, this patch also updates the associated
selftests to no longer use these kfuncs.
David Vernet [Fri, 31 Mar 2023 19:57:31 +0000 (14:57 -0500)]
bpf: Make struct task_struct an RCU-safe type
struct task_struct objects are a bit interesting in terms of how their
lifetime is protected by refcounts. task structs have two refcount
fields:
1. refcount_t usage: Protects the memory backing the task struct. When
this refcount drops to 0, the task is immediately freed, without
waiting for an RCU grace period to elapse. This is the field that
most callers in the kernel currently use to ensure that a task
remains valid while it's being referenced, and is what's currently
tracked with bpf_task_acquire() and bpf_task_release().
2. refcount_t rcu_users: A refcount field which, when it drops to 0,
schedules an RCU callback that drops a reference held on the 'usage'
field above (which is acquired when the task is first created). This
field therefore provides a form of RCU protection on the task by
ensuring that at least one 'usage' refcount will be held until an RCU
grace period has elapsed. The qualifier "a form of" is important
here, as a task can remain valid after task->rcu_users has dropped to
0 and the subsequent RCU gp has elapsed.
In terms of BPF, we want to use task->rcu_users to protect tasks that
function as referenced kptrs, and to allow tasks stored as referenced
kptrs in maps to be accessed with RCU protection.
Let's first determine whether we can safely use task->rcu_users to
protect tasks stored in maps. All of the bpf_task* kfuncs can only be
called from tracepoint, struct_ops, or BPF_PROG_TYPE_SCHED_CLS, program
types. For tracepoint and struct_ops programs, the struct task_struct
passed to a program handler will always be trusted, so it will always be
safe to call bpf_task_acquire() with any task passed to a program.
Note, however, that we must update bpf_task_acquire() to be KF_RET_NULL,
as it is possible that the task has exited by the time the program is
invoked, even if the pointer is still currently valid because the main
kernel holds a task->usage refcount. For BPF_PROG_TYPE_SCHED_CLS, tasks
should never be passed as an argument to the any program handlers, so it
should not be relevant.
The second question is whether it's safe to use RCU to access a task
that was acquired with bpf_task_acquire(), and stored in a map. Because
bpf_task_acquire() now uses task->rcu_users, it follows that if the task
is present in the map, that it must have had at least one
task->rcu_users refcount by the time the current RCU cs was started.
Therefore, it's safe to access that task until the end of the current
RCU cs.
With all that said, this patch makes struct task_struct is an
RCU-protected object. In doing so, we also change bpf_task_acquire() to
be KF_ACQUIRE | KF_RCU | KF_RET_NULL, and adjust any selftests as
necessary. A subsequent patch will remove bpf_task_kptr_get(), and
bpf_task_acquire_not_zero() respectively.
Andrii Nakryiko [Fri, 31 Mar 2023 22:24:02 +0000 (15:24 -0700)]
veristat: relicense veristat.c as dual GPL-2.0-only or BSD-2-Clause licensed
Dual-license veristat.c to dual GPL-2.0-only or BSD-2-Clause license.
This is needed to mirror it to Github to make it convenient for distro
packagers to package veristat as a separate package.
Veristat grew into a useful tool by itself, and there are already
a bunch of users relying on veristat as generic BPF loading and
verification helper tool. So making it easy to packagers by providing
Github mirror just like we do for bpftool and libbpf is the next step to
get veristat into the hands of users.
Apart from few typo fixes, I'm the sole contributor to veristat.c so
far, so no extra Acks should be needed for relicensing.
In file included from progs/bench_local_storage_create.c:6:
progs/bench_local_storage_create.c:43:14: error: conflicting types for
built-in function 'fork'; expected 'int(void)'
[-Werror=builtin-declaration-mismatch]
43 | int BPF_PROG(fork, struct task_struct *parent, struct
task_struct *child)
| ^~~~
hi,
this selftests cleanup was previously posted as part of file build id changes [1],
which might take more time, so I'm sending the selftests changes separately so it
won't get stuck.
v4 changes:
- added size argument to read_build_id [Andrii]
- condition changes in parse_build_id_buf [Andrii]
- use ELF_C_READ_MMAP in elf_begin [Andrii]
- return -ENOENT in read_build_id if build id is not found [Andrii]
- dropped elf class check [Andrii]
David Vernet [Thu, 30 Mar 2023 14:52:03 +0000 (09:52 -0500)]
selftests/bpf: Add testcases for ptr_*_or_null_ in bpf_kptr_xchg
The second argument of the bpf_kptr_xchg() helper function is
ARG_PTR_TO_BTF_ID_OR_NULL. A recent patch fixed a bug whereby the
verifier would fail with an internal error message if a program invoked
the helper with a PTR_TO_BTF_ID | PTR_MAYBE_NULL register. This testcase
adds some testcases to ensure that it fails gracefully moving forward.
Before the fix, these testcases would have failed an error resembling
the following:
David Vernet [Thu, 30 Mar 2023 14:52:02 +0000 (09:52 -0500)]
bpf: Handle PTR_MAYBE_NULL case in PTR_TO_BTF_ID helper call arg
When validating a helper function argument, we use check_reg_type() to
ensure that the register containing the argument is of the correct type.
When the register's base type is PTR_TO_BTF_ID, there is some
supplemental logic where we do extra checks for various combinations of
PTR_TO_BTF_ID type modifiers. For example, for PTR_TO_BTF_ID,
PTR_TO_BTF_ID | PTR_TRUSTED, and PTR_TO_BTF_ID | MEM_RCU, we call
map_kptr_match_type() for bpf_kptr_xchg() calls, and
btf_struct_ids_match() for other helper calls.
When an unhandled PTR_TO_BTF_ID type modifier combination is passed to
check_reg_type(), the verifier fails with an internal verifier error
message. This can currently be triggered by passing a PTR_MAYBE_NULL
pointer to helper functions (currently just bpf_kptr_xchg()) with an
ARG_PTR_TO_BTF_ID_OR_NULL arg type. For example, by callin
bpf_kptr_xchg(&v->kptr, bpf_cpumask_create()).
Whether or not passing a PTR_MAYBE_NULL arg to an
ARG_PTR_TO_BTF_ID_OR_NULL argument is valid is an interesting question.
In a vacuum, it seems fine. A helper function with an
ARG_PTR_TO_BTF_ID_OR_NULL arg would seem to be implying that it can
handle either a NULL or non-NULL arg, and has logic in place to detect
and gracefully handle each. This is the case for bpf_kptr_xchg(), which
of course simply does an xchg(). On the other hand, bpf_kptr_xchg() also
specifies OBJ_RELEASE, and refcounting semantics for a PTR_MAYBE_NULL
pointer is different than handling it for a NULL _OR_ non-NULL pointer.
For example, with a non-NULL arg, we should always fail if there was not
a nonzero refcount for the value in the register being passed to the
helper. For PTR_MAYBE_NULL on the other hand, it's unclear. If the
pointer is NULL it would be fine, but if it's not NULL, it would be
incorrect to load the program.
The current solution to this is to just fail if PTR_MAYBE_NULL is
passed, and to instead require programs to have a NULL check to
explicitly handle the NULL and non-NULL cases. This seems reasonable.
Not only would it possibly be quite complicated to correctly handle
PTR_MAYBE_NULL refcounting in the verifier, but it's also an arguably
odd programming pattern in general to not explicitly handle the NULL
case anyways. For example, it seems odd to not care about whether a
pointer you're passing to bpf_kptr_xchg() was successfully allocated in
a program such as the following:
This patch therefore updates the verifier to explicitly check for
PTR_MAYBE_NULL in check_reg_type(), and fail gracefully if it's
observed. This isn't really "fixing" anything unsafe or incorrect. We're
just updating the verifier to fail gracefully, and explicitly handle
this pattern rather than unintentionally falling back to an internal
verifier error path. A subsequent patch will update selftests.
Xu Kuohai [Wed, 29 Mar 2023 01:10:48 +0000 (21:10 -0400)]
selftests/bpf: Rewrite two infinite loops in bound check cases
The two infinite loops in bound check cases added by commit 1a3148fc171f ("selftests/bpf: Check when bounds are not in the 32-bit range")
increased the execution time of test_verifier from about 6 seconds to
about 9 seconds. Rewrite these two infinite loops to finite loops to get
rid of this extra time cost.
Merge branch 'veristat: add better support of freplace programs'
Andrii Nakryiko says:
====================
Teach veristat how to deal with freplace BPF programs. As they can't be
directly loaded by veristat without custom user-space part that sets correct
target program FD, veristat always fails freplace programs. This patch set
teaches veristat to guess target program type that will be inherited by
freplace program itself, and subtitute it for BPF_PROG_TYPE_EXT (freplace) one
for the purposes of BPF verification.
Patch #1 fixes bug in libbpf preventing overriding freplace with specific
program type.
Patch #2 adds convenient -d flag to request veristat to emit libbpf debug
logs. It help debugging why a specific BPF program fails to load, if the
problem is not due to BPF verification itself.
v3->v4:
- fix optional kern_name check when guessing prog type (Alexei);
v2->v3:
- fix bpf_obj_id selftest that uses legacy bpf_prog_test_load() helper,
which always sets program type programmatically; teach the helper to do it
only if actually necessary (Stanislav);
v1->v2:
- fix compilation error reported by old GCC (my GCC v11 doesn't produce even
a warning) and Clang (see CI failure at [0]):
GCC version:
veristat.c: In function ‘fixup_obj’:
veristat.c:908:1: error: label at end of compound statement
908 | skip_freplace_fixup:
| ^~~~~~~~~~~~~~~~~~~
Clang version:
veristat.c:909:1: error: label at end of compound statement is a C2x extension [-Werror,-Wc2x-extensions]
}
^
1 error generated.
Andrii Nakryiko [Mon, 27 Mar 2023 18:52:02 +0000 (11:52 -0700)]
veristat: guess and substitue underlying program type for freplace (EXT) progs
SEC("freplace") (i.e., BPF_PROG_TYPE_EXT) programs are not loadable as
is through veristat, as kernel expects actual program's FD during
BPF_PROG_LOAD time, which veristat has no way of knowing.
Unfortunately, freplace programs are a pretty important class of
programs, especially when dealing with XDP chaining solutions, which
rely on EXT programs.
So let's do our best and teach veristat to try to guess the original
program type, based on program's context argument type. And if guessing
process succeeds, we manually override freplace/EXT with guessed program
type using bpf_program__set_type() setter to increase chances of proper
BPF verification.
We rely on BTF and maintain a simple lookup table. This process is
obviously not 100% bulletproof, as valid program might not use context
and thus wouldn't have to specify correct type. Also, __sk_buff is very
ambiguous and is the context type across many different program types.
We pick BPF_PROG_TYPE_CGROUP_SKB for now, which seems to work fine in
practice so far. Similarly, some program types require specifying attach
type, and so we pick one out of possible few variants.
Best effort at its best. But this makes veristat even more widely
applicable.
Andrii Nakryiko [Mon, 27 Mar 2023 18:52:00 +0000 (11:52 -0700)]
libbpf: disassociate section handler on explicit bpf_program__set_type() call
If user explicitly overrides programs's type with
bpf_program__set_type() API call, we need to disassociate whatever
SEC_DEF handler libbpf determined initially based on program's SEC()
definition, as it's not goind to be valid anymore and could lead to
crashes and/or confusing failures.
Also, fix up bpf_prog_test_load() helper in selftests/bpf, which is
force-setting program type (even if that's completely unnecessary; this
is quite a legacy piece of code), and thus should expect auto-attach to
not work, yet one of the tests explicitly relies on auto-attach for
testing.
Instead, force-set program type only if it differs from the desired one.
Martin KaFai Lau [Wed, 29 Mar 2023 20:10:56 +0000 (13:10 -0700)]
Merge branch 'Allow BPF TCP CCs to write app_limited'
Yixin Shen says:
====================
This series allow BPF TCP CCs to write app_limited of struct
tcp_sock. A built-in CC or one from a kernel module is already
able to write to app_limited of struct tcp_sock. Until now,
a BPF CC doesn't have write access to this member of struct
tcp_sock.
v2:
- Merge the test of writing app_limited into the test of
writing sk_pacing. (Martin KaFai Lau)
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Yixin Shen [Wed, 29 Mar 2023 07:35:57 +0000 (07:35 +0000)]
bpf: allow a TCP CC to write app_limited
A CC that implements tcp_congestion_ops.cong_control() should be able to
write app_limited. A built-in CC or one from a kernel module is already
able to write to this member of struct tcp_sock.
For a BPF program, write access has not been allowed, yet.
Andrii Nakryiko [Tue, 28 Mar 2023 21:48:27 +0000 (14:48 -0700)]
Merge branch 'verifier/xdp_direct_packet_access.c converted to inline assembly'
Eduard Zingerman says:
====================
verifier/xdp_direct_packet_access.c automatically converted to inline
assembly using [1].
This is a leftover from [2], the last patch in a batch was blocked by
mail server for being too long. This patch-set splits it in two:
- one to add migrated test to progs/
- one to remove old test from verifier/
[1] Migration tool
https://github.com/eddyz87/verifier-tests-migrator
[2] First batch of migrated verifier/*.c tests
https://lore.kernel.org/bpf/167979433109.17761.17302808621381963629.git-patchwork-notify@kernel.org/
====================
Eduard Zingerman [Tue, 28 Mar 2023 02:08:13 +0000 (05:08 +0300)]
selftests/bpf: Remove verifier/xdp_direct_packet_access.c, converted to progs/verifier_xdp_direct_packet_access.c
Removing verifier/xdp_direct_packet_access.c.c as it was automatically converted to use
inline assembly in the previous commit. It is available in
progs/verifier_xdp_direct_packet_access.c.c.
Eduard Zingerman [Tue, 28 Mar 2023 00:47:38 +0000 (03:47 +0300)]
libbpf: Fix double-free when linker processes empty sections
Double-free error in bpf_linker__free() was reported by James Hilliard.
The error is caused by miss-use of realloc() in extend_sec().
The error occurs when two files with empty sections of the same name
are linked:
- when first file is processed:
- extend_sec() calls realloc(dst->raw_data, dst_align_sz)
with dst->raw_data == NULL and dst_align_sz == 0;
- dst->raw_data is set to a special pointer to a memory block of
size zero;
- when second file is processed:
- extend_sec() calls realloc(dst->raw_data, dst_align_sz)
with dst->raw_data == <special pointer> and dst_align_sz == 0;
- realloc() "frees" dst->raw_data special pointer and returns NULL;
- extend_sec() exits with -ENOMEM, and the old dst->raw_data value
is preserved (it is now invalid);
- eventually, bpf_linker__free() attempts to free dst->raw_data again.
This patch fixes the bug by avoiding -ENOMEM exit for dst_align_sz == 0.
The fix was suggested by Andrii Nakryiko <andrii.nakryiko@gmail.com>.
Hengqi Chen [Sun, 26 Mar 2023 09:53:41 +0000 (09:53 +0000)]
selftests/bpf: Don't assume page size is 4096
The verifier test creates BPF ringbuf maps using hard-coded
4096 as max_entries. Some tests will fail if the page size
of the running kernel is not 4096. Use getpagesize() instead.
Nuno Gonçalves [Fri, 24 Mar 2023 10:02:22 +0000 (10:02 +0000)]
xsk: allow remap of fill and/or completion rings
The remap of fill and completion rings was frowned upon as they
control the usage of UMEM which does not support concurrent use.
At the same time this would disallow the remap of these rings
into another process.
A possible use case is that the user wants to transfer the socket/
UMEM ownership to another process (via SYS_pidfd_getfd) and so
would need to also remap these rings.
This will have no impact on current usages and just relaxes the
remap limitation.
Signed-off-by: Nuno Gonçalves <nunog@fr24.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230324100222.13434-1-nunog@fr24.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Dave Thaler [Sun, 26 Mar 2023 03:31:17 +0000 (03:31 +0000)]
bpf, docs: Add extended call instructions
Add extended call instructions. Uses the term "program-local" for
call by offset. And there are instructions for calling helper functions
by "address" (the old way of using integer values), and for calling
helper functions by BTF ID (for kfuncs).
V1 -> V2: addressed comments from David Vernet
V2 -> V3: make descriptions in table consistent with updated names
v3:
- Only use bpf_mem_alloc for task and cgrp storage.
- sk and inode storage stay with kzalloc/kfree.
- Check NULL and add comments in bpf_mem_cache_raw_free() in patch 1.
- Added test and benchmark for task storage.
v2:
- Added bpf_mem_cache_alloc_flags() and bpf_mem_cache_raw_free()
to hide the internal data structure of the bpf allocator.
- Fixed a typo bug in bpf_selem_free()
- Simplified the test_local_storage test by directly using
err returned from libbpf
====================
Martin KaFai Lau [Wed, 22 Mar 2023 21:52:44 +0000 (14:52 -0700)]
bpf: Use bpf_mem_cache_alloc/free for bpf_local_storage
This patch uses bpf_mem_cache_alloc/free for allocating and freeing
bpf_local_storage for task and cgroup storage.
The changes are similar to the previous patch. A few things that
worth to mention for bpf_local_storage:
The local_storage is freed when the last selem is deleted.
Before deleting a selem from local_storage, it needs to retrieve the
local_storage->smap because the bpf_selem_unlink_storage_nolock()
may have set it to NULL. Note that local_storage->smap may have
already been NULL when the selem created this local_storage has
been removed. In this case, call_rcu will be used to free the
local_storage.
Also, the bpf_ma (true or false) value is needed before calling
bpf_local_storage_free(). The bpf_ma can either be obtained from
the local_storage->smap (if available) or any of its selem's smap.
A new helper check_storage_bpf_ma() is added to obtain
bpf_ma for a deleting bpf_local_storage.
When bpf_local_storage_alloc getting a reused memory, all
fields are either in the correct values or will be initialized.
'cache[]' must already be all NULLs. 'list' must be empty.
Others will be initialized.
Martin KaFai Lau [Wed, 22 Mar 2023 21:52:43 +0000 (14:52 -0700)]
bpf: Use bpf_mem_cache_alloc/free in bpf_local_storage_elem
This patch uses bpf_mem_alloc for the task and cgroup local storage that
the bpf prog can easily get a hold of the storage owner's PTR_TO_BTF_ID.
eg. bpf_get_current_task_btf() can be used in some of the kmalloc code
path which will cause deadlock/recursion. bpf_mem_cache_alloc is
deadlock free and will solve a legit use case in [1].
For sk storage, its batch creation benchmark shows a few percent
regression when the sk create/destroy batch size is larger than 32.
The sk creation/destruction happens much more often and
depends on external traffic. Considering it is hypothetical
to be able to cause deadlock with sk storage, it can cross
the bridge to use bpf_mem_alloc till a legit (ie. useful)
use case comes up.
For inode storage, bpf_local_storage_destroy() is called before
waiting for a rcu gp and its memory cannot be reused immediately.
inode stays with kmalloc/kfree after the rcu [or tasks_trace] gp.
A 'bool bpf_ma' argument is added to bpf_local_storage_map_alloc().
Only task and cgroup storage have 'bpf_ma == true' which
means to use bpf_mem_cache_alloc/free(). This patch only changes
selem to use bpf_mem_alloc for task and cgroup. The next patch
will change the local_storage to use bpf_mem_alloc also for
task and cgroup.
Here is some more details on the changes:
* memory allocation:
After bpf_mem_cache_alloc(), the SDATA(selem)->data is zero-ed because
bpf_mem_cache_alloc() could return a reused selem. It is to keep
the existing bpf_map_kzalloc() behavior. Only SDATA(selem)->data
is zero-ed. SDATA(selem)->data is the visible part to the bpf prog.
No need to use zero_map_value() to do the zeroing because
bpf_selem_free(..., reuse_now = true) ensures no bpf prog is using
the selem before returning the selem through bpf_mem_cache_free().
For the internal fields of selem, they will be initialized when
linking to the new smap and the new local_storage.
When 'bpf_ma == false', nothing changes in this patch. It will
stay with the bpf_map_kzalloc().
* memory free:
The bpf_selem_free() and bpf_selem_free_rcu() are modified to handle
the bpf_ma == true case.
For the common selem free path where its owner is also being destroyed,
the mem is freed in bpf_local_storage_destroy(), the owner (task
and cgroup) has gone through a rcu gp. The memory can be reused
immediately, so bpf_local_storage_destroy() will call
bpf_selem_free(..., reuse_now = true) which will do
bpf_mem_cache_free() for immediate reuse consideration.
An exception is the delete elem code path. The delete elem code path
is called from the helper bpf_*_storage_delete() and the syscall
bpf_map_delete_elem(). This path is an unusual case for local
storage because the common use case is to have the local storage
staying with its owner life time so that the bpf prog and the user
space does not have to monitor the owner's destruction. For the delete
elem path, the selem cannot be reused immediately because there could
be bpf prog using it. It will call bpf_selem_free(..., reuse_now = false)
and it will wait for a rcu tasks trace gp before freeing the elem. The
rcu callback is changed to do bpf_mem_cache_raw_free() instead of kfree().
When 'bpf_ma == false', it should be the same as before.
__bpf_selem_free() is added to do the kfree_rcu and call_tasks_trace_rcu().
A few words on the 'reuse_now == true'. When 'reuse_now == true',
it is still racing with bpf_local_storage_map_free which is under rcu
protection, so it still needs to wait for a rcu gp instead of kfree().
Otherwise, the selem may be reused by slab for a totally different struct
while the bpf_local_storage_map_free() is still using it (as a
rcu reader). For the inode case, there may be other rcu readers also.
In short, when bpf_ma == false and reuse_now == true => vanilla rcu.
Martin KaFai Lau [Wed, 22 Mar 2023 21:52:42 +0000 (14:52 -0700)]
bpf: Add a few bpf mem allocator functions
This patch adds a few bpf mem allocator functions which will
be used in the bpf_local_storage in a later patch.
bpf_mem_cache_alloc_flags(..., gfp_t flags) is added. When the
flags == GFP_KERNEL, it will fallback to __alloc(..., GFP_KERNEL).
bpf_local_storage knows its running context is sleepable (GFP_KERNEL)
and provides a better guarantee on memory allocation.
bpf_local_storage has some uncommon cases that its selem
cannot be reused immediately. It handles its own
rcu_head and goes through a rcu_trace gp and then free it.
bpf_mem_cache_raw_free() is added for direct free purpose
without leaking the LLIST_NODE_SZ internal knowledge.
During free time, the 'struct bpf_mem_alloc *ma' is no longer
available. However, the caller should know if it is
percpu memory or not and it can call different raw_free functions.
bpf_local_storage does not support percpu value, so only
the non-percpu 'bpf_mem_cache_raw_free()' is added in
this patch.
Merge branch 'First set of verifier/*.c migrated to inline assembly'
Eduard Zingerman says:
====================
This is a follow up for RFC [1]. It migrates a first batch of 38
verifier/*.c tests to inline assembly and use of ./test_progs for
actual execution. The migration is done by a python script (see [2]).
Each migrated verifier/xxx.c file is mapped to progs/verifier_xxx.c
plus an entry in the prog_tests/verifier.c. One patch per each file.
A few patches at the beginning of the patch-set extend test_loader
with necessary functionality, mainly:
- support for tests execution in unprivileged mode;
- support for test runs for test programs.
Migrated tests could be selected for execution using the following filter:
Changes compared to RFC:
- test_loader.c is extended to support test program runs;
- capabilities handling now matches behavior of test_verifier;
- BPF_ST_MEM instructions are automatically replaced by BPF_STX_MEM
instructions to overcome current clang limitations;
- tests styling updates according to RFC feedback;
- 38 migrated files are included instead of 1.
I used the following means for testing:
- migration tool itself has a set of self-tests;
- migrated tests are passing;
- manually compared each old/new file side-by-side.
While doing side-by-side comparison I've noted a few defects in the
original tests:
- and.c:
- One of the jump targets is off by one;
- BPF_ST_MEM wrong OFF/IMM ordering;
- array_access.c:
- BPF_ST_MEM wrong OFF/IMM ordering;
- value_or_null.c:
- BPF_ST_MEM wrong OFF/IMM ordering.