Masami Hiramatsu [Tue, 15 Mar 2022 14:00:50 +0000 (23:00 +0900)]
rethook: Add a generic return hook
Add a return hook framework which hooks the function return. Most of the
logic came from the kretprobe, but this is independent from kretprobe.
Note that this is expected to be used with other function entry hooking
feature, like ftrace, fprobe, adn kprobes. Eventually this will replace
the kretprobe (e.g. kprobe + rethook = kretprobe), but at this moment,
this is just an additional hook.
Masami Hiramatsu [Tue, 15 Mar 2022 14:00:38 +0000 (23:00 +0900)]
fprobe: Add ftrace based probe APIs
The fprobe is a wrapper API for ftrace function tracer.
Unlike kprobes, this probes only supports the function entry, but this
can probe multiple functions by one fprobe. The usage is similar, user
will set their callback to fprobe::entry_handler and call
register_fprobe*() with probed functions.
There are 3 registration interfaces,
- register_fprobe() takes filtering patterns of the functin names.
- register_fprobe_ips() takes an array of ftrace-location addresses.
- register_fprobe_syms() takes an array of function names.
The registered fprobes can be unregistered with unregister_fprobe().
e.g.
Jiri Olsa [Tue, 15 Mar 2022 14:00:26 +0000 (23:00 +0900)]
ftrace: Add ftrace_set_filter_ips function
Adding ftrace_set_filter_ips function to be able to set filter on
multiple ip addresses at once.
With the kprobe multi attach interface we have cases where we need to
initialize ftrace_ops object with thousands of functions, so having
single function diving into ftrace_hash_move_and_update_ops with
ftrace_lock is faster.
The functions ips are passed as unsigned long array with count.
Kaixi Fan [Sun, 13 Mar 2022 16:41:16 +0000 (00:41 +0800)]
selftests/bpf: Fix tunnel remote IP comments
In namespace at_ns0, the IP address of tnl dev is 10.1.1.100 which is the
overlay IP, and the ip address of veth0 is 172.16.1.100 which is the vtep
IP. When doing 'ping 10.1.1.100' from root namespace, the remote_ip should
be 172.16.1.100.
Fixes: 933a741e3b82 ("selftests/bpf: bpf tunnel test.") Signed-off-by: Kaixi Fan <fankaixi.li@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220313164116.5889-1-fankaixi.li@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Lorenzo Bianconi [Fri, 11 Mar 2022 09:14:19 +0000 (10:14 +0100)]
veth: Rework veth_xdp_rcv_skb in order to accept non-linear skb
Introduce veth_convert_skb_to_xdp_buff routine in order to
convert a non-linear skb into a xdp buffer. If the received skb
is cloned or shared, veth_convert_skb_to_xdp_buff will copy it
in a new skb composed by order-0 pages for the linear and the
fragmented area. Moreover veth_convert_skb_to_xdp_buff guarantees
we have enough headroom for xdp.
This is a preliminary patch to allow attaching xdp programs with frags
support on veth devices.
Lorenzo Bianconi [Fri, 11 Mar 2022 09:14:18 +0000 (10:14 +0100)]
net: veth: Account total xdp_frame len running ndo_xdp_xmit
Even if this is a theoretical issue since it is not possible to perform
XDP_REDIRECT on a non-linear xdp_frame, veth driver does not account
paged area in ndo_xdp_xmit function pointer.
Introduce xdp_get_frame_len utility routine to get the xdp_frame full
length and account total frame size running XDP_REDIRECT of a
non-linear xdp frame into a veth device.
Hou Tao [Wed, 9 Mar 2022 12:33:21 +0000 (20:33 +0800)]
selftests/bpf: Test subprog jit when toggle bpf_jit_harden repeatedly
When bpf_jit_harden is toggled between 0 and 2, subprog jit may fail
due to inconsistent twice read values of bpf_jit_harden during jit. So
add a test to ensure the problem is fixed.
Hou Tao [Wed, 9 Mar 2022 12:33:20 +0000 (20:33 +0800)]
bpf: Fix net.core.bpf_jit_harden race
It is the bpf_jit_harden counterpart to commit 60b58afc96c9 ("bpf: fix
net.core.bpf_jit_enable race"). bpf_jit_harden will be tested twice
for each subprog if there are subprogs in bpf program and constant
blinding may increase the length of program, so when running
"./test_progs -t subprogs" and toggling bpf_jit_harden between 0 and 2,
jit_subprogs may fail because constant blinding increases the length
of subprog instructions during extra passs.
So cache the value of bpf_jit_blinding_enabled() during program
allocation, and use the cached value during constant blinding, subprog
JITing and args tracking of tail call.
Hou Tao [Wed, 9 Mar 2022 12:33:18 +0000 (20:33 +0800)]
bpf, x86: Fall back to interpreter mode when extra pass fails
Extra pass for subprog jit may fail (e.g. due to bpf_jit_harden race),
but bpf_func is not cleared for the subprog and jit_subprogs will
succeed. The running of the bpf program may lead to oops because the
memory for the jited subprog image has already been freed.
So fall back to interpreter mode by clearing bpf_func/jited/jited_len
when extra pass fails.
Merge branch 'Remove libcap dependency from bpf selftests'
Martin KaFai Lau says:
====================
After upgrading to the newer libcap (>= 2.60),
the libcap commit aca076443591 ("Make cap_t operations thread safe.")
added a "__u8 mutex;" to the "struct _cap_struct". It caused a few byte
shift that breaks the assumption made in the "struct libcap" definition
in test_verifier.c.
This set is to remove the libcap dependency from the bpf selftests.
v2:
- Define CAP_PERFMON and CAP_BPF when the older <linux/capability.h>
does not have them. (Andrii)
====================
Martin KaFai Lau [Wed, 16 Mar 2022 17:38:35 +0000 (10:38 -0700)]
bpf: selftests: Remove libcap usage from test_progs
This patch removes the libcap usage from test_progs.
bind_perm.c is the only user. cap_*_effective() helpers added in the
earlier patch are directly used instead.
No other selftest binary is using libcap, so '-lcap' is also removed
from the Makefile.
Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Stanislav Fomichev <sdf@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220316173835.2039334-1-kafai@fb.com
Martin KaFai Lau [Wed, 16 Mar 2022 17:38:23 +0000 (10:38 -0700)]
bpf: selftests: Add helpers to directly use the capget and capset syscall
After upgrading to the newer libcap (>= 2.60),
the libcap commit aca076443591 ("Make cap_t operations thread safe.")
added a "__u8 mutex;" to the "struct _cap_struct". It caused a few byte
shift that breaks the assumption made in the "struct libcap" definition
in test_verifier.c.
The bpf selftest usage only needs to enable and disable the effective
caps of the running task. It is easier to directly syscall the
capget and capset instead. It can also remove the libcap
library dependency.
The cap_helpers.{c,h} is added. One __u64 is used for all CAP_*
bits instead of two __u32.
Dmitrii Dolgov [Wed, 9 Mar 2022 16:31:12 +0000 (17:31 +0100)]
bpftool: Add bpf_cookie to link output
Commit 82e6b1eee6a8 ("bpf: Allow to specify user-provided bpf_cookie for
BPF perf links") introduced the concept of user specified bpf_cookie,
which could be accessed by BPF programs using bpf_get_attach_cookie().
For troubleshooting purposes it is convenient to expose bpf_cookie via
bpftool as well, so there is no need to meddle with the target BPF
program itself.
Implemented using the pid iterator BPF program to actually fetch
bpf_cookies, which allows constraining code changes only to bpftool.
$ bpftool link
1: type 7 prog 5
bpf_cookie 123
pids bootstrap(81)
Guo Zhengkui [Tue, 15 Mar 2022 13:01:26 +0000 (21:01 +0800)]
selftests/bpf: Clean up array_size.cocci warnings
Clean up the array_size.cocci warnings under tools/testing/selftests/bpf/:
Use `ARRAY_SIZE(arr)` instead of forms like `sizeof(arr)/sizeof(arr[0])`.
tools/testing/selftests/bpf/test_cgroup_storage.c uses ARRAY_SIZE() defined
in tools/include/linux/kernel.h (sys/sysinfo.h -> linux/kernel.h), while
others use ARRAY_SIZE() in bpf_util.h.
Niklas Söderlund [Tue, 15 Mar 2022 10:29:48 +0000 (11:29 +0100)]
samples/bpf, xdpsock: Fix race when running for fix duration of time
When running xdpsock for a fix duration of time before terminating
using --duration=<n>, there is a race condition that may cause xdpsock
to terminate immediately.
When running for a fixed duration of time the check to determine when to
terminate execution is in is_benchmark_done() and is being executed in
the context of the poller thread,
if (opt_duration > 0) {
unsigned long dt = (get_nsecs() - start_time);
if (dt >= opt_duration)
benchmark_done = true;
}
However start_time is only set after the poller thread have been
created. This leaves a small window when the poller thread is starting
and calls is_benchmark_done() for the first time that start_time is not
yet set. In that case start_time have its initial value of 0 and the
duration check fails as it do not correlate correctly for the
applications start time and immediately sets benchmark_done which in
turn terminates the xdpsock application.
Fix this by setting start_time before creating the poller thread.
Fixes: d3f11b018f6c ("samples/bpf: xdpsock: Add duration option to specify how long to run") Signed-off-by: Niklas Söderlund <niklas.soderlund@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220315102948.466436-1-niklas.soderlund@corigine.com
The mem of msg has been uncharged in tcp_bpf_send_verdict() by
sk_msg_return(), and would be uncharged by sk_msg_free() again. When psock
is null, we can simply returning an error code, this would then trigger
the sk_msg_free_nocharge in the error path of __SK_REDIRECT and would have
the side effect of throwing an error up to user space. This would be a
slight change in behavior from user side but would look the same as an
error if the redirect on the socket threw an error.
This issue can cause the following info:
WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260
Call Trace:
<TASK>
__sk_destruct+0x24/0x1f0
sk_psock_destroy+0x19b/0x1c0
process_one_work+0x1b3/0x3c0
worker_thread+0x30/0x350
? process_one_work+0x3c0/0x3c0
kthread+0xe6/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
</TASK>
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220304081145.2037182-5-wangyufen@huawei.com
Wang Yufen [Fri, 4 Mar 2022 08:11:43 +0000 (16:11 +0800)]
bpf, sockmap: Fix memleak in tcp_bpf_sendmsg while sk msg is full
If tcp_bpf_sendmsg() is running while sk msg is full. When sk_msg_alloc()
returns -ENOMEM error, tcp_bpf_sendmsg() goes to wait_for_memory. If partial
memory has been alloced by sk_msg_alloc(), that is, msg_tx->sg.size is
greater than osize after sk_msg_alloc(), memleak occurs. To fix we use
sk_msg_trim() to release the allocated memory, then goto wait for memory.
Other call paths of sk_msg_alloc() have the similar issue, such as
tls_sw_sendmsg(), so handle sk_msg_trim logic inside sk_msg_alloc(),
as Cong Wang suggested.
This issue can cause the following info:
WARNING: CPU: 3 PID: 7950 at net/core/stream.c:208 sk_stream_kill_queues+0xd4/0x1a0
Call Trace:
<TASK>
inet_csk_destroy_sock+0x55/0x110
__tcp_close+0x279/0x470
tcp_close+0x1f/0x60
inet_release+0x3f/0x80
__sock_release+0x3d/0xb0
sock_close+0x11/0x20
__fput+0x92/0x250
task_work_run+0x6a/0xa0
do_exit+0x33b/0xb60
do_group_exit+0x2f/0xa0
get_signal+0xb6/0x950
arch_do_signal_or_restart+0xac/0x2a0
exit_to_user_mode_prepare+0xa9/0x200
syscall_exit_to_user_mode+0x12/0x30
do_syscall_64+0x46/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
While drop_sk_msg(), the msg has charged memory form sk by sk_mem_charge
and has sg pages need to put. To fix we use sk_msg_free() and then kfee()
msg.
This issue can cause the following info:
WARNING: CPU: 0 PID: 9202 at net/core/stream.c:205 sk_stream_kill_queues+0xc8/0xe0
Call Trace:
<IRQ>
inet_csk_destroy_sock+0x55/0x110
tcp_rcv_state_process+0xe5f/0xe90
? sk_filter_trim_cap+0x10d/0x230
? tcp_v4_do_rcv+0x161/0x250
tcp_v4_do_rcv+0x161/0x250
tcp_v4_rcv+0xc3a/0xce0
ip_protocol_deliver_rcu+0x3d/0x230
ip_local_deliver_finish+0x54/0x60
ip_local_deliver+0xfd/0x110
? ip_protocol_deliver_rcu+0x230/0x230
ip_rcv+0xd6/0x100
? ip_local_deliver+0x110/0x110
__netif_receive_skb_one_core+0x85/0xa0
process_backlog+0xa4/0x160
__napi_poll+0x29/0x1b0
net_rx_action+0x287/0x300
__do_softirq+0xff/0x2fc
do_softirq+0x79/0x90
</IRQ>
Yonghong Song [Fri, 11 Mar 2022 00:37:21 +0000 (16:37 -0800)]
selftests/bpf: Fix a clang compilation error for send_signal.c
Building selftests/bpf with latest clang compiler (clang15 built
from source), I hit the following compilation error:
/.../prog_tests/send_signal.c:43:16: error: variable 'j' set but not used [-Werror,-Wunused-but-set-variable]
volatile int j = 0;
^
1 error generated.
The problem also exists with clang13 and clang14. clang12 is okay.
In send_signal.c, we have the following code ...
volatile int j = 0;
[...]
for (int i = 0; i < 100000000 && !sigusr1_received; i++)
j /= i + 1;
... to burn CPU cycles so bpf_send_signal() helper can be tested
in NMI mode.
Slightly changing 'j /= i + 1' to 'j /= i + j + 1' or 'j++' can
fix the problem. Further investigation indicated this should be
a clang bug ([1]). The upstream fix will be proposed later. But it
is a good idea to workaround the issue to unblock people who build
kernel/selftests with clang.
selftests/bpf: Add a test for maximum packet size in xdp_do_redirect
This adds an extra test to the xdp_do_redirect selftest for XDP live packet
mode, which verifies that the maximum permissible packet size is accepted
without any errors, and that a too big packet is correctly rejected.
bpf, test_run: Fix packet size check for live packet mode
The live packet mode uses some extra space at the start of each page to
cache data structures so they don't have to be rebuilt at every repetition.
This space wasn't correctly accounted for in the size checking of the
arguments supplied to userspace. In addition, the definition of the frame
size should include the size of the skb_shared_info (as there is other
logic that subtracts the size of this).
Together, these mistakes resulted in userspace being able to trip the
XDP_WARN() in xdp_update_frame_from_buff(), which syzbot discovered in
short order. Fix this by changing the frame size define and adding the
extra headroom to the bpf_prog_test_run_xdp() function. Also drop the
max_len parameter to the page_pool init, since this is related to DMA which
is not used for the page pool instance in PROG_TEST_RUN.
Fixes: b530e9e1063e ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN") Reported-by: syzbot+0e91362d99386dc5de99@syzkaller.appspotmail.com Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220310225621.53374-1-toke@redhat.com
Hao Luo [Thu, 10 Mar 2022 21:16:55 +0000 (13:16 -0800)]
compiler_types: Refactor the use of btf_type_tag attribute.
Previous patches have introduced the compiler attribute btf_type_tag for
__user and __percpu. The availability of this attribute depends on
some CONFIGs and compiler support. This patch refactors the use
of btf_type_tag by introducing BTF_TYPE_TAG, which hides all the
dependencies.
Merge branch 'bpf-lsm: Extend interoperability with IMA'
Roberto Sassu says:
====================
Extend the interoperability with IMA, to give wider flexibility for the
implementation of integrity-focused LSMs based on eBPF.
Patch 1 fixes some style issues.
Patches 2-6 give the ability to eBPF-based LSMs to take advantage of the
measurement capability of IMA without needing to setup a policy in IMA
(those LSMs might implement the policy capability themselves).
Patches 7-9 allow eBPF-based LSMs to evaluate files read by the kernel.
Changelog
v2:
- Add better description to patch 1 (suggested by Shuah)
- Recalculate digest if it is not fresh (when IMA_COLLECTED flag not set)
- Move declaration of bpf_ima_file_hash() at the end (suggested by
Yonghong)
- Add tests to check if the digest has been recalculated
- Add deny test for bpf_kernel_read_file()
- Add description to tests
v1:
- Modify ima_file_hash() only and allow the usage of the function with the
modified behavior by eBPF-based LSMs through the new function
bpf_ima_file_hash() (suggested by Mimi)
- Make bpf_lsm_kernel_read_file() sleepable so that bpf_ima_inode_hash()
and bpf_ima_file_hash() can be called inside the implementation of
eBPF-based LSMs for this hook
====================
Roberto Sassu [Wed, 2 Mar 2022 11:14:03 +0000 (12:14 +0100)]
selftests/bpf: Add test for bpf_lsm_kernel_read_file()
Test the ability of bpf_lsm_kernel_read_file() to call the sleepable
functions bpf_ima_inode_hash() or bpf_ima_file_hash() to obtain a
measurement of a loaded IMA policy.
Roberto Sassu [Wed, 2 Mar 2022 11:14:02 +0000 (12:14 +0100)]
bpf-lsm: Make bpf_lsm_kernel_read_file() as sleepable
Make bpf_lsm_kernel_read_file() as sleepable, so that bpf_ima_inode_hash()
or bpf_ima_file_hash() can be called inside the implementation of this
hook.
Roberto Sassu [Wed, 2 Mar 2022 11:14:01 +0000 (12:14 +0100)]
selftests/bpf: Check if the digest is refreshed after a file write
Verify that bpf_ima_inode_hash() returns a non-fresh digest after a file
write, and that bpf_ima_file_hash() returns a fresh digest. Verification is
done by requesting the digest from the bprm_creds_for_exec hook, called
before ima_bprm_check().
Roberto Sassu [Wed, 2 Mar 2022 11:13:58 +0000 (12:13 +0100)]
bpf-lsm: Introduce new helper bpf_ima_file_hash()
ima_file_hash() has been modified to calculate the measurement of a file on
demand, if it has not been already performed by IMA or the measurement is
not fresh. For compatibility reasons, ima_inode_hash() remains unchanged.
Keep the same approach in eBPF and introduce the new helper
bpf_ima_file_hash() to take advantage of the modified behavior of
ima_file_hash().
Roberto Sassu [Wed, 2 Mar 2022 11:13:57 +0000 (12:13 +0100)]
ima: Always return a file measurement in ima_file_hash()
__ima_inode_hash() checks if a digest has been already calculated by
looking for the integrity_iint_cache structure associated to the passed
inode.
Users of ima_file_hash() (e.g. eBPF) might be interested in obtaining the
information without having to setup an IMA policy so that the digest is
always available at the time they call this function.
In addition, they likely expect the digest to be fresh, e.g. recalculated
by IMA after a file write. Although getting the digest from the
bprm_committed_creds hook (as in the eBPF test) ensures that the digest is
fresh, as the IMA hook is executed before that hook, this is not always the
case (e.g. for the mmap_file hook).
Call ima_collect_measurement() in __ima_inode_hash(), if the file
descriptor is available (passed by ima_file_hash()) and the digest is not
available/not fresh, and store the file measurement in a temporary
integrity_iint_cache structure.
This change does not cause memory usage increase, due to using the
temporary integrity_iint_cache structure, and due to freeing the
ima_digest_data structure inside integrity_iint_cache before exiting from
__ima_inode_hash().
For compatibility reasons, the behavior of ima_inode_hash() remains
unchanged.
Roberto Sassu [Wed, 2 Mar 2022 11:13:56 +0000 (12:13 +0100)]
ima: Fix documentation-related warnings in ima_main.c
Fix the following warnings in ima_main.c, displayed with W=n make argument:
security/integrity/ima/ima_main.c:432: warning: Function parameter or
member 'vma' not described in 'ima_file_mprotect'
security/integrity/ima/ima_main.c:636: warning: Function parameter or
member 'inode' not described in 'ima_post_create_tmpfile'
security/integrity/ima/ima_main.c:636: warning: Excess function parameter
'file' description in 'ima_post_create_tmpfile'
security/integrity/ima/ima_main.c:843: warning: Function parameter or
member 'load_id' not described in 'ima_post_load_data'
security/integrity/ima/ima_main.c:843: warning: Excess function parameter
'id' description in 'ima_post_load_data'
Also, fix some style issues in the description of ima_post_create_tmpfile()
and ima_post_path_mknod().
Chris J Arges [Wed, 9 Mar 2022 21:41:58 +0000 (15:41 -0600)]
bpftool: Ensure bytes_memlock json output is correct
If a BPF map is created over 2^32 the memlock value as displayed in JSON
format will be incorrect. Use atoll instead of atoi so that the correct
number is displayed.
```
$ bpftool map create /sys/fs/bpf/test_bpfmap type hash key 4 \
value 1024 entries 4194304 name test_bpfmap
$ bpftool map list
1: hash name test_bpfmap flags 0x0
key 4B value 1024B max_entries 4194304 memlock 4328521728B
$ sudo bpftool map list -j | jq .[].bytes_memlock 33554432
```
Daniel Borkmann [Thu, 10 Mar 2022 21:57:06 +0000 (22:57 +0100)]
Merge branch 'bpf-tstamp-follow-ups'
Martin KaFai Lau says:
====================
This set is a follow up on the bpf side based on discussion [0].
Patch 1 is to remove some skbuff macros that are used in bpf filter.c.
Patch 2 and 3 are to simplify the bpf insn rewrite on __sk_buff->tstamp.
Patch 4 is to simplify the bpf uapi by modeling the __sk_buff->tstamp
and __sk_buff->tstamp_type (was delivery_time_type) the same as its kernel
counter part skb->tstamp and skb->mono_delivery_time.
Patch 5 is to adjust the bpf selftests due to changes in patch 4.
bpf: selftests: Update tests after s/delivery_time/tstamp/ change in bpf.h
The previous patch made the follow changes:
- s/delivery_time_type/tstamp_type/
- s/bpf_skb_set_delivery_time/bpf_skb_set_tstamp/
- BPF_SKB_DELIVERY_TIME_* to BPF_SKB_TSTAMP_*
This patch is to change the test_tc_dtime.c to reflect the above.
bpf: Remove BPF_SKB_DELIVERY_TIME_NONE and rename s/delivery_time_/tstamp_/
This patch is to simplify the uapi bpf.h regarding to the tstamp type
and use a similar way as the kernel to describe the value stored
in __sk_buff->tstamp.
My earlier thought was to avoid describing the semantic and
clock base for the rcv timestamp until there is more clarity
on the use case, so the __sk_buff->delivery_time_type naming instead
of __sk_buff->tstamp_type.
With some thoughts, it can reuse the UNSPEC naming. This patch first
removes BPF_SKB_DELIVERY_TIME_NONE and also
rename BPF_SKB_DELIVERY_TIME_UNSPEC to BPF_SKB_TSTAMP_UNSPEC
and BPF_SKB_DELIVERY_TIME_MONO to BPF_SKB_TSTAMP_DELIVERY_MONO.
The semantic of BPF_SKB_TSTAMP_DELIVERY_MONO is the same:
__sk_buff->tstamp has delivery time in mono clock base.
BPF_SKB_TSTAMP_UNSPEC means __sk_buff->tstamp has the (rcv)
tstamp at ingress and the delivery time at egress. At egress,
the clock base could be found from skb->sk->sk_clockid.
__sk_buff->tstamp == 0 naturally means NONE, so NONE is not needed.
With BPF_SKB_TSTAMP_UNSPEC for the rcv tstamp at ingress,
the __sk_buff->delivery_time_type is also renamed to __sk_buff->tstamp_type
which was also suggested in the earlier discussion:
https://lore.kernel.org/bpf/b181acbe-caf8-502d-4b7b-7d96b9fc5d55@iogearbox.net/
The above will then make __sk_buff->tstamp and __sk_buff->tstamp_type
the same as its kernel skb->tstamp and skb->mono_delivery_time
counter part.
The internal kernel function bpf_skb_convert_dtime_type_read() is then
renamed to bpf_skb_convert_tstamp_type_read() and it can be simplified
with the BPF_SKB_DELIVERY_TIME_NONE gone. A BPF_ALU32_IMM(BPF_AND)
insn is also saved by using BPF_JMP32_IMM(BPF_JSET).
The bpf helper bpf_skb_set_delivery_time() is also renamed to
bpf_skb_set_tstamp(). The arg name is changed from dtime
to tstamp also. It only allows setting tstamp 0 for
BPF_SKB_TSTAMP_UNSPEC and it could be relaxed later
if there is use case to change mono delivery time to
non mono.
prog->delivery_time_access is also renamed to prog->tstamp_type_access.
bpf: Simplify insn rewrite on BPF_READ __sk_buff->tstamp
The skb->tc_at_ingress and skb->mono_delivery_time are at the same
byte offset. Thus, only one BPF_LDX_MEM(BPF_B) is needed
and both bits can be tested together.
/* BPF_READ: a = __sk_buff->tstamp */
if (skb->tc_at_ingress && skb->mono_delivery_time)
a = 0;
else
a = skb->tstamp;
bpf: net: Remove TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET macro
This patch removes the TC_AT_INGRESS_OFFSET and
SKB_MONO_DELIVERY_TIME_OFFSET macros. Instead, PKT_VLAN_PRESENT_OFFSET
is used because all of them are at the same offset. Comment is added to
make it clear that changing the position of tc_at_ingress or
mono_delivery_time will require to adjust the defined macros.
The earlier discussion can be found here:
https://lore.kernel.org/bpf/419d994e-ff61-7c11-0ec7-11fefcb0186e@iogearbox.net/
The kernel test robot pointed out that the newly added
bpf_test_run_xdp_live() runner doesn't set the retval in the caller (by
design), which means that the variable can be passed unitialised to
bpf_test_finish(). Fix this by initialising the variable properly.
Fixes: b530e9e1063e ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220310110228.161869-1-toke@redhat.com
Niklas Söderlund [Thu, 10 Mar 2022 12:18:46 +0000 (13:18 +0100)]
bpftool: Restore support for BPF offload-enabled feature probing
Commit 1a56c18e6c2e4e74 ("bpftool: Stop supporting BPF offload-enabled
feature probing") removed the support to probe for BPF offload features.
This is still something that is useful for NFP NIC that can support
offloading of BPF programs.
The reason for the dropped support was that libbpf starting with v1.0
would drop support for passing the ifindex to the BPF prog/map/helper
feature probing APIs. In order to keep this useful feature for NFP
restore the functionality by moving it directly into bpftool.
The code restored is a simplified version of the code that existed in
libbpf which supposed passing the ifindex. The simplification is that it
only targets the cases where ifindex is given and call into libbpf for
the cases where it's not.
Before restoring support for probing offload features:
# bpftool feature probe dev ens4np0
Scanning system call availability...
bpf() syscall is available
Scanning eBPF program types...
Scanning eBPF map types...
Scanning eBPF helper functions...
eBPF helpers supported for program type sched_cls:
eBPF helpers supported for program type xdp:
Scanning miscellaneous eBPF features...
Large program size limit is NOT available
Bounded loop support is NOT available
ISA extension v2 is NOT available
ISA extension v3 is NOT available
With support for probing offload features restored:
# bpftool feature probe dev ens4np0
Scanning system call availability...
bpf() syscall is available
Scanning eBPF program types...
eBPF program_type sched_cls is available
eBPF program_type xdp is available
Scanning eBPF map types...
eBPF map_type hash is available
eBPF map_type array is available
Scanning eBPF helper functions...
eBPF helpers supported for program type sched_cls:
- bpf_map_lookup_elem
- bpf_get_prandom_u32
- bpf_perf_event_output
eBPF helpers supported for program type xdp:
- bpf_map_lookup_elem
- bpf_get_prandom_u32
- bpf_perf_event_output
- bpf_xdp_adjust_head
- bpf_xdp_adjust_tail
Scanning miscellaneous eBPF features...
Large program size limit is NOT available
Bounded loop support is NOT available
ISA extension v2 is NOT available
ISA extension v3 is NOT available
Merge branch 'Add support for transmitting packets using XDP in bpf_prog_run()'
Toke Høiland-Jørgensen says:
====================
This series adds support for transmitting packets using XDP in
bpf_prog_run(), by enabling a new mode "live packet" mode which will handle
the XDP program return codes and redirect the packets to the stack or other
devices.
The primary use case for this is testing the redirect map types and the
ndo_xdp_xmit driver operation without an external traffic generator. But it
turns out to also be useful for creating a programmable traffic generator
in XDP, as well as injecting frames into the stack. A sample traffic
generator, which was included in previous versions of the series, but now
moved to xdp-tools, transmits up to 9 Mpps/core on my test machine.
To transmit the frames, the new mode instantiates a page_pool structure in
bpf_prog_run() and initialises the pages to contain XDP frames with the
data passed in by userspace. These frames can then be handled as though
they came from the hardware XDP path, and the existing page_pool code takes
care of returning and recycling them. The setup is optimised for high
performance with a high number of repetitions to support stress testing and
the traffic generator use case; see patch 1 for details.
v11:
- Fix override of return code in xdp_test_run_batch()
- Add Martin's ACKs to remaining patches
v10:
- Only propagate memory allocation errors from xdp_test_run_batch()
- Get rid of BPF_F_TEST_XDP_RESERVED; batch_size can be used to probe
- Check that batch_size is unset in non-XDP test_run funcs
- Lower the number of repetitions in the selftest to 10k
- Count number of recycled pages in the selftest
- Fix a few other nits from Martin, carry forward ACKs
v9:
- XDP_DROP packets in the selftest to ensure pages are recycled
- Fix a few issues reported by the kernel test robot
- Rewrite the documentation of the batch size to make it a bit clearer
- Rebase to newest bpf-next
v8:
- Make the batch size configurable from userspace
- Don't interrupt the packet loop on errors in do_redirect (this can be
caught from the tracepoint)
- Add documentation of the feature
- Add reserved flag userspace can use to probe for support (kernel didn't
check flags previously)
- Rebase to newest bpf-next, disallow live mode for jumbo frames
v7:
- Extend the local_bh_disable() to cover the full test run loop, to prevent
running concurrently with the softirq. Fixes a deadlock with veth xmit.
- Reinstate the forwarding sysctl setting in the selftest, and bump up the
number of packets being transmitted to trigger the above bug.
- Update commit message to make it clear that user space can select the
ingress interface.
v6:
- Fix meta vs data pointer setting and add a selftest for it
- Add local_bh_disable() around code passing packets up the stack
- Create a new netns for the selftest and use a TC program instead of the
forwarding hack to count packets being XDP_PASS'ed from the test prog.
- Check for the correct ingress ifindex in the selftest
- Rebase and drop patches 1-5 that were already merged
v5:
- Rebase to current bpf-next
v4:
- Fix a few code style issues (Alexei)
- Also handle the other return codes: XDP_PASS builds skbs and injects them
into the stack, and XDP_TX is turned into a redirect out the same
interface (Alexei).
- Drop the last patch adding an xdp_trafficgen program to samples/bpf; this
will live in xdp-tools instead (Alexei).
- Add a separate bpf_test_run_xdp_live() function to test_run.c instead of
entangling the new mode in the existing bpf_test_run().
v3:
- Reorder patches to make sure they all build individually (Patchwork)
- Remove a couple of unused variables (Patchwork)
- Remove unlikely() annotation in slow path and add back John's ACK that I
accidentally dropped for v2 (John)
v2:
- Split up up __xdp_do_redirect to avoid passing two pointers to it (John)
- Always reset context pointers before each test run (John)
- Use get_mac_addr() from xdp_sample_user.h instead of rolling our own (Kumar)
- Fix wrong offset for metadata pointer
====================
selftests/bpf: Add selftest for XDP_REDIRECT in BPF_PROG_RUN
This adds a selftest for the XDP_REDIRECT facility in BPF_PROG_RUN, that
redirects packets into a veth and counts them using an XDP program on the
other side of the veth pair and a TC program on the local side of the veth.
Documentation/bpf: Add documentation for BPF_PROG_RUN
This adds documentation for the BPF_PROG_RUN command; a short overview of
the command itself, and a more verbose description of the "live packet"
mode for XDP introduced in the previous commit.
bpf: Add "live packet" mode for XDP in BPF_PROG_RUN
This adds support for running XDP programs through BPF_PROG_RUN in a mode
that enables live packet processing of the resulting frames. Previous uses
of BPF_PROG_RUN for XDP returned the XDP program return code and the
modified packet data to userspace, which is useful for unit testing of XDP
programs.
The existing BPF_PROG_RUN for XDP allows userspace to set the ingress
ifindex and RXQ number as part of the context object being passed to the
kernel. This patch reuses that code, but adds a new mode with different
semantics, which can be selected with the new BPF_F_TEST_XDP_LIVE_FRAMES
flag.
When running BPF_PROG_RUN in this mode, the XDP program return codes will
be honoured: returning XDP_PASS will result in the frame being injected
into the networking stack as if it came from the selected networking
interface, while returning XDP_TX and XDP_REDIRECT will result in the frame
being transmitted out that interface. XDP_TX is translated into an
XDP_REDIRECT operation to the same interface, since the real XDP_TX action
is only possible from within the network drivers themselves, not from the
process context where BPF_PROG_RUN is executed.
Internally, this new mode of operation creates a page pool instance while
setting up the test run, and feeds pages from that into the XDP program.
The setup cost of this is amortised over the number of repetitions
specified by userspace.
To support the performance testing use case, we further optimise the setup
step so that all pages in the pool are pre-initialised with the packet
data, and pre-computed context and xdp_frame objects stored at the start of
each page. This makes it possible to entirely avoid touching the page
content on each XDP program invocation, and enables sending up to 9
Mpps/core on my test box.
Because the data pages are recycled by the page pool, and the test runner
doesn't re-initialise them for each run, subsequent invocations of the XDP
program will see the packet data in the state it was after the last time it
ran on that particular page. This means that an XDP program that modifies
the packet before redirecting it has to be careful about which assumptions
it makes about the packet content, but that is only an issue for the most
naively written programs.
Enabling the new flag is only allowed when not setting ctx_out and data_out
in the test specification, since using it means frames will be redirected
somewhere else, so they can't be returned.
Andrii Nakryiko [Wed, 9 Mar 2022 01:39:29 +0000 (17:39 -0800)]
Merge branch 'BPF test_progs tests improvement'
Mykola Lysenko says:
====================
First patch reduces the sample_freq to 1000 to ensure test will
work even when kernel.perf_event_max_sample_rate was reduced to 1000.
Patches for send_signal and find_vma tune the test implementation to
make sure needed thread is scheduled. Also, both tests will finish as
soon as possible after the test condition is met.
====================
Mykola Lysenko [Tue, 8 Mar 2022 20:04:49 +0000 (12:04 -0800)]
Improve stability of find_vma BPF test
Remove unneeded spleep and increase length of dummy CPU
intensive computation to guarantee test process execution.
Also, complete aforemention computation as soon as
test success criteria is met
Mykola Lysenko [Tue, 8 Mar 2022 20:04:48 +0000 (12:04 -0800)]
Improve send_signal BPF test stability
Substitute sleep with dummy CPU intensive computation.
Finish aforemention computation as soon as signal was
delivered to the test process. Make the BPF code to
only execute when PID global variable is set
Mykola Lysenko [Tue, 8 Mar 2022 20:04:47 +0000 (12:04 -0800)]
Improve perf related BPF tests (sample_freq issue)
Linux kernel may automatically reduce kernel.perf_event_max_sample_rate
value when running tests in parallel on slow systems. Linux kernel checks
against this limit when opening perf event with freq=1 parameter set.
The lower bound is 1000. This patch reduces sample_freq value to 1000
in all BPF tests that use sample_freq to ensure they always can open
perf event.
Adrian Ratiu [Tue, 8 Mar 2022 12:14:28 +0000 (14:14 +0200)]
tools: Fix unavoidable GCC call in Clang builds
In ChromeOS and Gentoo we catch any unwanted mixed Clang/LLVM
and GCC/binutils usage via toolchain wrappers which fail builds.
This has revealed that GCC is called unconditionally in Clang
configured builds to populate GCC_TOOLCHAIN_DIR.
Allow the user to override CLANG_CROSS_FLAGS to avoid the GCC
call - in our case we set the var directly in the ebuild recipe.
In theory Clang could be able to autodetect these settings so
this logic could be removed entirely, but in practice as the
commit cebdb7374577 ("tools: Help cross-building with clang")
mentions, this does not always work, so giving distributions
more control to specify their flags & sysroot is beneficial.
Felix Maurer [Thu, 3 Mar 2022 11:15:26 +0000 (12:15 +0100)]
selftests/bpf: Make test_lwt_ip_encap more stable and faster
In test_lwt_ip_encap, the ingress IPv6 encap test failed from time to
time. The failure occured when an IPv4 ping through the IPv6 GRE
encapsulation did not receive a reply within the timeout. The IPv4 ping
and the IPv6 ping in the test used different timeouts (1 sec for IPv4
and 6 sec for IPv6), probably taking into account that IPv6 might need
longer to successfully complete. However, when IPv4 pings (with the
short timeout) are encapsulated into the IPv6 tunnel, the delays of IPv6
apply.
The actual reason for the long delays with IPv6 was that the IPv6
neighbor discovery sometimes did not complete in time. This was caused
by the outgoing interface only having a tentative link local address,
i.e., not having completed DAD for that lladdr. The ND was successfully
retried after 1 sec but that was too late for the ping timeout.
The IPv6 addresses for the test were already added with nodad. However,
for the lladdrs, DAD was still performed. We now disable DAD in the test
netns completely and just assume that the two lladdrs on each veth pair
do not collide. This removes all the delays for IPv6 traffic in the
test.
Without the delays, we can now also reduce the delay of the IPv6 ping to
1 sec. This makes the whole test complete faster because we don't need
to wait for the excessive timeout for each IPv6 ping that is supposed
to fail.
Instead of determining buf_info string in the caller of check_buffer_access(),
we can determine whether the register type is read-only through
type_is_rdonly_mem() helper inside check_buffer_access() and construct
buf_info, making the code slightly cleaner.
KP Singh [Mon, 7 Mar 2022 13:30:47 +0000 (13:30 +0000)]
bpf/docs: Update vmtest docs for static linking
Dynamic linking when compiling on the host can cause issues when the
libc version does not match the one in the VM image. Update the
docs to explain how to do this.
Before:
./vmtest.sh -- ./test_progs -t test_ima
./test_progs: /usr/lib/libc.so.6: version `GLIBC_2.33' not found (required by ./test_progs)
Guo Zhengkui [Sun, 6 Mar 2022 02:34:26 +0000 (10:34 +0800)]
libbpf: Fix array_size.cocci warning
Fix the following coccicheck warning:
tools/lib/bpf/bpf.c:114:31-32: WARNING: Use ARRAY_SIZE
tools/lib/bpf/xsk.c:484:34-35: WARNING: Use ARRAY_SIZE
tools/lib/bpf/xsk.c:485:35-36: WARNING: Use ARRAY_SIZE
It has been tested with gcc (Debian 8.3.0-6) 8.3.0 on x86_64.
Yuntao Wang [Fri, 4 Mar 2022 07:04:08 +0000 (15:04 +0800)]
bpf: Replace strncpy() with strscpy()
Using strncpy() on NUL-terminated strings is considered deprecated[1].
Moreover, if the length of 'task->comm' is less than the destination buffer
size, strncpy() will NUL-pad the destination buffer, which is a needless
performance penalty.
Replacing strncpy() with strscpy() fixes all these issues.
lic121 [Tue, 1 Mar 2022 13:26:23 +0000 (13:26 +0000)]
libbpf: Unmap rings when umem deleted
xsk_umem__create() does mmap for fill/comp rings, but xsk_umem__delete()
doesn't do the unmap. This works fine for regular cases, because
xsk_socket__delete() does unmap for the rings. But for the case that
xsk_socket__create_shared() fails, umem rings are not unmapped.
fill_save/comp_save are checked to determine if rings have already be
unmapped by xsk. If fill_save and comp_save are NULL, it means that the
rings have already been used by xsk. Then they are supposed to be
unmapped by xsk_socket__delete(). Otherwise, xsk_umem__delete() does the
unmap.
Fixes: 2f6324a3937f ("libbpf: Support shared umems between queues and devices") Signed-off-by: Cheng Li <lic121@chinatelecom.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220301132623.GA19995@vscode.7~
Merge branch 'bpf: add __percpu tagging in vmlinux BTF'
Hao Luo says:
====================
This patchset is very much similar to Yonghong's patchset on adding
__user tagging [1], where a "user" btf_type_tag was introduced to
describe __user memory pointers. Similar approach can be applied on
__percpu pointers. The __percpu attribute in kernel is used to identify
pointers that point to memory allocated in percpu region. Normally,
accessing __percpu memory requires using special functions like
per_cpu_ptr() etc. Directly accessing __percpu pointer is meaningless.
Currently vmlinux BTF does not have a way to differentiate a __percpu
pointer from a regular pointer. So BPF programs are allowed to load
__percpu memory directly, which is an incorrect behavior.
With the previous work that encodes __user information in BTF, a nice
framework has been set up to allow us to encode __percpu information in
BTF and let the verifier to reject programs that try to directly access
percpu pointer. Previously, there is a PTR_TO_PERCPU_BTF_ID reg type which
is used to represent those percpu static variables in the kernel. Pahole
is able to collect variables that are stored in ".data..percpu" section
in the kernel image and emit BTF information for those variables. The
bpf_per_cpu_ptr() and bpf_this_cpu_ptr() helper functions were added to
access these variables. Now with __percpu information, we can tag those
__percpu fields in a struct (such as cgroup->rstat_cpu) and allow the
pair of bpf percpu helpers to access them as well.
In addition to adding __percpu tagging, this patchset also fixes a
harmless bug in the previous patch that introduced __user. Patch 01/04
is for that. Patch 02/04 adds the new attribute "percpu". Patch 03/04
adds MEM_PERCPU tag for PTR_TO_BTF_ID and replaces PTR_TO_PERCPU_BTF_ID
with (BTF_ID | MEM_PERCPU). Patch 04/04 refactors the btf_tag test a bit
and adds tests for percpu tag.
Like [1], the minimal requirements for btf_type_tag is
clang (>= clang14) and pahole (>= 1.23).
Hao Luo [Fri, 4 Mar 2022 19:16:57 +0000 (11:16 -0800)]
selftests/bpf: Add a test for btf_type_tag "percpu"
Add test for percpu btf_type_tag. Similar to the "user" tag, we test
the following cases:
1. __percpu struct field.
2. __percpu as function parameter.
3. per_cpu_ptr() accepts dynamically allocated __percpu memory.
Because the test for "user" and the test for "percpu" are very similar,
a little bit of refactoring has been done in btf_tag.c. Basically, both
tests share the same function for loading vmlinux and module btf.
Hao Luo [Fri, 4 Mar 2022 19:16:56 +0000 (11:16 -0800)]
bpf: Reject programs that try to load __percpu memory.
With the introduction of the btf_type_tag "percpu", we can add a
MEM_PERCPU to identify those pointers that point to percpu memory.
The ability of differetiating percpu pointers from regular memory
pointers have two benefits:
1. It forbids unexpected use of percpu pointers, such as direct loads.
In kernel, there are special functions used for accessing percpu
memory. Directly loading percpu memory is meaningless. We already
have BPF helpers like bpf_per_cpu_ptr() and bpf_this_cpu_ptr() that
wrap the kernel percpu functions. So we can now convert percpu
pointers into regular pointers in a safe way.
2. Previously, bpf_per_cpu_ptr() and bpf_this_cpu_ptr() only work on
PTR_TO_PERCPU_BTF_ID, a special reg_type which describes static
percpu variables in kernel (we rely on pahole to encode them into
vmlinux BTF). Now, since we can identify __percpu tagged pointers,
we can also identify dynamically allocated percpu memory as well.
It means we can use bpf_xxx_cpu_ptr() on dynamic percpu memory.
This would be very convenient when accessing fields like
"cgroup->rstat_cpu".
Hao Luo [Fri, 4 Mar 2022 19:16:55 +0000 (11:16 -0800)]
compiler_types: Define __percpu as __attribute__((btf_type_tag("percpu")))
This is similar to commit 7472d5a642c9 ("compiler_types: define __user as
__attribute__((btf_type_tag("user")))"), where a type tag "user" was
introduced to identify the pointers that point to user memory. With that
change, the newest compile toolchain can encode __user information into
vmlinux BTF, which can be used by the BPF verifier to enforce safe
program behaviors.
Similarly, we have __percpu attribute, which is mainly used to indicate
memory is allocated in percpu region. The __percpu pointers in kernel
are supposed to be used together with functions like per_cpu_ptr() and
this_cpu_ptr(), which perform necessary calculation on the pointer's
base address. Without the btf_type_tag introduced in this patch,
__percpu pointers will be treated as regular memory pointers in vmlinux
BTF and BPF programs are allowed to directly dereference them, generating
incorrect behaviors. Now with "percpu" btf_type_tag, the BPF verifier is
able to differentiate __percpu pointers from regular pointers and forbids
unexpected behaviors like direct load.
The following is an example similar to the one given in commit 7472d5a642c9:
for the function argument "int __percpu *arg", its type is described as
PTR -> TYPE_TAG(percpu) -> INT
The kernel can use this information for bpf verification or other
use cases.
Like commit 7472d5a642c9, this feature requires clang (>= clang14) and
pahole (>= 1.23).
Hao Luo [Fri, 4 Mar 2022 19:16:54 +0000 (11:16 -0800)]
bpf: Fix checking PTR_TO_BTF_ID in check_mem_access
With the introduction of MEM_USER in
commit c6f1bfe89ac9 ("bpf: reject program if a __user tagged memory accessed in kernel way")
PTR_TO_BTF_ID can be combined with a MEM_USER tag. Therefore, most
likely, when we compare reg_type against PTR_TO_BTF_ID, we want to use
the reg's base_type. Previously the check in check_mem_access() wants
to say: if the reg is BTF_ID but not NULL, the execution flow falls
into the 'then' branch. But now a reg of (BTF_ID | MEM_USER), which
should go into the 'then' branch, goes into the 'else'.
The end results before and after this patch are the same: regs tagged
with MEM_USER get rejected, but not in a way we intended. So fix the
condition, the error message now is correct.
$ ./test_progs -v -n 22/3
...
libbpf: prog 'test_user1': BPF program load failed: Permission denied
libbpf: prog 'test_user1': -- BEGIN PROG LOAD LOG --
R1 type=ctx expected=fp
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
; int BPF_PROG(test_user1, struct bpf_testmod_btf_type_tag_1 *arg)
0: (79) r1 = *(u64 *)(r1 +0)
func 'bpf_testmod_test_btf_type_tag_user_1' arg0 has btf_id 136561 type STRUCT 'bpf_testmod_btf_type_tag_1'
1: R1_w=user_ptr_bpf_testmod_btf_type_tag_1(id=0,off=0,imm=0)
; g = arg->a;
1: (61) r1 = *(u32 *)(r1 +0)
R1 invalid mem access 'user_ptr_'
Now:
libbpf: prog 'test_user1': BPF program load failed: Permission denied
libbpf: prog 'test_user1': -- BEGIN PROG LOAD LOG --
R1 type=ctx expected=fp
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
; int BPF_PROG(test_user1, struct bpf_testmod_btf_type_tag_1 *arg)
0: (79) r1 = *(u64 *)(r1 +0)
func 'bpf_testmod_test_btf_type_tag_user_1' arg0 has btf_id 104036 type STRUCT 'bpf_testmod_btf_type_tag_1'
1: R1_w=user_ptr_bpf_testmod_btf_type_tag_1(id=0,ref_obj_id=0,off=0,imm=0)
; g = arg->a;
1: (61) r1 = *(u32 *)(r1 +0)
R1 is ptr_bpf_testmod_btf_type_tag_1 access user memory: off=0
Note the error message for the reason of rejection.
Fixes: c6f1bfe89ac9 ("bpf: reject program if a __user tagged memory accessed in kernel way") Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220304191657.981240-2-haoluo@google.com
This set fixes a bug related to bad var_off being permitted for kfunc call in
case of PTR_TO_BTF_ID, consolidates offset checks for all register types allowed
as helper or kfunc arguments into a common shared helper, and introduces a
couple of other checks to harden the kfunc release logic and prevent future
bugs. Some selftests are also included that fail in absence of these fixes,
serving as demonstration of the issues being fixed.
* Update commit message for __diag patch to say clang instead of LLVM (Nathan)
* Address nits for check_func_arg_reg_off (Martin)
* Add comment for fixed_off_ok case, remove is_kfunc check (Martin)
* Put reg->off check for release kfunc inside check_func_arg_reg_off,
make the check a bit more readable
* Squash verifier selftests errstr update into patch 3 for bisect (Alexei)
* Include fix from Nathan for clang warning about missing prototypes
* Add unified __diag_ingore_all that works for both GCC/LLVM (Alexei)
Older discussion: Link: https://lore.kernel.org/bpf/20220219113744.1852259-1-memxor@gmail.com
Kumar Kartikeya Dwivedi (7):
bpf: Add check_func_arg_reg_off function
bpf: Fix PTR_TO_BTF_ID var_off check
bpf: Disallow negative offset in check_ptr_off_reg
bpf: Harden register offset checks for release helpers and kfuncs
compiler_types.h: Add unified __diag_ignore_all for GCC/LLVM
bpf: Replace __diag_ignore with unified __diag_ignore_all
selftests/bpf: Add tests for kfunc register offset checks
====================
Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
selftests/bpf: Add tests for kfunc register offset checks
Include a few verifier selftests that test against the problems being
fixed by previous commits, i.e. release kfunc always require
PTR_TO_BTF_ID fixed and var_off to be 0, and negative offset is not
permitted and returns a helpful error message.
bpf: Replace __diag_ignore with unified __diag_ignore_all
Currently, -Wmissing-prototypes warning is ignored for GCC, but not
clang. This leads to clang build warning in W=1 mode. Since the flag
used by both compilers is same, we can use the unified __diag_ignore_all
macro that works for all supported versions and compilers which have
__diag macro support (currently GCC >= 8.0, and Clang >= 11.0).
Also add nf_conntrack_bpf.h include to prevent missing prototype warning
for register_nf_conntrack_bpf.
compiler_types.h: Add unified __diag_ignore_all for GCC/LLVM
Add a __diag_ignore_all macro, to ignore warnings for both GCC and LLVM,
without having to specify the compiler type and version. By default, GCC
8 and clang 11 are used. This will be used by bpf subsystem to ignore
-Wmissing-prototypes warning for functions that are meant to be global
functions so that they are in vmlinux BTF, but don't have a prototype.
compiler-clang.h: Add __diag infrastructure for clang
Add __diag macros similar to those in compiler-gcc.h, so that warnings
that need to be adjusted for specific cases but not globally can be
ignored when building with clang.
bpf: Harden register offset checks for release helpers and kfuncs
Let's ensure that the PTR_TO_BTF_ID reg being passed in to release BPF
helpers and kfuncs always has its offset set to 0. While not a real
problem now, there's a very real possibility this will become a problem
when more and more kfuncs are exposed, and more BPF helpers are added
which can release PTR_TO_BTF_ID.
Previous commits already protected against non-zero var_off. One of the
case we are concerned about now is when we have a type that can be
returned by e.g. an acquire kfunc:
struct foo {
int a;
int b;
struct bar b;
};
... and struct bar is also a type that can be returned by another
acquire kfunc.
... would work with the current code, since the btf_struct_ids_match
takes reg->off into account for matching pointer type with release kfunc
argument type, but would obviously be incorrect, and most likely lead to
a kernel crash. A test has been included later to prevent regressions in
this area.
bpf: Disallow negative offset in check_ptr_off_reg
check_ptr_off_reg only allows fixed offset to be set for PTR_TO_BTF_ID,
where reg->off < 0 doesn't make sense. This would shift the pointer
backwards, and fails later in btf_struct_ids_match or btf_struct_walk
due to out of bounds access (since offset is interpreted as unsigned).
Improve the verifier by rejecting this case by using a better error
message for BPF helpers and kfunc, by putting a check inside the
check_func_arg_reg_off function.
Also, update existing verifier selftests to work with new error string.
When kfunc support was added, check_ctx_reg was called for PTR_TO_CTX
register, but no offset checks were made for PTR_TO_BTF_ID. Only
reg->off was taken into account by btf_struct_ids_match, which protected
against type mismatch due to non-zero reg->off, but when reg->off was
zero, a user could set the variable offset of the register and allow it
to be passed to kfunc, leading to bad pointer being passed into the
kernel.
Fix this by reusing the extracted helper check_func_arg_reg_off from
previous commit, and make one call before checking all supported
register types. Since the list is maintained, any future changes will be
taken into account by updating check_func_arg_reg_off. This function
prevents non-zero var_off to be set for PTR_TO_BTF_ID, but still allows
a fixed non-zero reg->off, which is needed for type matching to work
correctly when using pointer arithmetic.
ARG_DONTCARE is passed as arg_type, since kfunc doesn't support
accepting a ARG_PTR_TO_ALLOC_MEM without relying on size of parameter
type from BTF (in case of pointer), or using a mem, len pair. The
forcing of offset check for ARG_PTR_TO_ALLOC_MEM is done because ringbuf
helpers obtain the size from the header located at the beginning of the
memory region, hence any changes to the original pointer shouldn't be
allowed. In case of kfunc, size is always known, either at verification
time, or using the length parameter, hence this forcing is not required.
Since this check will happen once already for PTR_TO_CTX, remove the
check_ptr_off_reg call inside its block.
Lift the list of register types allowed for having fixed and variable
offsets when passed as helper function arguments into a common helper,
so that they can be reused for kfunc checks in later commits. Keeping a
common helper aids maintainability and allows us to follow the same
consistent rules across helpers and kfuncs. Also, convert check_func_arg
to use this function.
Merge branch 'libbpf: support custom SEC() handlers'
Andrii Nakryiko says:
====================
Add ability for user applications and libraries to register custom BPF program
SEC() handlers. See patch #2 for examples where this is useful.
Patch #1 does some preliminary refactoring to allow exponsing program
init, preload, and attach callbacks as public API. It also establishes
a protocol to allow optional auto-attach behavior. This will also help the
case of sometimes auto-attachable uprobes.
v4->v5:
- API documentation improvements (Daniel);
v3->v4:
- init_fn -> prog_setup_fn, preload_fn -> prog_prepare_load_fn (Alexei);
v2->v3:
- moved callbacks and cookie into OPTS struct (Alan);
- added more test scenarios (Alan);
- address most of Alan's feedback, but kept API name;
v1->v2:
- resubmitting due to git send-email screw up.
Cc: Alan Maguire <alan.maguire@oracle.com>
====================
Andrii Nakryiko [Sat, 5 Mar 2022 01:01:29 +0000 (17:01 -0800)]
selftests/bpf: Add custom SEC() handling selftest
Add a selftest validating various aspects of libbpf's handling of custom
SEC() handlers. It also demonstrates how libraries can ensure very early
callbacks registration and unregistration using
__attribute__((constructor))/__attribute__((destructor)) functions.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alan Maguire <alan.maguire@oracle.com> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/bpf/20220305010129.1549719-4-andrii@kernel.org
Andrii Nakryiko [Sat, 5 Mar 2022 01:01:28 +0000 (17:01 -0800)]
libbpf: Support custom SEC() handlers
Allow registering and unregistering custom handlers for BPF program.
This allows user applications and libraries to plug into libbpf's
declarative SEC() definition handling logic. This allows to offload
complex and intricate custom logic into external libraries, but still
provide a great user experience.
One such example is USDT handling library, which has a lot of code and
complexity which doesn't make sense to put into libbpf directly, but it
would be really great for users to be able to specify BPF programs with
something like SEC("usdt/<path-to-binary>:<usdt_provider>:<usdt_name>")
and have correct BPF program type set (BPF_PROGRAM_TYPE_KPROBE, as it is
uprobe) and even support BPF skeleton's auto-attach logic.
In some cases, it might be even good idea to override libbpf's default
handling, like for SEC("perf_event") programs. With custom library, it's
possible to extend logic to support specifying perf event specification
right there in SEC() definition without burdening libbpf with lots of
custom logic or extra library dependecies (e.g., libpfm4). With current
patch it's possible to override libbpf's SEC("perf_event") handling and
specify a completely custom ones.
Further, it's possible to specify a generic fallback handling for any
SEC() that doesn't match any other custom or standard libbpf handlers.
This allows to accommodate whatever legacy use cases there might be, if
necessary.
See doc comments for libbpf_register_prog_handler() and
libbpf_unregister_prog_handler() for detailed semantics.
This patch also bumps libbpf development version to v0.8 and adds new
APIs there.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alan Maguire <alan.maguire@oracle.com> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/bpf/20220305010129.1549719-3-andrii@kernel.org
Andrii Nakryiko [Sat, 5 Mar 2022 01:01:27 +0000 (17:01 -0800)]
libbpf: Allow BPF program auto-attach handlers to bail out
Allow some BPF program types to support auto-attach only in subste of
cases. Currently, if some BPF program type specifies attach callback, it
is assumed that during skeleton attach operation all such programs
either successfully attach or entire skeleton attachment fails. If some
program doesn't support auto-attachment from skeleton, such BPF program
types shouldn't have attach callback specified.
This is limiting for cases when, depending on how full the SEC("")
definition is, there could either be enough details to support
auto-attach or there might not be and user has to use some specific API
to provide more details at runtime.
One specific example of such desired behavior might be SEC("uprobe"). If
it's specified as just uprobe auto-attach isn't possible. But if it's
SEC("uprobe/<some_binary>:<some_func>") then there are enough details to
support auto-attach. Note that there is a somewhat subtle difference
between auto-attach behavior of BPF skeleton and using "generic"
bpf_program__attach(prog) (which uses the same attach handlers under the
cover). Skeleton allow some programs within bpf_object to not have
auto-attach implemented and doesn't treat that as an error. Instead such
BPF programs are just skipped during skeleton's (optional) attach step.
bpf_program__attach(), on the other hand, is called when user *expects*
auto-attach to work, so if specified program doesn't implement or
doesn't support auto-attach functionality, that will be treated as an
error.
Another improvement to the way libbpf is handling SEC()s would be to not
require providing dummy kernel function name for kprobe. Currently,
SEC("kprobe/whatever") is necessary even if actual kernel function is
determined by user at runtime and bpf_program__attach_kprobe() is used
to specify it. With changes in this patch, it's possible to support both
SEC("kprobe") and SEC("kprobe/<actual_kernel_function"), while only in
the latter case auto-attach will be performed. In the former one, such
kprobe will be skipped during skeleton attach operation.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Alan Maguire <alan.maguire@oracle.com> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/bpf/20220305010129.1549719-2-andrii@kernel.org
David S. Miller [Sat, 5 Mar 2022 11:16:56 +0000 (11:16 +0000)]
Merge branch 'bnxt_en-updates'
Michael Chan says:
====================
bnxt_en: Updates.
This patch series contains mainly NVRAM related features. More
NVRAM error checking and logging are added when installing firmware
packages. A new devlink hw health report is now added to report
and diagnose NVRAM issues. Other miscellaneous patches include
reporting correctly cards that don't support link pause, adding
an internal unknown link state, and avoiding unnecessary link
toggle during firmware reset.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vikas Gupta [Sat, 5 Mar 2022 08:54:42 +0000 (03:54 -0500)]
bnxt_en: add an nvm test for hw diagnose
Add an NVM test function for devlink hw reporter.
In this function an NVM VPD area is read followed by
a write. Test result is cached and if it is successful then
the next test can be conducted only after HW_RETEST_MIN_TIME to
avoid frequent writes to the NVM.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kalesh AP [Sat, 5 Mar 2022 08:54:41 +0000 (03:54 -0500)]
bnxt_en: implement hw health reporter
This reporter will report NVM errors which are non-fatal.
When we receive these NVM error events, we'll report it
through this new hw health reporter.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Edwin Peer [Sat, 5 Mar 2022 08:54:40 +0000 (03:54 -0500)]
bnxt_en: Do not destroy health reporters during reset
Health reporter state should be maintained over resets. Previously
reporters were destroyed if the device capabilities changed, but
since none of the reporters depend on capabilities anymore, this
logic should be removed.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sat, 5 Mar 2022 08:54:39 +0000 (03:54 -0500)]
bnxt_en: Eliminate unintended link toggle during FW reset
If the flow control settings have been changed, a subsequent FW reset
may cause the ethernet link to toggle unnecessarily. This link toggle
will increase the down time by a few seconds.
The problem is caused by bnxt_update_phy_setting() detecting a false
mismatch in the flow control settings between the stored software
settings and the current FW settings after the FW reset. This mismatch
is caused by the AUTONEG bit added to link_info->req_flow_ctrl in an
inconsistent way in bnxt_set_pauseparam() in autoneg mode. The AUTONEG
bit should not be added to link_info->req_flow_ctrl.
Reviewed-by: Colin Winegarden <colin.winegarden@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sat, 5 Mar 2022 08:54:38 +0000 (03:54 -0500)]
bnxt_en: Properly report no pause support on some cards
Some cards are configured to never support link pause or PFC. Discover
these cards and properly report no pause support to ethtool. Disable
PFC settings from DCBNL if PFC is unsupported.
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Edwin Peer [Sat, 5 Mar 2022 08:54:37 +0000 (03:54 -0500)]
bnxt_en: introduce initial link state of unknown
This will force link state to always be logged for initial NIC open.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kalesh AP [Sat, 5 Mar 2022 08:54:36 +0000 (03:54 -0500)]
bnxt_en: parse result field when NVRAM package install fails
Instead of always returning -ENOPKG, decode the firmware error
code further when the HWRM_NVM_INSTALL_UPDATE firmware call fails.
Return a more suitable error code to userspace and log an error
in dmesg.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kalesh AP [Sat, 5 Mar 2022 08:54:35 +0000 (03:54 -0500)]
bnxt_en: add more error checks to HWRM_NVM_INSTALL_UPDATE
FW returns error code "NVM_INSTALL_UPDATE_CMD_ERR_CODE_ANTI_ROLLBACK"
in the response to indicate that HWRM_NVM_INSTALL_UPDATE command has
failed due to Anti-rollback feature. Parse the error and return an
appropriate error code to the user.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kalesh AP [Sat, 5 Mar 2022 08:54:34 +0000 (03:54 -0500)]
bnxt_en: refactor error handling of HWRM_NVM_INSTALL_UPDATE
This is in anticipation of handling more "cmd_err" from FW in the next
patch.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Add the ability to configure the RX/TX coalesce timer with ethtool.
Change default setting to scale with the clock rate rather than being a
fixed number of clock cycles.
Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Sat, 5 Mar 2022 02:24:42 +0000 (20:24 -0600)]
net: axienet: reduce default RX interrupt threshold to 1
Now that NAPI has been implemented, the hardware interrupt mitigation
mechanism is not needed to avoid excessive interrupt load in most cases.
Reduce the default RX interrupt threshold to 1 to reduce introduced
latency. This can be increased with ethtool if desired if some applications
still want to reduce interrupts.
Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Sat, 5 Mar 2022 02:24:41 +0000 (20:24 -0600)]
net: axienet: implement NAPI and GRO receive
Implement NAPI and GRO receive. In addition to better performance, this
also avoids handling RX packets in hard IRQ context, which reduces the
IRQ latency impact to other devices.
Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Sat, 5 Mar 2022 02:24:40 +0000 (20:24 -0600)]
net: axienet: don't set IRQ timer when IRQ delay not used
When the RX or TX coalesce count is set to 1, there's no point in
setting the delay timer value since an interrupt will already be raised
on every packet, and the delay interrupt just causes extra pointless
interrupts.
Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>