The usage header of pm_nl_ctl command doesn't match with the context. So
this patch adds the missing userspace PM keywords 'ann', 'rem', 'csf',
'dsf', 'events' and 'listen' in it.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
selftests: mptcp: avoid Terminated messages in userspace_pm
There're some 'Terminated' messages in the output of userspace pm tests
script after killing './pm_nl_ctl events' processes:
Created network namespaces ns1, ns2 [OK]
./userspace_pm.sh: line 166: 13735 Terminated ip netns exec "$ns2" ./pm_nl_ctl events >> "$client_evts" 2>&1
./userspace_pm.sh: line 172: 13737 Terminated ip netns exec "$ns1" ./pm_nl_ctl events >> "$server_evts" 2>&1
Established IPv4 MPTCP Connection ns2 => ns1 [OK]
./userspace_pm.sh: line 166: 13753 Terminated ip netns exec "$ns2" ./pm_nl_ctl events >> "$client_evts" 2>&1
./userspace_pm.sh: line 172: 13755 Terminated ip netns exec "$ns1" ./pm_nl_ctl events >> "$server_evts" 2>&1
Established IPv6 MPTCP Connection ns2 => ns1 [OK]
ADD_ADDR 10.0.2.2 (ns2) => ns1, invalid token [OK]
This patch adds a helper kill_wait(), in it using 'wait $pid 2>/dev/null'
commands after 'kill $pid' to avoid printing out these Terminated messages.
Use this helper instead of using 'kill $pid'.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds userspace pm subflow tests support for mptcp_join.sh
script. Add userspace pm create subflow and destroy test cases in
userspace_tests().
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds userspace pm tests support for mptcp_join.sh script. Add
userspace pm add_addr and rm_addr test cases in userspace_tests().
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 8 Jul 2022 17:14:09 +0000 (10:14 -0700)]
selftests: mptcp: tweak simult_flows for debug kernels
The mentioned test measures the transfer run-time to verify
that the user-space program is able to use the full aggregate B/W.
Even on (virtual) link-speed-bound tests, debug kernel can slow
down the transfer enough to cause sporadic test failures.
Instead of unconditionally raising the maximum allowed run-time,
tweak when the running kernel is a debug one, and use some simple/
rough heuristic to guess such scenarios.
Note: this intentionally avoids looking for /boot/config-<version> as
the latter file is not always available in our reference CI
environments.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Move macro MPTCPOPT_HMAC_LEN definition from net/mptcp/protocol.h to
include/net/mptcp.h.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When we are operating in SGMII inband mode, it implies that there is a
PHY connected, and the ethtool advertisement for autoneg applies to
the PHY, not the SGMII link. When in 1000base-X mode, then this applies
to the 802.3z link and needs to be applied to the PCS.
Antoine Tenart [Thu, 7 Jul 2022 08:02:45 +0000 (10:02 +0200)]
Documentation: add a description for net.core.high_order_alloc_disable
A description is missing for the net.core.high_order_alloc_disable
option in admin-guide/sysctl/net.rst ; add it. The above sysctl option
was introduced by commit ce27ec60648d ("net: add high_order_alloc_disable
sysctl/static key").
Thanks to Eric for running again the benchmark cited in the above
commit, showing this knob is now mostly of historical importance.
When building with Clang we encounter this warning:
| net/rxrpc/rxkad.c:434:33: error: format specifies type 'unsigned short'
| but the argument has type 'u32' (aka 'unsigned int') [-Werror,-Wformat]
| _leave(" = %d [set %hx]", ret, y);
y is a u32 but the format specifier is `%hx`. Going from unsigned int to
short int results in a loss of data. This is surely not intended
behavior. If it is intended, the warning should be suppressed through
other means.
This patch should get us closer to the goal of enabling the -Wformat
flag for Clang builds.
====================
tls: pad strparser, internal header, decrypt_ctx etc.
A grab bag of non-functional refactoring to make the series
which will let us decrypt into a fresh skb smaller.
Patches in this series are not strictly required to get the
decryption into a fresh skb going, they are more in the "things
which had been annoying me for a while" category.
====================
Jakub Kicinski [Fri, 8 Jul 2022 01:03:13 +0000 (18:03 -0700)]
tls: create an internal header
include/net/tls.h is getting a little long, and is probably hard
for driver authors to navigate. Split out the internals into a
header which will live under net/tls/. While at it move some
static inlines with a single user into the source files, add
a few tls_ prefixes and fix spelling of 'proccess'.
Jakub Kicinski [Fri, 8 Jul 2022 01:03:11 +0000 (18:03 -0700)]
tls: rx: wrap decrypt params in a struct
The max size of iv + aad + tail is 22B. That's smaller
than a single sg entry (32B). Don't bother with the
memory packing, just create a struct which holds the
max size of those members.
Jakub Kicinski [Fri, 8 Jul 2022 01:03:10 +0000 (18:03 -0700)]
tls: rx: always allocate max possible aad size for decrypt
AAD size is either 5 or 13. Really no point complicating
the code for the 8B of difference. This will also let us
turn the chunked up buffer into a sane struct.
Jakub Kicinski [Fri, 8 Jul 2022 01:03:09 +0000 (18:03 -0700)]
strparser: pad sk_skb_cb to avoid straddling cachelines
sk_skb_cb lives within skb->cb[]. skb->cb[] straddles
2 cache lines, each containing 24B of data.
The first cache line does not contain much interesting
information for users of strparser, so pad things a little.
Previously strp_msg->full_len would live in the first cache
line and strp_msg->offset in the second.
We need to reorder the 8 byte temp_reg with struct tls_msg
to prevent a 4B hole which would push the struct over 48B.
net: ag71xx: switch to napi_build_skb() to reuse skbuff_heads
napi_build_skb() reuses NAPI skbuff_head cache in order to save some
cycles on freeing/allocating skbuff_heads on every new Rx or completed
Tx.
Use napi_consume_skb() to feed the cache with skbuff_heads of completed
Tx, so it's never empty. The budget parameter is added to indicate NAPI
context, as a value of zero can be passed in the case of netpoll.
Signed-off-by: Sieng-Piaw Liew <liew.s.piaw@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Current implementation is such that driver first resets the
existing PFC config before applying new pfc configuration.
This creates a problem like once PF or VFs requests PFC config
previous pfc config by other PFVfs is getting reset.
This patch fixes the problem by removing unnecessary resetting
of PFC config. Also configure Pause quanta value to smaller as
current value is too high.
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When building with Clang we encounter the following warnings:
| net/l2tp/l2tp_debugfs.c:187:40: error: format specifies type 'unsigned
| short' but the argument has type 'u32' (aka 'unsigned int')
| [-Werror,-Wformat] seq_printf(m, " nr %hu, ns %hu\n", session->nr,
| session->ns);
-
| net/l2tp/l2tp_debugfs.c:196:32: error: format specifies type 'unsigned
| short' but the argument has type 'int' [-Werror,-Wformat]
| session->l2specific_type, l2tp_get_l2specific_len(session));
-
| net/l2tp/l2tp_debugfs.c:219:6: error: format specifies type 'unsigned
| short' but the argument has type 'u32' (aka 'unsigned int')
| [-Werror,-Wformat] session->nr, session->ns,
Both session->nr and ->nc are of type `u32`. The currently used format
specifier is `%hu` which describes a `u16`. My proposed fix is to listen
to Clang and use the correct format specifier `%u`.
For the warning at line 196, l2tp_get_l2specific_len() returns an int
and should therefore be using the `%d` format specifier.
Link: https://github.com/ClangBuiltLinux/linux/issues/378 Signed-off-by: Justin Stitt <justinstitt@google.com> Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 8 Jul 2022 03:01:53 +0000 (20:01 -0700)]
Merge branch 'polarfire-soc-macb-reset-support'
Conor Dooley says:
====================
PolarFire SoC macb reset support
The Cadence MACBs on PolarFire SoC (MPFS) have reset capability and are
compatible with the zynqmp's init function. I have removed the zynqmp
specific comments from that function & renamed it to reflect what it
does, since it is no longer zynqmp only.
MPFS's MACB had previously used the generic binding, so I also added
the required specific binding.
For v2, I noticed some low hanging cleanup fruit so there are extra
patches added for that:
moving the init function out of the config structs, aligning the
alignment of the zynqmp & default config structs with the other dozen
or so structs & simplifing the error paths to use dev_err_probe().
Feel free to apply as many or as few of those as you like.
====================
net: macb: sort init_reset_optional() with other init()s
init_reset_optional() is somewhat oddly placed amidst the macb_config
struct definitions. Move it to a more reasonable location alongside
the fu540 init functions.
To date, the Microchip PolarFire SoC (MPFS) has been using the
cdns,macb compatible, however the generic device does not have reset
support. Add a new compatible & .data for MPFS to hook into the reset
functionality added for zynqmp support (and make the zynqmp init
function generic in the process).
Until now the PolarFire SoC (MPFS) has been using the generic
"cdns,macb" compatible but has optional reset support. Add a specific
compatible which falls back to the currently used generic binding.
Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When building with clang we encounter this warning:
| net/l2tp/l2tp_ppp.c:1557:6: error: format specifies type 'unsigned
| short' but the argument has type 'u32' (aka 'unsigned int')
| [-Werror,-Wformat] session->nr, session->ns,
Both session->nr and session->ns are of type u32. The format specifier
previously used is `%hu` which would truncate our unsigned integer from
32 to 16 bits. This doesn't seem like intended behavior, if it is then
perhaps we need to consider suppressing the warning with pragma clauses.
This patch should get us closer to the goal of enabling the -Wformat
flag for Clang builds.
Jie Wang [Tue, 5 Jul 2022 11:35:15 +0000 (19:35 +0800)]
net: page_pool: optimize page pool page allocation in NUMA scenario
Currently NIC packet receiving performance based on page pool deteriorates
occasionally. To analysis the causes of this problem page allocation stats
are collected. Here are the stats when NIC rx performance deteriorates:
The rx_pp_alloc_waive count indicates that a large number of pages' numa
node are inconsistent with the NIC device numa node. Therefore these pages
can't be reused by the page pool. As a result, many new pages would be
allocated by __page_pool_alloc_pages_slow which is time consuming. This
causes the NIC rx performance fluctuations.
The main reason of huge numa mismatch pages in page pool is that page pool
uses alloc_pages_bulk_array to allocate original pages. This function is
not suitable for page allocation in NUMA scenario. So this patch uses
alloc_pages_bulk_array_node which has a NUMA id input parameter to ensure
the NUMA consistent between NIC device and allocated pages.
Repeated NIC rx performance tests are performed 40 times. NIC rx bandwidth
is higher and more stable compared to the datas above. Here are three test
stats, the rx_pp_alloc_waive count is zero and rx_pp_alloc_slow which
indicates pages allocated from slow patch is relatively low.
- eth: ibmvnic: properly dispose of all skbs during a failover
Previous releases - always broken:
- bpf:
- fix insufficient bounds propagation from
adjust_scalar_min_max_vals
- clear page contiguity bit when unmapping pool
- netfilter: nft_set_pipapo: release elements in clone from
abort path
- mptcp: netlink: issue MP_PRIO signals from userspace PMs
- can:
- rcar_canfd: fix data transmission failed on R-Car V3U
- gs_usb: gs_usb_open/close(): fix memory leak
Misc:
- add Wenjia as SMC maintainer"
* tag 'net-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits)
wireguard: Kconfig: select CRYPTO_CHACHA_S390
crypto: s390 - do not depend on CRYPTO_HW for SIMD implementations
wireguard: selftests: use microvm on x86
wireguard: selftests: always call kernel makefile
wireguard: selftests: use virt machine on m68k
wireguard: selftests: set fake real time in init
r8169: fix accessing unset transport header
net: rose: fix UAF bug caused by rose_t0timer_expiry
usbnet: fix memory leak in error case
Revert "tls: rx: move counting TlsDecryptErrors for sync"
mptcp: update MIB_RMSUBFLOW in cmd_sf_destroy
mptcp: fix local endpoint accounting
selftests: mptcp: userspace PM support for MP_PRIO signals
mptcp: netlink: issue MP_PRIO signals from userspace PMs
mptcp: Acquire the subflow socket lock before modifying MP_PRIO flags
mptcp: Avoid acquiring PM lock for subflow priority changes
mptcp: fix locking in mptcp_nl_cmd_sf_destroy()
net/mlx5e: Fix matchall police parameters validation
net/sched: act_police: allow 'continue' action offload
net: lan966x: hardcode the number of external ports
...
Merge tag 'pinctrl-v5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
- Tag Intel pin control as supported in MAINTAINERS
- Fix a NULL pointer exception in the Aspeed driver
- Correct some NAND functions in the Sunxi A83T driver
- Use the right offset for some Sunxi pins
- Fix a zero base offset in the Freescale (NXP) i.MX93
- Fix the IRQ support in the STM32 driver
* tag 'pinctrl-v5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: stm32: fix optional IRQ support to gpios
pinctrl: imx: Add the zero base flag for imx93
pinctrl: sunxi: sunxi_pconf_set: use correct offset
pinctrl: sunxi: a83t: Fix NAND function name for some pins
pinctrl: aspeed: Fix potential NULL dereference in aspeed_pinmux_set_mux()
MAINTAINERS: Update Intel pin control to Supported
These are indeed "should not happen" situations, but it turns out recent
changes made the 'task_is_stopped_or_trace()' case trigger (fix for that
exists, is pending more testing), and the BUG_ON() makes it
unnecessarily hard to actually debug for no good reason.
It's been that way for a long time, but let's make it clear: BUG_ON() is
not good for debugging, and should never be used in situations where you
could just say "this shouldn't happen, but we can continue".
Use WARN_ON_ONCE() instead to make sure it gets logged, and then just
continue running. Instead of making the system basically unusuable
because you crashed the machine while potentially holding some very core
locks (eg this function is commonly called while holding 'tasklist_lock'
for writing).
Commit 6dd4142fb5a9 ("Merge branch 'af_unix-per-netns-socket-hash'") and
commit 51bae889fe11 ("af_unix: Put pathname sockets in the global hash
table.") changed a hash table layout.
Now, while looking up sockets, we traverse the global table for the
pathname sockets and the first half of each per-netns hash table for
abstract sockets, where pathname sockets are also linked. Thus, the
more pathname sockets we have, the longer we take to look up abstract
sockets. This characteristic has been there before the layout change,
but we can improve it now.
This patch changes the per-netns hash table's layout so that sockets not
requiring lookup reside in the first half and do not impact the lookup of
abstract sockets.
We have run a test that bind()s 100,000 abstract/pathname sockets for
each, bind()s an abstract socket 100,000 times and measures the time
on __unix_find_socket_byname(). The result shows that the patch makes
each lookup faster.
Jakub Kicinski [Thu, 7 Jul 2022 03:04:09 +0000 (20:04 -0700)]
Merge branch 'wireguard-patches-for-5-19-rc6'
Jason A. Donenfeld says:
====================
wireguard patches for 5.19-rc6
1) A few small fixups to the selftests, per usual. Of particular note is
a fix for a test flake that occurred on especially fast systems that
boot in less than a second.
2) An addition during this cycle of some s390 crypto interacted with the
way wireguard selects dependencies, resulting in linker errors
reported by the kernel test robot. So Vladis sent in a patch for
that, which also required a small preparatory fix moving some Kconfig
symbols around.
====================
Select the new implementation of CHACHA20 for S390 when available.
It is faster than the generic software implementation, but also prevents
some linker errors in certain situations.
crypto: s390 - do not depend on CRYPTO_HW for SIMD implementations
Various accelerated software implementation Kconfig values for S390 were
mistakenly placed into drivers/crypto/Kconfig, even though they're
mainly just SIMD code and live in arch/s390/crypto/ like usual. This
gives them the very unusual dependency on CRYPTO_HW, which leads to
problems elsewhere.
This patch fixes the issue by moving the Kconfig values for non-hardware
drivers into the usual place in crypto/Kconfig.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These selftests are used for much more extensive changes than just the
wireguard source files. So always call the kernel's build file, which
will do something or nothing after checking the whole tree, per usual.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Not all platforms have an RTC, and rather than trying to force one into
each, it's much easier to just set a fixed time. This is necessary
because WireGuard's latest handshakes parameter is returned in wallclock
time, and if the system time isn't set, and the system is really fast,
then this returns 0, which trips the test.
Turning this on requires setting CONFIG_COMPAT_32BIT_TIME=y, as musl
doesn't support settimeofday without it.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
66e4c8d95008 ("net: warn if transport header was not set") added
a check that triggers a warning in r8169, see [0].
The commit referenced in the Fixes tag refers to the change from
which the patch applies cleanly, there's nothing wrong with this
commit. It seems the actual issue (not bug, because the warning
is harmless here) was introduced with bdfa4ed68187
("r8169: use Giant Send").
net: rose: fix UAF bug caused by rose_t0timer_expiry
There are UAF bugs caused by rose_t0timer_expiry(). The
root cause is that del_timer() could not stop the timer
handler that is running and there is no synchronization.
One of the race conditions is shown below:
The rose_neigh is deallocated in position [1] and use in
position [2].
The crash trace triggered by POC is like below:
BUG: KASAN: use-after-free in expire_timers+0x144/0x320
Write of size 8 at addr ffff888009b19658 by task swapper/0/0
...
Call Trace:
<IRQ>
dump_stack_lvl+0xbf/0xee
print_address_description+0x7b/0x440
print_report+0x101/0x230
? expire_timers+0x144/0x320
kasan_report+0xed/0x120
? expire_timers+0x144/0x320
expire_timers+0x144/0x320
__run_timers+0x3ff/0x4d0
run_timer_softirq+0x41/0x80
__do_softirq+0x233/0x544
...
This patch changes rose_stop_ftimer() and rose_stop_t0timer()
in rose_remove_neigh() to del_timer_sync() in order that the
timer handler could be finished before the resources such as
rose_neigh and so on are deallocated. As a result, the UAF
bugs could be mitigated.
There are build warnings here which break the normal
build due to -Werror. Ratheesh was nice enough to quickly
follow up with fixes but didn't hit all the warnings I
see on GCC 12 so to unlock net-next from taking patches
let get this series out for now.
Merge tag 'for-linus' of https://github.com/openrisc/linux
Pull OpenRISC fixes from Stafford Horne:
"Fixups for OpenRISC found during recent testing:
- An OpenRISC irqchip fix to stop acking level interrupts which was
causing issues on SMP platforms
- A comment typo fix in our unwinder code"
* tag 'for-linus' of https://github.com/openrisc/linux:
openrisc: unwinder: Fix grammar issue in comment
irqchip: or1k-pic: Undefine mask_ack for level triggered hardware
Merge tag 'sound-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This became largish as it includes the pending ASoC fixes.
Almost all changes are device-specific small fixes, while many of them
are coverage for mixer issues that were detected by selftest. In
addition, usual suspects for HD/USB-audio are there"
* tag 'sound-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (43 commits)
ALSA: cs46xx: Fix missing snd_card_free() call at probe error
ALSA: usb-audio: Add quirk for Fiero SC-01 (fw v1.0.0)
ALSA: usb-audio: Add quirk for Fiero SC-01
ALSA: hda/realtek: Add quirk for Clevo L140PU
ALSA: usb-audio: Add quirks for MacroSilicon MS2100/MS2106 devices
ASoC: madera: Fix event generation for rate controls
ASoC: madera: Fix event generation for OUT1 demux
ASoC: cs47l15: Fix event generation for low power mux control
ASoC: cs35l41: Add ASP TX3/4 source to register patch
ASoC: dapm: Initialise kcontrol data for mux/demux controls
ASoC: rt711-sdca: fix kernel NULL pointer dereference when IO error
ASoC: cs35l41: Correct some control names
ASoC: wm5110: Fix DRE control
ASoC: wm_adsp: Fix event for preloader
MAINTAINERS: update ASoC Qualcomm maintainer email-id
ASoC: rockchip: i2s: switch BCLK to GPIO
ASoC: SOF: Intel: disable IMR boot when resuming from ACPI S4 and S5 states
ASoC: SOF: pm: add definitions for S4 and S5 states
ASoC: SOF: pm: add explicit behavior for ACPI S1 and S2
ASoC: SOF: Intel: hda: Fix compressed stream position tracking
...
Gal Pressman [Tue, 5 Jul 2022 11:08:37 +0000 (14:08 +0300)]
Revert "tls: rx: move counting TlsDecryptErrors for sync"
This reverts commit 284b4d93daee56dff3e10029ddf2e03227f50dbf.
When using TLS device offload and coming from tls_device_reencrypt()
flow, -EBADMSG error in tls_do_decryption() should not be counted
towards the TLSTlsDecryptError counter.
Move the counter increase back to the decrypt_internal() call site in
decrypt_skb_update().
This also fixes an issue where:
if (n_sgin < 1)
return -EBADMSG;
Errors in decrypt_internal() were not counted after the cited patch.
Fixes: 284b4d93daee ("tls: rx: move counting TlsDecryptErrors for sync") Cc: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 6 Jul 2022 12:09:28 +0000 (13:09 +0100)]
Merge branch 'hinic-dev_get_stats-fixes'
Qiao Ma says:
====================
net: hinic: fix bugs about dev_get_stats
These patches fixes 2 bugs of hinic driver:
- fix bug that ethtool get wrong stats because of hinic_{txq|rxq}_clean_stats() is called
- avoid kernel hung in hinic_get_stats64()
See every patch for more information.
Changes in v4:
- removed meaningless u64_stats_sync protection in hinic_{txq|rxq}_get_stats
- merged the third patch in v2 into first one
Changes in v3:
- fixes a compile warning reported by kernel test robot <lkp@intel.com>
Changes in v2:
- fixes another 2 bugs. (v1 is a single patch, see: https://lore.kernel.org/all/07736c2b7019b6883076a06129e06e8f7c5f7154.1656487154.git.mqaio@linux.alibaba.com/).
- to fix extra bugs, hinic_dev.tx_stats/rx_stats is removed, so there is no need to use spinlock or semaphore now.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
And the calltrace of task that actually caused kernel hungs as follows:
__switch_to+124
__schedule+548
schedule+72
schedule_timeout+348
__down_common+188
__down+24
down+104
hinic_get_stats64+44 [hinic]
dev_get_stats+92
bond_get_stats+172 [bonding]
dev_get_stats+92
dev_seq_printf_stats+60
dev_seq_show+24
seq_read_iter+964
seq_read+220
proc_reg_read+164
vfs_read+172
ksys_read+108
__arm64_sys_read+28
el0_svc_common+132
do_el0_svc+40
el0_svc+24
el0_sync_handler+164
el0_sync+324
When getting device stats from bond, kernel will call bond_get_stats().
It first holds the spinlock bond->stats_lock, and then call
hinic_get_stats64() to collect hinic device's stats.
However, hinic_get_stats64() calls `down(&nic_dev->mgmt_lock)` to
protect its critical section, which may schedule current task out.
And if system is under high pressure, the task cannot be woken up
immediately, which eventually triggers kernel hung panic.
Since previous patch has replaced hinic_dev.tx_stats/rx_stats with local
variable in hinic_get_stats64(), there is nothing need to be protected
by lock, so just removing down()/up() is ok.
Fixes: edd384f682cc ("net-next/hinic: Add ethtool and stats") Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Qiao Ma [Tue, 5 Jul 2022 11:22:22 +0000 (19:22 +0800)]
net: hinic: fix bug that ethtool get wrong stats
Function hinic_get_stats64() will do two operations:
1. reads stats from every hinic_rxq/txq and accumulates them
2. calls hinic_rxq/txq_clean_stats() to clean every rxq/txq's stats
For hinic_get_stats64(), it could get right data, because it sums all
data to nic_dev->rx_stats/tx_stats.
But it is wrong for get_drv_queue_stats(), this function will read
hinic_rxq's stats, which have been cleared to zero by hinic_get_stats64().
I have observed hinic's cleanup operation by using such command:
> watch -n 1 "cat ethtool -S eth4 | tail -40"
Result after a few seconds:
...
rxq7_pkts: 0
rxq7_bytes: 0
rxq7_errors: 0
rxq7_csum_errors: 0
rxq7_other_errors: 0
...
rxq9_pkts: 2
rxq9_bytes: 132
rxq9_errors: 0
rxq9_csum_errors: 0
rxq9_other_errors: 0
...
rxq11_pkts: 1
rxq11_bytes: 170
rxq11_errors: 0
rxq11_csum_errors: 0
rxq11_other_errors: 0
To solve this problem, we just keep every queue's total stats in their own
queue (aka hinic_{rxq|txq}), and simply sum all per-queue stats every time
calling hinic_get_stats64().
With that solution, there is no need to clean per-queue stats now,
and there is no need to maintain global hinic_dev.{tx|rx}_stats, too.
Fixes: edd384f682cc ("net-next/hinic: Add ethtool and stats") Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 6 Jul 2022 11:56:35 +0000 (12:56 +0100)]
Merge branch 'tls-rx-nopad-and-backlog-flushing'
Jakub Kicinski says:
====================
tls: rx: nopad and backlog flushing
This small series contains the two changes I've been working
towards in the previous ~50 patches a couple of months ago.
The first major change is the optional "nopad" optimization.
Currently TLS 1.3 Rx performs quite poorly because it does
not support the "zero-copy" or rather direct decrypt to a user
space buffer. Because of TLS 1.3 record padding we don't
know if a record contains data or a control message until
we decrypt it. Most records will contain data, tho, so the
optimization is to try the decryption hoping its data and
retry if it wasn't.
The performance gain from doing that is significant (~40%)
but if I'm completely honest the major reason is that we
call skb_cow_data() on the non-"zc" path. The next series
will remove the CoW, dropping the gain to only ~10%.
The second change is to flush the backlog every 128kB.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 5 Jul 2022 23:59:26 +0000 (16:59 -0700)]
tls: rx: periodically flush socket backlog
We continuously hold the socket lock during large reads and writes.
This may inflate RTT and negatively impact TCP performance.
Flush the backlog periodically. I tried to pick a flush period (128kB)
which gives significant benefit but the max Bps rate is not yet visibly
impacted.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 5 Jul 2022 23:59:23 +0000 (16:59 -0700)]
tls: rx: support optimistic decrypt to user buffer with TLS 1.3
We currently don't support decrypt to user buffer with TLS 1.3
because we don't know the record type and how much padding
record contains before decryption. In practice data records
are by far most common and padding gets used rarely so
we can assume data record, no padding, and if we find out
that wasn't the case - retry the crypto in place (decrypt
to skb).
To safeguard from user overwriting content type and padding
before we can check it attach a 1B sg entry where last byte
of the record will land.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 6 Jul 2022 11:50:27 +0000 (12:50 +0100)]
Merge branch 'mptcp-path-manager-fixes'
Mat Martineau says:
====================
mptcp: Path manager fixes for 5.19
The MPTCP userspace path manager is new in 5.19, and these patches fix
some issues in that new code.
Patches 1-3 fix path manager locking issues.
Patches 4 and 5 allow userspace path managers to change priority of
established subflows using the existing MPTCP_PM_CMD_SET_FLAGS generic
netlink command. Includes corresponding self test update.
Patches 6 and 7 fix accounting of available endpoint IDs and the
MPTCP_MIB_RMSUBFLOW counter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 5 Jul 2022 21:32:16 +0000 (14:32 -0700)]
mptcp: fix local endpoint accounting
In mptcp_pm_nl_rm_addr_or_subflow() we always mark as available
the id corresponding to the just removed address.
The used bitmap actually tracks only the local IDs: we must
restrict the operation when a (local) subflow is removed.
Fixes: a88c9e496937 ("mptcp: do not block subflows creation on errors") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
selftests: mptcp: userspace PM support for MP_PRIO signals
This change updates the testing sample (pm_nl_ctl) to exercise
the updated MPTCP_PM_CMD_SET_FLAGS command for userspace PMs to
issue MP_PRIO signals over the selected subflow.
E.g. ./pm_nl_ctl set 10.0.1.2 port 47234 flags backup token 823274047 rip 10.0.1.1 rport 50003
userspace_pm.sh has a new selftest that invokes this command.
Fixes: 259a834fadda ("selftests: mptcp: functional tests for the userspace PM type") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Kishen Maloor <kishen.maloor@intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
mptcp: netlink: issue MP_PRIO signals from userspace PMs
This change updates MPTCP_PM_CMD_SET_FLAGS to allow userspace PMs
to issue MP_PRIO signals over a specific subflow selected by
the connection token, local and remote address+port.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/286 Fixes: 702c2f646d42 ("mptcp: netlink: allow userspace-driven subflow establishment") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Kishen Maloor <kishen.maloor@intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mat Martineau [Tue, 5 Jul 2022 21:32:13 +0000 (14:32 -0700)]
mptcp: Acquire the subflow socket lock before modifying MP_PRIO flags
When setting up a subflow's flags for sending MP_PRIO MPTCP options, the
subflow socket lock was not held while reading and modifying several
struct members that are also read and modified in mptcp_write_options().
Acquire the subflow socket lock earlier and send the MP_PRIO ACK with
that lock already acquired. Add a new variant of the
mptcp_subflow_send_ack() helper to use with the subflow lock held.
Fixes: 067065422fcd ("mptcp: add the outgoing MP_PRIO support") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mat Martineau [Tue, 5 Jul 2022 21:32:12 +0000 (14:32 -0700)]
mptcp: Avoid acquiring PM lock for subflow priority changes
The in-kernel path manager code for changing subflow flags acquired both
the msk socket lock and the PM lock when possibly changing the "backup"
and "fullmesh" flags. mptcp_pm_nl_mp_prio_send_ack() does not access
anything protected by the PM lock, and it must release and reacquire
the PM lock.
By pushing the PM lock to where it is needed in mptcp_pm_nl_fullmesh(),
the lock is only acquired when the fullmesh flag is changed and the
backup flag code no longer has to release and reacquire the PM lock. The
change in locking context requires the MIB update to be modified - move
that to a better location instead.
This change also makes it possible to call
mptcp_pm_nl_mp_prio_send_ack() for the userspace PM commands without
manipulating the in-kernel PM lock.
Fixes: 0f9f696a502e ("mptcp: add set_flags command in PM netlink") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 5 Jul 2022 21:32:11 +0000 (14:32 -0700)]
mptcp: fix locking in mptcp_nl_cmd_sf_destroy()
The user-space PM subflow removal path uses a couple of helpers
that must be called under the msk socket lock and the current
code lacks such requirement.
Change the existing lock scope so that the relevant code is under
its protection.
Fixes: 702c2f646d42 ("mptcp: netlink: allow userspace-driven subflow establishment") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/287 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
TC act_police with 'continue' action had been supported by mlx5 matchall
classifier offload implementation for some time. However, 'continue' was
assumed implicitly and recently got broken in multiple places. Fix it in
both TC hardware offload validation code and mlx5 driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Referenced commit prepared the code for upcoming extension that allows mlx5
to offload police action attached to flower classifier. However, with
regard to existing matchall classifier offload validation should be
reversed as FLOW_ACTION_CONTINUE is the only supported notexceed police
action type. Fix the problem by allowing FLOW_ACTION_CONTINUE for police
action and extend scan_tc_matchall_fdb_actions() to only allow such actions
with matchall classifier.
Fixes: d97b4b105ce7 ("flow_offload: reject offload for all drivers with invalid police parameters") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Acked-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Offloading police with action TC_ACT_UNSPEC was erroneously disabled even
though it was supported by mlx5 matchall offload implementation, which
didn't verify the action type but instead assumed that any single police
action attached to matchall classifier is a 'continue' action. Lack of
action type check made it non-obvious what mlx5 matchall implementation
actually supports and caused implementers and reviewers of referenced
commits to disallow it as a part of improved validation code.
Fixes: b8cd5831c61c ("net: flow_offload: add tc police action parameters") Fixes: b50e462bc22d ("net/sched: act_police: Add extack messages for offload failure") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 6 Jul 2022 07:16:48 +0000 (08:16 +0100)]
Merge branch 'octeontx2-af-next'
Ratheesh Kannoth says:
====================
octeontx2: *** Exact Match Table and Field hash ***
*** Exact match table and Field hash support for CN10KB silicon ***
Ratheesh Kannoth (11):
These patch series enables exact match table in CN10KB silicon. Legacy
silicon used NPC mcam to do packet fields/channel matching for NPC rules.
NPC mcam resources exahausted as customer use case increased.
Supporting many DMAC filter becomes a challenge, as RPM based filter
count is less. Exact match table has 4way 2K entry table and a 32 entry
fully associative cam table. Second table is to handle hash
table collision overflows in 4way 2K entry table. Enabling exact match table
results in KEX key to be appended with Hit/Miss status. This can be used
to match in NPC mcam for a more generic rule and drop those packets than
having DMAC drop rules for each DMAC entry in NPC mcam.
octeontx2-af: Exact match support
octeontx2-af: Exact match scan from kex profile
octeontx2-af: devlink configuration support
octeontx2-af: FLR handler for exact match table.
octeontx2-af: Drop rules for NPC MCAM
octeontx2-af: Debugsfs support for exact match.
octeontx2: Modify mbox request and response structures
octeontx2-af: Wrapper functions for mac addr add/del/update/reset
octeontx2-af: Invoke exact match functions if supported
octeontx2-pf: Add support for exact match table.
octeontx2-af: Enable Exact match flag in kex profile
Suman Ghosh (1):
CN10KB variant of CN10K series of silicons supports
a new feature where in a large protocol field
(eg 128bit IPv6 DIP) can be condensed into a small
hashed 32bit data. This saves a lot of space in MCAM key
and allows user to add more protocol fields into the filter.
A max of two such protocol data can be hashed.
This patch adds support for hashing IPv6 SIP and/or DIP.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
octeontx2-af: Wrapper functions for MAC addr add/del/update/reset
These functions are wrappers for mac add/addr/del/update in
exact match table. These will be invoked from mbox handler routines
if exact matct table is supported and enabled.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
octeontx2: Modify mbox request and response structures
Exact match table modification requires wider fields as it has
more number of slots to fill in. Modifying an entry in exact match
table may cause hash collision and may be required to delete entry
from 4-way 2K table and add to fully associative 32 entry CAM table.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
NPC exact match table installs drop on hit rules in
NPC mcam for each channel. This rule has broadcast and multicast
bits cleared. Exact match bit cleared and channel bits
set. If exact match table hit bit is 0, corresponding NPC mcam
drop rule will be hit for the packet and will be dropped.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CN10KB silicon supports Exact match feature. This feature can be disabled
through devlink configuration. Devlink command fails if DMAC filter rules
are already present. Once disabled, legacy RPM based DMAC filters will be
configured.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CN10KB silicon supports exact match table. Scanning KEX
profile should check for exact match feature is enabled
and then set profile masks properly.
These kex profile masks are required to configure NPC
MCAM drop rules. If there is a miss in exact match table,
these drop rules will drop those packets.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CN10KB silicon has support for exact match table. This table
can be used to match maimum 64 bit value of KPU parsed output.
Hit/non hit in exact match table can be used as a KEX key to
NPC mcam.
This patch makes use of Exact match table to increase number of
DMAC filters supported. NPC mcam is no more need for each of these
DMAC entries as will be populated in Exact match table.
This patch implements following
1. Initialization of exact match table only for CN10KB.
2. Add/del/update interface function for exact match table.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CN10KB variant of CN10K series of silicons supports
a new feature where in a large protocol field
(eg 128bit IPv6 DIP) can be condensed into a small
hashed 32bit data. This saves a lot of space in MCAM key
and allows user to add more protocol fields into the filter.
A max of two such protocol data can be hashed.
This patch adds support for hashing IPv6 SIP and/or DIP.
Signed-off-by: Suman Ghosh <sumang@marvell.com> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman [Tue, 5 Jul 2022 07:36:04 +0000 (08:36 +0100)]
nfp: enable TSO by default for nfp netdev
We can benefit from TSO when the host CPU is not powerful enough,
so enable it by default now.
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
nfp: allow TSO packets with metadata prepended in NFDK path
Packets with metadata prepended can be correctly handled in
firmware when TSO is enabled, now remove the error path and
related comments. Since there's no existing firmware that
uses prepended metadata, no need to add compatibility check
here.
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Walle [Mon, 4 Jul 2022 15:36:54 +0000 (17:36 +0200)]
net: lan966x: hardcode the number of external ports
Instead of counting the child nodes in the device tree, hardcode the
number of ports in the driver itself. The counting won't work at all
if an ethernet port is marked as disabled, e.g. because it is not
connected on the board at all.
It turns out that the LAN9662 and LAN9668 use the same switching IP
with the same synthesis parameters. The only difference is that the
output ports are not connected. Thus, we can just hardcode the
number of physical ports to 8.
Fixes: db8bcaad5393 ("net: lan966x: add the basic lan966x driver") Signed-off-by: Michael Walle <michael@walle.cc> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Link: https://lore.kernel.org/r/20220704153654.1167886-1-michael@walle.cc Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Mon, 4 Jul 2022 19:02:41 +0000 (22:02 +0300)]
net: dsa: felix: build as module when tc-taprio is module
felix_vsc9959.c calls taprio_offload_get() and taprio_offload_free(),
symbols exported by net/sched/sch_taprio.c. As such, we must disallow
building the Felix driver as built-in when the symbol exported by
tc-taprio isn't present in the kernel image.
Vladimir Oltean [Mon, 4 Jul 2022 19:02:40 +0000 (22:02 +0300)]
net: sched: provide shim definitions for taprio_offload_{get,free}
All callers of taprio_offload_get() and taprio_offload_free() prior to
the blamed commit are conditionally compiled based on CONFIG_NET_SCH_TAPRIO.
felix_vsc9959.c is different; it provides vsc9959_qos_port_tas_set()
even when taprio is compiled out.
Provide shim definitions for the functions exported by taprio so that
felix_vsc9959.c is able to compile. vsc9959_qos_port_tas_set() in that
case is dead code anyway, and ocelot_port->taprio remains NULL, which is
fine for the rest of the logic.
Fixes: 1c9017e44af2 ("net: dsa: felix: keep reference on entire tc-taprio config") Reported-by: Colin Foster <colin.foster@in-advantage.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Colin Foster <colin.foster@in-advantage.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Link: https://lore.kernel.org/r/20220704190241.1288847-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 5 Jul 2022 22:22:28 +0000 (15:22 -0700)]
eth: remove neterion/vxge
The last meaningful change to this driver was made by Jon in 2011.
As much as we'd like to believe that this is because the code is
perfect the chances are nobody is using this hardware.
Because of the size of this driver there is a nontrivial maintenance
cost to keeping this code around, in the last 2 years we're averaging
more than 1 change a month. Some of which require nontrivial review
effort, see commit 877fe9d49b74 ("Revert "drivers/net/ethernet/neterion/vxge:
Fix a use-after-free bug in vxge-main.c"") for example.
Let's try to remove this driver. In general, IMHO, we need to
establish a clear path for shedding dead code. It will be hard
to unless we have some experience trying to delete stuff.
Jakub Kicinski [Tue, 5 Jul 2022 21:42:09 +0000 (14:42 -0700)]
Merge tag 'for-net-2022-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- Fix deadlock when powering on.
* tag 'for-net-2022-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: core: Fix deadlock on hci_power_on_sync.
====================