igc: don't reserve excessive XDP_PACKET_HEADROOM on XSK Rx to skb
{__,}napi_alloc_skb() allocates and reserves additional NET_SKB_PAD
+ NET_IP_ALIGN for any skb.
OTOH, igc_construct_skb_zc() currently allocates and reserves
additional `xdp->data_meta - xdp->data_hard_start`, which is about
XDP_PACKET_HEADROOM for XSK frames.
There's no need for that at all as the frame is post-XDP and will
go only to the networking stack core.
Pass the size of the actual data only (+ meta) to
__napi_alloc_skb() and don't reserve anything. This will give
enough headroom for stack processing.
Also, net_prefetch() xdp->data_meta and align the copy size to
speed-up memcpy() a little and better match igc_construct_skb().
Fixes: fc9df2a0b520 ("igc: Enable RX via AF_XDP zero-copy") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Nechama Kraus <nechamax.kraus@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
For now, if the XDP prog returns XDP_PASS on XSK, the metadata will
be lost as it doesn't get copied to the skb.
Copy it along with the frame headers. Account its size on skb
allocation, and when copying just treat it as a part of the frame
and do a pull after to "move" it to the "reserved" zone.
net_prefetch() xdp->data_meta and align the copy size to speed-up
memcpy() a little and better match ice_construct_skb().
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ice: don't reserve excessive XDP_PACKET_HEADROOM on XSK Rx to skb
{__,}napi_alloc_skb() allocates and reserves additional NET_SKB_PAD
+ NET_IP_ALIGN for any skb.
OTOH, ice_construct_skb_zc() currently allocates and reserves
additional `xdp->data - xdp->data_hard_start`, which is
XDP_PACKET_HEADROOM for XSK frames.
There's no need for that at all as the frame is post-XDP and will
go only to the networking stack core.
Pass the size of the actual data only to __napi_alloc_skb() and
don't reserve anything. This will give enough headroom for stack
processing.
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ice: respect metadata in legacy-rx/ice_construct_skb()
In "legacy-rx" mode represented by ice_construct_skb(), we can
still use XDP (and XDP metadata), but after XDP_PASS the metadata
will be lost as it doesn't get copied to the skb.
Copy it along with the frame headers. Account its size on skb
allocation, and when copying just treat it as a part of the frame
and do a pull after to "move" it to the "reserved" zone.
Point net_prefetch() to xdp->data_meta instead of data. This won't
change anything when the meta is not here, but will save some cache
misses otherwise.
Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
For now, if the XDP prog returns XDP_PASS on XSK, the metadata will
be lost as it doesn't get copied to the skb.
Copy it along with the frame headers. Account its size on skb
allocation, and when copying just treat it as a part of the frame
and do a pull after to "move" it to the "reserved" zone.
net_prefetch() xdp->data_meta and align the copy size to speed-up
memcpy() a little and better match i40e_construct_skb().
Fixes: 0a714186d3c0 ("i40e: add AF_XDP zero-copy Rx support") Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com> Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
i40e: don't reserve excessive XDP_PACKET_HEADROOM on XSK Rx to skb
{__,}napi_alloc_skb() allocates and reserves additional NET_SKB_PAD
+ NET_IP_ALIGN for any skb.
OTOH, i40e_construct_skb_zc() currently allocates and reserves
additional `xdp->data - xdp->data_hard_start`, which is
XDP_PACKET_HEADROOM for XSK frames.
There's no need for that at all as the frame is post-XDP and will
go only to the networking stack core.
Pass the size of the actual data only to __napi_alloc_skb() and
don't reserve anything. This will give enough headroom for stack
processing.
Fixes: 0a714186d3c0 ("i40e: add AF_XDP zero-copy Rx support") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
David S. Miller [Mon, 31 Jan 2022 15:08:20 +0000 (15:08 +0000)]
Merge branch 'smc-improvements'
Tony Lu says:
====================
net/smc: Improvements for TCP_CORK and sendfile()
Currently, SMC use default implement for syscall sendfile() [1], which
is wildly used in nginx and big data sences. Usually, applications use
sendfile() with TCP_CORK:
The above is an example of Nginx, when sendfile() on, Nginx first
enables TCP_CORK, write headers, the data will not be sent. Then call
sendfile(), it reads file and write to sndbuf. When TCP_CORK is cleared,
all pending data is sent out.
The performance of the default implement of sendfile is lower than when
it is off. After investigation, it shows two parts to improve:
- unnecessary lock contention of delayed work
- less data per send than when sendfile off
Patch #1 tries to reduce lock_sock() contention in smc_tx_work().
Patch #2 removes timed work for corking, and let applications control
it. See TCP_CORK [2] MSG_MORE [3].
Patch #3 adds MSG_SENDPAGE_NOTLAST for corking more data when
sendfile().
Test environments:
- CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4
- socket sndbuf / rcvbuf: 16384 / 131072 bytes
- server: smc_run nginx
- client: smc_run ./wrk -c 100 -t 2 -d 30 http://192.168.100.1:8080/4k.html
- payload: 4KB local disk file
Items QPS
sendfile off 272477.10
sendfile on (orig) 223622.79
sendfile on (this) 395847.21
This benchmark shows +45.28% improvement compared with sendfile off, and
+77.02% compared with original sendfile implement.
Tony Lu [Sun, 30 Jan 2022 18:02:57 +0000 (02:02 +0800)]
net/smc: Cork when sendpage with MSG_SENDPAGE_NOTLAST flag
This introduces a new corked flag, MSG_SENDPAGE_NOTLAST, which is
involved in syscall sendfile() [1], it indicates this is not the last
page. So we can cork the data until the page is not specify this flag.
It has the same effect as MSG_MORE, but existed in sendfile() only.
This patch adds a option MSG_SENDPAGE_NOTLAST for corking data, try to
cork more data before sending when using sendfile(), which acts like
TCP's behaviour. Also, this reimplements the default sendpage to inform
that it is supported to some extent.
Tony Lu [Sun, 30 Jan 2022 18:02:56 +0000 (02:02 +0800)]
net/smc: Remove corked dealyed work
Based on the manual of TCP_CORK [1] and MSG_MORE [2], these two options
have the same effect. Applications can set these options and informs the
kernel to pend the data, and send them out only when the socket or
syscall does not specify this flag. In other words, there's no need to
send data out by a delayed work, which will queue a lot of work.
This removes corked delayed work with SMC_TX_CORK_DELAY (250ms), and the
applications control how/when to send them out. It improves the
performance for sendfile and throughput, and remove unnecessary race of
lock_sock(). This also unlocks the limitation of sndbuf, and try to fill
it up before sending.
Tony Lu [Sun, 30 Jan 2022 18:02:55 +0000 (02:02 +0800)]
net/smc: Send directly when TCP_CORK is cleared
According to the man page of TCP_CORK [1], if set, don't send out
partial frames. All queued partial frames are sent when option is
cleared again.
When applications call setsockopt to disable TCP_CORK, this call is
protected by lock_sock(), and tries to mod_delayed_work() to 0, in order
to send pending data right now. However, the delayed work smc_tx_work is
also protected by lock_sock(). There introduces lock contention for
sending data.
To fix it, send pending data directly which acts like TCP, without
lock_sock() protected in the context of setsockopt (already lock_sock()ed),
and cancel unnecessary dealyed work, which is protected by lock.
[1] https://linux.die.net/man/7/tcp
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 31 Jan 2022 15:05:25 +0000 (15:05 +0000)]
Merge branch 'hash-rethink'
Akhmat Karakotov says:
====================
Make hash rethink configurable
As it was shown in the report by Alexander Azimov, hash rethink at the
client-side may lead to connection timeout toward stateful anycast
services. Tom Herbert created a patchset to address this issue by applying
hash rethink only after a negative routing event (3RTOs) [1]. This change
also affects server-side behavior, which we found undesirable. This
patchset changes defaults in a way to make them safe: hash rethink at the
client-side is disabled and enabled at the server-side upon each RTO
event or in case of duplicate acknowledgments.
This patchset provides two options to change default behaviour. The hash
rethink may be disabled at the server-side by the new sysctl option.
Changes in the sysctl option don't affect default behavior at the
client-side.
Hash rethink can also be enabled/disabled with socket option or bpf
syscalls which ovewrite both default and sysctl settings. This socket
option is available on both client and server-side. This should provide
mechanics to enable hash rethink inside administrative domain, such as DC,
where hash rethink at the client-side can be desirable.
v2:
- Changed sysctl default to ENABLED in all patches. Reduced sysctl
and socket option size to u8. Fixed netns bug reported by kernel
test robot.
v3:
- Fixed bug with bad u8 comparison. Moved sk_txrehash to use less
bytes in struct. Added WRITE_ONCE() in setsockopt in and
READ_ONCE() in tcp_rtx_synack.
v4:
- Rebase and add documentation for sysctl option.
v5:
- Move sk_txrehash out of busy poll ifdef.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Akhmat Karakotov [Mon, 31 Jan 2022 13:31:25 +0000 (16:31 +0300)]
tcp: Change SYN ACK retransmit behaviour to account for rehash
Disabling rehash behavior did not affect SYN ACK retransmits because hash
was forcefully changed bypassing the sk_rethink_hash function. This patch
adds a condition which checks for rehash mode before resetting hash.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Akhmat Karakotov [Mon, 31 Jan 2022 13:31:24 +0000 (16:31 +0300)]
bpf: Add SO_TXREHASH setsockopt
Add bpf socket option to override rehash behaviour from userspace or from bpf.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Akhmat Karakotov [Mon, 31 Jan 2022 13:31:22 +0000 (16:31 +0300)]
txhash: Add socket option to control TX hash rethink behavior
Add the SO_TXREHASH socket option to control hash rethink behavior per socket.
When default mode is set, sockets disable rehash at initialization and use
sysctl option when entering listen state. setsockopt() overrides default
behavior.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Akhmat Karakotov [Mon, 31 Jan 2022 13:31:21 +0000 (16:31 +0300)]
txhash: Make rethinking txhash behavior configurable via sysctl
Add a per ns sysctl that controls the txhash rethink behavior:
net.core.txrehash. When enabled, the same behavior is retained,
when disabled, rethink is not performed. Sysctl is enabled by default.
Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerhard Engleder [Sun, 30 Jan 2022 09:54:22 +0000 (10:54 +0100)]
selftests/net: timestamping: Fix bind_phc check
timestamping checks socket options during initialisation. For the field
bind_phc of the socket option SO_TIMESTAMPING it expects the value -1 if
PHC is not bound. Actually the value of bind_phc is 0 if PHC is not
bound. This results in the following output:
This is fixed by setting default value and expected value of bind_phc to
0.
Fixes: 2214d7032479 ("selftests/net: timestamping: support binding PHC") Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 31 Jan 2022 11:42:13 +0000 (11:42 +0000)]
Merge branch 'renesas-dead-code'
Sergey Shtylyov says:
====================
Remove some dead code in the Renesas Ethernet drivers
Here are 2 patches against DaveM's 'net-next.git' repo. The Renesas drivers
call their ndo_stop() methods directly and they always return 0, making the
result checks pointless, hence remove them...
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Shtylyov [Sat, 29 Jan 2022 11:55:17 +0000 (14:55 +0300)]
sh_eth: sh_eth_close() always returns 0
sh_eth_close() always returns 0, hence the check in sh_eth_wol_restore()
is pointless (however we cannot change the prototype of sh_eth_close() as
it implements the driver's ndo_stop() method).
Found by Linux Verification Center (linuxtesting.org) with the SVACE static
analysis tool.
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Shtylyov [Sat, 29 Jan 2022 11:55:16 +0000 (14:55 +0300)]
ravb: ravb_close() always returns 0
ravb_close() always returns 0, hence the check in ravb_wol_restore() is
pointless (however, we cannot change the prototype of ravb_close() as it
implements the driver's ndo_stop() method).
Found by Linux Verification Center (linuxtesting.org) with the SVACE static
analysis tool.
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Sat, 29 Jan 2022 01:27:02 +0000 (01:27 +0000)]
net/fsl: xgmac_mdio: fix return value check in xgmac_mdio_probe()
In case of error, the function devm_ioremap() returns NULL pointer
not ERR_PTR(). The IS_ERR() test in the return value check should
be replaced with NULL test.
Fixes: 1d14eb15dc2c ("net/fsl: xgmac_mdio: Use managed device resources") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Fri, 28 Jan 2022 23:53:47 +0000 (16:53 -0700)]
ipv4: Make ip_idents_reserve static
ip_idents_reserve is only used in net/ipv4/route.c. Make it static
and remove the export.
Signed-off-by: David Ahern <dsahern@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Fri, 28 Jan 2022 20:41:42 +0000 (21:41 +0100)]
r8169: add rtl_disable_exit_l1()
Add rtl_disable_exit_l1() for ensuring that the chip doesn't
inadvertently exit ASPM L1 when being in a low-power mode.
The new function is called from rtl_prepare_power_down() which
has to be moved in the code to avoid a forward declaration.
According to Realtek OCP register 0xc0ac shadows ERI register 0xd4
on RTL8168 versions from RTL8168g. This allows to simplify the
code a little.
v2:
- call rtl_disable_exit_l1() also if DASH or WoL are enabled
Suggested-by: Chun-Hao Lin <hau@realtek.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Shtylyov [Fri, 28 Jan 2022 18:32:40 +0000 (21:32 +0300)]
phy: make phy_set_max_speed() *void*
After following the call tree of phy_set_max_speed(), it became clear
that this function never returns anything but 0, so we can change its
result type to *void* and drop the result checks from the three drivers
that actually bothered to do it...
Found by Linux Verification Center (linuxtesting.org) with the SVACE static
analysis tool.
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
The individual patches have all the details. This work was triggered
by recent work on a platform that took 16s (sic) to load the mv88e6xxx
module.
The first patch gets rid of most of that time by replacing a very long
delay with a tighter poll loop to wait for the busy bit to clear.
The second patch shaves off some more time by avoiding redundant
busy-bit-checks, saving 1 out of 4 MDIO operations for every register
read/write in the optimal case.
v1 -> v2:
- Make sure that we always poll the busy bit at least twice, in the
unlikely event that the first one is quick to query the hardware,
but is then scheduled out for a long time before the timeout is
checked.
v2 -> v3:
- Fallback to the longer sleeps after the initial two poll attempts.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this change, both the read and write callback would start out
by asserting that the chip's busy flag was cleared. However, both
callbacks also made sure to wait for the clearing of the busy bit
before returning - making the initial check superfluous. The only
time that would ever have an effect was if the busy bit was initially
set for some reason.
With that in mind, make sure to perform an initial check of the busy
bit, after which both read and write can rely the previous operation
to have waited for the bit to clear.
This cuts the number of operations on the underlying MDIO bus by 25%
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: mv88e6xxx: Improve performance of busy bit polling
Avoid a long delay when a busy bit is still set and has to be polled
again.
Measurements on a system with 2 Opals (6097F) and one Agate (6352)
show that even with this much tighter loop, we have about a 50% chance
of the bit being cleared on the first poll, all other accesses see the
bit being cleared on the second poll.
On a standard MDIO bus running MDC at 2.5MHz, a single access with 32
bits of preamble plus 32 bits of data takes 64*(1/2.5MHz) = 25.6us.
This means that mv88e6xxx_smi_direct_wait took 26us + CPU overhead in
the fast scenario, but 26us + 1500us + 26us + CPU overhead in the slow
case - bringing the average close to 1ms.
With this change in place, the slow case is closer to 2*26us + CPU
overhead, with the average well below 100us - a 10x improvement.
This translates to real-world winnings. On a 3-chip 20-port system,
the modprobe time drops by 88%:
Before:
root@coronet:~# time modprobe mv88e6xxx
real 0m 15.99s
user 0m 0.00s
sys 0m 1.52s
After:
root@coronet:~# time modprobe mv88e6xxx
real 0m 2.21s
user 0m 0.00s
sys 0m 1.54s
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Sun Shouxin [Fri, 28 Jan 2022 14:44:42 +0000 (09:44 -0500)]
net: bonding: Add support for IPV6 ns/na to balance-alb/balance-tlb mode
Since ipv6 neighbor solicitation and advertisement messages
isn't handled gracefully in bond6 driver, we can see packet
drop due to inconsistency between mac address in the option
message and source MAC .
Another examples is ipv6 neighbor solicitation and advertisement
messages from VM via tap attached to host bridge, the src mac
might be changed through balance-alb mode, but it is not synced
with Link-layer address in the option message.
The patch implements bond6's tx handle for ipv6 neighbor
solicitation and advertisement messages.
Suggested-by: Hu Yadi <huyd12@chinatelecom.cn> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Sun Shouxin <sunshouxin@chinatelecom.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 29 Jan 2022 17:49:21 +0000 (17:49 +0000)]
Merge branch 'Cadence-ZyncMP-SGMII'
Robert Hancock says:
====================
Cadence MACB/GEM support for ZynqMP SGMII
Changes to allow SGMII mode to work properly in the GEM driver on the
Xilinx ZynqMP platform.
Changes since v3:
-more code formatting and error handling fixes
Changes since v2:
-fixed missing includes in DT binding example
-fixed phy_init and phy_power_on error handling/cleanup, moved
phy_power_on to open rather than probe
Changes since v1:
-changed order of controller reset and PHY init as per suggestion
-switched device reset to be optional
-updated bindings doc patch for switch to YAML
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 27 Jan 2022 16:37:36 +0000 (10:37 -0600)]
arm64: dts: zynqmp: Added GEM reset definitions
The Cadence GEM/MACB driver now utilizes the platform-level reset on the
ZynqMP platform. Add reset definitions to the ZynqMP platform device
tree to allow this to be used.
Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 27 Jan 2022 16:37:35 +0000 (10:37 -0600)]
net: macb: Added ZynqMP-specific initialization
The GEM controllers on ZynqMP were missing some initialization steps which
are required in some cases when using SGMII mode, which uses the PS-GTR
transceivers managed by the phy-zynqmp driver.
The GEM core appears to need a hardware-level reset in order to work
properly in SGMII mode in cases where the GT reference clock was not
present at initial power-on. This can be done using a reset mapped to
the zynqmp-reset driver in the device tree.
Also, when in SGMII mode, the GEM driver needs to ensure the PHY is
initialized and powered on.
Signed-off-by: Robert Hancock <robert.hancock@calian.com> Reviewed-by: Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 27 Jan 2022 16:37:34 +0000 (10:37 -0600)]
dt-bindings: net: cdns,macb: added generic PHY and reset mappings for ZynqMP
Updated macb DT binding documentation to reflect the phy-names, phys,
resets, reset-names properties which are now used with ZynqMP GEM
devices, and added a ZynqMP-specific DT example.
Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 28 Jan 2022 21:39:04 +0000 (13:39 -0800)]
Merge tag 'for-net-next-2022-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
- Add support for RTL8822C hci_ver 0x08
- Add support for RTL8852AE part 0bda:2852
- Fix WBS setting for Intel legacy ROM products
- Enable SCO over I2S ib mt7921s
- Increment management interface revision
* tag 'for-net-next-2022-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (30 commits)
Bluetooth: Increment management interface revision
Bluetooth: hci_sync: Fix queuing commands when HCI_UNREGISTER is set
Bluetooth: hci_h5: Add power reset via gpio in h5_btrtl_open
Bluetooth: btrtl: Add support for RTL8822C hci_ver 0x08
Bluetooth: hci_event: Fix HCI_EV_VENDOR max_len
Bluetooth: hci_core: Rate limit the logging of invalid SCO handle
Bluetooth: hci_event: Ignore multiple conn complete events
Bluetooth: msft: fix null pointer deref on msft_monitor_device_evt
Bluetooth: btmtksdio: mask out interrupt status
Bluetooth: btmtksdio: run sleep mode by default
Bluetooth: btmtksdio: lower log level in btmtksdio_runtime_[resume|suspend]()
Bluetooth: mt7921s: fix btmtksdio_[drv|fw]_pmctrl()
Bluetooth: mt7921s: fix bus hang with wrong privilege
Bluetooth: btmtksdio: refactor btmtksdio_runtime_[suspend|resume]()
Bluetooth: mt7921s: fix firmware coredump retrieve
Bluetooth: hci_serdev: call init_rwsem() before p->open()
Bluetooth: Remove kernel-doc style comment block
Bluetooth: btusb: Whitespace fixes for btusb_setup_csr()
Bluetooth: btusb: Add one more Bluetooth part for the Realtek RTL8852AE
Bluetooth: btintel: Fix WBS setting for Intel legacy ROM products
...
====================
Jisheng Zhang [Fri, 28 Jan 2022 14:52:13 +0000 (22:52 +0800)]
net: stmmac: dwmac-sun8i: make clk really gated during rpm suspended
Currently, the dwmac-sun8i's stmmaceth clk isn't disabled even if the
the device has been runtime suspended. The reason is the driver gets
the "stmmaceth" clk as tx_clk and enabling it during probe. But
there's no other usage of tx_clk except preparing and enabling, so
we can remove tx_clk and its usage then rely on the common routine
stmmac_probe_config_dt() to prepare and enable the stmmaceth clk
during driver initialization, and benefit from the runtime pm feature
after probed.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 28 Jan 2022 15:02:50 +0000 (15:02 +0000)]
Merge branch 'dsa-realtek-MDIO'
Luiz Angelo Daros de Luca says:
====================
net: dsa: realtek: MDIO interface and RTL8367S,RTL8367RB-VB
The old realtek-smi driver was linking subdrivers into a single
realtek-smi.ko After this series, each subdriver will be an independent
module required by either realtek-smi (platform driver) or the new
realtek-mdio (mdio driver). Both interface drivers (SMI or MDIO) are
independent, and they might even work side-by-side, although it will be
difficult to find such device. The subdriver can be individually
selected but only at buildtime, saving some storage space for custom
embedded systems.
Existing realtek-smi devices continue to work untouched during the
tests. The realtek-smi was moved into a realtek subdirectory, but it
normally does not break things.
I couldn't identify a fixed relation between port numbers (0..9) and
external interfaces (0..2), and I'm not sure if it is fixed for each
chip version or a device configuration. Until there is more info about
it, there is a new port property "realtek,ext-int" that can inform the
external interface.
The rtl8365mb might now handle multiple CPU ports and extint ports not
used as CPU ports. RTL8367S has an SGMII external interface, but my test
device (TP-Link Archer C5v4) uses only the second RGMII interface. We
need a test device with more external ports to test these features.
The driver still cannot handle SGMII ports.
RTL8367RB-VB support was added using information from Frank Wunderlich
<frank-w@public-files.de> but I didn't test it myself.
The rtl8365mb was tested with a MDIO-connected RTL8367S (TP-Link Acher
C5v4) and a SMI-connected RTL8365MB-VC switch (Asus RT-AC88U)
The rtl8366rb subdriver was not tested with this patch series, but it
was only slightly touched. It would be nice to test it, especially in an
MDIO-connected switch.
Best,
Luiz
Changelog:
v1-v2)
- formatting fixes
- dropped the rtl8365mb->rtl8367c rename
- other suggestions
v2-v3)
* realtek-mdio.c:
- cleanup realtek-mdio.c (BUG_ON, comments and includes)
- check devm_regmap_init return code
- removed realtek,rtl8366s string from realtek-mdio
* realtek-smi.c:
- removed void* type cast
* rtl8365mb.c:
- using macros to identify EXT interfaces
- rename some extra extport->extint cases
- allow extint as non cpu (not tested)
- allow multple cpu ports (not tested)
- dropped cpu info from struct rtl8365mb
* dropped dt-bindings changes (dealing outside this series)
* formatting issues fixed
v3-v4)
* fix cover message numbering 0/13 -> 0/11
* use static for realtek_mdio_read_reg
- Reported-by: kernel test robot <lkp@intel.com>
* use dsa_switch_for_each_cpu_port
* mention realtek_smi_{variant,ops} to realtek_{variant,ops}
in commit message
v5) sent again v4 branch. Sorry
v4-v6)
- added support for RTL8367RB-VB
- cleanup mdio_{read,write}, removing misterious START_OP, checking and
returning errors
- renamed priv->phy_id to priv->mdio_addr
- duplicated priv->ds_ops into ds_ops_{smi,mdio}. ds_ops_smi must not
set
phy_read or else both dsa and this driver might free slave_mii.
Dropped 401fd75c92f37
- Map port to extint using code instead of device-tree property. Added
comment
about port number, port description and external interfaces. Dropped
'realtek,ext-int' device-tree property
- Redacted the non-cpu ext port commit message, not highlighting the
possibility of using multiple CPU ports as it was just a byproduct.
- In a possible case of multiple cpu ports, use the first one as the
trap port.
Dropped 'realtek,trap-port' device-tree property
- Some formatting fixes
- BUG: rtl8365mb_phy_mode_supported was still checking for a cpu port
and not
an external interface
- BUG: fix trapdoor masking for port>7. Got a compiler error with a
bigger
constant value
- WARN: completed kdoc for rtl8366rb_drop_untagged()
- WARN: removed marks from incomplete kdoc
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Trap door number is a 4-bit number divided in two regions (3 and 1-bit).
Both values were not masked properly. This bug does not affect supported
devices as they use up to port 7 (ext2). It would only be a problem if
the driver becomes compatible with 10-port switches like RTL8370MB and
RTL8310SR.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
External interfaces can be configured, even if they are not CPU ports.
The first CPU port will also be the trap port (for receiving trapped
frames from the switch).
The CPU information was dropped from chip data as it was not used
outside setup. The only other place it was used is when it wrongly
checks for CPU port when it should check for extint.
The supported modes check now uses port type and not port usage.
As a byproduct, more than one CPU can be configured. although this
might not work well with DSA setups. Also, this driver is still only
blindly forwarding all traffic to CPU port(s).
This change was not tested in a device with multiple active external
interfaces ports.
realtek_priv->cpu_port is now only used by rtl8366rb.c
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: realtek: rtl8365mb: add RTL8367S support
Realtek's RTL8367S, a 5+2 port 10/100/1000M Ethernet switch.
It shares the same driver family (RTL8367C) with other models
as the RTL8365MB-VC. Its compatible string is "realtek,rtl8367s".
It was tested only with MDIO interface (realtek-mdio), although it might
work out-of-the-box with SMI interface (using realtek-smi).
This patch was based on an unpublished patch from Alvin Å ipraga
<alsi@bang-olufsen.dk>.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Tested-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of a fixed CPU port, assume that DSA is correct.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Tested-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: realtek: rtl8365mb: rename extport to extint
"extport" 0, 1, 2 was used to reference external ports id (ext0, ext1,
ext2). Meanwhile, port 0..9 is used as switch ports, including external
ports. "extport" was renamed to extint to make it clear it does not mean
the port number but the external interface number id.
The macros that map extint numbers to registers addresses now use inline
ifs instead of binary arithmetic.
Realtek uses in docs and drivers EXT_PORT0 (GMAC1) and EXT_PORT1
(GMAC2), with EXT_PORT0 being converted to ext_id == 1 and so on. It
might introduce some confusing while reading datasheets but it will not
be exposed to users.
"extint" was hardcoded to 1. However, some chips have multiple external
interfaces. It's not right to assume the CPU port uses extint 1 nor that
all extint are CPU ports. Now it came from a map between port number and
external interface id number.
This patch still does not allow multiple CPU ports nor extint as a non
CPU port.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: realtek: add new mdio interface for drivers
This driver is a mdio_driver instead of a platform driver (like
realtek-smi).
ds_ops was duplicated for smi and mdio usage as mdio interfaces uses
phy_{read,write} in ds_ops and the presence of phy_read is incompatible
with external slave_mii_bus allocation.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Tested-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: realtek: remove direct calls to realtek-smi
Remove the only two direct calls from subdrivers to realtek-smi.
Now they are called from realtek_priv. Subdrivers can now be
linked independently from realtek-smi.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Alvin Å ipraga <alsi@bang-olufsen.dk> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: realtek: rename realtek_smi to realtek_priv
In preparation to adding other interfaces, the private data structure
was renamed to priv. Also, realtek_smi_variant and realtek_smi_ops
were renamed to realtek_variant and realtek_ops as those structs are
not SMI specific.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Tested-by: Arınç ÜNAL <arinc.unal@arinc9.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
octeontx2-pf: Change receive buffer size using ethtool
ethtool rx-buf-len is for setting receive buffer size,
support setting it via ethtool -G parameter and getting
it via ethtool -g parameter.
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
TCP ZC Rx requires data to be placed neatly into pages, separate
from the networking headers. This is not supported by most devices
so to make deployment easy this set adds a way for the driver to
report support for this feature thru ethtool.
The larger scope of configuring splitting headers and data, or DMA
scatter seems dauntingly broad, so this set focuses specifically
on the question "is this device usable with TCP ZC Rx?".
The aim is to avoid a litany of conditions on HW platforms, features,
and firmware versions in orchestration systems when the drivers can
easily tell their SG config.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 27 Jan 2022 18:42:59 +0000 (10:42 -0800)]
ethtool: add header/data split indication
For applications running on a mix of platforms it's useful
to have a clear indication whether host's NIC supports the
geometry requirements of TCP zero-copy. TCP zero-copy Rx
requires data to be neatly placed into memory pages.
Most NICs can't do that.
This patch is adding GET support only, since the NICs
I work with either always have the feature enabled or
enable it whenever MTU is set to jumbo. In other words
I don't need SET. But adding set should be trivial.
(The only note on SET is that we will likely want
the setting to be "sticky" and use 0 / `unknown`
to reset it back to driver default.)
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
The reference clock output from the KSZ9477 and related Microchip
switch devices is not required on all board designs. Add a device
tree property to disable it for power and EMI reasons.
Changes since v3:
-rework some code for simplicity
Changes since v2:
-check for conflicting options in DT, added note in bindings doc
Changes since v1:
-added Acked-by on patch 1, rebase to net-next
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 27 Jan 2022 16:41:56 +0000 (10:41 -0600)]
net: dsa: microchip: Add property to disable reference clock
Add a new microchip,synclko-disable property which can be specified
to disable the reference clock output from the device if not required
by the board design.
Signed-off-by: Robert Hancock <robert.hancock@calian.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 27 Jan 2022 16:41:55 +0000 (10:41 -0600)]
net: dsa: microchip: Document property to disable reference clock
Document the new microchip,synclko-disable property which can be
specified to disable the reference clock output from the device if not
required by the board design.
Signed-off-by: Robert Hancock <robert.hancock@calian.com> Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lorenzo Bianconi [Thu, 27 Jan 2022 14:47:49 +0000 (15:47 +0100)]
net: mvneta: remove unnecessary if condition in mvneta_xdp_submit_frame
Get rid of unnecessary if check on tx_desc pointer in
mvneta_xdp_submit_frame routine since num_frames is always greater than
0 and tx_desc pointer is always initialized.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Convert sparx5 to use the mac_select_interface rather than using
phylink_set_pcs(). The intention here is to unify the approach for
PCS and eventually remove phylink_set_pcs().
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 28 Jan 2022 03:46:13 +0000 (19:46 -0800)]
Merge branch 'udp-ipv6-optimisations'
Pavel Begunkov says:
====================
udp/ipv6 optimisations
Shed some weight from udp/ipv6. Zerocopy benchmarks over dummy showed
~5% tx/s improvement, should be similar for small payload non-zc
cases.
The performance comes from killing 4 atomics and a couple of big struct
memcpy/memset. 1/10 removes a pair of atomics on dst refcounting for
cork->skb setup, 9/10 saves another pair on cork init. 5/10 and 8/10
kill extra 88B memset and memcpy respectively.
====================
Pavel Begunkov [Thu, 27 Jan 2022 00:36:31 +0000 (00:36 +0000)]
ipv6: partially inline ipv6_fixup_options
Inline a part of ipv6_fixup_options() to avoid extra overhead on
function call if opt is NULL.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Thu, 27 Jan 2022 00:36:30 +0000 (00:36 +0000)]
ipv6: optimise dst refcounting on cork init
udpv6_sendmsg() doesn't need dst after calling ip6_make_skb(), so
instead of taking an additional reference inside ip6_setup_cork()
and releasing the initial one afterwards, we can hand over a reference
into ip6_make_skb() saving two atomics. The only other user of
ip6_setup_cork() is ip6_append_data() and it requires an extra
dst_hold().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Thu, 27 Jan 2022 00:36:29 +0000 (00:36 +0000)]
udp6: don't make extra copies of iflow
udpv6_sendmsg() first initialises an on-stack 88B struct flowi6 and then
copies it into cork, which is expensive. Avoid the copy in corkless case
by initialising on-stack cork->fl directly.
The main part is a couple of lines under !corkreq check. The rest
converts fl6 variable to be a pointer.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Thu, 27 Jan 2022 00:36:28 +0000 (00:36 +0000)]
udp6: pass flow in ip6_make_skb together with cork
Another preparation patch. inet_cork_full already contains a field for
iflow, so we can avoid passing a separate struct iflow6 into
__ip6_append_data() and ip6_make_skb(), and use the flow stored in
inet_cork_full. Make sure callers set cork->fl, i.e. we init it in
ip6_append_data() and before calling ip6_make_skb().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Thu, 27 Jan 2022 00:36:27 +0000 (00:36 +0000)]
ipv6: pass full cork into __ip6_append_data()
Convert a struct inet_cork argument in __ip6_append_data() to struct
inet_cork_full. As one struct contains another inet_cork is still can
be accessed via ->base field. It's a preparation patch making further
changes a bit cleaner.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Thu, 27 Jan 2022 00:36:26 +0000 (00:36 +0000)]
ipv6: don't zero inet_cork_full::fl after use
It doesn't appear there is any reason for ip6_cork_release() to zero
cork->fl, it'll be fully filled on next initialisation. This 88 bytes
memset accounts to 0.3-0.5% of total CPU cycles.
It's also needed in following patches and allows to remove an extar flow
copy in udp_v6_push_pending_frames().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Thu, 27 Jan 2022 00:36:25 +0000 (00:36 +0000)]
ipv6: clean up cork setup/release
Clean up ip6_setup_cork() and ip6_cork_release() adding a local variable
for v6_cork->opt. It's a preparation patch for further changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Thu, 27 Jan 2022 00:36:24 +0000 (00:36 +0000)]
ipv6: remove daddr temp buffer in __ip6_make_skb
ipv6_push_nfrag_opts() doesn't change passed daddr, and so
__ip6_make_skb() doesn't actually need to keep an on-stack copy of
fl6->daddr. Set initially final_dst to fl6->daddr,
ipv6_push_nfrag_opts() will override it if needed, and get rid of extra
copies.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Thu, 27 Jan 2022 00:36:23 +0000 (00:36 +0000)]
udp6: shuffle up->pending AF_INET bits
Corked AF_INET for ipv6 socket doesn't appear to be the hottest case,
so move it out of the common path under up->pending check to remove
overhead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pavel Begunkov [Thu, 27 Jan 2022 00:36:22 +0000 (00:36 +0000)]
ipv6: optimise dst refcounting on skb init
__ip6_make_skb() gets a cork->dst ref, hands it over to skb and shortly
after puts cork->dst. Save two atomics by stealing it without extra
referencing, ip6_cork_release() handles NULL cork->dst.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Thu, 27 Jan 2022 09:02:26 +0000 (11:02 +0200)]
mlxsw: spectrum_acl: Allocate default actions for internal TCAM regions
In Spectrum-2 and later ASICs, each TCAM region has a default action
that is executed in case a packet did not match any rule in the region.
The location of the action in the database (KVDL) is computed by adding
the region's index to a base value.
Some TCAM regions are not exposed to the host and used internally by the
device. Allocate KVDL entries for the default actions of these regions
to avoid the host from overwriting them.
With mlxsw, lookups in the internal regions are not currently performed,
but it is a good practice not to overwrite their default actions.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amit Cohen [Thu, 27 Jan 2022 09:02:25 +0000 (11:02 +0200)]
mlxsw: spectrum: Guard against invalid local ports
When processing events generated by the device's firmware, the driver
protects itself from events reported for non-existent local ports, but
not for the CPU port (local port 0), which exists, but does not have all
the fields as any local port.
This can result in a NULL pointer dereference when trying access
'struct mlxsw_sp_port' fields which are not initialized for CPU port.
Commit 63b08b1f6834 ("mlxsw: spectrum: Protect driver from buggy firmware")
already handled such issue by bailing early when processing a PUDE event
reported for the CPU port.
Generalize the approach by moving the check to a common function and
making use of it in all relevant places.
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jiri Pirko [Thu, 27 Jan 2022 09:02:24 +0000 (11:02 +0200)]
mlxsw: core: Consolidate trap groups to a single event group
For event traps which are used in core, avoid having a separate trap
group for each event. Instead of that introduce a single core event trap
group and use it for all event traps.
Jiri Pirko [Thu, 27 Jan 2022 09:02:23 +0000 (11:02 +0200)]
mlxsw: core: Move functions to register/unregister array of traps to core.c
These functions belong to core.c alongside the functions that
register/unregister a single trap. Move it there. Make the functions
possibly usable by other parts of mlxsw code.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 28 Jan 2022 03:10:25 +0000 (19:10 -0800)]
Merge tag 'mlx5-updates-2022-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2022-01-27
1) Dima, adds an internal mlx5 steering callback per steering provider
(FW vs SW steering), to advertise steering capabilities implemented by
each module, this helps upper modules in mlx5 to know what is
supported and what's not without the need to tell what is the underlying
steering mode.
2nd patch is the usecase where this interface is used to implement
Vlan Push/pop for uplink with SW steering, where in FW mode it's not
supported yet.
2) Roi Dayan improves code readability and maintainability
as preparation step for multi attribute instance per flow
in mlx5 TC module
Currently the mlx5_flow object contains a single mlx5_attr instance.
However, multi table actions (e.g. CT) instantiate multiple attr instances.
This is a refactoring series in a preparation to support multiple
attribute instances per flow.
The commits prepare functions to get attr instance instead of using
flow->attr and also using attr->flags if the flag is more relevant
to be attr flag and not a flow flag considering there will be multiple
attr instances. i.e. CT and SAMPLE flags.
* tag 'mlx5-updates-2022-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: VLAN push on RX, pop on TX
net/mlx5: Introduce software defined steering capabilities
net/mlx5: Remove unused TIR modify bitmask enums
net/mlx5e: CT, Remove redundant flow args from tc ct calls
net/mlx5e: TC, Store mapped tunnel id on flow attr
net/mlx5e: Test CT and SAMPLE on flow attr
net/mlx5e: Refactor eswitch attr flags to just attr flags
net/mlx5e: CT, Don't set flow flag CT for ct clear flow
net/mlx5e: TC, Hold sample_attr on stack instead of pointer
net/mlx5e: TC, Reject rules with multiple CT actions
net/mlx5e: TC, Refactor mlx5e_tc_add_flow_mod_hdr() to get flow attr
net/mlx5e: TC, Pass attr to tc_act can_offload()
net/mlx5e: TC, Split pedit offloads verify from alloc_tc_pedit_action()
net/mlx5e: TC, Move pedit_headers_action to parse_attr
net/mlx5e: Move counter creation call to alloc_flow_attr_counter()
net/mlx5e: Pass attr arg for attaching/detaching encaps
net/mlx5e: Move code chunk setting encap dests into its own function
====================
Dima Chumak [Mon, 13 Dec 2021 11:21:46 +0000 (13:21 +0200)]
net/mlx5: VLAN push on RX, pop on TX
Some older NIC hardware isn't capable of doing VLAN push on RX and pop
on TX.
A workaround has been added in software to support it, but it has a
performance penalty since it requires a hairpin + loopback.
There's no such limitation with the newer NICs, so no need to pay the
price of the w/a. With this change the software w/a is disabled for
certain HW versions and steering modes that support it.
Dima Chumak [Sun, 21 Nov 2021 21:45:12 +0000 (23:45 +0200)]
net/mlx5: Introduce software defined steering capabilities
There are two different internal steering modes, abstracted from the
rest of the driver. In order to keep upper layer of the driver agnostic
to the differences in capabilities of the steering modes, this patch
introduces mlx5_fs_get_capabilities() API to check if a certain software
defined capability is supported. It differs from the capabilities
exposed by the hardware, as it takes into account the flow steering mode
(SMFS/DMFS) currently enabled.
This implementation supports only two capability flags:
They map to DR_ACTION_STATE_PUSH_VLAN and DR_ACTION_STATE_POP_VLAN
actions, implemented in SW steering earlier in commit f5e22be534e0
("net/mlx5: DR, Split modify VLAN state to separate pop/push states").
Which enables using of pop/push vlan without restrictions, e.g. doing
vlan pop on TX and RX, compared to FW steering that supports only vlan
pop on RX and push on TX.
Roi Dayan [Wed, 15 Dec 2021 13:37:27 +0000 (15:37 +0200)]
net/mlx5e: Test CT and SAMPLE on flow attr
Currently the mlx5_flow object contains a single mlx5_attr instance.
However, multi table actions (e.g. CT) instantiate multiple attr instances.
Prepare for multiple attr instances by testing for CT or SAMPLE flag on attr
flags instead of flow flag.
Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Chris Mi <cmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Roi Dayan [Sun, 19 Dec 2021 09:31:01 +0000 (11:31 +0200)]
net/mlx5e: Refactor eswitch attr flags to just attr flags
The flags are flow attrs and not esw specific attr flags.
Refactor to remove the esw prefix and move from eswitch.h
to en_tc.h where struct mlx5_flow_attr exists.
Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Roi Dayan [Wed, 15 Dec 2021 08:48:36 +0000 (10:48 +0200)]
net/mlx5e: CT, Don't set flow flag CT for ct clear flow
ct clear action is a normal flow with a modify header for registers to
0. there is no need for any special handling in tc_ct.c.
Parsing of ct clear action still allocates mod acts to set 0 on the
registers and the driver continue to add a normal rule with modify hdr
context.
Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Roi Dayan [Sun, 5 Dec 2021 13:10:35 +0000 (15:10 +0200)]
net/mlx5e: TC, Hold sample_attr on stack instead of pointer
In later commit we are going to instantiate multiple attr instances
for flow instead of single attr.
Parsing TC sample allocates a new memory but there is no symmetric
cleanup in the infrastructure.
To avoid asymmetric alloc/free use sample_attr as part of the flow attr
and not allocated and held as a pointer.
This will avoid a cleanup leak when sample action is not on the first
attr.
Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>