]> git.apps.os.sepia.ceph.com Git - ceph-client.git/log
ceph-client.git
3 years agonet/sched: flower: Reduce identation after is_key_vlan refactoring
Boris Sukholitko [Tue, 19 Apr 2022 08:14:31 +0000 (11:14 +0300)]
net/sched: flower: Reduce identation after is_key_vlan refactoring

Whitespace only.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet/sched: flower: Helper function for vlan ethtype checks
Boris Sukholitko [Tue, 19 Apr 2022 08:14:30 +0000 (11:14 +0300)]
net/sched: flower: Helper function for vlan ethtype checks

There are somewhat repetitive ethertype checks in fl_set_key. Refactor
them into is_vlan_key helper function.

To make the changes clearer, avoid touching identation levels. This is
the job for the next patch in the series.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoar5523: Use kzalloc instead of kmalloc/memset
Haowen Bai [Tue, 19 Apr 2022 01:37:31 +0000 (09:37 +0800)]
ar5523: Use kzalloc instead of kmalloc/memset

Use kzalloc rather than duplicating its implementation, which
makes code simple and easy to understand.

Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: dsa: realtek: remove realtek,rtl8367s string
Luiz Angelo Daros de Luca [Mon, 18 Apr 2022 23:35:58 +0000 (20:35 -0300)]
net: dsa: realtek: remove realtek,rtl8367s string

There is no need to add new compatible strings for each new supported
chip version. The compatible string is used only to select the subdriver
(rtl8365mb.c or rtl8366rb.c). Once in the subdriver, it will detect the
chip model by itself, ignoring which compatible string was used.

Link: https://lore.kernel.org/netdev/20220414014055.m4wbmr7tdz6hsa3m@bang-olufsen.dk/
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agodt-bindings: net: dsa: realtek: cleanup compatible strings
Luiz Angelo Daros de Luca [Mon, 18 Apr 2022 23:35:57 +0000 (20:35 -0300)]
dt-bindings: net: dsa: realtek: cleanup compatible strings

Compatible strings are used to help the driver find the chip ID/version
register for each chip family. After that, the driver can setup the
switch accordingly. Keep only the first supported model for each family
as a compatible string and reference other chip models in the
description.

The removed compatible strings have never been used in a released kernel.

CC: devicetree@vger.kernel.org
Link: https://lore.kernel.org/netdev/20220414014055.m4wbmr7tdz6hsa3m@bang-olufsen.dk/
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'hns3-next'
David S. Miller [Wed, 20 Apr 2022 09:45:51 +0000 (10:45 +0100)]
Merge branch 'hns3-next'

Guangbin Huang says:

====================
net: hns3: updates for -next

This series includes some updates for the HNS3 ethernet driver.

Change logs:
V1 -> V2:
 - Fix failed to apply to net-next problem.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: hns3: remove unnecessary line wrap for hns3_set_tunable
Hao Chen [Tue, 19 Apr 2022 03:27:09 +0000 (11:27 +0800)]
net: hns3: remove unnecessary line wrap for hns3_set_tunable

Remove unnecessary line wrap for hns3_set_tunable to improve
function readability.

Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: hns3: replace magic value by HCLGE_RING_REG_OFFSET
Peng Li [Tue, 19 Apr 2022 03:27:08 +0000 (11:27 +0800)]
net: hns3: replace magic value by HCLGE_RING_REG_OFFSET

Magic values are not recommended.

Signed-off-by: Peng Li<lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: hns3: fix the wrong words in comments
Peng Li [Tue, 19 Apr 2022 03:27:07 +0000 (11:27 +0800)]
net: hns3: fix the wrong words in comments

This patch fixes wrong words in comments.

Signed-off-by: Peng Li<lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: hns3: update the comment of function hclgevf_get_mbx_resp
Peng Li [Tue, 19 Apr 2022 03:27:06 +0000 (11:27 +0800)]
net: hns3: update the comment of function hclgevf_get_mbx_resp

The param of function hclgevf_get_mbx_resp has been changed but the
comments not upodated. This patch updates it.

Signed-off-by: Peng Li<lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: hns3: add log for setting tx spare buf size
Hao Chen [Tue, 19 Apr 2022 03:27:05 +0000 (11:27 +0800)]
net: hns3: add log for setting tx spare buf size

For the active tx spare buffer size maybe changed according
to the page size, so add log to notice it.

Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: hns3: add failure logs in hclge_set_vport_mtu
Jie Wang [Tue, 19 Apr 2022 03:27:04 +0000 (11:27 +0800)]
net: hns3: add failure logs in hclge_set_vport_mtu

Currently, There is a low probability that pf mtu configuration fails, but
the information in logs is insufficient for problem locating when the VF
mtu value is illegally modified.

So record the vf index and vf mtu value at the failure scenario.

Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: hns3: refine the definition for struct hclge_pf_to_vf_msg
Jian Shen [Tue, 19 Apr 2022 03:27:03 +0000 (11:27 +0800)]
net: hns3: refine the definition for struct hclge_pf_to_vf_msg

The struct hclge_pf_to_vf_msg is used for mailbox message from
PF to VF, including both response and request. But its definition
can only indicate respone, which makes the message data copy in
function hclge_send_mbx_msg() unreadable. So refine it by edding
a general message definition into it.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: hns3: refactor hns3_set_ringparam()
Hao Chen [Tue, 19 Apr 2022 03:27:02 +0000 (11:27 +0800)]
net: hns3: refactor hns3_set_ringparam()

Use struct hns3_ring_param to replace variable new/old_xxx and
add hns3_is_ringparam_changed() to judge them if is changed to
improve code readability.

Signed-off-by: Hao Chen <chenhao288@hisilicon.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: hns3: add ethtool parameter check for CQE/EQE mode
Yufeng Mo [Tue, 19 Apr 2022 03:27:01 +0000 (11:27 +0800)]
net: hns3: add ethtool parameter check for CQE/EQE mode

For DEVICE_VERSION_V2, the hardware does not support the CQE mode.
So add capability bit for coalesce CQE mode and add parameter check
for it in ethtool.

Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'atlantic-xdp-multi-buffer'
David S. Miller [Wed, 20 Apr 2022 09:42:57 +0000 (10:42 +0100)]
Merge branch 'atlantic-xdp-multi-buffer'

[PATCH net-next v5 0/3] net: atlantic: Add XDP support
@ 2022-04-17 10:12 Taehee Yoo
  2022-04-17 10:12 ` [PATCH net-next v5 1/3] net: atlantic: Implement xdp control plane Taehee Yoo
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Taehee Yoo @ 2022-04-17 10:12 UTC (permalink / raw)
  To: davem, kuba, pabeni, netdev, irusskikh, ast, daniel, hawk,
john.fastabend, andrii, kafai, songliubraving, yhs, kpsingh, bpf
Cc: ap420073
This patchset is to make atlantic to support multi-buffer XDP.

The first patch implement control plane of xdp.
The aq_xdp(), callback of .xdp_bpf is added.

The second patch implements data plane of xdp.
XDP_TX, XDP_DROP, and XDP_PASS is supported.
__aq_ring_xdp_clean() is added to receive and execute xdp program.
aq_nic_xmit_xdpf() is added to send packet by XDP.

The third patch implements callback of .ndo_xdp_xmit.
aq_xdp_xmit() is added to send redirected packets and it internally
calls aq_nic_xmit_xdpf().

Memory model is MEM_TYPE_PAGE_SHARED.

Order-2 page allocation is used when XDP is enabled.

LRO will be disabled if XDP program doesn't supports multi buffer.

AQC chip supports 32 multi-queues and 8 vectors(irq).
There are two options.
1. under 8 cores and maximum 4 tx queues per core.
2. under 4 cores and maximum 8 tx queues per core.

Like other drivers, these tx queues can be used only for XDP_TX,
XDP_REDIRECT queue. If so, no tx_lock is needed.
But this patchset doesn't use this strategy because getting hardware tx
queue index cost is too high.
So, tx_lock is used in the aq_nic_xmit_xdpf().

single-core, single queue, 80% cpu utilization.

  32.30%  [kernel]                  [k] aq_get_rxpages_xdp
  10.44%  [kernel]                  [k] aq_hw_read_reg <---------- here
   9.86%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
   5.51%  [kernel]                  [k] aq_ring_rx_clean

single-core, 8 queues, 100% cpu utilization, half PPS.

  52.03%  [kernel]                  [k] aq_hw_read_reg <---------- here
  18.24%  [kernel]                  [k] aq_get_rxpages_xdp
   4.30%  [kernel]                  [k] hw_atl_b0_hw_ring_rx_receive
   4.24%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
   2.79%  [kernel]                  [k] aq_ring_rx_clean

Performance result(64 Byte)
1. XDP_TX
  a. xdp_geieric, single core
    - 2.5Mpps, 100% cpu
  b. xdp_driver, single core
    - 4.5Mpps, 80% cpu
  c. xdp_generic, 8 core(hyper thread)
    - 6.3Mpps, 40% cpu
  d. xdp_driver, 8 core(hyper thread)
    - 6.3Mpps, 30% cpu

2. XDP_REDIRECT
  a. xdp_generic, single core
    - 2.3Mpps
  b. xdp_driver, single core
    - 4.5Mpps

v5:
 - Use MEM_TYPE_PAGE_SHARED instead of MEM_TYPE_PAGE_ORDER0
 - Use 2K frame size instead of 3K
 - Use order-2 page allocation instead of order-0
 - Rename aq_get_rxpage() to aq_alloc_rxpages()
 - Add missing PageFree stats for ethtool
 - Remove aq_unset_rxpage_xdp(), introduced by v2 patch due to
   change of memory model
 - Fix wrong last parameter value of xdp_prepare_buff()
 - Add aq_get_rxpages_xdp() to increase page reference count

v4:
 - Fix compile warning

v3:
 - Change wrong PPS performance result 40% -> 80% in single
   core(Intel i3-12100)
 - Separate aq_nic_map_xdp() from aq_nic_map_skb()
 - Drop multi buffer packets if single buffer XDP is attached
 - Disable LRO when single buffer XDP is attached
 - Use xdp_get_{frame/buff}_len()

v2:
 - Do not use inline in C file

Taehee Yoo (3):
  net: atlantic: Implement xdp control plane
  net: atlantic: Implement xdp data plane
  net: atlantic: Implement .ndo_xdp_xmit handler

 .../net/ethernet/aquantia/atlantic/aq_cfg.h   |   1 +
 .../ethernet/aquantia/atlantic/aq_ethtool.c   |   9 +
 .../net/ethernet/aquantia/atlantic/aq_main.c  |  87 ++++
 .../net/ethernet/aquantia/atlantic/aq_main.h  |   2 +
 .../net/ethernet/aquantia/atlantic/aq_nic.c   | 136 ++++++
 .../net/ethernet/aquantia/atlantic/aq_nic.h   |   5 +
 .../net/ethernet/aquantia/atlantic/aq_ring.c  | 409 ++++++++++++++++--
 .../net/ethernet/aquantia/atlantic/aq_ring.h  |  21 +-
 .../net/ethernet/aquantia/atlantic/aq_vec.c   |  23 +-
 .../net/ethernet/aquantia/atlantic/aq_vec.h   |   6 +
 .../aquantia/atlantic/hw_atl/hw_atl_a0.c      |   6 +-
 .../aquantia/atlantic/hw_atl/hw_atl_b0.c      |  10 +-
 12 files changed, 670 insertions(+), 45 deletions(-)

--
2.17.1

^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH net-next v5 1/3] net: atlantic: Implement xdp control plane
  2022-04-17 10:12 [PATCH net-next v5 0/3] net: atlantic: Add XDP support Taehee Yoo
@ 2022-04-17 10:12 ` Taehee Yoo
  2022-04-17 10:12 ` [PATCH net-next v5 2/3] net: atlantic: Implement xdp data plane Taehee Yoo
  2022-04-17 10:12 ` [PATCH net-next v5 3/3] net: atlantic: Implement .ndo_xdp_xmit handler Taehee Yoo
  2 siblings, 0 replies; 4+ messages in thread
From: Taehee Yoo @ 2022-04-17 10:12 UTC (permalink / raw)
  To: davem, kuba, pabeni, netdev, irusskikh, ast, daniel, hawk,
john.fastabend, andrii, kafai, songliubraving, yhs, kpsingh, bpf
Cc: ap420073
aq_xdp() is a xdp setup callback function for Atlantic driver.
When XDP is attached or detached, the device will be restarted because
it uses different headroom, tailroom, and page order value.

If XDP enabled, it switches default page order value from 0 to 2.
Because the default maximum frame size is still 2K and it needs
additional area for headroom and tailroom.
The total size(headroom + frame size + tailroom) is 2624.
So, 1472Bytes will be always wasted for every frame.
But when order-2 is used, these pages can be used 6 times
with flip strategy.
It means only about 106Bytes per frame will be wasted.

Also, It supports xdp fragment feature.
MTU can be 16K if xdp prog supports xdp fragment.
If not, MTU can not exceed 2K - ETH_HLEN - ETH_FCS.

And a static key is added and It will be used to call the xdp_clean
handler in ->poll(). data plane implementation will be contained
the followed patch.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---

v5:
 - Use MEM_TYPE_PAGE_SHARED instead of MEM_TYPE_PAGE_ORDER0
 - Use 2K frame size instead of 3K
 - Use order-2 page allocation instead of order-0
 - Rename aq_get_rxpage() to aq_alloc_rxpages()

v4:
 - No changed

v3:
 - Disable LRO when single buffer XDP is attached

v2:
 - No changed

3 years agonet: atlantic: Implement .ndo_xdp_xmit handler
Taehee Yoo [Sun, 17 Apr 2022 10:12:47 +0000 (10:12 +0000)]
net: atlantic: Implement .ndo_xdp_xmit handler

aq_xdp_xmit() is the callback function of .ndo_xdp_xmit.
It internally calls aq_nic_xmit_xdpf() to send packet.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: atlantic: Implement xdp data plane
Taehee Yoo [Sun, 17 Apr 2022 10:12:46 +0000 (10:12 +0000)]
net: atlantic: Implement xdp data plane

It supports XDP_PASS, XDP_DROP and multi buffer.

The new function aq_nic_xmit_xdpf() is used to send packet with
xdp_frame and internally it calls aq_nic_map_xdp().

AQC chip supports 32 multi-queues and 8 vectors(irq).
there are two option
1. under 8 cores and 4 tx queues per core.
2. under 4 cores and 8 tx queues per core.

Like ixgbe, these tx queues can be used only for XDP_TX, XDP_REDIRECT
queue. If so, no tx_lock is needed.
But this patchset doesn't use this strategy because getting hardware tx
queue index cost is too high.
So, tx_lock is used in the aq_nic_xmit_xdpf().

single-core, single queue, 80% cpu utilization.

  30.75%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
  10.35%  [kernel]                  [k] aq_hw_read_reg <---------- here
   4.38%  [kernel]                  [k] get_page_from_freelist

single-core, 8 queues, 100% cpu utilization, half PPS.

  45.56%  [kernel]                  [k] aq_hw_read_reg <---------- here
  17.58%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
   4.72%  [kernel]                  [k] hw_atl_b0_hw_ring_rx_receive

The new function __aq_ring_xdp_clean() is a xdp rx handler and this is
called only when XDP is attached.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: atlantic: Implement xdp control plane
Taehee Yoo [Sun, 17 Apr 2022 10:12:45 +0000 (10:12 +0000)]
net: atlantic: Implement xdp control plane

aq_xdp() is a xdp setup callback function for Atlantic driver.
When XDP is attached or detached, the device will be restarted because
it uses different headroom, tailroom, and page order value.

If XDP enabled, it switches default page order value from 0 to 2.
Because the default maximum frame size is still 2K and it needs
additional area for headroom and tailroom.
The total size(headroom + frame size + tailroom) is 2624.
So, 1472Bytes will be always wasted for every frame.
But when order-2 is used, these pages can be used 6 times
with flip strategy.
It means only about 106Bytes per frame will be wasted.

Also, It supports xdp fragment feature.
MTU can be 16K if xdp prog supports xdp fragment.
If not, MTU can not exceed 2K - ETH_HLEN - ETH_FCS.

And a static key is added and It will be used to call the xdp_clean
handler in ->poll(). data plane implementation will be contained
the followed patch.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'dsa-cross-chip-notifier-cleanup'
David S. Miller [Wed, 20 Apr 2022 09:34:34 +0000 (10:34 +0100)]
Merge branch 'dsa-cross-chip-notifier-cleanup'

Vladimir Oltean says:

====================
DSA cross-chip notifier cleanups

This patch set makes the following improvements:

- Cross-chip notifiers pass a switch index, port index, sometimes tree
  index, all as integers. Sometimes we need to recover the struct
  dsa_port based on those integers. That recovery involves traversing a
  list. By passing directly a pointer to the struct dsa_port we can
  avoid that, and the indices passed previously can still be obtained
  from the passed struct dsa_port.

- Resetting VLAN filtering on a switch has explicit code to make it run
  on a single switch, so it has no place to stay in the cross-chip
  notifier code. Move it out.

- Changing the MTU on a user port affects only that single port, yet the
  code passes through the cross-chip notifier layer where all switches
  are notified. Avoid that.

- Other related cosmetic changes in the MTU changing procedure.

Apart from the slight improvement in performance given by
(a) doing less work in cross-chip notifiers
(b) emitting less cross-chip notifiers
we also end up with about 100 less lines of code.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: dsa: don't emit targeted cross-chip notifiers for MTU change
Vladimir Oltean [Fri, 15 Apr 2022 15:46:26 +0000 (18:46 +0300)]
net: dsa: don't emit targeted cross-chip notifiers for MTU change

A cross-chip notifier with "targeted_match=true" is one that matches
only the local port of the switch that emitted it. In other words,
passing through the cross-chip notifier layer serves no purpose.

Eliminate this concept by calling directly ds->ops->port_change_mtu
instead of emitting a targeted cross-chip notifier. This leaves the
DSA_NOTIFIER_MTU event being emitted only for MTU updates on the CPU
port, which need to be reflected also across all DSA links.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: dsa: drop dsa_slave_priv from dsa_slave_change_mtu
Vladimir Oltean [Fri, 15 Apr 2022 15:46:25 +0000 (18:46 +0300)]
net: dsa: drop dsa_slave_priv from dsa_slave_change_mtu

We can get a hold of the "ds" pointer directly from "dp", no need for
the dsa_slave_priv.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: dsa: avoid one dsa_to_port() in dsa_slave_change_mtu
Vladimir Oltean [Fri, 15 Apr 2022 15:46:24 +0000 (18:46 +0300)]
net: dsa: avoid one dsa_to_port() in dsa_slave_change_mtu

We could retrieve the cpu_dp pointer directly from the "dp" we already
have, no need to resort to dsa_to_port(ds, port).

This change also removes the need for an "int port", so that is also
deleted.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: dsa: use dsa_tree_for_each_user_port in dsa_slave_change_mtu
Vladimir Oltean [Fri, 15 Apr 2022 15:46:23 +0000 (18:46 +0300)]
net: dsa: use dsa_tree_for_each_user_port in dsa_slave_change_mtu

Use the more conventional iterator over user ports instead of explicitly
ignoring them, and use the more conventional name "other_dp" instead of
"dp_iter", for readability.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: dsa: make cross-chip notifiers more efficient for host events
Vladimir Oltean [Fri, 15 Apr 2022 15:46:22 +0000 (18:46 +0300)]
net: dsa: make cross-chip notifiers more efficient for host events

To determine whether a given port should react to the port targeted by
the notifier, dsa_port_host_vlan_match() and dsa_port_host_address_match()
look at the positioning of the switch port currently executing the
notifier relative to the switch port for which the notifier was emitted.

To maintain stylistic compatibility with the other match functions from
switch.c, the host address and host VLAN match functions take the
notifier information about targeted port, switch and tree indices as
argument. However, these functions only use that information to retrieve
the struct dsa_port *targeted_dp, which is an invariant for the outer
loop that calls them. So it makes more sense to calculate the targeted
dp only once, and pass it to them as argument.

But furthermore, the targeted dp is actually known at the time the call
to dsa_port_notify() is made. It is just that we decide to only save the
indices of the port, switch and tree in the notifier structure, just to
retrace our steps and find the dp again using dsa_switch_find() and
dsa_to_port().

But both the above functions are relatively expensive, since they need
to iterate through lists. It appears more straightforward to make all
notifiers just pass the targeted dp inside their info structure, and
have the code that needs the indices to look at info->dp->index instead
of info->port, or info->dp->ds->index instead of info->sw_index, or
info->dp->ds->dst->index instead of info->tree_index.

For the sake of consistency, all cross-chip notifiers are converted to
pass the "dp" directly.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: dsa: move reset of VLAN filtering to dsa_port_switchdev_unsync_attrs
Vladimir Oltean [Fri, 15 Apr 2022 15:46:21 +0000 (18:46 +0300)]
net: dsa: move reset of VLAN filtering to dsa_port_switchdev_unsync_attrs

In dsa_port_switchdev_unsync_attrs() there is a comment that resetting
the VLAN filtering isn't done where it is expected. And since commit
108dc8741c20 ("net: dsa: Avoid cross-chip syncing of VLAN filtering"),
there is no reason to handle this in switch.c either.

Therefore, move the logic to port.c, and adapt it slightly to the data
structures and naming conventions from there.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'rtnetlink-improve-alt_ifname-config-and-fix-dangerous-group-usage'
Paolo Abeni [Tue, 19 Apr 2022 11:39:01 +0000 (13:39 +0200)]
Merge branch 'rtnetlink-improve-alt_ifname-config-and-fix-dangerous-group-usage'

Florent Fourcot says:

====================
rtnetlink: improve ALT_IFNAME config and fix dangerous GROUP usage

First commit forbids dangerous calls when both IFNAME and GROUP are
given, since it can introduce unexpected behaviour when IFNAME does not
match any interface.

Second patch achieves primary goal of this patchset to fix/improve
IFLA_ALT_IFNAME attribute, since previous code was never working for
newlink/setlink. ip-link command is probably getting interface index
before, and was not using this feature.

Last two patches are improving error code on corner cases.

Changes in v2:
  * Remove ifname argument in rtnl_dev_get/do_setlink
    functions (simplify code)
  * Use a boolean to avoid condition duplication in __rtnl_newlink

Changes in v3:
  * Simplify rtnl_dev_get signature

Changes in v4:
  * Rename link_lookup to link_specified

Changes in v5:
  * Re-order patches
====================

Link: https://lore.kernel.org/r/20220415165330.10497-1-florent.fourcot@wifirst.fr
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 years agortnetlink: return EINVAL when request cannot succeed
Florent Fourcot [Fri, 15 Apr 2022 16:53:30 +0000 (18:53 +0200)]
rtnetlink: return EINVAL when request cannot succeed

A request without interface name/interface index/interface group cannot
work. We should return EINVAL

Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Brian Baboch <brian.baboch@wifirst.fr>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 years agortnetlink: return ENODEV when IFLA_ALT_IFNAME is used in dellink
Florent Fourcot [Fri, 15 Apr 2022 16:53:29 +0000 (18:53 +0200)]
rtnetlink: return ENODEV when IFLA_ALT_IFNAME is used in dellink

If IFLA_ALT_IFNAME is set and given interface is not found,
we should return ENODEV and be consistent with IFLA_IFNAME
behaviour
This commit extends feature of commit 76c9ac0ee878,
"net: rtnetlink: add possibility to use alternative names as message handle"

CC: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Brian Baboch <brian.baboch@wifirst.fr>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 years agortnetlink: enable alt_ifname for setlink/newlink
Florent Fourcot [Fri, 15 Apr 2022 16:53:28 +0000 (18:53 +0200)]
rtnetlink: enable alt_ifname for setlink/newlink

buffer called "ifname" given in function rtnl_dev_get
is always valid when called by setlink/newlink,
but contains only empty string when IFLA_IFNAME is not given. So
IFLA_ALT_IFNAME is always ignored

This patch fixes rtnl_dev_get function with a remove of ifname argument,
and move ifname copy in do_setlink when required.

It extends feature of commit 76c9ac0ee878,
"net: rtnetlink: add possibility to use alternative names as message
handle""

CC: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Brian Baboch <brian.baboch@wifirst.fr>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 years agortnetlink: return ENODEV when ifname does not exist and group is given
Florent Fourcot [Fri, 15 Apr 2022 16:53:27 +0000 (18:53 +0200)]
rtnetlink: return ENODEV when ifname does not exist and group is given

When the interface does not exist, and a group is given, the given
parameters are being set to all interfaces of the given group. The given
IFNAME/ALT_IF_NAME are being ignored in that case.

That can be dangerous since a typo (or a deleted interface) can produce
weird side effects for caller:

Case 1:

 IFLA_IFNAME=valid_interface
 IFLA_GROUP=1
 MTU=1234

Case 1 will update MTU and group of the given interface "valid_interface".

Case 2:

 IFLA_IFNAME=doesnotexist
 IFLA_GROUP=1
 MTU=1234

Case 2 will update MTU of all interfaces in group 1. IFLA_IFNAME is
ignored in this case

This behaviour is not consistent and dangerous. In order to fix this issue,
we now return ENODEV when the given IFNAME does not exist.

Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Brian Baboch <brian.baboch@wifirst.fr>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 years agoMerge branch 'net-sched-allow-user-to-select-txqueue'
Paolo Abeni [Tue, 19 Apr 2022 10:20:48 +0000 (12:20 +0200)]
Merge branch 'net-sched-allow-user-to-select-txqueue'

Tonghao Zhang says:

====================
net: sched: allow user to select txqueue

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

Patch 1 allow user to select txqueue in clsact hook.
Patch 2 support skbhash to select txqueue.
====================

Link: https://lore.kernel.org/r/20220415164046.26636-1-xiangxia.m.yue@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 years agonet: sched: support hash selecting tx queue
Tonghao Zhang [Fri, 15 Apr 2022 16:40:46 +0000 (00:40 +0800)]
net: sched: support hash selecting tx queue

This patch allows users to pick queue_mapping, range
from A to B. Then we can load balance packets from A
to B tx queue. The range is an unsigned 16bit value
in decimal format.

$ tc filter ... action skbedit queue_mapping skbhash A B

"skbedit queue_mapping QUEUE_MAPPING" (from "man 8 tc-skbedit")
is enhanced with flags: SKBEDIT_F_TXQ_SKBHASH

  +----+      +----+      +----+
  | P1 |      | P2 |      | Pn |
  +----+      +----+      +----+
    |           |           |
    +-----------+-----------+
                |
                | clsact/skbedit
                |      MQ
                v
    +-----------+-----------+
    | q0        | qn        | qm
    v           v           v
  HTB/FQ       FIFO   ...  FIFO

For example:
If P1 sends out packets to different Pods on other host, and
we want distribute flows from qn - qm. Then we can use skb->hash
as hash.

setup commands:
$ NETDEV=eth0
$ ip netns add n1
$ ip link add ipv1 link $NETDEV type ipvlan mode l2
$ ip link set ipv1 netns n1
$ ip netns exec n1 ifconfig ipv1 2.2.2.100/24 up

$ tc qdisc add dev $NETDEV clsact
$ tc filter add dev $NETDEV egress protocol ip prio 1 \
        flower skip_hw src_ip 2.2.2.100 action skbedit queue_mapping skbhash 2 6
$ tc qdisc add dev $NETDEV handle 1: root mq
$ tc qdisc add dev $NETDEV parent 1:1 handle 2: htb
$ tc class add dev $NETDEV parent 2: classid 2:1 htb rate 100kbit
$ tc class add dev $NETDEV parent 2: classid 2:2 htb rate 200kbit
$ tc qdisc add dev $NETDEV parent 1:2 tbf rate 100mbit burst 100mb latency 1
$ tc qdisc add dev $NETDEV parent 1:3 pfifo
$ tc qdisc add dev $NETDEV parent 1:4 pfifo
$ tc qdisc add dev $NETDEV parent 1:5 pfifo
$ tc qdisc add dev $NETDEV parent 1:6 pfifo
$ tc qdisc add dev $NETDEV parent 1:7 pfifo

$ ip netns exec n1 iperf3 -c 2.2.2.1 -i 1 -t 10 -P 10

pick txqueue from 2 - 6:
$ ethtool -S $NETDEV | grep -i tx_queue_[0-9]_bytes
     tx_queue_0_bytes: 42
     tx_queue_1_bytes: 0
     tx_queue_2_bytes: 11442586444
     tx_queue_3_bytes: 7383615334
     tx_queue_4_bytes: 3981365579
     tx_queue_5_bytes: 3983235051
     tx_queue_6_bytes: 6706236461
     tx_queue_7_bytes: 42
     tx_queue_8_bytes: 0
     tx_queue_9_bytes: 0

txqueues 2 - 6 are mapped to classid 1:3 - 1:7
$ tc -s class show dev $NETDEV
...
class mq 1:3 root leaf 8002:
 Sent 11949133672 bytes 7929798 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:4 root leaf 8003:
 Sent 7710449050 bytes 5117279 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:5 root leaf 8004:
 Sent 4157648675 bytes 2758990 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:6 root leaf 8005:
 Sent 4159632195 bytes 2759990 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
class mq 1:7 root leaf 8006:
 Sent 7003169603 bytes 4646912 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
...

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 years agonet: sched: use queue_mapping to pick tx queue
Tonghao Zhang [Fri, 15 Apr 2022 16:40:45 +0000 (00:40 +0800)]
net: sched: use queue_mapping to pick tx queue

This patch fixes issue:
* If we install tc filters with act_skbedit in clsact hook.
  It doesn't work, because netdev_core_pick_tx() overwrites
  queue_mapping.

  $ tc filter ... action skbedit queue_mapping 1

And this patch is useful:
* We can use FQ + EDT to implement efficient policies. Tx queues
  are picked by xps, ndo_select_queue of netdev driver, or skb hash
  in netdev_core_pick_tx(). In fact, the netdev driver, and skb
  hash are _not_ under control. xps uses the CPUs map to select Tx
  queues, but we can't figure out which task_struct of pod/containter
  running on this cpu in most case. We can use clsact filters to classify
  one pod/container traffic to one Tx queue. Why ?

  In containter networking environment, there are two kinds of pod/
  containter/net-namespace. One kind (e.g. P1, P2), the high throughput
  is key in these applications. But avoid running out of network resource,
  the outbound traffic of these pods is limited, using or sharing one
  dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
  (e.g. Pn), the low latency of data access is key. And the traffic is not
  limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
  This choice provides two benefits. First, contention on the HTB/FQ Qdisc
  lock is significantly reduced since fewer CPUs contend for the same queue.
  More importantly, Qdisc contention can be eliminated completely if each
  CPU has its own FIFO Qdisc for the second kind of pods.

  There must be a mechanism in place to support classifying traffic based on
  pods/container to different Tx queues. Note that clsact is outside of Qdisc
  while Qdisc can run a classifier to select a sub-queue under the lock.

  In general recording the decision in the skb seems a little heavy handed.
  This patch introduces a per-CPU variable, suggested by Eric.

  The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
  - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
    is set in qdisc->enqueue() though tx queue has been selected in
    netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
    firstly in __dev_queue_xmit(), is useful:
  - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
    in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
    For example, eth0, macvlan in pod, which root Qdisc install skbedit
    queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
    eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
    because there is no filters in clsact or tx Qdisc of this netdev.
    Same action taked in eth0, ixgbe in Host.
  - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
    in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
    in __dev_queue_xmit when processing next packets.

  For performance reasons, use the static key. If user does not config the NET_EGRESS,
  the patch will not be compiled.

  +----+      +----+      +----+
  | P1 |      | P2 |      | Pn |
  +----+      +----+      +----+
    |           |           |
    +-----------+-----------+
                |
                | clsact/skbedit
                |      MQ
                v
    +-----------+-----------+
    | q0        | q1        | qn
    v           v           v
  HTB/FQ      HTB/FQ  ...  FIFO

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Talal Ahmad <talalahmad@google.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Wei Wang <weiwan@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 years agodocs: net: dsa: describe issues with checksum offload
Luiz Angelo Daros de Luca [Sat, 16 Apr 2022 05:27:37 +0000 (02:27 -0300)]
docs: net: dsa: describe issues with checksum offload

DSA tags before IP header (categories 1 and 2) or after the payload (3)
might introduce offload checksum issues.

Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'mlxsw-line-card'
David S. Miller [Mon, 18 Apr 2022 10:00:19 +0000 (11:00 +0100)]
Merge branch 'mlxsw-line-card'

Ido Schimmel says:

====================
mlxsw: Introduce line card support for modular switch

Jiri says:

This patchset introduces support for modular switch systems and also
introduces mlxsw support for NVIDIA Mellanox SN4800 modular switch.
It contains 8 slots to accommodate line cards - replaceable PHY modules
which may contain gearboxes.
Currently supported line card:
16X 100GbE (QSFP28)
Other line cards that are going to be supported:
8X 200GbE (QSFP56)
4X 400GbE (QSFP-DD)
There may be other types of line cards added in the future.

To be consistent with the port split configuration (splitter cabels),
the line card entities are treated in the similar way. The nature of
a line card is not "a pluggable device", but "a pluggable PHY module".

A concept of "provisioning" is introduced. The user may "provision"
certain slot with a line card type. Driver then creates all instances
(devlink ports, netdevices, etc) related to this line card type. It does
not matter if the line card is plugged-in at the time. User is able to
configure netdevices, devlink ports, setup port splitters, etc. From the
perspective of the switch ASIC, all is present and can be configured.

The carrier of netdevices stays down if the line card is not plugged-in.
Once the line card is inserted and activated, the carrier of
the related netdevices is then reflecting the physical line state,
same as for an ordinary fixed port.

Once user does not want to use the line card related instances
anymore, he can "unprovision" the slot. Driver then removes the
instances.

Patches 1-4 are extending devlink driver API and UAPI in order to
register, show, dump, provision and activate the line card.
Patches 5-17 are implementing the introduced API in mlxsw.
The last patch adds a selftest for mlxsw line cards.

Example:
$ devlink port # No ports are listed
$ devlink lc
pci/0000:01:00.0:
  lc 1 state unprovisioned
    supported_types:
       16x100G
  lc 2 state unprovisioned
    supported_types:
       16x100G
  lc 3 state unprovisioned
    supported_types:
       16x100G
  lc 4 state unprovisioned
    supported_types:
       16x100G
  lc 5 state unprovisioned
    supported_types:
       16x100G
  lc 6 state unprovisioned
    supported_types:
       16x100G
  lc 7 state unprovisioned
    supported_types:
       16x100G
  lc 8 state unprovisioned
    supported_types:
       16x100G

Note that driver exposes list supported line card types. Currently
there is only one: "16x100G".

To provision the slot #8:

$ devlink lc set pci/0000:01:00.0 lc 8 type 16x100G
$ devlink lc show pci/0000:01:00.0 lc 8
pci/0000:01:00.0:
  lc 8 state active type 16x100G
    supported_types:
       16x100G
$ devlink port
pci/0000:01:00.0/0: type notset flavour cpu port 0 splittable false
pci/0000:01:00.0/53: type eth netdev enp1s0nl8p1 flavour physical lc 8 port 1 splittable true lanes 4
pci/0000:01:00.0/54: type eth netdev enp1s0nl8p2 flavour physical lc 8 port 2 splittable true lanes 4
pci/0000:01:00.0/55: type eth netdev enp1s0nl8p3 flavour physical lc 8 port 3 splittable true lanes 4
pci/0000:01:00.0/56: type eth netdev enp1s0nl8p4 flavour physical lc 8 port 4 splittable true lanes 4
pci/0000:01:00.0/57: type eth netdev enp1s0nl8p5 flavour physical lc 8 port 5 splittable true lanes 4
pci/0000:01:00.0/58: type eth netdev enp1s0nl8p6 flavour physical lc 8 port 6 splittable true lanes 4
pci/0000:01:00.0/59: type eth netdev enp1s0nl8p7 flavour physical lc 8 port 7 splittable true lanes 4
pci/0000:01:00.0/60: type eth netdev enp1s0nl8p8 flavour physical lc 8 port 8 splittable true lanes 4
pci/0000:01:00.0/61: type eth netdev enp1s0nl8p9 flavour physical lc 8 port 9 splittable true lanes 4
pci/0000:01:00.0/62: type eth netdev enp1s0nl8p10 flavour physical lc 8 port 10 splittable true lanes 4
pci/0000:01:00.0/63: type eth netdev enp1s0nl8p11 flavour physical lc 8 port 11 splittable true lanes 4
pci/0000:01:00.0/64: type eth netdev enp1s0nl8p12 flavour physical lc 8 port 12 splittable true lanes 4
pci/0000:01:00.0/125: type eth netdev enp1s0nl8p13 flavour physical lc 8 port 13 splittable true lanes 4
pci/0000:01:00.0/126: type eth netdev enp1s0nl8p14 flavour physical lc 8 port 14 splittable true lanes 4
pci/0000:01:00.0/127: type eth netdev enp1s0nl8p15 flavour physical lc 8 port 15 splittable true lanes 4
pci/0000:01:00.0/128: type eth netdev enp1s0nl8p16 flavour physical lc 8 port 16 splittable true lanes 4

To uprovision the slot #8:

$ devlink lc set pci/0000:01:00.0 lc 8 notype
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoselftests: mlxsw: Introduce devlink line card provision/unprovision/activation tests
Jiri Pirko [Mon, 18 Apr 2022 06:42:41 +0000 (09:42 +0300)]
selftests: mlxsw: Introduce devlink line card provision/unprovision/activation tests

Introduce basic line card manipulation which consists of provisioning,
unprovisioning and activation of a line card.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: spectrum: Add port to linecard mapping
Jiri Pirko [Mon, 18 Apr 2022 06:42:40 +0000 (09:42 +0300)]
mlxsw: spectrum: Add port to linecard mapping

For each port get slot_index using PMLP register. For ports residing
on a linecard, identify it with the linecard by setting mapping
using devlink_port_linecard_set() helper. Use linecard slot index for
PMTDB register queries.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: core: Extend driver ops by remove selected ports op
Jiri Pirko [Mon, 18 Apr 2022 06:42:39 +0000 (09:42 +0300)]
mlxsw: core: Extend driver ops by remove selected ports op

In case of line card implementation, the core has to have a way to
remove relevant ports manually. Extend the Spectrum driver ops by an op
that implements port removal of selected ports upon request.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: core_linecards: Implement line card activation process
Jiri Pirko [Mon, 18 Apr 2022 06:42:38 +0000 (09:42 +0300)]
mlxsw: core_linecards: Implement line card activation process

Allow to process events generated upon line card getting "ready" and
"active".

When DSDSC event with "ready" bit set is delivered, that means the
line card is powered up. Use MDDC register to push the line card to
active state. Once FW is done with that, the DSDSC event with "active"
bit set is delivered.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: core_linecards: Add line card objects and implement provisioning
Jiri Pirko [Mon, 18 Apr 2022 06:42:37 +0000 (09:42 +0300)]
mlxsw: core_linecards: Add line card objects and implement provisioning

Introduce objects for line cards and an infrastructure around that.
Use devlink_linecard_create/destroy() to register the line card with
devlink core. Implement provisioning ops with a list of supported
line cards.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: reg: Add Management Binary Code Transfer Register
Jiri Pirko [Mon, 18 Apr 2022 06:42:36 +0000 (09:42 +0300)]
mlxsw: reg: Add Management Binary Code Transfer Register

The MBCT register allows to transfer binary INI codes from the host to
the management FW by transferring it by chunks of maximum 1KB.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: reg: Add Management DownStream Device Control Register
Jiri Pirko [Mon, 18 Apr 2022 06:42:35 +0000 (09:42 +0300)]
mlxsw: reg: Add Management DownStream Device Control Register

The MDDC register allows to control downstream devices and line cards.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: reg: Add Management DownStream Device Query Register
Jiri Pirko [Mon, 18 Apr 2022 06:42:34 +0000 (09:42 +0300)]
mlxsw: reg: Add Management DownStream Device Query Register

The MDDQ register allows to query the DownStream device properties.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: spectrum: Introduce port mapping change event processing
Jiri Pirko [Mon, 18 Apr 2022 06:42:33 +0000 (09:42 +0300)]
mlxsw: spectrum: Introduce port mapping change event processing

Register PMLPE trap and process the port mapping changes delivered
by it by creating related ports. Note that this happens after
provisioning. The INI of the linecard is processed and merged by FW.
PMLPE is generated for each port. Process this mapping change.

Layout of PMLPE is the same as layout of PMLP.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: Narrow the critical section of devl_lock during ports creation/removal
Jiri Pirko [Mon, 18 Apr 2022 06:42:32 +0000 (09:42 +0300)]
mlxsw: Narrow the critical section of devl_lock during ports creation/removal

No need to hold the lock for alloc and freecpu. So narrow the critical
section. Follow-up patch is going to benefit from this by adding more
code to the functions which will be out of the critical as well.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: reg: Add Ports Mapping Event Configuration Register
Jiri Pirko [Mon, 18 Apr 2022 06:42:31 +0000 (09:42 +0300)]
mlxsw: reg: Add Ports Mapping Event Configuration Register

The PMECR register is used to enable/disable event triggering
in case of local port mapping change.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: spectrum: Allocate port mapping array of structs instead of pointers
Jiri Pirko [Mon, 18 Apr 2022 06:42:30 +0000 (09:42 +0300)]
mlxsw: spectrum: Allocate port mapping array of structs instead of pointers

Instead of array of pointers to port mapping structures, allocate the
array of structures directly.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: spectrum: Allow lane to start from non-zero index
Jiri Pirko [Mon, 18 Apr 2022 06:42:29 +0000 (09:42 +0300)]
mlxsw: spectrum: Allow lane to start from non-zero index

So far, the lane index always started from zero. That is not true for
modular systems with gearbox-equipped linecards. Loose the check so the
lanes can start from non-zero index.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agodevlink: add port to line card relationship set
Jiri Pirko [Mon, 18 Apr 2022 06:42:28 +0000 (09:42 +0300)]
devlink: add port to line card relationship set

In order to properly inform user about relationship between port and
line card, introduce a driver API to set line card for a port. Use this
information to extend port devlink netlink message by line card index
and also include the line card index into phys_port_name and by that
into a netdevice name.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agodevlink: implement line card active state
Jiri Pirko [Mon, 18 Apr 2022 06:42:27 +0000 (09:42 +0300)]
devlink: implement line card active state

Allow driver to mark a line card as active. Expose this state to the
userspace over devlink netlink interface with proper notifications.
'active' state means that line card was plugged in after
being provisioned.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agodevlink: implement line card provisioning
Jiri Pirko [Mon, 18 Apr 2022 06:42:26 +0000 (09:42 +0300)]
devlink: implement line card provisioning

In order to be able to configure all needed stuff on a port/netdevice
of a line card without the line card being present, introduce line card
provisioning. Basically by setting a type, provisioning process will
start and driver is supposed to create a placeholder for instances
(ports/netdevices) for a line card type.

Allow the user to query the supported line card types over line card
get command. Then implement two netlink command SET to allow user to
set/unset the card type.

On the driver API side, add provision/unprovision ops and supported
types array to be advertised. Upon provision op call, the driver should
take care of creating the instances for the particular line card type.
Introduce provision_set/clear() functions to be called by the driver
once the provisioning/unprovisioning is done on its side. These helpers
are not to be called directly due to the async nature of provisioning.

Example:
$ devlink port # No ports are listed
$ devlink lc
pci/0000:01:00.0:
  lc 1 state unprovisioned
    supported_types:
       16x100G
  lc 2 state unprovisioned
    supported_types:
       16x100G
  lc 3 state unprovisioned
    supported_types:
       16x100G
  lc 4 state unprovisioned
    supported_types:
       16x100G
  lc 5 state unprovisioned
    supported_types:
       16x100G
  lc 6 state unprovisioned
    supported_types:
       16x100G
  lc 7 state unprovisioned
    supported_types:
       16x100G
  lc 8 state unprovisioned
    supported_types:
       16x100G

$ devlink lc set pci/0000:01:00.0 lc 8 type 16x100G
$ devlink lc show pci/0000:01:00.0 lc 8
pci/0000:01:00.0:
  lc 8 state active type 16x100G
    supported_types:
       16x100G
$ devlink port
pci/0000:01:00.0/0: type notset flavour cpu port 0 splittable false
pci/0000:01:00.0/53: type eth netdev enp1s0nl8p1 flavour physical lc 8 port 1 splittable true lanes 4
pci/0000:01:00.0/54: type eth netdev enp1s0nl8p2 flavour physical lc 8 port 2 splittable true lanes 4
pci/0000:01:00.0/55: type eth netdev enp1s0nl8p3 flavour physical lc 8 port 3 splittable true lanes 4
pci/0000:01:00.0/56: type eth netdev enp1s0nl8p4 flavour physical lc 8 port 4 splittable true lanes 4
pci/0000:01:00.0/57: type eth netdev enp1s0nl8p5 flavour physical lc 8 port 5 splittable true lanes 4
pci/0000:01:00.0/58: type eth netdev enp1s0nl8p6 flavour physical lc 8 port 6 splittable true lanes 4
pci/0000:01:00.0/59: type eth netdev enp1s0nl8p7 flavour physical lc 8 port 7 splittable true lanes 4
pci/0000:01:00.0/60: type eth netdev enp1s0nl8p8 flavour physical lc 8 port 8 splittable true lanes 4
pci/0000:01:00.0/61: type eth netdev enp1s0nl8p9 flavour physical lc 8 port 9 splittable true lanes 4
pci/0000:01:00.0/62: type eth netdev enp1s0nl8p10 flavour physical lc 8 port 10 splittable true lanes 4
pci/0000:01:00.0/63: type eth netdev enp1s0nl8p11 flavour physical lc 8 port 11 splittable true lanes 4
pci/0000:01:00.0/64: type eth netdev enp1s0nl8p12 flavour physical lc 8 port 12 splittable true lanes 4
pci/0000:01:00.0/125: type eth netdev enp1s0nl8p13 flavour physical lc 8 port 13 splittable true lanes 4
pci/0000:01:00.0/126: type eth netdev enp1s0nl8p14 flavour physical lc 8 port 14 splittable true lanes 4
pci/0000:01:00.0/127: type eth netdev enp1s0nl8p15 flavour physical lc 8 port 15 splittable true lanes 4
pci/0000:01:00.0/128: type eth netdev enp1s0nl8p16 flavour physical lc 8 port 16 splittable true lanes 4

$ devlink lc set pci/0000:01:00.0 lc 8 notype

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agodevlink: add support to create line card and expose to user
Jiri Pirko [Mon, 18 Apr 2022 06:42:25 +0000 (09:42 +0300)]
devlink: add support to create line card and expose to user

Extend the devlink API so the driver is going to be able to create and
destroy linecard instances. There can be multiple line cards per devlink
device. Expose this new type of object over devlink netlink API to the
userspace, with notifications.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotcp: fix signed/unsigned comparison
Eric Dumazet [Sun, 17 Apr 2022 18:34:32 +0000 (11:34 -0700)]
tcp: fix signed/unsigned comparison

Kernel test robot reported:

smatch warnings:
net/ipv4/tcp_input.c:5966 tcp_rcv_established() warn: unsigned 'reason' is never less than zero.

I actually had one packetdrill failing because of this bug,
and was about to send the fix :)

v2: Andreas Schwab also pointed out that @reason needs to be negated
    before we reach tcp_drop_reason()

Fixes: 4b506af9c5b8 ("tcp: add two drop reasons for tcp_ack()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'tcp-drop-reason-additions'
David S. Miller [Sun, 17 Apr 2022 12:31:32 +0000 (13:31 +0100)]
Merge branch 'tcp-drop-reason-additions'

Eric Dumazet says:

====================
tcp: drop reason additions

Currently, TCP is either missing drop reasons,
or pretending that some useful packets are dropped.

This patch series makes "perf record -a -e skb:kfree_skb"
much more usable.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotcp: add drop reason support to tcp_ofo_queue()
Eric Dumazet [Sat, 16 Apr 2022 00:10:48 +0000 (17:10 -0700)]
tcp: add drop reason support to tcp_ofo_queue()

packets in OFO queue might be redundant, and dropped.

tcp_drop() is no longer needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotcp: add drop reasons to tcp_rcv_synsent_state_process()
Eric Dumazet [Sat, 16 Apr 2022 00:10:47 +0000 (17:10 -0700)]
tcp: add drop reasons to tcp_rcv_synsent_state_process()

Re-use existing reasons.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotcp: make tcp_rcv_synsent_state_process() drop monitor friend
Eric Dumazet [Sat, 16 Apr 2022 00:10:46 +0000 (17:10 -0700)]
tcp: make tcp_rcv_synsent_state_process() drop monitor friend

1) A valid RST packet should be consumed, to not confuse drop monitor.

2) Same remark for packet validating cross syn setup,
   even if we might ignore part of it.

3) When third packet of 3WHS is delayed, do not pretend
   the SYNACK was dropped.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotcp: add drop reason support to tcp_prune_ofo_queue()
Eric Dumazet [Sat, 16 Apr 2022 00:10:45 +0000 (17:10 -0700)]
tcp: add drop reason support to tcp_prune_ofo_queue()

Add one reason for packets dropped from OFO queue because
of memory pressure.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotcp: add two drop reasons for tcp_ack()
Eric Dumazet [Sat, 16 Apr 2022 00:10:44 +0000 (17:10 -0700)]
tcp: add two drop reasons for tcp_ack()

Add TCP_TOO_OLD_ACK and TCP_ACK_UNSENT_DATA drop
reasons so that tcp_rcv_established() can report
them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotcp: add drop reasons to tcp_rcv_state_process()
Eric Dumazet [Sat, 16 Apr 2022 00:10:43 +0000 (17:10 -0700)]
tcp: add drop reasons to tcp_rcv_state_process()

Add basic support for drop reasons in tcp_rcv_state_process()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotcp: make tcp_rcv_state_process() drop monitor friendly
Eric Dumazet [Sat, 16 Apr 2022 00:10:42 +0000 (17:10 -0700)]
tcp: make tcp_rcv_state_process() drop monitor friendly

tcp_rcv_state_process() incorrectly drops packets
instead of consuming it, making drop monitor very noisy,
if not unusable.

Calling tcp_time_wait() or tcp_done() is part
of standard behavior, packets triggering these actions
were not dropped.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotcp: add drop reason support to tcp_validate_incoming()
Eric Dumazet [Sat, 16 Apr 2022 00:10:41 +0000 (17:10 -0700)]
tcp: add drop reason support to tcp_validate_incoming()

Creates four new drop reasons for the following cases:

1) packet being rejected by RFC 7323 PAWS check
2) packet being rejected by SEQUENCE check
3) Invalid RST packet
4) Invalid SYN packet

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotcp: get rid of rst_seq_match
Eric Dumazet [Sat, 16 Apr 2022 00:10:40 +0000 (17:10 -0700)]
tcp: get rid of rst_seq_match

Small cleanup in tcp_validate_incoming(), no need for rst_seq_match
setting and testing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agotcp: consume incoming skb leading to a reset
Eric Dumazet [Sat, 16 Apr 2022 00:10:39 +0000 (17:10 -0700)]
tcp: consume incoming skb leading to a reset

Whenever tcp_validate_incoming() handles a valid RST packet,
we should not pretend the packet was dropped.

Create a special section at the end of tcp_validate_incoming()
to handle this case.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'qca8k_preiv-shrink'
David S. Miller [Sun, 17 Apr 2022 12:28:29 +0000 (13:28 +0100)]
Merge branch 'qca8k_preiv-shrink'

Ansuel Smith says:

====================
net: Reduce qca8k_priv space usage

These 6 patch is a first attempt at reducting qca8k_priv space.
The code changed a lot during times and we have many old logic
that can be replaced with new implementation

The first patch drop the tracking of MTU. We mimic what was done
for mtk and we change MTU only when CPU port is changed.

The second patch finally drop a piece of story of this driver.
The ar8xxx_port_status struct was used by the first implementation
of this driver to put all sort of status data for the port...
With the evolution of DSA all that stuff got dropped till only
the enabled state was the only part of the that struct.
Since it's overkill to keep an array of int, we convert the variable
to a simple u8 where we store the status of each port. This is needed
to don't reanable ports on system resume.

The third patch is a preparation for patch 4. As Vladimir explained
in another patch, we waste a tons of space by keeping a duplicate of
the switch dsa ops in qca8k_priv. The only reason for this is to
dynamically set the correct mdiobus configuration (a legacy dsa one,
or a custom dedicated one)
To solve this problem, we just drop the phy_read/phy_write and we
declare a custom mdiobus in any case.
This way we can use a static dsa switch ops struct and we can drop it
from qca8k_priv

Patch 4 drop the duplicated dsa_switch_ops.

Patch 5 is a fixup for mdio read error.

Patch 6 is an effort to standardize how bus name are done.

This series is just a start of more cleanup.

The idea is to move this driver to the qca dir and split common code
from specific code. Also the mgmt eth code still requires some love
and can totally be optimized by recycling the same skb over time.

Also while working on the MTU it was notice some problem with
the stmmac driver and with the reloading phase that cause all
sort of problems with qca8k.

I'm sending this here just to try to keep small series instead of
proposing monster series hard to review.

v2:
- Rework MTU patch
v3:
- Drop unrealated changes from patch 3
- Add fixup patch for mdio read
- Unify bus name for legacy and OF mdio bus
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: dsa: qca8k: unify bus id naming with legacy and OF mdio bus
Ansuel Smith [Fri, 15 Apr 2022 23:30:17 +0000 (01:30 +0200)]
net: dsa: qca8k: unify bus id naming with legacy and OF mdio bus

Add support for multiple switch with OF mdio bus declaration.
Unify the bus id naming and use the same logic for both legacy and OF
mdio bus.

Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: dsa: qca8k: correctly handle mdio read error
Ansuel Smith [Fri, 15 Apr 2022 23:30:16 +0000 (01:30 +0200)]
net: dsa: qca8k: correctly handle mdio read error

Restore original way to handle mdio read error by returning 0xffff.
This was wrongly changed when the internal_mdio_read was introduced,
now that both legacy and internal use the same function, make sure that
they behave the same way.

Fixes: ce062a0adbfe ("net: dsa: qca8k: fix kernel panic with legacy mdio mapping")
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: dsa: qca8k: drop dsa_switch_ops from qca8k_priv
Ansuel Smith [Fri, 15 Apr 2022 23:30:15 +0000 (01:30 +0200)]
net: dsa: qca8k: drop dsa_switch_ops from qca8k_priv

Now that dsa_switch_ops is not switch specific anymore, we can drop it
from qca8k_priv and use the static ops directly for the dsa_switch
pointer.

Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: dsa: qca8k: rework and simplify mdiobus logic
Ansuel Smith [Fri, 15 Apr 2022 23:30:14 +0000 (01:30 +0200)]
net: dsa: qca8k: rework and simplify mdiobus logic

In an attempt to reduce qca8k_priv space, rework and simplify mdiobus
logic.
We now declare a mdiobus instead of relying on DSA phy_read/write even
if a mdio node is not present. This is all to make the qca8k ops static
and not switch specific. With a legacy implementation where port doesn't
have a phy map declared in the dts with a mdio node, we declare a
'qca8k-legacy' mdiobus. The conversion logic is used as legacy read and
write ops are used instead of the internal one.
Also drop the legacy_phy_port_mapping as we now declare mdiobus with ops
that already address the workaround.

Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: dsa: qca8k: drop port_sts from qca8k_priv
Ansuel Smith [Fri, 15 Apr 2022 23:30:13 +0000 (01:30 +0200)]
net: dsa: qca8k: drop port_sts from qca8k_priv

Port_sts is a thing of the past for this driver. It was something
present on the initial implementation of this driver and parts of the
original struct were dropped over time. Using an array of int to store if
a port is enabled or not to handle PM operation seems overkill. Switch
and use a simple u8 to store the port status where each bit correspond
to a port. (bit is set port is enabled, bit is not set, port is disabled)
Also add some comments to better describe why we need to track port
status.

Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: dsa: qca8k: drop MTU tracking from qca8k_priv
Ansuel Smith [Fri, 15 Apr 2022 23:30:12 +0000 (01:30 +0200)]
net: dsa: qca8k: drop MTU tracking from qca8k_priv

DSA set the CPU port based on the largest MTU of all the slave ports.
Based on this we can drop the MTU array from qca8k_priv and set the
port_change_mtu logic on DSA changing MTU of the CPU port as the switch
have a global MTU settingfor each port.

Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet/ipv6: Introduce accept_unsolicited_na knob to implement router-side changes for...
Arun Ajith S [Fri, 15 Apr 2022 08:34:02 +0000 (08:34 +0000)]
net/ipv6: Introduce accept_unsolicited_na knob to implement router-side changes for RFC9131

Add a new neighbour cache entry in STALE state for routers on receiving
an unsolicited (gratuitous) neighbour advertisement with
target link-layer-address option specified.
This is similar to the arp_accept configuration for IPv4.
A new sysctl endpoint is created to turn on this behaviour:
/proc/sys/net/ipv6/conf/interface/accept_unsolicited_na.

Signed-off-by: Arun Ajith S <aajith@arista.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoipv6: fix NULL deref in ip6_rcv_core()
Eric Dumazet [Wed, 13 Apr 2022 20:56:53 +0000 (13:56 -0700)]
ipv6: fix NULL deref in ip6_rcv_core()

idev can be NULL, as the surrounding code suggests.

Fixes: 4daf841a2ef3 ("net: ipv6: add skb drop reasons to ip6_rcv_core()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Menglong Dong <imagedong@tencent.com>
Cc: Jiang Biao <benbjiang@tencent.com>
Cc: Hao Peng <flyingpeng@tencent.com>
Link: https://lore.kernel.org/r/20220413205653.1178458-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet_sched: make qdisc_reset() smaller
Eric Dumazet [Thu, 14 Apr 2022 01:10:04 +0000 (18:10 -0700)]
net_sched: make qdisc_reset() smaller

For some unknown reason qdisc_reset() is using
a convoluted way of freeing two lists of skbs.

Use __skb_queue_purge() instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20220414011004.2378350-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoocteon_ep: Remove custom driver version
Leon Romanovsky [Thu, 14 Apr 2022 07:52:42 +0000 (10:52 +0300)]
octeon_ep: Remove custom driver version

In review comment [1] was pointed that new code is not supposed
to set driver version and should rely on kernel version instead.

As an outcome of that comment all the dance around setting such
driver version to FW should be removed too, because in upstream
kernel whole driver will have same version so read/write from/to
FW will give same result.

[1] https://lore.kernel.org/all/YladGTmon1x3dfxI@unreal

Fixes: 862cd659a6fb ("octeon_ep: Add driver framework and device initialization")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/5d76f3116ee795071ec044eabb815d6c2bdc7dbd.1649922731.git.leonro@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'ibmvnic-use-a-set-of-ltbs-per-pool'
Jakub Kicinski [Fri, 15 Apr 2022 21:02:09 +0000 (14:02 -0700)]
Merge branch 'ibmvnic-use-a-set-of-ltbs-per-pool'

Sukadev Bhattiprolu says:

====================
ibmvnic: Use a set of LTBs per pool

ibmvnic uses a single large long term buffer (LTB) per rx or tx
pool (queue). This has two limitations.

First, if we need to free/allocate an LTB (eg during a reset), under
low memory conditions, the allocation can fail.

Second, the kernel limits the size of single LTB (DMA buffer) to 16MB
(based on MAX_ORDER). With jumbo frames (mtu = 9000) we can only have
about 1763 buffers per LTB (16MB / 9588 bytes per frame) even though
the max supported buffers is 4096. (The 9588 instead of 9088 comes from
IBMVNIC_BUFFER_HLEN.)

To overcome these limitations, allow creating a set of LTBs per queue.
====================

Link: https://lore.kernel.org/r/20220413171026.1264294-1-drt@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoibmvnic: Allow multiple ltbs in txpool ltb_set
Sukadev Bhattiprolu [Wed, 13 Apr 2022 17:10:26 +0000 (13:10 -0400)]
ibmvnic: Allow multiple ltbs in txpool ltb_set

Allow multiple LTBs in the txpool's ltb_set. i.e rather than using
a single large LTB, use several smaller LTBs.

The first n-1 LTBs will all be of the same size. The size of the last
LTB in the set depends on the number of buffers and buffer (mtu) size.
This strategy hopefully allows more reuse of the initial LTBs and also
reduces the chances of an allocation failure (of the large LTB) when
system is low in memory.

Suggested-by: Brian King <brking@linux.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoibmvnic: Allow multiple ltbs in rxpool ltb_set
Sukadev Bhattiprolu [Wed, 13 Apr 2022 17:10:25 +0000 (13:10 -0400)]
ibmvnic: Allow multiple ltbs in rxpool ltb_set

Allow multiple LTBs in the rxpool's ltb_set. The first n-1 LTBs will all
be of the same size. The size of the last LTB in the set depends on the
number of buffers and buffer (mtu) size.

Having a set of LTBs per pool provides a couple of benefits.

First, with the current value of IBMVNIC_MAX_LTB_SIZE of 16MB, with an
MTU of 9000, we need a LTB (DMA buffer) of that size but the allocation
can fail in low memory conditions. With a set of LTBs per pool, we can
use several smaller (8MB) LTBs and hopefully have fewer allocation
failures. (See also comments in ibmvnic.h on the trade-off with smaller
LTBs)

Second since the kernel limits the size of the DMA buffer to 16MB (based
on MAX_ORDER), with a single DMA buffer per pool, the pool is also limited
to 16MB. This in turn limits the number of buffers per pool to 1763 when
MTU is 9000. With a set of LTBs per pool, we can have upto the max of 4096
buffers per pool even when MTU is 9000.

Suggested-by: Brian King <brking@linux.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoibmvnic: convert rxpool ltb to a set of ltbs
Sukadev Bhattiprolu [Wed, 13 Apr 2022 17:10:24 +0000 (13:10 -0400)]
ibmvnic: convert rxpool ltb to a set of ltbs

Define and use interfaces that treat the long term buffer (LTB) of an
rxpool as a set of LTBs rather than a single LTB. The set only has one
LTB for now.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoibmvnic: define map_txpool_buf_to_ltb()
Sukadev Bhattiprolu [Wed, 13 Apr 2022 17:10:23 +0000 (13:10 -0400)]
ibmvnic: define map_txpool_buf_to_ltb()

Define a helper to map a given txpool buffer into its corresponding long
term buffer (LTB) and offset. Currently there is just one LTB per txpool
so the mapping is trivial. When we add support for multiple LTBs per
txpool, this helper will be more useful.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoibmvnic: define map_rxpool_buf_to_ltb()
Sukadev Bhattiprolu [Wed, 13 Apr 2022 17:10:22 +0000 (13:10 -0400)]
ibmvnic: define map_rxpool_buf_to_ltb()

Define a helper to map a given rx pool buffer into its corresponding long
term buffer (LTB) and offset. Currently there is just one LTB per pool so
the mapping is trivial. When we add support for multiple LTBs per pool,
this helper will be more useful.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoibmvnic: rename local variable index to bufidx
Sukadev Bhattiprolu [Wed, 13 Apr 2022 17:10:21 +0000 (13:10 -0400)]
ibmvnic: rename local variable index to bufidx

The local variable 'index' is heavily used in some functions and is
confusing with the presence of other "index" fields like pool->index,
->consumer_index, etc. Rename it to bufidx to better reflect that its
the index of a buffer in the pool.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com>
Signed-off-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoMerge branch 'net-ethool-add-support-to-get-set-tx-push-by-ethtool-g-g'
Jakub Kicinski [Fri, 15 Apr 2022 18:41:56 +0000 (11:41 -0700)]
Merge branch 'net-ethool-add-support-to-get-set-tx-push-by-ethtool-g-g'

Jie Wang says:

====================
net: ethool: add support to get/set tx push by ethtool -G/g

These three patches add tx push in ring params and adapt the set and get APIs
of ring params.
====================

Link: https://lore.kernel.org/r/20220412020121.14140-1-huangguangbin2@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: hns3: add tx push support in hns3 ring param process
Jie Wang [Tue, 12 Apr 2022 02:01:21 +0000 (10:01 +0800)]
net: hns3: add tx push support in hns3 ring param process

This patch adds tx push param to hns3 ring param and adapts the set and get
API of ring params. So users can set it by cmd ethtool -G and get it by cmd
ethtool -g.

Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: ethtool: move checks before rtnl_lock() in ethnl_set_rings
Jie Wang [Tue, 12 Apr 2022 02:01:20 +0000 (10:01 +0800)]
net: ethtool: move checks before rtnl_lock() in ethnl_set_rings

Currently these two checks in ethnl_set_rings are added after rtnl_lock()
which will do useless works if the request is invalid.

So this patch moves these checks before the rtnl_lock() to avoid these
costs.

Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agonet: ethtool: extend ringparam set/get APIs for tx_push
Jie Wang [Tue, 12 Apr 2022 02:01:19 +0000 (10:01 +0800)]
net: ethtool: extend ringparam set/get APIs for tx_push

Currently tx push is a standard driver feature which controls use of a fast
path descriptor push. So this patch extends the ringparam APIs and data
structures to support set/get tx push by ethtool -G/g.

Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 years agoocteon_ep: fix error return code in octep_probe()
Yang Yingliang [Fri, 15 Apr 2022 02:39:57 +0000 (10:39 +0800)]
octeon_ep: fix error return code in octep_probe()

If register_netdev() fails , it should return error
code in octep_probe().

Fixes: 862cd659a6fb ("octeon_ep: Add driver framework and device initialization")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'emaclite-cleanups'
David S. Miller [Fri, 15 Apr 2022 10:46:29 +0000 (11:46 +0100)]
Merge branch 'emaclite-cleanups'

Radhey Shyam Pandey says:

====================
net: emaclite: Trivial code cleanup

This patchset fix coding style issues, remove BUFFER_ALIGN
macro and also update copyright text.

I have to resend as earlier series didn't reach mailing list
due to some configuration issue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: emaclite: Remove custom BUFFER_ALIGN macro
Shravya Kumbham [Thu, 14 Apr 2022 12:37:11 +0000 (18:07 +0530)]
net: emaclite: Remove custom BUFFER_ALIGN macro

BUFFER_ALIGN macro is used to calculate the number of bytes
required for the next alignment. Instead of this, we can directly
use the skb_reserve(skb, NET_IP_ALIGN) to make the protocol header
buffer aligned on at least a 4-byte boundary, where the NET_IP_ALIGN
is by default defined as 2. So removing the BUFFER_ALIGN and its
related defines which it can be done by the skb_reserve() itself.

Signed-off-by: Shravya Kumbham <shravya.kumbham@xilinx.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: emaclite: Update copyright text to correct format
Michal Simek [Thu, 14 Apr 2022 12:37:10 +0000 (18:07 +0530)]
net: emaclite: Update copyright text to correct format

Based on recommended guidance Copyright term should be also present in
front of (c). That's why aligned driver to match this pattern.
It helps automated tools with source code scanning.

Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: emaclite: Fix coding style
Radhey Shyam Pandey [Thu, 14 Apr 2022 12:37:09 +0000 (18:07 +0530)]
net: emaclite: Fix coding style

Make coding style changes to fix checkpatch script warnings.
There is no functional change. Fixes below check and warnings-

CHECK: Blank lines aren't necessary after an open brace '{'
CHECK: spinlock_t definition without comment
CHECK: Please don't use multiple blank lines
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
CHECK: braces {} should be used on all arms of this statement
CHECK: Unbalanced braces around else statement
CHECK: Alignment should match open parenthesis
WARNING: Missing a blank line after declarations

Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agonet: ethernet: ti: davinci_emac: using pm_runtime_resume_and_get instead of pm_runtim...
Minghao Chi [Thu, 14 Apr 2022 09:08:00 +0000 (09:08 +0000)]
net: ethernet: ti: davinci_emac: using pm_runtime_resume_and_get instead of pm_runtime_get_sync

Using pm_runtime_resume_and_get() to replace pm_runtime_get_sync and
pm_runtime_put_noidle. This change is just to simplify the code, no
actual functional changes.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoocteon_ep: Fix spelling mistake "inerrupts" -> "interrupts"
Colin Ian King [Thu, 14 Apr 2022 08:08:34 +0000 (09:08 +0100)]
octeon_ep: Fix spelling mistake "inerrupts" -> "interrupts"

There is a spelling mistake in a dev_info message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agoMerge branch 'mlxsw-line-card-prep'
David S. Miller [Fri, 15 Apr 2022 10:06:13 +0000 (11:06 +0100)]
Merge branch 'mlxsw-line-card-prep'

Ido Schimmel says:

====================
mlxsw: Preparations for line cards support

Currently, mlxsw registers thermal zones as well as hwmon entries for
objects such as transceiver modules and gearboxes. In upcoming modular
systems, these objects are no longer found on the main board (i.e., slot
0), but on plug-able line cards. This patchset prepares mlxsw for such
systems in terms of hwmon, thermal and cable access support.

Patches #1-#3 gradually prepare mlxsw for transceiver modules access
support for line cards by splitting some of the internal structures and
some APIs.

Patches #4-#5 gradually prepare mlxsw for hwmon support for line cards
by splitting some of the internal structures and augmenting them with a
slot index.

Patches #6-#7 do the same for thermal zones.

Patch #8 selects cooling device for binding to a thermal zone by exact
name match to prevent binding to non-relevant devices.

Patch #9 replaces internal define for thermal zone name length with a
common define.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: core_thermal: Use common define for thermal zone name length
Vadim Pasternak [Wed, 13 Apr 2022 15:17:33 +0000 (18:17 +0300)]
mlxsw: core_thermal: Use common define for thermal zone name length

Replace internal define 'MLXSW_THERMAL_ZONE_MAX_NAME' by common
'THERMAL_NAME_LENGTH'.

Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: core_thermal: Use exact name of cooling devices for binding
Vadim Pasternak [Wed, 13 Apr 2022 15:17:32 +0000 (18:17 +0300)]
mlxsw: core_thermal: Use exact name of cooling devices for binding

Modular system supports additional cooling devices "mlxreg_fan1",
"mlxreg_fan2", etcetera. Thermal zones in "mlxsw" driver should be
bound to the same device as before called "mlxreg_fan". Used exact
match for cooling device name to avoid binding to new additional
cooling devices.

Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: core_thermal: Add line card id prefix to line card thermal zone name
Vadim Pasternak [Wed, 13 Apr 2022 15:17:31 +0000 (18:17 +0300)]
mlxsw: core_thermal: Add line card id prefix to line card thermal zone name

Add prefix "lc#n" to thermal zones associated with the thermal objects
found on line cards.

For example thermal zone for module #9 located at line card #7 will
have type:
mlxsw-lc7-module9.
And thermal zone for gearbox #3 located at line card #5 will have type:
mlxsw-lc5-gearbox3.

Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: core_thermal: Extend internal structures to support multi thermal areas
Vadim Pasternak [Wed, 13 Apr 2022 15:17:30 +0000 (18:17 +0300)]
mlxsw: core_thermal: Extend internal structures to support multi thermal areas

Introduce intermediate level for thermal zones areas.
Currently all thermal zones are associated with thermal objects located
within the main board. Such objects are created during driver
initialization and removed during driver de-initialization.

For line cards in modular system the thermal zones are to be associated
with the specific line card. They should be created whenever new line
card is available (inserted, validated, powered and enabled) and
removed, when line card is getting unavailable.
The thermal objects found on the line card #n are accessed by setting
slot index to #n, while for access to objects found on the main board
slot index should be set to default value zero.

Each thermal area contains the set of thermal zones associated with
particular slot index.
Thus introduction of thermal zone areas allows to use the same APIs for
the main board and line cards, by adding slot index argument.

Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3 years agomlxsw: core_hwmon: Introduce slot parameter in hwmon interfaces
Vadim Pasternak [Wed, 13 Apr 2022 15:17:29 +0000 (18:17 +0300)]
mlxsw: core_hwmon: Introduce slot parameter in hwmon interfaces

Add 'slot' parameter to 'mlxsw_hwmon_dev' structure. Use this parameter
in mlxsw_reg_mtmp_pack(), mlxsw_reg_mtbr_pack(), mlxsw_reg_mgpir_pack()
and mlxsw_reg_mtmp_slot_index_set() routines.
For main board it'll always be zero, for line cards it'll be set to
the physical slot number in modular systems.

Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>