Sean Anderson [Thu, 18 Aug 2022 16:16:25 +0000 (12:16 -0400)]
dt-bindings: net: Convert FMan MAC bindings to yaml
This converts the MAC portion of the FMan MAC bindings to yaml.
Signed-off-by: Sean Anderson <sean.anderson@seco.com> Reviewed-by: Rob Herring <robh@kernel.org> Acked-by: Camelia Groza <camelia.groza@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
selftests: mlxsw: Add ordering tests for unified bridge model
Amit Cohen writes:
Commit 798661c73672 ("Merge branch 'mlxsw-unified-bridge-conversion-part-6'")
converted mlxsw driver to use unified bridge model. In the legacy model,
when a RIF was created / destroyed, it was firmware's responsibility to
update it in the relevant FID classification records. In the unified bridge
model, this responsibility moved to software.
This set adds tests to check the order of configuration for the following
classifications:
1. {Port, VID} -> FID
2. VID -> FID
3. VNI -> FID (after decapsulation)
In addition, in the legacy model, software is responsible to update a
table which is used to determine the packet's egress VID. Add a test to
check that the order of configuration does not impact switch behavior.
See more details in the commit messages.
Note that the tests supposed to pass also using the legacy model, they
are added now as with the new model they test the driver and not the
firmware.
Patch set overview:
Patch #1 adds test for {Port, VID} -> FID
Patch #2 adds test for VID -> FID
Patch #3 adds test for VNI -> FID
Patch #4 adds test for egress VID classification
====================
Amit Cohen [Wed, 17 Aug 2022 15:28:28 +0000 (17:28 +0200)]
selftests: mlxsw: Add egress VID classification test
After routing, the device always consults a table that determines the
packet's egress VID based on {egress RIF, egress local port}. In the
unified bridge model, it is up to software to maintain this table via
REIV register.
The table needs to be updated in the following flows:
1. When a RIF is set on a FID, for each FID's {Port, VID} mapping, a new
{RIF, Port}->VID mapping should be created.
2. When a {Port, VID} is mapped to a FID and the FID already has a RIF,
a new {RIF, Port}->VID mapping should be created.
Add a test to verify that packets get the correct VID after routing,
regardless of the order of the configuration.
# ./egress_vid_classification.sh
TEST: Add RIF for existing {port, VID}->FID mapping [ OK ]
TEST: Add {port, VID}->FID mapping for FID with a RIF [ OK ]
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amit Cohen [Wed, 17 Aug 2022 15:28:27 +0000 (17:28 +0200)]
selftests: mlxsw: Add ingress RIF configuration test for VXLAN
Before layer 2 forwarding, the device classifies an incoming packet to a
FID. After classification, the FID is known, but also all the attributes of
the FID, such as the router interface (RIF) via which a packet that needs
to be routed will ingress the router block.
For VXLAN decapsulation, the FID classification is done according to the
VNI. When a RIF is added on top of a FID, the existing VNI->FID mapping
should be updated by the software with the new RIF. In addition, when a new
mapping is added for FID which already has a RIF, the correct RIF should
be used for it.
Add a test to verify that packets can be routed after decapsulation which
is done after VNI->FID classification, regardless of the order of the
configuration.
# ./ingress_rif_conf_vxlan.sh
TEST: Add RIF for existing VNI->FID mapping [ OK ]
TEST: Add VNI->FID mapping for FID with a RIF [ OK ]
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amit Cohen [Wed, 17 Aug 2022 15:28:26 +0000 (17:28 +0200)]
selftests: mlxsw: Add ingress RIF configuration test for 802.1Q bridge
Before layer 2 forwarding, the device classifies an incoming packet to a
FID. After classification, the FID is known, but also all the attributes of
the FID, such as the router interface (RIF) via which a packet that needs
to be routed will ingress the router block.
For VLAN-aware bridges (802.1Q), the FID classification is done according
to VID. When a RIF is added on top of a FID, the existing VID->FID mapping
should be updated by the software with the new RIF.
We never map multiple VLANs to the same FID using VID->FID, so we cannot
create VID->FID for FID which already has a RIF using 802.1Q. Anyway,
verify that packets can be routed via port which is added after the FID
already has a RIF.
Add a test to verify that packets can be routed after VID->FID
classification, regardless of the order of the configuration.
# ./ingress_rif_conf_1q.sh
TEST: Add RIF for existing VID->FID mapping [ OK ]
TEST: Add port to VID->FID mapping for FID with a RIF [ OK ]
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amit Cohen [Wed, 17 Aug 2022 15:28:25 +0000 (17:28 +0200)]
selftests: mlxsw: Add ingress RIF configuration test for 802.1D bridge
Before layer 2 forwarding, the device classifies an incoming packet to a
FID. After classification, the FID is known, but also all the attributes of
the FID, such as the router interface (RIF) via which a packet that needs
to be routed will ingress the router block.
For VLAN-unaware bridges (802.1D), the FID classification is done according
to {Port, VID}. When a RIF is added on top of a FID, all the existing
{Port, VID}->FID mappings should be updated by the software with the new
RIF. In addition, when a new mapping is added for FID which already has a
RIF, the correct RIF should be used for it.
Add a test to verify that packets can be routed after {Port, VID}->FID
classification, regardless of the order of the configuration.
# ./ingress_rif_conf_1d.sh
TEST: Add RIF for existing {port, VID}->FID mapping [ OK ]
TEST: Add {port, VID}->FID mapping for FID with a RIF [ OK ]
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lorenzo Bianconi [Tue, 16 Aug 2022 17:14:30 +0000 (19:14 +0200)]
net: ethernet: mtk_eth_soc: remove unused txd_pdma pointer in mtk_xdp_submit_frame
Get rid of unnecessary txd_pdma pointer in mtk_xdp_submit_frame for loop
since it is actually used at the end of the routine using latest mtk_tx_dma
consumed pointer as reference.
Emeel Hakim [Thu, 18 Aug 2022 15:32:30 +0000 (18:32 +0300)]
net: macsec: Expose MACSEC_SALT_LEN definition to user space
Expose MACSEC_SALT_LEN definition to user space to be
used in various user space applications such as iproute.
Iproute will use this as part of adding macsec extended
packet number support.
- net: fix suspicious RCU usage in bpf_sk_reuseport_detach()
Current release - new code bugs:
- mlxsw: ptp: fix a couple of races, static checker warnings and
error handling
Previous releases - regressions:
- netfilter:
- nf_tables: fix possible module reference underflow in error path
- make conntrack helpers deal with BIG TCP (skbs > 64kB)
- nfnetlink: re-enable conntrack expectation events
- net: fix potential refcount leak in ndisc_router_discovery()
Previous releases - always broken:
- sched: cls_route: disallow handle of 0
- neigh: fix possible local DoS due to net iface start/stop loop
- rtnetlink: fix module refcount leak in rtnetlink_rcv_msg
- sched: fix adding qlen to qcpu->backlog in gnet_stats_add_queue_cpu
- virtio_net: fix endian-ness for RSS
- dsa: mv88e6060: prevent crash on an unused port
- fec: fix timer capture timing in `fec_ptp_enable_pps()`
- ocelot: stats: fix races, integer wrapping and reading incorrect
registers (the change of register definitions here accounts for
bulk of the changed LoC in this PR)"
* tag 'net-6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits)
net: moxa: MAC address reading, generating, validity checking
tcp: handle pure FIN case correctly
tcp: refactor tcp_read_skb() a bit
tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()
tcp: fix sock skb accounting in tcp_read_skb()
igb: Add lock to avoid data race
dt-bindings: Fix incorrect "the the" corrections
net: genl: fix error path memory leak in policy dumping
stmmac: intel: Add a missing clk_disable_unprepare() call in intel_eth_pci_remove()
net: ethernet: mtk_eth_soc: fix possible NULL pointer dereference in mtk_xdp_run
net/mlx5e: Allocate flow steering storage during uplink initialization
net: mscc: ocelot: report ndo_get_stats64 from the wraparound-resistant ocelot->stats
net: mscc: ocelot: keep ocelot_stat_layout by reg address, not offset
net: mscc: ocelot: make struct ocelot_stat_layout array indexable
net: mscc: ocelot: fix race between ndo_get_stats64 and ocelot_check_stats_work
net: mscc: ocelot: turn stats_lock into a spinlock
net: mscc: ocelot: fix address of SYS_COUNT_TX_AGING counter
net: mscc: ocelot: fix incorrect ndo_get_stats64 packet counters
net: dsa: felix: fix ethtool 256-511 and 512-1023 TX packet counters
net: dsa: don't warn in dsa_port_set_state_now() when driver doesn't support it
...
Linus Torvalds [Fri, 19 Aug 2022 02:24:57 +0000 (19:24 -0700)]
Merge tag 'linux-kselftest-next-6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull Kselftest fix from Shuah Khan:
- fix landlock test build regression
* tag 'linux-kselftest-next-6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/landlock: fix broken include of linux/landlock.h
Linus Torvalds [Fri, 19 Aug 2022 02:18:28 +0000 (19:18 -0700)]
Merge tag 'trace-rtla-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull rtla tool fixes from Steven Rostedt:
"Fixes for the Real-Time Linux Analysis tooling:
- Fix tracer name in comments and prints
- Fix setting up symlinks
- Allow extra flags to be set in build
- Consolidate and show all necessary libraries not found in build
error"
* tag 'trace-rtla-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
rtla: Consolidate and show all necessary libraries that failed for building
tools/rtla: Build with EXTRA_{C,LD}FLAGS
tools/rtla: Fix command symlinks
rtla: Fix tracer name
====================
Add DT property to disable hibernation mode
The patches add the ability to disable the hibernation mode of AR803x
PHYs. Hibernation mode defaults to enabled after hardware reset on
these PHYs. If the AR803x PHYs enter hibernation mode, they will not
provide any clock. For some MACs, they might need the clocks which
provided by the PHYs to support their own hardware logic.
So, the patches add the support to disable hibernation mode by adding
a boolean:
qca,disable-hibernation-mode
If one wished to disable hibernation mode to better match with the
specifical MAC, just add this property in the phy node of DT.
====================
Wei Fang [Thu, 18 Aug 2022 03:00:54 +0000 (11:00 +0800)]
net: phy: at803x: add disable hibernation mode support
When the cable is unplugged, the Atheros AR803x PHYs will enter
hibernation mode after about 10 seconds if the hibernation mode
is enabled and will not provide any clock to the MAC. But for
some MACs, this feature might cause unexpected issues due to the
logic of MACs.
Taking SYNP MAC (stmmac) as an example, if the cable is unplugged
and the "eth0" interface is down, the AR803x PHY will enter
hibernation mode. Then perform the "ifconfig eth0 up" operation,
the stmmac can't be able to complete the software reset operation
and fail to init it's own DMA. Therefore, the "eth0" interface is
failed to ifconfig up. Why does it cause this issue? The truth is
that the software reset operation of the stmmac is designed to
depend on the RX_CLK of PHY.
So, this patch offers an option for the user to determine whether
to disable the hibernation mode of AR803x PHYs.
Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The hibernation mode of Atheros AR803x PHYs defaults to be
enabled after hardware reset. When the cable is unplugged,
the PHY will enter hibernation mode after about 10 seconds
and the PHY clocks will be stopped to save power.
However, some MACs need the phy output clock for proper
functioning of their logic. For instance, stmmac needs the
RX_CLK of PHY for software reset to complete.
Therefore, add a DT property to configure the PHY to disable
this hardware hibernation mode.
Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sergei Antonov [Thu, 18 Aug 2022 09:23:17 +0000 (12:23 +0300)]
net: moxa: MAC address reading, generating, validity checking
This device does not remember its MAC address, so add a possibility
to get it from the platform. If it fails, generate a random address.
This will provide a MAC address early during boot without user space
being involved.
Also remove extra calls to is_valid_ether_addr().
Made after suggestions by Andrew Lunn:
1) Use eth_hw_addr_random() to assign a random MAC address during probe.
2) Remove is_valid_ether_addr() from moxart_mac_open()
3) Add a call to platform_get_ethdev_address() during probe
4) Remove is_valid_ether_addr() from moxart_set_mac_address(). The core does this
v1 -> v2:
Handle EPROBE_DEFER returned from platform_get_ethdev_address().
Move MAC reading code to the beginning of the probe function.
Signed-off-by: Sergei Antonov <saproj@gmail.com> Suggested-by: Andrew Lunn <andrew@lunn.ch> CC: Yang Yingliang <yangyingliang@huawei.com> CC: Pavel Skripkin <paskripkin@gmail.com> CC: Guobin Huang <huangguobin4@huawei.com> CC: Yang Wei <yang.wei9@zte.com.cn> CC: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/20220818092317.529557-1-saproj@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
tcp: some bug fixes for tcp_read_skb()
This patchset contains 3 bug fixes and 1 minor refactor patch for
tcp_read_skb(). V1 only had the first patch, as Eric prefers to fix all
of them together, I have to group them together.
====================
Cong Wang [Wed, 17 Aug 2022 19:54:45 +0000 (12:54 -0700)]
tcp: handle pure FIN case correctly
When skb->len==0, the recv_actor() returns 0 too, but we also use 0
for error conditions. This patch amends this by propagating the errors
to tcp_read_skb() so that we can distinguish skb->len==0 case from
error cases.
Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()") Reported-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cong Wang [Wed, 17 Aug 2022 19:54:44 +0000 (12:54 -0700)]
tcp: refactor tcp_read_skb() a bit
As tcp_read_skb() only reads one skb at a time, the while loop is
unnecessary, we can turn it into an if. This also simplifies the
code logic.
Cc: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cong Wang [Wed, 17 Aug 2022 19:54:43 +0000 (12:54 -0700)]
tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()
tcp_cleanup_rbuf() retrieves the skb from sk_receive_queue, it
assumes the skb is not yet dequeued. This is no longer true for
tcp_read_skb() case where we dequeue the skb first.
Fix this by introducing a helper __tcp_cleanup_rbuf() which does
not require any skb and calling it in tcp_read_skb().
Fixes: 04919bed948d ("tcp: Introduce tcp_read_skb()") Cc: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cong Wang [Wed, 17 Aug 2022 19:54:42 +0000 (12:54 -0700)]
tcp: fix sock skb accounting in tcp_read_skb()
Before commit 965b57b469a5 ("net: Introduce a new proto_ops
->read_skb()"), skb was not dequeued from receive queue hence
when we close TCP socket skb can be just flushed synchronously.
After this commit, we have to uncharge skb immediately after being
dequeued, otherwise it is still charged in the original sock. And we
still need to retain skb->sk, as eBPF programs may extract sock
information from skb->sk. Therefore, we have to call
skb_set_owner_sk_safe() here.
Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()") Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com Tested-by: Stanislav Fomichev <sdf@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lin Ma [Wed, 17 Aug 2022 18:49:21 +0000 (11:49 -0700)]
igb: Add lock to avoid data race
The commit c23d92b80e0b ("igb: Teardown SR-IOV before
unregister_netdev()") places the unregister_netdev() call after the
igb_disable_sriov() call to avoid functionality issue.
However, it introduces several race conditions when detaching a device.
For example, when .remove() is called, the below interleaving leads to
use-after-free.
To this end, this commit first eliminates the data races from netdev
core by using rtnl_lock (similar to commit 719479230893 ("dpaa2-eth: add
MAC/PHY support through phylink")). And then adds a spinlock to
eliminate races from driver requests. (similar to commit 1e53834ce541
("ixgbe: Add locking to prevent panic when setting sriov_numvfs to zero")
Fixes: c23d92b80e0b ("igb: Teardown SR-IOV before unregister_netdev()") Signed-off-by: Lin Ma <linma@zju.edu.cn> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20220817184921.735244-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 18 Aug 2022 18:02:11 +0000 (11:02 -0700)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2022-08-17 (ice)
This series contains updates to ice driver only.
Grzegorz prevents modifications to VLAN 0 when setting VLAN promiscuous
as it will already be set. He also ignores -EEXIST error when attempting
to set promiscuous and ensures promiscuous mode is properly cleared from
the hardware when being removed.
Benjamin ignores additional -EEXIST errors when setting promiscuous mode
since the existing mode is the desired mode.
Sylwester fixes VFs to allow sending of tagged traffic when no VLAN filters
exist.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ice: Fix VF not able to send tagged traffic with no VLAN filters
ice: Ignore error message when setting same promiscuous mode
ice: Fix clearing of promisc mode with bridge over bond
ice: Ignore EEXIST when setting promisc mode
ice: Fix double VLAN error when entering promisc mode
====================
Jakub Kicinski [Tue, 16 Aug 2022 16:19:39 +0000 (09:19 -0700)]
net: genl: fix error path memory leak in policy dumping
If construction of the array of policies fails when recording
non-first policy we need to unwind.
netlink_policy_dump_add_policy() itself also needs fixing as
it currently gives up on error without recording the allocated
pointer in the pstate pointer.
Reported-by: syzbot+dc54d9ba8153b216cae0@syzkaller.appspotmail.com Fixes: 50a896cf2d6f ("genetlink: properly support per-op policy dumping") Link: https://lore.kernel.org/r/20220816161939.577583-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
stmmac: intel: Add a missing clk_disable_unprepare() call in intel_eth_pci_remove()
Commit 09f012e64e4b ("stmmac: intel: Fix clock handling on error and remove
paths") removed this clk_disable_unprepare()
This was partly revert by commit ac322f86b56c ("net: stmmac: Fix clock
handling on remove path") which removed this clk_disable_unprepare()
because:
"
While unloading the dwmac-intel driver, clk_disable_unprepare() is
being called twice in stmmac_dvr_remove() and
intel_eth_pci_remove(). This causes kernel panic on the second call.
"
However later on, commit 5ec55823438e8 ("net: stmmac: add clocks management
for gmac driver") has updated stmmac_dvr_remove() which do not call
clk_disable_unprepare() anymore.
So this call should now be called from intel_eth_pci_remove().
Jens Wiklander [Thu, 18 Aug 2022 11:08:59 +0000 (13:08 +0200)]
tee: add overflow check in register_shm_helper()
With special lengths supplied by user space, register_shm_helper() has
an integer overflow when calculating the number of pages covered by a
supplied user space memory region.
This causes internal_get_user_pages_fast() a helper function of
pin_user_pages_fast() to do a NULL pointer dereference:
Leon Romanovsky [Tue, 16 Aug 2022 08:47:23 +0000 (11:47 +0300)]
net/mlx5e: Allocate flow steering storage during uplink initialization
IPsec code relies on valid priv->fs pointer that is the case in NIC
flow, but not correct in uplink. Before commit that mentioned in the
Fixes line, that pointer was valid in all flows as it was allocated
together with priv struct.
In addition, the cleanup representors routine called to that
not-initialized priv->fs pointer and its internals which caused NULL
deference.
So, move FS allocation to be as early as possible.
Jakub Kicinski [Thu, 18 Aug 2022 04:58:48 +0000 (21:58 -0700)]
Merge branch 'fixes-for-ocelot-driver-statistics'
Vladimir Oltean says:
====================
Fixes for Ocelot driver statistics
This series contains bug fixes for the ocelot drivers (both switchdev
and DSA). Some concern the counters exposed to ethtool -S, and others to
the counters exposed to ifconfig. I'm aware that the changes are fairly
large, but I wanted to prioritize on a proper approach to addressing the
issues rather than a quick hack.
Some of the noticed problems:
- bad register offsets for some counters
- unhandled concurrency leading to corrupted counters
- unhandled 32-bit wraparound of ifconfig counters
The issues on the ocelot switchdev driver were noticed through code
inspection, I do not have the hardware to test.
This patch set necessarily converts ocelot->stats_lock from a mutex to a
spinlock. I know this affects Colin Foster's development with the SPI
controlled VSC7512. I have other changes prepared for net-next that
convert this back into a mutex (along with other changes in this area).
====================
Vladimir Oltean [Tue, 16 Aug 2022 13:53:52 +0000 (16:53 +0300)]
net: mscc: ocelot: report ndo_get_stats64 from the wraparound-resistant ocelot->stats
Rather than reading the stats64 counters directly from the 32-bit
hardware, it's better to rely on the output produced by the periodic
ocelot_port_update_stats().
It would be even better to call ocelot_port_update_stats() right from
ocelot_get_stats64() to make sure we report the current values rather
than the ones from 2 seconds ago. But we need to export
ocelot_port_update_stats() from the switch lib towards the switchdev
driver for that, and future work will largely undo that.
There are more ocelot-based drivers waiting to be introduced, an example
of which is the SPI-controlled VSC7512. In that driver's case, it will
be impossible to call ocelot_port_update_stats() from ndo_get_stats64
context, since the latter is atomic, and reading the stats over SPI is
sleepable. So the compromise taken here, which will also hold going
forward, is to report 64-bit counters to stats64, which are not 100% up
to date.
Fixes: a556c76adc05 ("net: mscc: Add initial Ocelot switch support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 16 Aug 2022 13:53:51 +0000 (16:53 +0300)]
net: mscc: ocelot: keep ocelot_stat_layout by reg address, not offset
With so many counter addresses recently discovered as being wrong, it is
desirable to at least have a central database of information, rather
than two: one through the SYS_COUNT_* registers (used for
ndo_get_stats64), and the other through the offset field of struct
ocelot_stat_layout elements (used for ethtool -S).
The strategy will be to keep the SYS_COUNT_* definitions as the single
source of truth, but for that we need to expand our current definitions
to cover all registers. Then we need to convert the ocelot region
creation logic, and stats worker, to the read semantics imposed by going
through SYS_COUNT_* absolute register addresses, rather than offsets
of 32-bit words relative to SYS_COUNT_RX_OCTETS (which should have been
SYS_CNT, by the way).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 16 Aug 2022 13:53:50 +0000 (16:53 +0300)]
net: mscc: ocelot: make struct ocelot_stat_layout array indexable
The ocelot counters are 32-bit and require periodic reading, every 2
seconds, by ocelot_port_update_stats(), so that wraparounds are
detected.
Currently, the counters reported by ocelot_get_stats64() come from the
32-bit hardware counters directly, rather than from the 64-bit
accumulated ocelot->stats, and this is a problem for their integrity.
The strategy is to make ocelot_get_stats64() able to cherry-pick
individual stats from ocelot->stats the way in which it currently reads
them out from SYS_COUNT_* registers. But currently it can't, because
ocelot->stats is an opaque u64 array that's used only to feed data into
ethtool -S.
To solve that problem, we need to make ocelot->stats indexable, and
associate each element with an element of struct ocelot_stat_layout used
by ethtool -S.
This makes ocelot_stat_layout a fat (and possibly sparse) array, so we
need to change the way in which we access it. We no longer need
OCELOT_STAT_END as a sentinel, because we know the array's size
(OCELOT_NUM_STATS). We just need to skip the array elements that were
left unpopulated for the switch revision (ocelot, felix, seville).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 16 Aug 2022 13:53:49 +0000 (16:53 +0300)]
net: mscc: ocelot: fix race between ndo_get_stats64 and ocelot_check_stats_work
The 2 methods can run concurrently, and one will change the window of
counters (SYS_STAT_CFG_STAT_VIEW) that the other sees. The fix is
similar to what commit 7fbf6795d127 ("net: mscc: ocelot: fix mutex lock
error during ethtool stats read") has done for ethtool -S.
Fixes: a556c76adc05 ("net: mscc: Add initial Ocelot switch support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 16 Aug 2022 13:53:48 +0000 (16:53 +0300)]
net: mscc: ocelot: turn stats_lock into a spinlock
ocelot_get_stats64() currently runs unlocked and therefore may collide
with ocelot_port_update_stats() which indirectly accesses the same
counters. However, ocelot_get_stats64() runs in atomic context, and we
cannot simply take the sleepable ocelot->stats_lock mutex. We need to
convert it to an atomic spinlock first. Do that as a preparatory change.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 16 Aug 2022 13:53:47 +0000 (16:53 +0300)]
net: mscc: ocelot: fix address of SYS_COUNT_TX_AGING counter
This register, used as part of stats->tx_dropped in
ocelot_get_stats64(), has a wrong address. At the address currently
given, there is actually the c_tx_green_prio_6 counter.
Fixes: a556c76adc05 ("net: mscc: Add initial Ocelot switch support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reading stats using the SYS_COUNT_* register definitions is only used by
ocelot_get_stats64() from the ocelot switchdev driver, however,
currently the bucket definitions are incorrect.
Separately, on both RX and TX, we have the following problems:
- a 256-1023 bucket which actually tracks the 256-511 packets
- the 1024-1526 bucket actually tracks the 512-1023 packets
- the 1527-max bucket actually tracks the 1024-1526 packets
=> nobody tracks the packets from the real 1527-max bucket
Additionally, the RX_PAUSE, RX_CONTROL, RX_LONGS and RX_CLASSIFIED_DROPS
all track the wrong thing. However this doesn't seem to have any
consequence, since ocelot_get_stats64() doesn't use these.
Even though this problem only manifests itself for the switchdev driver,
we cannot split the fix for ocelot and for DSA, since it requires fixing
the bucket definitions from enum ocelot_reg, which makes us necessarily
adapt the structures from felix and seville as well.
Fixes: 84705fc16552 ("net: dsa: felix: introduce support for Seville VSC9953 switch") Fixes: 56051948773e ("net: dsa: ocelot: add driver for Felix switch family") Fixes: a556c76adc05 ("net: mscc: Add initial Ocelot switch support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
What the driver actually reports as 256-511 is in fact 512-1023, and the
TX packets in the 256-511 bucket are not reported. Fix that.
Fixes: 56051948773e ("net: dsa: ocelot: add driver for Felix switch family") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Tue, 16 Aug 2022 20:14:45 +0000 (23:14 +0300)]
net: dsa: don't warn in dsa_port_set_state_now() when driver doesn't support it
ds->ops->port_stp_state_set() is, like most DSA methods, optional, and
if absent, the port is supposed to remain in the forwarding state (as
standalone). Such is the case with the mv88e6060 driver, which does not
offload the bridge layer. DSA warns that the STP state can't be changed
to FORWARDING as part of dsa_port_enable_rt(), when in fact it should not.
The error message is also not up to modern standards, so take the
opportunity to make it more descriptive.
Fixes: fd3645413197 ("net: dsa: change scope of STP state setter") Reported-by: Sergei Antonov <saproj@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Sergei Antonov <saproj@gmail.com> Link: https://lore.kernel.org/r/20220816201445.1809483-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We've added 45 non-merge commits during the last 14 day(s) which contain
a total of 61 files changed, 986 insertions(+), 372 deletions(-).
The main changes are:
1) New bpf_ktime_get_tai_ns() BPF helper to access CLOCK_TAI, from Kurt
Kanzenbach and Jesper Dangaard Brouer.
2) Few clean ups and improvements for libbpf 1.0, from Andrii Nakryiko.
3) Expose crash_kexec() as kfunc for BPF programs, from Artem Savkov.
4) Add ability to define sleepable-only kfuncs, from Benjamin Tissoires.
5) Teach libbpf's bpf_prog_load() and bpf_map_create() to gracefully handle
unsupported names on old kernels, from Hangbin Liu.
6) Allow opting out from auto-attaching BPF programs by libbpf's BPF skeleton,
from Hao Luo.
7) Relax libbpf's requirement for shared libs to be marked executable, from
Henqgi Chen.
8) Improve bpf_iter internals handling of error returns, from Hao Luo.
9) Few accommodations in libbpf to support GCC-BPF quirks, from James Hilliard.
10) Fix BPF verifier logic around tracking dynptr ref_obj_id, from Joanne Koong.
11) bpftool improvements to handle full BPF program names better, from Manu
Bretelle.
12) bpftool fixes around libcap use, from Quentin Monnet.
13) BPF map internals clean ups and improvements around memory allocations,
from Yafang Shao.
14) Allow to use cgroup_get_from_file() on cgroupv1, allowing BPF cgroup
iterator to work on cgroupv1, from Yosry Ahmed.
15) BPF verifier internal clean ups, from Dave Marchevsky and Joanne Koong.
16) Various fixes and clean ups for selftests/bpf and vmtest.sh, from Daniel
Xu, Artem Savkov, Joanne Koong, Andrii Nakryiko, Shibin Koikkara Reeny.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits)
selftests/bpf: Few fixes for selftests/bpf built in release mode
libbpf: Clean up deprecated and legacy aliases
libbpf: Streamline bpf_attr and perf_event_attr initialization
libbpf: Fix potential NULL dereference when parsing ELF
selftests/bpf: Tests libbpf autoattach APIs
libbpf: Allows disabling auto attach
selftests/bpf: Fix attach point for non-x86 arches in test_progs/lsm
libbpf: Making bpf_prog_load() ignore name if kernel doesn't support
selftests/bpf: Update CI kconfig
selftests/bpf: Add connmark read test
selftests/bpf: Add existing connection bpf_*_ct_lookup() test
bpftool: Clear errno after libcap's checks
bpf: Clear up confusion in bpf_skb_adjust_room()'s documentation
bpftool: Fix a typo in a comment
libbpf: Add names for auxiliary maps
bpf: Use bpf_map_area_alloc consistently on bpf map creation
bpf: Make __GFP_NOWARN consistent in bpf map creation
bpf: Use bpf_map_area_free instread of kvfree
bpf: Remove unneeded memset in queue_stack_map creation
libbpf: preserve errno across pr_warn/pr_info/pr_debug
...
====================
====================
netfilter: conntrack and nf_tables bug fixes
The following patchset contains netfilter fixes for net.
Broken since 5.19:
A few ancient connection tracking helpers assume TCP packets cannot
exceed 64kb in size, but this isn't the case anymore with 5.19 when
BIG TCP got merged, from myself.
Regressions since 5.19:
1. 'conntrack -E expect' won't display anything because nfnetlink failed
to enable events for expectations, only for normal conntrack events.
2. partially revert change that added resched calls to a function that can
be in atomic context. Both broken and fixed up by myself.
Broken for several releases (up to original merge of nf_tables):
Several fixes for nf_tables control plane, from Pablo.
This fixes up resource leaks in error paths and adds more sanity
checks for mutually exclusive attributes/flags.
Kconfig:
NF_CONNTRACK_PROCFS is very old and doesn't provide all info provided
via ctnetlink, so it should not default to y. From Geert Uytterhoeven.
Selftests:
rework nft_flowtable.sh: it frequently indicated failure; the way it
tried to detect an offload failure did not work reliably.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
testing: selftests: nft_flowtable.sh: rework test to detect offload failure
testing: selftests: nft_flowtable.sh: use random netns names
netfilter: conntrack: NF_CONNTRACK_PROCFS should no longer default to y
netfilter: nf_tables: check NFT_SET_CONCAT flag if field_count is specified
netfilter: nf_tables: disallow NFT_SET_ELEM_CATCHALL and NFT_SET_ELEM_INTERVAL_END
netfilter: nf_tables: NFTA_SET_ELEM_KEY_END requires concat and interval flags
netfilter: nf_tables: validate NFTA_SET_ELEM_OBJREF based on NFT_SET_OBJECT flag
netfilter: nf_tables: really skip inactive sets when allocating name
netfilter: nfnetlink: re-enable conntrack expectation events
netfilter: nf_tables: fix scheduling-while-atomic splat
netfilter: nf_ct_irc: cap packet search space to 4k
netfilter: nf_ct_ftp: prefer skb_linearize
netfilter: nf_ct_h323: cap packet size at 64k
netfilter: nf_ct_sane: remove pseudo skb linearization
netfilter: nf_tables: possible module reference underflow in error path
netfilter: nf_tables: disallow NFTA_SET_ELEM_KEY_END with NFT_SET_ELEM_INTERVAL_END flag
netfilter: nf_tables: use READ_ONCE and WRITE_ONCE for shared generation id access
====================
David Howells [Tue, 16 Aug 2022 09:34:40 +0000 (10:34 +0100)]
net: Fix suspicious RCU usage in bpf_sk_reuseport_detach()
bpf_sk_reuseport_detach() calls __rcu_dereference_sk_user_data_with_flags()
to obtain the value of sk->sk_user_data, but that function is only usable
if the RCU read lock is held, and neither that function nor any of its
callers hold it.
Fix this by adding a new helper, __locked_read_sk_user_data_with_flags()
that checks to see if sk->sk_callback_lock() is held and use that here
instead.
Alternatively, making __rcu_dereference_sk_user_data_with_flags() use
rcu_dereference_checked() might suffice.
Without this, the following warning can be occasionally observed:
* tag 'ntfs3_for_6.0' of https://github.com/Paragon-Software-Group/linux-ntfs3: (39 commits)
fs/ntfs3: uninitialized variable in ntfs_set_acl_ex()
fs/ntfs3: Remove unused function wnd_bits
fs/ntfs3: Make ni_ins_new_attr return error
fs/ntfs3: Create MFT zone only if length is large enough
fs/ntfs3: Refactoring attr_insert_range to restore after errors
fs/ntfs3: Refactoring attr_punch_hole to restore after errors
fs/ntfs3: Refactoring attr_set_size to restore after errors
fs/ntfs3: New function ntfs_bad_inode
fs/ntfs3: Make MFT zone less fragmented
fs/ntfs3: Check possible errors in run_pack in advance
fs/ntfs3: Added comments to frecord functions
fs/ntfs3: Fill duplicate info in ni_add_name
fs/ntfs3: Make static function attr_load_runs
fs/ntfs3: Add new argument is_mft to ntfs_mark_rec_free
fs/ntfs3: Remove unused mi_mark_free
fs/ntfs3: Fix very fragmented case in attr_punch_hole
fs/ntfs3: Fix work with fragmented xattr
fs/ntfs3: Make ntfs_fallocate return -ENOSPC instead of -EFBIG
fs/ntfs3: extend ni_insert_nonresident to return inserted ATTR_LIST_ENTRY
fs/ntfs3: Check reserved size for maximum allowed
...
Linus Torvalds [Thu, 11 Aug 2022 22:29:46 +0000 (15:29 -0700)]
dcache: move the DCACHE_OP_COMPARE case out of the __d_lookup_rcu loop
__d_lookup_rcu() is one of the hottest functions in the kernel on
certain loads, and it is complicated by filesystems that might want to
have their own name compare function.
We can improve code generation by moving the test of DCACHE_OP_COMPARE
outside the loop, which makes the loop itself much simpler, at the cost
of some code duplication. But both cases end up being simpler, and the
"native" direct case-sensitive compare particularly so.
Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrii Nakryiko [Tue, 16 Aug 2022 00:19:29 +0000 (17:19 -0700)]
selftests/bpf: Few fixes for selftests/bpf built in release mode
Fix few issues found when building and running test_progs in
release mode.
First, potentially uninitialized idx variable in xskxceiver,
force-initialize to zero to satisfy compiler.
Few instances of defining uprobe trigger functions break in release mode
unless marked as noinline, due to being static. Add noinline to make
sure everything works.
Andrii Nakryiko [Tue, 16 Aug 2022 00:19:27 +0000 (17:19 -0700)]
libbpf: Streamline bpf_attr and perf_event_attr initialization
Make sure that entire libbpf code base is initializing bpf_attr and
perf_event_attr with memset(0). Also for bpf_attr make sure we
clear and pass to kernel only relevant parts of bpf_attr. bpf_attr is
a huge union of independent sub-command attributes, so there is no need
to clear and pass entire union bpf_attr, which over time grows quite
a lot and for most commands this growth is completely irrelevant.
Few cases where we were relying on compiler initialization of BPF UAPI
structs (like bpf_prog_info, bpf_map_info, etc) with `= {};` were
switched to memset(0) pattern for future-proofing.
Arun Ramadoss [Tue, 16 Aug 2022 10:55:16 +0000 (16:25 +0530)]
net: dsa: microchip: ksz9477: fix fdb_dump last invalid entry
In the ksz9477_fdb_dump function it reads the ALU control register and
exit from the timeout loop if there is valid entry or search is
complete. After exiting the loop, it reads the alu entry and report to
the user space irrespective of entry is valid. It works till the valid
entry. If the loop exited when search is complete, it reads the alu
table. The table returns all ones and it is reported to user space. So
bridge fdb show gives ff:ff:ff:ff:ff:ff as last entry for every port.
To fix it, after exiting the loop the entry is reported only if it is
valid one.
Fixes: b987e98e50ab ("dsa: add DSA switch driver for Microchip KSZ9477") Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/20220816105516.18350-1-arun.ramadoss@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: dsa: bcm_sf2: Utilize PHYLINK for all ports
This patch series has the bcm_sf2 driver utilize PHYLINK to configure
the CPU port link parameters to unify the configuration and pave the way
for DSA to utilize PHYLINK for all ports in the future.
Tested on BCM7445 and BCM7278
====================
Florian Fainelli [Mon, 15 Aug 2022 17:50:09 +0000 (10:50 -0700)]
net: dsa: bcm_sf2: Have PHYLINK configure CPU/IMP port(s)
Remove the artificial limitations imposed upon
bcm_sf2_sw_mac_link_{up,down} and allow us to override the link
parameters for IMP port(s) as well as regular ports by accounting for
the special differences that exist there.
Remove the code that did override the link parameters in
bcm_sf2_imp_setup().
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Fainelli [Mon, 15 Aug 2022 17:50:08 +0000 (10:50 -0700)]
net: dsa: bcm_sf2: Introduce helper for port override offset
Depending upon the generation of switches, we have different offsets for
configuring a given port's status override where link parameters are
applied. Introduce a helper function that we re-use throughout the code
in order to let phylink callbacks configure the IMP/CPU port(s) in
subsequent changes.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hao Luo [Tue, 16 Aug 2022 23:40:11 +0000 (16:40 -0700)]
libbpf: Allows disabling auto attach
Adds libbpf APIs for disabling auto-attach for individual functions.
This is motivated by the use case of cgroup iter [1]. Some iter
types require their parameters to be non-zero, therefore applying
auto-attach on them will fail. With these two new APIs, users who
want to use auto-attach and these types of iters can disable
auto-attach on the program and perform manual attach.
ice: Fix VF not able to send tagged traffic with no VLAN filters
VF was not able to send tagged traffic when it didn't
have any VLAN interfaces and VLAN anti-spoofing was enabled.
Fix this by allowing VFs with no VLAN filters to send tagged
traffic. After VF adds a VLAN interface it will be able to
send tagged traffic matching VLAN filters only.
Testing hints:
1. Spawn VF
2. Send tagged packet from a VF
3. The packet should be sent out and not dropped
4. Add a VLAN interface on VF
5. Send tagged packet on that VLAN interface
6. Packet should be sent out and not dropped
7. Send tagged packet with id different than VLAN interface
8. Packet should be dropped
ice: Ignore error message when setting same promiscuous mode
Commit 1273f89578f2 ("ice: Fix broken IFF_ALLMULTI handling")
introduced new checks when setting/clearing promiscuous mode. But if the
requested promiscuous mode setting already exists, an -EEXIST error
message would be printed. This is incorrect because promiscuous mode is
either on/off and shouldn't print an error when the requested
configuration is already set.
This can happen when removing a bridge with two bonded interfaces and
promiscuous most isn't fully cleared from VLAN VSI in hardware.
Fix this by ignoring cases where requested promiscuous mode exists.
Fixes: 1273f89578f2 ("ice: Fix broken IFF_ALLMULTI handling") Signed-off-by: Benjamin Mikailenko <benjamin.mikailenko@intel.com> Signed-off-by: Grzegorz Siwik <grzegorz.siwik@intel.com> Link: https://lore.kernel.org/all/CAK8fFZ7m-KR57M_rYX6xZN39K89O=LGooYkKsu6HKt0Bs+x6xQ@mail.gmail.com/ Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Grzegorz Siwik [Fri, 12 Aug 2022 13:25:49 +0000 (15:25 +0200)]
ice: Fix clearing of promisc mode with bridge over bond
When at least two interfaces are bonded and a bridge is enabled on the
bond, an error can occur when the bridge is removed and re-added. The
reason for the error is because promiscuous mode was not fully cleared from
the VLAN VSI in the hardware. With this change, promiscuous mode is
properly removed when the bridge disconnects from bonding.
[ 1033.676359] bond1: link status definitely down for interface enp95s0f0, disabling it
[ 1033.676366] bond1: making interface enp175s0f0 the new active one
[ 1033.676369] device enp95s0f0 left promiscuous mode
[ 1033.676522] device enp175s0f0 entered promiscuous mode
[ 1033.676901] ice 0000:af:00.0 enp175s0f0: Error setting Multicast promiscuous mode on VSI 6
[ 1041.795662] ice 0000:af:00.0 enp175s0f0: Error setting Multicast promiscuous mode on VSI 6
[ 1041.944826] bond1: link status definitely down for interface enp175s0f0, disabling it
[ 1041.944874] device enp175s0f0 left promiscuous mode
[ 1041.944918] bond1: now running without any active interface!
Fixes: c31af68a1b94 ("ice: Add outer_vlan_ops and VSI specific VLAN ops implementations") Co-developed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Grzegorz Siwik <grzegorz.siwik@intel.com> Link: https://lore.kernel.org/all/CAK8fFZ7m-KR57M_rYX6xZN39K89O=LGooYkKsu6HKt0Bs+x6xQ@mail.gmail.com/ Tested-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Tested-by: Igor Raits <igor@gooddata.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Grzegorz Siwik [Fri, 12 Aug 2022 13:25:48 +0000 (15:25 +0200)]
ice: Ignore EEXIST when setting promisc mode
Ignore EEXIST error when setting promiscuous mode.
This fix is needed because the driver could set promiscuous mode
when it still has not cleared properly.
Promiscuous mode could be set only once, so setting it second
time will be rejected.
Fixes: 5eda8afd6bcc ("ice: Add support for PF/VF promiscuous mode") Signed-off-by: Grzegorz Siwik <grzegorz.siwik@intel.com> Link: https://lore.kernel.org/all/CAK8fFZ7m-KR57M_rYX6xZN39K89O=LGooYkKsu6HKt0Bs+x6xQ@mail.gmail.com/ Tested-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Tested-by: Igor Raits <igor@gooddata.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Grzegorz Siwik [Fri, 12 Aug 2022 13:25:47 +0000 (15:25 +0200)]
ice: Fix double VLAN error when entering promisc mode
Avoid enabling or disabling VLAN 0 when trying to set promiscuous
VLAN mode if double VLAN mode is enabled. This fix is needed
because the driver tries to add the VLAN 0 filter twice (once for
inner and once for outer) when double VLAN mode is enabled. The
filter program is rejected by the firmware when double VLAN is
enabled, because the promiscuous filter only needs to be set once.
This issue was missed in the initial implementation of double VLAN
mode.
Fixes: 5eda8afd6bcc ("ice: Add support for PF/VF promiscuous mode") Signed-off-by: Grzegorz Siwik <grzegorz.siwik@intel.com> Link: https://lore.kernel.org/all/CAK8fFZ7m-KR57M_rYX6xZN39K89O=LGooYkKsu6HKt0Bs+x6xQ@mail.gmail.com/ Tested-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Tested-by: Igor Raits <igor@gooddata.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Linus Torvalds [Wed, 17 Aug 2022 15:58:54 +0000 (08:58 -0700)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Most notably this drops the commits that trip up google cloud (turns
out, any legacy device).
Plus a kerneldoc patch"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio: kerneldocs fixes and enhancements
virtio: Revert "virtio: find_vqs() add arg sizes"
virtio_vdpa: Revert "virtio_vdpa: support the arg sizes of find_vqs()"
virtio_pci: Revert "virtio_pci: support the arg sizes of find_vqs()"
virtio-mmio: Revert "virtio_mmio: support the arg sizes of find_vqs()"
virtio: Revert "virtio: add helper virtio_find_vqs_ctx_size()"
virtio_net: Revert "virtio_net: set the default max ring size by find_vqs()"
Florian Westphal [Tue, 16 Aug 2022 12:15:22 +0000 (14:15 +0200)]
testing: selftests: nft_flowtable.sh: rework test to detect offload failure
This test fails on current kernel releases because the flotwable path
now calls dst_check from packet path and will then remove the offload.
Test script has two purposes:
1. check that file (random content) can be sent to other netns (and vv)
2. check that the flow is offloaded (rather than handled by classic
forwarding path).
Since dst_check is in place, 2) fails because the nftables ruleset in
router namespace 1 intentionally blocks traffic under the assumption
that packets are not passed via classic path at all.
Rework this: Instead of blocking traffic, create two named counters, one
for original and one for reverse direction.
The first three test cases are handled by classic forwarding path
(path mtu discovery is disabled and packets exceed MTU).
But all other tests enable PMTUD, so the originator and responder are
expected to lower packet size and flowtable is expected to do the packet
forwarding.
For those tests, check that the packet counters (which are only
incremented for packets that are passed up to classic forward path)
are significantly lower than the file size transferred.
I've tested that the counter-checks fail as expected when the 'flow add'
statement is removed from the ruleset.
M Chetan Kumar [Tue, 16 Aug 2022 04:24:17 +0000 (09:54 +0530)]
net: wwan: t7xx: Devlink documentation
Document the t7xx devlink commands usage for fw flashing &
coredump collection.
Refer to t7xx.rst file for details.
Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com> Signed-off-by: Devegowda Chandrashekar <chandrashekar.devegowda@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
M Chetan Kumar [Tue, 16 Aug 2022 04:24:05 +0000 (09:54 +0530)]
net: wwan: t7xx: Enable devlink based fw flashing and coredump collection
This patch brings-in support for t7xx wwan device firmware flashing &
coredump collection using devlink.
Driver Registers with Devlink framework.
Implements devlink ops flash_update callback that programs modem firmware.
Creates region & snapshot required for device coredump log collection.
On early detection of wwan device in fastboot mode driver sets up CLDMA0 HW
tx/rx queues for raw data transfer then registers with devlink framework.
Upon receiving firmware image & partition details driver sends fastboot
commands for flashing the firmware.
In this flow the fastboot command & response gets exchanged between driver
and device. Once firmware flashing is success completion status is reported
to user space application.
Below is the devlink command usage for firmware flashing
$devlink dev flash pci/$BDF file ABC.img component ABC
Note: ABC.img is the firmware to be programmed to "ABC" partition.
In case of coredump collection when wwan device encounters an exception
it reboots & stays in fastboot mode for coredump collection by host driver.
On detecting exception state driver collects the core dump, creates the
devlink region & reports an event to user space application for dump
collection. The user space application invokes devlink region read command
for dump collection.
Below are the devlink commands used for coredump collection.
devlink region new pci/$BDF/mr_dump
devlink region read pci/$BDF/mr_dump snapshot $ID address $ADD length $LEN
devlink region del pci/$BDF/mr_dump snapshot $ID
Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com> Signed-off-by: Devegowda Chandrashekar <chandrashekar.devegowda@intel.com> Signed-off-by: Mishra Soumya Prakash <soumya.prakash.mishra@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Haijun Liu [Tue, 16 Aug 2022 04:23:53 +0000 (09:53 +0530)]
net: wwan: t7xx: PCIe reset rescan
PCI rescan module implements "rescan work queue". In firmware flashing
or coredump collection procedure WWAN device is programmed to boot in
fastboot mode and a work item is scheduled for removal & detection.
The WWAN device is reset using APCI call as part driver removal flow.
Work queue rescans pci bus at fixed interval for device detection,
later when device is detect work queue exits.
Signed-off-by: Haijun Liu <haijun.liu@mediatek.com> Co-developed-by: Madhusmita Sahu <madhusmita.sahu@intel.com> Signed-off-by: Madhusmita Sahu <madhusmita.sahu@intel.com> Signed-off-by: Ricardo Martinez <ricardo.martinez@linux.intel.com> Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com> Signed-off-by: Devegowda Chandrashekar <chandrashekar.devegowda@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Haijun Liu [Tue, 16 Aug 2022 04:23:40 +0000 (09:53 +0530)]
net: wwan: t7xx: Infrastructure for early port configuration
To support cases such as FW update or Core dump, the t7xx device
is capable of signaling the host that a special port needs
to be created before the handshake phase.
This patch adds the infrastructure required to create the
early ports which also requires a different configuration of
CLDMA queues.
Signed-off-by: Haijun Liu <haijun.liu@mediatek.com> Co-developed-by: Madhusmita Sahu <madhusmita.sahu@intel.com> Signed-off-by: Madhusmita Sahu <madhusmita.sahu@intel.com> Signed-off-by: Ricardo Martinez <ricardo.martinez@linux.intel.com> Signed-off-by: Devegowda Chandrashekar <chandrashekar.devegowda@intel.com> Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Haijun Liu [Tue, 16 Aug 2022 04:23:28 +0000 (09:53 +0530)]
net: wwan: t7xx: Add AP CLDMA
The t7xx device contains two Cross Layer DMA (CLDMA) interfaces to
communicate with AP and Modem processors respectively. So far only
MD-CLDMA was being used, this patch enables AP-CLDMA.
Rename small Application Processor (sAP) to AP.
Signed-off-by: Haijun Liu <haijun.liu@mediatek.com> Co-developed-by: Madhusmita Sahu <madhusmita.sahu@intel.com> Signed-off-by: Madhusmita Sahu <madhusmita.sahu@intel.com> Signed-off-by: Moises Veleta <moises.veleta@linux.intel.com> Signed-off-by: Devegowda Chandrashekar <chandrashekar.devegowda@intel.com> Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 15 Aug 2022 19:07:47 +0000 (12:07 -0700)]
net: phy: broadcom: Implement suspend/resume for AC131 and BCM5241
Implement the suspend/resume procedure for the Broadcom AC131 and BCM5241 type
of PHYs (10/100 only) by entering the standard power down followed by the
proprietary standby mode in the auxiliary mode 4 shadow register. On resume,
the PHY software reset is enough to make it come out of standby mode so we can
utilize brcm_fet_config_init() as the resume hook.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Tue, 16 Aug 2022 00:23:58 +0000 (17:23 -0700)]
tls: rx: react to strparser initialization errors
Even though the normal strparser's init function has a return
value we got away with ignoring errors until now, as it only
validates the parameters and we were passing correct parameters.
tls_strp can fail to init on memory allocation errors, which
syzbot duly induced and reported.
Reported-by: syzbot+abd45eb849b05194b1b6@syzkaller.appspotmail.com Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 17 Aug 2022 09:20:45 +0000 (10:20 +0100)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/nex
t-queue
Tony Nguyen says:
====================
ice: detect and report PTP timestamp issues
Jacob Keller says:
This series fixes a few small issues with the cached PTP Hardware Clock
timestamp used for timestamp extension. It also introduces extra checks to
help detect issues with this logic, such as if the cached timestamp is not
updated within the 2 second window.
This introduces a few statistics similar to the ones already available in
other Intel drivers, including tx_hwtstamp_skipped and tx_hwtstamp_timeouts.
It is intended to aid in debugging issues we're seeing with some setups
which might be related to incorrect cached timestamp values.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
+0 < S 0:0(0) win 32792 <mss 1000,sackOK,FO TFO_COOKIE>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.01 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.02 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.04 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.01 < . 1:1(0) ack 1 win 32792
+0 accept(3, ..., ...) = 4
Signed-off-by: Jie Meng <jmeng@fb.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Tue, 16 Aug 2022 12:15:21 +0000 (14:15 +0200)]
testing: selftests: nft_flowtable.sh: use random netns names
"ns1" is a too generic name, use a random suffix to avoid
errors when such a netns exists. Also allows to run multiple
instances of the script in parallel.
Linus Torvalds [Tue, 16 Aug 2022 18:42:11 +0000 (11:42 -0700)]
Merge tag 'nios2_fixes_v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux
Pull NIOS2 fixes from Dinh Nguyen:
- Security fixes from Al Viro
* tag 'nios2_fixes_v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
nios2: add force_successful_syscall_return()
nios2: restarts apply only to the first sigframe we build...
nios2: fix syscall restart checks
nios2: traced syscall does need to check the syscall number
nios2: don't leave NULLs in sys_call_table[]
nios2: page fault et.al. are *not* restartable syscalls...
Linus Torvalds [Tue, 16 Aug 2022 18:40:15 +0000 (11:40 -0700)]
Merge tag 'spi-fix-v6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A few fixes that came in since my pull request, the Meson fix is a
little large since it's fixing all possible cases of the problem that
was observed with the driver and clock API trying to share
configuration by integrating the device clocking fully with the clock
API rather than spot fixing the one instance that was observed"
* tag 'spi-fix-v6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: dt-bindings: Drop Pratyush Yadav
spi: meson-spicc: add local pow2 clock ops to preserve rate between messages
MAINTAINERS: rectify entry for ARM/HPE GXP ARCHITECTURE
spi: spi.c: Add missing __percpu annotations in users of spi_statistics
Linus Torvalds [Tue, 16 Aug 2022 18:36:38 +0000 (11:36 -0700)]
Merge tag 'regulator-fix-v6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"A couple of small fixes that came in since my pull request, nothing
major here"
* tag 'regulator-fix-v6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: core: Fix missing error return from regulator_bulk_get()
regulator: pca9450: Remove restrictions for regulator-name
The exception for the "unaligned access at the end of the page, next
page not mapped" never happens, but the fixup code ends up causing
trouble for compilers to optimize well.
clang in particular ends up seeing it being in the middle of a loop, and
tries desperately to optimize the exception fixup code that is never
really reached.
The simple solution is to just move all the fixups into the exception
handler itself, which moves it all out of the hot case code, and means
that the compiler never sees it or needs to worry about it.
Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hector Martin [Tue, 16 Aug 2022 07:03:11 +0000 (16:03 +0900)]
locking/atomic: Make test_and_*_bit() ordered on failure
These operations are documented as always ordered in
include/asm-generic/bitops/instrumented-atomic.h, and producer-consumer
type use cases where one side needs to ensure a flag is left pending
after some shared data was updated rely on this ordering, even in the
failure case.
This is the case with the workqueue code, which currently suffers from a
reproducible ordering violation on Apple M1 platforms (which are
notoriously out-of-order) that ends up causing the TTY layer to fail to
deliver data to userspace properly under the right conditions. This
change fixes that bug.
Change the documentation to restrict the "no order on failure" story to
the _lock() variant (for which it makes sense), and remove the
early-exit from the generic implementation, which is what causes the
missing barrier semantics in that case. Without this, the remaining
atomic op is fully ordered (including on ARM64 LSE, as of recent
versions of the architecture spec).
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Fixes: e986a0d6cb36 ("locking/atomics, asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs") Fixes: 61e02392d3c7 ("locking/atomic/bitops: Document and clarify ordering semantics for failed test_and_{}_bit()") Signed-off-by: Hector Martin <marcan@marcan.st> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jacob Keller [Wed, 27 Jul 2022 23:16:02 +0000 (16:16 -0700)]
ice: introduce ice_ptp_reset_cached_phctime function
If the PTP hardware clock is adjusted, the ice driver must update the
cached PHC timestamp. This is required in order to perform timestamp
extension on the shorter timestamps captured by the PHY.
Currently, we simply call ice_ptp_update_cached_phctime in the settime and
adjtime callbacks. This has a few issues:
1) if ICE_CFG_BUSY is set because another thread is updating the Rx rings,
we will exit with an error. This is not checked, and the functions do
not re-schedule the update. This could leave the cached timestamp
incorrect until the next scheduled work item execution.
2) even if we did handle an update, any currently outstanding Tx timestamp
would be extended using the wrong cached PHC time. This would produce
incorrect results.
To fix these issues, introduce a new ice_ptp_reset_cached_phctime function.
This function calls the ice_ptp_update_cached_phctime, and discards
outstanding Tx timestamps.
If the ice_ptp_update_cached_phctime function fails because ICE_CFG_BUSY is
set, we log a warning and schedule the thread to execute soon. The update
function is modified so that it always updates the cached copy in the PF
regardless. This ensures we have the most up to date values possible and
minimizes the risk of a packet timestamp being extended with the wrong
value.
It would be nice if we could skip reporting Rx timestamps until the cached
values are up to date. However, we can't access the Rx rings while
ICE_CFG_BUSY is set because they are actively being updated by another
thread.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Wed, 27 Jul 2022 23:16:01 +0000 (16:16 -0700)]
ice: re-arrange some static functions in ice_ptp.c
A following change is going to want to make use of ice_ptp_flush_tx_tracker
earlier in the ice_ptp.c file. To make this work, move the Tx timestamp
tracking functions higher up in the file, and pull the
ice_ptp_update_cached_timestamp function below them. This should have no
functional change.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Wed, 27 Jul 2022 23:16:00 +0000 (16:16 -0700)]
ice: track and warn when PHC update is late
The ice driver requires a cached copy of the PHC time in order to perform
timestamp extension on Tx and Rx hardware timestamp values. This cached PHC
time must always be updated at least once every 2 seconds. Otherwise, the
math used to perform the extension would produce invalid results.
The updates are supposed to occur periodically in the PTP kthread work
item, which is scheduled to run every half second. Thus, we do not expect
an update to be delayed for so long. However, there are error conditions
which can cause the update to be delayed.
Track this situation by using jiffies to determine approximately how long
ago the last update occurred. Add a new statistic and a dev_warn when we
have failed to update the cached PHC time. This makes the error case more
obvious.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Wed, 27 Jul 2022 23:15:59 +0000 (16:15 -0700)]
ice: track Tx timestamp stats similar to other Intel drivers
Several Intel networking drivers which support PTP track when Tx timestamps
are skipped or when they timeout without a timestamp from hardware. The
conditions which could cause these events are rare, but it can be useful to
know when and how often they occur.
Implement similar statistics for the ice driver, tx_hwtstamp_skipped,
tx_hwtstamp_timeouts, and tx_hwtstamp_flushed.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Wed, 27 Jul 2022 23:15:58 +0000 (16:15 -0700)]
ice: initialize cached_phctime when creating Rx rings
When we create new Rx rings, the cached_phctime field is zero initialized.
This could result in incorrect timestamp reporting due to the cached value
not yet being updated. Although a background task will periodically update
the cached value, ensure it matches the existing cached value in the PF
structure at ring initialization.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Jacob Keller [Wed, 27 Jul 2022 23:15:57 +0000 (16:15 -0700)]
ice: set tx_tstamps when creating new Tx rings via ethtool
When the user changes the number of queues via ethtool, the driver
allocates new rings. This allocation did not initialize tx_tstamps. This
results in the tx_tstamps field being zero (due to kcalloc allocation), and
would result in a NULL pointer dereference when attempting a transmit
timestamp on the new ring.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Alan Brady [Tue, 2 Aug 2022 08:19:17 +0000 (10:19 +0200)]
i40e: Fix to stop tx_timeout recovery if GLOBR fails
When a tx_timeout fires, the PF attempts to recover by incrementally
resetting. First we try a PFR, then CORER and finally a GLOBR. If the
GLOBR fails, then we keep hitting the tx_timeout and incrementing the
recovery level and issuing dmesgs, which is both annoying to the user
and accomplishes nothing.
If the GLOBR fails, then we're pretty much totally hosed, and there's
not much else we can do to recover, so this makes it such that we just
kill the VSI and stop hitting the tx_timeout in such a case.
Fixes: 41c445ff0f48 ("i40e: main driver core") Signed-off-by: Alan Brady <alan.brady@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
i40e: Fix tunnel checksum offload with fragmented traffic
Fix checksum offload on VXLAN tunnels.
In case, when mpls protocol is not used, set l4 header to transport
header of skb. This fixes case, when user tries to offload checksums
of VXLAN tunneled traffic.
Steps for reproduction (requires link partner with tunnels):
ip l s enp130s0f0 up
ip a f enp130s0f0
ip a a 10.10.110.2/24 dev enp130s0f0
ip l s enp130s0f0 mtu 1600
ip link add vxlan12_sut type vxlan id 12 group 238.168.100.100 dev \
enp130s0f0 dstport 4789
ip l s vxlan12_sut up
ip a a 20.10.110.2/24 dev vxlan12_sut
iperf3 -c 20.10.110.1 #should connect
Without this patch, TX descriptor was using wrong data, due to
l4 header pointing wrong address. NIC would then drop those packets
internally, due to incorrect TX descriptor data, which increased
GLV_TEPC register.
Fixes: b4fb2d33514a ("i40e: Add support for MPLS + TSO") Signed-off-by: Przemyslaw Patynowski <przemyslawx.patynowski@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>