====================
net/smc: send and write inline optimization for smc
Send cdc msgs and write data inline if qp has sufficent inline
space, helps latency reducing.
In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.4us-1.3us improvement in latency.
The results shown below:
msgsize before after
1B 11.9 us 10.6 us (-1.3 us)
2B 11.7 us 10.7 us (-1.0 us)
4B 11.7 us 10.7 us (-1.0 us)
8B 11.6 us 10.6 us (-1.0 us)
16B 11.7 us 10.7 us (-1.0 us)
32B 11.7 us 10.6 us (-1.1 us)
64B 11.7 us 11.2 us (-0.5 us)
128B 11.6 us 11.2 us (-0.4 us)
256B 11.8 us 11.2 us (-0.6 us)
512B 11.8 us 11.3 us (-0.5 us)
1KB 11.9 us 11.5 us (-0.4 us)
2KB 12.1 us 11.5 us (-0.6 us)
====================
Guangguan Wang [Mon, 16 May 2022 05:51:37 +0000 (13:51 +0800)]
net/smc: rdma write inline if qp has sufficient inline space
Rdma write with inline flag when sending small packages,
whose length is shorter than the qp's max_inline_data, can
help reducing latency.
In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.5us-0.7us improvement in latency.
The results shown below:
msgsize before after
1B 11.2 us 10.6 us (-0.6 us)
2B 11.2 us 10.7 us (-0.5 us)
4B 11.3 us 10.7 us (-0.6 us)
8B 11.2 us 10.6 us (-0.6 us)
16B 11.3 us 10.7 us (-0.6 us)
32B 11.3 us 10.6 us (-0.7 us)
64B 11.2 us 11.2 us (0 us)
128B 11.2 us 11.2 us (0 us)
256B 11.2 us 11.2 us (0 us)
512B 11.4 us 11.3 us (-0.1 us)
1KB 11.4 us 11.5 us (0.1 us)
2KB 11.5 us 11.5 us (0 us)
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Tested-by: kernel test robot <lkp@intel.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Guangguan Wang [Mon, 16 May 2022 05:51:36 +0000 (13:51 +0800)]
net/smc: send cdc msg inline if qp has sufficient inline space
As cdc msg's length is 44B, cdc msgs can be sent inline in
most rdma devices, which can help reducing sending latency.
In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.4us-0.7us improvement in latency.
The results shown below:
msgsize before after
1B 11.9 us 11.2 us (-0.7 us)
2B 11.7 us 11.2 us (-0.5 us)
4B 11.7 us 11.3 us (-0.4 us)
8B 11.6 us 11.2 us (-0.4 us)
16B 11.7 us 11.3 us (-0.4 us)
32B 11.7 us 11.3 us (-0.4 us)
64B 11.7 us 11.2 us (-0.5 us)
128B 11.6 us 11.2 us (-0.4 us)
256B 11.8 us 11.2 us (-0.6 us)
512B 11.8 us 11.4 us (-0.4 us)
1KB 11.9 us 11.4 us (-0.5 us)
2KB 12.1 us 11.5 us (-0.6 us)
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Tested-by: kernel test robot <lkp@intel.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Leszek Polak [Mon, 16 May 2022 07:08:59 +0000 (09:08 +0200)]
net: phy: marvell: Add errata section 5.1 for Alaska PHY
As per Errata Section 5.1, if EEE is intended to be used, some register
writes must be done once after every hardware reset. This patch now adds
the necessary register writes as listed in the Marvell errata.
Without this fix we experience ethernet problems on some of our boards
equipped with a new version of this ethernet PHY (different supplier).
Signed-off-by: Leszek Polak <lpolak@arri.de> Signed-off-by: Stefan Roese <sr@denx.de> Cc: Marek Behún <kabel@kernel.org> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: David S. Miller <davem@davemloft.net> Reviewed-by: Marek Behún <kabel@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20220516070859.549170-1-sr@denx.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Minghao Chi [Mon, 16 May 2022 08:22:51 +0000 (08:22 +0000)]
net: qede: Remove unnecessary synchronize_irq() before free_irq()
Calling synchronize_irq() right before free_irq() is quite useless. On one
hand the IRQ can easily fire again before free_irq() is entered, on the
other hand free_irq() itself calls synchronize_irq() internally (in a race
condition free way), before any state associated with the IRQ is freed.
Minghao Chi [Mon, 16 May 2022 08:19:14 +0000 (08:19 +0000)]
net: vxge: Remove unnecessary synchronize_irq() before free_irq()
Calling synchronize_irq() right before free_irq() is quite useless. On one
hand the IRQ can easily fire again before free_irq() is entered, on the
other hand free_irq() itself calls synchronize_irq() internally (in a race
condition free way), before any state associated with the IRQ is freed.
Minghao Chi [Mon, 16 May 2022 07:26:46 +0000 (07:26 +0000)]
qed: Remove unnecessary synchronize_irq() before free_irq()
Calling synchronize_irq() right before free_irq() is quite useless. On one
hand the IRQ can easily fire again before free_irq() is entered, on the
other hand free_irq() itself calls synchronize_irq() internally (in a race
condition free way), before any state associated with the IRQ is freed.
the first 2 patches are by me and target the CAN raw protocol. The 1st
removes an unneeded assignment, the other one adds support for
SO_TXTIME/SCM_TXTIME.
Oliver Hartkopp contributes 2 patches for the ISOTP protocol. The 1st
adds support for transmission without flow control, the other let's
bind() return an error on incorrect CAN ID formatting.
Geert Uytterhoeven contributes a patch to clean up ctucanfd's Kconfig
file.
Vincent Mailhol's patch for the slcan driver uses the proper function
to check for invalid CAN frames in the xmit callback.
The next patch is by Geert Uytterhoeven and makes the interrupt-names
of the renesas,rcar-canfd dt bindings mandatory.
A patch by my update the ctucanfd dt bindings to include the common
CAN controller bindings.
The last patch is by Akira Yokosawa and fixes a breakage the
ctucanfd's documentation.
* tag 'linux-can-next-for-5.19-20220516' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
docs: ctucanfd: Use 'kernel-figure' directive instead of 'figure'
dt-bindings: can: ctucanfd: include common CAN controller bindings
dt-bindings: can: renesas,rcar-canfd: Make interrupt-names required
can: slcan: slc_xmit(): use can_dropped_invalid_skb() instead of manual check
can: ctucanfd: Let users select instead of depend on CAN_CTUCANFD
can: isotp: isotp_bind(): return -EINVAL on incorrect CAN ID formatting
can: isotp: add support for transmission without flow control
can: raw: add support for SO_TXTIME/SCM_TXTIME
can: raw: raw_sendmsg(): remove not needed setting of skb->sk
====================
Ricardo Martinez [Fri, 13 May 2022 17:34:00 +0000 (10:34 -0700)]
net: skb: Remove skb_data_area_size()
skb_data_area_size() is not needed. As Jakub pointed out [1]:
For Rx, drivers can use the size passed during skb allocation or
use skb_tailroom().
For Tx, drivers should use skb_headlen().
Ricardo Martinez [Fri, 13 May 2022 17:33:59 +0000 (10:33 -0700)]
net: wwan: t7xx: Avoid calls to skb_data_area_size()
skb_data_area_size() helper was used to calculate the size of the
DMA mapped buffer passed to the HW. Instead of doing this, use the
size passed to allocate the skbs.
Signed-off-by: Ricardo Martinez <ricardo.martinez@linux.intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Additional locks are not needed, all the touched sections
are already under mptcp socket lock protection.
Fixes: 4293248c6704 ("mptcp: add data lock for sk timers") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Geliang Tang [Sat, 14 May 2022 00:21:13 +0000 (17:21 -0700)]
selftests: mptcp: fix a mp_fail test warning
Old tc versions (iproute2 5.3) show actions in multiple lines, not a
single line. Then the following unexpected MP_FAIL selftest output
occurs:
file received by server has inverted byte at 169
./mptcp_join.sh: line 1277: [: [{"total acts":1},{"actions":[{"order":0 pedit ,"control_action":{"type":"pipe"}keys 1
index 1 ref 1 bind 1,"installed":0,"last_used":0
key #0 at 148: val ff000000 mask ffffffff
5: integer expression expected
001 Infinite map syn[ ok ] - synack[ ok ] - ack[ ok ]
sum[ ok ] - csum [ ok ]
ftx[ ok ] - failrx[ ok ]
rtx[ ok ] - rstrx [ ok ]
itx[ ok ] - infirx[ ok ]
ftx[ ok ] - failrx[ ok ] invert
This patch adds a 'grep' before 'sed' to fix this.
Akira Yokosawa [Tue, 10 May 2022 23:45:43 +0000 (08:45 +0900)]
docs: ctucanfd: Use 'kernel-figure' directive instead of 'figure'
Two issues were observed in the ReST doc added by commit c3a0addefbde
("docs: ctucanfd: CTU CAN FD open-source IP core documentation.")
with Sphinx versions 2.4.4 and 4.5.0.
The plain "figure" directive broke "make pdfdocs" due to a missing
PDF figure. For conversion of SVG -> PDF to work, the "kernel-figure"
directive, which is an extension for kernel documentation, should
be used instead.
The directive of "code:: raw" causes a warning from both
"make htmldocs" and "make pdfdocs", which reads:
[...]/can/ctu/ctucanfd-driver.rst:75: WARNING: Pygments lexer name
'raw' is not known
A plain literal-block marker should suffice where no syntax
highlighting is intended.
Fix the issues by using suitable directive and marker.
Fixes: c3a0addefbde ("docs: ctucanfd: CTU CAN FD open-source IP core documentation.") Link: https://lore.kernel.org/all/5986752a-1c2a-5d64-f91d-58b1e6decd17@gmail.com Signed-off-by: Akira Yokosawa <akiyks@gmail.com> Acked-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> Cc: Martin Jerabek <martin.jerabek01@gmail.com> Cc: Ondrej Ille <ondrej.ille@gmail.com> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
there is a common CAN controller binding. Add this to the ctucanfd
binding.
Cc: Ondrej Ille <ondrej.ille@gmail.com> Acked-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
dt-bindings: can: renesas,rcar-canfd: Make interrupt-names required
The Renesas R-Car CAN FD Controller always uses two or more interrupts.
Make the interrupt-names properties a required property, to make it
easier to identify the individual interrupts.
can: ctucanfd: Let users select instead of depend on CAN_CTUCANFD
The CTU CAN-FD IP core is only useful when used with one of the
corresponding PCI/PCIe or platform (FPGA, SoC) drivers, which depend on
PCI resp. OF.
Hence make the users select the core driver code, instead of letting
then depend on it. Keep the core code config option visible when
compile-testing, to maintain compile-coverage.
Oliver Hartkopp [Sun, 15 May 2022 18:16:33 +0000 (20:16 +0200)]
can: isotp: isotp_bind(): return -EINVAL on incorrect CAN ID formatting
Commit 3ea566422cbd ("can: isotp: sanitize CAN ID checks in
isotp_bind()") checks the given CAN ID address information by
sanitizing the input values.
This check (silently) removes obsolete bits by masking the given CAN
IDs.
Derek Will suggested to give a feedback to the application programmer
when the 'sanitizing' was actually needed which means the programmer
provided CAN ID content in a wrong format (e.g. SFF CAN IDs with a CAN
ID > 0x7FF).
Oliver Hartkopp [Sat, 7 May 2022 11:55:58 +0000 (13:55 +0200)]
can: isotp: add support for transmission without flow control
Usually the ISO 15765-2 protocol is a point-to-point protocol to transfer
segmented PDUs to a dedicated receiver. This receiver sends a flow control
message to specify protocol options and timings (e.g. block size / STmin).
The so called functional addressing communication allows a 1:N
communication but is limited to a single frame length.
This new CAN_ISOTP_CF_BROADCAST allows an unconfirmed 1:N communication
with PDU length that would not fit into a single frame. This feature is
not covered by the ISO 15765-2 standard.
This patch calls into sock_cmsg_send() to parse the user supplied
control information into a struct sockcm_cookie. Then assign the
requested transmit time to the skb.
This makes it possible to use the Earliest TXTIME First (ETF) packet
scheduler with the CAN_RAW protocol. The user can send a CAN_RAW frame
with a TXTIME and the kernel (with the ETF scheduler) will take care
of sending it to the network interface.
can: raw: raw_sendmsg(): remove not needed setting of skb->sk
The skb in raw_sendmsg() is allocated with sock_alloc_send_skb(),
which subsequently calls sock_alloc_send_pskb() -> skb_set_owner_w(),
which assigns "skb->sk = sk".
This patch removes the not needed setting of skb->sk.
Fabio Estevam [Fri, 13 May 2022 11:46:13 +0000 (08:46 -0300)]
net: phy: micrel: Use the kszphy probe/suspend/resume
Now that it is possible to use .probe without having .driver_data, let
KSZ8061 use the kszphy specific hooks for probe,suspend and resume,
which is preferred.
Switch to using the dedicated kszphy probe/suspend/resume functions.
Minghao Chi [Fri, 13 May 2022 08:19:18 +0000 (08:19 +0000)]
octeontx2-pf: Remove unnecessary synchronize_irq() before free_irq()
Calling synchronize_irq() right before free_irq() is quite useless. On one
hand the IRQ can easily fire again before free_irq() is entered, on the
other hand free_irq() itself calls synchronize_irq() internally (in a race
condition free way), before any state associated with the IRQ is freed.
Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
Zheng Bin [Fri, 13 May 2022 07:10:18 +0000 (15:10 +0800)]
octeon_ep: add missing destroy_workqueue in octep_init_module
octep_init_module misses destroy_workqueue in error path,
this patch fixes that.
Fixes: 862cd659a6fb ("octeon_ep: Add driver framework and device initialization") Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
While testing this recently added feature on a variety
of platforms/configurations, I found the following issues:
1) A race leading to concurrent calls to smp_call_function_single_async()
2) Missed opportunity to use napi_consume_skb()
3) Need to limit the max length of the per-cpu lists.
4) Process the per-cpu list more frequently, for the
(unusual) case where net_rx_action() has mutiple
napi_poll() to process per round.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 16 May 2022 04:24:55 +0000 (21:24 -0700)]
net: add skb_defer_max sysctl
commit 68822bdf76f1 ("net: generalize skb freeing
deferral to per-cpu lists") added another per-cpu
cache of skbs. It was expected to be small,
and an IPI was forced whenever the list reached 128
skbs.
We might need to be able to control more precisely
queue capacity and added latency.
An IPI is generated whenever queue reaches half capacity.
Default value of the new limit is 64.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 16 May 2022 04:24:54 +0000 (21:24 -0700)]
net: use napi_consume_skb() in skb_defer_free_flush()
skb_defer_free_flush() runs from softirq context,
we have the opportunity to refill the napi_alloc_cache,
and/or use kmem_cache_free_bulk() when this cache is full.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 16 May 2022 04:24:53 +0000 (21:24 -0700)]
net: fix possible race in skb_attempt_defer_free()
A cpu can observe sd->defer_count reaching 128,
and call smp_call_function_single_async()
Problem is that the remote CPU can clear sd->defer_count
before the IPI is run/acknowledged.
Other cpus can queue more packets and also decide
to call smp_call_function_single_async() while the pending
IPI was not yet delivered.
This is a common issue with smp_call_function_single_async().
Callers must ensure correct synchronization and serialization.
I triggered this issue while experimenting smaller threshold.
Performing the call to smp_call_function_single_async()
under sd->defer_lock protection did not solve the problem.
Commit 5a18ceca6350 ("smp: Allow smp_call_function_single_async()
to insert locked csd") replaced an informative WARN_ON_ONCE()
with a return of -EBUSY, which is often ignored.
Test of CSD_FLAG_LOCK presence is racy anyway.
Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ 274.452394] tulip0: no phy info, aborting mtable build
[ 274.499041] tulip0: MII transceiver #1 config 1000 status 782d advertising 01e1
[ 274.750691] net eth0: Digital DS21142/43 Tulip rev 65 at MMIO 0xf4008000, 00:30:6e:08:7d:21, IRQ 17
[ 283.104520] net eth0: Setting full-duplex based on MII#1 link partner capability of c1e1
Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Zheng Bin [Fri, 13 May 2022 07:09:22 +0000 (15:09 +0800)]
net: hinic: add missing destroy_workqueue in hinic_pf_to_mgmt_init
hinic_pf_to_mgmt_init misses destroy_workqueue in error path,
this patch fixes that.
Fixes: 6dbb89014dc3 ("hinic: fix sending mailbox timeout in aeq event work") Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 16 May 2022 09:47:44 +0000 (10:47 +0100)]
Merge branch 'skb-drop-reason-boundary'
Menglong Dong says:
====================
net: skb: check the boundrary of skb drop reason
In the commit 1330b6ef3313 ("skb: make drop reason booleanable"),
SKB_NOT_DROPPED_YET is added to the enum skb_drop_reason, which makes
the invalid drop reason SKB_NOT_DROPPED_YET can leak to the kfree_skb
tracepoint. Once this happen (it happened, as 4th patch says), it can
cause NULL pointer in drop monitor and result in kernel panic.
Therefore, check the boundrary of drop reason in both kfree_skb_reason
(2th patch) and drop monitor (1th patch) to prevent such case happens
again.
Meanwhile, fix the invalid drop reason passed to kfree_skb_reason() in
tcp_v4_rcv() and tcp_v6_rcv().
Changes since v2:
1/4 - don't reset the reason and print the debug warning only (Jakub
Kicinski)
4/4 - remove new lines between tags
Changes since v1:
- consider tcp_v6_rcv() in the 4th patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Menglong Dong [Fri, 13 May 2022 03:03:39 +0000 (11:03 +0800)]
net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()
The 'drop_reason' that passed to kfree_skb_reason() in tcp_v4_rcv()
and tcp_v6_rcv() can be SKB_NOT_DROPPED_YET(0), as it is used as the
return value of tcp_inbound_md5_hash().
And it can panic the kernel with NULL pointer in
net_dm_packet_report_size() if the reason is 0, as drop_reasons[0]
is NULL.
Fixes: 1330b6ef3313 ("skb: make drop reason booleanable") Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Menglong Dong [Fri, 13 May 2022 03:03:38 +0000 (11:03 +0800)]
net: skb: change the definition SKB_DR_SET()
The SKB_DR_OR() is used to set the drop reason to a value when it is
not set or specified yet. SKB_NOT_DROPPED_YET should also be considered
as not set.
Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Menglong Dong [Fri, 13 May 2022 03:03:37 +0000 (11:03 +0800)]
net: skb: check the boundrary of drop reason in kfree_skb_reason()
Sometimes, we may forget to reset skb drop reason to NOT_SPECIFIED after
we make it the return value of the functions with return type of enum
skb_drop_reason, such as tcp_inbound_md5_hash. Therefore, its value can
be SKB_NOT_DROPPED_YET(0), which is invalid for kfree_skb_reason().
So we check the range of drop reason in kfree_skb_reason() with
DEBUG_NET_WARN_ON_ONCE().
Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Menglong Dong [Fri, 13 May 2022 03:03:36 +0000 (11:03 +0800)]
net: dm: check the boundary of skb drop reasons
The 'reason' will be set to 'SKB_DROP_REASON_NOT_SPECIFIED' if it not
small that SKB_DROP_REASON_MAX in net_dm_packet_trace_kfree_skb_hit(),
but it can't avoid it to be 0, which is invalid and can cause NULL
pointer in drop_reasons.
Therefore, reset it to SKB_DROP_REASON_NOT_SPECIFIED when 'reason <= 0'.
Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Guangguan Wang [Fri, 13 May 2022 02:24:53 +0000 (10:24 +0800)]
net/smc: align the connect behaviour with TCP
Connect with O_NONBLOCK will not be completed immediately
and returns -EINPROGRESS. It is possible to use selector/poll
for completion by selecting the socket for writing. After select
indicates writability, a second connect function call will return
0 to indicate connected successfully as TCP does, but smc returns
-EISCONN. Use socket state for smc to indicate connect state, which
can help smc aligning the connect behaviour with TCP.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 16 May 2022 09:31:06 +0000 (10:31 +0100)]
Merge branch 'sk_bound_dev_if-annotations'
Eric Dumazet says:
====================
net: add annotations for sk->sk_bound_dev_if
While writes on sk->sk_bound_dev_if are protected by socket lock,
we have many lockless reads all over the places.
This is based on syzbot report found in the first patch changelog.
v2: inline ipv6 function only defined if IS_ENABLED(CONFIG_IPV6) (kernel bots)
Change the INET6_MATCH() to inet6_match(), this is no longer a macro.
Change INET_MATCH() to inet_match() (Olivier Hartkopp & Jakub Kicinski)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 13 May 2022 18:55:50 +0000 (11:55 -0700)]
inet: rename INET_MATCH()
This is no longer a macro, but an inlined function.
INET_MATCH() -> inet_match()
Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Olivier Hartkopp <socketcan@hartkopp.net> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 13 May 2022 18:55:41 +0000 (11:55 -0700)]
net: annotate races around sk->sk_bound_dev_if
UDP sendmsg() is lockless, and reads sk->sk_bound_dev_if while
this field can be changed by another thread.
Adds minimal annotations to avoid KCSAN splats for UDP.
Following patches will add more annotations to potential lockless readers.
BUG: KCSAN: data-race in __ip6_datagram_connect / udpv6_sendmsg
write to 0xffff888136d47a94 of 4 bytes by task 7681 on cpu 0:
__ip6_datagram_connect+0x6e2/0x930 net/ipv6/datagram.c:221
ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272
inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576
__sys_connect_file net/socket.c:1900 [inline]
__sys_connect+0x197/0x1b0 net/socket.c:1917
__do_sys_connect net/socket.c:1927 [inline]
__se_sys_connect net/socket.c:1924 [inline]
__x64_sys_connect+0x3d/0x50 net/socket.c:1924
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff888136d47a94 of 4 bytes by task 7670 on cpu 1:
udpv6_sendmsg+0xc60/0x16e0 net/ipv6/udp.c:1436
inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:652
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg net/socket.c:725 [inline]
____sys_sendmsg+0x39a/0x510 net/socket.c:2413
___sys_sendmsg net/socket.c:2467 [inline]
__sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
__do_sys_sendmmsg net/socket.c:2582 [inline]
__se_sys_sendmmsg net/socket.c:2579 [inline]
__x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x00000000 -> 0xffffff9b
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7670 Comm: syz-executor.3 Tainted: G W 5.18.0-rc1-syzkaller-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
I chose to not add Fixes: tag because race has minor consequences
and stable teams busy enough.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 13 May 2022 18:34:08 +0000 (11:34 -0700)]
mlx5: support BIG TCP packets
mlx5 supports LSOv2.
IPv6 gro/tcp stacks insert a temporary Hop-by-Hop header
with JUMBO TLV for big packets.
We need to ignore/skip this HBH header when populating TX descriptor.
Note that ipv6_has_hopopt_jumbo() only recognizes very specific packet
layout, thus mlx5e_sq_xmit_wqe() is taking care of this layout only.
v7: adopt unsafe_memcpy() and MLX5_UNSAFE_MEMCPY_DISCLAIMER
v2: clear hopbyhop in mlx5e_tx_get_gso_ihs()
v4: fix compile error for CONFIG_MLX5_CORE_IPOIB=y
Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Leon Romanovsky <leon@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 13 May 2022 18:34:07 +0000 (11:34 -0700)]
mlx4: support BIG TCP packets
mlx4 supports LSOv2 just fine.
IPv6 stack inserts a temporary Hop-by-Hop header
with JUMBO TLV for big packets.
We need to ignore the HBH header when populating TX descriptor.
Tested:
Before: (not enabling bigger TSO/GRO packets)
ip link set dev eth0 gso_max_size 65536 gro_max_size 65536
netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
Send Recv Size Size Time Rate local remote local remote
bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr
ip link set dev eth0 gso_max_size 185000 gro_max_size 185000
netperf -H lpaa18 -t TCP_RR -T2,2 -l 10 -Cc -- -r 70000,70000
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to lpaa18.prod.google.com () port 0 AF_INET6 : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem
Send Recv Size Size Time Rate local remote local remote
bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 13 May 2022 18:34:06 +0000 (11:34 -0700)]
veth: enable BIG TCP packets
Set the TSO driver limit to GSO_MAX_SIZE (512 KB).
This allows the admin/user to set a GSO limit up to this value.
ip link set dev veth10 gso_max_size 200000
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Coco Li [Fri, 13 May 2022 18:34:04 +0000 (11:34 -0700)]
ipv6: Add hop-by-hop header to jumbograms in ip6_output
Instead of simply forcing a 0 payload_len in IPv6 header,
implement RFC 2675 and insert a custom extension header.
Note that only TCP stack is currently potentially generating
jumbograms, and that this extension header is purely local,
it wont be sent on a physical link.
This is needed so that packet capture (tcpdump and friends)
can properly dissect these large packets.
Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 13 May 2022 18:34:03 +0000 (11:34 -0700)]
net: allow gro_max_size to exceed 65536
Allow the gro_max_size to exceed a value larger than 65536.
There weren't really any external limitations that prevented this other
than the fact that IPv4 only supports a 16 bit length field. Since we have
the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
via a constant rather than relying on gro_max_size.
[edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 13 May 2022 18:34:02 +0000 (11:34 -0700)]
ipv6/gro: insert temporary HBH/jumbo header
Following patch will add GRO_IPV6_MAX_SIZE, allowing gro to build
BIG TCP ipv6 packets (bigger than 64K).
This patch changes ipv6_gro_complete() to insert a HBH/jumbo header
so that resulting packet can go through IPv6/TCP stacks.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 13 May 2022 18:34:01 +0000 (11:34 -0700)]
ipv6/gso: remove temporary HBH/jumbo header
ipv6 tcp and gro stacks will soon be able to build big TCP packets,
with an added temporary Hop By Hop header.
If GSO is involved for these large packets, we need to remove
the temporary HBH header before segmentation happens.
v2: perform HBH removal from ipv6_gso_segment() instead of
skb_segment() (Alexander feedback)
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 13 May 2022 18:34:00 +0000 (11:34 -0700)]
ipv6: add struct hop_jumbo_hdr definition
Following patches will need to add and remove local IPv6 jumbogram
options to enable BIG TCP.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 13 May 2022 18:33:59 +0000 (11:33 -0700)]
tcp_cubic: make hystart_ack_delay() aware of BIG TCP
hystart_ack_delay() had the assumption that a TSO packet
would not be bigger than GSO_MAX_SIZE.
This will no longer be true.
We should use sk->sk_gso_max_size instead.
This reduces chances of spurious Hystart ACK train detections.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 13 May 2022 18:33:58 +0000 (11:33 -0700)]
net: limit GSO_MAX_SIZE to 524280 bytes
Make sure we will not overflow shinfo->gso_segs
Minimal TCP MSS size is 8 bytes, and shinfo->gso_segs
is a 16bit field.
TCP_MIN_GSO_SIZE is currently defined in include/net/tcp.h,
it seems cleaner to not bring tcp details into include/linux/netdevice.h
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 13 May 2022 18:33:57 +0000 (11:33 -0700)]
net: allow gso_max_size to exceed 65536
The code for gso_max_size was added originally to allow for debugging and
workaround of buggy devices that couldn't support TSO with blocks 64K in
size. The original reason for limiting it to 64K was because that was the
existing limits of IPv4 and non-jumbogram IPv6 length fields.
With the addition of Big TCP we can remove this limit and allow the value
to potentially go up to UINT_MAX and instead be limited by the tso_max_size
value.
So in order to support this we need to go through and clean up the
remaining users of the gso_max_size value so that the values will cap at
64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
limit for GSO_MAX_SIZE.
v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
in a new sk_trim_gso_size() helper.
netif_set_tso_max_size() caps the requested TSO size
with GSO_MAX_SIZE.
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 13 May 2022 18:33:56 +0000 (11:33 -0700)]
net: add IFLA_TSO_{MAX_SIZE|SEGS} attributes
New netlink attributes IFLA_TSO_MAX_SIZE and IFLA_TSO_MAX_SEGS
are used to report to user-space the device TSO limits.
ip -d link sh dev eth1
...
tso_max_size 65536 tso_max_segs 65535
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 16 May 2022 09:14:27 +0000 (10:14 +0100)]
Merge branch 'Renesas-RSZ-V2M-support'
Phil Edworthy says:
====================
Add Renesas RZ/V2M Ethernet support
The RZ/V2M Ethernet is very similar to R-Car Gen3 Ethernet-AVB, though
some small parts are the same as R-Car Gen2.
Other differences are:
* It has separate data (DI), error (Line 1) and management (Line 2) irqs
rather than one irq for all three.
* Instead of using the High-speed peripheral bus clock for gPTP, it has
a separate gPTP reference clock.
v4:
* Add clk_disable_unprepare() for gptp ref clk
v3:
* Really renamed irq_en_dis_regs to irq_en_dis this time
* Modified ravb_ptp_extts() to use irq_en_dis
* Added Reviewed-by tags
v2:
* Just net patches in this series
* Instead of reusing ch22 and ch24 interrupt names, use the proper names
* Renamed irq_en_dis_regs to irq_en_dis
* Squashed use of GIC reg versus GIE/GID and got rid of separate gptp_ptm_gic feature.
* Move err_mgmt_irqs code under multi_irqs
* Minor editing of the commit msgs
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Phil Edworthy [Thu, 12 May 2022 11:47:22 +0000 (12:47 +0100)]
ravb: Add support for RZ/V2M
RZ/V2M Ethernet is very similar to R-Car Gen3 Ethernet-AVB, though
some small parts are the same as R-Car Gen2.
Other differences to R-Car Gen3 and Gen2 are:
* It has separate data (DI), error (Line 1) and management (Line 2) irqs
rather than one irq for all three.
* Instead of using the High-speed peripheral bus clock for gPTP, it has a
separate gPTP reference clock.
Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com> Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
Phil Edworthy [Thu, 12 May 2022 11:47:21 +0000 (12:47 +0100)]
ravb: Use separate clock for gPTP
RZ/V2M has a separate gPTP reference clock that is used when the
AVB-DMAC Mode Register (CCC) gPTP Clock Select (CSEL) bits are
set to "01: High-speed peripheral bus clock".
Therefore, add a feature that allows this clock to be used for
gPTP.
Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com> Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
Phil Edworthy [Thu, 12 May 2022 11:47:20 +0000 (12:47 +0100)]
ravb: Support separate Line0 (Desc), Line1 (Err) and Line2 (Mgmt) irqs
R-Car has a combined interrupt line, ch22 = Line0_DiA | Line1_A | Line2_A.
RZ/V2M has separate interrupt lines for each of these, so add a feature
that allows the driver to get these interrupts and call the common handler.
Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com> Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
Phil Edworthy [Thu, 12 May 2022 11:47:19 +0000 (12:47 +0100)]
ravb: Separate handling of irq enable/disable regs into feature
Currently, when the HW has a single interrupt, the driver uses the
GIC, TIC, RIC0 registers to enable and disable interrupts.
When the HW has multiple interrupts, it uses the GIE, GID, TIE, TID,
RIE0, RID0 registers.
However, other devices, e.g. RZ/V2M, have multiple irqs and only have
the GIC, TIC, RIC0 registers.
Therefore, split this into a separate feature.
Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com> Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
Document the Ethernet AVB IP found on RZ/V2M SoC.
It includes the Ethernet controller (E-MAC) and Dedicated Direct memory
access controller (DMAC) for transferring transmitted Ethernet frames
to and received Ethernet frames from respective storage areas in the
RAM at high speed.
The AVB-DMAC is compliant with IEEE 802.1BA, IEEE 802.1AS timing and
synchronization protocol, IEEE 802.1Qav real-time transfer, and the
IEEE 802.1Qat stream reservation protocol.
R-Car has a pair of combined interrupt lines:
ch22 = Line0_DiA | Line1_A | Line2_A
ch23 = Line0_DiB | Line1_B | Line2_B
Line0 for descriptor interrupts (which we call dia and dib).
Line1 for error related interrupts (which we call err_a and err_b).
Line2 for management and gPTP related interrupts (mgmt_a and mgmt_b).
RZ/V2M hardware has separate interrupt lines for each of these.
It has 3 clocks; the main AXI clock, the AMBA CHI (Coherent Hub
Interface) clock and a gPTP reference clock.
Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com> Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
ice: Expose RSS indirection tables for queue groups via ethtool
When ADQ queue groups (TCs) are created via tc mqprio command,
RSS contexts and associated RSS indirection tables are configured
automatically per TC based on the queue ranges specified for
each traffic class.
will create 3 queue groups (TC 0-2) with queue ranges 2, 8 and 4
in 3 queue groups. Each queue group is associated with its
own RSS context and RSS indirection table.
Add support to expose RSS indirection tables for all ADQ queue
groups using ethtool RSS contexts interface.
ethtool -x enp175s0f0 context <tc-num>
1) Cleanups for the code behind the XFRM offload API. This is a
preparation for the extension of the API for policy offload.
From Leon Romanovsky.
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: drop not needed flags variable in XFRM offload struct
net/mlx5e: Use XFRM state direction instead of flags
netdevsim: rely on XFRM state direction instead of flags
ixgbe: propagate XFRM offload state direction instead of flags
xfrm: store and rely on direction to construct offload flags
xfrm: rename xfrm_state_offload struct to allow reuse
xfrm: delete not used number of external headers
xfrm: free not used XFRM_ESP_NO_TRAILER flag
====================
Ren Zhijie [Fri, 13 May 2022 01:27:21 +0000 (09:27 +0800)]
sfc: siena: Fix Kconfig dependencies
If CONFIG_PTP_1588_CLOCK=m and CONFIG_SFC_SIENA=y, the siena driver will fail to link:
drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_remove_channel':
ptp.c:(.text+0xa28): undefined reference to `ptp_clock_unregister'
drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_probe_channel':
ptp.c:(.text+0x13a0): undefined reference to `ptp_clock_register'
ptp.c:(.text+0x1470): undefined reference to `ptp_clock_unregister'
drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_pps_worker':
ptp.c:(.text+0x1d29): undefined reference to `ptp_clock_event'
drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_siena_ptp_get_ts_info':
ptp.c:(.text+0x301b): undefined reference to `ptp_clock_index'
To fix this build error, make SFC_SIENA depends on PTP_1588_CLOCK.
Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: d48523cb88e0 ("sfc: Copy shared files needed for Siena (part 2)") Signed-off-by: Ren Zhijie <renzhijie2@huawei.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Link: https://lore.kernel.org/r/20220513012721.140871-1-renzhijie2@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sven Auhagen [Wed, 27 Apr 2022 07:15:15 +0000 (09:15 +0200)]
netfilter: flowtable: nft_flow_route use more data for reverse route
When creating a flow table entry, the reverse route is looked
up based on the current packet.
There can be scenarios where the user creates a custom ip rule
to route the traffic differently.
In order to support those scenarios, the lookup needs to add
more information based on the current packet.
The patch adds multiple new information to the route lookup.
Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: prefer extension check to pointer check
The pointer check usually results in a 'false positive': its likely
that the ctnetlink module is loaded but no event monitoring is enabled.
After recent change to autodetect ctnetlink usage and only allocate
the ecache extension if a listener is active, check if the extension
is present on a given conntrack.
If its not there, there is nothing to report and calls to the
notification framework can be elided.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This adds the new nf_conntrack_events=2 mode and makes it the
default.
This leverages the earlier flag in struct net to allow to avoid
the event extension as long as no event listener is active in
the namespace.
This avoids, for most cases, allocation of ct->ext area.
A followup patch will take further advantage of this by avoiding
calls down into the event framework if the extension isn't present.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Only called when new ct is allocated or the extension isn't present.
This function will be extended, place this in the conntrack module
instead of inlining.
The callers already depend on nf_conntrack module.
Return value is changed to bool, noone used the returned pointer.
Make sure that the core drops the newly allocated conntrack
if the extension is requested but can't be added.
This makes it necessary to ifdef the section, as the stub
always returns false we'd drop every new conntrack if the
the ecache extension is disabled in kconfig.
Add from data path (xt_CT, nft_ct) is unchanged.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A. If entry is totally new, then the rcu-protected resource
must already have been removed from global visibility before call
to nf_ct_iterate_destroy.
B. If entry was allocated before, but is not yet in the hash table
(uncofirmed case), genid gets incremented and synchronize_rcu() call
makes sure access has completed.
C. Next attempt to peek at extension area will fail for unconfirmed
conntracks, because ext->genid != genid.
D. Conntracks in the hash are iterated as before.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: cttimeout: decouple unlink and free on netns destruction
Increment the extid on module removal; this makes sure that even
in extreme cases any old uncofirmed entry that happened to be kept
e.g. on nfnetlink_queue list will not trip over a stale timeout
reference.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When a helper or timeout policy is removed, the conntrack table gets
traversed and affected extensions are cleared.
Conntrack entries not yet in the hashtable are referenced via a special
list, the unconfirmed list.
On removal of a policy or connection tracking helper, the unconfirmed
list gets traversed an all entries are marked as dying, this prevents
them from getting committed to the table at insertion time: core checks
for dying bit, if set, the conntrack entry gets destroyed at confirm
time.
The disadvantage is that each new conntrack has to be added to the percpu
unconfirmed list, and each insertion needs to remove it from this list.
The list is only ever needed when a policy or helper is removed -- a rare
occurrence.
Add a generation ID count: Instead of adding to the list and then
traversing that list on policy/helper removal, increment a counter
that is stored in the extension area.
For unconfirmed conntracks, the extension has the genid valid at ct
allocation time.
Removal of a helper/policy etc. increments the counter.
At confirmation time, validate that ext->genid == global_id.
If the stored number is not the same, do not allow the conntrack
insertion, just like as if a confirmed-list traversal would have flagged
the entry as dying.
After insertion, the genid is no longer relevant (conntrack entries
are now reachable via the conntrack table iterators and is set to 0.
This allows removal of the percpu unconfirmed list.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This helper tags connections not yet in the conntrack table as
dying. These nf_conn entries will be dropped instead when the
core attempts to insert them from the input or postrouting
'confirm' hook.
After the previous change, the entries get unlinked from the
list earlier, so that by the time the actual exit hook runs,
new connections no longer have a timeout policy assigned.
Its enough to walk the hashtable instead.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>