Florian Fainelli [Fri, 27 Jan 2023 00:08:19 +0000 (16:08 -0800)]
net: bcmgenet: Add a check for oversized packets
Occasionnaly we may get oversized packets from the hardware which
exceed the nomimal 2KiB buffer size we allocate SKBs with. Add an early
check which drops the packet to avoid invoking skb_over_panic() and move
on to processing the next packet.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Fri, 27 Jan 2023 07:45:06 +0000 (08:45 +0100)]
net: netlink: recommend policy range validation
For large ranges (outside of s16) the documentation currently
recommends open-coding the validation, but it's better to use
the NLA_POLICY_FULL_RANGE() or NLA_POLICY_FULL_RANGE_SIGNED()
policy validation instead; recommend that.
Jakub Kicinski [Sat, 28 Jan 2023 07:59:45 +0000 (23:59 -0800)]
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
bpf-next 2023-01-28
We've added 124 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 6386 insertions(+), 1827 deletions(-).
The main changes are:
1) Implement XDP hints via kfuncs with initial support for RX hash and
timestamp metadata kfuncs, from Stanislav Fomichev and
Toke Høiland-Jørgensen.
Measurements on overhead: https://lore.kernel.org/bpf/875yellcx6.fsf@toke.dk
2) Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case, from Andrii Nakryiko.
3) Significantly reduce the search time for module symbols by livepatch
and BPF, from Jiri Olsa and Zhen Lei.
4) Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs
in different time intervals, from David Vernet.
5) Fix several issues in the dynptr processing such as stack slot liveness
propagation, missing checks for PTR_TO_STACK variable offset, etc,
from Kumar Kartikeya Dwivedi.
6) Various performance improvements, fixes, and introduction of more
than just one XDP program to XSK selftests, from Magnus Karlsson.
7) Big batch to BPF samples to reduce deprecated functionality,
from Daniel T. Lee.
8) Enable struct_ops programs to be sleepable in verifier,
from David Vernet.
9) Reduce pr_warn() noise on BTF mismatches when they are expected under
the CONFIG_MODULE_ALLOW_BTF_MISMATCH config anyway, from Connor O'Brien.
10) Describe modulo and division by zero behavior of the BPF runtime
in BPF's instruction specification document, from Dave Thaler.
11) Several improvements to libbpf API documentation in libbpf.h,
from Grant Seltzer.
12) Improve resolve_btfids header dependencies related to subcmd and add
proper support for HOSTCC, from Ian Rogers.
13) Add ipip6 and ip6ip decapsulation support for bpf_skb_adjust_room()
helper along with BPF selftests, from Ziyang Xuan.
14) Simplify the parsing logic of structure parameters for BPF trampoline
in the x86-64 JIT compiler, from Pu Lehui.
15) Get BTF working for kernels with CONFIG_RUST enabled by excluding
Rust compilation units with pahole, from Martin Rodriguez Reboredo.
16) Get bpf_setsockopt() working for kTLS on top of TCP sockets,
from Kui-Feng Lee.
17) Disable stack protection for BPF objects in bpftool given BPF backends
don't support it, from Holger Hoffstätte.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (124 commits)
selftest/bpf: Make crashes more debuggable in test_progs
libbpf: Add documentation to map pinning API functions
libbpf: Fix malformed documentation formatting
selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata
selftests/bpf: Calls bpf_setsockopt() on a ktls enabled socket.
bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt().
bpf/selftests: Verify struct_ops prog sleepable behavior
bpf: Pass const struct bpf_prog * to .check_member
libbpf: Support sleepable struct_ops.s section
bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepable
selftests/bpf: Fix vmtest static compilation error
tools/resolve_btfids: Alter how HOSTCC is forced
tools/resolve_btfids: Install subcmd headers
bpf/docs: Document the nocast aliasing behavior of ___init
bpf/docs: Document how nested trusted fields may be defined
bpf/docs: Document cpumask kfuncs in a new file
selftests/bpf: Add selftest suite for cpumask kfuncs
selftests/bpf: Add nested trust selftests suite
bpf: Enable cpumasks to be queried and used as kptrs
bpf: Disallow NULLable pointers for trusted kfuncs
...
====================
Breno Leitao [Wed, 25 Jan 2023 18:52:30 +0000 (10:52 -0800)]
netpoll: Remove 4s sleep during carrier detection
This patch removes the msleep(4s) during netpoll_setup() if the carrier
appears instantly.
Here are some scenarios where this workaround is counter-productive in
modern ages:
Servers which have BMC communicating over NC-SI via the same NIC as gets
used for netconsole. BMC will keep the PHY up, hence the carrier
appearing instantly.
The link is fibre, SERDES getting sync could happen within 0.1Hz, and
the carrier also appears instantly.
Other than that, if a driver is reporting instant carrier and then
losing it, this is probably a driver bug.
drivers/net/ethernet/intel/ice/ice_main.c 418e53401e47 ("ice: move devlink port creation/deletion") 643ef23bd9dd ("ice: Introduce local var for readability")
https://lore.kernel.org/all/20230127124025.0dacef40@canb.auug.org.au/
https://lore.kernel.org/all/20230124005714.3996270-1-anthony.l.nguyen@intel.com/
drivers/net/ethernet/engleder/tsnep_main.c 3d53aaef4332 ("tsnep: Fix TX queue stop/wake for multiple queues") 25faa6a4c5ca ("tsnep: Replace TX spin_lock with __netif_tx_lock")
https://lore.kernel.org/all/20230127123604.36bb3e99@canb.auug.org.au/
net/netfilter/nf_conntrack_proto_sctp.c 13bd9b31a969 ("Revert "netfilter: conntrack: add sctp DATA_SENT state"") a44b7651489f ("netfilter: conntrack: unify established states for SCTP paths") f71cb8f45d09 ("netfilter: conntrack: sctp: use nf log infrastructure for invalid packets")
https://lore.kernel.org/all/20230127125052.674281f9@canb.auug.org.au/
https://lore.kernel.org/all/d36076f3-6add-a442-6d4b-ead9f7ffff86@tessares.net/
====================
net: xdp: execute xdp_do_flush() before napi_complete_done()
Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found in [1].
The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in [2].
The drivers have only been compile-tested since I do not own any of
the HW below. So if you are a maintainer, it would be great if you
could take a quick look to make sure I did not mess something up.
Note that these were the drivers I found that violated the ordering by
running a simple script and manually checking the ones that came up as
potential offenders. But the script was not perfect in any way. There
might still be offenders out there, since the script can generate
false negatives.
Magnus Karlsson [Wed, 25 Jan 2023 07:49:01 +0000 (08:49 +0100)]
dpaa2-eth: execute xdp_do_flush() before napi_complete_done()
Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.
The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.
Magnus Karlsson [Wed, 25 Jan 2023 07:49:00 +0000 (08:49 +0100)]
dpaa_eth: execute xdp_do_flush() before napi_complete_done()
Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.
The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.
Magnus Karlsson [Wed, 25 Jan 2023 07:48:59 +0000 (08:48 +0100)]
virtio-net: execute xdp_do_flush() before napi_complete_done()
Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.
The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.
Magnus Karlsson [Wed, 25 Jan 2023 07:48:58 +0000 (08:48 +0100)]
lan966x: execute xdp_do_flush() before napi_complete_done()
Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.
The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.
Magnus Karlsson [Wed, 25 Jan 2023 07:48:57 +0000 (08:48 +0100)]
qede: execute xdp_do_flush() before napi_complete_done()
Make sure that xdp_do_flush() is always executed before
napi_complete_done(). This is important for two reasons. First, a
redirect to an XSKMAP assumes that a call to xdp_do_redirect() from
napi context X on CPU Y will be followed by a xdp_do_flush() from the
same napi context and CPU. This is not guaranteed if the
napi_complete_done() is executed before xdp_do_flush(), as it tells
the napi logic that it is fine to schedule napi context X on another
CPU. Details from a production system triggering this bug using the
veth driver can be found following the first link below.
The second reason is that the XDP_REDIRECT logic in itself relies on
being inside a single NAPI instance through to the xdp_do_flush() call
for RCU protection of all in-kernel data structures. Details can be
found in the second link below.
Grant Seltzer [Thu, 26 Jan 2023 02:47:49 +0000 (21:47 -0500)]
libbpf: Fix malformed documentation formatting
This fixes the doxygen format documentation above the
user_ring_buffer__* APIs. There has to be a newline
before the @brief, otherwise doxygen won't render them
for libbpf.readthedocs.org.
This patchset takes care of small cleanup of devlink params usage.
Some of the patches (first 2/3) are cosmetic, but I would like to
point couple of interesting ones:
Patch 9 is the main one of this set and introduces devlink instance
locking for params, similar to other devlink objects. That allows params
to be registered/unregistered when devlink instance is registered.
Patches 10-12 change mlx5 code to register non-driverinit params in the
code they are related to, and thanks to patch 8 this might be when
devlink instance is registered - for example during devlink reload.
---
v1->v2:
- Just small fix in the last patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:38 +0000 (08:58 +0100)]
net/mlx5: Move eswitch port metadata devlink param to flow eswitch code
Move the param registration and handling code into the eswitch offloads
code as they are related to each other. No point in having the
devlink param registration done in separate file.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:37 +0000 (08:58 +0100)]
net/mlx5: Move flow steering devlink param to flow steering code
Move the param registration and handling code into the flow steering
code as they are related to each other. No point in having the
devlink param registration done in separate file.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:36 +0000 (08:58 +0100)]
net/mlx5: Move fw reset devlink param to fw reset code
Move the param registration and handling code into the fw reset code
as they are related to each other. No point in having the devlink param
registration done in separate file.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:35 +0000 (08:58 +0100)]
devlink: protect devlink param list by instance lock
Commit 1d18bb1a4ddd ("devlink: allow registering parameters after
the instance") as the subject implies introduced possibility to register
devlink params even for already registered devlink instance. This is a
bit problematic, as the consistency or params list was originally
secured by the fact it is static during devlink lifetime. So in order to
protect the params list, take devlink instance lock during the params
operations. Introduce unlocked function variants and use them in drivers
in locked context. Put lock assertions to appropriate places.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:34 +0000 (08:58 +0100)]
devlink: put couple of WARN_ONs in devlink_param_driverinit_value_get()
Put couple of WARN_ONs in devlink_param_driverinit_value_get() function
to clearly indicate, that it is a driver bug if used without reload
support or for non-driverinit param.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:33 +0000 (08:58 +0100)]
devlink: make devlink_param_driverinit_value_set() return void
devlink_param_driverinit_value_set() currently returns int with possible
error, but no user is checking it anyway. The only reason for a fail is
a driver bug. So convert the function to return void and put WARN_ONs
on error paths.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:32 +0000 (08:58 +0100)]
qed: remove pointless call to devlink_param_driverinit_value_set()
devlink_param_driverinit_value_set() call makes sense only for "
driverinit" params. However here, the param is "runtime".
devlink_param_driverinit_value_set() returns -EOPNOTSUPP in such case
and does not do anything. So remove the pointless call to
devlink_param_driverinit_value_set() entirely.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:31 +0000 (08:58 +0100)]
ice: remove pointless calls to devlink_param_driverinit_value_set()
devlink_param_driverinit_value_set() call makes sense only for
"driverinit" params. However here, both params are "runtime".
devlink_param_driverinit_value_set() returns -EOPNOTSUPP in such case
and does not do anything. So remove the pointless calls to
devlink_param_driverinit_value_set() entirely.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:30 +0000 (08:58 +0100)]
devlink: don't work with possible NULL pointer in devlink_param_unregister()
There is a WARN_ON checking the param_item for being NULL when the param
is not inserted in the list. That indicates a driver BUG. Instead of
continuing to work with NULL pointer with its consequences, return.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:29 +0000 (08:58 +0100)]
devlink: make devlink_param_register/unregister static
There is no user outside the devlink code, so remove the export and make
the functions static. Move them before callers to avoid forward
declarations.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:28 +0000 (08:58 +0100)]
net/mlx5: Covert devlink params registration to use devlink_params_register/unregister()
Since mlx5 is the only user of devlink API to register/unregister a
single param, convert it to use array registration function allowing to
simplify the devlink API by removing the single param registration
functions.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 26 Jan 2023 07:58:27 +0000 (08:58 +0100)]
net/mlx5: Change devlink param register/unregister function names
The functions are registering and unregistering devlink params, so
change the names accordingly.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Jan 2023 12:24:32 +0000 (12:24 +0000)]
Merge branch 'ethtool-netlink-next'
Jakub Kicinski says:
====================
ethtool: netlink: handle SET intro/outro in the common code
Factor out the boilerplate code from SET handlers to common code.
I volunteered to refactor the extack in GET in a conversation
with Vladimir but I gave up.
The handling of failures during dump in GET handlers is a bit
unclear to me. Some code uses presence of info as indication
of dump and tries to avoid reporting errors altogether
(including extack messages).
There's also the question of whether we should have a validation
callback (similar to .set_validate here) for GET. It looks like
.parse_request was expected to perform the validation. It takes
the extack and tb directly, not via info:
int (*parse_request)(struct ethnl_req_info *req_info,
struct nlattr **tb,
struct netlink_ext_ack *extack);
But .parse_request doesn't run under rtnl nor ethnl_ops_begin().
As a result some implementations defer validation until .prepare_data
where all the locks are held and they can call out to the driver.
All this makes me think that maybe we should refactor GET in the
same direction I'm refactoring SET. Split .prepare_data, take
more locks in the core, and add a validation helper which would
take extack directly:
- ret = ops->prepare_data(req_info, reply_data, info);
+ ret = ops->prepare_data_validate(req_info, reply_data, attrs, extack);
+ if (ret < 1) // if 0 -> skip for dump; -EOPNOTSUPP in do
+ goto err1;
+
+ ret = ethnl_ops_begin(dev);
+ if (ret)
+ goto err1;
+
+ ret = ops->prepare_data(req_info, reply_data); // no extack
+ ethnl_ops_complete(dev);
I'll file that away as a TODO for posterity / older me.
v2:
- invert checks for coalescing to avoid error code changes
- rebase and convert MM as well
This leads to a lot of copy / pasted code an bugs when people
mis-handle the error path.
Add a generic implementation of this pattern with a .set callback
in struct ethnl_request_ops called to "do stuff".
Also add an optional .set_validate which is called before
ethnl_ops_begin() -- a lot of implementations do basic request
capability / sanity checking at that point.
Because we want to avoid generating the notification when
no change happened - adopt a slightly hairy return values:
- 0 means nothing to do (no notification)
- 1 means done / continue
- negative error codes on error
Reuse .hdr_attr from struct ethnl_request_ops, GET and SET
use the same attr spaces in all cases.
Convert pause as an example (and to avoid unused function warnings).
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Convert qca8k to regmap read/write bulk API. The mgmt eth can write up
to 32 bytes of data at times. Currently we use a custom function to do
it but regmap now supports declaration of read/write bulk even without a
bus.
Drop the custom function and rework the regmap function to this new
implementation.
Rework the qca8k_fdb_read/write function to use the new
regmap_bulk_read/write as the old qca8k_bulk_read/write are now dropped.
Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 26 Jan 2023 07:14:23 +0000 (23:14 -0800)]
net: skbuff: drop the linux/hrtimer.h include
linux/hrtimer.h include was added because apparently it used
to contain ktime related code. This is no longer the case
and we include linux/time.h explicitly.
Sadly this change is currently a noop because linux/dma-mapping.h
and net/page_pool.h pull in half of the universe.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 26 Jan 2023 07:14:22 +0000 (23:14 -0800)]
net: skbuff: drop the linux/splice.h include
splice.h is included since commit a60e3cc7c929 ("net: make
skb_splice_bits more configureable") but really even then
all we needed is some forward declarations. Most of that
code is now gone, and remaining has fwd declarations.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Thu, 26 Jan 2023 07:14:20 +0000 (23:14 -0800)]
net: skbuff: drop the linux/sched.h include
linux/sched.h was added for skb_mstamp_* (all the way back
before linux/sched.h got split and linux/sched/clock.h created).
We don't need it in skbuff.h any more.
Sadly this change is currently a noop because linux/dma-mapping.h
and net/page_pool.h pull in half of the universe.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Jan 2023 11:16:29 +0000 (11:16 +0000)]
Merge branch 'ipa-abstract-status'
Alex Elder says:
====================
net: ipa: abstract status parsing
Under some circumstances, IPA generates a "packet status" structure
that describes information about a packet. This is used, for
example, when offload hardware detects an error in a packet, or
otherwise discovers a packet needs special handling. In this case,
the status is delivered (along with the packet it describes) to a
"default" endpoint so that it can be handled by the AP.
Until now, the structure of this status information hasn't changed.
However, to support more than 32 endpoints, this structure required
some changes, such that some fields are rearranged in ways that are
tricky to represent using C code.
This series updates code related to the IPA status structure. The
first patch uses a local variable to avoid recomputing a packet
length more than once. The second stops using sizeof() to determine
the size of an IPA packet status structure. Patches 3-5 extend the
definitions for values held in packet status fields. Patch 6 does a
little general cleanup to make patch 7 simpler. Patch 7 stops using
a C structure to represent packet status; instead, a new function
fetches values "by name" from a buffer containing such a structure.
The last patch updates this function so it also supports IPA v5.0+.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:45 +0000 (14:45 -0600)]
net: ipa: add IPA v5.0 packet status support
Update ipa_status_extract() to support IPA v5.0 and beyond. Because
the format of the IPA packet status depends on the version, pass an
IPA pointer to the function.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:44 +0000 (14:45 -0600)]
net: ipa: introduce generalized status decoder
Stop assuming the IPA packet status has a fixed format (defined by
a C structure). Instead, use a function to extract each field from
a block of data interpreted as an IPA packet status. Define an
enumerated type that identifies the fields that can be extracted.
The current function extracts fields based on the existing
ipa_status structure format (which is no longer used).
Define IPA_STATUS_RULE_MISS, to replace the calls to field_max() to
represent that condition; those depended on the knowing the width of
a filter or router rule in the IPA packet status structure.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:43 +0000 (14:45 -0600)]
net: ipa: IPA status preparatory cleanups
The next patch reworks how the IPA packet status structure is
interpreted. This patch does some preparatory work, to make it
easier to see the effect of that change:
- Change a few functions that access fields in a IPA packet status
structure to store field values in local variables with names
related to the field.
- Pass a void pointer rather than an (equivalent) status pointer
to two functions called by ipa_endpoint_status_parse().
- Use "rule" rather than "val" as the name of a variable that
holds a routing rule ID.
- Consistently use "IPA packet status" rather than "status
element" when referring to this data structure.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:42 +0000 (14:45 -0600)]
net: ipa: define remaining IPA status field values
Define the remaining values for opcode and exception fields in the
IPA packet status structure. Most of these values are powers-of-2,
suggesting they are meant to be used as bitmasks, but that is not
the case. Add comments to be clear about this, and express the
values in decimal format.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:41 +0000 (14:45 -0600)]
net: ipa: rename the NAT enumerated type
Rename the ipa_nat_en enumerated type to be ipa_nat_type, and rename
its symbols accordingly. Add a comment indicating those values are
also used in the IPA status nat_type field.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:40 +0000 (14:45 -0600)]
net: ipa: define all IPA status mask bits
There is a 16 bit status mask defined in the IPA packet status
structure, of which only one (TAG_VALID) is currently used.
Define all other IPA status mask values in an enumerated type whose
numeric values are bit mask values (in CPU byte order) in the status
mask. Use the TAG_VALID value from that type rather than defining a
separate field mask.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:39 +0000 (14:45 -0600)]
net: ipa: stop using sizeof(status)
The IPA packet status structure changes in IPA v5.0 in ways that are
difficult to represent cleanly. As a small step toward redefining
it as a parsed block of data, use a constant to define its size,
rather than the size of the IPA status structure type.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Wed, 25 Jan 2023 20:45:38 +0000 (14:45 -0600)]
net: ipa: refactor status buffer parsing
The packet length encoded in an IPA packet status buffer is computed
more than once in ipa_endpoint_status_parse(). It is also checked
again in ipa_endpoint_status_skip(), which that function calls.
Compute the length once, and use that computed value later rather
than recomputing it. Check for it being zero in the parse function
rather than in ipa_endpoint_status_skip().
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 25 Jan 2023 14:57:16 +0000 (16:57 +0200)]
net: dsa: ocelot: build felix.c into a dedicated kernel module
The build system currently complains:
scripts/Makefile.build:252: drivers/net/dsa/ocelot/Makefile:
felix.o is added to multiple modules: mscc_felix mscc_seville
Since felix.c holds the DSA glue layer, create a mscc_felix_dsa_lib.ko.
This is similar to how mscc_ocelot_switch_lib.ko holds a library for
configuring the hardware.
Jakub Kicinski [Fri, 27 Jan 2023 07:29:12 +0000 (23:29 -0800)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
virtchnl: update and refactor
Jesse Brandeburg says:
The virtchnl.h file is used by i40e/ice physical function (PF) drivers
and irdma when talking to the iavf driver. This series cleans up the
header file by removing unused elements, adding/cleaning some comments,
fixing the data structures so they are explicitly defined, including
padding, and finally does a long overdue rename of the IWARP members in
the structures to RDMA, since the ice driver and it's associated Intel
Ethernet E800 series adapters support both RDMA and IWARP.
The whole series should result in no functional change, but hopefully
clearer code.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
virtchnl: i40e/iavf: rename iwarp to rdma
virtchnl: do structure hardening
virtchnl: update header and increase header clarity
virtchnl: remove unused structure declaration
====================
selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata
The existing timestamping_enable() is a no-op because it applies
to the socket-related path that we are not verifying here
anymore. (but still leaving the code around hoping we can
have xdp->skb path verified here as well)
====================
tools: ynl: prevent reorder and fix flags
Some codegen improvements for YAML specs.
First, Lorenzon discovered when switching the XDP feature family
to use flags instead of pure enum that the kdoc got garbled.
The support for enum and flags is therefore unified.
Second when regenerating all families we discussed so far I noticed
that some netlink policies jumped around. We need to ensure we don't
render code based on their ordering in a hash.
====================
Jakub Kicinski [Thu, 26 Jan 2023 00:02:35 +0000 (16:02 -0800)]
tools: ynl: store ops in ordered dict to avoid random ordering
When rendering code we should walk the ops in the order in which
they are declared in the spec. This is both more intuitive and
prevents code from jumping around when hashing in the dict changes.
Jakub Kicinski [Thu, 26 Jan 2023 00:02:34 +0000 (16:02 -0800)]
tools: ynl: rename ops_list -> msg_list
ops_list contains all the operations, but the main iteration use
case is to walk only ops which define attrs. Rename ops_list to
msg_list, because now it looks like the contents are the same,
just the format is different. While at it convert from tuple
to just keys, none of the users care about the name of the op.
====================
Convert drivers to return XFRM configuration errors through extack
This series continues effort started by Sabrina to return XFRM configuration
errors through extack. It allows for user space software stack easily present
driver failure reasons to users.
As a note, Intel drivers have a path where extack is equal to NULL, and error
prints won't be available in current patchset. If it is needed, it can be
changed by adding special to Intel macro to print to dmesg in case of
extack == NULL.
====================
Leon Romanovsky [Tue, 24 Jan 2023 11:55:02 +0000 (13:55 +0200)]
nfp: fill IPsec state validation failure reason
Rely on extack to return failure reason.
Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Thu, 26 Jan 2023 18:20:12 +0000 (10:20 -0800)]
Merge tag 'net-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter.
Current release - regressions:
- sched: sch_taprio: do not schedule in taprio_reset()
Previous releases - regressions:
- core: fix UaF in netns ops registration error path
- ipv4: prevent potential spectre v1 gadgets
- ipv6: fix reachability confirmation with proxy_ndp
- netfilter: fix for the set rbtree
- eth: fec: use page_pool_put_full_page when freeing rx buffers
- eth: iavf: fix temporary deadlock and failure to set MAC address
Previous releases - always broken:
- netlink: prevent potential spectre v1 gadgets
- netfilter: fixes for SCTP connection tracking
- mctp: struct sock lifetime fixes
- eth: ravb: fix possible hang if RIS2_QFF1 happen
- eth: tg3: resolve deadlock in tg3_reset_task() during EEH
Misc:
- Mat stepped out as MPTCP co-maintainer"
* tag 'net-6.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (40 commits)
net: mdio-mux-meson-g12a: force internal PHY off on mux switch
docs: networking: Fix bridge documentation URL
tsnep: Fix TX queue stop/wake for multiple queues
net/tg3: resolve deadlock in tg3_reset_task() during EEH
net: mctp: mark socks as dead on unhash, prevent re-add
net: mctp: hold key reference when looking up a general key
net: mctp: move expiry timer delete to unhash
net: mctp: add an explicit reference from a mctp_sk_key to sock
net: ravb: Fix possible hang if RIS2_QFF1 happen
net: ravb: Fix lack of register setting after system resumed for Gen3
net/x25: Fix to not accept on connected socket
ice: move devlink port creation/deletion
sctp: fail if no bound addresses can be used for a given scope
net/sched: sch_taprio: do not schedule in taprio_reset()
Revert "Merge branch 'ethtool-mac-merge'"
netrom: Fix use-after-free of a listening socket.
netfilter: conntrack: unify established states for SCTP paths
Revert "netfilter: conntrack: add sctp DATA_SENT state"
netfilter: conntrack: fix bug in for_each_sctp_chunk
netfilter: conntrack: fix vtag checks for ABORT/SHUTDOWN_COMPLETE
...
Linus Torvalds [Thu, 26 Jan 2023 18:05:39 +0000 (10:05 -0800)]
treewide: fix up files incorrectly marked executable
I'm not exactly clear on what strange workflow causes people to do it,
but clearly occasionally some files end up being committed as executable
even though they clearly aren't.
This is a reprise of commit 90fda63fa115 ("treewide: fix up files
incorrectly marked executable"), just with a different set of files (but
with the same trivial shell scripting).
So apparently we need to re-do this every five years or so, and Joe
needs to just keep reminding me to do so ;)
Reported-by: Joe Perches <joe@perches.com> Fixes: 523375c943e5 ("drm/vmwgfx: Port vmwgfx to arm64") Fixes: 5c439937775d ("ASoC: codecs: add support for ES8326") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vladimir Oltean [Wed, 25 Jan 2023 11:02:14 +0000 (13:02 +0200)]
net: ethtool: provide shims for stats aggregation helpers when CONFIG_ETHTOOL_NETLINK=n
ethtool_aggregate_*_stats() are implemented in net/ethtool/stats.c, a
file which is compiled out when CONFIG_ETHTOOL_NETLINK=n. In order to
avoid adding Kbuild dependencies from drivers (which call these helpers)
on CONFIG_ETHTOOL_NETLINK, let's add some shim definitions which simply
make the helpers dead code.
This means the function prototypes should have been located in
include/linux/ethtool_netlink.h rather than include/linux/ethtool.h.
Fixes: 449c5459641a ("net: ethtool: add helpers for aggregate statistics") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230125110214.4127759-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
mptcp: add mixed v4/v6 support for the in-kernel PM
Before these patches, the in-kernel Path-Manager would not allow, for
the same MPTCP connection, having a mix of subflows in v4 and v6.
MPTCP's RFC 8684 doesn't forbid that and it is even recommended to do so
as the path in v4 and v6 are likely different. Some networks are also
v4 or v6 only, we cannot assume they all have both v4 and v6 support.
Patch 1 then removes this artificial constraint in the in-kernel PM
currently enforcing there are no mixed subflows in place, either in
address announcement or in subflow creation areas.
Patch 2 makes sure the sk_ipv6only attribute is also propagated to
subflows, just in case a new PM wouldn't respect it.
Some selftests have also been added for the in-kernel PM (patch 3).
Patches 4 to 8 are just some cleanups and small improvements in the
printed messages in the userspace PM. It is not linked to the rest but
identified when working on a related patch modifying this selftest,
already in -net:
Matthieu Baerts [Wed, 25 Jan 2023 10:47:28 +0000 (11:47 +0100)]
selftests: mptcp: userspace: avoid read errors
During the cleanup phase, the server pids were killed with a SIGTERM
directly, not using a SIGUSR1 first to quit safely. As a result, this
test was often ending with two error messages:
read: Connection reset by peer
While at it, use a for-loop to terminate all the PIDs the same way.
Also the different files are now removed after having killed the PIDs
using them. It makes more sense to do that in this order.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Wed, 25 Jan 2023 10:47:21 +0000 (11:47 +0100)]
mptcp: let the in-kernel PM use mixed IPv4 and IPv6 addresses
Currently the in-kernel PM arbitrary enforces that created subflow's
family must match the main MPTCP socket while the RFC allows mixing
IPv4 and IPv6 subflows.
This patch changes the in-kernel PM logic to create subflows matching
the currently selected source (or destination) address. IPv4 sockets
can pick only IPv4 addresses (and v4 mapped in v6), while IPv6 sockets
not restricted to V6ONLY can pick either IPv4 and IPv6 addresses as
long as the source and destination matches.
A helper, previously introduced is used to ease family matching checks,
taking care of IPv4 vs IPv4-mapped-IPv6 vs IPv6 only addresses.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/269 Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
However, when ICMP output is limited, there is no way to tell
which limit has been hit or even if the limits are responsible
for the lack of ICMP output.
Add counters for each of the cases above. As we are within
local_bh_disable(), use the __INC stats variant.
Paolo Abeni [Thu, 26 Jan 2023 09:08:05 +0000 (10:08 +0100)]
Merge branch 'adding-sparx5-is0-vcap-support'
Steen Hegelund says:
====================
Adding Sparx5 IS0 VCAP support
This provides the Ingress Stage 0 (IS0) VCAP (Versatile Content-Aware
Processor) support for the Sparx5 platform.
The IS0 VCAP (also known in the datasheet as CLM) is a classifier VCAP that
mainly extracts frame information to metadata that follows the frame in the
Sparx5 processing flow all the way to the egress port.
The IS0 VCAP has 4 lookups and they are accessible with a TC chain id:
Each of these lookups have their own port keyset configuration that decides
which keys will be used for matching on which traffic type.
The IS0 VCAP has these traffic classifications:
- IPv4 frames
- IPv6 frames
- Unicast MPLS frames (ethertype = 0x8847)
- Multicast MPLS frames (ethertype = 0x8847)
- Other frame types than MPLS, IPv4 and IPv6
The IS0 VCAP has an action that allows setting the value of a PAG (Policy
Association Group) key field in the frame metadata, and this can be used
for matching in an IS2 VCAP rule.
This allow rules in the IS0 VCAP to be linked to rules in the IS2 VCAP.
The linking is exposed by using the TC "goto chain" action with an offset
from the IS2 chain ids.
As an example a "goto chain 8000001" will use a PAG value of 1 to chain to
a rule in IS2 Lookup 0.
====================
Steen Hegelund [Tue, 24 Jan 2023 10:45:09 +0000 (11:45 +0100)]
net: microchip: sparx5: Add automatic selection of VCAP rule actionset
With more than one possible actionset in a VCAP instance, the VCAP API will
now use the actions in a VCAP rule to select the actionset that fits these
actions the best possible way.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Steen Hegelund [Tue, 24 Jan 2023 10:45:08 +0000 (11:45 +0100)]
net: microchip: sparx5: Add TC filter chaining support for IS0 and IS2 VCAPs
This allows rules to be chained between VCAP instances, e.g. from IS0
Lookup 0 to IS0 Lookup 1, or from one of the IS0 Lookups to one of the IS2
Lookups.
Chaining from an IS2 Lookup to another IS2 Lookup is not supported in the
hardware.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This patch set is a follow up to the "How to share IPv4 addresses by
partitioning the port space" talk given at LPC 2022 [1].
Please see patch #1 for the motivation & the use case description.
Patch #2 adds tests exercising the new option in various scenarios.
Documentation
-------------
Proposed update to the ip(7) man-page:
IP_LOCAL_PORT_RANGE (since Linux X.Y)
Set or get the per-socket default local port range. This
option can be used to clamp down the global local port
range, defined by the ip_local_port_range /proc interface
described below, for a given socket.
The option takes an uint32_t value with the high 16 bits
set to the upper range bound, and the low 16 bits set to
the lower range bound. Range bounds are inclusive. The
16-bit values should be in host byte order.
The lower bound has to be less than the upper bound when
both bounds are not zero. Otherwise, setting the option
fails with EINVAL.
If either bound is outside of the global local port range,
or is zero, then that bound has no effect.
To reset the setting, pass zero as both the upper and the
lower bound.
Interaction with SELinux bind() hook
------------------------------------
SELinux bind() hook - selinux_socket_bind() - performs a permission check
if the requested local port number lies outside of the netns ephemeral port
range.
The proposed socket option cannot be used change the ephemeral port range
to extend beyond the per-netns port range, as set by
net.ipv4.ip_local_port_range.
Hence, there is no interaction with SELinux, AFAICT.
Jakub Sitnicki [Tue, 24 Jan 2023 13:36:44 +0000 (14:36 +0100)]
selftests/net: Cover the IP_LOCAL_PORT_RANGE socket option
Exercise IP_LOCAL_PORT_RANGE socket option in various scenarios:
1. pass invalid values to setsockopt
2. pass a range outside of the per-netns port range
3. configure a single-port range
4. exhaust a configured multi-port range
5. check interaction with late-bind (IP_BIND_ADDRESS_NO_PORT)
6. set then get the per-socket port range
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Sitnicki [Tue, 24 Jan 2023 13:36:43 +0000 (14:36 +0100)]
inet: Add IP_LOCAL_PORT_RANGE socket option
Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.
A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:
1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
and the destination port.
An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:
1. Manually pick the source port by bind()'ing to it before connect()'ing
the socket.
This approach has a couple of downsides:
a) Search for a free port has to be implemented in the user-space. If
the chosen 4-tuple happens to be busy, the application needs to retry
from a different local port number.
Detecting if 4-tuple is busy can be either easy (TCP) or hard
(UDP). In TCP case, the application simply has to check if connect()
returned an error (EADDRNOTAVAIL). That is assuming that the local
port sharing was enabled (REUSEADDR) by all the sockets.
# Assume desired local port range is 60_000-60_511
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
s.bind(("192.0.2.1", 60_000))
s.connect(("1.1.1.1", 53))
# Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
# Application must retry with another local port
In case of UDP, the network stack allows binding more than one socket
to the same 4-tuple, when local port sharing is enabled
(REUSEADDR). Hence detecting the conflict is much harder and involves
querying sock_diag and toggling the REUSEADDR flag [1].
b) For TCP, bind()-ing to a port within the ephemeral port range means
that no connecting sockets, that is those which leave it to the
network stack to find a free local port at connect() time, can use
the this port.
IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
will be skipped during the free port search at connect() time.
2. Isolate the app in a dedicated netns and use the use the per-netns
ip_local_port_range sysctl to adjust the ephemeral port range bounds.
The per-netns setting affects all sockets, so this approach can be used
only if:
- there is just one egress IP address, or
- the desired egress port range is the same for all egress IP addresses
used by the application.
For TCP, this approach avoids the downsides of (1). Free port search and
4-tuple conflict detection is done by the network stack:
s = socket(AF_INET, SOCK_STREAM)
s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
s.bind(("192.0.2.1", 0))
s.connect(("1.1.1.1", 53))
# Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
For UDP this approach has limited applicability. Setting the
IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
port being shared with other connected UDP sockets.
Hence relying on the network stack to find a free source port, limits the
number of outgoing UDP flows from a single IP address down to the number
of available ephemeral ports.
To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.
To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.
The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.
UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.
PORT_LO = 40_000
PORT_HI = 40_511
s = socket(AF_INET, SOCK_STREAM)
v = struct.pack("I", PORT_HI << 16 | PORT_LO)
s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
s.bind(("127.0.0.1", 0))
s.getsockname()
# Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
# if there is a free port. EADDRINUSE otherwise.
Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>