[PATCH v3 00/20] More flow table preliminaries: address handling improvements
Here's another batch of cleanups and tweaks in preparation for the flow table. This set focuses on improved helpers for handling addresses, particularly in the TCP splice path. Based on my other series adding more iovecs to the tap and pcap code, however the only conflicts should be trivial Makefile collisions. Changes since v2: * Minor stylistic and formatting changes based on review * Some clarifying changes to the theory of operation notes on flow lifecycle * Rebased on top of new series cleaning up socket pool error handling. This removes a couple of patches from this series. * Small edits to commit message for improved clarity Changes since v1: * Rebased, and reordered in a way I hope is clearer * Add patch to rename port_fwd.[ch] * Added doc comments to clarify flow life cycle * Added uniform logging of flow start / end to match that lifecycle * union inany_addr typed special address constants * inany based tests for unspecified and multicast addresses, as well as loopback * Dropped patch allowing NULL parameter to inany_from_af(), turned out not to be that useful * Dropped sockaddr_any_init function, turned out not to be very useful in that form * Added patch enforcing no loopback addresses on tap interface * Added logic to sanity check TCP endpoint addresses * Moved socket creation into tcp_splice_connect() * Moved epoll ref parsing into tcp_listen_handler() * Allowed IN4_IS_*() helpers to work on void * addresses David Gibson (20): inany: Helper to test for various address types inany: Add inany_ntop() helper inany: Provide more conveniently typed constants for special addresses inany: Introduce union sockaddr_inany util: Allow IN4_IS_* macros to operate on untyped addresses tcp, udp: Don't precompute port remappings in epoll references flow: Add helper to determine a flow's protocol tcp_splice: Simplify clean up logic tcp_splice: Don't use flow_trace() before setting flow type flow: Clarify flow entry life cycle, introduce uniform logging tcp_splice: More specific variable names in new splice path tcp_splice: Merge tcp_splice_new() into its caller tcp_splice: Make tcp_splice_connect() create its own sockets tcp_splice: Improve error reporting on connect path tcp_splice: Improve logic deciding when to splice tcp, tcp_splice: Parse listening socket epoll ref in tcp_listen_handler() tcp: Validate TCP endpoint addresses tap: Disallow loopback addresses on tap interface port_fwd: Fix copypasta error in port_fwd_scan_udp() comments fwd: Rename port_fwd.[ch] and their contents Makefile | 12 ++-- conf.c | 8 +-- flow.c | 84 ++++++++++++++++++++++- flow.h | 9 +++ port_fwd.c => fwd.c | 32 ++++----- port_fwd.h => fwd.h | 24 +++---- icmp.c | 18 ++--- inany.c | 50 ++++++++++++++ inany.h | 96 ++++++++++++++++++++++---- passt.h | 2 +- tap.c | 19 ++++++ tcp.c | 119 +++++++++++++++++++++++--------- tcp.h | 6 +- tcp_splice.c | 162 +++++++++++++++++++++++++------------------- tcp_splice.h | 7 +- udp.c | 32 +++++---- udp.h | 10 +-- util.h | 8 +-- 18 files changed, 502 insertions(+), 196 deletions(-) rename port_fwd.c => fwd.c (83%) rename port_fwd.h => fwd.h (62%) create mode 100644 inany.c -- 2.43.2
Add helpers to determine if an inany is loopback, unspecified or
multicast, regardless of whether it's a "true" IPv6 address or an IPv4
address represented as v4-mapped.
Use the loopback helper to simplify tcp_splice_conn_from_sock() slightly.
Signed-off-by: David Gibson
Add this helper to format an inany into either IPv4 or IPv6 text
format as appropriate.
Signed-off-by: David Gibson
Our inany_addr type is used in some places to represent either IPv4 or
IPv6 addresses, and we plan to use it more widely. We don't yet
provide constants of this type for special addresses (loopback and
"any"). Add some of these, both the IPv4 and IPv6 variants of those
addresses, but typed as union inany_addr.
To avoid actually adding more things to .data we can use some macros and
casting to overlay the IPv6 versions of these with the standard library's
in6addr_loopback and in6addr_any. For the IPv4 versions we need to create
new constant globals.
For complicated historical reasons, the standard library doesn't
provide constants for IPv4 loopback and any addresses as struct
in_addr. It just has macros of type in_addr_t == uint32_t, which has
some gotchas w.r.t. endianness. We can use some more macros to
address this lack, using macros to effectively create these IPv4
constants as pieces of the inany constants above.
We use this last to avoid some awkward temporary variables just used
to get an address of an IPv4 loopback address.
Signed-off-by: David Gibson
There are a number of places where we want to handle either a
sockaddr_in or a sockaddr_in6. In some of those we use a void *,
which works ok and matches some standard library interfaces, but
doesn't give a signature level hint that we're dealing with only
sockaddr_in or sockaddr_in6, not (say) sockaddr_un or another type of
socket address. Other places we use a sockaddr_storage, which also
works, but has the same problem in addition to allocating more on the
stack than we need to.
Introduce union sockaddr_inany to explictly handle this case: it has
variants for sockaddr_in and sockaddr_in6. Use it in a number of
places where it's easy to do so.
Signed-off-by: David Gibson
The IN4_IS_*() macros expect a pointer to a struct in_addr. That
makes sense, but sometimes we have an IPv4 address as a void * pointer
or union type which makes these less convenient. Additionally, this
doesn't match the behaviour of the standard library's IN6_IS_*()
macros on which they're modelled, nor our own IN4_ARE_ADDR_EQUAL().
Signed-off-by: David Gibson
The epoll references for both TCP listening sockets and UDP sockets
includes a port number. This gives the destination port that traffic
to that socket will be sent to on the other side. That will usually
be the same as the socket's bound port, but might not if the -t, -u,
-T or -U options are given with different original and forwarded port
numbers.
As we move towards a more flexible forwarding model for passt, it's
going to become possible for that destination port to vary depending
on more things (for example the source or destination address). So,
it will no longer make sense to have a fixed value for a listening
socket.
Change to simpler semantics where this field in the reference gives
the bound port of the socket. We apply the translations to the
correct destination port later on, when we're actually forwarding.
Signed-off-by: David Gibson
Each flow already has a type field. This implies the protocol the
flow represents, but also has more information: we have two ways to
represent TCP flows, "tap" and "spliced". In order to generalise some
of the flow mechanics, we'll need to determine a flow's protocol in
terms of the IP (L4) protocol number.
Introduce a constant table and helper macro to derive this from the flow
type.
Signed-off-by: David Gibson
Currently tcp_splice_flow_defer() contains specific logic to determine
if we're far enough initialised that we need to close pipes and/or
sockets. This is potentially fragile if we change something about the
order in which we do things. We can simplify this by initialising the
pipe and socket fields to -1 very early, then close()ing them if and
only if they're non-negative.
This lets us remove a special case cleanup if our connect() fails.
This will already trigger a CLOSING event, and the socket fd in
question is populated in the connection structure. Thus we can let
the new cleanup logic handle it rather than requiring an explicit
close().
Signed-off-by: David Gibson
In tcp_splice_conn_from_sock() we can call flow_trace() if there's an
error setting TCP_QUICKACK. However, we do so before we've set the
flow type in the flow entry. That means that flow_trace() will print
nonsense when it tries to print the flow type.
There's no reason the setsockopt() has to happen before initialising
the flow entry, so just move it after.
Signed-off-by: David Gibson
Our allocation scheme for flow entries means there are some
non-obvious constraints on when what things can be done with an entry.
Add a big doc comment explaining the life cycle.
In addition, make a FLOW_START() macro to mark one of the important
transitions. This encourages correct usage, by making it natural to
only access the flow type specific structure after calling it. It
also logs that a new flow has been created, which is useful for
debugging.
We also add logging when a flow's lifecycle ends. This doesn't need a
new helper, because it can only happen either from flow_alloc_cancel()
or from the flow deferred handler.
Signed-off-by: David Gibson
In tcp_splice_conn_from_sock(), the 'port' variable stores the source
port of the connection on the originating side. In tcp_splice_new(),
called directly from it, the 'port' parameter gives the _destination_
port of the originating connection and is then updated to the
destination port of the connection on the other side.
Similarly, in tcp_splice_conn_from_sock(), 's' is the fd of the
accetped socket (on side 0), whereas in tcp_splice_new(), 's' is the
fd of the connecting socket (side 1).
I, for one, find having the same variable name with different meanings
in such close proximity in the flow of control pretty confusing.
Alter the names for greater specificity and clarity.
Signed-off-by: David Gibson
The only caller of tcp_splice_new() is tcp_splice_conn_from_sock().
Both are quite short, and the division of responsibilities between the
two isn't particularly obvious. Simplify by merging the former into
the latter.
Signed-off-by: David Gibson
Currently creating the connected socket for a splice is split between
tcp_splice_conn_from_sock(), which opens the socket, and
tcp_splice_connect() which connects it. Alter tcp_splice_connect() to
open its own socket based on an address family and pif we pass it.
This does require a second conditional on pif, but makes for a more
logical split of functionality: tcp_splice_conn_from_sock() picks the
target, tcp_splice_connect() creates the connection. While we're
there improve reporting of errors
Signed-off-by: David Gibson
This makes a number of changes to improve error reporting while
connecting a new spliced socket:
* We use flow_err() and similar functions so all messages include info
on which specific flow was affected
* We use strerror() to interpret raw error values
* We now report errors on connection (at "trace" level, since this would
allow spamming the logs)
* We also look up and report some details on EPOLLERR events, which can
include connection errors, since we use a non-blocking connect(). Again
we use "trace" level since this can spam the logs.
Signed-off-by: David Gibson
This makes several tweaks to improve the logic which decides whether
we're able to use the splice method for a new connection.
* Rather than only calling tcp_splice_conn_from_sock() in pasta mode, we
check for pasta mode within it, better localising the checks.
* Previously if we got a connection from a non-loopback address we'd
always fall back to the "tap" path, even if the connection was on a
socket in the namespace. If we did get a non-loopback address on a
namespace socket, something has gone wrong and the "tap" path certainly
won't be able to handle it. Report the error and close, rather than
passing it along to tap.
Signed-off-by: David Gibson
tcp_listen_handler() uses the epoll reference for the listening socket
it handles, and also passes on one variant of it to
tcp_tap_conn_from_sock() and tcp_splice_conn_from_sock(). The latter
two functions only need a couple of specific fields from the
reference.
Pass those specific values instead of the whole reference, which
localises the handling of the listening (as opposed to accepted)
socket and its reference entirely within tcp_listen_handler().
Signed-off-by: David Gibson
TCP connections should typically not have wildcard addresses (0.0.0.0
or ::) nor a zero port number for either endpoint. It's not entirely
clear (at least to me) if it's strictly against the RFCs to do so, but
at any rate the socket interfaces often treat those values
specially[1], so it's not really possible to manipulate such
connections. Likewise they should not have broadcast or multicast
addresses for either endpoint.
However, nothing prevents a guest from creating a SYN packet with such
values, and it's not entirely clear what the effect on passt would be.
To ensure sane behaviour, explicitly check for this case and drop such
packets, logging a debug warning (we don't want a higher level,
because that would allow a guest to spam the logs).
We never expect such an address on an accept()ed socket either, but
just in case, check for it as well.
[1] Depending on context as "unknown", "match any" or "kernel, pick
something for me"
Signed-off-by: David Gibson
The "tap" interface, whether it's actually a tuntap device or a qemu
socket, presents a virtual external link between different network hosts.
Hence, loopback addresses make no sense there. However, nothing prevents
the guest from putting bogus packets with loopback addresses onto the
interface and it's not entirely clear what effect that will have on passt.
Explicitly test for such packets and drop them.
Signed-off-by: David Gibson
port_fwd_scan_udp() handles UDP, as the name suggests, but its
function comment has the wrong function name and references TCP, due
to a bad copy-paste from port_fwd_scan_tcp().
Signed-off-by: David Gibson
Currently port_fwd.[ch] contains helpers related to port forwarding,
particular automatic port forwarding. We're planning to allow much more
flexible sorts of forwarding, including both port translation and NAT based
on the flow table. This will subsume the existing port forwarding logic,
so rename port_fwd.[ch] to fwd.[ch] with matching updates to all the names
within.
Signed-off-by: David Gibson
On Wed, 28 Feb 2024 22:25:00 +1100
David Gibson
Here's another batch of cleanups and tweaks in preparation for the flow table. This set focuses on improved helpers for handling addresses, particularly in the TCP splice path.
Based on my other series adding more iovecs to the tap and pcap code, however the only conflicts should be trivial Makefile collisions.
Changes since v2: * Minor stylistic and formatting changes based on review * Some clarifying changes to the theory of operation notes on flow lifecycle * Rebased on top of new series cleaning up socket pool error handling. This removes a couple of patches from this series. * Small edits to commit message for improved clarity Changes since v1: * Rebased, and reordered in a way I hope is clearer * Add patch to rename port_fwd.[ch] * Added doc comments to clarify flow life cycle * Added uniform logging of flow start / end to match that lifecycle * union inany_addr typed special address constants * inany based tests for unspecified and multicast addresses, as well as loopback * Dropped patch allowing NULL parameter to inany_from_af(), turned out not to be that useful * Dropped sockaddr_any_init function, turned out not to be very useful in that form * Added patch enforcing no loopback addresses on tap interface * Added logic to sanity check TCP endpoint addresses * Moved socket creation into tcp_splice_connect() * Moved epoll ref parsing into tcp_listen_handler() * Allowed IN4_IS_*() helpers to work on void * addresses
Applied! -- Stefano
participants (2)
-
David Gibson
-
Stefano Brivio