[PATCH v8 00/27] Unified flow table
This is the seventh draft of an implementation of more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel This series changes the TCP connection table and hash table into a more general flow table that can track other protocols as well. Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding. ICMP and UDP are converted to use the new flow table. This is based on the recent series of UDP flow table preliminaries. Caveats: * We roughly double the size of a connection/flow entry * We don't yet record the local address of flows initiated from a socket, even in cases where it's bound to a specific address. Changes since v7: * Rebase * Fix unintended regression in forwarding logic (we weren't applying map_gw logic to DNS packets, if they didn't hit explicit DNS forwarding rules). * Remove return value from pif_sockaddr(), in turned out not to be very useful. * More robust discarding of datagrams received between bind() and connect() on UDP reply sockets. * Avoid the name 'fside' for variables which was confusing in some contexts * Assorted minor changes based on feedback. Changes since v6: * Complete redesign of the UDP flow handling * Rebased (handling the change to bind() probing for local addresses was surprisingly fiddly) * Replace sockaddr_from_inany() with pif_sockaddr() which can correctly handle scope_id for different interfaces, and returns whether the address is non-trivial for convenience * Preserve specific loopback addresses in forwarding logic Changes since v5: * flowside_from_af() is now static * Small fixes to state verification * Pass protocol specific types into deferred/timer callbacks * No longer require complete forwarding address info for the hash table (we won't have it for UDP) * Fix bugs with logging of flow addresses * Make sure to initialise sin_zero field sockaddr_from_inany * Added patch better typing parameters to flow type specific callbacks * Terminology change "forwarded side" to "target side" * Assorted wording and style tweaks based on Stefano's review * Fold introduction of struct flowside and populating the initiating side together * Manage outbound addresses via the flow table as well * Support for UDP * Correct type of 'b' in flowside_lookup() (was a signed int) Changes since v4: * flowside_from_af() no longer fills in unspecified addresses when passed NULL * Split and rename flow hash lookup function * Clarified flow state transitions, and enforced where practical * Made side 0 always the initiating side of a flow, rather than letting the protocol specific code decide * Separated pifs from flowside addresses to allow better structure packing Changes since v3: * Complex rebase on top of the many things that have happened upstream since v2. * Assorted other changes. * Replace TAPFSIDE() and SOCKFSIDE() macros with local variables. Changes since v2: * Cosmetic fixes based on review * Extra doc comments for enum flow_type * Rename flowside to flowaddrs which turns out to make more sense in light of future changes * Fix bug where the socket flowaddrs for tap initiated connections wasn't initialised to match the socket address we were using in the case of map-gw NAT * New flowaddrs_from_sock() helper used in most cases which is cleaner and should avoid bugs like the above * Using newer centralised workarounds for clang-tidy issue 58992 * Remove duplicate definition of FLOW_MAX as maximum flow type and maximum number of tracked flows * Rebased on newer versions of preliminary work (ICMP, flow based dispatch and allocation, bind/address cleanups) * Unified hash table as well as base flow table * Integrated ICMP Changes since v1: * Terminology changes - "Endpoint" address/port instead of "correspondent" address/port - "flowside" instead of "demiflow" * Actually move the connection table to a new flow table structure in new files * Significant rearrangement of earlier patchs on top of that new table, to reduce churn David Gibson (27): flow: Common address information for initiating side flow: Common address information for target side tcp, flow: Remove redundant information, repack connection structures tcp: Obtain guest address from flowside tcp: Manage outbound address via flow table tcp: Simplify endpoint validation using flowside information tcp_splice: Eliminate SPLICE_V6 flag tcp, flow: Replace TCP specific hash function with general flow hash flow, tcp: Generalise TCP hash table to general flow hash table tcp: Re-use flow hash for initial sequence number generation icmp: Remove redundant id field from flow table entry icmp: Obtain destination addresses from the flowsides icmp: Look up ping flows using flow hash icmp: Eliminate icmp_id_map flow: Helper to create sockets based on flowside icmp: Manage outbound socket address via flow table flow, tcp: Flow based NAT and port forwarding for TCP flow, icmp: Use general flow forwarding rules for ICMP fwd: Update flow forwarding logic for UDP udp: Create flows for datagrams from originating sockets udp: Handle "spliced" datagrams with per-flow sockets udp: Remove obsolete splice tracking udp: Find or create flows for datagrams from tap interface udp: Direct datagrams from host to guest via flow table udp: Remove obsolete socket tracking udp: Remove rdelta port forwarding maps udp: Rename UDP listening sockets Makefile | 4 +- conf.c | 14 +- epoll_type.h | 6 +- flow.c | 483 +++++++++++++++++++++- flow.h | 47 +++ flow_table.h | 45 +- fwd.c | 187 ++++++++- fwd.h | 9 + icmp.c | 105 ++--- icmp_flow.h | 2 - inany.h | 2 - passt.c | 10 +- passt.h | 5 +- pif.c | 40 ++ pif.h | 17 + tap.c | 11 - tap.h | 1 - tcp.c | 522 ++++++----------------- tcp_buf.c | 6 +- tcp_conn.h | 47 +-- tcp_internal.h | 10 +- tcp_splice.c | 98 +---- tcp_splice.h | 5 +- udp.c | 1079 ++++++++++++++++++++---------------------------- udp.h | 35 +- udp_flow.h | 27 ++ util.c | 9 +- util.h | 3 + 28 files changed, 1563 insertions(+), 1266 deletions(-) create mode 100644 udp_flow.h -- 2.45.2
Handling of each protocol needs some degree of tracking of the
addresses and ports at the end of each connection or flow. Sometimes
that's explicit (as in the guest visible addresses for TCP
connections), sometimes implicit (the bound and connected addresses of
sockets).
To allow more consistent handling across protocols we want to
uniformly track the address and port at each end of the connection.
Furthermore, because we allow port remapping, and we sometimes need to
apply NAT, the addresses and ports can be different as seen by the
guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of address and port
information related to one side of a flow. Store two of these in the
common fields of a flow to track that information for both sides.
For now we only populate the initiating side, requiring that
information be completed when a flows enter INI. Later patches will
populate the target side.
For now this leaves some information redundantly recorded in both generic
and type specific fields. We'll fix that in later patches.
Signed-off-by: David Gibson
Require the address and port information for the target (non
initiating) side to be populated when a flow enters TGT state.
Implement that for TCP and ICMP. For now this leaves some information
redundantly recorded in both generic and type specific fields. We'll
fix that in later patches.
For TCP we now use the information from the flow to construct the
destination socket address in both tcp_conn_from_tap() and
tcp_splice_connect().
Signed-off-by: David Gibson
Some information we explicitly store in the TCP connection is now
duplicated in the common flow structure. Access it from there instead, and
remove it from the TCP specific structure. With that done we can reorder
both the "tap" and "splice" TCP structures a bit to get better packing for
the new combined flow table entries.
Signed-off-by: David Gibson
Currently we always deliver inbound TCP packets to the guest's most
recent observed IP address. This has the odd side effect that if the
guest changes its IP address with active TCP connections we might
deliver packets from old connections to the new address. That won't
work; it will probably result in an RST from the guest. Worse, if the
guest added a new address but also retains the old one, then we could
break those old connections by redirecting them to the new address.
Now that we maintain flowside information, we have a record of the correct
guest side address and can just use it.
Signed-off-by: David Gibson
For now when we forward a connection to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and port will be used by the kernel. When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.
Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there. This allows the
possibility of more complex rules for what outbound address and/or port
we use in future.
Signed-off-by: David Gibson
Now that we store all our endpoints in the flowside structure, use some
inany helpers to make validation of those endpoints simpler.
Signed-off-by: David Gibson
Since we're now constructing socket addresses based on information in the
flowside, we no longer need an explicit flag to tell if we're dealing with
an IPv4 or IPv6 connection. Hence, drop the now unused SPLICE_V6 flag.
As well as just simplifying the code, this allows for possible future
extensions where we could splice an IPv4 connection to an IPv6 connection
or vice versa.
Signed-off-by: David Gibson
Currently we match TCP packets received on the tap connection to a TCP
connection via a hash table based on the forwarding address and both
ports. We hope in future to allow for multiple guest side addresses, or
for multiple interfaces which means we may need to distinguish based on
the endpoint address and pif as well. We also want a unified hash table
to cover multiple protocols, not just TCP.
Replace the TCP specific hash function with one suitable for general flows,
or rather for one side of a general flow. This includes all the
information from struct flowside, plus the pif and the L4 protocol number.
Signed-off-by: David Gibson
Move the data structures and helper functions for the TCP hash table to
flow.c, making it a general hash table indexing sides of flows. This is
largely code motion and straightforward renames. There are two semantic
changes:
* flow_lookup_af() now needs to verify that the entry has a matching
protocol and interface as well as matching addresses and ports.
* We double the size of the hash table, because it's now at least
theoretically possible for both sides of each flow to be hashed.
Signed-off-by: David Gibson
We generate TCP initial sequence numbers, when we need them, from a
hash of the source and destination addresses and ports, plus a
timestamp. Moments later, we generate another hash of the same
information plus some more to insert the connection into the flow hash
table.
With some tweaks to the flow_hash_insert() interface and changing the
order we can re-use that hash table hash for the initial sequence
number, rather than calculating another one. It won't generate
identical results, but that doesn't matter as long as the sequence
numbers are well scattered.
Signed-off-by: David Gibson
struct icmp_ping_flow contains a field for the ICMP id of the ping, but
this is now redundant, since the id is also stored as the "port" in the
common flowsides.
Signed-off-by: David Gibson
icmp_sock_handler() obtains the guest address from it's most recently
observed IP. However, this can now be obtained from the common flowside
information.
icmp_tap_handler() builds its socket address for sendto() directly
from the destination address supplied by the incoming tap packet.
This can instead be generated from the flow.
Using the flowsides as the common source of truth here prepares us for
allowing more flexible NAT and forwarding by properly initialising
that flowside information.
Signed-off-by: David Gibson
When we receive a ping packet from the tap interface, we currently locate
the correct flow entry (if present) using an anciliary data structure, the
icmp_id_map[] tables. However, we can look this up using the flow hash
table - that's what it's for.
Signed-off-by: David Gibson
With previous reworks the icmp_id_map data structure is now maintained, but
never used for anything. Eliminate it.
Signed-off-by: David Gibson
We have upcoming use cases where it's useful to create new bound socket
based on information from the flow table. Add flowside_sock_l4() to do
this for either PIF_HOST or PIF_SPLICE sockets.
Signed-off-by: David Gibson
For now when we forward a ping to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and id will be used by the kernel. When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.
Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there. This allows the
possibility of more complex rules for what outbound address and/or id
we use in future.
To implement this we create a new helper which sets up a new socket based
on information in a flowside, which will also have future uses. It
behaves slightly differently from the existing ICMP code, in that it
doesn't bind to a specific interface if given a loopback address. This is
logically correct - the loopback address means we need to operate through
the host's loopback interface, not ifname_out. We didn't need it in ICMP
because ICMP will never generate a loopback address at this point, however
we intend to change that in future.
Signed-off-by: David Gibson
Currently the code to translate host side addresses and ports to guest side
addresses and ports, and vice versa, is scattered across the TCP code.
This includes both port redirection as controlled by the -t and -T options,
and our special case NAT controlled by the --no-map-gw option.
Gather this logic into fwd_nat_from_*() functions for each input
interface in fwd.c which take protocol and address information for the
initiating side and generates the pif and address information for the
forwarded side. This performs any NAT or port forwarding needed.
We create a flow_target() helper which applies those forwarding functions
as needed to automatically move a flow from INI to TGT state.
Signed-off-by: David Gibson
Current ICMP hard codes its forwarding rules, and never applies any
translations. Change it to use the flow_target() function, so that
it's translated the same as TCP (excluding TCP specific port
redirection).
This means that gw mapping now applies to ICMP so "ping <gw address>" will
now ping the host's loopback instead of the actual gw machine. This
removes the surprising behaviour that the target you ping might not be the
same as you connect to with TCP.
This removes the last user of flow_target_af(), so that's removed as well.
Signed-off-by: David Gibson
Add logic to the fwd_nat_from_*() functions to forwarding UDP packets. The
logic here doesn't exactly match our current forwarding, since our current
forwarding has some very strange and buggy edge cases. Instead it's
attempting to replicate what appears to be the intended logic behind the
current forwarding.
Signed-off-by: David Gibson
This implements the first steps of tracking UDP packets with the flow table
rather than its own (buggy) set of port maps. Specifically we create flow
table entries for datagrams received from a socket (PIF_HOST or
PIF_SPLICE).
When splitting datagrams from sockets into batches, we group by the flow
as well as splicesrc. This may result in smaller batches, but makes things
easier down the line. We can re-optimise this later if necessary. For now
we don't do anything else with the flow, not even match reply packets to
the same flow.
Signed-off-by: David Gibson
When forwarding a datagram to a socket, we need to find a socket with a
suitable local address to send it. Currently we keep track of such sockets
in an array indexed by local port, but this can't properly handle cases
where we have multiple local addresses in active use.
For "spliced" (socket to socket) cases, improve this by instead opening
a socket specifically for the target side of the flow. We connect() as
well as bind()ing that socket, so that it will only receive the flow's
reply packets, not anything else. We direct datagrams sent via that socket
using the addresses from the flow table, effectively replacing bespoke
addressing logic with the unified logic in fwd.c
When we create the flow, we also take a duplicate of the originating
socket, and use that to deliver reply datagrams back to the origin, again
using addresses from the flow table entry.
Signed-off-by: David Gibson
Now that spliced datagrams are managed via the flow table, remove
UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With
those removed, the 'ts' field in udp_splice_port is also no longer used.
struct udp_splice_port now contains just a socket fd, so replace it with
a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still
used for tracking of automatic port forwarding.
Finally, the 'splice' field of union udp_epoll_ref is no longer used so
remove it as well.
Signed-off-by: David Gibson
Currently we create flows for datagrams from socket interfaces, and use
them to direct "spliced" (socket to socket) datagrams. We don't yet
match datagrams from the tap interface to existing flows, nor create new
flows for them. Add that functionality, matching datagrams from tap to
existing flows when they exist, or creating new ones.
As with spliced flows, when creating a new flow from tap to socket, we
create a new connected socket to receive reply datagrams attached to that
flow specifically. We extend udp_flow_sock_handler() to handle reply
packets bound for tap rather than another socket.
For non-obvious reasons (perhaps increased stack usage?), this caused
a failure for me when running under valgrind, because valgrind invoked
rt_sigreturn which is not in our seccomp filter. Since we already
allow rt_sigaction and others in the valgrind target, it seems
reasonable to add rt_sigreturn as well.
Signed-off-by: David Gibson
This replaces the last piece of existing UDP port tracking with the
common flow table. Specifically use the flow table to direct datagrams
from host sockets to the guest tap interface. Since this now requires
a flow for every datagram, we add some logging if we encounter any
datagrams for which we can't find or create a flow.
Signed-off-by: David Gibson
Now that UDP datagrams are all directed via the flow table, we no longer
use the udp_tap_map[ or udp_act[] arrays. Remove them and connected
code.
Signed-off-by: David Gibson
In addition to the struct fwd_ports used by both UDP and TCP to track
port forwarding, UDP also included an 'rdelta' field, which contained the
reverse mapping of the main port map. This was used so that we could
properly direct reply packets to a forwarded packet where we change the
destination port. This has now been taken over by the flow table: reply
packets will match the flow of the originating packet, and that gives the
correct ports on the originating side.
So, eliminate the rdelta field, and with it struct udp_fwd_ports, which
now has no additional information over struct fwd_ports.
Signed-off-by: David Gibson
EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived
sockets which can initiate new flows. Rename to EPOLL_TYPE_UDP_LISTEN
and associated functions to match. Along with that, remove the .orig
field from union udp_listen_epoll_ref, since it is now always true.
Signed-off-by: David Gibson
On Thu, 18 Jul 2024 15:26:26 +1000
David Gibson
This is the seventh draft of an implementation of more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel
This series changes the TCP connection table and hash table into a more general flow table that can track other protocols as well. Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding.
ICMP and UDP are converted to use the new flow table.
This is based on the recent series of UDP flow table preliminaries.
Caveats: * We roughly double the size of a connection/flow entry * We don't yet record the local address of flows initiated from a socket, even in cases where it's bound to a specific address.
Changes since v7: * Rebase * Fix unintended regression in forwarding logic (we weren't applying map_gw logic to DNS packets, if they didn't hit explicit DNS forwarding rules). * Remove return value from pif_sockaddr(), in turned out not to be very useful. * More robust discarding of datagrams received between bind() and connect() on UDP reply sockets. * Avoid the name 'fside' for variables which was confusing in some contexts * Assorted minor changes based on feedback.
Applied (!) -- Stefano
On Fri, Jul 19, 2024 at 09:20:27PM +0200, Stefano Brivio wrote:
On Thu, 18 Jul 2024 15:26:26 +1000 David Gibson
wrote: This is the seventh draft of an implementation of more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel
This series changes the TCP connection table and hash table into a more general flow table that can track other protocols as well. Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding.
ICMP and UDP are converted to use the new flow table.
This is based on the recent series of UDP flow table preliminaries.
Caveats: * We roughly double the size of a connection/flow entry * We don't yet record the local address of flows initiated from a socket, even in cases where it's bound to a specific address.
Changes since v7: * Rebase * Fix unintended regression in forwarding logic (we weren't applying map_gw logic to DNS packets, if they didn't hit explicit DNS forwarding rules). * Remove return value from pif_sockaddr(), in turned out not to be very useful. * More robust discarding of datagrams received between bind() and connect() on UDP reply sockets. * Avoid the name 'fside' for variables which was confusing in some contexts * Assorted minor changes based on feedback.
Applied (!)
🎉 -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (2)
-
David Gibson
-
Stefano Brivio