[PATCH v7 00/27] Unified flow table
This is the seventh draft of an implementation of more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel This series changes the TCP connection table and hash table into a more general flow table that can track other protocols as well. Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding. ICMP and UDP are converted to use the new flow table. This is based on the recent series of UDP flow table preliminaries. Caveats: * We roughly double the size of a connection/flow entry * We don't yet record the local address of flows initiated from a socket, even in cases where it's bound to a specific address. Changes since v6: * Complete redesign of the UDP flow handling * Rebased (handling the change to bind() probing for local addresses was surprisingly fiddly) * Replace sockaddr_from_inany() with pif_sockaddr() which can correctly handle scope_id for different interfaces, and returns whether the address is non-trivial for convenience * Preserve specific loopback addresses in forwarding logic Changes since v5: * flowside_from_af() is now static * Small fixes to state verification * Pass protocol specific types into deferred/timer callbacks * No longer require complete forwarding address info for the hash table (we won't have it for UDP) * Fix bugs with logging of flow addresses * Make sure to initialise sin_zero field sockaddr_from_inany * Added patch better typing parameters to flow type specific callbacks * Terminology change "forwarded side" to "target side" * Assorted wording and style tweaks based on Stefano's review * Fold introduction of struct flowside and populating the initiating side together * Manage outbound addresses via the flow table as well * Support for UDP * Correct type of 'b' in flowside_lookup() (was a signed int) Changes since v4: * flowside_from_af() no longer fills in unspecified addresses when passed NULL * Split and rename flow hash lookup function * Clarified flow state transitions, and enforced where practical * Made side 0 always the initiating side of a flow, rather than letting the protocol specific code decide * Separated pifs from flowside addresses to allow better structure packing Changes since v3: * Complex rebase on top of the many things that have happened upstream since v2. * Assorted other changes. * Replace TAPFSIDE() and SOCKFSIDE() macros with local variables. Changes since v2: * Cosmetic fixes based on review * Extra doc comments for enum flow_type * Rename flowside to flowaddrs which turns out to make more sense in light of future changes * Fix bug where the socket flowaddrs for tap initiated connections wasn't initialised to match the socket address we were using in the case of map-gw NAT * New flowaddrs_from_sock() helper used in most cases which is cleaner and should avoid bugs like the above * Using newer centralised workarounds for clang-tidy issue 58992 * Remove duplicate definition of FLOW_MAX as maximum flow type and maximum number of tracked flows * Rebased on newer versions of preliminary work (ICMP, flow based dispatch and allocation, bind/address cleanups) * Unified hash table as well as base flow table * Integrated ICMP Changes since v1: * Terminology changes - "Endpoint" address/port instead of "correspondent" address/port - "flowside" instead of "demiflow" * Actually move the connection table to a new flow table structure in new files * Significant rearrangement of earlier patchs on top of that new table, to reduce churn David Gibson (27): flow: Common address information for initiating side flow: Common address information for target side tcp, flow: Remove redundant information, repack connection structures tcp: Obtain guest address from flowside tcp: Manage outbound address via flow table tcp: Simplify endpoint validation using flowside information tcp_splice: Eliminate SPLICE_V6 flag tcp, flow: Replace TCP specific hash function with general flow hash flow, tcp: Generalise TCP hash table to general flow hash table tcp: Re-use flow hash for initial sequence number generation icmp: Remove redundant id field from flow table entry icmp: Obtain destination addresses from the flowsides icmp: Look up ping flows using flow hash icmp: Eliminate icmp_id_map flow: Helper to create sockets based on flowside icmp: Manage outbound socket address via flow table flow, tcp: Flow based NAT and port forwarding for TCP flow, icmp: Use general flow forwarding rules for ICMP fwd: Update flow forwarding logic for UDP udp: Create flows for datagrams from originating sockets udp: Handle "spliced" datagrams with per-flow sockets udp: Remove obsolete splice tracking udp: Find or create flows for datagrams from tap interface udp: Direct datagrams from host to guest via flow table udp: Remove obsolete socket tracking udp: Remove rdelta port forwarding maps udp: Rename UDP listening sockets Makefile | 4 +- conf.c | 14 +- epoll_type.h | 6 +- flow.c | 481 +++++++++++++++++++++- flow.h | 47 +++ flow_table.h | 57 ++- fwd.c | 184 ++++++++- fwd.h | 9 + icmp.c | 105 ++--- icmp_flow.h | 2 - inany.h | 2 - passt.c | 10 +- passt.h | 5 +- pif.c | 45 +++ pif.h | 17 + tap.c | 11 - tap.h | 1 - tcp.c | 521 ++++++------------------ tcp_buf.c | 6 +- tcp_conn.h | 51 +-- tcp_internal.h | 10 +- tcp_splice.c | 98 +---- tcp_splice.h | 5 +- udp.c | 1055 +++++++++++++++++++----------------------------- udp.h | 33 +- udp_flow.h | 27 ++ util.c | 9 +- util.h | 3 + 28 files changed, 1549 insertions(+), 1269 deletions(-) create mode 100644 udp_flow.h -- 2.45.2
Handling of each protocol needs some degree of tracking of the
addresses and ports at the end of each connection or flow. Sometimes
that's explicit (as in the guest visible addresses for TCP
connections), sometimes implicit (the bound and connected addresses of
sockets).
To allow more consistent handling across protocols we want to
uniformly track the address and port at each end of the connection.
Furthermore, because we allow port remapping, and we sometimes need to
apply NAT, the addresses and ports can be different as seen by the
guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of address and port
information related to one side of a flow. Store two of these in the
common fields of a flow to track that information for both sides.
For now we only populate the initiating side, requiring that
information be completed when a flows enter INI. Later patches will
populate the target side.
For now this leaves some information redundantly recorded in both generic
and type specific fields. We'll fix that in later patches.
Signed-off-by: David Gibson
Require the address and port information for the target (non
initiating) side to be populated when a flow enters TGT state.
Implement that for TCP and ICMP. For now this leaves some information
redundantly recorded in both generic and type specific fields. We'll
fix that in later patches.
For TCP we now use the information from the flow to construct the
destination socket address in both tcp_conn_from_tap() and
tcp_splice_connect().
Signed-off-by: David Gibson
Two minor details:
On Fri, 5 Jul 2024 12:06:59 +1000
David Gibson
Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches.
For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect().
Signed-off-by: David Gibson
--- flow.c | 38 ++++++++++++++++++------ flow_table.h | 5 +++- icmp.c | 3 +- inany.h | 1 - pif.c | 45 ++++++++++++++++++++++++++++ pif.h | 17 +++++++++++ tcp.c | 82 ++++++++++++++++++++++++++++------------------------ tcp_splice.c | 45 +++++++++++----------------- 8 files changed, 158 insertions(+), 78 deletions(-) diff --git a/flow.c b/flow.c index 44e7b3b8..f064fad1 100644 --- a/flow.c +++ b/flow.c @@ -165,8 +165,10 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { - char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN]; + char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN]; const struct flowside *ini = &f->side[INISIDE]; + const struct flowside *tgt = &f->side[TGTSIDE]; uint8_t oldstate = f->state;
ASSERT(state < FLOW_NUM_STATES); @@ -177,19 +179,24 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f));
if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + flow_log_(f, LOG_DEBUG, + "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport, - pif_name(f->pif[TGTSIDE])); + pif_name(f->pif[TGTSIDE]), + inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), + tgt->fport, + inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), + tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport); }
@@ -261,21 +268,34 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, }
/** - * flow_target() - Move flow to TGT, setting TGTSIDE details + * flow_target_af() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state * @pif: pif of the target side + * @af: Address family for @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the target flowside information */ -void flow_target(union flow *flow, uint8_t pif) +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) { struct flow_common *f = &flow->f; + struct flowside *tgt = &f->side[TGTSIDE];
ASSERT(pif != PIF_NONE); ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); ASSERT(f->type == FLOW_TYPE_NONE); ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE);
+ flowside_from_af(tgt, af, daddr, dport, saddr, sport); f->pif[TGTSIDE] = pif; flow_set_state(f, FLOW_STATE_TGT); + return tgt; }
/** diff --git a/flow_table.h b/flow_table.h index ad1bc787..00dca4b2 100644 --- a/flow_table.h +++ b/flow_table.h @@ -114,7 +114,10 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, const union sockaddr_inany *ssa, in_port_t dport); -void flow_target(union flow *flow, uint8_t pif); +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport);
union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/icmp.c b/icmp.c index cf88ac1f..fd92c7da 100644 --- a/icmp.c +++ b/icmp.c @@ -167,7 +167,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, return NULL;
flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - flow_target(flow, PIF_HOST); + /* FIXME: Record outbound source address when known */ + flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping);
pingf->seq = -1; diff --git a/inany.h b/inany.h index 47b66fa9..8eaf5335 100644 --- a/inany.h +++ b/inany.h @@ -187,7 +187,6 @@ static inline bool inany_is_unspecified(const union inany_addr *a) * * Return: true if @a is in fe80::/10 (IPv6 link local unicast) */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_linklocal6(const union inany_addr *a) { return IN6_IS_ADDR_LINKLOCAL(&a->a6); diff --git a/pif.c b/pif.c index ebf01cc8..9f2d39cc 100644 --- a/pif.c +++ b/pif.c @@ -7,9 +7,14 @@
#include
#include +#include #include "util.h" #include "pif.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" +#include "passt.h"
const char *pif_type_str[] = { [PIF_NONE] = "<none>", @@ -19,3 +24,43 @@ const char *pif_type_str[] = { }; static_assert(ARRAY_SIZE(pif_type_str) == PIF_NUM_TYPES, "pif_type_str[] doesn't match enum pif_type"); + + +/** pif_sockaddr() - Construct a socket address suitable for an interface + * @c: Execution context + * @sa: Pointer to sockaddr to fill in + * @sl: Updated to relevant of length of initialised @sa
to relevant length
+ * @pif: Interface to create the socket address + * @addr: IPv[46] address + * @port: Port (host byte order) + * + * Return: true if resulting socket address is non-trivial (specified address or + * non-zero port), false otherwise
This is not really intuitive in the only caller using this, tcp_bind_outbound(). I wonder if it would make more sense to perform this check directly there, and have this returning void instead.
+ */ +bool pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl, + uint8_t pif, const union inany_addr *addr, in_port_t port) +{ + const struct in_addr *v4 = inany_v4(addr); + + ASSERT(pif_is_socket(pif)); + + if (v4) { + sa->sa_family = AF_INET; + sa->sa4.sin_addr = *v4; + sa->sa4.sin_port = htons(port); + memset(&sa->sa4.sin_zero, 0, sizeof(sa->sa4.sin_zero)); + *sl = sizeof(sa->sa4); + return !IN4_IS_ADDR_UNSPECIFIED(v4) || port; + } + + sa->sa_family = AF_INET6; + sa->sa6.sin6_addr = addr->a6; + sa->sa6.sin6_port = htons(port); + if (pif == PIF_HOST && IN6_IS_ADDR_LINKLOCAL(&addr->a6)) + sa->sa6.sin6_scope_id = c->ifi6; + else + sa->sa6.sin6_scope_id = 0; + sa->sa6.sin6_flowinfo = 0; + *sl = sizeof(sa->sa6); + return !IN6_IS_ADDR_UNSPECIFIED(&addr->a6) || port; +}
-- Stefano
On Wed, Jul 10, 2024 at 11:30:38PM +0200, Stefano Brivio wrote:
Two minor details:
On Fri, 5 Jul 2024 12:06:59 +1000 David Gibson
wrote: Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches.
For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect().
Signed-off-by: David Gibson
--- flow.c | 38 ++++++++++++++++++------ flow_table.h | 5 +++- icmp.c | 3 +- inany.h | 1 - pif.c | 45 ++++++++++++++++++++++++++++ pif.h | 17 +++++++++++ tcp.c | 82 ++++++++++++++++++++++++++++------------------------ tcp_splice.c | 45 +++++++++++----------------- 8 files changed, 158 insertions(+), 78 deletions(-) diff --git a/flow.c b/flow.c index 44e7b3b8..f064fad1 100644 --- a/flow.c +++ b/flow.c @@ -165,8 +165,10 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { - char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN]; + char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN]; const struct flowside *ini = &f->side[INISIDE]; + const struct flowside *tgt = &f->side[TGTSIDE]; uint8_t oldstate = f->state;
ASSERT(state < FLOW_NUM_STATES); @@ -177,19 +179,24 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f));
if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + flow_log_(f, LOG_DEBUG, + "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport, - pif_name(f->pif[TGTSIDE])); + pif_name(f->pif[TGTSIDE]), + inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), + tgt->fport, + inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), + tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport); }
@@ -261,21 +268,34 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, }
/** - * flow_target() - Move flow to TGT, setting TGTSIDE details + * flow_target_af() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state * @pif: pif of the target side + * @af: Address family for @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the target flowside information */ -void flow_target(union flow *flow, uint8_t pif) +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) { struct flow_common *f = &flow->f; + struct flowside *tgt = &f->side[TGTSIDE];
ASSERT(pif != PIF_NONE); ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); ASSERT(f->type == FLOW_TYPE_NONE); ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE);
+ flowside_from_af(tgt, af, daddr, dport, saddr, sport); f->pif[TGTSIDE] = pif; flow_set_state(f, FLOW_STATE_TGT); + return tgt; }
/** diff --git a/flow_table.h b/flow_table.h index ad1bc787..00dca4b2 100644 --- a/flow_table.h +++ b/flow_table.h @@ -114,7 +114,10 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, const union sockaddr_inany *ssa, in_port_t dport); -void flow_target(union flow *flow, uint8_t pif); +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport);
union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/icmp.c b/icmp.c index cf88ac1f..fd92c7da 100644 --- a/icmp.c +++ b/icmp.c @@ -167,7 +167,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, return NULL;
flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - flow_target(flow, PIF_HOST); + /* FIXME: Record outbound source address when known */ + flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping);
pingf->seq = -1; diff --git a/inany.h b/inany.h index 47b66fa9..8eaf5335 100644 --- a/inany.h +++ b/inany.h @@ -187,7 +187,6 @@ static inline bool inany_is_unspecified(const union inany_addr *a) * * Return: true if @a is in fe80::/10 (IPv6 link local unicast) */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_linklocal6(const union inany_addr *a) { return IN6_IS_ADDR_LINKLOCAL(&a->a6); diff --git a/pif.c b/pif.c index ebf01cc8..9f2d39cc 100644 --- a/pif.c +++ b/pif.c @@ -7,9 +7,14 @@
#include
#include +#include #include "util.h" #include "pif.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" +#include "passt.h"
const char *pif_type_str[] = { [PIF_NONE] = "<none>", @@ -19,3 +24,43 @@ const char *pif_type_str[] = { }; static_assert(ARRAY_SIZE(pif_type_str) == PIF_NUM_TYPES, "pif_type_str[] doesn't match enum pif_type"); + + +/** pif_sockaddr() - Construct a socket address suitable for an interface + * @c: Execution context + * @sa: Pointer to sockaddr to fill in + * @sl: Updated to relevant of length of initialised @sa
to relevant length
Done.
+ * @pif: Interface to create the socket address + * @addr: IPv[46] address + * @port: Port (host byte order) + * + * Return: true if resulting socket address is non-trivial (specified address or + * non-zero port), false otherwise
This is not really intuitive in the only caller using this, tcp_bind_outbound(). I wonder if it would make more sense to perform this check directly there, and have this returning void instead.
Yeah, done. When I implemented the return value I thought I was going to want it in more places than turned out to be the case.
+ */ +bool pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl, + uint8_t pif, const union inany_addr *addr, in_port_t port) +{ + const struct in_addr *v4 = inany_v4(addr); + + ASSERT(pif_is_socket(pif)); + + if (v4) { + sa->sa_family = AF_INET; + sa->sa4.sin_addr = *v4; + sa->sa4.sin_port = htons(port); + memset(&sa->sa4.sin_zero, 0, sizeof(sa->sa4.sin_zero)); + *sl = sizeof(sa->sa4); + return !IN4_IS_ADDR_UNSPECIFIED(v4) || port; + } + + sa->sa_family = AF_INET6; + sa->sa6.sin6_addr = addr->a6; + sa->sa6.sin6_port = htons(port); + if (pif == PIF_HOST && IN6_IS_ADDR_LINKLOCAL(&addr->a6)) + sa->sa6.sin6_scope_id = c->ifi6; + else + sa->sa6.sin6_scope_id = 0; + sa->sa6.sin6_flowinfo = 0; + *sl = sizeof(sa->sa6); + return !IN6_IS_ADDR_UNSPECIFIED(&addr->a6) || port; +}
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Some information we explicitly store in the TCP connection is now
duplicated in the common flow structure. Access it from there instead, and
remove it from the TCP specific structure. With that done we can reorder
both the "tap" and "splice" TCP structures a bit to get better packing for
the new combined flow table entries.
Signed-off-by: David Gibson
Currently we always deliver inbound TCP packets to the guest's most
recent observed IP address. This has the odd side effect that if the
guest changes its IP address with active TCP connections we might
deliver packets from old connections to the new address. That won't
work; it will probably result in an RST from the guest. Worse, if the
guest added a new address but also retains the old one, then we could
break those old connections by redirecting them to the new address.
Now that we maintain flowside information, we have a record of the correct
guest side address and can just use it.
Signed-off-by: David Gibson
For now when we forward a connection to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and port will be used by the kernel. When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.
Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there. This allows the
possibility of more complex rules for what outbound address and/or port
we use in future.
Signed-off-by: David Gibson
Now that we store all our endpoints in the flowside structure, use some
inany helpers to make validation of those endpoints simpler.
Signed-off-by: David Gibson
Since we're now constructing socket addresses based on information in the
flowside, we no longer need an explicit flag to tell if we're dealing with
an IPv4 or IPv6 connection. Hence, drop the now unused SPLICE_V6 flag.
As well as just simplifying the code, this allows for possible future
extensions where we could splice an IPv4 connection to an IPv6 connection
or vice versa.
Signed-off-by: David Gibson
Currently we match TCP packets received on the tap connection to a TCP
connection via a hash table based on the forwarding address and both
ports. We hope in future to allow for multiple guest side addresses, or
for multiple interfaces which means we may need to distinguish based on
the endpoint address and pif as well. We also want a unified hash table
to cover multiple protocols, not just TCP.
Replace the TCP specific hash function with one suitable for general flows,
or rather for one side of a general flow. This includes all the
information from struct flowside, plus the pif and the L4 protocol number.
Signed-off-by: David Gibson
Move the data structures and helper functions for the TCP hash table to
flow.c, making it a general hash table indexing sides of flows. This is
largely code motion and straightforward renames. There are two semantic
changes:
* flow_lookup_af() now needs to verify that the entry has a matching
protocol and interface as well as matching addresses and ports.
* We double the size of the hash table, because it's now at least
theoretically possible for both sides of each flow to be hashed.
Signed-off-by: David Gibson
We generate TCP initial sequence numbers, when we need them, from a
hash of the source and destination addresses and ports, plus a
timestamp. Moments later, we generate another hash of the same
information plus some more to insert the connection into the flow hash
table.
With some tweaks to the flow_hash_insert() interface and changing the
order we can re-use that hash table hash for the initial sequence
number, rather than calculating another one. It won't generate
identical results, but that doesn't matter as long as the sequence
numbers are well scattered.
Signed-off-by: David Gibson
struct icmp_ping_flow contains a field for the ICMP id of the ping, but
this is now redundant, since the id is also stored as the "port" in the
common flowsides.
Signed-off-by: David Gibson
icmp_sock_handler() obtains the guest address from it's most recently
observed IP. However, this can now be obtained from the common flowside
information.
icmp_tap_handler() builds its socket address for sendto() directly
from the destination address supplied by the incoming tap packet.
This can instead be generated from the flow.
Using the flowsides as the common source of truth here prepares us for
allowing more flexible NAT and forwarding by properly initialising
that flowside information.
Signed-off-by: David Gibson
When we receive a ping packet from the tap interface, we currently locate
the correct flow entry (if present) using an anciliary data structure, the
icmp_id_map[] tables. However, we can look this up using the flow hash
table - that's what it's for.
Signed-off-by: David Gibson
With previous reworks the icmp_id_map data structure is now maintained, but
never used for anything. Eliminate it.
Signed-off-by: David Gibson
We have upcoming use cases where it's useful to create new bound socket
based on information from the flow table. Add flowside_sock_l4() to do
this for either PIF_HOST or PIF_SPLICE sockets.
Signed-off-by: David Gibson
On Fri, 5 Jul 2024 12:07:12 +1000
David Gibson
We have upcoming use cases where it's useful to create new bound socket based on information from the flow table. Add flowside_sock_l4() to do this for either PIF_HOST or PIF_SPLICE sockets.
Signed-off-by: David Gibson
--- flow.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ flow.h | 3 ++ util.c | 6 ++-- util.h | 3 ++ 4 files changed, 101 insertions(+), 3 deletions(-) diff --git a/flow.c b/flow.c index 6f09781a..2d0a8a32 100644 --- a/flow.c +++ b/flow.c @@ -5,9 +5,11 @@ * Tracking for logical "flows" of packets. */
+#include
#include #include #include +#include #include #include "util.h" @@ -143,6 +145,96 @@ static void flowside_from_af(struct flowside *fside, sa_family_t af, fside->eport = eport; }
+/** + * struct flowside_sock_args - Parameters for flowside_sock_splice() + * @c: Execution context + * @fd: Filled in with new socket fd + * @err: Filled in with errno if something failed + * @type: Socket epoll type + * @sa: Socket address + * @sl: Length of @sa + * @data: epoll reference data + */ +struct flowside_sock_args { + const struct ctx *c; + int fd; + int err; + enum epoll_type type; + const struct sockaddr *sa; + socklen_t sl; + const char *path; + uint32_t data; +}; + +/** flowside_sock_splice() - Create and bind socket for PIF_SPLICE based on flowside + * @arg: Argument as a struct flowside_sock_args + * + * Return: 0 + */ +static int flowside_sock_splice(void *arg) +{ + struct flowside_sock_args *a = arg; + + ns_enter(a->c); + + a->fd = sock_l4_sa(a->c, a->type, a->sa, a->sl, NULL,
Nit: assuming you wanted the extra whitespace here to align the assignment with the one of a->err below, I'd rather write this (at least for consistency) as "a->fd = ...".
+ a->sa->sa_family == AF_INET6, a->data); + a->err = errno;
+ + return 0; +} + +/** flowside_sock_l4() - Create and bind socket based on flowside + * @c: Execution context + * @type: Socket epoll type + * @pif: Interface for this socket + * @tgt: Target flowside + * @data: epoll reference portion for protocol handlers + * + * Return: socket fd of protocol @proto bound to the forwarding address and port + * from @tgt (if specified). + */ +/* cppcheck-suppress unusedFunction */ +int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, + const struct flowside *tgt, uint32_t data) +{ + const char *ifname = NULL; + union sockaddr_inany sa; + socklen_t sl; + + ASSERT(pif_is_socket(pif)); + + pif_sockaddr(c, &sa, &sl, pif, &tgt->faddr, tgt->fport); + + switch (pif) { + case PIF_HOST: + if (inany_is_loopback(&tgt->faddr)) + ifname = NULL; + else if (sa.sa_family == AF_INET) + ifname = c->ip4.ifname_out; + else if (sa.sa_family == AF_INET6) + ifname = c->ip6.ifname_out; + + return sock_l4_sa(c, type, &sa, sl, ifname, + sa.sa_family == AF_INET6, data); + + case PIF_SPLICE: { + struct flowside_sock_args args = { + .c = c, .type = type, + .sa = &sa.sa, .sl = sl, .data = data, + }; + NS_CALL(flowside_sock_splice, &args); + errno = args.err; + return args.fd; + } + + default: + /* If we add new socket pifs, they'll need to be implemented + * here */
For consistency: /* If we add new socket pifs, they'll need to be implemented * here */ there are a few occurrences in the next patches, not so important I guess, I can also do a pass later at some point.
+ ASSERT(0); + } +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index c3a15ca6..e27f99be 100644 --- a/flow.h +++ b/flow.h @@ -164,6 +164,9 @@ static inline bool flowside_eq(const struct flowside *left, left->fport == right->fport; }
+int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, + const struct flowside *tgt, uint32_t data); + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry diff --git a/util.c b/util.c index 9a73fbb9..f2994a79 100644 --- a/util.c +++ b/util.c @@ -44,9 +44,9 @@ * * Return: newly created socket, negative error code on failure */ -static int sock_l4_sa(const struct ctx *c, enum epoll_type type, - const void *sa, socklen_t sl, - const char *ifname, bool v6only, uint32_t data) +int sock_l4_sa(const struct ctx *c, enum epoll_type type, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data) { sa_family_t af = ((const struct sockaddr *)sa)->sa_family; union epoll_ref ref = { .type = type, .data = data }; diff --git a/util.h b/util.h index d0150396..f2e4f8cf 100644 --- a/util.h +++ b/util.h @@ -144,6 +144,9 @@ struct ctx;
/* cppcheck-suppress funcArgNamesDifferent */ __attribute__ ((weak)) int ffsl(long int i) { return __builtin_ffsl(i); } +int sock_l4_sa(const struct ctx *c, enum epoll_type type, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data); int sock_l4(const struct ctx *c, sa_family_t af, enum epoll_type type, const void *bind_addr, const char *ifname, uint16_t port, uint32_t data);
-- Stefano
On Wed, Jul 10, 2024 at 11:32:01PM +0200, Stefano Brivio wrote:
On Fri, 5 Jul 2024 12:07:12 +1000 David Gibson
wrote: We have upcoming use cases where it's useful to create new bound socket based on information from the flow table. Add flowside_sock_l4() to do this for either PIF_HOST or PIF_SPLICE sockets.
Signed-off-by: David Gibson
--- flow.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ flow.h | 3 ++ util.c | 6 ++-- util.h | 3 ++ 4 files changed, 101 insertions(+), 3 deletions(-) diff --git a/flow.c b/flow.c index 6f09781a..2d0a8a32 100644 --- a/flow.c +++ b/flow.c @@ -5,9 +5,11 @@ * Tracking for logical "flows" of packets. */
+#include
#include #include #include +#include #include #include "util.h" @@ -143,6 +145,96 @@ static void flowside_from_af(struct flowside *fside, sa_family_t af, fside->eport = eport; }
+/** + * struct flowside_sock_args - Parameters for flowside_sock_splice() + * @c: Execution context + * @fd: Filled in with new socket fd + * @err: Filled in with errno if something failed + * @type: Socket epoll type + * @sa: Socket address + * @sl: Length of @sa + * @data: epoll reference data + */ +struct flowside_sock_args { + const struct ctx *c; + int fd; + int err; + enum epoll_type type; + const struct sockaddr *sa; + socklen_t sl; + const char *path; + uint32_t data; +}; + +/** flowside_sock_splice() - Create and bind socket for PIF_SPLICE based on flowside + * @arg: Argument as a struct flowside_sock_args + * + * Return: 0 + */ +static int flowside_sock_splice(void *arg) +{ + struct flowside_sock_args *a = arg; + + ns_enter(a->c); + + a->fd = sock_l4_sa(a->c, a->type, a->sa, a->sl, NULL,
Nit: assuming you wanted the extra whitespace here to align the assignment with the one of a->err below, I'd rather write this (at least for consistency) as "a->fd = ...".
Actually, I think it was just a boring old typo.
+ a->sa->sa_family == AF_INET6, a->data); + a->err = errno;
+ + return 0; +} + +/** flowside_sock_l4() - Create and bind socket based on flowside + * @c: Execution context + * @type: Socket epoll type + * @pif: Interface for this socket + * @tgt: Target flowside + * @data: epoll reference portion for protocol handlers + * + * Return: socket fd of protocol @proto bound to the forwarding address and port + * from @tgt (if specified). + */ +/* cppcheck-suppress unusedFunction */ +int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, + const struct flowside *tgt, uint32_t data) +{ + const char *ifname = NULL; + union sockaddr_inany sa; + socklen_t sl; + + ASSERT(pif_is_socket(pif)); + + pif_sockaddr(c, &sa, &sl, pif, &tgt->faddr, tgt->fport); + + switch (pif) { + case PIF_HOST: + if (inany_is_loopback(&tgt->faddr)) + ifname = NULL; + else if (sa.sa_family == AF_INET) + ifname = c->ip4.ifname_out; + else if (sa.sa_family == AF_INET6) + ifname = c->ip6.ifname_out; + + return sock_l4_sa(c, type, &sa, sl, ifname, + sa.sa_family == AF_INET6, data); + + case PIF_SPLICE: { + struct flowside_sock_args args = { + .c = c, .type = type, + .sa = &sa.sa, .sl = sl, .data = data, + }; + NS_CALL(flowside_sock_splice, &args); + errno = args.err; + return args.fd; + } + + default: + /* If we add new socket pifs, they'll need to be implemented + * here */
For consistency:
/* If we add new socket pifs, they'll need to be implemented * here */
Done.
there are a few occurrences in the next patches, not so important I guess, I can also do a pass later at some point.
+ ASSERT(0); + } +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index c3a15ca6..e27f99be 100644 --- a/flow.h +++ b/flow.h @@ -164,6 +164,9 @@ static inline bool flowside_eq(const struct flowside *left, left->fport == right->fport; }
+int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, + const struct flowside *tgt, uint32_t data); + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry diff --git a/util.c b/util.c index 9a73fbb9..f2994a79 100644 --- a/util.c +++ b/util.c @@ -44,9 +44,9 @@ * * Return: newly created socket, negative error code on failure */ -static int sock_l4_sa(const struct ctx *c, enum epoll_type type, - const void *sa, socklen_t sl, - const char *ifname, bool v6only, uint32_t data) +int sock_l4_sa(const struct ctx *c, enum epoll_type type, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data) { sa_family_t af = ((const struct sockaddr *)sa)->sa_family; union epoll_ref ref = { .type = type, .data = data }; diff --git a/util.h b/util.h index d0150396..f2e4f8cf 100644 --- a/util.h +++ b/util.h @@ -144,6 +144,9 @@ struct ctx;
/* cppcheck-suppress funcArgNamesDifferent */ __attribute__ ((weak)) int ffsl(long int i) { return __builtin_ffsl(i); } +int sock_l4_sa(const struct ctx *c, enum epoll_type type, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data); int sock_l4(const struct ctx *c, sa_family_t af, enum epoll_type type, const void *bind_addr, const char *ifname, uint16_t port, uint32_t data);
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, Jul 10, 2024 at 11:32:01PM +0200, Stefano Brivio wrote:
On Fri, 5 Jul 2024 12:07:12 +1000 David Gibson
wrote: We have upcoming use cases where it's useful to create new bound socket based on information from the flow table. Add flowside_sock_l4() to do this for either PIF_HOST or PIF_SPLICE sockets.
Signed-off-by: David Gibson
[snip] + default: + /* If we add new socket pifs, they'll need to be implemented + * here */ For consistency:
/* If we add new socket pifs, they'll need to be implemented * here */
there are a few occurrences in the next patches, not so important I guess, I can also do a pass later at some point.
I did some grepping, and I think I squashed the ones added by this series. There are also some instances that are already merged. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
For now when we forward a ping to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and id will be used by the kernel. When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.
Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there. This allows the
possibility of more complex rules for what outbound address and/or id
we use in future.
To implement this we create a new helper which sets up a new socket based
on information in a flowside, which will also have future uses. It
behaves slightly differently from the existing ICMP code, in that it
doesn't bind to a specific interface if given a loopback address. This is
logically correct - the loopback address means we need to operate through
the host's loopback interface, not ifname_out. We didn't need it in ICMP
because ICMP will never generate a loopback address at this point, however
we intend to change that in future.
Signed-off-by: David Gibson
Currently the code to translate host side addresses and ports to guest side
addresses and ports, and vice versa, is scattered across the TCP code.
This includes both port redirection as controlled by the -t and -T options,
and our special case NAT controlled by the --no-map-gw option.
Gather this logic into fwd_nat_from_*() functions for each input
interface in fwd.c which take protocol and address information for the
initiating side and generates the pif and address information for the
forwarded side. This performs any NAT or port forwarding needed.
We create a flow_target() helper which applies those forwarding functions
as needed to automatically move a flow from INI to TGT state.
Signed-off-by: David Gibson
Current ICMP hard codes its forwarding rules, and never applies any
translations. Change it to use the flow_target() function, so that
it's translated the same as TCP (excluding TCP specific port
redirection).
This means that gw mapping now applies to ICMP so "ping <gw address>" will
now ping the host's loopback instead of the actual gw machine. This
removes the surprising behaviour that the target you ping might not be the
same as you connect to with TCP.
This removes the last user of flow_target_af(), so that's removed as well.
Signed-off-by: David Gibson
Add logic to the fwd_nat_from_*() functions to forwarding UDP packets. The
logic here doesn't exactly match our current forwarding, since our current
forwarding has some very strange and buggy edge cases. Instead it's
attempting to replicate what appears to be the intended logic behind the
current forwarding.
Signed-off-by: David Gibson
On Fri, 5 Jul 2024 12:07:16 +1000
David Gibson
Add logic to the fwd_nat_from_*() functions to forwarding UDP packets. The logic here doesn't exactly match our current forwarding, since our current forwarding has some very strange and buggy edge cases. Instead it's attempting to replicate what appears to be the intended logic behind the current forwarding.
Signed-off-by: David Gibson
--- fwd.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/fwd.c b/fwd.c index 5731a536..4377de44 100644 --- a/fwd.c +++ b/fwd.c @@ -169,12 +169,15 @@ void fwd_scan_ports_init(struct ctx *c) uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { - (void)proto; - tgt->eaddr = ini->faddr; tgt->eport = ini->fport;
- if (!c->no_map_gw) { + if (proto == IPPROTO_UDP && tgt->eport == 53) { + if (inany_equals4(&tgt->eaddr, &c->ip4.dns_match)) + tgt->eaddr = inany_from_v4(c->ip4.dns_host); + else if (inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) + tgt->eaddr.a6 = c->ip6.dns_host; + } else if (!c->no_map_gw) {
There's a subtle difference here compared to the logic you dropped in 23/27 (udp_tap_handler()), which doesn't look correct to me. Earlier, with neither c->ip4.dns_match nor c->ip6.dns_match matching, we would let UDP traffic directed to port 53 be mapped to the host, if (!c->no_map_gw). That is, the logic was rather equivalent to this: if (proto == IPPROTO_UDP && tgt->eport == 53 && (inany_equals4(&tgt->eaddr, &c->ip4.dns_match) || inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) { if (inany_equals4(&tgt->eaddr, &c->ip4.dns_match)) tgt->eaddr = inany_from_v4(c->ip4.dns_host); else if (inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) tgt->eaddr.a6 = c->ip6.dns_host; } else if (!c->no_map_gw) { ... and I think we should maintain it, because if dns_match doesn't match, DNS traffic considerations shouldn't affect NAT decisions at all.
if (inany_equals4(&tgt->eaddr, &c->ip4.gw)) tgt->eaddr = inany_loopback4; else if (inany_equals6(&tgt->eaddr, &c->ip6.gw)) @@ -191,6 +194,10 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto,
/* Let the kernel pick a host side source port */ tgt->fport = 0; + if (proto == IPPROTO_UDP) { + /* But for UDP we preserve the source port */ + tgt->fport = ini->eport; + }
return PIF_HOST; } @@ -232,9 +239,14 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, tgt->eport = ini->fport; if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; + else if (proto == IPPROTO_UDP) + tgt->eport += c->udp.fwd_out.f.delta[tgt->eport];
/* Let the kernel pick a host side source port */ tgt->fport = 0; + if (proto == IPPROTO_UDP) + /* But for UDP preserve the source port */ + tgt->fport = ini->eport;
return PIF_HOST; } @@ -256,20 +268,26 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, tgt->eport = ini->fport; if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; + else if (proto == IPPROTO_UDP) + tgt->eport += c->udp.fwd_in.f.delta[tgt->eport];
if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) && - proto == IPPROTO_TCP) { + (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */
/* Preserve the specific loopback adddress used, but let the * kernel pick a source port on the target side */ tgt->faddr = ini->eaddr; tgt->fport = 0; + if (proto == IPPROTO_UDP) + /* But for UDP preserve the source port */ + tgt->fport = ini->eport;
if (inany_v4(&ini->eaddr)) tgt->eaddr = inany_loopback4; else tgt->eaddr = inany_loopback6; + return PIF_SPLICE; }
-- Stefano
On Mon, Jul 08, 2024 at 11:26:55PM +0200, Stefano Brivio wrote:
On Fri, 5 Jul 2024 12:07:16 +1000 David Gibson
wrote: Add logic to the fwd_nat_from_*() functions to forwarding UDP packets. The logic here doesn't exactly match our current forwarding, since our current forwarding has some very strange and buggy edge cases. Instead it's attempting to replicate what appears to be the intended logic behind the current forwarding.
Signed-off-by: David Gibson
--- fwd.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/fwd.c b/fwd.c index 5731a536..4377de44 100644 --- a/fwd.c +++ b/fwd.c @@ -169,12 +169,15 @@ void fwd_scan_ports_init(struct ctx *c) uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { - (void)proto; - tgt->eaddr = ini->faddr; tgt->eport = ini->fport;
- if (!c->no_map_gw) { + if (proto == IPPROTO_UDP && tgt->eport == 53) { + if (inany_equals4(&tgt->eaddr, &c->ip4.dns_match)) + tgt->eaddr = inany_from_v4(c->ip4.dns_host); + else if (inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) + tgt->eaddr.a6 = c->ip6.dns_host; + } else if (!c->no_map_gw) {
There's a subtle difference here compared to the logic you dropped in 23/27 (udp_tap_handler()), which doesn't look correct to me.
Earlier, with neither c->ip4.dns_match nor c->ip6.dns_match matching, we would let UDP traffic directed to port 53 be mapped to the host, if (!c->no_map_gw). That is, the logic was rather equivalent to this:
if (proto == IPPROTO_UDP && tgt->eport == 53 && (inany_equals4(&tgt->eaddr, &c->ip4.dns_match) || inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) { if (inany_equals4(&tgt->eaddr, &c->ip4.dns_match)) tgt->eaddr = inany_from_v4(c->ip4.dns_host); else if (inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) tgt->eaddr.a6 = c->ip6.dns_host; } else if (!c->no_map_gw) { ...
and I think we should maintain it, because if dns_match doesn't match, DNS traffic considerations shouldn't affect NAT decisions at all.
Good catch, I've adjusted that. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
This implements the first steps of tracking UDP packets with the flow table
rather than it's own (buggy) set of port maps. Specifically we create flow
table entries for datagrams received from a socket (PIF_HOST or
PIF_SPLICE).
When splitting datagrams from sockets into batches, we group by the flow
as well as splicesrc. This may result in smaller batches, but makes things
easier down the line. We can re-optimise this later if necessary. For now
we don't do anything else with the flow, not even match reply packets to
the same flow.
Signed-off-by: David Gibson
Nits only, here:
On Fri, 5 Jul 2024 12:07:17 +1000
David Gibson
This implements the first steps of tracking UDP packets with the flow table rather than it's own (buggy) set of port maps. Specifically we create flow
its
table entries for datagrams received from a socket (PIF_HOST or PIF_SPLICE).
When splitting datagrams from sockets into batches, we group by the flow as well as splicesrc. This may result in smaller batches, but makes things easier down the line. We can re-optimise this later if necessary. For now we don't do anything else with the flow, not even match reply packets to the same flow.
Signed-off-by: David Gibson
--- Makefile | 2 +- flow.c | 31 ++++++++++ flow.h | 4 ++ flow_table.h | 14 +++++ udp.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++-- udp_flow.h | 25 ++++++++ 6 files changed, 240 insertions(+), 5 deletions(-) create mode 100644 udp_flow.h diff --git a/Makefile b/Makefile index 09fc461d..92cbd5a6 100644 --- a/Makefile +++ b/Makefile @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - udp.h util.h + udp.h udp_flow.h util.h HEADERS = $(PASST_HEADERS) seccomp.h
C := \#include
\nstruct tcp_info x = { .tcpi_snd_wnd = 0 }; diff --git a/flow.c b/flow.c index 218033ae..0cb9495b 100644 --- a/flow.c +++ b/flow.c @@ -37,6 +37,7 @@ const char *flow_type_str[] = { [FLOW_TCP_SPLICE] = "TCP connection (spliced)", [FLOW_PING4] = "ICMP ping sequence", [FLOW_PING6] = "ICMPv6 ping sequence", + [FLOW_UDP] = "UDP flow", }; static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type"); @@ -46,6 +47,7 @@ const uint8_t flow_proto[] = { [FLOW_TCP_SPLICE] = IPPROTO_TCP, [FLOW_PING4] = IPPROTO_ICMP, [FLOW_PING6] = IPPROTO_ICMPV6, + [FLOW_UDP] = IPPROTO_UDP, }; static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, "flow_proto[] doesn't match enum flow_type"); @@ -700,6 +702,31 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, return flowside_lookup(c, proto, pif, &fside); } +/** + * flow_lookup_sa() - Look up a flow given and endpoint socket address
s/and/an/
+ * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @esa: Socket address of the endpoint + * @fport: Forwarding port number + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport) +{ + struct flowside fside = {
And the "f" in "fside" stands for "forwarding"... I don't have any quick fix in mind, and it's _kind of_ clear anyway, but this makes me doubt a bit about the "forwarding" / "endpoint" choice of words.
+ .fport = fport, + }; + + inany_from_sockaddr(&fside.eaddr, &fside.eport, esa); + if (inany_v4(&fside.eaddr)) + fside.faddr = inany_any4; + else + fside.faddr = inany_any6;
The usual extra newline here?
+ return flowside_lookup(c, proto, pif, &fside); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context @@ -779,6 +806,10 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) if (timer) closed = icmp_ping_timer(c, &flow->ping, now); break; + case FLOW_UDP: + if (timer) + closed = udp_flow_timer(c, &flow->udp, now); + break; default: /* Assume other flow types don't need any handling */ ; diff --git a/flow.h b/flow.h index e27f99be..3752e5ee 100644 --- a/flow.h +++ b/flow.h @@ -115,6 +115,8 @@ enum flow_type { FLOW_PING4, /* ICMPv6 echo requests from guest to host and matching replies back */ FLOW_PING6, + /* UDP pseudo-connection */ + FLOW_UDP,
FLOW_NUM_TYPES, }; @@ -238,6 +240,8 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, uint8_t proto, uint8_t pif, sa_family_t af, const void *eaddr, const void *faddr, in_port_t eport, in_port_t fport); +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport);
union flow;
diff --git a/flow_table.h b/flow_table.h index 457f27b1..3fbc7c8d 100644 --- a/flow_table.h +++ b/flow_table.h @@ -9,6 +9,7 @@
#include "tcp_conn.h" #include "icmp_flow.h" +#include "udp_flow.h"
/** * struct flow_free_cluster - Information about a cluster of free entries @@ -35,6 +36,7 @@ union flow { struct tcp_tap_conn tcp; struct tcp_splice_conn tcp_splice; struct icmp_ping_flow ping; + struct udp_flow udp; };
/* Global Flow Table */ @@ -78,6 +80,18 @@ static inline union flow *flow_at_sidx(flow_sidx_t sidx) return FLOW(sidx.flow); }
+/** flow_sidx_opposite - Get the other side of the same flow
flow_sidx_opposite()
+ * @sidx: Flow & side index + * + * Return: sidx for the other side of the same flow as @sidx + */ +static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) +{ + if (!flow_sidx_valid(sidx)) + return FLOW_SIDX_NONE;
Same here with the extra newline.
+ return (flow_sidx_t){.flow = sidx.flow, .side = !sidx.side}; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/udp.c b/udp.c index 6427b9ce..daf4fe26 100644 --- a/udp.c +++ b/udp.c @@ -15,6 +15,30 @@ /** * DOC: Theory of Operation * + * UDP Flows + * ========= + * + * UDP doesn't have true connections, but many protocols use a connection-like + * format. The flow is initiated by a client sending a datagram from a port of + * its choosing (usually ephemeral) to a specific port (usually well known) on a + * server. Both client and server address must be unicast. The server sends + * replies using the same addresses & ports with src/dest swapped. + * + * We track pseudo-connections of this type as flow table entries of type + * FLOW_UDP. We store the time of the last traffic on the flow in uflow->ts, + * and let the flow expire if there is no traffic for UDP_CONN_TIMEOUT seconds. + * + * NOTE: This won't handle multicast protocols, or some protocols with different + * port usage. We'll need specific logic if we want to handle those. + * + * "Listening" sockets + * =================== + * + * UDP doesn't use listen(), but we consider long term sockets which are allowed + * to create new flows "listening" by analogy with TCP. + * + * Port tracking + * ============= * * For UDP, a reduced version of port-based connection tracking is implemented * with two purposes: @@ -121,6 +145,7 @@ #include "tap.h" #include "pcap.h" #include "log.h" +#include "flow_table.h"
#define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ @@ -199,6 +224,7 @@ static struct ethhdr udp6_eth_hdr; * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() * @splicesrc: Source port for splicing, or -1 if not spliceable + * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { struct ipv6hdr ip6h; @@ -207,6 +233,7 @@ static struct udp_meta_t {
union sockaddr_inany s_in; int splicesrc; + flow_sidx_t tosidx; } #ifdef __AVX2__ __attribute__ ((aligned(32))) @@ -490,6 +517,115 @@ static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) return -1; }
+/** + * udp_at_sidx() - Get UDP specific flow at given sidx + * @sidx: Flow and side to retrieve + * + * Return: UDP specific flow at @sidx, or NULL of @sidx is invalid. Asserts if + * the flow at @sidx is not FLOW_UDP. + */ +struct udp_flow *udp_at_sidx(flow_sidx_t sidx) +{ + union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return NULL; + + ASSERT(flow->f.type == FLOW_UDP); + return &flow->udp; +} + +/* + * udp_flow_close() - Close and clean up UDP flow + * @c: Execution context + * @uflow: UDP flow + */ +static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) +{ + flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); +} + +/** + * udp_flow_new() - Common setup for a new UDP flow + * @c: Execution context + * @flow: Initiated flow + * @now: Timestamp + * + * Return: UDP specific flow, if successful, NULL on failure + */ +static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, + const struct timespec *now) +{ + const struct flowside *ini = &flow->f.side[INISIDE]; + struct udp_flow *uflow = NULL; + + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { + flow_dbg(flow, "Invalid endpoint to initiate UDP flow");
Do we risk making debug logs unusable if we see multicast traffic? Maybe this could be flow_trace() instead. -- Stefano
On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote:
Nits only, here:
On Fri, 5 Jul 2024 12:07:17 +1000 David Gibson
wrote: This implements the first steps of tracking UDP packets with the flow table rather than it's own (buggy) set of port maps. Specifically we create flow
its
Oops, fixed.
table entries for datagrams received from a socket (PIF_HOST or PIF_SPLICE).
When splitting datagrams from sockets into batches, we group by the flow as well as splicesrc. This may result in smaller batches, but makes things easier down the line. We can re-optimise this later if necessary. For now we don't do anything else with the flow, not even match reply packets to the same flow.
Signed-off-by: David Gibson
--- Makefile | 2 +- flow.c | 31 ++++++++++ flow.h | 4 ++ flow_table.h | 14 +++++ udp.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++-- udp_flow.h | 25 ++++++++ 6 files changed, 240 insertions(+), 5 deletions(-) create mode 100644 udp_flow.h diff --git a/Makefile b/Makefile index 09fc461d..92cbd5a6 100644 --- a/Makefile +++ b/Makefile @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - udp.h util.h + udp.h udp_flow.h util.h HEADERS = $(PASST_HEADERS) seccomp.h
C := \#include
\nstruct tcp_info x = { .tcpi_snd_wnd = 0 }; diff --git a/flow.c b/flow.c index 218033ae..0cb9495b 100644 --- a/flow.c +++ b/flow.c @@ -37,6 +37,7 @@ const char *flow_type_str[] = { [FLOW_TCP_SPLICE] = "TCP connection (spliced)", [FLOW_PING4] = "ICMP ping sequence", [FLOW_PING6] = "ICMPv6 ping sequence", + [FLOW_UDP] = "UDP flow", }; static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type"); @@ -46,6 +47,7 @@ const uint8_t flow_proto[] = { [FLOW_TCP_SPLICE] = IPPROTO_TCP, [FLOW_PING4] = IPPROTO_ICMP, [FLOW_PING6] = IPPROTO_ICMPV6, + [FLOW_UDP] = IPPROTO_UDP, }; static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, "flow_proto[] doesn't match enum flow_type"); @@ -700,6 +702,31 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, return flowside_lookup(c, proto, pif, &fside); } +/** + * flow_lookup_sa() - Look up a flow given and endpoint socket address
s/and/an/
Fixed.
+ * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @esa: Socket address of the endpoint + * @fport: Forwarding port number + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport) +{ + struct flowside fside = {
And the "f" in "fside" stands for "forwarding"... I don't have any quick fix in mind, and it's _kind of_ clear anyway, but this makes me doubt a bit about the "forwarding" / "endpoint" choice of words.
Heh, no, here "fside" is simply short for "flowside". Every flowside has both forwarding and endpoint elements. So it is confusing, but for a different reason. I need to find a different convention for naming struct flowside variables. I'd say 'side', but sometimes that's used for the 1-bit integer indicating which side in a flow. Hrm.. now that pif has been removed from here, maybe I could rename struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps?
+ .fport = fport, + }; + + inany_from_sockaddr(&fside.eaddr, &fside.eport, esa); + if (inany_v4(&fside.eaddr)) + fside.faddr = inany_any4; + else + fside.faddr = inany_any6;
The usual extra newline here?
Done.
+ return flowside_lookup(c, proto, pif, &fside); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context @@ -779,6 +806,10 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) if (timer) closed = icmp_ping_timer(c, &flow->ping, now); break; + case FLOW_UDP: + if (timer) + closed = udp_flow_timer(c, &flow->udp, now); + break; default: /* Assume other flow types don't need any handling */ ; diff --git a/flow.h b/flow.h index e27f99be..3752e5ee 100644 --- a/flow.h +++ b/flow.h @@ -115,6 +115,8 @@ enum flow_type { FLOW_PING4, /* ICMPv6 echo requests from guest to host and matching replies back */ FLOW_PING6, + /* UDP pseudo-connection */ + FLOW_UDP,
FLOW_NUM_TYPES, }; @@ -238,6 +240,8 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, uint8_t proto, uint8_t pif, sa_family_t af, const void *eaddr, const void *faddr, in_port_t eport, in_port_t fport); +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport);
union flow;
diff --git a/flow_table.h b/flow_table.h index 457f27b1..3fbc7c8d 100644 --- a/flow_table.h +++ b/flow_table.h @@ -9,6 +9,7 @@
#include "tcp_conn.h" #include "icmp_flow.h" +#include "udp_flow.h"
/** * struct flow_free_cluster - Information about a cluster of free entries @@ -35,6 +36,7 @@ union flow { struct tcp_tap_conn tcp; struct tcp_splice_conn tcp_splice; struct icmp_ping_flow ping; + struct udp_flow udp; };
/* Global Flow Table */ @@ -78,6 +80,18 @@ static inline union flow *flow_at_sidx(flow_sidx_t sidx) return FLOW(sidx.flow); }
+/** flow_sidx_opposite - Get the other side of the same flow
flow_sidx_opposite()
Done.
+ * @sidx: Flow & side index + * + * Return: sidx for the other side of the same flow as @sidx + */ +static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) +{ + if (!flow_sidx_valid(sidx)) + return FLOW_SIDX_NONE;
Same here with the extra newline.
Done.
+ return (flow_sidx_t){.flow = sidx.flow, .side = !sidx.side}; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/udp.c b/udp.c index 6427b9ce..daf4fe26 100644 --- a/udp.c +++ b/udp.c @@ -15,6 +15,30 @@ /** * DOC: Theory of Operation * + * UDP Flows + * ========= + * + * UDP doesn't have true connections, but many protocols use a connection-like + * format. The flow is initiated by a client sending a datagram from a port of + * its choosing (usually ephemeral) to a specific port (usually well known) on a + * server. Both client and server address must be unicast. The server sends + * replies using the same addresses & ports with src/dest swapped. + * + * We track pseudo-connections of this type as flow table entries of type + * FLOW_UDP. We store the time of the last traffic on the flow in uflow->ts, + * and let the flow expire if there is no traffic for UDP_CONN_TIMEOUT seconds. + * + * NOTE: This won't handle multicast protocols, or some protocols with different + * port usage. We'll need specific logic if we want to handle those. + * + * "Listening" sockets + * =================== + * + * UDP doesn't use listen(), but we consider long term sockets which are allowed + * to create new flows "listening" by analogy with TCP. + * + * Port tracking + * ============= * * For UDP, a reduced version of port-based connection tracking is implemented * with two purposes: @@ -121,6 +145,7 @@ #include "tap.h" #include "pcap.h" #include "log.h" +#include "flow_table.h"
#define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ @@ -199,6 +224,7 @@ static struct ethhdr udp6_eth_hdr; * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() * @splicesrc: Source port for splicing, or -1 if not spliceable + * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { struct ipv6hdr ip6h; @@ -207,6 +233,7 @@ static struct udp_meta_t {
union sockaddr_inany s_in; int splicesrc; + flow_sidx_t tosidx; } #ifdef __AVX2__ __attribute__ ((aligned(32))) @@ -490,6 +517,115 @@ static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) return -1; }
+/** + * udp_at_sidx() - Get UDP specific flow at given sidx + * @sidx: Flow and side to retrieve + * + * Return: UDP specific flow at @sidx, or NULL of @sidx is invalid. Asserts if + * the flow at @sidx is not FLOW_UDP. + */ +struct udp_flow *udp_at_sidx(flow_sidx_t sidx) +{ + union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return NULL; + + ASSERT(flow->f.type == FLOW_UDP); + return &flow->udp; +} + +/* + * udp_flow_close() - Close and clean up UDP flow + * @c: Execution context + * @uflow: UDP flow + */ +static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) +{ + flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); +} + +/** + * udp_flow_new() - Common setup for a new UDP flow + * @c: Execution context + * @flow: Initiated flow + * @now: Timestamp + * + * Return: UDP specific flow, if successful, NULL on failure + */ +static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, + const struct timespec *now) +{ + const struct flowside *ini = &flow->f.side[INISIDE]; + struct udp_flow *uflow = NULL; + + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { + flow_dbg(flow, "Invalid endpoint to initiate UDP flow");
Do we risk making debug logs unusable if we see multicast traffic?
Um.. I'm not sure.
Maybe this could be flow_trace() instead.
Sure, why not. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, 10 Jul 2024 09:59:08 +1000
David Gibson
On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote:
Nits only, here:
On Fri, 5 Jul 2024 12:07:17 +1000 David Gibson
wrote: [...]
+ * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @esa: Socket address of the endpoint + * @fport: Forwarding port number + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport) +{ + struct flowside fside = {
And the "f" in "fside" stands for "forwarding"... I don't have any quick fix in mind, and it's _kind of_ clear anyway, but this makes me doubt a bit about the "forwarding" / "endpoint" choice of words.
Heh, no, here "fside" is simply short for "flowside". Every flowside has both forwarding and endpoint elements.
Oh, I thought you called it fside here because you're setting the forwarding part of it directly, or something like that.
So it is confusing, but for a different reason. I need to find a different convention for naming struct flowside variables. I'd say 'side', but sometimes that's used for the 1-bit integer indicating which side in a flow.
Hrm.. now that pif has been removed from here, maybe I could rename struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps?
That's also confusing because it contains ports too (even though sure, in some sense they're part of the address). I would suggest keeping it like it is in for this series, but after that, if it's not too long, what about flow_addrs_ports? Actually, I don't think flowside is that bad. What I'm struggling with is rather 'forwarding' and 'endpoint'. I don't have any good suggestion at the moment, anyway. Using 'local' and 'remote' (laddr/lport, raddr/rport) would be clearer to me and avoid the conflict with 'f' of flowside, but you had good reasons to avoid that, if I recall correctly. -- Stefano
On Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote:
On Wed, 10 Jul 2024 09:59:08 +1000 David Gibson
wrote: On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote:
Nits only, here:
On Fri, 5 Jul 2024 12:07:17 +1000 David Gibson
wrote: [...]
+ * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @esa: Socket address of the endpoint + * @fport: Forwarding port number + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport) +{ + struct flowside fside = {
And the "f" in "fside" stands for "forwarding"... I don't have any quick fix in mind, and it's _kind of_ clear anyway, but this makes me doubt a bit about the "forwarding" / "endpoint" choice of words.
Heh, no, here "fside" is simply short for "flowside". Every flowside has both forwarding and endpoint elements.
Oh, I thought you called it fside here because you're setting the forwarding part of it directly, or something like that.
So it is confusing, but for a different reason. I need to find a different convention for naming struct flowside variables. I'd say 'side', but sometimes that's used for the 1-bit integer indicating which side in a flow.
Hrm.. now that pif has been removed from here, maybe I could rename struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps?
That's also confusing because it contains ports too (even though sure, in some sense they're part of the address).
Right :/.
I would suggest keeping it like it is in for this series, but after that, if it's not too long, what about flow_addrs_ports?
Still need a conventional name for the variables. "fap" probably isn't the best look, and still has the potentially confusing "f" in it.
Actually, I don't think flowside is that bad. What I'm struggling with is rather 'forwarding' and 'endpoint'. I don't have any good suggestion at the moment, anyway. Using 'local' and 'remote' (laddr/lport, raddr/rport) would be clearer to me and avoid the conflict with 'f' of flowside, but you had good reasons to avoid that, if I recall correctly.
Kind of, yeah. Local and remote are great when it's clear we're talking specifically from the point of view of the passt process. There are a bunch of cases where it's not necessarily obvious if we're talking from that point of view, the point of view of the guest, or the point of view of the host (the last usually when the endpoint is somewhere on an entirely different system). I wanted something that wherever we were talking about it is specifically relative to the passt process itself. I get the impression that "forwarding" is causing more trouble here than "endpoint". "midpoint address"? "intercepted address"? "redirected address" (probably not, that's 'r' like remote but this would be the local address)? -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 11 Jul 2024 14:26:38 +1000
David Gibson
On Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote:
On Wed, 10 Jul 2024 09:59:08 +1000 David Gibson
wrote: On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote:
Nits only, here:
On Fri, 5 Jul 2024 12:07:17 +1000 David Gibson
wrote: [...]
+ * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @esa: Socket address of the endpoint + * @fport: Forwarding port number + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport) +{ + struct flowside fside = {
And the "f" in "fside" stands for "forwarding"... I don't have any quick fix in mind, and it's _kind of_ clear anyway, but this makes me doubt a bit about the "forwarding" / "endpoint" choice of words.
Heh, no, here "fside" is simply short for "flowside". Every flowside has both forwarding and endpoint elements.
Oh, I thought you called it fside here because you're setting the forwarding part of it directly, or something like that.
So it is confusing, but for a different reason. I need to find a different convention for naming struct flowside variables. I'd say 'side', but sometimes that's used for the 1-bit integer indicating which side in a flow.
Hrm.. now that pif has been removed from here, maybe I could rename struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps?
That's also confusing because it contains ports too (even though sure, in some sense they're part of the address).
Right :/.
I would suggest keeping it like it is in for this series, but after that, if it's not too long, what about flow_addrs_ports?
Still need a conventional name for the variables. "fap" probably isn't the best look, and still has the potentially confusing "f" in it.
Actually, I don't think flowside is that bad. What I'm struggling with is rather 'forwarding' and 'endpoint'. I don't have any good suggestion at the moment, anyway. Using 'local' and 'remote' (laddr/lport, raddr/rport) would be clearer to me and avoid the conflict with 'f' of flowside, but you had good reasons to avoid that, if I recall correctly.
Kind of, yeah. Local and remote are great when it's clear we're talking specifically from the point of view of the passt process. There are a bunch of cases where it's not necessarily obvious if we're talking from that point of view, the point of view of the guest, or the point of view of the host (the last usually when the endpoint is somewhere on an entirely different system). I wanted something that wherever we were talking about it is specifically relative to the passt process itself.
I get the impression that "forwarding" is causing more trouble here than "endpoint". "midpoint address"? "intercepted address"? "redirected address" (probably not, that's 'r' like remote but this would be the local address)?
I think "forwarding" is still better than any of those. Perhaps "passt address" (and paddr/pport)... but I'm not sure it's much better than "forwarding". -- Stefano
On Thu, Jul 11, 2024 at 10:20:01AM +0200, Stefano Brivio wrote:
On Thu, 11 Jul 2024 14:26:38 +1000 David Gibson
wrote: On Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote:
On Wed, 10 Jul 2024 09:59:08 +1000 David Gibson
wrote: On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote:
Nits only, here:
On Fri, 5 Jul 2024 12:07:17 +1000 David Gibson
wrote: [...]
+ * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @esa: Socket address of the endpoint + * @fport: Forwarding port number + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport) +{ + struct flowside fside = {
And the "f" in "fside" stands for "forwarding"... I don't have any quick fix in mind, and it's _kind of_ clear anyway, but this makes me doubt a bit about the "forwarding" / "endpoint" choice of words.
Heh, no, here "fside" is simply short for "flowside". Every flowside has both forwarding and endpoint elements.
Oh, I thought you called it fside here because you're setting the forwarding part of it directly, or something like that.
So it is confusing, but for a different reason. I need to find a different convention for naming struct flowside variables. I'd say 'side', but sometimes that's used for the 1-bit integer indicating which side in a flow.
Hrm.. now that pif has been removed from here, maybe I could rename struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps?
That's also confusing because it contains ports too (even though sure, in some sense they're part of the address).
Right :/.
I would suggest keeping it like it is in for this series, but after that, if it's not too long, what about flow_addrs_ports?
Still need a conventional name for the variables. "fap" probably isn't the best look, and still has the potentially confusing "f" in it.
Actually, I don't think flowside is that bad. What I'm struggling with is rather 'forwarding' and 'endpoint'. I don't have any good suggestion at the moment, anyway. Using 'local' and 'remote' (laddr/lport, raddr/rport) would be clearer to me and avoid the conflict with 'f' of flowside, but you had good reasons to avoid that, if I recall correctly.
Kind of, yeah. Local and remote are great when it's clear we're talking specifically from the point of view of the passt process. There are a bunch of cases where it's not necessarily obvious if we're talking from that point of view, the point of view of the guest, or the point of view of the host (the last usually when the endpoint is somewhere on an entirely different system). I wanted something that wherever we were talking about it is specifically relative to the passt process itself.
I get the impression that "forwarding" is causing more trouble here than "endpoint". "midpoint address"? "intercepted address"? "redirected address" (probably not, that's 'r' like remote but this would be the local address)?
I think "forwarding" is still better than any of those. Perhaps "passt address" (and paddr/pport)... but I'm not sure it's much better than "forwarding".
Hm. "passthrough address"? Kind of means the same thing as "forwarding" in context, and maybe evokes the idea that this is the address that passt itself owns? -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 12 Jul 2024 08:58:57 +1000
David Gibson
On Thu, Jul 11, 2024 at 10:20:01AM +0200, Stefano Brivio wrote:
On Thu, 11 Jul 2024 14:26:38 +1000 David Gibson
wrote: On Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote:
On Wed, 10 Jul 2024 09:59:08 +1000 David Gibson
wrote: On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote:
Nits only, here:
On Fri, 5 Jul 2024 12:07:17 +1000 David Gibson
wrote: > [...] > > + * @c: Execution context > + * @proto: Protocol of the flow (IP L4 protocol number) > + * @pif: Interface of the flow > + * @esa: Socket address of the endpoint > + * @fport: Forwarding port number > + * > + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found > + */ > +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, > + const void *esa, in_port_t fport) > +{ > + struct flowside fside = {
And the "f" in "fside" stands for "forwarding"... I don't have any quick fix in mind, and it's _kind of_ clear anyway, but this makes me doubt a bit about the "forwarding" / "endpoint" choice of words.
Heh, no, here "fside" is simply short for "flowside". Every flowside has both forwarding and endpoint elements.
Oh, I thought you called it fside here because you're setting the forwarding part of it directly, or something like that.
So it is confusing, but for a different reason. I need to find a different convention for naming struct flowside variables. I'd say 'side', but sometimes that's used for the 1-bit integer indicating which side in a flow.
Hrm.. now that pif has been removed from here, maybe I could rename struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps?
That's also confusing because it contains ports too (even though sure, in some sense they're part of the address).
Right :/.
I would suggest keeping it like it is in for this series, but after that, if it's not too long, what about flow_addrs_ports?
Still need a conventional name for the variables. "fap" probably isn't the best look, and still has the potentially confusing "f" in it.
Actually, I don't think flowside is that bad. What I'm struggling with is rather 'forwarding' and 'endpoint'. I don't have any good suggestion at the moment, anyway. Using 'local' and 'remote' (laddr/lport, raddr/rport) would be clearer to me and avoid the conflict with 'f' of flowside, but you had good reasons to avoid that, if I recall correctly.
Kind of, yeah. Local and remote are great when it's clear we're talking specifically from the point of view of the passt process. There are a bunch of cases where it's not necessarily obvious if we're talking from that point of view, the point of view of the guest, or the point of view of the host (the last usually when the endpoint is somewhere on an entirely different system). I wanted something that wherever we were talking about it is specifically relative to the passt process itself.
I get the impression that "forwarding" is causing more trouble here than "endpoint". "midpoint address"? "intercepted address"? "redirected address" (probably not, that's 'r' like remote but this would be the local address)?
I think "forwarding" is still better than any of those. Perhaps "passt address" (and paddr/pport)... but I'm not sure it's much better than "forwarding".
Hm. "passthrough address"? Kind of means the same thing as "forwarding" in context, and maybe evokes the idea that this is the address that passt itself owns?
I'd still prefer "forwarding" to that... at least I almost got used to it, "passthrough" is even more confusing because not everything really passes... through? The part I'm missing (with any of those) is "here" vs. "there", that is, a reference to the "where" and to the topology instead of what that side does. Again, I would just leave that as "forwarding" in this series, and then change it everywhere if we come up with something nicer. -- Stefano
On Fri, Jul 12, 2024 at 10:21:41AM +0200, Stefano Brivio wrote:
On Fri, 12 Jul 2024 08:58:57 +1000 David Gibson
wrote: On Thu, Jul 11, 2024 at 10:20:01AM +0200, Stefano Brivio wrote:
On Thu, 11 Jul 2024 14:26:38 +1000 David Gibson
wrote: On Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote:
On Wed, 10 Jul 2024 09:59:08 +1000 David Gibson
wrote: On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote: > Nits only, here: > > On Fri, 5 Jul 2024 12:07:17 +1000 > David Gibson
wrote: > > > [...] > > > > + * @c: Execution context > > + * @proto: Protocol of the flow (IP L4 protocol number) > > + * @pif: Interface of the flow > > + * @esa: Socket address of the endpoint > > + * @fport: Forwarding port number > > + * > > + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found > > + */ > > +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, > > + const void *esa, in_port_t fport) > > +{ > > + struct flowside fside = { > > And the "f" in "fside" stands for "forwarding"... I don't have any > quick fix in mind, and it's _kind of_ clear anyway, but this makes me > doubt a bit about the "forwarding" / "endpoint" choice of words. Heh, no, here "fside" is simply short for "flowside". Every flowside has both forwarding and endpoint elements.
Oh, I thought you called it fside here because you're setting the forwarding part of it directly, or something like that.
So it is confusing, but for a different reason. I need to find a different convention for naming struct flowside variables. I'd say 'side', but sometimes that's used for the 1-bit integer indicating which side in a flow.
Hrm.. now that pif has been removed from here, maybe I could rename struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps?
That's also confusing because it contains ports too (even though sure, in some sense they're part of the address).
Right :/.
I would suggest keeping it like it is in for this series, but after that, if it's not too long, what about flow_addrs_ports?
Still need a conventional name for the variables. "fap" probably isn't the best look, and still has the potentially confusing "f" in it.
Actually, I don't think flowside is that bad. What I'm struggling with is rather 'forwarding' and 'endpoint'. I don't have any good suggestion at the moment, anyway. Using 'local' and 'remote' (laddr/lport, raddr/rport) would be clearer to me and avoid the conflict with 'f' of flowside, but you had good reasons to avoid that, if I recall correctly.
Kind of, yeah. Local and remote are great when it's clear we're talking specifically from the point of view of the passt process. There are a bunch of cases where it's not necessarily obvious if we're talking from that point of view, the point of view of the guest, or the point of view of the host (the last usually when the endpoint is somewhere on an entirely different system). I wanted something that wherever we were talking about it is specifically relative to the passt process itself.
I get the impression that "forwarding" is causing more trouble here than "endpoint". "midpoint address"? "intercepted address"? "redirected address" (probably not, that's 'r' like remote but this would be the local address)?
I think "forwarding" is still better than any of those. Perhaps "passt address" (and paddr/pport)... but I'm not sure it's much better than "forwarding".
Hm. "passthrough address"? Kind of means the same thing as "forwarding" in context, and maybe evokes the idea that this is the address that passt itself owns?
I'd still prefer "forwarding" to that... at least I almost got used to it, "passthrough" is even more confusing because not everything really passes... through?
Yeah, that's fair.
The part I'm missing (with any of those) is "here" vs. "there", that is, a reference to the "where" and to the topology instead of what that side does.
I don't really follow what you mean.
Again, I would just leave that as "forwarding" in this series, and then change it everywhere if we come up with something nicer.
Seems reasonable. It'd still be nice to have an alternative to 'fside' that doesn't have the confusing 'f'. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, 15 Jul 2024 14:06:45 +1000
David Gibson
On Fri, Jul 12, 2024 at 10:21:41AM +0200, Stefano Brivio wrote:
On Fri, 12 Jul 2024 08:58:57 +1000 David Gibson
wrote: On Thu, Jul 11, 2024 at 10:20:01AM +0200, Stefano Brivio wrote:
On Thu, 11 Jul 2024 14:26:38 +1000 David Gibson
wrote: On Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote:
On Wed, 10 Jul 2024 09:59:08 +1000 David Gibson
wrote: > On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote: > > Nits only, here: > > > > On Fri, 5 Jul 2024 12:07:17 +1000 > > David Gibson
wrote: > > > > > [...] > > > > > > + * @c: Execution context > > > + * @proto: Protocol of the flow (IP L4 protocol number) > > > + * @pif: Interface of the flow > > > + * @esa: Socket address of the endpoint > > > + * @fport: Forwarding port number > > > + * > > > + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found > > > + */ > > > +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, > > > + const void *esa, in_port_t fport) > > > +{ > > > + struct flowside fside = { > > > > And the "f" in "fside" stands for "forwarding"... I don't have any > > quick fix in mind, and it's _kind of_ clear anyway, but this makes me > > doubt a bit about the "forwarding" / "endpoint" choice of words. > > Heh, no, here "fside" is simply short for "flowside". Every flowside > has both forwarding and endpoint elements. Oh, I thought you called it fside here because you're setting the forwarding part of it directly, or something like that.
> So it is confusing, but > for a different reason. I need to find a different convention for > naming struct flowside variables. I'd say 'side', but sometimes > that's used for the 1-bit integer indicating which side in a flow. > > Hrm.. now that pif has been removed from here, maybe I could rename > struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps?
That's also confusing because it contains ports too (even though sure, in some sense they're part of the address).
Right :/.
I would suggest keeping it like it is in for this series, but after that, if it's not too long, what about flow_addrs_ports?
Still need a conventional name for the variables. "fap" probably isn't the best look, and still has the potentially confusing "f" in it.
Actually, I don't think flowside is that bad. What I'm struggling with is rather 'forwarding' and 'endpoint'. I don't have any good suggestion at the moment, anyway. Using 'local' and 'remote' (laddr/lport, raddr/rport) would be clearer to me and avoid the conflict with 'f' of flowside, but you had good reasons to avoid that, if I recall correctly.
Kind of, yeah. Local and remote are great when it's clear we're talking specifically from the point of view of the passt process. There are a bunch of cases where it's not necessarily obvious if we're talking from that point of view, the point of view of the guest, or the point of view of the host (the last usually when the endpoint is somewhere on an entirely different system). I wanted something that wherever we were talking about it is specifically relative to the passt process itself.
I get the impression that "forwarding" is causing more trouble here than "endpoint". "midpoint address"? "intercepted address"? "redirected address" (probably not, that's 'r' like remote but this would be the local address)?
I think "forwarding" is still better than any of those. Perhaps "passt address" (and paddr/pport)... but I'm not sure it's much better than "forwarding".
Hm. "passthrough address"? Kind of means the same thing as "forwarding" in context, and maybe evokes the idea that this is the address that passt itself owns?
I'd still prefer "forwarding" to that... at least I almost got used to it, "passthrough" is even more confusing because not everything really passes... through?
Yeah, that's fair.
The part I'm missing (with any of those) is "here" vs. "there", that is, a reference to the "where" and to the topology instead of what that side does.
I don't really follow what you mean.
What makes "forwarding" and "passthrough" not so obvious to me is that they don't carry the information of _where_ in the network diagram the side is (such as "local" and "remote" would do), but rather what it does ("forward packets", "pass data through"). From passt and pasta's perspective, the "forwarding" side could be called the "here" side, and the endpoint could be called the "there" side. Those names are not great for other reasons though, just think of the sentence "there is the 'there' side and the 'here' side", or "here we deal with the 'there' side". So at the moment I don't have any better suggestion in that sense. Telecommunication standards frequently use "near end" and "far end", but I guess you would still have a similar issue as with local/remote, and also "far" starts with "f".
Again, I would just leave that as "forwarding" in this series, and then change it everywhere if we come up with something nicer.
Seems reasonable. It'd still be nice to have an alternative to 'fside' that doesn't have the confusing 'f'.
-- Stefano
On Mon, Jul 15, 2024 at 06:37:39PM +0200, Stefano Brivio wrote:
On Mon, 15 Jul 2024 14:06:45 +1000 David Gibson
wrote: On Fri, Jul 12, 2024 at 10:21:41AM +0200, Stefano Brivio wrote:
On Fri, 12 Jul 2024 08:58:57 +1000 David Gibson
wrote: On Thu, Jul 11, 2024 at 10:20:01AM +0200, Stefano Brivio wrote:
On Thu, 11 Jul 2024 14:26:38 +1000 David Gibson
wrote: On Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote: > On Wed, 10 Jul 2024 09:59:08 +1000 > David Gibson
wrote: > > > On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote: > > > Nits only, here: > > > > > > On Fri, 5 Jul 2024 12:07:17 +1000 > > > David Gibson wrote: > > > > > > > [...] > > > > > > > > + * @c: Execution context > > > > + * @proto: Protocol of the flow (IP L4 protocol number) > > > > + * @pif: Interface of the flow > > > > + * @esa: Socket address of the endpoint > > > > + * @fport: Forwarding port number > > > > + * > > > > + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found > > > > + */ > > > > +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, > > > > + const void *esa, in_port_t fport) > > > > +{ > > > > + struct flowside fside = { > > > > > > And the "f" in "fside" stands for "forwarding"... I don't have any > > > quick fix in mind, and it's _kind of_ clear anyway, but this makes me > > > doubt a bit about the "forwarding" / "endpoint" choice of words. > > > > Heh, no, here "fside" is simply short for "flowside". Every flowside > > has both forwarding and endpoint elements. > > Oh, I thought you called it fside here because you're setting the > forwarding part of it directly, or something like that. > > > So it is confusing, but > > for a different reason. I need to find a different convention for > > naming struct flowside variables. I'd say 'side', but sometimes > > that's used for the 1-bit integer indicating which side in a flow. > > > > Hrm.. now that pif has been removed from here, maybe I could rename > > struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps? > > That's also confusing because it contains ports too (even though sure, > in some sense they're part of the address). Right :/.
> I would suggest keeping it > like it is in for this series, but after that, if it's not too long, > what about flow_addrs_ports?
Still need a conventional name for the variables. "fap" probably isn't the best look, and still has the potentially confusing "f" in it.
> Actually, I don't think flowside is that bad. What I'm struggling with > is rather 'forwarding' and 'endpoint'. I don't have any good suggestion > at the moment, anyway. Using 'local' and 'remote' (laddr/lport, > raddr/rport) would be clearer to me and avoid the conflict with 'f' of > flowside, but you had good reasons to avoid that, if I recall correctly.
Kind of, yeah. Local and remote are great when it's clear we're talking specifically from the point of view of the passt process. There are a bunch of cases where it's not necessarily obvious if we're talking from that point of view, the point of view of the guest, or the point of view of the host (the last usually when the endpoint is somewhere on an entirely different system). I wanted something that wherever we were talking about it is specifically relative to the passt process itself.
I get the impression that "forwarding" is causing more trouble here than "endpoint". "midpoint address"? "intercepted address"? "redirected address" (probably not, that's 'r' like remote but this would be the local address)?
I think "forwarding" is still better than any of those. Perhaps "passt address" (and paddr/pport)... but I'm not sure it's much better than "forwarding".
Hm. "passthrough address"? Kind of means the same thing as "forwarding" in context, and maybe evokes the idea that this is the address that passt itself owns?
I'd still prefer "forwarding" to that... at least I almost got used to it, "passthrough" is even more confusing because not everything really passes... through?
Yeah, that's fair.
The part I'm missing (with any of those) is "here" vs. "there", that is, a reference to the "where" and to the topology instead of what that side does.
I don't really follow what you mean.
What makes "forwarding" and "passthrough" not so obvious to me is that they don't carry the information of _where_ in the network diagram the side is (such as "local" and "remote" would do), but rather what it does ("forward packets", "pass data through").
Ah, I see. Fwiw, the reasoning here is that these are the addresses that belong to the thing doing the forwarding - i.e. passt/pasta.
From passt and pasta's perspective, the "forwarding" side could be called the "here" side, and the endpoint could be called the "there" side.
Those names are not great for other reasons though, just think of the sentence "there is the 'there' side and the 'here' side", or "here we deal with the 'there' side". So at the moment I don't have any better suggestion in that sense.
And in addition to that they share the problem with local/remote and near end/far end that they only make sense if your frame of reference is already passt/pasta's perspective.
Telecommunication standards frequently use "near end" and "far end", but I guess you would still have a similar issue as with local/remote, and also "far" starts with "f".
If we're going to assume we're talking from the perspective of the passt/pasta process, I don't think we'll do better than "local" and "remote". What I was aiming for with "forwarding" and "passthrough" was something unambiguous from "whole system" view encompassing the guest, the other endpoint and passt in between. Maybe that's not worth it, and we would be better with "local" and "remote", and add extra verbiage in the cases where it might seem like we're talking from the guest's view.
Again, I would just leave that as "forwarding" in this series, and then change it everywhere if we come up with something nicer.
Seems reasonable. It'd still be nice to have an alternative to 'fside' that doesn't have the confusing 'f'.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
When forwarding a datagram to a socket, we need to find a socket with a
suitable local address to send it. Currently we keep track of such sockets
in an array indexed by local port, but this can't properly handle cases
where we have multiple local addresses in active use.
For "spliced" (socket to socket) cases, improve this by instead opening
a socket specifically for the target side of the flow. We connect() as
well as bind()ing that socket, so that it will only receive the flow's
reply packets, not anything else. We direct datagrams sent via that socket
using the addresses from the flow table, effectively replacing bespoke
addressing logic with the unified logic in fwd.c
When we create the flow, we also take a duplicate of the originating
socket, and use that to deliver reply datagrams back to the origin, again
using addresses from the flow table entry.
Signed-off-by: David Gibson
On Fri, 5 Jul 2024 12:07:18 +1000
David Gibson
When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use.
For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c
When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry.
Signed-off-by: David Gibson
--- epoll_type.h | 2 + flow.c | 20 +++ flow.h | 2 + flow_table.h | 14 ++ passt.c | 4 + udp.c | 403 ++++++++++++++++++--------------------------------- udp.h | 6 +- udp_flow.h | 2 + util.c | 1 + 9 files changed, 194 insertions(+), 260 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index b6c04199..7a752ed1 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -22,6 +22,8 @@ enum epoll_type { EPOLL_TYPE_TCP_TIMER, /* UDP sockets */ EPOLL_TYPE_UDP, + /* UDP socket for replies on a specific flow */ + EPOLL_TYPE_UDP_REPLY, /* ICMP/ICMPv6 ping sockets */ EPOLL_TYPE_PING, /* inotify fd watching for end of netns (pasta) */ diff --git a/flow.c b/flow.c index 0cb9495b..2e100ddb 100644 --- a/flow.c +++ b/flow.c @@ -236,6 +236,26 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, } }
+/** flowside_connect() - Connect a socket based on flowside + * @c: Execution context + * @s: Socket to connect + * @pif: Target pif + * @tgt: Target flowside + * + * Connect @s to the endpoint address and port from @tgt. + * + * Return: 0 on success, negative on error + */ +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt) +{ + union sockaddr_inany sa; + socklen_t sl; + + pif_sockaddr(c, &sa, &sl, pif, &tgt->eaddr, tgt->eport); + return connect(s, &sa.sa, sl); +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index 3752e5ee..3f65ceb9 100644 --- a/flow.h +++ b/flow.h @@ -168,6 +168,8 @@ static inline bool flowside_eq(const struct flowside *left,
int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data); +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt);
/** * struct flow_common - Common fields for packet flows diff --git a/flow_table.h b/flow_table.h index 3fbc7c8d..1faac4a7 100644 --- a/flow_table.h +++ b/flow_table.h @@ -92,6 +92,20 @@ static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) return (flow_sidx_t){.flow = sidx.flow, .side = !sidx.side}; }
+/** pif_at_sidx - Interface for a given flow and side
pif_at_sidx()
+ * @sidx: Flow & side index + * + * Return: pif for the flow & side given by @sidx + */ +static inline uint8_t pif_at_sidx(flow_sidx_t sidx) +{ + const union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE; + return flow->f.pif[sidx.side]; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/passt.c b/passt.c index e4d45daa..f9405bee 100644 --- a/passt.c +++ b/passt.c @@ -67,6 +67,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", [EPOLL_TYPE_UDP] = "UDP socket", + [EPOLL_TYPE_UDP_REPLY] = "UDP reply socket", [EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket", [EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch", [EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch", @@ -349,6 +350,9 @@ loop: case EPOLL_TYPE_UDP: udp_buf_sock_handler(&c, ref, eventmask, &now); break; + case EPOLL_TYPE_UDP_REPLY: + udp_reply_sock_handler(&c, ref, eventmask, &now); + break; case EPOLL_TYPE_PING: icmp_sock_handler(&c, ref); break; diff --git a/udp.c b/udp.c index daf4fe26..f4c696db 100644 --- a/udp.c +++ b/udp.c @@ -35,7 +35,31 @@ * =================== * * UDP doesn't use listen(), but we consider long term sockets which are allowed - * to create new flows "listening" by analogy with TCP. + * to create new flows "listening" by analogy with TCP. This listening socket + * could receive packets from multiple flows, so we use a hash table match to + * find the specific flow for a datagram. + * + * When a UDP flow is initiated from a listening socket we take a duplicate of
These are always "inbound" flows, right?
+ * the socket and store it in uflow->s[INISIDE]. This will last for the + * lifetime of the flow, even if the original listening socket is closed due to + * port auto-probing. The duplicate is used to deliver replies back to the + * originating side. + * + * Reply sockets + * ============= + * + * When a UDP flow targets a socket, we create a "reply" socket in
...and these are outbound ones?
+ * uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive + * replies on the target side. This socket is both bound and connected and has + * EPOLL_TYPE_UDP_REPLY. The connect() means it will only receive datagrams + * associated with this flow, so the epoll reference directly points to the flow + * and we don't need a hash lookup. + * + * NOTE: it's possible that the reply socket could have a bound address + * overlapping with an unrelated listening socket. We assume datagrams for the + * flow will come to the reply socket in preference to a listening socket. The + * sample program contrib/udp-reuseaddr/reuseaddr-priority.c documents and tests
Now it's under doc/platform-requirements/
+ * that assumption. * * Port tracking * ============= @@ -56,62 +80,6 @@ * * Packets are forwarded back and forth, by prepending and stripping UDP headers * in the obvious way, with no port translation. - * - * In PASTA mode, the L2-L4 translation is skipped for connections to ports - * bound between namespaces using the loopback interface, messages are directly - * transferred between L4 sockets instead. These are called spliced connections - * for consistency with the TCP implementation, but the splice() syscall isn't - * actually used as it wouldn't make sense for datagram-based connections: a - * pair of recvmmsg() and sendmmsg() deals with this case.
I think we should keep this paragraph... or did you move/add this information somewhere else I missed? Given that the "Port tracking" section is going to be away, maybe this could be moved at the end of "UDP Flows".
- * - * The connection tracking for PASTA mode is slightly complicated by the absence - * of actual connections, see struct udp_splice_port, and these examples: - * - * - from init to namespace: - * - * - forward direction: 127.0.0.1:5000 -> 127.0.0.1:80 in init from socket s, - * with epoll reference: index = 80, splice = 1, orig = 1, ns = 0 - * - if udp_splice_ns[V4][5000].sock: - * - send packet to udp_splice_ns[V4][5000].sock, with destination port - * 80 - * - otherwise: - * - create new socket udp_splice_ns[V4][5000].sock - * - bind in namespace to 127.0.0.1:5000 - * - add to epoll with reference: index = 5000, splice = 1, orig = 0, - * ns = 1 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - * - reverse direction: 127.0.0.1:80 -> 127.0.0.1:5000 in namespace socket s, - * having epoll reference: index = 5000, splice = 1, orig = 0, ns = 1 - * - if udp_splice_init[V4][80].sock: - * - send to udp_splice_init[V4][80].sock, with destination port 5000 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - otherwise, discard - * - * - from namespace to init: - * - * - forward direction: 127.0.0.1:2000 -> 127.0.0.1:22 in namespace from - * socket s, with epoll reference: index = 22, splice = 1, orig = 1, ns = 1 - * - if udp4_splice_init[V4][2000].sock: - * - send packet to udp_splice_init[V4][2000].sock, with destination - * port 22 - * - otherwise: - * - create new socket udp_splice_init[V4][2000].sock - * - bind in init to 127.0.0.1:2000 - * - add to epoll with reference: index = 2000, splice = 1, orig = 0, - * ns = 0 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - * - reverse direction: 127.0.0.1:22 -> 127.0.0.1:2000 in init from socket s, - * having epoll reference: index = 2000, splice = 1, orig = 0, ns = 0 - * - if udp_splice_ns[V4][22].sock: - * - send to udp_splice_ns[V4][22].sock, with destination port 2000 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - otherwise, discard */
#include
@@ -134,6 +102,7 @@ #include #include #include +#include #include "checksum.h" #include "util.h" @@ -223,7 +192,6 @@ static struct ethhdr udp6_eth_hdr; * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() - * @splicesrc: Source port for splicing, or -1 if not spliceable * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { @@ -232,7 +200,6 @@ static struct udp_meta_t { struct tap_hdr taph;
union sockaddr_inany s_in; - int splicesrc; flow_sidx_t tosidx; } #ifdef __AVX2__ @@ -270,7 +237,6 @@ static struct mmsghdr udp_mh_splice [UDP_MAX_FRAMES]; /* IOVs for L2 frames */ static struct iovec udp_l2_iov [UDP_MAX_FRAMES][UDP_NUM_IOVS];
- /** * udp_portmap_clear() - Clear UDP port map before configuration */ @@ -383,140 +349,6 @@ static void udp_iov_init(const struct ctx *c) udp_iov_init_one(c, i); }
-/** - * udp_splice_new() - Create and prepare socket for "spliced" binding - * @c: Execution context - * @v6: Set for IPv6 sockets - * @src: Source port of original connection, host order - * @ns: Does the splice originate in the ns or not - * - * Return: prepared socket, negative error code on failure - * - * #syscalls:pasta getsockname - */ -int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) -{ - struct epoll_event ev = { .events = EPOLLIN | EPOLLRDHUP | EPOLLHUP }; - union epoll_ref ref = { .type = EPOLL_TYPE_UDP, - .udp = { .splice = true, .v6 = v6, .port = src } - }; - struct udp_splice_port *sp; - int act, s; - - if (ns) { - ref.udp.pif = PIF_SPLICE; - sp = &udp_splice_ns[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_NS; - } else { - ref.udp.pif = PIF_HOST; - sp = &udp_splice_init[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_INIT; - } - - s = socket(v6 ? AF_INET6 : AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, - IPPROTO_UDP); - - if (s > FD_REF_MAX) { - close(s); - return -EIO; - } - - if (s < 0) - return s; - - ref.fd = s; - - if (v6) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(src), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr6, sizeof(addr6))) - goto fail; - } else { - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(src), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr4, sizeof(addr4))) - goto fail; - } - - sp->sock = s; - bitmap_set(udp_act[v6 ? V6 : V4][act], src); - - ev.data.u64 = ref.u64; - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, s, &ev); - return s; - -fail: - close(s); - return -1; -} - -/** - * struct udp_splice_new_ns_arg - Arguments for udp_splice_new_ns() - * @c: Execution context - * @v6: Set for IPv6 - * @src: Source port of originating datagram, host order - * @dst: Destination port of originating datagram, host order - * @s: Newly created socket or negative error code - */ -struct udp_splice_new_ns_arg { - const struct ctx *c; - int v6; - in_port_t src; - int s; -}; - -/** - * udp_splice_new_ns() - Enter namespace and call udp_splice_new() - * @arg: See struct udp_splice_new_ns_arg - * - * Return: 0 - */ -static int udp_splice_new_ns(void *arg) -{ - struct udp_splice_new_ns_arg *a; - - a = (struct udp_splice_new_ns_arg *)arg; - - ns_enter(a->c); - - a->s = udp_splice_new(a->c, a->v6, a->src, true); - - return 0; -} - -/** - * udp_mmh_splice_port() - Is source address of message suitable for splicing? - * @ref: epoll reference for incoming message's origin socket - * @mmh: mmsghdr of incoming message - * - * Return: if source address of message in @mmh refers to localhost (127.0.0.1 - * or ::1) its source port (host order), otherwise -1. - */ -static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) -{ - const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name; - const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name; - - ASSERT(ref.type == EPOLL_TYPE_UDP); - - if (!ref.udp.splice) - return -1; - - if (ref.udp.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr)) - return ntohs(sa6->sin6_port); - - if (!ref.udp.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr)) - return ntohs(sa4->sin_port); - - return -1; -} - /** * udp_at_sidx() - Get UDP specific flow at given sidx * @sidx: Flow and side to retrieve @@ -542,6 +374,16 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx) */ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) { + if (uflow->s[INISIDE] >= 0) { + /* The listening socket needs to stay in epoll */ + close(uflow->s[INISIDE]); + } + + if (uflow->s[TGTSIDE] >= 0) { + /* But the flow specific one needs to be removed */ + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, uflow->s[TGTSIDE], NULL); + close(uflow->s[TGTSIDE]);
Keeping numbers of closed sockets around, instead of setting them to -1 right away, makes me a bit nervous. On the other hand it's obvious that udp_flow_new() sets them to -1 anyway, so there isn't much that can go wrong.
+ } flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); }
@@ -549,26 +391,80 @@ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) * udp_flow_new() - Common setup for a new UDP flow * @c: Execution context * @flow: Initiated flow + * @s_ini: Initiating socket (or -1) * @now: Timestamp * * Return: UDP specific flow, if successful, NULL on failure */ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, - const struct timespec *now) + int s_ini, const struct timespec *now) { const struct flowside *ini = &flow->f.side[INISIDE]; struct udp_flow *uflow = NULL; + const struct flowside *tgt; + uint8_t tgtpif;
if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { flow_dbg(flow, "Invalid endpoint to initiate UDP flow"); goto cancel; }
- if (!flow_target(c, flow, IPPROTO_UDP)) + if (!(tgt = flow_target(c, flow, IPPROTO_UDP))) goto cancel; + tgtpif = flow->f.pif[TGTSIDE];
uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); uflow->ts = now->tv_sec; + uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; + + if (s_ini >= 0) { + /* When using auto port-scanning the listening port could go + * away, so we need to duplicate it */
For consistency: closing */ on a newline. I would also say "the socket" instead of "it", otherwise it seems to be referring to the port at a first glance.
+ uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0);
There's one aspect of this I don't understand: if s_ini is closed while checking for bound ports (is it? I didn't really reach the end of this series), aren't duplicates also closed? That is, the documentation of dup2(2), which should be the same for this purpose, states that the duplicate inherits "file status flags", which I would assume also includes the fact that a socket is closed. I didn't test that though. If duplicates are closed, I guess an alternative solution could be to introduce some kind of reference counting for sockets... somewhere.
+ if (uflow->s[INISIDE] < 0) { + flow_err(uflow, + "Couldn't duplicate listening socket: %s", + strerror(errno)); + goto cancel; + } + } + + if (pif_is_socket(tgtpif)) { + union { + flow_sidx_t sidx; + uint32_t data; + } fref = { + .sidx = FLOW_SIDX(flow, TGTSIDE), + }; + + uflow->s[TGTSIDE] = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY, + tgtpif, tgt, fref.data); + if (uflow->s[TGTSIDE] < 0) { + flow_dbg(uflow, + "Couldn't open socket for spliced flow: %s", + strerror(errno)); + goto cancel; + } + + if (flowside_connect(c, uflow->s[TGTSIDE], tgtpif, tgt) < 0) { + flow_dbg(uflow, + "Couldn't connect flow socket: %s", + strerror(errno)); + goto cancel; + } + + /* It's possible, if unlikely, that we could receive some + * unrelated packets in between the bind() and connect() of this + * socket. For now we just discard these. We could consider + * trying to re-direct these to an appropriate handler, if we
Simply "redirect"?
+ * need to. + */ + /* cppcheck-suppress nullPointer */ + while (recv(uflow->s[TGTSIDE], NULL, 0, MSG_DONTWAIT) >= 0) + ;
Could a local attacker (another user) attempt to use this for denial of service? Of course, somebody could flood us anyway and we would get and handle all the events that that causes. But this case is different because we could get stuck for an unlimited amount of time without serving other sockets at all. If that's a possibility, perhaps a limit for this loop (a maximum amount of recv()) tries would be a good idea. I'm not sure how we should handle the case where we exceed the threshold. Another one, which adds some complexity, but looks more correct to me, would be to try a single recv() call, and if we get data from it, fail creating the new flow entirely. I'm still reviewing the rest (of this patch and of this series). -- Stefano
On Wed, Jul 10, 2024 at 12:32:33AM +0200, Stefano Brivio wrote:
On Fri, 5 Jul 2024 12:07:18 +1000 David Gibson
wrote: When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use.
For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c
When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry.
Signed-off-by: David Gibson
--- epoll_type.h | 2 + flow.c | 20 +++ flow.h | 2 + flow_table.h | 14 ++ passt.c | 4 + udp.c | 403 ++++++++++++++++++--------------------------------- udp.h | 6 +- udp_flow.h | 2 + util.c | 1 + 9 files changed, 194 insertions(+), 260 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index b6c04199..7a752ed1 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -22,6 +22,8 @@ enum epoll_type { EPOLL_TYPE_TCP_TIMER, /* UDP sockets */ EPOLL_TYPE_UDP, + /* UDP socket for replies on a specific flow */ + EPOLL_TYPE_UDP_REPLY, /* ICMP/ICMPv6 ping sockets */ EPOLL_TYPE_PING, /* inotify fd watching for end of netns (pasta) */ diff --git a/flow.c b/flow.c index 0cb9495b..2e100ddb 100644 --- a/flow.c +++ b/flow.c @@ -236,6 +236,26 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, } }
+/** flowside_connect() - Connect a socket based on flowside + * @c: Execution context + * @s: Socket to connect + * @pif: Target pif + * @tgt: Target flowside + * + * Connect @s to the endpoint address and port from @tgt. + * + * Return: 0 on success, negative on error + */ +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt) +{ + union sockaddr_inany sa; + socklen_t sl; + + pif_sockaddr(c, &sa, &sl, pif, &tgt->eaddr, tgt->eport); + return connect(s, &sa.sa, sl); +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index 3752e5ee..3f65ceb9 100644 --- a/flow.h +++ b/flow.h @@ -168,6 +168,8 @@ static inline bool flowside_eq(const struct flowside *left,
int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data); +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt);
/** * struct flow_common - Common fields for packet flows diff --git a/flow_table.h b/flow_table.h index 3fbc7c8d..1faac4a7 100644 --- a/flow_table.h +++ b/flow_table.h @@ -92,6 +92,20 @@ static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) return (flow_sidx_t){.flow = sidx.flow, .side = !sidx.side}; }
+/** pif_at_sidx - Interface for a given flow and side
pif_at_sidx()
Done.
+ * @sidx: Flow & side index + * + * Return: pif for the flow & side given by @sidx + */ +static inline uint8_t pif_at_sidx(flow_sidx_t sidx) +{ + const union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE; + return flow->f.pif[sidx.side]; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/passt.c b/passt.c index e4d45daa..f9405bee 100644 --- a/passt.c +++ b/passt.c @@ -67,6 +67,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", [EPOLL_TYPE_UDP] = "UDP socket", + [EPOLL_TYPE_UDP_REPLY] = "UDP reply socket", [EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket", [EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch", [EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch", @@ -349,6 +350,9 @@ loop: case EPOLL_TYPE_UDP: udp_buf_sock_handler(&c, ref, eventmask, &now); break; + case EPOLL_TYPE_UDP_REPLY: + udp_reply_sock_handler(&c, ref, eventmask, &now); + break; case EPOLL_TYPE_PING: icmp_sock_handler(&c, ref); break; diff --git a/udp.c b/udp.c index daf4fe26..f4c696db 100644 --- a/udp.c +++ b/udp.c @@ -35,7 +35,31 @@ * =================== * * UDP doesn't use listen(), but we consider long term sockets which are allowed - * to create new flows "listening" by analogy with TCP. + * to create new flows "listening" by analogy with TCP. This listening socket + * could receive packets from multiple flows, so we use a hash table match to + * find the specific flow for a datagram. + * + * When a UDP flow is initiated from a listening socket we take a duplicate of
These are always "inbound" flows, right?
No, they could come from a socket listening in the namespace (-U).
+ * the socket and store it in uflow->s[INISIDE]. This will last for the + * lifetime of the flow, even if the original listening socket is closed due to + * port auto-probing. The duplicate is used to deliver replies back to the + * originating side. + * + * Reply sockets + * ============= + * + * When a UDP flow targets a socket, we create a "reply" socket in
...and these are outbound ones?
No. Again, we could be creating this socket in the namespace. A spliced flow will have both the dup() of a listening socket, and a reply socket, regardless of whether it's inbound (HOST -> SPLICE) or outbound (SPLICE -> HOST).
+ * uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive + * replies on the target side. This socket is both bound and connected and has + * EPOLL_TYPE_UDP_REPLY. The connect() means it will only receive datagrams + * associated with this flow, so the epoll reference directly points to the flow + * and we don't need a hash lookup. + * + * NOTE: it's possible that the reply socket could have a bound address + * overlapping with an unrelated listening socket. We assume datagrams for the + * flow will come to the reply socket in preference to a listening socket. The + * sample program contrib/udp-reuseaddr/reuseaddr-priority.c documents and tests
Now it's under doc/platform-requirements/
Thanks for the catch, corrected.
+ * that assumption. * * Port tracking * ============= @@ -56,62 +80,6 @@ * * Packets are forwarded back and forth, by prepending and stripping UDP headers * in the obvious way, with no port translation. - * - * In PASTA mode, the L2-L4 translation is skipped for connections to ports - * bound between namespaces using the loopback interface, messages are directly - * transferred between L4 sockets instead. These are called spliced connections - * for consistency with the TCP implementation, but the splice() syscall isn't - * actually used as it wouldn't make sense for datagram-based connections: a - * pair of recvmmsg() and sendmmsg() deals with this case.
I think we should keep this paragraph... or did you move/add this information somewhere else I missed? Given that the "Port tracking" section is going to be away, maybe this could be moved at the end of "UDP Flows".
Fair point. I've paraphrahsed and updated this a bit and added it as a new section about spliced flows.
- * - * The connection tracking for PASTA mode is slightly complicated by the absence - * of actual connections, see struct udp_splice_port, and these examples: - * - * - from init to namespace: - * - * - forward direction: 127.0.0.1:5000 -> 127.0.0.1:80 in init from socket s, - * with epoll reference: index = 80, splice = 1, orig = 1, ns = 0 - * - if udp_splice_ns[V4][5000].sock: - * - send packet to udp_splice_ns[V4][5000].sock, with destination port - * 80 - * - otherwise: - * - create new socket udp_splice_ns[V4][5000].sock - * - bind in namespace to 127.0.0.1:5000 - * - add to epoll with reference: index = 5000, splice = 1, orig = 0, - * ns = 1 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - * - reverse direction: 127.0.0.1:80 -> 127.0.0.1:5000 in namespace socket s, - * having epoll reference: index = 5000, splice = 1, orig = 0, ns = 1 - * - if udp_splice_init[V4][80].sock: - * - send to udp_splice_init[V4][80].sock, with destination port 5000 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - otherwise, discard - * - * - from namespace to init: - * - * - forward direction: 127.0.0.1:2000 -> 127.0.0.1:22 in namespace from - * socket s, with epoll reference: index = 22, splice = 1, orig = 1, ns = 1 - * - if udp4_splice_init[V4][2000].sock: - * - send packet to udp_splice_init[V4][2000].sock, with destination - * port 22 - * - otherwise: - * - create new socket udp_splice_init[V4][2000].sock - * - bind in init to 127.0.0.1:2000 - * - add to epoll with reference: index = 2000, splice = 1, orig = 0, - * ns = 0 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - * - reverse direction: 127.0.0.1:22 -> 127.0.0.1:2000 in init from socket s, - * having epoll reference: index = 2000, splice = 1, orig = 0, ns = 0 - * - if udp_splice_ns[V4][22].sock: - * - send to udp_splice_ns[V4][22].sock, with destination port 2000 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - otherwise, discard */
#include
@@ -134,6 +102,7 @@ #include #include #include +#include #include "checksum.h" #include "util.h" @@ -223,7 +192,6 @@ static struct ethhdr udp6_eth_hdr; * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() - * @splicesrc: Source port for splicing, or -1 if not spliceable * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { @@ -232,7 +200,6 @@ static struct udp_meta_t { struct tap_hdr taph;
union sockaddr_inany s_in; - int splicesrc; flow_sidx_t tosidx; } #ifdef __AVX2__ @@ -270,7 +237,6 @@ static struct mmsghdr udp_mh_splice [UDP_MAX_FRAMES]; /* IOVs for L2 frames */ static struct iovec udp_l2_iov [UDP_MAX_FRAMES][UDP_NUM_IOVS];
- /** * udp_portmap_clear() - Clear UDP port map before configuration */ @@ -383,140 +349,6 @@ static void udp_iov_init(const struct ctx *c) udp_iov_init_one(c, i); }
-/** - * udp_splice_new() - Create and prepare socket for "spliced" binding - * @c: Execution context - * @v6: Set for IPv6 sockets - * @src: Source port of original connection, host order - * @ns: Does the splice originate in the ns or not - * - * Return: prepared socket, negative error code on failure - * - * #syscalls:pasta getsockname - */ -int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) -{ - struct epoll_event ev = { .events = EPOLLIN | EPOLLRDHUP | EPOLLHUP }; - union epoll_ref ref = { .type = EPOLL_TYPE_UDP, - .udp = { .splice = true, .v6 = v6, .port = src } - }; - struct udp_splice_port *sp; - int act, s; - - if (ns) { - ref.udp.pif = PIF_SPLICE; - sp = &udp_splice_ns[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_NS; - } else { - ref.udp.pif = PIF_HOST; - sp = &udp_splice_init[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_INIT; - } - - s = socket(v6 ? AF_INET6 : AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, - IPPROTO_UDP); - - if (s > FD_REF_MAX) { - close(s); - return -EIO; - } - - if (s < 0) - return s; - - ref.fd = s; - - if (v6) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(src), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr6, sizeof(addr6))) - goto fail; - } else { - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(src), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr4, sizeof(addr4))) - goto fail; - } - - sp->sock = s; - bitmap_set(udp_act[v6 ? V6 : V4][act], src); - - ev.data.u64 = ref.u64; - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, s, &ev); - return s; - -fail: - close(s); - return -1; -} - -/** - * struct udp_splice_new_ns_arg - Arguments for udp_splice_new_ns() - * @c: Execution context - * @v6: Set for IPv6 - * @src: Source port of originating datagram, host order - * @dst: Destination port of originating datagram, host order - * @s: Newly created socket or negative error code - */ -struct udp_splice_new_ns_arg { - const struct ctx *c; - int v6; - in_port_t src; - int s; -}; - -/** - * udp_splice_new_ns() - Enter namespace and call udp_splice_new() - * @arg: See struct udp_splice_new_ns_arg - * - * Return: 0 - */ -static int udp_splice_new_ns(void *arg) -{ - struct udp_splice_new_ns_arg *a; - - a = (struct udp_splice_new_ns_arg *)arg; - - ns_enter(a->c); - - a->s = udp_splice_new(a->c, a->v6, a->src, true); - - return 0; -} - -/** - * udp_mmh_splice_port() - Is source address of message suitable for splicing? - * @ref: epoll reference for incoming message's origin socket - * @mmh: mmsghdr of incoming message - * - * Return: if source address of message in @mmh refers to localhost (127.0.0.1 - * or ::1) its source port (host order), otherwise -1. - */ -static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) -{ - const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name; - const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name; - - ASSERT(ref.type == EPOLL_TYPE_UDP); - - if (!ref.udp.splice) - return -1; - - if (ref.udp.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr)) - return ntohs(sa6->sin6_port); - - if (!ref.udp.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr)) - return ntohs(sa4->sin_port); - - return -1; -} - /** * udp_at_sidx() - Get UDP specific flow at given sidx * @sidx: Flow and side to retrieve @@ -542,6 +374,16 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx) */ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) { + if (uflow->s[INISIDE] >= 0) { + /* The listening socket needs to stay in epoll */ + close(uflow->s[INISIDE]); + } + + if (uflow->s[TGTSIDE] >= 0) { + /* But the flow specific one needs to be removed */ + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, uflow->s[TGTSIDE], NULL); + close(uflow->s[TGTSIDE]);
Keeping numbers of closed sockets around, instead of setting them to -1 right away, makes me a bit nervous.
On the other hand it's obvious that udp_flow_new() sets them to -1 anyway, so there isn't much that can go wrong.
Right, we're about to destroy the flow entry entirely, so it is safe. But I agree that it's nervous-making. It's pretty cheap to set them to -1 explicitly, so I'm doing that now.
+ } flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); }
@@ -549,26 +391,80 @@ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) * udp_flow_new() - Common setup for a new UDP flow * @c: Execution context * @flow: Initiated flow + * @s_ini: Initiating socket (or -1) * @now: Timestamp * * Return: UDP specific flow, if successful, NULL on failure */ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, - const struct timespec *now) + int s_ini, const struct timespec *now) { const struct flowside *ini = &flow->f.side[INISIDE]; struct udp_flow *uflow = NULL; + const struct flowside *tgt; + uint8_t tgtpif;
if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { flow_dbg(flow, "Invalid endpoint to initiate UDP flow"); goto cancel; }
- if (!flow_target(c, flow, IPPROTO_UDP)) + if (!(tgt = flow_target(c, flow, IPPROTO_UDP))) goto cancel; + tgtpif = flow->f.pif[TGTSIDE];
uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); uflow->ts = now->tv_sec; + uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; + + if (s_ini >= 0) { + /* When using auto port-scanning the listening port could go + * away, so we need to duplicate it */
For consistency: closing */ on a newline. I would also say "the socket" instead of "it", otherwise it seems to be referring to the port at a first glance.
Done.
+ uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0);
There's one aspect of this I don't understand: if s_ini is closed while checking for bound ports (is it? I didn't really reach the end of this series), aren't duplicates also closed?
That is, the documentation of dup2(2), which should be the same for this purpose, states that the duplicate inherits "file status flags", which I would assume also includes the fact that a socket is closed. I didn't test that though.
I don't believe so. My understanding is that dup() (and the rest) make a new fd referencing the same underlying file object, yes. But AIUI, close() just closes one fd - the underlying object is only closed only when all fds are gone.
If duplicates are closed, I guess an alternative solution could be to introduce some kind of reference counting for sockets... somewhere.
.. in other words, I believe the kernel does the reference counting. I should verify this though, I'll try to come up with something new for doc/platform-requirements.
+ if (uflow->s[INISIDE] < 0) { + flow_err(uflow, + "Couldn't duplicate listening socket: %s", + strerror(errno)); + goto cancel; + } + } + + if (pif_is_socket(tgtpif)) { + union { + flow_sidx_t sidx; + uint32_t data; + } fref = { + .sidx = FLOW_SIDX(flow, TGTSIDE), + }; + + uflow->s[TGTSIDE] = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY, + tgtpif, tgt, fref.data); + if (uflow->s[TGTSIDE] < 0) { + flow_dbg(uflow, + "Couldn't open socket for spliced flow: %s", + strerror(errno)); + goto cancel; + } + + if (flowside_connect(c, uflow->s[TGTSIDE], tgtpif, tgt) < 0) { + flow_dbg(uflow, + "Couldn't connect flow socket: %s", + strerror(errno)); + goto cancel; + } + + /* It's possible, if unlikely, that we could receive some + * unrelated packets in between the bind() and connect() of this + * socket. For now we just discard these. We could consider + * trying to re-direct these to an appropriate handler, if we
Simply "redirect"?
Done.
+ * need to. + */ + /* cppcheck-suppress nullPointer */ + while (recv(uflow->s[TGTSIDE], NULL, 0, MSG_DONTWAIT) >= 0) + ;
Could a local attacker (another user) attempt to use this for denial of service?
Ah, interesting question.
Of course, somebody could flood us anyway and we would get and handle all the events that that causes. But this case is different because we could get stuck for an unlimited amount of time without serving other sockets at all.
Right.
If that's a possibility, perhaps a limit for this loop (a maximum amount of recv()) tries would be a good idea. I'm not sure how we should handle the case where we exceed the threshold.
We could fail to create the flow. That would limit the damage, but see below.
Another one, which adds some complexity, but looks more correct to me, would be to try a single recv() call, and if we get data from it, fail creating the new flow entirely.
Right. I also considered a single recvmmsg(); the difficulty with that is making suitable arrays for it, since the normal ones may be in use at this point. This removes the potential DoS of wedging passt completely, but raises the possibility of a different one. Could an attacker - not necessarily on the same host, but network-closer than the guest's intended endpoint - prevent the guest from connecting to a particular service by spamming fake reply packets here. I believe it would need to anticipate the guest's source port to do so, which might be good enough.
I'm still reviewing the rest (of this patch and of this series).
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, 10 Jul 2024 10:23:14 +1000
David Gibson
On Wed, Jul 10, 2024 at 12:32:33AM +0200, Stefano Brivio wrote:
On Fri, 5 Jul 2024 12:07:18 +1000 David Gibson
wrote: When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use.
For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c
When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry.
Signed-off-by: David Gibson
--- epoll_type.h | 2 + flow.c | 20 +++ flow.h | 2 + flow_table.h | 14 ++ passt.c | 4 + udp.c | 403 ++++++++++++++++++--------------------------------- udp.h | 6 +- udp_flow.h | 2 + util.c | 1 + 9 files changed, 194 insertions(+), 260 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index b6c04199..7a752ed1 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -22,6 +22,8 @@ enum epoll_type { EPOLL_TYPE_TCP_TIMER, /* UDP sockets */ EPOLL_TYPE_UDP, + /* UDP socket for replies on a specific flow */ + EPOLL_TYPE_UDP_REPLY, /* ICMP/ICMPv6 ping sockets */ EPOLL_TYPE_PING, /* inotify fd watching for end of netns (pasta) */ diff --git a/flow.c b/flow.c index 0cb9495b..2e100ddb 100644 --- a/flow.c +++ b/flow.c @@ -236,6 +236,26 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, } }
+/** flowside_connect() - Connect a socket based on flowside + * @c: Execution context + * @s: Socket to connect + * @pif: Target pif + * @tgt: Target flowside + * + * Connect @s to the endpoint address and port from @tgt. + * + * Return: 0 on success, negative on error + */ +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt) +{ + union sockaddr_inany sa; + socklen_t sl; + + pif_sockaddr(c, &sa, &sl, pif, &tgt->eaddr, tgt->eport); + return connect(s, &sa.sa, sl); +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index 3752e5ee..3f65ceb9 100644 --- a/flow.h +++ b/flow.h @@ -168,6 +168,8 @@ static inline bool flowside_eq(const struct flowside *left,
int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data); +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt);
/** * struct flow_common - Common fields for packet flows diff --git a/flow_table.h b/flow_table.h index 3fbc7c8d..1faac4a7 100644 --- a/flow_table.h +++ b/flow_table.h @@ -92,6 +92,20 @@ static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) return (flow_sidx_t){.flow = sidx.flow, .side = !sidx.side}; }
+/** pif_at_sidx - Interface for a given flow and side
pif_at_sidx()
Done.
+ * @sidx: Flow & side index + * + * Return: pif for the flow & side given by @sidx + */ +static inline uint8_t pif_at_sidx(flow_sidx_t sidx) +{ + const union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE; + return flow->f.pif[sidx.side]; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/passt.c b/passt.c index e4d45daa..f9405bee 100644 --- a/passt.c +++ b/passt.c @@ -67,6 +67,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", [EPOLL_TYPE_UDP] = "UDP socket", + [EPOLL_TYPE_UDP_REPLY] = "UDP reply socket", [EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket", [EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch", [EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch", @@ -349,6 +350,9 @@ loop: case EPOLL_TYPE_UDP: udp_buf_sock_handler(&c, ref, eventmask, &now); break; + case EPOLL_TYPE_UDP_REPLY: + udp_reply_sock_handler(&c, ref, eventmask, &now); + break; case EPOLL_TYPE_PING: icmp_sock_handler(&c, ref); break; diff --git a/udp.c b/udp.c index daf4fe26..f4c696db 100644 --- a/udp.c +++ b/udp.c @@ -35,7 +35,31 @@ * =================== * * UDP doesn't use listen(), but we consider long term sockets which are allowed - * to create new flows "listening" by analogy with TCP. + * to create new flows "listening" by analogy with TCP. This listening socket + * could receive packets from multiple flows, so we use a hash table match to + * find the specific flow for a datagram. + * + * When a UDP flow is initiated from a listening socket we take a duplicate of
These are always "inbound" flows, right?
No, they could come from a socket listening in the namespace (-U).
+ * the socket and store it in uflow->s[INISIDE]. This will last for the + * lifetime of the flow, even if the original listening socket is closed due to + * port auto-probing. The duplicate is used to deliver replies back to the + * originating side. + * + * Reply sockets + * ============= + * + * When a UDP flow targets a socket, we create a "reply" socket in
...and these are outbound ones?
No. Again, we could be creating this socket in the namespace. A spliced flow will have both the dup() of a listening socket, and a reply socket, regardless of whether it's inbound (HOST -> SPLICE) or outbound (SPLICE -> HOST).
+ * uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive + * replies on the target side. This socket is both bound and connected and has + * EPOLL_TYPE_UDP_REPLY. The connect() means it will only receive datagrams + * associated with this flow, so the epoll reference directly points to the flow + * and we don't need a hash lookup. + * + * NOTE: it's possible that the reply socket could have a bound address + * overlapping with an unrelated listening socket. We assume datagrams for the + * flow will come to the reply socket in preference to a listening socket. The + * sample program contrib/udp-reuseaddr/reuseaddr-priority.c documents and tests
Now it's under doc/platform-requirements/
Thanks for the catch, corrected.
+ * that assumption. * * Port tracking * ============= @@ -56,62 +80,6 @@ * * Packets are forwarded back and forth, by prepending and stripping UDP headers * in the obvious way, with no port translation. - * - * In PASTA mode, the L2-L4 translation is skipped for connections to ports - * bound between namespaces using the loopback interface, messages are directly - * transferred between L4 sockets instead. These are called spliced connections - * for consistency with the TCP implementation, but the splice() syscall isn't - * actually used as it wouldn't make sense for datagram-based connections: a - * pair of recvmmsg() and sendmmsg() deals with this case.
I think we should keep this paragraph... or did you move/add this information somewhere else I missed? Given that the "Port tracking" section is going to be away, maybe this could be moved at the end of "UDP Flows".
Fair point. I've paraphrahsed and updated this a bit and added it as a new section about spliced flows.
- * - * The connection tracking for PASTA mode is slightly complicated by the absence - * of actual connections, see struct udp_splice_port, and these examples: - * - * - from init to namespace: - * - * - forward direction: 127.0.0.1:5000 -> 127.0.0.1:80 in init from socket s, - * with epoll reference: index = 80, splice = 1, orig = 1, ns = 0 - * - if udp_splice_ns[V4][5000].sock: - * - send packet to udp_splice_ns[V4][5000].sock, with destination port - * 80 - * - otherwise: - * - create new socket udp_splice_ns[V4][5000].sock - * - bind in namespace to 127.0.0.1:5000 - * - add to epoll with reference: index = 5000, splice = 1, orig = 0, - * ns = 1 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - * - reverse direction: 127.0.0.1:80 -> 127.0.0.1:5000 in namespace socket s, - * having epoll reference: index = 5000, splice = 1, orig = 0, ns = 1 - * - if udp_splice_init[V4][80].sock: - * - send to udp_splice_init[V4][80].sock, with destination port 5000 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - otherwise, discard - * - * - from namespace to init: - * - * - forward direction: 127.0.0.1:2000 -> 127.0.0.1:22 in namespace from - * socket s, with epoll reference: index = 22, splice = 1, orig = 1, ns = 1 - * - if udp4_splice_init[V4][2000].sock: - * - send packet to udp_splice_init[V4][2000].sock, with destination - * port 22 - * - otherwise: - * - create new socket udp_splice_init[V4][2000].sock - * - bind in init to 127.0.0.1:2000 - * - add to epoll with reference: index = 2000, splice = 1, orig = 0, - * ns = 0 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - * - reverse direction: 127.0.0.1:22 -> 127.0.0.1:2000 in init from socket s, - * having epoll reference: index = 2000, splice = 1, orig = 0, ns = 0 - * - if udp_splice_ns[V4][22].sock: - * - send to udp_splice_ns[V4][22].sock, with destination port 2000 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - otherwise, discard */
#include
@@ -134,6 +102,7 @@ #include #include #include +#include #include "checksum.h" #include "util.h" @@ -223,7 +192,6 @@ static struct ethhdr udp6_eth_hdr; * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() - * @splicesrc: Source port for splicing, or -1 if not spliceable * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { @@ -232,7 +200,6 @@ static struct udp_meta_t { struct tap_hdr taph;
union sockaddr_inany s_in; - int splicesrc; flow_sidx_t tosidx; } #ifdef __AVX2__ @@ -270,7 +237,6 @@ static struct mmsghdr udp_mh_splice [UDP_MAX_FRAMES]; /* IOVs for L2 frames */ static struct iovec udp_l2_iov [UDP_MAX_FRAMES][UDP_NUM_IOVS];
- /** * udp_portmap_clear() - Clear UDP port map before configuration */ @@ -383,140 +349,6 @@ static void udp_iov_init(const struct ctx *c) udp_iov_init_one(c, i); }
-/** - * udp_splice_new() - Create and prepare socket for "spliced" binding - * @c: Execution context - * @v6: Set for IPv6 sockets - * @src: Source port of original connection, host order - * @ns: Does the splice originate in the ns or not - * - * Return: prepared socket, negative error code on failure - * - * #syscalls:pasta getsockname - */ -int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) -{ - struct epoll_event ev = { .events = EPOLLIN | EPOLLRDHUP | EPOLLHUP }; - union epoll_ref ref = { .type = EPOLL_TYPE_UDP, - .udp = { .splice = true, .v6 = v6, .port = src } - }; - struct udp_splice_port *sp; - int act, s; - - if (ns) { - ref.udp.pif = PIF_SPLICE; - sp = &udp_splice_ns[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_NS; - } else { - ref.udp.pif = PIF_HOST; - sp = &udp_splice_init[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_INIT; - } - - s = socket(v6 ? AF_INET6 : AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, - IPPROTO_UDP); - - if (s > FD_REF_MAX) { - close(s); - return -EIO; - } - - if (s < 0) - return s; - - ref.fd = s; - - if (v6) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(src), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr6, sizeof(addr6))) - goto fail; - } else { - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(src), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr4, sizeof(addr4))) - goto fail; - } - - sp->sock = s; - bitmap_set(udp_act[v6 ? V6 : V4][act], src); - - ev.data.u64 = ref.u64; - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, s, &ev); - return s; - -fail: - close(s); - return -1; -} - -/** - * struct udp_splice_new_ns_arg - Arguments for udp_splice_new_ns() - * @c: Execution context - * @v6: Set for IPv6 - * @src: Source port of originating datagram, host order - * @dst: Destination port of originating datagram, host order - * @s: Newly created socket or negative error code - */ -struct udp_splice_new_ns_arg { - const struct ctx *c; - int v6; - in_port_t src; - int s; -}; - -/** - * udp_splice_new_ns() - Enter namespace and call udp_splice_new() - * @arg: See struct udp_splice_new_ns_arg - * - * Return: 0 - */ -static int udp_splice_new_ns(void *arg) -{ - struct udp_splice_new_ns_arg *a; - - a = (struct udp_splice_new_ns_arg *)arg; - - ns_enter(a->c); - - a->s = udp_splice_new(a->c, a->v6, a->src, true); - - return 0; -} - -/** - * udp_mmh_splice_port() - Is source address of message suitable for splicing? - * @ref: epoll reference for incoming message's origin socket - * @mmh: mmsghdr of incoming message - * - * Return: if source address of message in @mmh refers to localhost (127.0.0.1 - * or ::1) its source port (host order), otherwise -1. - */ -static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) -{ - const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name; - const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name; - - ASSERT(ref.type == EPOLL_TYPE_UDP); - - if (!ref.udp.splice) - return -1; - - if (ref.udp.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr)) - return ntohs(sa6->sin6_port); - - if (!ref.udp.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr)) - return ntohs(sa4->sin_port); - - return -1; -} - /** * udp_at_sidx() - Get UDP specific flow at given sidx * @sidx: Flow and side to retrieve @@ -542,6 +374,16 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx) */ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) { + if (uflow->s[INISIDE] >= 0) { + /* The listening socket needs to stay in epoll */ + close(uflow->s[INISIDE]); + } + + if (uflow->s[TGTSIDE] >= 0) { + /* But the flow specific one needs to be removed */ + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, uflow->s[TGTSIDE], NULL); + close(uflow->s[TGTSIDE]);
Keeping numbers of closed sockets around, instead of setting them to -1 right away, makes me a bit nervous.
On the other hand it's obvious that udp_flow_new() sets them to -1 anyway, so there isn't much that can go wrong.
Right, we're about to destroy the flow entry entirely, so it is safe. But I agree that it's nervous-making. It's pretty cheap to set them to -1 explicitly, so I'm doing that now.
+ } flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); }
@@ -549,26 +391,80 @@ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) * udp_flow_new() - Common setup for a new UDP flow * @c: Execution context * @flow: Initiated flow + * @s_ini: Initiating socket (or -1) * @now: Timestamp * * Return: UDP specific flow, if successful, NULL on failure */ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, - const struct timespec *now) + int s_ini, const struct timespec *now) { const struct flowside *ini = &flow->f.side[INISIDE]; struct udp_flow *uflow = NULL; + const struct flowside *tgt; + uint8_t tgtpif;
if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { flow_dbg(flow, "Invalid endpoint to initiate UDP flow"); goto cancel; }
- if (!flow_target(c, flow, IPPROTO_UDP)) + if (!(tgt = flow_target(c, flow, IPPROTO_UDP))) goto cancel; + tgtpif = flow->f.pif[TGTSIDE];
uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); uflow->ts = now->tv_sec; + uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; + + if (s_ini >= 0) { + /* When using auto port-scanning the listening port could go + * away, so we need to duplicate it */
For consistency: closing */ on a newline. I would also say "the socket" instead of "it", otherwise it seems to be referring to the port at a first glance.
Done.
+ uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0);
There's one aspect of this I don't understand: if s_ini is closed while checking for bound ports (is it? I didn't really reach the end of this series), aren't duplicates also closed?
That is, the documentation of dup2(2), which should be the same for this purpose, states that the duplicate inherits "file status flags", which I would assume also includes the fact that a socket is closed. I didn't test that though.
I don't believe so. My understanding is that dup() (and the rest) make a new fd referencing the same underlying file object, yes. But AIUI, close() just closes one fd - the underlying object is only closed only when all fds are gone.
Ah, probably, yes.
If duplicates are closed, I guess an alternative solution could be to introduce some kind of reference counting for sockets... somewhere.
.. in other words, I believe the kernel does the reference counting.
I should verify this though, I'll try to come up with something new for doc/platform-requirements.
I didn't really find the time to sketch this but I guess the easiest way to check this behaviour is to have a TCP connection between a socket pair, with one socket having two descriptors, then closing one descriptor and check if the peer socket sees a closed connection (recv() returning 0 or similar). I was wondering whether it's worth to use the vforked namespaced peer trick I drafted here: https://archives.passt.top/passt-dev/20231206160808.3d312733@elisabeth/ just in case we want to use some of those test cases for actual tests, where we don't want to bind an actual TCP port on the machine we're running on. But if it adds complexity I'd say it's not worth it.
+ if (uflow->s[INISIDE] < 0) { + flow_err(uflow, + "Couldn't duplicate listening socket: %s", + strerror(errno)); + goto cancel; + } + } + + if (pif_is_socket(tgtpif)) { + union { + flow_sidx_t sidx; + uint32_t data; + } fref = { + .sidx = FLOW_SIDX(flow, TGTSIDE), + }; + + uflow->s[TGTSIDE] = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY, + tgtpif, tgt, fref.data); + if (uflow->s[TGTSIDE] < 0) { + flow_dbg(uflow, + "Couldn't open socket for spliced flow: %s", + strerror(errno)); + goto cancel; + } + + if (flowside_connect(c, uflow->s[TGTSIDE], tgtpif, tgt) < 0) { + flow_dbg(uflow, + "Couldn't connect flow socket: %s", + strerror(errno)); + goto cancel; + } + + /* It's possible, if unlikely, that we could receive some + * unrelated packets in between the bind() and connect() of this + * socket. For now we just discard these. We could consider + * trying to re-direct these to an appropriate handler, if we
Simply "redirect"?
Done.
+ * need to. + */ + /* cppcheck-suppress nullPointer */ + while (recv(uflow->s[TGTSIDE], NULL, 0, MSG_DONTWAIT) >= 0) + ;
Could a local attacker (another user) attempt to use this for denial of service?
Ah, interesting question.
Of course, somebody could flood us anyway and we would get and handle all the events that that causes. But this case is different because we could get stuck for an unlimited amount of time without serving other sockets at all.
Right.
If that's a possibility, perhaps a limit for this loop (a maximum amount of recv()) tries would be a good idea. I'm not sure how we should handle the case where we exceed the threshold.
We could fail to create the flow. That would limit the damage, but see below.
Another one, which adds some complexity, but looks more correct to me, would be to try a single recv() call, and if we get data from it, fail creating the new flow entirely.
Right. I also considered a single recvmmsg(); the difficulty with that is making suitable arrays for it, since the normal ones may be in use at this point.
For UDP, we don't need to make the buffers large enough to fit packets into them, see udp(7): All receive operations return only one packet. When the packet is smaller than the passed buffer, only that much data is returned; when it is bigger, the packet is truncated and the MSG_TRUNC flag is set. so probably 1024 bytes (1 * UIO_MAXIOV) on the stack would be enough... or maybe we can even pass 0 as size and NULL buffers? If recvmmsg() returns UIO_MAXIOV, then something weird is going on and we can abort the "connection" attempt.
This removes the potential DoS of wedging passt completely, but raises the possibility of a different one. Could an attacker - not necessarily on the same host, but network-closer than the guest's intended endpoint - prevent the guest from connecting to a particular service by spamming fake reply packets here.
Oops, I didn't think about that, that's also concerning. With my suggestion above, 1024 packets is all an attacker would need. But maybe there's a safe way to deal with this that's still simpler than a separate handler: using recvmmsg(), going through msg_name of the (truncated) messages we received with it, and trying to draw some conclusion about what kind of attack we're dealing with from there. I'm not sure exactly which conclusion, though.
I believe it would need to anticipate the guest's source port to do so, which might be good enough.
Linux does source port randomisation for UDP starting from 2.6.24, so that should be good enough, but given that we try to preserve the source port from the guest, we might make a guest that doesn't do port randomisation (albeit unlikely to be found) vulnerable to this. -- Stefano
On Wed, Jul 10, 2024 at 07:13:26PM +0200, Stefano Brivio wrote:
On Wed, 10 Jul 2024 10:23:14 +1000 David Gibson
wrote: On Wed, Jul 10, 2024 at 12:32:33AM +0200, Stefano Brivio wrote:
On Fri, 5 Jul 2024 12:07:18 +1000 David Gibson
wrote: [snip] + uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0);
There's one aspect of this I don't understand: if s_ini is closed while checking for bound ports (is it? I didn't really reach the end of this series), aren't duplicates also closed?
That is, the documentation of dup2(2), which should be the same for this purpose, states that the duplicate inherits "file status flags", which I would assume also includes the fact that a socket is closed. I didn't test that though.
I don't believe so. My understanding is that dup() (and the rest) make a new fd referencing the same underlying file object, yes. But AIUI, close() just closes one fd - the underlying object is only closed only when all fds are gone.
Ah, probably, yes.
If duplicates are closed, I guess an alternative solution could be to introduce some kind of reference counting for sockets... somewhere.
.. in other words, I believe the kernel does the reference counting.
I should verify this though, I'll try to come up with something new for doc/platform-requirements.
I didn't really find the time to sketch this but I guess the easiest way to check this behaviour is to have a TCP connection between a socket pair, with one socket having two descriptors, then closing one descriptor and check if the peer socket sees a closed connection (recv() returning 0 or similar).
So.. yes, this would check whether close() on a non-last fd for a socket triggers socket closing actions, but that's much stricter than what we actually need here. I would, for example, expect shutdown() on a TCP socket to affect all dups - and I don't actually know if a close() on one dup might trigger that. But we're dealing with UDP here, so there's no "on wire" effect of a close. So all we actually need to check is: 1. Open a "listening" udp socket 2. Dup it 3. Close a dup 4. Can the remaining dup still receive datagrams? I've written a test program for this, which I'll include in the next spin.
I was wondering whether it's worth to use the vforked namespaced peer trick I drafted here: https://archives.passt.top/passt-dev/20231206160808.3d312733@elisabeth/
just in case we want to use some of those test cases for actual tests, where we don't want to bind an actual TCP port on the machine we're running on. But if it adds complexity I'd say it's not worth it.
Yeah, I tend to think we can add that sort of sandboxing when and if we need it. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 11 Jul 2024 11:30:52 +1000
David Gibson
On Wed, Jul 10, 2024 at 07:13:26PM +0200, Stefano Brivio wrote:
On Wed, 10 Jul 2024 10:23:14 +1000 David Gibson
wrote: On Wed, Jul 10, 2024 at 12:32:33AM +0200, Stefano Brivio wrote:
On Fri, 5 Jul 2024 12:07:18 +1000 David Gibson
wrote: [snip] + uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0);
There's one aspect of this I don't understand: if s_ini is closed while checking for bound ports (is it? I didn't really reach the end of this series), aren't duplicates also closed?
That is, the documentation of dup2(2), which should be the same for this purpose, states that the duplicate inherits "file status flags", which I would assume also includes the fact that a socket is closed. I didn't test that though.
I don't believe so. My understanding is that dup() (and the rest) make a new fd referencing the same underlying file object, yes. But AIUI, close() just closes one fd - the underlying object is only closed only when all fds are gone.
Ah, probably, yes.
If duplicates are closed, I guess an alternative solution could be to introduce some kind of reference counting for sockets... somewhere.
.. in other words, I believe the kernel does the reference counting.
I should verify this though, I'll try to come up with something new for doc/platform-requirements.
I didn't really find the time to sketch this but I guess the easiest way to check this behaviour is to have a TCP connection between a socket pair, with one socket having two descriptors, then closing one descriptor and check if the peer socket sees a closed connection (recv() returning 0 or similar).
So.. yes, this would check whether close() on a non-last fd for a socket triggers socket closing actions, but that's much stricter than what we actually need here. I would, for example, expect shutdown() on a TCP socket to affect all dups - and I don't actually know if a close() on one dup might trigger that.
But we're dealing with UDP here, so there's no "on wire" effect of a close.
Right, that's why I was thinking of using a TCP socket. On the other hand it's a bit silly to test something slightly different to indirectly check the hypothesis.
So all we actually need to check is: 1. Open a "listening" udp socket 2. Dup it 3. Close a dup 4. Can the remaining dup still receive datagrams?
...sounds much better, yes. -- Stefano
On Wed, Jul 10, 2024 at 07:13:26PM +0200, Stefano Brivio wrote:
On Wed, 10 Jul 2024 10:23:14 +1000 David Gibson
wrote: On Wed, Jul 10, 2024 at 12:32:33AM +0200, Stefano Brivio wrote:
On Fri, 5 Jul 2024 12:07:18 +1000 David Gibson
wrote: [snip] + * need to. + */ + /* cppcheck-suppress nullPointer */ + while (recv(uflow->s[TGTSIDE], NULL, 0, MSG_DONTWAIT) >= 0) + ;
Could a local attacker (another user) attempt to use this for denial of service?
Ah, interesting question.
Of course, somebody could flood us anyway and we would get and handle all the events that that causes. But this case is different because we could get stuck for an unlimited amount of time without serving other sockets at all.
Right.
If that's a possibility, perhaps a limit for this loop (a maximum amount of recv()) tries would be a good idea. I'm not sure how we should handle the case where we exceed the threshold.
We could fail to create the flow. That would limit the damage, but see below.
Another one, which adds some complexity, but looks more correct to me, would be to try a single recv() call, and if we get data from it, fail creating the new flow entirely.
Right. I also considered a single recvmmsg(); the difficulty with that is making suitable arrays for it, since the normal ones may be in use at this point.
For UDP, we don't need to make the buffers large enough to fit packets into them, see udp(7):
All receive operations return only one packet. When the packet is smaller than the passed buffer, only that much data is returned; when it is bigger, the packet is truncated and the MSG_TRUNC flag is set.
so probably 1024 bytes (1 * UIO_MAXIOV) on the stack would be enough... or maybe we can even pass 0 as size and NULL buffers?
We can (see doc/platform-requirements/recv-zero.c). And even if we couldn't, we could use the same 1 byte buffer for all the datagrams. We basically just need the actual msghdr array.
If recvmmsg() returns UIO_MAXIOV, then something weird is going on and we can abort the "connection" attempt.
Seems reasonable; I've coded something up.
This removes the potential DoS of wedging passt completely, but raises the possibility of a different one. Could an attacker - not necessarily on the same host, but network-closer than the guest's intended endpoint - prevent the guest from connecting to a particular service by spamming fake reply packets here.
Oops, I didn't think about that, that's also concerning. With my suggestion above, 1024 packets is all an attacker would need.
But maybe there's a safe way to deal with this that's still simpler than a separate handler: using recvmmsg(), going through msg_name of the (truncated) messages we received with it, and trying to draw some conclusion about what kind of attack we're dealing with from there. I'm not sure exactly which conclusion, though.
I believe it would need to anticipate the guest's source port to do so, which might be good enough.
Linux does source port randomisation for UDP starting from 2.6.24, so that should be good enough, but given that we try to preserve the source port from the guest, we might make a guest that doesn't do port randomisation (albeit unlikely to be found) vulnerable to this.
Right. It feels like we're at least not making the guest dramatically more vulnerable to something here, so I'm inclined to treat it as good enough for now. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 5 Jul 2024 12:07:18 +1000
David Gibson
When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use.
For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c
When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry.
For some reason, after this patch (I bisected), I'm getting an EPOLLERR loop: pasta: epoll event on UDP socket 6 (events: 0x00000001) Flow 0 (NEW): FREE -> NEW Flow 0 (INI): NEW -> INI Flow 0 (INI): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => ? Flow 0 (TGT): INI -> TGT Flow 0 (TGT): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:47041 -> [127.0.0.1]:10001 Flow 0 (UDP flow): TGT -> TYPED Flow 0 (UDP flow): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:47041 -> [127.0.0.1]:10001 Flow 0 (UDP flow): Side 0 hash table insert: bucket: 97474 Flow 0 (UDP flow): TYPED -> ACTIVE Flow 0 (UDP flow): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:47041 -> [127.0.0.1]:10001 pasta: epoll event on UDP reply socket 116 (events: 0x00000008) pasta: epoll event on UDP reply socket 116 (events: 0x00000008) pasta: epoll event on UDP reply socket 116 (events: 0x00000008) [...repeated until I terminate the process] by sending one UDP datagram from the parent namespace with no "listening" process in the namespace, using the "spliced" path, something like this: echo a | nc -q1 -u localhost 10001 after running pasta with: ./pasta -u 10001 --trace -l /tmp/pasta.trace --log-size $((1 << 30)) I tried bigger/multiple datagrams, same result. Before this patch, I get something like this instead: 5.1018: pasta: epoll event on UDP socket 6 (events: 0x00000001) 5.1018: Flow 0 (NEW): FREE -> NEW 5.1018: Flow 0 (INI): NEW -> INI 5.1019: Flow 0 (INI): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => ? 5.1019: Flow 0 (TGT): INI -> TGT 5.1019: Flow 0 (TGT): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:41245 -> [127.0.0.1]:10001 5.1019: Flow 0 (UDP flow): TGT -> TYPED 5.1019: Flow 0 (UDP flow): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:41245 -> [127.0.0.1]:10001 5.1019: Flow 0 (UDP flow): Side 0 hash table insert: bucket: 111174 5.1019: Flow 0 (UDP flow): TYPED -> ACTIVE 5.1019: Flow 0 (UDP flow): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:41245 -> [127.0.0.1]:10001 I didn't really investigate, though. -- Stefano
On Fri, Jul 12, 2024 at 03:34:07PM +0200, Stefano Brivio wrote:
On Fri, 5 Jul 2024 12:07:18 +1000 David Gibson
wrote: When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use.
For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c
When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry.
For some reason, after this patch (I bisected), I'm getting an EPOLLERR loop:
pasta: epoll event on UDP socket 6 (events: 0x00000001) Flow 0 (NEW): FREE -> NEW Flow 0 (INI): NEW -> INI Flow 0 (INI): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => ? Flow 0 (TGT): INI -> TGT Flow 0 (TGT): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:47041 -> [127.0.0.1]:10001 Flow 0 (UDP flow): TGT -> TYPED Flow 0 (UDP flow): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:47041 -> [127.0.0.1]:10001 Flow 0 (UDP flow): Side 0 hash table insert: bucket: 97474 Flow 0 (UDP flow): TYPED -> ACTIVE Flow 0 (UDP flow): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:47041 -> [127.0.0.1]:10001 pasta: epoll event on UDP reply socket 116 (events: 0x00000008) pasta: epoll event on UDP reply socket 116 (events: 0x00000008) pasta: epoll event on UDP reply socket 116 (events: 0x00000008) [...repeated until I terminate the process]
by sending one UDP datagram from the parent namespace with no "listening" process in the namespace, using the "spliced" path, something like this:
echo a | nc -q1 -u localhost 10001
after running pasta with:
./pasta -u 10001 --trace -l /tmp/pasta.trace --log-size $((1 << 30))
I tried bigger/multiple datagrams, same result.
Ouch. I see it too. I'll debug...
Before this patch, I get something like this instead:
5.1018: pasta: epoll event on UDP socket 6 (events: 0x00000001) 5.1018: Flow 0 (NEW): FREE -> NEW 5.1018: Flow 0 (INI): NEW -> INI 5.1019: Flow 0 (INI): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => ? 5.1019: Flow 0 (TGT): INI -> TGT 5.1019: Flow 0 (TGT): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:41245 -> [127.0.0.1]:10001 5.1019: Flow 0 (UDP flow): TGT -> TYPED 5.1019: Flow 0 (UDP flow): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:41245 -> [127.0.0.1]:10001 5.1019: Flow 0 (UDP flow): Side 0 hash table insert: bucket: 111174 5.1019: Flow 0 (UDP flow): TYPED -> ACTIVE 5.1019: Flow 0 (UDP flow): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:41245 -> [127.0.0.1]:10001
I didn't really investigate, though.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Now that spliced datagrams are managed via the flow table, remove
UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With
those removed, the 'ts' field in udp_splice_port is also no longer used.
struct udp_splice_port now contains just a socket fd, so replace it with
a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still
used for tracking of automatic port forwarding.
Finally, the 'splice' field of union udp_epoll_ref is no longer used so
remove it as well.
Signed-off-by: David Gibson
On Fri, 5 Jul 2024 12:07:19 +1000
David Gibson
Now that spliced datagrams are managed via the flow table, remove UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With those removed, the 'ts' field in udp_splice_port is also no longer used. struct udp_splice_port now contains just a socket fd, so replace it with a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still used for tracking of automatic port forwarding.
Finally, the 'splice' field of union udp_epoll_ref is no longer used so remove it as well.
Signed-off-by: David Gibson
--- udp.c | 65 +++++++++++++++++------------------------------------------ udp.h | 3 +-- 2 files changed, 19 insertions(+), 49 deletions(-) diff --git a/udp.c b/udp.c index f4c696db..a4eb6d0f 100644 --- a/udp.c +++ b/udp.c @@ -136,27 +136,15 @@ struct udp_tap_port { time_t ts; };
-/** - * struct udp_splice_port - Bound socket for spliced communication - * @sock: Socket bound to index port - * @ts: Activity timestamp - */ -struct udp_splice_port { - int sock; - time_t ts; -}; - /* Port tracking, arrays indexed by packet source port (host order) */ static struct udp_tap_port udp_tap_map [IP_VERSIONS][NUM_PORTS];
/* "Spliced" sockets indexed by bound port (host order) */ -static struct udp_splice_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; -static struct udp_splice_port udp_splice_init[IP_VERSIONS][NUM_PORTS]; +static int udp_splice_ns [IP_VERSIONS][NUM_PORTS]; +static int udp_splice_init[IP_VERSIONS][NUM_PORTS];
enum udp_act_type { UDP_ACT_TAP, - UDP_ACT_SPLICE_NS, - UDP_ACT_SPLICE_INIT, UDP_ACT_TYPE_MAX, };
@@ -246,8 +234,8 @@ void udp_portmap_clear(void)
for (i = 0; i < NUM_PORTS; i++) { udp_tap_map[V4][i].sock = udp_tap_map[V6][i].sock = -1; - udp_splice_ns[V4][i].sock = udp_splice_ns[V6][i].sock = -1; - udp_splice_init[V4][i].sock = udp_splice_init[V6][i].sock = -1; + udp_splice_ns[V4][i] = udp_splice_ns[V6][i] = -1; + udp_splice_init[V4][i] = udp_splice_init[V6][i] = -1; } }
@@ -1050,8 +1038,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port) { - union udp_epoll_ref uref = { .splice = (c->mode == MODE_PASTA), - .orig = true, .port = port }; + union udp_epoll_ref uref = { .orig = true, .port = port }; int s, r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
if (ns) @@ -1067,12 +1054,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, ifname, port, uref.u32);
udp_tap_map[V4][port].sock = s < 0 ? -1 : s; - udp_splice_init[V4][port].sock = s < 0 ? -1 : s; + udp_splice_init[V4][port] = s < 0 ? -1 : s; } else { r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, &in4addr_loopback, ifname, port, uref.u32); - udp_splice_ns[V4][port].sock = s < 0 ? -1 : s; + udp_splice_ns[V4][port] = s < 0 ? -1 : s; } }
@@ -1084,12 +1071,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, ifname, port, uref.u32);
udp_tap_map[V6][port].sock = s < 0 ? -1 : s; - udp_splice_init[V6][port].sock = s < 0 ? -1 : s; + udp_splice_init[V6][port] = s < 0 ? -1 : s; } else { r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, &in6addr_loopback, ifname, port, uref.u32); - udp_splice_ns[V6][port].sock = s < 0 ? -1 : s; + udp_splice_ns[V6][port] = s < 0 ? -1 : s; } }
@@ -1130,7 +1117,6 @@ static void udp_splice_iov_init(void) static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, in_port_t port, const struct timespec *now) { - struct udp_splice_port *sp; struct udp_tap_port *tp; int *sockp = NULL;
@@ -1143,20 +1129,6 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, tp->flags = 0; }
- break; - case UDP_ACT_SPLICE_INIT: - sp = &udp_splice_init[v6 ? V6 : V4][port]; - - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; - - break; - case UDP_ACT_SPLICE_NS: - sp = &udp_splice_ns[v6 ? V6 : V4][port]; - - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; - break; default: return; @@ -1184,20 +1156,19 @@ static void udp_port_rebind(struct ctx *c, bool outbound) = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; const uint8_t *rmap = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; - struct udp_splice_port (*socks)[NUM_PORTS] - = outbound ? udp_splice_ns : udp_splice_init; + int (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init;
Nit: this should be moved up now, before the declaration of 'fmap'.
unsigned port;
for (port = 0; port < NUM_PORTS; port++) { if (!bitmap_isset(fmap, port)) { - if (socks[V4][port].sock >= 0) { - close(socks[V4][port].sock); - socks[V4][port].sock = -1; + if (socks[V4][port] >= 0) { + close(socks[V4][port]); + socks[V4][port] = -1; }
- if (socks[V6][port].sock >= 0) { - close(socks[V6][port].sock); - socks[V6][port].sock = -1; + if (socks[V6][port] >= 0) { + close(socks[V6][port]); + socks[V6][port] = -1; }
continue; @@ -1207,8 +1178,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound) if (bitmap_isset(rmap, port)) continue;
- if ((c->ifi4 && socks[V4][port].sock == -1) || - (c->ifi6 && socks[V6][port].sock == -1)) + if ((c->ifi4 && socks[V4][port] == -1) || + (c->ifi6 && socks[V6][port] == -1)) udp_sock_init(c, outbound, AF_UNSPEC, NULL, NULL, port); } } diff --git a/udp.h b/udp.h index db5e546e..310f42fd 100644 --- a/udp.h +++ b/udp.h @@ -36,8 +36,7 @@ union udp_epoll_ref { struct { in_port_t port; uint8_t pif; - bool splice:1, - orig:1, + bool orig:1,
The comment to the union should be updated, removing 'splice'. While at it, I guess you could also drop 'bound' (removed in 851723924356 ("udp: Remove the @bound field from union udp_epoll_ref"), but the comment wasn't updated). -- Stefano
On Wed, Jul 10, 2024 at 11:36:06PM +0200, Stefano Brivio wrote:
On Fri, 5 Jul 2024 12:07:19 +1000 David Gibson
wrote: Now that spliced datagrams are managed via the flow table, remove UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With those removed, the 'ts' field in udp_splice_port is also no longer used. struct udp_splice_port now contains just a socket fd, so replace it with a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still used for tracking of automatic port forwarding.
Finally, the 'splice' field of union udp_epoll_ref is no longer used so remove it as well.
Signed-off-by: David Gibson
--- udp.c | 65 +++++++++++++++++------------------------------------------ udp.h | 3 +-- 2 files changed, 19 insertions(+), 49 deletions(-) diff --git a/udp.c b/udp.c index f4c696db..a4eb6d0f 100644 --- a/udp.c +++ b/udp.c @@ -136,27 +136,15 @@ struct udp_tap_port { time_t ts; };
-/** - * struct udp_splice_port - Bound socket for spliced communication - * @sock: Socket bound to index port - * @ts: Activity timestamp - */ -struct udp_splice_port { - int sock; - time_t ts; -}; - /* Port tracking, arrays indexed by packet source port (host order) */ static struct udp_tap_port udp_tap_map [IP_VERSIONS][NUM_PORTS];
/* "Spliced" sockets indexed by bound port (host order) */ -static struct udp_splice_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; -static struct udp_splice_port udp_splice_init[IP_VERSIONS][NUM_PORTS]; +static int udp_splice_ns [IP_VERSIONS][NUM_PORTS]; +static int udp_splice_init[IP_VERSIONS][NUM_PORTS];
enum udp_act_type { UDP_ACT_TAP, - UDP_ACT_SPLICE_NS, - UDP_ACT_SPLICE_INIT, UDP_ACT_TYPE_MAX, };
@@ -246,8 +234,8 @@ void udp_portmap_clear(void)
for (i = 0; i < NUM_PORTS; i++) { udp_tap_map[V4][i].sock = udp_tap_map[V6][i].sock = -1; - udp_splice_ns[V4][i].sock = udp_splice_ns[V6][i].sock = -1; - udp_splice_init[V4][i].sock = udp_splice_init[V6][i].sock = -1; + udp_splice_ns[V4][i] = udp_splice_ns[V6][i] = -1; + udp_splice_init[V4][i] = udp_splice_init[V6][i] = -1; } }
@@ -1050,8 +1038,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port) { - union udp_epoll_ref uref = { .splice = (c->mode == MODE_PASTA), - .orig = true, .port = port }; + union udp_epoll_ref uref = { .orig = true, .port = port }; int s, r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
if (ns) @@ -1067,12 +1054,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, ifname, port, uref.u32);
udp_tap_map[V4][port].sock = s < 0 ? -1 : s; - udp_splice_init[V4][port].sock = s < 0 ? -1 : s; + udp_splice_init[V4][port] = s < 0 ? -1 : s; } else { r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, &in4addr_loopback, ifname, port, uref.u32); - udp_splice_ns[V4][port].sock = s < 0 ? -1 : s; + udp_splice_ns[V4][port] = s < 0 ? -1 : s; } }
@@ -1084,12 +1071,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, ifname, port, uref.u32);
udp_tap_map[V6][port].sock = s < 0 ? -1 : s; - udp_splice_init[V6][port].sock = s < 0 ? -1 : s; + udp_splice_init[V6][port] = s < 0 ? -1 : s; } else { r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, &in6addr_loopback, ifname, port, uref.u32); - udp_splice_ns[V6][port].sock = s < 0 ? -1 : s; + udp_splice_ns[V6][port] = s < 0 ? -1 : s; } }
@@ -1130,7 +1117,6 @@ static void udp_splice_iov_init(void) static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, in_port_t port, const struct timespec *now) { - struct udp_splice_port *sp; struct udp_tap_port *tp; int *sockp = NULL;
@@ -1143,20 +1129,6 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, tp->flags = 0; }
- break; - case UDP_ACT_SPLICE_INIT: - sp = &udp_splice_init[v6 ? V6 : V4][port]; - - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; - - break; - case UDP_ACT_SPLICE_NS: - sp = &udp_splice_ns[v6 ? V6 : V4][port]; - - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; - break; default: return; @@ -1184,20 +1156,19 @@ static void udp_port_rebind(struct ctx *c, bool outbound) = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; const uint8_t *rmap = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; - struct udp_splice_port (*socks)[NUM_PORTS] - = outbound ? udp_splice_ns : udp_splice_init; + int (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init;
Nit: this should be moved up now, before the declaration of 'fmap'.
Done.
unsigned port;
for (port = 0; port < NUM_PORTS; port++) { if (!bitmap_isset(fmap, port)) { - if (socks[V4][port].sock >= 0) { - close(socks[V4][port].sock); - socks[V4][port].sock = -1; + if (socks[V4][port] >= 0) { + close(socks[V4][port]); + socks[V4][port] = -1; }
- if (socks[V6][port].sock >= 0) { - close(socks[V6][port].sock); - socks[V6][port].sock = -1; + if (socks[V6][port] >= 0) { + close(socks[V6][port]); + socks[V6][port] = -1; }
continue; @@ -1207,8 +1178,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound) if (bitmap_isset(rmap, port)) continue;
- if ((c->ifi4 && socks[V4][port].sock == -1) || - (c->ifi6 && socks[V6][port].sock == -1)) + if ((c->ifi4 && socks[V4][port] == -1) || + (c->ifi6 && socks[V6][port] == -1)) udp_sock_init(c, outbound, AF_UNSPEC, NULL, NULL, port); } } diff --git a/udp.h b/udp.h index db5e546e..310f42fd 100644 --- a/udp.h +++ b/udp.h @@ -36,8 +36,7 @@ union udp_epoll_ref { struct { in_port_t port; uint8_t pif; - bool splice:1, - orig:1, + bool orig:1,
The comment to the union should be updated, removing 'splice'. While at it, I guess you could also drop 'bound' (removed in 851723924356 ("udp: Remove the @bound field from union udp_epoll_ref"), but the comment wasn't updated).
Done. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Currently we create flows for datagrams from socket interfaces, and use
them to direct "spliced" (socket to socket) datagrams. We don't yet
match datagrams from the tap interface to existing flows, nor create new
flows for them. Add that functionality, matching datagrams from tap to
existing flows when they exist, or creating new ones.
As with spliced flows, when creating a new flow from tap to socket, we
create a new connected socket to receive reply datagrams attached to that
flow specifically. We extend udp_flow_sock_handler() to handle reply
packets bound for tap rather than another socket.
For non-obvious reasons, this caused a failure for me when running under
valgrind, because valgrind invoked rt_sigreturn which is not in our
seccomp filter. Since we already allow rt_signaction and others in the
valgrind, it seems reasonable to add rt_sigreturn as well.
Signed-off-by: David Gibson
Some nits in the commit message only:
On Fri, 5 Jul 2024 12:07:20 +1000
David Gibson
Currently we create flows for datagrams from socket interfaces, and use them to direct "spliced" (socket to socket) datagrams. We don't yet match datagrams from the tap interface to existing flows, nor create new flows for them. Add that functionality, matching datagrams from tap to existing flows when they exist, or creating new ones.
As with spliced flows, when creating a new flow from tap to socket, we create a new connected socket to receive reply datagrams attached to that flow specifically. We extend udp_flow_sock_handler() to handle reply packets bound for tap rather than another socket.
For non-obvious reasons, this caused a failure for me when running under valgrind, because valgrind invoked rt_sigreturn which is not in our seccomp filter.
That might be because the stack area needed by udp_reply_sock_handler() is now a bit bigger... or something like that, I suppose.
Since we already allow rt_signaction and others in the
rt_sigaction
valgrind, it seems reasonable to add rt_sigreturn as well.
valgrind target -- Stefano
On Wed, Jul 10, 2024 at 11:36:56PM +0200, Stefano Brivio wrote:
Some nits in the commit message only:
On Fri, 5 Jul 2024 12:07:20 +1000 David Gibson
wrote: Currently we create flows for datagrams from socket interfaces, and use them to direct "spliced" (socket to socket) datagrams. We don't yet match datagrams from the tap interface to existing flows, nor create new flows for them. Add that functionality, matching datagrams from tap to existing flows when they exist, or creating new ones.
As with spliced flows, when creating a new flow from tap to socket, we create a new connected socket to receive reply datagrams attached to that flow specifically. We extend udp_flow_sock_handler() to handle reply packets bound for tap rather than another socket.
For non-obvious reasons, this caused a failure for me when running under valgrind, because valgrind invoked rt_sigreturn which is not in our seccomp filter.
That might be because the stack area needed by udp_reply_sock_handler() is now a bit bigger... or something like that, I suppose.
Since we already allow rt_signaction and others in the
rt_sigaction
valgrind, it seems reasonable to add rt_sigreturn as well.
valgrind target
Revised accordingly. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
This replaces the last piece of existing UDP port tracking with the
common flow table. Specifically use the flow table to direct datagrams
from host sockets to the guest tap interface. Since this now requires
a flow for every datagram, we add some logging if we encounter any
datagrams for which we can't find or create a flow.
Signed-off-by: David Gibson
Two nits only:
On Fri, 5 Jul 2024 12:07:21 +1000
David Gibson
This replaces the last piece of existing UDP port tracking with the common flow table. Specifically use the flow table to direct datagrams from host sockets to the guest tap interface. Since this now requires a flow for every datagram, we add some logging if we encounter any datagrams for which we can't find or create a flow.
Signed-off-by: David Gibson
--- flow_table.h | 14 ++++ udp.c | 188 +++++++++++++++------------------------------------ 2 files changed, 67 insertions(+), 135 deletions(-) diff --git a/flow_table.h b/flow_table.h index 1faac4a7..da9483b3 100644 --- a/flow_table.h +++ b/flow_table.h @@ -106,6 +106,20 @@ static inline uint8_t pif_at_sidx(flow_sidx_t sidx) return flow->f.pif[sidx.side]; }
+/** flowside_at_sidx - Retrieve a specific flowside
flowside_at_sidx()
+ * @sidx: Flow & side index + * + * Return: Flowside for the flow & side given by @sidx + */ +static inline const struct flowside *flowside_at_sidx(flow_sidx_t sidx) +{ + const union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE;
Usual extra newline.
+ return &flow->f.side[sidx.side]; +}
I finished reviewing all the other patches, no further comments, everything else looks good to me. -- Stefano
On Wed, Jul 10, 2024 at 11:37:36PM +0200, Stefano Brivio wrote:
Two nits only:
On Fri, 5 Jul 2024 12:07:21 +1000 David Gibson
wrote: This replaces the last piece of existing UDP port tracking with the common flow table. Specifically use the flow table to direct datagrams from host sockets to the guest tap interface. Since this now requires a flow for every datagram, we add some logging if we encounter any datagrams for which we can't find or create a flow.
Signed-off-by: David Gibson
--- flow_table.h | 14 ++++ udp.c | 188 +++++++++++++++------------------------------------ 2 files changed, 67 insertions(+), 135 deletions(-) diff --git a/flow_table.h b/flow_table.h index 1faac4a7..da9483b3 100644 --- a/flow_table.h +++ b/flow_table.h @@ -106,6 +106,20 @@ static inline uint8_t pif_at_sidx(flow_sidx_t sidx) return flow->f.pif[sidx.side]; }
+/** flowside_at_sidx - Retrieve a specific flowside
flowside_at_sidx()
Done.
+ * @sidx: Flow & side index + * + * Return: Flowside for the flow & side given by @sidx + */ +static inline const struct flowside *flowside_at_sidx(flow_sidx_t sidx) +{ + const union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE;
Usual extra newline.
Done.
+ return &flow->f.side[sidx.side]; +}
I finished reviewing all the other patches, no further comments, everything else looks good to me.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Now that UDP datagrams are all directed via the flow table, we no longer
use the udp_tap_map[ or udp_act[] arrays. Remove them and connected
code.
Signed-off-by: David Gibson
In addition to the struct fwd_ports used by both UDP and TCP to track
port forwarding, UDP also included an 'rdelta' field, which contained the
reverse mapping of the main port map. This was used so that we could
properly direct reply packets to a forwarded packet where we change the
destination port. This has now been taken over by the flow table: reply
packets will match the flow of the originating packet, and that gives the
correct ports on the originating side.
So, eliminate the rdelta field, and with it struct udp_fwd_ports, which
now has no additional information over struct fwd_ports.
Signed-off-by: David Gibson
EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived
sockets which can initiate new flows. Rename to EPOLL_TYPE_UDP_LISTEN
and associated functions to match. Along with that, remove the .orig
field from union udp_listen_epoll_ref, since it is now always true.
Signed-off-by: David Gibson
participants (2)
-
David Gibson
-
Stefano Brivio