This is the seventh draft of an implementation of more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel This series changes the TCP connection table and hash table into a more general flow table that can track other protocols as well. Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding. ICMP and UDP are converted to use the new flow table. This is based on the recent series of UDP flow table preliminaries. Caveats: * We roughly double the size of a connection/flow entry * We don't yet record the local address of flows initiated from a socket, even in cases where it's bound to a specific address. Changes since v6: * Complete redesign of the UDP flow handling * Rebased (handling the change to bind() probing for local addresses was surprisingly fiddly) * Replace sockaddr_from_inany() with pif_sockaddr() which can correctly handle scope_id for different interfaces, and returns whether the address is non-trivial for convenience * Preserve specific loopback addresses in forwarding logic Changes since v5: * flowside_from_af() is now static * Small fixes to state verification * Pass protocol specific types into deferred/timer callbacks * No longer require complete forwarding address info for the hash table (we won't have it for UDP) * Fix bugs with logging of flow addresses * Make sure to initialise sin_zero field sockaddr_from_inany * Added patch better typing parameters to flow type specific callbacks * Terminology change "forwarded side" to "target side" * Assorted wording and style tweaks based on Stefano's review * Fold introduction of struct flowside and populating the initiating side together * Manage outbound addresses via the flow table as well * Support for UDP * Correct type of 'b' in flowside_lookup() (was a signed int) Changes since v4: * flowside_from_af() no longer fills in unspecified addresses when passed NULL * Split and rename flow hash lookup function * Clarified flow state transitions, and enforced where practical * Made side 0 always the initiating side of a flow, rather than letting the protocol specific code decide * Separated pifs from flowside addresses to allow better structure packing Changes since v3: * Complex rebase on top of the many things that have happened upstream since v2. * Assorted other changes. * Replace TAPFSIDE() and SOCKFSIDE() macros with local variables. Changes since v2: * Cosmetic fixes based on review * Extra doc comments for enum flow_type * Rename flowside to flowaddrs which turns out to make more sense in light of future changes * Fix bug where the socket flowaddrs for tap initiated connections wasn't initialised to match the socket address we were using in the case of map-gw NAT * New flowaddrs_from_sock() helper used in most cases which is cleaner and should avoid bugs like the above * Using newer centralised workarounds for clang-tidy issue 58992 * Remove duplicate definition of FLOW_MAX as maximum flow type and maximum number of tracked flows * Rebased on newer versions of preliminary work (ICMP, flow based dispatch and allocation, bind/address cleanups) * Unified hash table as well as base flow table * Integrated ICMP Changes since v1: * Terminology changes - "Endpoint" address/port instead of "correspondent" address/port - "flowside" instead of "demiflow" * Actually move the connection table to a new flow table structure in new files * Significant rearrangement of earlier patchs on top of that new table, to reduce churn David Gibson (27): flow: Common address information for initiating side flow: Common address information for target side tcp, flow: Remove redundant information, repack connection structures tcp: Obtain guest address from flowside tcp: Manage outbound address via flow table tcp: Simplify endpoint validation using flowside information tcp_splice: Eliminate SPLICE_V6 flag tcp, flow: Replace TCP specific hash function with general flow hash flow, tcp: Generalise TCP hash table to general flow hash table tcp: Re-use flow hash for initial sequence number generation icmp: Remove redundant id field from flow table entry icmp: Obtain destination addresses from the flowsides icmp: Look up ping flows using flow hash icmp: Eliminate icmp_id_map flow: Helper to create sockets based on flowside icmp: Manage outbound socket address via flow table flow, tcp: Flow based NAT and port forwarding for TCP flow, icmp: Use general flow forwarding rules for ICMP fwd: Update flow forwarding logic for UDP udp: Create flows for datagrams from originating sockets udp: Handle "spliced" datagrams with per-flow sockets udp: Remove obsolete splice tracking udp: Find or create flows for datagrams from tap interface udp: Direct datagrams from host to guest via flow table udp: Remove obsolete socket tracking udp: Remove rdelta port forwarding maps udp: Rename UDP listening sockets Makefile | 4 +- conf.c | 14 +- epoll_type.h | 6 +- flow.c | 481 +++++++++++++++++++++- flow.h | 47 +++ flow_table.h | 57 ++- fwd.c | 184 ++++++++- fwd.h | 9 + icmp.c | 105 ++--- icmp_flow.h | 2 - inany.h | 2 - passt.c | 10 +- passt.h | 5 +- pif.c | 45 +++ pif.h | 17 + tap.c | 11 - tap.h | 1 - tcp.c | 521 ++++++------------------ tcp_buf.c | 6 +- tcp_conn.h | 51 +-- tcp_internal.h | 10 +- tcp_splice.c | 98 +---- tcp_splice.h | 5 +- udp.c | 1055 +++++++++++++++++++----------------------------- udp.h | 33 +- udp_flow.h | 27 ++ util.c | 9 +- util.h | 3 + 28 files changed, 1549 insertions(+), 1269 deletions(-) create mode 100644 udp_flow.h -- 2.45.2
Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more consistent handling across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of address and port information related to one side of a flow. Store two of these in the common fields of a flow to track that information for both sides. For now we only populate the initiating side, requiring that information be completed when a flows enter INI. Later patches will populate the target side. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++--- flow.h | 16 +++++++++ flow_table.h | 8 ++++- icmp.c | 9 +++-- passt.h | 3 ++ tcp.c | 6 ++-- 6 files changed, 127 insertions(+), 11 deletions(-) diff --git a/flow.c b/flow.c index d05aa495..44e7b3b8 100644 --- a/flow.c +++ b/flow.c @@ -108,6 +108,31 @@ static const union flow *flow_new_entry; /* = NULL */ /* Last time the flow timers ran */ static struct timespec flow_timer_run; +/** flowside_from_af() - Initialise flowside from addresses + * @fside: flowside to initialise + * @af: Address family (AF_INET or AF_INET6) + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + */ +static void flowside_from_af(struct flowside *fside, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport) +{ + if (faddr) + inany_from_af(&fside->faddr, af, faddr); + else + fside->faddr = inany_any6; + fside->fport = fport; + + if (eaddr) + inany_from_af(&fside->eaddr, af, eaddr); + else + fside->eaddr = inany_any6; + fside->eport = eport; +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority @@ -140,6 +165,8 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + const struct flowside *ini = &f->side[INISIDE]; uint8_t oldstate = f->state; ASSERT(state < FLOW_NUM_STATES); @@ -150,18 +177,28 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f)); if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s => %s", pif_name(f->pif[INISIDE]), - pif_name(f->pif[TGTSIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport, + pif_name(f->pif[TGTSIDE])); else if (MAX(state, oldstate) >= FLOW_STATE_INI) - flow_log_(f, LOG_DEBUG, "%s => ?", pif_name(f->pif[INISIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport); } /** - * flow_initiate() - Move flow to INI, setting INISIDE details + * flow_initiate_() - Move flow to INI, setting pif[INISIDE] * @flow: Flow to change state * @pif: pif of the initiating side */ -void flow_initiate(union flow *flow, uint8_t pif) +static void flow_initiate_(union flow *flow, uint8_t pif) { struct flow_common *f = &flow->f; @@ -174,6 +211,55 @@ void flow_initiate(union flow *flow, uint8_t pif) flow_set_state(f, FLOW_STATE_INI); } +/** + * flow_initiate_af() - Move flow to INI, setting INISIDE details + * @flow: Flow to change state + * @pif: pif of the initiating side + * @af: Address family of @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the initiating flowside information + */ +const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) +{ + struct flowside *ini = &flow->f.side[INISIDE]; + + flowside_from_af(ini, af, saddr, sport, daddr, dport); + flow_initiate_(flow, pif); + return ini; +} + +/** + * flow_initiate_sa() - Move flow to INI, setting INISIDE details + * @flow: Flow to change state + * @pif: pif of the initiating side + * @ssa: Source socket address + * @dport: Destination port + * + * Return: pointer to the initiating flowside information + */ +const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, + const union sockaddr_inany *ssa, + in_port_t dport) +{ + struct flowside *ini = &flow->f.side[INISIDE]; + + inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa); + if (inany_v4(&ini->eaddr)) + ini->faddr = inany_any4; + else + ini->faddr = inany_any6; + ini->fport = dport; + flow_initiate_(flow, pif); + return ini; +} + /** * flow_target() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state diff --git a/flow.h b/flow.h index d1f49c65..4c3762b9 100644 --- a/flow.h +++ b/flow.h @@ -135,11 +135,26 @@ extern const uint8_t flow_proto[]; #define INISIDE 0 /* Initiating side */ #define TGTSIDE 1 /* Target side */ +/** + * struct flowside - Address information for one side of a flow + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport; + in_port_t eport; +}; + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry * @type: Type of packet flow * @pif[]: Interface for each side of the flow + * @side[]: Information for each side of the flow */ struct flow_common { #ifdef __GNUC__ @@ -154,6 +169,7 @@ struct flow_common { "Not enough bits for type field"); #endif uint8_t pif[SIDES]; + struct flowside side[SIDES]; }; #define FLOW_INDEX_BITS 17 /* 128k - 1 */ diff --git a/flow_table.h b/flow_table.h index 226ddbdd..ad1bc787 100644 --- a/flow_table.h +++ b/flow_table.h @@ -107,7 +107,13 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f, union flow *flow_alloc(void); void flow_alloc_cancel(union flow *flow); -void flow_initiate(union flow *flow, uint8_t pif); +const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport); +const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, + const union sockaddr_inany *ssa, + in_port_t dport); void flow_target(union flow *flow, uint8_t pif); union flow *flow_set_type(union flow *flow, enum flow_type type); diff --git a/icmp.c b/icmp.c index d4ccc722..cf88ac1f 100644 --- a/icmp.c +++ b/icmp.c @@ -146,12 +146,15 @@ static void icmp_ping_close(const struct ctx *c, * @id_sock: Pointer to ping flow entry slot in icmp_id_map[] to update * @af: Address family, AF_INET or AF_INET6 * @id: ICMP id for the new socket + * @saddr: Source address + * @daddr: Destination address * * Return: Newly opened ping flow, or NULL on failure */ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, struct icmp_ping_flow **id_sock, - sa_family_t af, uint16_t id) + sa_family_t af, uint16_t id, + const void *saddr, const void *daddr) { uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6; union epoll_ref ref = { .type = EPOLL_TYPE_PING }; @@ -163,7 +166,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, if (!flow) return NULL; - flow_initiate(flow, PIF_TAP); + flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); flow_target(flow, PIF_HOST); pingf = FLOW_SET_TYPE(flow, flowtype, ping); @@ -269,7 +272,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, } if (!(pingf = *id_sock)) - if (!(pingf = icmp_ping_new(c, id_sock, af, id))) + if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) return 1; pingf->ts = now->tv_sec; diff --git a/passt.h b/passt.h index 867e77b7..0d76b498 100644 --- a/passt.h +++ b/passt.h @@ -17,6 +17,9 @@ union epoll_ref; #include "pif.h" #include "packet.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" #include "flow.h" #include "icmp.h" #include "fwd.h" diff --git a/tcp.c b/tcp.c index 75b959a2..f4c910f3 100644 --- a/tcp.c +++ b/tcp.c @@ -1620,7 +1620,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return; - flow_initiate(flow, PIF_TAP); + flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); flow_target(flow, PIF_HOST); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); @@ -2293,7 +2293,9 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) goto cancel; - flow_initiate(flow, ref.tcp_listen.pif); + /* FIXME: When listening port has a specific bound address, record that + * as the forwarding address */ + flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port); if (sa.sa_family == AF_INET) { const struct in_addr *addr = &sa.sa4.sin_addr; -- 2.45.2
Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect(). Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 38 ++++++++++++++++++------ flow_table.h | 5 +++- icmp.c | 3 +- inany.h | 1 - pif.c | 45 ++++++++++++++++++++++++++++ pif.h | 17 +++++++++++ tcp.c | 82 ++++++++++++++++++++++++++++------------------------ tcp_splice.c | 45 +++++++++++----------------- 8 files changed, 158 insertions(+), 78 deletions(-) diff --git a/flow.c b/flow.c index 44e7b3b8..f064fad1 100644 --- a/flow.c +++ b/flow.c @@ -165,8 +165,10 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { - char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN]; + char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN]; const struct flowside *ini = &f->side[INISIDE]; + const struct flowside *tgt = &f->side[TGTSIDE]; uint8_t oldstate = f->state; ASSERT(state < FLOW_NUM_STATES); @@ -177,19 +179,24 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f)); if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + flow_log_(f, LOG_DEBUG, + "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport, - pif_name(f->pif[TGTSIDE])); + pif_name(f->pif[TGTSIDE]), + inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), + tgt->fport, + inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), + tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport); } @@ -261,21 +268,34 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, } /** - * flow_target() - Move flow to TGT, setting TGTSIDE details + * flow_target_af() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state * @pif: pif of the target side + * @af: Address family for @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the target flowside information */ -void flow_target(union flow *flow, uint8_t pif) +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) { struct flow_common *f = &flow->f; + struct flowside *tgt = &f->side[TGTSIDE]; ASSERT(pif != PIF_NONE); ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); ASSERT(f->type == FLOW_TYPE_NONE); ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); + flowside_from_af(tgt, af, daddr, dport, saddr, sport); f->pif[TGTSIDE] = pif; flow_set_state(f, FLOW_STATE_TGT); + return tgt; } /** diff --git a/flow_table.h b/flow_table.h index ad1bc787..00dca4b2 100644 --- a/flow_table.h +++ b/flow_table.h @@ -114,7 +114,10 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, const union sockaddr_inany *ssa, in_port_t dport); -void flow_target(union flow *flow, uint8_t pif); +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport); union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/icmp.c b/icmp.c index cf88ac1f..fd92c7da 100644 --- a/icmp.c +++ b/icmp.c @@ -167,7 +167,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, return NULL; flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - flow_target(flow, PIF_HOST); + /* FIXME: Record outbound source address when known */ + flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping); pingf->seq = -1; diff --git a/inany.h b/inany.h index 47b66fa9..8eaf5335 100644 --- a/inany.h +++ b/inany.h @@ -187,7 +187,6 @@ static inline bool inany_is_unspecified(const union inany_addr *a) * * Return: true if @a is in fe80::/10 (IPv6 link local unicast) */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_linklocal6(const union inany_addr *a) { return IN6_IS_ADDR_LINKLOCAL(&a->a6); diff --git a/pif.c b/pif.c index ebf01cc8..9f2d39cc 100644 --- a/pif.c +++ b/pif.c @@ -7,9 +7,14 @@ #include <stdint.h> #include <assert.h> +#include <netinet/in.h> #include "util.h" #include "pif.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" +#include "passt.h" const char *pif_type_str[] = { [PIF_NONE] = "<none>", @@ -19,3 +24,43 @@ const char *pif_type_str[] = { }; static_assert(ARRAY_SIZE(pif_type_str) == PIF_NUM_TYPES, "pif_type_str[] doesn't match enum pif_type"); + + +/** pif_sockaddr() - Construct a socket address suitable for an interface + * @c: Execution context + * @sa: Pointer to sockaddr to fill in + * @sl: Updated to relevant of length of initialised @sa + * @pif: Interface to create the socket address + * @addr: IPv[46] address + * @port: Port (host byte order) + * + * Return: true if resulting socket address is non-trivial (specified address or + * non-zero port), false otherwise + */ +bool pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl, + uint8_t pif, const union inany_addr *addr, in_port_t port) +{ + const struct in_addr *v4 = inany_v4(addr); + + ASSERT(pif_is_socket(pif)); + + if (v4) { + sa->sa_family = AF_INET; + sa->sa4.sin_addr = *v4; + sa->sa4.sin_port = htons(port); + memset(&sa->sa4.sin_zero, 0, sizeof(sa->sa4.sin_zero)); + *sl = sizeof(sa->sa4); + return !IN4_IS_ADDR_UNSPECIFIED(v4) || port; + } + + sa->sa_family = AF_INET6; + sa->sa6.sin6_addr = addr->a6; + sa->sa6.sin6_port = htons(port); + if (pif == PIF_HOST && IN6_IS_ADDR_LINKLOCAL(&addr->a6)) + sa->sa6.sin6_scope_id = c->ifi6; + else + sa->sa6.sin6_scope_id = 0; + sa->sa6.sin6_flowinfo = 0; + *sl = sizeof(sa->sa6); + return !IN6_IS_ADDR_UNSPECIFIED(&addr->a6) || port; +} diff --git a/pif.h b/pif.h index ca85b349..0e97afa1 100644 --- a/pif.h +++ b/pif.h @@ -7,6 +7,9 @@ #ifndef PIF_H #define PIF_H +union inany_addr; +union sockaddr_inany; + /** * enum pif_type - Type of passt/pasta interface ("pif") * @@ -43,4 +46,18 @@ static inline const char *pif_name(uint8_t pif) return pif_type(pif); } +/** + * pif_is_socket() - Is interface implemented via L4 sockets? + * @pif: pif to check + * + * Return: true of @pif is an L4 socket based interface, otherwise false + */ +static inline bool pif_is_socket(uint8_t pif) +{ + return pif == PIF_HOST || pif == PIF_SPLICE; +} + +bool pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl, + uint8_t pif, const union inany_addr *addr, in_port_t port); + #endif /* PIF_H */ diff --git a/tcp.c b/tcp.c index f4c910f3..46f6c145 100644 --- a/tcp.c +++ b/tcp.c @@ -1601,18 +1601,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(dstport), - .sin_addr = *(struct in_addr *)daddr, - }; - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(dstport), - .sin6_addr = *(struct in6_addr *)daddr, - }; - const struct sockaddr *sa; + const struct flowside *ini, *tgt; struct tcp_tap_conn *conn; + union inany_addr dstaddr; /* FIXME: Avoid bulky temporary */ + union sockaddr_inany sa; union flow *flow; int s = -1, mss; socklen_t sl; @@ -1620,9 +1612,22 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return; - flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); + ini = flow_initiate_af(flow, PIF_TAP, + af, saddr, srcport, daddr, dstport); + + dstaddr = ini->faddr; + if (!c->no_map_gw) { + if (inany_equals4(&dstaddr, &c->ip4.gw)) + dstaddr = inany_loopback4; + else if (inany_equals6(&dstaddr, &c->ip6.gw)) + dstaddr = inany_loopback6; + + } - flow_target(flow, PIF_HOST); + /* FIXME: Record outbound source address when known */ + tgt = flow_target_af(flow, PIF_HOST, AF_INET6, + NULL, 0, /* Kernel decides source address */ + &dstaddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); if (af == AF_INET) { @@ -1641,9 +1646,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, dstport); goto cancel; } - - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); } else if (af == AF_INET6) { if (IN6_IS_ADDR_UNSPECIFIED(saddr) || IN6_IS_ADDR_MULTICAST(saddr) || srcport == 0 || @@ -1658,9 +1660,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, dstport); goto cancel; } - - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); } else { ASSERT(0); } @@ -1668,12 +1667,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if ((s = tcp_conn_sock(c, af)) < 0) goto cancel; - if (!c->no_map_gw) { - if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw)) - addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK); - if (af == AF_INET6 && IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw)) - addr6.sin6_addr = in6addr_loopback; - } + pif_sockaddr(c, &sa, &sl, PIF_HOST, &tgt->eaddr, tgt->eport); /* Use bind() to check if the target address is local (EADDRINUSE or * similar) and already bound, and set the LOCAL flag in that case. @@ -1685,7 +1679,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, * * So, if bind() succeeds, close the socket, get a new one, and proceed. */ - if (bind(s, sa, sl)) { + if (bind(s, &sa.sa, sl)) { if (errno != EADDRNOTAVAIL && errno != EACCES) conn_flag(c, conn, LOCAL); } else { @@ -1695,7 +1689,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, goto cancel; } - if (af == AF_INET6 && IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr)) { + if (inany_is_linklocal6(&tgt->eaddr)) { struct sockaddr_in6 addr6_ll = { .sin6_family = AF_INET6, .sin6_addr = c->ip6.addr_ll, @@ -1703,6 +1697,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, }; if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) goto cancel; + } else if (!inany_is_loopback(&tgt->eaddr)) { + tcp_bind_outbound(c, s, af); } conn->sock = s; @@ -1738,12 +1734,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, tcp_hash_insert(c, conn); - if ((af == AF_INET && !IN4_IS_ADDR_LOOPBACK(&addr4.sin_addr)) || - (af == AF_INET6 && !IN6_IS_ADDR_LOOPBACK(&addr6.sin6_addr) && - !IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr))) - tcp_bind_outbound(c, s, af); - - if (connect(s, sa, sl)) { + if (connect(s, &sa.sa, sl)) { if (errno != EINPROGRESS) { tcp_rst(c, conn); goto cancel; @@ -2241,9 +2232,25 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, const union sockaddr_inany *sa, const struct timespec *now) { + union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */ struct tcp_tap_conn *conn; + in_port_t srcport; + + inany_from_sockaddr(&saddr, &srcport, sa); + tcp_snat_inbound(c, &saddr); - flow_target(flow, PIF_TAP); + if (inany_v4(&saddr)) { + daddr = inany_from_v4(c->ip4.addr_seen); + } else { + if (inany_is_linklocal6(&saddr)) + daddr.a6 = c->ip6.addr_ll_seen; + else + daddr.a6 = c->ip6.addr_seen; + } + dstport += c->tcp.fwd_in.delta[dstport]; + + flow_target_af(flow, PIF_TAP, AF_INET6, + &saddr, srcport, &daddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); conn->sock = s; @@ -2251,10 +2258,9 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = dstport + c->tcp.fwd_in.delta[dstport]; - - tcp_snat_inbound(c, &conn->faddr); + conn->faddr = saddr; + conn->fport = srcport; + conn->eport = dstport; tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_splice.c b/tcp_splice.c index f2d4fc60..83fc7ba4 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -320,31 +320,20 @@ static int tcp_splice_connect_finish(const struct ctx *c, * tcp_splice_connect() - Create and connect socket for new spliced connection * @c: Execution context * @conn: Connection pointer - * @af: Address family - * @pif: pif on which to create socket - * @port: Destination port, host order * * Return: 0 for connect() succeeded or in progress, negative value on error */ -static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, - sa_family_t af, uint8_t pif, in_port_t port) +static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(port), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(port), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - const struct sockaddr *sa; + const struct flowside *tgt = &conn->f.side[TGTSIDE]; + sa_family_t af = inany_v4(&tgt->eaddr) ? AF_INET : AF_INET6; + uint8_t tgtpif = conn->f.pif[TGTSIDE]; + union sockaddr_inany sa; socklen_t sl; - if (pif == PIF_HOST) + if (tgtpif == PIF_HOST) conn->s[1] = tcp_conn_sock(c, af); - else if (pif == PIF_SPLICE) + else if (tgtpif == PIF_SPLICE) conn->s[1] = tcp_conn_sock_ns(c, af); else ASSERT(0); @@ -358,15 +347,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, conn->s[1]); } - if (CONN_V6(conn)) { - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); - } else { - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); - } + pif_sockaddr(c, &sa, &sl, tgtpif, &tgt->eaddr, tgt->eport); - if (connect(conn->s[1], sa, sl)) { + if (connect(conn->s[1], &sa.sa, sl)) { if (errno != EINPROGRESS) { flow_trace(conn, "Couldn't connect socket for splice: %s", strerror(errno)); @@ -471,7 +454,13 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, return false; } - flow_target(flow, tgtpif); + /* FIXME: Record outbound source address when known */ + if (af == AF_INET) + flow_target_af(flow, tgtpif, AF_INET, + NULL, 0, &in4addr_loopback, dstport); + else + flow_target_af(flow, tgtpif, AF_INET6, + NULL, 0, &in6addr_loopback, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice); conn->flags = af == AF_INET ? 0 : SPLICE_V6; @@ -483,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int))) flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0); - if (tcp_splice_connect(c, conn, af, tgtpif, dstport)) + if (tcp_splice_connect(c, conn)) conn_flag(c, conn, CLOSING); FLOW_ACTIVATE(conn); -- 2.45.2
Two minor details: On Fri, 5 Jul 2024 12:06:59 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect(). Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 38 ++++++++++++++++++------ flow_table.h | 5 +++- icmp.c | 3 +- inany.h | 1 - pif.c | 45 ++++++++++++++++++++++++++++ pif.h | 17 +++++++++++ tcp.c | 82 ++++++++++++++++++++++++++++------------------------ tcp_splice.c | 45 +++++++++++----------------- 8 files changed, 158 insertions(+), 78 deletions(-) diff --git a/flow.c b/flow.c index 44e7b3b8..f064fad1 100644 --- a/flow.c +++ b/flow.c @@ -165,8 +165,10 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { - char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN]; + char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN]; const struct flowside *ini = &f->side[INISIDE]; + const struct flowside *tgt = &f->side[TGTSIDE]; uint8_t oldstate = f->state; ASSERT(state < FLOW_NUM_STATES); @@ -177,19 +179,24 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f)); if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + flow_log_(f, LOG_DEBUG, + "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport, - pif_name(f->pif[TGTSIDE])); + pif_name(f->pif[TGTSIDE]), + inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), + tgt->fport, + inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), + tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport); } @@ -261,21 +268,34 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, } /** - * flow_target() - Move flow to TGT, setting TGTSIDE details + * flow_target_af() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state * @pif: pif of the target side + * @af: Address family for @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the target flowside information */ -void flow_target(union flow *flow, uint8_t pif) +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) { struct flow_common *f = &flow->f; + struct flowside *tgt = &f->side[TGTSIDE]; ASSERT(pif != PIF_NONE); ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); ASSERT(f->type == FLOW_TYPE_NONE); ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); + flowside_from_af(tgt, af, daddr, dport, saddr, sport); f->pif[TGTSIDE] = pif; flow_set_state(f, FLOW_STATE_TGT); + return tgt; } /** diff --git a/flow_table.h b/flow_table.h index ad1bc787..00dca4b2 100644 --- a/flow_table.h +++ b/flow_table.h @@ -114,7 +114,10 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, const union sockaddr_inany *ssa, in_port_t dport); -void flow_target(union flow *flow, uint8_t pif); +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport); union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/icmp.c b/icmp.c index cf88ac1f..fd92c7da 100644 --- a/icmp.c +++ b/icmp.c @@ -167,7 +167,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, return NULL; flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - flow_target(flow, PIF_HOST); + /* FIXME: Record outbound source address when known */ + flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping); pingf->seq = -1; diff --git a/inany.h b/inany.h index 47b66fa9..8eaf5335 100644 --- a/inany.h +++ b/inany.h @@ -187,7 +187,6 @@ static inline bool inany_is_unspecified(const union inany_addr *a) * * Return: true if @a is in fe80::/10 (IPv6 link local unicast) */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_linklocal6(const union inany_addr *a) { return IN6_IS_ADDR_LINKLOCAL(&a->a6); diff --git a/pif.c b/pif.c index ebf01cc8..9f2d39cc 100644 --- a/pif.c +++ b/pif.c @@ -7,9 +7,14 @@ #include <stdint.h> #include <assert.h> +#include <netinet/in.h> #include "util.h" #include "pif.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" +#include "passt.h" const char *pif_type_str[] = { [PIF_NONE] = "<none>", @@ -19,3 +24,43 @@ const char *pif_type_str[] = { }; static_assert(ARRAY_SIZE(pif_type_str) == PIF_NUM_TYPES, "pif_type_str[] doesn't match enum pif_type"); + + +/** pif_sockaddr() - Construct a socket address suitable for an interface + * @c: Execution context + * @sa: Pointer to sockaddr to fill in + * @sl: Updated to relevant of length of initialised @sato relevant length+ * @pif: Interface to create the socket address + * @addr: IPv[46] address + * @port: Port (host byte order) + * + * Return: true if resulting socket address is non-trivial (specified address or + * non-zero port), false otherwiseThis is not really intuitive in the only caller using this, tcp_bind_outbound(). I wonder if it would make more sense to perform this check directly there, and have this returning void instead.+ */ +bool pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl, + uint8_t pif, const union inany_addr *addr, in_port_t port) +{ + const struct in_addr *v4 = inany_v4(addr); + + ASSERT(pif_is_socket(pif)); + + if (v4) { + sa->sa_family = AF_INET; + sa->sa4.sin_addr = *v4; + sa->sa4.sin_port = htons(port); + memset(&sa->sa4.sin_zero, 0, sizeof(sa->sa4.sin_zero)); + *sl = sizeof(sa->sa4); + return !IN4_IS_ADDR_UNSPECIFIED(v4) || port; + } + + sa->sa_family = AF_INET6; + sa->sa6.sin6_addr = addr->a6; + sa->sa6.sin6_port = htons(port); + if (pif == PIF_HOST && IN6_IS_ADDR_LINKLOCAL(&addr->a6)) + sa->sa6.sin6_scope_id = c->ifi6; + else + sa->sa6.sin6_scope_id = 0; + sa->sa6.sin6_flowinfo = 0; + *sl = sizeof(sa->sa6); + return !IN6_IS_ADDR_UNSPECIFIED(&addr->a6) || port; +}-- Stefano
On Wed, Jul 10, 2024 at 11:30:38PM +0200, Stefano Brivio wrote:Two minor details: On Fri, 5 Jul 2024 12:06:59 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Done.Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect(). Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 38 ++++++++++++++++++------ flow_table.h | 5 +++- icmp.c | 3 +- inany.h | 1 - pif.c | 45 ++++++++++++++++++++++++++++ pif.h | 17 +++++++++++ tcp.c | 82 ++++++++++++++++++++++++++++------------------------ tcp_splice.c | 45 +++++++++++----------------- 8 files changed, 158 insertions(+), 78 deletions(-) diff --git a/flow.c b/flow.c index 44e7b3b8..f064fad1 100644 --- a/flow.c +++ b/flow.c @@ -165,8 +165,10 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { - char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN]; + char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN]; const struct flowside *ini = &f->side[INISIDE]; + const struct flowside *tgt = &f->side[TGTSIDE]; uint8_t oldstate = f->state; ASSERT(state < FLOW_NUM_STATES); @@ -177,19 +179,24 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f)); if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + flow_log_(f, LOG_DEBUG, + "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport, - pif_name(f->pif[TGTSIDE])); + pif_name(f->pif[TGTSIDE]), + inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), + tgt->fport, + inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), + tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport); } @@ -261,21 +268,34 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, } /** - * flow_target() - Move flow to TGT, setting TGTSIDE details + * flow_target_af() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state * @pif: pif of the target side + * @af: Address family for @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the target flowside information */ -void flow_target(union flow *flow, uint8_t pif) +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) { struct flow_common *f = &flow->f; + struct flowside *tgt = &f->side[TGTSIDE]; ASSERT(pif != PIF_NONE); ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); ASSERT(f->type == FLOW_TYPE_NONE); ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); + flowside_from_af(tgt, af, daddr, dport, saddr, sport); f->pif[TGTSIDE] = pif; flow_set_state(f, FLOW_STATE_TGT); + return tgt; } /** diff --git a/flow_table.h b/flow_table.h index ad1bc787..00dca4b2 100644 --- a/flow_table.h +++ b/flow_table.h @@ -114,7 +114,10 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, const union sockaddr_inany *ssa, in_port_t dport); -void flow_target(union flow *flow, uint8_t pif); +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport); union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/icmp.c b/icmp.c index cf88ac1f..fd92c7da 100644 --- a/icmp.c +++ b/icmp.c @@ -167,7 +167,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, return NULL; flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - flow_target(flow, PIF_HOST); + /* FIXME: Record outbound source address when known */ + flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping); pingf->seq = -1; diff --git a/inany.h b/inany.h index 47b66fa9..8eaf5335 100644 --- a/inany.h +++ b/inany.h @@ -187,7 +187,6 @@ static inline bool inany_is_unspecified(const union inany_addr *a) * * Return: true if @a is in fe80::/10 (IPv6 link local unicast) */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_linklocal6(const union inany_addr *a) { return IN6_IS_ADDR_LINKLOCAL(&a->a6); diff --git a/pif.c b/pif.c index ebf01cc8..9f2d39cc 100644 --- a/pif.c +++ b/pif.c @@ -7,9 +7,14 @@ #include <stdint.h> #include <assert.h> +#include <netinet/in.h> #include "util.h" #include "pif.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" +#include "passt.h" const char *pif_type_str[] = { [PIF_NONE] = "<none>", @@ -19,3 +24,43 @@ const char *pif_type_str[] = { }; static_assert(ARRAY_SIZE(pif_type_str) == PIF_NUM_TYPES, "pif_type_str[] doesn't match enum pif_type"); + + +/** pif_sockaddr() - Construct a socket address suitable for an interface + * @c: Execution context + * @sa: Pointer to sockaddr to fill in + * @sl: Updated to relevant of length of initialised @sato relevant lengthYeah, done. When I implemented the return value I thought I was going to want it in more places than turned out to be the case.+ * @pif: Interface to create the socket address + * @addr: IPv[46] address + * @port: Port (host byte order) + * + * Return: true if resulting socket address is non-trivial (specified address or + * non-zero port), false otherwiseThis is not really intuitive in the only caller using this, tcp_bind_outbound(). I wonder if it would make more sense to perform this check directly there, and have this returning void instead.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson+ */ +bool pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl, + uint8_t pif, const union inany_addr *addr, in_port_t port) +{ + const struct in_addr *v4 = inany_v4(addr); + + ASSERT(pif_is_socket(pif)); + + if (v4) { + sa->sa_family = AF_INET; + sa->sa4.sin_addr = *v4; + sa->sa4.sin_port = htons(port); + memset(&sa->sa4.sin_zero, 0, sizeof(sa->sa4.sin_zero)); + *sl = sizeof(sa->sa4); + return !IN4_IS_ADDR_UNSPECIFIED(v4) || port; + } + + sa->sa_family = AF_INET6; + sa->sa6.sin6_addr = addr->a6; + sa->sa6.sin6_port = htons(port); + if (pif == PIF_HOST && IN6_IS_ADDR_LINKLOCAL(&addr->a6)) + sa->sa6.sin6_scope_id = c->ifi6; + else + sa->sa6.sin6_scope_id = 0; + sa->sa6.sin6_flowinfo = 0; + *sl = sizeof(sa->sa6); + return !IN6_IS_ADDR_UNSPECIFIED(&addr->a6) || port; +}
Some information we explicitly store in the TCP connection is now duplicated in the common flow structure. Access it from there instead, and remove it from the TCP specific structure. With that done we can reorder both the "tap" and "splice" TCP structures a bit to get better packing for the new combined flow table entries. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 52 ++++++++++++++++++++++++++------------------------ tcp_conn.h | 40 +++++++++++++++----------------------- tcp_internal.h | 6 +++++- 3 files changed, 47 insertions(+), 51 deletions(-) diff --git a/tcp.c b/tcp.c index 46f6c145..727bbd93 100644 --- a/tcp.c +++ b/tcp.c @@ -333,8 +333,6 @@ #define ACK_IF_NEEDED 0 /* See tcp_send_flag() */ -#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) - #define CONN_IS_CLOSING(conn) \ (((conn)->events & ESTABLISHED) && \ ((conn)->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD))) @@ -635,10 +633,11 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) { + const struct flowside *tapside = TAPFLOW(conn); int i; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return 1; return 0; @@ -653,6 +652,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, const struct tcp_info *tinfo) { #ifdef HAS_MIN_RTT + const struct flowside *tapside = TAPFLOW(conn); int i, hole = -1; if (!tinfo->tcpi_min_rtt || @@ -660,7 +660,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -672,7 +672,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return; - low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = tapside->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -827,8 +827,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + const struct flowside *tapside = TAPFLOW(conn); + + if (inany_equals(&tapside->faddr, faddr) && + tapside->eport == eport && tapside->fport == fport) return 1; return 0; @@ -862,7 +864,10 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + const struct flowside *tapside = TAPFLOW(conn); + + return tcp_hash(c, &tapside->faddr, tapside->eport, + tapside->fport); } /** @@ -997,10 +1002,12 @@ void tcp_defer_handler(struct ctx *c) * @seq: Sequence number */ static void tcp_fill_header(struct tcphdr *th, - const struct tcp_tap_conn *conn, uint32_t seq) + const struct tcp_tap_conn *conn, uint32_t seq) { - th->source = htons(conn->fport); - th->dest = htons(conn->eport); + const struct flowside *tapside = TAPFLOW(conn); + + th->source = htons(tapside->fport); + th->dest = htons(tapside->eport); th->seq = htonl(seq); th->ack_seq = htonl(conn->seq_ack_to_tap); if (conn->events & ESTABLISHED) { @@ -1032,7 +1039,8 @@ static size_t tcp_fill_headers4(const struct ctx *c, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph); @@ -1074,10 +1082,11 @@ static size_t tcp_fill_headers6(const struct ctx *c, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) { + const struct flowside *tapside = TAPFLOW(conn); size_t l4len = dlen + sizeof(*th); ip6h->payload_len = htons(l4len); - ip6h->saddr = conn->faddr.a6; + ip6h->saddr = tapside->faddr.a6; if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) ip6h->daddr = c->ip6.addr_ll_seen; else @@ -1116,7 +1125,8 @@ size_t tcp_l2_buf_fill_headers(const struct ctx *c, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr); if (a4) { return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, @@ -1419,6 +1429,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); + const struct flowside *tapside = TAPFLOW(conn); union inany_addr aany; uint64_t hash; uint32_t ns; @@ -1428,10 +1439,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, else inany_from_af(&aany, AF_INET6, &c->ip6.addr); - inany_siphash_feed(&state, &conn->faddr); + inany_siphash_feed(&state, &tapside->faddr); inany_siphash_feed(&state, &aany); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)tapside->fport << 16 | tapside->eport); /* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1720,11 +1731,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1; - inany_from_af(&conn->faddr, af, daddr); - - conn->fport = dstport; - conn->eport = srcport; - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2258,10 +2264,6 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - conn->faddr = saddr; - conn->fport = srcport; - conn->eport = dstport; - tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_conn.h b/tcp_conn.h index 5f8c8fb6..3655f871 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -13,19 +13,16 @@ * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) * @f: Generic flow information * @in_epoll: Is the connection in the epoll set? + * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT + * @ws_from_tap: Window scaling factor advertised from tap/guest + * @ws_to_tap: Window scaling factor advertised to tap/guest * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS * @sock: Socket descriptor number * @events: Connection events, implying connection states * @timer: timerfd descriptor for timeout events * @flags: Connection flags representing internal attributes - * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT - * @ws_from_tap: Window scaling factor advertised from tap/guest - * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -49,6 +46,10 @@ struct tcp_tap_conn { unsigned int ws_from_tap :TCP_WS_BITS; unsigned int ws_to_tap :TCP_WS_BITS; +#define TCP_MSS_BITS 14 + unsigned int tap_mss :TCP_MSS_BITS; +#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) +#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) int sock :FD_REF_BITS; @@ -77,13 +78,6 @@ struct tcp_tap_conn { #define ACK_TO_TAP_DUE BIT(3) #define ACK_FROM_TAP_DUE BIT(4) - -#define TCP_MSS_BITS 14 - unsigned int tap_mss :TCP_MSS_BITS; -#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) -#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) - - #define SNDBUF_BITS 24 unsigned int sndbuf :SNDBUF_BITS; #define SNDBUF_SET(conn, bytes) (conn->sndbuf = ((bytes) >> (32 - SNDBUF_BITS))) @@ -91,11 +85,6 @@ struct tcp_tap_conn { uint8_t seq_dup_ack_approx; - - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap; @@ -109,22 +98,24 @@ struct tcp_tap_conn { /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @f: Generic flow information - * @in_epoll: Is the connection in the epoll set? * @s: File descriptor for sockets * @pipe: File descriptors for pipes - * @events: Events observed/actions performed on connection - * @flags: Connection flags (attributes, not events) * @read: Bytes read (not fully written to other side in one shot) * @written: Bytes written (not fully written from one other side read) -*/ + * @events: Events observed/actions performed on connection + * @flags: Connection flags (attributes, not events) + * @in_epoll: Is the connection in the epoll set? + */ struct tcp_splice_conn { /* Must be first element */ struct flow_common f; - bool in_epoll :1; int s[SIDES]; int pipe[SIDES][2]; + uint32_t read[SIDES]; + uint32_t written[SIDES]; + uint8_t events; #define SPLICE_CLOSED 0 #define SPLICE_CONNECT BIT(0) @@ -144,8 +135,7 @@ struct tcp_splice_conn { #define RCVLOWAT_ACT_1 BIT(4) #define CLOSING BIT(5) - uint32_t read[SIDES]; - uint32_t written[SIDES]; + bool in_epoll :1; }; /* Socket pools */ diff --git a/tcp_internal.h b/tcp_internal.h index 51aaa169..4f61e5c3 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -39,7 +39,11 @@ #define OPT_SACKP 4 #define OPT_SACK 5 #define OPT_TS 8 -#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) + +#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) +#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)])) + +#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) /* -- 2.45.2
Currently we always deliver inbound TCP packets to the guest's most recent observed IP address. This has the odd side effect that if the guest changes its IP address with active TCP connections we might deliver packets from old connections to the new address. That won't work; it will probably result in an RST from the guest. Worse, if the guest added a new address but also retains the old one, then we could break those old connections by redirecting them to the new address. Now that we maintain flowside information, we have a record of the correct guest side address and can just use it. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 41 +++++++++++++---------------------------- tcp_buf.c | 6 +++--- tcp_internal.h | 3 +-- 3 files changed, 17 insertions(+), 33 deletions(-) diff --git a/tcp.c b/tcp.c index 727bbd93..b03028ad 100644 --- a/tcp.c +++ b/tcp.c @@ -1021,7 +1021,6 @@ static void tcp_fill_header(struct tcphdr *th, /** * tcp_fill_headers4() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers - * @c: Execution context * @conn: Connection pointer * @taph: tap backend specific header * @iph: Pointer to IPv4 header @@ -1032,27 +1031,26 @@ static void tcp_fill_header(struct tcphdr *th, * * Return: The IPv4 payload length, host order */ -static size_t tcp_fill_headers4(const struct ctx *c, - const struct tcp_tap_conn *conn, +static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn, struct tap_hdr *taph, struct iphdr *iph, struct tcphdr *th, size_t dlen, const uint16_t *check, uint32_t seq) { const struct flowside *tapside = TAPFLOW(conn); - const struct in_addr *a4 = inany_v4(&tapside->faddr); + const struct in_addr *src4 = inany_v4(&tapside->faddr); + const struct in_addr *dst4 = inany_v4(&tapside->eaddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph); - ASSERT(a4); + ASSERT(src4 && dst4); iph->tot_len = htons(l3len); - iph->saddr = a4->s_addr; - iph->daddr = c->ip4.addr_seen.s_addr; + iph->saddr = src4->s_addr; + iph->daddr = dst4->s_addr; iph->check = check ? *check : - csum_ip4_header(l3len, IPPROTO_TCP, - *a4, c->ip4.addr_seen); + csum_ip4_header(l3len, IPPROTO_TCP, *src4, *dst4); tcp_fill_header(th, conn, seq); @@ -1065,7 +1063,6 @@ static size_t tcp_fill_headers4(const struct ctx *c, /** * tcp_fill_headers6() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers - * @c: Execution context * @conn: Connection pointer * @taph: tap backend specific header * @ip6h: Pointer to IPv6 header @@ -1076,8 +1073,7 @@ static size_t tcp_fill_headers4(const struct ctx *c, * * Return: The IPv6 payload length, host order */ -static size_t tcp_fill_headers6(const struct ctx *c, - const struct tcp_tap_conn *conn, +static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn, struct tap_hdr *taph, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) @@ -1087,10 +1083,7 @@ static size_t tcp_fill_headers6(const struct ctx *c, ip6h->payload_len = htons(l4len); ip6h->saddr = tapside->faddr.a6; - if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) - ip6h->daddr = c->ip6.addr_ll_seen; - else - ip6h->daddr = c->ip6.addr_seen; + ip6h->daddr = tapside->eaddr.a6; ip6h->hop_limit = 255; ip6h->version = 6; @@ -1111,7 +1104,6 @@ static size_t tcp_fill_headers6(const struct ctx *c, /** * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers - * @c: Execution context * @conn: Connection pointer * @iov: Pointer to an array of iovec of TCP pre-cooked buffers * @dlen: TCP payload length @@ -1120,8 +1112,7 @@ static size_t tcp_fill_headers6(const struct ctx *c, * * Return: IP payload length, host order */ -size_t tcp_l2_buf_fill_headers(const struct ctx *c, - const struct tcp_tap_conn *conn, +size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { @@ -1129,13 +1120,13 @@ size_t tcp_l2_buf_fill_headers(const struct ctx *c, const struct in_addr *a4 = inany_v4(&tapside->faddr); if (a4) { - return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, + return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_IP].iov_base, iov[TCP_IOV_PAYLOAD].iov_base, dlen, check, seq); } - return tcp_fill_headers6(c, conn, iov[TCP_IOV_TAP].iov_base, + return tcp_fill_headers6(conn, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_IP].iov_base, iov[TCP_IOV_PAYLOAD].iov_base, dlen, seq); @@ -1430,17 +1421,11 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, { struct siphash_state state = SIPHASH_INIT(c->hash_secret); const struct flowside *tapside = TAPFLOW(conn); - union inany_addr aany; uint64_t hash; uint32_t ns; - if (CONN_V4(conn)) - inany_from_af(&aany, AF_INET, &c->ip4.addr); - else - inany_from_af(&aany, AF_INET6, &c->ip6.addr); - inany_siphash_feed(&state, &tapside->faddr); - inany_siphash_feed(&state, &aany); + inany_siphash_feed(&state, &tapside->eaddr); hash = siphash_final(&state, 36, (uint64_t)tapside->fport << 16 | tapside->eport); diff --git a/tcp_buf.c b/tcp_buf.c index ff9f7970..9878e4b1 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -316,7 +316,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) return ret; } - l4len = tcp_l2_buf_fill_headers(c, conn, iov, optlen, NULL, seq); + l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, seq); iov[TCP_IOV_PAYLOAD].iov_len = l4len; if (flags & DUP_ACK) { @@ -373,7 +373,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp4_frame_conns[tcp4_payload_used] = conn; iov = tcp4_l2_iov[tcp4_payload_used++]; - l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq); + l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq); iov[TCP_IOV_PAYLOAD].iov_len = l4len; if (tcp4_payload_used > TCP_FRAMES_MEM - 1) tcp_payload_flush(c); @@ -381,7 +381,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp6_frame_conns[tcp6_payload_used] = conn; iov = tcp6_l2_iov[tcp6_payload_used++]; - l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq); + l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq); iov[TCP_IOV_PAYLOAD].iov_len = l4len; if (tcp6_payload_used > TCP_FRAMES_MEM - 1) tcp_payload_flush(c); diff --git a/tcp_internal.h b/tcp_internal.h index 4f61e5c3..ac6d4b21 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -88,8 +88,7 @@ void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); tcp_rst_do(c, conn); \ } while (0) -size_t tcp_l2_buf_fill_headers(const struct ctx *c, - const struct tcp_tap_conn *conn, +size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq); int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, -- 2.45.2
For now when we forward a connection to the host we leave the host side forwarding address and port blank since we don't necessarily know what source address and port will be used by the kernel. When the outbound address option is active, though, we do know the address at least, so we can record it in the flowside. Having done that, use it as the primary source of truth, binding the outgoing socket based on the information in there. This allows the possibility of more complex rules for what outbound address and/or port we use in future. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 92 +++++++++++++++++++++++++++++++---------------------------- 1 file changed, 49 insertions(+), 43 deletions(-) diff --git a/tcp.c b/tcp.c index b03028ad..f92414fb 100644 --- a/tcp.c +++ b/tcp.c @@ -1535,46 +1535,47 @@ static uint16_t tcp_conn_tap_mss(const struct tcp_tap_conn *conn, /** * tcp_bind_outbound() - Bind socket to outbound address and interface if given * @c: Execution context + * @conn: Connection entry for socket to bind * @s: Outbound TCP socket - * @af: Address family */ -static void tcp_bind_outbound(const struct ctx *c, int s, sa_family_t af) +static void tcp_bind_outbound(const struct ctx *c, + const struct tcp_tap_conn *conn, int s) { - if (af == AF_INET) { - if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.addr_out)) { - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = 0, - .sin_addr = c->ip4.addr_out, - }; - - if (bind(s, (struct sockaddr *)&addr4, sizeof(addr4))) - debug_perror("IPv4 TCP socket address bind"); + const struct flowside *tgt = &conn->f.side[TGTSIDE]; + union sockaddr_inany bind_sa; + socklen_t sl; + + + if (pif_sockaddr(c, &bind_sa, &sl, PIF_HOST, &tgt->faddr, tgt->fport)) { + if (bind(s, &bind_sa.sa, sl)) { + char sstr[INANY_ADDRSTRLEN]; + + flow_dbg(conn, + "Can't bind TCP outbound socket to %s:%hu: %s", + inany_ntop(&tgt->faddr, sstr, sizeof(sstr)), + tgt->fport, strerror(errno)); } + } + if (bind_sa.sa_family == AF_INET) { if (*c->ip4.ifname_out) { if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE, c->ip4.ifname_out, - strlen(c->ip4.ifname_out))) - debug_perror("IPv4 TCP socket interface bind"); - } - } else if (af == AF_INET6) { - if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr_out)) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = 0, - .sin6_addr = c->ip6.addr_out, - }; - - if (bind(s, (struct sockaddr *)&addr6, sizeof(addr6))) - debug_perror("IPv6 TCP socket address bind"); + strlen(c->ip4.ifname_out))) { + flow_dbg(conn, "Can't bind IPv4 TCP socket to" + " interface %s: %s", c->ip4.ifname_out, + strerror(errno)); + } } - + } else if (bind_sa.sa_family == AF_INET6) { if (*c->ip6.ifname_out) { if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE, c->ip6.ifname_out, - strlen(c->ip6.ifname_out))) - debug_perror("IPv6 TCP socket interface bind"); + strlen(c->ip6.ifname_out))) { + flow_dbg(conn, "Can't bind IPv6 TCP socket to" + " interface %s: %s", c->ip6.ifname_out, + strerror(errno)); + } } } } @@ -1597,9 +1598,9 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); + union inany_addr srcaddr, dstaddr; /* FIXME: Avoid bulky temporaries */ const struct flowside *ini, *tgt; struct tcp_tap_conn *conn; - union inany_addr dstaddr; /* FIXME: Avoid bulky temporary */ union sockaddr_inany sa; union flow *flow; int s = -1, mss; @@ -1620,9 +1621,24 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, } - /* FIXME: Record outbound source address when known */ + if (inany_is_linklocal6(&dstaddr)) { + srcaddr.a6 = c->ip6.addr_ll; + } else if (inany_is_loopback(&dstaddr)) { + srcaddr = dstaddr; + } else if (inany_v4(&dstaddr)) { + if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.addr_out)) + srcaddr = inany_from_v4(c->ip4.addr_out); + else + srcaddr = inany_any4; + } else { + if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr_out)) + srcaddr.a6 = c->ip6.addr_out; + else + srcaddr = inany_any6; + } + tgt = flow_target_af(flow, PIF_HOST, AF_INET6, - NULL, 0, /* Kernel decides source address */ + &srcaddr, 0, /* Kernel decides source port */ &dstaddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); @@ -1685,18 +1701,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, goto cancel; } - if (inany_is_linklocal6(&tgt->eaddr)) { - struct sockaddr_in6 addr6_ll = { - .sin6_family = AF_INET6, - .sin6_addr = c->ip6.addr_ll, - .sin6_scope_id = c->ifi6, - }; - if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) - goto cancel; - } else if (!inany_is_loopback(&tgt->eaddr)) { - tcp_bind_outbound(c, s, af); - } - conn->sock = s; conn->timer = -1; conn_event(c, conn, TAP_SYN_RCVD); @@ -1725,6 +1729,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, tcp_hash_insert(c, conn); + tcp_bind_outbound(c, conn, s); + if (connect(s, &sa.sa, sl)) { if (errno != EINPROGRESS) { tcp_rst(c, conn); -- 2.45.2
Now that we store all our endpoints in the flowside structure, use some inany helpers to make validation of those endpoints simpler. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- inany.h | 1 - tcp.c | 72 +++++++++++++++------------------------------------------ 2 files changed, 18 insertions(+), 55 deletions(-) diff --git a/inany.h b/inany.h index 8eaf5335..d2893cec 100644 --- a/inany.h +++ b/inany.h @@ -211,7 +211,6 @@ static inline bool inany_is_multicast(const union inany_addr *a) * * Return: true if @a is specified and a unicast address */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_unicast(const union inany_addr *a) { return !inany_is_unspecified(a) && !inany_is_multicast(a); diff --git a/tcp.c b/tcp.c index f92414fb..45ea9a71 100644 --- a/tcp.c +++ b/tcp.c @@ -1642,38 +1642,14 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, &dstaddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); - if (af == AF_INET) { - if (IN4_IS_ADDR_UNSPECIFIED(saddr) || - IN4_IS_ADDR_BROADCAST(saddr) || - IN4_IS_ADDR_MULTICAST(saddr) || srcport == 0 || - IN4_IS_ADDR_UNSPECIFIED(daddr) || - IN4_IS_ADDR_BROADCAST(daddr) || - IN4_IS_ADDR_MULTICAST(daddr) || dstport == 0) { - char sstr[INET_ADDRSTRLEN], dstr[INET_ADDRSTRLEN]; - - debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu", - inet_ntop(AF_INET, saddr, sstr, sizeof(sstr)), - srcport, - inet_ntop(AF_INET, daddr, dstr, sizeof(dstr)), - dstport); - goto cancel; - } - } else if (af == AF_INET6) { - if (IN6_IS_ADDR_UNSPECIFIED(saddr) || - IN6_IS_ADDR_MULTICAST(saddr) || srcport == 0 || - IN6_IS_ADDR_UNSPECIFIED(daddr) || - IN6_IS_ADDR_MULTICAST(daddr) || dstport == 0) { - char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN]; - - debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu", - inet_ntop(AF_INET6, saddr, sstr, sizeof(sstr)), - srcport, - inet_ntop(AF_INET6, daddr, dstr, sizeof(dstr)), - dstport); - goto cancel; - } - } else { - ASSERT(0); + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0 || + !inany_is_unicast(&ini->faddr) || ini->fport == 0) { + char sstr[INANY_ADDRSTRLEN], dstr[INANY_ADDRSTRLEN]; + + debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu", + inany_ntop(&ini->eaddr, sstr, sizeof(sstr)), ini->eport, + inany_ntop(&ini->faddr, dstr, sizeof(dstr)), ini->fport); + goto cancel; } if ((s = tcp_conn_sock(c, af)) < 0) @@ -2279,7 +2255,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, void tcp_listen_handler(struct ctx *c, union epoll_ref ref, const struct timespec *now) { - char sastr[SOCKADDR_STRLEN]; + const struct flowside *ini; union sockaddr_inany sa; socklen_t sl = sizeof(sa); union flow *flow; @@ -2294,23 +2270,15 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, /* FIXME: When listening port has a specific bound address, record that * as the forwarding address */ - flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port); - - if (sa.sa_family == AF_INET) { - const struct in_addr *addr = &sa.sa4.sin_addr; - in_port_t port = sa.sa4.sin_port; - - if (IN4_IS_ADDR_UNSPECIFIED(addr) || - IN4_IS_ADDR_BROADCAST(addr) || - IN4_IS_ADDR_MULTICAST(addr) || port == 0) - goto bad_endpoint; - } else if (sa.sa_family == AF_INET6) { - const struct in6_addr *addr = &sa.sa6.sin6_addr; - in_port_t port = sa.sa6.sin6_port; - - if (IN6_IS_ADDR_UNSPECIFIED(addr) || - IN6_IS_ADDR_MULTICAST(addr) || port == 0) - goto bad_endpoint; + ini = flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, + ref.tcp_listen.port); + + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { + char sastr[SOCKADDR_STRLEN]; + + err("Invalid endpoint from TCP accept(): %s", + sockaddr_ntop(&sa, sastr, sizeof(sastr))); + goto cancel; } if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif, @@ -2320,10 +2288,6 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, tcp_tap_conn_from_sock(c, ref.tcp_listen.port, flow, s, &sa, now); return; -bad_endpoint: - err("Invalid endpoint from TCP accept(): %s", - sockaddr_ntop(&sa, sastr, sizeof(sastr))); - cancel: flow_alloc_cancel(flow); } -- 2.45.2
Since we're now constructing socket addresses based on information in the flowside, we no longer need an explicit flag to tell if we're dealing with an IPv4 or IPv6 connection. Hence, drop the now unused SPLICE_V6 flag. As well as just simplifying the code, this allows for possible future extensions where we could splice an IPv4 connection to an IPv6 connection or vice versa. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp_conn.h | 11 +++++------ tcp_splice.c | 3 --- 2 files changed, 5 insertions(+), 9 deletions(-) diff --git a/tcp_conn.h b/tcp_conn.h index 3655f871..e9f26f93 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -128,12 +128,11 @@ struct tcp_splice_conn { #define FIN_SENT_1 BIT(7) uint8_t flags; -#define SPLICE_V6 BIT(0) -#define RCVLOWAT_SET_0 BIT(1) -#define RCVLOWAT_SET_1 BIT(2) -#define RCVLOWAT_ACT_0 BIT(3) -#define RCVLOWAT_ACT_1 BIT(4) -#define CLOSING BIT(5) +#define RCVLOWAT_SET_0 BIT(0) +#define RCVLOWAT_SET_1 BIT(1) +#define RCVLOWAT_ACT_0 BIT(2) +#define RCVLOWAT_ACT_1 BIT(3) +#define CLOSING BIT(4) bool in_epoll :1; }; diff --git a/tcp_splice.c b/tcp_splice.c index 83fc7ba4..c45fd55a 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -73,8 +73,6 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2]; -#define CONN_V6(x) ((x)->flags & SPLICE_V6) -#define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) (((conn)->events & (set)) == (set)) #define CONN(idx) (&FLOW(idx)->tcp_splice) @@ -463,7 +461,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, NULL, 0, &in6addr_loopback, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice); - conn->flags = af == AF_INET ? 0 : SPLICE_V6; conn->s[0] = s0; conn->s[1] = -1; conn->pipe[0][0] = conn->pipe[0][1] = -1; -- 2.45.2
Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP. Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the pif and the L4 protocol number. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 35 +++++++++++++++++++++++++++--- flow.h | 19 ++++++++++++++++ flow_table.h | 3 +++ tcp.c | 61 ++++++++++------------------------------------------ 4 files changed, 65 insertions(+), 53 deletions(-) diff --git a/flow.c b/flow.c index f064fad1..30d10e9d 100644 --- a/flow.c +++ b/flow.c @@ -116,9 +116,9 @@ static struct timespec flow_timer_run; * @faddr: Forwarding address (pointer to in_addr or in6_addr) * @fport: Forwarding port */ -static void flowside_from_af(struct flowside *fside, sa_family_t af, - const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport) +void flowside_from_af(struct flowside *fside, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport) { if (faddr) inany_from_af(&fside->faddr, af, faddr); @@ -401,6 +401,35 @@ void flow_alloc_cancel(union flow *flow) flow_new_entry = NULL; } +/** + * flow_hash() - Calculate hash value for one side of a flow + * @c: Execution context + * @proto: Protocol of this flow (IP L4 protocol number) + * @pif: pif of the side to hash + * @fside: Flowside (must not have unspecified parts) + * + * Return: hash value + */ +uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, + const struct flowside *fside) +{ + struct siphash_state state = SIPHASH_INIT(c->hash_secret); + + /* For the hash table to work, we need complete endpoint information, + * and at least a forwarding port. + */ + ASSERT(pif != PIF_NONE && !inany_is_unspecified(&fside->eaddr) && + fside->eport != 0 && fside->fport != 0); + + inany_siphash_feed(&state, &fside->faddr); + inany_siphash_feed(&state, &fside->eaddr); + + return siphash_final(&state, 38, (uint64_t)proto << 40 | + (uint64_t)pif << 32 | + (uint64_t)fside->fport << 16 | + (uint64_t)fside->eport); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context diff --git a/flow.h b/flow.h index 4c3762b9..0b1e5de2 100644 --- a/flow.h +++ b/flow.h @@ -149,6 +149,25 @@ struct flowside { in_port_t eport; }; +/** + * flowside_eq() - Check if two flowsides are equal + * @left, @right: Flowsides to compare + * + * Return: true if equal, false otherwise + */ +static inline bool flowside_eq(const struct flowside *left, + const struct flowside *right) +{ + return inany_equals(&left->eaddr, &right->eaddr) && + left->eport == right->eport && + inany_equals(&left->faddr, &right->faddr) && + left->fport == right->fport; +} + +void flowside_from_af(struct flowside *fside, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport); + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry diff --git a/flow_table.h b/flow_table.h index 00dca4b2..9bfa1174 100644 --- a/flow_table.h +++ b/flow_table.h @@ -126,4 +126,7 @@ void flow_activate(struct flow_common *f); #define FLOW_ACTIVATE(flow_) \ (flow_activate(&(flow_)->f)) +uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, + const struct flowside *fside); + #endif /* FLOW_TABLE_H */ diff --git a/tcp.c b/tcp.c index 45ea9a71..b1ad1014 100644 --- a/tcp.c +++ b/tcp.c @@ -376,7 +376,7 @@ static struct iovec tcp_iov [UIO_MAXIOV]; #define CONN(idx) (&(FLOW(idx)->tcp)) -/* Table for lookup from remote address, local port, remote port */ +/* Table for lookup from flowside information */ static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE]; static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, @@ -814,46 +814,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find, return -1; } -/** - * tcp_hash_match() - Check if a connection entry matches address and ports - * @conn: Connection entry to match against - * @faddr: Guest side forwarding address - * @eport: Guest side endpoint port - * @fport: Guest side forwarding port - * - * Return: 1 on match, 0 otherwise - */ -static int tcp_hash_match(const struct tcp_tap_conn *conn, - const union inany_addr *faddr, - in_port_t eport, in_port_t fport) -{ - const struct flowside *tapside = TAPFLOW(conn); - - if (inany_equals(&tapside->faddr, faddr) && - tapside->eport == eport && tapside->fport == fport) - return 1; - - return 0; -} - -/** - * tcp_hash() - Calculate hash value for connection given address and ports - * @c: Execution context - * @faddr: Guest side forwarding address - * @eport: Guest side endpoint port - * @fport: Guest side forwarding port - * - * Return: hash value, needs to be adjusted for table size - */ -static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, - in_port_t eport, in_port_t fport) -{ - struct siphash_state state = SIPHASH_INIT(c->hash_secret); - - inany_siphash_feed(&state, faddr); - return siphash_final(&state, 20, (uint64_t)eport << 16 | fport); -} - /** * tcp_conn_hash() - Calculate hash bucket of an existing connection * @c: Execution context @@ -866,8 +826,7 @@ static uint64_t tcp_conn_hash(const struct ctx *c, { const struct flowside *tapside = TAPFLOW(conn); - return tcp_hash(c, &tapside->faddr, tapside->eport, - tapside->fport); + return flow_hash(c, IPPROTO_TCP, conn->f.pif[TAPSIDE(conn)], tapside); } /** @@ -941,25 +900,26 @@ static void tcp_hash_remove(const struct ctx *c, * tcp_hash_lookup() - Look up connection given remote address and ports * @c: Execution context * @af: Address family, AF_INET or AF_INET6 + * @eaddr: Guest side endpoint address (guest local address) * @faddr: Guest side forwarding address (guest remote address) * @eport: Guest side endpoint port (guest local port) * @fport: Guest side forwarding port (guest remote port) * * Return: connection pointer, if found, -ENOENT otherwise */ -static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, - sa_family_t af, const void *faddr, +static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af, + const void *eaddr, const void *faddr, in_port_t eport, in_port_t fport) { - union inany_addr aany; + struct flowside fside; union flow *flow; unsigned b; - inany_from_af(&aany, af, faddr); + flowside_from_af(&fside, af, eaddr, eport, faddr, fport); - b = tcp_hash(c, &aany, eport, fport) % TCP_HASH_TABLE_SIZE; + b = flow_hash(c, IPPROTO_TCP, PIF_TAP, &fside) % TCP_HASH_TABLE_SIZE; while ((flow = flow_at_sidx(tc_hash[b])) && - !tcp_hash_match(&flow->tcp, &aany, eport, fport)) + !flowside_eq(&flow->f.side[TAPSIDE(flow)], &fside)) b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); return &flow->tcp; @@ -2047,7 +2007,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL); opts = packet_get(p, idx, sizeof(*th), optlen, NULL); - conn = tcp_hash_lookup(c, af, daddr, ntohs(th->source), ntohs(th->dest)); + conn = tcp_hash_lookup(c, af, saddr, daddr, + ntohs(th->source), ntohs(th->dest)); /* New connection from tap */ if (!conn) { -- 2.45.2
Move the data structures and helper functions for the TCP hash table to flow.c, making it a general hash table indexing sides of flows. This is largely code motion and straightforward renames. There are two semantic changes: * flow_lookup_af() now needs to verify that the entry has a matching protocol and interface as well as matching addresses and ports. * We double the size of the hash table, because it's now at least theoretically possible for both sides of each flow to be hashed. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 155 +++++++++++++++++++++++++++++++++++++++++++++++-- flow.h | 11 ++-- flow_table.h | 3 - tcp.c | 147 +++++----------------------------------------- tcp_internal.h | 1 + 5 files changed, 171 insertions(+), 146 deletions(-) diff --git a/flow.c b/flow.c index 30d10e9d..21310531 100644 --- a/flow.c +++ b/flow.c @@ -105,6 +105,16 @@ unsigned flow_first_free; union flow flowtab[FLOW_MAX]; static const union flow *flow_new_entry; /* = NULL */ +/* Hash table to index it */ +#define FLOW_HASH_LOAD 70 /* % */ +#define FLOW_HASH_SIZE ((2 * FLOW_MAX * 100 / FLOW_HASH_LOAD)) + +/* Table for lookup from flowside information */ +static flow_sidx_t flow_hashtab[FLOW_HASH_SIZE]; + +static_assert(ARRAY_SIZE(flow_hashtab) >= 2 * FLOW_MAX, +"Safe linear probing requires hash table with more entries than the number of sides in the flow table"); + /* Last time the flow timers ran */ static struct timespec flow_timer_run; @@ -116,9 +126,9 @@ static struct timespec flow_timer_run; * @faddr: Forwarding address (pointer to in_addr or in6_addr) * @fport: Forwarding port */ -void flowside_from_af(struct flowside *fside, sa_family_t af, - const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport) +static void flowside_from_af(struct flowside *fside, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport) { if (faddr) inany_from_af(&fside->faddr, af, faddr); @@ -410,8 +420,8 @@ void flow_alloc_cancel(union flow *flow) * * Return: hash value */ -uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, - const struct flowside *fside) +static uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, + const struct flowside *fside) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); @@ -430,6 +440,136 @@ uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, (uint64_t)fside->eport); } +/** + * flow_sidx_hash() - Calculate hash value for given side of a given flow + * @c: Execution context + * @sidx: Flow & side index to get hash for + * + * Return: hash value, of the flow & side represented by @sidx + */ +static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx) +{ + const struct flow_common *f = &flow_at_sidx(sidx)->f; + return flow_hash(c, FLOW_PROTO(f), + f->pif[sidx.side], &f->side[sidx.side]); +} + +/** + * flow_hash_probe() - Find hash bucket for a flow + * @c: Execution context + * @sidx: Flow and side to find bucket for + * + * Return: If @sidx is in the hash table, its current bucket, otherwise a + * suitable free bucket for it. + */ +static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) +{ + unsigned b = flow_sidx_hash(c, sidx) % FLOW_HASH_SIZE; + + /* Linear probing */ + while (flow_sidx_valid(flow_hashtab[b]) && + !flow_sidx_eq(flow_hashtab[b], sidx)) + b = mod_sub(b, 1, FLOW_HASH_SIZE); + + return b; +} + +/** + * flow_hash_insert() - Insert side of a flow into into hash table + * @c: Execution context + * @sidx: Flow & side index + */ +void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx) +{ + unsigned b = flow_hash_probe(c, sidx); + + flow_hashtab[b] = sidx; + flow_dbg(flow_at_sidx(sidx), "Side %u hash table insert: bucket: %u", + sidx.side, b); +} + +/** + * flow_hash_remove() - Drop side of a flow from the hash table + * @c: Execution context + * @sidx: Side of flow to remove + */ +void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx) +{ + unsigned b = flow_hash_probe(c, sidx), s; + + if (!flow_sidx_valid(flow_hashtab[b])) + return; /* Redundant remove */ + + flow_dbg(flow_at_sidx(sidx), "Side %u hash table remove: bucket: %u", + sidx.side, b); + + /* Scan the remainder of the cluster */ + for (s = mod_sub(b, 1, FLOW_HASH_SIZE); + flow_sidx_valid(flow_hashtab[s]); + s = mod_sub(s, 1, FLOW_HASH_SIZE)) { + unsigned h = flow_sidx_hash(c, flow_hashtab[s]) % FLOW_HASH_SIZE; + + if (!mod_between(h, s, b, FLOW_HASH_SIZE)) { + /* flow_hashtab[s] can live in flow_hashtab[b]'s slot */ + debug("hash table remove: shuffle %u -> %u", s, b); + flow_hashtab[b] = flow_hashtab[s]; + b = s; + } + } + + flow_hashtab[b] = FLOW_SIDX_NONE; +} + +/** + * flowside_lookup() - Look for a matching flowside in the flow table + * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: pif to look for in the table + * @fside: Flowside to look for in the table + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +static flow_sidx_t flowside_lookup(const struct ctx *c, uint8_t proto, + uint8_t pif, const struct flowside *fside) +{ + flow_sidx_t sidx; + union flow *flow; + unsigned b; + + b = flow_hash(c, proto, pif, fside) % FLOW_HASH_SIZE; + while ((sidx = flow_hashtab[b], flow = flow_at_sidx(sidx)) && + !(FLOW_PROTO(&flow->f) == proto && + flow->f.pif[sidx.side] == pif && + flowside_eq(&flow->f.side[sidx.side], fside))) + b = (b + 1) % FLOW_HASH_SIZE; + + return flow_hashtab[b]; +} + +/** + * flow_lookup_af() - Look up a flow given addressing information + * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @af: Address family, AF_INET or AF_INET6 + * @eaddr: Guest side endpoint address (guest local address) + * @faddr: Guest side forwarding address (guest remote address) + * @eport: Guest side endpoint port (guest local port) + * @fport: Guest side forwarding port (guest remote port) + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_af(const struct ctx *c, + uint8_t proto, uint8_t pif, sa_family_t af, + const void *eaddr, const void *faddr, + in_port_t eport, in_port_t fport) +{ + struct flowside fside; + + flowside_from_af(&fside, af, eaddr, eport, faddr, fport); + return flowside_lookup(c, proto, pif, &fside); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context @@ -543,7 +683,12 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) */ void flow_init(void) { + unsigned b; + /* Initial state is a single free cluster containing the whole table */ flowtab[0].free.n = FLOW_MAX; flowtab[0].free.next = FLOW_MAX; + + for (b = 0; b < FLOW_HASH_SIZE; b++) + flow_hashtab[b] = FLOW_SIDX_NONE; } diff --git a/flow.h b/flow.h index 0b1e5de2..c0bf0010 100644 --- a/flow.h +++ b/flow.h @@ -164,10 +164,6 @@ static inline bool flowside_eq(const struct flowside *left, left->fport == right->fport; } -void flowside_from_af(struct flowside *fside, sa_family_t af, - const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport); - /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry @@ -233,6 +229,13 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) return (a.flow == b.flow) && (a.side == b.side); } +void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx); +void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx); +flow_sidx_t flow_lookup_af(const struct ctx *c, + uint8_t proto, uint8_t pif, sa_family_t af, + const void *eaddr, const void *faddr, + in_port_t eport, in_port_t fport); + union flow; void flow_init(void); diff --git a/flow_table.h b/flow_table.h index 9bfa1174..00dca4b2 100644 --- a/flow_table.h +++ b/flow_table.h @@ -126,7 +126,4 @@ void flow_activate(struct flow_common *f); #define FLOW_ACTIVATE(flow_) \ (flow_activate(&(flow_)->f)) -uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, - const struct flowside *fside); - #endif /* FLOW_TABLE_H */ diff --git a/tcp.c b/tcp.c index b1ad1014..8cc91add 100644 --- a/tcp.c +++ b/tcp.c @@ -305,9 +305,6 @@ #include "tcp_internal.h" #include "tcp_buf.h" -#define TCP_HASH_TABLE_LOAD 70 /* % */ -#define TCP_HASH_TABLE_SIZE (FLOW_MAX * 100 / TCP_HASH_TABLE_LOAD) - /* MSS rounding: see SET_MSS() */ #define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */ @@ -376,12 +373,6 @@ static struct iovec tcp_iov [UIO_MAXIOV]; #define CONN(idx) (&(FLOW(idx)->tcp)) -/* Table for lookup from flowside information */ -static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE]; - -static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, - "Safe linear probing requires hash table larger than connection table"); - /* Pools for pre-opened sockets (in init) */ int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; @@ -567,9 +558,6 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, tcp_timer_ctl(c, conn); } -static void tcp_hash_remove(const struct ctx *c, - const struct tcp_tap_conn *conn); - /** * conn_event_do() - Set and log connection events, update epoll state * @c: Execution context @@ -615,7 +603,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, num == -1 ? "CLOSED" : tcp_event_str[num]); if (event == CLOSED) - tcp_hash_remove(c, conn); + flow_hash_remove(c, TAP_SIDX(conn)); else if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD)) conn_flag(c, conn, ACTIVE_CLOSE); else @@ -814,117 +802,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find, return -1; } -/** - * tcp_conn_hash() - Calculate hash bucket of an existing connection - * @c: Execution context - * @conn: Connection - * - * Return: hash value, needs to be adjusted for table size - */ -static uint64_t tcp_conn_hash(const struct ctx *c, - const struct tcp_tap_conn *conn) -{ - const struct flowside *tapside = TAPFLOW(conn); - - return flow_hash(c, IPPROTO_TCP, conn->f.pif[TAPSIDE(conn)], tapside); -} - -/** - * tcp_hash_probe() - Find hash bucket for a connection - * @c: Execution context - * @conn: Connection to find bucket for - * - * Return: If @conn is in the table, its current bucket, otherwise a suitable - * free bucket for it. - */ -static inline unsigned tcp_hash_probe(const struct ctx *c, - const struct tcp_tap_conn *conn) -{ - unsigned b = tcp_conn_hash(c, conn) % TCP_HASH_TABLE_SIZE; - flow_sidx_t sidx = FLOW_SIDX(conn, TAPSIDE(conn)); - - /* Linear probing */ - while (flow_sidx_valid(tc_hash[b]) && !flow_sidx_eq(tc_hash[b], sidx)) - b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - - return b; -} - -/** - * tcp_hash_insert() - Insert connection into hash table, chain link - * @c: Execution context - * @conn: Connection pointer - */ -static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn) -{ - unsigned b = tcp_hash_probe(c, conn); - - tc_hash[b] = FLOW_SIDX(conn, TAPSIDE(conn)); - flow_dbg(conn, "hash table insert: sock %i, bucket: %u", conn->sock, b); -} - -/** - * tcp_hash_remove() - Drop connection from hash table, chain unlink - * @c: Execution context - * @conn: Connection pointer - */ -static void tcp_hash_remove(const struct ctx *c, - const struct tcp_tap_conn *conn) -{ - unsigned b = tcp_hash_probe(c, conn), s; - union flow *flow; - - if (!flow_sidx_valid(tc_hash[b])) - return; /* Redundant remove */ - - flow_dbg(conn, "hash table remove: sock %i, bucket: %u", conn->sock, b); - - /* Scan the remainder of the cluster */ - for (s = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - (flow = flow_at_sidx(tc_hash[s])); - s = mod_sub(s, 1, TCP_HASH_TABLE_SIZE)) { - unsigned h = tcp_conn_hash(c, &flow->tcp) % TCP_HASH_TABLE_SIZE; - - if (!mod_between(h, s, b, TCP_HASH_TABLE_SIZE)) { - /* tc_hash[s] can live in tc_hash[b]'s slot */ - debug("hash table remove: shuffle %u -> %u", s, b); - tc_hash[b] = tc_hash[s]; - b = s; - } - } - - tc_hash[b] = FLOW_SIDX_NONE; -} - -/** - * tcp_hash_lookup() - Look up connection given remote address and ports - * @c: Execution context - * @af: Address family, AF_INET or AF_INET6 - * @eaddr: Guest side endpoint address (guest local address) - * @faddr: Guest side forwarding address (guest remote address) - * @eport: Guest side endpoint port (guest local port) - * @fport: Guest side forwarding port (guest remote port) - * - * Return: connection pointer, if found, -ENOENT otherwise - */ -static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af, - const void *eaddr, const void *faddr, - in_port_t eport, in_port_t fport) -{ - struct flowside fside; - union flow *flow; - unsigned b; - - flowside_from_af(&fside, af, eaddr, eport, faddr, fport); - - b = flow_hash(c, IPPROTO_TCP, PIF_TAP, &fside) % TCP_HASH_TABLE_SIZE; - while ((flow = flow_at_sidx(tc_hash[b])) && - !flowside_eq(&flow->f.side[TAPSIDE(flow)], &fside)) - b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - - return &flow->tcp; -} - /** * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections) * @conn: Connection to handle @@ -1663,7 +1540,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, tcp_seq_init(c, conn, now); conn->seq_ack_from_tap = conn->seq_to_tap; - tcp_hash_insert(c, conn); + flow_hash_insert(c, TAP_SIDX(conn)); tcp_bind_outbound(c, conn, s); @@ -1992,6 +1869,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, const struct tcphdr *th; size_t optlen, len; const char *opts; + union flow *flow; + flow_sidx_t sidx; int ack_due = 0; int count; @@ -2007,17 +1886,22 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL); opts = packet_get(p, idx, sizeof(*th), optlen, NULL); - conn = tcp_hash_lookup(c, af, saddr, daddr, - ntohs(th->source), ntohs(th->dest)); + sidx = flow_lookup_af(c, IPPROTO_TCP, PIF_TAP, af, saddr, daddr, + ntohs(th->source), ntohs(th->dest)); + flow = flow_at_sidx(sidx); /* New connection from tap */ - if (!conn) { + if (!flow) { if (opts && th->syn && !th->ack) tcp_conn_from_tap(c, af, saddr, daddr, th, opts, optlen, now); return 1; } + ASSERT(flow->f.type == FLOW_TCP); + ASSERT(flow->f.pif[sidx.side] == PIF_TAP); + conn = &flow->tcp; + flow_trace(conn, "packet length %zu from tap", len); if (th->rst) { @@ -2193,7 +2077,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn_event(c, conn, SOCK_ACCEPTED); tcp_seq_init(c, conn, now); - tcp_hash_insert(c, conn); + flow_hash_insert(c, TAP_SIDX(conn)); conn->seq_ack_from_tap = conn->seq_to_tap; @@ -2585,11 +2469,6 @@ static void tcp_sock_refill_init(const struct ctx *c) */ int tcp_init(struct ctx *c) { - unsigned b; - - for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) - tc_hash[b] = FLOW_SIDX_NONE; - if (c->ifi4) tcp_sock4_iov_init(c); diff --git a/tcp_internal.h b/tcp_internal.h index ac6d4b21..8b60aabc 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -42,6 +42,7 @@ #define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) #define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)])) +#define TAP_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_))) #define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) -- 2.45.2
We generate TCP initial sequence numbers, when we need them, from a hash of the source and destination addresses and ports, plus a timestamp. Moments later, we generate another hash of the same information plus some more to insert the connection into the flow hash table. With some tweaks to the flow_hash_insert() interface and changing the order we can re-use that hash table hash for the initial sequence number, rather than calculating another one. It won't generate identical results, but that doesn't matter as long as the sequence numbers are well scattered. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 30 ++++++++++++++++++++++++------ flow.h | 2 +- tcp.c | 33 +++++++++++---------------------- 3 files changed, 36 insertions(+), 29 deletions(-) diff --git a/flow.c b/flow.c index 21310531..6f09781a 100644 --- a/flow.c +++ b/flow.c @@ -455,16 +455,16 @@ static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx) } /** - * flow_hash_probe() - Find hash bucket for a flow - * @c: Execution context + * flow_hash_probe_() - Find hash bucket for a flow, given hash + * @hash: Raw hash value for flow & side * @sidx: Flow and side to find bucket for * * Return: If @sidx is in the hash table, its current bucket, otherwise a * suitable free bucket for it. */ -static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) +static inline unsigned flow_hash_probe_(uint64_t hash, flow_sidx_t sidx) { - unsigned b = flow_sidx_hash(c, sidx) % FLOW_HASH_SIZE; + unsigned b = hash % FLOW_HASH_SIZE; /* Linear probing */ while (flow_sidx_valid(flow_hashtab[b]) && @@ -474,18 +474,36 @@ static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) return b; } +/** + * flow_hash_probe() - Find hash bucket for a flow + * @c: Execution context + * @sidx: Flow and side to find bucket for + * + * Return: If @sidx is in the hash table, its current bucket, otherwise a + * suitable free bucket for it. + */ +static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) +{ + return flow_hash_probe_(flow_sidx_hash(c, sidx), sidx); +} + /** * flow_hash_insert() - Insert side of a flow into into hash table * @c: Execution context * @sidx: Flow & side index + * + * Return: raw (un-modded) hash value of side of flow */ -void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx) +uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx) { - unsigned b = flow_hash_probe(c, sidx); + uint64_t hash = flow_sidx_hash(c, sidx); + unsigned b = flow_hash_probe_(hash, sidx); flow_hashtab[b] = sidx; flow_dbg(flow_at_sidx(sidx), "Side %u hash table insert: bucket: %u", sidx.side, b); + + return hash; } /** diff --git a/flow.h b/flow.h index c0bf0010..c3a15ca6 100644 --- a/flow.h +++ b/flow.h @@ -229,7 +229,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) return (a.flow == b.flow) && (a.side == b.side); } -void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx); +uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx); void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx); flow_sidx_t flow_lookup_af(const struct ctx *c, uint8_t proto, uint8_t pif, sa_family_t af, diff --git a/tcp.c b/tcp.c index 8cc91add..4d3ffe57 100644 --- a/tcp.c +++ b/tcp.c @@ -1248,28 +1248,16 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) } /** - * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 - * @c: Execution context - * @conn: TCP connection, with faddr, fport and eport populated + * tcp_init_seq() - Calculate initial sequence number according to RFC 6528 + * @hash: Hash of connection details * @now: Current timestamp */ -static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, - const struct timespec *now) +static uint32_t tcp_init_seq(uint64_t hash, const struct timespec *now) { - struct siphash_state state = SIPHASH_INIT(c->hash_secret); - const struct flowside *tapside = TAPFLOW(conn); - uint64_t hash; - uint32_t ns; - - inany_siphash_feed(&state, &tapside->faddr); - inany_siphash_feed(&state, &tapside->eaddr); - hash = siphash_final(&state, 36, - (uint64_t)tapside->fport << 16 | tapside->eport); - /* 32ns ticks, overflows 32 bits every 137s */ - ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; + uint32_t ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; - conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; + return ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; } /** @@ -1441,6 +1429,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, union sockaddr_inany sa; union flow *flow; int s = -1, mss; + uint64_t hash; socklen_t sl; if (!(flow = flow_alloc())) @@ -1537,11 +1526,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; - tcp_seq_init(c, conn, now); + hash = flow_hash_insert(c, TAP_SIDX(conn)); + conn->seq_to_tap = tcp_init_seq(hash, now); conn->seq_ack_from_tap = conn->seq_to_tap; - flow_hash_insert(c, TAP_SIDX(conn)); - tcp_bind_outbound(c, conn, s); if (connect(s, &sa.sa, sl)) { @@ -2053,6 +2041,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */ struct tcp_tap_conn *conn; in_port_t srcport; + uint64_t hash; inany_from_sockaddr(&saddr, &srcport, sa); tcp_snat_inbound(c, &saddr); @@ -2076,8 +2065,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - tcp_seq_init(c, conn, now); - flow_hash_insert(c, TAP_SIDX(conn)); + hash = flow_hash_insert(c, TAP_SIDX(conn)); + conn->seq_to_tap = tcp_init_seq(hash, now); conn->seq_ack_from_tap = conn->seq_to_tap; -- 2.45.2
struct icmp_ping_flow contains a field for the ICMP id of the ping, but this is now redundant, since the id is also stored as the "port" in the common flowsides. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- icmp.c | 10 +++++----- icmp_flow.h | 2 -- 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/icmp.c b/icmp.c index fd92c7da..00142422 100644 --- a/icmp.c +++ b/icmp.c @@ -58,6 +58,7 @@ static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS]; void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) { struct icmp_ping_flow *pingf = PINGF(ref.flowside.flow); + const struct flowside *ini = &pingf->f.side[INISIDE]; union sockaddr_inany sr; socklen_t sl = sizeof(sr); char buf[USHRT_MAX]; @@ -83,7 +84,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) goto unexpected; /* Adjust packet back to guest-side ID */ - ih4->un.echo.id = htons(pingf->id); + ih4->un.echo.id = htons(ini->eport); seq = ntohs(ih4->un.echo.sequence); } else if (pingf->f.type == FLOW_PING6) { struct icmp6hdr *ih6 = (struct icmp6hdr *)buf; @@ -93,7 +94,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) goto unexpected; /* Adjust packet back to guest-side ID */ - ih6->icmp6_identifier = htons(pingf->id); + ih6->icmp6_identifier = htons(ini->eport); seq = ntohs(ih6->icmp6_sequence); } else { ASSERT(0); @@ -108,7 +109,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) } flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16, - pingf->id, seq); + ini->eport, seq); if (pingf->f.type == FLOW_PING4) tap_icmp4_send(c, sr.sa4.sin_addr, tap_ip4_daddr(c), buf, n); @@ -129,7 +130,7 @@ unexpected: static void icmp_ping_close(const struct ctx *c, const struct icmp_ping_flow *pingf) { - uint16_t id = pingf->id; + uint16_t id = pingf->f.side[INISIDE].eport; epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL); close(pingf->sock); @@ -172,7 +173,6 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, pingf = FLOW_SET_TYPE(flow, flowtype, ping); pingf->seq = -1; - pingf->id = id; if (af == AF_INET) { bind_addr = &c->ip4.addr_out; diff --git a/icmp_flow.h b/icmp_flow.h index c9847eae..fb93801d 100644 --- a/icmp_flow.h +++ b/icmp_flow.h @@ -13,7 +13,6 @@ * @seq: Last sequence number sent to tap, host order, -1: not sent yet * @sock: "ping" socket * @ts: Last associated activity from tap, seconds - * @id: ICMP id for the flow as seen by the guest */ struct icmp_ping_flow { /* Must be first element */ @@ -22,7 +21,6 @@ struct icmp_ping_flow { int seq; int sock; time_t ts; - uint16_t id; }; bool icmp_ping_timer(const struct ctx *c, const struct icmp_ping_flow *pingf, -- 2.45.2
icmp_sock_handler() obtains the guest address from it's most recently observed IP. However, this can now be obtained from the common flowside information. icmp_tap_handler() builds its socket address for sendto() directly from the destination address supplied by the incoming tap packet. This can instead be generated from the flow. Using the flowsides as the common source of truth here prepares us for allowing more flexible NAT and forwarding by properly initialising that flowside information. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- icmp.c | 27 +++++++++++++++++---------- tap.c | 11 ----------- tap.h | 1 - 3 files changed, 17 insertions(+), 22 deletions(-) diff --git a/icmp.c b/icmp.c index 00142422..56203708 100644 --- a/icmp.c +++ b/icmp.c @@ -111,11 +111,18 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16, ini->eport, seq); - if (pingf->f.type == FLOW_PING4) - tap_icmp4_send(c, sr.sa4.sin_addr, tap_ip4_daddr(c), buf, n); - else if (pingf->f.type == FLOW_PING6) - tap_icmp6_send(c, &sr.sa6.sin6_addr, - tap_ip6_daddr(c, &sr.sa6.sin6_addr), buf, n); + if (pingf->f.type == FLOW_PING4) { + const struct in_addr *saddr = inany_v4(&ini->faddr); + const struct in_addr *daddr = inany_v4(&ini->eaddr); + + ASSERT(saddr && daddr); /* Must have IPv4 addresses */ + tap_icmp4_send(c, *saddr, *daddr, buf, n); + } else if (pingf->f.type == FLOW_PING6) { + const struct in6_addr *saddr = &ini->faddr.a6; + const struct in6_addr *daddr = &ini->eaddr.a6; + + tap_icmp6_send(c, saddr, daddr, buf, n); + } return; unexpected: @@ -225,11 +232,12 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now) { - union sockaddr_inany sa = { .sa_family = af }; - const socklen_t sl = af == AF_INET ? sizeof(sa.sa4) : sizeof(sa.sa6); struct icmp_ping_flow *pingf, **id_sock; + const struct flowside *tgt; + union sockaddr_inany sa; size_t dlen, l4len; uint16_t id, seq; + socklen_t sl; void *pkt; (void)saddr; @@ -250,7 +258,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, id = ntohs(ih->un.echo.id); id_sock = &icmp_id_map[V4][id]; seq = ntohs(ih->un.echo.sequence); - sa.sa4.sin_addr = *(struct in_addr *)daddr; } else if (af == AF_INET6) { const struct icmp6hdr *ih; @@ -266,8 +273,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, id = ntohs(ih->icmp6_identifier); id_sock = &icmp_id_map[V6][id]; seq = ntohs(ih->icmp6_sequence); - sa.sa6.sin6_addr = *(struct in6_addr *)daddr; - sa.sa6.sin6_scope_id = c->ifi6; } else { ASSERT(0); } @@ -276,8 +281,10 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) return 1; + tgt = &pingf->f.side[TGTSIDE]; pingf->ts = now->tv_sec; + pif_sockaddr(c, &sa, &sl, PIF_HOST, &tgt->eaddr, 0); if (sendto(pingf->sock, pkt, l4len, MSG_NOSIGNAL, &sa.sa, sl) < 0) { flow_dbg(pingf, "failed to relay request to socket: %s", strerror(errno)); diff --git a/tap.c b/tap.c index ec994a2e..32a7b09c 100644 --- a/tap.c +++ b/tap.c @@ -90,17 +90,6 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len) tap_send_frames(c, iov, iovcnt, 1); } -/** - * tap_ip4_daddr() - Normal IPv4 destination address for inbound packets - * @c: Execution context - * - * Return: IPv4 address - */ -struct in_addr tap_ip4_daddr(const struct ctx *c) -{ - return c->ip4.addr_seen; -} - /** * tap_ip6_daddr() - Normal IPv6 destination address for inbound packets * @c: Execution context diff --git a/tap.h b/tap.h index d496bd0e..ec9e2ace 100644 --- a/tap.h +++ b/tap.h @@ -43,7 +43,6 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len) thdr->vnet_len = htonl(l2len); } -struct in_addr tap_ip4_daddr(const struct ctx *c); void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport, struct in_addr dst, in_port_t dport, const void *in, size_t dlen); -- 2.45.2
When we receive a ping packet from the tap interface, we currently locate the correct flow entry (if present) using an anciliary data structure, the icmp_id_map[] tables. However, we can look this up using the flow hash table - that's what it's for. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- icmp.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/icmp.c b/icmp.c index 56203708..b78712a1 100644 --- a/icmp.c +++ b/icmp.c @@ -141,6 +141,7 @@ static void icmp_ping_close(const struct ctx *c, epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL); close(pingf->sock); + flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE)); if (pingf->f.type == FLOW_PING4) icmp_id_map[V4][id] = NULL; @@ -205,6 +206,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id); + flow_hash_insert(c, FLOW_SIDX(pingf, INISIDE)); *id_sock = pingf; FLOW_ACTIVATE(pingf); @@ -237,6 +239,8 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, union sockaddr_inany sa; size_t dlen, l4len; uint16_t id, seq; + union flow *flow; + uint8_t proto; socklen_t sl; void *pkt; @@ -255,6 +259,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (ih->type != ICMP_ECHO) return 1; + proto = IPPROTO_ICMP; id = ntohs(ih->un.echo.id); id_sock = &icmp_id_map[V4][id]; seq = ntohs(ih->un.echo.sequence); @@ -270,6 +275,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (ih->icmp6_type != ICMPV6_ECHO_REQUEST) return 1; + proto = IPPROTO_ICMPV6; id = ntohs(ih->icmp6_identifier); id_sock = &icmp_id_map[V6][id]; seq = ntohs(ih->icmp6_sequence); @@ -277,11 +283,17 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, ASSERT(0); } - if (!(pingf = *id_sock)) - if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) - return 1; + flow = flow_at_sidx(flow_lookup_af(c, proto, PIF_TAP, + af, saddr, daddr, id, id)); + + if (flow) + pingf = &flow->ping; + else if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) + return 1; tgt = &pingf->f.side[TGTSIDE]; + + ASSERT(flow_proto[pingf->f.type] == proto); pingf->ts = now->tv_sec; pif_sockaddr(c, &sa, &sl, PIF_HOST, &tgt->eaddr, 0); -- 2.45.2
With previous reworks the icmp_id_map data structure is now maintained, but never used for anything. Eliminate it. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- icmp.c | 19 ++----------------- 1 file changed, 2 insertions(+), 17 deletions(-) diff --git a/icmp.c b/icmp.c index b78712a1..e2e697e7 100644 --- a/icmp.c +++ b/icmp.c @@ -47,9 +47,6 @@ #define PINGF(idx) (&(FLOW(idx)->ping)) -/* Indexed by ICMP echo identifier */ -static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS]; - /** * icmp_sock_handler() - Handle new data from ICMP or ICMPv6 socket * @c: Execution context @@ -137,22 +134,14 @@ unexpected: static void icmp_ping_close(const struct ctx *c, const struct icmp_ping_flow *pingf) { - uint16_t id = pingf->f.side[INISIDE].eport; - epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL); close(pingf->sock); flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE)); - - if (pingf->f.type == FLOW_PING4) - icmp_id_map[V4][id] = NULL; - else - icmp_id_map[V6][id] = NULL; } /** * icmp_ping_new() - Prepare a new ping socket for a new id * @c: Execution context - * @id_sock: Pointer to ping flow entry slot in icmp_id_map[] to update * @af: Address family, AF_INET or AF_INET6 * @id: ICMP id for the new socket * @saddr: Source address @@ -161,7 +150,6 @@ static void icmp_ping_close(const struct ctx *c, * Return: Newly opened ping flow, or NULL on failure */ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, - struct icmp_ping_flow **id_sock, sa_family_t af, uint16_t id, const void *saddr, const void *daddr) { @@ -207,7 +195,6 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id); flow_hash_insert(c, FLOW_SIDX(pingf, INISIDE)); - *id_sock = pingf; FLOW_ACTIVATE(pingf); @@ -234,7 +221,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now) { - struct icmp_ping_flow *pingf, **id_sock; + struct icmp_ping_flow *pingf; const struct flowside *tgt; union sockaddr_inany sa; size_t dlen, l4len; @@ -261,7 +248,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, proto = IPPROTO_ICMP; id = ntohs(ih->un.echo.id); - id_sock = &icmp_id_map[V4][id]; seq = ntohs(ih->un.echo.sequence); } else if (af == AF_INET6) { const struct icmp6hdr *ih; @@ -277,7 +263,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, proto = IPPROTO_ICMPV6; id = ntohs(ih->icmp6_identifier); - id_sock = &icmp_id_map[V6][id]; seq = ntohs(ih->icmp6_sequence); } else { ASSERT(0); @@ -288,7 +273,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (flow) pingf = &flow->ping; - else if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) + else if (!(pingf = icmp_ping_new(c, af, id, saddr, daddr))) return 1; tgt = &pingf->f.side[TGTSIDE]; -- 2.45.2
We have upcoming use cases where it's useful to create new bound socket based on information from the flow table. Add flowside_sock_l4() to do this for either PIF_HOST or PIF_SPLICE sockets. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ flow.h | 3 ++ util.c | 6 ++-- util.h | 3 ++ 4 files changed, 101 insertions(+), 3 deletions(-) diff --git a/flow.c b/flow.c index 6f09781a..2d0a8a32 100644 --- a/flow.c +++ b/flow.c @@ -5,9 +5,11 @@ * Tracking for logical "flows" of packets. */ +#include <errno.h> #include <stdint.h> #include <stdio.h> #include <unistd.h> +#include <sched.h> #include <string.h> #include "util.h" @@ -143,6 +145,96 @@ static void flowside_from_af(struct flowside *fside, sa_family_t af, fside->eport = eport; } +/** + * struct flowside_sock_args - Parameters for flowside_sock_splice() + * @c: Execution context + * @fd: Filled in with new socket fd + * @err: Filled in with errno if something failed + * @type: Socket epoll type + * @sa: Socket address + * @sl: Length of @sa + * @data: epoll reference data + */ +struct flowside_sock_args { + const struct ctx *c; + int fd; + int err; + enum epoll_type type; + const struct sockaddr *sa; + socklen_t sl; + const char *path; + uint32_t data; +}; + +/** flowside_sock_splice() - Create and bind socket for PIF_SPLICE based on flowside + * @arg: Argument as a struct flowside_sock_args + * + * Return: 0 + */ +static int flowside_sock_splice(void *arg) +{ + struct flowside_sock_args *a = arg; + + ns_enter(a->c); + + a->fd = sock_l4_sa(a->c, a->type, a->sa, a->sl, NULL, + a->sa->sa_family == AF_INET6, a->data); + a->err = errno; + + return 0; +} + +/** flowside_sock_l4() - Create and bind socket based on flowside + * @c: Execution context + * @type: Socket epoll type + * @pif: Interface for this socket + * @tgt: Target flowside + * @data: epoll reference portion for protocol handlers + * + * Return: socket fd of protocol @proto bound to the forwarding address and port + * from @tgt (if specified). + */ +/* cppcheck-suppress unusedFunction */ +int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, + const struct flowside *tgt, uint32_t data) +{ + const char *ifname = NULL; + union sockaddr_inany sa; + socklen_t sl; + + ASSERT(pif_is_socket(pif)); + + pif_sockaddr(c, &sa, &sl, pif, &tgt->faddr, tgt->fport); + + switch (pif) { + case PIF_HOST: + if (inany_is_loopback(&tgt->faddr)) + ifname = NULL; + else if (sa.sa_family == AF_INET) + ifname = c->ip4.ifname_out; + else if (sa.sa_family == AF_INET6) + ifname = c->ip6.ifname_out; + + return sock_l4_sa(c, type, &sa, sl, ifname, + sa.sa_family == AF_INET6, data); + + case PIF_SPLICE: { + struct flowside_sock_args args = { + .c = c, .type = type, + .sa = &sa.sa, .sl = sl, .data = data, + }; + NS_CALL(flowside_sock_splice, &args); + errno = args.err; + return args.fd; + } + + default: + /* If we add new socket pifs, they'll need to be implemented + * here */ + ASSERT(0); + } +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index c3a15ca6..e27f99be 100644 --- a/flow.h +++ b/flow.h @@ -164,6 +164,9 @@ static inline bool flowside_eq(const struct flowside *left, left->fport == right->fport; } +int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, + const struct flowside *tgt, uint32_t data); + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry diff --git a/util.c b/util.c index 9a73fbb9..f2994a79 100644 --- a/util.c +++ b/util.c @@ -44,9 +44,9 @@ * * Return: newly created socket, negative error code on failure */ -static int sock_l4_sa(const struct ctx *c, enum epoll_type type, - const void *sa, socklen_t sl, - const char *ifname, bool v6only, uint32_t data) +int sock_l4_sa(const struct ctx *c, enum epoll_type type, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data) { sa_family_t af = ((const struct sockaddr *)sa)->sa_family; union epoll_ref ref = { .type = type, .data = data }; diff --git a/util.h b/util.h index d0150396..f2e4f8cf 100644 --- a/util.h +++ b/util.h @@ -144,6 +144,9 @@ struct ctx; /* cppcheck-suppress funcArgNamesDifferent */ __attribute__ ((weak)) int ffsl(long int i) { return __builtin_ffsl(i); } +int sock_l4_sa(const struct ctx *c, enum epoll_type type, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data); int sock_l4(const struct ctx *c, sa_family_t af, enum epoll_type type, const void *bind_addr, const char *ifname, uint16_t port, uint32_t data); -- 2.45.2
On Fri, 5 Jul 2024 12:07:12 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:We have upcoming use cases where it's useful to create new bound socket based on information from the flow table. Add flowside_sock_l4() to do this for either PIF_HOST or PIF_SPLICE sockets. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ flow.h | 3 ++ util.c | 6 ++-- util.h | 3 ++ 4 files changed, 101 insertions(+), 3 deletions(-) diff --git a/flow.c b/flow.c index 6f09781a..2d0a8a32 100644 --- a/flow.c +++ b/flow.c @@ -5,9 +5,11 @@ * Tracking for logical "flows" of packets. */ +#include <errno.h> #include <stdint.h> #include <stdio.h> #include <unistd.h> +#include <sched.h> #include <string.h> #include "util.h" @@ -143,6 +145,96 @@ static void flowside_from_af(struct flowside *fside, sa_family_t af, fside->eport = eport; } +/** + * struct flowside_sock_args - Parameters for flowside_sock_splice() + * @c: Execution context + * @fd: Filled in with new socket fd + * @err: Filled in with errno if something failed + * @type: Socket epoll type + * @sa: Socket address + * @sl: Length of @sa + * @data: epoll reference data + */ +struct flowside_sock_args { + const struct ctx *c; + int fd; + int err; + enum epoll_type type; + const struct sockaddr *sa; + socklen_t sl; + const char *path; + uint32_t data; +}; + +/** flowside_sock_splice() - Create and bind socket for PIF_SPLICE based on flowside + * @arg: Argument as a struct flowside_sock_args + * + * Return: 0 + */ +static int flowside_sock_splice(void *arg) +{ + struct flowside_sock_args *a = arg; + + ns_enter(a->c); + + a->fd = sock_l4_sa(a->c, a->type, a->sa, a->sl, NULL,Nit: assuming you wanted the extra whitespace here to align the assignment with the one of a->err below, I'd rather write this (at least for consistency) as "a->fd = ...".+ a->sa->sa_family == AF_INET6, a->data); + a->err = errno;+ + return 0; +} + +/** flowside_sock_l4() - Create and bind socket based on flowside + * @c: Execution context + * @type: Socket epoll type + * @pif: Interface for this socket + * @tgt: Target flowside + * @data: epoll reference portion for protocol handlers + * + * Return: socket fd of protocol @proto bound to the forwarding address and port + * from @tgt (if specified). + */ +/* cppcheck-suppress unusedFunction */ +int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, + const struct flowside *tgt, uint32_t data) +{ + const char *ifname = NULL; + union sockaddr_inany sa; + socklen_t sl; + + ASSERT(pif_is_socket(pif)); + + pif_sockaddr(c, &sa, &sl, pif, &tgt->faddr, tgt->fport); + + switch (pif) { + case PIF_HOST: + if (inany_is_loopback(&tgt->faddr)) + ifname = NULL; + else if (sa.sa_family == AF_INET) + ifname = c->ip4.ifname_out; + else if (sa.sa_family == AF_INET6) + ifname = c->ip6.ifname_out; + + return sock_l4_sa(c, type, &sa, sl, ifname, + sa.sa_family == AF_INET6, data); + + case PIF_SPLICE: { + struct flowside_sock_args args = { + .c = c, .type = type, + .sa = &sa.sa, .sl = sl, .data = data, + }; + NS_CALL(flowside_sock_splice, &args); + errno = args.err; + return args.fd; + } + + default: + /* If we add new socket pifs, they'll need to be implemented + * here */For consistency: /* If we add new socket pifs, they'll need to be implemented * here */ there are a few occurrences in the next patches, not so important I guess, I can also do a pass later at some point.+ ASSERT(0); + } +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index c3a15ca6..e27f99be 100644 --- a/flow.h +++ b/flow.h @@ -164,6 +164,9 @@ static inline bool flowside_eq(const struct flowside *left, left->fport == right->fport; } +int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, + const struct flowside *tgt, uint32_t data); + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry diff --git a/util.c b/util.c index 9a73fbb9..f2994a79 100644 --- a/util.c +++ b/util.c @@ -44,9 +44,9 @@ * * Return: newly created socket, negative error code on failure */ -static int sock_l4_sa(const struct ctx *c, enum epoll_type type, - const void *sa, socklen_t sl, - const char *ifname, bool v6only, uint32_t data) +int sock_l4_sa(const struct ctx *c, enum epoll_type type, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data) { sa_family_t af = ((const struct sockaddr *)sa)->sa_family; union epoll_ref ref = { .type = type, .data = data }; diff --git a/util.h b/util.h index d0150396..f2e4f8cf 100644 --- a/util.h +++ b/util.h @@ -144,6 +144,9 @@ struct ctx; /* cppcheck-suppress funcArgNamesDifferent */ __attribute__ ((weak)) int ffsl(long int i) { return __builtin_ffsl(i); } +int sock_l4_sa(const struct ctx *c, enum epoll_type type, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data); int sock_l4(const struct ctx *c, sa_family_t af, enum epoll_type type, const void *bind_addr, const char *ifname, uint16_t port, uint32_t data);-- Stefano
On Wed, Jul 10, 2024 at 11:32:01PM +0200, Stefano Brivio wrote:On Fri, 5 Jul 2024 12:07:12 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Actually, I think it was just a boring old typo.We have upcoming use cases where it's useful to create new bound socket based on information from the flow table. Add flowside_sock_l4() to do this for either PIF_HOST or PIF_SPLICE sockets. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ flow.h | 3 ++ util.c | 6 ++-- util.h | 3 ++ 4 files changed, 101 insertions(+), 3 deletions(-) diff --git a/flow.c b/flow.c index 6f09781a..2d0a8a32 100644 --- a/flow.c +++ b/flow.c @@ -5,9 +5,11 @@ * Tracking for logical "flows" of packets. */ +#include <errno.h> #include <stdint.h> #include <stdio.h> #include <unistd.h> +#include <sched.h> #include <string.h> #include "util.h" @@ -143,6 +145,96 @@ static void flowside_from_af(struct flowside *fside, sa_family_t af, fside->eport = eport; } +/** + * struct flowside_sock_args - Parameters for flowside_sock_splice() + * @c: Execution context + * @fd: Filled in with new socket fd + * @err: Filled in with errno if something failed + * @type: Socket epoll type + * @sa: Socket address + * @sl: Length of @sa + * @data: epoll reference data + */ +struct flowside_sock_args { + const struct ctx *c; + int fd; + int err; + enum epoll_type type; + const struct sockaddr *sa; + socklen_t sl; + const char *path; + uint32_t data; +}; + +/** flowside_sock_splice() - Create and bind socket for PIF_SPLICE based on flowside + * @arg: Argument as a struct flowside_sock_args + * + * Return: 0 + */ +static int flowside_sock_splice(void *arg) +{ + struct flowside_sock_args *a = arg; + + ns_enter(a->c); + + a->fd = sock_l4_sa(a->c, a->type, a->sa, a->sl, NULL,Nit: assuming you wanted the extra whitespace here to align the assignment with the one of a->err below, I'd rather write this (at least for consistency) as "a->fd = ...".Done.+ a->sa->sa_family == AF_INET6, a->data); + a->err = errno;+ + return 0; +} + +/** flowside_sock_l4() - Create and bind socket based on flowside + * @c: Execution context + * @type: Socket epoll type + * @pif: Interface for this socket + * @tgt: Target flowside + * @data: epoll reference portion for protocol handlers + * + * Return: socket fd of protocol @proto bound to the forwarding address and port + * from @tgt (if specified). + */ +/* cppcheck-suppress unusedFunction */ +int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, + const struct flowside *tgt, uint32_t data) +{ + const char *ifname = NULL; + union sockaddr_inany sa; + socklen_t sl; + + ASSERT(pif_is_socket(pif)); + + pif_sockaddr(c, &sa, &sl, pif, &tgt->faddr, tgt->fport); + + switch (pif) { + case PIF_HOST: + if (inany_is_loopback(&tgt->faddr)) + ifname = NULL; + else if (sa.sa_family == AF_INET) + ifname = c->ip4.ifname_out; + else if (sa.sa_family == AF_INET6) + ifname = c->ip6.ifname_out; + + return sock_l4_sa(c, type, &sa, sl, ifname, + sa.sa_family == AF_INET6, data); + + case PIF_SPLICE: { + struct flowside_sock_args args = { + .c = c, .type = type, + .sa = &sa.sa, .sl = sl, .data = data, + }; + NS_CALL(flowside_sock_splice, &args); + errno = args.err; + return args.fd; + } + + default: + /* If we add new socket pifs, they'll need to be implemented + * here */For consistency: /* If we add new socket pifs, they'll need to be implemented * here */there are a few occurrences in the next patches, not so important I guess, I can also do a pass later at some point.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson+ ASSERT(0); + } +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index c3a15ca6..e27f99be 100644 --- a/flow.h +++ b/flow.h @@ -164,6 +164,9 @@ static inline bool flowside_eq(const struct flowside *left, left->fport == right->fport; } +int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, + const struct flowside *tgt, uint32_t data); + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry diff --git a/util.c b/util.c index 9a73fbb9..f2994a79 100644 --- a/util.c +++ b/util.c @@ -44,9 +44,9 @@ * * Return: newly created socket, negative error code on failure */ -static int sock_l4_sa(const struct ctx *c, enum epoll_type type, - const void *sa, socklen_t sl, - const char *ifname, bool v6only, uint32_t data) +int sock_l4_sa(const struct ctx *c, enum epoll_type type, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data) { sa_family_t af = ((const struct sockaddr *)sa)->sa_family; union epoll_ref ref = { .type = type, .data = data }; diff --git a/util.h b/util.h index d0150396..f2e4f8cf 100644 --- a/util.h +++ b/util.h @@ -144,6 +144,9 @@ struct ctx; /* cppcheck-suppress funcArgNamesDifferent */ __attribute__ ((weak)) int ffsl(long int i) { return __builtin_ffsl(i); } +int sock_l4_sa(const struct ctx *c, enum epoll_type type, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data); int sock_l4(const struct ctx *c, sa_family_t af, enum epoll_type type, const void *bind_addr, const char *ifname, uint16_t port, uint32_t data);
On Wed, Jul 10, 2024 at 11:32:01PM +0200, Stefano Brivio wrote:On Fri, 5 Jul 2024 12:07:12 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote: > We have upcoming use cases where it's useful to create new bound socket > based on information from the flow table. Add flowside_sock_l4() to do > this for either PIF_HOST or PIF_SPLICE sockets. > > Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au>[snip]I did some grepping, and I think I squashed the ones added by this series. There are also some instances that are already merged. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson+ default: + /* If we add new socket pifs, they'll need to be implemented + * here */For consistency: /* If we add new socket pifs, they'll need to be implemented * here */ there are a few occurrences in the next patches, not so important I guess, I can also do a pass later at some point.
For now when we forward a ping to the host we leave the host side forwarding address and port blank since we don't necessarily know what source address and id will be used by the kernel. When the outbound address option is active, though, we do know the address at least, so we can record it in the flowside. Having done that, use it as the primary source of truth, binding the outgoing socket based on the information in there. This allows the possibility of more complex rules for what outbound address and/or id we use in future. To implement this we create a new helper which sets up a new socket based on information in a flowside, which will also have future uses. It behaves slightly differently from the existing ICMP code, in that it doesn't bind to a specific interface if given a loopback address. This is logically correct - the loopback address means we need to operate through the host's loopback interface, not ifname_out. We didn't need it in ICMP because ICMP will never generate a loopback address at this point, however we intend to change that in future. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 1 - icmp.c | 23 ++++++++++------------- 2 files changed, 10 insertions(+), 14 deletions(-) diff --git a/flow.c b/flow.c index 2d0a8a32..cf952a32 100644 --- a/flow.c +++ b/flow.c @@ -194,7 +194,6 @@ static int flowside_sock_splice(void *arg) * Return: socket fd of protocol @proto bound to the forwarding address and port * from @tgt (if specified). */ -/* cppcheck-suppress unusedFunction */ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data) { diff --git a/icmp.c b/icmp.c index e2e697e7..29a66e6f 100644 --- a/icmp.c +++ b/icmp.c @@ -157,30 +157,27 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, union epoll_ref ref = { .type = EPOLL_TYPE_PING }; union flow *flow = flow_alloc(); struct icmp_ping_flow *pingf; + const struct flowside *tgt; const void *bind_addr; - const char *bind_if; if (!flow) return NULL; flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - /* FIXME: Record outbound source address when known */ - flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); - pingf = FLOW_SET_TYPE(flow, flowtype, ping); - - pingf->seq = -1; - if (af == AF_INET) { + if (af == AF_INET) bind_addr = &c->ip4.addr_out; - bind_if = c->ip4.ifname_out; - } else { + else if (af == AF_INET6) bind_addr = &c->ip6.addr_out; - bind_if = c->ip6.ifname_out; - } + + tgt = flow_target_af(flow, PIF_HOST, af, bind_addr, 0, daddr, 0); + pingf = FLOW_SET_TYPE(flow, flowtype, ping); + + pingf->seq = -1; ref.flowside = FLOW_SIDX(flow, TGTSIDE); - pingf->sock = sock_l4(c, af, EPOLL_TYPE_PING, bind_addr, bind_if, - 0, ref.data); + pingf->sock = flowside_sock_l4(c, EPOLL_TYPE_PING, PIF_HOST, + tgt, ref.data); if (pingf->sock < 0) { warn("Cannot open \"ping\" socket. You might need to:"); -- 2.45.2
Currently the code to translate host side addresses and ports to guest side addresses and ports, and vice versa, is scattered across the TCP code. This includes both port redirection as controlled by the -t and -T options, and our special case NAT controlled by the --no-map-gw option. Gather this logic into fwd_nat_from_*() functions for each input interface in fwd.c which take protocol and address information for the initiating side and generates the pif and address information for the forwarded side. This performs any NAT or port forwarding needed. We create a flow_target() helper which applies those forwarding functions as needed to automatically move a flow from INI to TGT state. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 53 +++++++++++++++++++ flow_table.h | 2 + fwd.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++++++ fwd.h | 9 ++++ tcp.c | 103 ++++++++++-------------------------- tcp_splice.c | 64 ++-------------------- tcp_splice.h | 5 +- 7 files changed, 243 insertions(+), 139 deletions(-) diff --git a/flow.c b/flow.c index cf952a32..c638fc18 100644 --- a/flow.c +++ b/flow.c @@ -399,6 +399,59 @@ const struct flowside *flow_target_af(union flow *flow, uint8_t pif, return tgt; } + +/** + * flow_target() - Determine where flow should forward to, and move to TGT + * @c: Execution context + * @flow: Flow to forward + * @proto: Protocol + * + * Return: pointer to the target flowside information + */ +const struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto) +{ + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + struct flow_common *f = &flow->f; + const struct flowside *ini = &f->side[INISIDE]; + struct flowside *tgt = &f->side[TGTSIDE]; + uint8_t tgtpif = PIF_NONE; + + ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); + ASSERT(f->type == FLOW_TYPE_NONE); + ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); + ASSERT(flow->f.state == FLOW_STATE_INI); + + switch (f->pif[INISIDE]) { + case PIF_TAP: + tgtpif = fwd_nat_from_tap(c, proto, ini, tgt); + break; + + case PIF_SPLICE: + tgtpif = fwd_nat_from_splice(c, proto, ini, tgt); + break; + + case PIF_HOST: + tgtpif = fwd_nat_from_host(c, proto, ini, tgt); + break; + + default: + flow_err(flow, "No rules to forward %s [%s]:%hu -> [%s]:%hu", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport); + } + + if (tgtpif == PIF_NONE) + return NULL; + + f->pif[TGTSIDE] = tgtpif; + flow_set_state(f, FLOW_STATE_TGT); + return tgt; +} + /** * flow_set_type() - Set type and move to TYPED * @flow: Flow to change state diff --git a/flow_table.h b/flow_table.h index 00dca4b2..457f27b1 100644 --- a/flow_table.h +++ b/flow_table.h @@ -118,6 +118,8 @@ const struct flowside *flow_target_af(union flow *flow, uint8_t pif, sa_family_t af, const void *saddr, in_port_t sport, const void *daddr, in_port_t dport); +const struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto); union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/fwd.c b/fwd.c index d3f17988..5731a536 100644 --- a/fwd.c +++ b/fwd.c @@ -25,6 +25,7 @@ #include "fwd.h" #include "passt.h" #include "lineread.h" +#include "flow_table.h" /* See enum in kernel's include/net/tcp_states.h */ #define UDP_LISTEN 0x07 @@ -154,3 +155,148 @@ void fwd_scan_ports_init(struct ctx *c) &c->tcp.fwd_out, &c->tcp.fwd_in); } } + +/** + * fwd_nat_from_tap() - Determine to forward a flow from the tap interface + * @c: Execution context + * @proto: Protocol (IP L4 protocol number) + * @ini: Flow address information of the initiating side + * @tgt: Flow address information on the target side (updated) + * + * Return: pif of the target interface to forward the flow to, PIF_NONE if the + * flow cannot or should not be forwarded at all. + */ +uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt) +{ + (void)proto; + + tgt->eaddr = ini->faddr; + tgt->eport = ini->fport; + + if (!c->no_map_gw) { + if (inany_equals4(&tgt->eaddr, &c->ip4.gw)) + tgt->eaddr = inany_loopback4; + else if (inany_equals6(&tgt->eaddr, &c->ip6.gw)) + tgt->eaddr = inany_loopback6; + } + + /* The relevant addr_out controls the host side source address. This + * may be unspecified, which allows the kernel to pick an address. + */ + if (inany_v4(&tgt->eaddr)) + tgt->faddr = inany_from_v4(c->ip4.addr_out); + else + tgt->faddr.a6 = c->ip6.addr_out; + + /* Let the kernel pick a host side source port */ + tgt->fport = 0; + + return PIF_HOST; +} + +/** + * fwd_nat_from_splice() - Determine to forward a flow from the splice interface + * @c: Execution context + * @proto: Protocol (IP L4 protocol number) + * @ini: Flow address information of the initiating side + * @tgt: Flow address information on the target side (updated) + * + * Return: pif of the target interface to forward the flow to, PIF_NONE if the + * flow cannot or should not be forwarded at all. + */ +uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt) +{ + if (!inany_is_loopback(&ini->eaddr) || + (!inany_is_loopback(&ini->faddr) && !inany_is_unspecified(&ini->faddr))) { + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + + debug("Non loopback address on %s: [%s]:%hu -> [%s]:%hu", + pif_name(PIF_SPLICE), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), ini->fport); + return PIF_NONE; + } + + if (inany_v4(&ini->eaddr)) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_loopback6; + + /* Preserve the specific loopback adddress used, but let the kernel pick + * a source port on the target side */ + tgt->faddr = ini->eaddr; + tgt->fport = 0; + + tgt->eport = ini->fport; + if (proto == IPPROTO_TCP) + tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; + + /* Let the kernel pick a host side source port */ + tgt->fport = 0; + + return PIF_HOST; +} + +/** + * fwd_nat_from_host() - Determine to forward a flow from the host interface + * @c: Execution context + * @proto: Protocol (IP L4 protocol number) + * @ini: Flow address information of the initiating side + * @tgt: Flow address information on the target side (updated) + * + * Return: pif of the target interface to forward the flow to, PIF_NONE if the + * flow cannot or should not be forwarded at all. + */ +uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt) +{ + /* Common for spliced and non-spliced cases */ + tgt->eport = ini->fport; + if (proto == IPPROTO_TCP) + tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; + + if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) && + proto == IPPROTO_TCP) { + /* spliceable */ + + /* Preserve the specific loopback adddress used, but let the + * kernel pick a source port on the target side */ + tgt->faddr = ini->eaddr; + tgt->fport = 0; + + if (inany_v4(&ini->eaddr)) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_loopback6; + return PIF_SPLICE; + } + + tgt->faddr = ini->eaddr; + tgt->fport = ini->eport; + + if (inany_is_loopback4(&tgt->faddr) || + inany_is_unspecified4(&tgt->faddr) || + inany_equals4(&tgt->faddr, &c->ip4.addr_seen)) { + tgt->faddr = inany_from_v4(c->ip4.gw); + } else if (inany_is_loopback6(&tgt->faddr) || + inany_equals6(&tgt->faddr, &c->ip6.addr_seen) || + inany_equals6(&tgt->faddr, &c->ip6.addr)) { + if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) + tgt->faddr.a6 = c->ip6.gw; + else + tgt->faddr.a6 = c->ip6.addr_ll; + } + + if (inany_v4(&tgt->faddr)) { + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + } else { + if (inany_is_linklocal6(&tgt->faddr)) + tgt->eaddr.a6 = c->ip6.addr_ll_seen; + else + tgt->eaddr.a6 = c->ip6.addr_seen; + } + + return PIF_TAP; +} diff --git a/fwd.h b/fwd.h index 41645d7f..b4aa8d57 100644 --- a/fwd.h +++ b/fwd.h @@ -7,6 +7,8 @@ #ifndef FWD_H #define FWD_H +struct flowside; + /* Number of ports for both TCP and UDP */ #define NUM_PORTS (1U << 16) @@ -42,4 +44,11 @@ void fwd_scan_ports_udp(struct fwd_ports *fwd, const struct fwd_ports *rev, const struct fwd_ports *tcp_rev); void fwd_scan_ports_init(struct ctx *c); +uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt); +uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt); +uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt); + #endif /* FWD_H */ diff --git a/tcp.c b/tcp.c index 4d3ffe57..e89b12b3 100644 --- a/tcp.c +++ b/tcp.c @@ -1423,7 +1423,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - union inany_addr srcaddr, dstaddr; /* FIXME: Avoid bulky temporaries */ const struct flowside *ini, *tgt; struct tcp_tap_conn *conn; union sockaddr_inany sa; @@ -1438,34 +1437,16 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, ini = flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); - dstaddr = ini->faddr; - if (!c->no_map_gw) { - if (inany_equals4(&dstaddr, &c->ip4.gw)) - dstaddr = inany_loopback4; - else if (inany_equals6(&dstaddr, &c->ip6.gw)) - dstaddr = inany_loopback6; - - } + if (!(tgt = flow_target(c, flow, IPPROTO_TCP))) + goto cancel; - if (inany_is_linklocal6(&dstaddr)) { - srcaddr.a6 = c->ip6.addr_ll; - } else if (inany_is_loopback(&dstaddr)) { - srcaddr = dstaddr; - } else if (inany_v4(&dstaddr)) { - if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.addr_out)) - srcaddr = inany_from_v4(c->ip4.addr_out); - else - srcaddr = inany_any4; - } else { - if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr_out)) - srcaddr.a6 = c->ip6.addr_out; - else - srcaddr = inany_any6; + if (flow->f.pif[TGTSIDE] != PIF_HOST) { + flow_err(flow, "No support for forwarding TCP from %s to %s", + pif_name(flow->f.pif[INISIDE]), + pif_name(flow->f.pif[TGTSIDE])); + goto cancel; } - tgt = flow_target_af(flow, PIF_HOST, AF_INET6, - &srcaddr, 0, /* Kernel decides source port */ - &dstaddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0 || @@ -2003,63 +1984,20 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn) conn_flag(c, conn, ACK_FROM_TAP_DUE); } -/** - * tcp_snat_inbound() - Translate source address for inbound data if needed - * @c: Execution context - * @addr: Source address of inbound packet/connection - */ -static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) -{ - if (inany_is_loopback4(addr) || - inany_is_unspecified4(addr) || - inany_equals4(addr, &c->ip4.addr_seen)) { - *addr = inany_from_v4(c->ip4.gw); - } else if (inany_is_loopback6(addr) || - inany_equals6(addr, &c->ip6.addr_seen) || - inany_equals6(addr, &c->ip6.addr)) { - if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) - addr->a6 = c->ip6.gw; - else - addr->a6 = c->ip6.addr_ll; - } -} - /** * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context - * @dstport: Destination port for connection (host side) * @flow: flow to initialise * @s: Accepted socket * @sa: Peer socket address (from accept()) * @now: Current timestamp */ -static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, - union flow *flow, int s, - const union sockaddr_inany *sa, +static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s, const struct timespec *now) { - union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */ - struct tcp_tap_conn *conn; - in_port_t srcport; + struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); uint64_t hash; - inany_from_sockaddr(&saddr, &srcport, sa); - tcp_snat_inbound(c, &saddr); - - if (inany_v4(&saddr)) { - daddr = inany_from_v4(c->ip4.addr_seen); - } else { - if (inany_is_linklocal6(&saddr)) - daddr.a6 = c->ip6.addr_ll_seen; - else - daddr.a6 = c->ip6.addr_seen; - } - dstport += c->tcp.fwd_in.delta[dstport]; - - flow_target_af(flow, PIF_TAP, AF_INET6, - &saddr, srcport, &daddr, dstport); - conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); - conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; @@ -2115,11 +2053,26 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, goto cancel; } - if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif, - ref.tcp_listen.port, flow, s, &sa)) - return; + if (!flow_target(c, flow, IPPROTO_TCP)) + goto cancel; + + switch (flow->f.pif[TGTSIDE]) { + case PIF_SPLICE: + case PIF_HOST: + tcp_splice_conn_from_sock(c, flow, s); + break; + + case PIF_TAP: + tcp_tap_conn_from_sock(c, flow, s, now); + break; + + default: + flow_err(flow, "No support for forwarding TCP from %s to %s", + pif_name(flow->f.pif[INISIDE]), + pif_name(flow->f.pif[TGTSIDE])); + goto cancel; + } - tcp_tap_conn_from_sock(c, ref.tcp_listen.port, flow, s, &sa, now); return; cancel: diff --git a/tcp_splice.c b/tcp_splice.c index c45fd55a..28f01b59 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -394,72 +394,18 @@ static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af) /** * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection * @c: Execution context - * @pif0: pif id of side 0 - * @dstport: Side 0 destination port of connection * @flow: flow to initialise * @s0: Accepted (side 0) socket * @sa: Peer address of connection * - * Return: true if able to create a spliced connection, false otherwise * #syscalls:pasta setsockopt */ -bool tcp_splice_conn_from_sock(const struct ctx *c, - uint8_t pif0, in_port_t dstport, - union flow *flow, int s0, - const union sockaddr_inany *sa) +void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0) { - struct tcp_splice_conn *conn; - union inany_addr src; - in_port_t srcport; - sa_family_t af; - uint8_t tgtpif; + struct tcp_splice_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, + tcp_splice); - if (c->mode != MODE_PASTA) - return false; - - inany_from_sockaddr(&src, &srcport, sa); - af = inany_v4(&src) ? AF_INET : AF_INET6; - - switch (pif0) { - case PIF_SPLICE: - if (!inany_is_loopback(&src)) { - char str[INANY_ADDRSTRLEN]; - - /* We can't use flow_err() etc. because we haven't set - * the flow type yet - */ - warn("Bad source address %s for splice, closing", - inany_ntop(&src, str, sizeof(str))); - - /* We *don't* want to fall back to tap */ - flow_alloc_cancel(flow); - return true; - } - - tgtpif = PIF_HOST; - dstport += c->tcp.fwd_out.delta[dstport]; - break; - - case PIF_HOST: - if (!inany_is_loopback(&src)) - return false; - - tgtpif = PIF_SPLICE; - dstport += c->tcp.fwd_in.delta[dstport]; - break; - - default: - return false; - } - - /* FIXME: Record outbound source address when known */ - if (af == AF_INET) - flow_target_af(flow, tgtpif, AF_INET, - NULL, 0, &in4addr_loopback, dstport); - else - flow_target_af(flow, tgtpif, AF_INET6, - NULL, 0, &in6addr_loopback, dstport); - conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice); + ASSERT(c->mode == MODE_PASTA); conn->s[0] = s0; conn->s[1] = -1; @@ -473,8 +419,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, conn_flag(c, conn, CLOSING); FLOW_ACTIVATE(conn); - - return true; } /** diff --git a/tcp_splice.h b/tcp_splice.h index ed8f0c58..a20f3e21 100644 --- a/tcp_splice.h +++ b/tcp_splice.h @@ -11,10 +11,7 @@ union sockaddr_inany; void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events); -bool tcp_splice_conn_from_sock(const struct ctx *c, - uint8_t pif0, in_port_t dstport, - union flow *flow, int s0, - const union sockaddr_inany *sa); +void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0); void tcp_splice_init(struct ctx *c); #endif /* TCP_SPLICE_H */ -- 2.45.2
Current ICMP hard codes its forwarding rules, and never applies any translations. Change it to use the flow_target() function, so that it's translated the same as TCP (excluding TCP specific port redirection). This means that gw mapping now applies to ICMP so "ping <gw address>" will now ping the host's loopback instead of the actual gw machine. This removes the surprising behaviour that the target you ping might not be the same as you connect to with TCP. This removes the last user of flow_target_af(), so that's removed as well. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 32 -------------------------------- icmp.c | 16 ++++++++++------ 2 files changed, 10 insertions(+), 38 deletions(-) diff --git a/flow.c b/flow.c index c638fc18..218033ae 100644 --- a/flow.c +++ b/flow.c @@ -368,38 +368,6 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, return ini; } -/** - * flow_target_af() - Move flow to TGT, setting TGTSIDE details - * @flow: Flow to change state - * @pif: pif of the target side - * @af: Address family for @eaddr and @faddr - * @saddr: Source address (pointer to in_addr or in6_addr) - * @sport: Endpoint port - * @daddr: Destination address (pointer to in_addr or in6_addr) - * @dport: Destination port - * - * Return: pointer to the target flowside information - */ -const struct flowside *flow_target_af(union flow *flow, uint8_t pif, - sa_family_t af, - const void *saddr, in_port_t sport, - const void *daddr, in_port_t dport) -{ - struct flow_common *f = &flow->f; - struct flowside *tgt = &f->side[TGTSIDE]; - - ASSERT(pif != PIF_NONE); - ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); - ASSERT(f->type == FLOW_TYPE_NONE); - ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); - - flowside_from_af(tgt, af, daddr, dport, saddr, sport); - f->pif[TGTSIDE] = pif; - flow_set_state(f, FLOW_STATE_TGT); - return tgt; -} - - /** * flow_target() - Determine where flow should forward to, and move to TGT * @c: Execution context diff --git a/icmp.c b/icmp.c index 29a66e6f..4158a3fe 100644 --- a/icmp.c +++ b/icmp.c @@ -153,24 +153,28 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, sa_family_t af, uint16_t id, const void *saddr, const void *daddr) { + uint8_t proto = af == AF_INET ? IPPROTO_ICMP : IPPROTO_ICMPV6; uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6; union epoll_ref ref = { .type = EPOLL_TYPE_PING }; union flow *flow = flow_alloc(); struct icmp_ping_flow *pingf; const struct flowside *tgt; - const void *bind_addr; if (!flow) return NULL; flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); + if (!(tgt = flow_target(c, flow, proto))) + goto cancel; - if (af == AF_INET) - bind_addr = &c->ip4.addr_out; - else if (af == AF_INET6) - bind_addr = &c->ip6.addr_out; + if (flow->f.pif[TGTSIDE] != PIF_HOST) { + flow_err(flow, "No support for forwarding %s from %s to %s", + proto == IPPROTO_ICMP ? "ICMP" : "ICMPv6", + pif_name(flow->f.pif[INISIDE]), + pif_name(flow->f.pif[TGTSIDE])); + goto cancel; + } - tgt = flow_target_af(flow, PIF_HOST, af, bind_addr, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping); pingf->seq = -1; -- 2.45.2
Add logic to the fwd_nat_from_*() functions to forwarding UDP packets. The logic here doesn't exactly match our current forwarding, since our current forwarding has some very strange and buggy edge cases. Instead it's attempting to replicate what appears to be the intended logic behind the current forwarding. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- fwd.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/fwd.c b/fwd.c index 5731a536..4377de44 100644 --- a/fwd.c +++ b/fwd.c @@ -169,12 +169,15 @@ void fwd_scan_ports_init(struct ctx *c) uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { - (void)proto; - tgt->eaddr = ini->faddr; tgt->eport = ini->fport; - if (!c->no_map_gw) { + if (proto == IPPROTO_UDP && tgt->eport == 53) { + if (inany_equals4(&tgt->eaddr, &c->ip4.dns_match)) + tgt->eaddr = inany_from_v4(c->ip4.dns_host); + else if (inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) + tgt->eaddr.a6 = c->ip6.dns_host; + } else if (!c->no_map_gw) { if (inany_equals4(&tgt->eaddr, &c->ip4.gw)) tgt->eaddr = inany_loopback4; else if (inany_equals6(&tgt->eaddr, &c->ip6.gw)) @@ -191,6 +194,10 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, /* Let the kernel pick a host side source port */ tgt->fport = 0; + if (proto == IPPROTO_UDP) { + /* But for UDP we preserve the source port */ + tgt->fport = ini->eport; + } return PIF_HOST; } @@ -232,9 +239,14 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, tgt->eport = ini->fport; if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; + else if (proto == IPPROTO_UDP) + tgt->eport += c->udp.fwd_out.f.delta[tgt->eport]; /* Let the kernel pick a host side source port */ tgt->fport = 0; + if (proto == IPPROTO_UDP) + /* But for UDP preserve the source port */ + tgt->fport = ini->eport; return PIF_HOST; } @@ -256,20 +268,26 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, tgt->eport = ini->fport; if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; + else if (proto == IPPROTO_UDP) + tgt->eport += c->udp.fwd_in.f.delta[tgt->eport]; if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) && - proto == IPPROTO_TCP) { + (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */ /* Preserve the specific loopback adddress used, but let the * kernel pick a source port on the target side */ tgt->faddr = ini->eaddr; tgt->fport = 0; + if (proto == IPPROTO_UDP) + /* But for UDP preserve the source port */ + tgt->fport = ini->eport; if (inany_v4(&ini->eaddr)) tgt->eaddr = inany_loopback4; else tgt->eaddr = inany_loopback6; + return PIF_SPLICE; } -- 2.45.2
On Fri, 5 Jul 2024 12:07:16 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Add logic to the fwd_nat_from_*() functions to forwarding UDP packets. The logic here doesn't exactly match our current forwarding, since our current forwarding has some very strange and buggy edge cases. Instead it's attempting to replicate what appears to be the intended logic behind the current forwarding. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- fwd.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/fwd.c b/fwd.c index 5731a536..4377de44 100644 --- a/fwd.c +++ b/fwd.c @@ -169,12 +169,15 @@ void fwd_scan_ports_init(struct ctx *c) uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { - (void)proto; - tgt->eaddr = ini->faddr; tgt->eport = ini->fport; - if (!c->no_map_gw) { + if (proto == IPPROTO_UDP && tgt->eport == 53) { + if (inany_equals4(&tgt->eaddr, &c->ip4.dns_match)) + tgt->eaddr = inany_from_v4(c->ip4.dns_host); + else if (inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) + tgt->eaddr.a6 = c->ip6.dns_host; + } else if (!c->no_map_gw) {There's a subtle difference here compared to the logic you dropped in 23/27 (udp_tap_handler()), which doesn't look correct to me. Earlier, with neither c->ip4.dns_match nor c->ip6.dns_match matching, we would let UDP traffic directed to port 53 be mapped to the host, if (!c->no_map_gw). That is, the logic was rather equivalent to this: if (proto == IPPROTO_UDP && tgt->eport == 53 && (inany_equals4(&tgt->eaddr, &c->ip4.dns_match) || inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) { if (inany_equals4(&tgt->eaddr, &c->ip4.dns_match)) tgt->eaddr = inany_from_v4(c->ip4.dns_host); else if (inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) tgt->eaddr.a6 = c->ip6.dns_host; } else if (!c->no_map_gw) { ... and I think we should maintain it, because if dns_match doesn't match, DNS traffic considerations shouldn't affect NAT decisions at all.if (inany_equals4(&tgt->eaddr, &c->ip4.gw)) tgt->eaddr = inany_loopback4; else if (inany_equals6(&tgt->eaddr, &c->ip6.gw)) @@ -191,6 +194,10 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, /* Let the kernel pick a host side source port */ tgt->fport = 0; + if (proto == IPPROTO_UDP) { + /* But for UDP we preserve the source port */ + tgt->fport = ini->eport; + } return PIF_HOST; } @@ -232,9 +239,14 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, tgt->eport = ini->fport; if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; + else if (proto == IPPROTO_UDP) + tgt->eport += c->udp.fwd_out.f.delta[tgt->eport]; /* Let the kernel pick a host side source port */ tgt->fport = 0; + if (proto == IPPROTO_UDP) + /* But for UDP preserve the source port */ + tgt->fport = ini->eport; return PIF_HOST; } @@ -256,20 +268,26 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, tgt->eport = ini->fport; if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; + else if (proto == IPPROTO_UDP) + tgt->eport += c->udp.fwd_in.f.delta[tgt->eport]; if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) && - proto == IPPROTO_TCP) { + (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */ /* Preserve the specific loopback adddress used, but let the * kernel pick a source port on the target side */ tgt->faddr = ini->eaddr; tgt->fport = 0; + if (proto == IPPROTO_UDP) + /* But for UDP preserve the source port */ + tgt->fport = ini->eport; if (inany_v4(&ini->eaddr)) tgt->eaddr = inany_loopback4; else tgt->eaddr = inany_loopback6; + return PIF_SPLICE; }-- Stefano
On Mon, Jul 08, 2024 at 11:26:55PM +0200, Stefano Brivio wrote:On Fri, 5 Jul 2024 12:07:16 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Good catch, I've adjusted that. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonAdd logic to the fwd_nat_from_*() functions to forwarding UDP packets. The logic here doesn't exactly match our current forwarding, since our current forwarding has some very strange and buggy edge cases. Instead it's attempting to replicate what appears to be the intended logic behind the current forwarding. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- fwd.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/fwd.c b/fwd.c index 5731a536..4377de44 100644 --- a/fwd.c +++ b/fwd.c @@ -169,12 +169,15 @@ void fwd_scan_ports_init(struct ctx *c) uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { - (void)proto; - tgt->eaddr = ini->faddr; tgt->eport = ini->fport; - if (!c->no_map_gw) { + if (proto == IPPROTO_UDP && tgt->eport == 53) { + if (inany_equals4(&tgt->eaddr, &c->ip4.dns_match)) + tgt->eaddr = inany_from_v4(c->ip4.dns_host); + else if (inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) + tgt->eaddr.a6 = c->ip6.dns_host; + } else if (!c->no_map_gw) {There's a subtle difference here compared to the logic you dropped in 23/27 (udp_tap_handler()), which doesn't look correct to me. Earlier, with neither c->ip4.dns_match nor c->ip6.dns_match matching, we would let UDP traffic directed to port 53 be mapped to the host, if (!c->no_map_gw). That is, the logic was rather equivalent to this: if (proto == IPPROTO_UDP && tgt->eport == 53 && (inany_equals4(&tgt->eaddr, &c->ip4.dns_match) || inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) { if (inany_equals4(&tgt->eaddr, &c->ip4.dns_match)) tgt->eaddr = inany_from_v4(c->ip4.dns_host); else if (inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) tgt->eaddr.a6 = c->ip6.dns_host; } else if (!c->no_map_gw) { ... and I think we should maintain it, because if dns_match doesn't match, DNS traffic considerations shouldn't affect NAT decisions at all.
This implements the first steps of tracking UDP packets with the flow table rather than it's own (buggy) set of port maps. Specifically we create flow table entries for datagrams received from a socket (PIF_HOST or PIF_SPLICE). When splitting datagrams from sockets into batches, we group by the flow as well as splicesrc. This may result in smaller batches, but makes things easier down the line. We can re-optimise this later if necessary. For now we don't do anything else with the flow, not even match reply packets to the same flow. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- Makefile | 2 +- flow.c | 31 ++++++++++ flow.h | 4 ++ flow_table.h | 14 +++++ udp.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++-- udp_flow.h | 25 ++++++++ 6 files changed, 240 insertions(+), 5 deletions(-) create mode 100644 udp_flow.h diff --git a/Makefile b/Makefile index 09fc461d..92cbd5a6 100644 --- a/Makefile +++ b/Makefile @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - udp.h util.h + udp.h udp_flow.h util.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 }; diff --git a/flow.c b/flow.c index 218033ae..0cb9495b 100644 --- a/flow.c +++ b/flow.c @@ -37,6 +37,7 @@ const char *flow_type_str[] = { [FLOW_TCP_SPLICE] = "TCP connection (spliced)", [FLOW_PING4] = "ICMP ping sequence", [FLOW_PING6] = "ICMPv6 ping sequence", + [FLOW_UDP] = "UDP flow", }; static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type"); @@ -46,6 +47,7 @@ const uint8_t flow_proto[] = { [FLOW_TCP_SPLICE] = IPPROTO_TCP, [FLOW_PING4] = IPPROTO_ICMP, [FLOW_PING6] = IPPROTO_ICMPV6, + [FLOW_UDP] = IPPROTO_UDP, }; static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, "flow_proto[] doesn't match enum flow_type"); @@ -700,6 +702,31 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, return flowside_lookup(c, proto, pif, &fside); } +/** + * flow_lookup_sa() - Look up a flow given and endpoint socket address + * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @esa: Socket address of the endpoint + * @fport: Forwarding port number + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport) +{ + struct flowside fside = { + .fport = fport, + }; + + inany_from_sockaddr(&fside.eaddr, &fside.eport, esa); + if (inany_v4(&fside.eaddr)) + fside.faddr = inany_any4; + else + fside.faddr = inany_any6; + return flowside_lookup(c, proto, pif, &fside); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context @@ -779,6 +806,10 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) if (timer) closed = icmp_ping_timer(c, &flow->ping, now); break; + case FLOW_UDP: + if (timer) + closed = udp_flow_timer(c, &flow->udp, now); + break; default: /* Assume other flow types don't need any handling */ ; diff --git a/flow.h b/flow.h index e27f99be..3752e5ee 100644 --- a/flow.h +++ b/flow.h @@ -115,6 +115,8 @@ enum flow_type { FLOW_PING4, /* ICMPv6 echo requests from guest to host and matching replies back */ FLOW_PING6, + /* UDP pseudo-connection */ + FLOW_UDP, FLOW_NUM_TYPES, }; @@ -238,6 +240,8 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, uint8_t proto, uint8_t pif, sa_family_t af, const void *eaddr, const void *faddr, in_port_t eport, in_port_t fport); +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport); union flow; diff --git a/flow_table.h b/flow_table.h index 457f27b1..3fbc7c8d 100644 --- a/flow_table.h +++ b/flow_table.h @@ -9,6 +9,7 @@ #include "tcp_conn.h" #include "icmp_flow.h" +#include "udp_flow.h" /** * struct flow_free_cluster - Information about a cluster of free entries @@ -35,6 +36,7 @@ union flow { struct tcp_tap_conn tcp; struct tcp_splice_conn tcp_splice; struct icmp_ping_flow ping; + struct udp_flow udp; }; /* Global Flow Table */ @@ -78,6 +80,18 @@ static inline union flow *flow_at_sidx(flow_sidx_t sidx) return FLOW(sidx.flow); } +/** flow_sidx_opposite - Get the other side of the same flow + * @sidx: Flow & side index + * + * Return: sidx for the other side of the same flow as @sidx + */ +static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) +{ + if (!flow_sidx_valid(sidx)) + return FLOW_SIDX_NONE; + return (flow_sidx_t){.flow = sidx.flow, .side = !sidx.side}; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/udp.c b/udp.c index 6427b9ce..daf4fe26 100644 --- a/udp.c +++ b/udp.c @@ -15,6 +15,30 @@ /** * DOC: Theory of Operation * + * UDP Flows + * ========= + * + * UDP doesn't have true connections, but many protocols use a connection-like + * format. The flow is initiated by a client sending a datagram from a port of + * its choosing (usually ephemeral) to a specific port (usually well known) on a + * server. Both client and server address must be unicast. The server sends + * replies using the same addresses & ports with src/dest swapped. + * + * We track pseudo-connections of this type as flow table entries of type + * FLOW_UDP. We store the time of the last traffic on the flow in uflow->ts, + * and let the flow expire if there is no traffic for UDP_CONN_TIMEOUT seconds. + * + * NOTE: This won't handle multicast protocols, or some protocols with different + * port usage. We'll need specific logic if we want to handle those. + * + * "Listening" sockets + * =================== + * + * UDP doesn't use listen(), but we consider long term sockets which are allowed + * to create new flows "listening" by analogy with TCP. + * + * Port tracking + * ============= * * For UDP, a reduced version of port-based connection tracking is implemented * with two purposes: @@ -121,6 +145,7 @@ #include "tap.h" #include "pcap.h" #include "log.h" +#include "flow_table.h" #define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ @@ -199,6 +224,7 @@ static struct ethhdr udp6_eth_hdr; * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() * @splicesrc: Source port for splicing, or -1 if not spliceable + * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { struct ipv6hdr ip6h; @@ -207,6 +233,7 @@ static struct udp_meta_t { union sockaddr_inany s_in; int splicesrc; + flow_sidx_t tosidx; } #ifdef __AVX2__ __attribute__ ((aligned(32))) @@ -490,6 +517,115 @@ static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) return -1; } +/** + * udp_at_sidx() - Get UDP specific flow at given sidx + * @sidx: Flow and side to retrieve + * + * Return: UDP specific flow at @sidx, or NULL of @sidx is invalid. Asserts if + * the flow at @sidx is not FLOW_UDP. + */ +struct udp_flow *udp_at_sidx(flow_sidx_t sidx) +{ + union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return NULL; + + ASSERT(flow->f.type == FLOW_UDP); + return &flow->udp; +} + +/* + * udp_flow_close() - Close and clean up UDP flow + * @c: Execution context + * @uflow: UDP flow + */ +static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) +{ + flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); +} + +/** + * udp_flow_new() - Common setup for a new UDP flow + * @c: Execution context + * @flow: Initiated flow + * @now: Timestamp + * + * Return: UDP specific flow, if successful, NULL on failure + */ +static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, + const struct timespec *now) +{ + const struct flowside *ini = &flow->f.side[INISIDE]; + struct udp_flow *uflow = NULL; + + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { + flow_dbg(flow, "Invalid endpoint to initiate UDP flow"); + goto cancel; + } + + if (!flow_target(c, flow, IPPROTO_UDP)) + goto cancel; + + uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); + uflow->ts = now->tv_sec; + + flow_hash_insert(c, FLOW_SIDX(uflow, INISIDE)); + FLOW_ACTIVATE(uflow); + + return FLOW_SIDX(uflow, TGTSIDE); + +cancel: + if (uflow) + udp_flow_close(c, uflow); + flow_alloc_cancel(flow); + return FLOW_SIDX_NONE; + +} + +/** + * udp_flow_from_sock() - Find or create UDP flow for "listening" socket + * @c: Execution context + * @ref: epoll reference of the receiving socket + * @meta: Metadata buffer for the datagram + * @now: Timestamp + * + * Return: sidx for the destination side of the flow for this packet, or + * FLOW_SIDX_NONE if we couldn't find or create a flow. + */ +static flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref, + struct udp_meta_t *meta, + const struct timespec *now) +{ + struct udp_flow *uflow; + union flow *flow; + flow_sidx_t sidx; + + ASSERT(ref.type == EPOLL_TYPE_UDP); + + /* FIXME: Match reply packets to their flow as well */ + if (!ref.udp.orig) + return FLOW_SIDX_NONE; + + sidx = flow_lookup_sa(c, IPPROTO_UDP, ref.udp.pif, &meta->s_in, ref.udp.port); + if ((uflow = udp_at_sidx(sidx))) { + uflow->ts = now->tv_sec; + return flow_sidx_opposite(sidx); + } + + if (!(flow = flow_alloc())) { + char sastr[SOCKADDR_STRLEN]; + + debug("Couldn't allocate flow for UDP datagram from %s %s", + pif_name(ref.udp.pif), + sockaddr_ntop(&meta->s_in, sastr, sizeof(sastr))); + return FLOW_SIDX_NONE; + } + + flow_initiate_sa(flow, ref.udp.pif, &meta->s_in, ref.udp.port); + return udp_flow_new(c, flow, now); +} + /** * udp_splice_prepare() - Prepare one datagram for splicing * @mmh: Receiving mmsghdr array @@ -786,12 +922,15 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve dstport += c->udp.fwd_in.f.delta[dstport]; /* We divide datagrams into batches based on how we need to send them, - * determined by udp_meta[i].splicesrc. To avoid either two passes - * through the array, or recalculating splicesrc for a single entry, we - * have to populate it one entry *ahead* of the loop counter. + * determined by udp_meta[i].splicesrc and udp_meta[i].tosidx. To avoid + * either two passes through the array, or recalculating splicesrc and + * tosidxfor a single entry, we have to populate it one entry *ahead* of + * the loop counter. */ udp_meta[0].splicesrc = udp_mmh_splice_port(ref, mmh_recv); + udp_meta[0].tosidx = udp_flow_from_sock(c, ref, &udp_meta[0], now); for (i = 0; i < n; ) { + flow_sidx_t batchsidx = udp_meta[i].tosidx; int batchsrc = udp_meta[i].splicesrc; int batchstart = i; @@ -807,7 +946,11 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve udp_meta[i].splicesrc = udp_mmh_splice_port(ref, &mmh_recv[i]); - } while (udp_meta[i].splicesrc == batchsrc); + udp_meta[i].tosidx = udp_flow_from_sock(c, ref, + &udp_meta[i], + now); + } while (flow_sidx_eq(udp_meta[i].tosidx, batchsidx) && + udp_meta[i].splicesrc == batchsrc); if (batchsrc >= 0) udp_splice_send(c, batchstart, i - batchstart, @@ -1201,6 +1344,24 @@ static int udp_port_rebind_outbound(void *arg) return 0; } +/** + * udp_flow_timer() - Handler for timed events related to a given flow + * @c: Execution context + * @uflow: UDP flow + * @now: Current timestamp + * + * Return: true if the flow is ready to free, false otherwise + */ +bool udp_flow_timer(const struct ctx *c, const struct udp_flow *uflow, + const struct timespec *now) +{ + if (now->tv_sec - uflow->ts <= UDP_CONN_TIMEOUT) + return false; + + udp_flow_close(c, uflow); + return true; +} + /** * udp_timer() - Scan activity bitmaps for ports with associated timed events * @c: Execution context diff --git a/udp_flow.h b/udp_flow.h new file mode 100644 index 00000000..18af9ac4 --- /dev/null +++ b/udp_flow.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright Red Hat + * Author: David Gibson <david(a)gibson.dropbear.id.au> + * + * UDP flow tracking data structures + */ +#ifndef UDP_FLOW_H +#define UDP_FLOW_H + +/** + * struct udp - Descriptor for a flow of UDP packets + * @f: Generic flow information + * @ts: Activity timestamp + */ +struct udp_flow { + /* Must be first element */ + struct flow_common f; + + time_t ts; +}; + +bool udp_flow_timer(const struct ctx *c, const struct udp_flow *uflow, + const struct timespec *now); + +#endif /* UDP_FLOW_H */ -- 2.45.2
Nits only, here: On Fri, 5 Jul 2024 12:07:17 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:This implements the first steps of tracking UDP packets with the flow table rather than it's own (buggy) set of port maps. Specifically we create flowitstable entries for datagrams received from a socket (PIF_HOST or PIF_SPLICE). When splitting datagrams from sockets into batches, we group by the flow as well as splicesrc. This may result in smaller batches, but makes things easier down the line. We can re-optimise this later if necessary. For now we don't do anything else with the flow, not even match reply packets to the same flow. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- Makefile | 2 +- flow.c | 31 ++++++++++ flow.h | 4 ++ flow_table.h | 14 +++++ udp.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++-- udp_flow.h | 25 ++++++++ 6 files changed, 240 insertions(+), 5 deletions(-) create mode 100644 udp_flow.h diff --git a/Makefile b/Makefile index 09fc461d..92cbd5a6 100644 --- a/Makefile +++ b/Makefile @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - udp.h util.h + udp.h udp_flow.h util.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 }; diff --git a/flow.c b/flow.c index 218033ae..0cb9495b 100644 --- a/flow.c +++ b/flow.c @@ -37,6 +37,7 @@ const char *flow_type_str[] = { [FLOW_TCP_SPLICE] = "TCP connection (spliced)", [FLOW_PING4] = "ICMP ping sequence", [FLOW_PING6] = "ICMPv6 ping sequence", + [FLOW_UDP] = "UDP flow", }; static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type"); @@ -46,6 +47,7 @@ const uint8_t flow_proto[] = { [FLOW_TCP_SPLICE] = IPPROTO_TCP, [FLOW_PING4] = IPPROTO_ICMP, [FLOW_PING6] = IPPROTO_ICMPV6, + [FLOW_UDP] = IPPROTO_UDP, }; static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, "flow_proto[] doesn't match enum flow_type"); @@ -700,6 +702,31 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, return flowside_lookup(c, proto, pif, &fside); } +/** + * flow_lookup_sa() - Look up a flow given and endpoint socket addresss/and/an/+ * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @esa: Socket address of the endpoint + * @fport: Forwarding port number + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport) +{ + struct flowside fside = {And the "f" in "fside" stands for "forwarding"... I don't have any quick fix in mind, and it's _kind of_ clear anyway, but this makes me doubt a bit about the "forwarding" / "endpoint" choice of words.+ .fport = fport, + }; + + inany_from_sockaddr(&fside.eaddr, &fside.eport, esa); + if (inany_v4(&fside.eaddr)) + fside.faddr = inany_any4; + else + fside.faddr = inany_any6;The usual extra newline here?+ return flowside_lookup(c, proto, pif, &fside); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context @@ -779,6 +806,10 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) if (timer) closed = icmp_ping_timer(c, &flow->ping, now); break; + case FLOW_UDP: + if (timer) + closed = udp_flow_timer(c, &flow->udp, now); + break; default: /* Assume other flow types don't need any handling */ ; diff --git a/flow.h b/flow.h index e27f99be..3752e5ee 100644 --- a/flow.h +++ b/flow.h @@ -115,6 +115,8 @@ enum flow_type { FLOW_PING4, /* ICMPv6 echo requests from guest to host and matching replies back */ FLOW_PING6, + /* UDP pseudo-connection */ + FLOW_UDP, FLOW_NUM_TYPES, }; @@ -238,6 +240,8 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, uint8_t proto, uint8_t pif, sa_family_t af, const void *eaddr, const void *faddr, in_port_t eport, in_port_t fport); +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport); union flow; diff --git a/flow_table.h b/flow_table.h index 457f27b1..3fbc7c8d 100644 --- a/flow_table.h +++ b/flow_table.h @@ -9,6 +9,7 @@ #include "tcp_conn.h" #include "icmp_flow.h" +#include "udp_flow.h" /** * struct flow_free_cluster - Information about a cluster of free entries @@ -35,6 +36,7 @@ union flow { struct tcp_tap_conn tcp; struct tcp_splice_conn tcp_splice; struct icmp_ping_flow ping; + struct udp_flow udp; }; /* Global Flow Table */ @@ -78,6 +80,18 @@ static inline union flow *flow_at_sidx(flow_sidx_t sidx) return FLOW(sidx.flow); } +/** flow_sidx_opposite - Get the other side of the same flowflow_sidx_opposite()+ * @sidx: Flow & side index + * + * Return: sidx for the other side of the same flow as @sidx + */ +static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) +{ + if (!flow_sidx_valid(sidx)) + return FLOW_SIDX_NONE;Same here with the extra newline.+ return (flow_sidx_t){.flow = sidx.flow, .side = !sidx.side}; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/udp.c b/udp.c index 6427b9ce..daf4fe26 100644 --- a/udp.c +++ b/udp.c @@ -15,6 +15,30 @@ /** * DOC: Theory of Operation * + * UDP Flows + * ========= + * + * UDP doesn't have true connections, but many protocols use a connection-like + * format. The flow is initiated by a client sending a datagram from a port of + * its choosing (usually ephemeral) to a specific port (usually well known) on a + * server. Both client and server address must be unicast. The server sends + * replies using the same addresses & ports with src/dest swapped. + * + * We track pseudo-connections of this type as flow table entries of type + * FLOW_UDP. We store the time of the last traffic on the flow in uflow->ts, + * and let the flow expire if there is no traffic for UDP_CONN_TIMEOUT seconds. + * + * NOTE: This won't handle multicast protocols, or some protocols with different + * port usage. We'll need specific logic if we want to handle those. + * + * "Listening" sockets + * =================== + * + * UDP doesn't use listen(), but we consider long term sockets which are allowed + * to create new flows "listening" by analogy with TCP. + * + * Port tracking + * ============= * * For UDP, a reduced version of port-based connection tracking is implemented * with two purposes: @@ -121,6 +145,7 @@ #include "tap.h" #include "pcap.h" #include "log.h" +#include "flow_table.h" #define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ @@ -199,6 +224,7 @@ static struct ethhdr udp6_eth_hdr; * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() * @splicesrc: Source port for splicing, or -1 if not spliceable + * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { struct ipv6hdr ip6h; @@ -207,6 +233,7 @@ static struct udp_meta_t { union sockaddr_inany s_in; int splicesrc; + flow_sidx_t tosidx; } #ifdef __AVX2__ __attribute__ ((aligned(32))) @@ -490,6 +517,115 @@ static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) return -1; } +/** + * udp_at_sidx() - Get UDP specific flow at given sidx + * @sidx: Flow and side to retrieve + * + * Return: UDP specific flow at @sidx, or NULL of @sidx is invalid. Asserts if + * the flow at @sidx is not FLOW_UDP. + */ +struct udp_flow *udp_at_sidx(flow_sidx_t sidx) +{ + union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return NULL; + + ASSERT(flow->f.type == FLOW_UDP); + return &flow->udp; +} + +/* + * udp_flow_close() - Close and clean up UDP flow + * @c: Execution context + * @uflow: UDP flow + */ +static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) +{ + flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); +} + +/** + * udp_flow_new() - Common setup for a new UDP flow + * @c: Execution context + * @flow: Initiated flow + * @now: Timestamp + * + * Return: UDP specific flow, if successful, NULL on failure + */ +static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, + const struct timespec *now) +{ + const struct flowside *ini = &flow->f.side[INISIDE]; + struct udp_flow *uflow = NULL; + + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { + flow_dbg(flow, "Invalid endpoint to initiate UDP flow");Do we risk making debug logs unusable if we see multicast traffic? Maybe this could be flow_trace() instead. -- Stefano
On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote:Nits only, here: On Fri, 5 Jul 2024 12:07:17 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Oops, fixed.This implements the first steps of tracking UDP packets with the flow table rather than it's own (buggy) set of port maps. Specifically we create flowitsFixed.table entries for datagrams received from a socket (PIF_HOST or PIF_SPLICE). When splitting datagrams from sockets into batches, we group by the flow as well as splicesrc. This may result in smaller batches, but makes things easier down the line. We can re-optimise this later if necessary. For now we don't do anything else with the flow, not even match reply packets to the same flow. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- Makefile | 2 +- flow.c | 31 ++++++++++ flow.h | 4 ++ flow_table.h | 14 +++++ udp.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++-- udp_flow.h | 25 ++++++++ 6 files changed, 240 insertions(+), 5 deletions(-) create mode 100644 udp_flow.h diff --git a/Makefile b/Makefile index 09fc461d..92cbd5a6 100644 --- a/Makefile +++ b/Makefile @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - udp.h util.h + udp.h udp_flow.h util.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 }; diff --git a/flow.c b/flow.c index 218033ae..0cb9495b 100644 --- a/flow.c +++ b/flow.c @@ -37,6 +37,7 @@ const char *flow_type_str[] = { [FLOW_TCP_SPLICE] = "TCP connection (spliced)", [FLOW_PING4] = "ICMP ping sequence", [FLOW_PING6] = "ICMPv6 ping sequence", + [FLOW_UDP] = "UDP flow", }; static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type"); @@ -46,6 +47,7 @@ const uint8_t flow_proto[] = { [FLOW_TCP_SPLICE] = IPPROTO_TCP, [FLOW_PING4] = IPPROTO_ICMP, [FLOW_PING6] = IPPROTO_ICMPV6, + [FLOW_UDP] = IPPROTO_UDP, }; static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, "flow_proto[] doesn't match enum flow_type"); @@ -700,6 +702,31 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, return flowside_lookup(c, proto, pif, &fside); } +/** + * flow_lookup_sa() - Look up a flow given and endpoint socket addresss/and/an/Heh, no, here "fside" is simply short for "flowside". Every flowside has both forwarding and endpoint elements. So it is confusing, but for a different reason. I need to find a different convention for naming struct flowside variables. I'd say 'side', but sometimes that's used for the 1-bit integer indicating which side in a flow. Hrm.. now that pif has been removed from here, maybe I could rename struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps?+ * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @esa: Socket address of the endpoint + * @fport: Forwarding port number + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport) +{ + struct flowside fside = {And the "f" in "fside" stands for "forwarding"... I don't have any quick fix in mind, and it's _kind of_ clear anyway, but this makes me doubt a bit about the "forwarding" / "endpoint" choice of words.Done.+ .fport = fport, + }; + + inany_from_sockaddr(&fside.eaddr, &fside.eport, esa); + if (inany_v4(&fside.eaddr)) + fside.faddr = inany_any4; + else + fside.faddr = inany_any6;The usual extra newline here?Done.+ return flowside_lookup(c, proto, pif, &fside); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context @@ -779,6 +806,10 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) if (timer) closed = icmp_ping_timer(c, &flow->ping, now); break; + case FLOW_UDP: + if (timer) + closed = udp_flow_timer(c, &flow->udp, now); + break; default: /* Assume other flow types don't need any handling */ ; diff --git a/flow.h b/flow.h index e27f99be..3752e5ee 100644 --- a/flow.h +++ b/flow.h @@ -115,6 +115,8 @@ enum flow_type { FLOW_PING4, /* ICMPv6 echo requests from guest to host and matching replies back */ FLOW_PING6, + /* UDP pseudo-connection */ + FLOW_UDP, FLOW_NUM_TYPES, }; @@ -238,6 +240,8 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, uint8_t proto, uint8_t pif, sa_family_t af, const void *eaddr, const void *faddr, in_port_t eport, in_port_t fport); +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport); union flow; diff --git a/flow_table.h b/flow_table.h index 457f27b1..3fbc7c8d 100644 --- a/flow_table.h +++ b/flow_table.h @@ -9,6 +9,7 @@ #include "tcp_conn.h" #include "icmp_flow.h" +#include "udp_flow.h" /** * struct flow_free_cluster - Information about a cluster of free entries @@ -35,6 +36,7 @@ union flow { struct tcp_tap_conn tcp; struct tcp_splice_conn tcp_splice; struct icmp_ping_flow ping; + struct udp_flow udp; }; /* Global Flow Table */ @@ -78,6 +80,18 @@ static inline union flow *flow_at_sidx(flow_sidx_t sidx) return FLOW(sidx.flow); } +/** flow_sidx_opposite - Get the other side of the same flowflow_sidx_opposite()Done.+ * @sidx: Flow & side index + * + * Return: sidx for the other side of the same flow as @sidx + */ +static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) +{ + if (!flow_sidx_valid(sidx)) + return FLOW_SIDX_NONE;Same here with the extra newline.Um.. I'm not sure.+ return (flow_sidx_t){.flow = sidx.flow, .side = !sidx.side}; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/udp.c b/udp.c index 6427b9ce..daf4fe26 100644 --- a/udp.c +++ b/udp.c @@ -15,6 +15,30 @@ /** * DOC: Theory of Operation * + * UDP Flows + * ========= + * + * UDP doesn't have true connections, but many protocols use a connection-like + * format. The flow is initiated by a client sending a datagram from a port of + * its choosing (usually ephemeral) to a specific port (usually well known) on a + * server. Both client and server address must be unicast. The server sends + * replies using the same addresses & ports with src/dest swapped. + * + * We track pseudo-connections of this type as flow table entries of type + * FLOW_UDP. We store the time of the last traffic on the flow in uflow->ts, + * and let the flow expire if there is no traffic for UDP_CONN_TIMEOUT seconds. + * + * NOTE: This won't handle multicast protocols, or some protocols with different + * port usage. We'll need specific logic if we want to handle those. + * + * "Listening" sockets + * =================== + * + * UDP doesn't use listen(), but we consider long term sockets which are allowed + * to create new flows "listening" by analogy with TCP. + * + * Port tracking + * ============= * * For UDP, a reduced version of port-based connection tracking is implemented * with two purposes: @@ -121,6 +145,7 @@ #include "tap.h" #include "pcap.h" #include "log.h" +#include "flow_table.h" #define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ @@ -199,6 +224,7 @@ static struct ethhdr udp6_eth_hdr; * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() * @splicesrc: Source port for splicing, or -1 if not spliceable + * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { struct ipv6hdr ip6h; @@ -207,6 +233,7 @@ static struct udp_meta_t { union sockaddr_inany s_in; int splicesrc; + flow_sidx_t tosidx; } #ifdef __AVX2__ __attribute__ ((aligned(32))) @@ -490,6 +517,115 @@ static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) return -1; } +/** + * udp_at_sidx() - Get UDP specific flow at given sidx + * @sidx: Flow and side to retrieve + * + * Return: UDP specific flow at @sidx, or NULL of @sidx is invalid. Asserts if + * the flow at @sidx is not FLOW_UDP. + */ +struct udp_flow *udp_at_sidx(flow_sidx_t sidx) +{ + union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return NULL; + + ASSERT(flow->f.type == FLOW_UDP); + return &flow->udp; +} + +/* + * udp_flow_close() - Close and clean up UDP flow + * @c: Execution context + * @uflow: UDP flow + */ +static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) +{ + flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); +} + +/** + * udp_flow_new() - Common setup for a new UDP flow + * @c: Execution context + * @flow: Initiated flow + * @now: Timestamp + * + * Return: UDP specific flow, if successful, NULL on failure + */ +static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, + const struct timespec *now) +{ + const struct flowside *ini = &flow->f.side[INISIDE]; + struct udp_flow *uflow = NULL; + + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { + flow_dbg(flow, "Invalid endpoint to initiate UDP flow");Do we risk making debug logs unusable if we see multicast traffic?Maybe this could be flow_trace() instead.Sure, why not. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, 10 Jul 2024 09:59:08 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote:Oh, I thought you called it fside here because you're setting the forwarding part of it directly, or something like that.Nits only, here: On Fri, 5 Jul 2024 12:07:17 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Heh, no, here "fside" is simply short for "flowside". Every flowside has both forwarding and endpoint elements.[...] + * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @esa: Socket address of the endpoint + * @fport: Forwarding port number + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport) +{ + struct flowside fside = {And the "f" in "fside" stands for "forwarding"... I don't have any quick fix in mind, and it's _kind of_ clear anyway, but this makes me doubt a bit about the "forwarding" / "endpoint" choice of words.So it is confusing, but for a different reason. I need to find a different convention for naming struct flowside variables. I'd say 'side', but sometimes that's used for the 1-bit integer indicating which side in a flow. Hrm.. now that pif has been removed from here, maybe I could rename struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps?That's also confusing because it contains ports too (even though sure, in some sense they're part of the address). I would suggest keeping it like it is in for this series, but after that, if it's not too long, what about flow_addrs_ports? Actually, I don't think flowside is that bad. What I'm struggling with is rather 'forwarding' and 'endpoint'. I don't have any good suggestion at the moment, anyway. Using 'local' and 'remote' (laddr/lport, raddr/rport) would be clearer to me and avoid the conflict with 'f' of flowside, but you had good reasons to avoid that, if I recall correctly. -- Stefano
On Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote:On Wed, 10 Jul 2024 09:59:08 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right :/.On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote:Oh, I thought you called it fside here because you're setting the forwarding part of it directly, or something like that.Nits only, here: On Fri, 5 Jul 2024 12:07:17 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Heh, no, here "fside" is simply short for "flowside". Every flowside has both forwarding and endpoint elements.[...] + * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @esa: Socket address of the endpoint + * @fport: Forwarding port number + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport) +{ + struct flowside fside = {And the "f" in "fside" stands for "forwarding"... I don't have any quick fix in mind, and it's _kind of_ clear anyway, but this makes me doubt a bit about the "forwarding" / "endpoint" choice of words.So it is confusing, but for a different reason. I need to find a different convention for naming struct flowside variables. I'd say 'side', but sometimes that's used for the 1-bit integer indicating which side in a flow. Hrm.. now that pif has been removed from here, maybe I could rename struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps?That's also confusing because it contains ports too (even though sure, in some sense they're part of the address).I would suggest keeping it like it is in for this series, but after that, if it's not too long, what about flow_addrs_ports?Still need a conventional name for the variables. "fap" probably isn't the best look, and still has the potentially confusing "f" in it.Actually, I don't think flowside is that bad. What I'm struggling with is rather 'forwarding' and 'endpoint'. I don't have any good suggestion at the moment, anyway. Using 'local' and 'remote' (laddr/lport, raddr/rport) would be clearer to me and avoid the conflict with 'f' of flowside, but you had good reasons to avoid that, if I recall correctly.Kind of, yeah. Local and remote are great when it's clear we're talking specifically from the point of view of the passt process. There are a bunch of cases where it's not necessarily obvious if we're talking from that point of view, the point of view of the guest, or the point of view of the host (the last usually when the endpoint is somewhere on an entirely different system). I wanted something that wherever we were talking about it is specifically relative to the passt process itself. I get the impression that "forwarding" is causing more trouble here than "endpoint". "midpoint address"? "intercepted address"? "redirected address" (probably not, that's 'r' like remote but this would be the local address)? -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 11 Jul 2024 14:26:38 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote:I think "forwarding" is still better than any of those. Perhaps "passt address" (and paddr/pport)... but I'm not sure it's much better than "forwarding". -- StefanoOn Wed, 10 Jul 2024 09:59:08 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right :/.On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote:Oh, I thought you called it fside here because you're setting the forwarding part of it directly, or something like that.Nits only, here: On Fri, 5 Jul 2024 12:07:17 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote: > [...] > > + * @c: Execution context > + * @proto: Protocol of the flow (IP L4 protocol number) > + * @pif: Interface of the flow > + * @esa: Socket address of the endpoint > + * @fport: Forwarding port number > + * > + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found > + */ > +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, > + const void *esa, in_port_t fport) > +{ > + struct flowside fside = { And the "f" in "fside" stands for "forwarding"... I don't have any quick fix in mind, and it's _kind of_ clear anyway, but this makes me doubt a bit about the "forwarding" / "endpoint" choice of words.Heh, no, here "fside" is simply short for "flowside". Every flowside has both forwarding and endpoint elements.So it is confusing, but for a different reason. I need to find a different convention for naming struct flowside variables. I'd say 'side', but sometimes that's used for the 1-bit integer indicating which side in a flow. Hrm.. now that pif has been removed from here, maybe I could rename struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps?That's also confusing because it contains ports too (even though sure, in some sense they're part of the address).I would suggest keeping it like it is in for this series, but after that, if it's not too long, what about flow_addrs_ports?Still need a conventional name for the variables. "fap" probably isn't the best look, and still has the potentially confusing "f" in it.Actually, I don't think flowside is that bad. What I'm struggling with is rather 'forwarding' and 'endpoint'. I don't have any good suggestion at the moment, anyway. Using 'local' and 'remote' (laddr/lport, raddr/rport) would be clearer to me and avoid the conflict with 'f' of flowside, but you had good reasons to avoid that, if I recall correctly.Kind of, yeah. Local and remote are great when it's clear we're talking specifically from the point of view of the passt process. There are a bunch of cases where it's not necessarily obvious if we're talking from that point of view, the point of view of the guest, or the point of view of the host (the last usually when the endpoint is somewhere on an entirely different system). I wanted something that wherever we were talking about it is specifically relative to the passt process itself. I get the impression that "forwarding" is causing more trouble here than "endpoint". "midpoint address"? "intercepted address"? "redirected address" (probably not, that's 'r' like remote but this would be the local address)?
On Thu, Jul 11, 2024 at 10:20:01AM +0200, Stefano Brivio wrote:On Thu, 11 Jul 2024 14:26:38 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Hm. "passthrough address"? Kind of means the same thing as "forwarding" in context, and maybe evokes the idea that this is the address that passt itself owns? -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonOn Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote:I think "forwarding" is still better than any of those. Perhaps "passt address" (and paddr/pport)... but I'm not sure it's much better than "forwarding".On Wed, 10 Jul 2024 09:59:08 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right :/.On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote: > Nits only, here: > > On Fri, 5 Jul 2024 12:07:17 +1000 > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > [...] > > > > + * @c: Execution context > > + * @proto: Protocol of the flow (IP L4 protocol number) > > + * @pif: Interface of the flow > > + * @esa: Socket address of the endpoint > > + * @fport: Forwarding port number > > + * > > + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found > > + */ > > +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, > > + const void *esa, in_port_t fport) > > +{ > > + struct flowside fside = { > > And the "f" in "fside" stands for "forwarding"... I don't have any > quick fix in mind, and it's _kind of_ clear anyway, but this makes me > doubt a bit about the "forwarding" / "endpoint" choice of words. Heh, no, here "fside" is simply short for "flowside". Every flowside has both forwarding and endpoint elements.Oh, I thought you called it fside here because you're setting the forwarding part of it directly, or something like that.So it is confusing, but for a different reason. I need to find a different convention for naming struct flowside variables. I'd say 'side', but sometimes that's used for the 1-bit integer indicating which side in a flow. Hrm.. now that pif has been removed from here, maybe I could rename struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps?That's also confusing because it contains ports too (even though sure, in some sense they're part of the address).I would suggest keeping it like it is in for this series, but after that, if it's not too long, what about flow_addrs_ports?Still need a conventional name for the variables. "fap" probably isn't the best look, and still has the potentially confusing "f" in it.Actually, I don't think flowside is that bad. What I'm struggling with is rather 'forwarding' and 'endpoint'. I don't have any good suggestion at the moment, anyway. Using 'local' and 'remote' (laddr/lport, raddr/rport) would be clearer to me and avoid the conflict with 'f' of flowside, but you had good reasons to avoid that, if I recall correctly.Kind of, yeah. Local and remote are great when it's clear we're talking specifically from the point of view of the passt process. There are a bunch of cases where it's not necessarily obvious if we're talking from that point of view, the point of view of the guest, or the point of view of the host (the last usually when the endpoint is somewhere on an entirely different system). I wanted something that wherever we were talking about it is specifically relative to the passt process itself. I get the impression that "forwarding" is causing more trouble here than "endpoint". "midpoint address"? "intercepted address"? "redirected address" (probably not, that's 'r' like remote but this would be the local address)?
On Fri, 12 Jul 2024 08:58:57 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Thu, Jul 11, 2024 at 10:20:01AM +0200, Stefano Brivio wrote:I'd still prefer "forwarding" to that... at least I almost got used to it, "passthrough" is even more confusing because not everything really passes... through? The part I'm missing (with any of those) is "here" vs. "there", that is, a reference to the "where" and to the topology instead of what that side does. Again, I would just leave that as "forwarding" in this series, and then change it everywhere if we come up with something nicer. -- StefanoOn Thu, 11 Jul 2024 14:26:38 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Hm. "passthrough address"? Kind of means the same thing as "forwarding" in context, and maybe evokes the idea that this is the address that passt itself owns?On Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote:I think "forwarding" is still better than any of those. Perhaps "passt address" (and paddr/pport)... but I'm not sure it's much better than "forwarding".On Wed, 10 Jul 2024 09:59:08 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote: > On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote: > > Nits only, here: > > > > On Fri, 5 Jul 2024 12:07:17 +1000 > > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > > > [...] > > > > > > + * @c: Execution context > > > + * @proto: Protocol of the flow (IP L4 protocol number) > > > + * @pif: Interface of the flow > > > + * @esa: Socket address of the endpoint > > > + * @fport: Forwarding port number > > > + * > > > + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found > > > + */ > > > +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, > > > + const void *esa, in_port_t fport) > > > +{ > > > + struct flowside fside = { > > > > And the "f" in "fside" stands for "forwarding"... I don't have any > > quick fix in mind, and it's _kind of_ clear anyway, but this makes me > > doubt a bit about the "forwarding" / "endpoint" choice of words. > > Heh, no, here "fside" is simply short for "flowside". Every flowside > has both forwarding and endpoint elements. Oh, I thought you called it fside here because you're setting the forwarding part of it directly, or something like that. > So it is confusing, but > for a different reason. I need to find a different convention for > naming struct flowside variables. I'd say 'side', but sometimes > that's used for the 1-bit integer indicating which side in a flow. > > Hrm.. now that pif has been removed from here, maybe I could rename > struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps? That's also confusing because it contains ports too (even though sure, in some sense they're part of the address).Right :/.I would suggest keeping it like it is in for this series, but after that, if it's not too long, what about flow_addrs_ports?Still need a conventional name for the variables. "fap" probably isn't the best look, and still has the potentially confusing "f" in it.Actually, I don't think flowside is that bad. What I'm struggling with is rather 'forwarding' and 'endpoint'. I don't have any good suggestion at the moment, anyway. Using 'local' and 'remote' (laddr/lport, raddr/rport) would be clearer to me and avoid the conflict with 'f' of flowside, but you had good reasons to avoid that, if I recall correctly.Kind of, yeah. Local and remote are great when it's clear we're talking specifically from the point of view of the passt process. There are a bunch of cases where it's not necessarily obvious if we're talking from that point of view, the point of view of the guest, or the point of view of the host (the last usually when the endpoint is somewhere on an entirely different system). I wanted something that wherever we were talking about it is specifically relative to the passt process itself. I get the impression that "forwarding" is causing more trouble here than "endpoint". "midpoint address"? "intercepted address"? "redirected address" (probably not, that's 'r' like remote but this would be the local address)?
On Fri, Jul 12, 2024 at 10:21:41AM +0200, Stefano Brivio wrote:On Fri, 12 Jul 2024 08:58:57 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Yeah, that's fair.On Thu, Jul 11, 2024 at 10:20:01AM +0200, Stefano Brivio wrote:I'd still prefer "forwarding" to that... at least I almost got used to it, "passthrough" is even more confusing because not everything really passes... through?On Thu, 11 Jul 2024 14:26:38 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Hm. "passthrough address"? Kind of means the same thing as "forwarding" in context, and maybe evokes the idea that this is the address that passt itself owns?On Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote: > On Wed, 10 Jul 2024 09:59:08 +1000 > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote: > > > Nits only, here: > > > > > > On Fri, 5 Jul 2024 12:07:17 +1000 > > > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > > > > > [...] > > > > > > > > + * @c: Execution context > > > > + * @proto: Protocol of the flow (IP L4 protocol number) > > > > + * @pif: Interface of the flow > > > > + * @esa: Socket address of the endpoint > > > > + * @fport: Forwarding port number > > > > + * > > > > + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found > > > > + */ > > > > +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, > > > > + const void *esa, in_port_t fport) > > > > +{ > > > > + struct flowside fside = { > > > > > > And the "f" in "fside" stands for "forwarding"... I don't have any > > > quick fix in mind, and it's _kind of_ clear anyway, but this makes me > > > doubt a bit about the "forwarding" / "endpoint" choice of words. > > > > Heh, no, here "fside" is simply short for "flowside". Every flowside > > has both forwarding and endpoint elements. > > Oh, I thought you called it fside here because you're setting the > forwarding part of it directly, or something like that. > > > So it is confusing, but > > for a different reason. I need to find a different convention for > > naming struct flowside variables. I'd say 'side', but sometimes > > that's used for the 1-bit integer indicating which side in a flow. > > > > Hrm.. now that pif has been removed from here, maybe I could rename > > struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps? > > That's also confusing because it contains ports too (even though sure, > in some sense they're part of the address). Right :/. > I would suggest keeping it > like it is in for this series, but after that, if it's not too long, > what about flow_addrs_ports? Still need a conventional name for the variables. "fap" probably isn't the best look, and still has the potentially confusing "f" in it. > Actually, I don't think flowside is that bad. What I'm struggling with > is rather 'forwarding' and 'endpoint'. I don't have any good suggestion > at the moment, anyway. Using 'local' and 'remote' (laddr/lport, > raddr/rport) would be clearer to me and avoid the conflict with 'f' of > flowside, but you had good reasons to avoid that, if I recall correctly. Kind of, yeah. Local and remote are great when it's clear we're talking specifically from the point of view of the passt process. There are a bunch of cases where it's not necessarily obvious if we're talking from that point of view, the point of view of the guest, or the point of view of the host (the last usually when the endpoint is somewhere on an entirely different system). I wanted something that wherever we were talking about it is specifically relative to the passt process itself. I get the impression that "forwarding" is causing more trouble here than "endpoint". "midpoint address"? "intercepted address"? "redirected address" (probably not, that's 'r' like remote but this would be the local address)?I think "forwarding" is still better than any of those. Perhaps "passt address" (and paddr/pport)... but I'm not sure it's much better than "forwarding".The part I'm missing (with any of those) is "here" vs. "there", that is, a reference to the "where" and to the topology instead of what that side does.I don't really follow what you mean.Again, I would just leave that as "forwarding" in this series, and then change it everywhere if we come up with something nicer.Seems reasonable. It'd still be nice to have an alternative to 'fside' that doesn't have the confusing 'f'. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, 15 Jul 2024 14:06:45 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Fri, Jul 12, 2024 at 10:21:41AM +0200, Stefano Brivio wrote:What makes "forwarding" and "passthrough" not so obvious to me is that they don't carry the information of _where_ in the network diagram the side is (such as "local" and "remote" would do), but rather what it does ("forward packets", "pass data through"). From passt and pasta's perspective, the "forwarding" side could be called the "here" side, and the endpoint could be called the "there" side. Those names are not great for other reasons though, just think of the sentence "there is the 'there' side and the 'here' side", or "here we deal with the 'there' side". So at the moment I don't have any better suggestion in that sense. Telecommunication standards frequently use "near end" and "far end", but I guess you would still have a similar issue as with local/remote, and also "far" starts with "f".On Fri, 12 Jul 2024 08:58:57 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Yeah, that's fair.On Thu, Jul 11, 2024 at 10:20:01AM +0200, Stefano Brivio wrote:I'd still prefer "forwarding" to that... at least I almost got used to it, "passthrough" is even more confusing because not everything really passes... through?On Thu, 11 Jul 2024 14:26:38 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote: > On Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote: > > On Wed, 10 Jul 2024 09:59:08 +1000 > > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > > > On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote: > > > > Nits only, here: > > > > > > > > On Fri, 5 Jul 2024 12:07:17 +1000 > > > > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > > > > > > > [...] > > > > > > > > > > + * @c: Execution context > > > > > + * @proto: Protocol of the flow (IP L4 protocol number) > > > > > + * @pif: Interface of the flow > > > > > + * @esa: Socket address of the endpoint > > > > > + * @fport: Forwarding port number > > > > > + * > > > > > + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found > > > > > + */ > > > > > +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, > > > > > + const void *esa, in_port_t fport) > > > > > +{ > > > > > + struct flowside fside = { > > > > > > > > And the "f" in "fside" stands for "forwarding"... I don't have any > > > > quick fix in mind, and it's _kind of_ clear anyway, but this makes me > > > > doubt a bit about the "forwarding" / "endpoint" choice of words. > > > > > > Heh, no, here "fside" is simply short for "flowside". Every flowside > > > has both forwarding and endpoint elements. > > > > Oh, I thought you called it fside here because you're setting the > > forwarding part of it directly, or something like that. > > > > > So it is confusing, but > > > for a different reason. I need to find a different convention for > > > naming struct flowside variables. I'd say 'side', but sometimes > > > that's used for the 1-bit integer indicating which side in a flow. > > > > > > Hrm.. now that pif has been removed from here, maybe I could rename > > > struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps? > > > > That's also confusing because it contains ports too (even though sure, > > in some sense they're part of the address). > > Right :/. > > > I would suggest keeping it > > like it is in for this series, but after that, if it's not too long, > > what about flow_addrs_ports? > > Still need a conventional name for the variables. "fap" probably > isn't the best look, and still has the potentially confusing "f" in > it. > > > Actually, I don't think flowside is that bad. What I'm struggling with > > is rather 'forwarding' and 'endpoint'. I don't have any good suggestion > > at the moment, anyway. Using 'local' and 'remote' (laddr/lport, > > raddr/rport) would be clearer to me and avoid the conflict with 'f' of > > flowside, but you had good reasons to avoid that, if I recall correctly. > > Kind of, yeah. Local and remote are great when it's clear we're > talking specifically from the point of view of the passt process. > There are a bunch of cases where it's not necessarily obvious if we're > talking from that point of view, the point of view of the guest, or > the point of view of the host (the last usually when the endpoint is > somewhere on an entirely different system). I wanted something that > wherever we were talking about it is specifically relative to the > passt process itself. > > I get the impression that "forwarding" is causing more trouble here > than "endpoint". "midpoint address"? "intercepted address"? > "redirected address" (probably not, that's 'r' like remote but this > would be the local address)? I think "forwarding" is still better than any of those. Perhaps "passt address" (and paddr/pport)... but I'm not sure it's much better than "forwarding".Hm. "passthrough address"? Kind of means the same thing as "forwarding" in context, and maybe evokes the idea that this is the address that passt itself owns?The part I'm missing (with any of those) is "here" vs. "there", that is, a reference to the "where" and to the topology instead of what that side does.I don't really follow what you mean.-- StefanoAgain, I would just leave that as "forwarding" in this series, and then change it everywhere if we come up with something nicer.Seems reasonable. It'd still be nice to have an alternative to 'fside' that doesn't have the confusing 'f'.
On Mon, Jul 15, 2024 at 06:37:39PM +0200, Stefano Brivio wrote:On Mon, 15 Jul 2024 14:06:45 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ah, I see. Fwiw, the reasoning here is that these are the addresses that belong to the thing doing the forwarding - i.e. passt/pasta.On Fri, Jul 12, 2024 at 10:21:41AM +0200, Stefano Brivio wrote:What makes "forwarding" and "passthrough" not so obvious to me is that they don't carry the information of _where_ in the network diagram the side is (such as "local" and "remote" would do), but rather what it does ("forward packets", "pass data through").On Fri, 12 Jul 2024 08:58:57 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Yeah, that's fair.On Thu, Jul 11, 2024 at 10:20:01AM +0200, Stefano Brivio wrote: > On Thu, 11 Jul 2024 14:26:38 +1000 > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > On Wed, Jul 10, 2024 at 11:35:23PM +0200, Stefano Brivio wrote: > > > On Wed, 10 Jul 2024 09:59:08 +1000 > > > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > > > > > On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote: > > > > > Nits only, here: > > > > > > > > > > On Fri, 5 Jul 2024 12:07:17 +1000 > > > > > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > > > > > > > > > [...] > > > > > > > > > > > > + * @c: Execution context > > > > > > + * @proto: Protocol of the flow (IP L4 protocol number) > > > > > > + * @pif: Interface of the flow > > > > > > + * @esa: Socket address of the endpoint > > > > > > + * @fport: Forwarding port number > > > > > > + * > > > > > > + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found > > > > > > + */ > > > > > > +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, > > > > > > + const void *esa, in_port_t fport) > > > > > > +{ > > > > > > + struct flowside fside = { > > > > > > > > > > And the "f" in "fside" stands for "forwarding"... I don't have any > > > > > quick fix in mind, and it's _kind of_ clear anyway, but this makes me > > > > > doubt a bit about the "forwarding" / "endpoint" choice of words. > > > > > > > > Heh, no, here "fside" is simply short for "flowside". Every flowside > > > > has both forwarding and endpoint elements. > > > > > > Oh, I thought you called it fside here because you're setting the > > > forwarding part of it directly, or something like that. > > > > > > > So it is confusing, but > > > > for a different reason. I need to find a different convention for > > > > naming struct flowside variables. I'd say 'side', but sometimes > > > > that's used for the 1-bit integer indicating which side in a flow. > > > > > > > > Hrm.. now that pif has been removed from here, maybe I could rename > > > > struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps? > > > > > > That's also confusing because it contains ports too (even though sure, > > > in some sense they're part of the address). > > > > Right :/. > > > > > I would suggest keeping it > > > like it is in for this series, but after that, if it's not too long, > > > what about flow_addrs_ports? > > > > Still need a conventional name for the variables. "fap" probably > > isn't the best look, and still has the potentially confusing "f" in > > it. > > > > > Actually, I don't think flowside is that bad. What I'm struggling with > > > is rather 'forwarding' and 'endpoint'. I don't have any good suggestion > > > at the moment, anyway. Using 'local' and 'remote' (laddr/lport, > > > raddr/rport) would be clearer to me and avoid the conflict with 'f' of > > > flowside, but you had good reasons to avoid that, if I recall correctly. > > > > Kind of, yeah. Local and remote are great when it's clear we're > > talking specifically from the point of view of the passt process. > > There are a bunch of cases where it's not necessarily obvious if we're > > talking from that point of view, the point of view of the guest, or > > the point of view of the host (the last usually when the endpoint is > > somewhere on an entirely different system). I wanted something that > > wherever we were talking about it is specifically relative to the > > passt process itself. > > > > I get the impression that "forwarding" is causing more trouble here > > than "endpoint". "midpoint address"? "intercepted address"? > > "redirected address" (probably not, that's 'r' like remote but this > > would be the local address)? > > I think "forwarding" is still better than any of those. Perhaps "passt > address" (and paddr/pport)... but I'm not sure it's much better than > "forwarding". Hm. "passthrough address"? Kind of means the same thing as "forwarding" in context, and maybe evokes the idea that this is the address that passt itself owns?I'd still prefer "forwarding" to that... at least I almost got used to it, "passthrough" is even more confusing because not everything really passes... through?The part I'm missing (with any of those) is "here" vs. "there", that is, a reference to the "where" and to the topology instead of what that side does.I don't really follow what you mean.And in addition to that they share the problem with local/remote and near end/far end that they only make sense if your frame of reference is already passt/pasta's perspective.From passt and pasta's perspective, the "forwarding" side could becalled the "here" side, and the endpoint could be called the "there" side. Those names are not great for other reasons though, just think of the sentence "there is the 'there' side and the 'here' side", or "here we deal with the 'there' side". So at the moment I don't have any better suggestion in that sense.Telecommunication standards frequently use "near end" and "far end", but I guess you would still have a similar issue as with local/remote, and also "far" starts with "f".If we're going to assume we're talking from the perspective of the passt/pasta process, I don't think we'll do better than "local" and "remote". What I was aiming for with "forwarding" and "passthrough" was something unambiguous from "whole system" view encompassing the guest, the other endpoint and passt in between. Maybe that's not worth it, and we would be better with "local" and "remote", and add extra verbiage in the cases where it might seem like we're talking from the guest's view.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonAgain, I would just leave that as "forwarding" in this series, and then change it everywhere if we come up with something nicer.Seems reasonable. It'd still be nice to have an alternative to 'fside' that doesn't have the confusing 'f'.
When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use. For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- epoll_type.h | 2 + flow.c | 20 +++ flow.h | 2 + flow_table.h | 14 ++ passt.c | 4 + udp.c | 403 ++++++++++++++++++--------------------------------- udp.h | 6 +- udp_flow.h | 2 + util.c | 1 + 9 files changed, 194 insertions(+), 260 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index b6c04199..7a752ed1 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -22,6 +22,8 @@ enum epoll_type { EPOLL_TYPE_TCP_TIMER, /* UDP sockets */ EPOLL_TYPE_UDP, + /* UDP socket for replies on a specific flow */ + EPOLL_TYPE_UDP_REPLY, /* ICMP/ICMPv6 ping sockets */ EPOLL_TYPE_PING, /* inotify fd watching for end of netns (pasta) */ diff --git a/flow.c b/flow.c index 0cb9495b..2e100ddb 100644 --- a/flow.c +++ b/flow.c @@ -236,6 +236,26 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, } } +/** flowside_connect() - Connect a socket based on flowside + * @c: Execution context + * @s: Socket to connect + * @pif: Target pif + * @tgt: Target flowside + * + * Connect @s to the endpoint address and port from @tgt. + * + * Return: 0 on success, negative on error + */ +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt) +{ + union sockaddr_inany sa; + socklen_t sl; + + pif_sockaddr(c, &sa, &sl, pif, &tgt->eaddr, tgt->eport); + return connect(s, &sa.sa, sl); +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index 3752e5ee..3f65ceb9 100644 --- a/flow.h +++ b/flow.h @@ -168,6 +168,8 @@ static inline bool flowside_eq(const struct flowside *left, int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data); +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt); /** * struct flow_common - Common fields for packet flows diff --git a/flow_table.h b/flow_table.h index 3fbc7c8d..1faac4a7 100644 --- a/flow_table.h +++ b/flow_table.h @@ -92,6 +92,20 @@ static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) return (flow_sidx_t){.flow = sidx.flow, .side = !sidx.side}; } +/** pif_at_sidx - Interface for a given flow and side + * @sidx: Flow & side index + * + * Return: pif for the flow & side given by @sidx + */ +static inline uint8_t pif_at_sidx(flow_sidx_t sidx) +{ + const union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE; + return flow->f.pif[sidx.side]; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/passt.c b/passt.c index e4d45daa..f9405bee 100644 --- a/passt.c +++ b/passt.c @@ -67,6 +67,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", [EPOLL_TYPE_UDP] = "UDP socket", + [EPOLL_TYPE_UDP_REPLY] = "UDP reply socket", [EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket", [EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch", [EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch", @@ -349,6 +350,9 @@ loop: case EPOLL_TYPE_UDP: udp_buf_sock_handler(&c, ref, eventmask, &now); break; + case EPOLL_TYPE_UDP_REPLY: + udp_reply_sock_handler(&c, ref, eventmask, &now); + break; case EPOLL_TYPE_PING: icmp_sock_handler(&c, ref); break; diff --git a/udp.c b/udp.c index daf4fe26..f4c696db 100644 --- a/udp.c +++ b/udp.c @@ -35,7 +35,31 @@ * =================== * * UDP doesn't use listen(), but we consider long term sockets which are allowed - * to create new flows "listening" by analogy with TCP. + * to create new flows "listening" by analogy with TCP. This listening socket + * could receive packets from multiple flows, so we use a hash table match to + * find the specific flow for a datagram. + * + * When a UDP flow is initiated from a listening socket we take a duplicate of + * the socket and store it in uflow->s[INISIDE]. This will last for the + * lifetime of the flow, even if the original listening socket is closed due to + * port auto-probing. The duplicate is used to deliver replies back to the + * originating side. + * + * Reply sockets + * ============= + * + * When a UDP flow targets a socket, we create a "reply" socket in + * uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive + * replies on the target side. This socket is both bound and connected and has + * EPOLL_TYPE_UDP_REPLY. The connect() means it will only receive datagrams + * associated with this flow, so the epoll reference directly points to the flow + * and we don't need a hash lookup. + * + * NOTE: it's possible that the reply socket could have a bound address + * overlapping with an unrelated listening socket. We assume datagrams for the + * flow will come to the reply socket in preference to a listening socket. The + * sample program contrib/udp-reuseaddr/reuseaddr-priority.c documents and tests + * that assumption. * * Port tracking * ============= @@ -56,62 +80,6 @@ * * Packets are forwarded back and forth, by prepending and stripping UDP headers * in the obvious way, with no port translation. - * - * In PASTA mode, the L2-L4 translation is skipped for connections to ports - * bound between namespaces using the loopback interface, messages are directly - * transferred between L4 sockets instead. These are called spliced connections - * for consistency with the TCP implementation, but the splice() syscall isn't - * actually used as it wouldn't make sense for datagram-based connections: a - * pair of recvmmsg() and sendmmsg() deals with this case. - * - * The connection tracking for PASTA mode is slightly complicated by the absence - * of actual connections, see struct udp_splice_port, and these examples: - * - * - from init to namespace: - * - * - forward direction: 127.0.0.1:5000 -> 127.0.0.1:80 in init from socket s, - * with epoll reference: index = 80, splice = 1, orig = 1, ns = 0 - * - if udp_splice_ns[V4][5000].sock: - * - send packet to udp_splice_ns[V4][5000].sock, with destination port - * 80 - * - otherwise: - * - create new socket udp_splice_ns[V4][5000].sock - * - bind in namespace to 127.0.0.1:5000 - * - add to epoll with reference: index = 5000, splice = 1, orig = 0, - * ns = 1 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - * - reverse direction: 127.0.0.1:80 -> 127.0.0.1:5000 in namespace socket s, - * having epoll reference: index = 5000, splice = 1, orig = 0, ns = 1 - * - if udp_splice_init[V4][80].sock: - * - send to udp_splice_init[V4][80].sock, with destination port 5000 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - otherwise, discard - * - * - from namespace to init: - * - * - forward direction: 127.0.0.1:2000 -> 127.0.0.1:22 in namespace from - * socket s, with epoll reference: index = 22, splice = 1, orig = 1, ns = 1 - * - if udp4_splice_init[V4][2000].sock: - * - send packet to udp_splice_init[V4][2000].sock, with destination - * port 22 - * - otherwise: - * - create new socket udp_splice_init[V4][2000].sock - * - bind in init to 127.0.0.1:2000 - * - add to epoll with reference: index = 2000, splice = 1, orig = 0, - * ns = 0 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - * - reverse direction: 127.0.0.1:22 -> 127.0.0.1:2000 in init from socket s, - * having epoll reference: index = 2000, splice = 1, orig = 0, ns = 0 - * - if udp_splice_ns[V4][22].sock: - * - send to udp_splice_ns[V4][22].sock, with destination port 2000 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - otherwise, discard */ #include <sched.h> @@ -134,6 +102,7 @@ #include <sys/socket.h> #include <sys/uio.h> #include <time.h> +#include <fcntl.h> #include "checksum.h" #include "util.h" @@ -223,7 +192,6 @@ static struct ethhdr udp6_eth_hdr; * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() - * @splicesrc: Source port for splicing, or -1 if not spliceable * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { @@ -232,7 +200,6 @@ static struct udp_meta_t { struct tap_hdr taph; union sockaddr_inany s_in; - int splicesrc; flow_sidx_t tosidx; } #ifdef __AVX2__ @@ -270,7 +237,6 @@ static struct mmsghdr udp_mh_splice [UDP_MAX_FRAMES]; /* IOVs for L2 frames */ static struct iovec udp_l2_iov [UDP_MAX_FRAMES][UDP_NUM_IOVS]; - /** * udp_portmap_clear() - Clear UDP port map before configuration */ @@ -383,140 +349,6 @@ static void udp_iov_init(const struct ctx *c) udp_iov_init_one(c, i); } -/** - * udp_splice_new() - Create and prepare socket for "spliced" binding - * @c: Execution context - * @v6: Set for IPv6 sockets - * @src: Source port of original connection, host order - * @ns: Does the splice originate in the ns or not - * - * Return: prepared socket, negative error code on failure - * - * #syscalls:pasta getsockname - */ -int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) -{ - struct epoll_event ev = { .events = EPOLLIN | EPOLLRDHUP | EPOLLHUP }; - union epoll_ref ref = { .type = EPOLL_TYPE_UDP, - .udp = { .splice = true, .v6 = v6, .port = src } - }; - struct udp_splice_port *sp; - int act, s; - - if (ns) { - ref.udp.pif = PIF_SPLICE; - sp = &udp_splice_ns[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_NS; - } else { - ref.udp.pif = PIF_HOST; - sp = &udp_splice_init[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_INIT; - } - - s = socket(v6 ? AF_INET6 : AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, - IPPROTO_UDP); - - if (s > FD_REF_MAX) { - close(s); - return -EIO; - } - - if (s < 0) - return s; - - ref.fd = s; - - if (v6) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(src), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr6, sizeof(addr6))) - goto fail; - } else { - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(src), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr4, sizeof(addr4))) - goto fail; - } - - sp->sock = s; - bitmap_set(udp_act[v6 ? V6 : V4][act], src); - - ev.data.u64 = ref.u64; - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, s, &ev); - return s; - -fail: - close(s); - return -1; -} - -/** - * struct udp_splice_new_ns_arg - Arguments for udp_splice_new_ns() - * @c: Execution context - * @v6: Set for IPv6 - * @src: Source port of originating datagram, host order - * @dst: Destination port of originating datagram, host order - * @s: Newly created socket or negative error code - */ -struct udp_splice_new_ns_arg { - const struct ctx *c; - int v6; - in_port_t src; - int s; -}; - -/** - * udp_splice_new_ns() - Enter namespace and call udp_splice_new() - * @arg: See struct udp_splice_new_ns_arg - * - * Return: 0 - */ -static int udp_splice_new_ns(void *arg) -{ - struct udp_splice_new_ns_arg *a; - - a = (struct udp_splice_new_ns_arg *)arg; - - ns_enter(a->c); - - a->s = udp_splice_new(a->c, a->v6, a->src, true); - - return 0; -} - -/** - * udp_mmh_splice_port() - Is source address of message suitable for splicing? - * @ref: epoll reference for incoming message's origin socket - * @mmh: mmsghdr of incoming message - * - * Return: if source address of message in @mmh refers to localhost (127.0.0.1 - * or ::1) its source port (host order), otherwise -1. - */ -static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) -{ - const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name; - const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name; - - ASSERT(ref.type == EPOLL_TYPE_UDP); - - if (!ref.udp.splice) - return -1; - - if (ref.udp.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr)) - return ntohs(sa6->sin6_port); - - if (!ref.udp.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr)) - return ntohs(sa4->sin_port); - - return -1; -} - /** * udp_at_sidx() - Get UDP specific flow at given sidx * @sidx: Flow and side to retrieve @@ -542,6 +374,16 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx) */ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) { + if (uflow->s[INISIDE] >= 0) { + /* The listening socket needs to stay in epoll */ + close(uflow->s[INISIDE]); + } + + if (uflow->s[TGTSIDE] >= 0) { + /* But the flow specific one needs to be removed */ + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, uflow->s[TGTSIDE], NULL); + close(uflow->s[TGTSIDE]); + } flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); } @@ -549,26 +391,80 @@ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) * udp_flow_new() - Common setup for a new UDP flow * @c: Execution context * @flow: Initiated flow + * @s_ini: Initiating socket (or -1) * @now: Timestamp * * Return: UDP specific flow, if successful, NULL on failure */ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, - const struct timespec *now) + int s_ini, const struct timespec *now) { const struct flowside *ini = &flow->f.side[INISIDE]; struct udp_flow *uflow = NULL; + const struct flowside *tgt; + uint8_t tgtpif; if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { flow_dbg(flow, "Invalid endpoint to initiate UDP flow"); goto cancel; } - if (!flow_target(c, flow, IPPROTO_UDP)) + if (!(tgt = flow_target(c, flow, IPPROTO_UDP))) goto cancel; + tgtpif = flow->f.pif[TGTSIDE]; uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); uflow->ts = now->tv_sec; + uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; + + if (s_ini >= 0) { + /* When using auto port-scanning the listening port could go + * away, so we need to duplicate it */ + uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0); + if (uflow->s[INISIDE] < 0) { + flow_err(uflow, + "Couldn't duplicate listening socket: %s", + strerror(errno)); + goto cancel; + } + } + + if (pif_is_socket(tgtpif)) { + union { + flow_sidx_t sidx; + uint32_t data; + } fref = { + .sidx = FLOW_SIDX(flow, TGTSIDE), + }; + + uflow->s[TGTSIDE] = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY, + tgtpif, tgt, fref.data); + if (uflow->s[TGTSIDE] < 0) { + flow_dbg(uflow, + "Couldn't open socket for spliced flow: %s", + strerror(errno)); + goto cancel; + } + + if (flowside_connect(c, uflow->s[TGTSIDE], tgtpif, tgt) < 0) { + flow_dbg(uflow, + "Couldn't connect flow socket: %s", + strerror(errno)); + goto cancel; + } + + /* It's possible, if unlikely, that we could receive some + * unrelated packets in between the bind() and connect() of this + * socket. For now we just discard these. We could consider + * trying to re-direct these to an appropriate handler, if we + * need to. + */ + /* cppcheck-suppress nullPointer */ + while (recv(uflow->s[TGTSIDE], NULL, 0, MSG_DONTWAIT) >= 0) + ; + if (errno != EAGAIN) + warn_perror("Unexpected error discarding datagrams"); + } flow_hash_insert(c, FLOW_SIDX(uflow, INISIDE)); FLOW_ACTIVATE(uflow); @@ -580,7 +476,6 @@ cancel: udp_flow_close(c, uflow); flow_alloc_cancel(flow); return FLOW_SIDX_NONE; - } /** @@ -590,6 +485,8 @@ cancel: * @meta: Metadata buffer for the datagram * @now: Timestamp * + * #syscalls fcntl + * * Return: sidx for the destination side of the flow for this packet, or * FLOW_SIDX_NONE if we couldn't find or create a flow. */ @@ -623,7 +520,7 @@ static flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref, } flow_initiate_sa(flow, ref.udp.pif, &meta->s_in, ref.udp.port); - return udp_flow_new(c, flow, now); + return udp_flow_new(c, flow, ref.fd, now); } /** @@ -647,55 +544,16 @@ static void udp_splice_prepare(struct mmsghdr *mmh, unsigned idx) * @now: Timestamp */ static void udp_splice_send(const struct ctx *c, size_t start, size_t n, - in_port_t src, in_port_t dst, - union epoll_ref ref, - const struct timespec *now) + flow_sidx_t tosidx) { - int s; - - if (ref.udp.v6) { - udp_splice_to.sa6 = (struct sockaddr_in6) { - .sin6_family = AF_INET6, - .sin6_addr = in6addr_loopback, - .sin6_port = htons(dst), - }; - } else { - udp_splice_to.sa4 = (struct sockaddr_in) { - .sin_family = AF_INET, - .sin_addr = in4addr_loopback, - .sin_port = htons(dst), - }; - } - - if (ref.udp.pif == PIF_SPLICE) { - src += c->udp.fwd_in.rdelta[src]; - s = udp_splice_init[ref.udp.v6][src].sock; - if (s < 0 && ref.udp.orig) - s = udp_splice_new(c, ref.udp.v6, src, false); - - if (s < 0) - return; - - udp_splice_ns[ref.udp.v6][dst].ts = now->tv_sec; - udp_splice_init[ref.udp.v6][src].ts = now->tv_sec; - } else { - ASSERT(ref.udp.pif == PIF_HOST); - src += c->udp.fwd_out.rdelta[src]; - s = udp_splice_ns[ref.udp.v6][src].sock; - if (s < 0 && ref.udp.orig) { - struct udp_splice_new_ns_arg arg = { - c, ref.udp.v6, src, -1, - }; - - NS_CALL(udp_splice_new_ns, &arg); - s = arg.s; - } - if (s < 0) - return; + const struct udp_flow *uflow = udp_at_sidx(tosidx); + const struct flowside *toside = &uflow->f.side[tosidx.side]; + uint8_t topif = uflow->f.pif[tosidx.side]; + int s = uflow->s[tosidx.side]; + socklen_t sl; - udp_splice_init[ref.udp.v6][dst].ts = now->tv_sec; - udp_splice_ns[ref.udp.v6][src].ts = now->tv_sec; - } + pif_sockaddr(c, &udp_splice_to, &sl, topif, + &toside->eaddr, toside->eport); sendmmsg(s, udp_mh_splice + start, n, MSG_NOSIGNAL); } @@ -922,20 +780,18 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve dstport += c->udp.fwd_in.f.delta[dstport]; /* We divide datagrams into batches based on how we need to send them, - * determined by udp_meta[i].splicesrc and udp_meta[i].tosidx. To avoid - * either two passes through the array, or recalculating splicesrc and - * tosidxfor a single entry, we have to populate it one entry *ahead* of - * the loop counter. + * determined by udp_meta[i].tosidx. To avoid either two passes through + * the array, or recalculating tosidx for a single entry, we have to + * populate it one entry *ahead* of the loop counter. */ - udp_meta[0].splicesrc = udp_mmh_splice_port(ref, mmh_recv); udp_meta[0].tosidx = udp_flow_from_sock(c, ref, &udp_meta[0], now); for (i = 0; i < n; ) { flow_sidx_t batchsidx = udp_meta[i].tosidx; - int batchsrc = udp_meta[i].splicesrc; + uint8_t batchpif = pif_at_sidx(batchsidx); int batchstart = i; do { - if (batchsrc >= 0) + if (pif_is_socket(batchpif)) udp_splice_prepare(mmh_recv, i); else udp_tap_prepare(c, mmh_recv, i, dstport, @@ -944,23 +800,54 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve if (++i >= n) break; - udp_meta[i].splicesrc = udp_mmh_splice_port(ref, - &mmh_recv[i]); udp_meta[i].tosidx = udp_flow_from_sock(c, ref, &udp_meta[i], now); - } while (flow_sidx_eq(udp_meta[i].tosidx, batchsidx) && - udp_meta[i].splicesrc == batchsrc); + } while (flow_sidx_eq(udp_meta[i].tosidx, batchsidx)); - if (batchsrc >= 0) + if (pif_is_socket(batchpif)) udp_splice_send(c, batchstart, i - batchstart, - batchsrc, dstport, ref, now); + batchsidx); else tap_send_frames(c, &udp_l2_iov[batchstart][0], UDP_NUM_IOVS, i - batchstart); } } +/** + * udp_reply_sock_handler() - Handle new data from flow specific socket + * @c: Execution context + * @ref: epoll reference + * @events: epoll events bitmap + * @now: Current timestamp + * + * #syscalls recvmmsg + */ +void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events, const struct timespec *now) +{ + flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside); + struct udp_flow *uflow = udp_at_sidx(ref.flowside); + const struct flowside *fromside = &uflow->f.side[ref.flowside.side]; + bool v6 = !inany_v4(&fromside->eaddr); + struct mmsghdr *mmh_recv = v6 ? udp6_mh_recv : udp4_mh_recv; + int from_s = uflow->s[ref.flowside.side]; + int n, i; + + ASSERT(!c->no_udp && uflow); + + if ((n = udp_sock_recv(c, from_s, events, mmh_recv)) <= 0) + return; + + flow_trace(uflow, "Received %d datagrams on reply socket", n); + uflow->ts = now->tv_sec; + + for (i = 0; i < n; i++) + udp_splice_prepare(mmh_recv, i); + + udp_splice_send(c, 0, n, tosidx); +} + /** * udp_tap_handler() - Handle packets from tap * @c: Execution context diff --git a/udp.h b/udp.h index 5865def2..db5e546e 100644 --- a/udp.h +++ b/udp.h @@ -9,8 +9,10 @@ #define UDP_TIMER_INTERVAL 1000 /* ms */ void udp_portmap_clear(void); -void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, - const struct timespec *now); +void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events, const struct timespec *now); +void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events, const struct timespec *now); int udp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now); diff --git a/udp_flow.h b/udp_flow.h index 18af9ac4..d5b8784f 100644 --- a/udp_flow.h +++ b/udp_flow.h @@ -11,12 +11,14 @@ * struct udp - Descriptor for a flow of UDP packets * @f: Generic flow information * @ts: Activity timestamp + * @s: Socket fd (or -1) for each side of the flow */ struct udp_flow { /* Must be first element */ struct flow_common f; time_t ts; + int s[SIDES]; }; bool udp_flow_timer(const struct ctx *c, const struct udp_flow *uflow, diff --git a/util.c b/util.c index f2994a79..7c57ab15 100644 --- a/util.c +++ b/util.c @@ -61,6 +61,7 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type, socktype = SOCK_STREAM | SOCK_NONBLOCK; break; case EPOLL_TYPE_UDP: + case EPOLL_TYPE_UDP_REPLY: proto = IPPROTO_UDP; socktype = SOCK_DGRAM | SOCK_NONBLOCK; break; -- 2.45.2
On Fri, 5 Jul 2024 12:07:18 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use. For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- epoll_type.h | 2 + flow.c | 20 +++ flow.h | 2 + flow_table.h | 14 ++ passt.c | 4 + udp.c | 403 ++++++++++++++++++--------------------------------- udp.h | 6 +- udp_flow.h | 2 + util.c | 1 + 9 files changed, 194 insertions(+), 260 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index b6c04199..7a752ed1 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -22,6 +22,8 @@ enum epoll_type { EPOLL_TYPE_TCP_TIMER, /* UDP sockets */ EPOLL_TYPE_UDP, + /* UDP socket for replies on a specific flow */ + EPOLL_TYPE_UDP_REPLY, /* ICMP/ICMPv6 ping sockets */ EPOLL_TYPE_PING, /* inotify fd watching for end of netns (pasta) */ diff --git a/flow.c b/flow.c index 0cb9495b..2e100ddb 100644 --- a/flow.c +++ b/flow.c @@ -236,6 +236,26 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, } } +/** flowside_connect() - Connect a socket based on flowside + * @c: Execution context + * @s: Socket to connect + * @pif: Target pif + * @tgt: Target flowside + * + * Connect @s to the endpoint address and port from @tgt. + * + * Return: 0 on success, negative on error + */ +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt) +{ + union sockaddr_inany sa; + socklen_t sl; + + pif_sockaddr(c, &sa, &sl, pif, &tgt->eaddr, tgt->eport); + return connect(s, &sa.sa, sl); +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index 3752e5ee..3f65ceb9 100644 --- a/flow.h +++ b/flow.h @@ -168,6 +168,8 @@ static inline bool flowside_eq(const struct flowside *left, int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data); +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt); /** * struct flow_common - Common fields for packet flows diff --git a/flow_table.h b/flow_table.h index 3fbc7c8d..1faac4a7 100644 --- a/flow_table.h +++ b/flow_table.h @@ -92,6 +92,20 @@ static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) return (flow_sidx_t){.flow = sidx.flow, .side = !sidx.side}; } +/** pif_at_sidx - Interface for a given flow and sidepif_at_sidx()+ * @sidx: Flow & side index + * + * Return: pif for the flow & side given by @sidx + */ +static inline uint8_t pif_at_sidx(flow_sidx_t sidx) +{ + const union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE; + return flow->f.pif[sidx.side]; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/passt.c b/passt.c index e4d45daa..f9405bee 100644 --- a/passt.c +++ b/passt.c @@ -67,6 +67,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", [EPOLL_TYPE_UDP] = "UDP socket", + [EPOLL_TYPE_UDP_REPLY] = "UDP reply socket", [EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket", [EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch", [EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch", @@ -349,6 +350,9 @@ loop: case EPOLL_TYPE_UDP: udp_buf_sock_handler(&c, ref, eventmask, &now); break; + case EPOLL_TYPE_UDP_REPLY: + udp_reply_sock_handler(&c, ref, eventmask, &now); + break; case EPOLL_TYPE_PING: icmp_sock_handler(&c, ref); break; diff --git a/udp.c b/udp.c index daf4fe26..f4c696db 100644 --- a/udp.c +++ b/udp.c @@ -35,7 +35,31 @@ * =================== * * UDP doesn't use listen(), but we consider long term sockets which are allowed - * to create new flows "listening" by analogy with TCP. + * to create new flows "listening" by analogy with TCP. This listening socket + * could receive packets from multiple flows, so we use a hash table match to + * find the specific flow for a datagram. + * + * When a UDP flow is initiated from a listening socket we take a duplicate ofThese are always "inbound" flows, right?+ * the socket and store it in uflow->s[INISIDE]. This will last for the + * lifetime of the flow, even if the original listening socket is closed due to + * port auto-probing. The duplicate is used to deliver replies back to the + * originating side. + * + * Reply sockets + * ============= + * + * When a UDP flow targets a socket, we create a "reply" socket in...and these are outbound ones?+ * uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive + * replies on the target side. This socket is both bound and connected and has + * EPOLL_TYPE_UDP_REPLY. The connect() means it will only receive datagrams + * associated with this flow, so the epoll reference directly points to the flow + * and we don't need a hash lookup. + * + * NOTE: it's possible that the reply socket could have a bound address + * overlapping with an unrelated listening socket. We assume datagrams for the + * flow will come to the reply socket in preference to a listening socket. The + * sample program contrib/udp-reuseaddr/reuseaddr-priority.c documents and testsNow it's under doc/platform-requirements/+ * that assumption. * * Port tracking * ============= @@ -56,62 +80,6 @@ * * Packets are forwarded back and forth, by prepending and stripping UDP headers * in the obvious way, with no port translation. - * - * In PASTA mode, the L2-L4 translation is skipped for connections to ports - * bound between namespaces using the loopback interface, messages are directly - * transferred between L4 sockets instead. These are called spliced connections - * for consistency with the TCP implementation, but the splice() syscall isn't - * actually used as it wouldn't make sense for datagram-based connections: a - * pair of recvmmsg() and sendmmsg() deals with this case.I think we should keep this paragraph... or did you move/add this information somewhere else I missed? Given that the "Port tracking" section is going to be away, maybe this could be moved at the end of "UDP Flows".- * - * The connection tracking for PASTA mode is slightly complicated by the absence - * of actual connections, see struct udp_splice_port, and these examples: - * - * - from init to namespace: - * - * - forward direction: 127.0.0.1:5000 -> 127.0.0.1:80 in init from socket s, - * with epoll reference: index = 80, splice = 1, orig = 1, ns = 0 - * - if udp_splice_ns[V4][5000].sock: - * - send packet to udp_splice_ns[V4][5000].sock, with destination port - * 80 - * - otherwise: - * - create new socket udp_splice_ns[V4][5000].sock - * - bind in namespace to 127.0.0.1:5000 - * - add to epoll with reference: index = 5000, splice = 1, orig = 0, - * ns = 1 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - * - reverse direction: 127.0.0.1:80 -> 127.0.0.1:5000 in namespace socket s, - * having epoll reference: index = 5000, splice = 1, orig = 0, ns = 1 - * - if udp_splice_init[V4][80].sock: - * - send to udp_splice_init[V4][80].sock, with destination port 5000 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - otherwise, discard - * - * - from namespace to init: - * - * - forward direction: 127.0.0.1:2000 -> 127.0.0.1:22 in namespace from - * socket s, with epoll reference: index = 22, splice = 1, orig = 1, ns = 1 - * - if udp4_splice_init[V4][2000].sock: - * - send packet to udp_splice_init[V4][2000].sock, with destination - * port 22 - * - otherwise: - * - create new socket udp_splice_init[V4][2000].sock - * - bind in init to 127.0.0.1:2000 - * - add to epoll with reference: index = 2000, splice = 1, orig = 0, - * ns = 0 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - * - reverse direction: 127.0.0.1:22 -> 127.0.0.1:2000 in init from socket s, - * having epoll reference: index = 2000, splice = 1, orig = 0, ns = 0 - * - if udp_splice_ns[V4][22].sock: - * - send to udp_splice_ns[V4][22].sock, with destination port 2000 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - otherwise, discard */ #include <sched.h> @@ -134,6 +102,7 @@ #include <sys/socket.h> #include <sys/uio.h> #include <time.h> +#include <fcntl.h> #include "checksum.h" #include "util.h" @@ -223,7 +192,6 @@ static struct ethhdr udp6_eth_hdr; * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() - * @splicesrc: Source port for splicing, or -1 if not spliceable * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { @@ -232,7 +200,6 @@ static struct udp_meta_t { struct tap_hdr taph; union sockaddr_inany s_in; - int splicesrc; flow_sidx_t tosidx; } #ifdef __AVX2__ @@ -270,7 +237,6 @@ static struct mmsghdr udp_mh_splice [UDP_MAX_FRAMES]; /* IOVs for L2 frames */ static struct iovec udp_l2_iov [UDP_MAX_FRAMES][UDP_NUM_IOVS]; - /** * udp_portmap_clear() - Clear UDP port map before configuration */ @@ -383,140 +349,6 @@ static void udp_iov_init(const struct ctx *c) udp_iov_init_one(c, i); } -/** - * udp_splice_new() - Create and prepare socket for "spliced" binding - * @c: Execution context - * @v6: Set for IPv6 sockets - * @src: Source port of original connection, host order - * @ns: Does the splice originate in the ns or not - * - * Return: prepared socket, negative error code on failure - * - * #syscalls:pasta getsockname - */ -int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) -{ - struct epoll_event ev = { .events = EPOLLIN | EPOLLRDHUP | EPOLLHUP }; - union epoll_ref ref = { .type = EPOLL_TYPE_UDP, - .udp = { .splice = true, .v6 = v6, .port = src } - }; - struct udp_splice_port *sp; - int act, s; - - if (ns) { - ref.udp.pif = PIF_SPLICE; - sp = &udp_splice_ns[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_NS; - } else { - ref.udp.pif = PIF_HOST; - sp = &udp_splice_init[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_INIT; - } - - s = socket(v6 ? AF_INET6 : AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, - IPPROTO_UDP); - - if (s > FD_REF_MAX) { - close(s); - return -EIO; - } - - if (s < 0) - return s; - - ref.fd = s; - - if (v6) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(src), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr6, sizeof(addr6))) - goto fail; - } else { - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(src), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr4, sizeof(addr4))) - goto fail; - } - - sp->sock = s; - bitmap_set(udp_act[v6 ? V6 : V4][act], src); - - ev.data.u64 = ref.u64; - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, s, &ev); - return s; - -fail: - close(s); - return -1; -} - -/** - * struct udp_splice_new_ns_arg - Arguments for udp_splice_new_ns() - * @c: Execution context - * @v6: Set for IPv6 - * @src: Source port of originating datagram, host order - * @dst: Destination port of originating datagram, host order - * @s: Newly created socket or negative error code - */ -struct udp_splice_new_ns_arg { - const struct ctx *c; - int v6; - in_port_t src; - int s; -}; - -/** - * udp_splice_new_ns() - Enter namespace and call udp_splice_new() - * @arg: See struct udp_splice_new_ns_arg - * - * Return: 0 - */ -static int udp_splice_new_ns(void *arg) -{ - struct udp_splice_new_ns_arg *a; - - a = (struct udp_splice_new_ns_arg *)arg; - - ns_enter(a->c); - - a->s = udp_splice_new(a->c, a->v6, a->src, true); - - return 0; -} - -/** - * udp_mmh_splice_port() - Is source address of message suitable for splicing? - * @ref: epoll reference for incoming message's origin socket - * @mmh: mmsghdr of incoming message - * - * Return: if source address of message in @mmh refers to localhost (127.0.0.1 - * or ::1) its source port (host order), otherwise -1. - */ -static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) -{ - const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name; - const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name; - - ASSERT(ref.type == EPOLL_TYPE_UDP); - - if (!ref.udp.splice) - return -1; - - if (ref.udp.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr)) - return ntohs(sa6->sin6_port); - - if (!ref.udp.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr)) - return ntohs(sa4->sin_port); - - return -1; -} - /** * udp_at_sidx() - Get UDP specific flow at given sidx * @sidx: Flow and side to retrieve @@ -542,6 +374,16 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx) */ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) { + if (uflow->s[INISIDE] >= 0) { + /* The listening socket needs to stay in epoll */ + close(uflow->s[INISIDE]); + } + + if (uflow->s[TGTSIDE] >= 0) { + /* But the flow specific one needs to be removed */ + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, uflow->s[TGTSIDE], NULL); + close(uflow->s[TGTSIDE]);Keeping numbers of closed sockets around, instead of setting them to -1 right away, makes me a bit nervous. On the other hand it's obvious that udp_flow_new() sets them to -1 anyway, so there isn't much that can go wrong.+ } flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); } @@ -549,26 +391,80 @@ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) * udp_flow_new() - Common setup for a new UDP flow * @c: Execution context * @flow: Initiated flow + * @s_ini: Initiating socket (or -1) * @now: Timestamp * * Return: UDP specific flow, if successful, NULL on failure */ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, - const struct timespec *now) + int s_ini, const struct timespec *now) { const struct flowside *ini = &flow->f.side[INISIDE]; struct udp_flow *uflow = NULL; + const struct flowside *tgt; + uint8_t tgtpif; if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { flow_dbg(flow, "Invalid endpoint to initiate UDP flow"); goto cancel; } - if (!flow_target(c, flow, IPPROTO_UDP)) + if (!(tgt = flow_target(c, flow, IPPROTO_UDP))) goto cancel; + tgtpif = flow->f.pif[TGTSIDE]; uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); uflow->ts = now->tv_sec; + uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; + + if (s_ini >= 0) { + /* When using auto port-scanning the listening port could go + * away, so we need to duplicate it */For consistency: closing */ on a newline. I would also say "the socket" instead of "it", otherwise it seems to be referring to the port at a first glance.+ uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0);There's one aspect of this I don't understand: if s_ini is closed while checking for bound ports (is it? I didn't really reach the end of this series), aren't duplicates also closed? That is, the documentation of dup2(2), which should be the same for this purpose, states that the duplicate inherits "file status flags", which I would assume also includes the fact that a socket is closed. I didn't test that though. If duplicates are closed, I guess an alternative solution could be to introduce some kind of reference counting for sockets... somewhere.+ if (uflow->s[INISIDE] < 0) { + flow_err(uflow, + "Couldn't duplicate listening socket: %s", + strerror(errno)); + goto cancel; + } + } + + if (pif_is_socket(tgtpif)) { + union { + flow_sidx_t sidx; + uint32_t data; + } fref = { + .sidx = FLOW_SIDX(flow, TGTSIDE), + }; + + uflow->s[TGTSIDE] = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY, + tgtpif, tgt, fref.data); + if (uflow->s[TGTSIDE] < 0) { + flow_dbg(uflow, + "Couldn't open socket for spliced flow: %s", + strerror(errno)); + goto cancel; + } + + if (flowside_connect(c, uflow->s[TGTSIDE], tgtpif, tgt) < 0) { + flow_dbg(uflow, + "Couldn't connect flow socket: %s", + strerror(errno)); + goto cancel; + } + + /* It's possible, if unlikely, that we could receive some + * unrelated packets in between the bind() and connect() of this + * socket. For now we just discard these. We could consider + * trying to re-direct these to an appropriate handler, if weSimply "redirect"?+ * need to. + */ + /* cppcheck-suppress nullPointer */ + while (recv(uflow->s[TGTSIDE], NULL, 0, MSG_DONTWAIT) >= 0) + ;Could a local attacker (another user) attempt to use this for denial of service? Of course, somebody could flood us anyway and we would get and handle all the events that that causes. But this case is different because we could get stuck for an unlimited amount of time without serving other sockets at all. If that's a possibility, perhaps a limit for this loop (a maximum amount of recv()) tries would be a good idea. I'm not sure how we should handle the case where we exceed the threshold. Another one, which adds some complexity, but looks more correct to me, would be to try a single recv() call, and if we get data from it, fail creating the new flow entirely. I'm still reviewing the rest (of this patch and of this series). -- Stefano
On Wed, Jul 10, 2024 at 12:32:33AM +0200, Stefano Brivio wrote:On Fri, 5 Jul 2024 12:07:18 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Done.When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use. For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- epoll_type.h | 2 + flow.c | 20 +++ flow.h | 2 + flow_table.h | 14 ++ passt.c | 4 + udp.c | 403 ++++++++++++++++++--------------------------------- udp.h | 6 +- udp_flow.h | 2 + util.c | 1 + 9 files changed, 194 insertions(+), 260 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index b6c04199..7a752ed1 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -22,6 +22,8 @@ enum epoll_type { EPOLL_TYPE_TCP_TIMER, /* UDP sockets */ EPOLL_TYPE_UDP, + /* UDP socket for replies on a specific flow */ + EPOLL_TYPE_UDP_REPLY, /* ICMP/ICMPv6 ping sockets */ EPOLL_TYPE_PING, /* inotify fd watching for end of netns (pasta) */ diff --git a/flow.c b/flow.c index 0cb9495b..2e100ddb 100644 --- a/flow.c +++ b/flow.c @@ -236,6 +236,26 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, } } +/** flowside_connect() - Connect a socket based on flowside + * @c: Execution context + * @s: Socket to connect + * @pif: Target pif + * @tgt: Target flowside + * + * Connect @s to the endpoint address and port from @tgt. + * + * Return: 0 on success, negative on error + */ +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt) +{ + union sockaddr_inany sa; + socklen_t sl; + + pif_sockaddr(c, &sa, &sl, pif, &tgt->eaddr, tgt->eport); + return connect(s, &sa.sa, sl); +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index 3752e5ee..3f65ceb9 100644 --- a/flow.h +++ b/flow.h @@ -168,6 +168,8 @@ static inline bool flowside_eq(const struct flowside *left, int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data); +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt); /** * struct flow_common - Common fields for packet flows diff --git a/flow_table.h b/flow_table.h index 3fbc7c8d..1faac4a7 100644 --- a/flow_table.h +++ b/flow_table.h @@ -92,6 +92,20 @@ static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) return (flow_sidx_t){.flow = sidx.flow, .side = !sidx.side}; } +/** pif_at_sidx - Interface for a given flow and sidepif_at_sidx()No, they could come from a socket listening in the namespace (-U).+ * @sidx: Flow & side index + * + * Return: pif for the flow & side given by @sidx + */ +static inline uint8_t pif_at_sidx(flow_sidx_t sidx) +{ + const union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE; + return flow->f.pif[sidx.side]; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/passt.c b/passt.c index e4d45daa..f9405bee 100644 --- a/passt.c +++ b/passt.c @@ -67,6 +67,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", [EPOLL_TYPE_UDP] = "UDP socket", + [EPOLL_TYPE_UDP_REPLY] = "UDP reply socket", [EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket", [EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch", [EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch", @@ -349,6 +350,9 @@ loop: case EPOLL_TYPE_UDP: udp_buf_sock_handler(&c, ref, eventmask, &now); break; + case EPOLL_TYPE_UDP_REPLY: + udp_reply_sock_handler(&c, ref, eventmask, &now); + break; case EPOLL_TYPE_PING: icmp_sock_handler(&c, ref); break; diff --git a/udp.c b/udp.c index daf4fe26..f4c696db 100644 --- a/udp.c +++ b/udp.c @@ -35,7 +35,31 @@ * =================== * * UDP doesn't use listen(), but we consider long term sockets which are allowed - * to create new flows "listening" by analogy with TCP. + * to create new flows "listening" by analogy with TCP. This listening socket + * could receive packets from multiple flows, so we use a hash table match to + * find the specific flow for a datagram. + * + * When a UDP flow is initiated from a listening socket we take a duplicate ofThese are always "inbound" flows, right?No. Again, we could be creating this socket in the namespace. A spliced flow will have both the dup() of a listening socket, and a reply socket, regardless of whether it's inbound (HOST -> SPLICE) or outbound (SPLICE -> HOST).+ * the socket and store it in uflow->s[INISIDE]. This will last for the + * lifetime of the flow, even if the original listening socket is closed due to + * port auto-probing. The duplicate is used to deliver replies back to the + * originating side. + * + * Reply sockets + * ============= + * + * When a UDP flow targets a socket, we create a "reply" socket in...and these are outbound ones?Thanks for the catch, corrected.+ * uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive + * replies on the target side. This socket is both bound and connected and has + * EPOLL_TYPE_UDP_REPLY. The connect() means it will only receive datagrams + * associated with this flow, so the epoll reference directly points to the flow + * and we don't need a hash lookup. + * + * NOTE: it's possible that the reply socket could have a bound address + * overlapping with an unrelated listening socket. We assume datagrams for the + * flow will come to the reply socket in preference to a listening socket. The + * sample program contrib/udp-reuseaddr/reuseaddr-priority.c documents and testsNow it's under doc/platform-requirements/Fair point. I've paraphrahsed and updated this a bit and added it as a new section about spliced flows.+ * that assumption. * * Port tracking * ============= @@ -56,62 +80,6 @@ * * Packets are forwarded back and forth, by prepending and stripping UDP headers * in the obvious way, with no port translation. - * - * In PASTA mode, the L2-L4 translation is skipped for connections to ports - * bound between namespaces using the loopback interface, messages are directly - * transferred between L4 sockets instead. These are called spliced connections - * for consistency with the TCP implementation, but the splice() syscall isn't - * actually used as it wouldn't make sense for datagram-based connections: a - * pair of recvmmsg() and sendmmsg() deals with this case.I think we should keep this paragraph... or did you move/add this information somewhere else I missed? Given that the "Port tracking" section is going to be away, maybe this could be moved at the end of "UDP Flows".Right, we're about to destroy the flow entry entirely, so it is safe. But I agree that it's nervous-making. It's pretty cheap to set them to -1 explicitly, so I'm doing that now.- * - * The connection tracking for PASTA mode is slightly complicated by the absence - * of actual connections, see struct udp_splice_port, and these examples: - * - * - from init to namespace: - * - * - forward direction: 127.0.0.1:5000 -> 127.0.0.1:80 in init from socket s, - * with epoll reference: index = 80, splice = 1, orig = 1, ns = 0 - * - if udp_splice_ns[V4][5000].sock: - * - send packet to udp_splice_ns[V4][5000].sock, with destination port - * 80 - * - otherwise: - * - create new socket udp_splice_ns[V4][5000].sock - * - bind in namespace to 127.0.0.1:5000 - * - add to epoll with reference: index = 5000, splice = 1, orig = 0, - * ns = 1 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - * - reverse direction: 127.0.0.1:80 -> 127.0.0.1:5000 in namespace socket s, - * having epoll reference: index = 5000, splice = 1, orig = 0, ns = 1 - * - if udp_splice_init[V4][80].sock: - * - send to udp_splice_init[V4][80].sock, with destination port 5000 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - otherwise, discard - * - * - from namespace to init: - * - * - forward direction: 127.0.0.1:2000 -> 127.0.0.1:22 in namespace from - * socket s, with epoll reference: index = 22, splice = 1, orig = 1, ns = 1 - * - if udp4_splice_init[V4][2000].sock: - * - send packet to udp_splice_init[V4][2000].sock, with destination - * port 22 - * - otherwise: - * - create new socket udp_splice_init[V4][2000].sock - * - bind in init to 127.0.0.1:2000 - * - add to epoll with reference: index = 2000, splice = 1, orig = 0, - * ns = 0 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - * - reverse direction: 127.0.0.1:22 -> 127.0.0.1:2000 in init from socket s, - * having epoll reference: index = 2000, splice = 1, orig = 0, ns = 0 - * - if udp_splice_ns[V4][22].sock: - * - send to udp_splice_ns[V4][22].sock, with destination port 2000 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - otherwise, discard */ #include <sched.h> @@ -134,6 +102,7 @@ #include <sys/socket.h> #include <sys/uio.h> #include <time.h> +#include <fcntl.h> #include "checksum.h" #include "util.h" @@ -223,7 +192,6 @@ static struct ethhdr udp6_eth_hdr; * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() - * @splicesrc: Source port for splicing, or -1 if not spliceable * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { @@ -232,7 +200,6 @@ static struct udp_meta_t { struct tap_hdr taph; union sockaddr_inany s_in; - int splicesrc; flow_sidx_t tosidx; } #ifdef __AVX2__ @@ -270,7 +237,6 @@ static struct mmsghdr udp_mh_splice [UDP_MAX_FRAMES]; /* IOVs for L2 frames */ static struct iovec udp_l2_iov [UDP_MAX_FRAMES][UDP_NUM_IOVS]; - /** * udp_portmap_clear() - Clear UDP port map before configuration */ @@ -383,140 +349,6 @@ static void udp_iov_init(const struct ctx *c) udp_iov_init_one(c, i); } -/** - * udp_splice_new() - Create and prepare socket for "spliced" binding - * @c: Execution context - * @v6: Set for IPv6 sockets - * @src: Source port of original connection, host order - * @ns: Does the splice originate in the ns or not - * - * Return: prepared socket, negative error code on failure - * - * #syscalls:pasta getsockname - */ -int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) -{ - struct epoll_event ev = { .events = EPOLLIN | EPOLLRDHUP | EPOLLHUP }; - union epoll_ref ref = { .type = EPOLL_TYPE_UDP, - .udp = { .splice = true, .v6 = v6, .port = src } - }; - struct udp_splice_port *sp; - int act, s; - - if (ns) { - ref.udp.pif = PIF_SPLICE; - sp = &udp_splice_ns[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_NS; - } else { - ref.udp.pif = PIF_HOST; - sp = &udp_splice_init[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_INIT; - } - - s = socket(v6 ? AF_INET6 : AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, - IPPROTO_UDP); - - if (s > FD_REF_MAX) { - close(s); - return -EIO; - } - - if (s < 0) - return s; - - ref.fd = s; - - if (v6) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(src), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr6, sizeof(addr6))) - goto fail; - } else { - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(src), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr4, sizeof(addr4))) - goto fail; - } - - sp->sock = s; - bitmap_set(udp_act[v6 ? V6 : V4][act], src); - - ev.data.u64 = ref.u64; - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, s, &ev); - return s; - -fail: - close(s); - return -1; -} - -/** - * struct udp_splice_new_ns_arg - Arguments for udp_splice_new_ns() - * @c: Execution context - * @v6: Set for IPv6 - * @src: Source port of originating datagram, host order - * @dst: Destination port of originating datagram, host order - * @s: Newly created socket or negative error code - */ -struct udp_splice_new_ns_arg { - const struct ctx *c; - int v6; - in_port_t src; - int s; -}; - -/** - * udp_splice_new_ns() - Enter namespace and call udp_splice_new() - * @arg: See struct udp_splice_new_ns_arg - * - * Return: 0 - */ -static int udp_splice_new_ns(void *arg) -{ - struct udp_splice_new_ns_arg *a; - - a = (struct udp_splice_new_ns_arg *)arg; - - ns_enter(a->c); - - a->s = udp_splice_new(a->c, a->v6, a->src, true); - - return 0; -} - -/** - * udp_mmh_splice_port() - Is source address of message suitable for splicing? - * @ref: epoll reference for incoming message's origin socket - * @mmh: mmsghdr of incoming message - * - * Return: if source address of message in @mmh refers to localhost (127.0.0.1 - * or ::1) its source port (host order), otherwise -1. - */ -static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) -{ - const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name; - const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name; - - ASSERT(ref.type == EPOLL_TYPE_UDP); - - if (!ref.udp.splice) - return -1; - - if (ref.udp.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr)) - return ntohs(sa6->sin6_port); - - if (!ref.udp.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr)) - return ntohs(sa4->sin_port); - - return -1; -} - /** * udp_at_sidx() - Get UDP specific flow at given sidx * @sidx: Flow and side to retrieve @@ -542,6 +374,16 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx) */ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) { + if (uflow->s[INISIDE] >= 0) { + /* The listening socket needs to stay in epoll */ + close(uflow->s[INISIDE]); + } + + if (uflow->s[TGTSIDE] >= 0) { + /* But the flow specific one needs to be removed */ + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, uflow->s[TGTSIDE], NULL); + close(uflow->s[TGTSIDE]);Keeping numbers of closed sockets around, instead of setting them to -1 right away, makes me a bit nervous. On the other hand it's obvious that udp_flow_new() sets them to -1 anyway, so there isn't much that can go wrong.Done.+ } flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); } @@ -549,26 +391,80 @@ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) * udp_flow_new() - Common setup for a new UDP flow * @c: Execution context * @flow: Initiated flow + * @s_ini: Initiating socket (or -1) * @now: Timestamp * * Return: UDP specific flow, if successful, NULL on failure */ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, - const struct timespec *now) + int s_ini, const struct timespec *now) { const struct flowside *ini = &flow->f.side[INISIDE]; struct udp_flow *uflow = NULL; + const struct flowside *tgt; + uint8_t tgtpif; if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { flow_dbg(flow, "Invalid endpoint to initiate UDP flow"); goto cancel; } - if (!flow_target(c, flow, IPPROTO_UDP)) + if (!(tgt = flow_target(c, flow, IPPROTO_UDP))) goto cancel; + tgtpif = flow->f.pif[TGTSIDE]; uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); uflow->ts = now->tv_sec; + uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; + + if (s_ini >= 0) { + /* When using auto port-scanning the listening port could go + * away, so we need to duplicate it */For consistency: closing */ on a newline. I would also say "the socket" instead of "it", otherwise it seems to be referring to the port at a first glance.I don't believe so. My understanding is that dup() (and the rest) make a new fd referencing the same underlying file object, yes. But AIUI, close() just closes one fd - the underlying object is only closed only when all fds are gone.+ uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0);There's one aspect of this I don't understand: if s_ini is closed while checking for bound ports (is it? I didn't really reach the end of this series), aren't duplicates also closed? That is, the documentation of dup2(2), which should be the same for this purpose, states that the duplicate inherits "file status flags", which I would assume also includes the fact that a socket is closed. I didn't test that though.If duplicates are closed, I guess an alternative solution could be to introduce some kind of reference counting for sockets... somewhere... in other words, I believe the kernel does the reference counting. I should verify this though, I'll try to come up with something new for doc/platform-requirements.Done.+ if (uflow->s[INISIDE] < 0) { + flow_err(uflow, + "Couldn't duplicate listening socket: %s", + strerror(errno)); + goto cancel; + } + } + + if (pif_is_socket(tgtpif)) { + union { + flow_sidx_t sidx; + uint32_t data; + } fref = { + .sidx = FLOW_SIDX(flow, TGTSIDE), + }; + + uflow->s[TGTSIDE] = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY, + tgtpif, tgt, fref.data); + if (uflow->s[TGTSIDE] < 0) { + flow_dbg(uflow, + "Couldn't open socket for spliced flow: %s", + strerror(errno)); + goto cancel; + } + + if (flowside_connect(c, uflow->s[TGTSIDE], tgtpif, tgt) < 0) { + flow_dbg(uflow, + "Couldn't connect flow socket: %s", + strerror(errno)); + goto cancel; + } + + /* It's possible, if unlikely, that we could receive some + * unrelated packets in between the bind() and connect() of this + * socket. For now we just discard these. We could consider + * trying to re-direct these to an appropriate handler, if weSimply "redirect"?Ah, interesting question.+ * need to. + */ + /* cppcheck-suppress nullPointer */ + while (recv(uflow->s[TGTSIDE], NULL, 0, MSG_DONTWAIT) >= 0) + ;Could a local attacker (another user) attempt to use this for denial of service?Of course, somebody could flood us anyway and we would get and handle all the events that that causes. But this case is different because we could get stuck for an unlimited amount of time without serving other sockets at all.Right.If that's a possibility, perhaps a limit for this loop (a maximum amount of recv()) tries would be a good idea. I'm not sure how we should handle the case where we exceed the threshold.We could fail to create the flow. That would limit the damage, but see below.Another one, which adds some complexity, but looks more correct to me, would be to try a single recv() call, and if we get data from it, fail creating the new flow entirely.Right. I also considered a single recvmmsg(); the difficulty with that is making suitable arrays for it, since the normal ones may be in use at this point. This removes the potential DoS of wedging passt completely, but raises the possibility of a different one. Could an attacker - not necessarily on the same host, but network-closer than the guest's intended endpoint - prevent the guest from connecting to a particular service by spamming fake reply packets here. I believe it would need to anticipate the guest's source port to do so, which might be good enough.I'm still reviewing the rest (of this patch and of this series).-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, 10 Jul 2024 10:23:14 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Wed, Jul 10, 2024 at 12:32:33AM +0200, Stefano Brivio wrote:Ah, probably, yes.On Fri, 5 Jul 2024 12:07:18 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Done.When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use. For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- epoll_type.h | 2 + flow.c | 20 +++ flow.h | 2 + flow_table.h | 14 ++ passt.c | 4 + udp.c | 403 ++++++++++++++++++--------------------------------- udp.h | 6 +- udp_flow.h | 2 + util.c | 1 + 9 files changed, 194 insertions(+), 260 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index b6c04199..7a752ed1 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -22,6 +22,8 @@ enum epoll_type { EPOLL_TYPE_TCP_TIMER, /* UDP sockets */ EPOLL_TYPE_UDP, + /* UDP socket for replies on a specific flow */ + EPOLL_TYPE_UDP_REPLY, /* ICMP/ICMPv6 ping sockets */ EPOLL_TYPE_PING, /* inotify fd watching for end of netns (pasta) */ diff --git a/flow.c b/flow.c index 0cb9495b..2e100ddb 100644 --- a/flow.c +++ b/flow.c @@ -236,6 +236,26 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, } } +/** flowside_connect() - Connect a socket based on flowside + * @c: Execution context + * @s: Socket to connect + * @pif: Target pif + * @tgt: Target flowside + * + * Connect @s to the endpoint address and port from @tgt. + * + * Return: 0 on success, negative on error + */ +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt) +{ + union sockaddr_inany sa; + socklen_t sl; + + pif_sockaddr(c, &sa, &sl, pif, &tgt->eaddr, tgt->eport); + return connect(s, &sa.sa, sl); +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index 3752e5ee..3f65ceb9 100644 --- a/flow.h +++ b/flow.h @@ -168,6 +168,8 @@ static inline bool flowside_eq(const struct flowside *left, int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data); +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt); /** * struct flow_common - Common fields for packet flows diff --git a/flow_table.h b/flow_table.h index 3fbc7c8d..1faac4a7 100644 --- a/flow_table.h +++ b/flow_table.h @@ -92,6 +92,20 @@ static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) return (flow_sidx_t){.flow = sidx.flow, .side = !sidx.side}; } +/** pif_at_sidx - Interface for a given flow and sidepif_at_sidx()No, they could come from a socket listening in the namespace (-U).+ * @sidx: Flow & side index + * + * Return: pif for the flow & side given by @sidx + */ +static inline uint8_t pif_at_sidx(flow_sidx_t sidx) +{ + const union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE; + return flow->f.pif[sidx.side]; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/passt.c b/passt.c index e4d45daa..f9405bee 100644 --- a/passt.c +++ b/passt.c @@ -67,6 +67,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", [EPOLL_TYPE_UDP] = "UDP socket", + [EPOLL_TYPE_UDP_REPLY] = "UDP reply socket", [EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket", [EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch", [EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch", @@ -349,6 +350,9 @@ loop: case EPOLL_TYPE_UDP: udp_buf_sock_handler(&c, ref, eventmask, &now); break; + case EPOLL_TYPE_UDP_REPLY: + udp_reply_sock_handler(&c, ref, eventmask, &now); + break; case EPOLL_TYPE_PING: icmp_sock_handler(&c, ref); break; diff --git a/udp.c b/udp.c index daf4fe26..f4c696db 100644 --- a/udp.c +++ b/udp.c @@ -35,7 +35,31 @@ * =================== * * UDP doesn't use listen(), but we consider long term sockets which are allowed - * to create new flows "listening" by analogy with TCP. + * to create new flows "listening" by analogy with TCP. This listening socket + * could receive packets from multiple flows, so we use a hash table match to + * find the specific flow for a datagram. + * + * When a UDP flow is initiated from a listening socket we take a duplicate ofThese are always "inbound" flows, right?No. Again, we could be creating this socket in the namespace. A spliced flow will have both the dup() of a listening socket, and a reply socket, regardless of whether it's inbound (HOST -> SPLICE) or outbound (SPLICE -> HOST).+ * the socket and store it in uflow->s[INISIDE]. This will last for the + * lifetime of the flow, even if the original listening socket is closed due to + * port auto-probing. The duplicate is used to deliver replies back to the + * originating side. + * + * Reply sockets + * ============= + * + * When a UDP flow targets a socket, we create a "reply" socket in...and these are outbound ones?Thanks for the catch, corrected.+ * uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive + * replies on the target side. This socket is both bound and connected and has + * EPOLL_TYPE_UDP_REPLY. The connect() means it will only receive datagrams + * associated with this flow, so the epoll reference directly points to the flow + * and we don't need a hash lookup. + * + * NOTE: it's possible that the reply socket could have a bound address + * overlapping with an unrelated listening socket. We assume datagrams for the + * flow will come to the reply socket in preference to a listening socket. The + * sample program contrib/udp-reuseaddr/reuseaddr-priority.c documents and testsNow it's under doc/platform-requirements/Fair point. I've paraphrahsed and updated this a bit and added it as a new section about spliced flows.+ * that assumption. * * Port tracking * ============= @@ -56,62 +80,6 @@ * * Packets are forwarded back and forth, by prepending and stripping UDP headers * in the obvious way, with no port translation. - * - * In PASTA mode, the L2-L4 translation is skipped for connections to ports - * bound between namespaces using the loopback interface, messages are directly - * transferred between L4 sockets instead. These are called spliced connections - * for consistency with the TCP implementation, but the splice() syscall isn't - * actually used as it wouldn't make sense for datagram-based connections: a - * pair of recvmmsg() and sendmmsg() deals with this case.I think we should keep this paragraph... or did you move/add this information somewhere else I missed? Given that the "Port tracking" section is going to be away, maybe this could be moved at the end of "UDP Flows".Right, we're about to destroy the flow entry entirely, so it is safe. But I agree that it's nervous-making. It's pretty cheap to set them to -1 explicitly, so I'm doing that now.- * - * The connection tracking for PASTA mode is slightly complicated by the absence - * of actual connections, see struct udp_splice_port, and these examples: - * - * - from init to namespace: - * - * - forward direction: 127.0.0.1:5000 -> 127.0.0.1:80 in init from socket s, - * with epoll reference: index = 80, splice = 1, orig = 1, ns = 0 - * - if udp_splice_ns[V4][5000].sock: - * - send packet to udp_splice_ns[V4][5000].sock, with destination port - * 80 - * - otherwise: - * - create new socket udp_splice_ns[V4][5000].sock - * - bind in namespace to 127.0.0.1:5000 - * - add to epoll with reference: index = 5000, splice = 1, orig = 0, - * ns = 1 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - * - reverse direction: 127.0.0.1:80 -> 127.0.0.1:5000 in namespace socket s, - * having epoll reference: index = 5000, splice = 1, orig = 0, ns = 1 - * - if udp_splice_init[V4][80].sock: - * - send to udp_splice_init[V4][80].sock, with destination port 5000 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - otherwise, discard - * - * - from namespace to init: - * - * - forward direction: 127.0.0.1:2000 -> 127.0.0.1:22 in namespace from - * socket s, with epoll reference: index = 22, splice = 1, orig = 1, ns = 1 - * - if udp4_splice_init[V4][2000].sock: - * - send packet to udp_splice_init[V4][2000].sock, with destination - * port 22 - * - otherwise: - * - create new socket udp_splice_init[V4][2000].sock - * - bind in init to 127.0.0.1:2000 - * - add to epoll with reference: index = 2000, splice = 1, orig = 0, - * ns = 0 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - * - reverse direction: 127.0.0.1:22 -> 127.0.0.1:2000 in init from socket s, - * having epoll reference: index = 2000, splice = 1, orig = 0, ns = 0 - * - if udp_splice_ns[V4][22].sock: - * - send to udp_splice_ns[V4][22].sock, with destination port 2000 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - otherwise, discard */ #include <sched.h> @@ -134,6 +102,7 @@ #include <sys/socket.h> #include <sys/uio.h> #include <time.h> +#include <fcntl.h> #include "checksum.h" #include "util.h" @@ -223,7 +192,6 @@ static struct ethhdr udp6_eth_hdr; * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() - * @splicesrc: Source port for splicing, or -1 if not spliceable * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { @@ -232,7 +200,6 @@ static struct udp_meta_t { struct tap_hdr taph; union sockaddr_inany s_in; - int splicesrc; flow_sidx_t tosidx; } #ifdef __AVX2__ @@ -270,7 +237,6 @@ static struct mmsghdr udp_mh_splice [UDP_MAX_FRAMES]; /* IOVs for L2 frames */ static struct iovec udp_l2_iov [UDP_MAX_FRAMES][UDP_NUM_IOVS]; - /** * udp_portmap_clear() - Clear UDP port map before configuration */ @@ -383,140 +349,6 @@ static void udp_iov_init(const struct ctx *c) udp_iov_init_one(c, i); } -/** - * udp_splice_new() - Create and prepare socket for "spliced" binding - * @c: Execution context - * @v6: Set for IPv6 sockets - * @src: Source port of original connection, host order - * @ns: Does the splice originate in the ns or not - * - * Return: prepared socket, negative error code on failure - * - * #syscalls:pasta getsockname - */ -int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) -{ - struct epoll_event ev = { .events = EPOLLIN | EPOLLRDHUP | EPOLLHUP }; - union epoll_ref ref = { .type = EPOLL_TYPE_UDP, - .udp = { .splice = true, .v6 = v6, .port = src } - }; - struct udp_splice_port *sp; - int act, s; - - if (ns) { - ref.udp.pif = PIF_SPLICE; - sp = &udp_splice_ns[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_NS; - } else { - ref.udp.pif = PIF_HOST; - sp = &udp_splice_init[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_INIT; - } - - s = socket(v6 ? AF_INET6 : AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, - IPPROTO_UDP); - - if (s > FD_REF_MAX) { - close(s); - return -EIO; - } - - if (s < 0) - return s; - - ref.fd = s; - - if (v6) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(src), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr6, sizeof(addr6))) - goto fail; - } else { - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(src), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr4, sizeof(addr4))) - goto fail; - } - - sp->sock = s; - bitmap_set(udp_act[v6 ? V6 : V4][act], src); - - ev.data.u64 = ref.u64; - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, s, &ev); - return s; - -fail: - close(s); - return -1; -} - -/** - * struct udp_splice_new_ns_arg - Arguments for udp_splice_new_ns() - * @c: Execution context - * @v6: Set for IPv6 - * @src: Source port of originating datagram, host order - * @dst: Destination port of originating datagram, host order - * @s: Newly created socket or negative error code - */ -struct udp_splice_new_ns_arg { - const struct ctx *c; - int v6; - in_port_t src; - int s; -}; - -/** - * udp_splice_new_ns() - Enter namespace and call udp_splice_new() - * @arg: See struct udp_splice_new_ns_arg - * - * Return: 0 - */ -static int udp_splice_new_ns(void *arg) -{ - struct udp_splice_new_ns_arg *a; - - a = (struct udp_splice_new_ns_arg *)arg; - - ns_enter(a->c); - - a->s = udp_splice_new(a->c, a->v6, a->src, true); - - return 0; -} - -/** - * udp_mmh_splice_port() - Is source address of message suitable for splicing? - * @ref: epoll reference for incoming message's origin socket - * @mmh: mmsghdr of incoming message - * - * Return: if source address of message in @mmh refers to localhost (127.0.0.1 - * or ::1) its source port (host order), otherwise -1. - */ -static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) -{ - const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name; - const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name; - - ASSERT(ref.type == EPOLL_TYPE_UDP); - - if (!ref.udp.splice) - return -1; - - if (ref.udp.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr)) - return ntohs(sa6->sin6_port); - - if (!ref.udp.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr)) - return ntohs(sa4->sin_port); - - return -1; -} - /** * udp_at_sidx() - Get UDP specific flow at given sidx * @sidx: Flow and side to retrieve @@ -542,6 +374,16 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx) */ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) { + if (uflow->s[INISIDE] >= 0) { + /* The listening socket needs to stay in epoll */ + close(uflow->s[INISIDE]); + } + + if (uflow->s[TGTSIDE] >= 0) { + /* But the flow specific one needs to be removed */ + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, uflow->s[TGTSIDE], NULL); + close(uflow->s[TGTSIDE]);Keeping numbers of closed sockets around, instead of setting them to -1 right away, makes me a bit nervous. On the other hand it's obvious that udp_flow_new() sets them to -1 anyway, so there isn't much that can go wrong.Done.+ } flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); } @@ -549,26 +391,80 @@ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) * udp_flow_new() - Common setup for a new UDP flow * @c: Execution context * @flow: Initiated flow + * @s_ini: Initiating socket (or -1) * @now: Timestamp * * Return: UDP specific flow, if successful, NULL on failure */ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, - const struct timespec *now) + int s_ini, const struct timespec *now) { const struct flowside *ini = &flow->f.side[INISIDE]; struct udp_flow *uflow = NULL; + const struct flowside *tgt; + uint8_t tgtpif; if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { flow_dbg(flow, "Invalid endpoint to initiate UDP flow"); goto cancel; } - if (!flow_target(c, flow, IPPROTO_UDP)) + if (!(tgt = flow_target(c, flow, IPPROTO_UDP))) goto cancel; + tgtpif = flow->f.pif[TGTSIDE]; uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); uflow->ts = now->tv_sec; + uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; + + if (s_ini >= 0) { + /* When using auto port-scanning the listening port could go + * away, so we need to duplicate it */For consistency: closing */ on a newline. I would also say "the socket" instead of "it", otherwise it seems to be referring to the port at a first glance.I don't believe so. My understanding is that dup() (and the rest) make a new fd referencing the same underlying file object, yes. But AIUI, close() just closes one fd - the underlying object is only closed only when all fds are gone.+ uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0);There's one aspect of this I don't understand: if s_ini is closed while checking for bound ports (is it? I didn't really reach the end of this series), aren't duplicates also closed? That is, the documentation of dup2(2), which should be the same for this purpose, states that the duplicate inherits "file status flags", which I would assume also includes the fact that a socket is closed. I didn't test that though.I didn't really find the time to sketch this but I guess the easiest way to check this behaviour is to have a TCP connection between a socket pair, with one socket having two descriptors, then closing one descriptor and check if the peer socket sees a closed connection (recv() returning 0 or similar). I was wondering whether it's worth to use the vforked namespaced peer trick I drafted here: https://archives.passt.top/passt-dev/20231206160808.3d312733@elisabeth/ just in case we want to use some of those test cases for actual tests, where we don't want to bind an actual TCP port on the machine we're running on. But if it adds complexity I'd say it's not worth it.If duplicates are closed, I guess an alternative solution could be to introduce some kind of reference counting for sockets... somewhere... in other words, I believe the kernel does the reference counting. I should verify this though, I'll try to come up with something new for doc/platform-requirements.For UDP, we don't need to make the buffers large enough to fit packets into them, see udp(7): All receive operations return only one packet. When the packet is smaller than the passed buffer, only that much data is returned; when it is bigger, the packet is truncated and the MSG_TRUNC flag is set. so probably 1024 bytes (1 * UIO_MAXIOV) on the stack would be enough... or maybe we can even pass 0 as size and NULL buffers? If recvmmsg() returns UIO_MAXIOV, then something weird is going on and we can abort the "connection" attempt.Done.+ if (uflow->s[INISIDE] < 0) { + flow_err(uflow, + "Couldn't duplicate listening socket: %s", + strerror(errno)); + goto cancel; + } + } + + if (pif_is_socket(tgtpif)) { + union { + flow_sidx_t sidx; + uint32_t data; + } fref = { + .sidx = FLOW_SIDX(flow, TGTSIDE), + }; + + uflow->s[TGTSIDE] = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY, + tgtpif, tgt, fref.data); + if (uflow->s[TGTSIDE] < 0) { + flow_dbg(uflow, + "Couldn't open socket for spliced flow: %s", + strerror(errno)); + goto cancel; + } + + if (flowside_connect(c, uflow->s[TGTSIDE], tgtpif, tgt) < 0) { + flow_dbg(uflow, + "Couldn't connect flow socket: %s", + strerror(errno)); + goto cancel; + } + + /* It's possible, if unlikely, that we could receive some + * unrelated packets in between the bind() and connect() of this + * socket. For now we just discard these. We could consider + * trying to re-direct these to an appropriate handler, if weSimply "redirect"?Ah, interesting question.+ * need to. + */ + /* cppcheck-suppress nullPointer */ + while (recv(uflow->s[TGTSIDE], NULL, 0, MSG_DONTWAIT) >= 0) + ;Could a local attacker (another user) attempt to use this for denial of service?Of course, somebody could flood us anyway and we would get and handle all the events that that causes. But this case is different because we could get stuck for an unlimited amount of time without serving other sockets at all.Right.If that's a possibility, perhaps a limit for this loop (a maximum amount of recv()) tries would be a good idea. I'm not sure how we should handle the case where we exceed the threshold.We could fail to create the flow. That would limit the damage, but see below.Another one, which adds some complexity, but looks more correct to me, would be to try a single recv() call, and if we get data from it, fail creating the new flow entirely.Right. I also considered a single recvmmsg(); the difficulty with that is making suitable arrays for it, since the normal ones may be in use at this point.This removes the potential DoS of wedging passt completely, but raises the possibility of a different one. Could an attacker - not necessarily on the same host, but network-closer than the guest's intended endpoint - prevent the guest from connecting to a particular service by spamming fake reply packets here.Oops, I didn't think about that, that's also concerning. With my suggestion above, 1024 packets is all an attacker would need. But maybe there's a safe way to deal with this that's still simpler than a separate handler: using recvmmsg(), going through msg_name of the (truncated) messages we received with it, and trying to draw some conclusion about what kind of attack we're dealing with from there. I'm not sure exactly which conclusion, though.I believe it would need to anticipate the guest's source port to do so, which might be good enough.Linux does source port randomisation for UDP starting from 2.6.24, so that should be good enough, but given that we try to preserve the source port from the guest, we might make a guest that doesn't do port randomisation (albeit unlikely to be found) vulnerable to this. -- Stefano
On Wed, Jul 10, 2024 at 07:13:26PM +0200, Stefano Brivio wrote:On Wed, 10 Jul 2024 10:23:14 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote: > On Wed, Jul 10, 2024 at 12:32:33AM +0200, Stefano Brivio wrote: > > On Fri, 5 Jul 2024 12:07:18 +1000 > > David Gibson <david(a)gibson.dropbear.id.au> wrote:[snip]So.. yes, this would check whether close() on a non-last fd for a socket triggers socket closing actions, but that's much stricter than what we actually need here. I would, for example, expect shutdown() on a TCP socket to affect all dups - and I don't actually know if a close() on one dup might trigger that. But we're dealing with UDP here, so there's no "on wire" effect of a close. So all we actually need to check is: 1. Open a "listening" udp socket 2. Dup it 3. Close a dup 4. Can the remaining dup still receive datagrams? I've written a test program for this, which I'll include in the next spin.Ah, probably, yes.I don't believe so. My understanding is that dup() (and the rest) make a new fd referencing the same underlying file object, yes. But AIUI, close() just closes one fd - the underlying object is only closed only when all fds are gone.+ uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0);There's one aspect of this I don't understand: if s_ini is closed while checking for bound ports (is it? I didn't really reach the end of this series), aren't duplicates also closed? That is, the documentation of dup2(2), which should be the same for this purpose, states that the duplicate inherits "file status flags", which I would assume also includes the fact that a socket is closed. I didn't test that though.I didn't really find the time to sketch this but I guess the easiest way to check this behaviour is to have a TCP connection between a socket pair, with one socket having two descriptors, then closing one descriptor and check if the peer socket sees a closed connection (recv() returning 0 or similar).If duplicates are closed, I guess an alternative solution could be to introduce some kind of reference counting for sockets... somewhere... in other words, I believe the kernel does the reference counting. I should verify this though, I'll try to come up with something new for doc/platform-requirements.I was wondering whether it's worth to use the vforked namespaced peer trick I drafted here: https://archives.passt.top/passt-dev/20231206160808.3d312733@elisabeth/ just in case we want to use some of those test cases for actual tests, where we don't want to bind an actual TCP port on the machine we're running on. But if it adds complexity I'd say it's not worth it.Yeah, I tend to think we can add that sort of sandboxing when and if we need it. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 11 Jul 2024 11:30:52 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Wed, Jul 10, 2024 at 07:13:26PM +0200, Stefano Brivio wrote:Right, that's why I was thinking of using a TCP socket. On the other hand it's a bit silly to test something slightly different to indirectly check the hypothesis.On Wed, 10 Jul 2024 10:23:14 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote: > On Wed, Jul 10, 2024 at 12:32:33AM +0200, Stefano Brivio wrote: > > On Fri, 5 Jul 2024 12:07:18 +1000 > > David Gibson <david(a)gibson.dropbear.id.au> wrote:[snip]So.. yes, this would check whether close() on a non-last fd for a socket triggers socket closing actions, but that's much stricter than what we actually need here. I would, for example, expect shutdown() on a TCP socket to affect all dups - and I don't actually know if a close() on one dup might trigger that. But we're dealing with UDP here, so there's no "on wire" effect of a close.Ah, probably, yes.> + uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0); There's one aspect of this I don't understand: if s_ini is closed while checking for bound ports (is it? I didn't really reach the end of this series), aren't duplicates also closed? That is, the documentation of dup2(2), which should be the same for this purpose, states that the duplicate inherits "file status flags", which I would assume also includes the fact that a socket is closed. I didn't test that though.I don't believe so. My understanding is that dup() (and the rest) make a new fd referencing the same underlying file object, yes. But AIUI, close() just closes one fd - the underlying object is only closed only when all fds are gone.I didn't really find the time to sketch this but I guess the easiest way to check this behaviour is to have a TCP connection between a socket pair, with one socket having two descriptors, then closing one descriptor and check if the peer socket sees a closed connection (recv() returning 0 or similar).If duplicates are closed, I guess an alternative solution could be to introduce some kind of reference counting for sockets... somewhere... in other words, I believe the kernel does the reference counting. I should verify this though, I'll try to come up with something new for doc/platform-requirements.So all we actually need to check is: 1. Open a "listening" udp socket 2. Dup it 3. Close a dup 4. Can the remaining dup still receive datagrams?...sounds much better, yes. -- Stefano
On Wed, Jul 10, 2024 at 07:13:26PM +0200, Stefano Brivio wrote:On Wed, 10 Jul 2024 10:23:14 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote: > On Wed, Jul 10, 2024 at 12:32:33AM +0200, Stefano Brivio wrote: > > On Fri, 5 Jul 2024 12:07:18 +1000 > > David Gibson <david(a)gibson.dropbear.id.au> wrote:[snip]We can (see doc/platform-requirements/recv-zero.c). And even if we couldn't, we could use the same 1 byte buffer for all the datagrams. We basically just need the actual msghdr array.For UDP, we don't need to make the buffers large enough to fit packets into them, see udp(7): All receive operations return only one packet. When the packet is smaller than the passed buffer, only that much data is returned; when it is bigger, the packet is truncated and the MSG_TRUNC flag is set. so probably 1024 bytes (1 * UIO_MAXIOV) on the stack would be enough... or maybe we can even pass 0 as size and NULL buffers?Ah, interesting question.+ * need to. + */ + /* cppcheck-suppress nullPointer */ + while (recv(uflow->s[TGTSIDE], NULL, 0, MSG_DONTWAIT) >= 0) + ;Could a local attacker (another user) attempt to use this for denial of service?Of course, somebody could flood us anyway and we would get and handle all the events that that causes. But this case is different because we could get stuck for an unlimited amount of time without serving other sockets at all.Right.If that's a possibility, perhaps a limit for this loop (a maximum amount of recv()) tries would be a good idea. I'm not sure how we should handle the case where we exceed the threshold.We could fail to create the flow. That would limit the damage, but see below.Another one, which adds some complexity, but looks more correct to me, would be to try a single recv() call, and if we get data from it, fail creating the new flow entirely.Right. I also considered a single recvmmsg(); the difficulty with that is making suitable arrays for it, since the normal ones may be in use at this point.If recvmmsg() returns UIO_MAXIOV, then something weird is going on and we can abort the "connection" attempt.Seems reasonable; I've coded something up.Right. It feels like we're at least not making the guest dramatically more vulnerable to something here, so I'm inclined to treat it as good enough for now. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonThis removes the potential DoS of wedging passt completely, but raises the possibility of a different one. Could an attacker - not necessarily on the same host, but network-closer than the guest's intended endpoint - prevent the guest from connecting to a particular service by spamming fake reply packets here.Oops, I didn't think about that, that's also concerning. With my suggestion above, 1024 packets is all an attacker would need. But maybe there's a safe way to deal with this that's still simpler than a separate handler: using recvmmsg(), going through msg_name of the (truncated) messages we received with it, and trying to draw some conclusion about what kind of attack we're dealing with from there. I'm not sure exactly which conclusion, though.I believe it would need to anticipate the guest's source port to do so, which might be good enough.Linux does source port randomisation for UDP starting from 2.6.24, so that should be good enough, but given that we try to preserve the source port from the guest, we might make a guest that doesn't do port randomisation (albeit unlikely to be found) vulnerable to this.
On Fri, 5 Jul 2024 12:07:18 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use. For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry.For some reason, after this patch (I bisected), I'm getting an EPOLLERR loop: pasta: epoll event on UDP socket 6 (events: 0x00000001) Flow 0 (NEW): FREE -> NEW Flow 0 (INI): NEW -> INI Flow 0 (INI): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => ? Flow 0 (TGT): INI -> TGT Flow 0 (TGT): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:47041 -> [127.0.0.1]:10001 Flow 0 (UDP flow): TGT -> TYPED Flow 0 (UDP flow): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:47041 -> [127.0.0.1]:10001 Flow 0 (UDP flow): Side 0 hash table insert: bucket: 97474 Flow 0 (UDP flow): TYPED -> ACTIVE Flow 0 (UDP flow): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:47041 -> [127.0.0.1]:10001 pasta: epoll event on UDP reply socket 116 (events: 0x00000008) pasta: epoll event on UDP reply socket 116 (events: 0x00000008) pasta: epoll event on UDP reply socket 116 (events: 0x00000008) [...repeated until I terminate the process] by sending one UDP datagram from the parent namespace with no "listening" process in the namespace, using the "spliced" path, something like this: echo a | nc -q1 -u localhost 10001 after running pasta with: ./pasta -u 10001 --trace -l /tmp/pasta.trace --log-size $((1 << 30)) I tried bigger/multiple datagrams, same result. Before this patch, I get something like this instead: 5.1018: pasta: epoll event on UDP socket 6 (events: 0x00000001) 5.1018: Flow 0 (NEW): FREE -> NEW 5.1018: Flow 0 (INI): NEW -> INI 5.1019: Flow 0 (INI): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => ? 5.1019: Flow 0 (TGT): INI -> TGT 5.1019: Flow 0 (TGT): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:41245 -> [127.0.0.1]:10001 5.1019: Flow 0 (UDP flow): TGT -> TYPED 5.1019: Flow 0 (UDP flow): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:41245 -> [127.0.0.1]:10001 5.1019: Flow 0 (UDP flow): Side 0 hash table insert: bucket: 111174 5.1019: Flow 0 (UDP flow): TYPED -> ACTIVE 5.1019: Flow 0 (UDP flow): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:41245 -> [127.0.0.1]:10001 I didn't really investigate, though. -- Stefano
On Fri, Jul 12, 2024 at 03:34:07PM +0200, Stefano Brivio wrote:On Fri, 5 Jul 2024 12:07:18 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ouch. I see it too. I'll debug...When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use. For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry.For some reason, after this patch (I bisected), I'm getting an EPOLLERR loop: pasta: epoll event on UDP socket 6 (events: 0x00000001) Flow 0 (NEW): FREE -> NEW Flow 0 (INI): NEW -> INI Flow 0 (INI): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => ? Flow 0 (TGT): INI -> TGT Flow 0 (TGT): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:47041 -> [127.0.0.1]:10001 Flow 0 (UDP flow): TGT -> TYPED Flow 0 (UDP flow): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:47041 -> [127.0.0.1]:10001 Flow 0 (UDP flow): Side 0 hash table insert: bucket: 97474 Flow 0 (UDP flow): TYPED -> ACTIVE Flow 0 (UDP flow): HOST [127.0.0.1]:47041 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:47041 -> [127.0.0.1]:10001 pasta: epoll event on UDP reply socket 116 (events: 0x00000008) pasta: epoll event on UDP reply socket 116 (events: 0x00000008) pasta: epoll event on UDP reply socket 116 (events: 0x00000008) [...repeated until I terminate the process] by sending one UDP datagram from the parent namespace with no "listening" process in the namespace, using the "spliced" path, something like this: echo a | nc -q1 -u localhost 10001 after running pasta with: ./pasta -u 10001 --trace -l /tmp/pasta.trace --log-size $((1 << 30)) I tried bigger/multiple datagrams, same result.Before this patch, I get something like this instead: 5.1018: pasta: epoll event on UDP socket 6 (events: 0x00000001) 5.1018: Flow 0 (NEW): FREE -> NEW 5.1018: Flow 0 (INI): NEW -> INI 5.1019: Flow 0 (INI): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => ? 5.1019: Flow 0 (TGT): INI -> TGT 5.1019: Flow 0 (TGT): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:41245 -> [127.0.0.1]:10001 5.1019: Flow 0 (UDP flow): TGT -> TYPED 5.1019: Flow 0 (UDP flow): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:41245 -> [127.0.0.1]:10001 5.1019: Flow 0 (UDP flow): Side 0 hash table insert: bucket: 111174 5.1019: Flow 0 (UDP flow): TYPED -> ACTIVE 5.1019: Flow 0 (UDP flow): HOST [127.0.0.1]:41245 -> [0.0.0.0]:10001 => SPLICE [127.0.0.1]:41245 -> [127.0.0.1]:10001 I didn't really investigate, though.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Now that spliced datagrams are managed via the flow table, remove UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With those removed, the 'ts' field in udp_splice_port is also no longer used. struct udp_splice_port now contains just a socket fd, so replace it with a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still used for tracking of automatic port forwarding. Finally, the 'splice' field of union udp_epoll_ref is no longer used so remove it as well. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 65 +++++++++++++++++------------------------------------------ udp.h | 3 +-- 2 files changed, 19 insertions(+), 49 deletions(-) diff --git a/udp.c b/udp.c index f4c696db..a4eb6d0f 100644 --- a/udp.c +++ b/udp.c @@ -136,27 +136,15 @@ struct udp_tap_port { time_t ts; }; -/** - * struct udp_splice_port - Bound socket for spliced communication - * @sock: Socket bound to index port - * @ts: Activity timestamp - */ -struct udp_splice_port { - int sock; - time_t ts; -}; - /* Port tracking, arrays indexed by packet source port (host order) */ static struct udp_tap_port udp_tap_map [IP_VERSIONS][NUM_PORTS]; /* "Spliced" sockets indexed by bound port (host order) */ -static struct udp_splice_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; -static struct udp_splice_port udp_splice_init[IP_VERSIONS][NUM_PORTS]; +static int udp_splice_ns [IP_VERSIONS][NUM_PORTS]; +static int udp_splice_init[IP_VERSIONS][NUM_PORTS]; enum udp_act_type { UDP_ACT_TAP, - UDP_ACT_SPLICE_NS, - UDP_ACT_SPLICE_INIT, UDP_ACT_TYPE_MAX, }; @@ -246,8 +234,8 @@ void udp_portmap_clear(void) for (i = 0; i < NUM_PORTS; i++) { udp_tap_map[V4][i].sock = udp_tap_map[V6][i].sock = -1; - udp_splice_ns[V4][i].sock = udp_splice_ns[V6][i].sock = -1; - udp_splice_init[V4][i].sock = udp_splice_init[V6][i].sock = -1; + udp_splice_ns[V4][i] = udp_splice_ns[V6][i] = -1; + udp_splice_init[V4][i] = udp_splice_init[V6][i] = -1; } } @@ -1050,8 +1038,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port) { - union udp_epoll_ref uref = { .splice = (c->mode == MODE_PASTA), - .orig = true, .port = port }; + union udp_epoll_ref uref = { .orig = true, .port = port }; int s, r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1; if (ns) @@ -1067,12 +1054,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, ifname, port, uref.u32); udp_tap_map[V4][port].sock = s < 0 ? -1 : s; - udp_splice_init[V4][port].sock = s < 0 ? -1 : s; + udp_splice_init[V4][port] = s < 0 ? -1 : s; } else { r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, &in4addr_loopback, ifname, port, uref.u32); - udp_splice_ns[V4][port].sock = s < 0 ? -1 : s; + udp_splice_ns[V4][port] = s < 0 ? -1 : s; } } @@ -1084,12 +1071,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, ifname, port, uref.u32); udp_tap_map[V6][port].sock = s < 0 ? -1 : s; - udp_splice_init[V6][port].sock = s < 0 ? -1 : s; + udp_splice_init[V6][port] = s < 0 ? -1 : s; } else { r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, &in6addr_loopback, ifname, port, uref.u32); - udp_splice_ns[V6][port].sock = s < 0 ? -1 : s; + udp_splice_ns[V6][port] = s < 0 ? -1 : s; } } @@ -1130,7 +1117,6 @@ static void udp_splice_iov_init(void) static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, in_port_t port, const struct timespec *now) { - struct udp_splice_port *sp; struct udp_tap_port *tp; int *sockp = NULL; @@ -1143,20 +1129,6 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, tp->flags = 0; } - break; - case UDP_ACT_SPLICE_INIT: - sp = &udp_splice_init[v6 ? V6 : V4][port]; - - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; - - break; - case UDP_ACT_SPLICE_NS: - sp = &udp_splice_ns[v6 ? V6 : V4][port]; - - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; - break; default: return; @@ -1184,20 +1156,19 @@ static void udp_port_rebind(struct ctx *c, bool outbound) = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; const uint8_t *rmap = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; - struct udp_splice_port (*socks)[NUM_PORTS] - = outbound ? udp_splice_ns : udp_splice_init; + int (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init; unsigned port; for (port = 0; port < NUM_PORTS; port++) { if (!bitmap_isset(fmap, port)) { - if (socks[V4][port].sock >= 0) { - close(socks[V4][port].sock); - socks[V4][port].sock = -1; + if (socks[V4][port] >= 0) { + close(socks[V4][port]); + socks[V4][port] = -1; } - if (socks[V6][port].sock >= 0) { - close(socks[V6][port].sock); - socks[V6][port].sock = -1; + if (socks[V6][port] >= 0) { + close(socks[V6][port]); + socks[V6][port] = -1; } continue; @@ -1207,8 +1178,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound) if (bitmap_isset(rmap, port)) continue; - if ((c->ifi4 && socks[V4][port].sock == -1) || - (c->ifi6 && socks[V6][port].sock == -1)) + if ((c->ifi4 && socks[V4][port] == -1) || + (c->ifi6 && socks[V6][port] == -1)) udp_sock_init(c, outbound, AF_UNSPEC, NULL, NULL, port); } } diff --git a/udp.h b/udp.h index db5e546e..310f42fd 100644 --- a/udp.h +++ b/udp.h @@ -36,8 +36,7 @@ union udp_epoll_ref { struct { in_port_t port; uint8_t pif; - bool splice:1, - orig:1, + bool orig:1, v6:1; }; uint32_t u32; -- 2.45.2
On Fri, 5 Jul 2024 12:07:19 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Now that spliced datagrams are managed via the flow table, remove UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With those removed, the 'ts' field in udp_splice_port is also no longer used. struct udp_splice_port now contains just a socket fd, so replace it with a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still used for tracking of automatic port forwarding. Finally, the 'splice' field of union udp_epoll_ref is no longer used so remove it as well. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 65 +++++++++++++++++------------------------------------------ udp.h | 3 +-- 2 files changed, 19 insertions(+), 49 deletions(-) diff --git a/udp.c b/udp.c index f4c696db..a4eb6d0f 100644 --- a/udp.c +++ b/udp.c @@ -136,27 +136,15 @@ struct udp_tap_port { time_t ts; }; -/** - * struct udp_splice_port - Bound socket for spliced communication - * @sock: Socket bound to index port - * @ts: Activity timestamp - */ -struct udp_splice_port { - int sock; - time_t ts; -}; - /* Port tracking, arrays indexed by packet source port (host order) */ static struct udp_tap_port udp_tap_map [IP_VERSIONS][NUM_PORTS]; /* "Spliced" sockets indexed by bound port (host order) */ -static struct udp_splice_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; -static struct udp_splice_port udp_splice_init[IP_VERSIONS][NUM_PORTS]; +static int udp_splice_ns [IP_VERSIONS][NUM_PORTS]; +static int udp_splice_init[IP_VERSIONS][NUM_PORTS]; enum udp_act_type { UDP_ACT_TAP, - UDP_ACT_SPLICE_NS, - UDP_ACT_SPLICE_INIT, UDP_ACT_TYPE_MAX, }; @@ -246,8 +234,8 @@ void udp_portmap_clear(void) for (i = 0; i < NUM_PORTS; i++) { udp_tap_map[V4][i].sock = udp_tap_map[V6][i].sock = -1; - udp_splice_ns[V4][i].sock = udp_splice_ns[V6][i].sock = -1; - udp_splice_init[V4][i].sock = udp_splice_init[V6][i].sock = -1; + udp_splice_ns[V4][i] = udp_splice_ns[V6][i] = -1; + udp_splice_init[V4][i] = udp_splice_init[V6][i] = -1; } } @@ -1050,8 +1038,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port) { - union udp_epoll_ref uref = { .splice = (c->mode == MODE_PASTA), - .orig = true, .port = port }; + union udp_epoll_ref uref = { .orig = true, .port = port }; int s, r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1; if (ns) @@ -1067,12 +1054,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, ifname, port, uref.u32); udp_tap_map[V4][port].sock = s < 0 ? -1 : s; - udp_splice_init[V4][port].sock = s < 0 ? -1 : s; + udp_splice_init[V4][port] = s < 0 ? -1 : s; } else { r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, &in4addr_loopback, ifname, port, uref.u32); - udp_splice_ns[V4][port].sock = s < 0 ? -1 : s; + udp_splice_ns[V4][port] = s < 0 ? -1 : s; } } @@ -1084,12 +1071,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, ifname, port, uref.u32); udp_tap_map[V6][port].sock = s < 0 ? -1 : s; - udp_splice_init[V6][port].sock = s < 0 ? -1 : s; + udp_splice_init[V6][port] = s < 0 ? -1 : s; } else { r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, &in6addr_loopback, ifname, port, uref.u32); - udp_splice_ns[V6][port].sock = s < 0 ? -1 : s; + udp_splice_ns[V6][port] = s < 0 ? -1 : s; } } @@ -1130,7 +1117,6 @@ static void udp_splice_iov_init(void) static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, in_port_t port, const struct timespec *now) { - struct udp_splice_port *sp; struct udp_tap_port *tp; int *sockp = NULL; @@ -1143,20 +1129,6 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, tp->flags = 0; } - break; - case UDP_ACT_SPLICE_INIT: - sp = &udp_splice_init[v6 ? V6 : V4][port]; - - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; - - break; - case UDP_ACT_SPLICE_NS: - sp = &udp_splice_ns[v6 ? V6 : V4][port]; - - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; - break; default: return; @@ -1184,20 +1156,19 @@ static void udp_port_rebind(struct ctx *c, bool outbound) = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; const uint8_t *rmap = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; - struct udp_splice_port (*socks)[NUM_PORTS] - = outbound ? udp_splice_ns : udp_splice_init; + int (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init;Nit: this should be moved up now, before the declaration of 'fmap'.unsigned port; for (port = 0; port < NUM_PORTS; port++) { if (!bitmap_isset(fmap, port)) { - if (socks[V4][port].sock >= 0) { - close(socks[V4][port].sock); - socks[V4][port].sock = -1; + if (socks[V4][port] >= 0) { + close(socks[V4][port]); + socks[V4][port] = -1; } - if (socks[V6][port].sock >= 0) { - close(socks[V6][port].sock); - socks[V6][port].sock = -1; + if (socks[V6][port] >= 0) { + close(socks[V6][port]); + socks[V6][port] = -1; } continue; @@ -1207,8 +1178,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound) if (bitmap_isset(rmap, port)) continue; - if ((c->ifi4 && socks[V4][port].sock == -1) || - (c->ifi6 && socks[V6][port].sock == -1)) + if ((c->ifi4 && socks[V4][port] == -1) || + (c->ifi6 && socks[V6][port] == -1)) udp_sock_init(c, outbound, AF_UNSPEC, NULL, NULL, port); } } diff --git a/udp.h b/udp.h index db5e546e..310f42fd 100644 --- a/udp.h +++ b/udp.h @@ -36,8 +36,7 @@ union udp_epoll_ref { struct { in_port_t port; uint8_t pif; - bool splice:1, - orig:1, + bool orig:1,The comment to the union should be updated, removing 'splice'. While at it, I guess you could also drop 'bound' (removed in 851723924356 ("udp: Remove the @bound field from union udp_epoll_ref"), but the comment wasn't updated). -- Stefano
On Wed, Jul 10, 2024 at 11:36:06PM +0200, Stefano Brivio wrote:On Fri, 5 Jul 2024 12:07:19 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Done.Now that spliced datagrams are managed via the flow table, remove UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With those removed, the 'ts' field in udp_splice_port is also no longer used. struct udp_splice_port now contains just a socket fd, so replace it with a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still used for tracking of automatic port forwarding. Finally, the 'splice' field of union udp_epoll_ref is no longer used so remove it as well. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 65 +++++++++++++++++------------------------------------------ udp.h | 3 +-- 2 files changed, 19 insertions(+), 49 deletions(-) diff --git a/udp.c b/udp.c index f4c696db..a4eb6d0f 100644 --- a/udp.c +++ b/udp.c @@ -136,27 +136,15 @@ struct udp_tap_port { time_t ts; }; -/** - * struct udp_splice_port - Bound socket for spliced communication - * @sock: Socket bound to index port - * @ts: Activity timestamp - */ -struct udp_splice_port { - int sock; - time_t ts; -}; - /* Port tracking, arrays indexed by packet source port (host order) */ static struct udp_tap_port udp_tap_map [IP_VERSIONS][NUM_PORTS]; /* "Spliced" sockets indexed by bound port (host order) */ -static struct udp_splice_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; -static struct udp_splice_port udp_splice_init[IP_VERSIONS][NUM_PORTS]; +static int udp_splice_ns [IP_VERSIONS][NUM_PORTS]; +static int udp_splice_init[IP_VERSIONS][NUM_PORTS]; enum udp_act_type { UDP_ACT_TAP, - UDP_ACT_SPLICE_NS, - UDP_ACT_SPLICE_INIT, UDP_ACT_TYPE_MAX, }; @@ -246,8 +234,8 @@ void udp_portmap_clear(void) for (i = 0; i < NUM_PORTS; i++) { udp_tap_map[V4][i].sock = udp_tap_map[V6][i].sock = -1; - udp_splice_ns[V4][i].sock = udp_splice_ns[V6][i].sock = -1; - udp_splice_init[V4][i].sock = udp_splice_init[V6][i].sock = -1; + udp_splice_ns[V4][i] = udp_splice_ns[V6][i] = -1; + udp_splice_init[V4][i] = udp_splice_init[V6][i] = -1; } } @@ -1050,8 +1038,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port) { - union udp_epoll_ref uref = { .splice = (c->mode == MODE_PASTA), - .orig = true, .port = port }; + union udp_epoll_ref uref = { .orig = true, .port = port }; int s, r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1; if (ns) @@ -1067,12 +1054,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, ifname, port, uref.u32); udp_tap_map[V4][port].sock = s < 0 ? -1 : s; - udp_splice_init[V4][port].sock = s < 0 ? -1 : s; + udp_splice_init[V4][port] = s < 0 ? -1 : s; } else { r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, &in4addr_loopback, ifname, port, uref.u32); - udp_splice_ns[V4][port].sock = s < 0 ? -1 : s; + udp_splice_ns[V4][port] = s < 0 ? -1 : s; } } @@ -1084,12 +1071,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, ifname, port, uref.u32); udp_tap_map[V6][port].sock = s < 0 ? -1 : s; - udp_splice_init[V6][port].sock = s < 0 ? -1 : s; + udp_splice_init[V6][port] = s < 0 ? -1 : s; } else { r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, &in6addr_loopback, ifname, port, uref.u32); - udp_splice_ns[V6][port].sock = s < 0 ? -1 : s; + udp_splice_ns[V6][port] = s < 0 ? -1 : s; } } @@ -1130,7 +1117,6 @@ static void udp_splice_iov_init(void) static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, in_port_t port, const struct timespec *now) { - struct udp_splice_port *sp; struct udp_tap_port *tp; int *sockp = NULL; @@ -1143,20 +1129,6 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, tp->flags = 0; } - break; - case UDP_ACT_SPLICE_INIT: - sp = &udp_splice_init[v6 ? V6 : V4][port]; - - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; - - break; - case UDP_ACT_SPLICE_NS: - sp = &udp_splice_ns[v6 ? V6 : V4][port]; - - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; - break; default: return; @@ -1184,20 +1156,19 @@ static void udp_port_rebind(struct ctx *c, bool outbound) = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; const uint8_t *rmap = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; - struct udp_splice_port (*socks)[NUM_PORTS] - = outbound ? udp_splice_ns : udp_splice_init; + int (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init;Nit: this should be moved up now, before the declaration of 'fmap'.Done. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonunsigned port; for (port = 0; port < NUM_PORTS; port++) { if (!bitmap_isset(fmap, port)) { - if (socks[V4][port].sock >= 0) { - close(socks[V4][port].sock); - socks[V4][port].sock = -1; + if (socks[V4][port] >= 0) { + close(socks[V4][port]); + socks[V4][port] = -1; } - if (socks[V6][port].sock >= 0) { - close(socks[V6][port].sock); - socks[V6][port].sock = -1; + if (socks[V6][port] >= 0) { + close(socks[V6][port]); + socks[V6][port] = -1; } continue; @@ -1207,8 +1178,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound) if (bitmap_isset(rmap, port)) continue; - if ((c->ifi4 && socks[V4][port].sock == -1) || - (c->ifi6 && socks[V6][port].sock == -1)) + if ((c->ifi4 && socks[V4][port] == -1) || + (c->ifi6 && socks[V6][port] == -1)) udp_sock_init(c, outbound, AF_UNSPEC, NULL, NULL, port); } } diff --git a/udp.h b/udp.h index db5e546e..310f42fd 100644 --- a/udp.h +++ b/udp.h @@ -36,8 +36,7 @@ union udp_epoll_ref { struct { in_port_t port; uint8_t pif; - bool splice:1, - orig:1, + bool orig:1,The comment to the union should be updated, removing 'splice'. While at it, I guess you could also drop 'bound' (removed in 851723924356 ("udp: Remove the @bound field from union udp_epoll_ref"), but the comment wasn't updated).
Currently we create flows for datagrams from socket interfaces, and use them to direct "spliced" (socket to socket) datagrams. We don't yet match datagrams from the tap interface to existing flows, nor create new flows for them. Add that functionality, matching datagrams from tap to existing flows when they exist, or creating new ones. As with spliced flows, when creating a new flow from tap to socket, we create a new connected socket to receive reply datagrams attached to that flow specifically. We extend udp_flow_sock_handler() to handle reply packets bound for tap rather than another socket. For non-obvious reasons, this caused a failure for me when running under valgrind, because valgrind invoked rt_sigreturn which is not in our seccomp filter. Since we already allow rt_signaction and others in the valgrind, it seems reasonable to add rt_sigreturn as well. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- Makefile | 2 +- udp.c | 211 +++++++++++++++++++++++++------------------------------ udp.h | 4 +- 3 files changed, 99 insertions(+), 118 deletions(-) diff --git a/Makefile b/Makefile index 92cbd5a6..bd504d23 100644 --- a/Makefile +++ b/Makefile @@ -128,7 +128,7 @@ qrap: $(QRAP_SRCS) passt.h $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(QRAP_SRCS) -o qrap $(LDFLAGS) valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \ - getpid gettid kill clock_gettime mmap \ + rt_sigreturn getpid gettid kill clock_gettime mmap \ munmap open unlink gettimeofday futex valgrind: FLAGS += -g -DVALGRIND valgrind: all diff --git a/udp.c b/udp.c index a4eb6d0f..a26ffe0c 100644 --- a/udp.c +++ b/udp.c @@ -103,6 +103,7 @@ #include <sys/uio.h> #include <time.h> #include <fcntl.h> +#include <arpa/inet.h> #include "checksum.h" #include "util.h" @@ -373,6 +374,8 @@ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) close(uflow->s[TGTSIDE]); } flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); + if (!pif_is_socket(uflow->f.pif[TGTSIDE])) + flow_hash_remove(c, FLOW_SIDX(uflow, TGTSIDE)); } /** @@ -455,6 +458,13 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, } flow_hash_insert(c, FLOW_SIDX(uflow, INISIDE)); + + /* If the target side is a socket, it will be a reply socket that knows + * its own flowside. But if it's tap, then we need to look it up by + * hash. + */ + if (!pif_is_socket(tgtpif)) + flow_hash_insert(c, FLOW_SIDX(uflow, TGTSIDE)); FLOW_ACTIVATE(uflow); return FLOW_SIDX(uflow, TGTSIDE); @@ -817,6 +827,8 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside); struct udp_flow *uflow = udp_at_sidx(ref.flowside); const struct flowside *fromside = &uflow->f.side[ref.flowside.side]; + const struct flowside *toside = &uflow->f.side[tosidx.side]; + uint8_t topif = uflow->f.pif[tosidx.side]; bool v6 = !inany_v4(&fromside->eaddr); struct mmsghdr *mmh_recv = v6 ? udp6_mh_recv : udp4_mh_recv; int from_s = uflow->s[ref.flowside.side]; @@ -830,10 +842,64 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, flow_trace(uflow, "Received %d datagrams on reply socket", n); uflow->ts = now->tv_sec; - for (i = 0; i < n; i++) - udp_splice_prepare(mmh_recv, i); + for (i = 0; i < n; i++) { + if (pif_is_socket(topif)) + udp_splice_prepare(mmh_recv, i); + else + udp_tap_prepare(c, mmh_recv, i, toside->eport, v6, now); + } - udp_splice_send(c, 0, n, tosidx); + if (pif_is_socket(topif)) + udp_splice_send(c, 0, n, tosidx); + else + tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n); +} + +/** + * udp_flow_from_tap() - Find or create UDP flow for tap packets + * @c: Execution context + * @pif: pif on which the packet is arriving + * @af: Address family, AF_INET or AF_INET6 + * @saddr: Source address on guest side + * @daddr: Destination address guest side + * @srcport: Source port on guest side + * @dstport: Destination port on guest side + * + * Return: sidx for the destination side of the flow for this packet, or + * FLOW_SIDX_NONE if we couldn't find or create a flow. + */ +static flow_sidx_t udp_flow_from_tap(const struct ctx *c, + uint8_t pif, sa_family_t af, + const void *saddr, const void *daddr, + in_port_t srcport, in_port_t dstport, + const struct timespec *now) +{ + struct udp_flow *uflow; + union flow *flow; + flow_sidx_t sidx; + + ASSERT(pif == PIF_TAP); + + sidx = flow_lookup_af(c, IPPROTO_UDP, pif, af, saddr, daddr, + srcport, dstport); + if ((uflow = udp_at_sidx(sidx))) { + uflow->ts = now->tv_sec; + return flow_sidx_opposite(sidx); + } + + if (!(flow = flow_alloc())) { + char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN]; + + debug("Couldn't allocate flow for UDP datagram from %s %s:%hu -> %s:%hu", + pif_name(pif), + inet_ntop(af, saddr, sstr, sizeof(sstr)), srcport, + inet_ntop(af, daddr, dstr, sizeof(dstr)), dstport); + return FLOW_SIDX_NONE; + } + + flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); + + return udp_flow_new(c, flow, -1, now); } /** @@ -851,24 +917,22 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, * * #syscalls sendmmsg */ -int udp_tap_handler(struct ctx *c, uint8_t pif, +int udp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now) { + const struct flowside *toside; struct mmsghdr mm[UIO_MAXIOV]; + union sockaddr_inany to_sa; struct iovec m[UIO_MAXIOV]; - struct sockaddr_in6 s_in6; - struct sockaddr_in s_in; const struct udphdr *uh; - struct sockaddr *sa; + struct udp_flow *uflow; int i, s, count = 0; + flow_sidx_t tosidx; in_port_t src, dst; + uint8_t topif; socklen_t sl; - (void)c; - (void)saddr; - (void)pif; - uh = packet_get(p, idx, 0, sizeof(*uh), NULL); if (!uh) return 1; @@ -877,116 +941,33 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, * and destination, so we can just take those from the first message. */ src = ntohs(uh->source); - src += c->udp.fwd_in.rdelta[src]; dst = ntohs(uh->dest); - if (af == AF_INET) { - s_in = (struct sockaddr_in) { - .sin_family = AF_INET, - .sin_port = uh->dest, - .sin_addr = *(struct in_addr *)daddr, - }; + tosidx = udp_flow_from_tap(c, pif, af, saddr, daddr, src, dst, now); + if (!(uflow = udp_at_sidx(tosidx))) { + char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN]; - sa = (struct sockaddr *)&s_in; - sl = sizeof(s_in); - - if (IN4_ARE_ADDR_EQUAL(&s_in.sin_addr, &c->ip4.dns_match) && - ntohs(s_in.sin_port) == 53) { - s_in.sin_addr = c->ip4.dns_host; - udp_tap_map[V4][src].ts = now->tv_sec; - udp_tap_map[V4][src].flags |= PORT_DNS_FWD; - bitmap_set(udp_act[V4][UDP_ACT_TAP], src); - } else if (IN4_ARE_ADDR_EQUAL(&s_in.sin_addr, &c->ip4.gw) && - !c->no_map_gw) { - if (!(udp_tap_map[V4][dst].flags & PORT_LOCAL) || - (udp_tap_map[V4][dst].flags & PORT_LOOPBACK)) - s_in.sin_addr.s_addr = htonl(INADDR_LOOPBACK); - else - s_in.sin_addr = c->ip4.addr_seen; - } - - debug("UDP from tap src=%hu dst=%hu, s=%d", - src, dst, udp_tap_map[V4][src].sock); - if ((s = udp_tap_map[V4][src].sock) < 0) { - struct in_addr bind_addr = IN4ADDR_ANY_INIT; - union udp_epoll_ref uref = { - .port = src, - .pif = PIF_HOST, - }; - const char *bind_if = NULL; - - if (!IN4_IS_ADDR_LOOPBACK(&s_in.sin_addr)) - bind_if = c->ip4.ifname_out; - - if (!IN4_IS_ADDR_LOOPBACK(&s_in.sin_addr)) - bind_addr = c->ip4.addr_out; - - s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, &bind_addr, - bind_if, src, uref.u32); - if (s < 0) - return p->count - idx; - - udp_tap_map[V4][src].sock = s; - bitmap_set(udp_act[V4][UDP_ACT_TAP], src); - } - - udp_tap_map[V4][src].ts = now->tv_sec; - } else { - s_in6 = (struct sockaddr_in6) { - .sin6_family = AF_INET6, - .sin6_port = uh->dest, - .sin6_addr = *(struct in6_addr *)daddr, - }; - const struct in6_addr *bind_addr = &in6addr_any; - - sa = (struct sockaddr *)&s_in6; - sl = sizeof(s_in6); - - if (IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.dns_match) && - ntohs(s_in6.sin6_port) == 53) { - s_in6.sin6_addr = c->ip6.dns_host; - udp_tap_map[V6][src].ts = now->tv_sec; - udp_tap_map[V6][src].flags |= PORT_DNS_FWD; - bitmap_set(udp_act[V6][UDP_ACT_TAP], src); - } else if (IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw) && - !c->no_map_gw) { - if (!(udp_tap_map[V6][dst].flags & PORT_LOCAL) || - (udp_tap_map[V6][dst].flags & PORT_LOOPBACK)) - s_in6.sin6_addr = in6addr_loopback; - else if (udp_tap_map[V6][dst].flags & PORT_GUA) - s_in6.sin6_addr = c->ip6.addr; - else - s_in6.sin6_addr = c->ip6.addr_seen; - } else if (IN6_IS_ADDR_LINKLOCAL(&s_in6.sin6_addr)) { - bind_addr = &c->ip6.addr_ll; - } - - if ((s = udp_tap_map[V6][src].sock) < 0) { - union udp_epoll_ref uref = { - .v6 = 1, - .port = src, - .pif = PIF_HOST, - }; - const char *bind_if = NULL; - - if (!IN6_IS_ADDR_LOOPBACK(&s_in6.sin6_addr)) - bind_if = c->ip6.ifname_out; + debug("Dropping datagram with no flow %s %s:%hu -> %s:%hu", + pif_name(pif), + inet_ntop(af, saddr, sstr, sizeof(sstr)), src, + inet_ntop(af, daddr, dstr, sizeof(dstr)), dst); + return 1; + } - if (!IN6_IS_ADDR_LOOPBACK(&s_in6.sin6_addr) && - !IN6_IS_ADDR_LINKLOCAL(&s_in6.sin6_addr)) - bind_addr = &c->ip6.addr_out; + topif = uflow->f.pif[tosidx.side]; + if (topif != PIF_HOST) { + uint8_t frompif = uflow->f.pif[!tosidx.side]; - s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, bind_addr, - bind_if, src, uref.u32); - if (s < 0) - return p->count - idx; + flow_err(uflow, "No support for forwarding UDP from %s to %s", + pif_name(frompif), pif_name(topif)); + return 1; + } + toside = &uflow->f.side[tosidx.side]; - udp_tap_map[V6][src].sock = s; - bitmap_set(udp_act[V6][UDP_ACT_TAP], src); - } + s = udp_at_sidx(tosidx)->s[tosidx.side]; + ASSERT(s >= 0); - udp_tap_map[V6][src].ts = now->tv_sec; - } + pif_sockaddr(c, &to_sa, &sl, topif, &toside->eaddr, toside->eport); for (i = 0; i < (int)p->count - idx; i++) { struct udphdr *uh_send; @@ -996,7 +977,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, if (!uh_send) return p->count - idx; - mm[i].msg_hdr.msg_name = sa; + mm[i].msg_hdr.msg_name = &to_sa; mm[i].msg_hdr.msg_namelen = sl; if (len) { diff --git a/udp.h b/udp.h index 310f42fd..88417fb0 100644 --- a/udp.h +++ b/udp.h @@ -13,8 +13,8 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now); void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now); -int udp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, - const void *saddr, const void *daddr, +int udp_tap_handler(const struct ctx *c, uint8_t pif, + sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now); int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port); -- 2.45.2
Some nits in the commit message only: On Fri, 5 Jul 2024 12:07:20 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Currently we create flows for datagrams from socket interfaces, and use them to direct "spliced" (socket to socket) datagrams. We don't yet match datagrams from the tap interface to existing flows, nor create new flows for them. Add that functionality, matching datagrams from tap to existing flows when they exist, or creating new ones. As with spliced flows, when creating a new flow from tap to socket, we create a new connected socket to receive reply datagrams attached to that flow specifically. We extend udp_flow_sock_handler() to handle reply packets bound for tap rather than another socket. For non-obvious reasons, this caused a failure for me when running under valgrind, because valgrind invoked rt_sigreturn which is not in our seccomp filter.That might be because the stack area needed by udp_reply_sock_handler() is now a bit bigger... or something like that, I suppose.Since we already allow rt_signaction and others in thert_sigactionvalgrind, it seems reasonable to add rt_sigreturn as well.valgrind target -- Stefano
On Wed, Jul 10, 2024 at 11:36:56PM +0200, Stefano Brivio wrote:Some nits in the commit message only: On Fri, 5 Jul 2024 12:07:20 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Revised accordingly. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonCurrently we create flows for datagrams from socket interfaces, and use them to direct "spliced" (socket to socket) datagrams. We don't yet match datagrams from the tap interface to existing flows, nor create new flows for them. Add that functionality, matching datagrams from tap to existing flows when they exist, or creating new ones. As with spliced flows, when creating a new flow from tap to socket, we create a new connected socket to receive reply datagrams attached to that flow specifically. We extend udp_flow_sock_handler() to handle reply packets bound for tap rather than another socket. For non-obvious reasons, this caused a failure for me when running under valgrind, because valgrind invoked rt_sigreturn which is not in our seccomp filter.That might be because the stack area needed by udp_reply_sock_handler() is now a bit bigger... or something like that, I suppose.Since we already allow rt_signaction and others in thert_sigactionvalgrind, it seems reasonable to add rt_sigreturn as well.valgrind target
This replaces the last piece of existing UDP port tracking with the common flow table. Specifically use the flow table to direct datagrams from host sockets to the guest tap interface. Since this now requires a flow for every datagram, we add some logging if we encounter any datagrams for which we can't find or create a flow. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow_table.h | 14 ++++ udp.c | 188 +++++++++++++++------------------------------------ 2 files changed, 67 insertions(+), 135 deletions(-) diff --git a/flow_table.h b/flow_table.h index 1faac4a7..da9483b3 100644 --- a/flow_table.h +++ b/flow_table.h @@ -106,6 +106,20 @@ static inline uint8_t pif_at_sidx(flow_sidx_t sidx) return flow->f.pif[sidx.side]; } +/** flowside_at_sidx - Retrieve a specific flowside + * @sidx: Flow & side index + * + * Return: Flowside for the flow & side given by @sidx + */ +static inline const struct flowside *flowside_at_sidx(flow_sidx_t sidx) +{ + const union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE; + return &flow->f.side[sidx.side]; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/udp.c b/udp.c index a26ffe0c..7d63faf6 100644 --- a/udp.c +++ b/udp.c @@ -60,26 +60,6 @@ * flow will come to the reply socket in preference to a listening socket. The * sample program contrib/udp-reuseaddr/reuseaddr-priority.c documents and tests * that assumption. - * - * Port tracking - * ============= - * - * For UDP, a reduced version of port-based connection tracking is implemented - * with two purposes: - * - binding ephemeral ports when they're used as source port by the guest, so - * that replies on those ports can be forwarded back to the guest, with a - * fixed timeout for this binding - * - packets received from the local host get their source changed to a local - * address (gateway address) so that they can be forwarded to the guest, and - * packets sent as replies by the guest need their destination address to - * be changed back to the address of the local host. This is dynamic to allow - * connections from the gateway as well, and uses the same fixed 180s timeout - * - * Sockets for bound ports are created at initialisation time, one set for IPv4 - * and one for IPv6. - * - * Packets are forwarded back and forth, by prepending and stripping UDP headers - * in the obvious way, with no port translation. */ #include <sched.h> @@ -498,7 +478,6 @@ static flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref, ASSERT(ref.type == EPOLL_TYPE_UDP); - /* FIXME: Match reply packets to their flow as well */ if (!ref.udp.orig) return FLOW_SIDX_NONE; @@ -558,160 +537,87 @@ static void udp_splice_send(const struct ctx *c, size_t start, size_t n, /** * udp_update_hdr4() - Update headers for one IPv4 datagram - * @c: Execution context * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) - * @s_in: Source socket address, filled in by recvmmsg() * @bp: Pointer to udp_payload_t to update - * @dstport: Destination port number + * @fside: Flowside with relevant addresses * @dlen: Length of UDP payload - * @now: Current timestamp * * Return: size of IPv4 payload (UDP header + data) */ -static size_t udp_update_hdr4(const struct ctx *c, - struct iphdr *ip4h, const struct sockaddr_in *s_in, - struct udp_payload_t *bp, - in_port_t dstport, size_t dlen, - const struct timespec *now) +static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp, + const struct flowside *fside, size_t dlen) { - const struct in_addr dst = c->ip4.addr_seen; - in_port_t srcport = ntohs(s_in->sin_port); + const struct in_addr *src = inany_v4(&fside->faddr); + const struct in_addr *dst = inany_v4(&fside->eaddr); size_t l4len = dlen + sizeof(bp->uh); size_t l3len = l4len + sizeof(*ip4h); - struct in_addr src = s_in->sin_addr; - - if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match) && - IN4_ARE_ADDR_EQUAL(&src, &c->ip4.dns_host) && srcport == 53 && - (udp_tap_map[V4][dstport].flags & PORT_DNS_FWD)) { - src = c->ip4.dns_match; - } else if (IN4_IS_ADDR_LOOPBACK(&src) || - IN4_ARE_ADDR_EQUAL(&src, &c->ip4.addr_seen)) { - udp_tap_map[V4][srcport].ts = now->tv_sec; - udp_tap_map[V4][srcport].flags |= PORT_LOCAL; - if (IN4_IS_ADDR_LOOPBACK(&src)) - udp_tap_map[V4][srcport].flags |= PORT_LOOPBACK; - else - udp_tap_map[V4][srcport].flags &= ~PORT_LOOPBACK; - - bitmap_set(udp_act[V4][UDP_ACT_TAP], srcport); - - src = c->ip4.gw; - } + ASSERT(src && dst); ip4h->tot_len = htons(l3len); - ip4h->daddr = dst.s_addr; - ip4h->saddr = src.s_addr; - ip4h->check = csum_ip4_header(l3len, IPPROTO_UDP, src, dst); + ip4h->daddr = dst->s_addr; + ip4h->saddr = src->s_addr; + ip4h->check = csum_ip4_header(l3len, IPPROTO_UDP, *src, *dst); - bp->uh.source = s_in->sin_port; - bp->uh.dest = htons(dstport); + bp->uh.source = htons(fside->fport); + bp->uh.dest = htons(fside->eport); bp->uh.len = htons(l4len); - csum_udp4(&bp->uh, src, dst, bp->data, dlen); + csum_udp4(&bp->uh, *src, *dst, bp->data, dlen); return l4len; } /** * udp_update_hdr6() - Update headers for one IPv6 datagram - * @c: Execution context * @ip6h: Pre-filled IPv6 header (except for payload_len and addresses) - * @s_in: Source socket address, filled in by recvmmsg() * @bp: Pointer to udp_payload_t to update - * @dstport: Destination port number + * @fside: Flowside with relevant addresses * @dlen: Length of UDP payload - * @now: Current timestamp * * Return: size of IPv6 payload (UDP header + data) */ -static size_t udp_update_hdr6(const struct ctx *c, - struct ipv6hdr *ip6h, struct sockaddr_in6 *s_in6, - struct udp_payload_t *bp, - in_port_t dstport, size_t dlen, - const struct timespec *now) +static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, + const struct flowside *fside, size_t dlen) { - const struct in6_addr *src = &s_in6->sin6_addr; - const struct in6_addr *dst = &c->ip6.addr_seen; - in_port_t srcport = ntohs(s_in6->sin6_port); uint16_t l4len = dlen + sizeof(bp->uh); - if (IN6_IS_ADDR_LINKLOCAL(src)) { - dst = &c->ip6.addr_ll_seen; - } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match) && - IN6_ARE_ADDR_EQUAL(src, &c->ip6.dns_host) && - srcport == 53 && - (udp_tap_map[V4][dstport].flags & PORT_DNS_FWD)) { - src = &c->ip6.dns_match; - } else if (IN6_IS_ADDR_LOOPBACK(src) || - IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr_seen) || - IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) { - udp_tap_map[V6][srcport].ts = now->tv_sec; - udp_tap_map[V6][srcport].flags |= PORT_LOCAL; - - if (IN6_IS_ADDR_LOOPBACK(src)) - udp_tap_map[V6][srcport].flags |= PORT_LOOPBACK; - else - udp_tap_map[V6][srcport].flags &= ~PORT_LOOPBACK; - - if (IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) - udp_tap_map[V6][srcport].flags |= PORT_GUA; - else - udp_tap_map[V6][srcport].flags &= ~PORT_GUA; - - bitmap_set(udp_act[V6][UDP_ACT_TAP], srcport); - - dst = &c->ip6.addr_ll_seen; - - if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) - src = &c->ip6.gw; - else - src = &c->ip6.addr_ll; - - } - ip6h->payload_len = htons(l4len); - ip6h->daddr = *dst; - ip6h->saddr = *src; + ip6h->daddr = fside->eaddr.a6; + ip6h->saddr = fside->faddr.a6; ip6h->version = 6; ip6h->nexthdr = IPPROTO_UDP; ip6h->hop_limit = 255; - bp->uh.source = s_in6->sin6_port; - bp->uh.dest = htons(dstport); + bp->uh.source = htons(fside->fport); + bp->uh.dest = htons(fside->eport); bp->uh.len = ip6h->payload_len; - csum_udp6(&bp->uh, src, dst, bp->data, dlen); + csum_udp6(&bp->uh, &fside->faddr.a6, &fside->eaddr.a6, bp->data, dlen); return l4len; } /** * udp_tap_prepare() - Convert one datagram into a tap frame - * @c: Execution context * @mmh: Receiving mmsghdr array * @idx: Index of the datagram to prepare - * @dstport: Destination port - * @v6: Prepare for IPv6? - * @now: Current timestamp + * @fside: flowside for destination side */ -static void udp_tap_prepare(const struct ctx *c, const struct mmsghdr *mmh, - unsigned idx, in_port_t dstport, bool v6, - const struct timespec *now) +static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx, + const struct flowside *fside) { struct iovec (*tap_iov)[UDP_NUM_IOVS] = &udp_l2_iov[idx]; struct udp_payload_t *bp = &udp_payload[idx]; struct udp_meta_t *bm = &udp_meta[idx]; size_t l4len; - if (v6) { - l4len = udp_update_hdr6(c, &bm->ip6h, &bm->s_in.sa6, bp, - dstport, mmh[idx].msg_len, now); + if (!inany_v4(&fside->eaddr) || !inany_v4(&fside->faddr)) { + l4len = udp_update_hdr6(&bm->ip6h, bp, fside, mmh[idx].msg_len); tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) + sizeof(udp6_eth_hdr)); (*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp6_eth_hdr); (*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip6h); } else { - l4len = udp_update_hdr4(c, &bm->ip4h, &bm->s_in.sa4, bp, - dstport, mmh[idx].msg_len, now); + l4len = udp_update_hdr4(&bm->ip4h, bp, fside, mmh[idx].msg_len); tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) + sizeof(udp4_eth_hdr)); (*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp4_eth_hdr); @@ -766,17 +672,11 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve const struct timespec *now) { struct mmsghdr *mmh_recv = ref.udp.v6 ? udp6_mh_recv : udp4_mh_recv; - in_port_t dstport = ref.udp.port; int n, i; if ((n = udp_sock_recv(c, ref.fd, events, mmh_recv)) <= 0) return; - if (ref.udp.pif == PIF_SPLICE) - dstport += c->udp.fwd_out.f.delta[dstport]; - else if (ref.udp.pif == PIF_HOST) - dstport += c->udp.fwd_in.f.delta[dstport]; - /* We divide datagrams into batches based on how we need to send them, * determined by udp_meta[i].tosidx. To avoid either two passes through * the array, or recalculating tosidx for a single entry, we have to @@ -791,9 +691,9 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve do { if (pif_is_socket(batchpif)) udp_splice_prepare(mmh_recv, i); - else - udp_tap_prepare(c, mmh_recv, i, dstport, - ref.udp.v6, now); + else if (batchpif == PIF_TAP) + udp_tap_prepare(mmh_recv, i, + flowside_at_sidx(batchsidx)); if (++i >= n) break; @@ -803,12 +703,24 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve now); } while (flow_sidx_eq(udp_meta[i].tosidx, batchsidx)); - if (pif_is_socket(batchpif)) + if (pif_is_socket(batchpif)) { udp_splice_send(c, batchstart, i - batchstart, batchsidx); - else + } else if (batchpif == PIF_TAP) { tap_send_frames(c, &udp_l2_iov[batchstart][0], UDP_NUM_IOVS, i - batchstart); + } else if (flow_sidx_valid(batchsidx)) { + flow_sidx_t fromsidx = flow_sidx_opposite(batchsidx); + struct udp_flow *uflow = udp_at_sidx(batchsidx); + + flow_err(uflow, + "No support for forwarding UDP from %s to %s", + pif_name(pif_at_sidx(fromsidx)), + pif_name(batchpif)); + } else { + debug("Discarding %d datagrams without flow", + i - batchstart); + } } } @@ -845,14 +757,20 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, for (i = 0; i < n; i++) { if (pif_is_socket(topif)) udp_splice_prepare(mmh_recv, i); - else - udp_tap_prepare(c, mmh_recv, i, toside->eport, v6, now); + else if (topif == PIF_TAP) + udp_tap_prepare(mmh_recv, i, toside); } - if (pif_is_socket(topif)) + if (pif_is_socket(topif)) { udp_splice_send(c, 0, n, tosidx); - else + } else if (topif == PIF_TAP) { tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n); + } else { + uint8_t frompif = uflow->f.pif[ref.flowside.side]; + + flow_err(uflow, "No support for forwarding UDP from %s to %s", + pif_name(frompif), pif_name(topif)); + } } /** -- 2.45.2
Two nits only: On Fri, 5 Jul 2024 12:07:21 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:This replaces the last piece of existing UDP port tracking with the common flow table. Specifically use the flow table to direct datagrams from host sockets to the guest tap interface. Since this now requires a flow for every datagram, we add some logging if we encounter any datagrams for which we can't find or create a flow. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow_table.h | 14 ++++ udp.c | 188 +++++++++++++++------------------------------------ 2 files changed, 67 insertions(+), 135 deletions(-) diff --git a/flow_table.h b/flow_table.h index 1faac4a7..da9483b3 100644 --- a/flow_table.h +++ b/flow_table.h @@ -106,6 +106,20 @@ static inline uint8_t pif_at_sidx(flow_sidx_t sidx) return flow->f.pif[sidx.side]; } +/** flowside_at_sidx - Retrieve a specific flowsideflowside_at_sidx()+ * @sidx: Flow & side index + * + * Return: Flowside for the flow & side given by @sidx + */ +static inline const struct flowside *flowside_at_sidx(flow_sidx_t sidx) +{ + const union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE;Usual extra newline.+ return &flow->f.side[sidx.side]; +}I finished reviewing all the other patches, no further comments, everything else looks good to me. -- Stefano
On Wed, Jul 10, 2024 at 11:37:36PM +0200, Stefano Brivio wrote:Two nits only: On Fri, 5 Jul 2024 12:07:21 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Done.This replaces the last piece of existing UDP port tracking with the common flow table. Specifically use the flow table to direct datagrams from host sockets to the guest tap interface. Since this now requires a flow for every datagram, we add some logging if we encounter any datagrams for which we can't find or create a flow. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow_table.h | 14 ++++ udp.c | 188 +++++++++++++++------------------------------------ 2 files changed, 67 insertions(+), 135 deletions(-) diff --git a/flow_table.h b/flow_table.h index 1faac4a7..da9483b3 100644 --- a/flow_table.h +++ b/flow_table.h @@ -106,6 +106,20 @@ static inline uint8_t pif_at_sidx(flow_sidx_t sidx) return flow->f.pif[sidx.side]; } +/** flowside_at_sidx - Retrieve a specific flowsideflowside_at_sidx()Done.+ * @sidx: Flow & side index + * + * Return: Flowside for the flow & side given by @sidx + */ +static inline const struct flowside *flowside_at_sidx(flow_sidx_t sidx) +{ + const union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE;Usual extra newline.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson+ return &flow->f.side[sidx.side]; +}I finished reviewing all the other patches, no further comments, everything else looks good to me.
Now that UDP datagrams are all directed via the flow table, we no longer use the udp_tap_map[ or udp_act[] arrays. Remove them and connected code. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 92 +---------------------------------------------------------- 1 file changed, 1 insertion(+), 91 deletions(-) diff --git a/udp.c b/udp.c index 7d63faf6..e1cd9921 100644 --- a/udp.c +++ b/udp.c @@ -100,38 +100,10 @@ #define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ -/** - * struct udp_tap_port - Port tracking based on tap-facing source port - * @sock: Socket bound to source port used as index - * @flags: Flags for recent activity type seen from/to port - * @ts: Activity timestamp from tap, used for socket aging - */ -struct udp_tap_port { - int sock; - uint8_t flags; -#define PORT_LOCAL BIT(0) /* Port was contacted from local address */ -#define PORT_LOOPBACK BIT(1) /* Port was contacted from loopback address */ -#define PORT_GUA BIT(2) /* Port was contacted from global unicast */ -#define PORT_DNS_FWD BIT(3) /* Port used as source for DNS remapped query */ - - time_t ts; -}; - -/* Port tracking, arrays indexed by packet source port (host order) */ -static struct udp_tap_port udp_tap_map [IP_VERSIONS][NUM_PORTS]; - /* "Spliced" sockets indexed by bound port (host order) */ static int udp_splice_ns [IP_VERSIONS][NUM_PORTS]; static int udp_splice_init[IP_VERSIONS][NUM_PORTS]; -enum udp_act_type { - UDP_ACT_TAP, - UDP_ACT_TYPE_MAX, -}; - -/* Activity-based aging for bindings */ -static uint8_t udp_act[IP_VERSIONS][UDP_ACT_TYPE_MAX][DIV_ROUND_UP(NUM_PORTS, 8)]; - /* Static buffers */ /** @@ -214,7 +186,6 @@ void udp_portmap_clear(void) unsigned i; for (i = 0; i < NUM_PORTS; i++) { - udp_tap_map[V4][i].sock = udp_tap_map[V6][i].sock = -1; udp_splice_ns[V4][i] = udp_splice_ns[V6][i] = -1; udp_splice_init[V4][i] = udp_splice_init[V6][i] = -1; } @@ -952,7 +923,6 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, addr, ifname, port, uref.u32); - udp_tap_map[V4][port].sock = s < 0 ? -1 : s; udp_splice_init[V4][port] = s < 0 ? -1 : s; } else { r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, @@ -969,7 +939,6 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, addr, ifname, port, uref.u32); - udp_tap_map[V6][port].sock = s < 0 ? -1 : s; udp_splice_init[V6][port] = s < 0 ? -1 : s; } else { r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, @@ -1005,43 +974,6 @@ static void udp_splice_iov_init(void) } } -/** - * udp_timer_one() - Handler for timed events on one port - * @c: Execution context - * @v6: Set for IPv6 connections - * @type: Socket type - * @port: Port number, host order - * @now: Current timestamp - */ -static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, - in_port_t port, const struct timespec *now) -{ - struct udp_tap_port *tp; - int *sockp = NULL; - - switch (type) { - case UDP_ACT_TAP: - tp = &udp_tap_map[v6 ? V6 : V4][port]; - - if (now->tv_sec - tp->ts > UDP_CONN_TIMEOUT) { - sockp = &tp->sock; - tp->flags = 0; - } - - break; - default: - return; - } - - if (sockp && *sockp >= 0) { - int s = *sockp; - *sockp = -1; - epoll_ctl(c->epollfd, EPOLL_CTL_DEL, s, NULL); - close(s); - bitmap_clear(udp_act[v6 ? V6 : V4][type], port); - } -} - /** * udp_port_rebind() - Rebind ports to match forward maps * @c: Execution context @@ -1126,9 +1058,7 @@ bool udp_flow_timer(const struct ctx *c, const struct udp_flow *uflow, */ void udp_timer(struct ctx *c, const struct timespec *now) { - int n, t, v6 = 0; - unsigned int i; - long *word, tmp; + (void)now; if (c->mode == MODE_PASTA) { if (c->udp.fwd_out.f.mode == FWD_AUTO) { @@ -1143,26 +1073,6 @@ void udp_timer(struct ctx *c, const struct timespec *now) udp_port_rebind(c, false); } } - - if (!c->ifi4) - v6 = 1; -v6: - for (t = 0; t < UDP_ACT_TYPE_MAX; t++) { - word = (long *)udp_act[v6 ? V6 : V4][t]; - for (i = 0; i < ARRAY_SIZE(udp_act[0][0]); - i += sizeof(long), word++) { - tmp = *word; - while ((n = ffsl(tmp))) { - tmp &= ~(1UL << (n - 1)); - udp_timer_one(c, v6, t, i * 8 + n - 1, now); - } - } - } - - if (!v6 && c->ifi6) { - v6 = 1; - goto v6; - } } /** -- 2.45.2
In addition to the struct fwd_ports used by both UDP and TCP to track port forwarding, UDP also included an 'rdelta' field, which contained the reverse mapping of the main port map. This was used so that we could properly direct reply packets to a forwarded packet where we change the destination port. This has now been taken over by the flow table: reply packets will match the flow of the originating packet, and that gives the correct ports on the originating side. So, eliminate the rdelta field, and with it struct udp_fwd_ports, which now has no additional information over struct fwd_ports. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- conf.c | 14 +++++++------- fwd.c | 24 ++++++++++++------------ udp.c | 42 ++++++------------------------------------ udp.h | 14 ++------------ 4 files changed, 27 insertions(+), 67 deletions(-) diff --git a/conf.c b/conf.c index 3c38cebc..2a8d4dc9 100644 --- a/conf.c +++ b/conf.c @@ -1243,7 +1243,7 @@ void conf(struct ctx *c, int argc, char **argv) } c->tcp.fwd_in.mode = c->tcp.fwd_out.mode = FWD_UNSET; - c->udp.fwd_in.f.mode = c->udp.fwd_out.f.mode = FWD_UNSET; + c->udp.fwd_in.mode = c->udp.fwd_out.mode = FWD_UNSET; do { name = getopt_long(argc, argv, optstring, options, NULL); @@ -1659,7 +1659,7 @@ void conf(struct ctx *c, int argc, char **argv) if (name == 't') conf_ports(c, name, optarg, &c->tcp.fwd_in); else if (name == 'u') - conf_ports(c, name, optarg, &c->udp.fwd_in.f); + conf_ports(c, name, optarg, &c->udp.fwd_in); } while (name != -1); if (c->mode == MODE_PASTA) @@ -1694,7 +1694,7 @@ void conf(struct ctx *c, int argc, char **argv) if (name == 'T') conf_ports(c, name, optarg, &c->tcp.fwd_out); else if (name == 'U') - conf_ports(c, name, optarg, &c->udp.fwd_out.f); + conf_ports(c, name, optarg, &c->udp.fwd_out); } while (name != -1); if (!c->ifi4) @@ -1721,10 +1721,10 @@ void conf(struct ctx *c, int argc, char **argv) c->tcp.fwd_in.mode = fwd_default; if (!c->tcp.fwd_out.mode) c->tcp.fwd_out.mode = fwd_default; - if (!c->udp.fwd_in.f.mode) - c->udp.fwd_in.f.mode = fwd_default; - if (!c->udp.fwd_out.f.mode) - c->udp.fwd_out.f.mode = fwd_default; + if (!c->udp.fwd_in.mode) + c->udp.fwd_in.mode = fwd_default; + if (!c->udp.fwd_out.mode) + c->udp.fwd_out.mode = fwd_default; fwd_scan_ports_init(c); diff --git a/fwd.c b/fwd.c index 4377de44..36780bce 100644 --- a/fwd.c +++ b/fwd.c @@ -129,18 +129,18 @@ void fwd_scan_ports_init(struct ctx *c) c->tcp.fwd_in.scan4 = c->tcp.fwd_in.scan6 = -1; c->tcp.fwd_out.scan4 = c->tcp.fwd_out.scan6 = -1; - c->udp.fwd_in.f.scan4 = c->udp.fwd_in.f.scan6 = -1; - c->udp.fwd_out.f.scan4 = c->udp.fwd_out.f.scan6 = -1; + c->udp.fwd_in.scan4 = c->udp.fwd_in.scan6 = -1; + c->udp.fwd_out.scan4 = c->udp.fwd_out.scan6 = -1; if (c->tcp.fwd_in.mode == FWD_AUTO) { c->tcp.fwd_in.scan4 = open_in_ns(c, "/proc/net/tcp", flags); c->tcp.fwd_in.scan6 = open_in_ns(c, "/proc/net/tcp6", flags); fwd_scan_ports_tcp(&c->tcp.fwd_in, &c->tcp.fwd_out); } - if (c->udp.fwd_in.f.mode == FWD_AUTO) { - c->udp.fwd_in.f.scan4 = open_in_ns(c, "/proc/net/udp", flags); - c->udp.fwd_in.f.scan6 = open_in_ns(c, "/proc/net/udp6", flags); - fwd_scan_ports_udp(&c->udp.fwd_in.f, &c->udp.fwd_out.f, + if (c->udp.fwd_in.mode == FWD_AUTO) { + c->udp.fwd_in.scan4 = open_in_ns(c, "/proc/net/udp", flags); + c->udp.fwd_in.scan6 = open_in_ns(c, "/proc/net/udp6", flags); + fwd_scan_ports_udp(&c->udp.fwd_in, &c->udp.fwd_out, &c->tcp.fwd_in, &c->tcp.fwd_out); } if (c->tcp.fwd_out.mode == FWD_AUTO) { @@ -148,10 +148,10 @@ void fwd_scan_ports_init(struct ctx *c) c->tcp.fwd_out.scan6 = open("/proc/net/tcp6", flags); fwd_scan_ports_tcp(&c->tcp.fwd_out, &c->tcp.fwd_in); } - if (c->udp.fwd_out.f.mode == FWD_AUTO) { - c->udp.fwd_out.f.scan4 = open("/proc/net/udp", flags); - c->udp.fwd_out.f.scan6 = open("/proc/net/udp6", flags); - fwd_scan_ports_udp(&c->udp.fwd_out.f, &c->udp.fwd_in.f, + if (c->udp.fwd_out.mode == FWD_AUTO) { + c->udp.fwd_out.scan4 = open("/proc/net/udp", flags); + c->udp.fwd_out.scan6 = open("/proc/net/udp6", flags); + fwd_scan_ports_udp(&c->udp.fwd_out, &c->udp.fwd_in, &c->tcp.fwd_out, &c->tcp.fwd_in); } } @@ -240,7 +240,7 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; else if (proto == IPPROTO_UDP) - tgt->eport += c->udp.fwd_out.f.delta[tgt->eport]; + tgt->eport += c->udp.fwd_out.delta[tgt->eport]; /* Let the kernel pick a host side source port */ tgt->fport = 0; @@ -269,7 +269,7 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; else if (proto == IPPROTO_UDP) - tgt->eport += c->udp.fwd_in.f.delta[tgt->eport]; + tgt->eport += c->udp.fwd_in.delta[tgt->eport]; if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) && (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { diff --git a/udp.c b/udp.c index e1cd9921..8affcf12 100644 --- a/udp.c +++ b/udp.c @@ -191,33 +191,6 @@ void udp_portmap_clear(void) } } -/** - * udp_invert_portmap() - Compute reverse port translations for return packets - * @fwd: Port forwarding configuration to compute reverse map for - */ -static void udp_invert_portmap(struct udp_fwd_ports *fwd) -{ - unsigned int i; - - static_assert(ARRAY_SIZE(fwd->f.delta) == ARRAY_SIZE(fwd->rdelta), - "Forward and reverse delta arrays must have same size"); - for (i = 0; i < ARRAY_SIZE(fwd->f.delta); i++) { - in_port_t delta = fwd->f.delta[i]; - - if (delta) { - /* Keep rport calculation separate from its usage: we - * need to perform the sum in in_port_t width (that is, - * modulo 65536), but C promotion rules would sum the - * two terms as 'int', if we just open-coded the array - * index as 'i + delta'. - */ - in_port_t rport = i + delta; - - fwd->rdelta[rport] = NUM_PORTS - delta; - } - } -} - /** * udp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses * @eth_d: Ethernet destination address, NULL if unchanged @@ -984,9 +957,9 @@ static void udp_splice_iov_init(void) static void udp_port_rebind(struct ctx *c, bool outbound) { const uint8_t *fmap - = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; + = outbound ? c->udp.fwd_out.map : c->udp.fwd_in.map; const uint8_t *rmap - = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; + = outbound ? c->udp.fwd_in.map : c->udp.fwd_out.map; int (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init; unsigned port; @@ -1061,14 +1034,14 @@ void udp_timer(struct ctx *c, const struct timespec *now) (void)now; if (c->mode == MODE_PASTA) { - if (c->udp.fwd_out.f.mode == FWD_AUTO) { - fwd_scan_ports_udp(&c->udp.fwd_out.f, &c->udp.fwd_in.f, + if (c->udp.fwd_out.mode == FWD_AUTO) { + fwd_scan_ports_udp(&c->udp.fwd_out, &c->udp.fwd_in, &c->tcp.fwd_out, &c->tcp.fwd_in); NS_CALL(udp_port_rebind_outbound, c); } - if (c->udp.fwd_in.f.mode == FWD_AUTO) { - fwd_scan_ports_udp(&c->udp.fwd_in.f, &c->udp.fwd_out.f, + if (c->udp.fwd_in.mode == FWD_AUTO) { + fwd_scan_ports_udp(&c->udp.fwd_in, &c->udp.fwd_out, &c->tcp.fwd_in, &c->tcp.fwd_out); udp_port_rebind(c, false); } @@ -1085,9 +1058,6 @@ int udp_init(struct ctx *c) { udp_iov_init(c); - udp_invert_portmap(&c->udp.fwd_in); - udp_invert_portmap(&c->udp.fwd_out); - if (c->mode == MODE_PASTA) { udp_splice_iov_init(); NS_CALL(udp_port_rebind_outbound, c); diff --git a/udp.h b/udp.h index 88417fb0..ba803d51 100644 --- a/udp.h +++ b/udp.h @@ -43,16 +43,6 @@ union udp_epoll_ref { }; -/** - * udp_fwd_ports - UDP specific port forwarding configuration - * @f: Generic forwarding configuration - * @rdelta: Reversed delta map to translate source ports on return packets - */ -struct udp_fwd_ports { - struct fwd_ports f; - in_port_t rdelta[NUM_PORTS]; -}; - /** * struct udp_ctx - Execution context for UDP * @fwd_in: Port forwarding configuration for inbound packets @@ -60,8 +50,8 @@ struct udp_fwd_ports { * @timer_run: Timestamp of most recent timer run */ struct udp_ctx { - struct udp_fwd_ports fwd_in; - struct udp_fwd_ports fwd_out; + struct fwd_ports fwd_in; + struct fwd_ports fwd_out; struct timespec timer_run; }; -- 2.45.2
EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived sockets which can initiate new flows. Rename to EPOLL_TYPE_UDP_LISTEN and associated functions to match. Along with that, remove the .orig field from union udp_listen_epoll_ref, since it is now always true. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- epoll_type.h | 4 ++-- passt.c | 6 +++--- passt.h | 2 +- udp.c | 25 +++++++++++-------------- udp.h | 12 +++++------- util.c | 2 +- 6 files changed, 23 insertions(+), 28 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index 7a752ed1..0ad1efa0 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -20,8 +20,8 @@ enum epoll_type { EPOLL_TYPE_TCP_LISTEN, /* timerfds used for TCP timers */ EPOLL_TYPE_TCP_TIMER, - /* UDP sockets */ - EPOLL_TYPE_UDP, + /* UDP "listening" sockets */ + EPOLL_TYPE_UDP_LISTEN, /* UDP socket for replies on a specific flow */ EPOLL_TYPE_UDP_REPLY, /* ICMP/ICMPv6 ping sockets */ diff --git a/passt.c b/passt.c index f9405bee..eed74ec9 100644 --- a/passt.c +++ b/passt.c @@ -66,7 +66,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TCP_SPLICE] = "connected spliced TCP socket", [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", - [EPOLL_TYPE_UDP] = "UDP socket", + [EPOLL_TYPE_UDP_LISTEN] = "listening UDP socket", [EPOLL_TYPE_UDP_REPLY] = "UDP reply socket", [EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket", [EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch", @@ -347,8 +347,8 @@ loop: case EPOLL_TYPE_TCP_TIMER: tcp_timer_handler(&c, ref); break; - case EPOLL_TYPE_UDP: - udp_buf_sock_handler(&c, ref, eventmask, &now); + case EPOLL_TYPE_UDP_LISTEN: + udp_listen_sock_handler(&c, ref, eventmask, &now); break; case EPOLL_TYPE_UDP_REPLY: udp_reply_sock_handler(&c, ref, eventmask, &now); diff --git a/passt.h b/passt.h index 0d76b498..4cc2b6f0 100644 --- a/passt.h +++ b/passt.h @@ -48,7 +48,7 @@ union epoll_ref { uint32_t flow; flow_sidx_t flowside; union tcp_listen_epoll_ref tcp_listen; - union udp_epoll_ref udp; + union udp_listen_epoll_ref udp; uint32_t data; int nsdir_fd; }; diff --git a/udp.c b/udp.c index 8affcf12..58cf6b93 100644 --- a/udp.c +++ b/udp.c @@ -420,10 +420,7 @@ static flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref, union flow *flow; flow_sidx_t sidx; - ASSERT(ref.type == EPOLL_TYPE_UDP); - - if (!ref.udp.orig) - return FLOW_SIDX_NONE; + ASSERT(ref.type == EPOLL_TYPE_UDP_LISTEN); sidx = flow_lookup_sa(c, IPPROTO_UDP, ref.udp.pif, &meta->s_in, ref.udp.port); if ((uflow = udp_at_sidx(sidx))) { @@ -604,7 +601,7 @@ int udp_sock_recv(const struct ctx *c, int s, uint32_t events, } /** - * udp_buf_sock_handler() - Handle new data from socket + * udp_listen_sock_handler() - Handle new data from socket * @c: Execution context * @ref: epoll reference * @events: epoll events bitmap @@ -612,8 +609,8 @@ int udp_sock_recv(const struct ctx *c, int s, uint32_t events, * * #syscalls recvmmsg */ -void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, - const struct timespec *now) +void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events, const struct timespec *now) { struct mmsghdr *mmh_recv = ref.udp.v6 ? udp6_mh_recv : udp4_mh_recv; int n, i; @@ -881,7 +878,7 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port) { - union udp_epoll_ref uref = { .orig = true, .port = port }; + union udp_listen_epoll_ref uref = { .port = port }; int s, r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1; if (ns) @@ -893,12 +890,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, uref.v6 = 0; if (!ns) { - r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, addr, - ifname, port, uref.u32); + r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP_LISTEN, + addr, ifname, port, uref.u32); udp_splice_init[V4][port] = s < 0 ? -1 : s; } else { - r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, + r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP_LISTEN, &in4addr_loopback, ifname, port, uref.u32); udp_splice_ns[V4][port] = s < 0 ? -1 : s; @@ -909,12 +906,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, uref.v6 = 1; if (!ns) { - r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, addr, - ifname, port, uref.u32); + r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP_LISTEN, + addr, ifname, port, uref.u32); udp_splice_init[V6][port] = s < 0 ? -1 : s; } else { - r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, + r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP_LISTEN, &in6addr_loopback, ifname, port, uref.u32); udp_splice_ns[V6][port] = s < 0 ? -1 : s; diff --git a/udp.h b/udp.h index ba803d51..18a86892 100644 --- a/udp.h +++ b/udp.h @@ -9,8 +9,8 @@ #define UDP_TIMER_INTERVAL 1000 /* ms */ void udp_portmap_clear(void); -void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, - uint32_t events, const struct timespec *now); +void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events, const struct timespec *now); void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now); int udp_tap_handler(const struct ctx *c, uint8_t pif, @@ -23,21 +23,19 @@ void udp_timer(struct ctx *c, const struct timespec *now); void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s); /** - * union udp_epoll_ref - epoll reference portion for TCP connections + * union udp_listen_epoll_ref - epoll reference for "listening" UDP sockets * @port: Source port for connected sockets, bound port otherwise * @pif: pif for this socket * @bound: Set if this file descriptor is a bound socket * @splice: Set if descriptor packets to be "spliced" - * @orig: Set if a spliced socket which can originate "connections" * @v6: Set for IPv6 sockets or connections * @u32: Opaque u32 value of reference */ -union udp_epoll_ref { +union udp_listen_epoll_ref { struct { in_port_t port; uint8_t pif; - bool orig:1, - v6:1; + bool v6:1; }; uint32_t u32; }; diff --git a/util.c b/util.c index 7c57ab15..82b0028b 100644 --- a/util.c +++ b/util.c @@ -60,7 +60,7 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type, proto = IPPROTO_TCP; socktype = SOCK_STREAM | SOCK_NONBLOCK; break; - case EPOLL_TYPE_UDP: + case EPOLL_TYPE_UDP_LISTEN: case EPOLL_TYPE_UDP_REPLY: proto = IPPROTO_UDP; socktype = SOCK_DGRAM | SOCK_NONBLOCK; -- 2.45.2