This is the seventh draft of an implementation of more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel This series changes the TCP connection table and hash table into a more general flow table that can track other protocols as well. Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding. ICMP and UDP are converted to use the new flow table. This is based on the recent series of UDP flow table preliminaries. Caveats: * We roughly double the size of a connection/flow entry * We don't yet record the local address of flows initiated from a socket, even in cases where it's bound to a specific address. Changes since v7: * Rebase * Fix unintended regression in forwarding logic (we weren't applying map_gw logic to DNS packets, if they didn't hit explicit DNS forwarding rules). * Remove return value from pif_sockaddr(), in turned out not to be very useful. * More robust discarding of datagrams received between bind() and connect() on UDP reply sockets. * Avoid the name 'fside' for variables which was confusing in some contexts * Assorted minor changes based on feedback. Changes since v6: * Complete redesign of the UDP flow handling * Rebased (handling the change to bind() probing for local addresses was surprisingly fiddly) * Replace sockaddr_from_inany() with pif_sockaddr() which can correctly handle scope_id for different interfaces, and returns whether the address is non-trivial for convenience * Preserve specific loopback addresses in forwarding logic Changes since v5: * flowside_from_af() is now static * Small fixes to state verification * Pass protocol specific types into deferred/timer callbacks * No longer require complete forwarding address info for the hash table (we won't have it for UDP) * Fix bugs with logging of flow addresses * Make sure to initialise sin_zero field sockaddr_from_inany * Added patch better typing parameters to flow type specific callbacks * Terminology change "forwarded side" to "target side" * Assorted wording and style tweaks based on Stefano's review * Fold introduction of struct flowside and populating the initiating side together * Manage outbound addresses via the flow table as well * Support for UDP * Correct type of 'b' in flowside_lookup() (was a signed int) Changes since v4: * flowside_from_af() no longer fills in unspecified addresses when passed NULL * Split and rename flow hash lookup function * Clarified flow state transitions, and enforced where practical * Made side 0 always the initiating side of a flow, rather than letting the protocol specific code decide * Separated pifs from flowside addresses to allow better structure packing Changes since v3: * Complex rebase on top of the many things that have happened upstream since v2. * Assorted other changes. * Replace TAPFSIDE() and SOCKFSIDE() macros with local variables. Changes since v2: * Cosmetic fixes based on review * Extra doc comments for enum flow_type * Rename flowside to flowaddrs which turns out to make more sense in light of future changes * Fix bug where the socket flowaddrs for tap initiated connections wasn't initialised to match the socket address we were using in the case of map-gw NAT * New flowaddrs_from_sock() helper used in most cases which is cleaner and should avoid bugs like the above * Using newer centralised workarounds for clang-tidy issue 58992 * Remove duplicate definition of FLOW_MAX as maximum flow type and maximum number of tracked flows * Rebased on newer versions of preliminary work (ICMP, flow based dispatch and allocation, bind/address cleanups) * Unified hash table as well as base flow table * Integrated ICMP Changes since v1: * Terminology changes - "Endpoint" address/port instead of "correspondent" address/port - "flowside" instead of "demiflow" * Actually move the connection table to a new flow table structure in new files * Significant rearrangement of earlier patchs on top of that new table, to reduce churn David Gibson (27): flow: Common address information for initiating side flow: Common address information for target side tcp, flow: Remove redundant information, repack connection structures tcp: Obtain guest address from flowside tcp: Manage outbound address via flow table tcp: Simplify endpoint validation using flowside information tcp_splice: Eliminate SPLICE_V6 flag tcp, flow: Replace TCP specific hash function with general flow hash flow, tcp: Generalise TCP hash table to general flow hash table tcp: Re-use flow hash for initial sequence number generation icmp: Remove redundant id field from flow table entry icmp: Obtain destination addresses from the flowsides icmp: Look up ping flows using flow hash icmp: Eliminate icmp_id_map flow: Helper to create sockets based on flowside icmp: Manage outbound socket address via flow table flow, tcp: Flow based NAT and port forwarding for TCP flow, icmp: Use general flow forwarding rules for ICMP fwd: Update flow forwarding logic for UDP udp: Create flows for datagrams from originating sockets udp: Handle "spliced" datagrams with per-flow sockets udp: Remove obsolete splice tracking udp: Find or create flows for datagrams from tap interface udp: Direct datagrams from host to guest via flow table udp: Remove obsolete socket tracking udp: Remove rdelta port forwarding maps udp: Rename UDP listening sockets Makefile | 4 +- conf.c | 14 +- epoll_type.h | 6 +- flow.c | 483 +++++++++++++++++++++- flow.h | 47 +++ flow_table.h | 45 +- fwd.c | 187 ++++++++- fwd.h | 9 + icmp.c | 105 ++--- icmp_flow.h | 2 - inany.h | 2 - passt.c | 10 +- passt.h | 5 +- pif.c | 40 ++ pif.h | 17 + tap.c | 11 - tap.h | 1 - tcp.c | 522 ++++++----------------- tcp_buf.c | 6 +- tcp_conn.h | 47 +-- tcp_internal.h | 10 +- tcp_splice.c | 98 +---- tcp_splice.h | 5 +- udp.c | 1079 ++++++++++++++++++++---------------------------- udp.h | 35 +- udp_flow.h | 27 ++ util.c | 9 +- util.h | 3 + 28 files changed, 1563 insertions(+), 1266 deletions(-) create mode 100644 udp_flow.h -- 2.45.2
Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more consistent handling across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of address and port information related to one side of a flow. Store two of these in the common fields of a flow to track that information for both sides. For now we only populate the initiating side, requiring that information be completed when a flows enter INI. Later patches will populate the target side. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++--- flow.h | 16 +++++++++ flow_table.h | 8 ++++- icmp.c | 9 +++-- passt.h | 3 ++ tcp.c | 6 ++-- 6 files changed, 127 insertions(+), 11 deletions(-) diff --git a/flow.c b/flow.c index d05aa495..223d0599 100644 --- a/flow.c +++ b/flow.c @@ -108,6 +108,31 @@ static const union flow *flow_new_entry; /* = NULL */ /* Last time the flow timers ran */ static struct timespec flow_timer_run; +/** flowside_from_af() - Initialise flowside from addresses + * @side: flowside to initialise + * @af: Address family (AF_INET or AF_INET6) + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + */ +static void flowside_from_af(struct flowside *side, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport) +{ + if (faddr) + inany_from_af(&side->faddr, af, faddr); + else + side->faddr = inany_any6; + side->fport = fport; + + if (eaddr) + inany_from_af(&side->eaddr, af, eaddr); + else + side->eaddr = inany_any6; + side->eport = eport; +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority @@ -140,6 +165,8 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + const struct flowside *ini = &f->side[INISIDE]; uint8_t oldstate = f->state; ASSERT(state < FLOW_NUM_STATES); @@ -150,18 +177,28 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f)); if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s => %s", pif_name(f->pif[INISIDE]), - pif_name(f->pif[TGTSIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport, + pif_name(f->pif[TGTSIDE])); else if (MAX(state, oldstate) >= FLOW_STATE_INI) - flow_log_(f, LOG_DEBUG, "%s => ?", pif_name(f->pif[INISIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport); } /** - * flow_initiate() - Move flow to INI, setting INISIDE details + * flow_initiate_() - Move flow to INI, setting pif[INISIDE] * @flow: Flow to change state * @pif: pif of the initiating side */ -void flow_initiate(union flow *flow, uint8_t pif) +static void flow_initiate_(union flow *flow, uint8_t pif) { struct flow_common *f = &flow->f; @@ -174,6 +211,55 @@ void flow_initiate(union flow *flow, uint8_t pif) flow_set_state(f, FLOW_STATE_INI); } +/** + * flow_initiate_af() - Move flow to INI, setting INISIDE details + * @flow: Flow to change state + * @pif: pif of the initiating side + * @af: Address family of @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the initiating flowside information + */ +const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) +{ + struct flowside *ini = &flow->f.side[INISIDE]; + + flowside_from_af(ini, af, saddr, sport, daddr, dport); + flow_initiate_(flow, pif); + return ini; +} + +/** + * flow_initiate_sa() - Move flow to INI, setting INISIDE details + * @flow: Flow to change state + * @pif: pif of the initiating side + * @ssa: Source socket address + * @dport: Destination port + * + * Return: pointer to the initiating flowside information + */ +const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, + const union sockaddr_inany *ssa, + in_port_t dport) +{ + struct flowside *ini = &flow->f.side[INISIDE]; + + inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa); + if (inany_v4(&ini->eaddr)) + ini->faddr = inany_any4; + else + ini->faddr = inany_any6; + ini->fport = dport; + flow_initiate_(flow, pif); + return ini; +} + /** * flow_target() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state diff --git a/flow.h b/flow.h index b5189043..8c6ba602 100644 --- a/flow.h +++ b/flow.h @@ -135,11 +135,26 @@ extern const uint8_t flow_proto[]; #define INISIDE 0 /* Initiating side index */ #define TGTSIDE 1 /* Target side index */ +/** + * struct flowside - Address information for one side of a flow + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport; + in_port_t eport; +}; + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry * @type: Type of packet flow * @pif[]: Interface for each side of the flow + * @side[]: Information for each side of the flow */ struct flow_common { #ifdef __GNUC__ @@ -154,6 +169,7 @@ struct flow_common { "Not enough bits for type field"); #endif uint8_t pif[SIDES]; + struct flowside side[SIDES]; }; #define FLOW_INDEX_BITS 17 /* 128k - 1 */ diff --git a/flow_table.h b/flow_table.h index 4f5d041c..586d8579 100644 --- a/flow_table.h +++ b/flow_table.h @@ -127,7 +127,13 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f, union flow *flow_alloc(void); void flow_alloc_cancel(union flow *flow); -void flow_initiate(union flow *flow, uint8_t pif); +const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport); +const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, + const union sockaddr_inany *ssa, + in_port_t dport); void flow_target(union flow *flow, uint8_t pif); union flow *flow_set_type(union flow *flow, enum flow_type type); diff --git a/icmp.c b/icmp.c index 7cf31e60..9b95be23 100644 --- a/icmp.c +++ b/icmp.c @@ -162,12 +162,15 @@ static void icmp_ping_close(const struct ctx *c, * @id_sock: Pointer to ping flow entry slot in icmp_id_map[] to update * @af: Address family, AF_INET or AF_INET6 * @id: ICMP id for the new socket + * @saddr: Source address + * @daddr: Destination address * * Return: Newly opened ping flow, or NULL on failure */ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, struct icmp_ping_flow **id_sock, - sa_family_t af, uint16_t id) + sa_family_t af, uint16_t id, + const void *saddr, const void *daddr) { uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6; union epoll_ref ref = { .type = EPOLL_TYPE_PING }; @@ -179,7 +182,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, if (!flow) return NULL; - flow_initiate(flow, PIF_TAP); + flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); flow_target(flow, PIF_HOST); pingf = FLOW_SET_TYPE(flow, flowtype, ping); @@ -285,7 +288,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, } if (!(pingf = *id_sock)) - if (!(pingf = icmp_ping_new(c, id_sock, af, id))) + if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) return 1; pingf->ts = now->tv_sec; diff --git a/passt.h b/passt.h index 867e77b7..0d76b498 100644 --- a/passt.h +++ b/passt.h @@ -17,6 +17,9 @@ union epoll_ref; #include "pif.h" #include "packet.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" #include "flow.h" #include "icmp.h" #include "fwd.h" diff --git a/tcp.c b/tcp.c index c5431f18..286a4171 100644 --- a/tcp.c +++ b/tcp.c @@ -1666,7 +1666,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return; - flow_initiate(flow, PIF_TAP); + flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); flow_target(flow, PIF_HOST); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); @@ -2351,7 +2351,9 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) goto cancel; - flow_initiate(flow, ref.tcp_listen.pif); + /* FIXME: When listening port has a specific bound address, record that + * as the forwarding address */ + flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port); if (sa.sa_family == AF_INET) { const struct in_addr *addr = &sa.sa4.sin_addr; -- 2.45.2
Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect(). Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 38 ++++++++++++++++++------ flow_table.h | 5 +++- icmp.c | 3 +- inany.h | 1 - pif.c | 40 +++++++++++++++++++++++++ pif.h | 17 +++++++++++ tcp.c | 82 ++++++++++++++++++++++++++++------------------------ tcp_splice.c | 45 +++++++++++----------------- 8 files changed, 153 insertions(+), 78 deletions(-) diff --git a/flow.c b/flow.c index 223d0599..71c44d2f 100644 --- a/flow.c +++ b/flow.c @@ -165,8 +165,10 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { - char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN]; + char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN]; const struct flowside *ini = &f->side[INISIDE]; + const struct flowside *tgt = &f->side[TGTSIDE]; uint8_t oldstate = f->state; ASSERT(state < FLOW_NUM_STATES); @@ -177,19 +179,24 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f)); if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + flow_log_(f, LOG_DEBUG, + "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport, - pif_name(f->pif[TGTSIDE])); + pif_name(f->pif[TGTSIDE]), + inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), + tgt->fport, + inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), + tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport); } @@ -261,21 +268,34 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, } /** - * flow_target() - Move flow to TGT, setting TGTSIDE details + * flow_target_af() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state * @pif: pif of the target side + * @af: Address family for @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the target flowside information */ -void flow_target(union flow *flow, uint8_t pif) +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) { struct flow_common *f = &flow->f; + struct flowside *tgt = &f->side[TGTSIDE]; ASSERT(pif != PIF_NONE); ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); ASSERT(f->type == FLOW_TYPE_NONE); ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); + flowside_from_af(tgt, af, daddr, dport, saddr, sport); f->pif[TGTSIDE] = pif; flow_set_state(f, FLOW_STATE_TGT); + return tgt; } /** diff --git a/flow_table.h b/flow_table.h index 586d8579..aabdbb75 100644 --- a/flow_table.h +++ b/flow_table.h @@ -134,7 +134,10 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, const union sockaddr_inany *ssa, in_port_t dport); -void flow_target(union flow *flow, uint8_t pif); +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport); union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/icmp.c b/icmp.c index 9b95be23..ebfa6272 100644 --- a/icmp.c +++ b/icmp.c @@ -183,7 +183,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, return NULL; flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - flow_target(flow, PIF_HOST); + /* FIXME: Record outbound source address when known */ + flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping); pingf->seq = -1; diff --git a/inany.h b/inany.h index 47b66fa9..8eaf5335 100644 --- a/inany.h +++ b/inany.h @@ -187,7 +187,6 @@ static inline bool inany_is_unspecified(const union inany_addr *a) * * Return: true if @a is in fe80::/10 (IPv6 link local unicast) */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_linklocal6(const union inany_addr *a) { return IN6_IS_ADDR_LINKLOCAL(&a->a6); diff --git a/pif.c b/pif.c index ebf01cc8..a099e31b 100644 --- a/pif.c +++ b/pif.c @@ -7,9 +7,14 @@ #include <stdint.h> #include <assert.h> +#include <netinet/in.h> #include "util.h" #include "pif.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" +#include "passt.h" const char *pif_type_str[] = { [PIF_NONE] = "<none>", @@ -19,3 +24,38 @@ const char *pif_type_str[] = { }; static_assert(ARRAY_SIZE(pif_type_str) == PIF_NUM_TYPES, "pif_type_str[] doesn't match enum pif_type"); + + +/** pif_sockaddr() - Construct a socket address suitable for an interface + * @c: Execution context + * @sa: Pointer to sockaddr to fill in + * @sl: Updated to relevant length of initialised @sa + * @pif: Interface to create the socket address + * @addr: IPv[46] address + * @port: Port (host byte order) + */ +void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl, + uint8_t pif, const union inany_addr *addr, in_port_t port) +{ + const struct in_addr *v4 = inany_v4(addr); + + ASSERT(pif_is_socket(pif)); + + if (v4) { + sa->sa_family = AF_INET; + sa->sa4.sin_addr = *v4; + sa->sa4.sin_port = htons(port); + memset(&sa->sa4.sin_zero, 0, sizeof(sa->sa4.sin_zero)); + *sl = sizeof(sa->sa4); + } else { + sa->sa_family = AF_INET6; + sa->sa6.sin6_addr = addr->a6; + sa->sa6.sin6_port = htons(port); + if (pif == PIF_HOST && IN6_IS_ADDR_LINKLOCAL(&addr->a6)) + sa->sa6.sin6_scope_id = c->ifi6; + else + sa->sa6.sin6_scope_id = 0; + sa->sa6.sin6_flowinfo = 0; + *sl = sizeof(sa->sa6); + } +} diff --git a/pif.h b/pif.h index ca85b349..8777bb5b 100644 --- a/pif.h +++ b/pif.h @@ -7,6 +7,9 @@ #ifndef PIF_H #define PIF_H +union inany_addr; +union sockaddr_inany; + /** * enum pif_type - Type of passt/pasta interface ("pif") * @@ -43,4 +46,18 @@ static inline const char *pif_name(uint8_t pif) return pif_type(pif); } +/** + * pif_is_socket() - Is interface implemented via L4 sockets? + * @pif: pif to check + * + * Return: true of @pif is an L4 socket based interface, otherwise false + */ +static inline bool pif_is_socket(uint8_t pif) +{ + return pif == PIF_HOST || pif == PIF_SPLICE; +} + +void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl, + uint8_t pif, const union inany_addr *addr, in_port_t port); + #endif /* PIF_H */ diff --git a/tcp.c b/tcp.c index 286a4171..914a0746 100644 --- a/tcp.c +++ b/tcp.c @@ -1647,18 +1647,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(dstport), - .sin_addr = *(struct in_addr *)daddr, - }; - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(dstport), - .sin6_addr = *(struct in6_addr *)daddr, - }; - const struct sockaddr *sa; + const struct flowside *ini, *tgt; struct tcp_tap_conn *conn; + union inany_addr dstaddr; /* FIXME: Avoid bulky temporary */ + union sockaddr_inany sa; union flow *flow; int s = -1, mss; socklen_t sl; @@ -1666,9 +1658,22 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return; - flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); + ini = flow_initiate_af(flow, PIF_TAP, + af, saddr, srcport, daddr, dstport); + + dstaddr = ini->faddr; + if (!c->no_map_gw) { + if (inany_equals4(&dstaddr, &c->ip4.gw)) + dstaddr = inany_loopback4; + else if (inany_equals6(&dstaddr, &c->ip6.gw)) + dstaddr = inany_loopback6; + + } - flow_target(flow, PIF_HOST); + /* FIXME: Record outbound source address when known */ + tgt = flow_target_af(flow, PIF_HOST, AF_INET6, + NULL, 0, /* Kernel decides source address */ + &dstaddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); if (af == AF_INET) { @@ -1687,9 +1692,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, dstport); goto cancel; } - - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); } else if (af == AF_INET6) { if (IN6_IS_ADDR_UNSPECIFIED(saddr) || IN6_IS_ADDR_MULTICAST(saddr) || srcport == 0 || @@ -1704,9 +1706,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, dstport); goto cancel; } - - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); } else { ASSERT(0); } @@ -1714,12 +1713,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if ((s = tcp_conn_sock(c, af)) < 0) goto cancel; - if (!c->no_map_gw) { - if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw)) - addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK); - if (af == AF_INET6 && IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw)) - addr6.sin6_addr = in6addr_loopback; - } + pif_sockaddr(c, &sa, &sl, PIF_HOST, &tgt->eaddr, tgt->eport); /* Use bind() to check if the target address is local (EADDRINUSE or * similar) and already bound, and set the LOCAL flag in that case. @@ -1731,7 +1725,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, * * So, if bind() succeeds, close the socket, get a new one, and proceed. */ - if (bind(s, sa, sl)) { + if (bind(s, &sa.sa, sl)) { if (errno != EADDRNOTAVAIL && errno != EACCES) conn_flag(c, conn, LOCAL); } else { @@ -1741,7 +1735,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, goto cancel; } - if (af == AF_INET6 && IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr)) { + if (inany_is_linklocal6(&tgt->eaddr)) { struct sockaddr_in6 addr6_ll = { .sin6_family = AF_INET6, .sin6_addr = c->ip6.addr_ll, @@ -1749,6 +1743,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, }; if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) goto cancel; + } else if (!inany_is_loopback(&tgt->eaddr)) { + tcp_bind_outbound(c, s, af); } conn->sock = s; @@ -1784,12 +1780,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, tcp_hash_insert(c, conn); - if ((af == AF_INET && !IN4_IS_ADDR_LOOPBACK(&addr4.sin_addr)) || - (af == AF_INET6 && !IN6_IS_ADDR_LOOPBACK(&addr6.sin6_addr) && - !IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr))) - tcp_bind_outbound(c, s, af); - - if (connect(s, sa, sl)) { + if (connect(s, &sa.sa, sl)) { if (errno != EINPROGRESS) { tcp_rst(c, conn); goto cancel; @@ -2297,9 +2288,25 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, const union sockaddr_inany *sa, const struct timespec *now) { + union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */ struct tcp_tap_conn *conn; + in_port_t srcport; + + inany_from_sockaddr(&saddr, &srcport, sa); + tcp_snat_inbound(c, &saddr); - flow_target(flow, PIF_TAP); + if (inany_v4(&saddr)) { + daddr = inany_from_v4(c->ip4.addr_seen); + } else { + if (inany_is_linklocal6(&saddr)) + daddr.a6 = c->ip6.addr_ll_seen; + else + daddr.a6 = c->ip6.addr_seen; + } + dstport += c->tcp.fwd_in.delta[dstport]; + + flow_target_af(flow, PIF_TAP, AF_INET6, + &saddr, srcport, &daddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); conn->sock = s; @@ -2307,10 +2314,9 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = dstport + c->tcp.fwd_in.delta[dstport]; - - tcp_snat_inbound(c, &conn->faddr); + conn->faddr = saddr; + conn->fport = srcport; + conn->eport = dstport; tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_splice.c b/tcp_splice.c index ffddc853..608e7f27 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -340,31 +340,20 @@ static int tcp_splice_connect_finish(const struct ctx *c, * tcp_splice_connect() - Create and connect socket for new spliced connection * @c: Execution context * @conn: Connection pointer - * @af: Address family - * @pif: pif on which to create socket - * @port: Destination port, host order * * Return: 0 for connect() succeeded or in progress, negative value on error */ -static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, - sa_family_t af, uint8_t pif, in_port_t port) +static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(port), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(port), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - const struct sockaddr *sa; + const struct flowside *tgt = &conn->f.side[TGTSIDE]; + sa_family_t af = inany_v4(&tgt->eaddr) ? AF_INET : AF_INET6; + uint8_t tgtpif = conn->f.pif[TGTSIDE]; + union sockaddr_inany sa; socklen_t sl; - if (pif == PIF_HOST) + if (tgtpif == PIF_HOST) conn->s[1] = tcp_conn_sock(c, af); - else if (pif == PIF_SPLICE) + else if (tgtpif == PIF_SPLICE) conn->s[1] = tcp_conn_sock_ns(c, af); else ASSERT(0); @@ -378,15 +367,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, conn->s[1]); } - if (CONN_V6(conn)) { - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); - } else { - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); - } + pif_sockaddr(c, &sa, &sl, tgtpif, &tgt->eaddr, tgt->eport); - if (connect(conn->s[1], sa, sl)) { + if (connect(conn->s[1], &sa.sa, sl)) { if (errno != EINPROGRESS) { flow_trace(conn, "Couldn't connect socket for splice: %s", strerror(errno)); @@ -491,7 +474,13 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, return false; } - flow_target(flow, tgtpif); + /* FIXME: Record outbound source address when known */ + if (af == AF_INET) + flow_target_af(flow, tgtpif, AF_INET, + NULL, 0, &in4addr_loopback, dstport); + else + flow_target_af(flow, tgtpif, AF_INET6, + NULL, 0, &in6addr_loopback, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice); conn->flags = af == AF_INET ? 0 : SPLICE_V6; @@ -503,7 +492,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int))) flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0); - if (tcp_splice_connect(c, conn, af, tgtpif, dstport)) + if (tcp_splice_connect(c, conn)) conn_flag(c, conn, CLOSING); FLOW_ACTIVATE(conn); -- 2.45.2
Some information we explicitly store in the TCP connection is now duplicated in the common flow structure. Access it from there instead, and remove it from the TCP specific structure. With that done we can reorder both the "tap" and "splice" TCP structures a bit to get better packing for the new combined flow table entries. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 52 ++++++++++++++++++++++++++------------------------ tcp_conn.h | 40 +++++++++++++++----------------------- tcp_internal.h | 6 +++++- 3 files changed, 47 insertions(+), 51 deletions(-) diff --git a/tcp.c b/tcp.c index 914a0746..3d3df4c9 100644 --- a/tcp.c +++ b/tcp.c @@ -333,8 +333,6 @@ #define ACK_IF_NEEDED 0 /* See tcp_send_flag() */ -#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) - #define CONN_IS_CLOSING(conn) \ (((conn)->events & ESTABLISHED) && \ ((conn)->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD))) @@ -673,10 +671,11 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) { + const struct flowside *tapside = TAPFLOW(conn); int i; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return 1; return 0; @@ -691,6 +690,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, const struct tcp_info *tinfo) { #ifdef HAS_MIN_RTT + const struct flowside *tapside = TAPFLOW(conn); int i, hole = -1; if (!tinfo->tcpi_min_rtt || @@ -698,7 +698,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -710,7 +710,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return; - low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = tapside->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -865,8 +865,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + const struct flowside *tapside = TAPFLOW(conn); + + if (inany_equals(&tapside->faddr, faddr) && + tapside->eport == eport && tapside->fport == fport) return 1; return 0; @@ -900,7 +902,10 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + const struct flowside *tapside = TAPFLOW(conn); + + return tcp_hash(c, &tapside->faddr, tapside->eport, + tapside->fport); } /** @@ -1035,10 +1040,12 @@ void tcp_defer_handler(struct ctx *c) * @seq: Sequence number */ static void tcp_fill_header(struct tcphdr *th, - const struct tcp_tap_conn *conn, uint32_t seq) + const struct tcp_tap_conn *conn, uint32_t seq) { - th->source = htons(conn->fport); - th->dest = htons(conn->eport); + const struct flowside *tapside = TAPFLOW(conn); + + th->source = htons(tapside->fport); + th->dest = htons(tapside->eport); th->seq = htonl(seq); th->ack_seq = htonl(conn->seq_ack_to_tap); if (conn->events & ESTABLISHED) { @@ -1070,7 +1077,8 @@ static size_t tcp_fill_headers4(const struct ctx *c, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph); @@ -1112,10 +1120,11 @@ static size_t tcp_fill_headers6(const struct ctx *c, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) { + const struct flowside *tapside = TAPFLOW(conn); size_t l4len = dlen + sizeof(*th); ip6h->payload_len = htons(l4len); - ip6h->saddr = conn->faddr.a6; + ip6h->saddr = tapside->faddr.a6; if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) ip6h->daddr = c->ip6.addr_ll_seen; else @@ -1154,7 +1163,8 @@ size_t tcp_l2_buf_fill_headers(const struct ctx *c, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr); if (a4) { return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, @@ -1465,6 +1475,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); + const struct flowside *tapside = TAPFLOW(conn); union inany_addr aany; uint64_t hash; uint32_t ns; @@ -1474,10 +1485,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, else inany_from_af(&aany, AF_INET6, &c->ip6.addr); - inany_siphash_feed(&state, &conn->faddr); + inany_siphash_feed(&state, &tapside->faddr); inany_siphash_feed(&state, &aany); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)tapside->fport << 16 | tapside->eport); /* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1766,11 +1777,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1; - inany_from_af(&conn->faddr, af, daddr); - - conn->fport = dstport; - conn->eport = srcport; - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2314,10 +2320,6 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - conn->faddr = saddr; - conn->fport = srcport; - conn->eport = dstport; - tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_conn.h b/tcp_conn.h index f80ef67b..4e7c57a4 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -13,19 +13,16 @@ * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) * @f: Generic flow information * @in_epoll: Is the connection in the epoll set? + * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT + * @ws_from_tap: Window scaling factor advertised from tap/guest + * @ws_to_tap: Window scaling factor advertised to tap/guest * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS * @sock: Socket descriptor number * @events: Connection events, implying connection states * @timer: timerfd descriptor for timeout events * @flags: Connection flags representing internal attributes - * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT - * @ws_from_tap: Window scaling factor advertised from tap/guest - * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -49,6 +46,10 @@ struct tcp_tap_conn { unsigned int ws_from_tap :TCP_WS_BITS; unsigned int ws_to_tap :TCP_WS_BITS; +#define TCP_MSS_BITS 14 + unsigned int tap_mss :TCP_MSS_BITS; +#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) +#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) int sock :FD_REF_BITS; @@ -77,13 +78,6 @@ struct tcp_tap_conn { #define ACK_TO_TAP_DUE BIT(3) #define ACK_FROM_TAP_DUE BIT(4) - -#define TCP_MSS_BITS 14 - unsigned int tap_mss :TCP_MSS_BITS; -#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) -#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) - - #define SNDBUF_BITS 24 unsigned int sndbuf :SNDBUF_BITS; #define SNDBUF_SET(conn, bytes) (conn->sndbuf = ((bytes) >> (32 - SNDBUF_BITS))) @@ -91,11 +85,6 @@ struct tcp_tap_conn { uint8_t seq_dup_ack_approx; - - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap; @@ -109,22 +98,24 @@ struct tcp_tap_conn { /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @f: Generic flow information - * @in_epoll: Is the connection in the epoll set? * @s: File descriptor for sockets * @pipe: File descriptors for pipes - * @events: Events observed/actions performed on connection - * @flags: Connection flags (attributes, not events) * @read: Bytes read (not fully written to other side in one shot) * @written: Bytes written (not fully written from one other side read) -*/ + * @events: Events observed/actions performed on connection + * @flags: Connection flags (attributes, not events) + * @in_epoll: Is the connection in the epoll set? + */ struct tcp_splice_conn { /* Must be first element */ struct flow_common f; - bool in_epoll :1; int s[SIDES]; int pipe[SIDES][2]; + uint32_t read[SIDES]; + uint32_t written[SIDES]; + uint8_t events; #define SPLICE_CLOSED 0 #define SPLICE_CONNECT BIT(0) @@ -139,8 +130,7 @@ struct tcp_splice_conn { #define RCVLOWAT_ACT(sidei_) ((sidei_) ? BIT(4) : BIT(3)) #define CLOSING BIT(5) - uint32_t read[SIDES]; - uint32_t written[SIDES]; + bool in_epoll :1; }; /* Socket pools */ diff --git a/tcp_internal.h b/tcp_internal.h index 51aaa169..4f61e5c3 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -39,7 +39,11 @@ #define OPT_SACKP 4 #define OPT_SACK 5 #define OPT_TS 8 -#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) + +#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) +#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)])) + +#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) /* -- 2.45.2
Currently we always deliver inbound TCP packets to the guest's most recent observed IP address. This has the odd side effect that if the guest changes its IP address with active TCP connections we might deliver packets from old connections to the new address. That won't work; it will probably result in an RST from the guest. Worse, if the guest added a new address but also retains the old one, then we could break those old connections by redirecting them to the new address. Now that we maintain flowside information, we have a record of the correct guest side address and can just use it. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 41 +++++++++++++---------------------------- tcp_buf.c | 6 +++--- tcp_internal.h | 3 +-- 3 files changed, 17 insertions(+), 33 deletions(-) diff --git a/tcp.c b/tcp.c index 3d3df4c9..bac72c02 100644 --- a/tcp.c +++ b/tcp.c @@ -1059,7 +1059,6 @@ static void tcp_fill_header(struct tcphdr *th, /** * tcp_fill_headers4() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers - * @c: Execution context * @conn: Connection pointer * @taph: tap backend specific header * @iph: Pointer to IPv4 header @@ -1070,27 +1069,26 @@ static void tcp_fill_header(struct tcphdr *th, * * Return: The IPv4 payload length, host order */ -static size_t tcp_fill_headers4(const struct ctx *c, - const struct tcp_tap_conn *conn, +static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn, struct tap_hdr *taph, struct iphdr *iph, struct tcphdr *th, size_t dlen, const uint16_t *check, uint32_t seq) { const struct flowside *tapside = TAPFLOW(conn); - const struct in_addr *a4 = inany_v4(&tapside->faddr); + const struct in_addr *src4 = inany_v4(&tapside->faddr); + const struct in_addr *dst4 = inany_v4(&tapside->eaddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph); - ASSERT(a4); + ASSERT(src4 && dst4); iph->tot_len = htons(l3len); - iph->saddr = a4->s_addr; - iph->daddr = c->ip4.addr_seen.s_addr; + iph->saddr = src4->s_addr; + iph->daddr = dst4->s_addr; iph->check = check ? *check : - csum_ip4_header(l3len, IPPROTO_TCP, - *a4, c->ip4.addr_seen); + csum_ip4_header(l3len, IPPROTO_TCP, *src4, *dst4); tcp_fill_header(th, conn, seq); @@ -1103,7 +1101,6 @@ static size_t tcp_fill_headers4(const struct ctx *c, /** * tcp_fill_headers6() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers - * @c: Execution context * @conn: Connection pointer * @taph: tap backend specific header * @ip6h: Pointer to IPv6 header @@ -1114,8 +1111,7 @@ static size_t tcp_fill_headers4(const struct ctx *c, * * Return: The IPv6 payload length, host order */ -static size_t tcp_fill_headers6(const struct ctx *c, - const struct tcp_tap_conn *conn, +static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn, struct tap_hdr *taph, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) @@ -1125,10 +1121,7 @@ static size_t tcp_fill_headers6(const struct ctx *c, ip6h->payload_len = htons(l4len); ip6h->saddr = tapside->faddr.a6; - if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) - ip6h->daddr = c->ip6.addr_ll_seen; - else - ip6h->daddr = c->ip6.addr_seen; + ip6h->daddr = tapside->eaddr.a6; ip6h->hop_limit = 255; ip6h->version = 6; @@ -1149,7 +1142,6 @@ static size_t tcp_fill_headers6(const struct ctx *c, /** * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers - * @c: Execution context * @conn: Connection pointer * @iov: Pointer to an array of iovec of TCP pre-cooked buffers * @dlen: TCP payload length @@ -1158,8 +1150,7 @@ static size_t tcp_fill_headers6(const struct ctx *c, * * Return: IP payload length, host order */ -size_t tcp_l2_buf_fill_headers(const struct ctx *c, - const struct tcp_tap_conn *conn, +size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { @@ -1167,13 +1158,13 @@ size_t tcp_l2_buf_fill_headers(const struct ctx *c, const struct in_addr *a4 = inany_v4(&tapside->faddr); if (a4) { - return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, + return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_IP].iov_base, iov[TCP_IOV_PAYLOAD].iov_base, dlen, check, seq); } - return tcp_fill_headers6(c, conn, iov[TCP_IOV_TAP].iov_base, + return tcp_fill_headers6(conn, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_IP].iov_base, iov[TCP_IOV_PAYLOAD].iov_base, dlen, seq); @@ -1476,17 +1467,11 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, { struct siphash_state state = SIPHASH_INIT(c->hash_secret); const struct flowside *tapside = TAPFLOW(conn); - union inany_addr aany; uint64_t hash; uint32_t ns; - if (CONN_V4(conn)) - inany_from_af(&aany, AF_INET, &c->ip4.addr); - else - inany_from_af(&aany, AF_INET6, &c->ip6.addr); - inany_siphash_feed(&state, &tapside->faddr); - inany_siphash_feed(&state, &aany); + inany_siphash_feed(&state, &tapside->eaddr); hash = siphash_final(&state, 36, (uint64_t)tapside->fport << 16 | tapside->eport); diff --git a/tcp_buf.c b/tcp_buf.c index 11dce021..9b198984 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -316,7 +316,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) return ret; } - l4len = tcp_l2_buf_fill_headers(c, conn, iov, optlen, NULL, seq); + l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, seq); iov[TCP_IOV_PAYLOAD].iov_len = l4len; if (flags & DUP_ACK) { @@ -373,7 +373,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp4_frame_conns[tcp4_payload_used] = conn; iov = tcp4_l2_iov[tcp4_payload_used++]; - l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq); + l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq); iov[TCP_IOV_PAYLOAD].iov_len = l4len; if (tcp4_payload_used > TCP_FRAMES_MEM - 1) tcp_payload_flush(c); @@ -381,7 +381,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp6_frame_conns[tcp6_payload_used] = conn; iov = tcp6_l2_iov[tcp6_payload_used++]; - l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq); + l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq); iov[TCP_IOV_PAYLOAD].iov_len = l4len; if (tcp6_payload_used > TCP_FRAMES_MEM - 1) tcp_payload_flush(c); diff --git a/tcp_internal.h b/tcp_internal.h index 4f61e5c3..ac6d4b21 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -88,8 +88,7 @@ void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); tcp_rst_do(c, conn); \ } while (0) -size_t tcp_l2_buf_fill_headers(const struct ctx *c, - const struct tcp_tap_conn *conn, +size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq); int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, -- 2.45.2
For now when we forward a connection to the host we leave the host side forwarding address and port blank since we don't necessarily know what source address and port will be used by the kernel. When the outbound address option is active, though, we do know the address at least, so we can record it in the flowside. Having done that, use it as the primary source of truth, binding the outgoing socket based on the information in there. This allows the possibility of more complex rules for what outbound address and/or port we use in future. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 93 ++++++++++++++++++++++++++++++++--------------------------- 1 file changed, 50 insertions(+), 43 deletions(-) diff --git a/tcp.c b/tcp.c index bac72c02..f0bf76bc 100644 --- a/tcp.c +++ b/tcp.c @@ -1581,46 +1581,48 @@ static uint16_t tcp_conn_tap_mss(const struct tcp_tap_conn *conn, /** * tcp_bind_outbound() - Bind socket to outbound address and interface if given * @c: Execution context + * @conn: Connection entry for socket to bind * @s: Outbound TCP socket - * @af: Address family */ -static void tcp_bind_outbound(const struct ctx *c, int s, sa_family_t af) +static void tcp_bind_outbound(const struct ctx *c, + const struct tcp_tap_conn *conn, int s) { - if (af == AF_INET) { - if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.addr_out)) { - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = 0, - .sin_addr = c->ip4.addr_out, - }; - - if (bind(s, (struct sockaddr *)&addr4, sizeof(addr4))) - debug_perror("IPv4 TCP socket address bind"); + const struct flowside *tgt = &conn->f.side[TGTSIDE]; + union sockaddr_inany bind_sa; + socklen_t sl; + + + pif_sockaddr(c, &bind_sa, &sl, PIF_HOST, &tgt->faddr, tgt->fport); + if (!inany_is_unspecified(&tgt->faddr) || tgt->fport) { + if (bind(s, &bind_sa.sa, sl)) { + char sstr[INANY_ADDRSTRLEN]; + + flow_dbg(conn, + "Can't bind TCP outbound socket to %s:%hu: %s", + inany_ntop(&tgt->faddr, sstr, sizeof(sstr)), + tgt->fport, strerror(errno)); } + } + if (bind_sa.sa_family == AF_INET) { if (*c->ip4.ifname_out) { if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE, c->ip4.ifname_out, - strlen(c->ip4.ifname_out))) - debug_perror("IPv4 TCP socket interface bind"); - } - } else if (af == AF_INET6) { - if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr_out)) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = 0, - .sin6_addr = c->ip6.addr_out, - }; - - if (bind(s, (struct sockaddr *)&addr6, sizeof(addr6))) - debug_perror("IPv6 TCP socket address bind"); + strlen(c->ip4.ifname_out))) { + flow_dbg(conn, "Can't bind IPv4 TCP socket to" + " interface %s: %s", c->ip4.ifname_out, + strerror(errno)); + } } - + } else if (bind_sa.sa_family == AF_INET6) { if (*c->ip6.ifname_out) { if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE, c->ip6.ifname_out, - strlen(c->ip6.ifname_out))) - debug_perror("IPv6 TCP socket interface bind"); + strlen(c->ip6.ifname_out))) { + flow_dbg(conn, "Can't bind IPv6 TCP socket to" + " interface %s: %s", c->ip6.ifname_out, + strerror(errno)); + } } } } @@ -1643,9 +1645,9 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); + union inany_addr srcaddr, dstaddr; /* FIXME: Avoid bulky temporaries */ const struct flowside *ini, *tgt; struct tcp_tap_conn *conn; - union inany_addr dstaddr; /* FIXME: Avoid bulky temporary */ union sockaddr_inany sa; union flow *flow; int s = -1, mss; @@ -1666,9 +1668,24 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, } - /* FIXME: Record outbound source address when known */ + if (inany_is_linklocal6(&dstaddr)) { + srcaddr.a6 = c->ip6.addr_ll; + } else if (inany_is_loopback(&dstaddr)) { + srcaddr = dstaddr; + } else if (inany_v4(&dstaddr)) { + if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.addr_out)) + srcaddr = inany_from_v4(c->ip4.addr_out); + else + srcaddr = inany_any4; + } else { + if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr_out)) + srcaddr.a6 = c->ip6.addr_out; + else + srcaddr = inany_any6; + } + tgt = flow_target_af(flow, PIF_HOST, AF_INET6, - NULL, 0, /* Kernel decides source address */ + &srcaddr, 0, /* Kernel decides source port */ &dstaddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); @@ -1731,18 +1748,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, goto cancel; } - if (inany_is_linklocal6(&tgt->eaddr)) { - struct sockaddr_in6 addr6_ll = { - .sin6_family = AF_INET6, - .sin6_addr = c->ip6.addr_ll, - .sin6_scope_id = c->ifi6, - }; - if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) - goto cancel; - } else if (!inany_is_loopback(&tgt->eaddr)) { - tcp_bind_outbound(c, s, af); - } - conn->sock = s; conn->timer = -1; conn_event(c, conn, TAP_SYN_RCVD); @@ -1771,6 +1776,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, tcp_hash_insert(c, conn); + tcp_bind_outbound(c, conn, s); + if (connect(s, &sa.sa, sl)) { if (errno != EINPROGRESS) { tcp_rst(c, conn); -- 2.45.2
Now that we store all our endpoints in the flowside structure, use some inany helpers to make validation of those endpoints simpler. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- inany.h | 1 - tcp.c | 72 +++++++++++++++------------------------------------------ 2 files changed, 18 insertions(+), 55 deletions(-) diff --git a/inany.h b/inany.h index 8eaf5335..d2893cec 100644 --- a/inany.h +++ b/inany.h @@ -211,7 +211,6 @@ static inline bool inany_is_multicast(const union inany_addr *a) * * Return: true if @a is specified and a unicast address */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_unicast(const union inany_addr *a) { return !inany_is_unspecified(a) && !inany_is_multicast(a); diff --git a/tcp.c b/tcp.c index f0bf76bc..b4c4f774 100644 --- a/tcp.c +++ b/tcp.c @@ -1689,38 +1689,14 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, &dstaddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); - if (af == AF_INET) { - if (IN4_IS_ADDR_UNSPECIFIED(saddr) || - IN4_IS_ADDR_BROADCAST(saddr) || - IN4_IS_ADDR_MULTICAST(saddr) || srcport == 0 || - IN4_IS_ADDR_UNSPECIFIED(daddr) || - IN4_IS_ADDR_BROADCAST(daddr) || - IN4_IS_ADDR_MULTICAST(daddr) || dstport == 0) { - char sstr[INET_ADDRSTRLEN], dstr[INET_ADDRSTRLEN]; - - debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu", - inet_ntop(AF_INET, saddr, sstr, sizeof(sstr)), - srcport, - inet_ntop(AF_INET, daddr, dstr, sizeof(dstr)), - dstport); - goto cancel; - } - } else if (af == AF_INET6) { - if (IN6_IS_ADDR_UNSPECIFIED(saddr) || - IN6_IS_ADDR_MULTICAST(saddr) || srcport == 0 || - IN6_IS_ADDR_UNSPECIFIED(daddr) || - IN6_IS_ADDR_MULTICAST(daddr) || dstport == 0) { - char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN]; - - debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu", - inet_ntop(AF_INET6, saddr, sstr, sizeof(sstr)), - srcport, - inet_ntop(AF_INET6, daddr, dstr, sizeof(dstr)), - dstport); - goto cancel; - } - } else { - ASSERT(0); + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0 || + !inany_is_unicast(&ini->faddr) || ini->fport == 0) { + char sstr[INANY_ADDRSTRLEN], dstr[INANY_ADDRSTRLEN]; + + debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu", + inany_ntop(&ini->eaddr, sstr, sizeof(sstr)), ini->eport, + inany_ntop(&ini->faddr, dstr, sizeof(dstr)), ini->fport); + goto cancel; } if ((s = tcp_conn_sock(c, af)) < 0) @@ -2336,7 +2312,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, void tcp_listen_handler(struct ctx *c, union epoll_ref ref, const struct timespec *now) { - char sastr[SOCKADDR_STRLEN]; + const struct flowside *ini; union sockaddr_inany sa; socklen_t sl = sizeof(sa); union flow *flow; @@ -2353,23 +2329,15 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, /* FIXME: When listening port has a specific bound address, record that * as the forwarding address */ - flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port); - - if (sa.sa_family == AF_INET) { - const struct in_addr *addr = &sa.sa4.sin_addr; - in_port_t port = sa.sa4.sin_port; - - if (IN4_IS_ADDR_UNSPECIFIED(addr) || - IN4_IS_ADDR_BROADCAST(addr) || - IN4_IS_ADDR_MULTICAST(addr) || port == 0) - goto bad_endpoint; - } else if (sa.sa_family == AF_INET6) { - const struct in6_addr *addr = &sa.sa6.sin6_addr; - in_port_t port = sa.sa6.sin6_port; - - if (IN6_IS_ADDR_UNSPECIFIED(addr) || - IN6_IS_ADDR_MULTICAST(addr) || port == 0) - goto bad_endpoint; + ini = flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, + ref.tcp_listen.port); + + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { + char sastr[SOCKADDR_STRLEN]; + + err("Invalid endpoint from TCP accept(): %s", + sockaddr_ntop(&sa, sastr, sizeof(sastr))); + goto cancel; } if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif, @@ -2379,10 +2347,6 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, tcp_tap_conn_from_sock(c, ref.tcp_listen.port, flow, s, &sa, now); return; -bad_endpoint: - err("Invalid endpoint from TCP accept(): %s", - sockaddr_ntop(&sa, sastr, sizeof(sastr))); - cancel: flow_alloc_cancel(flow); } -- 2.45.2
Since we're now constructing socket addresses based on information in the flowside, we no longer need an explicit flag to tell if we're dealing with an IPv4 or IPv6 connection. Hence, drop the now unused SPLICE_V6 flag. As well as just simplifying the code, this allows for possible future extensions where we could splice an IPv4 connection to an IPv6 connection or vice versa. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp_conn.h | 7 +++---- tcp_splice.c | 3 --- 2 files changed, 3 insertions(+), 7 deletions(-) diff --git a/tcp_conn.h b/tcp_conn.h index 4e7c57a4..6ae05115 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -125,10 +125,9 @@ struct tcp_splice_conn { #define FIN_SENT(sidei_) ((sidei_) ? BIT(7) : BIT(6)) uint8_t flags; -#define SPLICE_V6 BIT(0) -#define RCVLOWAT_SET(sidei_) ((sidei_) ? BIT(2) : BIT(1)) -#define RCVLOWAT_ACT(sidei_) ((sidei_) ? BIT(4) : BIT(3)) -#define CLOSING BIT(5) +#define RCVLOWAT_SET(sidei_) ((sidei_) ? BIT(1) : BIT(0)) +#define RCVLOWAT_ACT(sidei_) ((sidei_) ? BIT(3) : BIT(2)) +#define CLOSING BIT(4) bool in_epoll :1; }; diff --git a/tcp_splice.c b/tcp_splice.c index 608e7f27..c81daeeb 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -73,8 +73,6 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2]; -#define CONN_V6(x) ((x)->flags & SPLICE_V6) -#define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) (((conn)->events & (set)) == (set)) /* Display strings for connection events */ @@ -483,7 +481,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, NULL, 0, &in6addr_loopback, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice); - conn->flags = af == AF_INET ? 0 : SPLICE_V6; conn->s[0] = s0; conn->s[1] = -1; conn->pipe[0][0] = conn->pipe[0][1] = -1; -- 2.45.2
Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP. Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the pif and the L4 protocol number. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 35 +++++++++++++++++++++++++++--- flow.h | 19 ++++++++++++++++ flow_table.h | 3 +++ tcp.c | 61 ++++++++++------------------------------------------ 4 files changed, 65 insertions(+), 53 deletions(-) diff --git a/flow.c b/flow.c index 71c44d2f..e70e3cbf 100644 --- a/flow.c +++ b/flow.c @@ -116,9 +116,9 @@ static struct timespec flow_timer_run; * @faddr: Forwarding address (pointer to in_addr or in6_addr) * @fport: Forwarding port */ -static void flowside_from_af(struct flowside *side, sa_family_t af, - const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport) +void flowside_from_af(struct flowside *side, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport) { if (faddr) inany_from_af(&side->faddr, af, faddr); @@ -401,6 +401,35 @@ void flow_alloc_cancel(union flow *flow) flow_new_entry = NULL; } +/** + * flow_hash() - Calculate hash value for one side of a flow + * @c: Execution context + * @proto: Protocol of this flow (IP L4 protocol number) + * @pif: pif of the side to hash + * @side: Flowside (must not have unspecified parts) + * + * Return: hash value + */ +uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, + const struct flowside *side) +{ + struct siphash_state state = SIPHASH_INIT(c->hash_secret); + + /* For the hash table to work, we need complete endpoint information, + * and at least a forwarding port. + */ + ASSERT(pif != PIF_NONE && !inany_is_unspecified(&side->eaddr) && + side->eport != 0 && side->fport != 0); + + inany_siphash_feed(&state, &side->faddr); + inany_siphash_feed(&state, &side->eaddr); + + return siphash_final(&state, 38, (uint64_t)proto << 40 | + (uint64_t)pif << 32 | + (uint64_t)side->fport << 16 | + (uint64_t)side->eport); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context diff --git a/flow.h b/flow.h index 8c6ba602..a0550dc7 100644 --- a/flow.h +++ b/flow.h @@ -149,6 +149,25 @@ struct flowside { in_port_t eport; }; +/** + * flowside_eq() - Check if two flowsides are equal + * @left, @right: Flowsides to compare + * + * Return: true if equal, false otherwise + */ +static inline bool flowside_eq(const struct flowside *left, + const struct flowside *right) +{ + return inany_equals(&left->eaddr, &right->eaddr) && + left->eport == right->eport && + inany_equals(&left->faddr, &right->faddr) && + left->fport == right->fport; +} + +void flowside_from_af(struct flowside *side, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport); + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry diff --git a/flow_table.h b/flow_table.h index aabdbb75..b3cb5461 100644 --- a/flow_table.h +++ b/flow_table.h @@ -146,4 +146,7 @@ void flow_activate(struct flow_common *f); #define FLOW_ACTIVATE(flow_) \ (flow_activate(&(flow_)->f)) +uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, + const struct flowside *side); + #endif /* FLOW_TABLE_H */ diff --git a/tcp.c b/tcp.c index b4c4f774..5c8c8d12 100644 --- a/tcp.c +++ b/tcp.c @@ -377,7 +377,7 @@ bool peek_offset_cap; /* sendmsg() to socket */ static struct iovec tcp_iov [UIO_MAXIOV]; -/* Table for lookup from remote address, local port, remote port */ +/* Table for lookup from flowside information */ static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE]; static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, @@ -852,46 +852,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find, return -1; } -/** - * tcp_hash_match() - Check if a connection entry matches address and ports - * @conn: Connection entry to match against - * @faddr: Guest side forwarding address - * @eport: Guest side endpoint port - * @fport: Guest side forwarding port - * - * Return: 1 on match, 0 otherwise - */ -static int tcp_hash_match(const struct tcp_tap_conn *conn, - const union inany_addr *faddr, - in_port_t eport, in_port_t fport) -{ - const struct flowside *tapside = TAPFLOW(conn); - - if (inany_equals(&tapside->faddr, faddr) && - tapside->eport == eport && tapside->fport == fport) - return 1; - - return 0; -} - -/** - * tcp_hash() - Calculate hash value for connection given address and ports - * @c: Execution context - * @faddr: Guest side forwarding address - * @eport: Guest side endpoint port - * @fport: Guest side forwarding port - * - * Return: hash value, needs to be adjusted for table size - */ -static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, - in_port_t eport, in_port_t fport) -{ - struct siphash_state state = SIPHASH_INIT(c->hash_secret); - - inany_siphash_feed(&state, faddr); - return siphash_final(&state, 20, (uint64_t)eport << 16 | fport); -} - /** * tcp_conn_hash() - Calculate hash bucket of an existing connection * @c: Execution context @@ -904,8 +864,7 @@ static uint64_t tcp_conn_hash(const struct ctx *c, { const struct flowside *tapside = TAPFLOW(conn); - return tcp_hash(c, &tapside->faddr, tapside->eport, - tapside->fport); + return flow_hash(c, IPPROTO_TCP, conn->f.pif[TAPSIDE(conn)], tapside); } /** @@ -979,25 +938,26 @@ static void tcp_hash_remove(const struct ctx *c, * tcp_hash_lookup() - Look up connection given remote address and ports * @c: Execution context * @af: Address family, AF_INET or AF_INET6 + * @eaddr: Guest side endpoint address (guest local address) * @faddr: Guest side forwarding address (guest remote address) * @eport: Guest side endpoint port (guest local port) * @fport: Guest side forwarding port (guest remote port) * * Return: connection pointer, if found, -ENOENT otherwise */ -static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, - sa_family_t af, const void *faddr, +static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af, + const void *eaddr, const void *faddr, in_port_t eport, in_port_t fport) { - union inany_addr aany; + struct flowside side; union flow *flow; unsigned b; - inany_from_af(&aany, af, faddr); + flowside_from_af(&side, af, eaddr, eport, faddr, fport); - b = tcp_hash(c, &aany, eport, fport) % TCP_HASH_TABLE_SIZE; + b = flow_hash(c, IPPROTO_TCP, PIF_TAP, &side) % TCP_HASH_TABLE_SIZE; while ((flow = flow_at_sidx(tc_hash[b])) && - !tcp_hash_match(&flow->tcp, &aany, eport, fport)) + !flowside_eq(&flow->f.side[TAPSIDE(flow)], &side)) b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); return &flow->tcp; @@ -2102,7 +2062,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL); opts = packet_get(p, idx, sizeof(*th), optlen, NULL); - conn = tcp_hash_lookup(c, af, daddr, ntohs(th->source), ntohs(th->dest)); + conn = tcp_hash_lookup(c, af, saddr, daddr, + ntohs(th->source), ntohs(th->dest)); /* New connection from tap */ if (!conn) { -- 2.45.2
Move the data structures and helper functions for the TCP hash table to flow.c, making it a general hash table indexing sides of flows. This is largely code motion and straightforward renames. There are two semantic changes: * flow_lookup_af() now needs to verify that the entry has a matching protocol and interface as well as matching addresses and ports. * We double the size of the hash table, because it's now at least theoretically possible for both sides of each flow to be hashed. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 155 +++++++++++++++++++++++++++++++++++++++++++++++-- flow.h | 11 ++-- flow_table.h | 3 - tcp.c | 147 +++++----------------------------------------- tcp_internal.h | 1 + 5 files changed, 172 insertions(+), 145 deletions(-) diff --git a/flow.c b/flow.c index e70e3cbf..b291f254 100644 --- a/flow.c +++ b/flow.c @@ -105,6 +105,16 @@ unsigned flow_first_free; union flow flowtab[FLOW_MAX]; static const union flow *flow_new_entry; /* = NULL */ +/* Hash table to index it */ +#define FLOW_HASH_LOAD 70 /* % */ +#define FLOW_HASH_SIZE ((2 * FLOW_MAX * 100 / FLOW_HASH_LOAD)) + +/* Table for lookup from flowside information */ +static flow_sidx_t flow_hashtab[FLOW_HASH_SIZE]; + +static_assert(ARRAY_SIZE(flow_hashtab) >= 2 * FLOW_MAX, +"Safe linear probing requires hash table with more entries than the number of sides in the flow table"); + /* Last time the flow timers ran */ static struct timespec flow_timer_run; @@ -116,9 +126,9 @@ static struct timespec flow_timer_run; * @faddr: Forwarding address (pointer to in_addr or in6_addr) * @fport: Forwarding port */ -void flowside_from_af(struct flowside *side, sa_family_t af, - const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport) +static void flowside_from_af(struct flowside *side, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport) { if (faddr) inany_from_af(&side->faddr, af, faddr); @@ -410,8 +420,8 @@ void flow_alloc_cancel(union flow *flow) * * Return: hash value */ -uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, - const struct flowside *side) +static uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, + const struct flowside *side) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); @@ -430,6 +440,136 @@ uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, (uint64_t)side->eport); } +/** + * flow_sidx_hash() - Calculate hash value for given side of a given flow + * @c: Execution context + * @sidx: Flow & side index to get hash for + * + * Return: hash value, of the flow & side represented by @sidx + */ +static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx) +{ + const struct flow_common *f = &flow_at_sidx(sidx)->f; + return flow_hash(c, FLOW_PROTO(f), + f->pif[sidx.sidei], &f->side[sidx.sidei]); +} + +/** + * flow_hash_probe() - Find hash bucket for a flow + * @c: Execution context + * @sidx: Flow and side to find bucket for + * + * Return: If @sidx is in the hash table, its current bucket, otherwise a + * suitable free bucket for it. + */ +static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) +{ + unsigned b = flow_sidx_hash(c, sidx) % FLOW_HASH_SIZE; + + /* Linear probing */ + while (flow_sidx_valid(flow_hashtab[b]) && + !flow_sidx_eq(flow_hashtab[b], sidx)) + b = mod_sub(b, 1, FLOW_HASH_SIZE); + + return b; +} + +/** + * flow_hash_insert() - Insert side of a flow into into hash table + * @c: Execution context + * @sidx: Flow & side index + */ +void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx) +{ + unsigned b = flow_hash_probe(c, sidx); + + flow_hashtab[b] = sidx; + flow_dbg(flow_at_sidx(sidx), "Side %u hash table insert: bucket: %u", + sidx.sidei, b); +} + +/** + * flow_hash_remove() - Drop side of a flow from the hash table + * @c: Execution context + * @sidx: Side of flow to remove + */ +void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx) +{ + unsigned b = flow_hash_probe(c, sidx), s; + + if (!flow_sidx_valid(flow_hashtab[b])) + return; /* Redundant remove */ + + flow_dbg(flow_at_sidx(sidx), "Side %u hash table remove: bucket: %u", + sidx.sidei, b); + + /* Scan the remainder of the cluster */ + for (s = mod_sub(b, 1, FLOW_HASH_SIZE); + flow_sidx_valid(flow_hashtab[s]); + s = mod_sub(s, 1, FLOW_HASH_SIZE)) { + unsigned h = flow_sidx_hash(c, flow_hashtab[s]) % FLOW_HASH_SIZE; + + if (!mod_between(h, s, b, FLOW_HASH_SIZE)) { + /* flow_hashtab[s] can live in flow_hashtab[b]'s slot */ + debug("hash table remove: shuffle %u -> %u", s, b); + flow_hashtab[b] = flow_hashtab[s]; + b = s; + } + } + + flow_hashtab[b] = FLOW_SIDX_NONE; +} + +/** + * flowside_lookup() - Look for a matching flowside in the flow table + * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: pif to look for in the table + * @side: Flowside to look for in the table + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +static flow_sidx_t flowside_lookup(const struct ctx *c, uint8_t proto, + uint8_t pif, const struct flowside *side) +{ + flow_sidx_t sidx; + union flow *flow; + unsigned b; + + b = flow_hash(c, proto, pif, side) % FLOW_HASH_SIZE; + while ((sidx = flow_hashtab[b], flow = flow_at_sidx(sidx)) && + !(FLOW_PROTO(&flow->f) == proto && + flow->f.pif[sidx.sidei] == pif && + flowside_eq(&flow->f.side[sidx.sidei], side))) + b = (b + 1) % FLOW_HASH_SIZE; + + return flow_hashtab[b]; +} + +/** + * flow_lookup_af() - Look up a flow given addressing information + * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @af: Address family, AF_INET or AF_INET6 + * @eaddr: Guest side endpoint address (guest local address) + * @faddr: Guest side forwarding address (guest remote address) + * @eport: Guest side endpoint port (guest local port) + * @fport: Guest side forwarding port (guest remote port) + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_af(const struct ctx *c, + uint8_t proto, uint8_t pif, sa_family_t af, + const void *eaddr, const void *faddr, + in_port_t eport, in_port_t fport) +{ + struct flowside side; + + flowside_from_af(&side, af, eaddr, eport, faddr, fport); + return flowside_lookup(c, proto, pif, &side); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context @@ -543,7 +683,12 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) */ void flow_init(void) { + unsigned b; + /* Initial state is a single free cluster containing the whole table */ flowtab[0].free.n = FLOW_MAX; flowtab[0].free.next = FLOW_MAX; + + for (b = 0; b < FLOW_HASH_SIZE; b++) + flow_hashtab[b] = FLOW_SIDX_NONE; } diff --git a/flow.h b/flow.h index a0550dc7..fcb4121d 100644 --- a/flow.h +++ b/flow.h @@ -164,10 +164,6 @@ static inline bool flowside_eq(const struct flowside *left, left->fport == right->fport; } -void flowside_from_af(struct flowside *side, sa_family_t af, - const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport); - /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry @@ -233,6 +229,13 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) return (a.flowi == b.flowi) && (a.sidei == b.sidei); } +void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx); +void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx); +flow_sidx_t flow_lookup_af(const struct ctx *c, + uint8_t proto, uint8_t pif, sa_family_t af, + const void *eaddr, const void *faddr, + in_port_t eport, in_port_t fport); + union flow; void flow_init(void); diff --git a/flow_table.h b/flow_table.h index b3cb5461..aabdbb75 100644 --- a/flow_table.h +++ b/flow_table.h @@ -146,7 +146,4 @@ void flow_activate(struct flow_common *f); #define FLOW_ACTIVATE(flow_) \ (flow_activate(&(flow_)->f)) -uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, - const struct flowside *side); - #endif /* FLOW_TABLE_H */ diff --git a/tcp.c b/tcp.c index 5c8c8d12..09648551 100644 --- a/tcp.c +++ b/tcp.c @@ -305,9 +305,6 @@ #include "tcp_internal.h" #include "tcp_buf.h" -#define TCP_HASH_TABLE_LOAD 70 /* % */ -#define TCP_HASH_TABLE_SIZE (FLOW_MAX * 100 / TCP_HASH_TABLE_LOAD) - /* MSS rounding: see SET_MSS() */ #define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */ @@ -377,12 +374,6 @@ bool peek_offset_cap; /* sendmsg() to socket */ static struct iovec tcp_iov [UIO_MAXIOV]; -/* Table for lookup from flowside information */ -static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE]; - -static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, - "Safe linear probing requires hash table larger than connection table"); - /* Pools for pre-opened sockets (in init) */ int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; @@ -605,9 +596,6 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, tcp_timer_ctl(c, conn); } -static void tcp_hash_remove(const struct ctx *c, - const struct tcp_tap_conn *conn); - /** * conn_event_do() - Set and log connection events, update epoll state * @c: Execution context @@ -653,7 +641,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, num == -1 ? "CLOSED" : tcp_event_str[num]); if (event == CLOSED) - tcp_hash_remove(c, conn); + flow_hash_remove(c, TAP_SIDX(conn)); else if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD)) conn_flag(c, conn, ACTIVE_CLOSE); else @@ -852,117 +840,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find, return -1; } -/** - * tcp_conn_hash() - Calculate hash bucket of an existing connection - * @c: Execution context - * @conn: Connection - * - * Return: hash value, needs to be adjusted for table size - */ -static uint64_t tcp_conn_hash(const struct ctx *c, - const struct tcp_tap_conn *conn) -{ - const struct flowside *tapside = TAPFLOW(conn); - - return flow_hash(c, IPPROTO_TCP, conn->f.pif[TAPSIDE(conn)], tapside); -} - -/** - * tcp_hash_probe() - Find hash bucket for a connection - * @c: Execution context - * @conn: Connection to find bucket for - * - * Return: If @conn is in the table, its current bucket, otherwise a suitable - * free bucket for it. - */ -static inline unsigned tcp_hash_probe(const struct ctx *c, - const struct tcp_tap_conn *conn) -{ - unsigned b = tcp_conn_hash(c, conn) % TCP_HASH_TABLE_SIZE; - flow_sidx_t sidx = FLOW_SIDX(conn, TAPSIDE(conn)); - - /* Linear probing */ - while (flow_sidx_valid(tc_hash[b]) && !flow_sidx_eq(tc_hash[b], sidx)) - b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - - return b; -} - -/** - * tcp_hash_insert() - Insert connection into hash table, chain link - * @c: Execution context - * @conn: Connection pointer - */ -static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn) -{ - unsigned b = tcp_hash_probe(c, conn); - - tc_hash[b] = FLOW_SIDX(conn, TAPSIDE(conn)); - flow_dbg(conn, "hash table insert: sock %i, bucket: %u", conn->sock, b); -} - -/** - * tcp_hash_remove() - Drop connection from hash table, chain unlink - * @c: Execution context - * @conn: Connection pointer - */ -static void tcp_hash_remove(const struct ctx *c, - const struct tcp_tap_conn *conn) -{ - unsigned b = tcp_hash_probe(c, conn), s; - union flow *flow; - - if (!flow_sidx_valid(tc_hash[b])) - return; /* Redundant remove */ - - flow_dbg(conn, "hash table remove: sock %i, bucket: %u", conn->sock, b); - - /* Scan the remainder of the cluster */ - for (s = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - (flow = flow_at_sidx(tc_hash[s])); - s = mod_sub(s, 1, TCP_HASH_TABLE_SIZE)) { - unsigned h = tcp_conn_hash(c, &flow->tcp) % TCP_HASH_TABLE_SIZE; - - if (!mod_between(h, s, b, TCP_HASH_TABLE_SIZE)) { - /* tc_hash[s] can live in tc_hash[b]'s slot */ - debug("hash table remove: shuffle %u -> %u", s, b); - tc_hash[b] = tc_hash[s]; - b = s; - } - } - - tc_hash[b] = FLOW_SIDX_NONE; -} - -/** - * tcp_hash_lookup() - Look up connection given remote address and ports - * @c: Execution context - * @af: Address family, AF_INET or AF_INET6 - * @eaddr: Guest side endpoint address (guest local address) - * @faddr: Guest side forwarding address (guest remote address) - * @eport: Guest side endpoint port (guest local port) - * @fport: Guest side forwarding port (guest remote port) - * - * Return: connection pointer, if found, -ENOENT otherwise - */ -static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af, - const void *eaddr, const void *faddr, - in_port_t eport, in_port_t fport) -{ - struct flowside side; - union flow *flow; - unsigned b; - - flowside_from_af(&side, af, eaddr, eport, faddr, fport); - - b = flow_hash(c, IPPROTO_TCP, PIF_TAP, &side) % TCP_HASH_TABLE_SIZE; - while ((flow = flow_at_sidx(tc_hash[b])) && - !flowside_eq(&flow->f.side[TAPSIDE(flow)], &side)) - b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - - return &flow->tcp; -} - /** * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections) * @conn: Connection to handle @@ -1710,7 +1587,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, tcp_seq_init(c, conn, now); conn->seq_ack_from_tap = conn->seq_to_tap; - tcp_hash_insert(c, conn); + flow_hash_insert(c, TAP_SIDX(conn)); tcp_bind_outbound(c, conn, s); @@ -2047,6 +1924,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, const struct tcphdr *th; size_t optlen, len; const char *opts; + union flow *flow; + flow_sidx_t sidx; int ack_due = 0; int count; @@ -2062,17 +1941,22 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL); opts = packet_get(p, idx, sizeof(*th), optlen, NULL); - conn = tcp_hash_lookup(c, af, saddr, daddr, - ntohs(th->source), ntohs(th->dest)); + sidx = flow_lookup_af(c, IPPROTO_TCP, PIF_TAP, af, saddr, daddr, + ntohs(th->source), ntohs(th->dest)); + flow = flow_at_sidx(sidx); /* New connection from tap */ - if (!conn) { + if (!flow) { if (opts && th->syn && !th->ack) tcp_conn_from_tap(c, af, saddr, daddr, th, opts, optlen, now); return 1; } + ASSERT(flow->f.type == FLOW_TCP); + ASSERT(pif_at_sidx(sidx) == PIF_TAP); + conn = &flow->tcp; + flow_trace(conn, "packet length %zu from tap", len); if (th->rst) { @@ -2250,7 +2134,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn_event(c, conn, SOCK_ACCEPTED); tcp_seq_init(c, conn, now); - tcp_hash_insert(c, conn); + flow_hash_insert(c, TAP_SIDX(conn)); conn->seq_ack_from_tap = conn->seq_to_tap; @@ -2652,14 +2536,11 @@ static void tcp_sock_refill_init(const struct ctx *c) */ int tcp_init(struct ctx *c) { - unsigned int b, optv = 0; + unsigned int optv = 0; int s; ASSERT(!c->no_tcp); - for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) - tc_hash[b] = FLOW_SIDX_NONE; - if (c->ifi4) tcp_sock4_iov_init(c); diff --git a/tcp_internal.h b/tcp_internal.h index ac6d4b21..8b60aabc 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -42,6 +42,7 @@ #define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) #define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)])) +#define TAP_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_))) #define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) -- 2.45.2
We generate TCP initial sequence numbers, when we need them, from a hash of the source and destination addresses and ports, plus a timestamp. Moments later, we generate another hash of the same information plus some more to insert the connection into the flow hash table. With some tweaks to the flow_hash_insert() interface and changing the order we can re-use that hash table hash for the initial sequence number, rather than calculating another one. It won't generate identical results, but that doesn't matter as long as the sequence numbers are well scattered. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 30 ++++++++++++++++++++++++------ flow.h | 2 +- tcp.c | 33 +++++++++++---------------------- 3 files changed, 36 insertions(+), 29 deletions(-) diff --git a/flow.c b/flow.c index b291f254..8eea8615 100644 --- a/flow.c +++ b/flow.c @@ -455,16 +455,16 @@ static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx) } /** - * flow_hash_probe() - Find hash bucket for a flow - * @c: Execution context + * flow_hash_probe_() - Find hash bucket for a flow, given hash + * @hash: Raw hash value for flow & side * @sidx: Flow and side to find bucket for * * Return: If @sidx is in the hash table, its current bucket, otherwise a * suitable free bucket for it. */ -static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) +static inline unsigned flow_hash_probe_(uint64_t hash, flow_sidx_t sidx) { - unsigned b = flow_sidx_hash(c, sidx) % FLOW_HASH_SIZE; + unsigned b = hash % FLOW_HASH_SIZE; /* Linear probing */ while (flow_sidx_valid(flow_hashtab[b]) && @@ -474,18 +474,36 @@ static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) return b; } +/** + * flow_hash_probe() - Find hash bucket for a flow + * @c: Execution context + * @sidx: Flow and side to find bucket for + * + * Return: If @sidx is in the hash table, its current bucket, otherwise a + * suitable free bucket for it. + */ +static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) +{ + return flow_hash_probe_(flow_sidx_hash(c, sidx), sidx); +} + /** * flow_hash_insert() - Insert side of a flow into into hash table * @c: Execution context * @sidx: Flow & side index + * + * Return: raw (un-modded) hash value of side of flow */ -void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx) +uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx) { - unsigned b = flow_hash_probe(c, sidx); + uint64_t hash = flow_sidx_hash(c, sidx); + unsigned b = flow_hash_probe_(hash, sidx); flow_hashtab[b] = sidx; flow_dbg(flow_at_sidx(sidx), "Side %u hash table insert: bucket: %u", sidx.sidei, b); + + return hash; } /** diff --git a/flow.h b/flow.h index fcb4121d..e3a778a7 100644 --- a/flow.h +++ b/flow.h @@ -229,7 +229,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) return (a.flowi == b.flowi) && (a.sidei == b.sidei); } -void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx); +uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx); void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx); flow_sidx_t flow_lookup_af(const struct ctx *c, uint8_t proto, uint8_t pif, sa_family_t af, diff --git a/tcp.c b/tcp.c index 09648551..b6eca5d8 100644 --- a/tcp.c +++ b/tcp.c @@ -1294,28 +1294,16 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) } /** - * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 - * @c: Execution context - * @conn: TCP connection, with faddr, fport and eport populated + * tcp_init_seq() - Calculate initial sequence number according to RFC 6528 + * @hash: Hash of connection details * @now: Current timestamp */ -static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, - const struct timespec *now) +static uint32_t tcp_init_seq(uint64_t hash, const struct timespec *now) { - struct siphash_state state = SIPHASH_INIT(c->hash_secret); - const struct flowside *tapside = TAPFLOW(conn); - uint64_t hash; - uint32_t ns; - - inany_siphash_feed(&state, &tapside->faddr); - inany_siphash_feed(&state, &tapside->eaddr); - hash = siphash_final(&state, 36, - (uint64_t)tapside->fport << 16 | tapside->eport); - /* 32ns ticks, overflows 32 bits every 137s */ - ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; + uint32_t ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; - conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; + return ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; } /** @@ -1488,6 +1476,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, union sockaddr_inany sa; union flow *flow; int s = -1, mss; + uint64_t hash; socklen_t sl; if (!(flow = flow_alloc())) @@ -1584,11 +1573,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; - tcp_seq_init(c, conn, now); + hash = flow_hash_insert(c, TAP_SIDX(conn)); + conn->seq_to_tap = tcp_init_seq(hash, now); conn->seq_ack_from_tap = conn->seq_to_tap; - flow_hash_insert(c, TAP_SIDX(conn)); - tcp_bind_outbound(c, conn, s); if (connect(s, &sa.sa, sl)) { @@ -2110,6 +2098,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */ struct tcp_tap_conn *conn; in_port_t srcport; + uint64_t hash; inany_from_sockaddr(&saddr, &srcport, sa); tcp_snat_inbound(c, &saddr); @@ -2133,8 +2122,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - tcp_seq_init(c, conn, now); - flow_hash_insert(c, TAP_SIDX(conn)); + hash = flow_hash_insert(c, TAP_SIDX(conn)); + conn->seq_to_tap = tcp_init_seq(hash, now); conn->seq_ack_from_tap = conn->seq_to_tap; -- 2.45.2
struct icmp_ping_flow contains a field for the ICMP id of the ping, but this is now redundant, since the id is also stored as the "port" in the common flowsides. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- icmp.c | 10 +++++----- icmp_flow.h | 2 -- 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/icmp.c b/icmp.c index ebfa6272..f7bd9f8a 100644 --- a/icmp.c +++ b/icmp.c @@ -74,6 +74,7 @@ static struct icmp_ping_flow *ping_at_sidx(flow_sidx_t sidx) void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) { struct icmp_ping_flow *pingf = ping_at_sidx(ref.flowside); + const struct flowside *ini = &pingf->f.side[INISIDE]; union sockaddr_inany sr; socklen_t sl = sizeof(sr); char buf[USHRT_MAX]; @@ -99,7 +100,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) goto unexpected; /* Adjust packet back to guest-side ID */ - ih4->un.echo.id = htons(pingf->id); + ih4->un.echo.id = htons(ini->eport); seq = ntohs(ih4->un.echo.sequence); } else if (pingf->f.type == FLOW_PING6) { struct icmp6hdr *ih6 = (struct icmp6hdr *)buf; @@ -109,7 +110,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) goto unexpected; /* Adjust packet back to guest-side ID */ - ih6->icmp6_identifier = htons(pingf->id); + ih6->icmp6_identifier = htons(ini->eport); seq = ntohs(ih6->icmp6_sequence); } else { ASSERT(0); @@ -124,7 +125,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) } flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16, - pingf->id, seq); + ini->eport, seq); if (pingf->f.type == FLOW_PING4) tap_icmp4_send(c, sr.sa4.sin_addr, tap_ip4_daddr(c), buf, n); @@ -145,7 +146,7 @@ unexpected: static void icmp_ping_close(const struct ctx *c, const struct icmp_ping_flow *pingf) { - uint16_t id = pingf->id; + uint16_t id = pingf->f.side[INISIDE].eport; epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL); close(pingf->sock); @@ -188,7 +189,6 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, pingf = FLOW_SET_TYPE(flow, flowtype, ping); pingf->seq = -1; - pingf->id = id; if (af == AF_INET) { bind_addr = &c->ip4.addr_out; diff --git a/icmp_flow.h b/icmp_flow.h index c9847eae..fb93801d 100644 --- a/icmp_flow.h +++ b/icmp_flow.h @@ -13,7 +13,6 @@ * @seq: Last sequence number sent to tap, host order, -1: not sent yet * @sock: "ping" socket * @ts: Last associated activity from tap, seconds - * @id: ICMP id for the flow as seen by the guest */ struct icmp_ping_flow { /* Must be first element */ @@ -22,7 +21,6 @@ struct icmp_ping_flow { int seq; int sock; time_t ts; - uint16_t id; }; bool icmp_ping_timer(const struct ctx *c, const struct icmp_ping_flow *pingf, -- 2.45.2
icmp_sock_handler() obtains the guest address from it's most recently observed IP. However, this can now be obtained from the common flowside information. icmp_tap_handler() builds its socket address for sendto() directly from the destination address supplied by the incoming tap packet. This can instead be generated from the flow. Using the flowsides as the common source of truth here prepares us for allowing more flexible NAT and forwarding by properly initialising that flowside information. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- icmp.c | 27 +++++++++++++++++---------- tap.c | 11 ----------- tap.h | 1 - 3 files changed, 17 insertions(+), 22 deletions(-) diff --git a/icmp.c b/icmp.c index f7bd9f8a..82a95e87 100644 --- a/icmp.c +++ b/icmp.c @@ -127,11 +127,18 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16, ini->eport, seq); - if (pingf->f.type == FLOW_PING4) - tap_icmp4_send(c, sr.sa4.sin_addr, tap_ip4_daddr(c), buf, n); - else if (pingf->f.type == FLOW_PING6) - tap_icmp6_send(c, &sr.sa6.sin6_addr, - tap_ip6_daddr(c, &sr.sa6.sin6_addr), buf, n); + if (pingf->f.type == FLOW_PING4) { + const struct in_addr *saddr = inany_v4(&ini->faddr); + const struct in_addr *daddr = inany_v4(&ini->eaddr); + + ASSERT(saddr && daddr); /* Must have IPv4 addresses */ + tap_icmp4_send(c, *saddr, *daddr, buf, n); + } else if (pingf->f.type == FLOW_PING6) { + const struct in6_addr *saddr = &ini->faddr.a6; + const struct in6_addr *daddr = &ini->eaddr.a6; + + tap_icmp6_send(c, saddr, daddr, buf, n); + } return; unexpected: @@ -241,11 +248,12 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now) { - union sockaddr_inany sa = { .sa_family = af }; - const socklen_t sl = af == AF_INET ? sizeof(sa.sa4) : sizeof(sa.sa6); struct icmp_ping_flow *pingf, **id_sock; + const struct flowside *tgt; + union sockaddr_inany sa; size_t dlen, l4len; uint16_t id, seq; + socklen_t sl; void *pkt; (void)saddr; @@ -266,7 +274,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, id = ntohs(ih->un.echo.id); id_sock = &icmp_id_map[V4][id]; seq = ntohs(ih->un.echo.sequence); - sa.sa4.sin_addr = *(struct in_addr *)daddr; } else if (af == AF_INET6) { const struct icmp6hdr *ih; @@ -282,8 +289,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, id = ntohs(ih->icmp6_identifier); id_sock = &icmp_id_map[V6][id]; seq = ntohs(ih->icmp6_sequence); - sa.sa6.sin6_addr = *(struct in6_addr *)daddr; - sa.sa6.sin6_scope_id = c->ifi6; } else { ASSERT(0); } @@ -292,8 +297,10 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) return 1; + tgt = &pingf->f.side[TGTSIDE]; pingf->ts = now->tv_sec; + pif_sockaddr(c, &sa, &sl, PIF_HOST, &tgt->eaddr, 0); if (sendto(pingf->sock, pkt, l4len, MSG_NOSIGNAL, &sa.sa, sl) < 0) { flow_dbg(pingf, "failed to relay request to socket: %s", strerror(errno)); diff --git a/tap.c b/tap.c index ec994a2e..32a7b09c 100644 --- a/tap.c +++ b/tap.c @@ -90,17 +90,6 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len) tap_send_frames(c, iov, iovcnt, 1); } -/** - * tap_ip4_daddr() - Normal IPv4 destination address for inbound packets - * @c: Execution context - * - * Return: IPv4 address - */ -struct in_addr tap_ip4_daddr(const struct ctx *c) -{ - return c->ip4.addr_seen; -} - /** * tap_ip6_daddr() - Normal IPv6 destination address for inbound packets * @c: Execution context diff --git a/tap.h b/tap.h index d496bd0e..ec9e2ace 100644 --- a/tap.h +++ b/tap.h @@ -43,7 +43,6 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len) thdr->vnet_len = htonl(l2len); } -struct in_addr tap_ip4_daddr(const struct ctx *c); void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport, struct in_addr dst, in_port_t dport, const void *in, size_t dlen); -- 2.45.2
When we receive a ping packet from the tap interface, we currently locate the correct flow entry (if present) using an anciliary data structure, the icmp_id_map[] tables. However, we can look this up using the flow hash table - that's what it's for. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- icmp.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/icmp.c b/icmp.c index 82a95e87..f640f931 100644 --- a/icmp.c +++ b/icmp.c @@ -157,6 +157,7 @@ static void icmp_ping_close(const struct ctx *c, epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL); close(pingf->sock); + flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE)); if (pingf->f.type == FLOW_PING4) icmp_id_map[V4][id] = NULL; @@ -221,6 +222,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id); + flow_hash_insert(c, FLOW_SIDX(pingf, INISIDE)); *id_sock = pingf; FLOW_ACTIVATE(pingf); @@ -253,6 +255,8 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, union sockaddr_inany sa; size_t dlen, l4len; uint16_t id, seq; + union flow *flow; + uint8_t proto; socklen_t sl; void *pkt; @@ -271,6 +275,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (ih->type != ICMP_ECHO) return 1; + proto = IPPROTO_ICMP; id = ntohs(ih->un.echo.id); id_sock = &icmp_id_map[V4][id]; seq = ntohs(ih->un.echo.sequence); @@ -286,6 +291,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (ih->icmp6_type != ICMPV6_ECHO_REQUEST) return 1; + proto = IPPROTO_ICMPV6; id = ntohs(ih->icmp6_identifier); id_sock = &icmp_id_map[V6][id]; seq = ntohs(ih->icmp6_sequence); @@ -293,11 +299,17 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, ASSERT(0); } - if (!(pingf = *id_sock)) - if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) - return 1; + flow = flow_at_sidx(flow_lookup_af(c, proto, PIF_TAP, + af, saddr, daddr, id, id)); + + if (flow) + pingf = &flow->ping; + else if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) + return 1; tgt = &pingf->f.side[TGTSIDE]; + + ASSERT(flow_proto[pingf->f.type] == proto); pingf->ts = now->tv_sec; pif_sockaddr(c, &sa, &sl, PIF_HOST, &tgt->eaddr, 0); -- 2.45.2
With previous reworks the icmp_id_map data structure is now maintained, but never used for anything. Eliminate it. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- icmp.c | 19 ++----------------- 1 file changed, 2 insertions(+), 17 deletions(-) diff --git a/icmp.c b/icmp.c index f640f931..1a6f5d8a 100644 --- a/icmp.c +++ b/icmp.c @@ -45,9 +45,6 @@ #define ICMP_ECHO_TIMEOUT 60 /* s, timeout for ICMP socket activity */ #define ICMP_NUM_IDS (1U << 16) -/* Indexed by ICMP echo identifier */ -static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS]; - /** * ping_at_sidx() - Get ping specific flow at given sidx * @sidx: Flow and side to retrieve @@ -153,22 +150,14 @@ unexpected: static void icmp_ping_close(const struct ctx *c, const struct icmp_ping_flow *pingf) { - uint16_t id = pingf->f.side[INISIDE].eport; - epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL); close(pingf->sock); flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE)); - - if (pingf->f.type == FLOW_PING4) - icmp_id_map[V4][id] = NULL; - else - icmp_id_map[V6][id] = NULL; } /** * icmp_ping_new() - Prepare a new ping socket for a new id * @c: Execution context - * @id_sock: Pointer to ping flow entry slot in icmp_id_map[] to update * @af: Address family, AF_INET or AF_INET6 * @id: ICMP id for the new socket * @saddr: Source address @@ -177,7 +166,6 @@ static void icmp_ping_close(const struct ctx *c, * Return: Newly opened ping flow, or NULL on failure */ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, - struct icmp_ping_flow **id_sock, sa_family_t af, uint16_t id, const void *saddr, const void *daddr) { @@ -223,7 +211,6 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id); flow_hash_insert(c, FLOW_SIDX(pingf, INISIDE)); - *id_sock = pingf; FLOW_ACTIVATE(pingf); @@ -250,7 +237,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now) { - struct icmp_ping_flow *pingf, **id_sock; + struct icmp_ping_flow *pingf; const struct flowside *tgt; union sockaddr_inany sa; size_t dlen, l4len; @@ -277,7 +264,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, proto = IPPROTO_ICMP; id = ntohs(ih->un.echo.id); - id_sock = &icmp_id_map[V4][id]; seq = ntohs(ih->un.echo.sequence); } else if (af == AF_INET6) { const struct icmp6hdr *ih; @@ -293,7 +279,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, proto = IPPROTO_ICMPV6; id = ntohs(ih->icmp6_identifier); - id_sock = &icmp_id_map[V6][id]; seq = ntohs(ih->icmp6_sequence); } else { ASSERT(0); @@ -304,7 +289,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (flow) pingf = &flow->ping; - else if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) + else if (!(pingf = icmp_ping_new(c, af, id, saddr, daddr))) return 1; tgt = &pingf->f.side[TGTSIDE]; -- 2.45.2
We have upcoming use cases where it's useful to create new bound socket based on information from the flow table. Add flowside_sock_l4() to do this for either PIF_HOST or PIF_SPLICE sockets. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ flow.h | 3 ++ util.c | 6 ++-- util.h | 3 ++ 4 files changed, 102 insertions(+), 3 deletions(-) diff --git a/flow.c b/flow.c index 8eea8615..6ba8a62e 100644 --- a/flow.c +++ b/flow.c @@ -5,9 +5,11 @@ * Tracking for logical "flows" of packets. */ +#include <errno.h> #include <stdint.h> #include <stdio.h> #include <unistd.h> +#include <sched.h> #include <string.h> #include "util.h" @@ -143,6 +145,97 @@ static void flowside_from_af(struct flowside *side, sa_family_t af, side->eport = eport; } +/** + * struct flowside_sock_args - Parameters for flowside_sock_splice() + * @c: Execution context + * @fd: Filled in with new socket fd + * @err: Filled in with errno if something failed + * @type: Socket epoll type + * @sa: Socket address + * @sl: Length of @sa + * @data: epoll reference data + */ +struct flowside_sock_args { + const struct ctx *c; + int fd; + int err; + enum epoll_type type; + const struct sockaddr *sa; + socklen_t sl; + const char *path; + uint32_t data; +}; + +/** flowside_sock_splice() - Create and bind socket for PIF_SPLICE based on flowside + * @arg: Argument as a struct flowside_sock_args + * + * Return: 0 + */ +static int flowside_sock_splice(void *arg) +{ + struct flowside_sock_args *a = arg; + + ns_enter(a->c); + + a->fd = sock_l4_sa(a->c, a->type, a->sa, a->sl, NULL, + a->sa->sa_family == AF_INET6, a->data); + a->err = errno; + + return 0; +} + +/** flowside_sock_l4() - Create and bind socket based on flowside + * @c: Execution context + * @type: Socket epoll type + * @pif: Interface for this socket + * @tgt: Target flowside + * @data: epoll reference portion for protocol handlers + * + * Return: socket fd of protocol @proto bound to the forwarding address and port + * from @tgt (if specified). + */ +/* cppcheck-suppress unusedFunction */ +int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, + const struct flowside *tgt, uint32_t data) +{ + const char *ifname = NULL; + union sockaddr_inany sa; + socklen_t sl; + + ASSERT(pif_is_socket(pif)); + + pif_sockaddr(c, &sa, &sl, pif, &tgt->faddr, tgt->fport); + + switch (pif) { + case PIF_HOST: + if (inany_is_loopback(&tgt->faddr)) + ifname = NULL; + else if (sa.sa_family == AF_INET) + ifname = c->ip4.ifname_out; + else if (sa.sa_family == AF_INET6) + ifname = c->ip6.ifname_out; + + return sock_l4_sa(c, type, &sa, sl, ifname, + sa.sa_family == AF_INET6, data); + + case PIF_SPLICE: { + struct flowside_sock_args args = { + .c = c, .type = type, + .sa = &sa.sa, .sl = sl, .data = data, + }; + NS_CALL(flowside_sock_splice, &args); + errno = args.err; + return args.fd; + } + + default: + /* If we add new socket pifs, they'll need to be implemented + * here + */ + ASSERT(0); + } +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index e3a778a7..bf6b8459 100644 --- a/flow.h +++ b/flow.h @@ -164,6 +164,9 @@ static inline bool flowside_eq(const struct flowside *left, left->fport == right->fport; } +int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, + const struct flowside *tgt, uint32_t data); + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry diff --git a/util.c b/util.c index 1569f1c0..6b51fc51 100644 --- a/util.c +++ b/util.c @@ -45,9 +45,9 @@ * * Return: newly created socket, negative error code on failure */ -static int sock_l4_sa(const struct ctx *c, enum epoll_type type, - const void *sa, socklen_t sl, - const char *ifname, bool v6only, uint32_t data) +int sock_l4_sa(const struct ctx *c, enum epoll_type type, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data) { sa_family_t af = ((const struct sockaddr *)sa)->sa_family; union epoll_ref ref = { .type = type, .data = data }; diff --git a/util.h b/util.h index 1d479ddf..826614cf 100644 --- a/util.h +++ b/util.h @@ -144,6 +144,9 @@ struct ctx; /* cppcheck-suppress funcArgNamesDifferent */ __attribute__ ((weak)) int ffsl(long int i) { return __builtin_ffsl(i); } +int sock_l4_sa(const struct ctx *c, enum epoll_type type, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data); int sock_l4(const struct ctx *c, sa_family_t af, enum epoll_type type, const void *bind_addr, const char *ifname, uint16_t port, uint32_t data); -- 2.45.2
For now when we forward a ping to the host we leave the host side forwarding address and port blank since we don't necessarily know what source address and id will be used by the kernel. When the outbound address option is active, though, we do know the address at least, so we can record it in the flowside. Having done that, use it as the primary source of truth, binding the outgoing socket based on the information in there. This allows the possibility of more complex rules for what outbound address and/or id we use in future. To implement this we create a new helper which sets up a new socket based on information in a flowside, which will also have future uses. It behaves slightly differently from the existing ICMP code, in that it doesn't bind to a specific interface if given a loopback address. This is logically correct - the loopback address means we need to operate through the host's loopback interface, not ifname_out. We didn't need it in ICMP because ICMP will never generate a loopback address at this point, however we intend to change that in future. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 1 - icmp.c | 23 ++++++++++------------- 2 files changed, 10 insertions(+), 14 deletions(-) diff --git a/flow.c b/flow.c index 6ba8a62e..c4f12364 100644 --- a/flow.c +++ b/flow.c @@ -194,7 +194,6 @@ static int flowside_sock_splice(void *arg) * Return: socket fd of protocol @proto bound to the forwarding address and port * from @tgt (if specified). */ -/* cppcheck-suppress unusedFunction */ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data) { diff --git a/icmp.c b/icmp.c index 1a6f5d8a..22177475 100644 --- a/icmp.c +++ b/icmp.c @@ -173,30 +173,27 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, union epoll_ref ref = { .type = EPOLL_TYPE_PING }; union flow *flow = flow_alloc(); struct icmp_ping_flow *pingf; + const struct flowside *tgt; const void *bind_addr; - const char *bind_if; if (!flow) return NULL; flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - /* FIXME: Record outbound source address when known */ - flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); - pingf = FLOW_SET_TYPE(flow, flowtype, ping); - - pingf->seq = -1; - if (af == AF_INET) { + if (af == AF_INET) bind_addr = &c->ip4.addr_out; - bind_if = c->ip4.ifname_out; - } else { + else if (af == AF_INET6) bind_addr = &c->ip6.addr_out; - bind_if = c->ip6.ifname_out; - } + + tgt = flow_target_af(flow, PIF_HOST, af, bind_addr, 0, daddr, 0); + pingf = FLOW_SET_TYPE(flow, flowtype, ping); + + pingf->seq = -1; ref.flowside = FLOW_SIDX(flow, TGTSIDE); - pingf->sock = sock_l4(c, af, EPOLL_TYPE_PING, bind_addr, bind_if, - 0, ref.data); + pingf->sock = flowside_sock_l4(c, EPOLL_TYPE_PING, PIF_HOST, + tgt, ref.data); if (pingf->sock < 0) { warn("Cannot open \"ping\" socket. You might need to:"); -- 2.45.2
Currently the code to translate host side addresses and ports to guest side addresses and ports, and vice versa, is scattered across the TCP code. This includes both port redirection as controlled by the -t and -T options, and our special case NAT controlled by the --no-map-gw option. Gather this logic into fwd_nat_from_*() functions for each input interface in fwd.c which take protocol and address information for the initiating side and generates the pif and address information for the forwarded side. This performs any NAT or port forwarding needed. We create a flow_target() helper which applies those forwarding functions as needed to automatically move a flow from INI to TGT state. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 53 ++++++++++++++++++ flow_table.h | 2 + fwd.c | 148 +++++++++++++++++++++++++++++++++++++++++++++++++++ fwd.h | 9 ++++ tcp.c | 103 ++++++++++------------------------- tcp_splice.c | 64 ++-------------------- tcp_splice.h | 5 +- 7 files changed, 245 insertions(+), 139 deletions(-) diff --git a/flow.c b/flow.c index c4f12364..c1af1369 100644 --- a/flow.c +++ b/flow.c @@ -400,6 +400,59 @@ const struct flowside *flow_target_af(union flow *flow, uint8_t pif, return tgt; } + +/** + * flow_target() - Determine where flow should forward to, and move to TGT + * @c: Execution context + * @flow: Flow to forward + * @proto: Protocol + * + * Return: pointer to the target flowside information + */ +const struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto) +{ + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + struct flow_common *f = &flow->f; + const struct flowside *ini = &f->side[INISIDE]; + struct flowside *tgt = &f->side[TGTSIDE]; + uint8_t tgtpif = PIF_NONE; + + ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); + ASSERT(f->type == FLOW_TYPE_NONE); + ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); + ASSERT(flow->f.state == FLOW_STATE_INI); + + switch (f->pif[INISIDE]) { + case PIF_TAP: + tgtpif = fwd_nat_from_tap(c, proto, ini, tgt); + break; + + case PIF_SPLICE: + tgtpif = fwd_nat_from_splice(c, proto, ini, tgt); + break; + + case PIF_HOST: + tgtpif = fwd_nat_from_host(c, proto, ini, tgt); + break; + + default: + flow_err(flow, "No rules to forward %s [%s]:%hu -> [%s]:%hu", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport); + } + + if (tgtpif == PIF_NONE) + return NULL; + + f->pif[TGTSIDE] = tgtpif; + flow_set_state(f, FLOW_STATE_TGT); + return tgt; +} + /** * flow_set_type() - Set type and move to TYPED * @flow: Flow to change state diff --git a/flow_table.h b/flow_table.h index aabdbb75..9d912c83 100644 --- a/flow_table.h +++ b/flow_table.h @@ -138,6 +138,8 @@ const struct flowside *flow_target_af(union flow *flow, uint8_t pif, sa_family_t af, const void *saddr, in_port_t sport, const void *daddr, in_port_t dport); +const struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto); union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/fwd.c b/fwd.c index d3f17988..3288b0da 100644 --- a/fwd.c +++ b/fwd.c @@ -25,6 +25,7 @@ #include "fwd.h" #include "passt.h" #include "lineread.h" +#include "flow_table.h" /* See enum in kernel's include/net/tcp_states.h */ #define UDP_LISTEN 0x07 @@ -154,3 +155,150 @@ void fwd_scan_ports_init(struct ctx *c) &c->tcp.fwd_out, &c->tcp.fwd_in); } } + +/** + * fwd_nat_from_tap() - Determine to forward a flow from the tap interface + * @c: Execution context + * @proto: Protocol (IP L4 protocol number) + * @ini: Flow address information of the initiating side + * @tgt: Flow address information on the target side (updated) + * + * Return: pif of the target interface to forward the flow to, PIF_NONE if the + * flow cannot or should not be forwarded at all. + */ +uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt) +{ + (void)proto; + + tgt->eaddr = ini->faddr; + tgt->eport = ini->fport; + + if (!c->no_map_gw) { + if (inany_equals4(&tgt->eaddr, &c->ip4.gw)) + tgt->eaddr = inany_loopback4; + else if (inany_equals6(&tgt->eaddr, &c->ip6.gw)) + tgt->eaddr = inany_loopback6; + } + + /* The relevant addr_out controls the host side source address. This + * may be unspecified, which allows the kernel to pick an address. + */ + if (inany_v4(&tgt->eaddr)) + tgt->faddr = inany_from_v4(c->ip4.addr_out); + else + tgt->faddr.a6 = c->ip6.addr_out; + + /* Let the kernel pick a host side source port */ + tgt->fport = 0; + + return PIF_HOST; +} + +/** + * fwd_nat_from_splice() - Determine to forward a flow from the splice interface + * @c: Execution context + * @proto: Protocol (IP L4 protocol number) + * @ini: Flow address information of the initiating side + * @tgt: Flow address information on the target side (updated) + * + * Return: pif of the target interface to forward the flow to, PIF_NONE if the + * flow cannot or should not be forwarded at all. + */ +uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt) +{ + if (!inany_is_loopback(&ini->eaddr) || + (!inany_is_loopback(&ini->faddr) && !inany_is_unspecified(&ini->faddr))) { + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + + debug("Non loopback address on %s: [%s]:%hu -> [%s]:%hu", + pif_name(PIF_SPLICE), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), ini->fport); + return PIF_NONE; + } + + if (inany_v4(&ini->eaddr)) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_loopback6; + + /* Preserve the specific loopback adddress used, but let the kernel pick + * a source port on the target side + */ + tgt->faddr = ini->eaddr; + tgt->fport = 0; + + tgt->eport = ini->fport; + if (proto == IPPROTO_TCP) + tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; + + /* Let the kernel pick a host side source port */ + tgt->fport = 0; + + return PIF_HOST; +} + +/** + * fwd_nat_from_host() - Determine to forward a flow from the host interface + * @c: Execution context + * @proto: Protocol (IP L4 protocol number) + * @ini: Flow address information of the initiating side + * @tgt: Flow address information on the target side (updated) + * + * Return: pif of the target interface to forward the flow to, PIF_NONE if the + * flow cannot or should not be forwarded at all. + */ +uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt) +{ + /* Common for spliced and non-spliced cases */ + tgt->eport = ini->fport; + if (proto == IPPROTO_TCP) + tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; + + if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) && + proto == IPPROTO_TCP) { + /* spliceable */ + + /* Preserve the specific loopback adddress used, but let the + * kernel pick a source port on the target side + */ + tgt->faddr = ini->eaddr; + tgt->fport = 0; + + if (inany_v4(&ini->eaddr)) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_loopback6; + return PIF_SPLICE; + } + + tgt->faddr = ini->eaddr; + tgt->fport = ini->eport; + + if (inany_is_loopback4(&tgt->faddr) || + inany_is_unspecified4(&tgt->faddr) || + inany_equals4(&tgt->faddr, &c->ip4.addr_seen)) { + tgt->faddr = inany_from_v4(c->ip4.gw); + } else if (inany_is_loopback6(&tgt->faddr) || + inany_equals6(&tgt->faddr, &c->ip6.addr_seen) || + inany_equals6(&tgt->faddr, &c->ip6.addr)) { + if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) + tgt->faddr.a6 = c->ip6.gw; + else + tgt->faddr.a6 = c->ip6.addr_ll; + } + + if (inany_v4(&tgt->faddr)) { + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + } else { + if (inany_is_linklocal6(&tgt->faddr)) + tgt->eaddr.a6 = c->ip6.addr_ll_seen; + else + tgt->eaddr.a6 = c->ip6.addr_seen; + } + + return PIF_TAP; +} diff --git a/fwd.h b/fwd.h index 41645d7f..b4aa8d57 100644 --- a/fwd.h +++ b/fwd.h @@ -7,6 +7,8 @@ #ifndef FWD_H #define FWD_H +struct flowside; + /* Number of ports for both TCP and UDP */ #define NUM_PORTS (1U << 16) @@ -42,4 +44,11 @@ void fwd_scan_ports_udp(struct fwd_ports *fwd, const struct fwd_ports *rev, const struct fwd_ports *tcp_rev); void fwd_scan_ports_init(struct ctx *c); +uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt); +uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt); +uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt); + #endif /* FWD_H */ diff --git a/tcp.c b/tcp.c index b6eca5d8..0c66ac84 100644 --- a/tcp.c +++ b/tcp.c @@ -1470,7 +1470,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - union inany_addr srcaddr, dstaddr; /* FIXME: Avoid bulky temporaries */ const struct flowside *ini, *tgt; struct tcp_tap_conn *conn; union sockaddr_inany sa; @@ -1485,34 +1484,16 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, ini = flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); - dstaddr = ini->faddr; - if (!c->no_map_gw) { - if (inany_equals4(&dstaddr, &c->ip4.gw)) - dstaddr = inany_loopback4; - else if (inany_equals6(&dstaddr, &c->ip6.gw)) - dstaddr = inany_loopback6; - - } + if (!(tgt = flow_target(c, flow, IPPROTO_TCP))) + goto cancel; - if (inany_is_linklocal6(&dstaddr)) { - srcaddr.a6 = c->ip6.addr_ll; - } else if (inany_is_loopback(&dstaddr)) { - srcaddr = dstaddr; - } else if (inany_v4(&dstaddr)) { - if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.addr_out)) - srcaddr = inany_from_v4(c->ip4.addr_out); - else - srcaddr = inany_any4; - } else { - if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr_out)) - srcaddr.a6 = c->ip6.addr_out; - else - srcaddr = inany_any6; + if (flow->f.pif[TGTSIDE] != PIF_HOST) { + flow_err(flow, "No support for forwarding TCP from %s to %s", + pif_name(flow->f.pif[INISIDE]), + pif_name(flow->f.pif[TGTSIDE])); + goto cancel; } - tgt = flow_target_af(flow, PIF_HOST, AF_INET6, - &srcaddr, 0, /* Kernel decides source port */ - &dstaddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0 || @@ -2060,63 +2041,20 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn) conn_flag(c, conn, ACK_FROM_TAP_DUE); } -/** - * tcp_snat_inbound() - Translate source address for inbound data if needed - * @c: Execution context - * @addr: Source address of inbound packet/connection - */ -static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) -{ - if (inany_is_loopback4(addr) || - inany_is_unspecified4(addr) || - inany_equals4(addr, &c->ip4.addr_seen)) { - *addr = inany_from_v4(c->ip4.gw); - } else if (inany_is_loopback6(addr) || - inany_equals6(addr, &c->ip6.addr_seen) || - inany_equals6(addr, &c->ip6.addr)) { - if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) - addr->a6 = c->ip6.gw; - else - addr->a6 = c->ip6.addr_ll; - } -} - /** * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context - * @dstport: Destination port for connection (host side) * @flow: flow to initialise * @s: Accepted socket * @sa: Peer socket address (from accept()) * @now: Current timestamp */ -static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, - union flow *flow, int s, - const union sockaddr_inany *sa, +static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s, const struct timespec *now) { - union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */ - struct tcp_tap_conn *conn; - in_port_t srcport; + struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); uint64_t hash; - inany_from_sockaddr(&saddr, &srcport, sa); - tcp_snat_inbound(c, &saddr); - - if (inany_v4(&saddr)) { - daddr = inany_from_v4(c->ip4.addr_seen); - } else { - if (inany_is_linklocal6(&saddr)) - daddr.a6 = c->ip6.addr_ll_seen; - else - daddr.a6 = c->ip6.addr_seen; - } - dstport += c->tcp.fwd_in.delta[dstport]; - - flow_target_af(flow, PIF_TAP, AF_INET6, - &saddr, srcport, &daddr, dstport); - conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); - conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; @@ -2174,11 +2112,26 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, goto cancel; } - if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif, - ref.tcp_listen.port, flow, s, &sa)) - return; + if (!flow_target(c, flow, IPPROTO_TCP)) + goto cancel; + + switch (flow->f.pif[TGTSIDE]) { + case PIF_SPLICE: + case PIF_HOST: + tcp_splice_conn_from_sock(c, flow, s); + break; + + case PIF_TAP: + tcp_tap_conn_from_sock(c, flow, s, now); + break; + + default: + flow_err(flow, "No support for forwarding TCP from %s to %s", + pif_name(flow->f.pif[INISIDE]), + pif_name(flow->f.pif[TGTSIDE])); + goto cancel; + } - tcp_tap_conn_from_sock(c, ref.tcp_listen.port, flow, s, &sa, now); return; cancel: diff --git a/tcp_splice.c b/tcp_splice.c index c81daeeb..473562b5 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -414,72 +414,18 @@ static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af) /** * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection * @c: Execution context - * @pif0: pif id of side 0 - * @dstport: Side 0 destination port of connection * @flow: flow to initialise * @s0: Accepted (side 0) socket * @sa: Peer address of connection * - * Return: true if able to create a spliced connection, false otherwise * #syscalls:pasta setsockopt */ -bool tcp_splice_conn_from_sock(const struct ctx *c, - uint8_t pif0, in_port_t dstport, - union flow *flow, int s0, - const union sockaddr_inany *sa) +void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0) { - struct tcp_splice_conn *conn; - union inany_addr src; - in_port_t srcport; - sa_family_t af; - uint8_t tgtpif; + struct tcp_splice_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, + tcp_splice); - if (c->mode != MODE_PASTA) - return false; - - inany_from_sockaddr(&src, &srcport, sa); - af = inany_v4(&src) ? AF_INET : AF_INET6; - - switch (pif0) { - case PIF_SPLICE: - if (!inany_is_loopback(&src)) { - char str[INANY_ADDRSTRLEN]; - - /* We can't use flow_err() etc. because we haven't set - * the flow type yet - */ - warn("Bad source address %s for splice, closing", - inany_ntop(&src, str, sizeof(str))); - - /* We *don't* want to fall back to tap */ - flow_alloc_cancel(flow); - return true; - } - - tgtpif = PIF_HOST; - dstport += c->tcp.fwd_out.delta[dstport]; - break; - - case PIF_HOST: - if (!inany_is_loopback(&src)) - return false; - - tgtpif = PIF_SPLICE; - dstport += c->tcp.fwd_in.delta[dstport]; - break; - - default: - return false; - } - - /* FIXME: Record outbound source address when known */ - if (af == AF_INET) - flow_target_af(flow, tgtpif, AF_INET, - NULL, 0, &in4addr_loopback, dstport); - else - flow_target_af(flow, tgtpif, AF_INET6, - NULL, 0, &in6addr_loopback, dstport); - conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice); + ASSERT(c->mode == MODE_PASTA); conn->s[0] = s0; conn->s[1] = -1; @@ -493,8 +439,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, conn_flag(c, conn, CLOSING); FLOW_ACTIVATE(conn); - - return true; } /** diff --git a/tcp_splice.h b/tcp_splice.h index ed8f0c58..a20f3e21 100644 --- a/tcp_splice.h +++ b/tcp_splice.h @@ -11,10 +11,7 @@ union sockaddr_inany; void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events); -bool tcp_splice_conn_from_sock(const struct ctx *c, - uint8_t pif0, in_port_t dstport, - union flow *flow, int s0, - const union sockaddr_inany *sa); +void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0); void tcp_splice_init(struct ctx *c); #endif /* TCP_SPLICE_H */ -- 2.45.2
Current ICMP hard codes its forwarding rules, and never applies any translations. Change it to use the flow_target() function, so that it's translated the same as TCP (excluding TCP specific port redirection). This means that gw mapping now applies to ICMP so "ping <gw address>" will now ping the host's loopback instead of the actual gw machine. This removes the surprising behaviour that the target you ping might not be the same as you connect to with TCP. This removes the last user of flow_target_af(), so that's removed as well. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 32 -------------------------------- icmp.c | 16 ++++++++++------ 2 files changed, 10 insertions(+), 38 deletions(-) diff --git a/flow.c b/flow.c index c1af1369..27340df9 100644 --- a/flow.c +++ b/flow.c @@ -369,38 +369,6 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, return ini; } -/** - * flow_target_af() - Move flow to TGT, setting TGTSIDE details - * @flow: Flow to change state - * @pif: pif of the target side - * @af: Address family for @eaddr and @faddr - * @saddr: Source address (pointer to in_addr or in6_addr) - * @sport: Endpoint port - * @daddr: Destination address (pointer to in_addr or in6_addr) - * @dport: Destination port - * - * Return: pointer to the target flowside information - */ -const struct flowside *flow_target_af(union flow *flow, uint8_t pif, - sa_family_t af, - const void *saddr, in_port_t sport, - const void *daddr, in_port_t dport) -{ - struct flow_common *f = &flow->f; - struct flowside *tgt = &f->side[TGTSIDE]; - - ASSERT(pif != PIF_NONE); - ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); - ASSERT(f->type == FLOW_TYPE_NONE); - ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); - - flowside_from_af(tgt, af, daddr, dport, saddr, sport); - f->pif[TGTSIDE] = pif; - flow_set_state(f, FLOW_STATE_TGT); - return tgt; -} - - /** * flow_target() - Determine where flow should forward to, and move to TGT * @c: Execution context diff --git a/icmp.c b/icmp.c index 22177475..cb81c768 100644 --- a/icmp.c +++ b/icmp.c @@ -169,24 +169,28 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, sa_family_t af, uint16_t id, const void *saddr, const void *daddr) { + uint8_t proto = af == AF_INET ? IPPROTO_ICMP : IPPROTO_ICMPV6; uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6; union epoll_ref ref = { .type = EPOLL_TYPE_PING }; union flow *flow = flow_alloc(); struct icmp_ping_flow *pingf; const struct flowside *tgt; - const void *bind_addr; if (!flow) return NULL; flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); + if (!(tgt = flow_target(c, flow, proto))) + goto cancel; - if (af == AF_INET) - bind_addr = &c->ip4.addr_out; - else if (af == AF_INET6) - bind_addr = &c->ip6.addr_out; + if (flow->f.pif[TGTSIDE] != PIF_HOST) { + flow_err(flow, "No support for forwarding %s from %s to %s", + proto == IPPROTO_ICMP ? "ICMP" : "ICMPv6", + pif_name(flow->f.pif[INISIDE]), + pif_name(flow->f.pif[TGTSIDE])); + goto cancel; + } - tgt = flow_target_af(flow, PIF_HOST, af, bind_addr, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping); pingf->seq = -1; -- 2.45.2
Add logic to the fwd_nat_from_*() functions to forwarding UDP packets. The logic here doesn't exactly match our current forwarding, since our current forwarding has some very strange and buggy edge cases. Instead it's attempting to replicate what appears to be the intended logic behind the current forwarding. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- fwd.c | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/fwd.c b/fwd.c index 3288b0da..a70ebfd8 100644 --- a/fwd.c +++ b/fwd.c @@ -169,12 +169,16 @@ void fwd_scan_ports_init(struct ctx *c) uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { - (void)proto; - tgt->eaddr = ini->faddr; tgt->eport = ini->fport; - if (!c->no_map_gw) { + if (proto == IPPROTO_UDP && tgt->eport == 53 && + inany_equals4(&tgt->eaddr, &c->ip4.dns_match)) { + tgt->eaddr = inany_from_v4(c->ip4.dns_host); + } else if (proto == IPPROTO_UDP && tgt->eport == 53 && + inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) { + tgt->eaddr.a6 = c->ip6.dns_host; + } else if (!c->no_map_gw) { if (inany_equals4(&tgt->eaddr, &c->ip4.gw)) tgt->eaddr = inany_loopback4; else if (inany_equals6(&tgt->eaddr, &c->ip6.gw)) @@ -191,6 +195,10 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, /* Let the kernel pick a host side source port */ tgt->fport = 0; + if (proto == IPPROTO_UDP) { + /* But for UDP we preserve the source port */ + tgt->fport = ini->eport; + } return PIF_HOST; } @@ -233,9 +241,14 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, tgt->eport = ini->fport; if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; + else if (proto == IPPROTO_UDP) + tgt->eport += c->udp.fwd_out.f.delta[tgt->eport]; /* Let the kernel pick a host side source port */ tgt->fport = 0; + if (proto == IPPROTO_UDP) + /* But for UDP preserve the source port */ + tgt->fport = ini->eport; return PIF_HOST; } @@ -257,9 +270,11 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, tgt->eport = ini->fport; if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; + else if (proto == IPPROTO_UDP) + tgt->eport += c->udp.fwd_in.f.delta[tgt->eport]; if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) && - proto == IPPROTO_TCP) { + (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */ /* Preserve the specific loopback adddress used, but let the @@ -267,11 +282,15 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, */ tgt->faddr = ini->eaddr; tgt->fport = 0; + if (proto == IPPROTO_UDP) + /* But for UDP preserve the source port */ + tgt->fport = ini->eport; if (inany_v4(&ini->eaddr)) tgt->eaddr = inany_loopback4; else tgt->eaddr = inany_loopback6; + return PIF_SPLICE; } -- 2.45.2
This implements the first steps of tracking UDP packets with the flow table rather than its own (buggy) set of port maps. Specifically we create flow table entries for datagrams received from a socket (PIF_HOST or PIF_SPLICE). When splitting datagrams from sockets into batches, we group by the flow as well as splicesrc. This may result in smaller batches, but makes things easier down the line. We can re-optimise this later if necessary. For now we don't do anything else with the flow, not even match reply packets to the same flow. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- Makefile | 2 +- flow.c | 32 ++++++++++ flow.h | 4 ++ flow_table.h | 15 +++++ udp.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++-- udp_flow.h | 25 ++++++++ 6 files changed, 242 insertions(+), 5 deletions(-) create mode 100644 udp_flow.h diff --git a/Makefile b/Makefile index 09fc461d..92cbd5a6 100644 --- a/Makefile +++ b/Makefile @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - udp.h util.h + udp.h udp_flow.h util.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 }; diff --git a/flow.c b/flow.c index 27340df9..4e337d42 100644 --- a/flow.c +++ b/flow.c @@ -37,6 +37,7 @@ const char *flow_type_str[] = { [FLOW_TCP_SPLICE] = "TCP connection (spliced)", [FLOW_PING4] = "ICMP ping sequence", [FLOW_PING6] = "ICMPv6 ping sequence", + [FLOW_UDP] = "UDP flow", }; static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type"); @@ -46,6 +47,7 @@ const uint8_t flow_proto[] = { [FLOW_TCP_SPLICE] = IPPROTO_TCP, [FLOW_PING4] = IPPROTO_ICMP, [FLOW_PING6] = IPPROTO_ICMPV6, + [FLOW_UDP] = IPPROTO_UDP, }; static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, "flow_proto[] doesn't match enum flow_type"); @@ -701,6 +703,32 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, return flowside_lookup(c, proto, pif, &side); } +/** + * flow_lookup_sa() - Look up a flow given an endpoint socket address + * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @esa: Socket address of the endpoint + * @fport: Forwarding port number + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport) +{ + struct flowside side = { + .fport = fport, + }; + + inany_from_sockaddr(&side.eaddr, &side.eport, esa); + if (inany_v4(&side.eaddr)) + side.faddr = inany_any4; + else + side.faddr = inany_any6; + + return flowside_lookup(c, proto, pif, &side); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context @@ -780,6 +808,10 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) if (timer) closed = icmp_ping_timer(c, &flow->ping, now); break; + case FLOW_UDP: + if (timer) + closed = udp_flow_timer(c, &flow->udp, now); + break; default: /* Assume other flow types don't need any handling */ ; diff --git a/flow.h b/flow.h index bf6b8459..7866477b 100644 --- a/flow.h +++ b/flow.h @@ -115,6 +115,8 @@ enum flow_type { FLOW_PING4, /* ICMPv6 echo requests from guest to host and matching replies back */ FLOW_PING6, + /* UDP pseudo-connection */ + FLOW_UDP, FLOW_NUM_TYPES, }; @@ -238,6 +240,8 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, uint8_t proto, uint8_t pif, sa_family_t af, const void *eaddr, const void *faddr, in_port_t eport, in_port_t fport); +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport); union flow; diff --git a/flow_table.h b/flow_table.h index 9d912c83..df253be4 100644 --- a/flow_table.h +++ b/flow_table.h @@ -9,6 +9,7 @@ #include "tcp_conn.h" #include "icmp_flow.h" +#include "udp_flow.h" /** * struct flow_free_cluster - Information about a cluster of free entries @@ -35,6 +36,7 @@ union flow { struct tcp_tap_conn tcp; struct tcp_splice_conn tcp_splice; struct icmp_ping_flow ping; + struct udp_flow udp; }; /* Global Flow Table */ @@ -98,6 +100,19 @@ static inline uint8_t pif_at_sidx(flow_sidx_t sidx) return flow->f.pif[sidx.sidei]; } +/** flow_sidx_opposite() - Get the other side of the same flow + * @sidx: Flow & side index + * + * Return: sidx for the other side of the same flow as @sidx + */ +static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) +{ + if (!flow_sidx_valid(sidx)) + return FLOW_SIDX_NONE; + + return (flow_sidx_t){.flowi = sidx.flowi, .sidei = !sidx.sidei}; +} + /** flow_sidx() - Index of one side of a flow from common structure * @f: Common flow fields pointer * @sidei: Which side to refer to (0 or 1) diff --git a/udp.c b/udp.c index 150f970a..fdbe3968 100644 --- a/udp.c +++ b/udp.c @@ -15,6 +15,30 @@ /** * DOC: Theory of Operation * + * UDP Flows + * ========= + * + * UDP doesn't have true connections, but many protocols use a connection-like + * format. The flow is initiated by a client sending a datagram from a port of + * its choosing (usually ephemeral) to a specific port (usually well known) on a + * server. Both client and server address must be unicast. The server sends + * replies using the same addresses & ports with src/dest swapped. + * + * We track pseudo-connections of this type as flow table entries of type + * FLOW_UDP. We store the time of the last traffic on the flow in uflow->ts, + * and let the flow expire if there is no traffic for UDP_CONN_TIMEOUT seconds. + * + * NOTE: This won't handle multicast protocols, or some protocols with different + * port usage. We'll need specific logic if we want to handle those. + * + * "Listening" sockets + * =================== + * + * UDP doesn't use listen(), but we consider long term sockets which are allowed + * to create new flows "listening" by analogy with TCP. + * + * Port tracking + * ============= * * For UDP, a reduced version of port-based connection tracking is implemented * with two purposes: @@ -122,6 +146,7 @@ #include "tap.h" #include "pcap.h" #include "log.h" +#include "flow_table.h" #define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ @@ -200,6 +225,7 @@ static struct ethhdr udp6_eth_hdr; * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() * @splicesrc: Source port for splicing, or -1 if not spliceable + * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { struct ipv6hdr ip6h; @@ -208,6 +234,7 @@ static struct udp_meta_t { union sockaddr_inany s_in; int splicesrc; + flow_sidx_t tosidx; } #ifdef __AVX2__ __attribute__ ((aligned(32))) @@ -491,6 +518,115 @@ static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) return -1; } +/** + * udp_at_sidx() - Get UDP specific flow at given sidx + * @sidx: Flow and side to retrieve + * + * Return: UDP specific flow at @sidx, or NULL of @sidx is invalid. Asserts if + * the flow at @sidx is not FLOW_UDP. + */ +struct udp_flow *udp_at_sidx(flow_sidx_t sidx) +{ + union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return NULL; + + ASSERT(flow->f.type == FLOW_UDP); + return &flow->udp; +} + +/* + * udp_flow_close() - Close and clean up UDP flow + * @c: Execution context + * @uflow: UDP flow + */ +static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) +{ + flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); +} + +/** + * udp_flow_new() - Common setup for a new UDP flow + * @c: Execution context + * @flow: Initiated flow + * @now: Timestamp + * + * Return: UDP specific flow, if successful, NULL on failure + */ +static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, + const struct timespec *now) +{ + const struct flowside *ini = &flow->f.side[INISIDE]; + struct udp_flow *uflow = NULL; + + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { + flow_trace(flow, "Invalid endpoint to initiate UDP flow"); + goto cancel; + } + + if (!flow_target(c, flow, IPPROTO_UDP)) + goto cancel; + + uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); + uflow->ts = now->tv_sec; + + flow_hash_insert(c, FLOW_SIDX(uflow, INISIDE)); + FLOW_ACTIVATE(uflow); + + return FLOW_SIDX(uflow, TGTSIDE); + +cancel: + if (uflow) + udp_flow_close(c, uflow); + flow_alloc_cancel(flow); + return FLOW_SIDX_NONE; + +} + +/** + * udp_flow_from_sock() - Find or create UDP flow for "listening" socket + * @c: Execution context + * @ref: epoll reference of the receiving socket + * @meta: Metadata buffer for the datagram + * @now: Timestamp + * + * Return: sidx for the destination side of the flow for this packet, or + * FLOW_SIDX_NONE if we couldn't find or create a flow. + */ +static flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref, + struct udp_meta_t *meta, + const struct timespec *now) +{ + struct udp_flow *uflow; + union flow *flow; + flow_sidx_t sidx; + + ASSERT(ref.type == EPOLL_TYPE_UDP); + + /* FIXME: Match reply packets to their flow as well */ + if (!ref.udp.orig) + return FLOW_SIDX_NONE; + + sidx = flow_lookup_sa(c, IPPROTO_UDP, ref.udp.pif, &meta->s_in, ref.udp.port); + if ((uflow = udp_at_sidx(sidx))) { + uflow->ts = now->tv_sec; + return flow_sidx_opposite(sidx); + } + + if (!(flow = flow_alloc())) { + char sastr[SOCKADDR_STRLEN]; + + debug("Couldn't allocate flow for UDP datagram from %s %s", + pif_name(ref.udp.pif), + sockaddr_ntop(&meta->s_in, sastr, sizeof(sastr))); + return FLOW_SIDX_NONE; + } + + flow_initiate_sa(flow, ref.udp.pif, &meta->s_in, ref.udp.port); + return udp_flow_new(c, flow, now); +} + /** * udp_splice_prepare() - Prepare one datagram for splicing * @mmh: Receiving mmsghdr array @@ -848,12 +984,15 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve dstport += c->udp.fwd_in.f.delta[dstport]; /* We divide datagrams into batches based on how we need to send them, - * determined by udp_meta[i].splicesrc. To avoid either two passes - * through the array, or recalculating splicesrc for a single entry, we - * have to populate it one entry *ahead* of the loop counter. + * determined by udp_meta[i].splicesrc and udp_meta[i].tosidx. To avoid + * either two passes through the array, or recalculating splicesrc and + * tosidxfor a single entry, we have to populate it one entry *ahead* of + * the loop counter. */ udp_meta[0].splicesrc = udp_mmh_splice_port(ref, mmh_recv); + udp_meta[0].tosidx = udp_flow_from_sock(c, ref, &udp_meta[0], now); for (i = 0; i < n; ) { + flow_sidx_t batchsidx = udp_meta[i].tosidx; int batchsrc = udp_meta[i].splicesrc; int batchstart = i; @@ -870,7 +1009,11 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve udp_meta[i].splicesrc = udp_mmh_splice_port(ref, &mmh_recv[i]); - } while (udp_meta[i].splicesrc == batchsrc); + udp_meta[i].tosidx = udp_flow_from_sock(c, ref, + &udp_meta[i], + now); + } while (flow_sidx_eq(udp_meta[i].tosidx, batchsidx) && + udp_meta[i].splicesrc == batchsrc); if (batchsrc >= 0) { udp_splice_send(c, batchstart, i - batchstart, @@ -1268,6 +1411,24 @@ static int udp_port_rebind_outbound(void *arg) return 0; } +/** + * udp_flow_timer() - Handler for timed events related to a given flow + * @c: Execution context + * @uflow: UDP flow + * @now: Current timestamp + * + * Return: true if the flow is ready to free, false otherwise + */ +bool udp_flow_timer(const struct ctx *c, const struct udp_flow *uflow, + const struct timespec *now) +{ + if (now->tv_sec - uflow->ts <= UDP_CONN_TIMEOUT) + return false; + + udp_flow_close(c, uflow); + return true; +} + /** * udp_timer() - Scan activity bitmaps for ports with associated timed events * @c: Execution context diff --git a/udp_flow.h b/udp_flow.h new file mode 100644 index 00000000..18af9ac4 --- /dev/null +++ b/udp_flow.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright Red Hat + * Author: David Gibson <david(a)gibson.dropbear.id.au> + * + * UDP flow tracking data structures + */ +#ifndef UDP_FLOW_H +#define UDP_FLOW_H + +/** + * struct udp - Descriptor for a flow of UDP packets + * @f: Generic flow information + * @ts: Activity timestamp + */ +struct udp_flow { + /* Must be first element */ + struct flow_common f; + + time_t ts; +}; + +bool udp_flow_timer(const struct ctx *c, const struct udp_flow *uflow, + const struct timespec *now); + +#endif /* UDP_FLOW_H */ -- 2.45.2
When forwarding a datagram to a socket, we need to find a socket with a suitable local address to send it. Currently we keep track of such sockets in an array indexed by local port, but this can't properly handle cases where we have multiple local addresses in active use. For "spliced" (socket to socket) cases, improve this by instead opening a socket specifically for the target side of the flow. We connect() as well as bind()ing that socket, so that it will only receive the flow's reply packets, not anything else. We direct datagrams sent via that socket using the addresses from the flow table, effectively replacing bespoke addressing logic with the unified logic in fwd.c When we create the flow, we also take a duplicate of the originating socket, and use that to deliver reply datagrams back to the origin, again using addresses from the flow table entry. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- epoll_type.h | 2 + flow.c | 20 +++ flow.h | 2 + flow_table.h | 15 ++ passt.c | 4 + udp.c | 436 +++++++++++++++++++++------------------------------ udp.h | 6 +- udp_flow.h | 4 +- util.c | 1 + 9 files changed, 226 insertions(+), 264 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index b6c04199..7a752ed1 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -22,6 +22,8 @@ enum epoll_type { EPOLL_TYPE_TCP_TIMER, /* UDP sockets */ EPOLL_TYPE_UDP, + /* UDP socket for replies on a specific flow */ + EPOLL_TYPE_UDP_REPLY, /* ICMP/ICMPv6 ping sockets */ EPOLL_TYPE_PING, /* inotify fd watching for end of netns (pasta) */ diff --git a/flow.c b/flow.c index 4e337d42..d7d548d6 100644 --- a/flow.c +++ b/flow.c @@ -237,6 +237,26 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, } } +/** flowside_connect() - Connect a socket based on flowside + * @c: Execution context + * @s: Socket to connect + * @pif: Target pif + * @tgt: Target flowside + * + * Connect @s to the endpoint address and port from @tgt. + * + * Return: 0 on success, negative on error + */ +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt) +{ + union sockaddr_inany sa; + socklen_t sl; + + pif_sockaddr(c, &sa, &sl, pif, &tgt->eaddr, tgt->eport); + return connect(s, &sa.sa, sl); +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index 7866477b..078fd605 100644 --- a/flow.h +++ b/flow.h @@ -168,6 +168,8 @@ static inline bool flowside_eq(const struct flowside *left, int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data); +int flowside_connect(const struct ctx *c, int s, + uint8_t pif, const struct flowside *tgt); /** * struct flow_common - Common fields for packet flows diff --git a/flow_table.h b/flow_table.h index df253be4..a499e7b6 100644 --- a/flow_table.h +++ b/flow_table.h @@ -100,6 +100,21 @@ static inline uint8_t pif_at_sidx(flow_sidx_t sidx) return flow->f.pif[sidx.sidei]; } +/** flowside_at_sidx() - Retrieve a specific flowside + * @sidx: Flow & side index + * + * Return: Flowside for the flow & side given by @sidx + */ +static inline const struct flowside *flowside_at_sidx(flow_sidx_t sidx) +{ + const union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE; + + return &flow->f.side[sidx.sidei]; +} + /** flow_sidx_opposite() - Get the other side of the same flow * @sidx: Flow & side index * diff --git a/passt.c b/passt.c index e4d45daa..f9405bee 100644 --- a/passt.c +++ b/passt.c @@ -67,6 +67,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", [EPOLL_TYPE_UDP] = "UDP socket", + [EPOLL_TYPE_UDP_REPLY] = "UDP reply socket", [EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket", [EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch", [EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch", @@ -349,6 +350,9 @@ loop: case EPOLL_TYPE_UDP: udp_buf_sock_handler(&c, ref, eventmask, &now); break; + case EPOLL_TYPE_UDP_REPLY: + udp_reply_sock_handler(&c, ref, eventmask, &now); + break; case EPOLL_TYPE_PING: icmp_sock_handler(&c, ref); break; diff --git a/udp.c b/udp.c index fdbe3968..5543e614 100644 --- a/udp.c +++ b/udp.c @@ -35,7 +35,44 @@ * =================== * * UDP doesn't use listen(), but we consider long term sockets which are allowed - * to create new flows "listening" by analogy with TCP. + * to create new flows "listening" by analogy with TCP. This listening socket + * could receive packets from multiple flows, so we use a hash table match to + * find the specific flow for a datagram. + * + * When a UDP flow is initiated from a listening socket we take a duplicate of + * the socket and store it in uflow->s[INISIDE]. This will last for the + * lifetime of the flow, even if the original listening socket is closed due to + * port auto-probing. The duplicate is used to deliver replies back to the + * originating side. + * + * Reply sockets + * ============= + * + * When a UDP flow targets a socket, we create a "reply" socket in + * uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive + * replies on the target side. This socket is both bound and connected and has + * EPOLL_TYPE_UDP_REPLY. The connect() means it will only receive datagrams + * associated with this flow, so the epoll reference directly points to the flow + * and we don't need a hash lookup. + * + * NOTE: it's possible that the reply socket could have a bound address + * overlapping with an unrelated listening socket. We assume datagrams for the + * flow will come to the reply socket in preference to a listening socket. The + * sample program doc/platform-requirements/reuseaddr-priority.c documents and + * tests that assumption. + * + * "Spliced" flows + * =============== + * + * In PASTA mode, L2-L4 translation is skipped for connections to ports bound + * between namespaces using the loopback interface, messages are directly + * transferred between L4 sockets instead. These are called spliced connections + * in analogy with the TCP implementation. The the splice() syscall isn't + * actually used; it doesn't make sense for datagrams and instead a pair of + * recvmmsg() and sendmmsg() is used to forward the datagrams. + * + * Note that a spliced flow will have *both* a duplicated listening socket and a + * reply socket (see above). * * Port tracking * ============= @@ -56,62 +93,6 @@ * * Packets are forwarded back and forth, by prepending and stripping UDP headers * in the obvious way, with no port translation. - * - * In PASTA mode, the L2-L4 translation is skipped for connections to ports - * bound between namespaces using the loopback interface, messages are directly - * transferred between L4 sockets instead. These are called spliced connections - * for consistency with the TCP implementation, but the splice() syscall isn't - * actually used as it wouldn't make sense for datagram-based connections: a - * pair of recvmmsg() and sendmmsg() deals with this case. - * - * The connection tracking for PASTA mode is slightly complicated by the absence - * of actual connections, see struct udp_splice_port, and these examples: - * - * - from init to namespace: - * - * - forward direction: 127.0.0.1:5000 -> 127.0.0.1:80 in init from socket s, - * with epoll reference: index = 80, splice = 1, orig = 1, ns = 0 - * - if udp_splice_ns[V4][5000].sock: - * - send packet to udp_splice_ns[V4][5000].sock, with destination port - * 80 - * - otherwise: - * - create new socket udp_splice_ns[V4][5000].sock - * - bind in namespace to 127.0.0.1:5000 - * - add to epoll with reference: index = 5000, splice = 1, orig = 0, - * ns = 1 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - * - reverse direction: 127.0.0.1:80 -> 127.0.0.1:5000 in namespace socket s, - * having epoll reference: index = 5000, splice = 1, orig = 0, ns = 1 - * - if udp_splice_init[V4][80].sock: - * - send to udp_splice_init[V4][80].sock, with destination port 5000 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - otherwise, discard - * - * - from namespace to init: - * - * - forward direction: 127.0.0.1:2000 -> 127.0.0.1:22 in namespace from - * socket s, with epoll reference: index = 22, splice = 1, orig = 1, ns = 1 - * - if udp4_splice_init[V4][2000].sock: - * - send packet to udp_splice_init[V4][2000].sock, with destination - * port 22 - * - otherwise: - * - create new socket udp_splice_init[V4][2000].sock - * - bind in init to 127.0.0.1:2000 - * - add to epoll with reference: index = 2000, splice = 1, orig = 0, - * ns = 0 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - * - reverse direction: 127.0.0.1:22 -> 127.0.0.1:2000 in init from socket s, - * having epoll reference: index = 2000, splice = 1, orig = 0, ns = 0 - * - if udp_splice_ns[V4][22].sock: - * - send to udp_splice_ns[V4][22].sock, with destination port 2000 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - otherwise, discard */ #include <sched.h> @@ -134,6 +115,7 @@ #include <sys/socket.h> #include <sys/uio.h> #include <time.h> +#include <fcntl.h> #include <linux/errqueue.h> #include "checksum.h" @@ -224,7 +206,6 @@ static struct ethhdr udp6_eth_hdr; * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() - * @splicesrc: Source port for splicing, or -1 if not spliceable * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { @@ -233,7 +214,6 @@ static struct udp_meta_t { struct tap_hdr taph; union sockaddr_inany s_in; - int splicesrc; flow_sidx_t tosidx; } #ifdef __AVX2__ @@ -271,7 +251,6 @@ static struct mmsghdr udp_mh_splice [UDP_MAX_FRAMES]; /* IOVs for L2 frames */ static struct iovec udp_l2_iov [UDP_MAX_FRAMES][UDP_NUM_IOVS]; - /** * udp_portmap_clear() - Clear UDP port map before configuration */ @@ -384,140 +363,6 @@ static void udp_iov_init(const struct ctx *c) udp_iov_init_one(c, i); } -/** - * udp_splice_new() - Create and prepare socket for "spliced" binding - * @c: Execution context - * @v6: Set for IPv6 sockets - * @src: Source port of original connection, host order - * @ns: Does the splice originate in the ns or not - * - * Return: prepared socket, negative error code on failure - * - * #syscalls:pasta getsockname - */ -int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) -{ - struct epoll_event ev = { .events = EPOLLIN | EPOLLRDHUP | EPOLLHUP }; - union epoll_ref ref = { .type = EPOLL_TYPE_UDP, - .udp = { .splice = true, .v6 = v6, .port = src } - }; - struct udp_splice_port *sp; - int act, s; - - if (ns) { - ref.udp.pif = PIF_SPLICE; - sp = &udp_splice_ns[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_NS; - } else { - ref.udp.pif = PIF_HOST; - sp = &udp_splice_init[v6 ? V6 : V4][src]; - act = UDP_ACT_SPLICE_INIT; - } - - s = socket(v6 ? AF_INET6 : AF_INET, SOCK_DGRAM | SOCK_NONBLOCK, - IPPROTO_UDP); - - if (s > FD_REF_MAX) { - close(s); - return -EIO; - } - - if (s < 0) - return s; - - ref.fd = s; - - if (v6) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(src), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr6, sizeof(addr6))) - goto fail; - } else { - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(src), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - if (bind(s, (struct sockaddr *)&addr4, sizeof(addr4))) - goto fail; - } - - sp->sock = s; - bitmap_set(udp_act[v6 ? V6 : V4][act], src); - - ev.data.u64 = ref.u64; - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, s, &ev); - return s; - -fail: - close(s); - return -1; -} - -/** - * struct udp_splice_new_ns_arg - Arguments for udp_splice_new_ns() - * @c: Execution context - * @v6: Set for IPv6 - * @src: Source port of originating datagram, host order - * @dst: Destination port of originating datagram, host order - * @s: Newly created socket or negative error code - */ -struct udp_splice_new_ns_arg { - const struct ctx *c; - int v6; - in_port_t src; - int s; -}; - -/** - * udp_splice_new_ns() - Enter namespace and call udp_splice_new() - * @arg: See struct udp_splice_new_ns_arg - * - * Return: 0 - */ -static int udp_splice_new_ns(void *arg) -{ - struct udp_splice_new_ns_arg *a; - - a = (struct udp_splice_new_ns_arg *)arg; - - ns_enter(a->c); - - a->s = udp_splice_new(a->c, a->v6, a->src, true); - - return 0; -} - -/** - * udp_mmh_splice_port() - Is source address of message suitable for splicing? - * @ref: epoll reference for incoming message's origin socket - * @mmh: mmsghdr of incoming message - * - * Return: if source address of message in @mmh refers to localhost (127.0.0.1 - * or ::1) its source port (host order), otherwise -1. - */ -static int udp_mmh_splice_port(union epoll_ref ref, const struct mmsghdr *mmh) -{ - const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name; - const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name; - - ASSERT(ref.type == EPOLL_TYPE_UDP); - - if (!ref.udp.splice) - return -1; - - if (ref.udp.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr)) - return ntohs(sa6->sin6_port); - - if (!ref.udp.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr)) - return ntohs(sa4->sin_port); - - return -1; -} - /** * udp_at_sidx() - Get UDP specific flow at given sidx * @sidx: Flow and side to retrieve @@ -541,8 +386,20 @@ struct udp_flow *udp_at_sidx(flow_sidx_t sidx) * @c: Execution context * @uflow: UDP flow */ -static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) +static void udp_flow_close(const struct ctx *c, struct udp_flow *uflow) { + if (uflow->s[INISIDE] >= 0) { + /* The listening socket needs to stay in epoll */ + close(uflow->s[INISIDE]); + uflow->s[INISIDE] = -1; + } + + if (uflow->s[TGTSIDE] >= 0) { + /* But the flow specific one needs to be removed */ + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, uflow->s[TGTSIDE], NULL); + close(uflow->s[TGTSIDE]); + uflow->s[TGTSIDE] = -1; + } flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); } @@ -550,26 +407,92 @@ static void udp_flow_close(const struct ctx *c, const struct udp_flow *uflow) * udp_flow_new() - Common setup for a new UDP flow * @c: Execution context * @flow: Initiated flow + * @s_ini: Initiating socket (or -1) * @now: Timestamp * * Return: UDP specific flow, if successful, NULL on failure */ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, - const struct timespec *now) + int s_ini, const struct timespec *now) { const struct flowside *ini = &flow->f.side[INISIDE]; struct udp_flow *uflow = NULL; + const struct flowside *tgt; + uint8_t tgtpif; if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { flow_trace(flow, "Invalid endpoint to initiate UDP flow"); goto cancel; } - if (!flow_target(c, flow, IPPROTO_UDP)) + if (!(tgt = flow_target(c, flow, IPPROTO_UDP))) goto cancel; + tgtpif = flow->f.pif[TGTSIDE]; uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); uflow->ts = now->tv_sec; + uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; + + if (s_ini >= 0) { + /* When using auto port-scanning the listening port could go + * away, so we need to duplicate the socket + */ + uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0); + if (uflow->s[INISIDE] < 0) { + flow_err(uflow, + "Couldn't duplicate listening socket: %s", + strerror(errno)); + goto cancel; + } + } + + if (pif_is_socket(tgtpif)) { + struct mmsghdr discard[UIO_MAXIOV] = { 0 }; + union { + flow_sidx_t sidx; + uint32_t data; + } fref = { + .sidx = FLOW_SIDX(flow, TGTSIDE), + }; + int rc; + + uflow->s[TGTSIDE] = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY, + tgtpif, tgt, fref.data); + if (uflow->s[TGTSIDE] < 0) { + flow_dbg(uflow, + "Couldn't open socket for spliced flow: %s", + strerror(errno)); + goto cancel; + } + + if (flowside_connect(c, uflow->s[TGTSIDE], tgtpif, tgt) < 0) { + flow_dbg(uflow, + "Couldn't connect flow socket: %s", + strerror(errno)); + goto cancel; + } + + /* It's possible, if unlikely, that we could receive some + * unrelated packets in between the bind() and connect() of this + * socket. For now we just discard these. We could consider + * trying to redirect these to an appropriate handler, if we + * need to. + */ + rc = recvmmsg(uflow->s[TGTSIDE], discard, ARRAY_SIZE(discard), + MSG_DONTWAIT, NULL); + if (rc >= ARRAY_SIZE(discard)) { + flow_dbg(uflow, + "Too many (%d) spurious reply datagrams", rc); + goto cancel; + } else if (rc > 0) { + flow_trace(uflow, + "Discarded %d spurious reply datagrams", rc); + } else if (errno != EAGAIN) { + flow_err(uflow, + "Unexpected error discarding datagrams: %s", + strerror(errno)); + } + } flow_hash_insert(c, FLOW_SIDX(uflow, INISIDE)); FLOW_ACTIVATE(uflow); @@ -581,7 +504,6 @@ cancel: udp_flow_close(c, uflow); flow_alloc_cancel(flow); return FLOW_SIDX_NONE; - } /** @@ -591,6 +513,8 @@ cancel: * @meta: Metadata buffer for the datagram * @now: Timestamp * + * #syscalls fcntl + * * Return: sidx for the destination side of the flow for this packet, or * FLOW_SIDX_NONE if we couldn't find or create a flow. */ @@ -624,7 +548,7 @@ static flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref, } flow_initiate_sa(flow, ref.udp.pif, &meta->s_in, ref.udp.port); - return udp_flow_new(c, flow, now); + return udp_flow_new(c, flow, ref.fd, now); } /** @@ -648,55 +572,16 @@ static void udp_splice_prepare(struct mmsghdr *mmh, unsigned idx) * @now: Timestamp */ static void udp_splice_send(const struct ctx *c, size_t start, size_t n, - in_port_t src, in_port_t dst, - union epoll_ref ref, - const struct timespec *now) + flow_sidx_t tosidx) { - int s; - - if (ref.udp.v6) { - udp_splice_to.sa6 = (struct sockaddr_in6) { - .sin6_family = AF_INET6, - .sin6_addr = in6addr_loopback, - .sin6_port = htons(dst), - }; - } else { - udp_splice_to.sa4 = (struct sockaddr_in) { - .sin_family = AF_INET, - .sin_addr = in4addr_loopback, - .sin_port = htons(dst), - }; - } - - if (ref.udp.pif == PIF_SPLICE) { - src += c->udp.fwd_in.rdelta[src]; - s = udp_splice_init[ref.udp.v6][src].sock; - if (s < 0 && ref.udp.orig) - s = udp_splice_new(c, ref.udp.v6, src, false); - - if (s < 0) - return; - - udp_splice_ns[ref.udp.v6][dst].ts = now->tv_sec; - udp_splice_init[ref.udp.v6][src].ts = now->tv_sec; - } else { - ASSERT(ref.udp.pif == PIF_HOST); - src += c->udp.fwd_out.rdelta[src]; - s = udp_splice_ns[ref.udp.v6][src].sock; - if (s < 0 && ref.udp.orig) { - struct udp_splice_new_ns_arg arg = { - c, ref.udp.v6, src, -1, - }; - - NS_CALL(udp_splice_new_ns, &arg); - s = arg.s; - } - if (s < 0) - return; + const struct flowside *toside = flowside_at_sidx(tosidx); + const struct udp_flow *uflow = udp_at_sidx(tosidx); + uint8_t topif = pif_at_sidx(tosidx); + int s = uflow->s[tosidx.sidei]; + socklen_t sl; - udp_splice_init[ref.udp.v6][dst].ts = now->tv_sec; - udp_splice_ns[ref.udp.v6][src].ts = now->tv_sec; - } + pif_sockaddr(c, &udp_splice_to, &sl, topif, + &toside->eaddr, toside->eport); sendmmsg(s, udp_mh_splice + start, n, MSG_NOSIGNAL); } @@ -984,20 +869,18 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve dstport += c->udp.fwd_in.f.delta[dstport]; /* We divide datagrams into batches based on how we need to send them, - * determined by udp_meta[i].splicesrc and udp_meta[i].tosidx. To avoid - * either two passes through the array, or recalculating splicesrc and - * tosidxfor a single entry, we have to populate it one entry *ahead* of - * the loop counter. + * determined by udp_meta[i].tosidx. To avoid either two passes through + * the array, or recalculating tosidx for a single entry, we have to + * populate it one entry *ahead* of the loop counter. */ - udp_meta[0].splicesrc = udp_mmh_splice_port(ref, mmh_recv); udp_meta[0].tosidx = udp_flow_from_sock(c, ref, &udp_meta[0], now); for (i = 0; i < n; ) { flow_sidx_t batchsidx = udp_meta[i].tosidx; - int batchsrc = udp_meta[i].splicesrc; + uint8_t batchpif = pif_at_sidx(batchsidx); int batchstart = i; do { - if (batchsrc >= 0) { + if (pif_is_socket(batchpif)) { udp_splice_prepare(mmh_recv, i); } else { udp_tap_prepare(c, mmh_recv, i, dstport, @@ -1007,17 +890,14 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve if (++i >= n) break; - udp_meta[i].splicesrc = udp_mmh_splice_port(ref, - &mmh_recv[i]); udp_meta[i].tosidx = udp_flow_from_sock(c, ref, &udp_meta[i], now); - } while (flow_sidx_eq(udp_meta[i].tosidx, batchsidx) && - udp_meta[i].splicesrc == batchsrc); + } while (flow_sidx_eq(udp_meta[i].tosidx, batchsidx)); - if (batchsrc >= 0) { + if (pif_is_socket(batchpif)) { udp_splice_send(c, batchstart, i - batchstart, - batchsrc, dstport, ref, now); + batchsidx); } else { tap_send_frames(c, &udp_l2_iov[batchstart][0], UDP_NUM_IOVS, i - batchstart); @@ -1025,6 +905,40 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve } } +/** + * udp_reply_sock_handler() - Handle new data from flow specific socket + * @c: Execution context + * @ref: epoll reference + * @events: epoll events bitmap + * @now: Current timestamp + * + * #syscalls recvmmsg + */ +void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events, const struct timespec *now) +{ + const struct flowside *fromside = flowside_at_sidx(ref.flowside); + flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside); + struct udp_flow *uflow = udp_at_sidx(ref.flowside); + int from_s = uflow->s[ref.flowside.sidei]; + bool v6 = !inany_v4(&fromside->eaddr); + struct mmsghdr *mmh_recv = v6 ? udp6_mh_recv : udp4_mh_recv; + int n, i; + + ASSERT(!c->no_udp && uflow); + + if ((n = udp_sock_recv(c, from_s, events, mmh_recv)) <= 0) + return; + + flow_trace(uflow, "Received %d datagrams on reply socket", n); + uflow->ts = now->tv_sec; + + for (i = 0; i < n; i++) + udp_splice_prepare(mmh_recv, i); + + udp_splice_send(c, 0, n, tosidx); +} + /** * udp_tap_handler() - Handle packets from tap * @c: Execution context @@ -1419,8 +1333,8 @@ static int udp_port_rebind_outbound(void *arg) * * Return: true if the flow is ready to free, false otherwise */ -bool udp_flow_timer(const struct ctx *c, const struct udp_flow *uflow, - const struct timespec *now) +bool udp_flow_timer(const struct ctx *c, struct udp_flow *uflow, + const struct timespec *now) { if (now->tv_sec - uflow->ts <= UDP_CONN_TIMEOUT) return false; diff --git a/udp.h b/udp.h index 5865def2..db5e546e 100644 --- a/udp.h +++ b/udp.h @@ -9,8 +9,10 @@ #define UDP_TIMER_INTERVAL 1000 /* ms */ void udp_portmap_clear(void); -void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, - const struct timespec *now); +void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events, const struct timespec *now); +void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events, const struct timespec *now); int udp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now); diff --git a/udp_flow.h b/udp_flow.h index 18af9ac4..e0736f8f 100644 --- a/udp_flow.h +++ b/udp_flow.h @@ -11,15 +11,17 @@ * struct udp - Descriptor for a flow of UDP packets * @f: Generic flow information * @ts: Activity timestamp + * @s: Socket fd (or -1) for each side of the flow */ struct udp_flow { /* Must be first element */ struct flow_common f; time_t ts; + int s[SIDES]; }; -bool udp_flow_timer(const struct ctx *c, const struct udp_flow *uflow, +bool udp_flow_timer(const struct ctx *c, struct udp_flow *uflow, const struct timespec *now); #endif /* UDP_FLOW_H */ diff --git a/util.c b/util.c index 6b51fc51..8dc8ff76 100644 --- a/util.c +++ b/util.c @@ -62,6 +62,7 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type, socktype = SOCK_STREAM | SOCK_NONBLOCK; break; case EPOLL_TYPE_UDP: + case EPOLL_TYPE_UDP_REPLY: proto = IPPROTO_UDP; socktype = SOCK_DGRAM | SOCK_NONBLOCK; break; -- 2.45.2
Now that spliced datagrams are managed via the flow table, remove UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With those removed, the 'ts' field in udp_splice_port is also no longer used. struct udp_splice_port now contains just a socket fd, so replace it with a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still used for tracking of automatic port forwarding. Finally, the 'splice' field of union udp_epoll_ref is no longer used so remove it as well. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 65 +++++++++++++++++------------------------------------------ udp.h | 5 +---- 2 files changed, 19 insertions(+), 51 deletions(-) diff --git a/udp.c b/udp.c index 5543e614..b459b109 100644 --- a/udp.c +++ b/udp.c @@ -150,27 +150,15 @@ struct udp_tap_port { time_t ts; }; -/** - * struct udp_splice_port - Bound socket for spliced communication - * @sock: Socket bound to index port - * @ts: Activity timestamp - */ -struct udp_splice_port { - int sock; - time_t ts; -}; - /* Port tracking, arrays indexed by packet source port (host order) */ static struct udp_tap_port udp_tap_map [IP_VERSIONS][NUM_PORTS]; /* "Spliced" sockets indexed by bound port (host order) */ -static struct udp_splice_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; -static struct udp_splice_port udp_splice_init[IP_VERSIONS][NUM_PORTS]; +static int udp_splice_ns [IP_VERSIONS][NUM_PORTS]; +static int udp_splice_init[IP_VERSIONS][NUM_PORTS]; enum udp_act_type { UDP_ACT_TAP, - UDP_ACT_SPLICE_NS, - UDP_ACT_SPLICE_INIT, UDP_ACT_TYPE_MAX, }; @@ -260,8 +248,8 @@ void udp_portmap_clear(void) for (i = 0; i < NUM_PORTS; i++) { udp_tap_map[V4][i].sock = udp_tap_map[V6][i].sock = -1; - udp_splice_ns[V4][i].sock = udp_splice_ns[V6][i].sock = -1; - udp_splice_init[V4][i].sock = udp_splice_init[V6][i].sock = -1; + udp_splice_ns[V4][i] = udp_splice_ns[V6][i] = -1; + udp_splice_init[V4][i] = udp_splice_init[V6][i] = -1; } } @@ -1142,8 +1130,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port) { - union udp_epoll_ref uref = { .splice = (c->mode == MODE_PASTA), - .orig = true, .port = port }; + union udp_epoll_ref uref = { .orig = true, .port = port }; int s, r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1; ASSERT(!c->no_udp); @@ -1161,12 +1148,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, ifname, port, uref.u32); udp_tap_map[V4][port].sock = s < 0 ? -1 : s; - udp_splice_init[V4][port].sock = s < 0 ? -1 : s; + udp_splice_init[V4][port] = s < 0 ? -1 : s; } else { r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, &in4addr_loopback, ifname, port, uref.u32); - udp_splice_ns[V4][port].sock = s < 0 ? -1 : s; + udp_splice_ns[V4][port] = s < 0 ? -1 : s; } } @@ -1178,12 +1165,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, ifname, port, uref.u32); udp_tap_map[V6][port].sock = s < 0 ? -1 : s; - udp_splice_init[V6][port].sock = s < 0 ? -1 : s; + udp_splice_init[V6][port] = s < 0 ? -1 : s; } else { r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, &in6addr_loopback, ifname, port, uref.u32); - udp_splice_ns[V6][port].sock = s < 0 ? -1 : s; + udp_splice_ns[V6][port] = s < 0 ? -1 : s; } } @@ -1224,7 +1211,6 @@ static void udp_splice_iov_init(void) static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, in_port_t port, const struct timespec *now) { - struct udp_splice_port *sp; struct udp_tap_port *tp; int *sockp = NULL; @@ -1237,20 +1223,6 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, tp->flags = 0; } - break; - case UDP_ACT_SPLICE_INIT: - sp = &udp_splice_init[v6 ? V6 : V4][port]; - - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; - - break; - case UDP_ACT_SPLICE_NS: - sp = &udp_splice_ns[v6 ? V6 : V4][port]; - - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; - break; default: return; @@ -1274,24 +1246,23 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, */ static void udp_port_rebind(struct ctx *c, bool outbound) { + int (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init; const uint8_t *fmap = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; const uint8_t *rmap = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; - struct udp_splice_port (*socks)[NUM_PORTS] - = outbound ? udp_splice_ns : udp_splice_init; unsigned port; for (port = 0; port < NUM_PORTS; port++) { if (!bitmap_isset(fmap, port)) { - if (socks[V4][port].sock >= 0) { - close(socks[V4][port].sock); - socks[V4][port].sock = -1; + if (socks[V4][port] >= 0) { + close(socks[V4][port]); + socks[V4][port] = -1; } - if (socks[V6][port].sock >= 0) { - close(socks[V6][port].sock); - socks[V6][port].sock = -1; + if (socks[V6][port] >= 0) { + close(socks[V6][port]); + socks[V6][port] = -1; } continue; @@ -1301,8 +1272,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound) if (bitmap_isset(rmap, port)) continue; - if ((c->ifi4 && socks[V4][port].sock == -1) || - (c->ifi6 && socks[V6][port].sock == -1)) + if ((c->ifi4 && socks[V4][port] == -1) || + (c->ifi6 && socks[V6][port] == -1)) udp_sock_init(c, outbound, AF_UNSPEC, NULL, NULL, port); } } diff --git a/udp.h b/udp.h index db5e546e..e133f1e7 100644 --- a/udp.h +++ b/udp.h @@ -26,8 +26,6 @@ void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s); * union udp_epoll_ref - epoll reference portion for TCP connections * @port: Source port for connected sockets, bound port otherwise * @pif: pif for this socket - * @bound: Set if this file descriptor is a bound socket - * @splice: Set if descriptor packets to be "spliced" * @orig: Set if a spliced socket which can originate "connections" * @v6: Set for IPv6 sockets or connections * @u32: Opaque u32 value of reference @@ -36,8 +34,7 @@ union udp_epoll_ref { struct { in_port_t port; uint8_t pif; - bool splice:1, - orig:1, + bool orig:1, v6:1; }; uint32_t u32; -- 2.45.2
Currently we create flows for datagrams from socket interfaces, and use them to direct "spliced" (socket to socket) datagrams. We don't yet match datagrams from the tap interface to existing flows, nor create new flows for them. Add that functionality, matching datagrams from tap to existing flows when they exist, or creating new ones. As with spliced flows, when creating a new flow from tap to socket, we create a new connected socket to receive reply datagrams attached to that flow specifically. We extend udp_flow_sock_handler() to handle reply packets bound for tap rather than another socket. For non-obvious reasons (perhaps increased stack usage?), this caused a failure for me when running under valgrind, because valgrind invoked rt_sigreturn which is not in our seccomp filter. Since we already allow rt_sigaction and others in the valgrind target, it seems reasonable to add rt_sigreturn as well. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- Makefile | 2 +- udp.c | 211 +++++++++++++++++++++++++------------------------------ udp.h | 4 +- 3 files changed, 100 insertions(+), 117 deletions(-) diff --git a/Makefile b/Makefile index 92cbd5a6..bd504d23 100644 --- a/Makefile +++ b/Makefile @@ -128,7 +128,7 @@ qrap: $(QRAP_SRCS) passt.h $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(QRAP_SRCS) -o qrap $(LDFLAGS) valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \ - getpid gettid kill clock_gettime mmap \ + rt_sigreturn getpid gettid kill clock_gettime mmap \ munmap open unlink gettimeofday futex valgrind: FLAGS += -g -DVALGRIND valgrind: all diff --git a/udp.c b/udp.c index b459b109..2407ca86 100644 --- a/udp.c +++ b/udp.c @@ -116,6 +116,7 @@ #include <sys/uio.h> #include <time.h> #include <fcntl.h> +#include <arpa/inet.h> #include <linux/errqueue.h> #include "checksum.h" @@ -389,6 +390,8 @@ static void udp_flow_close(const struct ctx *c, struct udp_flow *uflow) uflow->s[TGTSIDE] = -1; } flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); + if (!pif_is_socket(uflow->f.pif[TGTSIDE])) + flow_hash_remove(c, FLOW_SIDX(uflow, TGTSIDE)); } /** @@ -483,6 +486,13 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, } flow_hash_insert(c, FLOW_SIDX(uflow, INISIDE)); + + /* If the target side is a socket, it will be a reply socket that knows + * its own flowside. But if it's tap, then we need to look it up by + * hash. + */ + if (!pif_is_socket(tgtpif)) + flow_hash_insert(c, FLOW_SIDX(uflow, TGTSIDE)); FLOW_ACTIVATE(uflow); return FLOW_SIDX(uflow, TGTSIDE); @@ -907,10 +917,12 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, { const struct flowside *fromside = flowside_at_sidx(ref.flowside); flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside); + const struct flowside *toside = flowside_at_sidx(tosidx); struct udp_flow *uflow = udp_at_sidx(ref.flowside); int from_s = uflow->s[ref.flowside.sidei]; bool v6 = !inany_v4(&fromside->eaddr); struct mmsghdr *mmh_recv = v6 ? udp6_mh_recv : udp4_mh_recv; + uint8_t topif = pif_at_sidx(tosidx); int n, i; ASSERT(!c->no_udp && uflow); @@ -921,10 +933,64 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, flow_trace(uflow, "Received %d datagrams on reply socket", n); uflow->ts = now->tv_sec; - for (i = 0; i < n; i++) - udp_splice_prepare(mmh_recv, i); + for (i = 0; i < n; i++) { + if (pif_is_socket(topif)) + udp_splice_prepare(mmh_recv, i); + else + udp_tap_prepare(c, mmh_recv, i, toside->eport, v6, now); + } - udp_splice_send(c, 0, n, tosidx); + if (pif_is_socket(topif)) + udp_splice_send(c, 0, n, tosidx); + else + tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n); +} + +/** + * udp_flow_from_tap() - Find or create UDP flow for tap packets + * @c: Execution context + * @pif: pif on which the packet is arriving + * @af: Address family, AF_INET or AF_INET6 + * @saddr: Source address on guest side + * @daddr: Destination address guest side + * @srcport: Source port on guest side + * @dstport: Destination port on guest side + * + * Return: sidx for the destination side of the flow for this packet, or + * FLOW_SIDX_NONE if we couldn't find or create a flow. + */ +static flow_sidx_t udp_flow_from_tap(const struct ctx *c, + uint8_t pif, sa_family_t af, + const void *saddr, const void *daddr, + in_port_t srcport, in_port_t dstport, + const struct timespec *now) +{ + struct udp_flow *uflow; + union flow *flow; + flow_sidx_t sidx; + + ASSERT(pif == PIF_TAP); + + sidx = flow_lookup_af(c, IPPROTO_UDP, pif, af, saddr, daddr, + srcport, dstport); + if ((uflow = udp_at_sidx(sidx))) { + uflow->ts = now->tv_sec; + return flow_sidx_opposite(sidx); + } + + if (!(flow = flow_alloc())) { + char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN]; + + debug("Couldn't allocate flow for UDP datagram from %s %s:%hu -> %s:%hu", + pif_name(pif), + inet_ntop(af, saddr, sstr, sizeof(sstr)), srcport, + inet_ntop(af, daddr, dstr, sizeof(dstr)), dstport); + return FLOW_SIDX_NONE; + } + + flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); + + return udp_flow_new(c, flow, -1, now); } /** @@ -942,23 +1008,22 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, * * #syscalls sendmmsg */ -int udp_tap_handler(struct ctx *c, uint8_t pif, +int udp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now) { + const struct flowside *toside; struct mmsghdr mm[UIO_MAXIOV]; + union sockaddr_inany to_sa; struct iovec m[UIO_MAXIOV]; - struct sockaddr_in6 s_in6; - struct sockaddr_in s_in; const struct udphdr *uh; - struct sockaddr *sa; + struct udp_flow *uflow; int i, s, count = 0; + flow_sidx_t tosidx; in_port_t src, dst; + uint8_t topif; socklen_t sl; - (void)saddr; - (void)pif; - ASSERT(!c->no_udp); uh = packet_get(p, idx, 0, sizeof(*uh), NULL); @@ -969,116 +1034,34 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, * and destination, so we can just take those from the first message. */ src = ntohs(uh->source); - src += c->udp.fwd_in.rdelta[src]; dst = ntohs(uh->dest); - if (af == AF_INET) { - s_in = (struct sockaddr_in) { - .sin_family = AF_INET, - .sin_port = uh->dest, - .sin_addr = *(struct in_addr *)daddr, - }; - - sa = (struct sockaddr *)&s_in; - sl = sizeof(s_in); - - if (IN4_ARE_ADDR_EQUAL(&s_in.sin_addr, &c->ip4.dns_match) && - ntohs(s_in.sin_port) == 53) { - s_in.sin_addr = c->ip4.dns_host; - udp_tap_map[V4][src].ts = now->tv_sec; - udp_tap_map[V4][src].flags |= PORT_DNS_FWD; - bitmap_set(udp_act[V4][UDP_ACT_TAP], src); - } else if (IN4_ARE_ADDR_EQUAL(&s_in.sin_addr, &c->ip4.gw) && - !c->no_map_gw) { - if (!(udp_tap_map[V4][dst].flags & PORT_LOCAL) || - (udp_tap_map[V4][dst].flags & PORT_LOOPBACK)) - s_in.sin_addr.s_addr = htonl(INADDR_LOOPBACK); - else - s_in.sin_addr = c->ip4.addr_seen; - } - - debug("UDP from tap src=%hu dst=%hu, s=%d", - src, dst, udp_tap_map[V4][src].sock); - if ((s = udp_tap_map[V4][src].sock) < 0) { - struct in_addr bind_addr = IN4ADDR_ANY_INIT; - union udp_epoll_ref uref = { - .port = src, - .pif = PIF_HOST, - }; - const char *bind_if = NULL; - - if (!IN4_IS_ADDR_LOOPBACK(&s_in.sin_addr)) - bind_if = c->ip4.ifname_out; - - if (!IN4_IS_ADDR_LOOPBACK(&s_in.sin_addr)) - bind_addr = c->ip4.addr_out; - - s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, &bind_addr, - bind_if, src, uref.u32); - if (s < 0) - return p->count - idx; - - udp_tap_map[V4][src].sock = s; - bitmap_set(udp_act[V4][UDP_ACT_TAP], src); - } - - udp_tap_map[V4][src].ts = now->tv_sec; - } else { - s_in6 = (struct sockaddr_in6) { - .sin6_family = AF_INET6, - .sin6_port = uh->dest, - .sin6_addr = *(struct in6_addr *)daddr, - }; - const struct in6_addr *bind_addr = &in6addr_any; - - sa = (struct sockaddr *)&s_in6; - sl = sizeof(s_in6); - - if (IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.dns_match) && - ntohs(s_in6.sin6_port) == 53) { - s_in6.sin6_addr = c->ip6.dns_host; - udp_tap_map[V6][src].ts = now->tv_sec; - udp_tap_map[V6][src].flags |= PORT_DNS_FWD; - bitmap_set(udp_act[V6][UDP_ACT_TAP], src); - } else if (IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw) && - !c->no_map_gw) { - if (!(udp_tap_map[V6][dst].flags & PORT_LOCAL) || - (udp_tap_map[V6][dst].flags & PORT_LOOPBACK)) - s_in6.sin6_addr = in6addr_loopback; - else if (udp_tap_map[V6][dst].flags & PORT_GUA) - s_in6.sin6_addr = c->ip6.addr; - else - s_in6.sin6_addr = c->ip6.addr_seen; - } else if (IN6_IS_ADDR_LINKLOCAL(&s_in6.sin6_addr)) { - bind_addr = &c->ip6.addr_ll; - } - - if ((s = udp_tap_map[V6][src].sock) < 0) { - union udp_epoll_ref uref = { - .v6 = 1, - .port = src, - .pif = PIF_HOST, - }; - const char *bind_if = NULL; + tosidx = udp_flow_from_tap(c, pif, af, saddr, daddr, src, dst, now); + if (!(uflow = udp_at_sidx(tosidx))) { + char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN]; - if (!IN6_IS_ADDR_LOOPBACK(&s_in6.sin6_addr)) - bind_if = c->ip6.ifname_out; + debug("Dropping datagram with no flow %s %s:%hu -> %s:%hu", + pif_name(pif), + inet_ntop(af, saddr, sstr, sizeof(sstr)), src, + inet_ntop(af, daddr, dstr, sizeof(dstr)), dst); + return 1; + } - if (!IN6_IS_ADDR_LOOPBACK(&s_in6.sin6_addr) && - !IN6_IS_ADDR_LINKLOCAL(&s_in6.sin6_addr)) - bind_addr = &c->ip6.addr_out; + topif = pif_at_sidx(tosidx); + if (topif != PIF_HOST) { + flow_sidx_t fromsidx = flow_sidx_opposite(tosidx); + uint8_t frompif = pif_at_sidx(fromsidx); - s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, bind_addr, - bind_if, src, uref.u32); - if (s < 0) - return p->count - idx; + flow_err(uflow, "No support for forwarding UDP from %s to %s", + pif_name(frompif), pif_name(topif)); + return 1; + } + toside = flowside_at_sidx(tosidx); - udp_tap_map[V6][src].sock = s; - bitmap_set(udp_act[V6][UDP_ACT_TAP], src); - } + s = udp_at_sidx(tosidx)->s[tosidx.sidei]; + ASSERT(s >= 0); - udp_tap_map[V6][src].ts = now->tv_sec; - } + pif_sockaddr(c, &to_sa, &sl, topif, &toside->eaddr, toside->eport); for (i = 0; i < (int)p->count - idx; i++) { struct udphdr *uh_send; @@ -1088,7 +1071,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, if (!uh_send) return p->count - idx; - mm[i].msg_hdr.msg_name = sa; + mm[i].msg_hdr.msg_name = &to_sa; mm[i].msg_hdr.msg_namelen = sl; if (len) { diff --git a/udp.h b/udp.h index e133f1e7..ceaa8c54 100644 --- a/udp.h +++ b/udp.h @@ -13,8 +13,8 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now); void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now); -int udp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, - const void *saddr, const void *daddr, +int udp_tap_handler(const struct ctx *c, uint8_t pif, + sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now); int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port); -- 2.45.2
This replaces the last piece of existing UDP port tracking with the common flow table. Specifically use the flow table to direct datagrams from host sockets to the guest tap interface. Since this now requires a flow for every datagram, we add some logging if we encounter any datagrams for which we can't find or create a flow. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 185 ++++++++++++++++------------------------------------------ 1 file changed, 51 insertions(+), 134 deletions(-) diff --git a/udp.c b/udp.c index 2407ca86..d39acb91 100644 --- a/udp.c +++ b/udp.c @@ -73,26 +73,6 @@ * * Note that a spliced flow will have *both* a duplicated listening socket and a * reply socket (see above). - * - * Port tracking - * ============= - * - * For UDP, a reduced version of port-based connection tracking is implemented - * with two purposes: - * - binding ephemeral ports when they're used as source port by the guest, so - * that replies on those ports can be forwarded back to the guest, with a - * fixed timeout for this binding - * - packets received from the local host get their source changed to a local - * address (gateway address) so that they can be forwarded to the guest, and - * packets sent as replies by the guest need their destination address to - * be changed back to the address of the local host. This is dynamic to allow - * connections from the gateway as well, and uses the same fixed 180s timeout - * - * Sockets for bound ports are created at initialisation time, one set for IPv4 - * and one for IPv6. - * - * Packets are forwarded back and forth, by prepending and stripping UDP headers - * in the obvious way, with no port translation. */ #include <sched.h> @@ -526,7 +506,6 @@ static flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref, ASSERT(ref.type == EPOLL_TYPE_UDP); - /* FIXME: Match reply packets to their flow as well */ if (!ref.udp.orig) return FLOW_SIDX_NONE; @@ -586,160 +565,87 @@ static void udp_splice_send(const struct ctx *c, size_t start, size_t n, /** * udp_update_hdr4() - Update headers for one IPv4 datagram - * @c: Execution context * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) - * @s_in: Source socket address, filled in by recvmmsg() * @bp: Pointer to udp_payload_t to update - * @dstport: Destination port number + * @toside: Flowside for destination side * @dlen: Length of UDP payload - * @now: Current timestamp * * Return: size of IPv4 payload (UDP header + data) */ -static size_t udp_update_hdr4(const struct ctx *c, - struct iphdr *ip4h, const struct sockaddr_in *s_in, - struct udp_payload_t *bp, - in_port_t dstport, size_t dlen, - const struct timespec *now) +static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp, + const struct flowside *toside, size_t dlen) { - const struct in_addr dst = c->ip4.addr_seen; - in_port_t srcport = ntohs(s_in->sin_port); + const struct in_addr *src = inany_v4(&toside->faddr); + const struct in_addr *dst = inany_v4(&toside->eaddr); size_t l4len = dlen + sizeof(bp->uh); size_t l3len = l4len + sizeof(*ip4h); - struct in_addr src = s_in->sin_addr; - - if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match) && - IN4_ARE_ADDR_EQUAL(&src, &c->ip4.dns_host) && srcport == 53 && - (udp_tap_map[V4][dstport].flags & PORT_DNS_FWD)) { - src = c->ip4.dns_match; - } else if (IN4_IS_ADDR_LOOPBACK(&src) || - IN4_ARE_ADDR_EQUAL(&src, &c->ip4.addr_seen)) { - udp_tap_map[V4][srcport].ts = now->tv_sec; - udp_tap_map[V4][srcport].flags |= PORT_LOCAL; - if (IN4_IS_ADDR_LOOPBACK(&src)) - udp_tap_map[V4][srcport].flags |= PORT_LOOPBACK; - else - udp_tap_map[V4][srcport].flags &= ~PORT_LOOPBACK; - - bitmap_set(udp_act[V4][UDP_ACT_TAP], srcport); - - src = c->ip4.gw; - } + ASSERT(src && dst); ip4h->tot_len = htons(l3len); - ip4h->daddr = dst.s_addr; - ip4h->saddr = src.s_addr; - ip4h->check = csum_ip4_header(l3len, IPPROTO_UDP, src, dst); + ip4h->daddr = dst->s_addr; + ip4h->saddr = src->s_addr; + ip4h->check = csum_ip4_header(l3len, IPPROTO_UDP, *src, *dst); - bp->uh.source = s_in->sin_port; - bp->uh.dest = htons(dstport); + bp->uh.source = htons(toside->fport); + bp->uh.dest = htons(toside->eport); bp->uh.len = htons(l4len); - csum_udp4(&bp->uh, src, dst, bp->data, dlen); + csum_udp4(&bp->uh, *src, *dst, bp->data, dlen); return l4len; } /** * udp_update_hdr6() - Update headers for one IPv6 datagram - * @c: Execution context * @ip6h: Pre-filled IPv6 header (except for payload_len and addresses) - * @s_in: Source socket address, filled in by recvmmsg() * @bp: Pointer to udp_payload_t to update - * @dstport: Destination port number + * @toside: Flowside for destination side * @dlen: Length of UDP payload - * @now: Current timestamp * * Return: size of IPv6 payload (UDP header + data) */ -static size_t udp_update_hdr6(const struct ctx *c, - struct ipv6hdr *ip6h, struct sockaddr_in6 *s_in6, - struct udp_payload_t *bp, - in_port_t dstport, size_t dlen, - const struct timespec *now) +static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, + const struct flowside *toside, size_t dlen) { - const struct in6_addr *src = &s_in6->sin6_addr; - const struct in6_addr *dst = &c->ip6.addr_seen; - in_port_t srcport = ntohs(s_in6->sin6_port); uint16_t l4len = dlen + sizeof(bp->uh); - if (IN6_IS_ADDR_LINKLOCAL(src)) { - dst = &c->ip6.addr_ll_seen; - } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match) && - IN6_ARE_ADDR_EQUAL(src, &c->ip6.dns_host) && - srcport == 53 && - (udp_tap_map[V4][dstport].flags & PORT_DNS_FWD)) { - src = &c->ip6.dns_match; - } else if (IN6_IS_ADDR_LOOPBACK(src) || - IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr_seen) || - IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) { - udp_tap_map[V6][srcport].ts = now->tv_sec; - udp_tap_map[V6][srcport].flags |= PORT_LOCAL; - - if (IN6_IS_ADDR_LOOPBACK(src)) - udp_tap_map[V6][srcport].flags |= PORT_LOOPBACK; - else - udp_tap_map[V6][srcport].flags &= ~PORT_LOOPBACK; - - if (IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) - udp_tap_map[V6][srcport].flags |= PORT_GUA; - else - udp_tap_map[V6][srcport].flags &= ~PORT_GUA; - - bitmap_set(udp_act[V6][UDP_ACT_TAP], srcport); - - dst = &c->ip6.addr_ll_seen; - - if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) - src = &c->ip6.gw; - else - src = &c->ip6.addr_ll; - - } - ip6h->payload_len = htons(l4len); - ip6h->daddr = *dst; - ip6h->saddr = *src; + ip6h->daddr = toside->eaddr.a6; + ip6h->saddr = toside->faddr.a6; ip6h->version = 6; ip6h->nexthdr = IPPROTO_UDP; ip6h->hop_limit = 255; - bp->uh.source = s_in6->sin6_port; - bp->uh.dest = htons(dstport); + bp->uh.source = htons(toside->fport); + bp->uh.dest = htons(toside->eport); bp->uh.len = ip6h->payload_len; - csum_udp6(&bp->uh, src, dst, bp->data, dlen); + csum_udp6(&bp->uh, &toside->faddr.a6, &toside->eaddr.a6, bp->data, dlen); return l4len; } /** * udp_tap_prepare() - Convert one datagram into a tap frame - * @c: Execution context * @mmh: Receiving mmsghdr array * @idx: Index of the datagram to prepare - * @dstport: Destination port - * @v6: Prepare for IPv6? - * @now: Current timestamp + * @toside: Flowside for destination side */ -static void udp_tap_prepare(const struct ctx *c, const struct mmsghdr *mmh, - unsigned idx, in_port_t dstport, bool v6, - const struct timespec *now) +static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx, + const struct flowside *toside) { struct iovec (*tap_iov)[UDP_NUM_IOVS] = &udp_l2_iov[idx]; struct udp_payload_t *bp = &udp_payload[idx]; struct udp_meta_t *bm = &udp_meta[idx]; size_t l4len; - if (v6) { - l4len = udp_update_hdr6(c, &bm->ip6h, &bm->s_in.sa6, bp, - dstport, mmh[idx].msg_len, now); + if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->faddr)) { + l4len = udp_update_hdr6(&bm->ip6h, bp, toside, mmh[idx].msg_len); tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) + sizeof(udp6_eth_hdr)); (*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp6_eth_hdr); (*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip6h); } else { - l4len = udp_update_hdr4(c, &bm->ip4h, &bm->s_in.sa4, bp, - dstport, mmh[idx].msg_len, now); + l4len = udp_update_hdr4(&bm->ip4h, bp, toside, mmh[idx].msg_len); tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) + sizeof(udp4_eth_hdr)); (*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp4_eth_hdr); @@ -855,17 +761,11 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve const struct timespec *now) { struct mmsghdr *mmh_recv = ref.udp.v6 ? udp6_mh_recv : udp4_mh_recv; - in_port_t dstport = ref.udp.port; int n, i; if ((n = udp_sock_recv(c, ref.fd, events, mmh_recv)) <= 0) return; - if (ref.udp.pif == PIF_SPLICE) - dstport += c->udp.fwd_out.f.delta[dstport]; - else if (ref.udp.pif == PIF_HOST) - dstport += c->udp.fwd_in.f.delta[dstport]; - /* We divide datagrams into batches based on how we need to send them, * determined by udp_meta[i].tosidx. To avoid either two passes through * the array, or recalculating tosidx for a single entry, we have to @@ -880,9 +780,9 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve do { if (pif_is_socket(batchpif)) { udp_splice_prepare(mmh_recv, i); - } else { - udp_tap_prepare(c, mmh_recv, i, dstport, - ref.udp.v6, now); + } else if (batchpif == PIF_TAP) { + udp_tap_prepare(mmh_recv, i, + flowside_at_sidx(batchsidx)); } if (++i >= n) @@ -896,9 +796,20 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve if (pif_is_socket(batchpif)) { udp_splice_send(c, batchstart, i - batchstart, batchsidx); - } else { + } else if (batchpif == PIF_TAP) { tap_send_frames(c, &udp_l2_iov[batchstart][0], UDP_NUM_IOVS, i - batchstart); + } else if (flow_sidx_valid(batchsidx)) { + flow_sidx_t fromsidx = flow_sidx_opposite(batchsidx); + struct udp_flow *uflow = udp_at_sidx(batchsidx); + + flow_err(uflow, + "No support for forwarding UDP from %s to %s", + pif_name(pif_at_sidx(fromsidx)), + pif_name(batchpif)); + } else { + debug("Discarding %d datagrams without flow", + i - batchstart); } } } @@ -936,14 +847,20 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, for (i = 0; i < n; i++) { if (pif_is_socket(topif)) udp_splice_prepare(mmh_recv, i); - else - udp_tap_prepare(c, mmh_recv, i, toside->eport, v6, now); + else if (topif == PIF_TAP) + udp_tap_prepare(mmh_recv, i, toside); } - if (pif_is_socket(topif)) + if (pif_is_socket(topif)) { udp_splice_send(c, 0, n, tosidx); - else + } else if (topif == PIF_TAP) { tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n); + } else { + uint8_t frompif = pif_at_sidx(ref.flowside); + + flow_err(uflow, "No support for forwarding UDP from %s to %s", + pif_name(frompif), pif_name(topif)); + } } /** -- 2.45.2
Now that UDP datagrams are all directed via the flow table, we no longer use the udp_tap_map[ or udp_act[] arrays. Remove them and connected code. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 92 +---------------------------------------------------------- 1 file changed, 1 insertion(+), 91 deletions(-) diff --git a/udp.c b/udp.c index d39acb91..0a17d1bb 100644 --- a/udp.c +++ b/udp.c @@ -114,38 +114,10 @@ #define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ -/** - * struct udp_tap_port - Port tracking based on tap-facing source port - * @sock: Socket bound to source port used as index - * @flags: Flags for recent activity type seen from/to port - * @ts: Activity timestamp from tap, used for socket aging - */ -struct udp_tap_port { - int sock; - uint8_t flags; -#define PORT_LOCAL BIT(0) /* Port was contacted from local address */ -#define PORT_LOOPBACK BIT(1) /* Port was contacted from loopback address */ -#define PORT_GUA BIT(2) /* Port was contacted from global unicast */ -#define PORT_DNS_FWD BIT(3) /* Port used as source for DNS remapped query */ - - time_t ts; -}; - -/* Port tracking, arrays indexed by packet source port (host order) */ -static struct udp_tap_port udp_tap_map [IP_VERSIONS][NUM_PORTS]; - /* "Spliced" sockets indexed by bound port (host order) */ static int udp_splice_ns [IP_VERSIONS][NUM_PORTS]; static int udp_splice_init[IP_VERSIONS][NUM_PORTS]; -enum udp_act_type { - UDP_ACT_TAP, - UDP_ACT_TYPE_MAX, -}; - -/* Activity-based aging for bindings */ -static uint8_t udp_act[IP_VERSIONS][UDP_ACT_TYPE_MAX][DIV_ROUND_UP(NUM_PORTS, 8)]; - /* Static buffers */ /** @@ -228,7 +200,6 @@ void udp_portmap_clear(void) unsigned i; for (i = 0; i < NUM_PORTS; i++) { - udp_tap_map[V4][i].sock = udp_tap_map[V6][i].sock = -1; udp_splice_ns[V4][i] = udp_splice_ns[V6][i] = -1; udp_splice_init[V4][i] = udp_splice_init[V6][i] = -1; } @@ -1047,7 +1018,6 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, addr, ifname, port, uref.u32); - udp_tap_map[V4][port].sock = s < 0 ? -1 : s; udp_splice_init[V4][port] = s < 0 ? -1 : s; } else { r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, @@ -1064,7 +1034,6 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, addr, ifname, port, uref.u32); - udp_tap_map[V6][port].sock = s < 0 ? -1 : s; udp_splice_init[V6][port] = s < 0 ? -1 : s; } else { r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, @@ -1100,43 +1069,6 @@ static void udp_splice_iov_init(void) } } -/** - * udp_timer_one() - Handler for timed events on one port - * @c: Execution context - * @v6: Set for IPv6 connections - * @type: Socket type - * @port: Port number, host order - * @now: Current timestamp - */ -static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, - in_port_t port, const struct timespec *now) -{ - struct udp_tap_port *tp; - int *sockp = NULL; - - switch (type) { - case UDP_ACT_TAP: - tp = &udp_tap_map[v6 ? V6 : V4][port]; - - if (now->tv_sec - tp->ts > UDP_CONN_TIMEOUT) { - sockp = &tp->sock; - tp->flags = 0; - } - - break; - default: - return; - } - - if (sockp && *sockp >= 0) { - int s = *sockp; - *sockp = -1; - epoll_ctl(c->epollfd, EPOLL_CTL_DEL, s, NULL); - close(s); - bitmap_clear(udp_act[v6 ? V6 : V4][type], port); - } -} - /** * udp_port_rebind() - Rebind ports to match forward maps * @c: Execution context @@ -1221,9 +1153,7 @@ bool udp_flow_timer(const struct ctx *c, struct udp_flow *uflow, */ void udp_timer(struct ctx *c, const struct timespec *now) { - int n, t, v6 = 0; - unsigned int i; - long *word, tmp; + (void)now; ASSERT(!c->no_udp); @@ -1240,26 +1170,6 @@ void udp_timer(struct ctx *c, const struct timespec *now) udp_port_rebind(c, false); } } - - if (!c->ifi4) - v6 = 1; -v6: - for (t = 0; t < UDP_ACT_TYPE_MAX; t++) { - word = (long *)udp_act[v6 ? V6 : V4][t]; - for (i = 0; i < ARRAY_SIZE(udp_act[0][0]); - i += sizeof(long), word++) { - tmp = *word; - while ((n = ffsl(tmp))) { - tmp &= ~(1UL << (n - 1)); - udp_timer_one(c, v6, t, i * 8 + n - 1, now); - } - } - } - - if (!v6 && c->ifi6) { - v6 = 1; - goto v6; - } } /** -- 2.45.2
In addition to the struct fwd_ports used by both UDP and TCP to track port forwarding, UDP also included an 'rdelta' field, which contained the reverse mapping of the main port map. This was used so that we could properly direct reply packets to a forwarded packet where we change the destination port. This has now been taken over by the flow table: reply packets will match the flow of the originating packet, and that gives the correct ports on the originating side. So, eliminate the rdelta field, and with it struct udp_fwd_ports, which now has no additional information over struct fwd_ports. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- conf.c | 14 +++++++------- fwd.c | 24 ++++++++++++------------ udp.c | 42 ++++++------------------------------------ udp.h | 14 ++------------ 4 files changed, 27 insertions(+), 67 deletions(-) diff --git a/conf.c b/conf.c index 629eb897..3cf9ed87 100644 --- a/conf.c +++ b/conf.c @@ -1248,7 +1248,7 @@ void conf(struct ctx *c, int argc, char **argv) } c->tcp.fwd_in.mode = c->tcp.fwd_out.mode = FWD_UNSET; - c->udp.fwd_in.f.mode = c->udp.fwd_out.f.mode = FWD_UNSET; + c->udp.fwd_in.mode = c->udp.fwd_out.mode = FWD_UNSET; do { name = getopt_long(argc, argv, optstring, options, NULL); @@ -1664,7 +1664,7 @@ void conf(struct ctx *c, int argc, char **argv) if (name == 't') conf_ports(c, name, optarg, &c->tcp.fwd_in); else if (name == 'u') - conf_ports(c, name, optarg, &c->udp.fwd_in.f); + conf_ports(c, name, optarg, &c->udp.fwd_in); } while (name != -1); if (c->mode == MODE_PASTA) @@ -1699,7 +1699,7 @@ void conf(struct ctx *c, int argc, char **argv) if (name == 'T') conf_ports(c, name, optarg, &c->tcp.fwd_out); else if (name == 'U') - conf_ports(c, name, optarg, &c->udp.fwd_out.f); + conf_ports(c, name, optarg, &c->udp.fwd_out); } while (name != -1); if (!c->ifi4) @@ -1726,10 +1726,10 @@ void conf(struct ctx *c, int argc, char **argv) c->tcp.fwd_in.mode = fwd_default; if (!c->tcp.fwd_out.mode) c->tcp.fwd_out.mode = fwd_default; - if (!c->udp.fwd_in.f.mode) - c->udp.fwd_in.f.mode = fwd_default; - if (!c->udp.fwd_out.f.mode) - c->udp.fwd_out.f.mode = fwd_default; + if (!c->udp.fwd_in.mode) + c->udp.fwd_in.mode = fwd_default; + if (!c->udp.fwd_out.mode) + c->udp.fwd_out.mode = fwd_default; fwd_scan_ports_init(c); diff --git a/fwd.c b/fwd.c index a70ebfd8..8c1f3d91 100644 --- a/fwd.c +++ b/fwd.c @@ -129,18 +129,18 @@ void fwd_scan_ports_init(struct ctx *c) c->tcp.fwd_in.scan4 = c->tcp.fwd_in.scan6 = -1; c->tcp.fwd_out.scan4 = c->tcp.fwd_out.scan6 = -1; - c->udp.fwd_in.f.scan4 = c->udp.fwd_in.f.scan6 = -1; - c->udp.fwd_out.f.scan4 = c->udp.fwd_out.f.scan6 = -1; + c->udp.fwd_in.scan4 = c->udp.fwd_in.scan6 = -1; + c->udp.fwd_out.scan4 = c->udp.fwd_out.scan6 = -1; if (c->tcp.fwd_in.mode == FWD_AUTO) { c->tcp.fwd_in.scan4 = open_in_ns(c, "/proc/net/tcp", flags); c->tcp.fwd_in.scan6 = open_in_ns(c, "/proc/net/tcp6", flags); fwd_scan_ports_tcp(&c->tcp.fwd_in, &c->tcp.fwd_out); } - if (c->udp.fwd_in.f.mode == FWD_AUTO) { - c->udp.fwd_in.f.scan4 = open_in_ns(c, "/proc/net/udp", flags); - c->udp.fwd_in.f.scan6 = open_in_ns(c, "/proc/net/udp6", flags); - fwd_scan_ports_udp(&c->udp.fwd_in.f, &c->udp.fwd_out.f, + if (c->udp.fwd_in.mode == FWD_AUTO) { + c->udp.fwd_in.scan4 = open_in_ns(c, "/proc/net/udp", flags); + c->udp.fwd_in.scan6 = open_in_ns(c, "/proc/net/udp6", flags); + fwd_scan_ports_udp(&c->udp.fwd_in, &c->udp.fwd_out, &c->tcp.fwd_in, &c->tcp.fwd_out); } if (c->tcp.fwd_out.mode == FWD_AUTO) { @@ -148,10 +148,10 @@ void fwd_scan_ports_init(struct ctx *c) c->tcp.fwd_out.scan6 = open("/proc/net/tcp6", flags); fwd_scan_ports_tcp(&c->tcp.fwd_out, &c->tcp.fwd_in); } - if (c->udp.fwd_out.f.mode == FWD_AUTO) { - c->udp.fwd_out.f.scan4 = open("/proc/net/udp", flags); - c->udp.fwd_out.f.scan6 = open("/proc/net/udp6", flags); - fwd_scan_ports_udp(&c->udp.fwd_out.f, &c->udp.fwd_in.f, + if (c->udp.fwd_out.mode == FWD_AUTO) { + c->udp.fwd_out.scan4 = open("/proc/net/udp", flags); + c->udp.fwd_out.scan6 = open("/proc/net/udp6", flags); + fwd_scan_ports_udp(&c->udp.fwd_out, &c->udp.fwd_in, &c->tcp.fwd_out, &c->tcp.fwd_in); } } @@ -242,7 +242,7 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; else if (proto == IPPROTO_UDP) - tgt->eport += c->udp.fwd_out.f.delta[tgt->eport]; + tgt->eport += c->udp.fwd_out.delta[tgt->eport]; /* Let the kernel pick a host side source port */ tgt->fport = 0; @@ -271,7 +271,7 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; else if (proto == IPPROTO_UDP) - tgt->eport += c->udp.fwd_in.f.delta[tgt->eport]; + tgt->eport += c->udp.fwd_in.delta[tgt->eport]; if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) && (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { diff --git a/udp.c b/udp.c index 0a17d1bb..4d612c31 100644 --- a/udp.c +++ b/udp.c @@ -205,33 +205,6 @@ void udp_portmap_clear(void) } } -/** - * udp_invert_portmap() - Compute reverse port translations for return packets - * @fwd: Port forwarding configuration to compute reverse map for - */ -static void udp_invert_portmap(struct udp_fwd_ports *fwd) -{ - unsigned int i; - - static_assert(ARRAY_SIZE(fwd->f.delta) == ARRAY_SIZE(fwd->rdelta), - "Forward and reverse delta arrays must have same size"); - for (i = 0; i < ARRAY_SIZE(fwd->f.delta); i++) { - in_port_t delta = fwd->f.delta[i]; - - if (delta) { - /* Keep rport calculation separate from its usage: we - * need to perform the sum in in_port_t width (that is, - * modulo 65536), but C promotion rules would sum the - * two terms as 'int', if we just open-coded the array - * index as 'i + delta'. - */ - in_port_t rport = i + delta; - - fwd->rdelta[rport] = NUM_PORTS - delta; - } - } -} - /** * udp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses * @eth_d: Ethernet destination address, NULL if unchanged @@ -1080,9 +1053,9 @@ static void udp_port_rebind(struct ctx *c, bool outbound) { int (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init; const uint8_t *fmap - = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; + = outbound ? c->udp.fwd_out.map : c->udp.fwd_in.map; const uint8_t *rmap - = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; + = outbound ? c->udp.fwd_in.map : c->udp.fwd_out.map; unsigned port; for (port = 0; port < NUM_PORTS; port++) { @@ -1158,14 +1131,14 @@ void udp_timer(struct ctx *c, const struct timespec *now) ASSERT(!c->no_udp); if (c->mode == MODE_PASTA) { - if (c->udp.fwd_out.f.mode == FWD_AUTO) { - fwd_scan_ports_udp(&c->udp.fwd_out.f, &c->udp.fwd_in.f, + if (c->udp.fwd_out.mode == FWD_AUTO) { + fwd_scan_ports_udp(&c->udp.fwd_out, &c->udp.fwd_in, &c->tcp.fwd_out, &c->tcp.fwd_in); NS_CALL(udp_port_rebind_outbound, c); } - if (c->udp.fwd_in.f.mode == FWD_AUTO) { - fwd_scan_ports_udp(&c->udp.fwd_in.f, &c->udp.fwd_out.f, + if (c->udp.fwd_in.mode == FWD_AUTO) { + fwd_scan_ports_udp(&c->udp.fwd_in, &c->udp.fwd_out, &c->tcp.fwd_in, &c->tcp.fwd_out); udp_port_rebind(c, false); } @@ -1184,9 +1157,6 @@ int udp_init(struct ctx *c) udp_iov_init(c); - udp_invert_portmap(&c->udp.fwd_in); - udp_invert_portmap(&c->udp.fwd_out); - if (c->mode == MODE_PASTA) { udp_splice_iov_init(); NS_CALL(udp_port_rebind_outbound, c); diff --git a/udp.h b/udp.h index ceaa8c54..c81ef290 100644 --- a/udp.h +++ b/udp.h @@ -41,16 +41,6 @@ union udp_epoll_ref { }; -/** - * udp_fwd_ports - UDP specific port forwarding configuration - * @f: Generic forwarding configuration - * @rdelta: Reversed delta map to translate source ports on return packets - */ -struct udp_fwd_ports { - struct fwd_ports f; - in_port_t rdelta[NUM_PORTS]; -}; - /** * struct udp_ctx - Execution context for UDP * @fwd_in: Port forwarding configuration for inbound packets @@ -58,8 +48,8 @@ struct udp_fwd_ports { * @timer_run: Timestamp of most recent timer run */ struct udp_ctx { - struct udp_fwd_ports fwd_in; - struct udp_fwd_ports fwd_out; + struct fwd_ports fwd_in; + struct fwd_ports fwd_out; struct timespec timer_run; }; -- 2.45.2
EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived sockets which can initiate new flows. Rename to EPOLL_TYPE_UDP_LISTEN and associated functions to match. Along with that, remove the .orig field from union udp_listen_epoll_ref, since it is now always true. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- epoll_type.h | 4 ++-- passt.c | 6 +++--- passt.h | 2 +- udp.c | 25 +++++++++++-------------- udp.h | 12 +++++------- util.c | 2 +- 6 files changed, 23 insertions(+), 28 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index 7a752ed1..0ad1efa0 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -20,8 +20,8 @@ enum epoll_type { EPOLL_TYPE_TCP_LISTEN, /* timerfds used for TCP timers */ EPOLL_TYPE_TCP_TIMER, - /* UDP sockets */ - EPOLL_TYPE_UDP, + /* UDP "listening" sockets */ + EPOLL_TYPE_UDP_LISTEN, /* UDP socket for replies on a specific flow */ EPOLL_TYPE_UDP_REPLY, /* ICMP/ICMPv6 ping sockets */ diff --git a/passt.c b/passt.c index f9405bee..eed74ec9 100644 --- a/passt.c +++ b/passt.c @@ -66,7 +66,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TCP_SPLICE] = "connected spliced TCP socket", [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", - [EPOLL_TYPE_UDP] = "UDP socket", + [EPOLL_TYPE_UDP_LISTEN] = "listening UDP socket", [EPOLL_TYPE_UDP_REPLY] = "UDP reply socket", [EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket", [EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch", @@ -347,8 +347,8 @@ loop: case EPOLL_TYPE_TCP_TIMER: tcp_timer_handler(&c, ref); break; - case EPOLL_TYPE_UDP: - udp_buf_sock_handler(&c, ref, eventmask, &now); + case EPOLL_TYPE_UDP_LISTEN: + udp_listen_sock_handler(&c, ref, eventmask, &now); break; case EPOLL_TYPE_UDP_REPLY: udp_reply_sock_handler(&c, ref, eventmask, &now); diff --git a/passt.h b/passt.h index 0d76b498..4cc2b6f0 100644 --- a/passt.h +++ b/passt.h @@ -48,7 +48,7 @@ union epoll_ref { uint32_t flow; flow_sidx_t flowside; union tcp_listen_epoll_ref tcp_listen; - union udp_epoll_ref udp; + union udp_listen_epoll_ref udp; uint32_t data; int nsdir_fd; }; diff --git a/udp.c b/udp.c index 4d612c31..a9201480 100644 --- a/udp.c +++ b/udp.c @@ -448,10 +448,7 @@ static flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref, union flow *flow; flow_sidx_t sidx; - ASSERT(ref.type == EPOLL_TYPE_UDP); - - if (!ref.udp.orig) - return FLOW_SIDX_NONE; + ASSERT(ref.type == EPOLL_TYPE_UDP_LISTEN); sidx = flow_lookup_sa(c, IPPROTO_UDP, ref.udp.pif, &meta->s_in, ref.udp.port); if ((uflow = udp_at_sidx(sidx))) { @@ -693,7 +690,7 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events, } /** - * udp_buf_sock_handler() - Handle new data from socket + * udp_listen_sock_handler() - Handle new data from socket * @c: Execution context * @ref: epoll reference * @events: epoll events bitmap @@ -701,8 +698,8 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events, * * #syscalls recvmmsg */ -void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, - const struct timespec *now) +void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events, const struct timespec *now) { struct mmsghdr *mmh_recv = ref.udp.v6 ? udp6_mh_recv : udp4_mh_recv; int n, i; @@ -974,7 +971,7 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port) { - union udp_epoll_ref uref = { .orig = true, .port = port }; + union udp_listen_epoll_ref uref = { .port = port }; int s, r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1; ASSERT(!c->no_udp); @@ -988,12 +985,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, uref.v6 = 0; if (!ns) { - r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, addr, - ifname, port, uref.u32); + r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP_LISTEN, + addr, ifname, port, uref.u32); udp_splice_init[V4][port] = s < 0 ? -1 : s; } else { - r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP, + r4 = s = sock_l4(c, AF_INET, EPOLL_TYPE_UDP_LISTEN, &in4addr_loopback, ifname, port, uref.u32); udp_splice_ns[V4][port] = s < 0 ? -1 : s; @@ -1004,12 +1001,12 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, uref.v6 = 1; if (!ns) { - r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, addr, - ifname, port, uref.u32); + r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP_LISTEN, + addr, ifname, port, uref.u32); udp_splice_init[V6][port] = s < 0 ? -1 : s; } else { - r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP, + r6 = s = sock_l4(c, AF_INET6, EPOLL_TYPE_UDP_LISTEN, &in6addr_loopback, ifname, port, uref.u32); udp_splice_ns[V6][port] = s < 0 ? -1 : s; diff --git a/udp.h b/udp.h index c81ef290..fb42e1c5 100644 --- a/udp.h +++ b/udp.h @@ -9,8 +9,8 @@ #define UDP_TIMER_INTERVAL 1000 /* ms */ void udp_portmap_clear(void); -void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, - uint32_t events, const struct timespec *now); +void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events, const struct timespec *now); void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now); int udp_tap_handler(const struct ctx *c, uint8_t pif, @@ -23,19 +23,17 @@ void udp_timer(struct ctx *c, const struct timespec *now); void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s); /** - * union udp_epoll_ref - epoll reference portion for TCP connections + * union udp_listen_epoll_ref - epoll reference for "listening" UDP sockets * @port: Source port for connected sockets, bound port otherwise * @pif: pif for this socket - * @orig: Set if a spliced socket which can originate "connections" * @v6: Set for IPv6 sockets or connections * @u32: Opaque u32 value of reference */ -union udp_epoll_ref { +union udp_listen_epoll_ref { struct { in_port_t port; uint8_t pif; - bool orig:1, - v6:1; + bool v6:1; }; uint32_t u32; }; diff --git a/util.c b/util.c index 8dc8ff76..c275b14e 100644 --- a/util.c +++ b/util.c @@ -61,7 +61,7 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type, proto = IPPROTO_TCP; socktype = SOCK_STREAM | SOCK_NONBLOCK; break; - case EPOLL_TYPE_UDP: + case EPOLL_TYPE_UDP_LISTEN: case EPOLL_TYPE_UDP_REPLY: proto = IPPROTO_UDP; socktype = SOCK_DGRAM | SOCK_NONBLOCK; -- 2.45.2
On Thu, 18 Jul 2024 15:26:26 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:This is the seventh draft of an implementation of more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel This series changes the TCP connection table and hash table into a more general flow table that can track other protocols as well. Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding. ICMP and UDP are converted to use the new flow table. This is based on the recent series of UDP flow table preliminaries. Caveats: * We roughly double the size of a connection/flow entry * We don't yet record the local address of flows initiated from a socket, even in cases where it's bound to a specific address. Changes since v7: * Rebase * Fix unintended regression in forwarding logic (we weren't applying map_gw logic to DNS packets, if they didn't hit explicit DNS forwarding rules). * Remove return value from pif_sockaddr(), in turned out not to be very useful. * More robust discarding of datagrams received between bind() and connect() on UDP reply sockets. * Avoid the name 'fside' for variables which was confusing in some contexts * Assorted minor changes based on feedback.Applied (!) -- Stefano
On Fri, Jul 19, 2024 at 09:20:27PM +0200, Stefano Brivio wrote:On Thu, 18 Jul 2024 15:26:26 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:🎉 -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonThis is the seventh draft of an implementation of more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel This series changes the TCP connection table and hash table into a more general flow table that can track other protocols as well. Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding. ICMP and UDP are converted to use the new flow table. This is based on the recent series of UDP flow table preliminaries. Caveats: * We roughly double the size of a connection/flow entry * We don't yet record the local address of flows initiated from a socket, even in cases where it's bound to a specific address. Changes since v7: * Rebase * Fix unintended regression in forwarding logic (we weren't applying map_gw logic to DNS packets, if they didn't hit explicit DNS forwarding rules). * Remove return value from pif_sockaddr(), in turned out not to be very useful. * More robust discarding of datagrams received between bind() and connect() on UDP reply sockets. * Avoid the name 'fside' for variables which was confusing in some contexts * Assorted minor changes based on feedback.Applied (!)