This is a sixth draft of an implementation of more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel This series changes the TCP connection table and hash table into a more general flow table that can track other protocols as well. Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding. ICMP and UDP are converted to use the new flow table. This is based on the most recent series of flowtable preliminaries. NOTE: this fails the pasta UDP perf tests. This is one aspect of a curly issue I'm not sure how to deail with yet. So, this isn't ready for merge yet, but I'm posting because there have been a *lot* of changes since v5, and it could do with another round of review, Other caveats: * We roughly double the size of a connection/flow entry * UDP still has a number of forwarding bugs because we don't consider the local address when looking for a socket to use. There are at least some FIXMEs noted for this. Changes since v5: * flowside_from_af() is now static * Small fixes to state verification * Pass protocol specific types into deferred/timer callbacks * No longer require complete forwarding address info for the hash table (we won't have it for UDP) * Fix bugs with logging of flow addresses * Make sure to initialise sin_zero field sockaddr_from_inany * Added patch better typing parameters to flow type specific callbacks * Terminology change "forwarded side" to "target side" * Assorted wording and style tweaks based on Stefano's review * Fold introduction of struct flowside and populating the initiating side together * Manage outbound addresses via the flow table as well * Support for UDP * Correct type of 'b' in flowside_lookup() (was a signed int) Changes since v4: * flowside_from_af() no longer fills in unspecified addresses when passed NULL * Split and rename flow hash lookup function * Clarified flow state transitions, and enforced where practical * Made side 0 always the initiating side of a flow, rather than letting the protocol specific code decide * Separated pifs from flowside addresses to allow better structure packing Changes since v3: * Complex rebase on top of the many things that have happened upstream since v2. * Assorted other changes. * Replace TAPFSIDE() and SOCKFSIDE() macros with local variables. Changes since v2: * Cosmetic fixes based on review * Extra doc comments for enum flow_type * Rename flowside to flowaddrs which turns out to make more sense in light of future changes * Fix bug where the socket flowaddrs for tap initiated connections wasn't initialised to match the socket address we were using in the case of map-gw NAT * New flowaddrs_from_sock() helper used in most cases which is cleaner and should avoid bugs like the above * Using newer centralised workarounds for clang-tidy issue 58992 * Remove duplicate definition of FLOW_MAX as maximum flow type and maximum number of tracked flows * Rebased on newer versions of preliminary work (ICMP, flow based dispatch and allocation, bind/address cleanups) * Unified hash table as well as base flow table * Integrated ICMP Changes since v1: * Terminology changes - "Endpoint" address/port instead of "correspondent" address/port - "flowside" instead of "demiflow" * Actually move the connection table to a new flow table structure in new files * Significant rearrangement of earlier patchs on top of that new table, to reduce churn David Gibson (26): flow: Common address information for initiating side flow: Common address information for target side tcp, flow: Remove redundant information, repack connection structures tcp: Obtain guest address from flowside tcp: Manage outbound address via flow table tcp: Simplify endpoint validation using flowside information tcp_splice: Eliminate SPLICE_V6 flag tcp, flow: Replace TCP specific hash function with general flow hash flow, tcp: Generalise TCP hash table to general flow hash table tcp: Re-use flow hash for initial sequence number generation icmp: Remove redundant id field from flow table entry icmp: Obtain destination addresses from the flowsides icmp: Look up ping flows using flow hash icmp: Eliminate icmp_id_map icmp: Manage outbound socket address via flow table flow, tcp: Flow based NAT and port forwarding for TCP flow, icmp: Use general flow forwarding rules for ICMP fwd: Update flow forwarding logic for UDP udp: Create flow table entries for UDP udp: Direct traffic from tap according to flow table udp: Direct traffic from host to guest tap according to flow table udp: Direct spliced traffic according to flow table udp: Remove 'splicesrc' tracking udp: Remove tap port flags field udp: Remove rdelta port forwarding maps udp: Eliminate 'splice' flag from epoll reference Makefile | 2 +- conf.c | 14 +- flow.c | 402 +++++++++++++++++++++++++- flow.h | 45 +++ flow_table.h | 45 ++- fwd.c | 177 +++++++++++- fwd.h | 9 + icmp.c | 105 +++---- icmp_flow.h | 2 - inany.h | 31 +- passt.h | 3 + tap.c | 11 - tap.h | 1 - tcp.c | 524 +++++++++------------------------- tcp_buf.c | 6 +- tcp_conn.h | 51 ++-- tcp_internal.h | 10 +- tcp_splice.c | 98 +------ tcp_splice.h | 5 +- udp.c | 757 +++++++++++++++++++++++-------------------------- udp.h | 22 +- udp_flow.h | 25 ++ util.c | 6 +- util.h | 3 + 24 files changed, 1328 insertions(+), 1026 deletions(-) create mode 100644 udp_flow.h -- 2.45.2
Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more consistent handling across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of address and port information related to one side of a flow. Store two of these in the common fields of a flow to track that information for both sides. For now we only populate the initiating side, requiring that information be completed when a flows enter INI. Later patches will populate the target side. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++--- flow.h | 16 +++++++++ flow_table.h | 8 ++++- icmp.c | 9 +++-- passt.h | 3 ++ tcp.c | 6 ++-- 6 files changed, 127 insertions(+), 11 deletions(-) diff --git a/flow.c b/flow.c index d05aa495..1819111d 100644 --- a/flow.c +++ b/flow.c @@ -108,6 +108,31 @@ static const union flow *flow_new_entry; /* = NULL */ /* Last time the flow timers ran */ static struct timespec flow_timer_run; +/** flowside_from_af() - Initialise flowside from addresses + * @fside: flowside to initialise + * @af: Address family (AF_INET or AF_INET6) + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + */ +static void flowside_from_af(struct flowside *fside, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport) +{ + if (faddr) + inany_from_af(&fside->faddr, af, faddr); + else + fside->faddr = inany_any6; + fside->fport = fport; + + if (eaddr) + inany_from_af(&fside->eaddr, af, eaddr); + else + fside->eaddr = inany_any6; + fside->eport = eport; +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority @@ -140,6 +165,8 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + const struct flowside *ini = &f->side[INISIDE]; uint8_t oldstate = f->state; ASSERT(state < FLOW_NUM_STATES); @@ -150,18 +177,28 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f)); if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s => %s", pif_name(f->pif[INISIDE]), - pif_name(f->pif[TGTSIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport, + pif_name(f->pif[TGTSIDE])); else if (MAX(state, oldstate) >= FLOW_STATE_INI) - flow_log_(f, LOG_DEBUG, "%s => ?", pif_name(f->pif[INISIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport); } /** - * flow_initiate() - Move flow to INI, setting INISIDE details + * flow_initiate_() - Move flow to INI, setting setting pif[INISIDE] * @flow: Flow to change state * @pif: pif of the initiating side */ -void flow_initiate(union flow *flow, uint8_t pif) +static void flow_initiate_(union flow *flow, uint8_t pif) { struct flow_common *f = &flow->f; @@ -174,6 +211,55 @@ void flow_initiate(union flow *flow, uint8_t pif) flow_set_state(f, FLOW_STATE_INI); } +/** + * flow_initiate_af() - Move flow to INI, setting INISIDE details + * @flow: Flow to change state + * @pif: pif of the initiating side + * @af: Address family of @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the initiating flowside information + */ +const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) +{ + struct flowside *ini = &flow->f.side[INISIDE]; + + flowside_from_af(ini, af, saddr, sport, daddr, dport); + flow_initiate_(flow, pif); + return ini; +} + +/** + * flow_initiate_sa() - Move flow to INI, setting INISIDE details + * @flow: Flow to change state + * @pif: pif of the initiating side + * @ssa: Source socket address + * @dport: Destination port + * + * Return: pointer to the initiating flowside information + */ +const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, + const union sockaddr_inany *ssa, + in_port_t dport) +{ + struct flowside *ini = &flow->f.side[INISIDE]; + + inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa); + if (inany_v4(&ini->eaddr)) + ini->faddr = inany_any4; + else + ini->faddr = inany_any6; + ini->fport = dport; + flow_initiate_(flow, pif); + return ini; +} + /** * flow_target() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state diff --git a/flow.h b/flow.h index 29ef9f12..b873ea7f 100644 --- a/flow.h +++ b/flow.h @@ -135,11 +135,26 @@ extern const uint8_t flow_proto[]; #define INISIDE 0 /* Initiating side */ #define TGTSIDE 1 /* Target side */ +/** + * struct flowside - Address information for one side of a flow + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport; + in_port_t eport; +}; + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry * @type: Type of packet flow * @pif[]: Interface for each side of the flow + * @side[]: Information for each side of the flow */ struct flow_common { #ifdef __GNUC__ @@ -154,6 +169,7 @@ struct flow_common { "Not enough bits for type field"); #endif uint8_t pif[SIDES]; + struct flowside side[SIDES]; }; #define FLOW_INDEX_BITS 17 /* 128k - 1 */ diff --git a/flow_table.h b/flow_table.h index 1b163491..2e912532 100644 --- a/flow_table.h +++ b/flow_table.h @@ -107,7 +107,13 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f, union flow *flow_alloc(void); void flow_alloc_cancel(union flow *flow); -void flow_initiate(union flow *flow, uint8_t pif); +const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport); +const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, + const union sockaddr_inany *ssa, + in_port_t dport); void flow_target(union flow *flow, uint8_t pif); union flow *flow_set_type(union flow *flow, enum flow_type type); diff --git a/icmp.c b/icmp.c index 80330f6f..38d1deeb 100644 --- a/icmp.c +++ b/icmp.c @@ -146,12 +146,15 @@ static void icmp_ping_close(const struct ctx *c, * @id_sock: Pointer to ping flow entry slot in icmp_id_map[] to update * @af: Address family, AF_INET or AF_INET6 * @id: ICMP id for the new socket + * @saddr: Source address + * @daddr: Destination address * * Return: Newly opened ping flow, or NULL on failure */ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, struct icmp_ping_flow **id_sock, - sa_family_t af, uint16_t id) + sa_family_t af, uint16_t id, + const void *saddr, const void *daddr) { uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6; union epoll_ref ref = { .type = EPOLL_TYPE_PING }; @@ -163,7 +166,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, if (!flow) return NULL; - flow_initiate(flow, PIF_TAP); + flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); flow_target(flow, PIF_HOST); pingf = FLOW_SET_TYPE(flow, flowtype, ping); @@ -269,7 +272,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, } if (!(pingf = *id_sock)) - if (!(pingf = icmp_ping_new(c, id_sock, af, id))) + if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) return 1; pingf->ts = now->tv_sec; diff --git a/passt.h b/passt.h index 46d073a2..76810867 100644 --- a/passt.h +++ b/passt.h @@ -17,6 +17,9 @@ union epoll_ref; #include "pif.h" #include "packet.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" #include "flow.h" #include "icmp.h" #include "fwd.h" diff --git a/tcp.c b/tcp.c index 68524235..07a2eb1c 100644 --- a/tcp.c +++ b/tcp.c @@ -1629,7 +1629,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return; - flow_initiate(flow, PIF_TAP); + flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); if (af == AF_INET) { if (IN4_IS_ADDR_UNSPECIFIED(saddr) || @@ -2288,7 +2288,9 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) goto cancel; - flow_initiate(flow, ref.tcp_listen.pif); + /* FIXME: When listening port has a specific bound address, record that + * as the forwarding address */ + flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port); if (sa.sa_family == AF_INET) { const struct in_addr *addr = &sa.sa4.sin_addr; -- 2.45.2
On Fri, 14 Jun 2024 16:13:23 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more consistent handling across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of address and port information related to one side of a flow. Store two of these in the common fields of a flow to track that information for both sides. For now we only populate the initiating side, requiring that information be completed when a flows enter INI. Later patches will populate the target side. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++--- flow.h | 16 +++++++++ flow_table.h | 8 ++++- icmp.c | 9 +++-- passt.h | 3 ++ tcp.c | 6 ++-- 6 files changed, 127 insertions(+), 11 deletions(-) diff --git a/flow.c b/flow.c index d05aa495..1819111d 100644 --- a/flow.c +++ b/flow.c @@ -108,6 +108,31 @@ static const union flow *flow_new_entry; /* = NULL */ /* Last time the flow timers ran */ static struct timespec flow_timer_run; +/** flowside_from_af() - Initialise flowside from addresses + * @fside: flowside to initialise + * @af: Address family (AF_INET or AF_INET6) + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + */ +static void flowside_from_af(struct flowside *fside, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport) +{ + if (faddr) + inany_from_af(&fside->faddr, af, faddr); + else + fside->faddr = inany_any6;I kind of wonder if, for clarity, we should perhaps: #define inany_any inany_any6 ? And... I don't actually have in mind how IP versions are stored in the flow table for unspecified addresses, but I wonder if we're losing some information this way: if this is called with AF_INET as 'af', we're not saying anywhere in the flowside information that this is going to be an IPv4 flowside, right?+ fside->fport = fport; + + if (eaddr) + inany_from_af(&fside->eaddr, af, eaddr); + else + fside->eaddr = inany_any6; + fside->eport = eport; +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority @@ -140,6 +165,8 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + const struct flowside *ini = &f->side[INISIDE]; uint8_t oldstate = f->state; ASSERT(state < FLOW_NUM_STATES); @@ -150,18 +177,28 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f)); if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s => %s", pif_name(f->pif[INISIDE]), - pif_name(f->pif[TGTSIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport, + pif_name(f->pif[TGTSIDE])); else if (MAX(state, oldstate) >= FLOW_STATE_INI) - flow_log_(f, LOG_DEBUG, "%s => ?", pif_name(f->pif[INISIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport); } /** - * flow_initiate() - Move flow to INI, setting INISIDE details + * flow_initiate_() - Move flow to INI, setting setting pif[INISIDE]s/setting setting/setting/* @flow: Flow to change state * @pif: pif of the initiating side */ -void flow_initiate(union flow *flow, uint8_t pif) +static void flow_initiate_(union flow *flow, uint8_t pif) { struct flow_common *f = &flow->f; @@ -174,6 +211,55 @@ void flow_initiate(union flow *flow, uint8_t pif) flow_set_state(f, FLOW_STATE_INI); } +/** + * flow_initiate_af() - Move flow to INI, setting INISIDE details + * @flow: Flow to change state + * @pif: pif of the initiating side + * @af: Address family of @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the initiating flowside information + */ +const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) +{ + struct flowside *ini = &flow->f.side[INISIDE]; + + flowside_from_af(ini, af, saddr, sport, daddr, dport); + flow_initiate_(flow, pif); + return ini; +} + +/** + * flow_initiate_sa() - Move flow to INI, setting INISIDE details + * @flow: Flow to change state + * @pif: pif of the initiating side + * @ssa: Source socket address + * @dport: Destination port + * + * Return: pointer to the initiating flowside information + */ +const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, + const union sockaddr_inany *ssa, + in_port_t dport) +{ + struct flowside *ini = &flow->f.side[INISIDE]; + + inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa); + if (inany_v4(&ini->eaddr)) + ini->faddr = inany_any4; + else + ini->faddr = inany_any6; + ini->fport = dport; + flow_initiate_(flow, pif); + return ini; +} + /** * flow_target() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state diff --git a/flow.h b/flow.h index 29ef9f12..b873ea7f 100644 --- a/flow.h +++ b/flow.h @@ -135,11 +135,26 @@ extern const uint8_t flow_proto[]; #define INISIDE 0 /* Initiating side */ #define TGTSIDE 1 /* Target side */ +/** + * struct flowside - Address information for one side of a flow + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport; + in_port_t eport; +}; + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry * @type: Type of packet flow * @pif[]: Interface for each side of the flow + * @side[]: Information for each side of the flow */ struct flow_common { #ifdef __GNUC__ @@ -154,6 +169,7 @@ struct flow_common { "Not enough bits for type field"); #endif uint8_t pif[SIDES]; + struct flowside side[SIDES]; }; #define FLOW_INDEX_BITS 17 /* 128k - 1 */ diff --git a/flow_table.h b/flow_table.h index 1b163491..2e912532 100644 --- a/flow_table.h +++ b/flow_table.h @@ -107,7 +107,13 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f, union flow *flow_alloc(void); void flow_alloc_cancel(union flow *flow); -void flow_initiate(union flow *flow, uint8_t pif); +const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport); +const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, + const union sockaddr_inany *ssa, + in_port_t dport); void flow_target(union flow *flow, uint8_t pif); union flow *flow_set_type(union flow *flow, enum flow_type type); diff --git a/icmp.c b/icmp.c index 80330f6f..38d1deeb 100644 --- a/icmp.c +++ b/icmp.c @@ -146,12 +146,15 @@ static void icmp_ping_close(const struct ctx *c, * @id_sock: Pointer to ping flow entry slot in icmp_id_map[] to update * @af: Address family, AF_INET or AF_INET6 * @id: ICMP id for the new socket + * @saddr: Source address + * @daddr: Destination address * * Return: Newly opened ping flow, or NULL on failure */ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, struct icmp_ping_flow **id_sock, - sa_family_t af, uint16_t id) + sa_family_t af, uint16_t id, + const void *saddr, const void *daddr) { uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6; union epoll_ref ref = { .type = EPOLL_TYPE_PING }; @@ -163,7 +166,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, if (!flow) return NULL; - flow_initiate(flow, PIF_TAP); + flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); flow_target(flow, PIF_HOST); pingf = FLOW_SET_TYPE(flow, flowtype, ping); @@ -269,7 +272,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, } if (!(pingf = *id_sock)) - if (!(pingf = icmp_ping_new(c, id_sock, af, id))) + if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) return 1; pingf->ts = now->tv_sec; diff --git a/passt.h b/passt.h index 46d073a2..76810867 100644 --- a/passt.h +++ b/passt.h @@ -17,6 +17,9 @@ union epoll_ref; #include "pif.h" #include "packet.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" #include "flow.h" #include "icmp.h" #include "fwd.h" diff --git a/tcp.c b/tcp.c index 68524235..07a2eb1c 100644 --- a/tcp.c +++ b/tcp.c @@ -1629,7 +1629,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return; - flow_initiate(flow, PIF_TAP); + flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); if (af == AF_INET) { if (IN4_IS_ADDR_UNSPECIFIED(saddr) || @@ -2288,7 +2288,9 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) goto cancel; - flow_initiate(flow, ref.tcp_listen.pif); + /* FIXME: When listening port has a specific bound address, record that + * as the forwarding address */ + flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port); if (sa.sa_family == AF_INET) { const struct in_addr *addr = &sa.sa4.sin_addr;-- Stefano
On Wed, Jun 26, 2024 at 12:23:28AM +0200, Stefano Brivio wrote:On Fri, 14 Jun 2024 16:13:23 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:That might be a good idea.Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more consistent handling across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of address and port information related to one side of a flow. Store two of these in the common fields of a flow to track that information for both sides. For now we only populate the initiating side, requiring that information be completed when a flows enter INI. Later patches will populate the target side. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++--- flow.h | 16 +++++++++ flow_table.h | 8 ++++- icmp.c | 9 +++-- passt.h | 3 ++ tcp.c | 6 ++-- 6 files changed, 127 insertions(+), 11 deletions(-) diff --git a/flow.c b/flow.c index d05aa495..1819111d 100644 --- a/flow.c +++ b/flow.c @@ -108,6 +108,31 @@ static const union flow *flow_new_entry; /* = NULL */ /* Last time the flow timers ran */ static struct timespec flow_timer_run; +/** flowside_from_af() - Initialise flowside from addresses + * @fside: flowside to initialise + * @af: Address family (AF_INET or AF_INET6) + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + */ +static void flowside_from_af(struct flowside *fside, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport) +{ + if (faddr) + inany_from_af(&fside->faddr, af, faddr); + else + fside->faddr = inany_any6;I kind of wonder if, for clarity, we should perhaps: #define inany_any inany_any6 ?And... I don't actually have in mind how IP versions are stored in the flow table for unspecified addresses, but I wonder if we're losing some information this way: if this is called with AF_INET as 'af', we're not saying anywhere in the flowside information that this is going to be an IPv4 flowside, right?So, I've gone back and forth on this a bit, and it's not entirely consistent in this version whether I always use :: for "blank", or whether I use 0.0.0.0 for IPv4 cases. Using :: everywhere does technically lose some information, although you can generally tell that a flowside is IPv4 from the other address. The argument for using :: everywhere is that in many of the cases this indicates "blank" or "missing" in a way that's not really dependent on the IP version, and sometimes it's simpler to do this. There are (or may be in future) also some cases where an otherwise IPv4 flowside might be routed through a dual-stack :: socket. The argument for using 0.0.0.0 where accurate is that it makes for more consistency across the flowside and makes matching up to addresses from elswhere easier in some cases. With the reworks for how I handle UDP, I'm going to need to revisit this anyway, so I'll see where I land.Fixed. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson+ fside->fport = fport; + + if (eaddr) + inany_from_af(&fside->eaddr, af, eaddr); + else + fside->eaddr = inany_any6; + fside->eport = eport; +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority @@ -140,6 +165,8 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + const struct flowside *ini = &f->side[INISIDE]; uint8_t oldstate = f->state; ASSERT(state < FLOW_NUM_STATES); @@ -150,18 +177,28 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f)); if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s => %s", pif_name(f->pif[INISIDE]), - pif_name(f->pif[TGTSIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport, + pif_name(f->pif[TGTSIDE])); else if (MAX(state, oldstate) >= FLOW_STATE_INI) - flow_log_(f, LOG_DEBUG, "%s => ?", pif_name(f->pif[INISIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport); } /** - * flow_initiate() - Move flow to INI, setting INISIDE details + * flow_initiate_() - Move flow to INI, setting setting pif[INISIDE]s/setting setting/setting/
Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect(). Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 38 ++++++++++++++++++------ flow_table.h | 5 +++- icmp.c | 3 +- inany.h | 30 ++++++++++++++++++- tcp.c | 83 ++++++++++++++++++++++++++++------------------------ tcp_splice.c | 45 +++++++++++----------------- 6 files changed, 126 insertions(+), 78 deletions(-) diff --git a/flow.c b/flow.c index 1819111d..39e046bf 100644 --- a/flow.c +++ b/flow.c @@ -165,8 +165,10 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { - char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN]; + char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN]; const struct flowside *ini = &f->side[INISIDE]; + const struct flowside *tgt = &f->side[TGTSIDE]; uint8_t oldstate = f->state; ASSERT(state < FLOW_NUM_STATES); @@ -177,19 +179,24 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f)); if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + flow_log_(f, LOG_DEBUG, + "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport, - pif_name(f->pif[TGTSIDE])); + pif_name(f->pif[TGTSIDE]), + inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), + tgt->fport, + inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), + tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport); } @@ -261,21 +268,34 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, } /** - * flow_target() - Move flow to TGT, setting TGTSIDE details + * flow_target_af() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state * @pif: pif of the target side + * @af: Address family for @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the target flowside information */ -void flow_target(union flow *flow, uint8_t pif) +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) { struct flow_common *f = &flow->f; + struct flowside *tgt = &f->side[TGTSIDE]; ASSERT(pif != PIF_NONE); ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); ASSERT(f->type == FLOW_TYPE_NONE); ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); + flowside_from_af(tgt, af, daddr, dport, saddr, sport); f->pif[TGTSIDE] = pif; flow_set_state(f, FLOW_STATE_TGT); + return tgt; } /** diff --git a/flow_table.h b/flow_table.h index 2e912532..7af32c6a 100644 --- a/flow_table.h +++ b/flow_table.h @@ -114,7 +114,10 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, const union sockaddr_inany *ssa, in_port_t dport); -void flow_target(union flow *flow, uint8_t pif); +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport); union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/icmp.c b/icmp.c index 38d1deeb..c86c54c3 100644 --- a/icmp.c +++ b/icmp.c @@ -167,7 +167,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, return NULL; flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - flow_target(flow, PIF_HOST); + /* FIXME: Record outbound source address when known */ + flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping); pingf->seq = -1; diff --git a/inany.h b/inany.h index 47b66fa9..2bf3becf 100644 --- a/inany.h +++ b/inany.h @@ -187,7 +187,6 @@ static inline bool inany_is_unspecified(const union inany_addr *a) * * Return: true if @a is in fe80::/10 (IPv6 link local unicast) */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_linklocal6(const union inany_addr *a) { return IN6_IS_ADDR_LINKLOCAL(&a->a6); @@ -273,4 +272,33 @@ static inline void inany_siphash_feed(struct siphash_state *state, const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size); +/** sockaddr_from_inany() - Construct a sockaddr from an inany + * @sa: Pointer to sockaddr to fill in + * @sl: Updated to relevant of length of initialised @sa + * @addr: IPv[46] address + * @port: Port (host byte order) + * @scope: Scope ID (ignored for IPv4 addresses) + */ +static inline void sockaddr_from_inany(union sockaddr_inany *sa, socklen_t *sl, + const union inany_addr *addr, + in_port_t port, uint32_t scope) +{ + const struct in_addr *v4 = inany_v4(addr); + + if (v4) { + sa->sa_family = AF_INET; + sa->sa4.sin_addr = *v4; + sa->sa4.sin_port = htons(port); + memset(&sa->sa4.sin_zero, 0, sizeof(sa->sa4.sin_zero)); + *sl = sizeof(sa->sa4); + } else { + sa->sa_family = AF_INET6; + sa->sa6.sin6_addr = addr->a6; + sa->sa6.sin6_port = htons(port); + sa->sa6.sin6_scope_id = scope; + sa->sa6.sin6_flowinfo = 0; + *sl = sizeof(sa->sa6); + } +} + #endif /* INANY_H */ diff --git a/tcp.c b/tcp.c index 07a2eb1c..c6cd0c72 100644 --- a/tcp.c +++ b/tcp.c @@ -1610,18 +1610,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(dstport), - .sin_addr = *(struct in_addr *)daddr, - }; - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(dstport), - .sin6_addr = *(struct in6_addr *)daddr, - }; - const struct sockaddr *sa; + const struct flowside *ini, *tgt; struct tcp_tap_conn *conn; + union inany_addr dstaddr; /* FIXME: Avoid bulky temporary */ + union sockaddr_inany sa; union flow *flow; int s = -1, mss; socklen_t sl; @@ -1629,7 +1621,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return; - flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); + ini = flow_initiate_af(flow, PIF_TAP, + af, saddr, srcport, daddr, dstport); if (af == AF_INET) { if (IN4_IS_ADDR_UNSPECIFIED(saddr) || @@ -1661,19 +1654,28 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, dstport); goto cancel; } + } else { + ASSERT(0); } if ((s = tcp_conn_sock(c, af)) < 0) goto cancel; + dstaddr = ini->faddr; + if (!c->no_map_gw) { - if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw)) - addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK); - if (af == AF_INET6 && IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw)) - addr6.sin6_addr = in6addr_loopback; + if (inany_equals4(&dstaddr, &c->ip4.gw)) + dstaddr = inany_loopback4; + else if (inany_equals6(&dstaddr, &c->ip6.gw)) + dstaddr = inany_loopback6; } - if (af == AF_INET6 && IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr)) { + /* FIXME: Record outbound source address when known */ + tgt = flow_target_af(flow, PIF_HOST, AF_INET6, + NULL, 0, /* Kernel decides source address */ + &dstaddr, dstport); + + if (inany_is_linklocal6(&tgt->eaddr)) { struct sockaddr_in6 addr6_ll = { .sin6_family = AF_INET6, .sin6_addr = c->ip6.addr_ll, @@ -1681,9 +1683,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, }; if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) goto cancel; + } else if (!inany_is_loopback(&tgt->eaddr)) { + tcp_bind_outbound(c, s, af); } - flow_target(flow, PIF_HOST); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); conn->sock = s; conn->timer = -1; @@ -1706,14 +1709,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, inany_from_af(&conn->faddr, af, daddr); - if (af == AF_INET) { - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); - } else { - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); - } - conn->fport = dstport; conn->eport = srcport; @@ -1726,19 +1721,16 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, tcp_hash_insert(c, conn); - if (!bind(s, sa, sl)) { + sockaddr_from_inany(&sa, &sl, &tgt->eaddr, tgt->eport, c->ifi6); + + if (!bind(s, &sa.sa, sl)) { tcp_rst(c, conn); /* Nobody is listening then */ goto cancel; } if (errno != EADDRNOTAVAIL && errno != EACCES) conn_flag(c, conn, LOCAL); - if ((af == AF_INET && !IN4_IS_ADDR_LOOPBACK(&addr4.sin_addr)) || - (af == AF_INET6 && !IN6_IS_ADDR_LOOPBACK(&addr6.sin6_addr) && - !IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr))) - tcp_bind_outbound(c, s, af); - - if (connect(s, sa, sl)) { + if (connect(s, &sa.sa, sl)) { if (errno != EINPROGRESS) { tcp_rst(c, conn); goto cancel; @@ -2236,9 +2228,25 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, const union sockaddr_inany *sa, const struct timespec *now) { + union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */ struct tcp_tap_conn *conn; + in_port_t srcport; + + inany_from_sockaddr(&saddr, &srcport, sa); + tcp_snat_inbound(c, &saddr); - flow_target(flow, PIF_TAP); + if (inany_v4(&saddr)) { + daddr = inany_from_v4(c->ip4.addr_seen); + } else { + if (inany_is_linklocal6(&saddr)) + daddr.a6 = c->ip6.addr_ll_seen; + else + daddr.a6 = c->ip6.addr_seen; + } + dstport += c->tcp.fwd_in.delta[dstport]; + + flow_target_af(flow, PIF_TAP, AF_INET6, + &saddr, srcport, &daddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); conn->sock = s; @@ -2246,10 +2254,9 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = dstport + c->tcp.fwd_in.delta[dstport]; - - tcp_snat_inbound(c, &conn->faddr); + conn->faddr = saddr; + conn->fport = srcport; + conn->eport = dstport; tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_splice.c b/tcp_splice.c index 5a406c63..bcb42c97 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -320,31 +320,20 @@ static int tcp_splice_connect_finish(const struct ctx *c, * tcp_splice_connect() - Create and connect socket for new spliced connection * @c: Execution context * @conn: Connection pointer - * @af: Address family - * @pif: pif on which to create socket - * @port: Destination port, host order * * Return: 0 for connect() succeeded or in progress, negative value on error */ -static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, - sa_family_t af, uint8_t pif, in_port_t port) +static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(port), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(port), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - const struct sockaddr *sa; + const struct flowside *tgt = &conn->f.side[TGTSIDE]; + sa_family_t af = inany_v4(&tgt->eaddr) ? AF_INET : AF_INET6; + uint8_t tgtpif = conn->f.pif[TGTSIDE]; + union sockaddr_inany sa; socklen_t sl; - if (pif == PIF_HOST) + if (tgtpif == PIF_HOST) conn->s[1] = tcp_conn_sock(c, af); - else if (pif == PIF_SPLICE) + else if (tgtpif == PIF_SPLICE) conn->s[1] = tcp_conn_sock_ns(c, af); else ASSERT(0); @@ -358,15 +347,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, conn->s[1]); } - if (CONN_V6(conn)) { - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); - } else { - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); - } + sockaddr_from_inany(&sa, &sl, &tgt->eaddr, tgt->eport, 0); - if (connect(conn->s[1], sa, sl)) { + if (connect(conn->s[1], &sa.sa, sl)) { if (errno != EINPROGRESS) { flow_trace(conn, "Couldn't connect socket for splice: %s", strerror(errno)); @@ -471,7 +454,13 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, return false; } - flow_target(flow, tgtpif); + /* FIXME: Record outbound source address when known */ + if (af == AF_INET) + flow_target_af(flow, tgtpif, AF_INET, + NULL, 0, &in4addr_loopback, dstport); + else + flow_target_af(flow, tgtpif, AF_INET6, + NULL, 0, &in6addr_loopback, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice); conn->flags = af == AF_INET ? 0 : SPLICE_V6; @@ -483,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int))) flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0); - if (tcp_splice_connect(c, conn, af, tgtpif, dstport)) + if (tcp_splice_connect(c, conn)) conn_flag(c, conn, CLOSING); FLOW_ACTIVATE(conn); -- 2.45.2
On Fri, 14 Jun 2024 16:13:24 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect(). Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 38 ++++++++++++++++++------ flow_table.h | 5 +++- icmp.c | 3 +- inany.h | 30 ++++++++++++++++++- tcp.c | 83 ++++++++++++++++++++++++++++------------------------ tcp_splice.c | 45 +++++++++++----------------- 6 files changed, 126 insertions(+), 78 deletions(-) diff --git a/flow.c b/flow.c index 1819111d..39e046bf 100644 --- a/flow.c +++ b/flow.c @@ -165,8 +165,10 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { - char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN]; + char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN]; const struct flowside *ini = &f->side[INISIDE]; + const struct flowside *tgt = &f->side[TGTSIDE]; uint8_t oldstate = f->state; ASSERT(state < FLOW_NUM_STATES); @@ -177,19 +179,24 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f)); if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + flow_log_(f, LOG_DEBUG, + "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport, - pif_name(f->pif[TGTSIDE])); + pif_name(f->pif[TGTSIDE]), + inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), + tgt->fport, + inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), + tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport); } @@ -261,21 +268,34 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, } /** - * flow_target() - Move flow to TGT, setting TGTSIDE details + * flow_target_af() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state * @pif: pif of the target side + * @af: Address family for @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the target flowside information */ -void flow_target(union flow *flow, uint8_t pif) +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) { struct flow_common *f = &flow->f; + struct flowside *tgt = &f->side[TGTSIDE]; ASSERT(pif != PIF_NONE); ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); ASSERT(f->type == FLOW_TYPE_NONE); ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); + flowside_from_af(tgt, af, daddr, dport, saddr, sport); f->pif[TGTSIDE] = pif; flow_set_state(f, FLOW_STATE_TGT); + return tgt; } /** diff --git a/flow_table.h b/flow_table.h index 2e912532..7af32c6a 100644 --- a/flow_table.h +++ b/flow_table.h @@ -114,7 +114,10 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, const union sockaddr_inany *ssa, in_port_t dport); -void flow_target(union flow *flow, uint8_t pif); +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport); union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/icmp.c b/icmp.c index 38d1deeb..c86c54c3 100644 --- a/icmp.c +++ b/icmp.c @@ -167,7 +167,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, return NULL; flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - flow_target(flow, PIF_HOST); + /* FIXME: Record outbound source address when known */ + flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping); pingf->seq = -1; diff --git a/inany.h b/inany.h index 47b66fa9..2bf3becf 100644 --- a/inany.h +++ b/inany.h @@ -187,7 +187,6 @@ static inline bool inany_is_unspecified(const union inany_addr *a) * * Return: true if @a is in fe80::/10 (IPv6 link local unicast) */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_linklocal6(const union inany_addr *a) { return IN6_IS_ADDR_LINKLOCAL(&a->a6); @@ -273,4 +272,33 @@ static inline void inany_siphash_feed(struct siphash_state *state, const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size); +/** sockaddr_from_inany() - Construct a sockaddr from an inany + * @sa: Pointer to sockaddr to fill in + * @sl: Updated to relevant of length of initialised @sa + * @addr: IPv[46] address + * @port: Port (host byte order) + * @scope: Scope ID (ignored for IPv4 addresses)Perhaps worth specifying that it's the interface index for IPv6 link-local addresses only (even if it's not the case as of this patch, more below): * @scope: Scope ID for link-local IPv6 addresses only: interface index+ */ +static inline void sockaddr_from_inany(union sockaddr_inany *sa, socklen_t *sl, + const union inany_addr *addr, + in_port_t port, uint32_t scope) +{ + const struct in_addr *v4 = inany_v4(addr); + + if (v4) { + sa->sa_family = AF_INET; + sa->sa4.sin_addr = *v4; + sa->sa4.sin_port = htons(port); + memset(&sa->sa4.sin_zero, 0, sizeof(sa->sa4.sin_zero)); + *sl = sizeof(sa->sa4); + } else { + sa->sa_family = AF_INET6; + sa->sa6.sin6_addr = addr->a6; + sa->sa6.sin6_port = htons(port); + sa->sa6.sin6_scope_id = scope; + sa->sa6.sin6_flowinfo = 0; + *sl = sizeof(sa->sa6); + } +} + #endif /* INANY_H */ diff --git a/tcp.c b/tcp.c index 07a2eb1c..c6cd0c72 100644 --- a/tcp.c +++ b/tcp.c @@ -1610,18 +1610,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(dstport), - .sin_addr = *(struct in_addr *)daddr, - }; - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(dstport), - .sin6_addr = *(struct in6_addr *)daddr, - }; - const struct sockaddr *sa; + const struct flowside *ini, *tgt; struct tcp_tap_conn *conn; + union inany_addr dstaddr; /* FIXME: Avoid bulky temporary */ + union sockaddr_inany sa; union flow *flow; int s = -1, mss; socklen_t sl; @@ -1629,7 +1621,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return; - flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); + ini = flow_initiate_af(flow, PIF_TAP, + af, saddr, srcport, daddr, dstport); if (af == AF_INET) { if (IN4_IS_ADDR_UNSPECIFIED(saddr) || @@ -1661,19 +1654,28 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, dstport); goto cancel; } + } else { + ASSERT(0); } if ((s = tcp_conn_sock(c, af)) < 0) goto cancel; + dstaddr = ini->faddr; + if (!c->no_map_gw) { - if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw)) - addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK); - if (af == AF_INET6 && IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw)) - addr6.sin6_addr = in6addr_loopback; + if (inany_equals4(&dstaddr, &c->ip4.gw)) + dstaddr = inany_loopback4; + else if (inany_equals6(&dstaddr, &c->ip6.gw)) + dstaddr = inany_loopback6; } - if (af == AF_INET6 && IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr)) { + /* FIXME: Record outbound source address when known */ + tgt = flow_target_af(flow, PIF_HOST, AF_INET6, + NULL, 0, /* Kernel decides source address */ + &dstaddr, dstport); + + if (inany_is_linklocal6(&tgt->eaddr)) { struct sockaddr_in6 addr6_ll = { .sin6_family = AF_INET6, .sin6_addr = c->ip6.addr_ll, @@ -1681,9 +1683,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, }; if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) goto cancel; + } else if (!inany_is_loopback(&tgt->eaddr)) { + tcp_bind_outbound(c, s, af); } - flow_target(flow, PIF_HOST); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); conn->sock = s; conn->timer = -1; @@ -1706,14 +1709,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, inany_from_af(&conn->faddr, af, daddr); - if (af == AF_INET) { - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); - } else { - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); - } - conn->fport = dstport; conn->eport = srcport; @@ -1726,19 +1721,16 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, tcp_hash_insert(c, conn); - if (!bind(s, sa, sl)) { + sockaddr_from_inany(&sa, &sl, &tgt->eaddr, tgt->eport, c->ifi6);...so, I never tried to use a non-zero scope identifier for, say, global unicast IPv6 addresses. Perhaps it's harmless. But I still think that, logically, we should pass it as zero in that case.+ + if (!bind(s, &sa.sa, sl)) { tcp_rst(c, conn); /* Nobody is listening then */ goto cancel; } if (errno != EADDRNOTAVAIL && errno != EACCES) conn_flag(c, conn, LOCAL);This will have a semi-trivial conflict with 54a9d3801b95 ("tcp: Don't rely on bind() to fail to decide that connection target is valid").- if ((af == AF_INET && !IN4_IS_ADDR_LOOPBACK(&addr4.sin_addr)) || - (af == AF_INET6 && !IN6_IS_ADDR_LOOPBACK(&addr6.sin6_addr) && - !IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr))) - tcp_bind_outbound(c, s, af); - - if (connect(s, sa, sl)) { + if (connect(s, &sa.sa, sl)) { if (errno != EINPROGRESS) { tcp_rst(c, conn); goto cancel; @@ -2236,9 +2228,25 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, const union sockaddr_inany *sa, const struct timespec *now) { + union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */ struct tcp_tap_conn *conn; + in_port_t srcport; + + inany_from_sockaddr(&saddr, &srcport, sa); + tcp_snat_inbound(c, &saddr); - flow_target(flow, PIF_TAP); + if (inany_v4(&saddr)) { + daddr = inany_from_v4(c->ip4.addr_seen); + } else { + if (inany_is_linklocal6(&saddr)) + daddr.a6 = c->ip6.addr_ll_seen; + else + daddr.a6 = c->ip6.addr_seen; + } + dstport += c->tcp.fwd_in.delta[dstport]; + + flow_target_af(flow, PIF_TAP, AF_INET6, + &saddr, srcport, &daddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); conn->sock = s; @@ -2246,10 +2254,9 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = dstport + c->tcp.fwd_in.delta[dstport]; - - tcp_snat_inbound(c, &conn->faddr); + conn->faddr = saddr; + conn->fport = srcport; + conn->eport = dstport; tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_splice.c b/tcp_splice.c index 5a406c63..bcb42c97 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -320,31 +320,20 @@ static int tcp_splice_connect_finish(const struct ctx *c, * tcp_splice_connect() - Create and connect socket for new spliced connection * @c: Execution context * @conn: Connection pointer - * @af: Address family - * @pif: pif on which to create socket - * @port: Destination port, host order * * Return: 0 for connect() succeeded or in progress, negative value on error */ -static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, - sa_family_t af, uint8_t pif, in_port_t port) +static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(port), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(port), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - const struct sockaddr *sa; + const struct flowside *tgt = &conn->f.side[TGTSIDE]; + sa_family_t af = inany_v4(&tgt->eaddr) ? AF_INET : AF_INET6; + uint8_t tgtpif = conn->f.pif[TGTSIDE]; + union sockaddr_inany sa; socklen_t sl; - if (pif == PIF_HOST) + if (tgtpif == PIF_HOST) conn->s[1] = tcp_conn_sock(c, af); - else if (pif == PIF_SPLICE) + else if (tgtpif == PIF_SPLICE) conn->s[1] = tcp_conn_sock_ns(c, af); else ASSERT(0); @@ -358,15 +347,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, conn->s[1]); } - if (CONN_V6(conn)) { - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); - } else { - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); - } + sockaddr_from_inany(&sa, &sl, &tgt->eaddr, tgt->eport, 0); - if (connect(conn->s[1], sa, sl)) { + if (connect(conn->s[1], &sa.sa, sl)) { if (errno != EINPROGRESS) { flow_trace(conn, "Couldn't connect socket for splice: %s", strerror(errno)); @@ -471,7 +454,13 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, return false; } - flow_target(flow, tgtpif); + /* FIXME: Record outbound source address when known */ + if (af == AF_INET) + flow_target_af(flow, tgtpif, AF_INET, + NULL, 0, &in4addr_loopback, dstport); + else + flow_target_af(flow, tgtpif, AF_INET6, + NULL, 0, &in6addr_loopback, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice); conn->flags = af == AF_INET ? 0 : SPLICE_V6; @@ -483,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int))) flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0); - if (tcp_splice_connect(c, conn, af, tgtpif, dstport)) + if (tcp_splice_connect(c, conn)) conn_flag(c, conn, CLOSING); FLOW_ACTIVATE(conn);-- Stefano
On Wed, Jun 26, 2024 at 12:23:59AM +0200, Stefano Brivio wrote:On Fri, 14 Jun 2024 16:13:24 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right.. as it happens, I'm reworking this in a way that will obsolete this parameter anyway.Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches. For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect(). Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 38 ++++++++++++++++++------ flow_table.h | 5 +++- icmp.c | 3 +- inany.h | 30 ++++++++++++++++++- tcp.c | 83 ++++++++++++++++++++++++++++------------------------ tcp_splice.c | 45 +++++++++++----------------- 6 files changed, 126 insertions(+), 78 deletions(-) diff --git a/flow.c b/flow.c index 1819111d..39e046bf 100644 --- a/flow.c +++ b/flow.c @@ -165,8 +165,10 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { - char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN]; + char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN]; const struct flowside *ini = &f->side[INISIDE]; + const struct flowside *tgt = &f->side[TGTSIDE]; uint8_t oldstate = f->state; ASSERT(state < FLOW_NUM_STATES); @@ -177,19 +179,24 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f)); if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + flow_log_(f, LOG_DEBUG, + "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport, - pif_name(f->pif[TGTSIDE])); + pif_name(f->pif[TGTSIDE]), + inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), + tgt->fport, + inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), + tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport); } @@ -261,21 +268,34 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, } /** - * flow_target() - Move flow to TGT, setting TGTSIDE details + * flow_target_af() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state * @pif: pif of the target side + * @af: Address family for @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the target flowside information */ -void flow_target(union flow *flow, uint8_t pif) +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) { struct flow_common *f = &flow->f; + struct flowside *tgt = &f->side[TGTSIDE]; ASSERT(pif != PIF_NONE); ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); ASSERT(f->type == FLOW_TYPE_NONE); ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); + flowside_from_af(tgt, af, daddr, dport, saddr, sport); f->pif[TGTSIDE] = pif; flow_set_state(f, FLOW_STATE_TGT); + return tgt; } /** diff --git a/flow_table.h b/flow_table.h index 2e912532..7af32c6a 100644 --- a/flow_table.h +++ b/flow_table.h @@ -114,7 +114,10 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, const union sockaddr_inany *ssa, in_port_t dport); -void flow_target(union flow *flow, uint8_t pif); +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport); union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/icmp.c b/icmp.c index 38d1deeb..c86c54c3 100644 --- a/icmp.c +++ b/icmp.c @@ -167,7 +167,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, return NULL; flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - flow_target(flow, PIF_HOST); + /* FIXME: Record outbound source address when known */ + flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping); pingf->seq = -1; diff --git a/inany.h b/inany.h index 47b66fa9..2bf3becf 100644 --- a/inany.h +++ b/inany.h @@ -187,7 +187,6 @@ static inline bool inany_is_unspecified(const union inany_addr *a) * * Return: true if @a is in fe80::/10 (IPv6 link local unicast) */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_linklocal6(const union inany_addr *a) { return IN6_IS_ADDR_LINKLOCAL(&a->a6); @@ -273,4 +272,33 @@ static inline void inany_siphash_feed(struct siphash_state *state, const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size); +/** sockaddr_from_inany() - Construct a sockaddr from an inany + * @sa: Pointer to sockaddr to fill in + * @sl: Updated to relevant of length of initialised @sa + * @addr: IPv[46] address + * @port: Port (host byte order) + * @scope: Scope ID (ignored for IPv4 addresses)Perhaps worth specifying that it's the interface index for IPv6 link-local addresses only (even if it's not the case as of this patch, more below): * @scope: Scope ID for link-local IPv6 addresses only: interface indexRight, that will be changed by the rework I have in mind anyway.+ */ +static inline void sockaddr_from_inany(union sockaddr_inany *sa, socklen_t *sl, + const union inany_addr *addr, + in_port_t port, uint32_t scope) +{ + const struct in_addr *v4 = inany_v4(addr); + + if (v4) { + sa->sa_family = AF_INET; + sa->sa4.sin_addr = *v4; + sa->sa4.sin_port = htons(port); + memset(&sa->sa4.sin_zero, 0, sizeof(sa->sa4.sin_zero)); + *sl = sizeof(sa->sa4); + } else { + sa->sa_family = AF_INET6; + sa->sa6.sin6_addr = addr->a6; + sa->sa6.sin6_port = htons(port); + sa->sa6.sin6_scope_id = scope; + sa->sa6.sin6_flowinfo = 0; + *sl = sizeof(sa->sa6); + } +} + #endif /* INANY_H */ diff --git a/tcp.c b/tcp.c index 07a2eb1c..c6cd0c72 100644 --- a/tcp.c +++ b/tcp.c @@ -1610,18 +1610,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(dstport), - .sin_addr = *(struct in_addr *)daddr, - }; - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(dstport), - .sin6_addr = *(struct in6_addr *)daddr, - }; - const struct sockaddr *sa; + const struct flowside *ini, *tgt; struct tcp_tap_conn *conn; + union inany_addr dstaddr; /* FIXME: Avoid bulky temporary */ + union sockaddr_inany sa; union flow *flow; int s = -1, mss; socklen_t sl; @@ -1629,7 +1621,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return; - flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); + ini = flow_initiate_af(flow, PIF_TAP, + af, saddr, srcport, daddr, dstport); if (af == AF_INET) { if (IN4_IS_ADDR_UNSPECIFIED(saddr) || @@ -1661,19 +1654,28 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, dstport); goto cancel; } + } else { + ASSERT(0); } if ((s = tcp_conn_sock(c, af)) < 0) goto cancel; + dstaddr = ini->faddr; + if (!c->no_map_gw) { - if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw)) - addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK); - if (af == AF_INET6 && IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw)) - addr6.sin6_addr = in6addr_loopback; + if (inany_equals4(&dstaddr, &c->ip4.gw)) + dstaddr = inany_loopback4; + else if (inany_equals6(&dstaddr, &c->ip6.gw)) + dstaddr = inany_loopback6; } - if (af == AF_INET6 && IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr)) { + /* FIXME: Record outbound source address when known */ + tgt = flow_target_af(flow, PIF_HOST, AF_INET6, + NULL, 0, /* Kernel decides source address */ + &dstaddr, dstport); + + if (inany_is_linklocal6(&tgt->eaddr)) { struct sockaddr_in6 addr6_ll = { .sin6_family = AF_INET6, .sin6_addr = c->ip6.addr_ll, @@ -1681,9 +1683,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, }; if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) goto cancel; + } else if (!inany_is_loopback(&tgt->eaddr)) { + tcp_bind_outbound(c, s, af); } - flow_target(flow, PIF_HOST); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); conn->sock = s; conn->timer = -1; @@ -1706,14 +1709,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, inany_from_af(&conn->faddr, af, daddr); - if (af == AF_INET) { - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); - } else { - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); - } - conn->fport = dstport; conn->eport = srcport; @@ -1726,19 +1721,16 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, tcp_hash_insert(c, conn); - if (!bind(s, sa, sl)) { + sockaddr_from_inany(&sa, &sl, &tgt->eaddr, tgt->eport, c->ifi6);...so, I never tried to use a non-zero scope identifier for, say, global unicast IPv6 addresses. Perhaps it's harmless. But I still think that, logically, we should pass it as zero in that case.Yeah, I already did that rebase, it was surprisingly fiddly.+ + if (!bind(s, &sa.sa, sl)) { tcp_rst(c, conn); /* Nobody is listening then */ goto cancel; } if (errno != EADDRNOTAVAIL && errno != EACCES) conn_flag(c, conn, LOCAL);This will have a semi-trivial conflict with 54a9d3801b95 ("tcp: Don't rely on bind() to fail to decide that connection target is valid").-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson- if ((af == AF_INET && !IN4_IS_ADDR_LOOPBACK(&addr4.sin_addr)) || - (af == AF_INET6 && !IN6_IS_ADDR_LOOPBACK(&addr6.sin6_addr) && - !IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr))) - tcp_bind_outbound(c, s, af); - - if (connect(s, sa, sl)) { + if (connect(s, &sa.sa, sl)) { if (errno != EINPROGRESS) { tcp_rst(c, conn); goto cancel; @@ -2236,9 +2228,25 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, const union sockaddr_inany *sa, const struct timespec *now) { + union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */ struct tcp_tap_conn *conn; + in_port_t srcport; + + inany_from_sockaddr(&saddr, &srcport, sa); + tcp_snat_inbound(c, &saddr); - flow_target(flow, PIF_TAP); + if (inany_v4(&saddr)) { + daddr = inany_from_v4(c->ip4.addr_seen); + } else { + if (inany_is_linklocal6(&saddr)) + daddr.a6 = c->ip6.addr_ll_seen; + else + daddr.a6 = c->ip6.addr_seen; + } + dstport += c->tcp.fwd_in.delta[dstport]; + + flow_target_af(flow, PIF_TAP, AF_INET6, + &saddr, srcport, &daddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); conn->sock = s; @@ -2246,10 +2254,9 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = dstport + c->tcp.fwd_in.delta[dstport]; - - tcp_snat_inbound(c, &conn->faddr); + conn->faddr = saddr; + conn->fport = srcport; + conn->eport = dstport; tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_splice.c b/tcp_splice.c index 5a406c63..bcb42c97 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -320,31 +320,20 @@ static int tcp_splice_connect_finish(const struct ctx *c, * tcp_splice_connect() - Create and connect socket for new spliced connection * @c: Execution context * @conn: Connection pointer - * @af: Address family - * @pif: pif on which to create socket - * @port: Destination port, host order * * Return: 0 for connect() succeeded or in progress, negative value on error */ -static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, - sa_family_t af, uint8_t pif, in_port_t port) +static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(port), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(port), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - const struct sockaddr *sa; + const struct flowside *tgt = &conn->f.side[TGTSIDE]; + sa_family_t af = inany_v4(&tgt->eaddr) ? AF_INET : AF_INET6; + uint8_t tgtpif = conn->f.pif[TGTSIDE]; + union sockaddr_inany sa; socklen_t sl; - if (pif == PIF_HOST) + if (tgtpif == PIF_HOST) conn->s[1] = tcp_conn_sock(c, af); - else if (pif == PIF_SPLICE) + else if (tgtpif == PIF_SPLICE) conn->s[1] = tcp_conn_sock_ns(c, af); else ASSERT(0); @@ -358,15 +347,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, conn->s[1]); } - if (CONN_V6(conn)) { - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); - } else { - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); - } + sockaddr_from_inany(&sa, &sl, &tgt->eaddr, tgt->eport, 0); - if (connect(conn->s[1], sa, sl)) { + if (connect(conn->s[1], &sa.sa, sl)) { if (errno != EINPROGRESS) { flow_trace(conn, "Couldn't connect socket for splice: %s", strerror(errno)); @@ -471,7 +454,13 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, return false; } - flow_target(flow, tgtpif); + /* FIXME: Record outbound source address when known */ + if (af == AF_INET) + flow_target_af(flow, tgtpif, AF_INET, + NULL, 0, &in4addr_loopback, dstport); + else + flow_target_af(flow, tgtpif, AF_INET6, + NULL, 0, &in6addr_loopback, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice); conn->flags = af == AF_INET ? 0 : SPLICE_V6; @@ -483,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int))) flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0); - if (tcp_splice_connect(c, conn, af, tgtpif, dstport)) + if (tcp_splice_connect(c, conn)) conn_flag(c, conn, CLOSING); FLOW_ACTIVATE(conn);
Some information we explicitly store in the TCP connection is now duplicated in the common flow structure. Access it from there instead, and remove it from the TCP specific structure. With that done we can reorder both the "tap" and "splice" TCP structures a bit to get better packing for the new combined flow table entries. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 52 ++++++++++++++++++++++++++------------------------ tcp_conn.h | 40 +++++++++++++++----------------------- tcp_internal.h | 6 +++++- 3 files changed, 47 insertions(+), 51 deletions(-) diff --git a/tcp.c b/tcp.c index c6cd0c72..30ad3dd4 100644 --- a/tcp.c +++ b/tcp.c @@ -333,8 +333,6 @@ #define ACK_IF_NEEDED 0 /* See tcp_send_flag() */ -#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) - #define CONN_IS_CLOSING(conn) \ (((conn)->events & ESTABLISHED) && \ ((conn)->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD))) @@ -635,10 +633,11 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) { + const struct flowside *tapside = TAPFLOW(conn); int i; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return 1; return 0; @@ -653,6 +652,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, const struct tcp_info *tinfo) { #ifdef HAS_MIN_RTT + const struct flowside *tapside = TAPFLOW(conn); int i, hole = -1; if (!tinfo->tcpi_min_rtt || @@ -660,7 +660,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -672,7 +672,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return; - low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = tapside->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -827,8 +827,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + const struct flowside *tapside = TAPFLOW(conn); + + if (inany_equals(&tapside->faddr, faddr) && + tapside->eport == eport && tapside->fport == fport) return 1; return 0; @@ -862,7 +864,10 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + const struct flowside *tapside = TAPFLOW(conn); + + return tcp_hash(c, &tapside->faddr, tapside->eport, + tapside->fport); } /** @@ -998,10 +1003,12 @@ void tcp_defer_handler(struct ctx *c) * @seq: Sequence number */ static void tcp_fill_header(struct tcphdr *th, - const struct tcp_tap_conn *conn, uint32_t seq) + const struct tcp_tap_conn *conn, uint32_t seq) { - th->source = htons(conn->fport); - th->dest = htons(conn->eport); + const struct flowside *tapside = TAPFLOW(conn); + + th->source = htons(tapside->fport); + th->dest = htons(tapside->eport); th->seq = htonl(seq); th->ack_seq = htonl(conn->seq_ack_to_tap); if (conn->events & ESTABLISHED) { @@ -1033,7 +1040,8 @@ static size_t tcp_fill_headers4(const struct ctx *c, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph); @@ -1075,10 +1083,11 @@ static size_t tcp_fill_headers6(const struct ctx *c, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) { + const struct flowside *tapside = TAPFLOW(conn); size_t l4len = dlen + sizeof(*th); ip6h->payload_len = htons(l4len); - ip6h->saddr = conn->faddr.a6; + ip6h->saddr = tapside->faddr.a6; if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) ip6h->daddr = c->ip6.addr_ll_seen; else @@ -1117,7 +1126,8 @@ size_t tcp_l2_buf_fill_headers(const struct ctx *c, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr); if (a4) { return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, @@ -1420,6 +1430,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); + const struct flowside *tapside = TAPFLOW(conn); union inany_addr aany; uint64_t hash; uint32_t ns; @@ -1429,10 +1440,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, else inany_from_af(&aany, AF_INET6, &c->ip6.addr); - inany_siphash_feed(&state, &conn->faddr); + inany_siphash_feed(&state, &tapside->faddr); inany_siphash_feed(&state, &aany); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)tapside->fport << 16 | tapside->eport); /* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1707,11 +1718,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1; - inany_from_af(&conn->faddr, af, daddr); - - conn->fport = dstport; - conn->eport = srcport; - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2254,10 +2260,6 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - conn->faddr = saddr; - conn->fport = srcport; - conn->eport = dstport; - tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_conn.h b/tcp_conn.h index 5f8c8fb6..b741ce32 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -13,19 +13,16 @@ * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) * @f: Generic flow information * @in_epoll: Is the connection in the epoll set? + * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT + * @ws_from_tap: Window scaling factor advertised from tap/guest + * @ws_to_tap: Window scaling factor advertised to tap/guest * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS * @sock: Socket descriptor number * @events: Connection events, implying connection states * @timer: timerfd descriptor for timeout events * @flags: Connection flags representing internal attributes - * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT - * @ws_from_tap: Window scaling factor advertised from tap/guest - * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -49,6 +46,10 @@ struct tcp_tap_conn { unsigned int ws_from_tap :TCP_WS_BITS; unsigned int ws_to_tap :TCP_WS_BITS; +#define TCP_MSS_BITS 14 + unsigned int tap_mss :TCP_MSS_BITS; +#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) +#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) int sock :FD_REF_BITS; @@ -77,13 +78,6 @@ struct tcp_tap_conn { #define ACK_TO_TAP_DUE BIT(3) #define ACK_FROM_TAP_DUE BIT(4) - -#define TCP_MSS_BITS 14 - unsigned int tap_mss :TCP_MSS_BITS; -#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) -#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) - - #define SNDBUF_BITS 24 unsigned int sndbuf :SNDBUF_BITS; #define SNDBUF_SET(conn, bytes) (conn->sndbuf = ((bytes) >> (32 - SNDBUF_BITS))) @@ -91,11 +85,6 @@ struct tcp_tap_conn { uint8_t seq_dup_ack_approx; - - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap; @@ -109,22 +98,24 @@ struct tcp_tap_conn { /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @f: Generic flow information - * @in_epoll: Is the connection in the epoll set? * @s: File descriptor for sockets * @pipe: File descriptors for pipes - * @events: Events observed/actions performed on connection - * @flags: Connection flags (attributes, not events) * @read: Bytes read (not fully written to other side in one shot) * @written: Bytes written (not fully written from one other side read) -*/ + * @events: Events observed/actions performed on connection + * @flags: Connection flags (attributes, not events) + * @in_epoll: Is the connection in the epoll set? + */ struct tcp_splice_conn { /* Must be first element */ struct flow_common f; - bool in_epoll :1; int s[SIDES]; int pipe[SIDES][2]; + uint32_t read[SIDES]; + uint32_t written[SIDES]; + uint8_t events; #define SPLICE_CLOSED 0 #define SPLICE_CONNECT BIT(0) @@ -144,8 +135,7 @@ struct tcp_splice_conn { #define RCVLOWAT_ACT_1 BIT(4) #define CLOSING BIT(5) - uint32_t read[SIDES]; - uint32_t written[SIDES]; + bool in_epoll :1; }; /* Socket pools */ diff --git a/tcp_internal.h b/tcp_internal.h index 51aaa169..4f61e5c3 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -39,7 +39,11 @@ #define OPT_SACKP 4 #define OPT_SACK 5 #define OPT_TS 8 -#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) + +#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) +#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)])) + +#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) /* -- 2.45.2
On Fri, 14 Jun 2024 16:13:25 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Some information we explicitly store in the TCP connection is now duplicated in the common flow structure. Access it from there instead, and remove it from the TCP specific structure. With that done we can reorder both the "tap" and "splice" TCP structures a bit to get better packing for the new combined flow table entries. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 52 ++++++++++++++++++++++++++------------------------ tcp_conn.h | 40 +++++++++++++++----------------------- tcp_internal.h | 6 +++++- 3 files changed, 47 insertions(+), 51 deletions(-) diff --git a/tcp.c b/tcp.c index c6cd0c72..30ad3dd4 100644 --- a/tcp.c +++ b/tcp.c @@ -333,8 +333,6 @@ #define ACK_IF_NEEDED 0 /* See tcp_send_flag() */ -#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) - #define CONN_IS_CLOSING(conn) \ (((conn)->events & ESTABLISHED) && \ ((conn)->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD))) @@ -635,10 +633,11 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) { + const struct flowside *tapside = TAPFLOW(conn); int i; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return 1; return 0; @@ -653,6 +652,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, const struct tcp_info *tinfo) { #ifdef HAS_MIN_RTT + const struct flowside *tapside = TAPFLOW(conn); int i, hole = -1; if (!tinfo->tcpi_min_rtt || @@ -660,7 +660,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -672,7 +672,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return; - low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = tapside->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -827,8 +827,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + const struct flowside *tapside = TAPFLOW(conn); + + if (inany_equals(&tapside->faddr, faddr) && + tapside->eport == eport && tapside->fport == fport) return 1; return 0; @@ -862,7 +864,10 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + const struct flowside *tapside = TAPFLOW(conn); + + return tcp_hash(c, &tapside->faddr, tapside->eport, + tapside->fport); } /** @@ -998,10 +1003,12 @@ void tcp_defer_handler(struct ctx *c) * @seq: Sequence number */ static void tcp_fill_header(struct tcphdr *th, - const struct tcp_tap_conn *conn, uint32_t seq) + const struct tcp_tap_conn *conn, uint32_t seq) { - th->source = htons(conn->fport); - th->dest = htons(conn->eport); + const struct flowside *tapside = TAPFLOW(conn); + + th->source = htons(tapside->fport); + th->dest = htons(tapside->eport); th->seq = htonl(seq); th->ack_seq = htonl(conn->seq_ack_to_tap); if (conn->events & ESTABLISHED) { @@ -1033,7 +1040,8 @@ static size_t tcp_fill_headers4(const struct ctx *c, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph); @@ -1075,10 +1083,11 @@ static size_t tcp_fill_headers6(const struct ctx *c, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) { + const struct flowside *tapside = TAPFLOW(conn); size_t l4len = dlen + sizeof(*th); ip6h->payload_len = htons(l4len); - ip6h->saddr = conn->faddr.a6; + ip6h->saddr = tapside->faddr.a6; if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) ip6h->daddr = c->ip6.addr_ll_seen; else @@ -1117,7 +1126,8 @@ size_t tcp_l2_buf_fill_headers(const struct ctx *c, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr); if (a4) { return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, @@ -1420,6 +1430,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); + const struct flowside *tapside = TAPFLOW(conn); union inany_addr aany; uint64_t hash; uint32_t ns; @@ -1429,10 +1440,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, else inany_from_af(&aany, AF_INET6, &c->ip6.addr); - inany_siphash_feed(&state, &conn->faddr); + inany_siphash_feed(&state, &tapside->faddr); inany_siphash_feed(&state, &aany); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)tapside->fport << 16 | tapside->eport); /* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1707,11 +1718,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1; - inany_from_af(&conn->faddr, af, daddr); - - conn->fport = dstport; - conn->eport = srcport; - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2254,10 +2260,6 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - conn->faddr = saddr; - conn->fport = srcport; - conn->eport = dstport; - tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_conn.h b/tcp_conn.h index 5f8c8fb6..b741ce32 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -13,19 +13,16 @@ * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) * @f: Generic flow information * @in_epoll: Is the connection in the epoll set? + * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT + * @ws_from_tap: Window scaling factor advertised from tap/guest + * @ws_to_tap: Window scaling factor advertised to tap/guest * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS * @sock: Socket descriptor number * @events: Connection events, implying connection states * @timer: timerfd descriptor for timeout events * @flags: Connection flags representing internal attributes - * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT - * @ws_from_tap: Window scaling factor advertised from tap/guest - * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -49,6 +46,10 @@ struct tcp_tap_conn { unsigned int ws_from_tap :TCP_WS_BITS; unsigned int ws_to_tap :TCP_WS_BITS; +#define TCP_MSS_BITS 14 + unsigned int tap_mss :TCP_MSS_BITS; +#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) +#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) int sock :FD_REF_BITS; @@ -77,13 +78,6 @@ struct tcp_tap_conn { #define ACK_TO_TAP_DUE BIT(3) #define ACK_FROM_TAP_DUE BIT(4) - -#define TCP_MSS_BITS 14 - unsigned int tap_mss :TCP_MSS_BITS; -#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) -#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) - - #define SNDBUF_BITS 24 unsigned int sndbuf :SNDBUF_BITS; #define SNDBUF_SET(conn, bytes) (conn->sndbuf = ((bytes) >> (32 - SNDBUF_BITS))) @@ -91,11 +85,6 @@ struct tcp_tap_conn { uint8_t seq_dup_ack_approx; - - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap; @@ -109,22 +98,24 @@ struct tcp_tap_conn { /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @f: Generic flow information - * @in_epoll: Is the connection in the epoll set? * @s: File descriptor for sockets * @pipe: File descriptors for pipes - * @events: Events observed/actions performed on connection - * @flags: Connection flags (attributes, not events) * @read: Bytes read (not fully written to other side in one shot) * @written: Bytes written (not fully written from one other side read) -*/ + * @events: Events observed/actions performed on connection + * @flags: Connection flags (attributes, not events) + * @in_epoll: Is the connection in the epoll set? + */ struct tcp_splice_conn { /* Must be first element */ struct flow_common f; - bool in_epoll :1; int s[SIDES]; int pipe[SIDES][2]; + uint32_t read[SIDES]; + uint32_t written[SIDES]; + uint8_t events; #define SPLICE_CLOSED 0 #define SPLICE_CONNECT BIT(0) @@ -144,8 +135,7 @@ struct tcp_splice_conn { #define RCVLOWAT_ACT_1 BIT(4) #define CLOSING BIT(5) - uint32_t read[SIDES]; - uint32_t written[SIDES]; + bool in_epoll :1;Excess tab.}; /* Socket pools */ diff --git a/tcp_internal.h b/tcp_internal.h index 51aaa169..4f61e5c3 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -39,7 +39,11 @@ #define OPT_SACKP 4 #define OPT_SACK 5 #define OPT_TS 8 -#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) + +#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) +#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)])) + +#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) /*I reviewed up to 7/26 by the way, no further comments until that point. -- Stefano
On Wed, Jun 26, 2024 at 12:25:05AM +0200, Stefano Brivio wrote:On Fri, 14 Jun 2024 16:13:25 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Oops, fixed.Some information we explicitly store in the TCP connection is now duplicated in the common flow structure. Access it from there instead, and remove it from the TCP specific structure. With that done we can reorder both the "tap" and "splice" TCP structures a bit to get better packing for the new combined flow table entries. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 52 ++++++++++++++++++++++++++------------------------ tcp_conn.h | 40 +++++++++++++++----------------------- tcp_internal.h | 6 +++++- 3 files changed, 47 insertions(+), 51 deletions(-) diff --git a/tcp.c b/tcp.c index c6cd0c72..30ad3dd4 100644 --- a/tcp.c +++ b/tcp.c @@ -333,8 +333,6 @@ #define ACK_IF_NEEDED 0 /* See tcp_send_flag() */ -#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) - #define CONN_IS_CLOSING(conn) \ (((conn)->events & ESTABLISHED) && \ ((conn)->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD))) @@ -635,10 +633,11 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) { + const struct flowside *tapside = TAPFLOW(conn); int i; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return 1; return 0; @@ -653,6 +652,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, const struct tcp_info *tinfo) { #ifdef HAS_MIN_RTT + const struct flowside *tapside = TAPFLOW(conn); int i, hole = -1; if (!tinfo->tcpi_min_rtt || @@ -660,7 +660,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -672,7 +672,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return; - low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = tapside->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -827,8 +827,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + const struct flowside *tapside = TAPFLOW(conn); + + if (inany_equals(&tapside->faddr, faddr) && + tapside->eport == eport && tapside->fport == fport) return 1; return 0; @@ -862,7 +864,10 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + const struct flowside *tapside = TAPFLOW(conn); + + return tcp_hash(c, &tapside->faddr, tapside->eport, + tapside->fport); } /** @@ -998,10 +1003,12 @@ void tcp_defer_handler(struct ctx *c) * @seq: Sequence number */ static void tcp_fill_header(struct tcphdr *th, - const struct tcp_tap_conn *conn, uint32_t seq) + const struct tcp_tap_conn *conn, uint32_t seq) { - th->source = htons(conn->fport); - th->dest = htons(conn->eport); + const struct flowside *tapside = TAPFLOW(conn); + + th->source = htons(tapside->fport); + th->dest = htons(tapside->eport); th->seq = htonl(seq); th->ack_seq = htonl(conn->seq_ack_to_tap); if (conn->events & ESTABLISHED) { @@ -1033,7 +1040,8 @@ static size_t tcp_fill_headers4(const struct ctx *c, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph); @@ -1075,10 +1083,11 @@ static size_t tcp_fill_headers6(const struct ctx *c, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) { + const struct flowside *tapside = TAPFLOW(conn); size_t l4len = dlen + sizeof(*th); ip6h->payload_len = htons(l4len); - ip6h->saddr = conn->faddr.a6; + ip6h->saddr = tapside->faddr.a6; if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) ip6h->daddr = c->ip6.addr_ll_seen; else @@ -1117,7 +1126,8 @@ size_t tcp_l2_buf_fill_headers(const struct ctx *c, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr); if (a4) { return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, @@ -1420,6 +1430,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); + const struct flowside *tapside = TAPFLOW(conn); union inany_addr aany; uint64_t hash; uint32_t ns; @@ -1429,10 +1440,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, else inany_from_af(&aany, AF_INET6, &c->ip6.addr); - inany_siphash_feed(&state, &conn->faddr); + inany_siphash_feed(&state, &tapside->faddr); inany_siphash_feed(&state, &aany); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)tapside->fport << 16 | tapside->eport); /* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1707,11 +1718,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1; - inany_from_af(&conn->faddr, af, daddr); - - conn->fport = dstport; - conn->eport = srcport; - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2254,10 +2260,6 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - conn->faddr = saddr; - conn->fport = srcport; - conn->eport = dstport; - tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_conn.h b/tcp_conn.h index 5f8c8fb6..b741ce32 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -13,19 +13,16 @@ * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) * @f: Generic flow information * @in_epoll: Is the connection in the epoll set? + * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT + * @ws_from_tap: Window scaling factor advertised from tap/guest + * @ws_to_tap: Window scaling factor advertised to tap/guest * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS * @sock: Socket descriptor number * @events: Connection events, implying connection states * @timer: timerfd descriptor for timeout events * @flags: Connection flags representing internal attributes - * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT - * @ws_from_tap: Window scaling factor advertised from tap/guest - * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -49,6 +46,10 @@ struct tcp_tap_conn { unsigned int ws_from_tap :TCP_WS_BITS; unsigned int ws_to_tap :TCP_WS_BITS; +#define TCP_MSS_BITS 14 + unsigned int tap_mss :TCP_MSS_BITS; +#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) +#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) int sock :FD_REF_BITS; @@ -77,13 +78,6 @@ struct tcp_tap_conn { #define ACK_TO_TAP_DUE BIT(3) #define ACK_FROM_TAP_DUE BIT(4) - -#define TCP_MSS_BITS 14 - unsigned int tap_mss :TCP_MSS_BITS; -#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) -#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) - - #define SNDBUF_BITS 24 unsigned int sndbuf :SNDBUF_BITS; #define SNDBUF_SET(conn, bytes) (conn->sndbuf = ((bytes) >> (32 - SNDBUF_BITS))) @@ -91,11 +85,6 @@ struct tcp_tap_conn { uint8_t seq_dup_ack_approx; - - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap; @@ -109,22 +98,24 @@ struct tcp_tap_conn { /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @f: Generic flow information - * @in_epoll: Is the connection in the epoll set? * @s: File descriptor for sockets * @pipe: File descriptors for pipes - * @events: Events observed/actions performed on connection - * @flags: Connection flags (attributes, not events) * @read: Bytes read (not fully written to other side in one shot) * @written: Bytes written (not fully written from one other side read) -*/ + * @events: Events observed/actions performed on connection + * @flags: Connection flags (attributes, not events) + * @in_epoll: Is the connection in the epoll set? + */ struct tcp_splice_conn { /* Must be first element */ struct flow_common f; - bool in_epoll :1; int s[SIDES]; int pipe[SIDES][2]; + uint32_t read[SIDES]; + uint32_t written[SIDES]; + uint8_t events; #define SPLICE_CLOSED 0 #define SPLICE_CONNECT BIT(0) @@ -144,8 +135,7 @@ struct tcp_splice_conn { #define RCVLOWAT_ACT_1 BIT(4) #define CLOSING BIT(5) - uint32_t read[SIDES]; - uint32_t written[SIDES]; + bool in_epoll :1;Excess tab.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson}; /* Socket pools */ diff --git a/tcp_internal.h b/tcp_internal.h index 51aaa169..4f61e5c3 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -39,7 +39,11 @@ #define OPT_SACKP 4 #define OPT_SACK 5 #define OPT_TS 8 -#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) + +#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) +#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)])) + +#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) /*I reviewed up to 7/26 by the way, no further comments until that point.
Currently we always deliver inbound TCP packets to the guest's most recent observed IP address. This has the odd side effect that if the guest changes its IP address with active TCP connections we might deliver packets from old connections to the new address. That won't work; it will probably result in an RST from the guest. Worse, if the guest added a new address but also retains the old one, then we could break those old connections by redirecting them to the new address. Now that we maintain flowside information, we have a record of the correct guest side address and can just use it. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 41 +++++++++++++---------------------------- tcp_buf.c | 6 +++--- tcp_internal.h | 3 +-- 3 files changed, 17 insertions(+), 33 deletions(-) diff --git a/tcp.c b/tcp.c index 30ad3dd4..df625751 100644 --- a/tcp.c +++ b/tcp.c @@ -1022,7 +1022,6 @@ static void tcp_fill_header(struct tcphdr *th, /** * tcp_fill_headers4() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers - * @c: Execution context * @conn: Connection pointer * @taph: tap backend specific header * @iph: Pointer to IPv4 header @@ -1033,27 +1032,26 @@ static void tcp_fill_header(struct tcphdr *th, * * Return: The IPv4 payload length, host order */ -static size_t tcp_fill_headers4(const struct ctx *c, - const struct tcp_tap_conn *conn, +static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn, struct tap_hdr *taph, struct iphdr *iph, struct tcphdr *th, size_t dlen, const uint16_t *check, uint32_t seq) { const struct flowside *tapside = TAPFLOW(conn); - const struct in_addr *a4 = inany_v4(&tapside->faddr); + const struct in_addr *src4 = inany_v4(&tapside->faddr); + const struct in_addr *dst4 = inany_v4(&tapside->eaddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph); - ASSERT(a4); + ASSERT(src4 && dst4); iph->tot_len = htons(l3len); - iph->saddr = a4->s_addr; - iph->daddr = c->ip4.addr_seen.s_addr; + iph->saddr = src4->s_addr; + iph->daddr = dst4->s_addr; iph->check = check ? *check : - csum_ip4_header(l3len, IPPROTO_TCP, - *a4, c->ip4.addr_seen); + csum_ip4_header(l3len, IPPROTO_TCP, *src4, *dst4); tcp_fill_header(th, conn, seq); @@ -1066,7 +1064,6 @@ static size_t tcp_fill_headers4(const struct ctx *c, /** * tcp_fill_headers6() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers - * @c: Execution context * @conn: Connection pointer * @taph: tap backend specific header * @ip6h: Pointer to IPv6 header @@ -1077,8 +1074,7 @@ static size_t tcp_fill_headers4(const struct ctx *c, * * Return: The IPv6 payload length, host order */ -static size_t tcp_fill_headers6(const struct ctx *c, - const struct tcp_tap_conn *conn, +static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn, struct tap_hdr *taph, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) @@ -1088,10 +1084,7 @@ static size_t tcp_fill_headers6(const struct ctx *c, ip6h->payload_len = htons(l4len); ip6h->saddr = tapside->faddr.a6; - if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) - ip6h->daddr = c->ip6.addr_ll_seen; - else - ip6h->daddr = c->ip6.addr_seen; + ip6h->daddr = tapside->eaddr.a6; ip6h->hop_limit = 255; ip6h->version = 6; @@ -1112,7 +1105,6 @@ static size_t tcp_fill_headers6(const struct ctx *c, /** * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers - * @c: Execution context * @conn: Connection pointer * @iov: Pointer to an array of iovec of TCP pre-cooked buffers * @dlen: TCP payload length @@ -1121,8 +1113,7 @@ static size_t tcp_fill_headers6(const struct ctx *c, * * Return: IP payload length, host order */ -size_t tcp_l2_buf_fill_headers(const struct ctx *c, - const struct tcp_tap_conn *conn, +size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { @@ -1130,13 +1121,13 @@ size_t tcp_l2_buf_fill_headers(const struct ctx *c, const struct in_addr *a4 = inany_v4(&tapside->faddr); if (a4) { - return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, + return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_IP].iov_base, iov[TCP_IOV_PAYLOAD].iov_base, dlen, check, seq); } - return tcp_fill_headers6(c, conn, iov[TCP_IOV_TAP].iov_base, + return tcp_fill_headers6(conn, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_IP].iov_base, iov[TCP_IOV_PAYLOAD].iov_base, dlen, seq); @@ -1431,17 +1422,11 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, { struct siphash_state state = SIPHASH_INIT(c->hash_secret); const struct flowside *tapside = TAPFLOW(conn); - union inany_addr aany; uint64_t hash; uint32_t ns; - if (CONN_V4(conn)) - inany_from_af(&aany, AF_INET, &c->ip4.addr); - else - inany_from_af(&aany, AF_INET6, &c->ip6.addr); - inany_siphash_feed(&state, &tapside->faddr); - inany_siphash_feed(&state, &aany); + inany_siphash_feed(&state, &tapside->eaddr); hash = siphash_final(&state, 36, (uint64_t)tapside->fport << 16 | tapside->eport); diff --git a/tcp_buf.c b/tcp_buf.c index ff9f7970..9878e4b1 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -316,7 +316,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) return ret; } - l4len = tcp_l2_buf_fill_headers(c, conn, iov, optlen, NULL, seq); + l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, seq); iov[TCP_IOV_PAYLOAD].iov_len = l4len; if (flags & DUP_ACK) { @@ -373,7 +373,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp4_frame_conns[tcp4_payload_used] = conn; iov = tcp4_l2_iov[tcp4_payload_used++]; - l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq); + l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq); iov[TCP_IOV_PAYLOAD].iov_len = l4len; if (tcp4_payload_used > TCP_FRAMES_MEM - 1) tcp_payload_flush(c); @@ -381,7 +381,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp6_frame_conns[tcp6_payload_used] = conn; iov = tcp6_l2_iov[tcp6_payload_used++]; - l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq); + l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq); iov[TCP_IOV_PAYLOAD].iov_len = l4len; if (tcp6_payload_used > TCP_FRAMES_MEM - 1) tcp_payload_flush(c); diff --git a/tcp_internal.h b/tcp_internal.h index 4f61e5c3..ac6d4b21 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -88,8 +88,7 @@ void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); tcp_rst_do(c, conn); \ } while (0) -size_t tcp_l2_buf_fill_headers(const struct ctx *c, - const struct tcp_tap_conn *conn, +size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq); int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, -- 2.45.2
For now when we forward a connection to the host we leave the host side forwarding address and port blank since we don't necessarily know what source address and port will be used by the kernel. When the outbound address option is active, though, we do know the address at least, so we can record it in the flowside. Having done that, use it as the primary source of truth, binding the outgoing socket based on the information in there. This allows the possibility of more complex rules for what outbound address and/or port we use in future. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 93 ++++++++++++++++++++++++++++++----------------------------- 1 file changed, 47 insertions(+), 46 deletions(-) diff --git a/tcp.c b/tcp.c index df625751..74883312 100644 --- a/tcp.c +++ b/tcp.c @@ -1536,53 +1536,48 @@ static uint16_t tcp_conn_tap_mss(const struct tcp_tap_conn *conn, /** * tcp_bind_outbound() - Bind socket to outbound address and interface if given * @c: Execution context + * @conn: Connection entry for socket to bind * @s: Outbound TCP socket - * @af: Address family */ -static void tcp_bind_outbound(const struct ctx *c, int s, sa_family_t af) +static void tcp_bind_outbound(const struct ctx *c, + const struct tcp_tap_conn *conn, int s) { - if (af == AF_INET) { - if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.addr_out)) { - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = 0, - .sin_addr = c->ip4.addr_out, - }; - - if (bind(s, (struct sockaddr *)&addr4, sizeof(addr4))) { - debug("Can't bind IPv4 TCP socket address: %s", - strerror(errno)); - } + const struct flowside *tgt = &conn->f.side[TGTSIDE]; + union sockaddr_inany bind_sa; + socklen_t sl; + + sockaddr_from_inany(&bind_sa, &sl, + &tgt->faddr, tgt->fport, c->ifi6); + + if (!inany_is_unspecified(&tgt->faddr) || tgt->fport) { + if (bind(s, &bind_sa.sa, sl)) { + char sstr[INANY_ADDRSTRLEN]; + + flow_dbg(conn, + "Can't bind TCP outbound socket to %s:%hu: %s", + inany_ntop(&tgt->faddr, sstr, sizeof(sstr)), + tgt->fport, strerror(errno)); } + } + if (bind_sa.sa_family == AF_INET) { if (*c->ip4.ifname_out) { if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE, c->ip4.ifname_out, strlen(c->ip4.ifname_out))) { - debug("Can't bind IPv4 TCP socket to interface:" - " %s", strerror(errno)); - } - } - } else if (af == AF_INET6) { - if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr_out)) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = 0, - .sin6_addr = c->ip6.addr_out, - }; - - if (bind(s, (struct sockaddr *)&addr6, sizeof(addr6))) { - debug("Can't bind IPv6 TCP socket address: %s", - strerror(errno)); + flow_dbg(conn, "Can't bind IPv4 TCP socket to" + " interface %s: %s", c->ip4.ifname_out, + strerror(errno)); } } - + } else if (bind_sa.sa_family == AF_INET6) { if (*c->ip6.ifname_out) { if (setsockopt(s, SOL_SOCKET, SO_BINDTODEVICE, c->ip6.ifname_out, strlen(c->ip6.ifname_out))) { - debug("Can't bind IPv6 TCP socket to interface:" - " %s", strerror(errno)); + flow_dbg(conn, "Can't bind IPv6 TCP socket to" + " interface %s: %s", c->ip6.ifname_out, + strerror(errno)); } } } @@ -1606,9 +1601,9 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); + union inany_addr srcaddr, dstaddr; /* FIXME: Avoid bulky temporaries */ const struct flowside *ini, *tgt; struct tcp_tap_conn *conn; - union inany_addr dstaddr; /* FIXME: Avoid bulky temporary */ union sockaddr_inany sa; union flow *flow; int s = -1, mss; @@ -1666,23 +1661,27 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, dstaddr = inany_loopback6; } + if (inany_is_linklocal6(&dstaddr)) { + srcaddr.a6 = c->ip6.addr_ll; + } else if (inany_is_loopback(&dstaddr)) { + srcaddr = dstaddr; + } else if (inany_v4(&dstaddr)) { + if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.addr_out)) + srcaddr = inany_from_v4(c->ip4.addr_out); + else + srcaddr = inany_any4; + } else { + if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr_out)) + srcaddr.a6 = c->ip6.addr_out; + else + srcaddr = inany_any6; + } + /* FIXME: Record outbound source address when known */ tgt = flow_target_af(flow, PIF_HOST, AF_INET6, - NULL, 0, /* Kernel decides source address */ + &srcaddr, 0, /* Kernel decides source port */ &dstaddr, dstport); - if (inany_is_linklocal6(&tgt->eaddr)) { - struct sockaddr_in6 addr6_ll = { - .sin6_family = AF_INET6, - .sin6_addr = c->ip6.addr_ll, - .sin6_scope_id = c->ifi6, - }; - if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) - goto cancel; - } else if (!inany_is_loopback(&tgt->eaddr)) { - tcp_bind_outbound(c, s, af); - } - conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); conn->sock = s; conn->timer = -1; @@ -1721,6 +1720,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (errno != EADDRNOTAVAIL && errno != EACCES) conn_flag(c, conn, LOCAL); + tcp_bind_outbound(c, conn, s); + if (connect(s, &sa.sa, sl)) { if (errno != EINPROGRESS) { tcp_rst(c, conn); -- 2.45.2
Now that we store all our endpoints in the flowside structure, use some inany helpers to make validation of those endpoints simpler. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- inany.h | 1 - tcp.c | 72 +++++++++++++++------------------------------------------ 2 files changed, 18 insertions(+), 55 deletions(-) diff --git a/inany.h b/inany.h index 2bf3becf..27b1c88f 100644 --- a/inany.h +++ b/inany.h @@ -211,7 +211,6 @@ static inline bool inany_is_multicast(const union inany_addr *a) * * Return: true if @a is specified and a unicast address */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_unicast(const union inany_addr *a) { return !inany_is_unspecified(a) && !inany_is_multicast(a); diff --git a/tcp.c b/tcp.c index 74883312..09add999 100644 --- a/tcp.c +++ b/tcp.c @@ -1615,38 +1615,14 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, ini = flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); - if (af == AF_INET) { - if (IN4_IS_ADDR_UNSPECIFIED(saddr) || - IN4_IS_ADDR_BROADCAST(saddr) || - IN4_IS_ADDR_MULTICAST(saddr) || srcport == 0 || - IN4_IS_ADDR_UNSPECIFIED(daddr) || - IN4_IS_ADDR_BROADCAST(daddr) || - IN4_IS_ADDR_MULTICAST(daddr) || dstport == 0) { - char sstr[INET_ADDRSTRLEN], dstr[INET_ADDRSTRLEN]; - - debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu", - inet_ntop(AF_INET, saddr, sstr, sizeof(sstr)), - srcport, - inet_ntop(AF_INET, daddr, dstr, sizeof(dstr)), - dstport); - goto cancel; - } - } else if (af == AF_INET6) { - if (IN6_IS_ADDR_UNSPECIFIED(saddr) || - IN6_IS_ADDR_MULTICAST(saddr) || srcport == 0 || - IN6_IS_ADDR_UNSPECIFIED(daddr) || - IN6_IS_ADDR_MULTICAST(daddr) || dstport == 0) { - char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN]; - - debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu", - inet_ntop(AF_INET6, saddr, sstr, sizeof(sstr)), - srcport, - inet_ntop(AF_INET6, daddr, dstr, sizeof(dstr)), - dstport); - goto cancel; - } - } else { - ASSERT(0); + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0 || + !inany_is_unicast(&ini->faddr) || ini->fport == 0) { + char sstr[INANY_ADDRSTRLEN], dstr[INANY_ADDRSTRLEN]; + + debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu", + inany_ntop(&ini->eaddr, sstr, sizeof(sstr)), ini->eport, + inany_ntop(&ini->faddr, dstr, sizeof(dstr)), ini->fport); + goto cancel; } if ((s = tcp_conn_sock(c, af)) < 0) @@ -2270,7 +2246,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, void tcp_listen_handler(struct ctx *c, union epoll_ref ref, const struct timespec *now) { - char sastr[SOCKADDR_STRLEN]; + const struct flowside *ini; union sockaddr_inany sa; socklen_t sl = sizeof(sa); union flow *flow; @@ -2285,23 +2261,15 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, /* FIXME: When listening port has a specific bound address, record that * as the forwarding address */ - flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port); - - if (sa.sa_family == AF_INET) { - const struct in_addr *addr = &sa.sa4.sin_addr; - in_port_t port = sa.sa4.sin_port; - - if (IN4_IS_ADDR_UNSPECIFIED(addr) || - IN4_IS_ADDR_BROADCAST(addr) || - IN4_IS_ADDR_MULTICAST(addr) || port == 0) - goto bad_endpoint; - } else if (sa.sa_family == AF_INET6) { - const struct in6_addr *addr = &sa.sa6.sin6_addr; - in_port_t port = sa.sa6.sin6_port; - - if (IN6_IS_ADDR_UNSPECIFIED(addr) || - IN6_IS_ADDR_MULTICAST(addr) || port == 0) - goto bad_endpoint; + ini = flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, + ref.tcp_listen.port); + + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { + char sastr[SOCKADDR_STRLEN]; + + err("Invalid endpoint from TCP accept(): %s", + sockaddr_ntop(&sa, sastr, sizeof(sastr))); + goto cancel; } if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif, @@ -2311,10 +2279,6 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, tcp_tap_conn_from_sock(c, ref.tcp_listen.port, flow, s, &sa, now); return; -bad_endpoint: - err("Invalid endpoint from TCP accept(): %s", - sockaddr_ntop(&sa, sastr, sizeof(sastr))); - cancel: flow_alloc_cancel(flow); } -- 2.45.2
Since we're now constructing socket addresses based on information in the flowside, we no longer need an explicit flag to tell if we're dealing with an IPv4 or IPv6 connection. Hence, drop the now unused SPLICE_V6 flag. As well as just simplifying the code, this allows for possible future extensions where we could splice an IPv4 connection to an IPv6 connection or vice versa. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp_conn.h | 11 +++++------ tcp_splice.c | 3 --- 2 files changed, 5 insertions(+), 9 deletions(-) diff --git a/tcp_conn.h b/tcp_conn.h index b741ce32..08d58d6e 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -128,12 +128,11 @@ struct tcp_splice_conn { #define FIN_SENT_1 BIT(7) uint8_t flags; -#define SPLICE_V6 BIT(0) -#define RCVLOWAT_SET_0 BIT(1) -#define RCVLOWAT_SET_1 BIT(2) -#define RCVLOWAT_ACT_0 BIT(3) -#define RCVLOWAT_ACT_1 BIT(4) -#define CLOSING BIT(5) +#define RCVLOWAT_SET_0 BIT(0) +#define RCVLOWAT_SET_1 BIT(1) +#define RCVLOWAT_ACT_0 BIT(2) +#define RCVLOWAT_ACT_1 BIT(3) +#define CLOSING BIT(4) bool in_epoll :1; }; diff --git a/tcp_splice.c b/tcp_splice.c index bcb42c97..d829d753 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -73,8 +73,6 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2]; -#define CONN_V6(x) ((x)->flags & SPLICE_V6) -#define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) (((conn)->events & (set)) == (set)) #define CONN(idx) (&FLOW(idx)->tcp_splice) @@ -463,7 +461,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, NULL, 0, &in6addr_loopback, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice); - conn->flags = af == AF_INET ? 0 : SPLICE_V6; conn->s[0] = s0; conn->s[1] = -1; conn->pipe[0][0] = conn->pipe[0][1] = -1; -- 2.45.2
Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP. Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the pif and the L4 protocol number. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 35 +++++++++++++++++++++++++++--- flow.h | 19 ++++++++++++++++ flow_table.h | 3 +++ tcp.c | 61 ++++++++++------------------------------------------ 4 files changed, 65 insertions(+), 53 deletions(-) diff --git a/flow.c b/flow.c index 39e046bf..8ed9c1ae 100644 --- a/flow.c +++ b/flow.c @@ -116,9 +116,9 @@ static struct timespec flow_timer_run; * @faddr: Forwarding address (pointer to in_addr or in6_addr) * @fport: Forwarding port */ -static void flowside_from_af(struct flowside *fside, sa_family_t af, - const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport) +void flowside_from_af(struct flowside *fside, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport) { if (faddr) inany_from_af(&fside->faddr, af, faddr); @@ -401,6 +401,35 @@ void flow_alloc_cancel(union flow *flow) flow_new_entry = NULL; } +/** + * flow_hash() - Calculate hash value for one side of a flow + * @c: Execution context + * @proto: Protocol of this flow (IP L4 protocol number) + * @pif: pif of the side to hash + * @fside: Flowside (must not have unspecified parts) + * + * Return: hash value + */ +uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, + const struct flowside *fside) +{ + struct siphash_state state = SIPHASH_INIT(c->hash_secret); + + /* For the hash table to work, we need complete endpoint information, + * and at least a forwarding port. + */ + ASSERT(pif != PIF_NONE && !inany_is_unspecified(&fside->eaddr) && + fside->eport != 0 && fside->fport != 0); + + inany_siphash_feed(&state, &fside->faddr); + inany_siphash_feed(&state, &fside->eaddr); + + return siphash_final(&state, 38, (uint64_t)proto << 40 | + (uint64_t)pif << 32 | + (uint64_t)fside->fport << 16 | + (uint64_t)fside->eport); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context diff --git a/flow.h b/flow.h index b873ea7f..6c9eaf60 100644 --- a/flow.h +++ b/flow.h @@ -149,6 +149,25 @@ struct flowside { in_port_t eport; }; +/** + * flowside_eq() - Check if two flowsides are equal + * @left, @right: Flowsides to compare + * + * Return: true if equal, false otherwise + */ +static inline bool flowside_eq(const struct flowside *left, + const struct flowside *right) +{ + return inany_equals(&left->eaddr, &right->eaddr) && + left->eport == right->eport && + inany_equals(&left->faddr, &right->faddr) && + left->fport == right->fport; +} + +void flowside_from_af(struct flowside *fside, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport); + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry diff --git a/flow_table.h b/flow_table.h index 7af32c6a..34f4e6cc 100644 --- a/flow_table.h +++ b/flow_table.h @@ -126,4 +126,7 @@ void flow_activate(struct flow_common *f); #define FLOW_ACTIVATE(flow_) \ (flow_activate(&(flow_)->f)) +uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, + const struct flowside *fside); + #endif /* FLOW_TABLE_H */ diff --git a/tcp.c b/tcp.c index 09add999..531cba2d 100644 --- a/tcp.c +++ b/tcp.c @@ -376,7 +376,7 @@ static struct iovec tcp_iov [UIO_MAXIOV]; #define CONN(idx) (&(FLOW(idx)->tcp)) -/* Table for lookup from remote address, local port, remote port */ +/* Table for lookup from flowside information */ static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE]; static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, @@ -814,46 +814,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find, return -1; } -/** - * tcp_hash_match() - Check if a connection entry matches address and ports - * @conn: Connection entry to match against - * @faddr: Guest side forwarding address - * @eport: Guest side endpoint port - * @fport: Guest side forwarding port - * - * Return: 1 on match, 0 otherwise - */ -static int tcp_hash_match(const struct tcp_tap_conn *conn, - const union inany_addr *faddr, - in_port_t eport, in_port_t fport) -{ - const struct flowside *tapside = TAPFLOW(conn); - - if (inany_equals(&tapside->faddr, faddr) && - tapside->eport == eport && tapside->fport == fport) - return 1; - - return 0; -} - -/** - * tcp_hash() - Calculate hash value for connection given address and ports - * @c: Execution context - * @faddr: Guest side forwarding address - * @eport: Guest side endpoint port - * @fport: Guest side forwarding port - * - * Return: hash value, needs to be adjusted for table size - */ -static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, - in_port_t eport, in_port_t fport) -{ - struct siphash_state state = SIPHASH_INIT(c->hash_secret); - - inany_siphash_feed(&state, faddr); - return siphash_final(&state, 20, (uint64_t)eport << 16 | fport); -} - /** * tcp_conn_hash() - Calculate hash bucket of an existing connection * @c: Execution context @@ -866,8 +826,7 @@ static uint64_t tcp_conn_hash(const struct ctx *c, { const struct flowside *tapside = TAPFLOW(conn); - return tcp_hash(c, &tapside->faddr, tapside->eport, - tapside->fport); + return flow_hash(c, IPPROTO_TCP, conn->f.pif[TAPSIDE(conn)], tapside); } /** @@ -942,25 +901,26 @@ static void tcp_hash_remove(const struct ctx *c, * tcp_hash_lookup() - Look up connection given remote address and ports * @c: Execution context * @af: Address family, AF_INET or AF_INET6 + * @eaddr: Guest side endpoint address (guest local address) * @faddr: Guest side forwarding address (guest remote address) * @eport: Guest side endpoint port (guest local port) * @fport: Guest side forwarding port (guest remote port) * * Return: connection pointer, if found, -ENOENT otherwise */ -static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, - sa_family_t af, const void *faddr, +static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af, + const void *eaddr, const void *faddr, in_port_t eport, in_port_t fport) { - union inany_addr aany; + struct flowside fside; union flow *flow; unsigned b; - inany_from_af(&aany, af, faddr); + flowside_from_af(&fside, af, eaddr, eport, faddr, fport); - b = tcp_hash(c, &aany, eport, fport) % TCP_HASH_TABLE_SIZE; + b = flow_hash(c, IPPROTO_TCP, PIF_TAP, &fside) % TCP_HASH_TABLE_SIZE; while ((flow = flow_at_sidx(tc_hash[b])) && - !tcp_hash_match(&flow->tcp, &aany, eport, fport)) + !flowside_eq(&flow->f.side[TAPSIDE(flow)], &fside)) b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); return &flow->tcp; @@ -2038,7 +1998,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL); opts = packet_get(p, idx, sizeof(*th), optlen, NULL); - conn = tcp_hash_lookup(c, af, daddr, ntohs(th->source), ntohs(th->dest)); + conn = tcp_hash_lookup(c, af, saddr, daddr, + ntohs(th->source), ntohs(th->dest)); /* New connection from tap */ if (!conn) { -- 2.45.2
Move the data structures and helper functions for the TCP hash table to flow.c, making it a general hash table indexing sides of flows. This is largely code motion and straightforward renames. There are two semantic changes: * flow_lookup_af() now needs to verify that the entry has a matching protocol and interface as well as matching addresses and ports. * We double the size of the hash table, because it's now at least theoretically possible for both sides of each flow to be hashed. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 155 +++++++++++++++++++++++++++++++++++++++++++++++-- flow.h | 11 ++-- flow_table.h | 3 - tcp.c | 148 +++++----------------------------------------- tcp_internal.h | 1 + 5 files changed, 171 insertions(+), 147 deletions(-) diff --git a/flow.c b/flow.c index 8ed9c1ae..e7503dbd 100644 --- a/flow.c +++ b/flow.c @@ -105,6 +105,16 @@ unsigned flow_first_free; union flow flowtab[FLOW_MAX]; static const union flow *flow_new_entry; /* = NULL */ +/* Hash table to index it */ +#define FLOW_HASH_LOAD 70 /* % */ +#define FLOW_HASH_SIZE ((2 * FLOW_MAX * 100 / FLOW_HASH_LOAD)) + +/* Table for lookup from flowside information */ +static flow_sidx_t flow_hashtab[FLOW_HASH_SIZE]; + +static_assert(ARRAY_SIZE(flow_hashtab) >= 2 * FLOW_MAX, +"Safe linear probing requires hash table with more entries than the number of sides in the flow table"); + /* Last time the flow timers ran */ static struct timespec flow_timer_run; @@ -116,9 +126,9 @@ static struct timespec flow_timer_run; * @faddr: Forwarding address (pointer to in_addr or in6_addr) * @fport: Forwarding port */ -void flowside_from_af(struct flowside *fside, sa_family_t af, - const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport) +static void flowside_from_af(struct flowside *fside, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport) { if (faddr) inany_from_af(&fside->faddr, af, faddr); @@ -410,8 +420,8 @@ void flow_alloc_cancel(union flow *flow) * * Return: hash value */ -uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, - const struct flowside *fside) +static uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, + const struct flowside *fside) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); @@ -430,6 +440,136 @@ uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, (uint64_t)fside->eport); } +/** + * flow_sidx_hash() - Calculate hash value for given side of a given flow + * @c: Execution context + * @sidx: Flow & side index to get hash for + * + * Return: hash value, of the flow & side represented by @sidx + */ +static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx) +{ + const struct flow_common *f = &flow_at_sidx(sidx)->f; + return flow_hash(c, FLOW_PROTO(f), + f->pif[sidx.side], &f->side[sidx.side]); +} + +/** + * flow_hash_probe() - Find hash bucket for a flow + * @c: Execution context + * @sidx: Flow and side to find bucket for + * + * Return: If @sidx is in the hash table, its current bucket, otherwise a + * suitable free bucket for it. + */ +static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) +{ + unsigned b = flow_sidx_hash(c, sidx) % FLOW_HASH_SIZE; + + /* Linear probing */ + while (!flow_sidx_eq(flow_hashtab[b], FLOW_SIDX_NONE) && + !flow_sidx_eq(flow_hashtab[b], sidx)) + b = mod_sub(b, 1, FLOW_HASH_SIZE); + + return b; +} + +/** + * flow_hash_insert() - Insert side of a flow into into hash table + * @c: Execution context + * @sidx: Flow & side index + */ +void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx) +{ + unsigned b = flow_hash_probe(c, sidx); + + flow_hashtab[b] = sidx; + flow_dbg(flow_at_sidx(sidx), "Side %u hash table insert: bucket: %u", + sidx.side, b); +} + +/** + * flow_hash_remove() - Drop side of a flow from the hash table + * @c: Execution context + * @sidx: Side of flow to remove + */ +void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx) +{ + unsigned b = flow_hash_probe(c, sidx), s; + + if (flow_sidx_eq(flow_hashtab[b], FLOW_SIDX_NONE)) + return; /* Redundant remove */ + + flow_dbg(flow_at_sidx(sidx), "Side %u hash table remove: bucket: %u", + sidx.side, b); + + /* Scan the remainder of the cluster */ + for (s = mod_sub(b, 1, FLOW_HASH_SIZE); + !flow_sidx_eq(flow_hashtab[s], FLOW_SIDX_NONE); + s = mod_sub(s, 1, FLOW_HASH_SIZE)) { + unsigned h = flow_sidx_hash(c, flow_hashtab[s]) % FLOW_HASH_SIZE; + + if (!mod_between(h, s, b, FLOW_HASH_SIZE)) { + /* flow_hashtab[s] can live in flow_hashtab[b]'s slot */ + debug("hash table remove: shuffle %u -> %u", s, b); + flow_hashtab[b] = flow_hashtab[s]; + b = s; + } + } + + flow_hashtab[b] = FLOW_SIDX_NONE; +} + +/** + * flowside_lookup() - Look for a matching flowside in the flow table + * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: pif to look for in the table + * @fside: Flowside to look for in the table + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +static flow_sidx_t flowside_lookup(const struct ctx *c, uint8_t proto, + uint8_t pif, const struct flowside *fside) +{ + flow_sidx_t sidx; + union flow *flow; + unsigned b; + + b = flow_hash(c, proto, pif, fside) % FLOW_HASH_SIZE; + while ((sidx = flow_hashtab[b], flow = flow_at_sidx(sidx)) && + !(FLOW_PROTO(&flow->f) == proto && + flow->f.pif[sidx.side] == pif && + flowside_eq(&flow->f.side[sidx.side], fside))) + b = (b + 1) % FLOW_HASH_SIZE; + + return flow_hashtab[b]; +} + +/** + * flow_lookup_af() - Look up a flow given addressing information + * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @af: Address family, AF_INET or AF_INET6 + * @eaddr: Guest side endpoint address (guest local address) + * @faddr: Guest side forwarding address (guest remote address) + * @eport: Guest side endpoint port (guest local port) + * @fport: Guest side forwarding port (guest remote port) + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_af(const struct ctx *c, + uint8_t proto, uint8_t pif, sa_family_t af, + const void *eaddr, const void *faddr, + in_port_t eport, in_port_t fport) +{ + struct flowside fside; + + flowside_from_af(&fside, af, eaddr, eport, faddr, fport); + return flowside_lookup(c, proto, pif, &fside); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context @@ -543,7 +683,12 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) */ void flow_init(void) { + unsigned b; + /* Initial state is a single free cluster containing the whole table */ flowtab[0].free.n = FLOW_MAX; flowtab[0].free.next = FLOW_MAX; + + for (b = 0; b < FLOW_HASH_SIZE; b++) + flow_hashtab[b] = FLOW_SIDX_NONE; } diff --git a/flow.h b/flow.h index 6c9eaf60..c29d0312 100644 --- a/flow.h +++ b/flow.h @@ -164,10 +164,6 @@ static inline bool flowside_eq(const struct flowside *left, left->fport == right->fport; } -void flowside_from_af(struct flowside *fside, sa_family_t af, - const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport); - /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry @@ -222,6 +218,13 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) return (a.flow == b.flow) && (a.side == b.side); } +void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx); +void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx); +flow_sidx_t flow_lookup_af(const struct ctx *c, + uint8_t proto, uint8_t pif, sa_family_t af, + const void *eaddr, const void *faddr, + in_port_t eport, in_port_t fport); + union flow; void flow_init(void); diff --git a/flow_table.h b/flow_table.h index 34f4e6cc..7af32c6a 100644 --- a/flow_table.h +++ b/flow_table.h @@ -126,7 +126,4 @@ void flow_activate(struct flow_common *f); #define FLOW_ACTIVATE(flow_) \ (flow_activate(&(flow_)->f)) -uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, - const struct flowside *fside); - #endif /* FLOW_TABLE_H */ diff --git a/tcp.c b/tcp.c index 531cba2d..e92f6850 100644 --- a/tcp.c +++ b/tcp.c @@ -305,9 +305,6 @@ #include "tcp_internal.h" #include "tcp_buf.h" -#define TCP_HASH_TABLE_LOAD 70 /* % */ -#define TCP_HASH_TABLE_SIZE (FLOW_MAX * 100 / TCP_HASH_TABLE_LOAD) - /* MSS rounding: see SET_MSS() */ #define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */ @@ -376,12 +373,6 @@ static struct iovec tcp_iov [UIO_MAXIOV]; #define CONN(idx) (&(FLOW(idx)->tcp)) -/* Table for lookup from flowside information */ -static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE]; - -static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, - "Safe linear probing requires hash table larger than connection table"); - /* Pools for pre-opened sockets (in init) */ int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; @@ -567,9 +558,6 @@ void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, tcp_timer_ctl(c, conn); } -static void tcp_hash_remove(const struct ctx *c, - const struct tcp_tap_conn *conn); - /** * conn_event_do() - Set and log connection events, update epoll state * @c: Execution context @@ -615,7 +603,7 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, num == -1 ? "CLOSED" : tcp_event_str[num]); if (event == CLOSED) - tcp_hash_remove(c, conn); + flow_hash_remove(c, TAP_SIDX(conn)); else if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD)) conn_flag(c, conn, ACTIVE_CLOSE); else @@ -814,118 +802,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find, return -1; } -/** - * tcp_conn_hash() - Calculate hash bucket of an existing connection - * @c: Execution context - * @conn: Connection - * - * Return: hash value, needs to be adjusted for table size - */ -static uint64_t tcp_conn_hash(const struct ctx *c, - const struct tcp_tap_conn *conn) -{ - const struct flowside *tapside = TAPFLOW(conn); - - return flow_hash(c, IPPROTO_TCP, conn->f.pif[TAPSIDE(conn)], tapside); -} - -/** - * tcp_hash_probe() - Find hash bucket for a connection - * @c: Execution context - * @conn: Connection to find bucket for - * - * Return: If @conn is in the table, its current bucket, otherwise a suitable - * free bucket for it. - */ -static inline unsigned tcp_hash_probe(const struct ctx *c, - const struct tcp_tap_conn *conn) -{ - unsigned b = tcp_conn_hash(c, conn) % TCP_HASH_TABLE_SIZE; - flow_sidx_t sidx = FLOW_SIDX(conn, TAPSIDE(conn)); - - /* Linear probing */ - while (!flow_sidx_eq(tc_hash[b], FLOW_SIDX_NONE) && - !flow_sidx_eq(tc_hash[b], sidx)) - b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - - return b; -} - -/** - * tcp_hash_insert() - Insert connection into hash table, chain link - * @c: Execution context - * @conn: Connection pointer - */ -static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn) -{ - unsigned b = tcp_hash_probe(c, conn); - - tc_hash[b] = FLOW_SIDX(conn, TAPSIDE(conn)); - flow_dbg(conn, "hash table insert: sock %i, bucket: %u", conn->sock, b); -} - -/** - * tcp_hash_remove() - Drop connection from hash table, chain unlink - * @c: Execution context - * @conn: Connection pointer - */ -static void tcp_hash_remove(const struct ctx *c, - const struct tcp_tap_conn *conn) -{ - unsigned b = tcp_hash_probe(c, conn), s; - union flow *flow = flow_at_sidx(tc_hash[b]); - - if (!flow) - return; /* Redundant remove */ - - flow_dbg(conn, "hash table remove: sock %i, bucket: %u", conn->sock, b); - - /* Scan the remainder of the cluster */ - for (s = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - (flow = flow_at_sidx(tc_hash[s])); - s = mod_sub(s, 1, TCP_HASH_TABLE_SIZE)) { - unsigned h = tcp_conn_hash(c, &flow->tcp) % TCP_HASH_TABLE_SIZE; - - if (!mod_between(h, s, b, TCP_HASH_TABLE_SIZE)) { - /* tc_hash[s] can live in tc_hash[b]'s slot */ - debug("hash table remove: shuffle %u -> %u", s, b); - tc_hash[b] = tc_hash[s]; - b = s; - } - } - - tc_hash[b] = FLOW_SIDX_NONE; -} - -/** - * tcp_hash_lookup() - Look up connection given remote address and ports - * @c: Execution context - * @af: Address family, AF_INET or AF_INET6 - * @eaddr: Guest side endpoint address (guest local address) - * @faddr: Guest side forwarding address (guest remote address) - * @eport: Guest side endpoint port (guest local port) - * @fport: Guest side forwarding port (guest remote port) - * - * Return: connection pointer, if found, -ENOENT otherwise - */ -static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af, - const void *eaddr, const void *faddr, - in_port_t eport, in_port_t fport) -{ - struct flowside fside; - union flow *flow; - unsigned b; - - flowside_from_af(&fside, af, eaddr, eport, faddr, fport); - - b = flow_hash(c, IPPROTO_TCP, PIF_TAP, &fside) % TCP_HASH_TABLE_SIZE; - while ((flow = flow_at_sidx(tc_hash[b])) && - !flowside_eq(&flow->f.side[TAPSIDE(flow)], &fside)) - b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - - return &flow->tcp; -} - /** * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections) * @conn: Connection to handle @@ -1645,7 +1521,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, tcp_seq_init(c, conn, now); conn->seq_ack_from_tap = conn->seq_to_tap; - tcp_hash_insert(c, conn); + flow_hash_insert(c, TAP_SIDX(conn)); sockaddr_from_inany(&sa, &sl, &tgt->eaddr, tgt->eport, c->ifi6); @@ -1983,6 +1859,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, const struct tcphdr *th; size_t optlen, len; const char *opts; + union flow *flow; + flow_sidx_t sidx; int ack_due = 0; int count; @@ -1998,17 +1876,22 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL); opts = packet_get(p, idx, sizeof(*th), optlen, NULL); - conn = tcp_hash_lookup(c, af, saddr, daddr, - ntohs(th->source), ntohs(th->dest)); + sidx = flow_lookup_af(c, IPPROTO_TCP, PIF_TAP, af, saddr, daddr, + ntohs(th->source), ntohs(th->dest)); + flow = flow_at_sidx(sidx); /* New connection from tap */ - if (!conn) { + if (!flow) { if (opts && th->syn && !th->ack) tcp_conn_from_tap(c, af, saddr, daddr, th, opts, optlen, now); return 1; } + ASSERT(flow->f.type == FLOW_TCP); + ASSERT(flow->f.pif[sidx.side] == PIF_TAP); + conn = &flow->tcp; + flow_trace(conn, "packet length %zu from tap", len); if (th->rst) { @@ -2184,7 +2067,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn_event(c, conn, SOCK_ACCEPTED); tcp_seq_init(c, conn, now); - tcp_hash_insert(c, conn); + flow_hash_insert(c, TAP_SIDX(conn)); conn->seq_ack_from_tap = conn->seq_to_tap; @@ -2576,11 +2459,6 @@ static void tcp_sock_refill_init(const struct ctx *c) */ int tcp_init(struct ctx *c) { - unsigned b; - - for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) - tc_hash[b] = FLOW_SIDX_NONE; - if (c->ifi4) tcp_sock4_iov_init(c); diff --git a/tcp_internal.h b/tcp_internal.h index ac6d4b21..8b60aabc 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -42,6 +42,7 @@ #define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) #define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)])) +#define TAP_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_))) #define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) -- 2.45.2
We generate TCP initial sequence numbers, when we need them, from a hash of the source and destination addresses and ports, plus a timestamp. Moments later, we generate another hash of the same information plus some more to insert the connection into the flow hash table. With some tweaks to the flow_hash_insert() interface and changing the order we can re-use that hash table hash for the initial sequence number, rather than calculating another one. It won't generate identical results, but that doesn't matter as long as the sequence numbers are well scattered. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 30 ++++++++++++++++++++++++------ flow.h | 2 +- tcp.c | 33 +++++++++++---------------------- 3 files changed, 36 insertions(+), 29 deletions(-) diff --git a/flow.c b/flow.c index e7503dbd..c94351d7 100644 --- a/flow.c +++ b/flow.c @@ -455,16 +455,16 @@ static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx) } /** - * flow_hash_probe() - Find hash bucket for a flow - * @c: Execution context + * flow_hash_probe_() - Find hash bucket for a flow, given hash + * @hash: Raw hash value for flow & side * @sidx: Flow and side to find bucket for * * Return: If @sidx is in the hash table, its current bucket, otherwise a * suitable free bucket for it. */ -static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) +static inline unsigned flow_hash_probe_(uint64_t hash, flow_sidx_t sidx) { - unsigned b = flow_sidx_hash(c, sidx) % FLOW_HASH_SIZE; + unsigned b = hash % FLOW_HASH_SIZE; /* Linear probing */ while (!flow_sidx_eq(flow_hashtab[b], FLOW_SIDX_NONE) && @@ -474,18 +474,36 @@ static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) return b; } +/** + * flow_hash_probe() - Find hash bucket for a flow + * @c: Execution context + * @sidx: Flow and side to find bucket for + * + * Return: If @sidx is in the hash table, its current bucket, otherwise a + * suitable free bucket for it. + */ +static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) +{ + return flow_hash_probe_(flow_sidx_hash(c, sidx), sidx); +} + /** * flow_hash_insert() - Insert side of a flow into into hash table * @c: Execution context * @sidx: Flow & side index + * + * Return: raw (un-modded) hash value of side of flow */ -void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx) +uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx) { - unsigned b = flow_hash_probe(c, sidx); + uint64_t hash = flow_sidx_hash(c, sidx); + unsigned b = flow_hash_probe_(hash, sidx); flow_hashtab[b] = sidx; flow_dbg(flow_at_sidx(sidx), "Side %u hash table insert: bucket: %u", sidx.side, b); + + return hash; } /** diff --git a/flow.h b/flow.h index c29d0312..90389a5e 100644 --- a/flow.h +++ b/flow.h @@ -218,7 +218,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) return (a.flow == b.flow) && (a.side == b.side); } -void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx); +uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx); void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx); flow_sidx_t flow_lookup_af(const struct ctx *c, uint8_t proto, uint8_t pif, sa_family_t af, diff --git a/tcp.c b/tcp.c index e92f6850..a82d1dc6 100644 --- a/tcp.c +++ b/tcp.c @@ -1248,28 +1248,16 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) } /** - * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 - * @c: Execution context - * @conn: TCP connection, with faddr, fport and eport populated + * tcp_init_seq() - Calculate initial sequence number according to RFC 6528 + * @hash: Hash of connection details * @now: Current timestamp */ -static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, - const struct timespec *now) +static uint32_t tcp_init_seq(uint64_t hash, const struct timespec *now) { - struct siphash_state state = SIPHASH_INIT(c->hash_secret); - const struct flowside *tapside = TAPFLOW(conn); - uint64_t hash; - uint32_t ns; - - inany_siphash_feed(&state, &tapside->faddr); - inany_siphash_feed(&state, &tapside->eaddr); - hash = siphash_final(&state, 36, - (uint64_t)tapside->fport << 16 | tapside->eport); - /* 32ns ticks, overflows 32 bits every 137s */ - ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; + uint32_t ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; - conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; + return ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; } /** @@ -1443,6 +1431,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, union sockaddr_inany sa; union flow *flow; int s = -1, mss; + uint64_t hash; socklen_t sl; if (!(flow = flow_alloc())) @@ -1518,11 +1507,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; - tcp_seq_init(c, conn, now); + hash = flow_hash_insert(c, TAP_SIDX(conn)); + conn->seq_to_tap = tcp_init_seq(hash, now); conn->seq_ack_from_tap = conn->seq_to_tap; - flow_hash_insert(c, TAP_SIDX(conn)); - sockaddr_from_inany(&sa, &sl, &tgt->eaddr, tgt->eport, c->ifi6); if (!bind(s, &sa.sa, sl)) { @@ -2043,6 +2031,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */ struct tcp_tap_conn *conn; in_port_t srcport; + uint64_t hash; inany_from_sockaddr(&saddr, &srcport, sa); tcp_snat_inbound(c, &saddr); @@ -2066,8 +2055,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - tcp_seq_init(c, conn, now); - flow_hash_insert(c, TAP_SIDX(conn)); + hash = flow_hash_insert(c, TAP_SIDX(conn)); + conn->seq_to_tap = tcp_init_seq(hash, now); conn->seq_ack_from_tap = conn->seq_to_tap; -- 2.45.2
struct icmp_ping_flow contains a field for the ICMP id of the ping, but this is now redundant, since the id is also stored as the "port" in the common flowsides. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- icmp.c | 10 +++++----- icmp_flow.h | 2 -- 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/icmp.c b/icmp.c index c86c54c3..deb13c96 100644 --- a/icmp.c +++ b/icmp.c @@ -58,6 +58,7 @@ static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS]; void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) { struct icmp_ping_flow *pingf = PINGF(ref.flowside.flow); + const struct flowside *ini = &pingf->f.side[INISIDE]; union sockaddr_inany sr; socklen_t sl = sizeof(sr); char buf[USHRT_MAX]; @@ -83,7 +84,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) goto unexpected; /* Adjust packet back to guest-side ID */ - ih4->un.echo.id = htons(pingf->id); + ih4->un.echo.id = htons(ini->eport); seq = ntohs(ih4->un.echo.sequence); } else if (pingf->f.type == FLOW_PING6) { struct icmp6hdr *ih6 = (struct icmp6hdr *)buf; @@ -93,7 +94,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) goto unexpected; /* Adjust packet back to guest-side ID */ - ih6->icmp6_identifier = htons(pingf->id); + ih6->icmp6_identifier = htons(ini->eport); seq = ntohs(ih6->icmp6_sequence); } else { ASSERT(0); @@ -108,7 +109,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) } flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16, - pingf->id, seq); + ini->eport, seq); if (pingf->f.type == FLOW_PING4) tap_icmp4_send(c, sr.sa4.sin_addr, tap_ip4_daddr(c), buf, n); @@ -129,7 +130,7 @@ unexpected: static void icmp_ping_close(const struct ctx *c, const struct icmp_ping_flow *pingf) { - uint16_t id = pingf->id; + uint16_t id = pingf->f.side[INISIDE].eport; epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL); close(pingf->sock); @@ -172,7 +173,6 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, pingf = FLOW_SET_TYPE(flow, flowtype, ping); pingf->seq = -1; - pingf->id = id; if (af == AF_INET) { bind_addr = &c->ip4.addr_out; diff --git a/icmp_flow.h b/icmp_flow.h index c9847eae..fb93801d 100644 --- a/icmp_flow.h +++ b/icmp_flow.h @@ -13,7 +13,6 @@ * @seq: Last sequence number sent to tap, host order, -1: not sent yet * @sock: "ping" socket * @ts: Last associated activity from tap, seconds - * @id: ICMP id for the flow as seen by the guest */ struct icmp_ping_flow { /* Must be first element */ @@ -22,7 +21,6 @@ struct icmp_ping_flow { int seq; int sock; time_t ts; - uint16_t id; }; bool icmp_ping_timer(const struct ctx *c, const struct icmp_ping_flow *pingf, -- 2.45.2
icmp_sock_handler() obtains the guest address from it's most recently observed IP. However, this can now be obtained from the common flowside information. icmp_tap_handler() builds its socket address for sendto() directly from the destination address supplied by the incoming tap packet. This can instead be generated from the flow. Using the flowsides as the common source of truth here prepares us for allowing more flexible NAT and forwarding by properly initialising that flowside information. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- icmp.c | 27 +++++++++++++++++---------- tap.c | 11 ----------- tap.h | 1 - 3 files changed, 17 insertions(+), 22 deletions(-) diff --git a/icmp.c b/icmp.c index deb13c96..88813fbc 100644 --- a/icmp.c +++ b/icmp.c @@ -111,11 +111,18 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16, ini->eport, seq); - if (pingf->f.type == FLOW_PING4) - tap_icmp4_send(c, sr.sa4.sin_addr, tap_ip4_daddr(c), buf, n); - else if (pingf->f.type == FLOW_PING6) - tap_icmp6_send(c, &sr.sa6.sin6_addr, - tap_ip6_daddr(c, &sr.sa6.sin6_addr), buf, n); + if (pingf->f.type == FLOW_PING4) { + const struct in_addr *saddr = inany_v4(&ini->faddr); + const struct in_addr *daddr = inany_v4(&ini->eaddr); + + ASSERT(saddr && daddr); /* Must have IPv4 addresses */ + tap_icmp4_send(c, *saddr, *daddr, buf, n); + } else if (pingf->f.type == FLOW_PING6) { + const struct in6_addr *saddr = &ini->faddr.a6; + const struct in6_addr *daddr = &ini->eaddr.a6; + + tap_icmp6_send(c, saddr, daddr, buf, n); + } return; unexpected: @@ -225,11 +232,12 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now) { - union sockaddr_inany sa = { .sa_family = af }; - const socklen_t sl = af == AF_INET ? sizeof(sa.sa4) : sizeof(sa.sa6); struct icmp_ping_flow *pingf, **id_sock; + const struct flowside *tgt; + union sockaddr_inany sa; size_t dlen, l4len; uint16_t id, seq; + socklen_t sl; void *pkt; (void)saddr; @@ -250,7 +258,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, id = ntohs(ih->un.echo.id); id_sock = &icmp_id_map[V4][id]; seq = ntohs(ih->un.echo.sequence); - sa.sa4.sin_addr = *(struct in_addr *)daddr; } else if (af == AF_INET6) { const struct icmp6hdr *ih; @@ -266,8 +273,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, id = ntohs(ih->icmp6_identifier); id_sock = &icmp_id_map[V6][id]; seq = ntohs(ih->icmp6_sequence); - sa.sa6.sin6_addr = *(struct in6_addr *)daddr; - sa.sa6.sin6_scope_id = c->ifi6; } else { ASSERT(0); } @@ -276,8 +281,10 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) return 1; + tgt = &pingf->f.side[TGTSIDE]; pingf->ts = now->tv_sec; + sockaddr_from_inany(&sa, &sl, &tgt->eaddr, 0, c->ifi6); if (sendto(pingf->sock, pkt, l4len, MSG_NOSIGNAL, &sa.sa, sl) < 0) { flow_dbg(pingf, "failed to relay request to socket: %s", strerror(errno)); diff --git a/tap.c b/tap.c index c9aeff19..794e15a0 100644 --- a/tap.c +++ b/tap.c @@ -90,17 +90,6 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len) tap_send_frames(c, iov, iovcnt, 1); } -/** - * tap_ip4_daddr() - Normal IPv4 destination address for inbound packets - * @c: Execution context - * - * Return: IPv4 address - */ -struct in_addr tap_ip4_daddr(const struct ctx *c) -{ - return c->ip4.addr_seen; -} - /** * tap_ip6_daddr() - Normal IPv6 destination address for inbound packets * @c: Execution context diff --git a/tap.h b/tap.h index d496bd0e..ec9e2ace 100644 --- a/tap.h +++ b/tap.h @@ -43,7 +43,6 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len) thdr->vnet_len = htonl(l2len); } -struct in_addr tap_ip4_daddr(const struct ctx *c); void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport, struct in_addr dst, in_port_t dport, const void *in, size_t dlen); -- 2.45.2
When we receive a ping packet from the tap interface, we currently locate the correct flow entry (if present) using an anciliary data structure, the icmp_id_map[] tables. However, we can look this up using the flow hash table - that's what it's for. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- icmp.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/icmp.c b/icmp.c index 88813fbc..d11de1aa 100644 --- a/icmp.c +++ b/icmp.c @@ -141,6 +141,7 @@ static void icmp_ping_close(const struct ctx *c, epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL); close(pingf->sock); + flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE)); if (pingf->f.type == FLOW_PING4) icmp_id_map[V4][id] = NULL; @@ -205,6 +206,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id); + flow_hash_insert(c, FLOW_SIDX(pingf, INISIDE)); *id_sock = pingf; FLOW_ACTIVATE(pingf); @@ -237,6 +239,8 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, union sockaddr_inany sa; size_t dlen, l4len; uint16_t id, seq; + union flow *flow; + uint8_t proto; socklen_t sl; void *pkt; @@ -255,6 +259,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (ih->type != ICMP_ECHO) return 1; + proto = IPPROTO_ICMP; id = ntohs(ih->un.echo.id); id_sock = &icmp_id_map[V4][id]; seq = ntohs(ih->un.echo.sequence); @@ -270,6 +275,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (ih->icmp6_type != ICMPV6_ECHO_REQUEST) return 1; + proto = IPPROTO_ICMPV6; id = ntohs(ih->icmp6_identifier); id_sock = &icmp_id_map[V6][id]; seq = ntohs(ih->icmp6_sequence); @@ -277,11 +283,17 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, ASSERT(0); } - if (!(pingf = *id_sock)) - if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) - return 1; + flow = flow_at_sidx(flow_lookup_af(c, proto, PIF_TAP, + af, saddr, daddr, id, id)); + + if (flow) + pingf = &flow->ping; + else if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) + return 1; tgt = &pingf->f.side[TGTSIDE]; + + ASSERT(flow_proto[pingf->f.type] == proto); pingf->ts = now->tv_sec; sockaddr_from_inany(&sa, &sl, &tgt->eaddr, 0, c->ifi6); -- 2.45.2
With previous reworks the icmp_id_map data structure is now maintained, but never used for anything. Eliminate it. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- icmp.c | 19 ++----------------- 1 file changed, 2 insertions(+), 17 deletions(-) diff --git a/icmp.c b/icmp.c index d11de1aa..b297b9ac 100644 --- a/icmp.c +++ b/icmp.c @@ -47,9 +47,6 @@ #define PINGF(idx) (&(FLOW(idx)->ping)) -/* Indexed by ICMP echo identifier */ -static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS]; - /** * icmp_sock_handler() - Handle new data from ICMP or ICMPv6 socket * @c: Execution context @@ -137,22 +134,14 @@ unexpected: static void icmp_ping_close(const struct ctx *c, const struct icmp_ping_flow *pingf) { - uint16_t id = pingf->f.side[INISIDE].eport; - epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL); close(pingf->sock); flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE)); - - if (pingf->f.type == FLOW_PING4) - icmp_id_map[V4][id] = NULL; - else - icmp_id_map[V6][id] = NULL; } /** * icmp_ping_new() - Prepare a new ping socket for a new id * @c: Execution context - * @id_sock: Pointer to ping flow entry slot in icmp_id_map[] to update * @af: Address family, AF_INET or AF_INET6 * @id: ICMP id for the new socket * @saddr: Source address @@ -161,7 +150,6 @@ static void icmp_ping_close(const struct ctx *c, * Return: Newly opened ping flow, or NULL on failure */ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, - struct icmp_ping_flow **id_sock, sa_family_t af, uint16_t id, const void *saddr, const void *daddr) { @@ -207,7 +195,6 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id); flow_hash_insert(c, FLOW_SIDX(pingf, INISIDE)); - *id_sock = pingf; FLOW_ACTIVATE(pingf); @@ -234,7 +221,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now) { - struct icmp_ping_flow *pingf, **id_sock; + struct icmp_ping_flow *pingf; const struct flowside *tgt; union sockaddr_inany sa; size_t dlen, l4len; @@ -261,7 +248,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, proto = IPPROTO_ICMP; id = ntohs(ih->un.echo.id); - id_sock = &icmp_id_map[V4][id]; seq = ntohs(ih->un.echo.sequence); } else if (af == AF_INET6) { const struct icmp6hdr *ih; @@ -277,7 +263,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, proto = IPPROTO_ICMPV6; id = ntohs(ih->icmp6_identifier); - id_sock = &icmp_id_map[V6][id]; seq = ntohs(ih->icmp6_sequence); } else { ASSERT(0); @@ -288,7 +273,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (flow) pingf = &flow->ping; - else if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) + else if (!(pingf = icmp_ping_new(c, af, id, saddr, daddr))) return 1; tgt = &pingf->f.side[TGTSIDE]; -- 2.45.2
For now when we forward a ping to the host we leave the host side forwarding address and port blank since we don't necessarily know what source address and id will be used by the kernel. When the outbound address option is active, though, we do know the address at least, so we can record it in the flowside. Having done that, use it as the primary source of truth, binding the outgoing socket based on the information in there. This allows the possibility of more complex rules for what outbound address and/or id we use in future. To implement this we create a new helper which sets up a new socket based on information in a flowside, which will also have future uses. It behaves slightly differently from the existing ICMP code, in that it doesn't bind to a specific interface if given a loopback address. This is logically correct - the loopback address means we need to operate through the host's loopback interface, not ifname_out. We didn't need it in ICMP because ICMP will never generate a loopback address at this point, however we intend to change that in future. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 32 ++++++++++++++++++++++++++++++++ flow.h | 3 +++ icmp.c | 23 ++++++++++------------- util.c | 6 +++--- util.h | 3 +++ 5 files changed, 51 insertions(+), 16 deletions(-) diff --git a/flow.c b/flow.c index c94351d7..3770af37 100644 --- a/flow.c +++ b/flow.c @@ -143,6 +143,38 @@ static void flowside_from_af(struct flowside *fside, sa_family_t af, fside->eport = eport; } +/** flowside_sock_l4() - Create and bind socket based on flowside + * @c: Execution context + * @proto: Protocol number + * @pif: Interface for this socket + * @tgt: Target flowside + * @data: epoll reference portion for protocol handlers + * + * Return: socket fd of protocol @proto bound to the forwarding address and port + * from @tgt (if specified). + */ +int flowside_sock_l4(const struct ctx *c, uint8_t proto, uint8_t pif, + const struct flowside *tgt, uint32_t data) +{ + const char *ifname = NULL; + union sockaddr_inany sa; + socklen_t sl; + + ASSERT(pif == PIF_HOST); /* TODO: support other pifs */ + + sockaddr_from_inany(&sa, &sl, &tgt->faddr, tgt->fport, c->ifi6); + + if (inany_is_loopback(&tgt->faddr)) + ifname = NULL; + else if (sa.sa_family == AF_INET) + ifname = c->ip4.ifname_out; + else if (sa.sa_family == AF_INET6) + ifname = c->ip6.ifname_out; + + return sock_l4_sa(c, proto, &sa, sl, ifname, sa.sa_family == AF_INET6, + data); +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority diff --git a/flow.h b/flow.h index 90389a5e..948f2ea9 100644 --- a/flow.h +++ b/flow.h @@ -164,6 +164,9 @@ static inline bool flowside_eq(const struct flowside *left, left->fport == right->fport; } +int flowside_sock_l4(const struct ctx *c, uint8_t proto, uint8_t pif, + const struct flowside *tgt, uint32_t data); + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry diff --git a/icmp.c b/icmp.c index b297b9ac..cb3278e9 100644 --- a/icmp.c +++ b/icmp.c @@ -157,30 +157,27 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, union epoll_ref ref = { .type = EPOLL_TYPE_PING }; union flow *flow = flow_alloc(); struct icmp_ping_flow *pingf; + const struct flowside *tgt; const void *bind_addr; - const char *bind_if; if (!flow) return NULL; flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - /* FIXME: Record outbound source address when known */ - flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); - pingf = FLOW_SET_TYPE(flow, flowtype, ping); - - pingf->seq = -1; - if (af == AF_INET) { + if (af == AF_INET) bind_addr = &c->ip4.addr_out; - bind_if = c->ip4.ifname_out; - } else { + else if (af == AF_INET6) bind_addr = &c->ip6.addr_out; - bind_if = c->ip6.ifname_out; - } + + tgt = flow_target_af(flow, PIF_HOST, af, bind_addr, 0, daddr, 0); + pingf = FLOW_SET_TYPE(flow, flowtype, ping); + + pingf->seq = -1; ref.flowside = FLOW_SIDX(flow, TGTSIDE); - pingf->sock = sock_l4(c, af, flow_proto[flowtype], bind_addr, bind_if, - 0, ref.data); + pingf->sock = flowside_sock_l4(c, flow_proto[flowtype], PIF_HOST, + tgt, ref.data); if (pingf->sock < 0) { warn("Cannot open \"ping\" socket. You might need to:"); diff --git a/util.c b/util.c index 4e3d84a1..9ba9908d 100644 --- a/util.c +++ b/util.c @@ -44,9 +44,9 @@ * * Return: newly created socket, negative error code on failure */ -static int sock_l4_sa(const struct ctx *c, uint8_t proto, - const void *sa, socklen_t sl, - const char *ifname, bool v6only, uint32_t data) +int sock_l4_sa(const struct ctx *c, uint8_t proto, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data) { sa_family_t af = ((const struct sockaddr *)sa)->sa_family; union epoll_ref ref = { .data = data }; diff --git a/util.h b/util.h index eebb027b..bbf10778 100644 --- a/util.h +++ b/util.h @@ -143,6 +143,9 @@ struct ctx; /* cppcheck-suppress funcArgNamesDifferent */ __attribute__ ((weak)) int ffsl(long int i) { return __builtin_ffsl(i); } +int sock_l4_sa(const struct ctx *c, uint8_t proto, + const void *sa, socklen_t sl, + const char *ifname, bool v6only, uint32_t data); int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto, const void *bind_addr, const char *ifname, uint16_t port, uint32_t data); -- 2.45.2
Currently the code to translate host side addresses and ports to guest side addresses and ports, and vice versa, is scattered across the TCP code. This includes both port redirection as controlled by the -t and -T options, and our special case NAT controlled by the --no-map-gw option. Gather this logic into fwd_nat_from_*() functions for each input interface in fwd.c which take protocol and address information for the initiating side and generates the pif and address information for the forwarded side. This performs any NAT or port forwarding needed. We create a flow_target() helper which applies those forwarding functions as needed to automatically move a flow from INI to TGT state. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 53 ++++++++++++++++++++ flow_table.h | 2 + fwd.c | 139 +++++++++++++++++++++++++++++++++++++++++++++++++++ fwd.h | 9 ++++ tcp.c | 107 +++++++++++---------------------------- tcp_splice.c | 64 ++---------------------- tcp_splice.h | 5 +- 7 files changed, 237 insertions(+), 142 deletions(-) diff --git a/flow.c b/flow.c index 3770af37..dc600eca 100644 --- a/flow.c +++ b/flow.c @@ -340,6 +340,59 @@ const struct flowside *flow_target_af(union flow *flow, uint8_t pif, return tgt; } + +/** + * flow_target() - Determine where flow should forward to, and move to TGT + * @c: Execution context + * @flow: Flow to forward + * @proto: Protocol + * + * Return: pointer to the target flowside information + */ +const struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto) +{ + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + struct flow_common *f = &flow->f; + const struct flowside *ini = &f->side[INISIDE]; + struct flowside *tgt = &f->side[TGTSIDE]; + uint8_t tgtpif = PIF_NONE; + + ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); + ASSERT(f->type == FLOW_TYPE_NONE); + ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); + ASSERT(flow->f.state == FLOW_STATE_INI); + + switch (f->pif[INISIDE]) { + case PIF_TAP: + tgtpif = fwd_nat_from_tap(c, proto, ini, tgt); + break; + + case PIF_SPLICE: + tgtpif = fwd_nat_from_splice(c, proto, ini, tgt); + break; + + case PIF_HOST: + tgtpif = fwd_nat_from_host(c, proto, ini, tgt); + break; + + default: + flow_err(flow, "No rules to forward %s [%s]:%hu -> [%s]:%hu", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport); + } + + if (tgtpif == PIF_NONE) + return NULL; + + f->pif[TGTSIDE] = tgtpif; + flow_set_state(f, FLOW_STATE_TGT); + return tgt; +} + /** * flow_set_type() - Set type and move to TYPED * @flow: Flow to change state diff --git a/flow_table.h b/flow_table.h index 7af32c6a..07c59041 100644 --- a/flow_table.h +++ b/flow_table.h @@ -118,6 +118,8 @@ const struct flowside *flow_target_af(union flow *flow, uint8_t pif, sa_family_t af, const void *saddr, in_port_t sport, const void *daddr, in_port_t dport); +const struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto); union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/fwd.c b/fwd.c index b3d5a370..7ad7f07b 100644 --- a/fwd.c +++ b/fwd.c @@ -25,6 +25,7 @@ #include "fwd.h" #include "passt.h" #include "lineread.h" +#include "flow_table.h" /* See enum in kernel's include/net/tcp_states.h */ #define UDP_LISTEN 0x07 @@ -154,3 +155,141 @@ void fwd_scan_ports_init(struct ctx *c) &c->tcp.fwd_out, &c->tcp.fwd_in); } } + +/** + * fwd_nat_from_tap() - Determine to forward a flow from the tap interface + * @c: Execution context + * @proto: Protocol (IP L4 protocol number) + * @ini: Flow address information of the initiating side + * @tgt: Flow address information on the target side (updated) + * + * Return: pif of the target interface to forward the flow to, PIF_NONE if the + * flow cannot or should not be forwarded at all. + */ +uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt) +{ + (void)proto; + + tgt->eaddr = ini->faddr; + tgt->eport = ini->fport; + + if (!c->no_map_gw) { + if (inany_equals4(&tgt->eaddr, &c->ip4.gw)) + tgt->eaddr = inany_loopback4; + else if (inany_equals6(&tgt->eaddr, &c->ip6.gw)) + tgt->eaddr = inany_loopback6; + } + + /* The relevant addr_out controls the host side source address. This + * may be unspecified, which allows the kernel to pick an address. + */ + if (inany_v4(&tgt->eaddr)) + tgt->faddr = inany_from_v4(c->ip4.addr_out); + else + tgt->faddr.a6 = c->ip6.addr_out; + + /* Let the kernel pick a host side source port */ + tgt->fport = 0; + + return PIF_HOST; +} + +/** + * fwd_nat_from_splice() - Determine to forward a flow from the splice interface + * @c: Execution context + * @proto: Protocol (IP L4 protocol number) + * @ini: Flow address information of the initiating side + * @tgt: Flow address information on the target side (updated) + * + * Return: pif of the target interface to forward the flow to, PIF_NONE if the + * flow cannot or should not be forwarded at all. + */ +uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt) +{ + if (!inany_is_loopback(&ini->eaddr) || + (!inany_is_loopback(&ini->faddr) && !inany_is_unspecified(&ini->faddr))) { + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + + debug("Non loopback address on %s: [%s]:%hu -> [%s]:%hu", + pif_name(PIF_SPLICE), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), ini->fport); + return PIF_NONE; + } + + if (inany_v4(&ini->eaddr)) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_loopback6; + + tgt->eport = ini->fport; + if (proto == IPPROTO_TCP) + tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; + + /* Let the kernel pick a host side source port */ + tgt->fport = 0; + + return PIF_HOST; +} + +/** + * fwd_nat_from_host() - Determine to forward a flow from the host interface + * @c: Execution context + * @proto: Protocol (IP L4 protocol number) + * @ini: Flow address information of the initiating side + * @tgt: Flow address information on the target side (updated) + * + * Return: pif of the target interface to forward the flow to, PIF_NONE if the + * flow cannot or should not be forwarded at all. + */ +uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt) +{ + /* Common for spliced and non-spliced cases */ + tgt->eport = ini->fport; + if (proto == IPPROTO_TCP) + tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; + + if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) && + proto == IPPROTO_TCP) { + /* spliceable */ + tgt->faddr = ini->eaddr; + /* Let the kernel pick a namespace side source port */ + tgt->fport = 0; + + if (inany_v4(&ini->eaddr)) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_loopback6; + return PIF_SPLICE; + } + + tgt->faddr = ini->eaddr; + tgt->fport = ini->eport; + + if (inany_is_loopback4(&tgt->faddr) || + inany_is_unspecified4(&tgt->faddr) || + inany_equals4(&tgt->faddr, &c->ip4.addr_seen)) { + tgt->faddr = inany_from_v4(c->ip4.gw); + } else if (inany_is_loopback6(&tgt->faddr) || + inany_equals6(&tgt->faddr, &c->ip6.addr_seen) || + inany_equals6(&tgt->faddr, &c->ip6.addr)) { + if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) + tgt->faddr.a6 = c->ip6.gw; + else + tgt->faddr.a6 = c->ip6.addr_ll; + } + + if (inany_v4(&tgt->faddr)) { + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + } else { + if (inany_is_linklocal6(&tgt->faddr)) + tgt->eaddr.a6 = c->ip6.addr_ll_seen; + else + tgt->eaddr.a6 = c->ip6.addr_seen; + } + + return PIF_TAP; +} diff --git a/fwd.h b/fwd.h index 41645d7f..b4aa8d57 100644 --- a/fwd.h +++ b/fwd.h @@ -7,6 +7,8 @@ #ifndef FWD_H #define FWD_H +struct flowside; + /* Number of ports for both TCP and UDP */ #define NUM_PORTS (1U << 16) @@ -42,4 +44,11 @@ void fwd_scan_ports_udp(struct fwd_ports *fwd, const struct fwd_ports *rev, const struct fwd_ports *tcp_rev); void fwd_scan_ports_init(struct ctx *c); +uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt); +uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt); +uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, + const struct flowside *ini, struct flowside *tgt); + #endif /* FWD_H */ diff --git a/tcp.c b/tcp.c index a82d1dc6..1dd4e4c1 100644 --- a/tcp.c +++ b/tcp.c @@ -1425,7 +1425,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - union inany_addr srcaddr, dstaddr; /* FIXME: Avoid bulky temporaries */ const struct flowside *ini, *tgt; struct tcp_tap_conn *conn; union sockaddr_inany sa; @@ -1450,38 +1449,18 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, goto cancel; } - if ((s = tcp_conn_sock(c, af)) < 0) + if (!(tgt = flow_target(c, flow, IPPROTO_TCP))) goto cancel; - dstaddr = ini->faddr; - - if (!c->no_map_gw) { - if (inany_equals4(&dstaddr, &c->ip4.gw)) - dstaddr = inany_loopback4; - else if (inany_equals6(&dstaddr, &c->ip6.gw)) - dstaddr = inany_loopback6; - } - - if (inany_is_linklocal6(&dstaddr)) { - srcaddr.a6 = c->ip6.addr_ll; - } else if (inany_is_loopback(&dstaddr)) { - srcaddr = dstaddr; - } else if (inany_v4(&dstaddr)) { - if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.addr_out)) - srcaddr = inany_from_v4(c->ip4.addr_out); - else - srcaddr = inany_any4; - } else { - if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr_out)) - srcaddr.a6 = c->ip6.addr_out; - else - srcaddr = inany_any6; + if (flow->f.pif[TGTSIDE] != PIF_HOST) { + flow_err(flow, "No support for forwarding TCP from %s to %s", + pif_name(flow->f.pif[INISIDE]), + pif_name(flow->f.pif[TGTSIDE])); + goto cancel; } - /* FIXME: Record outbound source address when known */ - tgt = flow_target_af(flow, PIF_HOST, AF_INET6, - &srcaddr, 0, /* Kernel decides source port */ - &dstaddr, dstport); + if ((s = tcp_conn_sock(c, af)) < 0) + goto cancel; conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); conn->sock = s; @@ -1993,63 +1972,20 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn) conn_flag(c, conn, ACK_FROM_TAP_DUE); } -/** - * tcp_snat_inbound() - Translate source address for inbound data if needed - * @c: Execution context - * @addr: Source address of inbound packet/connection - */ -static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) -{ - if (inany_is_loopback4(addr) || - inany_is_unspecified4(addr) || - inany_equals4(addr, &c->ip4.addr_seen)) { - *addr = inany_from_v4(c->ip4.gw); - } else if (inany_is_loopback6(addr) || - inany_equals6(addr, &c->ip6.addr_seen) || - inany_equals6(addr, &c->ip6.addr)) { - if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) - addr->a6 = c->ip6.gw; - else - addr->a6 = c->ip6.addr_ll; - } -} - /** * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context - * @dstport: Destination port for connection (host side) * @flow: flow to initialise * @s: Accepted socket * @sa: Peer socket address (from accept()) * @now: Current timestamp */ -static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, - union flow *flow, int s, - const union sockaddr_inany *sa, +static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s, const struct timespec *now) { - union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */ - struct tcp_tap_conn *conn; - in_port_t srcport; + struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); uint64_t hash; - inany_from_sockaddr(&saddr, &srcport, sa); - tcp_snat_inbound(c, &saddr); - - if (inany_v4(&saddr)) { - daddr = inany_from_v4(c->ip4.addr_seen); - } else { - if (inany_is_linklocal6(&saddr)) - daddr.a6 = c->ip6.addr_ll_seen; - else - daddr.a6 = c->ip6.addr_seen; - } - dstport += c->tcp.fwd_in.delta[dstport]; - - flow_target_af(flow, PIF_TAP, AF_INET6, - &saddr, srcport, &daddr, dstport); - conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); - conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; @@ -2105,11 +2041,26 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, goto cancel; } - if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif, - ref.tcp_listen.port, flow, s, &sa)) - return; + if (!flow_target(c, flow, IPPROTO_TCP)) + goto cancel; + + switch (flow->f.pif[TGTSIDE]) { + case PIF_SPLICE: + case PIF_HOST: + tcp_splice_conn_from_sock(c, flow, s); + break; + + case PIF_TAP: + tcp_tap_conn_from_sock(c, flow, s, now); + break; + + default: + flow_err(flow, "No support for forwarding TCP from %s to %s", + pif_name(flow->f.pif[INISIDE]), + pif_name(flow->f.pif[TGTSIDE])); + goto cancel; + } - tcp_tap_conn_from_sock(c, ref.tcp_listen.port, flow, s, &sa, now); return; cancel: diff --git a/tcp_splice.c b/tcp_splice.c index d829d753..a04a51fd 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -394,72 +394,18 @@ static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af) /** * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection * @c: Execution context - * @pif0: pif id of side 0 - * @dstport: Side 0 destination port of connection * @flow: flow to initialise * @s0: Accepted (side 0) socket * @sa: Peer address of connection * - * Return: true if able to create a spliced connection, false otherwise * #syscalls:pasta setsockopt */ -bool tcp_splice_conn_from_sock(const struct ctx *c, - uint8_t pif0, in_port_t dstport, - union flow *flow, int s0, - const union sockaddr_inany *sa) +void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0) { - struct tcp_splice_conn *conn; - union inany_addr src; - in_port_t srcport; - sa_family_t af; - uint8_t tgtpif; + struct tcp_splice_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, + tcp_splice); - if (c->mode != MODE_PASTA) - return false; - - inany_from_sockaddr(&src, &srcport, sa); - af = inany_v4(&src) ? AF_INET : AF_INET6; - - switch (pif0) { - case PIF_SPLICE: - if (!inany_is_loopback(&src)) { - char str[INANY_ADDRSTRLEN]; - - /* We can't use flow_err() etc. because we haven't set - * the flow type yet - */ - warn("Bad source address %s for splice, closing", - inany_ntop(&src, str, sizeof(str))); - - /* We *don't* want to fall back to tap */ - flow_alloc_cancel(flow); - return true; - } - - tgtpif = PIF_HOST; - dstport += c->tcp.fwd_out.delta[dstport]; - break; - - case PIF_HOST: - if (!inany_is_loopback(&src)) - return false; - - tgtpif = PIF_SPLICE; - dstport += c->tcp.fwd_in.delta[dstport]; - break; - - default: - return false; - } - - /* FIXME: Record outbound source address when known */ - if (af == AF_INET) - flow_target_af(flow, tgtpif, AF_INET, - NULL, 0, &in4addr_loopback, dstport); - else - flow_target_af(flow, tgtpif, AF_INET6, - NULL, 0, &in6addr_loopback, dstport); - conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice); + ASSERT(c->mode == MODE_PASTA); conn->s[0] = s0; conn->s[1] = -1; @@ -473,8 +419,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, conn_flag(c, conn, CLOSING); FLOW_ACTIVATE(conn); - - return true; } /** diff --git a/tcp_splice.h b/tcp_splice.h index ed8f0c58..a20f3e21 100644 --- a/tcp_splice.h +++ b/tcp_splice.h @@ -11,10 +11,7 @@ union sockaddr_inany; void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events); -bool tcp_splice_conn_from_sock(const struct ctx *c, - uint8_t pif0, in_port_t dstport, - union flow *flow, int s0, - const union sockaddr_inany *sa); +void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0); void tcp_splice_init(struct ctx *c); #endif /* TCP_SPLICE_H */ -- 2.45.2
On Fri, 14 Jun 2024 16:13:38 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Currently the code to translate host side addresses and ports to guest side addresses and ports, and vice versa, is scattered across the TCP code. This includes both port redirection as controlled by the -t and -T options, and our special case NAT controlled by the --no-map-gw option. Gather this logic into fwd_nat_from_*() functions for each input interface in fwd.c which take protocol and address information for the initiating side and generates the pif and address information for the forwarded side. This performs any NAT or port forwarding needed. We create a flow_target() helper which applies those forwarding functions as needed to automatically move a flow from INI to TGT state.Given that you already added flow_target() in another series, I didn't really review that part of this patch as I guess it will change. The rest of the patches from 8/26 to 17/26 all look good to me: after all, changes from v5 look rather minimal for those. I didn't review patches starting from 18/26, as you mentioned they will change substantially. -- Stefano
On Thu, Jun 27, 2024 at 12:49:41AM +0200, Stefano Brivio wrote:On Fri, 14 Jun 2024 16:13:38 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Actually, I think this version is already on top of that, but the commit message is a bit out of date. The steps here are: 1. Add flow_target() which takes an explicit target pif (already merged) 2. Replace flow_target() with variants which also take explicit target addresses (patch #2 in this series) 3. Replace flow_target_*() variants with plain flow_target() which automatically determines the target addresses based on the forwarding logic (this patch)Currently the code to translate host side addresses and ports to guest side addresses and ports, and vice versa, is scattered across the TCP code. This includes both port redirection as controlled by the -t and -T options, and our special case NAT controlled by the --no-map-gw option. Gather this logic into fwd_nat_from_*() functions for each input interface in fwd.c which take protocol and address information for the initiating side and generates the pif and address information for the forwarded side. This performs any NAT or port forwarding needed. We create a flow_target() helper which applies those forwarding functions as needed to automatically move a flow from INI to TGT state.Given that you already added flow_target() in another series, I didn't really review that part of this patch as I guess it will change.The rest of the patches from 8/26 to 17/26 all look good to me: after all, changes from v5 look rather minimal for those. I didn't review patches starting from 18/26, as you mentioned they will change substantially.18/26 itself is probably fine, but the ones after that are being more or less entirely rewritten, yes. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Current ICMP hard codes its forwarding rules, and never applies any translations. Change it to use the flow_target() function, so that it's translated the same as TCP (excluding TCP specific port redirection). This means that gw mapping now applies to ICMP so "ping <gw address>" will now ping the host's loopback instead of the actual gw machine. This removes the surprising behaviour that the target you ping might not be the same as you connect to with TCP. This removes the last user of flow_target_af(), so that's removed as well. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 32 -------------------------------- icmp.c | 16 ++++++++++------ 2 files changed, 10 insertions(+), 38 deletions(-) diff --git a/flow.c b/flow.c index dc600eca..cf799082 100644 --- a/flow.c +++ b/flow.c @@ -309,38 +309,6 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, return ini; } -/** - * flow_target_af() - Move flow to TGT, setting TGTSIDE details - * @flow: Flow to change state - * @pif: pif of the target side - * @af: Address family for @eaddr and @faddr - * @saddr: Source address (pointer to in_addr or in6_addr) - * @sport: Endpoint port - * @daddr: Destination address (pointer to in_addr or in6_addr) - * @dport: Destination port - * - * Return: pointer to the target flowside information - */ -const struct flowside *flow_target_af(union flow *flow, uint8_t pif, - sa_family_t af, - const void *saddr, in_port_t sport, - const void *daddr, in_port_t dport) -{ - struct flow_common *f = &flow->f; - struct flowside *tgt = &f->side[TGTSIDE]; - - ASSERT(pif != PIF_NONE); - ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); - ASSERT(f->type == FLOW_TYPE_NONE); - ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); - - flowside_from_af(tgt, af, daddr, dport, saddr, sport); - f->pif[TGTSIDE] = pif; - flow_set_state(f, FLOW_STATE_TGT); - return tgt; -} - - /** * flow_target() - Determine where flow should forward to, and move to TGT * @c: Execution context diff --git a/icmp.c b/icmp.c index cb3278e9..45d71efd 100644 --- a/icmp.c +++ b/icmp.c @@ -153,24 +153,28 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, sa_family_t af, uint16_t id, const void *saddr, const void *daddr) { + uint8_t proto = af == AF_INET ? IPPROTO_ICMP : IPPROTO_ICMPV6; uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6; union epoll_ref ref = { .type = EPOLL_TYPE_PING }; union flow *flow = flow_alloc(); struct icmp_ping_flow *pingf; const struct flowside *tgt; - const void *bind_addr; if (!flow) return NULL; flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); + if (!(tgt = flow_target(c, flow, proto))) + goto cancel; - if (af == AF_INET) - bind_addr = &c->ip4.addr_out; - else if (af == AF_INET6) - bind_addr = &c->ip6.addr_out; + if (flow->f.pif[TGTSIDE] != PIF_HOST) { + flow_err(flow, "No support for forwarding %s from %s to %s", + proto == IPPROTO_ICMP ? "ICMP" : "ICMPv6", + pif_name(flow->f.pif[INISIDE]), + pif_name(flow->f.pif[TGTSIDE])); + goto cancel; + } - tgt = flow_target_af(flow, PIF_HOST, af, bind_addr, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping); pingf->seq = -1; -- 2.45.2
Add logic to the fwd_nat_from_*() functions to forwarding UDP packets. The logic here doesn't exactly match our current forwarding, since our current forwarding has some very strange and buggy edge cases. Instead it's attempting to replicate what appears to be the intended logic behind the current forwarding. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- fwd.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/fwd.c b/fwd.c index 7ad7f07b..cd66eaee 100644 --- a/fwd.c +++ b/fwd.c @@ -169,12 +169,15 @@ void fwd_scan_ports_init(struct ctx *c) uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { - (void)proto; - tgt->eaddr = ini->faddr; tgt->eport = ini->fport; - if (!c->no_map_gw) { + if (proto == IPPROTO_UDP && tgt->eport == 53) { + if (inany_equals4(&tgt->eaddr, &c->ip4.dns_match)) + tgt->eaddr = inany_from_v4(c->ip4.dns_host); + else if (inany_equals6(&tgt->eaddr, &c->ip6.dns_match)) + tgt->eaddr.a6 = c->ip6.dns_host; + } else if (!c->no_map_gw) { if (inany_equals4(&tgt->eaddr, &c->ip4.gw)) tgt->eaddr = inany_loopback4; else if (inany_equals6(&tgt->eaddr, &c->ip6.gw)) @@ -191,6 +194,10 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, /* Let the kernel pick a host side source port */ tgt->fport = 0; + if (proto == IPPROTO_UDP) { + /* But for UDP we preserve the source port */ + tgt->fport = ini->eport; + } return PIF_HOST; } @@ -227,9 +234,14 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, tgt->eport = ini->fport; if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; + else if (proto == IPPROTO_UDP) + tgt->eport += c->udp.fwd_out.f.delta[tgt->eport]; /* Let the kernel pick a host side source port */ tgt->fport = 0; + if (proto == IPPROTO_UDP) + /* But for UDP preserve the source port */ + tgt->fport = ini->eport; return PIF_HOST; } @@ -251,18 +263,24 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, tgt->eport = ini->fport; if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; + else if (proto == IPPROTO_UDP) + tgt->eport += c->udp.fwd_in.f.delta[tgt->eport]; if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) && - proto == IPPROTO_TCP) { + (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */ tgt->faddr = ini->eaddr; /* Let the kernel pick a namespace side source port */ tgt->fport = 0; + if (proto == IPPROTO_UDP) + /* But for UDP preserve the source port */ + tgt->fport = ini->eport; if (inany_v4(&ini->eaddr)) tgt->eaddr = inany_loopback4; else tgt->eaddr = inany_loopback6; + return PIF_SPLICE; } -- 2.45.2
Currently UDP only has a very rudimentary (and buggy) form of connection tracking implemented with per-port flags. Make a start on converting this to more robust tracking via the flow table. Start matching UDP packets to flow table entries, creating them when necessary. We also add a timer so that the flows will expire. For now don't actually use the information in the flow table, that will come later. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- Makefile | 2 +- flow.c | 31 +++++++ flow.h | 4 + flow_table.h | 2 + udp.c | 228 ++++++++++++++++++++++++++++++++++++++++++++++++--- udp_flow.h | 25 ++++++ 6 files changed, 278 insertions(+), 14 deletions(-) create mode 100644 udp_flow.h diff --git a/Makefile b/Makefile index 09fc461d..92cbd5a6 100644 --- a/Makefile +++ b/Makefile @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - udp.h util.h + udp.h udp_flow.h util.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 }; diff --git a/flow.c b/flow.c index cf799082..c3d54f6e 100644 --- a/flow.c +++ b/flow.c @@ -35,6 +35,7 @@ const char *flow_type_str[] = { [FLOW_TCP_SPLICE] = "TCP connection (spliced)", [FLOW_PING4] = "ICMP ping sequence", [FLOW_PING6] = "ICMPv6 ping sequence", + [FLOW_UDP] = "UDP flow", }; static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type"); @@ -44,6 +45,7 @@ const uint8_t flow_proto[] = { [FLOW_TCP_SPLICE] = IPPROTO_TCP, [FLOW_PING4] = IPPROTO_ICMP, [FLOW_PING6] = IPPROTO_ICMPV6, + [FLOW_UDP] = IPPROTO_UDP, }; static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, "flow_proto[] doesn't match enum flow_type"); @@ -641,6 +643,31 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, return flowside_lookup(c, proto, pif, &fside); } +/** + * flow_lookup_sa() - Look up a flow given and endpoint socket address + * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @esa: Socket address of the endpoint + * @fport: Forwarding port number + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport) +{ + struct flowside fside = { + .fport = fport, + }; + + inany_from_sockaddr(&fside.eaddr, &fside.eport, esa); + if (inany_v4(&fside.eaddr)) + fside.faddr = inany_any4; + else + fside.faddr = inany_any6; + return flowside_lookup(c, proto, pif, &fside); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context @@ -720,6 +747,10 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) if (timer) closed = icmp_ping_timer(c, &flow->ping, now); break; + case FLOW_UDP: + if (timer) + closed = udp_flow_timer(c, &flow->udp, now); + break; default: /* Assume other flow types don't need any handling */ ; diff --git a/flow.h b/flow.h index 948f2ea9..b5ca2792 100644 --- a/flow.h +++ b/flow.h @@ -115,6 +115,8 @@ enum flow_type { FLOW_PING4, /* ICMPv6 echo requests from guest to host and matching replies back */ FLOW_PING6, + /* UDP packets with the matching unicast endpoints */ + FLOW_UDP, FLOW_NUM_TYPES, }; @@ -227,6 +229,8 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, uint8_t proto, uint8_t pif, sa_family_t af, const void *eaddr, const void *faddr, in_port_t eport, in_port_t fport); +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, + const void *esa, in_port_t fport); union flow; diff --git a/flow_table.h b/flow_table.h index 07c59041..6cf4f2b7 100644 --- a/flow_table.h +++ b/flow_table.h @@ -9,6 +9,7 @@ #include "tcp_conn.h" #include "icmp_flow.h" +#include "udp_flow.h" /** * struct flow_free_cluster - Information about a cluster of free entries @@ -35,6 +36,7 @@ union flow { struct tcp_tap_conn tcp; struct tcp_splice_conn tcp_splice; struct icmp_ping_flow ping; + struct udp_flow udp; }; /* Global Flow Table */ diff --git a/udp.c b/udp.c index e79ca938..cb6db5c5 100644 --- a/udp.c +++ b/udp.c @@ -15,10 +15,49 @@ /** * DOC: Theory of Operation * + * Flow Table + * ========== * - * For UDP, a reduced version of port-based connection tracking is implemented - * with two purposes: - * - binding ephemeral ports when they're used as source port by the guest, so + * UDP does not have connections, but to reliably forward reply packets back to + * the original requested, we must keep track of pseudo-connections. We do this + * via the generic flow table. + * + * - Finding an existing flow + * + * When we receive a datagram we attempt to match it to an existing flow: one + * with matching interface, addresses and ports (both forwarding and + * endpoint). For socket interfaces, we treat the forwarding address as the + * bound address of the receiving socket, which may be unspecified, rather + * than the datagram's actual destination address (which is awkward to + * determine for unbound sockets). + * + * - Creating a new flow + * + * If no matching flow exists, and the datagram comes either from the tap + * interface, or from a socket with the 'orig' flag set we create a new one. + * The initiating side records the interface, endpoint and forwarding + * addresses and ports of this first datagram. Again, we treat the forwarding + * address for sockets as the socket's bound address, regardless of the + * datagram's actual destination. + * + * The target side interface and addresses are assigned by the general code in + * fwd.c. When the target is a socket interface, the target forwarding + * address may be left unspecified - in this case, the kernel will determine + * the source address when we send the datagram. + * + * - Flow expiry + * + * Every time a datagram is received that matches a flow (or creates a new + * one), we update the flow's timestamp to the current time. Periodically we + * scan flows and those which are older than UDP_CONN_TIMEOUT (180s) are + * removed. + * + * Port Tracking + * ============= + * + * For datagrams not handled by the flow table, a reduced version of port-based + * connection tracking is implemented with two purposes: + * - binding ephemeral ports when they're used as source port by the guest, so * that replies on those ports can be forwarded back to the guest, with a * fixed timeout for this binding * - packets received from the local host get their source changed to a local @@ -121,6 +160,7 @@ #include "tap.h" #include "pcap.h" #include "log.h" +#include "flow_table.h" #define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ @@ -199,6 +239,7 @@ static struct ethhdr udp6_eth_hdr; * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() * @splicesrc: Source port for splicing, or -1 if not spliceable + * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { struct ipv6hdr ip6h; @@ -207,6 +248,7 @@ static struct udp_meta_t { union sockaddr_inany s_in; int splicesrc; + flow_sidx_t tosidx; } #ifdef __AVX2__ __attribute__ ((aligned(32))) @@ -253,6 +295,17 @@ static struct sockaddr_in6 udp6_localname = { static struct mmsghdr udp4_mh_splice [UDP_MAX_FRAMES]; static struct mmsghdr udp6_mh_splice [UDP_MAX_FRAMES]; +struct udp_flow *udp_at_sidx(flow_sidx_t sidx) +{ + union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return NULL; + + ASSERT(flow->f.type == FLOW_UDP); + return &flow->udp; +} + /** * udp_portmap_clear() - Clear UDP port map before configuration */ @@ -492,6 +545,67 @@ static int udp_mmh_splice_port(union udp_epoll_ref uref, return -1; } +/** + * udp_flow_from_sock() - Find or create UDP flow for datagrams from socket + * @c: Execution context + * @uref: UDP epoll reference of the originating socket + * @meta: Metadata buffer for the datagram + * + * Return: sidx for the destination side of the flow for this packet, or + * FLOW_SIDX_NONE if we couldn't find or create a flow. + */ +flow_sidx_t udp_flow_from_sock(const struct ctx *c, union udp_epoll_ref uref, + struct udp_meta_t *meta) +{ + char sstr[INANY_ADDRSTRLEN]; + const struct flowside *ini; + struct udp_flow *uflow; + union flow *flow; + flow_sidx_t sidx; + + sidx = flow_lookup_sa(c, IPPROTO_UDP, uref.pif, &meta->s_in, uref.port); + if ((flow = flow_at_sidx(sidx))) + return FLOW_SIDX(flow, !sidx.side); + + if (!uref.orig) + return FLOW_SIDX_NONE; + + if (!(flow = flow_alloc())) { + char sastr[SOCKADDR_STRLEN]; + + debug("Couldn't allocate flow for UDP datagram from %s %s", + pif_name(uref.pif), + sockaddr_ntop(&meta->s_in, sastr, sizeof(sastr))); + return FLOW_SIDX_NONE; + } + + ini = flow_initiate_sa(flow, uref.pif, &meta->s_in, uref.port); + + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) { + flow_dbg(flow, "Invalid endpoint on UDP recv()"); + /* Invalid endpoint */ + goto cancel; + } + + if (!flow_target(c, flow, IPPROTO_UDP)) + goto cancel; + + uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); + flow_hash_insert(c, FLOW_SIDX(uflow, INISIDE)); + flow_hash_insert(c, FLOW_SIDX(uflow, TGTSIDE)); + FLOW_ACTIVATE(uflow); + + return FLOW_SIDX(uflow, TGTSIDE); + +cancel: + flow_dbg(flow, "Couldn't create UDP flow for %s [%s]:%hu -> ?:%hu", + pif_name(uref.pif), + inany_ntop(&ini->eaddr, sstr, sizeof(sstr)), + ini->eport, ini->fport); + flow_alloc_cancel(flow); + return FLOW_SIDX_NONE; +} + /** * udp_splice_send() - Send datagrams from socket to socket * @c: Execution context @@ -536,6 +650,7 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, break; udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); + udp_meta[i].tosidx = udp_flow_from_sock(c, uref, &udp_meta[i]); } while (udp_meta[i].splicesrc == src); if (uref.pif == PIF_SPLICE) { @@ -758,6 +873,7 @@ static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n, break; udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); + udp_meta[i].tosidx = udp_flow_from_sock(c, uref, &udp_meta[i]); } while (udp_meta[i].splicesrc == -1); tap_send_frames(c, &tap_iov[start][0], UDP_NUM_IOVS, i - start); @@ -786,8 +902,8 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve */ ssize_t n = (c->mode == MODE_PASTA ? 1 : UDP_MAX_FRAMES); in_port_t dstport = ref.udp.port; - bool v6 = ref.udp.v6; struct mmsghdr *mmh_recv; + bool v6 = ref.udp.v6; int i, m; if (c->no_udp || !(events & EPOLLIN)) @@ -797,6 +913,8 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve dstport += c->udp.fwd_out.f.delta[dstport]; else if (ref.udp.pif == PIF_HOST) dstport += c->udp.fwd_in.f.delta[dstport]; + else + ASSERT(0); if (v6) mmh_recv = udp6_l2_mh_sock; @@ -809,12 +927,13 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve /* We divide things into batches based on how we need to send them, * determined by udp_meta[i].splicesrc. To avoid either two passes - * through the array, or recalculating splicesrc for a single entry, we - * have to populate it one entry *ahead* of the loop counter (if - * present). So we fill in entry 0 before the loop, then udp_*_send() - * populate one entry past where they consume. + * through the array, or recalculating splicesrc and tosidx for a single + * entry, we have to populate them one entry *ahead* of the loop counter + * (if present). So we fill in entry 0 before the loop, then + * udp_*_send() populate one entry past where they consume. */ udp_meta[0].splicesrc = udp_mmh_splice_port(ref.udp, mmh_recv); + udp_meta[0].tosidx = udp_flow_from_sock(c, ref.udp, &udp_meta[0]); for (i = 0; i < n; i += m) { if (udp_meta[i].splicesrc >= 0) m = udp_splice_send(c, i, n, dstport, ref.udp, now); @@ -823,6 +942,74 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve } } +/** + * udp_flow_from_tap() - Find or create UDP flow for tap packets + * @c: Execution context + * @pif: pif on which the packet is arriving + * @af: Address family, AF_INET or AF_INET6 + * @saddr: Source address on guest side + * @daddr: Destination address guest side + * @srcport: Source port on guest side + * @dstport: Destination port on guest side + * + * Return: sidx for the destination side of the flow for this packet, or + * FLOW_SIDX_NONE if we couldn't find or create a flow. + */ +flow_sidx_t udp_flow_from_tap(const struct ctx *c, + uint8_t pif, sa_family_t af, + const void *saddr, const void *daddr, + in_port_t srcport, in_port_t dstport) +{ + const struct flowside *ini; + struct udp_flow *uflow; + union flow *flow; + flow_sidx_t sidx; + + ASSERT(pif == PIF_TAP); + + sidx = flow_lookup_af(c, IPPROTO_UDP, pif, af, saddr, daddr, + srcport, dstport); + if ((flow = flow_at_sidx(sidx))) + return FLOW_SIDX(flow, !sidx.side); + + if (!(flow = flow_alloc())) + return FLOW_SIDX_NONE; + + ini = flow_initiate_af(flow, PIF_TAP, af, + saddr, srcport, daddr, dstport); + + if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0 || + !inany_is_unicast(&ini->faddr) || ini->fport == 0) { + char sstr[INANY_ADDRSTRLEN], dstr[INANY_ADDRSTRLEN]; + + debug("Invalid UDP endpoint from %s: %s:%hu -> %s:%hu", + pif_name(pif), + inany_ntop(&ini->eaddr, sstr, sizeof(sstr)), ini->eport, + inany_ntop(&ini->faddr, dstr, sizeof(dstr)), ini->fport); + goto cancel; + } + + if (!flow_target(c, flow, IPPROTO_UDP)) + goto cancel; + + if (flow->f.pif[TGTSIDE] != PIF_HOST) { + flow_err(flow, "No support for forwarding UDP from %s to %s", + pif_name(flow->f.pif[INISIDE]), + pif_name(flow->f.pif[TGTSIDE])); + goto cancel; + } + + uflow = FLOW_SET_TYPE(flow, FLOW_UDP, udp); + flow_hash_insert(c, FLOW_SIDX(uflow, INISIDE)); + flow_hash_insert(c, FLOW_SIDX(uflow, TGTSIDE)); + FLOW_ACTIVATE(uflow); + return FLOW_SIDX(uflow, TGTSIDE); + +cancel: + flow_alloc_cancel(flow); + return FLOW_SIDX_NONE; +} + /** * udp_tap_handler() - Handle packets from tap * @c: Execution context @@ -847,15 +1034,13 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, struct sockaddr_in6 s_in6; struct sockaddr_in s_in; const struct udphdr *uh; + struct udp_flow *uflow; struct sockaddr *sa; int i, s, count = 0; in_port_t src, dst; + flow_sidx_t sidx; socklen_t sl; - (void)c; - (void)saddr; - (void)pif; - uh = packet_get(p, idx, 0, sizeof(*uh), NULL); if (!uh) return 1; @@ -864,8 +1049,14 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, * and destination, so we can just take those from the first message. */ src = ntohs(uh->source); - src += c->udp.fwd_in.rdelta[src]; dst = ntohs(uh->dest); + sidx = udp_flow_from_tap(c, pif, af, saddr, daddr, src, dst); + if ((uflow = udp_at_sidx(sidx))) + uflow->ts = now->tv_sec; + else + debug("UDP from tap without flow"); + + src += c->udp.fwd_in.rdelta[src]; if (af == AF_INET) { s_in = (struct sockaddr_in) { @@ -1211,6 +1402,17 @@ static int udp_port_rebind_outbound(void *arg) return 0; } +bool udp_flow_timer(const struct ctx *c, const struct udp_flow *uflow, + const struct timespec *now) +{ + if (now->tv_sec - uflow->ts <= UDP_CONN_TIMEOUT) + return false; + + flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); + flow_hash_remove(c, FLOW_SIDX(uflow, TGTSIDE)); + return true; +} + /** * udp_timer() - Scan activity bitmaps for ports with associated timed events * @c: Execution context diff --git a/udp_flow.h b/udp_flow.h new file mode 100644 index 00000000..18af9ac4 --- /dev/null +++ b/udp_flow.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright Red Hat + * Author: David Gibson <david(a)gibson.dropbear.id.au> + * + * UDP flow tracking data structures + */ +#ifndef UDP_FLOW_H +#define UDP_FLOW_H + +/** + * struct udp - Descriptor for a flow of UDP packets + * @f: Generic flow information + * @ts: Activity timestamp + */ +struct udp_flow { + /* Must be first element */ + struct flow_common f; + + time_t ts; +}; + +bool udp_flow_timer(const struct ctx *c, const struct udp_flow *uflow, + const struct timespec *now); + +#endif /* UDP_FLOW_H */ -- 2.45.2
Although we construct flow entries for UDP packets, we don't yet actually direct traffic according to the information in there. Start fixing that by directing traffic originating from the tap device according to the flow table. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 153 +++++++++++++++++++++------------------------------------- udp.h | 4 +- 2 files changed, 58 insertions(+), 99 deletions(-) diff --git a/udp.c b/udp.c index cb6db5c5..4668690e 100644 --- a/udp.c +++ b/udp.c @@ -52,6 +52,27 @@ * scan flows and those which are older than UDP_CONN_TIMEOUT (180s) are * removed. * + * - Locating or creating an outgoing socket + * + * When forwarding to a socket based interface, we need to find a suitable + * socket to send via. Generally this should have a bound address and port + * matching the forwarding address and port of the flowside for the outgoing + * datagram. However, if we have an existing socket with a matching port and + * an "any" address, we need to use that (in that case a socket with a + * specific bound address would conflict). + * + * FIXME: currently we don't perform this lookup correctly. Instead we abuse + * the fact that it's rare to have multiple flows with the same forwarding + * address but different forwarding port. We store at most a single socket + * per per bound port number (and IP version). For datagrams forwarded from + * PIF_TAP to PIF_HOST these are in udp_tap_map[]. + * + * For ports where port forwarding is configured (-u option) a socket is + * opened during start up, bound to the specified forwarding address and + * stored in udp_tap_map[]. For other ports we open a socket when we first + * need to forward a datagram from that port, bound to the configured outbound + * address (which may be "any"). + * * Port Tracking * ============= * @@ -149,6 +170,7 @@ #include <sys/socket.h> #include <sys/uio.h> #include <time.h> +#include <arpa/inet.h> #include "checksum.h" #include "util.h" @@ -1025,20 +1047,21 @@ cancel: * * #syscalls sendmmsg */ -int udp_tap_handler(struct ctx *c, uint8_t pif, +int udp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now) { + const struct flowside *toside; struct mmsghdr mm[UIO_MAXIOV]; struct iovec m[UIO_MAXIOV]; - struct sockaddr_in6 s_in6; - struct sockaddr_in s_in; + union udp_epoll_ref uref; + union sockaddr_inany sa; const struct udphdr *uh; struct udp_flow *uflow; - struct sockaddr *sa; int i, s, count = 0; - in_port_t src, dst; flow_sidx_t sidx; + in_port_t src; + uint8_t topif; socklen_t sl; uh = packet_get(p, idx, 0, sizeof(*uh), NULL); @@ -1048,59 +1071,36 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, /* The caller already checks that all the messages have the same source * and destination, so we can just take those from the first message. */ - src = ntohs(uh->source); - dst = ntohs(uh->dest); - sidx = udp_flow_from_tap(c, pif, af, saddr, daddr, src, dst); - if ((uflow = udp_at_sidx(sidx))) - uflow->ts = now->tv_sec; - else - debug("UDP from tap without flow"); + sidx = udp_flow_from_tap(c, pif, af, saddr, daddr, + ntohs(uh->source), ntohs(uh->dest)); + if (!(uflow = udp_at_sidx(sidx))) { + char sstr[INANY_ADDRSTRLEN], dstr[INANY_ADDRSTRLEN]; - src += c->udp.fwd_in.rdelta[src]; + debug("Dropping UDP packet without flow %s %s:%hu -> %s:%hu", + pif_name(pif), + inet_ntop(af, saddr, sstr, sizeof(sstr)), + ntohs(uh->source), + inet_ntop(af, daddr, dstr, sizeof(dstr)), + ntohs(uh->dest)); + return 1; + } - if (af == AF_INET) { - s_in = (struct sockaddr_in) { - .sin_family = AF_INET, - .sin_port = uh->dest, - .sin_addr = *(struct in_addr *)daddr, - }; + topif = uflow->f.pif[sidx.side]; + toside = &uflow->f.side[sidx.side]; - sa = (struct sockaddr *)&s_in; - sl = sizeof(s_in); + ASSERT(topif == PIF_HOST); - if (IN4_ARE_ADDR_EQUAL(&s_in.sin_addr, &c->ip4.dns_match) && - ntohs(s_in.sin_port) == 53) { - s_in.sin_addr = c->ip4.dns_host; - udp_tap_map[V4][src].ts = now->tv_sec; - udp_tap_map[V4][src].flags |= PORT_DNS_FWD; - bitmap_set(udp_act[V4][UDP_ACT_TAP], src); - } else if (IN4_ARE_ADDR_EQUAL(&s_in.sin_addr, &c->ip4.gw) && - !c->no_map_gw) { - if (!(udp_tap_map[V4][dst].flags & PORT_LOCAL) || - (udp_tap_map[V4][dst].flags & PORT_LOOPBACK)) - s_in.sin_addr.s_addr = htonl(INADDR_LOOPBACK); - else - s_in.sin_addr = c->ip4.addr_seen; - } + uflow->ts = now->tv_sec; - debug("UDP from tap src=%hu dst=%hu, s=%d", - src, dst, udp_tap_map[V4][src].sock); - if ((s = udp_tap_map[V4][src].sock) < 0) { - struct in_addr bind_addr = IN4ADDR_ANY_INIT; - union udp_epoll_ref uref = { - .port = src, - .pif = PIF_HOST, - }; - const char *bind_if = NULL; - - if (!IN4_IS_ADDR_LOOPBACK(&s_in.sin_addr)) - bind_if = c->ip4.ifname_out; + sockaddr_from_inany(&sa, &sl, &toside->eaddr, toside->eport, c->ifi6); + src = toside->fport; + uref.port = src; + uref.pif = topif; - if (!IN4_IS_ADDR_LOOPBACK(&s_in.sin_addr)) - bind_addr = c->ip4.addr_out; - - s = sock_l4(c, AF_INET, IPPROTO_UDP, &bind_addr, - bind_if, src, uref.u32); + if (sa.sa_family == AF_INET) { + if ((s = udp_tap_map[V4][src].sock) < 0) { + s = flowside_sock_l4(c, IPPROTO_UDP, topif, toside, + uref.u32); if (s < 0) return p->count - idx; @@ -1110,52 +1110,11 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, udp_tap_map[V4][src].ts = now->tv_sec; } else { - s_in6 = (struct sockaddr_in6) { - .sin6_family = AF_INET6, - .sin6_port = uh->dest, - .sin6_addr = *(struct in6_addr *)daddr, - }; - const struct in6_addr *bind_addr = &in6addr_any; - - sa = (struct sockaddr *)&s_in6; - sl = sizeof(s_in6); - - if (IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.dns_match) && - ntohs(s_in6.sin6_port) == 53) { - s_in6.sin6_addr = c->ip6.dns_host; - udp_tap_map[V6][src].ts = now->tv_sec; - udp_tap_map[V6][src].flags |= PORT_DNS_FWD; - bitmap_set(udp_act[V6][UDP_ACT_TAP], src); - } else if (IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw) && - !c->no_map_gw) { - if (!(udp_tap_map[V6][dst].flags & PORT_LOCAL) || - (udp_tap_map[V6][dst].flags & PORT_LOOPBACK)) - s_in6.sin6_addr = in6addr_loopback; - else if (udp_tap_map[V6][dst].flags & PORT_GUA) - s_in6.sin6_addr = c->ip6.addr; - else - s_in6.sin6_addr = c->ip6.addr_seen; - } else if (IN6_IS_ADDR_LINKLOCAL(&s_in6.sin6_addr)) { - bind_addr = &c->ip6.addr_ll; - } - if ((s = udp_tap_map[V6][src].sock) < 0) { - union udp_epoll_ref uref = { - .v6 = 1, - .port = src, - .pif = PIF_HOST, - }; - const char *bind_if = NULL; - - if (!IN6_IS_ADDR_LOOPBACK(&s_in6.sin6_addr)) - bind_if = c->ip6.ifname_out; - - if (!IN6_IS_ADDR_LOOPBACK(&s_in6.sin6_addr) && - !IN6_IS_ADDR_LINKLOCAL(&s_in6.sin6_addr)) - bind_addr = &c->ip6.addr_out; + uref.v6 = 1; - s = sock_l4(c, AF_INET6, IPPROTO_UDP, bind_addr, - bind_if, src, uref.u32); + s = flowside_sock_l4(c, IPPROTO_UDP, topif, toside, + uref.u32); if (s < 0) return p->count - idx; @@ -1174,7 +1133,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, if (!uh_send) return p->count - idx; - mm[i].msg_hdr.msg_name = sa; + mm[i].msg_hdr.msg_name = &sa; mm[i].msg_hdr.msg_namelen = sl; if (len) { diff --git a/udp.h b/udp.h index 5865def2..d25e66cb 100644 --- a/udp.h +++ b/udp.h @@ -11,8 +11,8 @@ void udp_portmap_clear(void); void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now); -int udp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, - const void *saddr, const void *daddr, +int udp_tap_handler(const struct ctx *c, uint8_t pif, + sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now); int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port); -- 2.45.2
Currently although we associate a flow with traffic coming from a socket, we don't use that flow to determine how to forward the traffic. Fix this for the case of traffic going from socket to tap (i.e. host to guest). Determine first that we should be sending to tap from the pif in the flow entry, rather than with separate logic. Also use the addresses and ports in the flow entry to construct the headers for the tap interface. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow_table.h | 28 +++++++++ udp.c | 158 +++++++++++++++++---------------------------------- 2 files changed, 81 insertions(+), 105 deletions(-) diff --git a/flow_table.h b/flow_table.h index 6cf4f2b7..3f1e81b9 100644 --- a/flow_table.h +++ b/flow_table.h @@ -80,6 +80,34 @@ static inline union flow *flow_at_sidx(flow_sidx_t sidx) return FLOW(sidx.flow); } +/** flowside_at_sidx - Flow side for a given sidx + * @sidx: Flow & side index + * + * Return: pointer to the corresponding flowside, or NULL + */ +static inline struct flowside *flowside_at_sidx(flow_sidx_t sidx) +{ + union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return NULL; + return &flow->f.side[sidx.side]; +} + +/** pif_at_sidx - Interface id for a given sidx + * @sidx: Flow & side index + * + * Return: pif for the given flow & side, or PIF_NONE if the sidx is invalid + */ +static inline uint8_t pif_at_sidx(flow_sidx_t sidx) +{ + union flow *flow = flow_at_sidx(sidx); + + if (!flow) + return PIF_NONE; + return flow->f.pif[sidx.side]; +} + /** flow_sidx_t - Index of one side of a flow from common structure * @f: Common flow fields pointer * @side: Which side to refer to (0 or 1) diff --git a/udp.c b/udp.c index 4668690e..29f3ba85 100644 --- a/udp.c +++ b/udp.c @@ -711,118 +711,53 @@ out: } /** - * udp_update_hdr4() - Update headers for one IPv4 datagram - * @c: Execution context + * udp_update_hdr4) - Update headers for one IPv4 datagram * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) - * @s_in: Source socket address, filled in by recvmmsg() * @bp: Pointer to udp_payload_t to update - * @dstport: Destination port number + * @fside: Flowside with addresses to direct the datagram * @dlen: Length of UDP payload - * @now: Current timestamp * * Return: size of IPv4 payload (UDP header + data) */ -static size_t udp_update_hdr4(const struct ctx *c, - struct iphdr *ip4h, const struct sockaddr_in *s_in, - struct udp_payload_t *bp, - in_port_t dstport, size_t dlen, - const struct timespec *now) +static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp, + const struct flowside *fside, size_t dlen) { - const struct in_addr dst = c->ip4.addr_seen; - in_port_t srcport = ntohs(s_in->sin_port); + const struct in_addr *src = inany_v4(&fside->faddr); + const struct in_addr *dst = inany_v4(&fside->eaddr); size_t l4len = dlen + sizeof(bp->uh); size_t l3len = l4len + sizeof(*ip4h); - struct in_addr src = s_in->sin_addr; - - if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match) && - IN4_ARE_ADDR_EQUAL(&src, &c->ip4.dns_host) && srcport == 53 && - (udp_tap_map[V4][dstport].flags & PORT_DNS_FWD)) { - src = c->ip4.dns_match; - } else if (IN4_IS_ADDR_LOOPBACK(&src) || - IN4_ARE_ADDR_EQUAL(&src, &c->ip4.addr_seen)) { - udp_tap_map[V4][srcport].ts = now->tv_sec; - udp_tap_map[V4][srcport].flags |= PORT_LOCAL; - - if (IN4_IS_ADDR_LOOPBACK(&src)) - udp_tap_map[V4][srcport].flags |= PORT_LOOPBACK; - else - udp_tap_map[V4][srcport].flags &= ~PORT_LOOPBACK; - bitmap_set(udp_act[V4][UDP_ACT_TAP], srcport); - - src = c->ip4.gw; - } + ASSERT(src && dst); ip4h->tot_len = htons(l3len); - ip4h->daddr = dst.s_addr; - ip4h->saddr = src.s_addr; - ip4h->check = csum_ip4_header(l3len, IPPROTO_UDP, src, dst); + ip4h->daddr = dst->s_addr; + ip4h->saddr = src->s_addr; + ip4h->check = csum_ip4_header(l3len, IPPROTO_UDP, *src, *dst); - bp->uh.source = s_in->sin_port; - bp->uh.dest = htons(dstport); + bp->uh.source = htons(fside->fport); + bp->uh.dest = htons(fside->eport); bp->uh.len = htons(l4len); - csum_udp4(&bp->uh, src, dst, bp->data, dlen); + csum_udp4(&bp->uh, *src, *dst, bp->data, dlen); return l4len; } /** * udp_update_hdr6() - Update headers for one IPv6 datagram - * @c: Execution context * @ip6h: Pre-filled IPv6 header (except for payload_len and addresses) - * @s_in: Source socket address, filled in by recvmmsg() * @bp: Pointer to udp_payload_t to update - * @dstport: Destination port number + * @fside: Flowside with addresses to direct the datagram * @dlen: Length of UDP payload - * @now: Current timestamp * * Return: size of IPv6 payload (UDP header + data) */ -static size_t udp_update_hdr6(const struct ctx *c, - struct ipv6hdr *ip6h, struct sockaddr_in6 *s_in6, - struct udp_payload_t *bp, - in_port_t dstport, size_t dlen, - const struct timespec *now) +static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, + const struct flowside *fside, size_t dlen) { - const struct in6_addr *src = &s_in6->sin6_addr; - const struct in6_addr *dst = &c->ip6.addr_seen; - in_port_t srcport = ntohs(s_in6->sin6_port); + const struct in6_addr *src = &fside->faddr.a6; + const struct in6_addr *dst = &fside->eaddr.a6; uint16_t l4len = dlen + sizeof(bp->uh); - if (IN6_IS_ADDR_LINKLOCAL(src)) { - dst = &c->ip6.addr_ll_seen; - } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match) && - IN6_ARE_ADDR_EQUAL(src, &c->ip6.dns_host) && - srcport == 53 && - (udp_tap_map[V4][dstport].flags & PORT_DNS_FWD)) { - src = &c->ip6.dns_match; - } else if (IN6_IS_ADDR_LOOPBACK(src) || - IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr_seen) || - IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) { - udp_tap_map[V6][srcport].ts = now->tv_sec; - udp_tap_map[V6][srcport].flags |= PORT_LOCAL; - - if (IN6_IS_ADDR_LOOPBACK(src)) - udp_tap_map[V6][srcport].flags |= PORT_LOOPBACK; - else - udp_tap_map[V6][srcport].flags &= ~PORT_LOOPBACK; - - if (IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) - udp_tap_map[V6][srcport].flags |= PORT_GUA; - else - udp_tap_map[V6][srcport].flags &= ~PORT_GUA; - - bitmap_set(udp_act[V6][UDP_ACT_TAP], srcport); - - dst = &c->ip6.addr_ll_seen; - - if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) - src = &c->ip6.gw; - else - src = &c->ip6.addr_ll; - - } - ip6h->payload_len = htons(l4len); ip6h->daddr = *dst; ip6h->saddr = *src; @@ -830,8 +765,8 @@ static size_t udp_update_hdr6(const struct ctx *c, ip6h->nexthdr = IPPROTO_UDP; ip6h->hop_limit = 255; - bp->uh.source = s_in6->sin6_port; - bp->uh.dest = htons(dstport); + bp->uh.source = htons(fside->fport); + bp->uh.dest = htons(fside->eport); bp->uh.len = ip6h->payload_len; csum_udp6(&bp->uh, src, dst, bp->data, dlen); @@ -843,26 +778,23 @@ static size_t udp_update_hdr6(const struct ctx *c, * @c: Execution context * @start: Index of first datagram in udp[46]_l2_buf pool * @n: Total number of datagrams in udp[46]_l2_buf pool - * @dstport: Destination port number on destination side * @uref: UDP epoll reference for origin socket * @now: Current timestamp * * This consumes as many frames as are sendable via tap. It requires that - * udp_meta[(a)start].splicesrc is initialised, and will initialise - * udp_meta[].splicesrc for each frame it consumes *and one more* (if present). + * udp_meta[(a)start].tosidx is initialised, and will initialise udp_meta[].tosidx + * for each frame it consumes *and one more* (if present). * * Return: Number of frames sent via tap */ static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n, - in_port_t dstport, union udp_epoll_ref uref, + union udp_epoll_ref uref, const struct timespec *now) { struct iovec (*tap_iov)[UDP_NUM_IOVS]; struct mmsghdr *mmh_recv; size_t i = start; - ASSERT(udp_meta[start].splicesrc == -1); - if (uref.v6) { tap_iov = udp6_l2_iov_tap; mmh_recv = udp6_l2_mh_sock; @@ -874,18 +806,19 @@ static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n, do { struct udp_payload_t *bp = &udp_payload[i]; struct udp_meta_t *bm = &udp_meta[i]; + const struct flowside *toside = flowside_at_sidx(bm->tosidx); size_t l4len; if (uref.v6) { - l4len = udp_update_hdr6(c, &bm->ip6h, - &bm->s_in.sa6, bp, dstport, - udp6_l2_mh_sock[i].msg_len, now); + udp_tap_map[V6][toside->fport].ts = now->tv_sec; + l4len = udp_update_hdr6(&bm->ip6h, bp, toside, + udp6_l2_mh_sock[i].msg_len); tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) + sizeof(udp6_eth_hdr)); } else { - l4len = udp_update_hdr4(c, &bm->ip4h, - &bm->s_in.sa4, bp, dstport, - udp4_l2_mh_sock[i].msg_len, now); + udp_tap_map[V4][toside->fport].ts = now->tv_sec; + l4len = udp_update_hdr4(&bm->ip4h, bp, toside, + udp4_l2_mh_sock[i].msg_len); tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) + sizeof(udp4_eth_hdr)); } @@ -896,7 +829,7 @@ static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n, udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); udp_meta[i].tosidx = udp_flow_from_sock(c, uref, &udp_meta[i]); - } while (udp_meta[i].splicesrc == -1); + } while (pif_at_sidx(udp_meta[i].tosidx) == PIF_TAP); tap_send_frames(c, &tap_iov[start][0], UDP_NUM_IOVS, i - start); return i - start; @@ -948,19 +881,34 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve return; /* We divide things into batches based on how we need to send them, - * determined by udp_meta[i].splicesrc. To avoid either two passes - * through the array, or recalculating splicesrc and tosidx for a single - * entry, we have to populate them one entry *ahead* of the loop counter - * (if present). So we fill in entry 0 before the loop, then + * determined by udp_meta[i].splicesrc or tosidx. To avoid either two + * passes through the array, or recalculating splicesrc and tosidx for a + * single entry, we have to populate them one entry *ahead* of the loop + * counter (if present). So we fill in entry 0 before the loop, then * udp_*_send() populate one entry past where they consume. */ udp_meta[0].splicesrc = udp_mmh_splice_port(ref.udp, mmh_recv); udp_meta[0].tosidx = udp_flow_from_sock(c, ref.udp, &udp_meta[0]); for (i = 0; i < n; i += m) { - if (udp_meta[i].splicesrc >= 0) + flow_sidx_t tosidx = udp_meta[i].tosidx; + uint8_t topif = pif_at_sidx(tosidx); + + if (topif == PIF_TAP) { + m = udp_tap_send(c, i, n, ref.udp, now); + } else if (udp_meta[i].splicesrc >= 0) { m = udp_splice_send(c, i, n, dstport, ref.udp, now); - else - m = udp_tap_send(c, i, n, dstport, ref.udp, now); + } else { + char sstr[SOCKADDR_STRLEN]; + + debug("Dropping UDP packet without usable flow from %s %s -> ?:%hu", + pif_name(ref.udp.pif), + sockaddr_ntop(&udp_meta[i].s_in, sstr, sizeof(sstr)), + dstport); + + m = 1; + udp_meta[i].splicesrc = udp_mmh_splice_port(ref.udp, mmh_recv); + udp_meta[i].tosidx = udp_flow_from_sock(c, ref.udp, &udp_meta[i]); + } } } -- 2.45.2
Although we associate a flow with traffic coming from a socket, for spliced traffic we don't use that flow entry to address it. Fix that, using the flow table as the source of truth for addressing "spliced" datagrams. As part of this we tweak how we batch datagrams. Previously we'd batch together contiguous datagrams as long as they have the same source port on the destination side. Now we batch together datagrams only if they belong to the same flow. Previously the structure required that datagrams with the same source port would also have the same destination port, and we relied on that fact. Although that will be true for flows at the moment it may not be true in future, so we need to ensure that everything in the batch has the same destination port. Similarly, although all datagrams will have loopback as both the source and destination address, these could be different addresses in the 127.0.0.1/8 subnet. To be sent as a single batch we need those addresses to be the same as well. Again, future changes to how we construct flow entries may make this a real concern. If both ports and addresses are the same, it must be the same flow, since that's how flows are looked up. So, we can simplify the logic simply by checking for the same flow. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 149 +++++++++++++++++----------------------------------------- 1 file changed, 43 insertions(+), 106 deletions(-) diff --git a/udp.c b/udp.c index 29f3ba85..89dc0307 100644 --- a/udp.c +++ b/udp.c @@ -73,81 +73,26 @@ * need to forward a datagram from that port, bound to the configured outbound * address (which may be "any"). * - * Port Tracking - * ============= + * - L4 to L4 ("spliced") traffic * - * For datagrams not handled by the flow table, a reduced version of port-based - * connection tracking is implemented with two purposes: - * - binding ephemeral ports when they're used as source port by the guest, so - * that replies on those ports can be forwarded back to the guest, with a - * fixed timeout for this binding - * - packets received from the local host get their source changed to a local - * address (gateway address) so that they can be forwarded to the guest, and - * packets sent as replies by the guest need their destination address to - * be changed back to the address of the local host. This is dynamic to allow - * connections from the gateway as well, and uses the same fixed 180s timeout - * - * Sockets for bound ports are created at initialisation time, one set for IPv4 - * and one for IPv6. + * In PASTA mode, the L2-L4 translation is skipped for datagrams between host + * and namespace with loopback addresses on both sides. Instead, messages are + * directly transferred between L4 sockets. These are called spliced flows by + * analogy with the TCP implementation, but the splice() syscall isn't + * actually used (that's only for streams); a pair of recvmmsg() and + * sendmmsg() deals with this case. * - * Packets are forwarded back and forth, by prepending and stripping UDP headers - * in the obvious way, with no port translation. + * To find a suitable sending socket for spliced datagrams, that is bound to + * localhost and the appropriate forwarding port, we use udp_splice_init[] for + * sockets in the initial host namespace and udp_splice_ns[] for sockets in + * the guest namespace. * - * In PASTA mode, the L2-L4 translation is skipped for connections to ports - * bound between namespaces using the loopback interface, messages are directly - * transferred between L4 sockets instead. These are called spliced connections - * for consistency with the TCP implementation, but the splice() syscall isn't - * actually used as it wouldn't make sense for datagram-based connections: a - * pair of recvmmsg() and sendmmsg() deals with this case. + * FIXME: sockets in udp_splice_init[] can conflict with sockets bound to the + * "any" address in udp_tap_map[]. * - * The connection tracking for PASTA mode is slightly complicated by the absence - * of actual connections, see struct udp_splice_port, and these examples: - * - * - from init to namespace: - * - * - forward direction: 127.0.0.1:5000 -> 127.0.0.1:80 in init from socket s, - * with epoll reference: index = 80, splice = 1, orig = 1, ns = 0 - * - if udp_splice_ns[V4][5000].sock: - * - send packet to udp_splice_ns[V4][5000].sock, with destination port - * 80 - * - otherwise: - * - create new socket udp_splice_ns[V4][5000].sock - * - bind in namespace to 127.0.0.1:5000 - * - add to epoll with reference: index = 5000, splice = 1, orig = 0, - * ns = 1 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - * - reverse direction: 127.0.0.1:80 -> 127.0.0.1:5000 in namespace socket s, - * having epoll reference: index = 5000, splice = 1, orig = 0, ns = 1 - * - if udp_splice_init[V4][80].sock: - * - send to udp_splice_init[V4][80].sock, with destination port 5000 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - otherwise, discard - * - * - from namespace to init: - * - * - forward direction: 127.0.0.1:2000 -> 127.0.0.1:22 in namespace from - * socket s, with epoll reference: index = 22, splice = 1, orig = 1, ns = 1 - * - if udp4_splice_init[V4][2000].sock: - * - send packet to udp_splice_init[V4][2000].sock, with destination - * port 22 - * - otherwise: - * - create new socket udp_splice_init[V4][2000].sock - * - bind in init to 127.0.0.1:2000 - * - add to epoll with reference: index = 2000, splice = 1, orig = 0, - * ns = 0 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - * - reverse direction: 127.0.0.1:22 -> 127.0.0.1:2000 in init from socket s, - * having epoll reference: index = 2000, splice = 1, orig = 0, ns = 0 - * - if udp_splice_ns[V4][22].sock: - * - send to udp_splice_ns[V4][22].sock, with destination port 2000 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - otherwise, discard + * FIXME: We don't handle the case of needing multiple "splice" sockets for a + * single port, due to using different IPv4 loopback addresses (e.g. 127.0.0.1 + * and 127.0.0.2). */ #include <sched.h> @@ -305,14 +250,7 @@ static struct mmsghdr udp6_l2_mh_sock [UDP_MAX_FRAMES]; /* recvmmsg()/sendmmsg() data for "spliced" connections */ static struct iovec udp_iov_splice [UDP_MAX_FRAMES]; -static struct sockaddr_in udp4_localname = { - .sin_family = AF_INET, - .sin_addr = IN4ADDR_LOOPBACK_INIT, -}; -static struct sockaddr_in6 udp6_localname = { - .sin6_family = AF_INET6, - .sin6_addr = IN6ADDR_LOOPBACK_INIT, -}; +static union sockaddr_inany udp_splicename; static struct mmsghdr udp4_mh_splice [UDP_MAX_FRAMES]; static struct mmsghdr udp6_mh_splice [UDP_MAX_FRAMES]; @@ -633,37 +571,38 @@ cancel: * @c: Execution context * @start: Index of first datagram in udp[46]_l2_buf * @n: Total number of datagrams in udp[46]_l2_buf pool - * @dst: Datagrams will be sent to this port (on destination side) * @uref: UDP epoll reference for origin socket * @now: Timestamp * - * This consumes as many datagrams as are sendable via a single socket. It - * requires that udp_meta[(a)start].splicesrc is initialised, and will initialise - * udp_meta[].splicesrc for each datagram it consumes *and one more* (if - * present). + * This consumes as many contiguous datagrams as possible with the same flow. + * It requires that udp_meta[(a)start].tosidx is initialised, and will initialise + * udp_meta[].tosidx for each frame it consumes *and one more* (if present). * * Return: Number of datagrams forwarded */ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, - in_port_t dst, union udp_epoll_ref uref, + union udp_epoll_ref uref, const struct timespec *now) { - in_port_t src = udp_meta[start].splicesrc; + flow_sidx_t startsidx = udp_meta[start].tosidx; + const struct flowside *toside = flowside_at_sidx(startsidx); + in_port_t src = toside->fport, dst = toside->eport; + uint8_t topif = pif_at_sidx(startsidx); struct mmsghdr *mmh_recv, *mmh_send; unsigned int i = start; + socklen_t sl; int s; - ASSERT(udp_meta[start].splicesrc >= 0); - if (uref.v6) { mmh_recv = udp6_l2_mh_sock; mmh_send = udp6_mh_splice; - udp6_localname.sin6_port = htons(dst); } else { mmh_recv = udp4_l2_mh_sock; mmh_send = udp4_mh_splice; - udp4_localname.sin_port = htons(dst); } + /* We don't need a scope id, because it's always a loopback address */ + sockaddr_from_inany(&udp_splicename, &sl, + &toside->eaddr, toside->eport, 0); do { mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len; @@ -673,10 +612,9 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); udp_meta[i].tosidx = udp_flow_from_sock(c, uref, &udp_meta[i]); - } while (udp_meta[i].splicesrc == src); + } while (flow_sidx_eq(udp_meta[i].tosidx, startsidx)); - if (uref.pif == PIF_SPLICE) { - src += c->udp.fwd_in.rdelta[src]; + if (topif == PIF_HOST) { s = udp_splice_init[uref.v6][src].sock; if (s < 0 && uref.orig) s = udp_splice_new(c, uref.v6, src, false); @@ -687,8 +625,7 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, udp_splice_ns[uref.v6][dst].ts = now->tv_sec; udp_splice_init[uref.v6][src].ts = now->tv_sec; } else { - ASSERT(uref.pif == PIF_HOST); - src += c->udp.fwd_out.rdelta[src]; + ASSERT(topif == PIF_SPLICE); s = udp_splice_ns[uref.v6][src].sock; if (s < 0 && uref.orig) { struct udp_splice_new_ns_arg arg = { @@ -881,11 +818,11 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve return; /* We divide things into batches based on how we need to send them, - * determined by udp_meta[i].splicesrc or tosidx. To avoid either two - * passes through the array, or recalculating splicesrc and tosidx for a - * single entry, we have to populate them one entry *ahead* of the loop - * counter (if present). So we fill in entry 0 before the loop, then - * udp_*_send() populate one entry past where they consume. + * determined by udp_meta[i].tosidx. To avoid either two passes through + * the array, or recalculating splicesrc and tosidx for a single entry, + * we have to populate them one entry *ahead* of the loop counter (if + * present). So we fill in entry 0 before the loop, then udp_*_send() + * populate one entry past where they consume. */ udp_meta[0].splicesrc = udp_mmh_splice_port(ref.udp, mmh_recv); udp_meta[0].tosidx = udp_flow_from_sock(c, ref.udp, &udp_meta[0]); @@ -895,8 +832,8 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve if (topif == PIF_TAP) { m = udp_tap_send(c, i, n, ref.udp, now); - } else if (udp_meta[i].splicesrc >= 0) { - m = udp_splice_send(c, i, n, dstport, ref.udp, now); + } else if (topif != PIF_NONE) { + m = udp_splice_send(c, i, n, ref.udp, now); } else { char sstr[SOCKADDR_STRLEN]; @@ -1183,11 +1120,11 @@ static void udp_splice_iov_init(void) struct msghdr *mh4 = &udp4_mh_splice[i].msg_hdr; struct msghdr *mh6 = &udp6_mh_splice[i].msg_hdr; - mh4->msg_name = &udp4_localname; - mh4->msg_namelen = sizeof(udp4_localname); + mh4->msg_name = &udp_splicename; + mh4->msg_namelen = sizeof(udp_splicename.sa4); - mh6->msg_name = &udp6_localname; - mh6->msg_namelen = sizeof(udp6_localname); + mh6->msg_name = &udp_splicename; + mh6->msg_namelen = sizeof(udp_splicename.sa6); udp_iov_splice[i].iov_base = udp_payload[i].data; -- 2.45.2
This field in udp_meta_t was used to work out whether and how we're splicing datagrams. However, that role has been taken over by the flow table, so this field is updated, but never used. Eliminate it. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 48 ++++++------------------------------------------ 1 file changed, 6 insertions(+), 42 deletions(-) diff --git a/udp.c b/udp.c index 89dc0307..3b30306a 100644 --- a/udp.c +++ b/udp.c @@ -205,7 +205,6 @@ static struct ethhdr udp6_eth_hdr; * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) * @taph: Tap backend specific header * @s_in: Source socket address, filled in by recvmmsg() - * @splicesrc: Source port for splicing, or -1 if not spliceable * @tosidx: sidx for the destination side of this datagram's flow */ static struct udp_meta_t { @@ -214,7 +213,6 @@ static struct udp_meta_t { struct tap_hdr taph; union sockaddr_inany s_in; - int splicesrc; flow_sidx_t tosidx; } #ifdef __AVX2__ @@ -479,32 +477,6 @@ static int udp_splice_new_ns(void *arg) return 0; } -/** - * udp_mmh_splice_port() - Is source address of message suitable for splicing? - * @uref: UDP epoll reference for incoming message's origin socket - * @mmh: mmsghdr of incoming message - * - * Return: if source address of message in @mmh refers to localhost (127.0.0.1 - * or ::1) its source port (host order), otherwise -1. - */ -static int udp_mmh_splice_port(union udp_epoll_ref uref, - const struct mmsghdr *mmh) -{ - const struct sockaddr_in6 *sa6 = mmh->msg_hdr.msg_name; - const struct sockaddr_in *sa4 = mmh->msg_hdr.msg_name; - - if (!uref.splice) - return -1; - - if (uref.v6 && IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr)) - return ntohs(sa6->sin6_port); - - if (!uref.v6 && IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr)) - return ntohs(sa4->sin_port); - - return -1; -} - /** * udp_flow_from_sock() - Find or create UDP flow for datagrams from socket * @c: Execution context @@ -610,7 +582,6 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, if (++i >= n) break; - udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); udp_meta[i].tosidx = udp_flow_from_sock(c, uref, &udp_meta[i]); } while (flow_sidx_eq(udp_meta[i].tosidx, startsidx)); @@ -729,16 +700,12 @@ static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n, const struct timespec *now) { struct iovec (*tap_iov)[UDP_NUM_IOVS]; - struct mmsghdr *mmh_recv; size_t i = start; - if (uref.v6) { + if (uref.v6) tap_iov = udp6_l2_iov_tap; - mmh_recv = udp6_l2_mh_sock; - } else { - mmh_recv = udp4_l2_mh_sock; + else tap_iov = udp4_l2_iov_tap; - } do { struct udp_payload_t *bp = &udp_payload[i]; @@ -764,7 +731,6 @@ static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n, if (++i >= n) break; - udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); udp_meta[i].tosidx = udp_flow_from_sock(c, uref, &udp_meta[i]); } while (pif_at_sidx(udp_meta[i].tosidx) == PIF_TAP); @@ -819,12 +785,11 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve /* We divide things into batches based on how we need to send them, * determined by udp_meta[i].tosidx. To avoid either two passes through - * the array, or recalculating splicesrc and tosidx for a single entry, - * we have to populate them one entry *ahead* of the loop counter (if - * present). So we fill in entry 0 before the loop, then udp_*_send() - * populate one entry past where they consume. + * the array, or recalculating tosidx for a single entry, we have to + * populate it one entry *ahead* of the loop counter (if present). So + * we fill in entry 0 before the loop, then udp_*_send() populate one + * entry past where they consume. */ - udp_meta[0].splicesrc = udp_mmh_splice_port(ref.udp, mmh_recv); udp_meta[0].tosidx = udp_flow_from_sock(c, ref.udp, &udp_meta[0]); for (i = 0; i < n; i += m) { flow_sidx_t tosidx = udp_meta[i].tosidx; @@ -843,7 +808,6 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve dstport); m = 1; - udp_meta[i].splicesrc = udp_mmh_splice_port(ref.udp, mmh_recv); udp_meta[i].tosidx = udp_flow_from_sock(c, ref.udp, &udp_meta[i]); } } -- 2.45.2
The flags field in udp_tap_port was used to implement a rudimentary (and buggy) form of connection tracking. This has now been taken over by the flow table. So, eliminate the field. This in turn allows udp_tap_port and udp_splice_port to be merged into a single type. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 61 ++++++++++++++++++++--------------------------------------- 1 file changed, 20 insertions(+), 41 deletions(-) diff --git a/udp.c b/udp.c index 3b30306a..489e2095 100644 --- a/udp.c +++ b/udp.c @@ -133,38 +133,21 @@ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ /** - * struct udp_tap_port - Port tracking based on tap-facing source port - * @sock: Socket bound to source port used as index - * @flags: Flags for recent activity type seen from/to port - * @ts: Activity timestamp from tap, used for socket aging - */ -struct udp_tap_port { - int sock; - uint8_t flags; -#define PORT_LOCAL BIT(0) /* Port was contacted from local address */ -#define PORT_LOOPBACK BIT(1) /* Port was contacted from loopback address */ -#define PORT_GUA BIT(2) /* Port was contacted from global unicast */ -#define PORT_DNS_FWD BIT(3) /* Port used as source for DNS remapped query */ - - time_t ts; -}; - -/** - * struct udp_splice_port - Bound socket for spliced communication + * struct udp_bound_port - Bound socket for host or ns communication * @sock: Socket bound to index port * @ts: Activity timestamp */ -struct udp_splice_port { +struct udp_bound_port { int sock; time_t ts; }; /* Port tracking, arrays indexed by packet source port (host order) */ -static struct udp_tap_port udp_tap_map [IP_VERSIONS][NUM_PORTS]; +static struct udp_bound_port udp_tap_map [IP_VERSIONS][NUM_PORTS]; /* "Spliced" sockets indexed by bound port (host order) */ -static struct udp_splice_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; -static struct udp_splice_port udp_splice_init[IP_VERSIONS][NUM_PORTS]; +static struct udp_bound_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; +static struct udp_bound_port udp_splice_init[IP_VERSIONS][NUM_PORTS]; enum udp_act_type { UDP_ACT_TAP, @@ -387,7 +370,7 @@ int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) union epoll_ref ref = { .type = EPOLL_TYPE_UDP, .udp = { .splice = true, .v6 = v6, .port = src } }; - struct udp_splice_port *sp; + struct udp_bound_port *sp; int act, s; if (ns) { @@ -1109,41 +1092,37 @@ static void udp_splice_iov_init(void) static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, in_port_t port, const struct timespec *now) { - struct udp_splice_port *sp; - struct udp_tap_port *tp; - int *sockp = NULL; + struct udp_bound_port *bp; switch (type) { case UDP_ACT_TAP: - tp = &udp_tap_map[v6 ? V6 : V4][port]; + bp = &udp_tap_map[v6 ? V6 : V4][port]; - if (now->tv_sec - tp->ts > UDP_CONN_TIMEOUT) { - sockp = &tp->sock; - tp->flags = 0; - } + if (now->tv_sec - bp->ts <= UDP_CONN_TIMEOUT) + return; /* nothing to do */ break; case UDP_ACT_SPLICE_INIT: - sp = &udp_splice_init[v6 ? V6 : V4][port]; + bp = &udp_splice_init[v6 ? V6 : V4][port]; - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; + if (now->tv_sec - bp->ts <= UDP_CONN_TIMEOUT) + return; /* nothing to do */ break; case UDP_ACT_SPLICE_NS: - sp = &udp_splice_ns[v6 ? V6 : V4][port]; + bp = &udp_splice_ns[v6 ? V6 : V4][port]; - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; + if (now->tv_sec - bp->ts <= UDP_CONN_TIMEOUT) + return; /* nothing to do */ break; default: return; } - if (sockp && *sockp >= 0) { - int s = *sockp; - *sockp = -1; + if (bp->sock >= 0) { + int s = bp->sock; + bp->sock = -1; epoll_ctl(c->epollfd, EPOLL_CTL_DEL, s, NULL); close(s); bitmap_clear(udp_act[v6 ? V6 : V4][type], port); @@ -1163,7 +1142,7 @@ static void udp_port_rebind(struct ctx *c, bool outbound) = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; const uint8_t *rmap = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; - struct udp_splice_port (*socks)[NUM_PORTS] + struct udp_bound_port (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init; unsigned port; -- 2.45.2
In addition to the struct fwd_ports used by both UDP and TCP to track port forwarding, UDP also included an 'rdelta' field, which contained the reverse mapping of the main port map. This was used so that we could properly direct reply packets to a forwarded packet where we change the destination port. This has now been taken over by the flow table: reply packets will match the flow of the originating packet, and that gives the correct ports on the originating side. So, eliminate the rdelta field, and with it struct udp_fwd_ports, which now has no additional information over struct fwd_ports. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- conf.c | 14 +++++++------- fwd.c | 24 ++++++++++++------------ udp.c | 37 ++++++++----------------------------- udp.h | 14 ++------------ 4 files changed, 29 insertions(+), 60 deletions(-) diff --git a/conf.c b/conf.c index 94b3ed6f..d1891479 100644 --- a/conf.c +++ b/conf.c @@ -1248,7 +1248,7 @@ void conf(struct ctx *c, int argc, char **argv) } c->tcp.fwd_in.mode = c->tcp.fwd_out.mode = FWD_UNSET; - c->udp.fwd_in.f.mode = c->udp.fwd_out.f.mode = FWD_UNSET; + c->udp.fwd_in.mode = c->udp.fwd_out.mode = FWD_UNSET; do { name = getopt_long(argc, argv, optstring, options, NULL); @@ -1733,7 +1733,7 @@ void conf(struct ctx *c, int argc, char **argv) if (name == 't') conf_ports(c, name, optarg, &c->tcp.fwd_in); else if (name == 'u') - conf_ports(c, name, optarg, &c->udp.fwd_in.f); + conf_ports(c, name, optarg, &c->udp.fwd_in); } while (name != -1); if (c->mode == MODE_PASTA) @@ -1768,7 +1768,7 @@ void conf(struct ctx *c, int argc, char **argv) if (name == 'T') conf_ports(c, name, optarg, &c->tcp.fwd_out); else if (name == 'U') - conf_ports(c, name, optarg, &c->udp.fwd_out.f); + conf_ports(c, name, optarg, &c->udp.fwd_out); } while (name != -1); if (!c->ifi4) @@ -1795,10 +1795,10 @@ void conf(struct ctx *c, int argc, char **argv) c->tcp.fwd_in.mode = fwd_default; if (!c->tcp.fwd_out.mode) c->tcp.fwd_out.mode = fwd_default; - if (!c->udp.fwd_in.f.mode) - c->udp.fwd_in.f.mode = fwd_default; - if (!c->udp.fwd_out.f.mode) - c->udp.fwd_out.f.mode = fwd_default; + if (!c->udp.fwd_in.mode) + c->udp.fwd_in.mode = fwd_default; + if (!c->udp.fwd_out.mode) + c->udp.fwd_out.mode = fwd_default; fwd_scan_ports_init(c); diff --git a/fwd.c b/fwd.c index cd66eaee..69b0f535 100644 --- a/fwd.c +++ b/fwd.c @@ -129,18 +129,18 @@ void fwd_scan_ports_init(struct ctx *c) c->tcp.fwd_in.scan4 = c->tcp.fwd_in.scan6 = -1; c->tcp.fwd_out.scan4 = c->tcp.fwd_out.scan6 = -1; - c->udp.fwd_in.f.scan4 = c->udp.fwd_in.f.scan6 = -1; - c->udp.fwd_out.f.scan4 = c->udp.fwd_out.f.scan6 = -1; + c->udp.fwd_in.scan4 = c->udp.fwd_in.scan6 = -1; + c->udp.fwd_out.scan4 = c->udp.fwd_out.scan6 = -1; if (c->tcp.fwd_in.mode == FWD_AUTO) { c->tcp.fwd_in.scan4 = open_in_ns(c, "/proc/net/tcp", flags); c->tcp.fwd_in.scan6 = open_in_ns(c, "/proc/net/tcp6", flags); fwd_scan_ports_tcp(&c->tcp.fwd_in, &c->tcp.fwd_out); } - if (c->udp.fwd_in.f.mode == FWD_AUTO) { - c->udp.fwd_in.f.scan4 = open_in_ns(c, "/proc/net/udp", flags); - c->udp.fwd_in.f.scan6 = open_in_ns(c, "/proc/net/udp6", flags); - fwd_scan_ports_udp(&c->udp.fwd_in.f, &c->udp.fwd_out.f, + if (c->udp.fwd_in.mode == FWD_AUTO) { + c->udp.fwd_in.scan4 = open_in_ns(c, "/proc/net/udp", flags); + c->udp.fwd_in.scan6 = open_in_ns(c, "/proc/net/udp6", flags); + fwd_scan_ports_udp(&c->udp.fwd_in, &c->udp.fwd_out, &c->tcp.fwd_in, &c->tcp.fwd_out); } if (c->tcp.fwd_out.mode == FWD_AUTO) { @@ -148,10 +148,10 @@ void fwd_scan_ports_init(struct ctx *c) c->tcp.fwd_out.scan6 = open("/proc/net/tcp6", flags); fwd_scan_ports_tcp(&c->tcp.fwd_out, &c->tcp.fwd_in); } - if (c->udp.fwd_out.f.mode == FWD_AUTO) { - c->udp.fwd_out.f.scan4 = open("/proc/net/udp", flags); - c->udp.fwd_out.f.scan6 = open("/proc/net/udp6", flags); - fwd_scan_ports_udp(&c->udp.fwd_out.f, &c->udp.fwd_in.f, + if (c->udp.fwd_out.mode == FWD_AUTO) { + c->udp.fwd_out.scan4 = open("/proc/net/udp", flags); + c->udp.fwd_out.scan6 = open("/proc/net/udp6", flags); + fwd_scan_ports_udp(&c->udp.fwd_out, &c->udp.fwd_in, &c->tcp.fwd_out, &c->tcp.fwd_in); } } @@ -235,7 +235,7 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; else if (proto == IPPROTO_UDP) - tgt->eport += c->udp.fwd_out.f.delta[tgt->eport]; + tgt->eport += c->udp.fwd_out.delta[tgt->eport]; /* Let the kernel pick a host side source port */ tgt->fport = 0; @@ -264,7 +264,7 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; else if (proto == IPPROTO_UDP) - tgt->eport += c->udp.fwd_in.f.delta[tgt->eport]; + tgt->eport += c->udp.fwd_in.delta[tgt->eport]; if (c->mode == MODE_PASTA && inany_is_loopback(&ini->eaddr) && (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { diff --git a/udp.c b/udp.c index 489e2095..c170b0be 100644 --- a/udp.c +++ b/udp.c @@ -261,24 +261,6 @@ void udp_portmap_clear(void) } } -/** - * udp_invert_portmap() - Compute reverse port translations for return packets - * @fwd: Port forwarding configuration to compute reverse map for - */ -static void udp_invert_portmap(struct udp_fwd_ports *fwd) -{ - unsigned int i; - - static_assert(ARRAY_SIZE(fwd->f.delta) == ARRAY_SIZE(fwd->rdelta), - "Forward and reverse delta arrays must have same size"); - for (i = 0; i < ARRAY_SIZE(fwd->f.delta); i++) { - in_port_t delta = fwd->f.delta[i]; - - if (delta) - fwd->rdelta[i + delta] = NUM_PORTS - delta; - } -} - /** * udp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses * @eth_d: Ethernet destination address, NULL if unchanged @@ -751,9 +733,9 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve return; if (ref.udp.pif == PIF_SPLICE) - dstport += c->udp.fwd_out.f.delta[dstport]; + dstport += c->udp.fwd_out.delta[dstport]; else if (ref.udp.pif == PIF_HOST) - dstport += c->udp.fwd_in.f.delta[dstport]; + dstport += c->udp.fwd_in.delta[dstport]; else ASSERT(0); @@ -1139,9 +1121,9 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, static void udp_port_rebind(struct ctx *c, bool outbound) { const uint8_t *fmap - = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; + = outbound ? c->udp.fwd_out.map : c->udp.fwd_in.map; const uint8_t *rmap - = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; + = outbound ? c->udp.fwd_in.map : c->udp.fwd_out.map; struct udp_bound_port (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init; unsigned port; @@ -1212,14 +1194,14 @@ void udp_timer(struct ctx *c, const struct timespec *now) long *word, tmp; if (c->mode == MODE_PASTA) { - if (c->udp.fwd_out.f.mode == FWD_AUTO) { - fwd_scan_ports_udp(&c->udp.fwd_out.f, &c->udp.fwd_in.f, + if (c->udp.fwd_out.mode == FWD_AUTO) { + fwd_scan_ports_udp(&c->udp.fwd_out, &c->udp.fwd_in, &c->tcp.fwd_out, &c->tcp.fwd_in); NS_CALL(udp_port_rebind_outbound, c); } - if (c->udp.fwd_in.f.mode == FWD_AUTO) { - fwd_scan_ports_udp(&c->udp.fwd_in.f, &c->udp.fwd_out.f, + if (c->udp.fwd_in.mode == FWD_AUTO) { + fwd_scan_ports_udp(&c->udp.fwd_in, &c->udp.fwd_out, &c->tcp.fwd_in, &c->tcp.fwd_out); udp_port_rebind(c, false); } @@ -1256,9 +1238,6 @@ int udp_init(struct ctx *c) { udp_iov_init(c); - udp_invert_portmap(&c->udp.fwd_in); - udp_invert_portmap(&c->udp.fwd_out); - if (c->mode == MODE_PASTA) { udp_splice_iov_init(); NS_CALL(udp_port_rebind_outbound, c); diff --git a/udp.h b/udp.h index d25e66cb..4ae65723 100644 --- a/udp.h +++ b/udp.h @@ -42,16 +42,6 @@ union udp_epoll_ref { }; -/** - * udp_fwd_ports - UDP specific port forwarding configuration - * @f: Generic forwarding configuration - * @rdelta: Reversed delta map to translate source ports on return packets - */ -struct udp_fwd_ports { - struct fwd_ports f; - in_port_t rdelta[NUM_PORTS]; -}; - /** * struct udp_ctx - Execution context for UDP * @fwd_in: Port forwarding configuration for inbound packets @@ -59,8 +49,8 @@ struct udp_fwd_ports { * @timer_run: Timestamp of most recent timer run */ struct udp_ctx { - struct udp_fwd_ports fwd_in; - struct udp_fwd_ports fwd_out; + struct fwd_ports fwd_in; + struct fwd_ports fwd_out; struct timespec timer_run; }; -- 2.45.2
We no longer check the 'splice' flag in the UDP epoll reference. Instead, we determine whether a datagram needs to be spliced from the pifs in the flow table. So, eliminate this field. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 5 ++--- udp.h | 4 +--- 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/udp.c b/udp.c index c170b0be..1789a572 100644 --- a/udp.c +++ b/udp.c @@ -350,7 +350,7 @@ int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) { struct epoll_event ev = { .events = EPOLLIN | EPOLLRDHUP | EPOLLHUP }; union epoll_ref ref = { .type = EPOLL_TYPE_UDP, - .udp = { .splice = true, .v6 = v6, .port = src } + .udp = { .v6 = v6, .port = src } }; struct udp_bound_port *sp; int act, s; @@ -989,8 +989,7 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif, int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port) { - union udp_epoll_ref uref = { .splice = (c->mode == MODE_PASTA), - .orig = true, .port = port }; + union udp_epoll_ref uref = { .orig = true, .port = port }; int s, r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1; if (ns) diff --git a/udp.h b/udp.h index 4ae65723..4b62b996 100644 --- a/udp.h +++ b/udp.h @@ -25,7 +25,6 @@ void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s); * @port: Source port for connected sockets, bound port otherwise * @pif: pif for this socket * @bound: Set if this file descriptor is a bound socket - * @splice: Set if descriptor packets to be "spliced" * @orig: Set if a spliced socket which can originate "connections" * @v6: Set for IPv6 sockets or connections * @u32: Opaque u32 value of reference @@ -34,8 +33,7 @@ union udp_epoll_ref { struct { in_port_t port; uint8_t pif; - bool splice:1, - orig:1, + bool orig:1, v6:1; }; uint32_t u32; -- 2.45.2