[PATCH v4 00/16] RFC: Unified flow table

David Gibson

3 May 2024 3 May '24

3:11 a.m.

This is a fourth draft of the first steps in implementing more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel This series changes the TCP connection table and hash table into a more general flow table that can track other protocols as well. Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding. ICMP is converted to use the new flow table. This doesn't include UDP, but I'm working on it right now and making progress. I'm posting this to give a head start on the review :) Caveats: * We significantly increase the size of a connection/flow entry - Can probably be mitigated, but I haven't investigated much yet * We perform a number of extra getsockname() calls to know some of the socket endpoints - Haven't yet measured how much performance impact that has - Can be mitigated in at least some cases, but again, haven't tried yet Changes since v4: * Complex rebase on top of the many things that have happened upstream since v3. * Assorted other changes. Changes since v3: * Replace TAPFSIDE() and SOCKFSIDE() macros with local variables. Changes since v2: * Cosmetic fixes based on review * Extra doc comments for enum flow_type * Rename flowside to flowaddrs which turns out to make more sense in light of future changes * Fix bug where the socket flowaddrs for tap initiated connections wasn't initialised to match the socket address we were using in the case of map-gw NAT * New flowaddrs_from_sock() helper used in most cases which is cleaner and should avoid bugs like the above * Using newer centralised workarounds for clang-tidy issue 58992 * Remove duplicate definition of FLOW_MAX as maximum flow type and maximum number of tracked flows * Rebased on newer versions of preliminary work (ICMP, flow based dispatch and allocation, bind/address cleanups) * Unified hash table as well as base flow table * Integrated ICMP Changes since v1: * Terminology changes - "Endpoint" address/port instead of "correspondent" address/port - "flowside" instead of "demiflow" * Actually move the connection table to a new flow table structure in new files * Significant rearrangement of earlier patchs on top of that new table, to reduce churn David Gibson (16): flow: Common data structures for tracking flow addresses tcp: Maintain flowside information for "tap" connections tcp_splice: Maintain flowside information for spliced connections tcp: Obtain guest address from flowside tcp: Simplify endpoint validation using flowside information tcp, tcp_splice: Construct sockaddrs for connect() from flowside tcp_splice: Eliminate SPLICE_V6 flag tcp, flow: Replace TCP specific hash function with general flow hash flow, tcp: Generalise TCP hash table to general flow hash table tcp: Re-use flow hash for initial sequence number generation icmp: Populate flowside information icmp: Use flowsides as the source of truth wherever possible icmp: Look up ping flows using flow hash icmp: Eliminate icmp_id_map flow, tcp: flow based NAT and port forwarding for TCP flow, icmp: Use general flow forwarding rules for ICMP flow.c | 199 ++++++++++++++++++++- flow.h | 97 +++++++++++ fwd.c | 139 +++++++++++++++ fwd.h | 5 + icmp.c | 83 +++++---- icmp_flow.h | 1 - inany.h | 29 +++- passt.h | 3 + pif.h | 1 - tap.c | 11 -- tap.h | 1 - tcp.c | 475 +++++++++++++-------------------------------------- tcp_conn.h | 20 +-- tcp_splice.c | 93 ++-------- tcp_splice.h | 5 +- 15 files changed, 649 insertions(+), 513 deletions(-) -- 2.44.0

Show replies by date

David Gibson

3 May 3 May

3:11 a.m.

New subject: [PATCH v4 01/16] flow: Common data structures for tracking flow addresses

Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more general and robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of common information related to one side of each flow. For now that's the addresses, ports and the pif id. Store two of these in the common fields of a flow to track that information for both sides. For now we just introduce the structure itself, helpers to populate it, and logging of the contents when starting and ending flows. Later patches will actually put something useful there. Signed-off-by: David Gibson --- flow.c | 28 ++++++++++++++++++-- flow.h | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ passt.h | 3 +++ pif.h | 1 - tcp_conn.h | 1 - 5 files changed, 104 insertions(+), 4 deletions(-) diff --git a/flow.c b/flow.c index 80dd269..02d6008 100644 --- a/flow.c +++ b/flow.c @@ -51,10 +51,11 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, * * ALLOC - A tentatively allocated entry * Operations: + * - Common flow fields other than type may be accessed * - flow_alloc_cancel() returns the entry to FREE state * - FLOW_START() set the entry's type and moves to START state * Caveats: - * - It's not safe to write fields in the flow entry + * - It's not safe to write flow type specific fields in the entry * - It's not safe to allocate further entries with flow_alloc() * - It's not safe to return to the main epoll loop (use FLOW_START() * to move to START state before doing so) @@ -62,6 +63,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, * * START - An entry being prepared by flow type specific code * Operations: + * - Common flow fields other than type may be accessed * - Flow type specific fields may be accessed * - flow_*() logging functions * - flow_alloc_cancel() returns the entry to FREE state @@ -168,9 +170,21 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) union flow *flow_start(union flow *flow, enum flow_type type, unsigned iniside) { - (void)iniside; + char ebuf[INANY_ADDRSTRLEN], fbuf[INANY_ADDRSTRLEN]; + const struct flowside *a = &flow->f.side[iniside]; + const struct flowside *b = &flow->f.side[!iniside]; + flow->f.type = type; flow_dbg(flow, "START %s", flow_type_str[flow->f.type]); + flow_dbg(flow, " from side %u (%s): [%s]:%hu -> [%s]:%hu", + iniside, pif_name(a->pif), + inany_ntop(&a->eaddr, ebuf, sizeof(ebuf)), a->eport, + inany_ntop(&a->faddr, fbuf, sizeof(fbuf)), a->fport); + flow_dbg(flow, " to side %u (%s): [%s]:%hu -> [%s]:%hu", + !iniside, pif_name(b->pif), + inany_ntop(&b->faddr, fbuf, sizeof(fbuf)), b->fport, + inany_ntop(&b->eaddr, ebuf, sizeof(ebuf)), b->eport); + return flow; } @@ -180,10 +194,20 @@ union flow *flow_start(union flow *flow, enum flow_type type, */ static void flow_end(union flow *flow) { + char ebuf[INANY_ADDRSTRLEN], fbuf[INANY_ADDRSTRLEN]; + const struct flowside *a = &flow->f.side[0]; + const struct flowside *b = &flow->f.side[1]; + if (flow->f.type == FLOW_TYPE_NONE) return; /* Nothing to do */ flow_dbg(flow, "END %s", flow_type_str[flow->f.type]); + flow_dbg(flow, " side 0 (%s): [%s]:%hu <-> [%s]:%hu", pif_name(a->pif), + inany_ntop(&a->faddr, fbuf, sizeof(fbuf)), a->fport, + inany_ntop(&a->eaddr, ebuf, sizeof(ebuf)), a->eport); + flow_dbg(flow, " side 1 (%s): [%s]:%hu <-> [%s]:%hu", pif_name(b->pif), + inany_ntop(&b->faddr, fbuf, sizeof(fbuf)), b->fport, + inany_ntop(&b->eaddr, ebuf, sizeof(ebuf)), b->eport); flow->f.type = FLOW_TYPE_NONE; } diff --git a/flow.h b/flow.h index c943c44..f7fb537 100644 --- a/flow.h +++ b/flow.h @@ -35,11 +35,86 @@ extern const uint8_t flow_proto[]; #define FLOW_PROTO(f) \ ((f)->type < FLOW_NUM_TYPES ? flow_proto[(f)->type] : 0) +/** + * struct flowside - Common information for one side of a flow + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + * @pif: pif ID on which this side of the flow exists + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport; + in_port_t eport; + uint8_t pif; +}; +static_assert(_Alignof(struct flowside) == _Alignof(uint32_t), + "Unexpected alignment for struct flowside"); + +/** flowside_from_inany - Initialize flowside from inany addresses + * @fside: flowside to initialize + * @pif: pif id of this flowside + * @faddr: Forwarding address (inany) + * @fport: Forwarding port + * @eaddr: Endpoint address (inany) + * @eport: Endpoint port + */ +/* cppcheck-suppress unusedFunction */ +static inline void flowside_from_inany(struct flowside *fside, uint8_t pif, + const union inany_addr *faddr, in_port_t fport, + const union inany_addr *eaddr, in_port_t eport) +{ + fside->pif = pif; + fside->faddr = *faddr; + fside->eaddr = *eaddr; + fside->fport = fport; + fside->eport = eport; +} + +/** flowside_from_af - Initialize flowside from addresses + * @fside: flowside to initialize + * @pif: pif id of this flowside + * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr, or NULL) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr, or NULL) + * @eport: Endpoint port + * + * If NULL is given for either address, the appropriate unspecified/any address + * for the address family is substituted. + */ +/* cppcheck-suppress unusedFunction */ +static inline void flowside_from_af(struct flowside *fside, + uint8_t pif, sa_family_t af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + const union inany_addr *any = af == AF_INET ? &inany_any4 : &inany_any6; + + fside->pif = pif; + if (faddr) + inany_from_af(&fside->faddr, af, faddr); + else + fside->faddr = *any; + if (eaddr) + inany_from_af(&fside->eaddr, af, eaddr); + else + fside->eaddr = *any; + fside->fport = fport; + fside->eport = eport; +} + +#define SIDES 2 + /** * struct flow_common - Common fields for packet flows + * @side[]: Information for each side of the flow * @type: Type of packet flow */ struct flow_common { + struct flowside side[SIDES]; uint8_t type; }; diff --git a/passt.h b/passt.h index bc58d64..3db0b8e 100644 --- a/passt.h +++ b/passt.h @@ -17,6 +17,9 @@ union epoll_ref; #include "pif.h" #include "packet.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" #include "flow.h" #include "icmp.h" #include "fwd.h" diff --git a/pif.h b/pif.h index bd52936..ca85b34 100644 --- a/pif.h +++ b/pif.h @@ -38,7 +38,6 @@ static inline const char *pif_type(enum pif_type pt) return "?"; } -/* cppcheck-suppress unusedFunction */ static inline const char *pif_name(uint8_t pif) { return pif_type(pif); diff --git a/tcp_conn.h b/tcp_conn.h index d280b22..1a07dd5 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -106,7 +106,6 @@ struct tcp_tap_conn { uint32_t seq_init_from_tap; }; -#define SIDES 2 /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @f: Generic flow information -- 2.44.0

Stefano Brivio

13 May 13 May

8:07 p.m.

New subject: [PATCH v4 01/16] flow: Common data structures for tracking flow addresses

Minor comments/nits only: On Fri, 3 May 2024 11:11:20 +1000 David Gibson wrote:

...

Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).

To allow more general and robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.

Introduce 'struct flowside' to keep track of common information related to one side of each flow. For now that's the addresses, ports and the pif id. Store two of these in the common fields of a flow to track that information for both sides. For now we just introduce the structure itself, helpers to populate it, and logging of the contents when starting and ending flows. Later patches will actually put something useful there.

Signed-off-by: David Gibson --- flow.c | 28 ++++++++++++++++++-- flow.h | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ passt.h | 3 +++ pif.h | 1 - tcp_conn.h | 1 - 5 files changed, 104 insertions(+), 4 deletions(-)

diff --git a/flow.c b/flow.c index 80dd269..02d6008 100644 --- a/flow.c +++ b/flow.c @@ -51,10 +51,11 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, * * ALLOC - A tentatively allocated entry * Operations: + * - Common flow fields other than type may be accessed * - flow_alloc_cancel() returns the entry to FREE state * - FLOW_START() set the entry's type and moves to START state * Caveats: - * - It's not safe to write fields in the flow entry + * - It's not safe to write flow type specific fields in the entry * - It's not safe to allocate further entries with flow_alloc() * - It's not safe to return to the main epoll loop (use FLOW_START() * to move to START state before doing so) @@ -62,6 +63,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, * * START - An entry being prepared by flow type specific code * Operations: + * - Common flow fields other than type may be accessed * - Flow type specific fields may be accessed * - flow_*() logging functions * - flow_alloc_cancel() returns the entry to FREE state @@ -168,9 +170,21 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) union flow *flow_start(union flow *flow, enum flow_type type, unsigned iniside) { - (void)iniside; + char ebuf[INANY_ADDRSTRLEN], fbuf[INANY_ADDRSTRLEN]; + const struct flowside *a = &flow->f.side[iniside];

As long as iniside is used as a binary value (I guess it's unsigned because you have in mind that it could eventually be extended, right?), I think '!!iniside' would be clearer and perhaps more robust here.

...

+ const struct flowside *b = &flow->f.side[!iniside]; + flow->f.type = type; flow_dbg(flow, "START %s", flow_type_str[flow->f.type]); + flow_dbg(flow, " from side %u (%s): [%s]:%hu -> [%s]:%hu", + iniside, pif_name(a->pif), + inany_ntop(&a->eaddr, ebuf, sizeof(ebuf)), a->eport, + inany_ntop(&a->faddr, fbuf, sizeof(fbuf)), a->fport); + flow_dbg(flow, " to side %u (%s): [%s]:%hu -> [%s]:%hu", + !iniside, pif_name(b->pif), + inany_ntop(&b->faddr, fbuf, sizeof(fbuf)), b->fport, + inany_ntop(&b->eaddr, ebuf, sizeof(ebuf)), b->eport); + return flow; }

@@ -180,10 +194,20 @@ union flow *flow_start(union flow *flow, enum flow_type type, */ static void flow_end(union flow *flow) { + char ebuf[INANY_ADDRSTRLEN], fbuf[INANY_ADDRSTRLEN]; + const struct flowside *a = &flow->f.side[0]; + const struct flowside *b = &flow->f.side[1]; + if (flow->f.type == FLOW_TYPE_NONE) return; /* Nothing to do */

flow_dbg(flow, "END %s", flow_type_str[flow->f.type]); + flow_dbg(flow, " side 0 (%s): [%s]:%hu <-> [%s]:%hu", pif_name(a->pif), + inany_ntop(&a->faddr, fbuf, sizeof(fbuf)), a->fport, + inany_ntop(&a->eaddr, ebuf, sizeof(ebuf)), a->eport); + flow_dbg(flow, " side 1 (%s): [%s]:%hu <-> [%s]:%hu", pif_name(b->pif), + inany_ntop(&b->faddr, fbuf, sizeof(fbuf)), b->fport, + inany_ntop(&b->eaddr, ebuf, sizeof(ebuf)), b->eport); flow->f.type = FLOW_TYPE_NONE; }

diff --git a/flow.h b/flow.h index c943c44..f7fb537 100644 --- a/flow.h +++ b/flow.h @@ -35,11 +35,86 @@ extern const uint8_t flow_proto[]; #define FLOW_PROTO(f) \ ((f)->type < FLOW_NUM_TYPES ? flow_proto[(f)->type] : 0)

+/** + * struct flowside - Common information for one side of a flow + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + * @pif: pif ID on which this side of the flow exists + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport; + in_port_t eport; + uint8_t pif; +}; +static_assert(_Alignof(struct flowside) == _Alignof(uint32_t), + "Unexpected alignment for struct flowside");

I'm too thick to understand the reason behind this assert.

...

+ +/** flowside_from_inany - Initialize flowside from inany addresses

flowside_from_inany(), it's a function.

...

+ * @fside: flowside to initialize + * @pif: pif id of this flowside + * @faddr: Forwarding address (inany) + * @fport: Forwarding port + * @eaddr: Endpoint address (inany) + * @eport: Endpoint port + */ +/* cppcheck-suppress unusedFunction */ +static inline void flowside_from_inany(struct flowside *fside, uint8_t pif, + const union inany_addr *faddr, in_port_t fport, + const union inany_addr *eaddr, in_port_t eport) +{ + fside->pif = pif; + fside->faddr = *faddr; + fside->eaddr = *eaddr; + fside->fport = fport; + fside->eport = eport; +} + +/** flowside_from_af - Initialize flowside from addresses

flowside_from_af()

...

+ * @fside: flowside to initialize + * @pif: pif id of this flowside + * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr, or NULL) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr, or NULL) + * @eport: Endpoint port + * + * If NULL is given for either address, the appropriate unspecified/any address

s/any/wildcard/ makes it a bit easier to follow, I guess.

...

+ * for the address family is substituted. + */ +/* cppcheck-suppress unusedFunction */ +static inline void flowside_from_af(struct flowside *fside, + uint8_t pif, sa_family_t af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + const union inany_addr *any = af == AF_INET ? &inany_any4 : &inany_any6; + + fside->pif = pif; + if (faddr) + inany_from_af(&fside->faddr, af, faddr); + else + fside->faddr = *any; + if (eaddr) + inany_from_af(&fside->eaddr, af, eaddr); + else + fside->eaddr = *any; + fside->fport = fport; + fside->eport = eport; +} + +#define SIDES 2 + /** * struct flow_common - Common fields for packet flows + * @side[]: Information for each side of the flow * @type: Type of packet flow */ struct flow_common { + struct flowside side[SIDES]; uint8_t type; };

diff --git a/passt.h b/passt.h index bc58d64..3db0b8e 100644 --- a/passt.h +++ b/passt.h @@ -17,6 +17,9 @@ union epoll_ref;

#include "pif.h" #include "packet.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" #include "flow.h" #include "icmp.h" #include "fwd.h" diff --git a/pif.h b/pif.h index bd52936..ca85b34 100644 --- a/pif.h +++ b/pif.h @@ -38,7 +38,6 @@ static inline const char *pif_type(enum pif_type pt) return "?"; }

-/* cppcheck-suppress unusedFunction */ static inline const char *pif_name(uint8_t pif) { return pif_type(pif); diff --git a/tcp_conn.h b/tcp_conn.h index d280b22..1a07dd5 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -106,7 +106,6 @@ struct tcp_tap_conn { uint32_t seq_init_from_tap; };

-#define SIDES 2 /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @f: Generic flow information

-- Stefano

David Gibson

14 May 14 May

2:11 a.m.

New subject: [PATCH v4 01/16] flow: Common data structures for tracking flow addresses

On Mon, May 13, 2024 at 08:07:00PM +0200, Stefano Brivio wrote:

...

Minor comments/nits only:

On Fri, 3 May 2024 11:11:20 +1000 David Gibson wrote:

...
Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).

To allow more general and robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.

Introduce 'struct flowside' to keep track of common information related to one side of each flow. For now that's the addresses, ports and the pif id. Store two of these in the common fields of a flow to track that information for both sides. For now we just introduce the structure itself, helpers to populate it, and logging of the contents when starting and ending flows. Later patches will actually put something useful there.

Signed-off-by: David Gibson --- flow.c | 28 ++++++++++++++++++-- flow.h | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ passt.h | 3 +++ pif.h | 1 - tcp_conn.h | 1 - 5 files changed, 104 insertions(+), 4 deletions(-)

diff --git a/flow.c b/flow.c index 80dd269..02d6008 100644 --- a/flow.c +++ b/flow.c @@ -51,10 +51,11 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, * * ALLOC - A tentatively allocated entry * Operations: + * - Common flow fields other than type may be accessed * - flow_alloc_cancel() returns the entry to FREE state * - FLOW_START() set the entry's type and moves to START state * Caveats: - * - It's not safe to write fields in the flow entry + * - It's not safe to write flow type specific fields in the entry * - It's not safe to allocate further entries with flow_alloc() * - It's not safe to return to the main epoll loop (use FLOW_START() * to move to START state before doing so) @@ -62,6 +63,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, * * START - An entry being prepared by flow type specific code * Operations: + * - Common flow fields other than type may be accessed * - Flow type specific fields may be accessed * - flow_*() logging functions * - flow_alloc_cancel() returns the entry to FREE state @@ -168,9 +170,21 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) union flow *flow_start(union flow *flow, enum flow_type type, unsigned iniside) { - (void)iniside; + char ebuf[INANY_ADDRSTRLEN], fbuf[INANY_ADDRSTRLEN]; + const struct flowside *a = &flow->f.side[iniside];

As long as iniside is used as a binary value (I guess it's unsigned because you have in mind that it could eventually be extended, right?),

Not really. My intention is that it's fundamentally a two value variable. However, it's used as an array index and doesn't represent true/false values, so bool didn't seem right. Signs added extra complications in some cases, hence unsigned. I'd use uint1_t if that were a thing...

...

I think '!!iniside' would be clearer and perhaps more robust here.

Hm. I don't really like that. If iniside ever has a value other than 0 or 1, that's a bug. Fwiw, this particular instance is gone in the latest version and there are more places where we use just constants, but it's not all of them. I guess see what you think on the new version.

...

...
+ const struct flowside *b = &flow->f.side[!iniside]; + flow->f.type = type; flow_dbg(flow, "START %s", flow_type_str[flow->f.type]); + flow_dbg(flow, " from side %u (%s): [%s]:%hu -> [%s]:%hu", + iniside, pif_name(a->pif), + inany_ntop(&a->eaddr, ebuf, sizeof(ebuf)), a->eport, + inany_ntop(&a->faddr, fbuf, sizeof(fbuf)), a->fport); + flow_dbg(flow, " to side %u (%s): [%s]:%hu -> [%s]:%hu", + !iniside, pif_name(b->pif), + inany_ntop(&b->faddr, fbuf, sizeof(fbuf)), b->fport, + inany_ntop(&b->eaddr, ebuf, sizeof(ebuf)), b->eport); + return flow; }

@@ -180,10 +194,20 @@ union flow *flow_start(union flow *flow, enum flow_type type, */ static void flow_end(union flow *flow) { + char ebuf[INANY_ADDRSTRLEN], fbuf[INANY_ADDRSTRLEN]; + const struct flowside *a = &flow->f.side[0]; + const struct flowside *b = &flow->f.side[1]; + if (flow->f.type == FLOW_TYPE_NONE) return; /* Nothing to do */

flow_dbg(flow, "END %s", flow_type_str[flow->f.type]); + flow_dbg(flow, " side 0 (%s): [%s]:%hu <-> [%s]:%hu", pif_name(a->pif), + inany_ntop(&a->faddr, fbuf, sizeof(fbuf)), a->fport, + inany_ntop(&a->eaddr, ebuf, sizeof(ebuf)), a->eport); + flow_dbg(flow, " side 1 (%s): [%s]:%hu <-> [%s]:%hu", pif_name(b->pif), + inany_ntop(&b->faddr, fbuf, sizeof(fbuf)), b->fport, + inany_ntop(&b->eaddr, ebuf, sizeof(ebuf)), b->eport); flow->f.type = FLOW_TYPE_NONE; }

diff --git a/flow.h b/flow.h index c943c44..f7fb537 100644 --- a/flow.h +++ b/flow.h @@ -35,11 +35,86 @@ extern const uint8_t flow_proto[]; #define FLOW_PROTO(f) \ ((f)->type < FLOW_NUM_TYPES ? flow_proto[(f)->type] : 0)

+/** + * struct flowside - Common information for one side of a flow + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + * @pif: pif ID on which this side of the flow exists + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport; + in_port_t eport; + uint8_t pif; +}; +static_assert(_Alignof(struct flowside) == _Alignof(uint32_t), + "Unexpected alignment for struct flowside");

I'm too thick to understand the reason behind this assert.

I guess there isn't a particularly strong reason. This was mostly so I didn't get surprised by some weird alignment padding.

...

...
+ +/** flowside_from_inany - Initialize flowside from inany addresses

flowside_from_inany(), it's a function.

Gone in the latest version anyway.

...

...
+ * @fside: flowside to initialize + * @pif: pif id of this flowside + * @faddr: Forwarding address (inany) + * @fport: Forwarding port + * @eaddr: Endpoint address (inany) + * @eport: Endpoint port + */ +/* cppcheck-suppress unusedFunction */ +static inline void flowside_from_inany(struct flowside *fside, uint8_t pif, + const union inany_addr *faddr, in_port_t fport, + const union inany_addr *eaddr, in_port_t eport) +{ + fside->pif = pif; + fside->faddr = *faddr; + fside->eaddr = *eaddr; + fside->fport = fport; + fside->eport = eport; +} + +/** flowside_from_af - Initialize flowside from addresses

flowside_from_af()

Fixed. I changed to the british spelling of initialise while I was at it.

...

...
+ * @fside: flowside to initialize + * @pif: pif id of this flowside + * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr, or NULL) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr, or NULL) + * @eport: Endpoint port + * + * If NULL is given for either address, the appropriate unspecified/any address

s/any/wildcard/ makes it a bit easier to follow, I guess.

That behaviour and comment is gone in the latest version.

...

...
+ * for the address family is substituted. + */ +/* cppcheck-suppress unusedFunction */ +static inline void flowside_from_af(struct flowside *fside, + uint8_t pif, sa_family_t af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + const union inany_addr *any = af == AF_INET ? &inany_any4 : &inany_any6; + + fside->pif = pif; + if (faddr) + inany_from_af(&fside->faddr, af, faddr); + else + fside->faddr = *any; + if (eaddr) + inany_from_af(&fside->eaddr, af, eaddr); + else + fside->eaddr = *any; + fside->fport = fport; + fside->eport = eport; +} + +#define SIDES 2 + /** * struct flow_common - Common fields for packet flows + * @side[]: Information for each side of the flow * @type: Type of packet flow */ struct flow_common { + struct flowside side[SIDES]; uint8_t type; };

diff --git a/passt.h b/passt.h index bc58d64..3db0b8e 100644 --- a/passt.h +++ b/passt.h @@ -17,6 +17,9 @@ union epoll_ref;

#include "pif.h" #include "packet.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" #include "flow.h" #include "icmp.h" #include "fwd.h" diff --git a/pif.h b/pif.h index bd52936..ca85b34 100644 --- a/pif.h +++ b/pif.h @@ -38,7 +38,6 @@ static inline const char *pif_type(enum pif_type pt) return "?"; }

-/* cppcheck-suppress unusedFunction */ static inline const char *pif_name(uint8_t pif) { return pif_type(pif); diff --git a/tcp_conn.h b/tcp_conn.h index d280b22..1a07dd5 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -106,7 +106,6 @@ struct tcp_tap_conn { uint32_t seq_init_from_tap; };

-#define SIDES 2 /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @f: Generic flow information

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

David Gibson

3 May 3 May

3:11 a.m.

New subject: [PATCH v4 02/16] tcp: Maintain flowside information for "tap" connections

tcp_tap_conn has several fields to track addresses and ports as seen by the guest/namespace. We now have general fields for this in the common flowside struct so use those instead of protocol specific fields. The flowside also has space for the guest side endpoint address (local address from the guest's PoV) so we fill that in as well. We didn't previously store equivalent information for the connection as it appears to the host; that was implicit in the state of the host side socket. For future generalisations of flow/connection tracking, we're going to need that information, so populate the other flowside in each flow table entry with as much of this information as we can easily obtain. For connections initiated by the guest that's the endpoint address and port. To get the forwarding address and port we'd need to call getsockname() in general, so leave that blank for now. For connections initiated from outside, we also have the endpoint address from accept(). We have the forwarding port from the epoll ref, but we leave the forwarding address blank. For now we just fill the information in without really using it for anything. Signed-off-by: David Gibson --- flow.h | 1 - tcp.c | 88 +++++++++++++++++++++++++++++++++++++----------------- tcp_conn.h | 8 ----- 3 files changed, 60 insertions(+), 37 deletions(-) diff --git a/flow.h b/flow.h index f7fb537..88caa76 100644 --- a/flow.h +++ b/flow.h @@ -85,7 +85,6 @@ static inline void flowside_from_inany(struct flowside *fside, uint8_t pif, * If NULL is given for either address, the appropriate unspecified/any address * for the address family is substituted. */ -/* cppcheck-suppress unusedFunction */ static inline void flowside_from_af(struct flowside *fside, uint8_t pif, sa_family_t af, const void *faddr, in_port_t fport, diff --git a/tcp.c b/tcp.c index 21d0af0..1835b86 100644 --- a/tcp.c +++ b/tcp.c @@ -372,7 +372,7 @@ #define OPT_SACK 5 #define OPT_TS 8 -#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) +#define CONN_V4(conn) (!!inany_v4(&conn->f.side[TAPSIDE].faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) #define CONN_IS_CLOSING(conn) \ ((conn->events & ESTABLISHED) && \ @@ -795,10 +795,11 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) { + const struct flowside *tapside = &conn->f.side[TAPSIDE]; int i; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return 1; return 0; @@ -813,6 +814,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, const struct tcp_info *tinfo) { #ifdef HAS_MIN_RTT + const struct flowside *tapside = &conn->f.side[TAPSIDE]; int i, hole = -1; if (!tinfo->tcpi_min_rtt || @@ -820,7 +822,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -832,7 +834,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return; - low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = tapside->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -1085,8 +1087,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + const struct flowside *tapside = &conn->f.side[TAPSIDE]; + + if (inany_equals(&tapside->faddr, faddr) && + tapside->eport == eport && tapside->fport == fport) return 1; return 0; @@ -1120,7 +1124,9 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + const struct flowside *tapside = &conn->f.side[TAPSIDE]; + + return tcp_hash(c, &tapside->faddr, tapside->eport, tapside->fport); } /** @@ -1302,10 +1308,12 @@ void tcp_defer_handler(struct ctx *c) * @seq: Sequence number */ static void tcp_fill_header(struct tcphdr *th, - const struct tcp_tap_conn *conn, uint32_t seq) + const struct tcp_tap_conn *conn, uint32_t seq) { - th->source = htons(conn->fport); - th->dest = htons(conn->eport); + const struct flowside *tapside = &conn->f.side[TAPSIDE]; + + th->source = htons(tapside->fport); + th->dest = htons(tapside->eport); th->seq = htonl(seq); th->ack_seq = htonl(conn->seq_ack_to_tap); if (conn->events & ESTABLISHED) { @@ -1337,7 +1345,8 @@ static size_t tcp_fill_headers4(const struct ctx *c, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = &conn->f.side[TAPSIDE]; + const struct in_addr *a4 = inany_v4(&tapside->faddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph); @@ -1379,10 +1388,11 @@ static size_t tcp_fill_headers6(const struct ctx *c, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) { + const struct flowside *tapside = &conn->f.side[TAPSIDE]; size_t l4len = dlen + sizeof(*th); ip6h->payload_len = htons(l4len); - ip6h->saddr = conn->faddr.a6; + ip6h->saddr = tapside->faddr.a6; if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) ip6h->daddr = c->ip6.addr_ll_seen; else @@ -1421,9 +1431,7 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); - - if (a4) { + if (CONN_V4(conn)) { return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_IP].iov_base, iov[TCP_IOV_PAYLOAD].iov_base, dlen, @@ -1738,7 +1746,7 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) /** * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 * @c: Execution context - * @conn: TCP connection, with faddr, fport and eport populated + * @conn: TCP connection, with tap flowside faddr, fport and eport * @now: Current timestamp */ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, @@ -1746,6 +1754,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, { struct siphash_state state = SIPHASH_INIT(c->hash_secret); union inany_addr aany; + const struct flowside *tapside = &conn->f.side[TAPSIDE]; uint64_t hash; uint32_t ns; @@ -1754,10 +1763,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, else inany_from_af(&aany, AF_INET6, &c->ip6.addr); - inany_siphash_feed(&state, &conn->faddr); + inany_siphash_feed(&state, &tapside->faddr); inany_siphash_feed(&state, &aany); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)tapside->fport << 16 | tapside->eport); /* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1945,6 +1954,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, .sin6_port = htons(dstport), .sin6_addr = *(struct in6_addr *)daddr, }; + struct flowside *tapside, *sockside; const struct sockaddr *sa; struct tcp_tap_conn *conn; union flow *flow; @@ -1954,6 +1964,11 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return; + tapside = &flow->f.side[TAPSIDE]; + sockside = &flow->f.side[SOCKSIDE]; + + flowside_from_af(tapside, PIF_TAP, af, daddr, dstport, saddr, srcport); + if (af == AF_INET) { if (IN4_IS_ADDR_UNSPECIFIED(saddr) || IN4_IS_ADDR_BROADCAST(saddr) || @@ -2026,19 +2041,19 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1; - inany_from_af(&conn->faddr, af, daddr); + sockside->pif = PIF_HOST; + sockside->eport = dstport; if (af == AF_INET) { + inany_from_af(&sockside->eaddr, AF_INET, &addr4.sin_addr); sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); } else { + inany_from_af(&sockside->eaddr, AF_INET6, &addr6.sin6_addr); sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); } - conn->fport = dstport; - conn->eport = srcport; - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2724,18 +2739,35 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, const union sockaddr_inany *sa, const struct timespec *now) { - struct tcp_tap_conn *conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE); + struct flowside *sockside = &flow->f.side[SOCKSIDE]; + struct flowside *tapside = &flow->f.side[TAPSIDE]; + struct tcp_tap_conn *conn; + + sockside->pif = PIF_HOST; + inany_from_sockaddr(&sockside->eaddr, &sockside->eport, sa); + sockside->fport = dstport; + + tapside->pif = PIF_TAP; + tapside->faddr = sockside->eaddr; + tapside->fport = sockside->eport; + tcp_snat_inbound(c, &tapside->faddr); + if (CONN_V4(flow)) { + inany_from_af(&tapside->eaddr, AF_INET, &c->ip4.addr_seen); + } else { + if (IN6_IS_ADDR_LINKLOCAL(&tapside->faddr.a6)) + tapside->eaddr.a6 = c->ip6.addr_ll_seen; + else + tapside->eaddr.a6 = c->ip6.addr_seen; + } + tapside->eport = dstport + c->tcp.fwd_in.delta[dstport]; + + conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE); conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = dstport + c->tcp.fwd_in.delta[dstport]; - - tcp_snat_inbound(c, &conn->faddr); - tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_conn.h b/tcp_conn.h index 1a07dd5..f55f144 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -23,9 +23,6 @@ * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -91,11 +88,6 @@ struct tcp_tap_conn { uint8_t seq_dup_ack_approx; - - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap; -- 2.44.0

Stefano Brivio

13 May 13 May

8:07 p.m.

New subject: [PATCH v4 02/16] tcp: Maintain flowside information for "tap" connections

On Fri, 3 May 2024 11:11:21 +1000 David Gibson wrote:

...

tcp_tap_conn has several fields to track addresses and ports as seen by the guest/namespace. We now have general fields for this in the common flowside struct so use those instead of protocol specific fields. The flowside also has space for the guest side endpoint address (local address from the guest's PoV) so we fill that in as well.

We didn't previously store equivalent information for the connection as it appears to the host; that was implicit in the state of the host side socket. For future generalisations of flow/connection tracking, we're going to need that information, so populate the other flowside in each flow table entry with as much of this information as we can easily obtain. For connections initiated by the guest that's the endpoint address and port. To get the forwarding address and port we'd need to call getsockname() in general, so leave that blank for now. For connections initiated from outside, we also have the endpoint address from accept(). We have the forwarding port from the epoll ref, but we leave the forwarding address blank.

For now we just fill the information in without really using it for anything.

Signed-off-by: David Gibson --- flow.h | 1 - tcp.c | 88 +++++++++++++++++++++++++++++++++++++----------------- tcp_conn.h | 8 ----- 3 files changed, 60 insertions(+), 37 deletions(-)

diff --git a/flow.h b/flow.h index f7fb537..88caa76 100644 --- a/flow.h +++ b/flow.h @@ -85,7 +85,6 @@ static inline void flowside_from_inany(struct flowside *fside, uint8_t pif, * If NULL is given for either address, the appropriate unspecified/any address * for the address family is substituted. */ -/* cppcheck-suppress unusedFunction */ static inline void flowside_from_af(struct flowside *fside, uint8_t pif, sa_family_t af, const void *faddr, in_port_t fport, diff --git a/tcp.c b/tcp.c index 21d0af0..1835b86 100644 --- a/tcp.c +++ b/tcp.c @@ -372,7 +372,7 @@ #define OPT_SACK 5 #define OPT_TS 8

-#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) +#define CONN_V4(conn) (!!inany_v4(&conn->f.side[TAPSIDE].faddr))

...which reminds me: I guess CONN_V4() and CONN_V6() should eventually go away, just like SPLICE_V6 in 7/16.

...

#define CONN_V6(conn) (!CONN_V4(conn)) #define CONN_IS_CLOSING(conn) \ ((conn->events & ESTABLISHED) && \ @@ -795,10 +795,11 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) { + const struct flowside *tapside = &conn->f.side[TAPSIDE]; int i;

for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return 1;

return 0; @@ -813,6 +814,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, const struct tcp_info *tinfo) { #ifdef HAS_MIN_RTT + const struct flowside *tapside = &conn->f.side[TAPSIDE]; int i, hole = -1;

if (!tinfo->tcpi_min_rtt || @@ -820,7 +822,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return;

for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -832,7 +834,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return;

- low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = tapside->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -1085,8 +1087,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + const struct flowside *tapside = &conn->f.side[TAPSIDE]; + + if (inany_equals(&tapside->faddr, faddr) && + tapside->eport == eport && tapside->fport == fport) return 1;

return 0; @@ -1120,7 +1124,9 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + const struct flowside *tapside = &conn->f.side[TAPSIDE]; + + return tcp_hash(c, &tapside->faddr, tapside->eport, tapside->fport); }

/** @@ -1302,10 +1308,12 @@ void tcp_defer_handler(struct ctx *c) * @seq: Sequence number */ static void tcp_fill_header(struct tcphdr *th, - const struct tcp_tap_conn *conn, uint32_t seq) + const struct tcp_tap_conn *conn, uint32_t seq) { - th->source = htons(conn->fport); - th->dest = htons(conn->eport); + const struct flowside *tapside = &conn->f.side[TAPSIDE]; + + th->source = htons(tapside->fport); + th->dest = htons(tapside->eport); th->seq = htonl(seq); th->ack_seq = htonl(conn->seq_ack_to_tap); if (conn->events & ESTABLISHED) { @@ -1337,7 +1345,8 @@ static size_t tcp_fill_headers4(const struct ctx *c, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = &conn->f.side[TAPSIDE]; + const struct in_addr *a4 = inany_v4(&tapside->faddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph);

@@ -1379,10 +1388,11 @@ static size_t tcp_fill_headers6(const struct ctx *c, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) { + const struct flowside *tapside = &conn->f.side[TAPSIDE]; size_t l4len = dlen + sizeof(*th);

ip6h->payload_len = htons(l4len); - ip6h->saddr = conn->faddr.a6; + ip6h->saddr = tapside->faddr.a6; if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) ip6h->daddr = c->ip6.addr_ll_seen; else @@ -1421,9 +1431,7 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); - - if (a4) { + if (CONN_V4(conn)) { return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_IP].iov_base, iov[TCP_IOV_PAYLOAD].iov_base, dlen, @@ -1738,7 +1746,7 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) /** * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 * @c: Execution context - * @conn: TCP connection, with faddr, fport and eport populated + * @conn: TCP connection, with tap flowside faddr, fport and eport * @now: Current timestamp */ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, @@ -1746,6 +1754,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, { struct siphash_state state = SIPHASH_INIT(c->hash_secret); union inany_addr aany; + const struct flowside *tapside = &conn->f.side[TAPSIDE];

One line up.

...

uint64_t hash; uint32_t ns;

@@ -1754,10 +1763,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, else inany_from_af(&aany, AF_INET6, &c->ip6.addr);

- inany_siphash_feed(&state, &conn->faddr); + inany_siphash_feed(&state, &tapside->faddr); inany_siphash_feed(&state, &aany); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)tapside->fport << 16 | tapside->eport);

/* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1945,6 +1954,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, .sin6_port = htons(dstport), .sin6_addr = *(struct in6_addr *)daddr, }; + struct flowside *tapside, *sockside; const struct sockaddr *sa; struct tcp_tap_conn *conn; union flow *flow; @@ -1954,6 +1964,11 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return;

+ tapside = &flow->f.side[TAPSIDE]; + sockside = &flow->f.side[SOCKSIDE]; + + flowside_from_af(tapside, PIF_TAP, af, daddr, dstport, saddr, srcport); + if (af == AF_INET) { if (IN4_IS_ADDR_UNSPECIFIED(saddr) || IN4_IS_ADDR_BROADCAST(saddr) || @@ -2026,19 +2041,19 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1;

- inany_from_af(&conn->faddr, af, daddr); + sockside->pif = PIF_HOST; + sockside->eport = dstport;

if (af == AF_INET) { + inany_from_af(&sockside->eaddr, AF_INET, &addr4.sin_addr); sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); } else { + inany_from_af(&sockside->eaddr, AF_INET6, &addr6.sin6_addr); sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); }

- conn->fport = dstport; - conn->eport = srcport; - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2724,18 +2739,35 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, const union sockaddr_inany *sa, const struct timespec *now) { - struct tcp_tap_conn *conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE); + struct flowside *sockside = &flow->f.side[SOCKSIDE]; + struct flowside *tapside = &flow->f.side[TAPSIDE]; + struct tcp_tap_conn *conn; + + sockside->pif = PIF_HOST; + inany_from_sockaddr(&sockside->eaddr, &sockside->eport, sa); + sockside->fport = dstport; + + tapside->pif = PIF_TAP; + tapside->faddr = sockside->eaddr; + tapside->fport = sockside->eport; + tcp_snat_inbound(c, &tapside->faddr); + if (CONN_V4(flow)) { + inany_from_af(&tapside->eaddr, AF_INET, &c->ip4.addr_seen); + } else { + if (IN6_IS_ADDR_LINKLOCAL(&tapside->faddr.a6)) + tapside->eaddr.a6 = c->ip6.addr_ll_seen; + else + tapside->eaddr.a6 = c->ip6.addr_seen; + } + tapside->eport = dstport + c->tcp.fwd_in.delta[dstport];

Pre-existing, but I wonder: doesn't this port translation also belong to tcp_snat_inbound()?

...

+ + conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE);

conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);

- inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = dstport + c->tcp.fwd_in.delta[dstport]; - - tcp_snat_inbound(c, &conn->faddr); - tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn);

diff --git a/tcp_conn.h b/tcp_conn.h index 1a07dd5..f55f144 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -23,9 +23,6 @@ * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -91,11 +88,6 @@ struct tcp_tap_conn {

uint8_t seq_dup_ack_approx;

- - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap;

-- Stefano

David Gibson

14 May 14 May

2:15 a.m.

New subject: [PATCH v4 02/16] tcp: Maintain flowside information for "tap" connections

On Mon, May 13, 2024 at 08:07:22PM +0200, Stefano Brivio wrote:

...

On Fri, 3 May 2024 11:11:21 +1000 David Gibson wrote:

...
tcp_tap_conn has several fields to track addresses and ports as seen by the guest/namespace. We now have general fields for this in the common flowside struct so use those instead of protocol specific fields. The flowside also has space for the guest side endpoint address (local address from the guest's PoV) so we fill that in as well.

We didn't previously store equivalent information for the connection as it appears to the host; that was implicit in the state of the host side socket. For future generalisations of flow/connection tracking, we're going to need that information, so populate the other flowside in each flow table entry with as much of this information as we can easily obtain. For connections initiated by the guest that's the endpoint address and port. To get the forwarding address and port we'd need to call getsockname() in general, so leave that blank for now. For connections initiated from outside, we also have the endpoint address from accept(). We have the forwarding port from the epoll ref, but we leave the forwarding address blank.

For now we just fill the information in without really using it for anything.

Signed-off-by: David Gibson --- flow.h | 1 - tcp.c | 88 +++++++++++++++++++++++++++++++++++++----------------- tcp_conn.h | 8 ----- 3 files changed, 60 insertions(+), 37 deletions(-)

diff --git a/flow.h b/flow.h index f7fb537..88caa76 100644 --- a/flow.h +++ b/flow.h @@ -85,7 +85,6 @@ static inline void flowside_from_inany(struct flowside *fside, uint8_t pif, * If NULL is given for either address, the appropriate unspecified/any address * for the address family is substituted. */ -/* cppcheck-suppress unusedFunction */ static inline void flowside_from_af(struct flowside *fside, uint8_t pif, sa_family_t af, const void *faddr, in_port_t fport, diff --git a/tcp.c b/tcp.c index 21d0af0..1835b86 100644 --- a/tcp.c +++ b/tcp.c @@ -372,7 +372,7 @@ #define OPT_SACK 5 #define OPT_TS 8

-#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) +#define CONN_V4(conn) (!!inany_v4(&conn->f.side[TAPSIDE].faddr))

...which reminds me: I guess CONN_V4() and CONN_V6() should eventually go away, just like SPLICE_V6 in 7/16.

Yes. I've thought about doing that, but haven't quite gotten there yet.

...

...
#define CONN_V6(conn) (!CONN_V4(conn)) #define CONN_IS_CLOSING(conn) \ ((conn->events & ESTABLISHED) && \ @@ -795,10 +795,11 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) { + const struct flowside *tapside = &conn->f.side[TAPSIDE]; int i;

for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return 1;

return 0; @@ -813,6 +814,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, const struct tcp_info *tinfo) { #ifdef HAS_MIN_RTT + const struct flowside *tapside = &conn->f.side[TAPSIDE]; int i, hole = -1;

if (!tinfo->tcpi_min_rtt || @@ -820,7 +822,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return;

for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -832,7 +834,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return;

- low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = tapside->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -1085,8 +1087,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + const struct flowside *tapside = &conn->f.side[TAPSIDE]; + + if (inany_equals(&tapside->faddr, faddr) && + tapside->eport == eport && tapside->fport == fport) return 1;

return 0; @@ -1120,7 +1124,9 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + const struct flowside *tapside = &conn->f.side[TAPSIDE]; + + return tcp_hash(c, &tapside->faddr, tapside->eport, tapside->fport); }

/** @@ -1302,10 +1308,12 @@ void tcp_defer_handler(struct ctx *c) * @seq: Sequence number */ static void tcp_fill_header(struct tcphdr *th, - const struct tcp_tap_conn *conn, uint32_t seq) + const struct tcp_tap_conn *conn, uint32_t seq) { - th->source = htons(conn->fport); - th->dest = htons(conn->eport); + const struct flowside *tapside = &conn->f.side[TAPSIDE]; + + th->source = htons(tapside->fport); + th->dest = htons(tapside->eport); th->seq = htonl(seq); th->ack_seq = htonl(conn->seq_ack_to_tap); if (conn->events & ESTABLISHED) { @@ -1337,7 +1345,8 @@ static size_t tcp_fill_headers4(const struct ctx *c, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = &conn->f.side[TAPSIDE]; + const struct in_addr *a4 = inany_v4(&tapside->faddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph);

@@ -1379,10 +1388,11 @@ static size_t tcp_fill_headers6(const struct ctx *c, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) { + const struct flowside *tapside = &conn->f.side[TAPSIDE]; size_t l4len = dlen + sizeof(*th);

ip6h->payload_len = htons(l4len); - ip6h->saddr = conn->faddr.a6; + ip6h->saddr = tapside->faddr.a6; if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) ip6h->daddr = c->ip6.addr_ll_seen; else @@ -1421,9 +1431,7 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); - - if (a4) { + if (CONN_V4(conn)) { return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_IP].iov_base, iov[TCP_IOV_PAYLOAD].iov_base, dlen, @@ -1738,7 +1746,7 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) /** * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 * @c: Execution context - * @conn: TCP connection, with faddr, fport and eport populated + * @conn: TCP connection, with tap flowside faddr, fport and eport * @now: Current timestamp */ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, @@ -1746,6 +1754,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, { struct siphash_state state = SIPHASH_INIT(c->hash_secret); union inany_addr aany; + const struct flowside *tapside = &conn->f.side[TAPSIDE];

One line up.

Already fixed in the latest equivalent code.

...

...
uint64_t hash; uint32_t ns;

@@ -1754,10 +1763,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, else inany_from_af(&aany, AF_INET6, &c->ip6.addr);

- inany_siphash_feed(&state, &conn->faddr); + inany_siphash_feed(&state, &tapside->faddr); inany_siphash_feed(&state, &aany); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)tapside->fport << 16 | tapside->eport);

/* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1945,6 +1954,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, .sin6_port = htons(dstport), .sin6_addr = *(struct in6_addr *)daddr, }; + struct flowside *tapside, *sockside; const struct sockaddr *sa; struct tcp_tap_conn *conn; union flow *flow; @@ -1954,6 +1964,11 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return;

+ tapside = &flow->f.side[TAPSIDE]; + sockside = &flow->f.side[SOCKSIDE]; + + flowside_from_af(tapside, PIF_TAP, af, daddr, dstport, saddr, srcport); + if (af == AF_INET) { if (IN4_IS_ADDR_UNSPECIFIED(saddr) || IN4_IS_ADDR_BROADCAST(saddr) || @@ -2026,19 +2041,19 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1;

- inany_from_af(&conn->faddr, af, daddr); + sockside->pif = PIF_HOST; + sockside->eport = dstport;

if (af == AF_INET) { + inany_from_af(&sockside->eaddr, AF_INET, &addr4.sin_addr); sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); } else { + inany_from_af(&sockside->eaddr, AF_INET6, &addr6.sin6_addr); sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); }

- conn->fport = dstport; - conn->eport = srcport; - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2724,18 +2739,35 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, const union sockaddr_inany *sa, const struct timespec *now) { - struct tcp_tap_conn *conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE); + struct flowside *sockside = &flow->f.side[SOCKSIDE]; + struct flowside *tapside = &flow->f.side[TAPSIDE]; + struct tcp_tap_conn *conn; + + sockside->pif = PIF_HOST; + inany_from_sockaddr(&sockside->eaddr, &sockside->eport, sa); + sockside->fport = dstport; + + tapside->pif = PIF_TAP; + tapside->faddr = sockside->eaddr; + tapside->fport = sockside->eport; + tcp_snat_inbound(c, &tapside->faddr); + if (CONN_V4(flow)) { + inany_from_af(&tapside->eaddr, AF_INET, &c->ip4.addr_seen); + } else { + if (IN6_IS_ADDR_LINKLOCAL(&tapside->faddr.a6)) + tapside->eaddr.a6 = c->ip6.addr_ll_seen; + else + tapside->eaddr.a6 = c->ip6.addr_seen; + } + tapside->eport = dstport + c->tcp.fwd_in.delta[dstport];

Pre-existing, but I wonder: doesn't this port translation also belong to tcp_snat_inbound()?

Not really, because "snat" here is for "source nat". But in any case both are subsumed into common NAT functions later in the series.

...

...
+ + conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE);

conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);

- inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = dstport + c->tcp.fwd_in.delta[dstport]; - - tcp_snat_inbound(c, &conn->faddr); - tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn);

diff --git a/tcp_conn.h b/tcp_conn.h index 1a07dd5..f55f144 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -23,9 +23,6 @@ * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -91,11 +88,6 @@ struct tcp_tap_conn {

uint8_t seq_dup_ack_approx;

- - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap;

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

David Gibson

3 May 3 May

3:11 a.m.

New subject: [PATCH v4 03/16] tcp_splice: Maintain flowside information for spliced connections

Every flow in the flow table now has space for the the addresses as seen by both the host and guest side. We already fill that information in for regular "tap" TCP connections, now do it for spliced connections too. Signed-off-by: David Gibson --- tcp.c | 20 ++++++++++---------- tcp_splice.c | 31 +++++++++++++++++-------------- tcp_splice.h | 6 ++---- 3 files changed, 29 insertions(+), 28 deletions(-) diff --git a/tcp.c b/tcp.c index 1835b86..cd5bffe 100644 --- a/tcp.c +++ b/tcp.c @@ -2731,22 +2731,16 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * @dstport: Destination port for connection (host side) * @flow: flow to initialise * @s: Accepted socket - * @sa: Peer socket address (from accept()) * @now: Current timestamp */ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, union flow *flow, int s, - const union sockaddr_inany *sa, const struct timespec *now) { - struct flowside *sockside = &flow->f.side[SOCKSIDE]; + const struct flowside *sockside = &flow->f.side[SOCKSIDE]; struct flowside *tapside = &flow->f.side[TAPSIDE]; struct tcp_tap_conn *conn; - sockside->pif = PIF_HOST; - inany_from_sockaddr(&sockside->eaddr, &sockside->eport, sa); - sockside->fport = dstport; - tapside->pif = PIF_TAP; tapside->faddr = sockside->eaddr; tapside->fport = sockside->eport; @@ -2792,16 +2786,23 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, { union sockaddr_inany sa; socklen_t sl = sizeof(sa); + struct flowside *side0; union flow *flow; int s; if (c->no_tcp || !(flow = flow_alloc())) return; + side0 = &flow->f.side[0]; + s = accept4(ref.fd, &sa.sa, &sl, SOCK_NONBLOCK); if (s < 0) goto cancel; + side0->pif = ref.tcp_listen.pif; + inany_from_sockaddr(&side0->eaddr, &side0->eport, &sa); + side0->fport = ref.tcp_listen.port; + if (sa.sa_family == AF_INET) { const struct in_addr *addr = &sa.sa4.sin_addr; in_port_t port = sa.sa4.sin_port; @@ -2829,11 +2830,10 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, } } - if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif, - ref.tcp_listen.port, flow, s, &sa)) + if (tcp_splice_conn_from_sock(c, ref.tcp_listen.port, flow, s)) return; - tcp_tap_conn_from_sock(c, ref.tcp_listen.port, flow, s, &sa, now); + tcp_tap_conn_from_sock(c, ref.tcp_listen.port, flow, s, now); return; cancel: diff --git a/tcp_splice.c b/tcp_splice.c index 42b7be0..462ed0c 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -414,42 +414,38 @@ static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af) /** * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection * @c: Execution context - * @pif0: pif id of side 0 * @dstport: Side 0 destination port of connection * @flow: flow to initialise * @s0: Accepted (side 0) socket - * @sa: Peer address of connection * * Return: true if able to create a spliced connection, false otherwise * #syscalls:pasta setsockopt */ -bool tcp_splice_conn_from_sock(const struct ctx *c, - uint8_t pif0, in_port_t dstport, - union flow *flow, int s0, - const union sockaddr_inany *sa) +bool tcp_splice_conn_from_sock(const struct ctx *c, in_port_t dstport, + union flow *flow, int s0) { + const struct flowside *side0 = &flow->f.side[0]; + const union inany_addr *src = &side0->eaddr; + struct flowside *side1 = &flow->f.side[1]; struct tcp_splice_conn *conn; - union inany_addr src; - in_port_t srcport; sa_family_t af; uint8_t pif1; if (c->mode != MODE_PASTA) return false; - inany_from_sockaddr(&src, &srcport, sa); - af = inany_v4(&src) ? AF_INET : AF_INET6; + af = inany_v4(src) ? AF_INET : AF_INET6; - switch (pif0) { + switch (side0->pif) { case PIF_SPLICE: - if (!inany_is_loopback(&src)) { + if (!inany_is_loopback(src)) { char str[INANY_ADDRSTRLEN]; /* We can't use flow_err() etc. because we haven't set * the flow type yet */ warn("Bad source address %s for splice, closing", - inany_ntop(&src, str, sizeof(str))); + inany_ntop(src, str, sizeof(str))); /* We *don't* want to fall back to tap */ flow_alloc_cancel(flow); @@ -461,7 +457,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, break; case PIF_HOST: - if (!inany_is_loopback(&src)) + if (!inany_is_loopback(src)) return false; pif1 = PIF_SPLICE; @@ -472,6 +468,13 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, return false; } + if (af == AF_INET) + flowside_from_af(side1, pif1, AF_INET, NULL, 0, + &in4addr_loopback, dstport); + else + flowside_from_af(side1, pif1, AF_INET6, NULL, 0, + &in6addr_loopback, dstport); + conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0); conn->flags = af == AF_INET ? 0 : SPLICE_V6; diff --git a/tcp_splice.h b/tcp_splice.h index ed8f0c5..e523c7e 100644 --- a/tcp_splice.h +++ b/tcp_splice.h @@ -11,10 +11,8 @@ union sockaddr_inany; void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events); -bool tcp_splice_conn_from_sock(const struct ctx *c, - uint8_t pif0, in_port_t dstport, - union flow *flow, int s0, - const union sockaddr_inany *sa); +bool tcp_splice_conn_from_sock(const struct ctx *c, in_port_t dstport, + union flow *flow, int s0); void tcp_splice_init(struct ctx *c); #endif /* TCP_SPLICE_H */ -- 2.44.0

David Gibson

3:11 a.m.

New subject: [PATCH v4 04/16] tcp: Obtain guest address from flowside

Currently we always deliver inbound TCP packets to the guest's most recent observed IP address. This has the odd side effect that if the guest changes its IP address with active TCP connections we might deliver packets from old connections to the new address. That won't work; it will will probably result in an RST from the guest. Worse, if the guest added a new address but also retains the old one, then we could break those old connections by redirecting them to the new address. Now that we maintain flowside information, we have a record of the correct guest side address and can just use it. Signed-off-by: David Gibson --- tcp.c | 47 ++++++++++++++++------------------------------- 1 file changed, 16 insertions(+), 31 deletions(-) diff --git a/tcp.c b/tcp.c index cd5bffe..5ff7480 100644 --- a/tcp.c +++ b/tcp.c @@ -1327,7 +1327,6 @@ static void tcp_fill_header(struct tcphdr *th, /** * tcp_fill_headers4() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers - * @c: Execution context * @conn: Connection pointer * @taph: tap backend specific header * @iph: Pointer to IPv4 header @@ -1338,27 +1337,26 @@ static void tcp_fill_header(struct tcphdr *th, * * Return: The IPv4 payload length, host order */ -static size_t tcp_fill_headers4(const struct ctx *c, - const struct tcp_tap_conn *conn, +static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn, struct tap_hdr *taph, struct iphdr *iph, struct tcphdr *th, size_t dlen, const uint16_t *check, uint32_t seq) { const struct flowside *tapside = &conn->f.side[TAPSIDE]; - const struct in_addr *a4 = inany_v4(&tapside->faddr); + const struct in_addr *src4 = inany_v4(&tapside->faddr); + const struct in_addr *dst4 = inany_v4(&tapside->eaddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph); - ASSERT(a4); + ASSERT(src4 && dst4); iph->tot_len = htons(l3len); - iph->saddr = a4->s_addr; - iph->daddr = c->ip4.addr_seen.s_addr; + iph->saddr = src4->s_addr; + iph->daddr = dst4->s_addr; iph->check = check ? *check : - csum_ip4_header(l3len, IPPROTO_TCP, - *a4, c->ip4.addr_seen); + csum_ip4_header(l3len, IPPROTO_TCP, *src4, *dst4); tcp_fill_header(th, conn, seq); @@ -1371,7 +1369,6 @@ static size_t tcp_fill_headers4(const struct ctx *c, /** * tcp_fill_headers6() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers - * @c: Execution context * @conn: Connection pointer * @taph: tap backend specific header * @ip6h: Pointer to IPv6 header @@ -1382,8 +1379,7 @@ static size_t tcp_fill_headers4(const struct ctx *c, * * Return: The IPv6 payload length, host order */ -static size_t tcp_fill_headers6(const struct ctx *c, - const struct tcp_tap_conn *conn, +static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn, struct tap_hdr *taph, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) @@ -1393,10 +1389,7 @@ static size_t tcp_fill_headers6(const struct ctx *c, ip6h->payload_len = htons(l4len); ip6h->saddr = tapside->faddr.a6; - if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) - ip6h->daddr = c->ip6.addr_ll_seen; - else - ip6h->daddr = c->ip6.addr_seen; + ip6h->daddr = tapside->eaddr.a6; ip6h->hop_limit = 255; ip6h->version = 6; @@ -1417,7 +1410,6 @@ static size_t tcp_fill_headers6(const struct ctx *c, /** * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers - * @c: Execution context * @conn: Connection pointer * @iov: Pointer to an array of iovec of TCP pre-cooked buffers * @dlen: TCP payload length @@ -1426,19 +1418,18 @@ static size_t tcp_fill_headers6(const struct ctx *c, * * Return: IP payload length, host order */ -static size_t tcp_l2_buf_fill_headers(const struct ctx *c, - const struct tcp_tap_conn *conn, +static size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { if (CONN_V4(conn)) { - return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, + return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_IP].iov_base, iov[TCP_IOV_PAYLOAD].iov_base, dlen, check, seq); } - return tcp_fill_headers6(c, conn, iov[TCP_IOV_TAP].iov_base, + return tcp_fill_headers6(conn, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_IP].iov_base, iov[TCP_IOV_PAYLOAD].iov_base, dlen, seq); @@ -1654,7 +1645,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) th->syn = !!(flags & SYN); th->fin = !!(flags & FIN); - l4len = tcp_l2_buf_fill_headers(c, conn, iov, optlen, NULL, + l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, conn->seq_to_tap); iov[TCP_IOV_PAYLOAD].iov_len = l4len; @@ -1753,18 +1744,12 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); - union inany_addr aany; const struct flowside *tapside = &conn->f.side[TAPSIDE]; uint64_t hash; uint32_t ns; - if (CONN_V4(conn)) - inany_from_af(&aany, AF_INET, &c->ip4.addr); - else - inany_from_af(&aany, AF_INET6, &c->ip6.addr); - inany_siphash_feed(&state, &tapside->faddr); - inany_siphash_feed(&state, &aany); + inany_siphash_feed(&state, &tapside->eaddr); hash = siphash_final(&state, 36, (uint64_t)tapside->fport << 16 | tapside->eport); @@ -2161,7 +2146,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp4_seq_update[tcp4_payload_used].len = dlen; iov = tcp4_l2_iov[tcp4_payload_used++]; - l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq); + l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq); iov[TCP_IOV_PAYLOAD].iov_len = l4len; if (tcp4_payload_used > TCP_FRAMES_MEM - 1) tcp_payload_flush(c); @@ -2170,7 +2155,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp6_seq_update[tcp6_payload_used].len = dlen; iov = tcp6_l2_iov[tcp6_payload_used++]; - l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq); + l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq); iov[TCP_IOV_PAYLOAD].iov_len = l4len; if (tcp6_payload_used > TCP_FRAMES_MEM - 1) tcp_payload_flush(c); -- 2.44.0

Stefano Brivio

13 May 13 May

8:07 p.m.

New subject: [PATCH v4 04/16] tcp: Obtain guest address from flowside

On Fri, 3 May 2024 11:11:23 +1000 David Gibson wrote:

...

Currently we always deliver inbound TCP packets to the guest's most recent observed IP address. This has the odd side effect that if the guest changes its IP address with active TCP connections we might deliver packets from old connections to the new address. That won't work; it will will probably result in an RST from the guest. Worse,

s/will will/will/ ...if I recall correctly, that was actually working, as long as we don't swap link-local with global unicast addresses (hence those conditions sprinkled all over the place). But it doesn't matter in any case, this is surely the way forward. -- Stefano

David Gibson

14 May 14 May

2:18 a.m.

New subject: [PATCH v4 04/16] tcp: Obtain guest address from flowside

On Mon, May 13, 2024 at 08:07:43PM +0200, Stefano Brivio wrote:

...

On Fri, 3 May 2024 11:11:23 +1000 David Gibson wrote:

...
Currently we always deliver inbound TCP packets to the guest's most recent observed IP address. This has the odd side effect that if the guest changes its IP address with active TCP connections we might deliver packets from old connections to the new address. That won't work; it will will probably result in an RST from the guest. Worse,

s/will will/will/

Fixed.

...

...if I recall correctly, that was actually working, as long as we don't swap link-local with global unicast addresses (hence those conditions sprinkled all over the place).

Um.. I don't see how that's possible. Linux - and I imagine any peer - will index TCP connections by both endpoint addresses, so if we deliver packets from one connection to a different address, the peer won't recognize them as belonging to the old connection.

...

But it doesn't matter in any case, this is surely the way forward.

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

David Gibson

3 May 3 May

David Gibson
Stefano Brivio

[PATCH v4 00/16] RFC: Unified flow table

tags

participants (2)