[PATCH v2 00/10] RFC: Convert TCP connection table to generalisable flow table

newer
[PATCH] tap: fix uses of l3_len in...

older
[PATCH] fedora: Replace pasta hard...

David Gibson

28 Aug 2023 28 Aug '23

7:41 a.m.

This is a second draft of the first steps in implementing more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel This series changes the TCP connection table into a more general flow table that can track other protocols as well (although none are implemented yet). Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding. Caveats: * We significantly increase the size of a connection/flow entry - Can probably be mitigated, but I haven't investigated much yet * We perform a number of extra getsockname() calls to know some of the socket endpoints - Haven't yet measured how much performance impact that has - Can be mitigated in at least some cases, but again, haven't tried yet * Only TCP converted so far Changes since v1: * Terminology changes - "Endpoint" address/port instead of "correspondent" address/port - "flowside" instead of "demiflow" * Actually move the connection table to a new flow table structure in new files * Significant rearrangement of earlier patchs on top of that new table, to reduce churn David Gibson (10): flow, tcp: Generalise connection types flow, tcp: Move TCP connection table to unified flow table flow, tcp: Consolidate flow pointer<->index helpers flow: Make unified version of flow table compaction flow: Introduce struct flowside, space for uniform tracking of addresses tcp: Move guest side address tracking to flow/flowside tcp, flow: Perform TCP hash calculations based on flowside tcp: Re-use flowside_hash for initial sequence number generation tcp: Maintain host flowside for connections tcp_splice: Fill out flowside information for spliced connections Makefile | 14 +- flow.c | 111 ++++++++++++++++ flow.h | 115 +++++++++++++++++ flow_table.h | 45 +++++++ passt.h | 3 + siphash.c | 1 + tcp.c | 355 ++++++++++++++++++++++++--------------------------- tcp.h | 5 - tcp_conn.h | 54 ++------ tcp_splice.c | 78 ++++++----- tcp_splice.h | 3 +- 11 files changed, 505 insertions(+), 279 deletions(-) create mode 100644 flow.c create mode 100644 flow.h create mode 100644 flow_table.h -- 2.41.0

Show replies by date

David Gibson

28 Aug 28 Aug

7:41 a.m.

New subject: [PATCH v2 01/10] flow, tcp: Generalise connection types

Currently TCP connections use a 1-bit selector, 'spliced', to determine the rest of the contents of the structure. We want to generalise the TCP connection table to other types of flows in other protocols. Make a start on this by replacing the tcp_conn_common structure with a new flow_common structure with an enum rather than a simple boolean indicating the type of flow. Signed-off-by: David Gibson --- Makefile | 8 +++---- flow.c | 14 ++++++++++++ flow.h | 29 ++++++++++++++++++++++++ tcp.c | 63 +++++++++++++++++++++++++++++++++++++--------------- tcp_conn.h | 24 ++++++-------------- tcp_splice.c | 3 ++- 6 files changed, 101 insertions(+), 40 deletions(-) create mode 100644 flow.c create mode 100644 flow.h diff --git a/Makefile b/Makefile index 4435bd6..c5a3ce7 100644 --- a/Makefile +++ b/Makefile @@ -43,15 +43,15 @@ FLAGS += -DARCH=\"$(TARGET_ARCH)\" FLAGS += -DVERSION=\"$(VERSION)\" FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) -PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c icmp.c igmp.c \ - isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c passt.c \ - pasta.c pcap.c siphash.c tap.c tcp.c tcp_splice.c udp.c util.c +PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \ + igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \ + passt.c pasta.c pcap.c siphash.c tap.c tcp.c tcp_splice.c udp.c util.c QRAP_SRCS = qrap.c SRCS = $(PASST_SRCS) $(QRAP_SRCS) MANPAGES = passt.1 pasta.1 qrap.1 -PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h icmp.h \ +PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h icmp.h \ inany.h isolation.h lineread.h log.h ndp.h netlink.h packet.h passt.h \ pasta.h pcap.h port_fwd.h siphash.h tap.h tcp.h tcp_conn.h \ tcp_splice.h udp.h util.h diff --git a/flow.c b/flow.c new file mode 100644 index 0000000..c3802ce --- /dev/null +++ b/flow.c @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright Red Hat + * Author: David Gibson + * + * Tracking for logical "flows" of packets. + */ + +#include "flow.h" + +const char *flow_type_str[] = { + [FLOW_NONE] = "<none>", + [FLOW_TCP] = "TCP connection", + [FLOW_TCP_SPLICE] = "TCP connection (spliced)", +}; diff --git a/flow.h b/flow.h new file mode 100644 index 0000000..1afc1e5 --- /dev/null +++ b/flow.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright Red Hat + * Author: David Gibson + * + * Tracking for logical "flows" of packets. + */ +#ifndef FLOW_H +#define FLOW_H + +enum flow_type { + FLOW_NONE = 0, + FLOW_TCP, + FLOW_TCP_SPLICE, + FLOW_MAX = FLOW_TCP_SPLICE, +}; + +extern const char *flow_type_str[]; +#define FLOW_TYPE(f) \ + ((f)->type <= FLOW_MAX ? flow_type_str[(f)->type] : "?") + +/** + * struct flow_common - Common fields for packet flows + * @type: Type of packet flow + */ +struct flow_common { + enum flow_type type; +}; + +#endif /* FLOW_H */ diff --git a/tcp.c b/tcp.c index c89e6e4..75930b1 100644 --- a/tcp.c +++ b/tcp.c @@ -302,6 +302,7 @@ #include "tcp_splice.h" #include "log.h" #include "inany.h" +#include "flow.h" #include "tcp_conn.h" @@ -575,7 +576,7 @@ static inline struct tcp_tap_conn *conn_at_idx(int index) { if ((index < 0) || (index >= TCP_MAX_CONNS)) return NULL; - ASSERT(!(CONN(index)->c.spliced)); + ASSERT(CONN(index)->f.type == FLOW_TCP); return CONN(index); } @@ -1313,14 +1314,21 @@ void tcp_table_compact(struct ctx *c, union tcp_conn *hole) from = tc + c->tcp.conn_count; memcpy(hole, from, sizeof(*hole)); - if (from->c.spliced) - tcp_splice_conn_update(c, &hole->splice); - else + switch (from->f.type) { + case FLOW_TCP: tcp_tap_conn_update(c, &from->tap, &hole->tap); + break; + case FLOW_TCP_SPLICE: + tcp_splice_conn_update(c, &hole->splice); + break; + default: + die("Unexpected %s in tcp_table_compact()", + FLOW_TYPE(&from->f)); + } - debug("TCP: table compaction (spliced=%d): old index %li, new index %li, " + debug("TCP: table compaction (%s): old index %li, new index %li, " "from: %p, to: %p", - from->c.spliced, CONN_IDX(from), CONN_IDX(hole), from, hole); + FLOW_TYPE(&from->f), CONN_IDX(from), CONN_IDX(hole), from, hole); memset(from, 0, sizeof(*from)); } @@ -1388,12 +1396,18 @@ void tcp_defer_handler(struct ctx *c) tcp_l2_data_buf_flush(c); for (conn = tc + c->tcp.conn_count - 1; conn >= tc; conn--) { - if (conn->c.spliced) { - if (conn->splice.flags & CLOSING) - tcp_splice_destroy(c, conn); - } else { + switch (conn->f.type) { + case FLOW_TCP: if (conn->tap.events == CLOSED) tcp_conn_destroy(c, conn); + break; + case FLOW_TCP_SPLICE: + if (conn->splice.flags & CLOSING) + tcp_splice_destroy(c, conn); + break; + default: + die("Unexpected %s in tcp_defer_handler()", + FLOW_TYPE(&conn->f)); } } } @@ -2029,7 +2043,7 @@ static void tcp_conn_from_tap(struct ctx *c, } conn = CONN(c->tcp.conn_count++); - conn->c.spliced = false; + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn_event(c, conn, TAP_SYN_RCVD); @@ -2714,7 +2728,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, struct sockaddr *sa, const struct timespec *now) { - conn->c.spliced = false; + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; @@ -2903,10 +2917,17 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events) { union tcp_conn *conn = tc + ref.tcp.index; - if (conn->c.spliced) - tcp_splice_sock_handler(c, &conn->splice, ref.fd, events); - else + switch (conn->f.type) { + case FLOW_TCP: tcp_tap_sock_handler(c, &conn->tap, events); + break; + case FLOW_TCP_SPLICE: + tcp_splice_sock_handler(c, &conn->splice, ref.fd, events); + break; + default: + die("Unexpected %s in tcp_sock_handler_compact()", + FLOW_TYPE(&conn->f)); + } } /** @@ -3294,11 +3315,17 @@ void tcp_timer(struct ctx *c, const struct timespec *ts) } for (conn = tc + c->tcp.conn_count - 1; conn >= tc; conn--) { - if (conn->c.spliced) { - tcp_splice_timer(c, conn); - } else { + switch (conn->f.type) { + case FLOW_TCP: if (conn->tap.events == CLOSED) tcp_conn_destroy(c, conn); + break; + case FLOW_TCP_SPLICE: + tcp_splice_timer(c, conn); + break; + default: + die("Unexpected %s in tcp_timer()", + FLOW_TYPE(&conn->f)); } } diff --git a/tcp_conn.h b/tcp_conn.h index d67ea62..0074a08 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -9,19 +9,9 @@ #ifndef TCP_CONN_H #define TCP_CONN_H -/** - * struct tcp_conn_common - Common fields for spliced and non-spliced - * @spliced: Is this a spliced connection? - */ -struct tcp_conn_common { - bool spliced :1; -}; - -extern const char *tcp_common_flag_str[]; - /** * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) - * @c: Fields common with tcp_splice_conn + * @f: Generic flow information * @in_epoll: Is the connection in the epoll set? * @next_index: Connection index of next item in hash chain, -1 for none * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS @@ -46,8 +36,8 @@ extern const char *tcp_common_flag_str[]; * @seq_init_from_tap: Initial sequence number from tap */ struct tcp_tap_conn { - /* Must be first element to match tcp_splice_conn */ - struct tcp_conn_common c; + /* Must be first element */ + struct flow_common f; bool in_epoll :1; int next_index :TCP_CONN_INDEX_BITS + 2; @@ -121,7 +111,7 @@ struct tcp_tap_conn { /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection - * @c: Fields common with tcp_tap_conn + * @f: Generic flow information * @in_epoll: Is the connection in the epoll set? * @a: File descriptor number of socket for accepted connection * @pipe_a_b: Pipe ends for splice() from @a to @b @@ -135,8 +125,8 @@ struct tcp_tap_conn { * @b_written: Bytes written to @b (not fully written from one @a read) */ struct tcp_splice_conn { - /* Must be first element to match tcp_tap_conn */ - struct tcp_conn_common c; + /* Must be first element */ + struct flow_common f; bool in_epoll :1; int a; @@ -176,7 +166,7 @@ struct tcp_splice_conn { * @splice: Fields specific to spliced connections */ union tcp_conn { - struct tcp_conn_common c; + struct flow_common f; struct tcp_tap_conn tap; struct tcp_splice_conn splice; }; diff --git a/tcp_splice.c b/tcp_splice.c index 5b36975..840d639 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -53,6 +53,7 @@ #include "log.h" #include "tcp_splice.h" #include "inany.h" +#include "flow.h" #include "tcp_conn.h" @@ -511,7 +512,7 @@ bool tcp_splice_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int))) trace("TCP (spliced): failed to set TCP_QUICKACK on %i", s); - conn->c.spliced = true; + conn->f.type = FLOW_TCP_SPLICE; conn->a = s; if (tcp_splice_new(c, conn, ref.port, ref.ns)) -- 2.41.0

David Gibson

7:41 a.m.

New subject: [PATCH v2 02/10] flow, tcp: Move TCP connection table to unified flow table

We want to generalise "connection" tracking to things other than true TCP connections. Continue implenenting this by renaming the TCP connection table to the "flow table" and moving it to flow.c. The definitions are split between flow.h and flow_table.h - we need this separation to avoid circular dependencies: the definitions in flow.h will be needed by many headers using the flow mechanism, but flow_table.h needs all those protocol specific headers in order to define the full flow table entry. Signed-off-by: David Gibson --- Makefile | 8 ++--- flow.c | 11 +++++++ flow.h | 8 +++++ flow_table.h | 25 +++++++++++++++ passt.h | 3 ++ tcp.c | 87 +++++++++++++++++++++++++--------------------------- tcp.h | 5 --- tcp_conn.h | 23 +++----------- tcp_splice.c | 19 ++++++------ 9 files changed, 107 insertions(+), 82 deletions(-) create mode 100644 flow_table.h diff --git a/Makefile b/Makefile index c5a3ce7..73c0ef7 100644 --- a/Makefile +++ b/Makefile @@ -51,10 +51,10 @@ SRCS = $(PASST_SRCS) $(QRAP_SRCS) MANPAGES = passt.1 pasta.1 qrap.1 -PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h icmp.h \ - inany.h isolation.h lineread.h log.h ndp.h netlink.h packet.h passt.h \ - pasta.h pcap.h port_fwd.h siphash.h tap.h tcp.h tcp_conn.h \ - tcp_splice.h udp.h util.h +PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \ + flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \ + netlink.h packet.h passt.h pasta.h pcap.h port_fwd.h siphash.h tap.h \ + tcp.h tcp_conn.h tcp_splice.h udp.h util.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include \nstruct tcp_info x = { .tcpi_snd_wnd = 0 }; diff --git a/flow.c b/flow.c index c3802ce..864158a 100644 --- a/flow.c +++ b/flow.c @@ -5,10 +5,21 @@ * Tracking for logical "flows" of packets. */ +#include +#include + +#include "util.h" +#include "passt.h" +#include "inany.h" #include "flow.h" +#include "tcp_conn.h" +#include "flow_table.h" const char *flow_type_str[] = { [FLOW_NONE] = "<none>", [FLOW_TCP] = "TCP connection", [FLOW_TCP_SPLICE] = "TCP connection (spliced)", }; + +/* Global Flow Table */ +union flow flowtab[FLOW_MAX]; diff --git a/flow.h b/flow.h index 1afc1e5..ce497cf 100644 --- a/flow.h +++ b/flow.h @@ -26,4 +26,12 @@ struct flow_common { enum flow_type type; }; +#define FLOW_INDEX_BITS 17 /* 128k - 1 */ +#define FLOW_MAX MAX_FROM_BITS(FLOW_INDEX_BITS) + +#define FLOW_TABLE_PRESSURE 30 /* % of FLOW_MAX */ +#define FLOW_FILE_PRESSURE 30 /* % of c->nofile */ + +union flow; + #endif /* FLOW_H */ diff --git a/flow_table.h b/flow_table.h new file mode 100644 index 0000000..c4c646b --- /dev/null +++ b/flow_table.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright Red Hat + * Author: David Gibson + * + * Definitions for the global table of packet flows. + */ +#ifndef FLOW_TABLE_H +#define FLOW_TABLE_H + +/** + * union flow - Descriptor for a logical packet flow (e.g. connection) + * @f: Fields common between all variants + * @tcp: Fields for non-spliced TCP connections + * @tcp_splice: Fields for spliced TCP connections +*/ +union flow { + struct flow_common f; + struct tcp_tap_conn tcp; + struct tcp_splice_conn tcp_splice; +}; + +/* Global Flow Table */ +extern union flow flowtab[]; + +#endif /* FLOW_TABLE_H */ diff --git a/passt.h b/passt.h index 282bd1a..023b7e0 100644 --- a/passt.h +++ b/passt.h @@ -220,6 +220,7 @@ struct ip6_ctx { * @pasta_conf_ns: Configure namespace after creating it * @no_copy_routes: Don't copy all routes when configuring target namespace * @no_copy_addrs: Don't copy all addresses when configuring namespace + * @flow_count: Number of tracked packet flows (connections etc.) * @no_tcp: Disable TCP operation * @tcp: Context for TCP protocol handler * @no_tcp: Disable UDP operation @@ -281,6 +282,8 @@ struct ctx { int no_copy_routes; int no_copy_addrs; + unsigned flow_count; + int no_tcp; struct tcp_ctx tcp; int no_udp; diff --git a/tcp.c b/tcp.c index 75930b1..7994197 100644 --- a/tcp.c +++ b/tcp.c @@ -305,14 +305,14 @@ #include "flow.h" #include "tcp_conn.h" +#include "flow_table.h" #define TCP_FRAMES_MEM 128 #define TCP_FRAMES \ (c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1) #define TCP_HASH_TABLE_LOAD 70 /* % */ -#define TCP_HASH_TABLE_SIZE (TCP_MAX_CONNS * 100 / \ - TCP_HASH_TABLE_LOAD) +#define TCP_HASH_TABLE_SIZE (FLOW_MAX * 100 / TCP_HASH_TABLE_LOAD) #define MAX_WS 8 #define MAX_WINDOW (1 << (16 + (MAX_WS))) @@ -561,11 +561,8 @@ tcp6_l2_flags_buf[TCP_FRAMES_MEM]; static unsigned int tcp6_l2_flags_buf_used; -/* TCP connections */ -union tcp_conn tc[TCP_MAX_CONNS]; - -#define CONN(index) (&tc[(index)].tap) -#define CONN_IDX(conn) ((union tcp_conn *)(conn) - tc) +#define CONN(index) (&flowtab[(index)].tcp) +#define CONN_IDX(conn) ((union flow *)(conn) - flowtab) /** conn_at_idx() - Find a connection by index, if present * @index: Index of connection to lookup @@ -574,7 +571,7 @@ union tcp_conn tc[TCP_MAX_CONNS]; */ static inline struct tcp_tap_conn *conn_at_idx(int index) { - if ((index < 0) || (index >= TCP_MAX_CONNS)) + if ((index < 0) || (index >= FLOW_MAX)) return NULL; ASSERT(CONN(index)->f.type == FLOW_TCP); return CONN(index); @@ -1300,26 +1297,26 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, * @c: Execution context * @hole: Pointer to recently closed connection */ -void tcp_table_compact(struct ctx *c, union tcp_conn *hole) +void tcp_table_compact(struct ctx *c, union flow *hole) { - union tcp_conn *from; + union flow *from; - if (CONN_IDX(hole) == --c->tcp.conn_count) { + if (CONN_IDX(hole) == --c->flow_count) { debug("TCP: table compaction: maximum index was %li (%p)", CONN_IDX(hole), hole); memset(hole, 0, sizeof(*hole)); return; } - from = tc + c->tcp.conn_count; + from = flowtab + c->flow_count; memcpy(hole, from, sizeof(*hole)); switch (from->f.type) { case FLOW_TCP: - tcp_tap_conn_update(c, &from->tap, &hole->tap); + tcp_tap_conn_update(c, &from->tcp, &hole->tcp); break; case FLOW_TCP_SPLICE: - tcp_splice_conn_update(c, &hole->splice); + tcp_splice_conn_update(c, &hole->tcp_splice); break; default: die("Unexpected %s in tcp_table_compact()", @@ -1336,18 +1333,18 @@ void tcp_table_compact(struct ctx *c, union tcp_conn *hole) /** * tcp_conn_destroy() - Close sockets, trigger hash table removal and compaction * @c: Execution context - * @conn_union: Connection pointer (container union) + * @flow: Flow table entry for this connection */ -static void tcp_conn_destroy(struct ctx *c, union tcp_conn *conn_union) +static void tcp_conn_destroy(struct ctx *c, union flow *flow) { - struct tcp_tap_conn *conn = &conn_union->tap; + struct tcp_tap_conn *conn = &flow->tcp; close(conn->sock); if (conn->timer != -1) close(conn->timer); tcp_hash_remove(c, conn); - tcp_table_compact(c, conn_union); + tcp_table_compact(c, flow); } static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); @@ -1390,24 +1387,24 @@ static void tcp_l2_data_buf_flush(struct ctx *c) */ void tcp_defer_handler(struct ctx *c) { - union tcp_conn *conn; + union flow *flow; tcp_l2_flags_buf_flush(c); tcp_l2_data_buf_flush(c); - for (conn = tc + c->tcp.conn_count - 1; conn >= tc; conn--) { - switch (conn->f.type) { + for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) { + switch (flow->f.type) { case FLOW_TCP: - if (conn->tap.events == CLOSED) - tcp_conn_destroy(c, conn); + if (flow->tcp.events == CLOSED) + tcp_conn_destroy(c, flow); break; case FLOW_TCP_SPLICE: - if (conn->splice.flags & CLOSING) - tcp_splice_destroy(c, conn); + if (flow->tcp_splice.flags & CLOSING) + tcp_splice_destroy(c, flow); break; default: die("Unexpected %s in tcp_defer_handler()", - FLOW_TYPE(&conn->f)); + FLOW_TYPE(&flow->f)); } } } @@ -2016,7 +2013,7 @@ static void tcp_conn_from_tap(struct ctx *c, (void)saddr; - if (c->tcp.conn_count >= TCP_MAX_CONNS) + if (c->flow_count >= FLOW_MAX) return; if ((s = tcp_conn_pool_sock(pool)) < 0) @@ -2042,7 +2039,7 @@ static void tcp_conn_from_tap(struct ctx *c, } } - conn = CONN(c->tcp.conn_count++); + conn = CONN(c->flow_count++); conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; @@ -2762,11 +2759,11 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, const struct timespec *now) { struct sockaddr_storage sa; - union tcp_conn *conn; + union flow *flow; socklen_t sl; int s; - if (c->no_tcp || c->tcp.conn_count >= TCP_MAX_CONNS) + if (c->no_tcp || c->flow_count >= FLOW_MAX) return; sl = sizeof(sa); @@ -2779,14 +2776,14 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) return; - conn = tc + c->tcp.conn_count++; + flow = flowtab + c->flow_count++; if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref.tcp_listen, &conn->splice, + tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s, (struct sockaddr *)&sa)) return; - tcp_tap_conn_from_sock(c, ref.tcp_listen, &conn->tap, s, + tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, (struct sockaddr *)&sa, now); } @@ -2915,18 +2912,18 @@ static void tcp_tap_sock_handler(struct ctx *c, struct tcp_tap_conn *conn, */ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events) { - union tcp_conn *conn = tc + ref.tcp.index; + union flow *flow = flowtab + ref.tcp.index; - switch (conn->f.type) { + switch (flow->f.type) { case FLOW_TCP: - tcp_tap_sock_handler(c, &conn->tap, events); + tcp_tap_sock_handler(c, &flow->tcp, events); break; case FLOW_TCP_SPLICE: - tcp_splice_sock_handler(c, &conn->splice, ref.fd, events); + tcp_splice_sock_handler(c, &flow->tcp_splice, ref.fd, events); break; default: die("Unexpected %s in tcp_sock_handler_compact()", - FLOW_TYPE(&conn->f)); + FLOW_TYPE(&flow->f)); } } @@ -3291,7 +3288,7 @@ static int tcp_port_rebind(void *arg) */ void tcp_timer(struct ctx *c, const struct timespec *ts) { - union tcp_conn *conn; + union flow *flow; (void)ts; @@ -3314,18 +3311,18 @@ void tcp_timer(struct ctx *c, const struct timespec *ts) } } - for (conn = tc + c->tcp.conn_count - 1; conn >= tc; conn--) { - switch (conn->f.type) { + for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) { + switch (flow->f.type) { case FLOW_TCP: - if (conn->tap.events == CLOSED) - tcp_conn_destroy(c, conn); + if (flow->tcp.events == CLOSED) + tcp_conn_destroy(c, flow); break; case FLOW_TCP_SPLICE: - tcp_splice_timer(c, conn); + tcp_splice_timer(c, flow); break; default: die("Unexpected %s in tcp_timer()", - FLOW_TYPE(&conn->f)); + FLOW_TYPE(&flow->f)); } } diff --git a/tcp.h b/tcp.h index 9eaec3f..4c7b8a4 100644 --- a/tcp.h +++ b/tcp.h @@ -8,9 +8,6 @@ #define TCP_TIMER_INTERVAL 1000 /* ms */ -#define TCP_CONN_INDEX_BITS 17 /* 128k - 1 */ -#define TCP_MAX_CONNS MAX_FROM_BITS(TCP_CONN_INDEX_BITS) - struct ctx; void tcp_timer_handler(struct ctx *c, union epoll_ref ref); @@ -55,7 +52,6 @@ union tcp_listen_epoll_ref { /** * struct tcp_ctx - Execution context for TCP routines * @hash_secret: 128-bit secret for hash functions, ISN and hash table - * @conn_count: Count of total connections in connection table * @port_to_tap: Ports bound host-side, packets to tap or spliced * @fwd_in: Port forwarding configuration for inbound packets * @fwd_out: Port forwarding configuration for outbound packets @@ -65,7 +61,6 @@ union tcp_listen_epoll_ref { */ struct tcp_ctx { uint64_t hash_secret[2]; - int conn_count; struct port_fwd fwd_in; struct port_fwd fwd_out; struct timespec timer_run; diff --git a/tcp_conn.h b/tcp_conn.h index 0074a08..a7c7001 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -40,7 +40,7 @@ struct tcp_tap_conn { struct flow_common f; bool in_epoll :1; - int next_index :TCP_CONN_INDEX_BITS + 2; + int next_index :FLOW_INDEX_BITS + 2; #define TCP_RETRANS_BITS 3 unsigned int retrans :TCP_RETRANS_BITS; @@ -159,21 +159,6 @@ struct tcp_splice_conn { uint32_t b_written; }; -/** - * union tcp_conn - Descriptor for a TCP connection (spliced or non-spliced) - * @c: Fields common between all variants - * @tap: Fields specific to non-spliced connections - * @splice: Fields specific to spliced connections -*/ -union tcp_conn { - struct flow_common f; - struct tcp_tap_conn tap; - struct tcp_splice_conn splice; -}; - -/* TCP connections */ -extern union tcp_conn tc[]; - /* Socket pools */ #define TCP_SOCK_POOL_SIZE 32 @@ -181,9 +166,9 @@ extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new); -void tcp_table_compact(struct ctx *c, union tcp_conn *hole); -void tcp_splice_destroy(struct ctx *c, union tcp_conn *conn_union); -void tcp_splice_timer(struct ctx *c, union tcp_conn *conn_union); +void tcp_table_compact(struct ctx *c, union flow *hole); +void tcp_splice_destroy(struct ctx *c, union flow *flow); +void tcp_splice_timer(struct ctx *c, union flow *flow); int tcp_conn_pool_sock(int pool[]); int tcp_conn_new_sock(const struct ctx *c, sa_family_t af); void tcp_sock_refill_pool(const struct ctx *c, int pool[], int af); diff --git a/tcp_splice.c b/tcp_splice.c index 840d639..72346b8 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -56,6 +56,7 @@ #include "flow.h" #include "tcp_conn.h" +#include "flow_table.h" #define MAX_PIPE_SIZE (8UL * 1024 * 1024) #define TCP_SPLICE_PIPE_POOL_SIZE 16 @@ -75,7 +76,7 @@ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2]; #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) #define CONN(index) (&tc[(index)].splice) -#define CONN_IDX(conn) ((union tcp_conn *)(conn) - tc) +#define CONN_IDX(conn) ((union flow *)(conn) - flowtab) /* Display strings for connection events */ static const char *tcp_splice_event_str[] __attribute((__unused__)) = { @@ -263,11 +264,11 @@ void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new) /** * tcp_splice_destroy() - Close spliced connection and pipes, clear * @c: Execution context - * @conn_union: Spliced connection (container union) + * @flow: Flow table entry */ -void tcp_splice_destroy(struct ctx *c, union tcp_conn *conn_union) +void tcp_splice_destroy(struct ctx *c, union flow *flow) { - struct tcp_splice_conn *conn = &conn_union->splice; + struct tcp_splice_conn *conn = &flow->tcp_splice; if (conn->events & SPLICE_ESTABLISHED) { /* Flushing might need to block: don't recycle them. */ @@ -296,7 +297,7 @@ void tcp_splice_destroy(struct ctx *c, union tcp_conn *conn_union) conn->flags = 0; debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn)); - tcp_table_compact(c, conn_union); + tcp_table_compact(c, flow); } /** @@ -835,14 +836,14 @@ void tcp_splice_init(struct ctx *c) /** * tcp_splice_timer() - Timer for spliced connections * @c: Execution context - * @conn_union: Spliced connection (container union) + * @flow: Flow table entry */ -void tcp_splice_timer(struct ctx *c, union tcp_conn *conn_union) +void tcp_splice_timer(struct ctx *c, union flow *flow) { - struct tcp_splice_conn *conn = &conn_union->splice; + struct tcp_splice_conn *conn = &flow->tcp_splice; if (conn->flags & CLOSING) { - tcp_splice_destroy(c, conn_union); + tcp_splice_destroy(c, flow); return; } -- 2.41.0

David Gibson

7:41 a.m.

New subject: [PATCH v2 03/10] flow, tcp: Consolidate flow pointer<->index helpers

Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index of their connection structures in the connection table, now become the unified flow table. We can easily combine these into a common helper. While we're there, add some trickery for some additional type safety. They also define their own CONN() versions, which aren't so easily combined since they need to return different types, but we can have them use a common helper. Signed-off-by: David Gibson --- flow_table.h | 20 ++++++++++++++++++++ tcp.c | 49 ++++++++++++++++++++++++------------------------- tcp_conn.h | 2 +- tcp_splice.c | 17 ++++++++--------- 4 files changed, 53 insertions(+), 35 deletions(-) diff --git a/flow_table.h b/flow_table.h index c4c646b..dd4875e 100644 --- a/flow_table.h +++ b/flow_table.h @@ -22,4 +22,24 @@ union flow { /* Global Flow Table */ extern union flow flowtab[]; + +/** flowk_idx - Index of flow from common structure + * @f: Common flow fields pointer + * + * Return: index of @f in the flow table + */ +static inline unsigned flow_idx(const struct flow_common *f) +{ + return (union flow *)f - flowtab; +} + +/** FLOW_IDX - Find the index of a flow + * @f_: Flow pointer, either union flow * or protocol specific + * + * Return: index of @f in the flow table + */ +#define FLOW_IDX(f_) (flow_idx(&(f_)->f)) + +#define FLOW(index) (&flowtab[(index)]) + #endif /* FLOW_TABLE_H */ diff --git a/tcp.c b/tcp.c index 7994197..7d2e89d 100644 --- a/tcp.c +++ b/tcp.c @@ -561,8 +561,7 @@ tcp6_l2_flags_buf[TCP_FRAMES_MEM]; static unsigned int tcp6_l2_flags_buf_used; -#define CONN(index) (&flowtab[(index)].tcp) -#define CONN_IDX(conn) ((union flow *)(conn) - flowtab) +#define CONN(index) (&FLOW(index)->tcp) /** conn_at_idx() - Find a connection by index, if present * @index: Index of connection to lookup @@ -631,7 +630,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn) { int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref = { .type = EPOLL_TYPE_TCP, .fd = conn->sock, - .tcp.index = CONN_IDX(conn) }; + .tcp.index = FLOW_IDX(conn) }; struct epoll_event ev = { .data.u64 = ref.u64 }; if (conn->events == CLOSED) { @@ -652,7 +651,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->timer != -1) { union epoll_ref ref_t = { .type = EPOLL_TYPE_TCP_TIMER, .fd = conn->sock, - .tcp.index = CONN_IDX(conn) }; + .tcp.index = FLOW_IDX(conn) }; struct epoll_event ev_t = { .data.u64 = ref_t.u64, .events = EPOLLIN | EPOLLET }; @@ -680,7 +679,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->timer == -1) { union epoll_ref ref = { .type = EPOLL_TYPE_TCP_TIMER, .fd = conn->sock, - .tcp.index = CONN_IDX(conn) }; + .tcp.index = FLOW_IDX(conn) }; struct epoll_event ev = { .data.u64 = ref.u64, .events = EPOLLIN | EPOLLET }; int fd; @@ -716,7 +715,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) it.it_value.tv_sec = ACT_TIMEOUT; } - debug("TCP: index %li, timer expires in %lu.%03lus", CONN_IDX(conn), + debug("TCP: index %li, timer expires in %lu.%03lus", FLOW_IDX(conn), it.it_value.tv_sec, it.it_value.tv_nsec / 1000 / 1000); timerfd_settime(conn->timer, 0, &it, NULL); @@ -739,7 +738,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, conn->flags &= flag; if (flag_index >= 0) { - debug("TCP: index %li: %s dropped", CONN_IDX(conn), + debug("TCP: index %li: %s dropped", FLOW_IDX(conn), tcp_flag_str[flag_index]); } } else { @@ -760,7 +759,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, conn->flags |= flag; if (flag_index >= 0) { - debug("TCP: index %li: %s", CONN_IDX(conn), + debug("TCP: index %li: %s", FLOW_IDX(conn), tcp_flag_str[flag_index]); } } @@ -810,12 +809,12 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, new += 5; if (prev != new) { - debug("TCP: index %li, %s: %s -> %s", CONN_IDX(conn), + debug("TCP: index %li, %s: %s -> %s", FLOW_IDX(conn), num == -1 ? "CLOSED" : tcp_event_str[num], prev == -1 ? "CLOSED" : tcp_state_str[prev], (new == -1 || num == -1) ? "CLOSED" : tcp_state_str[new]); } else { - debug("TCP: index %li, %s", CONN_IDX(conn), + debug("TCP: index %li, %s", FLOW_IDX(conn), num == -1 ? "CLOSED" : tcp_event_str[num]); } @@ -1200,11 +1199,11 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn) int b; b = tcp_hash(c, &conn->faddr, conn->eport, conn->fport); - conn->next_index = tc_hash[b] ? CONN_IDX(tc_hash[b]) : -1; + conn->next_index = tc_hash[b] ? FLOW_IDX(tc_hash[b]) : -1U; tc_hash[b] = conn; debug("TCP: hash table insert: index %li, sock %i, bucket: %i, next: " - "%p", CONN_IDX(conn), conn->sock, b, conn_at_idx(conn->next_index)); + "%p", FLOW_IDX(conn), conn->sock, b, conn_at_idx(conn->next_index)); } /** @@ -1230,7 +1229,7 @@ static void tcp_hash_remove(const struct ctx *c, } debug("TCP: hash table remove: index %li, sock %i, bucket: %i, new: %p", - CONN_IDX(conn), conn->sock, b, + FLOW_IDX(conn), conn->sock, b, prev ? conn_at_idx(prev->next_index) : tc_hash[b]); } @@ -1250,7 +1249,7 @@ static void tcp_tap_conn_update(struct ctx *c, struct tcp_tap_conn *old, prev = entry, entry = conn_at_idx(entry->next_index)) { if (entry == old) { if (prev) - prev->next_index = CONN_IDX(new); + prev->next_index = FLOW_IDX(new); else tc_hash[b] = new; break; @@ -1259,7 +1258,7 @@ static void tcp_tap_conn_update(struct ctx *c, struct tcp_tap_conn *old, debug("TCP: hash table update: old index %li, new index %li, sock %i, " "bucket: %i, old: %p, new: %p", - CONN_IDX(old), CONN_IDX(new), new->sock, b, old, new); + FLOW_IDX(old), FLOW_IDX(new), new->sock, b, old, new); tcp_epoll_ctl(c, new); } @@ -1301,9 +1300,9 @@ void tcp_table_compact(struct ctx *c, union flow *hole) { union flow *from; - if (CONN_IDX(hole) == --c->flow_count) { + if (FLOW_IDX(hole) == --c->flow_count) { debug("TCP: table compaction: maximum index was %li (%p)", - CONN_IDX(hole), hole); + FLOW_IDX(hole), hole); memset(hole, 0, sizeof(*hole)); return; } @@ -1325,7 +1324,7 @@ void tcp_table_compact(struct ctx *c, union flow *hole) debug("TCP: table compaction (%s): old index %li, new index %li, " "from: %p, to: %p", - FLOW_TYPE(&from->f), CONN_IDX(from), CONN_IDX(hole), from, hole); + FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole), from, hole); memset(from, 0, sizeof(*from)); } @@ -1350,7 +1349,7 @@ static void tcp_conn_destroy(struct ctx *c, union flow *flow) static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); #define tcp_rst(c, conn) \ do { \ - debug("TCP: index %li, reset at %s:%i", CONN_IDX(conn), \ + debug("TCP: index %li, reset at %s:%i", FLOW_IDX(conn), \ __func__, __LINE__); \ tcp_rst_do(c, conn); \ } while (0) @@ -2576,7 +2575,7 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, return 1; } - trace("TCP: packet length %lu from tap for index %lu", len, CONN_IDX(conn)); + trace("TCP: packet length %lu from tap for index %lu", len, FLOW_IDX(conn)); if (th->rst) { conn_event(c, conn, CLOSED); @@ -2815,17 +2814,17 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - debug("TCP: index %li, handshake timeout", CONN_IDX(conn)); + debug("TCP: index %li, handshake timeout", FLOW_IDX(conn)); tcp_rst(c, conn); } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { - debug("TCP: index %li, FIN timeout", CONN_IDX(conn)); + debug("TCP: index %li, FIN timeout", FLOW_IDX(conn)); tcp_rst(c, conn); } else if (conn->retrans == TCP_MAX_RETRANS) { debug("TCP: index %li, retransmissions count exceeded", - CONN_IDX(conn)); + FLOW_IDX(conn)); tcp_rst(c, conn); } else { - debug("TCP: index %li, ACK timeout, retry", CONN_IDX(conn)); + debug("TCP: index %li, ACK timeout, retry", FLOW_IDX(conn)); conn->retrans++; conn->seq_to_tap = conn->seq_ack_from_tap; tcp_data_from_sock(c, conn); @@ -2843,7 +2842,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) */ timerfd_settime(conn->timer, 0, &new, &old); if (old.it_value.tv_sec == ACT_TIMEOUT) { - debug("TCP: index %li, activity timeout", CONN_IDX(conn)); + debug("TCP: index %li, activity timeout", FLOW_IDX(conn)); tcp_rst(c, conn); } } diff --git a/tcp_conn.h b/tcp_conn.h index a7c7001..a8c0ae9 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -40,7 +40,7 @@ struct tcp_tap_conn { struct flow_common f; bool in_epoll :1; - int next_index :FLOW_INDEX_BITS + 2; + unsigned next_index :FLOW_INDEX_BITS + 2; #define TCP_RETRANS_BITS 3 unsigned int retrans :TCP_RETRANS_BITS; diff --git a/tcp_splice.c b/tcp_splice.c index 72346b8..2794998 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -75,8 +75,7 @@ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2]; #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) -#define CONN(index) (&tc[(index)].splice) -#define CONN_IDX(conn) ((union flow *)(conn) - flowtab) +#define CONN(index) (&FLOW(index)->tcp_splice) /* Display strings for connection events */ static const char *tcp_splice_event_str[] __attribute((__unused__)) = { @@ -137,7 +136,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn, conn->flags &= flag; if (flag_index >= 0) { - debug("TCP (spliced): index %li: %s dropped", CONN_IDX(conn), + debug("TCP (spliced): index %li: %s dropped", FLOW_IDX(conn), tcp_splice_flag_str[flag_index]); } } else { @@ -148,7 +147,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn, conn->flags |= flag; if (flag_index >= 0) { - debug("TCP (spliced): index %li: %s", CONN_IDX(conn), + debug("TCP (spliced): index %li: %s", FLOW_IDX(conn), tcp_splice_flag_str[flag_index]); } } @@ -176,9 +175,9 @@ static int tcp_splice_epoll_ctl(const struct ctx *c, { int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref_a = { .type = EPOLL_TYPE_TCP, .fd = conn->a, - .tcp.index = CONN_IDX(conn) }; + .tcp.index = FLOW_IDX(conn) }; union epoll_ref ref_b = { .type = EPOLL_TYPE_TCP, .fd = conn->b, - .tcp.index = CONN_IDX(conn) }; + .tcp.index = FLOW_IDX(conn) }; struct epoll_event ev_a = { .data.u64 = ref_a.u64 }; struct epoll_event ev_b = { .data.u64 = ref_b.u64 }; uint32_t events_a, events_b; @@ -221,7 +220,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, conn->events &= event; if (flag_index >= 0) { - debug("TCP (spliced): index %li, ~%s", CONN_IDX(conn), + debug("TCP (spliced): index %li, ~%s", FLOW_IDX(conn), tcp_splice_event_str[flag_index]); } } else { @@ -232,7 +231,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, conn->events |= event; if (flag_index >= 0) { - debug("TCP (spliced): index %li, %s", CONN_IDX(conn), + debug("TCP (spliced): index %li, %s", FLOW_IDX(conn), tcp_splice_event_str[flag_index]); } } @@ -295,7 +294,7 @@ void tcp_splice_destroy(struct ctx *c, union flow *flow) conn->events = SPLICE_CLOSED; conn->flags = 0; - debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn)); + debug("TCP (spliced): index %li, CLOSED", FLOW_IDX(conn)); tcp_table_compact(c, flow); } -- 2.41.0

Stefano Brivio

7 Sep 7 Sep

3:01 a.m.

New subject: [PATCH v2 03/10] flow, tcp: Consolidate flow pointer<->index helpers

On Mon, 28 Aug 2023 15:41:39 +1000 David Gibson wrote:

...

Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index of their connection structures in the connection table, now become the unified flow table. We can easily combine these into a common helper. While we're there, add some trickery for some additional type safety.

They also define their own CONN() versions, which aren't so easily combined since they need to return different types, but we can have them use a common helper.

Signed-off-by: David Gibson --- flow_table.h | 20 ++++++++++++++++++++ tcp.c | 49 ++++++++++++++++++++++++------------------------- tcp_conn.h | 2 +- tcp_splice.c | 17 ++++++++--------- 4 files changed, 53 insertions(+), 35 deletions(-)

diff --git a/flow_table.h b/flow_table.h index c4c646b..dd4875e 100644 --- a/flow_table.h +++ b/flow_table.h @@ -22,4 +22,24 @@ union flow { /* Global Flow Table */ extern union flow flowtab[];

+ +/** flowk_idx - Index of flow from common structure

"flowk"

...

+ * @f: Common flow fields pointer + * + * Return: index of @f in the flow table + */ +static inline unsigned flow_idx(const struct flow_common *f) +{ + return (union flow *)f - flowtab; +} + +/** FLOW_IDX - Find the index of a flow + * @f_: Flow pointer, either union flow * or protocol specific + * + * Return: index of @f in the flow table + */ +#define FLOW_IDX(f_) (flow_idx(&(f_)->f)) + +#define FLOW(index) (&flowtab[(index)]) + #endif /* FLOW_TABLE_H */ diff --git a/tcp.c b/tcp.c index 7994197..7d2e89d 100644 --- a/tcp.c +++ b/tcp.c @@ -561,8 +561,7 @@ tcp6_l2_flags_buf[TCP_FRAMES_MEM];

static unsigned int tcp6_l2_flags_buf_used;

-#define CONN(index) (&flowtab[(index)].tcp) -#define CONN_IDX(conn) ((union flow *)(conn) - flowtab) +#define CONN(index) (&FLOW(index)->tcp)

Extra parentheses are not needed, but I've been wondering for a bit why you would use "&FLOW" here, as FLOW(x) already resolves to &flowtab[x]... perhaps (&(FLOW(index)->tcp)) is actually easier to read? -- Stefano

David Gibson

5:48 a.m.

New subject: [PATCH v2 03/10] flow, tcp: Consolidate flow pointer<->index helpers

On Thu, Sep 07, 2023 at 03:01:21AM +0200, Stefano Brivio wrote:

...

On Mon, 28 Aug 2023 15:41:39 +1000 David Gibson wrote:

...
Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index of their connection structures in the connection table, now become the unified flow table. We can easily combine these into a common helper. While we're there, add some trickery for some additional type safety.

They also define their own CONN() versions, which aren't so easily combined since they need to return different types, but we can have them use a common helper.

Signed-off-by: David Gibson --- flow_table.h | 20 ++++++++++++++++++++ tcp.c | 49 ++++++++++++++++++++++++------------------------- tcp_conn.h | 2 +- tcp_splice.c | 17 ++++++++--------- 4 files changed, 53 insertions(+), 35 deletions(-)

diff --git a/flow_table.h b/flow_table.h index c4c646b..dd4875e 100644 --- a/flow_table.h +++ b/flow_table.h @@ -22,4 +22,24 @@ union flow { /* Global Flow Table */ extern union flow flowtab[];

+ +/** flowk_idx - Index of flow from common structure

"flowk"

Oops, fixed.

...

...
+ * @f: Common flow fields pointer + * + * Return: index of @f in the flow table + */ +static inline unsigned flow_idx(const struct flow_common *f) +{ + return (union flow *)f - flowtab; +} + +/** FLOW_IDX - Find the index of a flow + * @f_: Flow pointer, either union flow * or protocol specific + * + * Return: index of @f in the flow table + */ +#define FLOW_IDX(f_) (flow_idx(&(f_)->f)) + +#define FLOW(index) (&flowtab[(index)]) + #endif /* FLOW_TABLE_H */ diff --git a/tcp.c b/tcp.c index 7994197..7d2e89d 100644 --- a/tcp.c +++ b/tcp.c @@ -561,8 +561,7 @@ tcp6_l2_flags_buf[TCP_FRAMES_MEM];

static unsigned int tcp6_l2_flags_buf_used;

-#define CONN(index) (&flowtab[(index)].tcp) -#define CONN_IDX(conn) ((union flow *)(conn) - flowtab) +#define CONN(index) (&FLOW(index)->tcp)

Extra parentheses are not needed, but I've been wondering for a bit why you would use "&FLOW" here, as FLOW(x) already resolves to &flowtab[x]... perhaps (&(FLOW(index)->tcp)) is actually easier to read?

Makes sense to me, done. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

David Gibson

28 Aug 28 Aug

7:41 a.m.

New subject: [PATCH v2 04/10] flow: Make unified version of flow table compaction

tcp_table_compact() will move entries in the connection/flow table to keep it compact when other entries are removed. The moved entries need not have the same type as the flow removed, so it needs to be able to handle moving any type of flow. Therefore, move it to flow.c rather than being purportedly TCP specific. Signed-off-by: David Gibson --- flow.c | 38 ++++++++++++++++++++++++++++++++++++++ flow.h | 2 ++ tcp.c | 44 +++----------------------------------------- tcp_conn.h | 3 ++- tcp_splice.c | 2 +- 5 files changed, 46 insertions(+), 43 deletions(-) diff --git a/flow.c b/flow.c index 864158a..12ca8db 100644 --- a/flow.c +++ b/flow.c @@ -23,3 +23,41 @@ const char *flow_type_str[] = { /* Global Flow Table */ union flow flowtab[FLOW_MAX]; + +/** + * flow_table_compact() - Perform compaction on flow table + * @c: Execution context + * @hole: Pointer to recently closed flow + */ +void flow_table_compact(struct ctx *c, union flow *hole) +{ + union flow *from; + + if (FLOW_IDX(hole) == --c->flow_count) { + debug("flow: table compaction: maximum index was %li (%p)", + FLOW_IDX(hole), hole); + memset(hole, 0, sizeof(*hole)); + return; + } + + from = flowtab + c->flow_count; + memcpy(hole, from, sizeof(*hole)); + + switch (from->f.type) { + case FLOW_TCP: + tcp_tap_conn_update(c, &from->tcp, &hole->tcp); + break; + case FLOW_TCP_SPLICE: + tcp_splice_conn_update(c, &hole->tcp_splice); + break; + default: + die("Unexpected %s in tcp_table_compact()", + FLOW_TYPE(&from->f)); + } + + debug("flow: table compaction (%s): old index %li, new index %li, " + "from: %p, to: %p", + FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole), from, hole); + + memset(from, 0, sizeof(*from)); +} diff --git a/flow.h b/flow.h index ce497cf..e212796 100644 --- a/flow.h +++ b/flow.h @@ -34,4 +34,6 @@ struct flow_common { union flow; +void flow_table_compact(struct ctx *c, union flow *hole); + #endif /* FLOW_H */ diff --git a/tcp.c b/tcp.c index 7d2e89d..722a613 100644 --- a/tcp.c +++ b/tcp.c @@ -1239,8 +1239,8 @@ static void tcp_hash_remove(const struct ctx *c, * @old: Old location of tcp_tap_conn * @new: New location of tcp_tap_conn */ -static void tcp_tap_conn_update(struct ctx *c, struct tcp_tap_conn *old, - struct tcp_tap_conn *new) +void tcp_tap_conn_update(struct ctx *c, struct tcp_tap_conn *old, + struct tcp_tap_conn *new) { struct tcp_tap_conn *entry, *prev = NULL; int b = tcp_conn_hash(c, old); @@ -1291,44 +1291,6 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, return NULL; } -/** - * tcp_table_compact() - Perform compaction on connection table - * @c: Execution context - * @hole: Pointer to recently closed connection - */ -void tcp_table_compact(struct ctx *c, union flow *hole) -{ - union flow *from; - - if (FLOW_IDX(hole) == --c->flow_count) { - debug("TCP: table compaction: maximum index was %li (%p)", - FLOW_IDX(hole), hole); - memset(hole, 0, sizeof(*hole)); - return; - } - - from = flowtab + c->flow_count; - memcpy(hole, from, sizeof(*hole)); - - switch (from->f.type) { - case FLOW_TCP: - tcp_tap_conn_update(c, &from->tcp, &hole->tcp); - break; - case FLOW_TCP_SPLICE: - tcp_splice_conn_update(c, &hole->tcp_splice); - break; - default: - die("Unexpected %s in tcp_table_compact()", - FLOW_TYPE(&from->f)); - } - - debug("TCP: table compaction (%s): old index %li, new index %li, " - "from: %p, to: %p", - FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole), from, hole); - - memset(from, 0, sizeof(*from)); -} - /** * tcp_conn_destroy() - Close sockets, trigger hash table removal and compaction * @c: Execution context @@ -1343,7 +1305,7 @@ static void tcp_conn_destroy(struct ctx *c, union flow *flow) close(conn->timer); tcp_hash_remove(c, conn); - tcp_table_compact(c, flow); + flow_table_compact(c, flow); } static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); diff --git a/tcp_conn.h b/tcp_conn.h index a8c0ae9..4e7c7fc 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -165,8 +165,9 @@ struct tcp_splice_conn { extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; +void tcp_tap_conn_update(struct ctx *c, struct tcp_tap_conn *old, + struct tcp_tap_conn *new); void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new); -void tcp_table_compact(struct ctx *c, union flow *hole); void tcp_splice_destroy(struct ctx *c, union flow *flow); void tcp_splice_timer(struct ctx *c, union flow *flow); int tcp_conn_pool_sock(int pool[]); diff --git a/tcp_splice.c b/tcp_splice.c index 2794998..34cb774 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -296,7 +296,7 @@ void tcp_splice_destroy(struct ctx *c, union flow *flow) conn->flags = 0; debug("TCP (spliced): index %li, CLOSED", FLOW_IDX(conn)); - tcp_table_compact(c, flow); + flow_table_compact(c, flow); } /** -- 2.41.0

David Gibson

7:41 a.m.

New subject: [PATCH v2 05/10] flow: Introduce struct flowside, space for uniform tracking of addresses

Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more general abd robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of the address and ports of a flow from a single "side" (guest or host). Store two of these in the common fields of a flow to track that information for both sides. For now we just introduce the structure and fields themselves, along with some simple helpers. Later patches will actually use these to store useful information. Signed-off-by: David Gibson --- flow.c | 22 ++++++++++++++++++++++ flow.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+) diff --git a/flow.c b/flow.c index 12ca8db..a93cf8c 100644 --- a/flow.c +++ b/flow.c @@ -7,6 +7,7 @@ #include #include +#include #include "util.h" #include "passt.h" @@ -24,6 +25,27 @@ const char *flow_type_str[] = { /* Global Flow Table */ union flow flowtab[FLOW_MAX]; +/** flowside_fmt - Format a flowside as a string + * @fs: flowside to format + * @buf: Buffer into which to store the formatted version + * @size: Size of @buf + * + * Return: pointer to formatted string describing @fs, or NULL on error + */ +/* cppcheck-suppress unusedFunction */ +const char *flowside_fmt(const struct flowside *fs, char *buf, size_t size) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + + if (!inet_ntop(AF_INET6, &fs->eaddr, ebuf, sizeof(ebuf)) + || !inet_ntop(AF_INET6, &fs->faddr, fbuf, sizeof(fbuf))) + return NULL; + + snprintf(buf, size, "[%s]:%hu <-> [%s]:%hu", fbuf, fs->fport, + ebuf, fs->eport); + return (const char *)buf; +} + /** * flow_table_compact() - Perform compaction on flow table * @c: Execution context diff --git a/flow.h b/flow.h index e212796..9891fcb 100644 --- a/flow.h +++ b/flow.h @@ -18,11 +18,59 @@ extern const char *flow_type_str[]; #define FLOW_TYPE(f) \ ((f)->type <= FLOW_MAX ? flow_type_str[(f)->type] : "?") +/** + * struct flowside - Describes a logical packet flow as seen from one "side" + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport, eport; +}; + +/** flowside_from_af - Initialize a flowside from addresses + * @fs: flowside to initialize + * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + */ +static inline void flowside_from_af(struct flowside *fs, int af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + inany_from_af(&fs->faddr, af, faddr); + inany_from_af(&fs->eaddr, af, eaddr); + fs->fport = fport; + fs->eport = eport; +} + +/** flowside_complete - Check if flowside is fully initialized + * @fs: flowside to check + */ +static inline bool flowside_complete(const struct flowside *fs) +{ + return !IN6_IS_ADDR_UNSPECIFIED(&fs->faddr) && + !IN6_IS_ADDR_UNSPECIFIED(&fs->eaddr) && + fs->fport != 0 && fs->eport != 0; +} + +#define FLOWSIDE_STRLEN (2*(INET6_ADDRSTRLEN+8) + 6) + +const char *flowside_fmt(const struct flowside *fs, char *buf, size_t size); + /** * struct flow_common - Common fields for packet flows + * @side[]: Information on the flow for each side. Flow types can have + * their own conventions about which side is which * @type: Type of packet flow */ struct flow_common { + struct flowside side[2]; enum flow_type type; }; -- 2.41.0

Stefano Brivio

7 Sep 7 Sep

3:01 a.m.

New subject: [PATCH v2 05/10] flow: Introduce struct flowside, space for uniform tracking of addresses

On Mon, 28 Aug 2023 15:41:41 +1000 David Gibson wrote:

...

Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).

To allow more general abd robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.

Introduce 'struct flowside' to keep track of the address and ports of a flow from a single "side" (guest or host). Store two of these in the common fields of a flow to track that information for both sides.

For now we just introduce the structure and fields themselves, along with some simple helpers. Later patches will actually use these to store useful information.

Signed-off-by: David Gibson --- flow.c | 22 ++++++++++++++++++++++ flow.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+)

diff --git a/flow.c b/flow.c index 12ca8db..a93cf8c 100644 --- a/flow.c +++ b/flow.c @@ -7,6 +7,7 @@

#include #include +#include

#include "util.h" #include "passt.h" @@ -24,6 +25,27 @@ const char *flow_type_str[] = { /* Global Flow Table */ union flow flowtab[FLOW_MAX];

+/** flowside_fmt - Format a flowside as a string + * @fs: flowside to format + * @buf: Buffer into which to store the formatted version + * @size: Size of @buf + * + * Return: pointer to formatted string describing @fs, or NULL on error + */ +/* cppcheck-suppress unusedFunction */ +const char *flowside_fmt(const struct flowside *fs, char *buf, size_t size) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + + if (!inet_ntop(AF_INET6, &fs->eaddr, ebuf, sizeof(ebuf)) + || !inet_ntop(AF_INET6, &fs->faddr, fbuf, sizeof(fbuf)))

For consistency (also with flowside_complete() below): if (!inet_ntop(AF_INET6, &fs->eaddr, ebuf, sizeof(ebuf)) || !inet_ntop(AF_INET6, &fs->faddr, fbuf, sizeof(fbuf)))

...

+ return NULL; + + snprintf(buf, size, "[%s]:%hu <-> [%s]:%hu", fbuf, fs->fport, + ebuf, fs->eport); + return (const char *)buf; +} + /** * flow_table_compact() - Perform compaction on flow table * @c: Execution context diff --git a/flow.h b/flow.h index e212796..9891fcb 100644 --- a/flow.h +++ b/flow.h @@ -18,11 +18,59 @@ extern const char *flow_type_str[]; #define FLOW_TYPE(f) \ ((f)->type <= FLOW_MAX ? flow_type_str[(f)->type] : "?")

+/** + * struct flowside - Describes a logical packet flow as seen from one "side" + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport, eport;

I guess always valid, but uncommon (compared to in_port_t x; in_port_t y;)?

...

+}; + +/** flowside_from_af - Initialize a flowside from addresses + * @fs: flowside to initialize + * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + */ +static inline void flowside_from_af(struct flowside *fs, int af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + inany_from_af(&fs->faddr, af, faddr); + inany_from_af(&fs->eaddr, af, eaddr); + fs->fport = fport; + fs->eport = eport; +} + +/** flowside_complete - Check if flowside is fully initialized + * @fs: flowside to check + */ +static inline bool flowside_complete(const struct flowside *fs) +{ + return !IN6_IS_ADDR_UNSPECIFIED(&fs->faddr) && + !IN6_IS_ADDR_UNSPECIFIED(&fs->eaddr) && + fs->fport != 0 && fs->eport != 0;

Zero is reserved for TCP (which could be problematic anyway if we try to match things), but for UDP a "zero" port can actually be used (in the probably desuete sense of "no port"). Maybe we should use -1 instead?

...

+} + +#define FLOWSIDE_STRLEN (2*(INET6_ADDRSTRLEN+8) + 6)

For consistency: "(2 * INET6_ADDRSTRLEN + 8) + 6)". Limited to the usage I've seen in 6/10 (maybe I'm ignoring something): is it worth it to have flowside_fmt() as a function forming a string, rather than something calling debug() with what we want...? At the moment we have tap_packet_debug(), admittedly not very elegant but perhaps more terse than this. -- Stefano

David Gibson

6:05 a.m.

New subject: [PATCH v2 05/10] flow: Introduce struct flowside, space for uniform tracking of addresses

On Thu, Sep 07, 2023 at 03:01:40AM +0200, Stefano Brivio wrote:

...

On Mon, 28 Aug 2023 15:41:41 +1000 David Gibson wrote:

...
Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).

To allow more general abd robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.

Introduce 'struct flowside' to keep track of the address and ports of a flow from a single "side" (guest or host). Store two of these in the common fields of a flow to track that information for both sides.

For now we just introduce the structure and fields themselves, along with some simple helpers. Later patches will actually use these to store useful information.

Signed-off-by: David Gibson --- flow.c | 22 ++++++++++++++++++++++ flow.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+)

diff --git a/flow.c b/flow.c index 12ca8db..a93cf8c 100644 --- a/flow.c +++ b/flow.c @@ -7,6 +7,7 @@

#include #include +#include

#include "util.h" #include "passt.h" @@ -24,6 +25,27 @@ const char *flow_type_str[] = { /* Global Flow Table */ union flow flowtab[FLOW_MAX];

+/** flowside_fmt - Format a flowside as a string + * @fs: flowside to format + * @buf: Buffer into which to store the formatted version + * @size: Size of @buf + * + * Return: pointer to formatted string describing @fs, or NULL on error + */ +/* cppcheck-suppress unusedFunction */ +const char *flowside_fmt(const struct flowside *fs, char *buf, size_t size) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + + if (!inet_ntop(AF_INET6, &fs->eaddr, ebuf, sizeof(ebuf)) + || !inet_ntop(AF_INET6, &fs->faddr, fbuf, sizeof(fbuf)))

For consistency (also with flowside_complete() below):

if (!inet_ntop(AF_INET6, &fs->eaddr, ebuf, sizeof(ebuf)) || !inet_ntop(AF_INET6, &fs->faddr, fbuf, sizeof(fbuf)))

Done.

...

...
+ return NULL; + + snprintf(buf, size, "[%s]:%hu <-> [%s]:%hu", fbuf, fs->fport, + ebuf, fs->eport); + return (const char *)buf; +} + /** * flow_table_compact() - Perform compaction on flow table * @c: Execution context diff --git a/flow.h b/flow.h index e212796..9891fcb 100644 --- a/flow.h +++ b/flow.h @@ -18,11 +18,59 @@ extern const char *flow_type_str[]; #define FLOW_TYPE(f) \ ((f)->type <= FLOW_MAX ? flow_type_str[(f)->type] : "?")

+/** + * struct flowside - Describes a logical packet flow as seen from one "side" + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport, eport;

I guess always valid, but uncommon (compared to in_port_t x; in_port_t y;)?

Huh.. I didn't even think about that (and the fact that I did this for the ports, but not for the addresses). Changed it to the more conventional style.

...

...
+}; + +/** flowside_from_af - Initialize a flowside from addresses + * @fs: flowside to initialize + * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + */ +static inline void flowside_from_af(struct flowside *fs, int af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + inany_from_af(&fs->faddr, af, faddr); + inany_from_af(&fs->eaddr, af, eaddr); + fs->fport = fport; + fs->eport = eport; +} + +/** flowside_complete - Check if flowside is fully initialized + * @fs: flowside to check + */ +static inline bool flowside_complete(const struct flowside *fs) +{ + return !IN6_IS_ADDR_UNSPECIFIED(&fs->faddr) && + !IN6_IS_ADDR_UNSPECIFIED(&fs->eaddr) && + fs->fport != 0 && fs->eport != 0;

Zero is reserved for TCP (which could be problematic anyway if we try to match things),

The more practical consideration here is that a 0 port is used in the sockaddr passed to bind() to represent "pick a port for me". The point of this is to check that we have a "fully specified" flowside, that provides sufficient information to both bind() and connect() a socket with no ambiguity on the endpoints.

...

but for UDP a "zero" port can actually be used (in the probably desuete sense of "no port").

[Aside: I've never encountered the word desuete before] By "no port" here are you meaning for UDP traffic that expects no response? If that's so we probably neither need or want to create a flow for it anyway. In any case, even if port 0 can be used at the protocol level, I don't think it can be used at the socket level: I'm pretty sure the bind() behaviour of treating 0 as "pick for me" is true for UDP as well as TCP - it's functionality that's basically necessary, and I can't see any other way to specify it.

...

Maybe we should use -1 instead?

That doesn't really help - port 65535 is itself valid, so unless we widen these fields - which I don't really want to do - it's still ambiguous (in fact, worse, because AFAICT port 65535 could actually be used in practice). Even if we did widen, it's still a problem, because if we got the port from, for example, getsockname() on a not-yet-connected socket, that will give us 0 and the point of this function is to tell us that's not a fully specified endpoint.

...

...
+} + +#define FLOWSIDE_STRLEN (2*(INET6_ADDRSTRLEN+8) + 6)

For consistency: "(2 * INET6_ADDRSTRLEN + 8) + 6)".

Done.

...

Limited to the usage I've seen in 6/10 (maybe I'm ignoring something): is it worth it to have flowside_fmt() as a function forming a string, rather than something calling debug() with what we want...? At the moment we have tap_packet_debug(), admittedly not very elegant but perhaps more terse than this.

There's at least one more use coming in the remainder of the series, and I'd expect to see more as other protocols are added to the flow mechanics. I also think it could be a very useful helper when adding ad-hoc debugging. So, yes, I think it's worth it. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

9:55 a.m.

New subject: [PATCH v2 05/10] flow: Introduce struct flowside, space for uniform tracking of addresses

On Thu, 7 Sep 2023 14:05:37 +1000 David Gibson wrote:

...

On Thu, Sep 07, 2023 at 03:01:40AM +0200, Stefano Brivio wrote:

...
On Mon, 28 Aug 2023 15:41:41 +1000 David Gibson wrote:

...
Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).

To allow more general abd robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.

Introduce 'struct flowside' to keep track of the address and ports of a flow from a single "side" (guest or host). Store two of these in the common fields of a flow to track that information for both sides.

For now we just introduce the structure and fields themselves, along with some simple helpers. Later patches will actually use these to store useful information.

Signed-off-by: David Gibson --- flow.c | 22 ++++++++++++++++++++++ flow.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+)

diff --git a/flow.c b/flow.c index 12ca8db..a93cf8c 100644 --- a/flow.c +++ b/flow.c @@ -7,6 +7,7 @@

#include #include +#include

#include "util.h" #include "passt.h" @@ -24,6 +25,27 @@ const char *flow_type_str[] = { /* Global Flow Table */ union flow flowtab[FLOW_MAX];

+/** flowside_fmt - Format a flowside as a string + * @fs: flowside to format + * @buf: Buffer into which to store the formatted version + * @size: Size of @buf + * + * Return: pointer to formatted string describing @fs, or NULL on error + */ +/* cppcheck-suppress unusedFunction */ +const char *flowside_fmt(const struct flowside *fs, char *buf, size_t size) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + + if (!inet_ntop(AF_INET6, &fs->eaddr, ebuf, sizeof(ebuf)) + || !inet_ntop(AF_INET6, &fs->faddr, fbuf, sizeof(fbuf)))

For consistency (also with flowside_complete() below):

if (!inet_ntop(AF_INET6, &fs->eaddr, ebuf, sizeof(ebuf)) || !inet_ntop(AF_INET6, &fs->faddr, fbuf, sizeof(fbuf)))

Done.

...
...
+ return NULL; + + snprintf(buf, size, "[%s]:%hu <-> [%s]:%hu", fbuf, fs->fport, + ebuf, fs->eport); + return (const char *)buf; +} + /** * flow_table_compact() - Perform compaction on flow table * @c: Execution context diff --git a/flow.h b/flow.h index e212796..9891fcb 100644 --- a/flow.h +++ b/flow.h @@ -18,11 +18,59 @@ extern const char *flow_type_str[]; #define FLOW_TYPE(f) \ ((f)->type <= FLOW_MAX ? flow_type_str[(f)->type] : "?")

+/** + * struct flowside - Describes a logical packet flow as seen from one "side" + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport, eport;

I guess always valid, but uncommon (compared to in_port_t x; in_port_t y;)?

Huh.. I didn't even think about that (and the fact that I did this for the ports, but not for the addresses). Changed it to the more conventional style.

...
...
+}; + +/** flowside_from_af - Initialize a flowside from addresses + * @fs: flowside to initialize + * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + */ +static inline void flowside_from_af(struct flowside *fs, int af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + inany_from_af(&fs->faddr, af, faddr); + inany_from_af(&fs->eaddr, af, eaddr); + fs->fport = fport; + fs->eport = eport; +} + +/** flowside_complete - Check if flowside is fully initialized + * @fs: flowside to check + */ +static inline bool flowside_complete(const struct flowside *fs) +{ + return !IN6_IS_ADDR_UNSPECIFIED(&fs->faddr) && + !IN6_IS_ADDR_UNSPECIFIED(&fs->eaddr) && + fs->fport != 0 && fs->eport != 0;

Zero is reserved for TCP (which could be problematic anyway if we try to match things),

The more practical consideration here is that a 0 port is used in the sockaddr passed to bind() to represent "pick a port for me". The point of this is to check that we have a "fully specified" flowside, that provides sufficient information to both bind() and connect() a socket with no ambiguity on the endpoints.

...
but for UDP a "zero" port can actually be used (in the probably desuete sense of "no port").

[Aside: I've never encountered the word desuete before]

By "no port" here are you meaning for UDP traffic that expects no response? If that's so we probably neither need or want to create a flow for it anyway.

The fact that no response is expected is probably a practical consequence of this... but RFC 768, "Fields", really says: Source Port is an optional field, when meaningful, it indicates the port of the sending process, and may be assumed to be the port to which a reply should be addressed in the absence of any other information. If not used, a value of zero is inserted.

...

In any case, even if port 0 can be used at the protocol level, I don't think it can be used at the socket level: I'm pretty sure the bind() behaviour of treating 0 as "pick for me" is true for UDP as well as TCP - it's functionality that's basically necessary, and I can't see any other way to specify it.

Right, yes.

...

...
Maybe we should use -1 instead?

That doesn't really help - port 65535 is itself valid, so unless we widen these fields - which I don't really want to do - it's still ambiguous (in fact, worse, because AFAICT port 65535 could actually be used in practice).

Oops, sorry, I didn't consider that. Of course.

...

Even if we did widen, it's still a problem, because if we got the port from, for example, getsockname() on a not-yet-connected socket, that will give us 0 and the point of this function is to tell us that's not a fully specified endpoint.

Right... forget about this, I didn't consider that.

...

...
...
+} + +#define FLOWSIDE_STRLEN (2*(INET6_ADDRSTRLEN+8) + 6)

For consistency: "(2 * INET6_ADDRSTRLEN + 8) + 6)".

Done.

...
Limited to the usage I've seen in 6/10 (maybe I'm ignoring something): is it worth it to have flowside_fmt() as a function forming a string, rather than something calling debug() with what we want...? At the moment we have tap_packet_debug(), admittedly not very elegant but perhaps more terse than this.

There's at least one more use coming in the remainder of the series, and I'd expect to see more as other protocols are added to the flow mechanics. I also think it could be a very useful helper when adding ad-hoc debugging. So, yes, I think it's worth it.

Ah, okay. -- Stefano

David Gibson

28 Aug 28 Aug

7:41 a.m.

New subject: [PATCH v2 06/10] tcp: Move guest side address tracking to flow/flowside

tcp_tap_conn has several fields to track addresses and ports as seen by the guest/namespace. However we now have general fields for this in the common flowside fields of struct flow. Use those instead of protocol specific fields. So far we've only explicitly tracked the guest side forwarding address in the TCP connection - the remote address from the guest's point of view. The tap side endpoint address - the local address from the guest's point of view - was assumed to always be one of the handful of guest addresses we track as addr_seen (one each for IPv4, IPv6 global and IPv6 link-local). struct flowside expects both addresses, and we will want to use the endpoint address in future. So, determine that address and store it as part of the flowside. Signed-off-by: David Gibson --- flow.c | 1 - tcp.c | 80 +++++++++++++++++++++++++++++------------------------- tcp_conn.h | 6 +--- 3 files changed, 44 insertions(+), 43 deletions(-) diff --git a/flow.c b/flow.c index a93cf8c..d7264f8 100644 --- a/flow.c +++ b/flow.c @@ -32,7 +32,6 @@ union flow flowtab[FLOW_MAX]; * * Return: pointer to formatted string describing @fs, or NULL on error */ -/* cppcheck-suppress unusedFunction */ const char *flowside_fmt(const struct flowside *fs, char *buf, size_t size) { char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; diff --git a/tcp.c b/tcp.c index 722a613..16b930e 100644 --- a/tcp.c +++ b/tcp.c @@ -397,7 +397,9 @@ struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */ #define OPT_SACK 5 #define OPT_TS 8 -#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) +#define TAPSIDE(conn) (&(conn)->f.side[1]) + +#define CONN_V4(conn) (!!inany_v4(&TAPSIDE(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) #define CONN_IS_CLOSING(conn) \ ((conn->events & ESTABLISHED) && \ @@ -844,7 +846,7 @@ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) int i; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&TAPSIDE(conn)->faddr, low_rtt_dst + i)) return 1; return 0; @@ -866,7 +868,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&TAPSIDE(conn)->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -878,7 +880,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return; - low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = TAPSIDE(conn)->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -1143,8 +1145,8 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + if (inany_equals(&TAPSIDE(conn)->faddr, faddr) && + TAPSIDE(conn)->eport == eport && TAPSIDE(conn)->fport == fport) return 1; return 0; @@ -1186,7 +1188,8 @@ static unsigned int tcp_hash(const struct ctx *c, const union inany_addr *faddr, static unsigned int tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + return tcp_hash(c, &TAPSIDE(conn)->faddr, + TAPSIDE(conn)->eport, TAPSIDE(conn)->fport); } /** @@ -1198,7 +1201,8 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn) { int b; - b = tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + b = tcp_hash(c, &TAPSIDE(conn)->faddr, + TAPSIDE(conn)->eport, TAPSIDE(conn)->fport); conn->next_index = tc_hash[b] ? FLOW_IDX(tc_hash[b]) : -1U; tc_hash[b] = conn; @@ -1386,13 +1390,13 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c, void *p, size_t plen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct in_addr *a4 = inany_v4(&TAPSIDE(conn)->faddr); size_t ip_len, tlen; #define SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq) \ do { \ - b->th.source = htons(conn->fport); \ - b->th.dest = htons(conn->eport); \ + b->th.source = htons(TAPSIDE(conn)->fport); \ + b->th.dest = htons(TAPSIDE(conn)->eport); \ b->th.seq = htonl(seq); \ b->th.ack_seq = htonl(conn->seq_ack_to_tap); \ if (conn->events & ESTABLISHED) { \ @@ -1410,7 +1414,7 @@ do { \ ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr); b->iph.tot_len = htons(ip_len); b->iph.saddr = a4->s_addr; - b->iph.daddr = c->ip4.addr_seen.s_addr; + b->iph.daddr = inany_v4(&TAPSIDE(conn)->eaddr)->s_addr; if (check) b->iph.check = *check; @@ -1428,11 +1432,8 @@ do { \ ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr); b->ip6h.payload_len = htons(plen + sizeof(struct tcphdr)); - b->ip6h.saddr = conn->faddr.a6; - if (IN6_IS_ADDR_LINKLOCAL(&b->ip6h.saddr)) - b->ip6h.daddr = c->ip6.addr_ll_seen; - else - b->ip6h.daddr = c->ip6.addr_seen; + b->ip6h.saddr = TAPSIDE(conn)->faddr.a6; + b->ip6h.daddr = TAPSIDE(conn)->eaddr.a6; memset(b->ip6h.flow_lbl, 0, 3); @@ -1781,31 +1782,25 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn, /** * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 * @c: Execution context - * @conn: TCP connection, with faddr, fport and eport populated + * @conn: TCP connection, with faddr, fport, eaddr, eport populated * @now: Current timestamp */ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { - union inany_addr aany; struct { union inany_addr src; in_port_t srcport; union inany_addr dst; in_port_t dstport; } __attribute__((__packed__)) in = { - .src = conn->faddr, - .srcport = conn->fport, - .dstport = conn->eport, + .src = TAPSIDE(conn)->faddr, + .srcport = TAPSIDE(conn)->fport, + .dst = TAPSIDE(conn)->eaddr, + .dstport = TAPSIDE(conn)->eport, }; uint32_t ns, seq = 0; - if (CONN_V4(conn)) - inany_from_af(&aany, AF_INET, &c->ip4.addr); - else - inany_from_af(&aany, AF_INET6, &c->ip6.addr); - in.dst = aany; - seq = siphash_36b((uint8_t *)&in, c->tcp.hash_secret); /* 32ns ticks, overflows 32 bits every 137s */ @@ -1967,13 +1962,12 @@ static void tcp_conn_from_tap(struct ctx *c, .sin6_port = th->dest, .sin6_addr = *(struct in6_addr *)daddr, }; + char fsstr[FLOWSIDE_STRLEN]; const struct sockaddr *sa; struct tcp_tap_conn *conn; socklen_t sl; int s, mss; - (void)saddr; - if (c->flow_count >= FLOW_MAX) return; @@ -2021,7 +2015,12 @@ static void tcp_conn_from_tap(struct ctx *c, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1; - inany_from_af(&conn->faddr, af, daddr); + flowside_from_af(TAPSIDE(conn), af, daddr, ntohs(th->dest), + saddr, ntohs(th->source)); + ASSERT(flowside_complete(TAPSIDE(conn))); + + debug("TCP: index %li, new connection from tap, %s", FLOW_IDX(conn), + flowside_fmt(TAPSIDE(conn), fsstr, sizeof(fsstr))); if (af == AF_INET) { sa = (struct sockaddr *)&addr4; @@ -2031,9 +2030,6 @@ static void tcp_conn_from_tap(struct ctx *c, sl = sizeof(addr6); } - conn->fport = ntohs(th->dest); - conn->eport = ntohs(th->source); - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2692,10 +2688,20 @@ static void tcp_tap_conn_from_sock(struct ctx *c, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = ref.port; + inany_from_sockaddr(&TAPSIDE(conn)->faddr, &TAPSIDE(conn)->fport, sa); + tcp_snat_inbound(c, &TAPSIDE(conn)->faddr); + + if (CONN_V4(conn)) { + inany_from_af(&TAPSIDE(conn)->eaddr, AF_INET, &c->ip4.addr_seen); + } else { + if (IN6_IS_ADDR_LINKLOCAL(&TAPSIDE(conn)->faddr.a6)) + TAPSIDE(conn)->eaddr.a6 = c->ip6.addr_ll_seen; + else + TAPSIDE(conn)->eaddr.a6 = c->ip6.addr_seen; + } + TAPSIDE(conn)->eport = ref.port; - tcp_snat_inbound(c, &conn->faddr); + ASSERT(flowside_complete(TAPSIDE(conn))); tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_conn.h b/tcp_conn.h index 4e7c7fc..3482759 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -24,6 +24,7 @@ * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap + * @eaddr: Guest side endpoint address (guest's local address) * @faddr: Guest side forwarding address (guest's remote address) * @eport: Guest side endpoint port (guest's local port) * @fport: Guest side forwarding port (guest's remote port) @@ -94,11 +95,6 @@ struct tcp_tap_conn { uint8_t seq_dup_ack_approx; - - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap; -- 2.41.0

David Gibson

7:41 a.m.

New subject: [PATCH v2 07/10] tcp, flow: Perform TCP hash calculations based on flowside

Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, which means we may need to distinguish based on the endpoint address as well. Extend the hash function to include this information. Since this now exactly matches the contents of the guest flowside, we can base our hash functions on that, rather than a group of individual parameters. We also put some of the helpers in flow.h, because we hope to be able to re-use the hashing logic for other cases in future as well. Signed-off-by: David Gibson --- flow.c | 1 + flow.h | 27 ++++++++++++++++++++++ siphash.c | 1 + tcp.c | 65 +++++++++++++--------------------------------------- tcp_splice.c | 1 + 5 files changed, 46 insertions(+), 49 deletions(-) diff --git a/flow.c b/flow.c index d7264f8..4521a43 100644 --- a/flow.c +++ b/flow.c @@ -12,6 +12,7 @@ #include "util.h" #include "passt.h" #include "inany.h" +#include "siphash.h" #include "flow.h" #include "tcp_conn.h" #include "flow_table.h" diff --git a/flow.h b/flow.h index 9891fcb..b4f042b 100644 --- a/flow.h +++ b/flow.h @@ -63,6 +63,33 @@ static inline bool flowside_complete(const struct flowside *fs) const char *flowside_fmt(const struct flowside *fs, char *buf, size_t size); +/** + * flowside_eq() - Check if two flowsides are equal + * @left, @right: Flowsides to compare + * + * Return: true if equal, false otherwise + */ +static inline bool flowside_eq(const struct flowside *left, + const struct flowside *right) +{ + return memcmp(left, right, sizeof(struct flowside)) == 0; +} + +/** + * flowside_hash() - Calculate hash value for a flowside + * @fs: Flowside + * @k: Hash secret (128-bits as array of 2 64-bit words) + * + * Return: hash value + */ +static inline unsigned int flowside_hash(const struct flowside *fs, + const uint64_t *k) +{ + ASSERT(flowside_complete(fs)); + return siphash_36b((uint8_t *)fs, k); +} + + /** * struct flow_common - Common fields for packet flows * @side[]: Information on the flow for each side. Flow types can have diff --git a/siphash.c b/siphash.c index e266e15..1f424d8 100644 --- a/siphash.c +++ b/siphash.c @@ -163,6 +163,7 @@ uint32_t siphash_12b(const uint8_t *in, const uint64_t *k) */ /* NOLINTNEXTLINE(clang-diagnostic-unknown-attributes) */ __attribute__((optimize("-fno-strict-aliasing"))) /* See siphash_8b() */ +/* cppcheck-suppress unusedFunction */ uint64_t siphash_20b(const uint8_t *in, const uint64_t *k) { uint32_t *in32 = (uint32_t *)in; diff --git a/tcp.c b/tcp.c index 16b930e..27cdd15 100644 --- a/tcp.c +++ b/tcp.c @@ -1133,49 +1133,15 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find, } /** - * tcp_hash_match() - Check if a connection entry matches address and ports - * @conn: Connection entry to match against - * @faddr: Guest side forwarding address - * @eport: Guest side endpoint port - * @fport: Guest side forwarding port - * - * Return: 1 on match, 0 otherwise - */ -static int tcp_hash_match(const struct tcp_tap_conn *conn, - const union inany_addr *faddr, - in_port_t eport, in_port_t fport) -{ - if (inany_equals(&TAPSIDE(conn)->faddr, faddr) && - TAPSIDE(conn)->eport == eport && TAPSIDE(conn)->fport == fport) - return 1; - - return 0; -} - -/** - * tcp_hash() - Calculate hash value for connection given address and ports + * tcp_hash() - Calculate hash value for a TCP guest flowside * @c: Execution context - * @faddr: Guest side forwarding address - * @eport: Guest side endpoint port - * @fport: Guest side forwarding port + * @fs: Guest flowside * * Return: hash value, already modulo size of the hash table */ -static unsigned int tcp_hash(const struct ctx *c, const union inany_addr *faddr, - in_port_t eport, in_port_t fport) +static unsigned int tcp_hash(const struct ctx *c, const struct flowside *fs) { - struct { - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - } __attribute__((__packed__)) in = { - *faddr, eport, fport - }; - uint64_t b = 0; - - b = siphash_20b((uint8_t *)&in, c->tcp.hash_secret); - - return (unsigned int)(b % TCP_HASH_TABLE_SIZE); + return flowside_hash(fs, c->tcp.hash_secret) % TCP_HASH_TABLE_SIZE; } /** @@ -1188,8 +1154,7 @@ static unsigned int tcp_hash(const struct ctx *c, const union inany_addr *faddr, static unsigned int tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &TAPSIDE(conn)->faddr, - TAPSIDE(conn)->eport, TAPSIDE(conn)->fport); + return tcp_hash(c, TAPSIDE(conn)); } /** @@ -1201,8 +1166,7 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn) { int b; - b = tcp_hash(c, &TAPSIDE(conn)->faddr, - TAPSIDE(conn)->eport, TAPSIDE(conn)->fport); + b = tcp_hash(c, TAPSIDE(conn)); conn->next_index = tc_hash[b] ? FLOW_IDX(tc_hash[b]) : -1U; tc_hash[b] = conn; @@ -1271,24 +1235,26 @@ void tcp_tap_conn_update(struct ctx *c, struct tcp_tap_conn *old, * tcp_hash_lookup() - Look up connection given remote address and ports * @c: Execution context * @af: Address family, AF_INET or AF_INET6 + * @eaddr: Guest side endpoint address (guest local address) * @faddr: Guest side forwarding address (guest remote address) * @eport: Guest side endpoint port (guest local port) * @fport: Guest side forwarding port (guest remote port) * * Return: connection pointer, if found, -ENOENT otherwise */ -static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, - int af, const void *faddr, +static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, int af, + const void *eaddr, const void *faddr, in_port_t eport, in_port_t fport) { - union inany_addr aany; + struct flowside fs; struct tcp_tap_conn *conn; int b; - inany_from_af(&aany, af, faddr); - b = tcp_hash(c, &aany, eport, fport); + flowside_from_af(&fs, af, faddr, fport, eaddr, eport); + + b = tcp_hash(c, &fs); for (conn = tc_hash[b]; conn; conn = conn_at_idx(conn->next_index)) { - if (tcp_hash_match(conn, &aany, eport, fport)) + if (flowside_eq(TAPSIDE(conn), &fs)) return conn; } @@ -2523,7 +2489,8 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL); opts = packet_get(p, 0, sizeof(*th), optlen, NULL); - conn = tcp_hash_lookup(c, af, daddr, htons(th->source), htons(th->dest)); + conn = tcp_hash_lookup(c, af, saddr, daddr, + htons(th->source), htons(th->dest)); /* New connection from tap */ if (!conn) { diff --git a/tcp_splice.c b/tcp_splice.c index 34cb774..676e7e8 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -51,6 +51,7 @@ #include "util.h" #include "passt.h" #include "log.h" +#include "siphash.h" #include "tcp_splice.h" #include "inany.h" #include "flow.h" -- 2.41.0

David Gibson

7:41 a.m.

New subject: [PATCH v2 08/10] tcp: Re-use flowside_hash for initial sequence number generation

We generate TCP initial sequence numbers, when we need them, from a hash of the source and destination addresses and ports, plus a timestamp. The contents of that hash are now exactly the same as the flowside_hash() we use elsewhere. The values won't be identical because we order the fields in the hash differently, but that doesn't matter for our purposes. Signed-off-by: David Gibson --- tcp.c | 15 +-------------- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/tcp.c b/tcp.c index 27cdd15..a9ddce6 100644 --- a/tcp.c +++ b/tcp.c @@ -1754,20 +1754,7 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn, static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { - struct { - union inany_addr src; - in_port_t srcport; - union inany_addr dst; - in_port_t dstport; - } __attribute__((__packed__)) in = { - .src = TAPSIDE(conn)->faddr, - .srcport = TAPSIDE(conn)->fport, - .dst = TAPSIDE(conn)->eaddr, - .dstport = TAPSIDE(conn)->eport, - }; - uint32_t ns, seq = 0; - - seq = siphash_36b((uint8_t *)&in, c->tcp.hash_secret); + uint32_t ns, seq = flowside_hash(TAPSIDE(conn), c->tcp.hash_secret); /* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; -- 2.41.0

David Gibson

7:41 a.m.

New subject: [PATCH v2 09/10] tcp: Maintain host flowside for connections

We now maintain a struct flowside describing each TCP connection as it appears to the guest. We don't explicitly have the same information for the connections as they appear to the host, however. Rather, that information is implicit in the state of the host side socket. For future generalisations of flow/connection tracking, we're going to need to use this information more heavily, so properly populate the other flowside in each flow table entry. This does require an additional getsockname() call for each new connection. We hope to optimise that away for at least some cases in future. Signed-off-by: David Gibson --- flow.c | 26 ++++++++++++++++++++++++++ flow.h | 1 + tcp.c | 46 ++++++++++++++++++++++++++++++++++++++++++---- tcp_conn.h | 4 ---- 4 files changed, 69 insertions(+), 8 deletions(-) diff --git a/flow.c b/flow.c index 4521a43..f2a7377 100644 --- a/flow.c +++ b/flow.c @@ -7,6 +7,7 @@ #include #include +#include #include #include "util.h" @@ -83,3 +84,28 @@ void flow_table_compact(struct ctx *c, union flow *hole) memset(from, 0, sizeof(*from)); } + +/** flowside_getsockname - Initialize flowside f{addr,port} from a bound socket + * @fs: flowside to initialize + * @s: bound socket + * + * #syscalls getsockname + */ +int flowside_getsockname(struct flowside *fs, int s) +{ + struct sockaddr_storage sa; + socklen_t sl = sizeof(sa); + + /* FIXME: Workaround clang-tidy not realizing that getsockname() writes + * the socket address. See + * https://github.com/llvm/llvm-project/issues/58992 + */ + memset(&sa, 0, sizeof(struct sockaddr_in6)); + if (getsockname(s, (struct sockaddr *)&sa, &sl) < 0) + return -errno; + + inany_from_sockaddr(&fs->faddr, &fs->fport, + (const struct sockaddr *)&sa); + + return 0; +} diff --git a/flow.h b/flow.h index b4f042b..4a27303 100644 --- a/flow.h +++ b/flow.h @@ -61,6 +61,7 @@ static inline bool flowside_complete(const struct flowside *fs) #define FLOWSIDE_STRLEN (2*(INET6_ADDRSTRLEN+8) + 6) +int flowside_getsockname(struct flowside *fs, int s); const char *flowside_fmt(const struct flowside *fs, char *buf, size_t size); /** diff --git a/tcp.c b/tcp.c index a9ddce6..297134f 100644 --- a/tcp.c +++ b/tcp.c @@ -397,6 +397,7 @@ struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */ #define OPT_SACK 5 #define OPT_TS 8 +#define SOCKSIDE(conn) (&(conn)->f.side[0]) #define TAPSIDE(conn) (&(conn)->f.side[1]) #define CONN_V4(conn) (!!inany_v4(&TAPSIDE(conn)->faddr)) @@ -2020,6 +2021,19 @@ static void tcp_conn_from_tap(struct ctx *c, conn_event(c, conn, TAP_SYN_ACK_SENT); } + /* Initialise sock-side demiflow */ + SOCKSIDE(conn)->eaddr = TAPSIDE(conn)->faddr; + SOCKSIDE(conn)->eport = TAPSIDE(conn)->fport; + if (flowside_getsockname(SOCKSIDE(conn), s) < 0) { + err("tcp: Failed to get local name for outgoing connection"); + tcp_rst(c, conn); + return; + } + + ASSERT(flowside_complete(SOCKSIDE(conn))); + debug("TCP: index %li, connection forwarded to socket, %s", FLOW_IDX(conn), + flowside_fmt(SOCKSIDE(conn), fsstr, sizeof(fsstr))); + tcp_epoll_ctl(c, conn); } @@ -2629,20 +2643,35 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * @s: Accepted socket * @sa: Peer socket address (from accept()) * @now: Current timestamp + * + * Return: true if able to create a tap connection, false otherwise */ -static void tcp_tap_conn_from_sock(struct ctx *c, +static bool tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, struct sockaddr *sa, const struct timespec *now) { + char fsstr[FLOWSIDE_STRLEN]; + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - inany_from_sockaddr(&TAPSIDE(conn)->faddr, &TAPSIDE(conn)->fport, sa); + if (flowside_getsockname(SOCKSIDE(conn), s) < 0) { + err("tcp: Failed to get local name, connection dropped"); + return false; + } + inany_from_sockaddr(&SOCKSIDE(conn)->eaddr, &SOCKSIDE(conn)->eport, sa); + + ASSERT(flowside_complete(SOCKSIDE(conn))); + debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(conn), + flowside_fmt(SOCKSIDE(conn), fsstr, sizeof(fsstr))); + + TAPSIDE(conn)->faddr = SOCKSIDE(conn)->eaddr; + TAPSIDE(conn)->fport = SOCKSIDE(conn)->eport; tcp_snat_inbound(c, &TAPSIDE(conn)->faddr); if (CONN_V4(conn)) { @@ -2656,6 +2685,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, TAPSIDE(conn)->eport = ref.port; ASSERT(flowside_complete(TAPSIDE(conn))); + debug("TCP: index %li, connection forwarded to tap, %s", FLOW_IDX(conn), + flowside_fmt(TAPSIDE(conn), fsstr, sizeof(fsstr))); tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); @@ -2668,6 +2699,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE); tcp_get_sndbuf(conn); + + return true; } /** @@ -2704,8 +2737,13 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, s, (struct sockaddr *)&sa)) return; - tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now); + if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, + (struct sockaddr *)&sa, now)) + return; + + /* Failed to create the connection */ + close(s); + c->flow_count--; } /** diff --git a/tcp_conn.h b/tcp_conn.h index 3482759..2ef0130 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -24,10 +24,6 @@ * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @eaddr: Guest side endpoint address (guest's local address) - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap -- 2.41.0

David Gibson

7:41 a.m.

New subject: [PATCH v2 10/10] tcp_splice: Fill out flowside information for spliced connections

Every flow in the flow table now has space for the the addresses as seen by both the host and guest side. We fill that information in for regular "tap" TCP connections, but not for spliced connections. Fill in that information for spliced connections too, so it's now uniformly available for all flow types (that we've implemented so far). Signed-off-by: David Gibson --- tcp.c | 46 +++++++++++++++++++--------------------------- tcp_splice.c | 40 ++++++++++++++++++++++++++-------------- tcp_splice.h | 3 +-- 3 files changed, 46 insertions(+), 43 deletions(-) diff --git a/tcp.c b/tcp.c index 297134f..7459fc2 100644 --- a/tcp.c +++ b/tcp.c @@ -2639,37 +2639,25 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with TAPSIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer socket address (from accept()) * @now: Current timestamp - * - * Return: true if able to create a tap connection, false otherwise */ -static bool tcp_tap_conn_from_sock(struct ctx *c, +static void tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, - struct sockaddr *sa, const struct timespec *now) { char fsstr[FLOWSIDE_STRLEN]; + ASSERT(flowside_complete(SOCKSIDE(conn))); + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - if (flowside_getsockname(SOCKSIDE(conn), s) < 0) { - err("tcp: Failed to get local name, connection dropped"); - return false; - } - inany_from_sockaddr(&SOCKSIDE(conn)->eaddr, &SOCKSIDE(conn)->eport, sa); - - ASSERT(flowside_complete(SOCKSIDE(conn))); - debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(conn), - flowside_fmt(SOCKSIDE(conn), fsstr, sizeof(fsstr))); - TAPSIDE(conn)->faddr = SOCKSIDE(conn)->eaddr; TAPSIDE(conn)->fport = SOCKSIDE(conn)->eport; tcp_snat_inbound(c, &TAPSIDE(conn)->faddr); @@ -2699,8 +2687,6 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE); tcp_get_sndbuf(conn); - - return true; } /** @@ -2712,6 +2698,7 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, void tcp_listen_handler(struct ctx *c, union epoll_ref ref, const struct timespec *now) { + char fsstr[FLOWSIDE_STRLEN]; struct sockaddr_storage sa; union flow *flow; socklen_t sl; @@ -2730,20 +2717,25 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) return; - flow = flowtab + c->flow_count++; + flow = flowtab + c->flow_count; - if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, - s, (struct sockaddr *)&sa)) + if (flowside_getsockname(&flow->f.side[0], s) < 0) { + err("tcp: Failed to get local name, connection dropped"); + close(s); return; + } + inany_from_sockaddr(&flow->f.side[0].eaddr, &flow->f.side[0].eport, + &sa); + c->flow_count++; - if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now)) + debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(flow), + flowside_fmt(&flow->f.side[0], fsstr, sizeof(fsstr))); + + if (c->mode == MODE_PASTA && + tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s)) return; - /* Failed to create the connection */ - close(s); - c->flow_count--; + tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, now); } /** diff --git a/tcp_splice.c b/tcp_splice.c index 676e7e8..018d095 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -73,6 +73,9 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2]; +#define ASIDE(conn) (&(conn)->f.side[0]) +#define BSIDE(conn) (&(conn)->f.side[1]) + #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) @@ -310,7 +313,16 @@ void tcp_splice_destroy(struct ctx *c, union flow *flow) static int tcp_splice_connect_finish(const struct ctx *c, struct tcp_splice_conn *conn) { - int i; + char fsstr[FLOWSIDE_STRLEN]; + int i, rc; + + rc = flowside_getsockname(BSIDE(conn), conn->b); + if (rc) + return rc; + + ASSERT(flowside_complete(BSIDE(conn))); + debug("TCP (splice): index %li, connection forwarded, %s", FLOW_IDX(conn), + flowside_fmt(BSIDE(conn), fsstr, sizeof(fsstr))); conn->pipe_a_b[0] = conn->pipe_b_a[0] = -1; conn->pipe_a_b[1] = conn->pipe_b_a[1] = -1; @@ -386,10 +398,13 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, if (CONN_V6(conn)) { sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET6, &addr6.sin6_addr); } else { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET, &addr4.sin_addr); } + BSIDE(conn)->eport = port; if (connect(conn->b, sa, sl)) { if (errno != EINPROGRESS) { @@ -480,33 +495,30 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock, * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with ASIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer address of connection * * Return: true if able to create a spliced connection, false otherwise * #syscalls:pasta setsockopt */ bool tcp_splice_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, - struct tcp_splice_conn *conn, int s, - const struct sockaddr *sa) + struct tcp_splice_conn *conn, int s) { - const struct in_addr *a4; - union inany_addr aany; - in_port_t port; + const struct in_addr *e4 = inany_v4(&ASIDE(conn)->eaddr); + const struct in_addr *f4 = inany_v4(&ASIDE(conn)->faddr); ASSERT(c->mode == MODE_PASTA); + ASSERT(flowside_complete(ASIDE(conn))); - inany_from_sockaddr(&aany, &port, sa); - a4 = inany_v4(&aany); - - if (a4) { - if (!IN4_IS_ADDR_LOOPBACK(a4)) + if (e4) { + if (!IN4_IS_ADDR_LOOPBACK(e4)) return false; + ASSERT(f4 && IN4_IS_ADDR_LOOPBACK(f4)); conn->flags = 0; } else { - if (!IN6_IS_ADDR_LOOPBACK(&aany.a6)) + if (!IN6_IS_ADDR_LOOPBACK(&ASIDE(conn)->eaddr.a6)) return false; + ASSERT(IN6_IS_ADDR_LOOPBACK(&ASIDE(conn)->faddr.a6)); conn->flags = SPLICE_V6; } diff --git a/tcp_splice.h b/tcp_splice.h index e7a583a..fb00318 100644 --- a/tcp_splice.h +++ b/tcp_splice.h @@ -11,8 +11,7 @@ struct tcp_splice_conn; void tcp_splice_sock_handler(struct ctx *c, struct tcp_splice_conn *conn, int s, uint32_t events); bool tcp_splice_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, - struct tcp_splice_conn *conn, int s, - const struct sockaddr *sa); + struct tcp_splice_conn *conn, int s); void tcp_splice_init(struct ctx *c); #endif /* TCP_SPLICE_H */ -- 2.41.0

Stefano Brivio

7 Sep 7 Sep

3:02 a.m.

New subject: [PATCH v2 10/10] tcp_splice: Fill out flowside information for spliced connections

On Mon, 28 Aug 2023 15:41:46 +1000 David Gibson wrote:

...

Every flow in the flow table now has space for the the addresses as seen by both the host and guest side. We fill that information in for regular "tap" TCP connections, but not for spliced connections.

Fill in that information for spliced connections too, so it's now uniformly available for all flow types (that we've implemented so far).

Signed-off-by: David Gibson --- tcp.c | 46 +++++++++++++++++++--------------------------- tcp_splice.c | 40 ++++++++++++++++++++++++++-------------- tcp_splice.h | 3 +-- 3 files changed, 46 insertions(+), 43 deletions(-)

diff --git a/tcp.c b/tcp.c index 297134f..7459fc2 100644 --- a/tcp.c +++ b/tcp.c @@ -2639,37 +2639,25 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with TAPSIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer socket address (from accept()) * @now: Current timestamp - * - * Return: true if able to create a tap connection, false otherwise */ -static bool tcp_tap_conn_from_sock(struct ctx *c, +static void tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, - struct sockaddr *sa, const struct timespec *now) { char fsstr[FLOWSIDE_STRLEN];

+ ASSERT(flowside_complete(SOCKSIDE(conn))); + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);

- if (flowside_getsockname(SOCKSIDE(conn), s) < 0) { - err("tcp: Failed to get local name, connection dropped"); - return false; - } - inany_from_sockaddr(&SOCKSIDE(conn)->eaddr, &SOCKSIDE(conn)->eport, sa); - - ASSERT(flowside_complete(SOCKSIDE(conn))); - debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(conn), - flowside_fmt(SOCKSIDE(conn), fsstr, sizeof(fsstr))); - TAPSIDE(conn)->faddr = SOCKSIDE(conn)->eaddr; TAPSIDE(conn)->fport = SOCKSIDE(conn)->eport; tcp_snat_inbound(c, &TAPSIDE(conn)->faddr); @@ -2699,8 +2687,6 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE);

tcp_get_sndbuf(conn); - - return true; }

/** @@ -2712,6 +2698,7 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, void tcp_listen_handler(struct ctx *c, union epoll_ref ref, const struct timespec *now) { + char fsstr[FLOWSIDE_STRLEN]; struct sockaddr_storage sa; union flow *flow; socklen_t sl; @@ -2730,20 +2717,25 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) return;

- flow = flowtab + c->flow_count++; + flow = flowtab + c->flow_count;

- if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, - s, (struct sockaddr *)&sa)) + if (flowside_getsockname(&flow->f.side[0], s) < 0) { + err("tcp: Failed to get local name, connection dropped"); + close(s); return; + } + inany_from_sockaddr(&flow->f.side[0].eaddr, &flow->f.side[0].eport, + &sa); + c->flow_count++;

- if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now)) + debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(flow), + flowside_fmt(&flow->f.side[0], fsstr, sizeof(fsstr))); + + if (c->mode == MODE_PASTA && + tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s)) return;

- /* Failed to create the connection */ - close(s); - c->flow_count--; + tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, now); }

/** diff --git a/tcp_splice.c b/tcp_splice.c index 676e7e8..018d095 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -73,6 +73,9 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2];

+#define ASIDE(conn) (&(conn)->f.side[0]) +#define BSIDE(conn) (&(conn)->f.side[1]) + #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) @@ -310,7 +313,16 @@ void tcp_splice_destroy(struct ctx *c, union flow *flow) static int tcp_splice_connect_finish(const struct ctx *c, struct tcp_splice_conn *conn) { - int i; + char fsstr[FLOWSIDE_STRLEN]; + int i, rc; + + rc = flowside_getsockname(BSIDE(conn), conn->b); + if (rc) + return rc; + + ASSERT(flowside_complete(BSIDE(conn))); + debug("TCP (splice): index %li, connection forwarded, %s", FLOW_IDX(conn), + flowside_fmt(BSIDE(conn), fsstr, sizeof(fsstr)));

conn->pipe_a_b[0] = conn->pipe_b_a[0] = -1; conn->pipe_a_b[1] = conn->pipe_b_a[1] = -1; @@ -386,10 +398,13 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, if (CONN_V6(conn)) { sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET6, &addr6.sin6_addr); } else { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET, &addr4.sin_addr); } + BSIDE(conn)->eport = port;

if (connect(conn->b, sa, sl)) { if (errno != EINPROGRESS) { @@ -480,33 +495,30 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock, * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with ASIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer address of connection * * Return: true if able to create a spliced connection, false otherwise * #syscalls:pasta setsockopt */ bool tcp_splice_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, - struct tcp_splice_conn *conn, int s, - const struct sockaddr *sa) + struct tcp_splice_conn *conn, int s) { - const struct in_addr *a4; - union inany_addr aany; - in_port_t port; + const struct in_addr *e4 = inany_v4(&ASIDE(conn)->eaddr); + const struct in_addr *f4 = inany_v4(&ASIDE(conn)->faddr);

ASSERT(c->mode == MODE_PASTA); + ASSERT(flowside_complete(ASIDE(conn)));

- inany_from_sockaddr(&aany, &port, sa); - a4 = inany_v4(&aany); - - if (a4) { - if (!IN4_IS_ADDR_LOOPBACK(a4)) + if (e4) { + if (!IN4_IS_ADDR_LOOPBACK(e4)) return false; + ASSERT(f4 && IN4_IS_ADDR_LOOPBACK(f4));

I can't follow this: the test you're replacing is actually (still) a test used by tcp_listen_handler() unless I'm missing something. Returning false here should simply mean we can't use a spliced connection, not that something is wrong.

...

conn->flags = 0; } else { - if (!IN6_IS_ADDR_LOOPBACK(&aany.a6)) + if (!IN6_IS_ADDR_LOOPBACK(&ASIDE(conn)->eaddr.a6)) return false; + ASSERT(IN6_IS_ADDR_LOOPBACK(&ASIDE(conn)->faddr.a6));

...same here. Everything else in the series looks good to me! It looks simpler (so far) than I thought it would be. -- Stefano

David Gibson

6:14 a.m.

New subject: [PATCH v2 10/10] tcp_splice: Fill out flowside information for spliced connections

On Thu, Sep 07, 2023 at 03:02:53AM +0200, Stefano Brivio wrote:

...

On Mon, 28 Aug 2023 15:41:46 +1000 David Gibson wrote:

...
Every flow in the flow table now has space for the the addresses as seen by both the host and guest side. We fill that information in for regular "tap" TCP connections, but not for spliced connections.

Fill in that information for spliced connections too, so it's now uniformly available for all flow types (that we've implemented so far).

Signed-off-by: David Gibson --- tcp.c | 46 +++++++++++++++++++--------------------------- tcp_splice.c | 40 ++++++++++++++++++++++++++-------------- tcp_splice.h | 3 +-- 3 files changed, 46 insertions(+), 43 deletions(-)

diff --git a/tcp.c b/tcp.c index 297134f..7459fc2 100644 --- a/tcp.c +++ b/tcp.c @@ -2639,37 +2639,25 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with TAPSIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer socket address (from accept()) * @now: Current timestamp - * - * Return: true if able to create a tap connection, false otherwise */ -static bool tcp_tap_conn_from_sock(struct ctx *c, +static void tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, - struct sockaddr *sa, const struct timespec *now) { char fsstr[FLOWSIDE_STRLEN];

+ ASSERT(flowside_complete(SOCKSIDE(conn))); + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);

- if (flowside_getsockname(SOCKSIDE(conn), s) < 0) { - err("tcp: Failed to get local name, connection dropped"); - return false; - } - inany_from_sockaddr(&SOCKSIDE(conn)->eaddr, &SOCKSIDE(conn)->eport, sa); - - ASSERT(flowside_complete(SOCKSIDE(conn))); - debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(conn), - flowside_fmt(SOCKSIDE(conn), fsstr, sizeof(fsstr))); - TAPSIDE(conn)->faddr = SOCKSIDE(conn)->eaddr; TAPSIDE(conn)->fport = SOCKSIDE(conn)->eport; tcp_snat_inbound(c, &TAPSIDE(conn)->faddr); @@ -2699,8 +2687,6 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE);

tcp_get_sndbuf(conn); - - return true; }

/** @@ -2712,6 +2698,7 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, void tcp_listen_handler(struct ctx *c, union epoll_ref ref, const struct timespec *now) { + char fsstr[FLOWSIDE_STRLEN]; struct sockaddr_storage sa; union flow *flow; socklen_t sl; @@ -2730,20 +2717,25 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) return;

- flow = flowtab + c->flow_count++; + flow = flowtab + c->flow_count;

- if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, - s, (struct sockaddr *)&sa)) + if (flowside_getsockname(&flow->f.side[0], s) < 0) { + err("tcp: Failed to get local name, connection dropped"); + close(s); return; + } + inany_from_sockaddr(&flow->f.side[0].eaddr, &flow->f.side[0].eport, + &sa); + c->flow_count++;

- if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now)) + debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(flow), + flowside_fmt(&flow->f.side[0], fsstr, sizeof(fsstr))); + + if (c->mode == MODE_PASTA && + tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s)) return;

- /* Failed to create the connection */ - close(s); - c->flow_count--; + tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, now); }

/** diff --git a/tcp_splice.c b/tcp_splice.c index 676e7e8..018d095 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -73,6 +73,9 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2];

+#define ASIDE(conn) (&(conn)->f.side[0]) +#define BSIDE(conn) (&(conn)->f.side[1]) + #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) @@ -310,7 +313,16 @@ void tcp_splice_destroy(struct ctx *c, union flow *flow) static int tcp_splice_connect_finish(const struct ctx *c, struct tcp_splice_conn *conn) { - int i; + char fsstr[FLOWSIDE_STRLEN]; + int i, rc; + + rc = flowside_getsockname(BSIDE(conn), conn->b); + if (rc) + return rc; + + ASSERT(flowside_complete(BSIDE(conn))); + debug("TCP (splice): index %li, connection forwarded, %s", FLOW_IDX(conn), + flowside_fmt(BSIDE(conn), fsstr, sizeof(fsstr)));

conn->pipe_a_b[0] = conn->pipe_b_a[0] = -1; conn->pipe_a_b[1] = conn->pipe_b_a[1] = -1; @@ -386,10 +398,13 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, if (CONN_V6(conn)) { sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET6, &addr6.sin6_addr); } else { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET, &addr4.sin_addr); } + BSIDE(conn)->eport = port;

if (connect(conn->b, sa, sl)) { if (errno != EINPROGRESS) { @@ -480,33 +495,30 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock, * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with ASIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer address of connection * * Return: true if able to create a spliced connection, false otherwise * #syscalls:pasta setsockopt */ bool tcp_splice_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, - struct tcp_splice_conn *conn, int s, - const struct sockaddr *sa) + struct tcp_splice_conn *conn, int s) { - const struct in_addr *a4; - union inany_addr aany; - in_port_t port; + const struct in_addr *e4 = inany_v4(&ASIDE(conn)->eaddr); + const struct in_addr *f4 = inany_v4(&ASIDE(conn)->faddr);

ASSERT(c->mode == MODE_PASTA); + ASSERT(flowside_complete(ASIDE(conn)));

- inany_from_sockaddr(&aany, &port, sa); - a4 = inany_v4(&aany); - - if (a4) { - if (!IN4_IS_ADDR_LOOPBACK(a4)) + if (e4) { + if (!IN4_IS_ADDR_LOOPBACK(e4)) return false; + ASSERT(f4 && IN4_IS_ADDR_LOOPBACK(f4));

I can't follow this: the test you're replacing is actually (still) a test used by tcp_listen_handler() unless I'm missing something.

I'm not replacing a test - I'm just rewriting the same test, then adding an assert.

...

Returning false here should simply mean we can't use a spliced connection, not that something is wrong.

Yes. The 'if' checks if this is a spliceable connection - we have a loopback endpoint (remote to pasta) address. The assert is then checking that the forwarding address (local to pasta) is also loopback - i.e. that we don't have traffic that's going from 127.0.0.1 to a non-loopback address which would be nonsense. I guess this does indicate that the kernel has given us something weird, rather than an internal bug, so maybe it should be log an error and drop the connection rather than asserting.

...

...
conn->flags = 0; } else { - if (!IN6_IS_ADDR_LOOPBACK(&aany.a6)) + if (!IN6_IS_ADDR_LOOPBACK(&ASIDE(conn)->eaddr.a6)) return false; + ASSERT(IN6_IS_ADDR_LOOPBACK(&ASIDE(conn)->faddr.a6));

...same here.

Everything else in the series looks good to me! It looks simpler (so far) than I thought it would be.

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

9:55 a.m.

New subject: [PATCH v2 10/10] tcp_splice: Fill out flowside information for spliced connections

On Thu, 7 Sep 2023 14:14:04 +1000 David Gibson wrote:

...

On Thu, Sep 07, 2023 at 03:02:53AM +0200, Stefano Brivio wrote:

...
On Mon, 28 Aug 2023 15:41:46 +1000 David Gibson wrote:

...
Every flow in the flow table now has space for the the addresses as seen by both the host and guest side. We fill that information in for regular "tap" TCP connections, but not for spliced connections.

Fill in that information for spliced connections too, so it's now uniformly available for all flow types (that we've implemented so far).

Signed-off-by: David Gibson --- tcp.c | 46 +++++++++++++++++++--------------------------- tcp_splice.c | 40 ++++++++++++++++++++++++++-------------- tcp_splice.h | 3 +-- 3 files changed, 46 insertions(+), 43 deletions(-)

diff --git a/tcp.c b/tcp.c index 297134f..7459fc2 100644 --- a/tcp.c +++ b/tcp.c @@ -2639,37 +2639,25 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with TAPSIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer socket address (from accept()) * @now: Current timestamp - * - * Return: true if able to create a tap connection, false otherwise */ -static bool tcp_tap_conn_from_sock(struct ctx *c, +static void tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, - struct sockaddr *sa, const struct timespec *now) { char fsstr[FLOWSIDE_STRLEN];

+ ASSERT(flowside_complete(SOCKSIDE(conn))); + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);

- if (flowside_getsockname(SOCKSIDE(conn), s) < 0) { - err("tcp: Failed to get local name, connection dropped"); - return false; - } - inany_from_sockaddr(&SOCKSIDE(conn)->eaddr, &SOCKSIDE(conn)->eport, sa); - - ASSERT(flowside_complete(SOCKSIDE(conn))); - debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(conn), - flowside_fmt(SOCKSIDE(conn), fsstr, sizeof(fsstr))); - TAPSIDE(conn)->faddr = SOCKSIDE(conn)->eaddr; TAPSIDE(conn)->fport = SOCKSIDE(conn)->eport; tcp_snat_inbound(c, &TAPSIDE(conn)->faddr); @@ -2699,8 +2687,6 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE);

tcp_get_sndbuf(conn); - - return true; }

/** @@ -2712,6 +2698,7 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, void tcp_listen_handler(struct ctx *c, union epoll_ref ref, const struct timespec *now) { + char fsstr[FLOWSIDE_STRLEN]; struct sockaddr_storage sa; union flow *flow; socklen_t sl; @@ -2730,20 +2717,25 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) return;

- flow = flowtab + c->flow_count++; + flow = flowtab + c->flow_count;

- if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, - s, (struct sockaddr *)&sa)) + if (flowside_getsockname(&flow->f.side[0], s) < 0) { + err("tcp: Failed to get local name, connection dropped"); + close(s); return; + } + inany_from_sockaddr(&flow->f.side[0].eaddr, &flow->f.side[0].eport, + &sa); + c->flow_count++;

- if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now)) + debug("TCP: index %li, new connection from socket, %s", FLOW_IDX(flow), + flowside_fmt(&flow->f.side[0], fsstr, sizeof(fsstr))); + + if (c->mode == MODE_PASTA && + tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s)) return;

- /* Failed to create the connection */ - close(s); - c->flow_count--; + tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, now); }

/** diff --git a/tcp_splice.c b/tcp_splice.c index 676e7e8..018d095 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -73,6 +73,9 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2];

+#define ASIDE(conn) (&(conn)->f.side[0]) +#define BSIDE(conn) (&(conn)->f.side[1]) + #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) @@ -310,7 +313,16 @@ void tcp_splice_destroy(struct ctx *c, union flow *flow) static int tcp_splice_connect_finish(const struct ctx *c, struct tcp_splice_conn *conn) { - int i; + char fsstr[FLOWSIDE_STRLEN]; + int i, rc; + + rc = flowside_getsockname(BSIDE(conn), conn->b); + if (rc) + return rc; + + ASSERT(flowside_complete(BSIDE(conn))); + debug("TCP (splice): index %li, connection forwarded, %s", FLOW_IDX(conn), + flowside_fmt(BSIDE(conn), fsstr, sizeof(fsstr)));

conn->pipe_a_b[0] = conn->pipe_b_a[0] = -1; conn->pipe_a_b[1] = conn->pipe_b_a[1] = -1; @@ -386,10 +398,13 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, if (CONN_V6(conn)) { sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET6, &addr6.sin6_addr); } else { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); + inany_from_af(&BSIDE(conn)->eaddr, AF_INET, &addr4.sin_addr); } + BSIDE(conn)->eport = port;

if (connect(conn->b, sa, sl)) { if (errno != EINPROGRESS) { @@ -480,33 +495,30 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock, * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with ASIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer address of connection * * Return: true if able to create a spliced connection, false otherwise * #syscalls:pasta setsockopt */ bool tcp_splice_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, - struct tcp_splice_conn *conn, int s, - const struct sockaddr *sa) + struct tcp_splice_conn *conn, int s) { - const struct in_addr *a4; - union inany_addr aany; - in_port_t port; + const struct in_addr *e4 = inany_v4(&ASIDE(conn)->eaddr); + const struct in_addr *f4 = inany_v4(&ASIDE(conn)->faddr);

ASSERT(c->mode == MODE_PASTA); + ASSERT(flowside_complete(ASIDE(conn)));

- inany_from_sockaddr(&aany, &port, sa); - a4 = inany_v4(&aany); - - if (a4) { - if (!IN4_IS_ADDR_LOOPBACK(a4)) + if (e4) { + if (!IN4_IS_ADDR_LOOPBACK(e4)) return false; + ASSERT(f4 && IN4_IS_ADDR_LOOPBACK(f4));

I can't follow this: the test you're replacing is actually (still) a test used by tcp_listen_handler() unless I'm missing something.

I'm not replacing a test - I'm just rewriting the same test, then adding an assert.

Gah, sorry, I misread!

...

...
Returning false here should simply mean we can't use a spliced connection, not that something is wrong.

Yes. The 'if' checks if this is a spliceable connection - we have a loopback endpoint (remote to pasta) address. The assert is then checking that the forwarding address (local to pasta) is also loopback - i.e. that we don't have traffic that's going from 127.0.0.1 to a non-loopback address which would be nonsense.

I guess this does indicate that the kernel has given us something weird, rather than an internal bug, so maybe it should be log an error and drop the connection rather than asserting.

Sure, I see now. -- Stefano

851

Age (days ago)

861

Last active (days ago)

List overview

Download

18 comments

2 participants

participants (2)

David Gibson
Stefano Brivio

[PATCH v2 00/10] RFC: Convert TCP connection table to generalisable flow table

tags

participants (2)