We can splice TCP connections in pasta mode if and only if they originate from localhost. Currently we separate the two cases by having separate listening sockets: one listens on the host address for non-spliceable connections, the other listens on the loopback address for spliceable connections. As well as requiring twice as many listening sockets, this has the drawback of meaning that passt and pasta behaviour differ subtley but surprisingly: by default passt inbound port forwards will listen on the unspecified address, but for pasta they will be changed to listen on the host's interface address. In fact we don't need to do this. We can defer the decision about whether to splice a connection until after we've accepted it, by testing the peer address to see if it is local. At least, in principle we can. This series deals with a number of complications on the way to accomplishing that. CAVEAT: The current draft increases the size of tcp_conn above its current 64 bytes, which is likely to push it into a second cache line on some machines. This is fixable, but doing so is fiddly, and I'm still working on it. David Gibson (14): style: Minor corrections to function comments tcp: Remove unused TCP_MAX_SOCKS constant tcp: Better helpers for converting between connection pointer and index tcp_splice: Helpers for converting from index to/from tcp_splice_conn tcp: Move connection state structures into a shared header tcp: Add connection union type tcp: Improved helpers to update connections after moving tcp: Unify spliced and non-spliced connection tables tcp: Unify tcp_defer_handler and tcp_splice_defer_handler() tcp: Partially unify tcp_timer() and tcp_splice_timer() tcp: Unify the IN_EPOLL flag tcp: Separate helpers to create ns listening sockets tcp: Unify part of spliced and non-spliced conn_from_sock path tcp: Use the same sockets to listen for spliced and non-spliced connections Makefile | 3 +- conf.c | 12 +- tap.c | 6 +- tcp.c | 661 +++++++++++++++++++++++---------------------------- tcp.h | 7 +- tcp_conn.h | 208 ++++++++++++++++ tcp_splice.c | 292 +++++++++-------------- tcp_splice.h | 4 +- 8 files changed, 625 insertions(+), 568 deletions(-) create mode 100644 tcp_conn.h -- 2.38.1
Some style issues and a typo. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- conf.c | 6 +++--- tap.c | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/conf.c b/conf.c index 1adcf83..3ad247e 100644 --- a/conf.c +++ b/conf.c @@ -112,9 +112,9 @@ static int get_bound_ports_ns(void *arg) * @s: String to search * @c: Delimiter character * - * Returns: If another @c is found in @s, returns a pointer to the - * character *after* the delimiter, if no further @c is in - * @s, return NULL + * Return: If another @c is found in @s, returns a pointer to the + * character *after* the delimiter, if no further @c is in @s, + * return NULL */ static char *next_chunk(const char *s, char c) { diff --git a/tap.c b/tap.c index abeff25..707660c 100644 --- a/tap.c +++ b/tap.c @@ -90,7 +90,7 @@ int tap_send(const struct ctx *c, const void *data, size_t len) * tap_ip4_daddr() - Normal IPv4 destination address for inbound packets * @c: Execution context * - * Returns: IPv4 address, network order + * Return: IPv4 address, network order */ struct in_addr tap_ip4_daddr(const struct ctx *c) { @@ -98,11 +98,11 @@ struct in_addr tap_ip4_daddr(const struct ctx *c) } /** - * tap_ip6_daddr() - Normal IPv4 destination address for inbound packets + * tap_ip6_daddr() - Normal IPv6 destination address for inbound packets * @c: Execution context * @src: Source address * - * Returns: pointer to IPv6 address + * Return: pointer to IPv6 address */ const struct in6_addr *tap_ip6_daddr(const struct ctx *c, const struct in6_addr *src) -- 2.38.1
Presumably it meant something in the past, but it's no longer used. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.h | 1 - 1 file changed, 1 deletion(-) diff --git a/tcp.h b/tcp.h index 3fabb5a..bba0f38 100644 --- a/tcp.h +++ b/tcp.h @@ -10,7 +10,6 @@ #define TCP_CONN_INDEX_BITS 17 /* 128k */ #define TCP_MAX_CONNS (1 << TCP_CONN_INDEX_BITS) -#define TCP_MAX_SOCKS (TCP_MAX_CONNS + USHRT_MAX * 2) #define TCP_SOCK_POOL_SIZE 32 -- 2.38.1
The macro CONN_OR_NULL() is used to look up connections by index with bounds checking. Replace it with an inline function, which means: - Better type checking - No danger of multiple evaluation of an @index with side effects Also add a helper to perform the reverse translation: from connection pointer to index. Introduce a macro for this which will make later cleanups easier and safer. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 83 ++++++++++++++++++++++++++++++++--------------------------- 1 file changed, 45 insertions(+), 38 deletions(-) diff --git a/tcp.c b/tcp.c index d043123..4e56a6c 100644 --- a/tcp.c +++ b/tcp.c @@ -518,14 +518,6 @@ struct tcp_conn { (conn->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD))) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) -#define CONN(index) (tc + (index)) - -/* We probably don't want to use gcc statement expressions (for portability), so - * use this only after well-defined sequence points (no pre-/post-increments). - */ -#define CONN_OR_NULL(index) \ - (((int)(index) >= 0 && (index) < TCP_MAX_CONNS) ? (tc + (index)) : NULL) - static const char *tcp_event_str[] __attribute((__unused__)) = { "SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT", @@ -705,6 +697,21 @@ static size_t tcp6_l2_flags_buf_bytes; /* TCP connections */ static struct tcp_conn tc[TCP_MAX_CONNS]; +#define CONN(index) (tc + (index)) +#define CONN_IDX(conn) ((conn) - tc) + +/** conn_at_idx() - Find a connection by index, if present + * @index: Index of connection to lookup + * + * Return: Pointer to connection, or NULL if @index is out of bounds + */ +static inline struct tcp_conn *conn_at_idx(int index) +{ + if ((index < 0) || (index >= TCP_MAX_CONNS)) + return NULL; + return CONN(index); +} + /* Table for lookup from remote address, local port, remote port */ static struct tcp_conn *tc_hash[TCP_HASH_TABLE_SIZE]; @@ -761,7 +768,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_conn *conn) { int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref = { .r.proto = IPPROTO_TCP, .r.s = conn->sock, - .r.p.tcp.tcp.index = conn - tc, + .r.p.tcp.tcp.index = CONN_IDX(conn), .r.p.tcp.tcp.v6 = CONN_V6(conn) }; struct epoll_event ev = { .data.u64 = ref.u64 }; @@ -784,7 +791,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_conn *conn) union epoll_ref ref_t = { .r.proto = IPPROTO_TCP, .r.s = conn->sock, .r.p.tcp.tcp.timer = 1, - .r.p.tcp.tcp.index = conn - tc }; + .r.p.tcp.tcp.index = CONN_IDX(conn) }; struct epoll_event ev_t = { .data.u64 = ref_t.u64, .events = EPOLLIN | EPOLLET }; @@ -813,7 +820,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_conn *conn) union epoll_ref ref = { .r.proto = IPPROTO_TCP, .r.s = conn->sock, .r.p.tcp.tcp.timer = 1, - .r.p.tcp.tcp.index = conn - tc }; + .r.p.tcp.tcp.index = CONN_IDX(conn) }; struct epoll_event ev = { .data.u64 = ref.u64, .events = EPOLLIN | EPOLLET }; int fd; @@ -846,7 +853,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_conn *conn) it.it_value.tv_sec = ACT_TIMEOUT; } - debug("TCP: index %li, timer expires in %lu.%03lus", conn - tc, + debug("TCP: index %li, timer expires in %lu.%03lus", CONN_IDX(conn), it.it_value.tv_sec, it.it_value.tv_nsec / 1000 / 1000); timerfd_settime(conn->timer, 0, &it, NULL); @@ -867,7 +874,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn, conn->flags &= flag; if (fls(~flag) >= 0) { - debug("TCP: index %li: %s dropped", conn - tc, + debug("TCP: index %li: %s dropped", CONN_IDX(conn), tcp_flag_str[fls(~flag)]); } } else { @@ -876,7 +883,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn, conn->flags |= flag; if (fls(flag) >= 0) { - debug("TCP: index %li: %s", conn - tc, + debug("TCP: index %li: %s", CONN_IDX(conn), tcp_flag_str[fls(flag)]); } } @@ -926,12 +933,12 @@ static void conn_event_do(const struct ctx *c, struct tcp_conn *conn, new += 5; if (prev != new) { - debug("TCP: index %li, %s: %s -> %s", conn - tc, + debug("TCP: index %li, %s: %s -> %s", CONN_IDX(conn), num == -1 ? "CLOSED" : tcp_event_str[num], prev == -1 ? "CLOSED" : tcp_state_str[prev], (new == -1 || num == -1) ? "CLOSED" : tcp_state_str[new]); } else { - debug("TCP: index %li, %s", conn - tc, + debug("TCP: index %li, %s", CONN_IDX(conn), num == -1 ? "CLOSED" : tcp_event_str[num]); } @@ -1355,12 +1362,12 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_conn *conn, int b; b = tcp_hash(c, af, addr, conn->tap_port, conn->sock_port); - conn->next_index = tc_hash[b] ? tc_hash[b] - tc : -1; + conn->next_index = tc_hash[b] ? CONN_IDX(tc_hash[b]) : -1; tc_hash[b] = conn; conn->hash_bucket = b; debug("TCP: hash table insert: index %li, sock %i, bucket: %i, next: " - "%p", conn - tc, conn->sock, b, CONN_OR_NULL(conn->next_index)); + "%p", CONN_IDX(conn), conn->sock, b, conn_at_idx(conn->next_index)); } /** @@ -1373,19 +1380,19 @@ static void tcp_hash_remove(const struct tcp_conn *conn) int b = conn->hash_bucket; for (entry = tc_hash[b]; entry; - prev = entry, entry = CONN_OR_NULL(entry->next_index)) { + prev = entry, entry = conn_at_idx(entry->next_index)) { if (entry == conn) { if (prev) prev->next_index = conn->next_index; else - tc_hash[b] = CONN_OR_NULL(conn->next_index); + tc_hash[b] = conn_at_idx(conn->next_index); break; } } debug("TCP: hash table remove: index %li, sock %i, bucket: %i, new: %p", - conn - tc, conn->sock, b, - prev ? CONN_OR_NULL(prev->next_index) : tc_hash[b]); + CONN_IDX(conn), conn->sock, b, + prev ? conn_at_idx(prev->next_index) : tc_hash[b]); } /** @@ -1399,10 +1406,10 @@ static void tcp_hash_update(struct tcp_conn *old, struct tcp_conn *new) int b = old->hash_bucket; for (entry = tc_hash[b]; entry; - prev = entry, entry = CONN_OR_NULL(entry->next_index)) { + prev = entry, entry = conn_at_idx(entry->next_index)) { if (entry == old) { if (prev) - prev->next_index = new - tc; + prev->next_index = CONN_IDX(new); else tc_hash[b] = new; break; @@ -1411,7 +1418,7 @@ static void tcp_hash_update(struct tcp_conn *old, struct tcp_conn *new) debug("TCP: hash table update: old index %li, new index %li, sock %i, " "bucket: %i, old: %p, new: %p", - old - tc, new - tc, new->sock, b, old, new); + CONN_IDX(old), CONN_IDX(new), new->sock, b, old, new); } /** @@ -1431,7 +1438,7 @@ static struct tcp_conn *tcp_hash_lookup(const struct ctx *c, int af, int b = tcp_hash(c, af, addr, tap_port, sock_port); struct tcp_conn *conn; - for (conn = tc_hash[b]; conn; conn = CONN_OR_NULL(conn->next_index)) { + for (conn = tc_hash[b]; conn; conn = conn_at_idx(conn->next_index)) { if (tcp_hash_match(conn, af, addr, tap_port, sock_port)) return conn; } @@ -1448,9 +1455,9 @@ static void tcp_table_compact(struct ctx *c, struct tcp_conn *hole) { struct tcp_conn *from, *to; - if ((hole - tc) == --c->tcp.conn_count) { + if (CONN_IDX(hole) == --c->tcp.conn_count) { debug("TCP: hash table compaction: maximum index was %li (%p)", - hole - tc, hole); + CONN_IDX(hole), hole); memset(hole, 0, sizeof(*hole)); return; } @@ -1465,7 +1472,7 @@ static void tcp_table_compact(struct ctx *c, struct tcp_conn *hole) debug("TCP: hash table compaction: old index %li, new index %li, " "sock %i, from: %p, to: %p", - from - tc, to - tc, from->sock, from, to); + CONN_IDX(from), CONN_IDX(to), from->sock, from, to); memset(from, 0, sizeof(*from)); } @@ -1488,7 +1495,7 @@ static void tcp_conn_destroy(struct ctx *c, struct tcp_conn *conn) static void tcp_rst_do(struct ctx *c, struct tcp_conn *conn); #define tcp_rst(c, conn) \ do { \ - debug("TCP: index %li, reset at %s:%i", conn - tc, \ + debug("TCP: index %li, reset at %s:%i", CONN_IDX(conn), \ __func__, __LINE__); \ tcp_rst_do(c, conn); \ } while (0) @@ -2734,7 +2741,7 @@ int tcp_tap_handler(struct ctx *c, int af, const void *addr, return 1; } - trace("TCP: packet length %lu from tap for index %lu", len, conn - tc); + trace("TCP: packet length %lu from tap for index %lu", len, CONN_IDX(conn)); if (th->rst) { conn_event(c, conn, CLOSED); @@ -2942,7 +2949,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, */ static void tcp_timer_handler(struct ctx *c, union epoll_ref ref) { - struct tcp_conn *conn = CONN_OR_NULL(ref.r.p.tcp.tcp.index); + struct tcp_conn *conn = conn_at_idx(ref.r.p.tcp.tcp.index); struct itimerspec check_armed = { { 0 }, { 0 } }; if (!conn) @@ -2961,17 +2968,17 @@ static void tcp_timer_handler(struct ctx *c, union epoll_ref ref) conn_flag(c, conn, ~ACK_TO_TAP_DUE); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - debug("TCP: index %li, handshake timeout", conn - tc); + debug("TCP: index %li, handshake timeout", CONN_IDX(conn)); tcp_rst(c, conn); } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { - debug("TCP: index %li, FIN timeout", conn - tc); + debug("TCP: index %li, FIN timeout", CONN_IDX(conn)); tcp_rst(c, conn); } else if (conn->retrans == TCP_MAX_RETRANS) { debug("TCP: index %li, retransmissions count exceeded", - conn - tc); + CONN_IDX(conn)); tcp_rst(c, conn); } else { - debug("TCP: index %li, ACK timeout, retry", conn - tc); + debug("TCP: index %li, ACK timeout, retry", CONN_IDX(conn)); conn->retrans++; conn->seq_to_tap = conn->seq_ack_from_tap; tcp_data_from_sock(c, conn); @@ -2989,7 +2996,7 @@ static void tcp_timer_handler(struct ctx *c, union epoll_ref ref) */ timerfd_settime(conn->timer, 0, &new, &old); if (old.it_value.tv_sec == ACT_TIMEOUT) { - debug("TCP: index %li, activity timeout", conn - tc); + debug("TCP: index %li, activity timeout", CONN_IDX(conn)); tcp_rst(c, conn); } } @@ -3022,7 +3029,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events, return; } - if (!(conn = CONN_OR_NULL(ref.r.p.tcp.tcp.index))) + if (!(conn = conn_at_idx(ref.r.p.tcp.tcp.index))) return; if (conn->events == CLOSED) -- 2.38.1
Like we already have for non-spliced connections, create a CONN_IDX() macro for looking up the index of spliced connection structures. Change the name of the array of spliced connections to be different from that for non-spliced connections (even though they're in different modules). This will make subsequent changes a bit safer. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp_splice.c | 43 +++++++++++++++++++++++++------------------ 1 file changed, 25 insertions(+), 18 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 99c5fa7..9186760 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -13,10 +13,11 @@ * DOC: Theory of Operation * * - * For local traffic directed to TCP ports configured for direct mapping between - * namespaces, packets are directly translated between L4 sockets using a pair - * of splice() syscalls. These connections are tracked in the @tc array of - * struct tcp_splice_conn, using these events: + * For local traffic directed to TCP ports configured for direct + * mapping between namespaces, packets are directly translated between + * L4 sockets using a pair of splice() syscalls. These connections are + * tracked in the @tc_splice array of struct tcp_splice_conn, using + * these events: * * - SPLICE_CONNECT: connection accepted, connecting to target * - SPLICE_ESTABLISHED: connection to target established @@ -113,10 +114,11 @@ struct tcp_splice_conn { #define CONN_V6(x) (x->flags & SOCK_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) -#define CONN(index) (tc + (index)) +#define CONN(index) (tc_splice + (index)) +#define CONN_IDX(conn) ((conn) - tc_splice) /* Spliced connections */ -static struct tcp_splice_conn tc[TCP_SPLICE_MAX_CONNS]; +static struct tcp_splice_conn tc_splice[TCP_SPLICE_MAX_CONNS]; /* Display strings for connection events */ static const char *tcp_splice_event_str[] __attribute((__unused__)) = { @@ -173,7 +175,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn, conn->flags &= flag; if (fls(~flag) >= 0) { - debug("TCP (spliced): index %li: %s dropped", conn - tc, + debug("TCP (spliced): index %li: %s dropped", CONN_IDX(conn), tcp_splice_flag_str[fls(~flag)]); } } else { @@ -182,7 +184,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn, conn->flags |= flag; if (fls(flag) >= 0) { - debug("TCP (spliced): index %li: %s", conn - tc, + debug("TCP (spliced): index %li: %s", CONN_IDX(conn), tcp_splice_flag_str[fls(flag)]); } } @@ -211,11 +213,11 @@ static int tcp_splice_epoll_ctl(const struct ctx *c, int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a, .r.p.tcp.tcp.splice = 1, - .r.p.tcp.tcp.index = conn - tc, + .r.p.tcp.tcp.index = CONN_IDX(conn), .r.p.tcp.tcp.v6 = CONN_V6(conn) }; union epoll_ref ref_b = { .r.proto = IPPROTO_TCP, .r.s = conn->b, .r.p.tcp.tcp.splice = 1, - .r.p.tcp.tcp.index = conn - tc, + .r.p.tcp.tcp.index = CONN_IDX(conn), .r.p.tcp.tcp.v6 = CONN_V6(conn) }; struct epoll_event ev_a = { .data.u64 = ref_a.u64 }; struct epoll_event ev_b = { .data.u64 = ref_b.u64 }; @@ -257,7 +259,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, conn->events &= event; if (fls(~event) >= 0) { - debug("TCP (spliced): index %li, ~%s", conn - tc, + debug("TCP (spliced): index %li, ~%s", CONN_IDX(conn), tcp_splice_event_str[fls(~event)]); } } else { @@ -266,7 +268,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, conn->events |= event; if (fls(event) >= 0) { - debug("TCP (spliced): index %li, %s", conn - tc, + debug("TCP (spliced): index %li, %s", CONN_IDX(conn), tcp_splice_event_str[fls(event)]); } } @@ -292,8 +294,8 @@ static void tcp_table_splice_compact(struct ctx *c, { struct tcp_splice_conn *move; - if ((hole - tc) == --c->tcp.splice_conn_count) { - debug("TCP (spliced): index %li (max) removed", hole - tc); + if (CONN_IDX(hole) == --c->tcp.splice_conn_count) { + debug("TCP (spliced): index %li (max) removed", CONN_IDX(hole)); return; } @@ -307,7 +309,8 @@ static void tcp_table_splice_compact(struct ctx *c, move->pipe_b_a[0] = move->pipe_b_a[1] = -1; move->flags = move->events = 0; - debug("TCP (spliced): index %li moved to %li", move - tc, hole - tc); + debug("TCP (spliced): index %li moved to %li", + CONN_IDX(move), CONN_IDX(hole)); tcp_splice_epoll_ctl(c, hole); if (tcp_splice_epoll_ctl(c, hole)) conn_flag(c, hole, CLOSING); @@ -345,7 +348,7 @@ static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn) conn->events = CLOSED; conn->flags = 0; - debug("TCP (spliced): index %li, CLOSED", conn - tc); + debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn)); tcp_table_splice_compact(c, conn); } @@ -872,7 +875,9 @@ void tcp_splice_timer(struct ctx *c) { struct tcp_splice_conn *conn; - for (conn = CONN(c->tcp.splice_conn_count - 1); conn >= tc; conn--) { + for (conn = CONN(c->tcp.splice_conn_count - 1); + conn >= tc_splice; + conn--) { if (conn->flags & CLOSING) { tcp_splice_destroy(c, conn); return; @@ -918,7 +923,9 @@ void tcp_splice_defer_handler(struct ctx *c) if (c->tcp.splice_conn_count < MIN(max_files / 6, max_conns)) return; - for (conn = CONN(c->tcp.splice_conn_count - 1); conn >= tc; conn--) { + for (conn = CONN(c->tcp.splice_conn_count - 1); + conn >= tc_splice; + conn--) { if (conn->flags & CLOSING) tcp_splice_destroy(c, conn); } -- 2.38.1
Currently spliced and non-spliced connections use completely independent tracking structures. We want to unify these, so as a preliminary step move the definitions for both variants into a new tcp_conn.h header, shared by tcp.c and tcp_splice.c. This requires renaming some #defines with the same name but different meanings between the two cases. In the process we correct some places that are slightly out of sync between the comments and the code for various event bit names. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- Makefile | 3 +- tcp.c | 201 ++++++++++++--------------------------------------- tcp_conn.h | 168 ++++++++++++++++++++++++++++++++++++++++++ tcp_splice.c | 93 +++++++----------------- 4 files changed, 242 insertions(+), 223 deletions(-) create mode 100644 tcp_conn.h diff --git a/Makefile b/Makefile index 6b22408..80213d1 100644 --- a/Makefile +++ b/Makefile @@ -45,7 +45,8 @@ MANPAGES = passt.1 pasta.1 qrap.1 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h icmp.h \ isolation.h lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h \ - pcap.h port_fwd.h siphash.h tap.h tcp.h tcp_splice.h udp.h util.h + pcap.h port_fwd.h siphash.h tap.h tcp.h tcp_conn.h tcp_splice.h udp.h \ + util.h HEADERS = $(PASST_HEADERS) seccomp.h # On gcc 11 and 12, with -O2 and -flto, tcp_hash() and siphash_20b(), if diff --git a/tcp.c b/tcp.c index 4e56a6c..628b3d9 100644 --- a/tcp.c +++ b/tcp.c @@ -98,7 +98,7 @@ * Connection tracking and storage * ------------------------------- * - * Connections are tracked by the @tc array of struct tcp_conn, containing + * Connections are tracked by the @tc array of struct tcp_tap_conn, containing * addresses, ports, TCP states and parameters. This is statically allocated and * indexed by an arbitrary connection number. The array is compacted whenever a * connection is closed, by remapping the highest connection index in use to the @@ -301,6 +301,8 @@ #include "tcp_splice.h" #include "log.h" +#include "tcp_conn.h" + #define TCP_FRAMES_MEM 128 #define TCP_FRAMES \ (c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1) @@ -308,7 +310,6 @@ #define TCP_FILE_PRESSURE 30 /* % of c->nofile */ #define TCP_CONN_PRESSURE 30 /* % of c->tcp.conn_count */ -#define TCP_HASH_BUCKET_BITS (TCP_CONN_INDEX_BITS + 1) #define TCP_HASH_TABLE_LOAD 70 /* % */ #define TCP_HASH_TABLE_SIZE (TCP_MAX_CONNS * 100 / \ TCP_HASH_TABLE_LOAD) @@ -402,117 +403,8 @@ struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */ #define OPT_SACK 5 #define OPT_TS 8 -/** - * struct tcp_conn - Descriptor for a TCP connection (not spliced) - * @next_index: Connection index of next item in hash chain, -1 for none - * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS - * @sock: Socket descriptor number - * @events: Connection events, implying connection states - * @timer: timerfd descriptor for timeout events - * @flags: Connection flags representing internal attributes - * @hash_bucket: Bucket index in connection lookup hash table - * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT - * @ws_from_tap: Window scaling factor advertised from tap/guest - * @ws_to_tap: Window scaling factor advertised to tap/guest - * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS - * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @a.a6: IPv6 remote address, can be IPv4-mapped - * @a.a4.zero: Zero prefix for IPv4-mapped, see RFC 6890, Table 20 - * @a.a4.one: Ones prefix for IPv4-mapped - * @a.a4.a: IPv4 address - * @tap_port: Guest-facing tap port - * @sock_port: Remote, socket-facing port - * @wnd_from_tap: Last window size from tap, unscaled (as received) - * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) - * @seq_to_tap: Next sequence for packets to tap - * @seq_ack_from_tap: Last ACK number received from tap - * @seq_from_tap: Next sequence for packets from tap (not actually sent) - * @seq_ack_to_tap: Last ACK number sent to tap - * @seq_init_from_tap: Initial sequence number from tap - */ -struct tcp_conn { - int next_index :TCP_CONN_INDEX_BITS + 2; - -#define TCP_RETRANS_BITS 3 - unsigned int retrans :TCP_RETRANS_BITS; -#define TCP_MAX_RETRANS ((1U << TCP_RETRANS_BITS) - 1) - -#define TCP_WS_BITS 4 /* RFC 7323 */ -#define TCP_WS_MAX 14 - unsigned int ws_from_tap :TCP_WS_BITS; - unsigned int ws_to_tap :TCP_WS_BITS; - - - int sock :SOCKET_REF_BITS; - - uint8_t events; -#define CLOSED 0 -#define SOCK_ACCEPTED BIT(0) /* implies SYN sent to tap */ -#define TAP_SYN_RCVD BIT(1) /* implies socket connecting */ -#define TAP_SYN_ACK_SENT BIT( 3) /* implies socket connected */ -#define ESTABLISHED BIT(2) -#define SOCK_FIN_RCVD BIT( 3) -#define SOCK_FIN_SENT BIT( 4) -#define TAP_FIN_RCVD BIT( 5) -#define TAP_FIN_SENT BIT( 6) -#define TAP_FIN_ACKED BIT( 7) - -#define CONN_STATE_BITS /* Setting these clears other flags */ \ - (SOCK_ACCEPTED | TAP_SYN_RCVD | ESTABLISHED) - - - int timer :SOCKET_REF_BITS; - - uint8_t flags; -#define STALLED BIT(0) -#define LOCAL BIT(1) -#define WND_CLAMPED BIT(2) -#define IN_EPOLL BIT(3) -#define ACTIVE_CLOSE BIT(4) -#define ACK_TO_TAP_DUE BIT(5) -#define ACK_FROM_TAP_DUE BIT(6) - - - unsigned int hash_bucket :TCP_HASH_BUCKET_BITS; - -#define TCP_MSS_BITS 14 - unsigned int tap_mss :TCP_MSS_BITS; -#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) -#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) - - -#define SNDBUF_BITS 24 - unsigned int sndbuf :SNDBUF_BITS; -#define SNDBUF_SET(conn, bytes) (conn->sndbuf = ((bytes) >> (32 - SNDBUF_BITS))) -#define SNDBUF_GET(conn) (conn->sndbuf << (32 - SNDBUF_BITS)) - - uint8_t seq_dup_ack_approx; - - - union { - struct in6_addr a6; - struct { - uint8_t zero[10]; - uint8_t one[2]; - struct in_addr a; - } a4; - } a; #define CONN_V4(conn) IN6_IS_ADDR_V4MAPPED(&conn->a.a6) #define CONN_V6(conn) (!CONN_V4(conn)) - - in_port_t tap_port; - in_port_t sock_port; - - uint16_t wnd_from_tap; - uint16_t wnd_to_tap; - - uint32_t seq_to_tap; - uint32_t seq_ack_from_tap; - uint32_t seq_from_tap; - uint32_t seq_ack_to_tap; - uint32_t seq_init_from_tap; -}; - #define CONN_IS_CLOSING(conn) \ ((conn->events & ESTABLISHED) && \ (conn->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD))) @@ -695,7 +587,7 @@ static unsigned int tcp6_l2_flags_buf_used; static size_t tcp6_l2_flags_buf_bytes; /* TCP connections */ -static struct tcp_conn tc[TCP_MAX_CONNS]; +static struct tcp_tap_conn tc[TCP_MAX_CONNS]; #define CONN(index) (tc + (index)) #define CONN_IDX(conn) ((conn) - tc) @@ -705,7 +597,7 @@ static struct tcp_conn tc[TCP_MAX_CONNS]; * * Return: Pointer to connection, or NULL if @index is out of bounds */ -static inline struct tcp_conn *conn_at_idx(int index) +static inline struct tcp_tap_conn *conn_at_idx(int index) { if ((index < 0) || (index >= TCP_MAX_CONNS)) return NULL; @@ -713,7 +605,7 @@ static inline struct tcp_conn *conn_at_idx(int index) } /* Table for lookup from remote address, local port, remote port */ -static struct tcp_conn *tc_hash[TCP_HASH_TABLE_SIZE]; +static struct tcp_tap_conn *tc_hash[TCP_HASH_TABLE_SIZE]; /* Pools for pre-opened sockets */ int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; @@ -749,7 +641,7 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags) return EPOLLRDHUP; } -static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn, +static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, unsigned long flag); #define conn_flag(c, conn, flag) \ do { \ @@ -764,7 +656,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn, * * Return: 0 on success, negative error code on failure (not on deletion) */ -static int tcp_epoll_ctl(const struct ctx *c, struct tcp_conn *conn) +static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn) { int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref = { .r.proto = IPPROTO_TCP, .r.s = conn->sock, @@ -809,7 +701,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_conn *conn) * * #syscalls timerfd_create timerfd_settime */ -static void tcp_timer_ctl(const struct ctx *c, struct tcp_conn *conn) +static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) { struct itimerspec it = { { 0 }, { 0 } }; @@ -865,7 +757,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_conn *conn) * @conn: Connection pointer * @flag: Flag to set, or ~flag to unset */ -static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn, +static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, unsigned long flag) { if (flag & (flag - 1)) { @@ -903,7 +795,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn, * @conn: Connection pointer * @event: Connection event */ -static void conn_event_do(const struct ctx *c, struct tcp_conn *conn, +static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, unsigned long event) { int prev, new, num = fls(event); @@ -963,7 +855,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_conn *conn, * * Return: 1 if destination is in low RTT table, 0 otherwise */ -static int tcp_rtt_dst_low(const struct tcp_conn *conn) +static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) { int i; @@ -979,7 +871,7 @@ static int tcp_rtt_dst_low(const struct tcp_conn *conn) * @conn: Connection pointer * @tinfo: Pointer to struct tcp_info for socket */ -static void tcp_rtt_dst_check(const struct tcp_conn *conn, +static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, const struct tcp_info *tinfo) { #ifdef HAS_MIN_RTT @@ -1016,7 +908,7 @@ static void tcp_rtt_dst_check(const struct tcp_conn *conn, * tcp_get_sndbuf() - Get, scale SO_SNDBUF between thresholds (1 to 0.5 usage) * @conn: Connection pointer */ -static void tcp_get_sndbuf(struct tcp_conn *conn) +static void tcp_get_sndbuf(struct tcp_tap_conn *conn) { int s = conn->sock, sndbuf; socklen_t sl; @@ -1290,7 +1182,8 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find, * * Return: 1 on match, 0 otherwise */ -static int tcp_hash_match(const struct tcp_conn *conn, int af, const void *addr, +static int tcp_hash_match(const struct tcp_tap_conn *conn, + int af, const void *addr, in_port_t tap_port, in_port_t sock_port) { if (af == AF_INET && CONN_V4(conn) && @@ -1356,7 +1249,7 @@ static unsigned int tcp_hash(const struct ctx *c, int af, const void *addr, * @af: Address family, AF_INET or AF_INET6 * @addr: Remote address, pointer to in_addr or in6_addr */ -static void tcp_hash_insert(const struct ctx *c, struct tcp_conn *conn, +static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn, int af, const void *addr) { int b; @@ -1374,9 +1267,9 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_conn *conn, * tcp_hash_remove() - Drop connection from hash table, chain unlink * @conn: Connection pointer */ -static void tcp_hash_remove(const struct tcp_conn *conn) +static void tcp_hash_remove(const struct tcp_tap_conn *conn) { - struct tcp_conn *entry, *prev = NULL; + struct tcp_tap_conn *entry, *prev = NULL; int b = conn->hash_bucket; for (entry = tc_hash[b]; entry; @@ -1400,9 +1293,9 @@ static void tcp_hash_remove(const struct tcp_conn *conn) * @old: Old connection pointer * @new: New connection pointer */ -static void tcp_hash_update(struct tcp_conn *old, struct tcp_conn *new) +static void tcp_hash_update(struct tcp_tap_conn *old, struct tcp_tap_conn *new) { - struct tcp_conn *entry, *prev = NULL; + struct tcp_tap_conn *entry, *prev = NULL; int b = old->hash_bucket; for (entry = tc_hash[b]; entry; @@ -1431,12 +1324,12 @@ static void tcp_hash_update(struct tcp_conn *old, struct tcp_conn *new) * * Return: connection pointer, if found, -ENOENT otherwise */ -static struct tcp_conn *tcp_hash_lookup(const struct ctx *c, int af, +static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, int af, const void *addr, in_port_t tap_port, in_port_t sock_port) { int b = tcp_hash(c, af, addr, tap_port, sock_port); - struct tcp_conn *conn; + struct tcp_tap_conn *conn; for (conn = tc_hash[b]; conn; conn = conn_at_idx(conn->next_index)) { if (tcp_hash_match(conn, af, addr, tap_port, sock_port)) @@ -1451,9 +1344,9 @@ static struct tcp_conn *tcp_hash_lookup(const struct ctx *c, int af, * @c: Execution context * @hole: Pointer to recently closed connection */ -static void tcp_table_compact(struct ctx *c, struct tcp_conn *hole) +static void tcp_table_compact(struct ctx *c, struct tcp_tap_conn *hole) { - struct tcp_conn *from, *to; + struct tcp_tap_conn *from, *to; if (CONN_IDX(hole) == --c->tcp.conn_count) { debug("TCP: hash table compaction: maximum index was %li (%p)", @@ -1482,7 +1375,7 @@ static void tcp_table_compact(struct ctx *c, struct tcp_conn *hole) * @c: Execution context * @conn: Connection pointer */ -static void tcp_conn_destroy(struct ctx *c, struct tcp_conn *conn) +static void tcp_conn_destroy(struct ctx *c, struct tcp_tap_conn *conn) { close(conn->sock); if (conn->timer != -1) @@ -1492,7 +1385,7 @@ static void tcp_conn_destroy(struct ctx *c, struct tcp_conn *conn) tcp_table_compact(c, conn); } -static void tcp_rst_do(struct ctx *c, struct tcp_conn *conn); +static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); #define tcp_rst(c, conn) \ do { \ debug("TCP: index %li, reset at %s:%i", CONN_IDX(conn), \ @@ -1627,7 +1520,7 @@ void tcp_defer_handler(struct ctx *c) { int max_conns = c->tcp.conn_count / 100 * TCP_CONN_PRESSURE; int max_files = c->nofile / 100 * TCP_FILE_PRESSURE; - struct tcp_conn *conn; + struct tcp_tap_conn *conn; tcp_l2_flags_buf_flush(c); tcp_l2_data_buf_flush(c); @@ -1656,7 +1549,7 @@ void tcp_defer_handler(struct ctx *c) * Return: 802.3 length, host order */ static size_t tcp_l2_buf_fill_headers(const struct ctx *c, - const struct tcp_conn *conn, + const struct tcp_tap_conn *conn, void *p, size_t plen, const uint16_t *check, uint32_t seq) { @@ -1738,7 +1631,7 @@ do { \ * * Return: 1 if sequence or window were updated, 0 otherwise */ -static int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_conn *conn, +static int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, int force_seq, struct tcp_info *tinfo) { uint32_t prev_wnd_to_tap = conn->wnd_to_tap << conn->ws_to_tap; @@ -1824,7 +1717,7 @@ out: * * Return: negative error code on connection reset, 0 otherwise */ -static int tcp_send_flag(struct ctx *c, struct tcp_conn *conn, int flags) +static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags) { uint32_t prev_ack_to_tap = conn->seq_ack_to_tap; uint32_t prev_wnd_to_tap = conn->wnd_to_tap; @@ -1971,7 +1864,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_conn *conn, int flags) * @c: Execution context * @conn: Connection pointer */ -static void tcp_rst_do(struct ctx *c, struct tcp_conn *conn) +static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn) { if (conn->events == CLOSED) return; @@ -1986,7 +1879,7 @@ static void tcp_rst_do(struct ctx *c, struct tcp_conn *conn) * @opts: Pointer to start of TCP options * @optlen: Bytes in options: caller MUST ensure available length */ -static void tcp_get_tap_ws(struct tcp_conn *conn, +static void tcp_get_tap_ws(struct tcp_tap_conn *conn, const char *opts, size_t optlen) { int ws = tcp_opt_get(opts, optlen, OPT_WS, NULL, NULL); @@ -2003,7 +1896,7 @@ static void tcp_get_tap_ws(struct tcp_conn *conn, * @conn: Connection pointer * @window: Window value, host order, unscaled */ -static void tcp_clamp_window(const struct ctx *c, struct tcp_conn *conn, +static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn, unsigned wnd) { uint32_t prev_scaled = conn->wnd_from_tap << conn->ws_from_tap; @@ -2125,7 +2018,7 @@ static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af) * Return: clamped MSS value */ static uint16_t tcp_conn_tap_mss(const struct ctx *c, - const struct tcp_conn *conn, + const struct tcp_tap_conn *conn, const char *opts, size_t optlen) { unsigned int mss; @@ -2172,7 +2065,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr, .sin6_addr = *(struct in6_addr *)addr, }; const struct sockaddr *sa; - struct tcp_conn *conn; + struct tcp_tap_conn *conn; socklen_t sl; int s, mss; @@ -2280,7 +2173,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr, * * Return: 0 on success, negative error code from recv() on failure */ -static int tcp_sock_consume(struct tcp_conn *conn, uint32_t ack_seq) +static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq) { /* Simply ignore out-of-order ACKs: we already consumed the data we * needed from the buffer, and we won't rewind back to a lower ACK @@ -2307,7 +2200,7 @@ static int tcp_sock_consume(struct tcp_conn *conn, uint32_t ack_seq) * @seq: Sequence number to be sent * @now: Current timestamp */ -static void tcp_data_to_tap(struct ctx *c, struct tcp_conn *conn, +static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn, ssize_t plen, int no_csum, uint32_t seq) { struct iovec *iov; @@ -2344,7 +2237,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_conn *conn, * * #syscalls recvmsg */ -static int tcp_data_from_sock(struct ctx *c, struct tcp_conn *conn) +static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) { uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; @@ -2475,7 +2368,7 @@ zero_len: * * #syscalls sendmsg */ -static void tcp_data_from_tap(struct ctx *c, struct tcp_conn *conn, +static void tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, const struct pool *p) { int i, iov_i, ack = 0, fin = 0, retr = 0, keep = -1, partial_send = 0; @@ -2675,7 +2568,7 @@ out: * @opts: Pointer to start of options * @optlen: Bytes in options: caller MUST ensure available length */ -static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_conn *conn, +static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, const struct tcphdr *th, const char *opts, size_t optlen) { @@ -2714,7 +2607,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_conn *conn, int tcp_tap_handler(struct ctx *c, int af, const void *addr, const struct pool *p, const struct timespec *now) { - struct tcp_conn *conn; + struct tcp_tap_conn *conn; size_t optlen, len; struct tcphdr *th; int ack_due = 0; @@ -2829,7 +2722,7 @@ int tcp_tap_handler(struct ctx *c, int af, const void *addr, * @c: Execution context * @conn: Connection pointer */ -static void tcp_connect_finish(struct ctx *c, struct tcp_conn *conn) +static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn) { socklen_t sl; int so; @@ -2857,7 +2750,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, const struct timespec *now) { struct sockaddr_storage sa; - struct tcp_conn *conn; + struct tcp_tap_conn *conn; socklen_t sl; int s; @@ -2949,7 +2842,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, */ static void tcp_timer_handler(struct ctx *c, union epoll_ref ref) { - struct tcp_conn *conn = conn_at_idx(ref.r.p.tcp.tcp.index); + struct tcp_tap_conn *conn = conn_at_idx(ref.r.p.tcp.tcp.index); struct itimerspec check_armed = { { 0 }, { 0 } }; if (!conn) @@ -3012,7 +2905,7 @@ static void tcp_timer_handler(struct ctx *c, union epoll_ref ref) void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now) { - struct tcp_conn *conn; + struct tcp_tap_conn *conn; if (ref.r.p.tcp.tcp.timer) { tcp_timer_handler(c, ref); @@ -3510,7 +3403,7 @@ static int tcp_port_rebind(void *arg) void tcp_timer(struct ctx *c, const struct timespec *ts) { struct tcp_sock_refill_arg refill_arg = { c, 0 }; - struct tcp_conn *conn; + struct tcp_tap_conn *conn; (void)ts; diff --git a/tcp_conn.h b/tcp_conn.h new file mode 100644 index 0000000..db4c2d9 --- /dev/null +++ b/tcp_conn.h @@ -0,0 +1,168 @@ +/* SPDX-License-Identifier: AGPL-3.0-or-later + * Copyright Red Hat + * Author: Stefano Brivio <sbrivio(a)redhat.com> + * Author: David Gibson <david(a)gibson.dropbear.id.au> + * + * TCP connection tracking data structures, used by tcp.c and + * tcp_splice.c. Shouldn't be included in non-TCP code. + */ +#ifndef TCP_CONN_H +#define TCP_CONN_H + +#define TCP_HASH_BUCKET_BITS (TCP_CONN_INDEX_BITS + 1) + +/** + * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) + * @next_index: Connection index of next item in hash chain, -1 for none + * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS + * @sock: Socket descriptor number + * @events: Connection events, implying connection states + * @timer: timerfd descriptor for timeout events + * @flags: Connection flags representing internal attributes + * @hash_bucket: Bucket index in connection lookup hash table + * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT + * @ws_from_tap: Window scaling factor advertised from tap/guest + * @ws_to_tap: Window scaling factor advertised to tap/guest + * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS + * @seq_dup_ack_approx: Last duplicate ACK number sent to tap + * @a.a6: IPv6 remote address, can be IPv4-mapped + * @a.a4.zero: Zero prefix for IPv4-mapped, see RFC 6890, Table 20 + * @a.a4.one: Ones prefix for IPv4-mapped + * @a.a4.a: IPv4 address + * @tap_port: Guest-facing tap port + * @sock_port: Remote, socket-facing port + * @wnd_from_tap: Last window size from tap, unscaled (as received) + * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) + * @seq_to_tap: Next sequence for packets to tap + * @seq_ack_from_tap: Last ACK number received from tap + * @seq_from_tap: Next sequence for packets from tap (not actually sent) + * @seq_ack_to_tap: Last ACK number sent to tap + * @seq_init_from_tap: Initial sequence number from tap + */ +struct tcp_tap_conn { + int next_index :TCP_CONN_INDEX_BITS + 2; + +#define TCP_RETRANS_BITS 3 + unsigned int retrans :TCP_RETRANS_BITS; +#define TCP_MAX_RETRANS ((1U << TCP_RETRANS_BITS) - 1) + +#define TCP_WS_BITS 4 /* RFC 7323 */ +#define TCP_WS_MAX 14 + unsigned int ws_from_tap :TCP_WS_BITS; + unsigned int ws_to_tap :TCP_WS_BITS; + + + int sock :SOCKET_REF_BITS; + + uint8_t events; +#define CLOSED 0 +#define SOCK_ACCEPTED BIT(0) /* implies SYN sent to tap */ +#define TAP_SYN_RCVD BIT(1) /* implies socket connecting */ +#define TAP_SYN_ACK_SENT BIT( 3) /* implies socket connected */ +#define ESTABLISHED BIT(2) +#define SOCK_FIN_RCVD BIT( 3) +#define SOCK_FIN_SENT BIT( 4) +#define TAP_FIN_RCVD BIT( 5) +#define TAP_FIN_SENT BIT( 6) +#define TAP_FIN_ACKED BIT( 7) + +#define CONN_STATE_BITS /* Setting these clears other flags */ \ + (SOCK_ACCEPTED | TAP_SYN_RCVD | ESTABLISHED) + + + int timer :SOCKET_REF_BITS; + + uint8_t flags; +#define STALLED BIT(0) +#define LOCAL BIT(1) +#define WND_CLAMPED BIT(2) +#define IN_EPOLL BIT(3) +#define ACTIVE_CLOSE BIT(4) +#define ACK_TO_TAP_DUE BIT(5) +#define ACK_FROM_TAP_DUE BIT(6) + + + unsigned int hash_bucket :TCP_HASH_BUCKET_BITS; + +#define TCP_MSS_BITS 14 + unsigned int tap_mss :TCP_MSS_BITS; +#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) +#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) + + +#define SNDBUF_BITS 24 + unsigned int sndbuf :SNDBUF_BITS; +#define SNDBUF_SET(conn, bytes) (conn->sndbuf = ((bytes) >> (32 - SNDBUF_BITS))) +#define SNDBUF_GET(conn) (conn->sndbuf << (32 - SNDBUF_BITS)) + + uint8_t seq_dup_ack_approx; + + + union { + struct in6_addr a6; + struct { + uint8_t zero[10]; + uint8_t one[2]; + struct in_addr a; + } a4; + } a; + + in_port_t tap_port; + in_port_t sock_port; + + uint16_t wnd_from_tap; + uint16_t wnd_to_tap; + + uint32_t seq_to_tap; + uint32_t seq_ack_from_tap; + uint32_t seq_from_tap; + uint32_t seq_ack_to_tap; + uint32_t seq_init_from_tap; +}; + +/** + * struct tcp_splice_conn - Descriptor for a spliced TCP connection + * @a: File descriptor number of socket for accepted connection + * @pipe_a_b: Pipe ends for splice() from @a to @b + * @b: File descriptor number of peer connected socket + * @pipe_b_a: Pipe ends for splice() from @b to @a + * @events: Events observed/actions performed on connection + * @flags: Connection flags (attributes, not events) + * @a_read: Bytes read from @a (not fully written to @b in one shot) + * @a_written: Bytes written to @a (not fully written from one @b read) + * @b_read: Bytes read from @b (not fully written to @a in one shot) + * @b_written: Bytes written to @b (not fully written from one @a read) +*/ +struct tcp_splice_conn { + int a; + int pipe_a_b[2]; + int b; + int pipe_b_a[2]; + + uint8_t events; +#define SPLICE_CLOSED 0 +#define SPLICE_CONNECT BIT(0) +#define SPLICE_ESTABLISHED BIT(1) +#define A_OUT_WAIT BIT(2) +#define B_OUT_WAIT BIT(3) +#define A_FIN_RCVD BIT(4) +#define B_FIN_RCVD BIT(5) +#define A_FIN_SENT BIT(6) +#define B_FIN_SENT BIT(7) + + uint8_t flags; +#define SPLICE_V6 BIT(0) +#define SPLICE_IN_EPOLL BIT(1) +#define RCVLOWAT_SET_A BIT(2) +#define RCVLOWAT_SET_B BIT(3) +#define RCVLOWAT_ACT_A BIT(4) +#define RCVLOWAT_ACT_B BIT(5) +#define CLOSING BIT(6) + + uint32_t a_read; + uint32_t a_written; + uint32_t b_read; + uint32_t b_written; +}; + +#endif /* TCP_CONN_H */ diff --git a/tcp_splice.c b/tcp_splice.c index 9186760..515805c 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -21,12 +21,12 @@ * * - SPLICE_CONNECT: connection accepted, connecting to target * - SPLICE_ESTABLISHED: connection to target established - * - SPLICE_A_OUT_WAIT: pipe to accepted socket full, wait for EPOLLOUT - * - SPLICE_B_OUT_WAIT: pipe to target socket full, wait for EPOLLOUT - * - SPLICE_A_FIN_RCVD: FIN (EPOLLRDHUP) seen from accepted socket - * - SPLICE_B_FIN_RCVD: FIN (EPOLLRDHUP) seen from target socket - * - SPLICE_A_FIN_RCVD: FIN (write shutdown) sent to accepted socket - * - SPLICE_B_FIN_RCVD: FIN (write shutdown) sent to target socket + * - A_OUT_WAIT: pipe to accepted socket full, wait for EPOLLOUT + * - B_OUT_WAIT: pipe to target socket full, wait for EPOLLOUT + * - A_FIN_RCVD: FIN (EPOLLRDHUP) seen from accepted socket + * - B_FIN_RCVD: FIN (EPOLLRDHUP) seen from target socket + * - A_FIN_RCVD: FIN (write shutdown) sent to accepted socket + * - B_FIN_RCVD: FIN (write shutdown) sent to target socket * * #syscalls:pasta pipe2|pipe fcntl armv6l:fcntl64 armv7l:fcntl64 ppc64:fcntl64 */ @@ -51,6 +51,8 @@ #include "passt.h" #include "log.h" +#include "tcp_conn.h" + #define MAX_PIPE_SIZE (8UL * 1024 * 1024) #define TCP_SPLICE_MAX_CONNS (128 * 1024) #define TCP_SPLICE_PIPE_POOL_SIZE 16 @@ -66,52 +68,7 @@ extern int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2]; -/** - * struct tcp_splice_conn - Descriptor for a spliced TCP connection - * @a: File descriptor number of socket for accepted connection - * @pipe_a_b: Pipe ends for splice() from @a to @b - * @b: File descriptor number of peer connected socket - * @pipe_b_a: Pipe ends for splice() from @b to @a - * @events: Events observed/actions performed on connection - * @flags: Connection flags (attributes, not events) - * @a_read: Bytes read from @a (not fully written to @b in one shot) - * @a_written: Bytes written to @a (not fully written from one @b read) - * @b_read: Bytes read from @b (not fully written to @a in one shot) - * @b_written: Bytes written to @b (not fully written from one @a read) -*/ -struct tcp_splice_conn { - int a; - int pipe_a_b[2]; - int b; - int pipe_b_a[2]; - - uint8_t events; -#define CLOSED 0 -#define CONNECT BIT(0) -#define ESTABLISHED BIT(1) -#define A_OUT_WAIT BIT(2) -#define B_OUT_WAIT BIT(3) -#define A_FIN_RCVD BIT(4) -#define B_FIN_RCVD BIT(5) -#define A_FIN_SENT BIT(6) -#define B_FIN_SENT BIT(7) - - uint8_t flags; -#define SOCK_V6 BIT(0) -#define IN_EPOLL BIT(1) -#define RCVLOWAT_SET_A BIT(2) -#define RCVLOWAT_SET_B BIT(3) -#define RCVLOWAT_ACT_A BIT(4) -#define RCVLOWAT_ACT_B BIT(5) -#define CLOSING BIT(6) - - uint32_t a_read; - uint32_t a_written; - uint32_t b_read; - uint32_t b_written; -}; - -#define CONN_V6(x) (x->flags & SOCK_V6) +#define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) #define CONN(index) (tc_splice + (index)) @@ -122,13 +79,13 @@ static struct tcp_splice_conn tc_splice[TCP_SPLICE_MAX_CONNS]; /* Display strings for connection events */ static const char *tcp_splice_event_str[] __attribute((__unused__)) = { - "CONNECT", "ESTABLISHED", "A_OUT_WAIT", "B_OUT_WAIT", + "SPLICE_CONNECT", "SPLICE_ESTABLISHED", "A_OUT_WAIT", "B_OUT_WAIT", "A_FIN_RCVD", "B_FIN_RCVD", "A_FIN_SENT", "B_FIN_SENT", }; /* Display strings for connection flags */ static const char *tcp_splice_flag_str[] __attribute((__unused__)) = { - "SOCK_V6", "IN_EPOLL", "RCVLOWAT_SET_A", "RCVLOWAT_SET_B", + "SPLICE_V6", "SPLICE_IN_EPOLL", "RCVLOWAT_SET_A", "RCVLOWAT_SET_B", "RCVLOWAT_ACT_A", "RCVLOWAT_ACT_B", "CLOSING", }; @@ -143,12 +100,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events, { *a = *b = 0; - if (events & ESTABLISHED) { + if (events & SPLICE_ESTABLISHED) { if (!(events & B_FIN_SENT)) *a = EPOLLIN | EPOLLRDHUP; if (!(events & A_FIN_SENT)) *b = EPOLLIN | EPOLLRDHUP; - } else if (events & CONNECT) { + } else if (events & SPLICE_CONNECT) { *b = EPOLLOUT; } @@ -210,7 +167,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn, static int tcp_splice_epoll_ctl(const struct ctx *c, struct tcp_splice_conn *conn) { - int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; + int m = (conn->flags & SPLICE_IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a, .r.p.tcp.tcp.splice = 1, .r.p.tcp.tcp.index = CONN_IDX(conn), @@ -234,7 +191,7 @@ static int tcp_splice_epoll_ctl(const struct ctx *c, epoll_ctl(c->epollfd, m, conn->b, &ev_b)) goto delete; - conn->flags |= IN_EPOLL; /* No need to log this */ + conn->flags |= SPLICE_IN_EPOLL; /* No need to log this */ return 0; @@ -323,7 +280,7 @@ static void tcp_table_splice_compact(struct ctx *c, */ static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn) { - if (conn->events & ESTABLISHED) { + if (conn->events & SPLICE_ESTABLISHED) { /* Flushing might need to block: don't recycle them. */ if (conn->pipe_a_b[0] != -1) { close(conn->pipe_a_b[0]); @@ -337,7 +294,7 @@ static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn) } } - if (conn->events & CONNECT) { + if (conn->events & SPLICE_CONNECT) { close(conn->b); conn->b = -1; } @@ -346,7 +303,7 @@ static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn) conn->a = -1; conn->a_read = conn->a_written = conn->b_read = conn->b_written = 0; - conn->events = CLOSED; + conn->events = SPLICE_CLOSED; conn->flags = 0; debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn)); @@ -397,8 +354,8 @@ static int tcp_splice_connect_finish(const struct ctx *c, } } - if (!(conn->events & ESTABLISHED)) - conn_event(c, conn, ESTABLISHED); + if (!(conn->events & SPLICE_ESTABLISHED)) + conn_event(c, conn, SPLICE_ESTABLISHED); return 0; } @@ -466,9 +423,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, close(sock_conn); return ret; } - conn_event(c, conn, CONNECT); + conn_event(c, conn, SPLICE_CONNECT); } else { - conn_event(c, conn, ESTABLISHED); + conn_event(c, conn, SPLICE_ESTABLISHED); return tcp_splice_connect_finish(c, conn); } @@ -598,7 +555,7 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref, conn = CONN(c->tcp.splice_conn_count++); conn->a = s; - conn->flags = ref.r.p.tcp.tcp.v6 ? SOCK_V6 : 0; + conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0; if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index, ref.r.p.tcp.tcp.outbound)) @@ -609,13 +566,13 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref, conn = CONN(ref.r.p.tcp.tcp.index); - if (conn->events == CLOSED) + if (conn->events == SPLICE_CLOSED) return; if (events & EPOLLERR) goto close; - if (conn->events == CONNECT) { + if (conn->events == SPLICE_CONNECT) { if (!(events & EPOLLOUT)) goto close; if (tcp_splice_connect_finish(c, conn)) -- 2.38.1
Currently, the tables for spliced and non-spliced connections are entirely separate, with different types in different arrays. We want to unify them. As a first step, create a union type which can represent either a spliced or non-spliced connection. For them to be distinguishable, the individual types need to have a common header added, with a bit indicating which type this structure is. This comes at the cost of increasing the size of tcp_tap_conn to over one (64 byte) cacheline. This isn't ideal, but it makes things simpler for now and we'll re-optimize this later. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 4 ++++ tcp_conn.h | 30 ++++++++++++++++++++++++++++++ tcp_splice.c | 2 ++ 3 files changed, 36 insertions(+) diff --git a/tcp.c b/tcp.c index 628b3d9..1acb0c6 100644 --- a/tcp.c +++ b/tcp.c @@ -288,6 +288,7 @@ #include <sys/uio.h> #include <unistd.h> #include <time.h> +#include <assert.h> #include <linux/tcp.h> /* For struct tcp_info */ @@ -601,6 +602,7 @@ static inline struct tcp_tap_conn *conn_at_idx(int index) { if ((index < 0) || (index >= TCP_MAX_CONNS)) return NULL; + assert(!(CONN(index)->c.spliced)); return CONN(index); } @@ -2095,6 +2097,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr, } conn = CONN(c->tcp.conn_count++); + conn->c.spliced = false; conn->sock = s; conn->timer = -1; conn_event(c, conn, TAP_SYN_RCVD); @@ -2763,6 +2766,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, return; conn = CONN(c->tcp.conn_count++); + conn->c.spliced = false; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; diff --git a/tcp_conn.h b/tcp_conn.h index db4c2d9..39d104a 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -11,8 +11,19 @@ #define TCP_HASH_BUCKET_BITS (TCP_CONN_INDEX_BITS + 1) +/** + * struct tcp_conn_common - Common fields for spliced and non-spliced + * @spliced: Is this a spliced connection? + */ +struct tcp_conn_common { + bool spliced :1; +}; + +extern const char *tcp_common_flag_str[]; + /** * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) + * @c: Fields common with tcp_splice_conn * @next_index: Connection index of next item in hash chain, -1 for none * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS * @sock: Socket descriptor number @@ -40,6 +51,9 @@ * @seq_init_from_tap: Initial sequence number from tap */ struct tcp_tap_conn { + /* Must be first element to match tcp_splice_conn */ + struct tcp_conn_common c; + int next_index :TCP_CONN_INDEX_BITS + 2; #define TCP_RETRANS_BITS 3 @@ -122,6 +136,7 @@ struct tcp_tap_conn { /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection + * @c: Fields common with tcp_tap_conn * @a: File descriptor number of socket for accepted connection * @pipe_a_b: Pipe ends for splice() from @a to @b * @b: File descriptor number of peer connected socket @@ -134,6 +149,9 @@ struct tcp_tap_conn { * @b_written: Bytes written to @b (not fully written from one @a read) */ struct tcp_splice_conn { + /* Must be first element to match tcp_tap_conn */ + struct tcp_conn_common c; + int a; int pipe_a_b[2]; int b; @@ -165,4 +183,16 @@ struct tcp_splice_conn { uint32_t b_written; }; +/** + * union tcp_conn - Descriptor for a TCP connection (spliced or non-spliced) + * @c: Fields common between all variants + * @tap: Fields specific to non-spliced connections + * @splice: Fields specific to spliced connections +*/ +union tcp_conn { + struct tcp_conn_common c; + struct tcp_tap_conn tap; + struct tcp_splice_conn splice; +}; + #endif /* TCP_CONN_H */ diff --git a/tcp_splice.c b/tcp_splice.c index 515805c..c4d4e6f 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -46,6 +46,7 @@ #include <sys/epoll.h> #include <sys/types.h> #include <sys/socket.h> +#include <assert.h> #include "util.h" #include "passt.h" @@ -554,6 +555,7 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref, } conn = CONN(c->tcp.splice_conn_count++); + conn->c.spliced = true; conn->a = s; conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0; -- 2.38.1
When we compact the connection tables (both spliced and non-spliced) we need to move entries from one slot to another. That requires some updates in the entries themselves. Add helpers to make all the necessary updates for the spliced and non-spliced cases. This will simplify later cleanups. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 16 +++++++++------- tcp_splice.c | 17 ++++++++++++++--- 2 files changed, 23 insertions(+), 10 deletions(-) diff --git a/tcp.c b/tcp.c index 1acb0c6..44e1640 100644 --- a/tcp.c +++ b/tcp.c @@ -1291,11 +1291,13 @@ static void tcp_hash_remove(const struct tcp_tap_conn *conn) } /** - * tcp_hash_update() - Update pointer for given connection - * @old: Old connection pointer - * @new: New connection pointer + * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the table + * @c: Execution context + * @old: Old location of tcp_tap_conn + * @new: New location of tcp_tap_conn */ -static void tcp_hash_update(struct tcp_tap_conn *old, struct tcp_tap_conn *new) +static void tcp_tap_conn_update(struct ctx *c, struct tcp_tap_conn *old, + struct tcp_tap_conn *new) { struct tcp_tap_conn *entry, *prev = NULL; int b = old->hash_bucket; @@ -1314,6 +1316,8 @@ static void tcp_hash_update(struct tcp_tap_conn *old, struct tcp_tap_conn *new) debug("TCP: hash table update: old index %li, new index %li, sock %i, " "bucket: %i, old: %p, new: %p", CONN_IDX(old), CONN_IDX(new), new->sock, b, old, new); + + tcp_epoll_ctl(c, new); } /** @@ -1361,9 +1365,7 @@ static void tcp_table_compact(struct ctx *c, struct tcp_tap_conn *hole) memcpy(hole, from, sizeof(*hole)); to = hole; - tcp_hash_update(from, to); - - tcp_epoll_ctl(c, to); + tcp_tap_conn_update(c, from, to); debug("TCP: hash table compaction: old index %li, new index %li, " "sock %i, from: %p, to: %p", diff --git a/tcp_splice.c b/tcp_splice.c index c4d4e6f..42133af 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -242,6 +242,19 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, conn_event_do(c, conn, event); \ } while (0) + +/** + * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table + * @c: Execution context + * @new: New location of tcp_splice_conn + */ +static void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new) +{ + tcp_splice_epoll_ctl(c, new); + if (tcp_splice_epoll_ctl(c, new)) + conn_flag(c, new, CLOSING); +} + /** * tcp_table_splice_compact - Compact spliced connection table * @c: Execution context @@ -269,9 +282,7 @@ static void tcp_table_splice_compact(struct ctx *c, debug("TCP (spliced): index %li moved to %li", CONN_IDX(move), CONN_IDX(hole)); - tcp_splice_epoll_ctl(c, hole); - if (tcp_splice_epoll_ctl(c, hole)) - conn_flag(c, hole, CLOSING); + tcp_splice_conn_update(c, hole); } /** -- 2.38.1
Currently spliced and non-spliced connections are stored in completely separate tables, so there are completely independent limits on the number of spliced and non-spliced connections. This is a bit counter-intuitive. More importantly, the fact that the tables are separate prevents us from unifying some other logic between the two cases. So, merge these two tables into one, using the 'c.spliced' common field to distinguish between them when necessary. For now we keep a common limit of 128k connections, whether they're spliced or non-spliced, which means we save memory overall. If necessary we could increase this to a 256k or higher total, which would cost memory but give some more flexibility. For now, the code paths which need to step through all extant connections are still separate for the two cases, just skipping over entries which aren't for them. We'll improve that in later patches. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 46 ++++++++++++++++++++---------------- tcp.h | 2 +- tcp_conn.h | 6 +++++ tcp_splice.c | 66 ++++++++++++++-------------------------------------- 4 files changed, 51 insertions(+), 69 deletions(-) diff --git a/tcp.c b/tcp.c index 44e1640..ffc030e 100644 --- a/tcp.c +++ b/tcp.c @@ -98,11 +98,11 @@ * Connection tracking and storage * ------------------------------- * - * Connections are tracked by the @tc array of struct tcp_tap_conn, containing - * addresses, ports, TCP states and parameters. This is statically allocated and - * indexed by an arbitrary connection number. The array is compacted whenever a - * connection is closed, by remapping the highest connection index in use to the - * one freed up. + * Connections are tracked by struct tcp_tap_conn entries in the @tc + * array, containing addresses, ports, TCP states and parameters. This + * is statically allocated and indexed by an arbitrary connection + * number. The array is compacted whenever a connection is closed, by + * remapping the highest connection index in use to the one freed up. * * References used for the epoll interface report the connection index used for * the @tc array. @@ -588,10 +588,10 @@ static unsigned int tcp6_l2_flags_buf_used; static size_t tcp6_l2_flags_buf_bytes; /* TCP connections */ -static struct tcp_tap_conn tc[TCP_MAX_CONNS]; +union tcp_conn tc[TCP_MAX_CONNS]; -#define CONN(index) (tc + (index)) -#define CONN_IDX(conn) ((conn) - tc) +#define CONN(index) (&tc[(index)].tap) +#define CONN_IDX(conn) ((union tcp_conn *)(conn) - tc) /** conn_at_idx() - Find a connection by index, if present * @index: Index of connection to lookup @@ -1350,26 +1350,28 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, int af, * @c: Execution context * @hole: Pointer to recently closed connection */ -static void tcp_table_compact(struct ctx *c, struct tcp_tap_conn *hole) +void tcp_table_compact(struct ctx *c, union tcp_conn *hole) { - struct tcp_tap_conn *from, *to; + union tcp_conn *from; if (CONN_IDX(hole) == --c->tcp.conn_count) { - debug("TCP: hash table compaction: maximum index was %li (%p)", + debug("TCP: table compaction: maximum index was %li (%p)", CONN_IDX(hole), hole); memset(hole, 0, sizeof(*hole)); return; } - from = CONN(c->tcp.conn_count); + from = tc + c->tcp.conn_count; memcpy(hole, from, sizeof(*hole)); - to = hole; - tcp_tap_conn_update(c, from, to); + if (from->c.spliced) + tcp_splice_conn_update(c, &hole->splice); + else + tcp_tap_conn_update(c, &from->tap, &hole->tap); - debug("TCP: hash table compaction: old index %li, new index %li, " - "sock %i, from: %p, to: %p", - CONN_IDX(from), CONN_IDX(to), from->sock, from, to); + debug("TCP: table compaction (spliced=%d): old index %li, new index %li, " + "from: %p, to: %p", + from->c.spliced, CONN_IDX(from), CONN_IDX(hole), from, hole); memset(from, 0, sizeof(*from)); } @@ -1386,7 +1388,7 @@ static void tcp_conn_destroy(struct ctx *c, struct tcp_tap_conn *conn) close(conn->timer); tcp_hash_remove(conn); - tcp_table_compact(c, conn); + tcp_table_compact(c, (union tcp_conn *)conn); } static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); @@ -1534,7 +1536,9 @@ void tcp_defer_handler(struct ctx *c) if (c->tcp.conn_count < MIN(max_files, max_conns)) return; - for (conn = CONN(c->tcp.conn_count - 1); conn >= tc; conn--) { + for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) { + if (conn->c.spliced) + continue; if (conn->events == CLOSED) tcp_conn_destroy(c, conn); } @@ -3432,7 +3436,9 @@ void tcp_timer(struct ctx *c, const struct timespec *ts) } } - for (conn = CONN(c->tcp.conn_count - 1); conn >= tc; conn--) { + for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) { + if (conn->c.spliced) + continue; if (conn->events == CLOSED) tcp_conn_destroy(c, conn); } diff --git a/tcp.h b/tcp.h index bba0f38..49738ef 100644 --- a/tcp.h +++ b/tcp.h @@ -54,7 +54,7 @@ union tcp_epoll_ref { /** * struct tcp_ctx - Execution context for TCP routines * @hash_secret: 128-bit secret for hash functions, ISN and hash table - * @conn_count: Count of connections (not spliced) in connection table + * @conn_count: Count of total connections in connection table * @splice_conn_count: Count of spliced connections in connection table * @port_to_tap: Ports bound host-side, packets to tap or spliced * @fwd_in: Port forwarding configuration for inbound packets diff --git a/tcp_conn.h b/tcp_conn.h index 39d104a..4295f7d 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -195,4 +195,10 @@ union tcp_conn { struct tcp_splice_conn splice; }; +/* TCP connections */ +extern union tcp_conn tc[]; + +void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new); +void tcp_table_compact(struct ctx *c, union tcp_conn *hole); + #endif /* TCP_CONN_H */ diff --git a/tcp_splice.c b/tcp_splice.c index 42133af..f12dc2b 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -16,7 +16,7 @@ * For local traffic directed to TCP ports configured for direct * mapping between namespaces, packets are directly translated between * L4 sockets using a pair of splice() syscalls. These connections are - * tracked in the @tc_splice array of struct tcp_splice_conn, using + * tracked by struct tcp_splice_conn entries in the @tc array, using * these events: * * - SPLICE_CONNECT: connection accepted, connecting to target @@ -57,7 +57,7 @@ #define MAX_PIPE_SIZE (8UL * 1024 * 1024) #define TCP_SPLICE_MAX_CONNS (128 * 1024) #define TCP_SPLICE_PIPE_POOL_SIZE 16 -#define TCP_SPLICE_CONN_PRESSURE 30 /* % of splice_conn_count */ +#define TCP_SPLICE_CONN_PRESSURE 30 /* % of conn_count */ #define TCP_SPLICE_FILE_PRESSURE 30 /* % of c->nofile */ /* From tcp.c */ @@ -72,11 +72,8 @@ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2]; #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) -#define CONN(index) (tc_splice + (index)) -#define CONN_IDX(conn) ((conn) - tc_splice) - -/* Spliced connections */ -static struct tcp_splice_conn tc_splice[TCP_SPLICE_MAX_CONNS]; +#define CONN(index) (&tc[(index)].splice) +#define CONN_IDX(conn) ((union tcp_conn *)(conn) - tc) /* Display strings for connection events */ static const char *tcp_splice_event_str[] __attribute((__unused__)) = { @@ -248,43 +245,13 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, * @c: Execution context * @new: New location of tcp_splice_conn */ -static void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new) +void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new) { tcp_splice_epoll_ctl(c, new); if (tcp_splice_epoll_ctl(c, new)) conn_flag(c, new, CLOSING); } -/** - * tcp_table_splice_compact - Compact spliced connection table - * @c: Execution context - * @hole: Pointer to recently closed connection - */ -static void tcp_table_splice_compact(struct ctx *c, - struct tcp_splice_conn *hole) -{ - struct tcp_splice_conn *move; - - if (CONN_IDX(hole) == --c->tcp.splice_conn_count) { - debug("TCP (spliced): index %li (max) removed", CONN_IDX(hole)); - return; - } - - move = CONN(c->tcp.splice_conn_count); - - memcpy(hole, move, sizeof(*hole)); - - move->a = move->b = -1; - move->a_read = move->a_written = move->b_read = move->b_written = 0; - move->pipe_a_b[0] = move->pipe_a_b[1] = -1; - move->pipe_b_a[0] = move->pipe_b_a[1] = -1; - move->flags = move->events = 0; - - debug("TCP (spliced): index %li moved to %li", - CONN_IDX(move), CONN_IDX(hole)); - tcp_splice_conn_update(c, hole); -} - /** * tcp_splice_destroy() - Close spliced connection and pipes, clear * @c: Execution context @@ -319,7 +286,8 @@ static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn) conn->flags = 0; debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn)); - tcp_table_splice_compact(c, conn); + c->tcp.splice_conn_count--; + tcp_table_compact(c, (union tcp_conn *)conn); } /** @@ -553,7 +521,7 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref, if (ref.r.p.tcp.tcp.listen) { int s; - if (c->tcp.splice_conn_count >= TCP_SPLICE_MAX_CONNS) + if (c->tcp.conn_count >= TCP_MAX_CONNS) return; if ((s = accept4(ref.r.s, NULL, NULL, SOCK_NONBLOCK)) < 0) @@ -565,8 +533,9 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref, s); } - conn = CONN(c->tcp.splice_conn_count++); + conn = CONN(c->tcp.conn_count++); conn->c.spliced = true; + c->tcp.splice_conn_count++; conn->a = s; conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0; @@ -845,9 +814,10 @@ void tcp_splice_timer(struct ctx *c) { struct tcp_splice_conn *conn; - for (conn = CONN(c->tcp.splice_conn_count - 1); - conn >= tc_splice; - conn--) { + for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) { + if (!conn->c.spliced) + continue; + if (conn->flags & CLOSING) { tcp_splice_destroy(c, conn); return; @@ -890,12 +860,12 @@ void tcp_splice_defer_handler(struct ctx *c) int max_files = c->nofile / 100 * TCP_SPLICE_FILE_PRESSURE; struct tcp_splice_conn *conn; - if (c->tcp.splice_conn_count < MIN(max_files / 6, max_conns)) + if (c->tcp.conn_count < MIN(max_files / 6, max_conns)) return; - for (conn = CONN(c->tcp.splice_conn_count - 1); - conn >= tc_splice; - conn--) { + for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) { + if (!conn->c.spliced) + continue; if (conn->flags & CLOSING) tcp_splice_destroy(c, conn); } -- 2.38.1
These two functions each step through non-spliced and spliced connections respectively and clean up entries for closed connections. To avoid scanning the connection table twice, we merge these into a single function which scans the unified table and performs the appropriate sort of cleanup action on each one. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 20 +++++++++++--------- tcp_conn.h | 1 + tcp_splice.c | 24 +----------------------- tcp_splice.h | 1 - 4 files changed, 13 insertions(+), 33 deletions(-) diff --git a/tcp.c b/tcp.c index ffc030e..e6c05f3 100644 --- a/tcp.c +++ b/tcp.c @@ -1526,21 +1526,23 @@ void tcp_defer_handler(struct ctx *c) { int max_conns = c->tcp.conn_count / 100 * TCP_CONN_PRESSURE; int max_files = c->nofile / 100 * TCP_FILE_PRESSURE; - struct tcp_tap_conn *conn; + union tcp_conn *conn; tcp_l2_flags_buf_flush(c); tcp_l2_data_buf_flush(c); - tcp_splice_defer_handler(c); - - if (c->tcp.conn_count < MIN(max_files, max_conns)) + if ((c->tcp.conn_count < MIN(max_files, max_conns)) && + (c->tcp.splice_conn_count < MIN(max_files / 6, max_conns))) return; - for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) { - if (conn->c.spliced) - continue; - if (conn->events == CLOSED) - tcp_conn_destroy(c, conn); + for (conn = tc + c->tcp.conn_count - 1; conn >= tc; conn--) { + if (conn->c.spliced) { + if (conn->splice.flags & CLOSING) + tcp_splice_destroy(c, &conn->splice); + } else { + if (conn->tap.events == CLOSED) + tcp_conn_destroy(c, &conn->tap); + } } } diff --git a/tcp_conn.h b/tcp_conn.h index 4295f7d..634e259 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -200,5 +200,6 @@ extern union tcp_conn tc[]; void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new); void tcp_table_compact(struct ctx *c, union tcp_conn *hole); +void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn); #endif /* TCP_CONN_H */ diff --git a/tcp_splice.c b/tcp_splice.c index f12dc2b..8db7760 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -111,7 +111,6 @@ static void tcp_splice_conn_epoll_events(uint16_t events, *b |= (events & B_OUT_WAIT) ? EPOLLOUT : 0; } -static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn); static int tcp_splice_epoll_ctl(const struct ctx *c, struct tcp_splice_conn *conn); @@ -257,7 +256,7 @@ void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new) * @c: Execution context * @conn: Connection pointer */ -static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn) +void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn) { if (conn->events & SPLICE_ESTABLISHED) { /* Flushing might need to block: don't recycle them. */ @@ -849,24 +848,3 @@ void tcp_splice_timer(struct ctx *c) tcp_splice_pipe_refill(c); } - -/** - * tcp_splice_defer_handler() - Close connections without timer on file pressure - * @c: Execution context - */ -void tcp_splice_defer_handler(struct ctx *c) -{ - int max_conns = c->tcp.conn_count / 100 * TCP_SPLICE_CONN_PRESSURE; - int max_files = c->nofile / 100 * TCP_SPLICE_FILE_PRESSURE; - struct tcp_splice_conn *conn; - - if (c->tcp.conn_count < MIN(max_files / 6, max_conns)) - return; - - for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) { - if (!conn->c.spliced) - continue; - if (conn->flags & CLOSING) - tcp_splice_destroy(c, conn); - } -} diff --git a/tcp_splice.h b/tcp_splice.h index 63ffc68..b222600 100644 --- a/tcp_splice.h +++ b/tcp_splice.h @@ -15,6 +15,5 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref, void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn); void tcp_splice_init(struct ctx *c); void tcp_splice_timer(struct ctx *c); -void tcp_splice_defer_handler(struct ctx *c); #endif /* TCP_SPLICE_H */ -- 2.38.1
These two functions scan all the non-splced and spliced connections respectively and perform timed updates on them. Avoid scanning the now unified table twice, by having tcp_timer scan it once calling the relevant per-connection function for each one. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 18 ++++++++--------- tcp_conn.h | 3 +++ tcp_splice.c | 57 +++++++++++++++++++++++----------------------------- tcp_splice.h | 1 - 4 files changed, 37 insertions(+), 42 deletions(-) diff --git a/tcp.c b/tcp.c index e6c05f3..b1ddc6f 100644 --- a/tcp.c +++ b/tcp.c @@ -3282,8 +3282,6 @@ int tcp_init(struct ctx *c) refill_arg.ns = 1; NS_CALL(tcp_sock_refill, &refill_arg); - - tcp_splice_timer(c); } return 0; @@ -3415,7 +3413,7 @@ static int tcp_port_rebind(void *arg) void tcp_timer(struct ctx *c, const struct timespec *ts) { struct tcp_sock_refill_arg refill_arg = { c, 0 }; - struct tcp_tap_conn *conn; + union tcp_conn *conn; (void)ts; @@ -3438,11 +3436,13 @@ void tcp_timer(struct ctx *c, const struct timespec *ts) } } - for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) { - if (conn->c.spliced) - continue; - if (conn->events == CLOSED) - tcp_conn_destroy(c, conn); + for (conn = tc + c->tcp.conn_count - 1; conn >= tc; conn--) { + if (conn->c.spliced) { + tcp_splice_timer(c, &conn->splice); + } else { + if (conn->tap.events == CLOSED) + tcp_conn_destroy(c, &conn->tap); + } } tcp_sock_refill(&refill_arg); @@ -3452,6 +3452,6 @@ void tcp_timer(struct ctx *c, const struct timespec *ts) (c->ifi6 && ns_sock_pool6[TCP_SOCK_POOL_TSH] < 0)) NS_CALL(tcp_sock_refill, &refill_arg); - tcp_splice_timer(c); + tcp_splice_pipe_refill(c); } } diff --git a/tcp_conn.h b/tcp_conn.h index 634e259..7c450a0 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -201,5 +201,8 @@ extern union tcp_conn tc[]; void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new); void tcp_table_compact(struct ctx *c, union tcp_conn *hole); void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn); +void tcp_splice_timer(struct ctx *c, struct tcp_splice_conn *conn); +void tcp_splice_pipe_refill(const struct ctx *c); + #endif /* TCP_CONN_H */ diff --git a/tcp_splice.c b/tcp_splice.c index 8db7760..7244c5d 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -766,7 +766,7 @@ smaller: * tcp_splice_pipe_refill() - Refill pool of pre-opened pipes * @c: Execution context */ -static void tcp_splice_pipe_refill(const struct ctx *c) +void tcp_splice_pipe_refill(const struct ctx *c) { int i; @@ -803,48 +803,41 @@ void tcp_splice_init(struct ctx *c) { memset(splice_pipe_pool, 0xff, sizeof(splice_pipe_pool)); tcp_set_pipe_size(c); + tcp_splice_pipe_refill(c); } /** * tcp_splice_timer() - Timer for spliced connections * @c: Execution context + * @conn: Spliced connection */ -void tcp_splice_timer(struct ctx *c) +void tcp_splice_timer(struct ctx *c, struct tcp_splice_conn *conn) { - struct tcp_splice_conn *conn; - - for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) { - if (!conn->c.spliced) - continue; - - if (conn->flags & CLOSING) { - tcp_splice_destroy(c, conn); - return; - } + if (conn->flags & CLOSING) { + tcp_splice_destroy(c, conn); + return; + } - if ( (conn->flags & RCVLOWAT_SET_A) && - !(conn->flags & RCVLOWAT_ACT_A)) { - if (setsockopt(conn->a, SOL_SOCKET, SO_RCVLOWAT, - &((int){ 1 }), sizeof(int))) { - trace("TCP (spliced): can't set SO_RCVLOWAT on " - "%i", conn->a); - } - conn_flag(c, conn, ~RCVLOWAT_SET_A); + if ( (conn->flags & RCVLOWAT_SET_A) && + !(conn->flags & RCVLOWAT_ACT_A)) { + if (setsockopt(conn->a, SOL_SOCKET, SO_RCVLOWAT, + &((int){ 1 }), sizeof(int))) { + trace("TCP (spliced): can't set SO_RCVLOWAT on " + "%i", conn->a); } + conn_flag(c, conn, ~RCVLOWAT_SET_A); + } - if ( (conn->flags & RCVLOWAT_SET_B) && - !(conn->flags & RCVLOWAT_ACT_B)) { - if (setsockopt(conn->b, SOL_SOCKET, SO_RCVLOWAT, - &((int){ 1 }), sizeof(int))) { - trace("TCP (spliced): can't set SO_RCVLOWAT on " - "%i", conn->b); - } - conn_flag(c, conn, ~RCVLOWAT_SET_B); + if ( (conn->flags & RCVLOWAT_SET_B) && + !(conn->flags & RCVLOWAT_ACT_B)) { + if (setsockopt(conn->b, SOL_SOCKET, SO_RCVLOWAT, + &((int){ 1 }), sizeof(int))) { + trace("TCP (spliced): can't set SO_RCVLOWAT on " + "%i", conn->b); } - - conn_flag(c, conn, ~RCVLOWAT_ACT_A); - conn_flag(c, conn, ~RCVLOWAT_ACT_B); + conn_flag(c, conn, ~RCVLOWAT_SET_B); } - tcp_splice_pipe_refill(c); + conn_flag(c, conn, ~RCVLOWAT_ACT_A); + conn_flag(c, conn, ~RCVLOWAT_ACT_B); } diff --git a/tcp_splice.h b/tcp_splice.h index b222600..c7895d2 100644 --- a/tcp_splice.h +++ b/tcp_splice.h @@ -14,6 +14,5 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref, uint32_t events); void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn); void tcp_splice_init(struct ctx *c); -void tcp_splice_timer(struct ctx *c); #endif /* TCP_SPLICE_H */ -- 2.38.1
There is very little common between the tcp_tap_conn and tcp_splice_conn structures. However, both do have an IN_EPOLL flag which has the same meaning in each case, though it's stored in a different location. Simplify things slightly by moving this bit into the common header of the two structures. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 10 +++++----- tcp_conn.h | 20 ++++++++++---------- tcp_splice.c | 8 ++++---- 3 files changed, 19 insertions(+), 19 deletions(-) diff --git a/tcp.c b/tcp.c index b1ddc6f..c9bcbfb 100644 --- a/tcp.c +++ b/tcp.c @@ -429,8 +429,8 @@ static const char *tcp_state_str[] __attribute((__unused__)) = { }; static const char *tcp_flag_str[] __attribute((__unused__)) = { - "STALLED", "LOCAL", "WND_CLAMPED", "IN_EPOLL", "ACTIVE_CLOSE", - "ACK_TO_TAP_DUE", "ACK_FROM_TAP_DUE", + "STALLED", "LOCAL", "WND_CLAMPED", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE", + "ACK_FROM_TAP_DUE", }; /* Listening sockets, used for automatic port forwarding in pasta mode only */ @@ -660,14 +660,14 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn) { - int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; + int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref = { .r.proto = IPPROTO_TCP, .r.s = conn->sock, .r.p.tcp.tcp.index = CONN_IDX(conn), .r.p.tcp.tcp.v6 = CONN_V6(conn) }; struct epoll_event ev = { .data.u64 = ref.u64 }; if (conn->events == CLOSED) { - if (conn->flags & IN_EPOLL) + if (conn->c.in_epoll) epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->sock, &ev); if (conn->timer != -1) epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->timer, &ev); @@ -679,7 +679,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (epoll_ctl(c->epollfd, m, conn->sock, &ev)) return -errno; - conn->flags |= IN_EPOLL; /* No need to log this */ + conn->c.in_epoll = true; if (conn->timer != -1) { union epoll_ref ref_t = { .r.proto = IPPROTO_TCP, diff --git a/tcp_conn.h b/tcp_conn.h index 7c450a0..faa63dc 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -14,9 +14,11 @@ /** * struct tcp_conn_common - Common fields for spliced and non-spliced * @spliced: Is this a spliced connection? + * @in_epoll: Is the connection in the epoll set? */ struct tcp_conn_common { bool spliced :1; + bool in_epoll :1; }; extern const char *tcp_common_flag_str[]; @@ -90,10 +92,9 @@ struct tcp_tap_conn { #define STALLED BIT(0) #define LOCAL BIT(1) #define WND_CLAMPED BIT(2) -#define IN_EPOLL BIT(3) -#define ACTIVE_CLOSE BIT(4) -#define ACK_TO_TAP_DUE BIT(5) -#define ACK_FROM_TAP_DUE BIT(6) +#define ACTIVE_CLOSE BIT(3) +#define ACK_TO_TAP_DUE BIT(4) +#define ACK_FROM_TAP_DUE BIT(5) unsigned int hash_bucket :TCP_HASH_BUCKET_BITS; @@ -170,12 +171,11 @@ struct tcp_splice_conn { uint8_t flags; #define SPLICE_V6 BIT(0) -#define SPLICE_IN_EPOLL BIT(1) -#define RCVLOWAT_SET_A BIT(2) -#define RCVLOWAT_SET_B BIT(3) -#define RCVLOWAT_ACT_A BIT(4) -#define RCVLOWAT_ACT_B BIT(5) -#define CLOSING BIT(6) +#define RCVLOWAT_SET_A BIT(1) +#define RCVLOWAT_SET_B BIT(2) +#define RCVLOWAT_ACT_A BIT(3) +#define RCVLOWAT_ACT_B BIT(4) +#define CLOSING BIT(5) uint32_t a_read; uint32_t a_written; diff --git a/tcp_splice.c b/tcp_splice.c index 7244c5d..685aa18 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -83,8 +83,8 @@ static const char *tcp_splice_event_str[] __attribute((__unused__)) = { /* Display strings for connection flags */ static const char *tcp_splice_flag_str[] __attribute((__unused__)) = { - "SPLICE_V6", "SPLICE_IN_EPOLL", "RCVLOWAT_SET_A", "RCVLOWAT_SET_B", - "RCVLOWAT_ACT_A", "RCVLOWAT_ACT_B", "CLOSING", + "SPLICE_V6", "RCVLOWAT_SET_A", "RCVLOWAT_SET_B", "RCVLOWAT_ACT_A", + "RCVLOWAT_ACT_B", "CLOSING", }; /** @@ -164,7 +164,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn, static int tcp_splice_epoll_ctl(const struct ctx *c, struct tcp_splice_conn *conn) { - int m = (conn->flags & SPLICE_IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; + int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a, .r.p.tcp.tcp.splice = 1, .r.p.tcp.tcp.index = CONN_IDX(conn), @@ -188,7 +188,7 @@ static int tcp_splice_epoll_ctl(const struct ctx *c, epoll_ctl(c->epollfd, m, conn->b, &ev_b)) goto delete; - conn->flags |= SPLICE_IN_EPOLL; /* No need to log this */ + conn->c.in_epoll = true; return 0; -- 2.38.1
tcp_sock_init*() can create either sockets listening on the host, or in the pasta network namespace (with @ns==1). There are, however, a number of differences in how these two cases work in practice though. "ns" sockets are only used in pasta mode, and they always lead to spliced connections only. The functions are also only ever called in "ns" mode with a NULL address and interface name, and it doesn't really make sense for them to be called any other way. Later changes will introduce further differences in behaviour between these two cases, so it makes more sense to use separate functions for creating the ns listening sockets than the regular external/host listening sockets. --- conf.c | 6 +-- tcp.c | 130 ++++++++++++++++++++++++++++++++++++++------------------- tcp.h | 4 +- 3 files changed, 92 insertions(+), 48 deletions(-) diff --git a/conf.c b/conf.c index 3ad247e..2b39d18 100644 --- a/conf.c +++ b/conf.c @@ -209,7 +209,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg, for (i = 0; i < PORT_EPHEMERAL_MIN; i++) { if (optname == 't') - tcp_sock_init(c, 0, AF_UNSPEC, NULL, NULL, i); + tcp_sock_init(c, AF_UNSPEC, NULL, NULL, i); else if (optname == 'u') udp_sock_init(c, 0, AF_UNSPEC, NULL, NULL, i); } @@ -287,7 +287,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg, bitmap_set(fwd->map, i); if (optname == 't') - tcp_sock_init(c, 0, af, addr, ifname, i); + tcp_sock_init(c, af, addr, ifname, i); else if (optname == 'u') udp_sock_init(c, 0, af, addr, ifname, i); } @@ -333,7 +333,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg, fwd->delta[i] = mapped_range.first - orig_range.first; if (optname == 't') - tcp_sock_init(c, 0, af, addr, ifname, i); + tcp_sock_init(c, af, addr, ifname, i); else if (optname == 'u') udp_sock_init(c, 0, af, addr, ifname, i); } diff --git a/tcp.c b/tcp.c index c9bcbfb..47c025b 100644 --- a/tcp.c +++ b/tcp.c @@ -2986,15 +2986,15 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events, /** * tcp_sock_init4() - Initialise listening sockets for a given IPv4 port * @c: Execution context - * @ns: In pasta mode, if set, bind with loopback address in namespace * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order */ -static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *addr, +static void tcp_sock_init4(const struct ctx *c, const struct in_addr *addr, const char *ifname, in_port_t port) { - union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = ns }; + in_port_t idx = port + c->tcp.fwd_in.delta[port]; + union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx }; bool spliced = false, tap = true; int s; @@ -3005,14 +3005,9 @@ static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *ad if (!addr) addr = &c->ip4.addr; - tap = !ns && !IN4_IS_ADDR_LOOPBACK(addr); + tap = !IN4_IS_ADDR_LOOPBACK(addr); } - if (ns) - tref.tcp.index = (in_port_t)(port + c->tcp.fwd_out.delta[port]); - else - tref.tcp.index = (in_port_t)(port + c->tcp.fwd_in.delta[port]); - if (tap) { s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port, tref.u32); @@ -3038,29 +3033,25 @@ static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *ad else s = -1; - if (c->tcp.fwd_out.mode == FWD_AUTO) { - if (ns) - tcp_sock_ns[port][V4] = s; - else - tcp_sock_init_lo[port][V4] = s; - } + if (c->tcp.fwd_out.mode == FWD_AUTO) + tcp_sock_init_lo[port][V4] = s; } } /** * tcp_sock_init6() - Initialise listening sockets for a given IPv6 port * @c: Execution context - * @ns: In pasta mode, if set, bind with loopback address in namespace * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order */ -static void tcp_sock_init6(const struct ctx *c, int ns, +static void tcp_sock_init6(const struct ctx *c, const struct in6_addr *addr, const char *ifname, in_port_t port) { - union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = ns, - .tcp.v6 = 1 }; + in_port_t idx = port + c->tcp.fwd_in.delta[port]; + union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.v6 = 1, + .tcp.index = idx }; bool spliced = false, tap = true; int s; @@ -3072,14 +3063,9 @@ static void tcp_sock_init6(const struct ctx *c, int ns, if (!addr) addr = &c->ip6.addr; - tap = !ns && !IN6_IS_ADDR_LOOPBACK(addr); + tap = !IN6_IS_ADDR_LOOPBACK(addr); } - if (ns) - tref.tcp.index = (in_port_t)(port + c->tcp.fwd_out.delta[port]); - else - tref.tcp.index = (in_port_t)(port + c->tcp.fwd_in.delta[port]); - if (tap) { s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, tref.u32); @@ -3104,40 +3090,99 @@ static void tcp_sock_init6(const struct ctx *c, int ns, else s = -1; - if (c->tcp.fwd_out.mode == FWD_AUTO) { - if (ns) - tcp_sock_ns[port][V6] = s; - else - tcp_sock_init_lo[port][V6] = s; - } + if (c->tcp.fwd_out.mode == FWD_AUTO) + tcp_sock_init_lo[port][V6] = s; } } /** * tcp_sock_init() - Initialise listening sockets for a given port * @c: Execution context - * @ns: In pasta mode, if set, bind with loopback address in namespace * @af: Address family to select a specific IP version, or AF_UNSPEC * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order */ -void tcp_sock_init(const struct ctx *c, int ns, sa_family_t af, - const void *addr, const char *ifname, in_port_t port) +void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr, + const char *ifname, in_port_t port) { if ((af == AF_INET || af == AF_UNSPEC) && c->ifi4) - tcp_sock_init4(c, ns, addr, ifname, port); + tcp_sock_init4(c, addr, ifname, port); if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6) - tcp_sock_init6(c, ns, addr, ifname, port); + tcp_sock_init6(c, addr, ifname, port); +} + +/** + * tcp_ns_sock_init4() - Init socket to listen for outbound IPv4 connections + * @c: Execution context + * @port: Port, host order + */ +static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port) +{ + in_port_t idx = port + c->tcp.fwd_out.delta[port]; + union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1, + .tcp.splice = 1, .tcp.index = idx }; + struct in_addr loopback = { htonl(INADDR_LOOPBACK) }; + int s; + + assert(c->mode == MODE_PASTA); + + s = sock_l4(c, AF_INET, IPPROTO_TCP, &loopback, NULL, port, tref.u32); + if (s >= 0) + tcp_sock_set_bufsize(c, s); + else + s = -1; + + if (c->tcp.fwd_out.mode == FWD_AUTO) + tcp_sock_ns[port][V4] = s; } /** - * tcp_sock_init_ns() - Bind sockets in namespace for outbound connections + * tcp_ns_sock_init6() - Init socket to listen for outbound IPv6 connections + * @c: Execution context + * @port: Port, host order + */ +static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port) +{ + in_port_t idx = port + c->tcp.fwd_out.delta[port]; + union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1, + .tcp.splice = 1, .tcp.v6 = 1, + .tcp.index = idx}; + int s; + + assert(c->mode == MODE_PASTA); + + s = sock_l4(c, AF_INET6, IPPROTO_TCP, &in6addr_loopback, NULL, port, + tref.u32); + if (s >= 0) + tcp_sock_set_bufsize(c, s); + else + s = -1; + + if (c->tcp.fwd_out.mode == FWD_AUTO) + tcp_sock_ns[port][V6] = s; +} + +/** + * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections + * @c: Execution context + * @port: Port, host order + */ +void tcp_ns_sock_init(const struct ctx *c, in_port_t port) +{ + if (c->ifi4) + tcp_ns_sock_init4(c, port); + if (c->ifi6) + tcp_ns_sock_init6(c, port); +} + +/** + * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections * @arg: Execution context * * Return: 0 */ -static int tcp_sock_init_ns(void *arg) +static int tcp_ns_socks_init(void *arg) { struct ctx *c = (struct ctx *)arg; unsigned port; @@ -3148,7 +3193,7 @@ static int tcp_sock_init_ns(void *arg) if (!bitmap_isset(c->tcp.fwd_out.map, port)) continue; - tcp_sock_init(c, 1, AF_UNSPEC, NULL, NULL, port); + tcp_ns_sock_init(c, port); } return 0; @@ -3278,7 +3323,7 @@ int tcp_init(struct ctx *c) if (c->mode == MODE_PASTA) { tcp_splice_init(c); - NS_CALL(tcp_sock_init_ns, c); + NS_CALL(tcp_ns_socks_init, c); refill_arg.ns = 1; NS_CALL(tcp_sock_refill, &refill_arg); @@ -3363,8 +3408,7 @@ static int tcp_port_rebind(void *arg) if ((a->c->ifi4 && tcp_sock_ns[port][V4] == -1) || (a->c->ifi6 && tcp_sock_ns[port][V6] == -1)) - tcp_sock_init(a->c, 1, AF_UNSPEC, NULL, NULL, - port); + tcp_ns_sock_init(a->c, port); } } else { for (port = 0; port < NUM_PORTS; port++) { @@ -3397,7 +3441,7 @@ static int tcp_port_rebind(void *arg) if ((a->c->ifi4 && tcp_sock_init_ext[port][V4] == -1) || (a->c->ifi6 && tcp_sock_init_ext[port][V6] == -1)) - tcp_sock_init(a->c, 0, AF_UNSPEC, NULL, NULL, + tcp_sock_init(a->c, AF_UNSPEC, NULL, NULL, port); } } diff --git a/tcp.h b/tcp.h index 49738ef..f4ed298 100644 --- a/tcp.h +++ b/tcp.h @@ -19,8 +19,8 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now); int tcp_tap_handler(struct ctx *c, int af, const void *addr, const struct pool *p, const struct timespec *now); -void tcp_sock_init(const struct ctx *c, int ns, sa_family_t af, - const void *addr, const char *ifname, in_port_t port); +void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr, + const char *ifname, in_port_t port); int tcp_init(struct ctx *c); void tcp_timer(struct ctx *c, const struct timespec *ts); void tcp_defer_handler(struct ctx *c); -- 2.38.1
In tcp_sock_handler() we split off to handle spliced sockets before checking anything else. However the first steps of the "new connection" path for each case are the same: allocate a connection entry and accept() the connection. Remove this duplication by making tcp_conn_from_sock() handle both spliced and non-spliced cases, with help from more specific tcp_tap_conn_from_sock and tcp_splice_conn_from_sock functions for the later stages which differ. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 68 ++++++++++++++++++++++++++++++++++------------------ tcp_splice.c | 58 +++++++++++++++++++++++--------------------- tcp_splice.h | 2 ++ 3 files changed, 78 insertions(+), 50 deletions(-) diff --git a/tcp.c b/tcp.c index 47c025b..e7bfc8c 100644 --- a/tcp.c +++ b/tcp.c @@ -2752,28 +2752,19 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn) } /** - * tcp_conn_from_sock() - Handle new connection request from listening socket + * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context * @ref: epoll reference of listening socket + * @conn: connection structure to initialize + * @s: Accepted socket + * @sa: Peer socket address (from accept()) * @now: Current timestamp */ -static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, - const struct timespec *now) +static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref, + struct tcp_tap_conn *conn, int s, + struct sockaddr *sa, + const struct timespec *now) { - struct sockaddr_storage sa; - struct tcp_tap_conn *conn; - socklen_t sl; - int s; - - if (c->tcp.conn_count >= TCP_MAX_CONNS) - return; - - sl = sizeof(sa); - s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK); - if (s < 0) - return; - - conn = CONN(c->tcp.conn_count++); conn->c.spliced = false; conn->sock = s; conn->timer = -1; @@ -2783,7 +2774,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, if (ref.r.p.tcp.tcp.v6) { struct sockaddr_in6 sa6; - memcpy(&sa6, &sa, sizeof(sa6)); + memcpy(&sa6, sa, sizeof(sa6)); if (IN6_IS_ADDR_LOOPBACK(&sa6.sin6_addr) || IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr_seen) || @@ -2812,7 +2803,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, } else { struct sockaddr_in sa4; - memcpy(&sa4, &sa, sizeof(sa4)); + memcpy(&sa4, sa, sizeof(sa4)); memset(&conn->a.a4.zero, 0, sizeof(conn->a.a4.zero)); memset(&conn->a.a4.one, 0xff, sizeof(conn->a.a4.one)); @@ -2845,6 +2836,37 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, tcp_get_sndbuf(conn); } +/** + * tcp_conn_from_sock() - Handle new connection request from listening socket + * @c: Execution context + * @ref: epoll reference of listening socket + * @now: Current timestamp + */ +static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, + const struct timespec *now) +{ + struct sockaddr_storage sa; + union tcp_conn *conn; + socklen_t sl; + int s; + + if (c->tcp.conn_count >= TCP_MAX_CONNS) + return; + + sl = sizeof(sa); + s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK); + if (s < 0) + return; + + conn = tc + c->tcp.conn_count++; + + if (ref.r.p.tcp.tcp.splice) + tcp_splice_conn_from_sock(c, ref, &conn->splice, s); + else + tcp_tap_conn_from_sock(c, ref, &conn->tap, s, + (struct sockaddr *)&sa, now); +} + /** * tcp_timer_handler() - timerfd events: close, send ACK, retransmit, or reset * @c: Execution context @@ -2924,13 +2946,13 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events, return; } - if (ref.r.p.tcp.tcp.splice) { - tcp_sock_handler_splice(c, ref, events); + if (ref.r.p.tcp.tcp.listen) { + tcp_conn_from_sock(c, ref, now); return; } - if (ref.r.p.tcp.tcp.listen) { - tcp_conn_from_sock(c, ref, now); + if (ref.r.p.tcp.tcp.splice) { + tcp_sock_handler_splice(c, ref, events); return; } diff --git a/tcp_splice.c b/tcp_splice.c index 685aa18..4f6fa40 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -501,6 +501,36 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock, *pipes = *from == conn->a ? conn->pipe_a_b : conn->pipe_b_a; } +/** + * tcp_splice_conn_from_sock() - Initialize state for spliced connection + * @c: Execution context + * @ref: epoll reference of listening socket + * @conn: connection structure to initialize + * @s: Accepted socket + * + * #syscalls:pasta setsockopt + */ +void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref, + struct tcp_splice_conn *conn, int s) +{ + assert(c->mode == MODE_PASTA); + + if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), + sizeof(int))) { + trace("TCP (spliced): failed to set TCP_QUICKACK on %i", + s); + } + + conn->c.spliced = true; + c->tcp.splice_conn_count++; + conn->a = s; + conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0; + + if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index, + ref.r.p.tcp.tcp.outbound)) + conn_flag(c, conn, CLOSING); +} + /** * tcp_sock_handler_splice() - Handler for socket mapped to spliced connection * @c: Execution context @@ -517,33 +547,7 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref, uint32_t *seq_read, *seq_write; struct tcp_splice_conn *conn; - if (ref.r.p.tcp.tcp.listen) { - int s; - - if (c->tcp.conn_count >= TCP_MAX_CONNS) - return; - - if ((s = accept4(ref.r.s, NULL, NULL, SOCK_NONBLOCK)) < 0) - return; - - if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), - sizeof(int))) { - trace("TCP (spliced): failed to set TCP_QUICKACK on %i", - s); - } - - conn = CONN(c->tcp.conn_count++); - conn->c.spliced = true; - c->tcp.splice_conn_count++; - conn->a = s; - conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0; - - if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index, - ref.r.p.tcp.tcp.outbound)) - conn_flag(c, conn, CLOSING); - - return; - } + assert(!ref.r.p.tcp.tcp.listen); conn = CONN(ref.r.p.tcp.tcp.index); diff --git a/tcp_splice.h b/tcp_splice.h index c7895d2..053221e 100644 --- a/tcp_splice.h +++ b/tcp_splice.h @@ -12,6 +12,8 @@ struct tcp_splice_conn; void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref, uint32_t events); +void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref, + struct tcp_splice_conn *conn, int s); void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn); void tcp_splice_init(struct ctx *c); -- 2.38.1
In pasta mode, tcp_sock_init[46]() create separate sockets to listen for spliced connections (these are bound to localhost) and non-spliced connections (these are bound to the host address). This introduces a subtle behavioural difference between pasta and passt: by default, pasta will listen only on a single host address, whereas passt will listen on all addresses (0.0.0.0 or ::). This also prevents us using some additional optimizations that only work with the unspecified (0.0.0.0 or ::) address. However, it turns out we don't need to do this. We can splice a connection if and only if it originates from the loopback address. Currently we ensure this by having the "spliced" listening sockets listening only on loopback. However, we can defer the decision about whether to splice a connection until after accept(), by checking if the connection was made from localhost. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tcp.c | 131 +++++++++++++++++----------------------------------------- 1 file changed, 39 insertions(+), 92 deletions(-) diff --git a/tcp.c b/tcp.c index e7bfc8c..ac70d4e 100644 --- a/tcp.c +++ b/tcp.c @@ -434,7 +434,6 @@ static const char *tcp_flag_str[] __attribute((__unused__)) = { }; /* Listening sockets, used for automatic port forwarding in pasta mode only */ -static int tcp_sock_init_lo [NUM_PORTS][IP_VERSIONS]; static int tcp_sock_init_ext [NUM_PORTS][IP_VERSIONS]; static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; @@ -2847,9 +2846,13 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, { struct sockaddr_storage sa; union tcp_conn *conn; + bool can_splice = false; socklen_t sl; int s; + assert(ref.r.p.tcp.tcp.listen); + assert(!ref.r.p.tcp.tcp.splice); + if (c->tcp.conn_count >= TCP_MAX_CONNS) return; @@ -2860,7 +2863,25 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, conn = tc + c->tcp.conn_count++; - if (ref.r.p.tcp.tcp.splice) + if (c->mode == MODE_PASTA) { + if (ref.r.p.tcp.tcp.v6) { + const struct sockaddr_in6 *sa6 + = (const struct sockaddr_in6 *)&sa; + /* clang-tidy doesn't realize accept() initializes sa/sa6 */ + /* NOLINTNEXTLINE(clang-analyzer-core.UndefinedBinaryOperatorResult) */ + if (IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr)) + can_splice = true; + } else { + const struct sockaddr_in *sa4 = + (const struct sockaddr_in *)&sa; + /* clang-tidy doesn't realize accept() initializes sa/sa4 */ + /* NOLINTNEXTLINE(clang-analyzer-core.CallAndMessage) */ + if (htonl(sa4->sin_addr.s_addr) == INADDR_LOOPBACK) + can_splice = true; + } + } + + if (can_splice) tcp_splice_conn_from_sock(c, ref, &conn->splice, s); else tcp_tap_conn_from_sock(c, ref, &conn->tap, s, @@ -3017,47 +3038,16 @@ static void tcp_sock_init4(const struct ctx *c, const struct in_addr *addr, { in_port_t idx = port + c->tcp.fwd_in.delta[port]; union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx }; - bool spliced = false, tap = true; int s; - if (c->mode == MODE_PASTA) { - spliced = !addr || IN4_IS_ADDR_UNSPECIFIED(addr) || - IN4_IS_ADDR_LOOPBACK(addr); - - if (!addr) - addr = &c->ip4.addr; - - tap = !IN4_IS_ADDR_LOOPBACK(addr); - } - - if (tap) { - s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port, - tref.u32); - if (s >= 0) - tcp_sock_set_bufsize(c, s); - else - s = -1; - - if (c->tcp.fwd_in.mode == FWD_AUTO) - tcp_sock_init_ext[port][V4] = s; - } - - if (spliced) { - struct in_addr loopback = { htonl(INADDR_LOOPBACK) }; - tref.tcp.splice = 1; - - addr = &loopback; - - s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port, - tref.u32); - if (s >= 0) - tcp_sock_set_bufsize(c, s); - else - s = -1; + s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port, tref.u32); + if (s >= 0) + tcp_sock_set_bufsize(c, s); + else + s = -1; - if (c->tcp.fwd_out.mode == FWD_AUTO) - tcp_sock_init_lo[port][V4] = s; - } + if (c->tcp.fwd_in.mode == FWD_AUTO) + tcp_sock_init_ext[port][V4] = s; } /** @@ -3074,47 +3064,16 @@ static void tcp_sock_init6(const struct ctx *c, in_port_t idx = port + c->tcp.fwd_in.delta[port]; union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.v6 = 1, .tcp.index = idx }; - bool spliced = false, tap = true; int s; - if (c->mode == MODE_PASTA) { - spliced = !addr || - IN6_IS_ADDR_UNSPECIFIED(addr) || - IN6_IS_ADDR_LOOPBACK(addr); - - if (!addr) - addr = &c->ip6.addr; - - tap = !IN6_IS_ADDR_LOOPBACK(addr); - } - - if (tap) { - s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, - tref.u32); - if (s >= 0) - tcp_sock_set_bufsize(c, s); - else - s = -1; - - if (c->tcp.fwd_in.mode == FWD_AUTO) - tcp_sock_init_ext[port][V6] = s; - } - - if (spliced) { - tref.tcp.splice = 1; - - addr = &in6addr_loopback; - - s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, - tref.u32); - if (s >= 0) - tcp_sock_set_bufsize(c, s); - else - s = -1; + s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, tref.u32); + if (s >= 0) + tcp_sock_set_bufsize(c, s); + else + s = -1; - if (c->tcp.fwd_out.mode == FWD_AUTO) - tcp_sock_init_lo[port][V6] = s; - } + if (c->tcp.fwd_in.mode == FWD_AUTO) + tcp_sock_init_ext[port][V6] = s; } /** @@ -3143,7 +3102,7 @@ static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port) { in_port_t idx = port + c->tcp.fwd_out.delta[port]; union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1, - .tcp.splice = 1, .tcp.index = idx }; + .tcp.index = idx }; struct in_addr loopback = { htonl(INADDR_LOOPBACK) }; int s; @@ -3168,8 +3127,7 @@ static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port) { in_port_t idx = port + c->tcp.fwd_out.delta[port]; union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1, - .tcp.splice = 1, .tcp.v6 = 1, - .tcp.index = idx}; + .tcp.v6 = 1, .tcp.index = idx}; int s; assert(c->mode == MODE_PASTA); @@ -3336,7 +3294,6 @@ int tcp_init(struct ctx *c) memset(init_sock_pool6, 0xff, sizeof(init_sock_pool6)); memset(ns_sock_pool4, 0xff, sizeof(ns_sock_pool4)); memset(ns_sock_pool6, 0xff, sizeof(ns_sock_pool6)); - memset(tcp_sock_init_lo, 0xff, sizeof(tcp_sock_init_lo)); memset(tcp_sock_init_ext, 0xff, sizeof(tcp_sock_init_ext)); memset(tcp_sock_ns, 0xff, sizeof(tcp_sock_ns)); @@ -3444,16 +3401,6 @@ static int tcp_port_rebind(void *arg) close(tcp_sock_init_ext[port][V6]); tcp_sock_init_ext[port][V6] = -1; } - - if (tcp_sock_init_lo[port][V4] >= 0) { - close(tcp_sock_init_lo[port][V4]); - tcp_sock_init_lo[port][V4] = -1; - } - - if (tcp_sock_init_lo[port][V6] >= 0) { - close(tcp_sock_init_lo[port][V6]); - tcp_sock_init_lo[port][V6] = -1; - } continue; } -- 2.38.1
On Mon, 14 Nov 2022 17:16:57 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:We can splice TCP connections in pasta mode if and only if they originate from localhost. Currently we separate the two cases by having separate listening sockets: one listens on the host address for non-spliceable connections, the other listens on the loopback address for spliceable connections.I couldn't finish reviewing the whole series in detail yet, but I had a look at all the patches and, except for a couple of minor style issues (those are the ones I still need to finish checking), I couldn't see any flaw -- all the single patches look fine to me. -- Stefano
On Tue, Nov 15, 2022 at 02:22:41AM +0100, Stefano Brivio wrote:On Mon, 14 Nov 2022 17:16:57 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ok. Probably not worth looking deeper at this stage. I'm pretty close to sending a new series which contains a slightly revised version of this, plus the rest of the stuff necessary to do dual stack TCP sockets. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonWe can splice TCP connections in pasta mode if and only if they originate from localhost. Currently we separate the two cases by having separate listening sockets: one listens on the host address for non-spliceable connections, the other listens on the loopback address for spliceable connections.I couldn't finish reviewing the whole series in detail yet, but I had a look at all the patches and, except for a couple of minor style issues (those are the ones I still need to finish checking), I couldn't see any flaw -- all the single patches look fine to me.