November 2022 - dev

[PATCH 0/4] udp: Fix some confusion of IPv4 and IPv6 control structures
by David Gibson 06 Dec '22

06 Dec '22

It turns out a couple of places on the IPv4 specific inbound path accidentally use control structures that are supposed to be for IPv6. That could lead to weird behaviour in a rather complex set of circumstances. Path 1/4 here is the actual fix, the rest makes some clean ups to the code that should make similar mistakes harder errors harder to commit in future. This is based on my earlier cleanup of the UDP splicing code, although I think it will rebase trivially. David Gibson (4): udp: Fix inorrect use of IPv6 mh buffers in IPv4 path udp: Better factor IPv4 and IPv6 paths in udp_sock_handler() udp: Preadjust udp[46]_l2_iov_tap[].iov_base for pasta mode udp: Factor out control structure management from udp_sock_fill_data_v[46] udp.c | 184 ++++++++++++++++++++++++++-------------------------------- 1 file changed, 81 insertions(+), 103 deletions(-) -- 2.38.1

2 7

[PATCH v4 00/16] Simplify and correct handling of "spliced" UDP forwarding
by David Gibson 06 Dec '22

06 Dec '22

The UDP "splicing" (forwarding packets from one L4 socket to another, rather than via the tuntap device) code assumes that any given UDP port in the init namespace will only communicate with a single port on the ns side at a time, and vice versa. This will often be the case, but since UDP is a connectionless protocol, it need not be. In fact it is not the case in our existing UDP bandwidth checks, although the specific configuration there means it's not harmful in that case. The failure mode in this case can be quite bad: we don't just fall back to an unoptimized oath, or drop packets, we will misdirect packets to the wrong destination. This series make some substantial simplifications to how we handle the splice forwarding, then corrects it to handle the case of multiple source ports sending to a single destination. This does come at a performance cost. It's not as large as I feared, and shouldn't affect the most common case where there is a 1 to 1 mapping between source and destination ports. I haven't yet been able to confirm the latter because the iperf3 bandwidth test we use *does* have interleaved streams with a common destination port. Based on the earlier series for dual stack TCP sockets. Changes since v3: * Changed interface of udp_splice_sendfrom() to slightly better separate concerns and to make some future cleanups simpler * Fixed a serious buffer overrun bug where we weren't bounds checking as we scanned for additional datagrams with the same source address. Changes since v2: * Minor style and comment revisions Changes since v1: * Added patches 12..16/16 fixing the delivery of packets, as well as just simplifying the mechanics David Gibson (16): udp: Also bind() connected ports for "splice" forwarding udp: Separate tracking of inbound and outbound packet flows udp: Always use sendto() rather than send() for forwarding spliced packets udp: Don't connect "forward" sockets for spliced flows udp: Remove the @bound field from union udp_epoll_ref udp: Split splice field in udp_epoll_ref into (mostly) independent bits udp: Don't create double sockets for -U port udp: Re-use fixed bound sockets for packet forwarding when possible udp: Don't explicitly track originating socket for spliced "connections" udp: Update UDP "connection" timestamps in both directions udp: Simplify udp_sock_handler_splice udp: Make UDP_SPLICE_FRAMES and UDP_TAP_FRAMES_MEM the same thing udp: Add helper to extract port from a sockaddr_in or sockaddr_in6 udp: Unify buffers for tap and splice paths udp: Split send half of udp_sock_handler_splice() from the receive half udp: Correct splice forwarding when receiving from multiple sources passt.h | 2 + udp.c | 522 ++++++++++++++++++++++++++------------------------------ udp.h | 16 +- 3 files changed, 248 insertions(+), 292 deletions(-) -- 2.38.1

2 17

[PATCH v2 00/16] Simplify and correct handling of "spliced" UDP forwarding
by David Gibson 01 Dec '22

01 Dec '22

The UDP "splicing" (forwarding packets from one L4 socket to another, rather than via the tuntap device) code assumes that any given UDP port in the init namespace will only communicate with a single port on the ns side at a time, and vice versa. This will often be the case, but since UDP is a connectionless protocol, it need not be. In fact it is not the case in our existing UDP bandwidth checks, although the specific configuration there means it's not harmful in that case. The failure mode in this case can be quite bad: we don't just fall back to an unoptimized oath, or drop packets, we will misdirect packets to the wrong destination. This series make some substantial simplifications to how we handle the splice forwarding, then corrects it to handle the case of multiple source ports sending to a single destination. This does come at a performance cost. It's not as large as I feared, and shouldn't affect the most common case where there is a 1 to 1 mapping between source and destination ports. I haven't yet been able to confirm the latter because the iperf3 bandwidth test we use *does* have interleaved streams with a common destination port. Based on the earlier series for dual stack TCP sockets. Changes since v1: * Added patches 12..16/16 fixing the delivery of packets, as well as just simplifying the mechanics David Gibson (16): udp: Also bind() connected ports for "splice" forwarding udp: Separate tracking of inbound and outbound packet flows udp: Always use sendto() rather than send() for forwarding spliced packets udp: Don't connect "forward" sockets for spliced flows udp: Remove the @bound field from union udp_epoll_ref udp: Split splice field in udp_epoll_ref into (mostly) independent bits udp: Don't create double sockets for -U port udp: Re-use fixed bound sockets for packet forwarding when possible udp: Don't explicitly track originating socket for spliced "connections" udp: Update UDP "connection" timestamps in both directions udp: Simplify udp_sock_handler_splice udp: Make UDP_SPLICE_FRAMES and UDP_TAP_FRAMES_MEM the same thing udp: Add helper to extract port from a sockaddr_in or sockaddr_in6 udp: Unify buffers for tap and splice paths udp: Split send half of udp_sock_handler_splice() from the receive half udp: Correct splice forwarding when receiving from multiple sources passt.h | 2 + udp.c | 518 +++++++++++++++++++++++++------------------------------- udp.h | 16 +- 3 files changed, 244 insertions(+), 292 deletions(-) -- 2.38.1

2 28

[PATCH passt v2 0/7] Add fuzzing
by Richard W.M. Jones 30 Nov '22

30 Nov '22

With this series, fuzzing actually works, albeit slowly. More on that below. Patches 1 & 2 are the same as before. Patch 3 is Stefano Brivio's modified patch (with some changes that we discussed together on IRC but otherwise unchanged). Patch 4 is new, but discussed already upstream: It changes the order in which EPOLLIN and EPOLLRDHUP events are processed, so that we don't drop packets when the socket is closed. Patches 5 & 6 are the hacks that were needed to get fuzzing to work. Patch 6 removes all seccomp and other isolation stuff because for unclear reasons that breaks AFL instrumentation. AFL appears to fork off a second process, and somehow strace cannot follow that process, but the second process fails, and that breaks AFL completely. Without strace data it's rather hard to see what's going on so I didn't investigate this further. Patch 7 adds the fuzzing wrapper and is not greatly changed from before. However I did have to disable the AFL "fork server" optimization which somehow doesn't work with passt (it does work fine with libnbd & nbdkit). Speed is not great. I'm getting ~ 75-80 execs/second. Really we want this to be much higher, since that ultimately governs how fast we can explore new code paths and find bugs. Ideally well over 1000 execs/s (per fuzzing process) would be a good target to aim for. (Of course this depends on the hardware as well.) We could try to find out why the fork server is not compatible with passt, but I don't think we'd gain very much performance there. To explore this I ran a dummy fuzzed process from the same wrapper, and it was hardly any faster. I think the real gains are going to come from reducing the overhead of starting passt. There seem to be some netlink messages sent during start up and maybe if those could be reduced or eliminated we might see better performance. The other factor is fuzzing stability, which hovers around 87-90%. While this isn't necessarily bad, I wonder where the non-determinism is coming from [ideal figures would be 95-100%]. Passt doesn't appear to use threads. It does call getrandom (for TCP sequence numbers) so it'd be good to have a way to bypass that. However I don't fully understand what's going on here. Rich.

3 18

[PATCH v3 00/16] Simplify and correct handling of "spliced" UDP forwarding
by David Gibson 29 Nov '22

29 Nov '22

The UDP "splicing" (forwarding packets from one L4 socket to another, rather than via the tuntap device) code assumes that any given UDP port in the init namespace will only communicate with a single port on the ns side at a time, and vice versa. This will often be the case, but since UDP is a connectionless protocol, it need not be. In fact it is not the case in our existing UDP bandwidth checks, although the specific configuration there means it's not harmful in that case. The failure mode in this case can be quite bad: we don't just fall back to an unoptimized oath, or drop packets, we will misdirect packets to the wrong destination. This series make some substantial simplifications to how we handle the splice forwarding, then corrects it to handle the case of multiple source ports sending to a single destination. This does come at a performance cost. It's not as large as I feared, and shouldn't affect the most common case where there is a 1 to 1 mapping between source and destination ports. I haven't yet been able to confirm the latter because the iperf3 bandwidth test we use *does* have interleaved streams with a common destination port. Based on the earlier series for dual stack TCP sockets. Changes since v2: * Minor style and comment revisions Changes since v1: * Added patches 12..16/16 fixing the delivery of packets, as well as just simplifying the mechanics David Gibson (16): udp: Also bind() connected ports for "splice" forwarding udp: Separate tracking of inbound and outbound packet flows udp: Always use sendto() rather than send() for forwarding spliced packets udp: Don't connect "forward" sockets for spliced flows udp: Remove the @bound field from union udp_epoll_ref udp: Split splice field in udp_epoll_ref into (mostly) independent bits udp: Don't create double sockets for -U port udp: Re-use fixed bound sockets for packet forwarding when possible udp: Don't explicitly track originating socket for spliced "connections" udp: Update UDP "connection" timestamps in both directions udp: Simplify udp_sock_handler_splice udp: Make UDP_SPLICE_FRAMES and UDP_TAP_FRAMES_MEM the same thing udp: Add helper to extract port from a sockaddr_in or sockaddr_in6 udp: Unify buffers for tap and splice paths udp: Split send half of udp_sock_handler_splice() from the receive half udp: Correct splice forwarding when receiving from multiple sources passt.h | 2 + udp.c | 519 +++++++++++++++++++++++++------------------------------- udp.h | 16 +- 3 files changed, 244 insertions(+), 293 deletions(-) -- 2.38.1

1 17

[PATCH v2 00/32] Use dual stack sockets to listen for inbound TCP connections
by David Gibson 25 Nov '22

25 Nov '22

When forwarding many ports, passt can consume a lot of kernel memory because of the many listening sockets it opens. There are not a lot of ways we can reduce that, but here's one. Currently we create separate listening sockets for each port for both IPv4 and IPv6. However in Linux (and probably other platforms), it's possible to listen for both IPv4 and IPv6 connections on an IPv6 socket. This series uses such dual stack sockets to halve the number of listening sockets needed for TCP. When forwarding all TCP and UDP ports, this reduces the kernel memory used from around 677 MiB to around 487 MiB (kernel 6.0.8 on an x86_64 Fedora 37 machine). This should also be possible for UDP, but that will require a mostly separate implementation. Changes since v2: * Assorted minor polishing based on Stefano's review David Gibson (32): clang-tidy: Suppress warning about assignments in if statements style: Minor corrections to function comments tcp_splice: #include tcp_splice.h in tcp_splice.c tcp: Remove unused TCP_MAX_SOCKS constant tcp: Better helpers for converting between connection pointer and index tcp_splice: Helpers for converting from index to/from tcp_splice_conn tcp: Move connection state structures into a shared header tcp: Add connection union type tcp: Improved helpers to update connections after moving tcp: Unify spliced and non-spliced connection tables tcp: Unify tcp_defer_handler and tcp_splice_defer_handler() tcp: Partially unify tcp_timer() and tcp_splice_timer() tcp: Unify the IN_EPOLL flag tcp: Separate helpers to create ns listening sockets tcp: Unify part of spliced and non-spliced conn_from_sock path tcp: Use the same sockets to listen for spliced and non-spliced connections tcp: Remove splice from tcp_epoll_ref tcp: Don't store hash bucket in connection structures inany: Helper functions for handling addresses which could be IPv4 or IPv6 tcp: Hash IPv4 and IPv4-mapped-IPv6 addresses the same tcp: Take tcp_hash_insert() address from struct tcp_conn tcp: Simplify tcp_hash_match() to take an inany_addr tcp: Unify initial sequence number calculation for IPv4 and IPv6 tcp: Have tcp_seq_init() take its parameters from struct tcp_conn tcp: Fix small errors in tcp_seq_init() time handling tcp: Remove v6 flag from tcp_epoll_ref tcp: NAT IPv4-mapped IPv6 addresses like IPv4 addresses tcp_splice: Allow splicing of connections from IPv4-mapped loopback tcp: Consolidate tcp_sock_init[46] util: Allow sock_l4() to open dual stack sockets util: Always return -1 on error in sock_l4() tcp: Use dual stack sockets for port forwarding when possible Makefile | 15 +- conf.c | 12 +- inany.h | 94 +++++ siphash.c | 2 + tap.c | 6 +- tcp.c | 981 ++++++++++++++++++++++----------------------------- tcp.h | 11 +- tcp_conn.h | 192 ++++++++++ tcp_splice.c | 333 +++++++---------- tcp_splice.h | 12 +- util.c | 19 +- 11 files changed, 891 insertions(+), 786 deletions(-) create mode 100644 inany.h create mode 100644 tcp_conn.h -- 2.38.1

2 36

[PATCH 00/11] Simplify handling of "spliced" UDP forwarding
by David Gibson 22 Nov '22

22 Nov '22

This series make some substantial simplifications to how we handle the forwarding of "spliced" UDP packets (those that don't go over the tuntap device, but instead are forwarded from one L4 socket to another). This doesn't yet change the existing (arguably broken) assumption that UDP communications are from one port to one port within the pasta namespace, not one to many or many to one. However, the simplifications made here will make it easier to correct that in future. Based on the earlier series for dual stack TCP sockets. David Gibson (11): udp: Also bind() connected ports for "splice" forwarding udp: Separate tracking of inbound and outbound packet flows udp: Always use sendto() rather than send() for forwarding spliced packets udp: Don't connect "forward" sockets for spliced flows udp: Remove the @bound field from union udp_epoll_ref udp: Split splice field in udp_epoll_ref into (mostly) independent bits udp: Don't create double sockets for -U port udp: Re-use fixed bound sockets for packet forwarding when possible udp: Don't explicitly track originating socket for spliced "connections" udp: Update UDP "connection" timestamps in both directions udp: Simplify udp_sock_handler_splice passt.h | 2 + udp.c | 344 ++++++++++++++++++++++---------------------------------- udp.h | 16 ++- 3 files changed, 144 insertions(+), 218 deletions(-) -- 2.38.1

2 12

[PATCH] README: Add link to weekly development meeting
by Stefano Brivio 21 Nov '22

21 Nov '22

Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 33e4dcf..46cbd9f 100644 --- a/README.md +++ b/README.md @@ -657,6 +657,10 @@ See also the [test logs](/builds/latest/test/). ### [Chat](/passt/chat) * Somebody might be available on [IRC](https://irc.passt.top) +### Weekly development [meeting](https://pad.passt.top/p/weekly) +* Open to everybody! Feel free to join and propose a different time directly on + the agenda. + ## Security and Vulnerability Reports * Please send an email to [passt-sec](mailto:passt-sec@passt.top), private list, -- 2.35.1

2 1

[PATCH] tcp: Pass union tcp_conn pointer to destroy and splice timer functions
by Stefano Brivio 21 Nov '22

21 Nov '22

The pointers are actually the same, but we later pass the container union to tcp_table_compact(), which might zero the size of the whole union, and this confuses Coverity Scan. Given that we have pointers to the container union to start with, just pass those instead, all the way down to tcp_table_compact(). Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tcp.c | 17 +++++++++-------- tcp_conn.h | 4 ++-- tcp_splice.c | 16 ++++++++++------ 3 files changed, 21 insertions(+), 16 deletions(-) diff --git a/tcp.c b/tcp.c index 8874789..ff12dfa 100644 --- a/tcp.c +++ b/tcp.c @@ -1376,16 +1376,18 @@ void tcp_table_compact(struct ctx *c, union tcp_conn *hole) /** * tcp_conn_destroy() - Close sockets, trigger hash table removal and compaction * @c: Execution context - * @conn: Connection pointer + * @conn_union: Connection pointer (container union) */ -static void tcp_conn_destroy(struct ctx *c, struct tcp_tap_conn *conn) +static void tcp_conn_destroy(struct ctx *c, union tcp_conn *conn_union) { + struct tcp_tap_conn *conn = &conn_union->tap; + close(conn->sock); if (conn->timer != -1) close(conn->timer); tcp_hash_remove(c, conn); - tcp_table_compact(c, (union tcp_conn *)conn); + tcp_table_compact(c, conn_union); } static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); @@ -1535,13 +1537,12 @@ void tcp_defer_handler(struct ctx *c) for (conn = tc + c->tcp.conn_count - 1; conn >= tc; conn--) { if (conn->c.spliced) { if (conn->splice.flags & CLOSING) - tcp_splice_destroy(c, &conn->splice); + tcp_splice_destroy(c, conn); } else { if (conn->tap.events == CLOSED) - tcp_conn_destroy(c, &conn->tap); + tcp_conn_destroy(c, conn); } } - } /** @@ -3395,10 +3396,10 @@ void tcp_timer(struct ctx *c, const struct timespec *ts) for (conn = tc + c->tcp.conn_count - 1; conn >= tc; conn--) { if (conn->c.spliced) { - tcp_splice_timer(c, &conn->splice); + tcp_splice_timer(c, conn); } else { if (conn->tap.events == CLOSED) - tcp_conn_destroy(c, &conn->tap); + tcp_conn_destroy(c, conn); } } diff --git a/tcp_conn.h b/tcp_conn.h index 4a8be29..a146276 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -181,8 +181,8 @@ extern union tcp_conn tc[]; void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new); void tcp_table_compact(struct ctx *c, union tcp_conn *hole); -void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn); -void tcp_splice_timer(struct ctx *c, struct tcp_splice_conn *conn); +void tcp_splice_destroy(struct ctx *c, union tcp_conn *conn_union); +void tcp_splice_timer(struct ctx *c, union tcp_conn *conn_union); void tcp_splice_pipe_refill(const struct ctx *c); diff --git a/tcp_splice.c b/tcp_splice.c index e2f0ce1..72b1672 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -251,10 +251,12 @@ void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new) /** * tcp_splice_destroy() - Close spliced connection and pipes, clear * @c: Execution context - * @conn: Connection pointer + * @conn_union: Spliced connection (container union) */ -void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn) +void tcp_splice_destroy(struct ctx *c, union tcp_conn *conn_union) { + struct tcp_splice_conn *conn = &conn_union->splice; + if (conn->events & SPLICE_ESTABLISHED) { /* Flushing might need to block: don't recycle them. */ if (conn->pipe_a_b[0] != -1) { @@ -283,7 +285,7 @@ void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn) debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn)); c->tcp.splice_conn_count--; - tcp_table_compact(c, (union tcp_conn *)conn); + tcp_table_compact(c, conn_union); } /** @@ -824,12 +826,14 @@ void tcp_splice_init(struct ctx *c) /** * tcp_splice_timer() - Timer for spliced connections * @c: Execution context - * @conn: Spliced connection + * @conn_union: Spliced connection (container union) */ -void tcp_splice_timer(struct ctx *c, struct tcp_splice_conn *conn) +void tcp_splice_timer(struct ctx *c, union tcp_conn *conn_union) { + struct tcp_splice_conn *conn = &conn_union->splice; + if (conn->flags & CLOSING) { - tcp_splice_destroy(c, conn); + tcp_splice_destroy(c, conn_union); return; } -- 2.35.1

2 1

UDP "splicing" fairly broken :(
by David Gibson 18 Nov '22

18 Nov '22

In preparation for trying to implement dual stack sockets for UDP, I've been getting my head around how the UDP splicing works. Alas, I'm pretty sure that it's broken if there's not a one-to-one correspondence between init side source ports and ns side destination ports. That will typically be the case, but given its UDP there's no guarantee. In addition, UDP servers in the ns will not see the correct port numbers with getpeername(). That's also true of spliced TCP connections (see https://bugs.passt.top/show_bug.cgi?id=39), but it's more likely to matter for UDP (I don't know of any TCP protocols that care about source port number on the server side, but there are some common UDP protocols that have at least port number conventions on both sides). I can expand on the details later, but pasta will do the wrong thing in at least some circumstances for both a single init side socket sendto()ing packets to multiple different ports in the ns/guest and multiple init side sockets send()ing to the same port in the ns/guest. I think I know how to fix it, but it's not a trivial job. So, the question is do I embark on this now, or do I just remove UDP "splicing" entirely for the time being (other than a minimum required to make -U work)? That would unblock dual stack UDP sockets and we can attempt to reoptimize this later. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

2 4