On Wed, 2 Oct 2024 15:48:26 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:In pasta mode, where addressing permits we "splice" connections, forwarding directly from host socket to guest/container socket without any L2 or L3 processing. This gives us a very large performance improvement when it's possible. Since the traffic is from a local socket within the guest, it will go over the guest's 'lo' interface, and accordingly we set the guest side address to be the loopback address. However this has a surprising side effect: sometimes guests will run services that are only supposed to be used within the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's forwarding exposes those services to the host, which isn't generally what we want. Correct this by instead forwarding inbound "splice" flows to the guest's external address. Link: https://github.com/containers/podman/issues/24045 Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- conf.c | 9 +++++++++ fwd.c | 31 +++++++++++++++++++++++-------- passt.1 | 23 +++++++++++++++++++---- passt.h | 2 ++ 4 files changed, 53 insertions(+), 12 deletions(-) diff --git a/conf.c b/conf.c index 6e62510..b5318f3 100644 --- a/conf.c +++ b/conf.c @@ -908,6 +908,9 @@ pasta_opts: " -U, --udp-ns SPEC UDP port forwarding to init namespace\n" " SPEC is as described above\n" " default: auto\n" + " --host-lo-to-ns-lo DEPRECATED:\n" + " Translate host-loopback forwards to\n" + " namespace loopback\n" " --userns NSPATH Target user namespace to join\n" " --netns PATH|NAME Target network namespace to join\n" " --netns-only Don't join existing user namespace\n" @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv) {"netns-only", no_argument, NULL, 20 }, {"map-host-loopback", required_argument, NULL, 21 }, {"map-guest-addr", required_argument, NULL, 22 }, + {"host-lo-to-ns-lo", no_argument, NULL, 23 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv) conf_nat(optarg, &c->ip4.map_guest_addr, &c->ip6.map_guest_addr, NULL); break; + case 23: + if (c->mode != MODE_PASTA) + die("--host-lo-to-ns-lo is for pasta mode only"); + c->host_lo_to_ns_lo = 1; + break; case 'd': c->debug = 1; c->quiet = 0; diff --git a/fwd.c b/fwd.c index a505098..c71f5e1 100644 --- a/fwd.c +++ b/fwd.c @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */ - /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but by + * default use its external address, so we don't inadvertently + * expose services that listen only on the guest's loopback + * address. That can be overridden by --host-lo-to-ns-lo which + * will instead forward to the loopback address in the guest. + * + * In either case, let the kernel pick the source address to + * match. */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback6; + else + tgt->eaddr.a6 = c->ip6.addr_seen;Either this...+ tgt->oaddr = inany_any6;or this (and not something before this patch, up to 3/4) make the "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, sometimes (about one in three/four runs), that's what I mistakenly reported as coming from Laurent's series at: https://archives.passt.top/passt-dev/20241002163238.1778ed19@elisabeth/ It hangs like this (display with >= 240 columns): ns$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │...passed. 88.198.0.164 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): big transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin ns$ ip -j link show | jq -rM '.[] | select(.ifname == "enp9s0").mtu' │...passed. 65520 │ ns$ /sbin/dhclient -6 --no-pid enp9s0 │Starting test: TCP/IPv4: ns to host (via tap): big transfer ns$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.prefixlen == 128).local] | .[0]' │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin 2a01:4f8:222:904::2 │...passed. ns$ ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' │ fe80::1 │Starting test: TCP/IPv4: host to ns (spliced): small transfer ns$ which socat ip jq >/dev/null │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10003 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): small transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:88.198.0.161:10003 │...passed. ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin,create,trunc │ ns$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10003 │Starting test: TCP/IPv4: ns to host (via tap): small transfer ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin 88.198.0.161 │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP4:88.198.0.161:10003 │ ns$ strace socat -u TCP6-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc 2>/tmp/socat_server.strace │Starting test: TCP/IPv6: host to ns (spliced): big transfer │ ──namespace─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────┴──pasta/tcp [7/12] - TCP/IPv6: host to ns (spliced): big transfer─────────────────────────────────── host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │ router: 88.198.0.161 fe80::1 │DNS: host$ which ip jq >/dev/null │ 185.12.64.1 host$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │ 185.12.64.2 88.198.0.164 │ NAT to host ::1: fe80::1 host$ ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP/DHCPv6: 88.198.0.161 │ assign: 2a01:4f8:222:904::2 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' │ router: fe80::1 enp9s0 │ our link-local: fe80::1 host$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.scope == "global" and .depreca│DNS: ted != true).local] | .[0]' │ 2a01:4ff:ff00::add:2 2a01:4f8:222:904::2 │ 2a01:4ff:ff00::add:1 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP: received RS, sending RA fe80::1 │DHCP: offer to discover host$ which socat ip jq >/dev/null │ from 1e:48:6f:6e:b6:50 host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10002 │DHCP: ack to request host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │ from 1e:48:6f:6e:b6:50 host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │DHCPv6: received SOLICIT, sending ADVERTISE host$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10002 │DHCPv6: received REQUEST/RENEW/CONFIRM, sending REPLY host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ strace socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[::1]:10002 2>/tmp/socat_client.strace │NDP: received NS, sending NA host$ │ ──host──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──pasta──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Testing commit: a056cfc fwd: Direct inbound spliced forwards to the guest's external address PASS: 23 | FAIL: 0 | 2024-10-04T16:16:28+00:00 ...even without strace. The client is done, the server hangs. If I unblock this manually by re-running the same client command, the server wakes up, writes the file, and terminates, and the test continues normally. Those three "received NS, sending NA" messages in the pasta pane are printed in a short time after the test starts. If I run this with TRACE=1 (which needs the patch I just sent), this is pasta's debugging output for this test: -- 6.1401: pasta: epoll event on listening TCP socket 6 (events: 0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW 6.1402: Flow 0 (INI): NEW -> INI 6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ? 6.1402: Flow 0 (TGT): INI -> TGT 6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): TGT -> TYPED 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): event at tcp_splice_connect:377 6.1402: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT 6.1402: Flow 0 (TCP connection (spliced)): TYPED -> ACTIVE 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 6.1402: NDP: received NS, sending NA 7.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.1585: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 7.1585: NDP: received NS, sending NA 8.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.1825: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 8.1825: NDP: received NS, sending NA 9.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 9.2065: pasta: epoll event on connected spliced TCP socket 118 (events: 0x0000001c) 9.2065: Flow 0 (TCP connection (spliced)): Error event on socket: No route to host 9.2065: Flow 0 (TCP connection (spliced)): flag at tcp_splice_sock_handler:624 9.2065: Flow 0 (TCP connection (spliced)): RCVLOWAT_ACT_1 9.2068: Flow 0 (TCP connection (spliced)): CLOSED 9.2068: Flow 0 (FREE): ACTIVE -> FREE 9.2068: Flow 0 (FREE): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 10.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 11.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 12.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 13.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) [...] -- Relevant parts of strace output from the client: -- openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5 ioctl(5, TCGETS, 0x7ffd600ae4a0) = -1 ENOTTY (Inappropriate ioctl for device) fcntl(5, F_SETFD, FD_CLOEXEC) = 0 socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 connect(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 write(6, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 write(6, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) [...] pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "", 8192) = 0 shutdown(6, SHUT_WR) = 0 shutdown(6, SHUT_RDWR) = 0 exit_group(0) = ? +++ exited with 0 +++ -- and from the server: -- socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0 listen(6, 5) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, [28]) = 0 pselect6(7, [4 6], NULL, NULL, NULL, NULL -- If I connect from the host without a server in the namespace (but with the port forwarded by pasta), I get a connection reset, and if the port is not forwarded by pasta, connection refused. But this is another case: we start connecting and accept the connection (probably we shouldn't). Note the "No route to host" error on the socket. It looks somehow similar to the race I fixed with commit f4e9f26480ef ("pasta: Disable neighbour solicitations on device up to prevent DAD"), but it doesn't look like an invalid c->ip6.addr_seen, because otherwise pasta would reset the connection, I suppose. I haven't debugged further yet. This looks like an existing issue in pasta rather than in this series or in the tests, but it blocks tests, so I haven't applied this yet. -- Stefano