[PATCH v3 0/4] Don't expose container loopback services to the host
podman issue #24045 pointed out that pasta's spliced forwarding logic can expose services within the namespace bound only to 127.0.0.1 or ::1 to the host. However, the namespace probably expects those to only be accessible to itself, so that's probably not what we want. This changes our forwarding logic so as not to do this. Note that the podman tests will currently fail with this series, I've submitted podman PR https://github.com/containers/podman/pull/24064 to fix that. Link: https://github.com/containers/podman/issues/24045 Changes since v2: * Add new field do structure comment Changes since v1: * Add --host-lo-to-ns-lo option to preserve the old behaviour * Clarify the new behaviour in the man page * Add some extra patches making some other corrections to the man page David Gibson (4): passt.1: Mark --stderr as deprecated more prominently passt.1: Clarify and update "Handling of local addresses" section test: Clarify test for spliced inbound transfers fwd: Direct inbound spliced forwards to the guest's external address conf.c | 9 ++++++ fwd.c | 31 +++++++++++++----- passt.1 | 75 +++++++++++++++++++++++++++----------------- passt.h | 2 ++ test/passt_in_ns/tcp | 8 ++--- test/passt_in_ns/udp | 4 +-- test/pasta/tcp | 16 +++++----- test/pasta/udp | 8 ++--- 8 files changed, 98 insertions(+), 55 deletions(-) -- 2.46.2
The description of this option says that it's deprecated, but unlike
--no-copy-addrs and --no-copy-routes it doesn't have a clear label. Add
one to make it easier to spot.
Signed-off-by: David Gibson
This section didn't mention the effect of the --map-host-loopback option
which now alters this behaviour. Update it accordingly.
It used "local addresses" to mean specifically 127.0.0.0/8 and ::1.
However, "local" could also refer to link-local addresses or to addresses
of any scope which happen to be configured on the host. Use "loopback
address" to be more precise about this.
Signed-off-by: David Gibson
The tests in pasta/tcp and pasta/udp for inbound transfers have the server
listening within the namespace explicitly bound to 127.0.0.1 or ::1. This
only works because of the behaviour of inbound splice connections, which
always appear with both source and destination addresses as loopback in
the namespace. That's not an inherent property for "spliced" connections
and arguably an undesirable one. Also update the test names to make it
clearer that these tests are expecting to exercise the "splice" path.
Interestingly this was already correct for the equivalent passt_in_ns/*,
although we also update the test names for clarity there.
Note that there are similar issues in some of the podman tests, addressed
in https://github.com/containers/podman/pull/24064
Signed-off-by: David Gibson
In pasta mode, where addressing permits we "splice" connections, forwarding
directly from host socket to guest/container socket without any L2 or L3
processing. This gives us a very large performance improvement when it's
possible.
Since the traffic is from a local socket within the guest, it will go over
the guest's 'lo' interface, and accordingly we set the guest side address
to be the loopback address. However this has a surprising side effect:
sometimes guests will run services that are only supposed to be used within
the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's
forwarding exposes those services to the host, which isn't generally what
we want.
Correct this by instead forwarding inbound "splice" flows to the guest's
external address.
Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson
On Wed, 2 Oct 2024 15:48:26 +1000
David Gibson
In pasta mode, where addressing permits we "splice" connections, forwarding directly from host socket to guest/container socket without any L2 or L3 processing. This gives us a very large performance improvement when it's possible.
Since the traffic is from a local socket within the guest, it will go over the guest's 'lo' interface, and accordingly we set the guest side address to be the loopback address. However this has a surprising side effect: sometimes guests will run services that are only supposed to be used within the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's forwarding exposes those services to the host, which isn't generally what we want.
Correct this by instead forwarding inbound "splice" flows to the guest's external address.
Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson
--- conf.c | 9 +++++++++ fwd.c | 31 +++++++++++++++++++++++-------- passt.1 | 23 +++++++++++++++++++---- passt.h | 2 ++ 4 files changed, 53 insertions(+), 12 deletions(-) diff --git a/conf.c b/conf.c index 6e62510..b5318f3 100644 --- a/conf.c +++ b/conf.c @@ -908,6 +908,9 @@ pasta_opts: " -U, --udp-ns SPEC UDP port forwarding to init namespace\n" " SPEC is as described above\n" " default: auto\n" + " --host-lo-to-ns-lo DEPRECATED:\n" + " Translate host-loopback forwards to\n" + " namespace loopback\n" " --userns NSPATH Target user namespace to join\n" " --netns PATH|NAME Target network namespace to join\n" " --netns-only Don't join existing user namespace\n" @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv) {"netns-only", no_argument, NULL, 20 }, {"map-host-loopback", required_argument, NULL, 21 }, {"map-guest-addr", required_argument, NULL, 22 }, + {"host-lo-to-ns-lo", no_argument, NULL, 23 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv) conf_nat(optarg, &c->ip4.map_guest_addr, &c->ip6.map_guest_addr, NULL); break; + case 23: + if (c->mode != MODE_PASTA) + die("--host-lo-to-ns-lo is for pasta mode only"); + c->host_lo_to_ns_lo = 1; + break; case 'd': c->debug = 1; c->quiet = 0; diff --git a/fwd.c b/fwd.c index a505098..c71f5e1 100644 --- a/fwd.c +++ b/fwd.c @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */
- /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but by + * default use its external address, so we don't inadvertently + * expose services that listen only on the guest's loopback + * address. That can be overridden by --host-lo-to-ns-lo which + * will instead forward to the loopback address in the guest. + * + * In either case, let the kernel pick the source address to + * match. */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback6; + else + tgt->eaddr.a6 = c->ip6.addr_seen;
Either this...
+ tgt->oaddr = inany_any6;
or this (and not something before this patch, up to 3/4) make the "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, sometimes (about one in three/four runs), that's what I mistakenly reported as coming from Laurent's series at: https://archives.passt.top/passt-dev/20241002163238.1778ed19@elisabeth/ It hangs like this (display with >= 240 columns): ns$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │...passed. 88.198.0.164 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): big transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin ns$ ip -j link show | jq -rM '.[] | select(.ifname == "enp9s0").mtu' │...passed. 65520 │ ns$ /sbin/dhclient -6 --no-pid enp9s0 │Starting test: TCP/IPv4: ns to host (via tap): big transfer ns$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.prefixlen == 128).local] | .[0]' │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin 2a01:4f8:222:904::2 │...passed. ns$ ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' │ fe80::1 │Starting test: TCP/IPv4: host to ns (spliced): small transfer ns$ which socat ip jq >/dev/null │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10003 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): small transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:88.198.0.161:10003 │...passed. ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin,create,trunc │ ns$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10003 │Starting test: TCP/IPv4: ns to host (via tap): small transfer ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin 88.198.0.161 │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP4:88.198.0.161:10003 │ ns$ strace socat -u TCP6-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc 2>/tmp/socat_server.strace │Starting test: TCP/IPv6: host to ns (spliced): big transfer │ ──namespace─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────┴──pasta/tcp [7/12] - TCP/IPv6: host to ns (spliced): big transfer─────────────────────────────────── host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │ router: 88.198.0.161 fe80::1 │DNS: host$ which ip jq >/dev/null │ 185.12.64.1 host$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │ 185.12.64.2 88.198.0.164 │ NAT to host ::1: fe80::1 host$ ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP/DHCPv6: 88.198.0.161 │ assign: 2a01:4f8:222:904::2 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' │ router: fe80::1 enp9s0 │ our link-local: fe80::1 host$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.scope == "global" and .depreca│DNS: ted != true).local] | .[0]' │ 2a01:4ff:ff00::add:2 2a01:4f8:222:904::2 │ 2a01:4ff:ff00::add:1 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP: received RS, sending RA fe80::1 │DHCP: offer to discover host$ which socat ip jq >/dev/null │ from 1e:48:6f:6e:b6:50 host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10002 │DHCP: ack to request host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │ from 1e:48:6f:6e:b6:50 host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │DHCPv6: received SOLICIT, sending ADVERTISE host$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10002 │DHCPv6: received REQUEST/RENEW/CONFIRM, sending REPLY host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ strace socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[::1]:10002 2>/tmp/socat_client.strace │NDP: received NS, sending NA host$ │ ──host──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──pasta──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Testing commit: a056cfc fwd: Direct inbound spliced forwards to the guest's external address PASS: 23 | FAIL: 0 | 2024-10-04T16:16:28+00:00 ...even without strace. The client is done, the server hangs. If I unblock this manually by re-running the same client command, the server wakes up, writes the file, and terminates, and the test continues normally. Those three "received NS, sending NA" messages in the pasta pane are printed in a short time after the test starts. If I run this with TRACE=1 (which needs the patch I just sent), this is pasta's debugging output for this test: -- 6.1401: pasta: epoll event on listening TCP socket 6 (events: 0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW 6.1402: Flow 0 (INI): NEW -> INI 6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ? 6.1402: Flow 0 (TGT): INI -> TGT 6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): TGT -> TYPED 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): event at tcp_splice_connect:377 6.1402: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT 6.1402: Flow 0 (TCP connection (spliced)): TYPED -> ACTIVE 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 6.1402: NDP: received NS, sending NA 7.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.1585: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 7.1585: NDP: received NS, sending NA 8.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.1825: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 8.1825: NDP: received NS, sending NA 9.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 9.2065: pasta: epoll event on connected spliced TCP socket 118 (events: 0x0000001c) 9.2065: Flow 0 (TCP connection (spliced)): Error event on socket: No route to host 9.2065: Flow 0 (TCP connection (spliced)): flag at tcp_splice_sock_handler:624 9.2065: Flow 0 (TCP connection (spliced)): RCVLOWAT_ACT_1 9.2068: Flow 0 (TCP connection (spliced)): CLOSED 9.2068: Flow 0 (FREE): ACTIVE -> FREE 9.2068: Flow 0 (FREE): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 10.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 11.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 12.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 13.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) [...] -- Relevant parts of strace output from the client: -- openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5 ioctl(5, TCGETS, 0x7ffd600ae4a0) = -1 ENOTTY (Inappropriate ioctl for device) fcntl(5, F_SETFD, FD_CLOEXEC) = 0 socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 connect(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 write(6, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 write(6, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) [...] pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "", 8192) = 0 shutdown(6, SHUT_WR) = 0 shutdown(6, SHUT_RDWR) = 0 exit_group(0) = ? +++ exited with 0 +++ -- and from the server: -- socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0 listen(6, 5) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, [28]) = 0 pselect6(7, [4 6], NULL, NULL, NULL, NULL -- If I connect from the host without a server in the namespace (but with the port forwarded by pasta), I get a connection reset, and if the port is not forwarded by pasta, connection refused. But this is another case: we start connecting and accept the connection (probably we shouldn't). Note the "No route to host" error on the socket. It looks somehow similar to the race I fixed with commit f4e9f26480ef ("pasta: Disable neighbour solicitations on device up to prevent DAD"), but it doesn't look like an invalid c->ip6.addr_seen, because otherwise pasta would reset the connection, I suppose. I haven't debugged further yet. This looks like an existing issue in pasta rather than in this series or in the tests, but it blocks tests, so I haven't applied this yet. -- Stefano
On Wed, 9 Oct 2024 15:07:21 +0200
Stefano Brivio
On Wed, 2 Oct 2024 15:48:26 +1000 David Gibson
wrote: In pasta mode, where addressing permits we "splice" connections, forwarding directly from host socket to guest/container socket without any L2 or L3 processing. This gives us a very large performance improvement when it's possible.
Since the traffic is from a local socket within the guest, it will go over the guest's 'lo' interface, and accordingly we set the guest side address to be the loopback address. However this has a surprising side effect: sometimes guests will run services that are only supposed to be used within the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's forwarding exposes those services to the host, which isn't generally what we want.
Correct this by instead forwarding inbound "splice" flows to the guest's external address.
Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson
--- conf.c | 9 +++++++++ fwd.c | 31 +++++++++++++++++++++++-------- passt.1 | 23 +++++++++++++++++++---- passt.h | 2 ++ 4 files changed, 53 insertions(+), 12 deletions(-) diff --git a/conf.c b/conf.c index 6e62510..b5318f3 100644 --- a/conf.c +++ b/conf.c @@ -908,6 +908,9 @@ pasta_opts: " -U, --udp-ns SPEC UDP port forwarding to init namespace\n" " SPEC is as described above\n" " default: auto\n" + " --host-lo-to-ns-lo DEPRECATED:\n" + " Translate host-loopback forwards to\n" + " namespace loopback\n" " --userns NSPATH Target user namespace to join\n" " --netns PATH|NAME Target network namespace to join\n" " --netns-only Don't join existing user namespace\n" @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv) {"netns-only", no_argument, NULL, 20 }, {"map-host-loopback", required_argument, NULL, 21 }, {"map-guest-addr", required_argument, NULL, 22 }, + {"host-lo-to-ns-lo", no_argument, NULL, 23 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv) conf_nat(optarg, &c->ip4.map_guest_addr, &c->ip6.map_guest_addr, NULL); break; + case 23: + if (c->mode != MODE_PASTA) + die("--host-lo-to-ns-lo is for pasta mode only"); + c->host_lo_to_ns_lo = 1; + break; case 'd': c->debug = 1; c->quiet = 0; diff --git a/fwd.c b/fwd.c index a505098..c71f5e1 100644 --- a/fwd.c +++ b/fwd.c @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */
- /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but by + * default use its external address, so we don't inadvertently + * expose services that listen only on the guest's loopback + * address. That can be overridden by --host-lo-to-ns-lo which + * will instead forward to the loopback address in the guest. + * + * In either case, let the kernel pick the source address to + * match. */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback6; + else + tgt->eaddr.a6 = c->ip6.addr_seen;
Either this...
+ tgt->oaddr = inany_any6;
or this (and not something before this patch, up to 3/4) make the "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, sometimes (about one in three/four runs), that's what I mistakenly reported as coming from Laurent's series at:
https://archives.passt.top/passt-dev/20241002163238.1778ed19@elisabeth/
It hangs like this (display with >= 240 columns):
Ouch, sorry, it looks like saving something in claws-mail as draft and sending it later means lines will be forcefully wrapped. Here's the original test output: ns$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │...passed. 88.198.0.164 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): big transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin ns$ ip -j link show | jq -rM '.[] | select(.ifname == "enp9s0").mtu' │...passed. 65520 │ ns$ /sbin/dhclient -6 --no-pid enp9s0 │Starting test: TCP/IPv4: ns to host (via tap): big transfer ns$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.prefixlen == 128).local] | .[0]' │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin 2a01:4f8:222:904::2 │...passed. ns$ ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' │ fe80::1 │Starting test: TCP/IPv4: host to ns (spliced): small transfer ns$ which socat ip jq >/dev/null │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10003 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): small transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:88.198.0.161:10003 │...passed. ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin,create,trunc │ ns$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10003 │Starting test: TCP/IPv4: ns to host (via tap): small transfer ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin 88.198.0.161 │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP4:88.198.0.161:10003 │ ns$ strace socat -u TCP6-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc 2>/tmp/socat_server.strace │Starting test: TCP/IPv6: host to ns (spliced): big transfer │ ──namespace─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────┴──pasta/tcp [7/12] - TCP/IPv6: host to ns (spliced): big transfer─────────────────────────────────── host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │ router: 88.198.0.161 fe80::1 │DNS: host$ which ip jq >/dev/null │ 185.12.64.1 host$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │ 185.12.64.2 88.198.0.164 │ NAT to host ::1: fe80::1 host$ ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP/DHCPv6: 88.198.0.161 │ assign: 2a01:4f8:222:904::2 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' │ router: fe80::1 enp9s0 │ our link-local: fe80::1 host$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.scope == "global" and .depreca│DNS: ted != true).local] | .[0]' │ 2a01:4ff:ff00::add:2 2a01:4f8:222:904::2 │ 2a01:4ff:ff00::add:1 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP: received RS, sending RA fe80::1 │DHCP: offer to discover host$ which socat ip jq >/dev/null │ from 1e:48:6f:6e:b6:50 host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10002 │DHCP: ack to request host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │ from 1e:48:6f:6e:b6:50 host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │DHCPv6: received SOLICIT, sending ADVERTISE host$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10002 │DHCPv6: received REQUEST/RENEW/CONFIRM, sending REPLY host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ strace socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[::1]:10002 2>/tmp/socat_client.strace │NDP: received NS, sending NA host$ │ ──host──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──pasta──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Testing commit: a056cfc fwd: Direct inbound spliced forwards to the guest's external address PASS: 23 | FAIL: 0 | 2024-10-04T16:16:28+00:00
...even without strace. The client is done, the server hangs.
If I unblock this manually by re-running the same client command, the server wakes up, writes the file, and terminates, and the test continues normally.
Those three "received NS, sending NA" messages in the pasta pane are printed in a short time after the test starts.
If I run this with TRACE=1 (which needs the patch I just sent), this is pasta's debugging output for this test:
-- 6.1401: pasta: epoll event on listening TCP socket 6 (events: 0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW 6.1402: Flow 0 (INI): NEW -> INI 6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ? 6.1402: Flow 0 (TGT): INI -> TGT 6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): TGT -> TYPED 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): event at tcp_splice_connect:377 6.1402: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT 6.1402: Flow 0 (TCP connection (spliced)): TYPED -> ACTIVE 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 6.1402: NDP: received NS, sending NA 7.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.1585: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 7.1585: NDP: received NS, sending NA 8.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.1825: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 8.1825: NDP: received NS, sending NA 9.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 9.2065: pasta: epoll event on connected spliced TCP socket 118 (events: 0x0000001c) 9.2065: Flow 0 (TCP connection (spliced)): Error event on socket: No route to host 9.2065: Flow 0 (TCP connection (spliced)): flag at tcp_splice_sock_handler:624 9.2065: Flow 0 (TCP connection (spliced)): RCVLOWAT_ACT_1 9.2068: Flow 0 (TCP connection (spliced)): CLOSED 9.2068: Flow 0 (FREE): ACTIVE -> FREE 9.2068: Flow 0 (FREE): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 10.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 11.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 12.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 13.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) [...] --
This was: 6.1401: pasta: epoll event on listening TCP socket 6 (events: 0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW 6.1402: Flow 0 (INI): NEW -> INI 6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ? 6.1402: Flow 0 (TGT): INI -> TGT 6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): TGT -> TYPED 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): event at tcp_splice_connect:377 6.1402: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT 6.1402: Flow 0 (TCP connection (spliced)): TYPED -> ACTIVE 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 6.1402: NDP: received NS, sending NA 7.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.1585: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 7.1585: NDP: received NS, sending NA 8.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.1825: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 8.1825: NDP: received NS, sending NA 9.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 9.2065: pasta: epoll event on connected spliced TCP socket 118 (events: 0x0000001c) 9.2065: Flow 0 (TCP connection (spliced)): Error event on socket: No route to host 9.2065: Flow 0 (TCP connection (spliced)): flag at tcp_splice_sock_handler:624 9.2065: Flow 0 (TCP connection (spliced)): RCVLOWAT_ACT_1 9.2068: Flow 0 (TCP connection (spliced)): CLOSED 9.2068: Flow 0 (FREE): ACTIVE -> FREE 9.2068: Flow 0 (FREE): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 10.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 11.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 12.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 13.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) [...]
Relevant parts of strace output from the client:
-- openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5 ioctl(5, TCGETS, 0x7ffd600ae4a0) = -1 ENOTTY (Inappropriate ioctl for device) fcntl(5, F_SETFD, FD_CLOEXEC) = 0 socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 connect(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 write(6, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 write(6, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6])
This was: openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5 ioctl(5, TCGETS, 0x7ffd600ae4a0) = -1 ENOTTY (Inappropriate ioctl for device) fcntl(5, F_SETFD, FD_CLOEXEC) = 0 socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 connect(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 write(6, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 write(6, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6])
[...]
pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "", 8192) = 0 shutdown(6, SHUT_WR) = 0 shutdown(6, SHUT_RDWR) = 0 exit_group(0) = ? +++ exited with 0 +++ --
and from the server:
-- socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0 listen(6, 5) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, [28]) = 0 pselect6(7, [4 6], NULL, NULL, NULL, NULL --
And this was: socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0 listen(6, 5) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, [28]) = 0 pselect6(7, [4 6], NULL, NULL, NULL, NULL
If I connect from the host without a server in the namespace (but with the port forwarded by pasta), I get a connection reset, and if the port is not forwarded by pasta, connection refused.
But this is another case: we start connecting and accept the connection (probably we shouldn't). Note the "No route to host" error on the socket.
It looks somehow similar to the race I fixed with commit f4e9f26480ef ("pasta: Disable neighbour solicitations on device up to prevent DAD"), but it doesn't look like an invalid c->ip6.addr_seen, because otherwise pasta would reset the connection, I suppose.
I haven't debugged further yet. This looks like an existing issue in pasta rather than in this series or in the tests, but it blocks tests, so I haven't applied this yet.
-- Stefano
On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
On Wed, 9 Oct 2024 15:07:21 +0200 Stefano Brivio
wrote: On Wed, 2 Oct 2024 15:48:26 +1000 David Gibson
wrote: In pasta mode, where addressing permits we "splice" connections, forwarding directly from host socket to guest/container socket without any L2 or L3 processing. This gives us a very large performance improvement when it's possible.
Since the traffic is from a local socket within the guest, it will go over the guest's 'lo' interface, and accordingly we set the guest side address to be the loopback address. However this has a surprising side effect: sometimes guests will run services that are only supposed to be used within the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's forwarding exposes those services to the host, which isn't generally what we want.
Correct this by instead forwarding inbound "splice" flows to the guest's external address.
Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson
--- conf.c | 9 +++++++++ fwd.c | 31 +++++++++++++++++++++++-------- passt.1 | 23 +++++++++++++++++++---- passt.h | 2 ++ 4 files changed, 53 insertions(+), 12 deletions(-) diff --git a/conf.c b/conf.c index 6e62510..b5318f3 100644 --- a/conf.c +++ b/conf.c @@ -908,6 +908,9 @@ pasta_opts: " -U, --udp-ns SPEC UDP port forwarding to init namespace\n" " SPEC is as described above\n" " default: auto\n" + " --host-lo-to-ns-lo DEPRECATED:\n" + " Translate host-loopback forwards to\n" + " namespace loopback\n" " --userns NSPATH Target user namespace to join\n" " --netns PATH|NAME Target network namespace to join\n" " --netns-only Don't join existing user namespace\n" @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv) {"netns-only", no_argument, NULL, 20 }, {"map-host-loopback", required_argument, NULL, 21 }, {"map-guest-addr", required_argument, NULL, 22 }, + {"host-lo-to-ns-lo", no_argument, NULL, 23 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv) conf_nat(optarg, &c->ip4.map_guest_addr, &c->ip6.map_guest_addr, NULL); break; + case 23: + if (c->mode != MODE_PASTA) + die("--host-lo-to-ns-lo is for pasta mode only"); + c->host_lo_to_ns_lo = 1; + break; case 'd': c->debug = 1; c->quiet = 0; diff --git a/fwd.c b/fwd.c index a505098..c71f5e1 100644 --- a/fwd.c +++ b/fwd.c @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */
- /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but by + * default use its external address, so we don't inadvertently + * expose services that listen only on the guest's loopback + * address. That can be overridden by --host-lo-to-ns-lo which + * will instead forward to the loopback address in the guest. + * + * In either case, let the kernel pick the source address to + * match. */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback6; + else + tgt->eaddr.a6 = c->ip6.addr_seen;
Either this...
+ tgt->oaddr = inany_any6;
or this (and not something before this patch, up to 3/4) make the "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, sometimes (about one in three/four runs), that's what I mistakenly reported as coming from Laurent's series at:
Huh, interesting. Just got back from my leave and ran that group of tests in a loop this afternoon, but didn't manage to reproduce. I have administrivia that will probably fill the rest of this week, but I'll look into this as soon as I can. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
On Wed, 9 Oct 2024 15:07:21 +0200 Stefano Brivio
wrote: [snip] @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */
- /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but by + * default use its external address, so we don't inadvertently + * expose services that listen only on the guest's loopback + * address. That can be overridden by --host-lo-to-ns-lo which + * will instead forward to the loopback address in the guest. + * + * In either case, let the kernel pick the source address to + * match. */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback6; + else + tgt->eaddr.a6 = c->ip6.addr_seen;
Either this...
+ tgt->oaddr = inany_any6;
or this (and not something before this patch, up to 3/4) make the "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, sometimes (about one in three/four runs), that's what I mistakenly reported as coming from Laurent's series at:
Huh, interesting. Just got back from my leave and ran that group of tests in a loop this afternoon, but didn't manage to reproduce. I have administrivia that will probably fill the rest of this week, but I'll look into this as soon as I can.
I reproduced the problem on passt.top, and I have a partial idea what's going on. As you say it's seeming like the address (addr_seen == addr in this case) isn't properly ready. This is over splice, but on the tap interface, I see the container sending NS messages for its own address - seems like it's doing DAD. But more importantly, we're answering those NS messages with NA messages, because we answer all NS. i.e. we're making the DAD fail. What I'm not sure of is how this ever worked at all. --config-net makes sense, since we disable DAD, but our test suite has always been using NDP+DHCP instead of --config-net. So, AFACT, we'll always fail guest DAD attempts, both IPv6, which happens most of the time and for IPv4 via ARP, which is used much more rarely. I think we need to be more selective in what NS or ARP lookups we resopnd to. The question is what approach to take: Option 1: Answer everything apart from specific exceptions ======== Basically we'd explicitly exclude the guest/container's assigned address but continue answering everything else. This is a bit ugly and means we'd probably still have a similar problem if the guest just picks an address instead of taking the assigned one. On the other hand it means things will work in at least some cases where the guest tries to contact remote hosts directly instead of via the default gateway. Option 2: Only answer for specific addresses ========= Reverse the logic above, usually *don't* answer NS queries, unless they have a specific address that we know is our responsibility. That would be, afaict, the gateway address, the "our_tap" addresses and the -map--host-loopback and --map-guest-addr addresses (usually all the same). This might be less robust against certain guests that don't use a default gateway when we expect. I guess that could happen when we have non-gateway routes copied from the host, so we'd have to handle that too. It would remove one more barrier towards having multiple bridged guests behind a single passt/pasta instance (they can't reasonably communicate with each other if passt answers all their NS / ARP requests). -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:
On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
On Wed, 9 Oct 2024 15:07:21 +0200 Stefano Brivio
wrote: [snip] @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */
- /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but by + * default use its external address, so we don't inadvertently + * expose services that listen only on the guest's loopback + * address. That can be overridden by --host-lo-to-ns-lo which + * will instead forward to the loopback address in the guest. + * + * In either case, let the kernel pick the source address to + * match. */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback6; + else + tgt->eaddr.a6 = c->ip6.addr_seen;
Either this...
+ tgt->oaddr = inany_any6;
or this (and not something before this patch, up to 3/4) make the "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, sometimes (about one in three/four runs), that's what I mistakenly reported as coming from Laurent's series at:
Huh, interesting. Just got back from my leave and ran that group of tests in a loop this afternoon, but didn't manage to reproduce. I have administrivia that will probably fill the rest of this week, but I'll look into this as soon as I can.
I reproduced the problem on passt.top, and I have a partial idea what's going on. As you say it's seeming like the address (addr_seen == addr in this case) isn't properly ready. This is over splice, but on the tap interface, I see the container sending NS messages for its own address - seems like it's doing DAD. But more importantly, we're answering those NS messages with NA messages, because we answer all NS. i.e. we're making the DAD fail. What I'm not sure of is how this ever worked at all. --config-net makes sense, since we disable DAD, but our test suite has always been using NDP+DHCP instead of --config-net.
So, AFACT, we'll always fail guest DAD attempts, both IPv6, which happens most of the time and for IPv4 via ARP, which is used much more rarely. I think we need to be more selective in what NS or ARP lookups we resopnd to. The question is what approach to take:
Hmm... no.. there's more to this. Usually DAD requests have :: as the source address, and we *do* exclude those from getting replies. In this case though, we're getting NS requests for the assigned address from what looks like the SLAAC address. So, I do think it would be wise to explicitly exclude these: we shouldn't be giving NA responses for an address that ought to belong to the guest, even if it doesn't look like a DAD. But, I'm not sure what's triggering this. Is for some reason the DHCP address not "taking", so the container is trying to locate it on the network instead? Or _is_ this DAD, but under some circumstances rather than using :: as the source address it uses another configured address. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:
On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:
On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
On Wed, 9 Oct 2024 15:07:21 +0200 Stefano Brivio
wrote: [snip] @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */
- /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but by + * default use its external address, so we don't inadvertently + * expose services that listen only on the guest's loopback + * address. That can be overridden by --host-lo-to-ns-lo which + * will instead forward to the loopback address in the guest. + * + * In either case, let the kernel pick the source address to + * match. */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback6; + else + tgt->eaddr.a6 = c->ip6.addr_seen;
Either this...
+ tgt->oaddr = inany_any6;
or this (and not something before this patch, up to 3/4) make the "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, sometimes (about one in three/four runs), that's what I mistakenly reported as coming from Laurent's series at:
Huh, interesting. Just got back from my leave and ran that group of tests in a loop this afternoon, but didn't manage to reproduce. I have administrivia that will probably fill the rest of this week, but I'll look into this as soon as I can.
I reproduced the problem on passt.top, and I have a partial idea what's going on. As you say it's seeming like the address (addr_seen == addr in this case) isn't properly ready. This is over splice, but on the tap interface, I see the container sending NS messages for its own address - seems like it's doing DAD. But more importantly, we're answering those NS messages with NA messages, because we answer all NS. i.e. we're making the DAD fail. What I'm not sure of is how this ever worked at all. --config-net makes sense, since we disable DAD, but our test suite has always been using NDP+DHCP instead of --config-net.
So, AFACT, we'll always fail guest DAD attempts, both IPv6, which happens most of the time and for IPv4 via ARP, which is used much more rarely. I think we need to be more selective in what NS or ARP lookups we resopnd to. The question is what approach to take:
Hmm... no.. there's more to this.
Usually DAD requests have :: as the source address, and we *do* exclude those from getting replies. In this case though, we're getting NS requests for the assigned address from what looks like the SLAAC address. So, I do think it would be wise to explicitly exclude these: we shouldn't be giving NA responses for an address that ought to belong to the guest, even if it doesn't look like a DAD.
But, I'm not sure what's triggering this. Is for some reason the DHCP address not "taking", so the container is trying to locate it on the network instead? Or _is_ this DAD, but under some circumstances rather than using :: as the source address it uses another configured address.
Ok.. I've understood a bit more. While timing is a factor here, it looks like the main reason I wasn't seeing it on my machine is what I'd consider a bug in the Debian version of the dhclient-script: when adding an IPv6 address, it returns without waiting for DAD to complete (i.e. for the address to be non-tentative). There's also an additional bug, which doesn't cause this problem, I think, but caused some problems when I was investigating. DHCPv6 needs the link-local SLAAC address already configured and non-tentative. The Fedora dhclient-script waits for that too at the PREINIT6 stage, but the Debian one doesn't, meaning if you attempt dhclient -6 immediately after starting the namespace it will fail to bind the UDP address it needs. I still think it's a good idea not to give NA messages for the guest assigned address, but we'll need a different workaround for this issue. I guess we'll have to manually wait for DAD to complete in the DHCP tests, which will be kind of mucky :/ -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, 16 Oct 2024 19:39:40 +1100
David Gibson
On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:
On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:
On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
On Wed, 9 Oct 2024 15:07:21 +0200 Stefano Brivio
wrote: [snip] > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > /* spliceable */ > > - /* Preserve the specific loopback adddress used, but let the > - * kernel pick a source port on the target side > + /* The traffic will go over the guest's 'lo' interface, but by > + * default use its external address, so we don't inadvertently > + * expose services that listen only on the guest's loopback > + * address. That can be overridden by --host-lo-to-ns-lo which > + * will instead forward to the loopback address in the guest. > + * > + * In either case, let the kernel pick the source address to > + * match. > */ > - tgt->oaddr = ini->eaddr; > + if (inany_v4(&ini->eaddr)) { > + if (c->host_lo_to_ns_lo) > + tgt->eaddr = inany_loopback4; > + else > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > + tgt->oaddr = inany_any4; > + } else { > + if (c->host_lo_to_ns_lo) > + tgt->eaddr = inany_loopback6; > + else > + tgt->eaddr.a6 = c->ip6.addr_seen;
Either this...
> + tgt->oaddr = inany_any6;
or this (and not something before this patch, up to 3/4) make the "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, sometimes (about one in three/four runs), that's what I mistakenly reported as coming from Laurent's series at:
Huh, interesting. Just got back from my leave and ran that group of tests in a loop this afternoon, but didn't manage to reproduce. I have administrivia that will probably fill the rest of this week, but I'll look into this as soon as I can.
I reproduced the problem on passt.top, and I have a partial idea what's going on. As you say it's seeming like the address (addr_seen == addr in this case) isn't properly ready. This is over splice, but on the tap interface, I see the container sending NS messages for its own address - seems like it's doing DAD. But more importantly, we're answering those NS messages with NA messages, because we answer all NS. i.e. we're making the DAD fail. What I'm not sure of is how this ever worked at all. --config-net makes sense, since we disable DAD, but our test suite has always been using NDP+DHCP instead of --config-net.
So, AFACT, we'll always fail guest DAD attempts, both IPv6, which happens most of the time and for IPv4 via ARP, which is used much more rarely. I think we need to be more selective in what NS or ARP lookups we resopnd to. The question is what approach to take:
Hmm... no.. there's more to this.
Usually DAD requests have :: as the source address, and we *do* exclude those from getting replies. In this case though, we're getting NS requests for the assigned address from what looks like the SLAAC address. So, I do think it would be wise to explicitly exclude these: we shouldn't be giving NA responses for an address that ought to belong to the guest, even if it doesn't look like a DAD.
But, I'm not sure what's triggering this. Is for some reason the DHCP address not "taking", so the container is trying to locate it on the network instead? Or _is_ this DAD, but under some circumstances rather than using :: as the source address it uses another configured address.
Ok.. I've understood a bit more. While timing is a factor here, it looks like the main reason I wasn't seeing it on my machine is what I'd consider a bug in the Debian version of the dhclient-script: when adding an IPv6 address, it returns without waiting for DAD to complete (i.e. for the address to be non-tentative).
Oops. On one hand, I would feel inclined to propose a fix for the Debian and Ubuntu packages. On the other hand, I wonder if it's universally considered a bug: the DHCPv6 client did its job at that point, and it's debatable whether dhclient should wait for the address to be usable before forking to background. That is, arguably, the job of dhclient's is to request and configure an address. It's not a network configuration daemon. There might be many other reasons why that address is unusable, and yet dhclient is not responsible for them. By the way, I guess it's just an issue for test scripts like this one.
There's also an additional bug, which doesn't cause this problem, I think, but caused some problems when I was investigating. DHCPv6 needs the link-local SLAAC address already configured and non-tentative. The Fedora dhclient-script waits for that too at the PREINIT6 stage, but the Debian one doesn't, meaning if you attempt dhclient -6 immediately after starting the namespace it will fail to bind the UDP address it needs.
Right, and that's fine for us because we have a 2-second delay after SLAAC. This looks to me a bit more like a real bug, but again, there might be many other reasons why dhclient can't use a link-local address. One might argue that it's the responsibility of the user/caller to invoke dhclient when appropriate. In that sense, you might be wondering why there's a 2-second delay after SLAAC, but no delay after invoking dhclient -6: the reason is that I was convinced that one wouldn't need DAD once a DHCPv6 client configures an address. The server is already checking that, I thought. Well, no. RFC 8415 18.2.10.1: https://datatracker.ietf.org/doc/html/rfc8415#section-18.2.10.1 says: If the client can operate with the addresses and/or prefixes obtained from the server: [...] - The client MUST perform duplicate address detection as per Section 5.4 of [RFC4862], which does list some exceptions, on each of the received addresses in any IAs on which it has not performed duplicate address detection during processing of any of the previous Reply messages from the server. The client performs the duplicate address detection before using the received addresses for any traffic. If any of the addresses are found to be in use on the link, the client sends a Decline message to the server for those addresses as described in Section 18.2.8.
I still think it's a good idea not to give NA messages for the guest assigned address, but we'll need a different workaround for this issue.
I read the rest of your reasoning about it, but the nice thing of the current behaviour (and that's why I added that single check on the source address in ndp()) is that the guest can really use whatever address it wants, regardless of what we tried to configure, and we'll resolve any other address. If we receive a neighbour solicitation for the guest assigned address, and the source address is not unspecified, it means that the guest is _not_ using the assigned address, and it's actually trying to reach it.
I guess we'll have to manually wait for DAD to complete in the DHCP tests, which will be kind of mucky :/
Alternatively, we could use the same trick as added by commit f4e9f26480ef ("pasta: Disable neighbour solicitations on device up to prevent DAD"): disable neighbour solicitations, run dhclient -6, set 'nodad' on the address, and re-enable neighbour solicitations. This works for me: -- diff --git a/test/pasta/dhcp b/test/pasta/dhcp index 41556b8..76aa723 100644 --- a/test/pasta/dhcp +++ b/test/pasta/dhcp @@ -34,9 +34,12 @@ nsout MTU ip -j link show | jq -rM '.[] | select(.ifname == "__IFNAME__").mtu' check [ __MTU__ = 65520 ] test DHCPv6: address +ns ip link set dev __IFNAME__ arp off ns /sbin/dhclient -6 --no-pid __IFNAME__ hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' nsout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]' +ns ip addr change __ADDR6__ dev __IFNAME__ nodad +ns ip link set dev __IFNAME__ arp on hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]' check [ __ADDR6__ = __HOST_ADDR6__ ] -- Adding 2-second delays as we have them for NDP doesn't look that bad: $ grep --exclude-dir=demo -rn "dhclient -6" pasta/dhcp:37:ns /sbin/dhclient -6 --no-pid __IFNAME__ passt_in_ns/dhcp:54:guest /sbin/dhclient -6 __IFNAME__ passt/dhcp:51:guest /sbin/dhclient -6 __IFNAME__ perf/passt_tcp:117:guest dhclient -6 -x perf/passt_tcp:118:guest dhclient -6 __IFNAME__ two_guests/basic:40:guest1 /sbin/dhclient -6 __IFNAME1__ two_guests/basic:41:guest2 /sbin/dhclient -6 __IFNAME2__ given that we don't need it on dhclient -x, tests would take about 12 seconds longer. Or we could switch to the arp off / nodad / arp on approach for everything, including SLAAC: $ grep --exclude-dir=demo --exclude-dir=mbuto --exclude-dir=distro --exclude-dir=memory -rn "sleep[ $(printf '\t')].*2" pasta/ndp:21:sleep 2 passt_in_ns/icmp:29:ns ip addr add 2001:db8::1 dev __IFNAME_NS__ && sleep 2 # DAD passt/ndp:19:guest ip link set dev __IFNAME__ up && sleep 2 two_guests/basic:39:sleep 2 and save slightly less than 8 seconds. If you ask me, I would have a slight preference for the nodad approach. -- Stefano
On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote:
On Wed, 16 Oct 2024 19:39:40 +1100 David Gibson
wrote: On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:
On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:
On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:
On Wed, 9 Oct 2024 15:07:21 +0200 Stefano Brivio
wrote: [snip] > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > /* spliceable */ > > > > - /* Preserve the specific loopback adddress used, but let the > > - * kernel pick a source port on the target side > > + /* The traffic will go over the guest's 'lo' interface, but by > > + * default use its external address, so we don't inadvertently > > + * expose services that listen only on the guest's loopback > > + * address. That can be overridden by --host-lo-to-ns-lo which > > + * will instead forward to the loopback address in the guest. > > + * > > + * In either case, let the kernel pick the source address to > > + * match. > > */ > > - tgt->oaddr = ini->eaddr; > > + if (inany_v4(&ini->eaddr)) { > > + if (c->host_lo_to_ns_lo) > > + tgt->eaddr = inany_loopback4; > > + else > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > + tgt->oaddr = inany_any4; > > + } else { > > + if (c->host_lo_to_ns_lo) > > + tgt->eaddr = inany_loopback6; > > + else > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > Either this... > > > + tgt->oaddr = inany_any6; > > or this (and not something before this patch, up to 3/4) make the > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > sometimes (about one in three/four runs), that's what I mistakenly > reported as coming from Laurent's series at: Huh, interesting. Just got back from my leave and ran that group of tests in a loop this afternoon, but didn't manage to reproduce. I have administrivia that will probably fill the rest of this week, but I'll look into this as soon as I can.
I reproduced the problem on passt.top, and I have a partial idea what's going on. As you say it's seeming like the address (addr_seen == addr in this case) isn't properly ready. This is over splice, but on the tap interface, I see the container sending NS messages for its own address - seems like it's doing DAD. But more importantly, we're answering those NS messages with NA messages, because we answer all NS. i.e. we're making the DAD fail. What I'm not sure of is how this ever worked at all. --config-net makes sense, since we disable DAD, but our test suite has always been using NDP+DHCP instead of --config-net.
So, AFACT, we'll always fail guest DAD attempts, both IPv6, which happens most of the time and for IPv4 via ARP, which is used much more rarely. I think we need to be more selective in what NS or ARP lookups we resopnd to. The question is what approach to take:
Hmm... no.. there's more to this.
Usually DAD requests have :: as the source address, and we *do* exclude those from getting replies. In this case though, we're getting NS requests for the assigned address from what looks like the SLAAC address. So, I do think it would be wise to explicitly exclude these: we shouldn't be giving NA responses for an address that ought to belong to the guest, even if it doesn't look like a DAD.
But, I'm not sure what's triggering this. Is for some reason the DHCP address not "taking", so the container is trying to locate it on the network instead? Or _is_ this DAD, but under some circumstances rather than using :: as the source address it uses another configured address.
Ok.. I've understood a bit more. While timing is a factor here, it looks like the main reason I wasn't seeing it on my machine is what I'd consider a bug in the Debian version of the dhclient-script: when adding an IPv6 address, it returns without waiting for DAD to complete (i.e. for the address to be non-tentative).
Oops. On one hand, I would feel inclined to propose a fix for the Debian and Ubuntu packages. On the other hand, I wonder if it's universally considered a bug: the DHCPv6 client did its job at that point, and it's debatable whether dhclient should wait for the address to be usable before forking to background.
That is, arguably, the job of dhclient's is to request and configure an address. It's not a network configuration daemon. There might be many other reasons why that address is unusable, and yet dhclient is not responsible for them.
Hrm... I guess. Counterpoints.. - Most other failures to get a usable address will result in a visible error - dhclient has a --dad-wait-time option which seems to imply that the script should wait for DAD - The upstream script version waits for DAD In any case I filed a report for it https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231
By the way, I guess it's just an issue for test scripts like this one.
Why do you guess that?
There's also an additional bug, which doesn't cause this problem, I think, but caused some problems when I was investigating. DHCPv6 needs the link-local SLAAC address already configured and non-tentative. The Fedora dhclient-script waits for that too at the PREINIT6 stage, but the Debian one doesn't, meaning if you attempt dhclient -6 immediately after starting the namespace it will fail to bind the UDP address it needs.
Right, and that's fine for us because we have a 2-second delay after SLAAC. This looks to me a bit more like a real bug, but again, there might be many other reasons why dhclient can't use a link-local address. One might argue that it's the responsibility of the user/caller to invoke dhclient when appropriate.
Here I think it's a much clearer argument that it's a real bug. We play fast and loose with it for mbuto, but dhclient can typically be called on an interface that isn't even up: the PREINIT/PREINIT6 script actions are specifically for this, they'll bring the interface up ready for the client to do its thing. I'd say the script is failing to do its job for PREINIT6 if there isn't a usable link-local address at the end. I filed a report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085229
In that sense, you might be wondering why there's a 2-second delay after SLAAC, but no delay after invoking dhclient -6: the reason is that I was convinced that one wouldn't need DAD once a DHCPv6 client configures an address. The server is already checking that, I thought.
Well, no. RFC 8415 18.2.10.1:
https://datatracker.ietf.org/doc/html/rfc8415#section-18.2.10.1
says:
If the client can operate with the addresses and/or prefixes obtained from the server:
[...]
- The client MUST perform duplicate address detection as per Section 5.4 of [RFC4862], which does list some exceptions, on each of the received addresses in any IAs on which it has not performed duplicate address detection during processing of any of the previous Reply messages from the server. The client performs the duplicate address detection before using the received addresses for any traffic. If any of the addresses are found to be in use on the link, the client sends a Decline message to the server for those addresses as described in Section 18.2.8.
Indeed.
I still think it's a good idea not to give NA messages for the guest assigned address, but we'll need a different workaround for this issue.
I read the rest of your reasoning about it, but the nice thing of the current behaviour (and that's why I added that single check on the source address in ndp()) is that the guest can really use whatever address it wants, regardless of what we tried to configure, and we'll resolve any other address.
Hrm. I suppose. Fwiw we already make the equivalent exclusion for ARP
If we receive a neighbour solicitation for the guest assigned address, and the source address is not unspecified, it means that the guest is _not_ using the assigned address, and it's actually trying to reach it.
I guess we'll have to manually wait for DAD to complete in the DHCP tests, which will be kind of mucky :/
Alternatively, we could use the same trick as added by commit f4e9f26480ef ("pasta: Disable neighbour solicitations on device up to prevent DAD"): disable neighbour solicitations, run dhclient -6, set 'nodad' on the address, and re-enable neighbour solicitations.
This works for me:
Ok. More complex, but faster, I guess. I'll try implementing this. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 17 Oct 2024 12:19:58 +1100
David Gibson
On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote:
On Wed, 16 Oct 2024 19:39:40 +1100 David Gibson
wrote: On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:
On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:
On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > On Wed, 9 Oct 2024 15:07:21 +0200 > Stefano Brivio
wrote: [snip] > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > /* spliceable */ > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > - * kernel pick a source port on the target side > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > + * default use its external address, so we don't inadvertently > > > + * expose services that listen only on the guest's loopback > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > + * will instead forward to the loopback address in the guest. > > > + * > > > + * In either case, let the kernel pick the source address to > > > + * match. > > > */ > > > - tgt->oaddr = ini->eaddr; > > > + if (inany_v4(&ini->eaddr)) { > > > + if (c->host_lo_to_ns_lo) > > > + tgt->eaddr = inany_loopback4; > > > + else > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > + tgt->oaddr = inany_any4; > > > + } else { > > > + if (c->host_lo_to_ns_lo) > > > + tgt->eaddr = inany_loopback6; > > > + else > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > Either this... > > > > > + tgt->oaddr = inany_any6; > > > > or this (and not something before this patch, up to 3/4) make the > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > sometimes (about one in three/four runs), that's what I mistakenly > > reported as coming from Laurent's series at: Huh, interesting. Just got back from my leave and ran that group of tests in a loop this afternoon, but didn't manage to reproduce. I have administrivia that will probably fill the rest of this week, but I'll look into this as soon as I can.
I reproduced the problem on passt.top, and I have a partial idea what's going on. As you say it's seeming like the address (addr_seen == addr in this case) isn't properly ready. This is over splice, but on the tap interface, I see the container sending NS messages for its own address - seems like it's doing DAD. But more importantly, we're answering those NS messages with NA messages, because we answer all NS. i.e. we're making the DAD fail. What I'm not sure of is how this ever worked at all. --config-net makes sense, since we disable DAD, but our test suite has always been using NDP+DHCP instead of --config-net.
So, AFACT, we'll always fail guest DAD attempts, both IPv6, which happens most of the time and for IPv4 via ARP, which is used much more rarely. I think we need to be more selective in what NS or ARP lookups we resopnd to. The question is what approach to take:
Hmm... no.. there's more to this.
Usually DAD requests have :: as the source address, and we *do* exclude those from getting replies. In this case though, we're getting NS requests for the assigned address from what looks like the SLAAC address. So, I do think it would be wise to explicitly exclude these: we shouldn't be giving NA responses for an address that ought to belong to the guest, even if it doesn't look like a DAD.
But, I'm not sure what's triggering this. Is for some reason the DHCP address not "taking", so the container is trying to locate it on the network instead? Or _is_ this DAD, but under some circumstances rather than using :: as the source address it uses another configured address.
Ok.. I've understood a bit more. While timing is a factor here, it looks like the main reason I wasn't seeing it on my machine is what I'd consider a bug in the Debian version of the dhclient-script: when adding an IPv6 address, it returns without waiting for DAD to complete (i.e. for the address to be non-tentative).
Oops. On one hand, I would feel inclined to propose a fix for the Debian and Ubuntu packages. On the other hand, I wonder if it's universally considered a bug: the DHCPv6 client did its job at that point, and it's debatable whether dhclient should wait for the address to be usable before forking to background.
That is, arguably, the job of dhclient's is to request and configure an address. It's not a network configuration daemon. There might be many other reasons why that address is unusable, and yet dhclient is not responsible for them.
Hrm... I guess. Counterpoints.. - Most other failures to get a usable address will result in a visible error - dhclient has a --dad-wait-time option which seems to imply that the script should wait for DAD - The upstream script version waits for DAD
In any case I filed a report for it https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231
By the way, I guess it's just an issue for test scripts like this one.
Why do you guess that?
Because it's kind of rare that your address changes if you use DHCPv6, I guess, so this would be relevant almost exclusively at boot. And, at boot, if a remote peer/client happens to try to connect to the machine where the client is running right after an address was assigned, it must have a retry mechanism almost for sure. -- Stefano
On Thu, Oct 17, 2024 at 10:31:22AM +0200, Stefano Brivio wrote:
On Thu, 17 Oct 2024 12:19:58 +1100 David Gibson
wrote: On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote:
On Wed, 16 Oct 2024 19:39:40 +1100 David Gibson
wrote: On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:
On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:
On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote: > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > > On Wed, 9 Oct 2024 15:07:21 +0200 > > Stefano Brivio
wrote: [snip] > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > > /* spliceable */ > > > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > > - * kernel pick a source port on the target side > > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > > + * default use its external address, so we don't inadvertently > > > > + * expose services that listen only on the guest's loopback > > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > > + * will instead forward to the loopback address in the guest. > > > > + * > > > > + * In either case, let the kernel pick the source address to > > > > + * match. > > > > */ > > > > - tgt->oaddr = ini->eaddr; > > > > + if (inany_v4(&ini->eaddr)) { > > > > + if (c->host_lo_to_ns_lo) > > > > + tgt->eaddr = inany_loopback4; > > > > + else > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > > + tgt->oaddr = inany_any4; > > > > + } else { > > > > + if (c->host_lo_to_ns_lo) > > > > + tgt->eaddr = inany_loopback6; > > > > + else > > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > > > Either this... > > > > > > > + tgt->oaddr = inany_any6; > > > > > > or this (and not something before this patch, up to 3/4) make the > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > > sometimes (about one in three/four runs), that's what I mistakenly > > > reported as coming from Laurent's series at: > > Huh, interesting. Just got back from my leave and ran that group of > tests in a loop this afternoon, but didn't manage to reproduce. I > have administrivia that will probably fill the rest of this week, but > I'll look into this as soon as I can. I reproduced the problem on passt.top, and I have a partial idea what's going on. As you say it's seeming like the address (addr_seen == addr in this case) isn't properly ready. This is over splice, but on the tap interface, I see the container sending NS messages for its own address - seems like it's doing DAD. But more importantly, we're answering those NS messages with NA messages, because we answer all NS. i.e. we're making the DAD fail. What I'm not sure of is how this ever worked at all. --config-net makes sense, since we disable DAD, but our test suite has always been using NDP+DHCP instead of --config-net.
So, AFACT, we'll always fail guest DAD attempts, both IPv6, which happens most of the time and for IPv4 via ARP, which is used much more rarely. I think we need to be more selective in what NS or ARP lookups we resopnd to. The question is what approach to take:
Hmm... no.. there's more to this.
Usually DAD requests have :: as the source address, and we *do* exclude those from getting replies. In this case though, we're getting NS requests for the assigned address from what looks like the SLAAC address. So, I do think it would be wise to explicitly exclude these: we shouldn't be giving NA responses for an address that ought to belong to the guest, even if it doesn't look like a DAD.
But, I'm not sure what's triggering this. Is for some reason the DHCP address not "taking", so the container is trying to locate it on the network instead? Or _is_ this DAD, but under some circumstances rather than using :: as the source address it uses another configured address.
Ok.. I've understood a bit more. While timing is a factor here, it looks like the main reason I wasn't seeing it on my machine is what I'd consider a bug in the Debian version of the dhclient-script: when adding an IPv6 address, it returns without waiting for DAD to complete (i.e. for the address to be non-tentative).
Oops. On one hand, I would feel inclined to propose a fix for the Debian and Ubuntu packages. On the other hand, I wonder if it's universally considered a bug: the DHCPv6 client did its job at that point, and it's debatable whether dhclient should wait for the address to be usable before forking to background.
That is, arguably, the job of dhclient's is to request and configure an address. It's not a network configuration daemon. There might be many other reasons why that address is unusable, and yet dhclient is not responsible for them.
Hrm... I guess. Counterpoints.. - Most other failures to get a usable address will result in a visible error - dhclient has a --dad-wait-time option which seems to imply that the script should wait for DAD - The upstream script version waits for DAD
In any case I filed a report for it https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231
By the way, I guess it's just an issue for test scripts like this one.
Why do you guess that?
Because it's kind of rare that your address changes if you use DHCPv6, I guess, so this would be relevant almost exclusively at boot.
Hm, true. I was thinking of ephemeral containers where "boot" could be quite common.. but those will most likely use --config-net.
And, at boot, if a remote peer/client happens to try to connect to the machine where the client is running right after an address was assigned, it must have a retry mechanism almost for sure.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote: [snip]
Adding 2-second delays as we have them for NDP doesn't look that bad:
$ grep --exclude-dir=demo -rn "dhclient -6" pasta/dhcp:37:ns /sbin/dhclient -6 --no-pid __IFNAME__ passt_in_ns/dhcp:54:guest /sbin/dhclient -6 __IFNAME__ passt/dhcp:51:guest /sbin/dhclient -6 __IFNAME__ perf/passt_tcp:117:guest dhclient -6 -x perf/passt_tcp:118:guest dhclient -6 __IFNAME__ two_guests/basic:40:guest1 /sbin/dhclient -6 __IFNAME1__ two_guests/basic:41:guest2 /sbin/dhclient -6 __IFNAME2__
given that we don't need it on dhclient -x, tests would take about 12 seconds longer.
Or we could switch to the arp off / nodad / arp on approach for everything, including SLAAC:
$ grep --exclude-dir=demo --exclude-dir=mbuto --exclude-dir=distro --exclude-dir=memory -rn "sleep[ $(printf '\t')].*2" pasta/ndp:21:sleep 2 passt_in_ns/icmp:29:ns ip addr add 2001:db8::1 dev __IFNAME_NS__ && sleep 2 # DAD passt/ndp:19:guest ip link set dev __IFNAME__ up && sleep 2 two_guests/basic:39:sleep 2
and save slightly less than 8 seconds. If you ask me, I would have a slight preference for the nodad approach.
Much as I don't like making the tests slower, I don't think the nodad approach is a good idea. It's fine if we think of the NDP/DHCP only as setting things up for the transfer tests. But they also serve the purpose of testing the NDP and DHCP paths themselves. For that purpose we should perform a full "normal" configuration, including DAD. [e.g. if we really did manage to break our NS responses so that we totally broke DAD, we'd want our tests to catch that]. I have a draft implementation of explicitly waiting for DAD to complete both for SLAAC and for DHCPv6, patches coming shortly. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (2)
-
David Gibson
-
Stefano Brivio