podman issue #24045 pointed out that pasta's spliced forwarding logic can expose services within the namespace bound only to 127.0.0.1 or ::1 to the host. However, the namespace probably expects those to only be accessible to itself, so that's probably not what we want. This changes our forwarding logic so as not to do this. Note that the podman tests will currently fail with this series, I've submitted podman PR https://github.com/containers/podman/pull/24064 to fix that. Link: https://github.com/containers/podman/issues/24045 Changes since v2: * Add new field do structure comment Changes since v1: * Add --host-lo-to-ns-lo option to preserve the old behaviour * Clarify the new behaviour in the man page * Add some extra patches making some other corrections to the man page David Gibson (4): passt.1: Mark --stderr as deprecated more prominently passt.1: Clarify and update "Handling of local addresses" section test: Clarify test for spliced inbound transfers fwd: Direct inbound spliced forwards to the guest's external address conf.c | 9 ++++++ fwd.c | 31 +++++++++++++----- passt.1 | 75 +++++++++++++++++++++++++++----------------- passt.h | 2 ++ test/passt_in_ns/tcp | 8 ++--- test/passt_in_ns/udp | 4 +-- test/pasta/tcp | 16 +++++----- test/pasta/udp | 8 ++--- 8 files changed, 98 insertions(+), 55 deletions(-) -- 2.46.2
The description of this option says that it's deprecated, but unlike --no-copy-addrs and --no-copy-routes it doesn't have a clear label. Add one to make it easier to spot. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- passt.1 | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/passt.1 b/passt.1 index 79d134d..acc1f92 100644 --- a/passt.1 +++ b/passt.1 @@ -95,7 +95,7 @@ detached PID namespace after starting, because the PID itself cannot change. Default is to fork into background. .TP -.BR \-e ", " \-\-stderr +.BR \-e ", " \-\-stderr " " (DEPRECATED) This option has no effect, and is maintained for compatibility purposes only. Note that this configuration option is \fBdeprecated\fR and will be removed in a -- 2.46.2
This section didn't mention the effect of the --map-host-loopback option which now alters this behaviour. Update it accordingly. It used "local addresses" to mean specifically 127.0.0.0/8 and ::1. However, "local" could also refer to link-local addresses or to addresses of any scope which happen to be configured on the host. Use "loopback address" to be more precise about this. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- passt.1 | 54 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 28 insertions(+), 26 deletions(-) diff --git a/passt.1 b/passt.1 index acc1f92..11104e1 100644 --- a/passt.1 +++ b/passt.1 @@ -863,38 +863,40 @@ root@localhost's password: .SH NOTES -.SS Handling of traffic with local destination and source addresses - -Both \fBpasst\fR and \fBpasta\fR can bind on ports with a local address, -depending on the configuration. Local destination or source addresses need to be -changed before packets are delivered to the guest or target namespace: most -operating systems would drop packets received from non-loopback interfaces with -local addresses, and it would also be impossible for guest or target namespace -to route answers back. - -For convenience, and somewhat arbitrarily, the source address on these packets -is translated to the address of the default IPv4 or IPv6 gateway (if any) -- -this is known to be an existing, valid address on the same subnet. - -Loopback destination addresses are instead translated to the observed external -address of the guest or target namespace. For IPv6 packets, if usage of a -link-local address by guest or namespace has ever been observed, and the -original destination address is also a link-local address, the observed -link-local address is used. Otherwise, the observed global address is used. For -both IPv4 and IPv6, if no addresses have been seen yet, the configured addresses -will be used instead. +.SS Handling of traffic with loopback destination and source addresses + +Both \fBpasst\fR and \fBpasta\fR can bind on ports with a loopback +address (127.0.0.0/8 or ::1), depending on the configuration. Loopback +destination or source addresses need to be changed before packets are +delivered to the guest or target namespace: most operating systems +would drop packets received with loopback addresses on non-loopback +interfaces, and it would also be impossible for guest or target +namespace to route answers back. + +For convenience, the source address on these packets is translated to +the address specified by the \fB\-\-map-host-loopback\fR option. If +not specified this defaults, somewhat arbitrarily, to the address of +default IPv4 or IPv6 gateway (if any) -- this is known to be an +existing, valid address on the same subnet. If \fB\-\-no-map-gw\fR or +\fB\-\-map-host-loopback none\fR are specified this translation is +disabled and packets with loopback addresses are simply dropped. + +Loopback destination addresses are translated to the observed external +address of the guest or target namespace. For IPv6, the observed +link-local address is used if the translated source address is +link-local, otherwise the observed global address is used. For both +IPv4 and IPv6, if no addresses have been seen yet, the configured +addresses will be used instead. For example, if \fBpasst\fR or \fBpasta\fR receive a connection from 127.0.0.1, with destination 127.0.0.10, and the default IPv4 gateway is 192.0.2.1, while the last observed source address from guest or namespace is 192.0.2.2, this will be translated to a connection from 192.0.2.1 to 192.0.2.2. -Similarly, for traffic coming from guest or namespace, packets with destination -address corresponding to the default gateway will have their destination address -translated to a loopback address, if and only if a packet, in the opposite -direction, with a loopback destination or source address, port-wise matching for -UDP, or connection-wise for TCP, has been recently forwarded to guest or -namespace. This behaviour can be disabled with \-\-no\-map\-gw. +Similarly, for traffic coming from guest or namespace, packets with +destination address corresponding to the \fB\-\-map-host-loopback\fR +address will have their destination address translated to a loopback +address. .SS Handling of local traffic in pasta -- 2.46.2
The tests in pasta/tcp and pasta/udp for inbound transfers have the server listening within the namespace explicitly bound to 127.0.0.1 or ::1. This only works because of the behaviour of inbound splice connections, which always appear with both source and destination addresses as loopback in the namespace. That's not an inherent property for "spliced" connections and arguably an undesirable one. Also update the test names to make it clearer that these tests are expecting to exercise the "splice" path. Interestingly this was already correct for the equivalent passt_in_ns/*, although we also update the test names for clarity there. Note that there are similar issues in some of the podman tests, addressed in https://github.com/containers/podman/pull/24064 Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- test/passt_in_ns/tcp | 8 ++++---- test/passt_in_ns/udp | 4 ++-- test/pasta/tcp | 16 ++++++++-------- test/pasta/udp | 8 ++++---- 4 files changed, 18 insertions(+), 18 deletions(-) diff --git a/test/passt_in_ns/tcp b/test/passt_in_ns/tcp index aaf340e..319880b 100644 --- a/test/passt_in_ns/tcp +++ b/test/passt_in_ns/tcp @@ -32,7 +32,7 @@ host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10001 guestw guest cmp test_big.bin /root/big.bin -test TCP/IPv4: host to ns: big transfer +test TCP/IPv4: host to ns (spliced): big transfer nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10002 @@ -90,7 +90,7 @@ host socat -u OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10001 guestw guest cmp test_small.bin /root/small.bin -test TCP/IPv4: host to ns: small transfer +test TCP/IPv4: host to ns (spliced): small transfer nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10002 @@ -146,7 +146,7 @@ host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10001 guestw guest cmp test_big.bin /root/big.bin -test TCP/IPv6: host to ns: big transfer +test TCP/IPv6: host to ns (spliced): big transfer nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10002 @@ -204,7 +204,7 @@ host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10001 guestw guest cmp test_small.bin /root/small.bin -test TCP/IPv6: host to ns: small transfer +test TCP/IPv6: host to ns (spliced): small transfer nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10002 diff --git a/test/passt_in_ns/udp b/test/passt_in_ns/udp index 3426ab9..791511c 100644 --- a/test/passt_in_ns/udp +++ b/test/passt_in_ns/udp @@ -30,7 +30,7 @@ host socat -u OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10001,shut-null guestw guest cmp test.bin /root/medium.bin -test UDP/IPv4: host to ns +test UDP/IPv4: host to ns (recvmmsg/sendmmsg) nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null @@ -88,7 +88,7 @@ host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10001,shut-null guestw guest cmp test.bin /root/medium.bin -test UDP/IPv6: host to ns +test UDP/IPv6: host to ns (recvmmsg/sendmmsg) nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null diff --git a/test/pasta/tcp b/test/pasta/tcp index 6ab18c5..53b6f25 100644 --- a/test/pasta/tcp +++ b/test/pasta/tcp @@ -19,8 +19,8 @@ set TEMP_NS_BIG __STATEDIR__/test_ns_big.bin set TEMP_SMALL __STATEDIR__/test_small.bin set TEMP_NS_SMALL __STATEDIR__/test_ns_small.bin -test TCP/IPv4: host to ns: big transfer -nsb socat -u TCP4-LISTEN:10002,bind=127.0.0.1 OPEN:__TEMP_NS_BIG__,create,trunc +test TCP/IPv4: host to ns (spliced): big transfer +nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10002 nsw check cmp __BASEPATH__/big.bin __TEMP_NS_BIG__ @@ -38,8 +38,8 @@ ns socat -u OPEN:__BASEPATH__/big.bin TCP4:__GW__:10003 hostw check cmp __BASEPATH__/big.bin __TEMP_BIG__ -test TCP/IPv4: host to ns: small transfer -nsb socat -u TCP4-LISTEN:10002,bind=127.0.0.1 OPEN:__TEMP_NS_SMALL__,create,trunc +test TCP/IPv4: host to ns (spliced): small transfer +nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc host socat OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10002 nsw check cmp __BASEPATH__/small.bin __TEMP_NS_SMALL__ @@ -57,8 +57,8 @@ ns socat -u OPEN:__BASEPATH__/small.bin TCP4:__GW__:10003 hostw check cmp __BASEPATH__/small.bin __TEMP_SMALL__ -test TCP/IPv6: host to ns: big transfer -nsb socat -u TCP6-LISTEN:10002,bind=[::1] OPEN:__TEMP_NS_BIG__,create,trunc +test TCP/IPv6: host to ns (spliced): big transfer +nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10002 nsw check cmp __BASEPATH__/big.bin __TEMP_NS_BIG__ @@ -77,8 +77,8 @@ ns socat -u OPEN:__BASEPATH__/big.bin TCP6:[__GW6__%__IFNAME__]:10003 hostw check cmp __BASEPATH__/big.bin __TEMP_BIG__ -test TCP/IPv6: host to ns: small transfer -nsb socat -u TCP6-LISTEN:10002,bind=[::1] OPEN:__TEMP_NS_SMALL__,create,trunc +test TCP/IPv6: host to ns (spliced): small transfer +nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10002 nsw check cmp __BASEPATH__/small.bin __TEMP_NS_SMALL__ diff --git a/test/pasta/udp b/test/pasta/udp index 30e3a85..7734d02 100644 --- a/test/pasta/udp +++ b/test/pasta/udp @@ -17,8 +17,8 @@ htools dd socat ip jq set TEMP __STATEDIR__/test.bin set TEMP_NS __STATEDIR__/test_ns.bin -test UDP/IPv4: host to ns -nsb socat -u UDP4-LISTEN:10002,bind=127.0.0.1,null-eof OPEN:__TEMP_NS__,create,trunc +test UDP/IPv4: host to ns (recvmmsg/sendmmsg) +nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc host socat OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null nsw check cmp __BASEPATH__/medium.bin __TEMP_NS__ @@ -37,8 +37,8 @@ ns socat -u OPEN:__BASEPATH__/medium.bin UDP4:__GW__:10003,shut-null hostw check cmp __BASEPATH__/medium.bin __TEMP__ -test UDP/IPv6: host to ns -nsb socat -u UDP6-LISTEN:10002,bind=[::1],null-eof OPEN:__TEMP_NS__,create,trunc +test UDP/IPv6: host to ns (recvmmsg/sendmmsg) +nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null nsw check cmp __BASEPATH__/medium.bin __TEMP_NS__ -- 2.46.2
In pasta mode, where addressing permits we "splice" connections, forwarding directly from host socket to guest/container socket without any L2 or L3 processing. This gives us a very large performance improvement when it's possible. Since the traffic is from a local socket within the guest, it will go over the guest's 'lo' interface, and accordingly we set the guest side address to be the loopback address. However this has a surprising side effect: sometimes guests will run services that are only supposed to be used within the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's forwarding exposes those services to the host, which isn't generally what we want. Correct this by instead forwarding inbound "splice" flows to the guest's external address. Link: https://github.com/containers/podman/issues/24045 Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- conf.c | 9 +++++++++ fwd.c | 31 +++++++++++++++++++++++-------- passt.1 | 23 +++++++++++++++++++---- passt.h | 2 ++ 4 files changed, 53 insertions(+), 12 deletions(-) diff --git a/conf.c b/conf.c index 6e62510..b5318f3 100644 --- a/conf.c +++ b/conf.c @@ -908,6 +908,9 @@ pasta_opts: " -U, --udp-ns SPEC UDP port forwarding to init namespace\n" " SPEC is as described above\n" " default: auto\n" + " --host-lo-to-ns-lo DEPRECATED:\n" + " Translate host-loopback forwards to\n" + " namespace loopback\n" " --userns NSPATH Target user namespace to join\n" " --netns PATH|NAME Target network namespace to join\n" " --netns-only Don't join existing user namespace\n" @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv) {"netns-only", no_argument, NULL, 20 }, {"map-host-loopback", required_argument, NULL, 21 }, {"map-guest-addr", required_argument, NULL, 22 }, + {"host-lo-to-ns-lo", no_argument, NULL, 23 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv) conf_nat(optarg, &c->ip4.map_guest_addr, &c->ip6.map_guest_addr, NULL); break; + case 23: + if (c->mode != MODE_PASTA) + die("--host-lo-to-ns-lo is for pasta mode only"); + c->host_lo_to_ns_lo = 1; + break; case 'd': c->debug = 1; c->quiet = 0; diff --git a/fwd.c b/fwd.c index a505098..c71f5e1 100644 --- a/fwd.c +++ b/fwd.c @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */ - /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but by + * default use its external address, so we don't inadvertently + * expose services that listen only on the guest's loopback + * address. That can be overridden by --host-lo-to-ns-lo which + * will instead forward to the loopback address in the guest. + * + * In either case, let the kernel pick the source address to + * match. */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback6; + else + tgt->eaddr.a6 = c->ip6.addr_seen; + tgt->oaddr = inany_any6; + } + + /* Let the kernel pick source port */ tgt->oport = 0; if (proto == IPPROTO_UDP) /* But for UDP preserve the source port */ tgt->oport = ini->eport; - if (inany_v4(&ini->eaddr)) - tgt->eaddr = inany_loopback4; - else - tgt->eaddr = inany_loopback6; - return PIF_SPLICE; } diff --git a/passt.1 b/passt.1 index 11104e1..332384c 100644 --- a/passt.1 +++ b/passt.1 @@ -586,6 +586,13 @@ Configure UDP port forwarding from target namespace to init namespace. Default is \fBauto\fR. +.TP +.BR \-\-host-lo-to-ns-lo " " (DEPRECATED) +If specified, connections forwarded with \fB\-t\fR and \fB\-u\fR from +the host's loopback address will appear on the loopback address in the +guest as well. Without this option such forwarded packets will appear +to come from the guest's public address. + .TP .BR \-\-userns " " \fIspec Target user namespace to join, as a path. If PID is given, without this option, @@ -874,8 +881,9 @@ interfaces, and it would also be impossible for guest or target namespace to route answers back. For convenience, the source address on these packets is translated to -the address specified by the \fB\-\-map-host-loopback\fR option. If -not specified this defaults, somewhat arbitrarily, to the address of +the address specified by the \fB\-\-map-host-loopback\fR option (with +some exceptions in pasta mode, see next section below). If not +specified this defaults, somewhat arbitrarily, to the address of default IPv4 or IPv6 gateway (if any) -- this is known to be an existing, valid address on the same subnet. If \fB\-\-no-map-gw\fR or \fB\-\-map-host-loopback none\fR are specified this translation is @@ -912,8 +920,15 @@ and the new socket using the \fBsplice\fR(2) system call, and for UDP, a pair of \fBrecvmmsg\fR(2) and \fBsendmmsg\fR(2) system calls deals with packet transfers. -This bypass only applies to local connections and traffic, because it's not -possible to bind sockets to foreign addresses. +Because it's not possible to bind sockets to foreign addresses, this +bypass only applies to local connections and traffic. It also means +that the address translation differs slightly from passt mode. +Connections from loopback to loopback on the host will appear to come +from the target namespace's public address within the guest, unless +\fB\-\-host-lo-to-ns-lo\fR is specified, in which case they will +appear to come from loopback in the namespace as well. The latter +behaviour used to be the default, but is usually undesirable, since it +can unintentionally expose namespace local services to the host. .SS Binding to low numbered ports (well-known or system ports, up to 1023) diff --git a/passt.h b/passt.h index 031c9b6..f7b7a58 100644 --- a/passt.h +++ b/passt.h @@ -225,6 +225,7 @@ struct ip6_ctx { * @no_dhcpv6: Disable DHCPv6 server * @no_ndp: Disable NDP handler altogether * @no_ra: Disable router advertisements + * @host_lo_to_ns_lo: Map host loopback addresses to ns loopback addresses * @low_wmem: Low probed net.core.wmem_max * @low_rmem: Low probed net.core.rmem_max */ @@ -284,6 +285,7 @@ struct ctx { int no_dhcpv6; int no_ndp; int no_ra; + int host_lo_to_ns_lo; int low_wmem; int low_rmem; -- 2.46.2
On Wed, 2 Oct 2024 15:48:26 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:In pasta mode, where addressing permits we "splice" connections, forwarding directly from host socket to guest/container socket without any L2 or L3 processing. This gives us a very large performance improvement when it's possible. Since the traffic is from a local socket within the guest, it will go over the guest's 'lo' interface, and accordingly we set the guest side address to be the loopback address. However this has a surprising side effect: sometimes guests will run services that are only supposed to be used within the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's forwarding exposes those services to the host, which isn't generally what we want. Correct this by instead forwarding inbound "splice" flows to the guest's external address. Link: https://github.com/containers/podman/issues/24045 Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- conf.c | 9 +++++++++ fwd.c | 31 +++++++++++++++++++++++-------- passt.1 | 23 +++++++++++++++++++---- passt.h | 2 ++ 4 files changed, 53 insertions(+), 12 deletions(-) diff --git a/conf.c b/conf.c index 6e62510..b5318f3 100644 --- a/conf.c +++ b/conf.c @@ -908,6 +908,9 @@ pasta_opts: " -U, --udp-ns SPEC UDP port forwarding to init namespace\n" " SPEC is as described above\n" " default: auto\n" + " --host-lo-to-ns-lo DEPRECATED:\n" + " Translate host-loopback forwards to\n" + " namespace loopback\n" " --userns NSPATH Target user namespace to join\n" " --netns PATH|NAME Target network namespace to join\n" " --netns-only Don't join existing user namespace\n" @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv) {"netns-only", no_argument, NULL, 20 }, {"map-host-loopback", required_argument, NULL, 21 }, {"map-guest-addr", required_argument, NULL, 22 }, + {"host-lo-to-ns-lo", no_argument, NULL, 23 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv) conf_nat(optarg, &c->ip4.map_guest_addr, &c->ip6.map_guest_addr, NULL); break; + case 23: + if (c->mode != MODE_PASTA) + die("--host-lo-to-ns-lo is for pasta mode only"); + c->host_lo_to_ns_lo = 1; + break; case 'd': c->debug = 1; c->quiet = 0; diff --git a/fwd.c b/fwd.c index a505098..c71f5e1 100644 --- a/fwd.c +++ b/fwd.c @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */ - /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but by + * default use its external address, so we don't inadvertently + * expose services that listen only on the guest's loopback + * address. That can be overridden by --host-lo-to-ns-lo which + * will instead forward to the loopback address in the guest. + * + * In either case, let the kernel pick the source address to + * match. */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback6; + else + tgt->eaddr.a6 = c->ip6.addr_seen;Either this...+ tgt->oaddr = inany_any6;or this (and not something before this patch, up to 3/4) make the "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, sometimes (about one in three/four runs), that's what I mistakenly reported as coming from Laurent's series at: https://archives.passt.top/passt-dev/20241002163238.1778ed19@elisabeth/ It hangs like this (display with >= 240 columns): ns$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │...passed. 88.198.0.164 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): big transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin ns$ ip -j link show | jq -rM '.[] | select(.ifname == "enp9s0").mtu' │...passed. 65520 │ ns$ /sbin/dhclient -6 --no-pid enp9s0 │Starting test: TCP/IPv4: ns to host (via tap): big transfer ns$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.prefixlen == 128).local] | .[0]' │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin 2a01:4f8:222:904::2 │...passed. ns$ ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' │ fe80::1 │Starting test: TCP/IPv4: host to ns (spliced): small transfer ns$ which socat ip jq >/dev/null │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10003 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): small transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:88.198.0.161:10003 │...passed. ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin,create,trunc │ ns$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10003 │Starting test: TCP/IPv4: ns to host (via tap): small transfer ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin 88.198.0.161 │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP4:88.198.0.161:10003 │ ns$ strace socat -u TCP6-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc 2>/tmp/socat_server.strace │Starting test: TCP/IPv6: host to ns (spliced): big transfer │ ──namespace─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────┴──pasta/tcp [7/12] - TCP/IPv6: host to ns (spliced): big transfer─────────────────────────────────── host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │ router: 88.198.0.161 fe80::1 │DNS: host$ which ip jq >/dev/null │ 185.12.64.1 host$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │ 185.12.64.2 88.198.0.164 │ NAT to host ::1: fe80::1 host$ ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP/DHCPv6: 88.198.0.161 │ assign: 2a01:4f8:222:904::2 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' │ router: fe80::1 enp9s0 │ our link-local: fe80::1 host$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.scope == "global" and .depreca│DNS: ted != true).local] | .[0]' │ 2a01:4ff:ff00::add:2 2a01:4f8:222:904::2 │ 2a01:4ff:ff00::add:1 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP: received RS, sending RA fe80::1 │DHCP: offer to discover host$ which socat ip jq >/dev/null │ from 1e:48:6f:6e:b6:50 host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10002 │DHCP: ack to request host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │ from 1e:48:6f:6e:b6:50 host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │DHCPv6: received SOLICIT, sending ADVERTISE host$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10002 │DHCPv6: received REQUEST/RENEW/CONFIRM, sending REPLY host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ strace socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[::1]:10002 2>/tmp/socat_client.strace │NDP: received NS, sending NA host$ │ ──host──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──pasta──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Testing commit: a056cfc fwd: Direct inbound spliced forwards to the guest's external address PASS: 23 | FAIL: 0 | 2024-10-04T16:16:28+00:00 ...even without strace. The client is done, the server hangs. If I unblock this manually by re-running the same client command, the server wakes up, writes the file, and terminates, and the test continues normally. Those three "received NS, sending NA" messages in the pasta pane are printed in a short time after the test starts. If I run this with TRACE=1 (which needs the patch I just sent), this is pasta's debugging output for this test: -- 6.1401: pasta: epoll event on listening TCP socket 6 (events: 0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW 6.1402: Flow 0 (INI): NEW -> INI 6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ? 6.1402: Flow 0 (TGT): INI -> TGT 6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): TGT -> TYPED 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): event at tcp_splice_connect:377 6.1402: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT 6.1402: Flow 0 (TCP connection (spliced)): TYPED -> ACTIVE 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 6.1402: NDP: received NS, sending NA 7.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.1585: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 7.1585: NDP: received NS, sending NA 8.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.1825: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 8.1825: NDP: received NS, sending NA 9.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 9.2065: pasta: epoll event on connected spliced TCP socket 118 (events: 0x0000001c) 9.2065: Flow 0 (TCP connection (spliced)): Error event on socket: No route to host 9.2065: Flow 0 (TCP connection (spliced)): flag at tcp_splice_sock_handler:624 9.2065: Flow 0 (TCP connection (spliced)): RCVLOWAT_ACT_1 9.2068: Flow 0 (TCP connection (spliced)): CLOSED 9.2068: Flow 0 (FREE): ACTIVE -> FREE 9.2068: Flow 0 (FREE): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 10.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 11.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 12.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 13.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) [...] -- Relevant parts of strace output from the client: -- openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5 ioctl(5, TCGETS, 0x7ffd600ae4a0) = -1 ENOTTY (Inappropriate ioctl for device) fcntl(5, F_SETFD, FD_CLOEXEC) = 0 socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 connect(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 write(6, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 write(6, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) [...] pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "", 8192) = 0 shutdown(6, SHUT_WR) = 0 shutdown(6, SHUT_RDWR) = 0 exit_group(0) = ? +++ exited with 0 +++ -- and from the server: -- socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0 listen(6, 5) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, [28]) = 0 pselect6(7, [4 6], NULL, NULL, NULL, NULL -- If I connect from the host without a server in the namespace (but with the port forwarded by pasta), I get a connection reset, and if the port is not forwarded by pasta, connection refused. But this is another case: we start connecting and accept the connection (probably we shouldn't). Note the "No route to host" error on the socket. It looks somehow similar to the race I fixed with commit f4e9f26480ef ("pasta: Disable neighbour solicitations on device up to prevent DAD"), but it doesn't look like an invalid c->ip6.addr_seen, because otherwise pasta would reset the connection, I suppose. I haven't debugged further yet. This looks like an existing issue in pasta rather than in this series or in the tests, but it blocks tests, so I haven't applied this yet. -- Stefano
On Wed, 9 Oct 2024 15:07:21 +0200 Stefano Brivio <sbrivio(a)redhat.com> wrote:On Wed, 2 Oct 2024 15:48:26 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ouch, sorry, it looks like saving something in claws-mail as draft and sending it later means lines will be forcefully wrapped. Here's the original test output: ns$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │...passed. 88.198.0.164 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): big transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin ns$ ip -j link show | jq -rM '.[] | select(.ifname == "enp9s0").mtu' │...passed. 65520 │ ns$ /sbin/dhclient -6 --no-pid enp9s0 │Starting test: TCP/IPv4: ns to host (via tap): big transfer ns$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.prefixlen == 128).local] | .[0]' │? cmp /home/sbrivio/passt/test/big.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin 2a01:4f8:222:904::2 │...passed. ns$ ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' │ fe80::1 │Starting test: TCP/IPv4: host to ns (spliced): small transfer ns$ which socat ip jq >/dev/null │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10003 │ ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │Starting test: TCP/IPv4: ns to host (spliced): small transfer 88.198.0.161 │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:88.198.0.161:10003 │...passed. ns$ socat -u TCP4-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_small.bin,create,trunc │ ns$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10003 │Starting test: TCP/IPv4: ns to host (via tap): small transfer ns$ ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' │? cmp /home/sbrivio/passt/test/small.bin /tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin 88.198.0.161 │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP4:88.198.0.161:10003 │ ns$ strace socat -u TCP6-LISTEN:10002 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_ns_big.bin,create,trunc 2>/tmp/socat_server.strace │Starting test: TCP/IPv6: host to ns (spliced): big transfer │ ──namespace─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬──────────────────┴──pasta/tcp [7/12] - TCP/IPv6: host to ns (spliced): big transfer─────────────────────────────────── host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │ router: 88.198.0.161 fe80::1 │DNS: host$ which ip jq >/dev/null │ 185.12.64.1 host$ ip -j -4 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │ 185.12.64.2 88.198.0.164 │ NAT to host ::1: fe80::1 host$ ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP/DHCPv6: 88.198.0.161 │ assign: 2a01:4f8:222:904::2 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' │ router: fe80::1 enp9s0 │ our link-local: fe80::1 host$ ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "enp9s0").addr_info[] | select(.scope == "global" and .depreca│DNS: ted != true).local] | .[0]' │ 2a01:4ff:ff00::add:2 2a01:4f8:222:904::2 │ 2a01:4ff:ff00::add:1 host$ ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' │NDP: received RS, sending RA fe80::1 │DHCP: offer to discover host$ which socat ip jq >/dev/null │ from 1e:48:6f:6e:b6:50 host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10002 │DHCP: ack to request host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │ from 1e:48:6f:6e:b6:50 host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_big.bin,create,trunc │DHCPv6: received SOLICIT, sending ADVERTISE host$ socat OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10002 │DHCPv6: received REQUEST/RENEW/CONFIRM, sending REPLY host$ socat -u TCP4-LISTEN:10003,bind=127.0.0.1 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-EsDdjG/pasta/tcp/test_small.bin,create,trunc │NDP: received NS, sending NA host$ strace socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[::1]:10002 2>/tmp/socat_client.strace │NDP: received NS, sending NA host$ │ ──host──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──pasta──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Testing commit: a056cfc fwd: Direct inbound spliced forwards to the guest's external address PASS: 23 | FAIL: 0 | 2024-10-04T16:16:28+00:00In pasta mode, where addressing permits we "splice" connections, forwarding directly from host socket to guest/container socket without any L2 or L3 processing. This gives us a very large performance improvement when it's possible. Since the traffic is from a local socket within the guest, it will go over the guest's 'lo' interface, and accordingly we set the guest side address to be the loopback address. However this has a surprising side effect: sometimes guests will run services that are only supposed to be used within the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's forwarding exposes those services to the host, which isn't generally what we want. Correct this by instead forwarding inbound "splice" flows to the guest's external address. Link: https://github.com/containers/podman/issues/24045 Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- conf.c | 9 +++++++++ fwd.c | 31 +++++++++++++++++++++++-------- passt.1 | 23 +++++++++++++++++++---- passt.h | 2 ++ 4 files changed, 53 insertions(+), 12 deletions(-) diff --git a/conf.c b/conf.c index 6e62510..b5318f3 100644 --- a/conf.c +++ b/conf.c @@ -908,6 +908,9 @@ pasta_opts: " -U, --udp-ns SPEC UDP port forwarding to init namespace\n" " SPEC is as described above\n" " default: auto\n" + " --host-lo-to-ns-lo DEPRECATED:\n" + " Translate host-loopback forwards to\n" + " namespace loopback\n" " --userns NSPATH Target user namespace to join\n" " --netns PATH|NAME Target network namespace to join\n" " --netns-only Don't join existing user namespace\n" @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv) {"netns-only", no_argument, NULL, 20 }, {"map-host-loopback", required_argument, NULL, 21 }, {"map-guest-addr", required_argument, NULL, 22 }, + {"host-lo-to-ns-lo", no_argument, NULL, 23 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv) conf_nat(optarg, &c->ip4.map_guest_addr, &c->ip6.map_guest_addr, NULL); break; + case 23: + if (c->mode != MODE_PASTA) + die("--host-lo-to-ns-lo is for pasta mode only"); + c->host_lo_to_ns_lo = 1; + break; case 'd': c->debug = 1; c->quiet = 0; diff --git a/fwd.c b/fwd.c index a505098..c71f5e1 100644 --- a/fwd.c +++ b/fwd.c @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */ - /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but by + * default use its external address, so we don't inadvertently + * expose services that listen only on the guest's loopback + * address. That can be overridden by --host-lo-to-ns-lo which + * will instead forward to the loopback address in the guest. + * + * In either case, let the kernel pick the source address to + * match. */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback4; + else + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + if (c->host_lo_to_ns_lo) + tgt->eaddr = inany_loopback6; + else + tgt->eaddr.a6 = c->ip6.addr_seen;Either this...+ tgt->oaddr = inany_any6;or this (and not something before this patch, up to 3/4) make the "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, sometimes (about one in three/four runs), that's what I mistakenly reported as coming from Laurent's series at: https://archives.passt.top/passt-dev/20241002163238.1778ed19@elisabeth/ It hangs like this (display with >= 240 columns):...even without strace. The client is done, the server hangs. If I unblock this manually by re-running the same client command, the server wakes up, writes the file, and terminates, and the test continues normally. Those three "received NS, sending NA" messages in the pasta pane are printed in a short time after the test starts. If I run this with TRACE=1 (which needs the patch I just sent), this is pasta's debugging output for this test: -- 6.1401: pasta: epoll event on listening TCP socket 6 (events: 0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW 6.1402: Flow 0 (INI): NEW -> INI 6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ? 6.1402: Flow 0 (TGT): INI -> TGT 6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): TGT -> TYPED 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): event at tcp_splice_connect:377 6.1402: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT 6.1402: Flow 0 (TCP connection (spliced)): TYPED -> ACTIVE 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 6.1402: NDP: received NS, sending NA 7.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.1585: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 7.1585: NDP: received NS, sending NA 8.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.1825: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 8.1825: NDP: received NS, sending NA 9.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 9.2065: pasta: epoll event on connected spliced TCP socket 118 (events: 0x0000001c) 9.2065: Flow 0 (TCP connection (spliced)): Error event on socket: No route to host 9.2065: Flow 0 (TCP connection (spliced)): flag at tcp_splice_sock_handler:624 9.2065: Flow 0 (TCP connection (spliced)): RCVLOWAT_ACT_1 9.2068: Flow 0 (TCP connection (spliced)): CLOSED 9.2068: Flow 0 (FREE): ACTIVE -> FREE 9.2068: Flow 0 (FREE): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 10.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 11.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 12.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 13.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) [...] --This was: 6.1401: pasta: epoll event on listening TCP socket 6 (events: 0x00000001) 6.1402: Flow 0 (NEW): FREE -> NEW 6.1402: Flow 0 (INI): NEW -> INI 6.1402: Flow 0 (INI): HOST [::1]:48910 -> [::]:10002 => ? 6.1402: Flow 0 (TGT): INI -> TGT 6.1402: Flow 0 (TGT): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): TGT -> TYPED 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: Flow 0 (TCP connection (spliced)): event at tcp_splice_connect:377 6.1402: Flow 0 (TCP connection (spliced)): SPLICE_CONNECT 6.1402: Flow 0 (TCP connection (spliced)): TYPED -> ACTIVE 6.1402: Flow 0 (TCP connection (spliced)): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 6.1402: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 6.1402: NDP: received NS, sending NA 7.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: TCP (spliced): cannot set pool pipe size to 524288 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.0007: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 7.1585: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 7.1585: NDP: received NS, sending NA 8.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.0006: Flow 0 (TCP connection (spliced)): flag at tcp_splice_timer:766 8.1825: pasta: epoll event on /dev/net/tun device 13 (events: 0x00000001) 8.1825: NDP: received NS, sending NA 9.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 9.2065: pasta: epoll event on connected spliced TCP socket 118 (events: 0x0000001c) 9.2065: Flow 0 (TCP connection (spliced)): Error event on socket: No route to host 9.2065: Flow 0 (TCP connection (spliced)): flag at tcp_splice_sock_handler:624 9.2065: Flow 0 (TCP connection (spliced)): RCVLOWAT_ACT_1 9.2068: Flow 0 (TCP connection (spliced)): CLOSED 9.2068: Flow 0 (FREE): ACTIVE -> FREE 9.2068: Flow 0 (FREE): HOST [::1]:48910 -> [::]:10002 => SPLICE [::]:0 -> [2a01:4f8:222:904::2]:10002 10.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 11.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 12.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) 13.0006: pasta: epoll event on namespace timer watch 12 (events: 0x00000001) [...]Relevant parts of strace output from the client: -- openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5 ioctl(5, TCGETS, 0x7ffd600ae4a0) = -1 ENOTTY (Inappropriate ioctl for device) fcntl(5, F_SETFD, FD_CLOEXEC) = 0 socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 connect(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 write(6, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 write(6, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6])This was: openat(AT_FDCWD, "/home/sbrivio/passt/test/big.bin", O_RDONLY) = 5 ioctl(5, TCGETS, 0x7ffd600ae4a0) = -1 ENOTTY (Inappropriate ioctl for device) fcntl(5, F_SETFD, FD_CLOEXEC) = 0 socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 connect(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(39038), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [112 => 28]) = 0 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 write(6, "\335>\210#\264\331\273\276\257['\357\365\361\2\262\\\255O\5L\302Q\231\16\234\266\307\32\362\206\333"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 write(6, "\343;H\320\177\323\245^\321%\\l\224\341R\235\337\33s\236\232\265\2608\312\257D\204\375\324\313\5"..., 8192) = 8192 pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6])[...] pselect6(7, [5], [6], [], NULL, NULL) = 2 (in [5], out [6]) read(5, "", 8192) = 0 shutdown(6, SHUT_WR) = 0 shutdown(6, SHUT_RDWR) = 0 exit_group(0) = ? +++ exited with 0 +++ -- and from the server: -- socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0 listen(6, 5) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, [28]) = 0 pselect6(7, [4 6], NULL, NULL, NULL, NULL --And this was: socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP) = 6 fcntl(6, F_SETFD, FD_CLOEXEC) = 0 setsockopt(6, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = 0 listen(6, 5) = 0 getsockname(6, {sa_family=AF_INET6, sin6_port=htons(10002), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, [28]) = 0 pselect6(7, [4 6], NULL, NULL, NULL, NULLIf I connect from the host without a server in the namespace (but with the port forwarded by pasta), I get a connection reset, and if the port is not forwarded by pasta, connection refused. But this is another case: we start connecting and accept the connection (probably we shouldn't). Note the "No route to host" error on the socket. It looks somehow similar to the race I fixed with commit f4e9f26480ef ("pasta: Disable neighbour solicitations on device up to prevent DAD"), but it doesn't look like an invalid c->ip6.addr_seen, because otherwise pasta would reset the connection, I suppose. I haven't debugged further yet. This looks like an existing issue in pasta rather than in this series or in the tests, but it blocks tests, so I haven't applied this yet.-- Stefano
On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:On Wed, 9 Oct 2024 15:07:21 +0200 Stefano Brivio <sbrivio(a)redhat.com> wrote: > On Wed, 2 Oct 2024 15:48:26 +1000 > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > In pasta mode, where addressing permits we "splice" connections, forwarding > > directly from host socket to guest/container socket without any L2 or L3 > > processing. This gives us a very large performance improvement when it's > > possible. > > > > Since the traffic is from a local socket within the guest, it will go over > > the guest's 'lo' interface, and accordingly we set the guest side address > > to be the loopback address. However this has a surprising side effect: > > sometimes guests will run services that are only supposed to be used within > > the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's > > forwarding exposes those services to the host, which isn't generally what > > we want. > > > > Correct this by instead forwarding inbound "splice" flows to the guest's > > external address. > > > > Link: https://github.com/containers/podman/issues/24045 > > > > Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> > > --- > > conf.c | 9 +++++++++ > > fwd.c | 31 +++++++++++++++++++++++-------- > > passt.1 | 23 +++++++++++++++++++---- > > passt.h | 2 ++ > > 4 files changed, 53 insertions(+), 12 deletions(-) > > > > diff --git a/conf.c b/conf.c > > index 6e62510..b5318f3 100644 > > --- a/conf.c > > +++ b/conf.c > > @@ -908,6 +908,9 @@ pasta_opts: > > " -U, --udp-ns SPEC UDP port forwarding to init namespace\n" > > " SPEC is as described above\n" > > " default: auto\n" > > + " --host-lo-to-ns-lo DEPRECATED:\n" > > + " Translate host-loopback forwards to\n" > > + " namespace loopback\n" > > " --userns NSPATH Target user namespace to join\n" > > " --netns PATH|NAME Target network namespace to join\n" > > " --netns-only Don't join existing user namespace\n" > > @@ -1284,6 +1287,7 @@ void conf(struct ctx *c, int argc, char **argv) > > {"netns-only", no_argument, NULL, 20 }, > > {"map-host-loopback", required_argument, NULL, 21 }, > > {"map-guest-addr", required_argument, NULL, 22 }, > > + {"host-lo-to-ns-lo", no_argument, NULL, 23 }, > > { 0 }, > > }; > > const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; > > @@ -1461,6 +1465,11 @@ void conf(struct ctx *c, int argc, char **argv) > > conf_nat(optarg, &c->ip4.map_guest_addr, > > &c->ip6.map_guest_addr, NULL); > > break; > > + case 23: > > + if (c->mode != MODE_PASTA) > > + die("--host-lo-to-ns-lo is for pasta mode only"); > > + c->host_lo_to_ns_lo = 1; > > + break; > > case 'd': > > c->debug = 1; > > c->quiet = 0; > > diff --git a/fwd.c b/fwd.c > > index a505098..c71f5e1 100644 > > --- a/fwd.c > > +++ b/fwd.c > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > /* spliceable */ > > > > - /* Preserve the specific loopback adddress used, but let the > > - * kernel pick a source port on the target side > > + /* The traffic will go over the guest's 'lo' interface, but by > > + * default use its external address, so we don't inadvertently > > + * expose services that listen only on the guest's loopback > > + * address. That can be overridden by --host-lo-to-ns-lo which > > + * will instead forward to the loopback address in the guest. > > + * > > + * In either case, let the kernel pick the source address to > > + * match. > > */ > > - tgt->oaddr = ini->eaddr; > > + if (inany_v4(&ini->eaddr)) { > > + if (c->host_lo_to_ns_lo) > > + tgt->eaddr = inany_loopback4; > > + else > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > + tgt->oaddr = inany_any4; > > + } else { > > + if (c->host_lo_to_ns_lo) > > + tgt->eaddr = inany_loopback6; > > + else > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > Either this... > > > + tgt->oaddr = inany_any6; > > or this (and not something before this patch, up to 3/4) make the > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > sometimes (about one in three/four runs), that's what I mistakenly > reported as coming from Laurent's series at:Huh, interesting. Just got back from my leave and ran that group of tests in a loop this afternoon, but didn't manage to reproduce. I have administrivia that will probably fill the rest of this week, but I'll look into this as soon as I can. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > On Wed, 9 Oct 2024 15:07:21 +0200 > Stefano Brivio <sbrivio(a)redhat.com> wrote:[snip]I reproduced the problem on passt.top, and I have a partial idea what's going on. As you say it's seeming like the address (addr_seen == addr in this case) isn't properly ready. This is over splice, but on the tap interface, I see the container sending NS messages for its own address - seems like it's doing DAD. But more importantly, we're answering those NS messages with NA messages, because we answer all NS. i.e. we're making the DAD fail. What I'm not sure of is how this ever worked at all. --config-net makes sense, since we disable DAD, but our test suite has always been using NDP+DHCP instead of --config-net. So, AFACT, we'll always fail guest DAD attempts, both IPv6, which happens most of the time and for IPv4 via ARP, which is used much more rarely. I think we need to be more selective in what NS or ARP lookups we resopnd to. The question is what approach to take: Option 1: Answer everything apart from specific exceptions ======== Basically we'd explicitly exclude the guest/container's assigned address but continue answering everything else. This is a bit ugly and means we'd probably still have a similar problem if the guest just picks an address instead of taking the assigned one. On the other hand it means things will work in at least some cases where the guest tries to contact remote hosts directly instead of via the default gateway. Option 2: Only answer for specific addresses ========= Reverse the logic above, usually *don't* answer NS queries, unless they have a specific address that we know is our responsibility. That would be, afaict, the gateway address, the "our_tap" addresses and the -map--host-loopback and --map-guest-addr addresses (usually all the same). This might be less robust against certain guests that don't use a default gateway when we expect. I guess that could happen when we have non-gateway routes copied from the host, so we'd have to handle that too. It would remove one more barrier towards having multiple bridged guests behind a single passt/pasta instance (they can't reasonably communicate with each other if passt answers all their NS / ARP requests). -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson> > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > /* spliceable */ > > > > - /* Preserve the specific loopback adddress used, but let the > > - * kernel pick a source port on the target side > > + /* The traffic will go over the guest's 'lo' interface, but by > > + * default use its external address, so we don't inadvertently > > + * expose services that listen only on the guest's loopback > > + * address. That can be overridden by --host-lo-to-ns-lo which > > + * will instead forward to the loopback address in the guest. > > + * > > + * In either case, let the kernel pick the source address to > > + * match. > > */ > > - tgt->oaddr = ini->eaddr; > > + if (inany_v4(&ini->eaddr)) { > > + if (c->host_lo_to_ns_lo) > > + tgt->eaddr = inany_loopback4; > > + else > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > + tgt->oaddr = inany_any4; > > + } else { > > + if (c->host_lo_to_ns_lo) > > + tgt->eaddr = inany_loopback6; > > + else > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > Either this... > > > + tgt->oaddr = inany_any6; > > or this (and not something before this patch, up to 3/4) make the > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > sometimes (about one in three/four runs), that's what I mistakenly > reported as coming from Laurent's series at:Huh, interesting. Just got back from my leave and ran that group of tests in a loop this afternoon, but didn't manage to reproduce. I have administrivia that will probably fill the rest of this week, but I'll look into this as soon as I can.
On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:Hmm... no.. there's more to this. Usually DAD requests have :: as the source address, and we *do* exclude those from getting replies. In this case though, we're getting NS requests for the assigned address from what looks like the SLAAC address. So, I do think it would be wise to explicitly exclude these: we shouldn't be giving NA responses for an address that ought to belong to the guest, even if it doesn't look like a DAD. But, I'm not sure what's triggering this. Is for some reason the DHCP address not "taking", so the container is trying to locate it on the network instead? Or _is_ this DAD, but under some circumstances rather than using :: as the source address it uses another configured address. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonOn Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > On Wed, 9 Oct 2024 15:07:21 +0200 > Stefano Brivio <sbrivio(a)redhat.com> wrote:[snip]I reproduced the problem on passt.top, and I have a partial idea what's going on. As you say it's seeming like the address (addr_seen == addr in this case) isn't properly ready. This is over splice, but on the tap interface, I see the container sending NS messages for its own address - seems like it's doing DAD. But more importantly, we're answering those NS messages with NA messages, because we answer all NS. i.e. we're making the DAD fail. What I'm not sure of is how this ever worked at all. --config-net makes sense, since we disable DAD, but our test suite has always been using NDP+DHCP instead of --config-net. So, AFACT, we'll always fail guest DAD attempts, both IPv6, which happens most of the time and for IPv4 via ARP, which is used much more rarely. I think we need to be more selective in what NS or ARP lookups we resopnd to. The question is what approach to take:> > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > /* spliceable */ > > > > - /* Preserve the specific loopback adddress used, but let the > > - * kernel pick a source port on the target side > > + /* The traffic will go over the guest's 'lo' interface, but by > > + * default use its external address, so we don't inadvertently > > + * expose services that listen only on the guest's loopback > > + * address. That can be overridden by --host-lo-to-ns-lo which > > + * will instead forward to the loopback address in the guest. > > + * > > + * In either case, let the kernel pick the source address to > > + * match. > > */ > > - tgt->oaddr = ini->eaddr; > > + if (inany_v4(&ini->eaddr)) { > > + if (c->host_lo_to_ns_lo) > > + tgt->eaddr = inany_loopback4; > > + else > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > + tgt->oaddr = inany_any4; > > + } else { > > + if (c->host_lo_to_ns_lo) > > + tgt->eaddr = inany_loopback6; > > + else > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > Either this... > > > + tgt->oaddr = inany_any6; > > or this (and not something before this patch, up to 3/4) make the > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > sometimes (about one in three/four runs), that's what I mistakenly > reported as coming from Laurent's series at:Huh, interesting. Just got back from my leave and ran that group of tests in a loop this afternoon, but didn't manage to reproduce. I have administrivia that will probably fill the rest of this week, but I'll look into this as soon as I can.
On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:Ok.. I've understood a bit more. While timing is a factor here, it looks like the main reason I wasn't seeing it on my machine is what I'd consider a bug in the Debian version of the dhclient-script: when adding an IPv6 address, it returns without waiting for DAD to complete (i.e. for the address to be non-tentative). There's also an additional bug, which doesn't cause this problem, I think, but caused some problems when I was investigating. DHCPv6 needs the link-local SLAAC address already configured and non-tentative. The Fedora dhclient-script waits for that too at the PREINIT6 stage, but the Debian one doesn't, meaning if you attempt dhclient -6 immediately after starting the namespace it will fail to bind the UDP address it needs. I still think it's a good idea not to give NA messages for the guest assigned address, but we'll need a different workaround for this issue. I guess we'll have to manually wait for DAD to complete in the DHCP tests, which will be kind of mucky :/ -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonOn Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:Hmm... no.. there's more to this. Usually DAD requests have :: as the source address, and we *do* exclude those from getting replies. In this case though, we're getting NS requests for the assigned address from what looks like the SLAAC address. So, I do think it would be wise to explicitly exclude these: we shouldn't be giving NA responses for an address that ought to belong to the guest, even if it doesn't look like a DAD. But, I'm not sure what's triggering this. Is for some reason the DHCP address not "taking", so the container is trying to locate it on the network instead? Or _is_ this DAD, but under some circumstances rather than using :: as the source address it uses another configured address.On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > On Wed, 9 Oct 2024 15:07:21 +0200 > Stefano Brivio <sbrivio(a)redhat.com> wrote:[snip]I reproduced the problem on passt.top, and I have a partial idea what's going on. As you say it's seeming like the address (addr_seen == addr in this case) isn't properly ready. This is over splice, but on the tap interface, I see the container sending NS messages for its own address - seems like it's doing DAD. But more importantly, we're answering those NS messages with NA messages, because we answer all NS. i.e. we're making the DAD fail. What I'm not sure of is how this ever worked at all. --config-net makes sense, since we disable DAD, but our test suite has always been using NDP+DHCP instead of --config-net. So, AFACT, we'll always fail guest DAD attempts, both IPv6, which happens most of the time and for IPv4 via ARP, which is used much more rarely. I think we need to be more selective in what NS or ARP lookups we resopnd to. The question is what approach to take:> > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > /* spliceable */ > > > > - /* Preserve the specific loopback adddress used, but let the > > - * kernel pick a source port on the target side > > + /* The traffic will go over the guest's 'lo' interface, but by > > + * default use its external address, so we don't inadvertently > > + * expose services that listen only on the guest's loopback > > + * address. That can be overridden by --host-lo-to-ns-lo which > > + * will instead forward to the loopback address in the guest. > > + * > > + * In either case, let the kernel pick the source address to > > + * match. > > */ > > - tgt->oaddr = ini->eaddr; > > + if (inany_v4(&ini->eaddr)) { > > + if (c->host_lo_to_ns_lo) > > + tgt->eaddr = inany_loopback4; > > + else > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > + tgt->oaddr = inany_any4; > > + } else { > > + if (c->host_lo_to_ns_lo) > > + tgt->eaddr = inany_loopback6; > > + else > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > Either this... > > > + tgt->oaddr = inany_any6; > > or this (and not something before this patch, up to 3/4) make the > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > sometimes (about one in three/four runs), that's what I mistakenly > reported as coming from Laurent's series at:Huh, interesting. Just got back from my leave and ran that group of tests in a loop this afternoon, but didn't manage to reproduce. I have administrivia that will probably fill the rest of this week, but I'll look into this as soon as I can.
On Wed, 16 Oct 2024 19:39:40 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:Oops. On one hand, I would feel inclined to propose a fix for the Debian and Ubuntu packages. On the other hand, I wonder if it's universally considered a bug: the DHCPv6 client did its job at that point, and it's debatable whether dhclient should wait for the address to be usable before forking to background. That is, arguably, the job of dhclient's is to request and configure an address. It's not a network configuration daemon. There might be many other reasons why that address is unusable, and yet dhclient is not responsible for them. By the way, I guess it's just an issue for test scripts like this one.On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:Ok.. I've understood a bit more. While timing is a factor here, it looks like the main reason I wasn't seeing it on my machine is what I'd consider a bug in the Debian version of the dhclient-script: when adding an IPv6 address, it returns without waiting for DAD to complete (i.e. for the address to be non-tentative).On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:Hmm... no.. there's more to this. Usually DAD requests have :: as the source address, and we *do* exclude those from getting replies. In this case though, we're getting NS requests for the assigned address from what looks like the SLAAC address. So, I do think it would be wise to explicitly exclude these: we shouldn't be giving NA responses for an address that ought to belong to the guest, even if it doesn't look like a DAD. But, I'm not sure what's triggering this. Is for some reason the DHCP address not "taking", so the container is trying to locate it on the network instead? Or _is_ this DAD, but under some circumstances rather than using :: as the source address it uses another configured address.On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > On Wed, 9 Oct 2024 15:07:21 +0200 > Stefano Brivio <sbrivio(a)redhat.com> wrote:[snip]> > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > /* spliceable */ > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > - * kernel pick a source port on the target side > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > + * default use its external address, so we don't inadvertently > > > + * expose services that listen only on the guest's loopback > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > + * will instead forward to the loopback address in the guest. > > > + * > > > + * In either case, let the kernel pick the source address to > > > + * match. > > > */ > > > - tgt->oaddr = ini->eaddr; > > > + if (inany_v4(&ini->eaddr)) { > > > + if (c->host_lo_to_ns_lo) > > > + tgt->eaddr = inany_loopback4; > > > + else > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > + tgt->oaddr = inany_any4; > > > + } else { > > > + if (c->host_lo_to_ns_lo) > > > + tgt->eaddr = inany_loopback6; > > > + else > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > Either this... > > > > > + tgt->oaddr = inany_any6; > > > > or this (and not something before this patch, up to 3/4) make the > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > sometimes (about one in three/four runs), that's what I mistakenly > > reported as coming from Laurent's series at: Huh, interesting. Just got back from my leave and ran that group of tests in a loop this afternoon, but didn't manage to reproduce. I have administrivia that will probably fill the rest of this week, but I'll look into this as soon as I can.I reproduced the problem on passt.top, and I have a partial idea what's going on. As you say it's seeming like the address (addr_seen == addr in this case) isn't properly ready. This is over splice, but on the tap interface, I see the container sending NS messages for its own address - seems like it's doing DAD. But more importantly, we're answering those NS messages with NA messages, because we answer all NS. i.e. we're making the DAD fail. What I'm not sure of is how this ever worked at all. --config-net makes sense, since we disable DAD, but our test suite has always been using NDP+DHCP instead of --config-net. So, AFACT, we'll always fail guest DAD attempts, both IPv6, which happens most of the time and for IPv4 via ARP, which is used much more rarely. I think we need to be more selective in what NS or ARP lookups we resopnd to. The question is what approach to take:There's also an additional bug, which doesn't cause this problem, I think, but caused some problems when I was investigating. DHCPv6 needs the link-local SLAAC address already configured and non-tentative. The Fedora dhclient-script waits for that too at the PREINIT6 stage, but the Debian one doesn't, meaning if you attempt dhclient -6 immediately after starting the namespace it will fail to bind the UDP address it needs.Right, and that's fine for us because we have a 2-second delay after SLAAC. This looks to me a bit more like a real bug, but again, there might be many other reasons why dhclient can't use a link-local address. One might argue that it's the responsibility of the user/caller to invoke dhclient when appropriate. In that sense, you might be wondering why there's a 2-second delay after SLAAC, but no delay after invoking dhclient -6: the reason is that I was convinced that one wouldn't need DAD once a DHCPv6 client configures an address. The server is already checking that, I thought. Well, no. RFC 8415 18.2.10.1: https://datatracker.ietf.org/doc/html/rfc8415#section-18.2.10.1 says: If the client can operate with the addresses and/or prefixes obtained from the server: [...] - The client MUST perform duplicate address detection as per Section 5.4 of [RFC4862], which does list some exceptions, on each of the received addresses in any IAs on which it has not performed duplicate address detection during processing of any of the previous Reply messages from the server. The client performs the duplicate address detection before using the received addresses for any traffic. If any of the addresses are found to be in use on the link, the client sends a Decline message to the server for those addresses as described in Section 18.2.8.I still think it's a good idea not to give NA messages for the guest assigned address, but we'll need a different workaround for this issue.I read the rest of your reasoning about it, but the nice thing of the current behaviour (and that's why I added that single check on the source address in ndp()) is that the guest can really use whatever address it wants, regardless of what we tried to configure, and we'll resolve any other address. If we receive a neighbour solicitation for the guest assigned address, and the source address is not unspecified, it means that the guest is _not_ using the assigned address, and it's actually trying to reach it.I guess we'll have to manually wait for DAD to complete in the DHCP tests, which will be kind of mucky :/Alternatively, we could use the same trick as added by commit f4e9f26480ef ("pasta: Disable neighbour solicitations on device up to prevent DAD"): disable neighbour solicitations, run dhclient -6, set 'nodad' on the address, and re-enable neighbour solicitations. This works for me: -- diff --git a/test/pasta/dhcp b/test/pasta/dhcp index 41556b8..76aa723 100644 --- a/test/pasta/dhcp +++ b/test/pasta/dhcp @@ -34,9 +34,12 @@ nsout MTU ip -j link show | jq -rM '.[] | select(.ifname == "__IFNAME__").mtu' check [ __MTU__ = 65520 ] test DHCPv6: address +ns ip link set dev __IFNAME__ arp off ns /sbin/dhclient -6 --no-pid __IFNAME__ hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' nsout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]' +ns ip addr change __ADDR6__ dev __IFNAME__ nodad +ns ip link set dev __IFNAME__ arp on hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]' check [ __ADDR6__ = __HOST_ADDR6__ ] -- Adding 2-second delays as we have them for NDP doesn't look that bad: $ grep --exclude-dir=demo -rn "dhclient -6" pasta/dhcp:37:ns /sbin/dhclient -6 --no-pid __IFNAME__ passt_in_ns/dhcp:54:guest /sbin/dhclient -6 __IFNAME__ passt/dhcp:51:guest /sbin/dhclient -6 __IFNAME__ perf/passt_tcp:117:guest dhclient -6 -x perf/passt_tcp:118:guest dhclient -6 __IFNAME__ two_guests/basic:40:guest1 /sbin/dhclient -6 __IFNAME1__ two_guests/basic:41:guest2 /sbin/dhclient -6 __IFNAME2__ given that we don't need it on dhclient -x, tests would take about 12 seconds longer. Or we could switch to the arp off / nodad / arp on approach for everything, including SLAAC: $ grep --exclude-dir=demo --exclude-dir=mbuto --exclude-dir=distro --exclude-dir=memory -rn "sleep[ $(printf '\t')].*2" pasta/ndp:21:sleep 2 passt_in_ns/icmp:29:ns ip addr add 2001:db8::1 dev __IFNAME_NS__ && sleep 2 # DAD passt/ndp:19:guest ip link set dev __IFNAME__ up && sleep 2 two_guests/basic:39:sleep 2 and save slightly less than 8 seconds. If you ask me, I would have a slight preference for the nodad approach. -- Stefano
On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote:On Wed, 16 Oct 2024 19:39:40 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Hrm... I guess. Counterpoints.. - Most other failures to get a usable address will result in a visible error - dhclient has a --dad-wait-time option which seems to imply that the script should wait for DAD - The upstream script version waits for DAD In any case I filed a report for it https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:Oops. On one hand, I would feel inclined to propose a fix for the Debian and Ubuntu packages. On the other hand, I wonder if it's universally considered a bug: the DHCPv6 client did its job at that point, and it's debatable whether dhclient should wait for the address to be usable before forking to background. That is, arguably, the job of dhclient's is to request and configure an address. It's not a network configuration daemon. There might be many other reasons why that address is unusable, and yet dhclient is not responsible for them.On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:Ok.. I've understood a bit more. While timing is a factor here, it looks like the main reason I wasn't seeing it on my machine is what I'd consider a bug in the Debian version of the dhclient-script: when adding an IPv6 address, it returns without waiting for DAD to complete (i.e. for the address to be non-tentative).On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote: > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > > On Wed, 9 Oct 2024 15:07:21 +0200 > > Stefano Brivio <sbrivio(a)redhat.com> wrote: [snip] > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > > /* spliceable */ > > > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > > - * kernel pick a source port on the target side > > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > > + * default use its external address, so we don't inadvertently > > > > + * expose services that listen only on the guest's loopback > > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > > + * will instead forward to the loopback address in the guest. > > > > + * > > > > + * In either case, let the kernel pick the source address to > > > > + * match. > > > > */ > > > > - tgt->oaddr = ini->eaddr; > > > > + if (inany_v4(&ini->eaddr)) { > > > > + if (c->host_lo_to_ns_lo) > > > > + tgt->eaddr = inany_loopback4; > > > > + else > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > > + tgt->oaddr = inany_any4; > > > > + } else { > > > > + if (c->host_lo_to_ns_lo) > > > > + tgt->eaddr = inany_loopback6; > > > > + else > > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > > > Either this... > > > > > > > + tgt->oaddr = inany_any6; > > > > > > or this (and not something before this patch, up to 3/4) make the > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > > sometimes (about one in three/four runs), that's what I mistakenly > > > reported as coming from Laurent's series at: > > Huh, interesting. Just got back from my leave and ran that group of > tests in a loop this afternoon, but didn't manage to reproduce. I > have administrivia that will probably fill the rest of this week, but > I'll look into this as soon as I can. I reproduced the problem on passt.top, and I have a partial idea what's going on. As you say it's seeming like the address (addr_seen == addr in this case) isn't properly ready. This is over splice, but on the tap interface, I see the container sending NS messages for its own address - seems like it's doing DAD. But more importantly, we're answering those NS messages with NA messages, because we answer all NS. i.e. we're making the DAD fail. What I'm not sure of is how this ever worked at all. --config-net makes sense, since we disable DAD, but our test suite has always been using NDP+DHCP instead of --config-net. So, AFACT, we'll always fail guest DAD attempts, both IPv6, which happens most of the time and for IPv4 via ARP, which is used much more rarely. I think we need to be more selective in what NS or ARP lookups we resopnd to. The question is what approach to take:Hmm... no.. there's more to this. Usually DAD requests have :: as the source address, and we *do* exclude those from getting replies. In this case though, we're getting NS requests for the assigned address from what looks like the SLAAC address. So, I do think it would be wise to explicitly exclude these: we shouldn't be giving NA responses for an address that ought to belong to the guest, even if it doesn't look like a DAD. But, I'm not sure what's triggering this. Is for some reason the DHCP address not "taking", so the container is trying to locate it on the network instead? Or _is_ this DAD, but under some circumstances rather than using :: as the source address it uses another configured address.By the way, I guess it's just an issue for test scripts like this one.Why do you guess that?Here I think it's a much clearer argument that it's a real bug. We play fast and loose with it for mbuto, but dhclient can typically be called on an interface that isn't even up: the PREINIT/PREINIT6 script actions are specifically for this, they'll bring the interface up ready for the client to do its thing. I'd say the script is failing to do its job for PREINIT6 if there isn't a usable link-local address at the end. I filed a report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085229There's also an additional bug, which doesn't cause this problem, I think, but caused some problems when I was investigating. DHCPv6 needs the link-local SLAAC address already configured and non-tentative. The Fedora dhclient-script waits for that too at the PREINIT6 stage, but the Debian one doesn't, meaning if you attempt dhclient -6 immediately after starting the namespace it will fail to bind the UDP address it needs.Right, and that's fine for us because we have a 2-second delay after SLAAC. This looks to me a bit more like a real bug, but again, there might be many other reasons why dhclient can't use a link-local address. One might argue that it's the responsibility of the user/caller to invoke dhclient when appropriate.In that sense, you might be wondering why there's a 2-second delay after SLAAC, but no delay after invoking dhclient -6: the reason is that I was convinced that one wouldn't need DAD once a DHCPv6 client configures an address. The server is already checking that, I thought. Well, no. RFC 8415 18.2.10.1: https://datatracker.ietf.org/doc/html/rfc8415#section-18.2.10.1 says: If the client can operate with the addresses and/or prefixes obtained from the server: [...] - The client MUST perform duplicate address detection as per Section 5.4 of [RFC4862], which does list some exceptions, on each of the received addresses in any IAs on which it has not performed duplicate address detection during processing of any of the previous Reply messages from the server. The client performs the duplicate address detection before using the received addresses for any traffic. If any of the addresses are found to be in use on the link, the client sends a Decline message to the server for those addresses as described in Section 18.2.8.Indeed.Hrm. I suppose. Fwiw we already make the equivalent exclusion for ARPI still think it's a good idea not to give NA messages for the guest assigned address, but we'll need a different workaround for this issue.I read the rest of your reasoning about it, but the nice thing of the current behaviour (and that's why I added that single check on the source address in ndp()) is that the guest can really use whatever address it wants, regardless of what we tried to configure, and we'll resolve any other address.If we receive a neighbour solicitation for the guest assigned address, and the source address is not unspecified, it means that the guest is _not_ using the assigned address, and it's actually trying to reach it.Ok. More complex, but faster, I guess. I'll try implementing this. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonI guess we'll have to manually wait for DAD to complete in the DHCP tests, which will be kind of mucky :/Alternatively, we could use the same trick as added by commit f4e9f26480ef ("pasta: Disable neighbour solicitations on device up to prevent DAD"): disable neighbour solicitations, run dhclient -6, set 'nodad' on the address, and re-enable neighbour solicitations. This works for me:
On Thu, 17 Oct 2024 12:19:58 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote:Because it's kind of rare that your address changes if you use DHCPv6, I guess, so this would be relevant almost exclusively at boot. And, at boot, if a remote peer/client happens to try to connect to the machine where the client is running right after an address was assigned, it must have a retry mechanism almost for sure. -- StefanoOn Wed, 16 Oct 2024 19:39:40 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Hrm... I guess. Counterpoints.. - Most other failures to get a usable address will result in a visible error - dhclient has a --dad-wait-time option which seems to imply that the script should wait for DAD - The upstream script version waits for DAD In any case I filed a report for it https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:Oops. On one hand, I would feel inclined to propose a fix for the Debian and Ubuntu packages. On the other hand, I wonder if it's universally considered a bug: the DHCPv6 client did its job at that point, and it's debatable whether dhclient should wait for the address to be usable before forking to background. That is, arguably, the job of dhclient's is to request and configure an address. It's not a network configuration daemon. There might be many other reasons why that address is unusable, and yet dhclient is not responsible for them.On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote: > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote: > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > > > On Wed, 9 Oct 2024 15:07:21 +0200 > > > Stefano Brivio <sbrivio(a)redhat.com> wrote: > [snip] > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > > > /* spliceable */ > > > > > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > > > - * kernel pick a source port on the target side > > > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > > > + * default use its external address, so we don't inadvertently > > > > > + * expose services that listen only on the guest's loopback > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > > > + * will instead forward to the loopback address in the guest. > > > > > + * > > > > > + * In either case, let the kernel pick the source address to > > > > > + * match. > > > > > */ > > > > > - tgt->oaddr = ini->eaddr; > > > > > + if (inany_v4(&ini->eaddr)) { > > > > > + if (c->host_lo_to_ns_lo) > > > > > + tgt->eaddr = inany_loopback4; > > > > > + else > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > > > + tgt->oaddr = inany_any4; > > > > > + } else { > > > > > + if (c->host_lo_to_ns_lo) > > > > > + tgt->eaddr = inany_loopback6; > > > > > + else > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > > > > > Either this... > > > > > > > > > + tgt->oaddr = inany_any6; > > > > > > > > or this (and not something before this patch, up to 3/4) make the > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > > > sometimes (about one in three/four runs), that's what I mistakenly > > > > reported as coming from Laurent's series at: > > > > Huh, interesting. Just got back from my leave and ran that group of > > tests in a loop this afternoon, but didn't manage to reproduce. I > > have administrivia that will probably fill the rest of this week, but > > I'll look into this as soon as I can. > > I reproduced the problem on passt.top, and I have a partial idea > what's going on. As you say it's seeming like the address (addr_seen > == addr in this case) isn't properly ready. This is over splice, but > on the tap interface, I see the container sending NS messages for its > own address - seems like it's doing DAD. But more importantly, we're > answering those NS messages with NA messages, because we answer all > NS. i.e. we're making the DAD fail. What I'm not sure of is how this > ever worked at all. --config-net makes sense, since we disable DAD, > but our test suite has always been using NDP+DHCP instead of > --config-net. > > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which > happens most of the time and for IPv4 via ARP, which is used much more > rarely. I think we need to be more selective in what NS or ARP > lookups we resopnd to. The question is what approach to take: Hmm... no.. there's more to this. Usually DAD requests have :: as the source address, and we *do* exclude those from getting replies. In this case though, we're getting NS requests for the assigned address from what looks like the SLAAC address. So, I do think it would be wise to explicitly exclude these: we shouldn't be giving NA responses for an address that ought to belong to the guest, even if it doesn't look like a DAD. But, I'm not sure what's triggering this. Is for some reason the DHCP address not "taking", so the container is trying to locate it on the network instead? Or _is_ this DAD, but under some circumstances rather than using :: as the source address it uses another configured address.Ok.. I've understood a bit more. While timing is a factor here, it looks like the main reason I wasn't seeing it on my machine is what I'd consider a bug in the Debian version of the dhclient-script: when adding an IPv6 address, it returns without waiting for DAD to complete (i.e. for the address to be non-tentative).By the way, I guess it's just an issue for test scripts like this one.Why do you guess that?
On Thu, Oct 17, 2024 at 10:31:22AM +0200, Stefano Brivio wrote:On Thu, 17 Oct 2024 12:19:58 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Hm, true. I was thinking of ephemeral containers where "boot" could be quite common.. but those will most likely use --config-net.On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote:Because it's kind of rare that your address changes if you use DHCPv6, I guess, so this would be relevant almost exclusively at boot.On Wed, 16 Oct 2024 19:39:40 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Hrm... I guess. Counterpoints.. - Most other failures to get a usable address will result in a visible error - dhclient has a --dad-wait-time option which seems to imply that the script should wait for DAD - The upstream script version waits for DAD In any case I filed a report for it https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote: > On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote: > > On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote: > > > On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote: > > > > On Wed, 9 Oct 2024 15:07:21 +0200 > > > > Stefano Brivio <sbrivio(a)redhat.com> wrote: > > [snip] > > > > > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, > > > > > > (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { > > > > > > /* spliceable */ > > > > > > > > > > > > - /* Preserve the specific loopback adddress used, but let the > > > > > > - * kernel pick a source port on the target side > > > > > > + /* The traffic will go over the guest's 'lo' interface, but by > > > > > > + * default use its external address, so we don't inadvertently > > > > > > + * expose services that listen only on the guest's loopback > > > > > > + * address. That can be overridden by --host-lo-to-ns-lo which > > > > > > + * will instead forward to the loopback address in the guest. > > > > > > + * > > > > > > + * In either case, let the kernel pick the source address to > > > > > > + * match. > > > > > > */ > > > > > > - tgt->oaddr = ini->eaddr; > > > > > > + if (inany_v4(&ini->eaddr)) { > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > + tgt->eaddr = inany_loopback4; > > > > > > + else > > > > > > + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); > > > > > > + tgt->oaddr = inany_any4; > > > > > > + } else { > > > > > > + if (c->host_lo_to_ns_lo) > > > > > > + tgt->eaddr = inany_loopback6; > > > > > > + else > > > > > > + tgt->eaddr.a6 = c->ip6.addr_seen; > > > > > > > > > > Either this... > > > > > > > > > > > + tgt->oaddr = inany_any6; > > > > > > > > > > or this (and not something before this patch, up to 3/4) make the > > > > > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang, > > > > > sometimes (about one in three/four runs), that's what I mistakenly > > > > > reported as coming from Laurent's series at: > > > > > > Huh, interesting. Just got back from my leave and ran that group of > > > tests in a loop this afternoon, but didn't manage to reproduce. I > > > have administrivia that will probably fill the rest of this week, but > > > I'll look into this as soon as I can. > > > > I reproduced the problem on passt.top, and I have a partial idea > > what's going on. As you say it's seeming like the address (addr_seen > > == addr in this case) isn't properly ready. This is over splice, but > > on the tap interface, I see the container sending NS messages for its > > own address - seems like it's doing DAD. But more importantly, we're > > answering those NS messages with NA messages, because we answer all > > NS. i.e. we're making the DAD fail. What I'm not sure of is how this > > ever worked at all. --config-net makes sense, since we disable DAD, > > but our test suite has always been using NDP+DHCP instead of > > --config-net. > > > > So, AFACT, we'll always fail guest DAD attempts, both IPv6, which > > happens most of the time and for IPv4 via ARP, which is used much more > > rarely. I think we need to be more selective in what NS or ARP > > lookups we resopnd to. The question is what approach to take: > > Hmm... no.. there's more to this. > > Usually DAD requests have :: as the source address, and we *do* > exclude those from getting replies. In this case though, we're > getting NS requests for the assigned address from what looks like the > SLAAC address. So, I do think it would be wise to explicitly exclude > these: we shouldn't be giving NA responses for an address that ought > to belong to the guest, even if it doesn't look like a DAD. > > But, I'm not sure what's triggering this. Is for some reason the DHCP > address not "taking", so the container is trying to locate it on the > network instead? Or _is_ this DAD, but under some circumstances > rather than using :: as the source address it uses another configured > address. Ok.. I've understood a bit more. While timing is a factor here, it looks like the main reason I wasn't seeing it on my machine is what I'd consider a bug in the Debian version of the dhclient-script: when adding an IPv6 address, it returns without waiting for DAD to complete (i.e. for the address to be non-tentative).Oops. On one hand, I would feel inclined to propose a fix for the Debian and Ubuntu packages. On the other hand, I wonder if it's universally considered a bug: the DHCPv6 client did its job at that point, and it's debatable whether dhclient should wait for the address to be usable before forking to background. That is, arguably, the job of dhclient's is to request and configure an address. It's not a network configuration daemon. There might be many other reasons why that address is unusable, and yet dhclient is not responsible for them.By the way, I guess it's just an issue for test scripts like this one.Why do you guess that?And, at boot, if a remote peer/client happens to try to connect to the machine where the client is running right after an address was assigned, it must have a retry mechanism almost for sure.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote: [snip]Adding 2-second delays as we have them for NDP doesn't look that bad:$ grep --exclude-dir=demo -rn "dhclient -6" pasta/dhcp:37:ns /sbin/dhclient -6 --no-pid __IFNAME__ passt_in_ns/dhcp:54:guest /sbin/dhclient -6 __IFNAME__ passt/dhcp:51:guest /sbin/dhclient -6 __IFNAME__ perf/passt_tcp:117:guest dhclient -6 -x perf/passt_tcp:118:guest dhclient -6 __IFNAME__ two_guests/basic:40:guest1 /sbin/dhclient -6 __IFNAME1__ two_guests/basic:41:guest2 /sbin/dhclient -6 __IFNAME2__given that we don't need it on dhclient -x, tests would take about 12 seconds longer.Or we could switch to the arp off / nodad / arp on approach for everything, including SLAAC:$ grep --exclude-dir=demo --exclude-dir=mbuto --exclude-dir=distro --exclude-dir=memory -rn "sleep[ $(printf '\t')].*2" pasta/ndp:21:sleep 2 passt_in_ns/icmp:29:ns ip addr add 2001:db8::1 dev __IFNAME_NS__ && sleep 2 # DAD passt/ndp:19:guest ip link set dev __IFNAME__ up && sleep 2 two_guests/basic:39:sleep 2and save slightly less than 8 seconds. If you ask me, I would have a slight preference for the nodad approach.Much as I don't like making the tests slower, I don't think the nodad approach is a good idea. It's fine if we think of the NDP/DHCP only as setting things up for the transfer tests. But they also serve the purpose of testing the NDP and DHCP paths themselves. For that purpose we should perform a full "normal" configuration, including DAD. [e.g. if we really did manage to break our NS responses so that we totally broke DAD, we'd want our tests to catch that]. I have a draft implementation of explicitly waiting for DAD to complete both for SLAAC and for DHCPv6, patches coming shortly. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson