podman issue #24045 pointed out that pasta's spliced forwarding logic can expose services within the namespace bound only to 127.0.0.1 or ::1 to the host. However, the namespace probably expects those to only be accessible to itself, so that's probably not what we want. This changes our forwarding logic so as not to do this. Note that the podman tests will currently fail with this series, I've submitted podman PR https://github.com/containers/podman/pull/24064 to fix that. Link: https://github.com/containers/podman/issues/24045 David Gibson (2): test: Clarify test for spliced inbound transfers fwd: Direct inbound spliced forwards to the guest's external address fwd.c | 24 +++++++++++++++--------- test/passt_in_ns/tcp | 8 ++++---- test/passt_in_ns/udp | 4 ++-- test/pasta/tcp | 16 ++++++++-------- test/pasta/udp | 8 ++++---- 5 files changed, 33 insertions(+), 27 deletions(-) -- 2.46.1
The tests in pasta/tcp and pasta/udp for inbound transfers have the server listening within the namespace explicitly bound to 127.0.0.1 or ::1. This only works because of the behaviour of inbound splice connections, which always appear with both source and destination addresses as loopback in the namespace. That's not an inherent property for "spliced" connections and arguably an undesirable one. Also update the test names to make it clearer that these tests are expecting to exercise the "splice" path. Interestingly this was already correct for the equivalent passt_in_ns/*, although we also update the test names for clarity there. Note that there are similar issues in some of the podman tests, addressed in https://github.com/containers/podman/pull/24064 Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- test/passt_in_ns/tcp | 8 ++++---- test/passt_in_ns/udp | 4 ++-- test/pasta/tcp | 16 ++++++++-------- test/pasta/udp | 8 ++++---- 4 files changed, 18 insertions(+), 18 deletions(-) diff --git a/test/passt_in_ns/tcp b/test/passt_in_ns/tcp index aaf340e..319880b 100644 --- a/test/passt_in_ns/tcp +++ b/test/passt_in_ns/tcp @@ -32,7 +32,7 @@ host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10001 guestw guest cmp test_big.bin /root/big.bin -test TCP/IPv4: host to ns: big transfer +test TCP/IPv4: host to ns (spliced): big transfer nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10002 @@ -90,7 +90,7 @@ host socat -u OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10001 guestw guest cmp test_small.bin /root/small.bin -test TCP/IPv4: host to ns: small transfer +test TCP/IPv4: host to ns (spliced): small transfer nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10002 @@ -146,7 +146,7 @@ host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10001 guestw guest cmp test_big.bin /root/big.bin -test TCP/IPv6: host to ns: big transfer +test TCP/IPv6: host to ns (spliced): big transfer nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10002 @@ -204,7 +204,7 @@ host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10001 guestw guest cmp test_small.bin /root/small.bin -test TCP/IPv6: host to ns: small transfer +test TCP/IPv6: host to ns (spliced): small transfer nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10002 diff --git a/test/passt_in_ns/udp b/test/passt_in_ns/udp index 3426ab9..791511c 100644 --- a/test/passt_in_ns/udp +++ b/test/passt_in_ns/udp @@ -30,7 +30,7 @@ host socat -u OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10001,shut-null guestw guest cmp test.bin /root/medium.bin -test UDP/IPv4: host to ns +test UDP/IPv4: host to ns (recvmmsg/sendmmsg) nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null @@ -88,7 +88,7 @@ host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10001,shut-null guestw guest cmp test.bin /root/medium.bin -test UDP/IPv6: host to ns +test UDP/IPv6: host to ns (recvmmsg/sendmmsg) nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc sleep 1 host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null diff --git a/test/pasta/tcp b/test/pasta/tcp index 6ab18c5..53b6f25 100644 --- a/test/pasta/tcp +++ b/test/pasta/tcp @@ -19,8 +19,8 @@ set TEMP_NS_BIG __STATEDIR__/test_ns_big.bin set TEMP_SMALL __STATEDIR__/test_small.bin set TEMP_NS_SMALL __STATEDIR__/test_ns_small.bin -test TCP/IPv4: host to ns: big transfer -nsb socat -u TCP4-LISTEN:10002,bind=127.0.0.1 OPEN:__TEMP_NS_BIG__,create,trunc +test TCP/IPv4: host to ns (spliced): big transfer +nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc host socat -u OPEN:__BASEPATH__/big.bin TCP4:127.0.0.1:10002 nsw check cmp __BASEPATH__/big.bin __TEMP_NS_BIG__ @@ -38,8 +38,8 @@ ns socat -u OPEN:__BASEPATH__/big.bin TCP4:__GW__:10003 hostw check cmp __BASEPATH__/big.bin __TEMP_BIG__ -test TCP/IPv4: host to ns: small transfer -nsb socat -u TCP4-LISTEN:10002,bind=127.0.0.1 OPEN:__TEMP_NS_SMALL__,create,trunc +test TCP/IPv4: host to ns (spliced): small transfer +nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc host socat OPEN:__BASEPATH__/small.bin TCP4:127.0.0.1:10002 nsw check cmp __BASEPATH__/small.bin __TEMP_NS_SMALL__ @@ -57,8 +57,8 @@ ns socat -u OPEN:__BASEPATH__/small.bin TCP4:__GW__:10003 hostw check cmp __BASEPATH__/small.bin __TEMP_SMALL__ -test TCP/IPv6: host to ns: big transfer -nsb socat -u TCP6-LISTEN:10002,bind=[::1] OPEN:__TEMP_NS_BIG__,create,trunc +test TCP/IPv6: host to ns (spliced): big transfer +nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc host socat -u OPEN:__BASEPATH__/big.bin TCP6:[::1]:10002 nsw check cmp __BASEPATH__/big.bin __TEMP_NS_BIG__ @@ -77,8 +77,8 @@ ns socat -u OPEN:__BASEPATH__/big.bin TCP6:[__GW6__%__IFNAME__]:10003 hostw check cmp __BASEPATH__/big.bin __TEMP_BIG__ -test TCP/IPv6: host to ns: small transfer -nsb socat -u TCP6-LISTEN:10002,bind=[::1] OPEN:__TEMP_NS_SMALL__,create,trunc +test TCP/IPv6: host to ns (spliced): small transfer +nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc host socat -u OPEN:__BASEPATH__/small.bin TCP6:[::1]:10002 nsw check cmp __BASEPATH__/small.bin __TEMP_NS_SMALL__ diff --git a/test/pasta/udp b/test/pasta/udp index 30e3a85..7734d02 100644 --- a/test/pasta/udp +++ b/test/pasta/udp @@ -17,8 +17,8 @@ htools dd socat ip jq set TEMP __STATEDIR__/test.bin set TEMP_NS __STATEDIR__/test_ns.bin -test UDP/IPv4: host to ns -nsb socat -u UDP4-LISTEN:10002,bind=127.0.0.1,null-eof OPEN:__TEMP_NS__,create,trunc +test UDP/IPv4: host to ns (recvmmsg/sendmmsg) +nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc host socat OPEN:__BASEPATH__/medium.bin UDP4:127.0.0.1:10002,shut-null nsw check cmp __BASEPATH__/medium.bin __TEMP_NS__ @@ -37,8 +37,8 @@ ns socat -u OPEN:__BASEPATH__/medium.bin UDP4:__GW__:10003,shut-null hostw check cmp __BASEPATH__/medium.bin __TEMP__ -test UDP/IPv6: host to ns -nsb socat -u UDP6-LISTEN:10002,bind=[::1],null-eof OPEN:__TEMP_NS__,create,trunc +test UDP/IPv6: host to ns (recvmmsg/sendmmsg) +nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc host socat -u OPEN:__BASEPATH__/medium.bin UDP6:[::1]:10002,shut-null nsw check cmp __BASEPATH__/medium.bin __TEMP_NS__ -- 2.46.1
In pasta mode, where addressing permits we "splice" connections, forwarding directly from host socket to guest/container socket without any L2 or L3 processing. This gives us a very large performance improvement when it's possible. Since the traffic is from a local socket within the guest, it will go over the guest's 'lo' interface, and accordingly we set the guest side address to be the loopback address. However this has a surprising side effect: sometimes guests will run services that are only supposed to be used within the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's forwarding exposes those services to the host, which isn't generally what we want. Correct this by instead forwarding inbound "splice" flows to the guest's external address. Link: https://github.com/containers/podman/issues/24045 Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- fwd.c | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/fwd.c b/fwd.c index a505098..d5149db 100644 --- a/fwd.c +++ b/fwd.c @@ -447,20 +447,26 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */ - /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but use + * its external address, so we don't inadvertendly expose + * services that listen only on the guest's loopback address. + * + * Let the kernel pick our address on PIF_SPLICE */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + tgt->eaddr.a6 = c->ip6.addr_seen; + tgt->oaddr = inany_any6; + } + + /* Let the kernel pick port */ tgt->oport = 0; if (proto == IPPROTO_UDP) - /* But for UDP preserve the source port */ + /* Except for UDP, preserve the source port */ tgt->oport = ini->eport; - if (inany_v4(&ini->eaddr)) - tgt->eaddr = inany_loopback4; - else - tgt->eaddr = inany_loopback6; - return PIF_SPLICE; } -- 2.46.1
On Wed, 25 Sep 2024 16:54:36 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:In pasta mode, where addressing permits we "splice" connections, forwarding directly from host socket to guest/container socket without any L2 or L3 processing. This gives us a very large performance improvement when it's possible. Since the traffic is from a local socket within the guest, it will go over the guest's 'lo' interface, and accordingly we set the guest side address to be the loopback address. However this has a surprising side effect: sometimes guests will run services that are only supposed to be used within the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's forwarding exposes those services to the host, which isn't generally what we want. Correct this by instead forwarding inbound "splice" flows to the guest's external address. Link: https://github.com/containers/podman/issues/24045 Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- fwd.c | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/fwd.c b/fwd.c index a505098..d5149db 100644 --- a/fwd.c +++ b/fwd.c @@ -447,20 +447,26 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */ - /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but use + * its external address, so we don't inadvertendly exposeinadvertently+ * services that listen only on the guest's loopback address. + * + * Let the kernel pick our address on PIF_SPLICE */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + tgt->eaddr.a6 = c->ip6.addr_seen; + tgt->oaddr = inany_any6; + } + + /* Let the kernel pick port */ tgt->oport = 0; if (proto == IPPROTO_UDP) - /* But for UDP preserve the source port */ + /* Except for UDP, preserve the source port */It means the same thing, but it's less clear now: does it mean "Except for UDP: in that case preserve the source port" or "Except for UDP: in all other cases preserve the source port"? I would keep the original version unless I'm missing something subtle that this patch would change regarding ports.tgt->oport = ini->eport; - if (inany_v4(&ini->eaddr)) - tgt->eaddr = inany_loopback4; - else - tgt->eaddr = inany_loopback6; - return PIF_SPLICE; }The series looks good to me (yes, I know I still have to review one pending series from you), except for the concern I mentioned as a comment to the cover letter. -- Stefano
On Wed, Sep 25, 2024 at 10:29:44AM +0200, Stefano Brivio wrote:On Wed, 25 Sep 2024 16:54:36 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Oops, fixed.In pasta mode, where addressing permits we "splice" connections, forwarding directly from host socket to guest/container socket without any L2 or L3 processing. This gives us a very large performance improvement when it's possible. Since the traffic is from a local socket within the guest, it will go over the guest's 'lo' interface, and accordingly we set the guest side address to be the loopback address. However this has a surprising side effect: sometimes guests will run services that are only supposed to be used within the guest and are therefore bound to only 127.0.0.1 and/or ::1. pasta's forwarding exposes those services to the host, which isn't generally what we want. Correct this by instead forwarding inbound "splice" flows to the guest's external address. Link: https://github.com/containers/podman/issues/24045 Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- fwd.c | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/fwd.c b/fwd.c index a505098..d5149db 100644 --- a/fwd.c +++ b/fwd.c @@ -447,20 +447,26 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) { /* spliceable */ - /* Preserve the specific loopback adddress used, but let the - * kernel pick a source port on the target side + /* The traffic will go over the guest's 'lo' interface, but use + * its external address, so we don't inadvertendly exposeinadvertentlyGood point. I was trying to make it a follow on from the earlier "Let the kernel pick port" comment, but you're right the old version was still clearer. Reverted.+ * services that listen only on the guest's loopback address. + * + * Let the kernel pick our address on PIF_SPLICE */ - tgt->oaddr = ini->eaddr; + if (inany_v4(&ini->eaddr)) { + tgt->eaddr = inany_from_v4(c->ip4.addr_seen); + tgt->oaddr = inany_any4; + } else { + tgt->eaddr.a6 = c->ip6.addr_seen; + tgt->oaddr = inany_any6; + } + + /* Let the kernel pick port */ tgt->oport = 0; if (proto == IPPROTO_UDP) - /* But for UDP preserve the source port */ + /* Except for UDP, preserve the source port */It means the same thing, but it's less clear now: does it mean "Except for UDP: in that case preserve the source port" or "Except for UDP: in all other cases preserve the source port"?I would keep the original version unless I'm missing something subtle that this patch would change regarding ports.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsontgt->oport = ini->eport; - if (inany_v4(&ini->eaddr)) - tgt->eaddr = inany_loopback4; - else - tgt->eaddr = inany_loopback6; - return PIF_SPLICE; }The series looks good to me (yes, I know I still have to review one pending series from you), except for the concern I mentioned as a comment to the cover letter.
On Wed, 25 Sep 2024 16:54:34 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:podman issue #24045 pointed out that pasta's spliced forwarding logic can expose services within the namespace bound only to 127.0.0.1 or ::1 to the host. However, the namespace probably expects those to only be accessible to itself, so that's probably not what we want....that's what I thought would be desirable as you see from patch 1/2 and https://github.com/containers/podman/pull/24064. I think you're right in general but I would feel more confident applying this if 1. we briefly documented this in the man page and 2. we added an option to enable the current behaviour back (1. can be documented as part of documenting 2., then). The new fwd_nat_from_host() implementation seems to make this relatively trivial, but I'm not really familiar with it yet so I might be wrong. -- Stefano