[PATCH 00/22] RFC: Allow configuration of special case NATs
Based on Stefano's recent patch for faster tests. Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback. Suggestions for better names for the new options in patches 20 & 22 are most welcome. Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well. NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway. Paul, amongst other things, I think this will allow podman to (finally) nicely address #19213, picking an address to remap to the host's external address with --nat-guest-addr, much like it already uses --dns-forward. David Gibson (22): treewide: Use "our address" instead of "forwarding address" util: Helper for formatting MAC addresses treewide: Rename MAC address fields for clarity treewide: Use struct assignment instead of memcpy() for IP addresses conf: Use array indices rather than pointers for DNS array slots conf: More accurately count entries added in get_dns() conf: Move DNS array bounds checks into add_dns[46] conf: Move adding of a nameserver from resolv.conf into subfunction conf: Correct setting of dns_match address in add_dns6() conf: Treat --dns addresses as guest visible addresses conf: Remove incorrect initialisation of addr_ll_seen util: Correct sock_l4() binding for link local addresses treewide: Change misleading 'addr_ll' name Clarify which addresses in ip[46]_ctx are meaningful where Initialise our_tap_ll to ip6.gw when suitable fwd: Helpers to clarify what host addresses aren't guest accessible fwd: Split notion of "our tap address" from gateway for IPv4 Don't take "our" MAC address from the host conf, fwd: Split notion of gateway/router from guest-visible host address conf: Allow address remapped to host to be configured fwd: Distinguish translatable from untranslatable addresses on inbound fwd, conf: Allow NAT of the guest's assigned address arp.c | 4 +- conf.c | 328 +++++++++++++++++++++++++----------------- dhcp.c | 19 +-- dhcpv6.c | 21 +-- flow.c | 72 +++++----- flow.h | 18 +-- fwd.c | 170 +++++++++++++++++----- icmp.c | 4 +- ndp.c | 9 +- passt.1 | 45 +++++- passt.c | 2 +- passt.h | 53 +++++-- pasta.c | 14 +- tap.c | 12 +- tcp.c | 33 ++--- tcp_internal.h | 2 +- test/lib/setup | 11 +- test/passt_in_ns/dhcp | 73 ++++++++++ test/passt_in_ns/tcp | 38 +++-- test/passt_in_ns/udp | 22 +-- test/perf/passt_tcp | 33 ++--- test/perf/passt_udp | 31 ++-- test/perf/pasta_tcp | 29 ++-- test/perf/pasta_udp | 25 ++-- test/run | 4 +- udp.c | 12 +- util.c | 22 ++- util.h | 4 +- 28 files changed, 719 insertions(+), 391 deletions(-) create mode 100644 test/passt_in_ns/dhcp -- 2.46.0
The term "forwarding address" to indicate the local-to-passt address was
well-intentioned, but ends up being kinda confusing. As discussed on a
recent call, let's try "our" instead.
Signed-off-by: David Gibson
On Fri, 16 Aug 2024 15:39:42 +1000
David Gibson
The term "forwarding address" to indicate the local-to-passt address was well-intentioned, but ends up being kinda confusing. As discussed on a recent call, let's try "our" instead.
Signed-off-by: David Gibson
--- flow.c | 72 +++++++++++++++++++++++++------------------------- flow.h | 18 ++++++------- fwd.c | 70 ++++++++++++++++++++++++------------------------ icmp.c | 4 +-- tcp.c | 33 ++++++++++++----------- tcp_internal.h | 2 +- udp.c | 12 ++++----- 7 files changed, 106 insertions(+), 105 deletions(-) diff --git a/flow.c b/flow.c index 93b687dc..8915e366 100644 --- a/flow.c +++ b/flow.c @@ -127,18 +127,18 @@ static struct timespec flow_timer_run; * @af: Address family (AF_INET or AF_INET6) * @eaddr: Endpoint address (pointer to in_addr or in6_addr) * @eport: Endpoint port - * @faddr: Forwarding address (pointer to in_addr or in6_addr) - * @fport: Forwarding port + * @oaddr: Our address (pointer to in_addr or in6_addr) + * @oport: Our port */ static void flowside_from_af(struct flowside *side, sa_family_t af, const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport) + const void *oaddr, in_port_t oport) { - if (faddr) - inany_from_af(&side->faddr, af, faddr); + if (oaddr) + inany_from_af(&side->oaddr, af, oaddr); else - side->faddr = inany_any6; - side->fport = fport; + side->oaddr = inany_any6; + side->oport = oport;
if (eaddr) inany_from_af(&side->eaddr, af, eaddr); @@ -193,8 +193,8 @@ static int flowside_sock_splice(void *arg) * @tgt: Target flowside * @data: epoll reference portion for protocol handlers * - * Return: socket fd of protocol @proto bound to the forwarding address and port - * from @tgt (if specified). + * Return: socket fd of protocol @proto bound to our address and port from @tgt + * (if specified). */ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data) @@ -205,11 +205,11 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
ASSERT(pif_is_socket(pif));
- pif_sockaddr(c, &sa, &sl, pif, &tgt->faddr, tgt->fport); + pif_sockaddr(c, &sa, &sl, pif, &tgt->oaddr, tgt->oport);
switch (pif) { case PIF_HOST: - if (inany_is_loopback(&tgt->faddr)) + if (inany_is_loopback(&tgt->oaddr)) ifname = NULL; else if (sa.sa_family == AF_INET) ifname = c->ip4.ifname_out; @@ -309,11 +309,11 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) pif_name(f->pif[INISIDE]), inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), - ini->fport, + inany_ntop(&ini->oaddr, fstr0, sizeof(fstr0)), + ini->oport, pif_name(f->pif[TGTSIDE]), - inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), - tgt->fport, + inany_ntop(&tgt->oaddr, fstr1, sizeof(fstr1)), + tgt->oport, inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) @@ -321,8 +321,8 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) pif_name(f->pif[INISIDE]), inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), - ini->fport); + inany_ntop(&ini->oaddr, fstr0, sizeof(fstr0)), + ini->oport); }
/** @@ -347,7 +347,7 @@ static void flow_initiate_(union flow *flow, uint8_t pif) * flow_initiate_af() - Move flow to INI, setting INISIDE details * @flow: Flow to change state * @pif: pif of the initiating side - * @af: Address family of @eaddr and @faddr + * @af: Address family of @eaddr and @oaddr
Pre-existing, but this made me realise that flow_initiate_af() doesn't actually take @eaddr and @faddr at all (it's @saddr and @daddr instead). -- Stefano
On Sun, Aug 18, 2024 at 05:44:51PM +0200, Stefano Brivio wrote:
On Fri, 16 Aug 2024 15:39:42 +1000 David Gibson
wrote: The term "forwarding address" to indicate the local-to-passt address was well-intentioned, but ends up being kinda confusing. As discussed on a recent call, let's try "our" instead.
Signed-off-by: David Gibson
--- flow.c | 72 +++++++++++++++++++++++++------------------------- flow.h | 18 ++++++------- fwd.c | 70 ++++++++++++++++++++++++------------------------ icmp.c | 4 +-- tcp.c | 33 ++++++++++++----------- tcp_internal.h | 2 +- udp.c | 12 ++++----- 7 files changed, 106 insertions(+), 105 deletions(-) diff --git a/flow.c b/flow.c index 93b687dc..8915e366 100644 --- a/flow.c +++ b/flow.c @@ -127,18 +127,18 @@ static struct timespec flow_timer_run; * @af: Address family (AF_INET or AF_INET6) * @eaddr: Endpoint address (pointer to in_addr or in6_addr) * @eport: Endpoint port - * @faddr: Forwarding address (pointer to in_addr or in6_addr) - * @fport: Forwarding port + * @oaddr: Our address (pointer to in_addr or in6_addr) + * @oport: Our port */ static void flowside_from_af(struct flowside *side, sa_family_t af, const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport) + const void *oaddr, in_port_t oport) { - if (faddr) - inany_from_af(&side->faddr, af, faddr); + if (oaddr) + inany_from_af(&side->oaddr, af, oaddr); else - side->faddr = inany_any6; - side->fport = fport; + side->oaddr = inany_any6; + side->oport = oport;
if (eaddr) inany_from_af(&side->eaddr, af, eaddr); @@ -193,8 +193,8 @@ static int flowside_sock_splice(void *arg) * @tgt: Target flowside * @data: epoll reference portion for protocol handlers * - * Return: socket fd of protocol @proto bound to the forwarding address and port - * from @tgt (if specified). + * Return: socket fd of protocol @proto bound to our address and port from @tgt + * (if specified). */ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data) @@ -205,11 +205,11 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
ASSERT(pif_is_socket(pif));
- pif_sockaddr(c, &sa, &sl, pif, &tgt->faddr, tgt->fport); + pif_sockaddr(c, &sa, &sl, pif, &tgt->oaddr, tgt->oport);
switch (pif) { case PIF_HOST: - if (inany_is_loopback(&tgt->faddr)) + if (inany_is_loopback(&tgt->oaddr)) ifname = NULL; else if (sa.sa_family == AF_INET) ifname = c->ip4.ifname_out; @@ -309,11 +309,11 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) pif_name(f->pif[INISIDE]), inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), - ini->fport, + inany_ntop(&ini->oaddr, fstr0, sizeof(fstr0)), + ini->oport, pif_name(f->pif[TGTSIDE]), - inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), - tgt->fport, + inany_ntop(&tgt->oaddr, fstr1, sizeof(fstr1)), + tgt->oport, inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) @@ -321,8 +321,8 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) pif_name(f->pif[INISIDE]), inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), - ini->fport); + inany_ntop(&ini->oaddr, fstr0, sizeof(fstr0)), + ini->oport); }
/** @@ -347,7 +347,7 @@ static void flow_initiate_(union flow *flow, uint8_t pif) * flow_initiate_af() - Move flow to INI, setting INISIDE details * @flow: Flow to change state * @pif: pif of the initiating side - * @af: Address family of @eaddr and @faddr + * @af: Address family of @eaddr and @oaddr
Pre-existing, but this made me realise that flow_initiate_af() doesn't actually take @eaddr and @faddr at all (it's @saddr and @daddr instead).
Oops, yes. I've folded a fix for that into this patch. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
There are a couple of places where we somewhat messily open code formatting
an Ethernet like MAC address for display. Add an eth_ntop() helper for
this.
Signed-off-by: David Gibson
On Fri, 16 Aug 2024 15:39:43 +1000
David Gibson
There are a couple of places where we somewhat messily open code formatting an Ethernet like MAC address for display. Add an eth_ntop() helper for this.
Signed-off-by: David Gibson
--- conf.c | 7 +++---- dhcp.c | 5 ++--- util.c | 19 +++++++++++++++++++ util.h | 3 +++ 4 files changed, 27 insertions(+), 7 deletions(-) diff --git a/conf.c b/conf.c index ed097bdc..830f91a6 100644 --- a/conf.c +++ b/conf.c @@ -921,7 +921,8 @@ pasta_opts: */ static void conf_print(const struct ctx *c) { - char buf4[INET_ADDRSTRLEN], buf6[INET6_ADDRSTRLEN], ifn[IFNAMSIZ]; + char buf4[INET_ADDRSTRLEN], buf6[INET6_ADDRSTRLEN]; + char bufmac[ETH_ADDRSTRLEN], ifn[IFNAMSIZ]; int i;
info("Template interface: %s%s%s%s%s", @@ -955,9 +956,7 @@ static void conf_print(const struct ctx *c) info("Namespace interface: %s", c->pasta_ifn);
info("MAC:"); - info(" host: %02x:%02x:%02x:%02x:%02x:%02x", - c->mac[0], c->mac[1], c->mac[2], - c->mac[3], c->mac[4], c->mac[5]); + info(" host: %s", eth_ntop(c->mac, bufmac, sizeof(bufmac)));
if (c->ifi4) { if (!c->no_dhcp) { diff --git a/dhcp.c b/dhcp.c index aa9f59da..acc5b03e 100644 --- a/dhcp.c +++ b/dhcp.c @@ -276,6 +276,7 @@ static void opt_set_dns_search(const struct ctx *c, size_t max_len) int dhcp(const struct ctx *c, const struct pool *p) { size_t mlen, dlen, offset = 0, opt_len, opt_off = 0; + char macstr[ETH_ADDRSTRLEN]; const struct ethhdr *eh; const struct iphdr *iph; const struct udphdr *uh; @@ -340,9 +341,7 @@ int dhcp(const struct ctx *c, const struct pool *p) return -1; }
- info(" from %02x:%02x:%02x:%02x:%02x:%02x", - m->chaddr[0], m->chaddr[1], m->chaddr[2], - m->chaddr[3], m->chaddr[4], m->chaddr[5]); + info(" from %s", eth_ntop(m->chaddr, macstr, sizeof(macstr)));
m->yiaddr = c->ip4.addr; mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len)); diff --git a/util.c b/util.c index 0b414045..892358b1 100644 --- a/util.c +++ b/util.c @@ -676,6 +676,25 @@ const char *sockaddr_ntop(const void *sa, char *dst, socklen_t size) return dst; }
+/** eth_ntop() - Convert an Ethernet MAC address to text format + * @mac: MAC address + * @dst: output buffer, minimum ETH_ADDRSTRLEN bytes + * @size: size of buffer at @dst
Nit: s/output/Output, s/size/Size
+ * + * Return: On success, a non-null pointer to @dst, NULL on failure + */ +const char *eth_ntop(const unsigned char *mac, char *dst, size_t size) +{ + int len; + + len = snprintf(dst, size, "%02x:%02x:%02x:%02x:%02x:%02x", + mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]); + if (len < 0 || (size_t)len >= size) + return NULL; + + return dst; +} + /** str_ee_origin() - Convert socket extended error origin to a string * @ee: Socket extended error structure * diff --git a/util.h b/util.h index cb4d181c..c1748074 100644 --- a/util.h +++ b/util.h @@ -215,9 +215,12 @@ static inline const char *af_name(sa_family_t af)
#define SOCKADDR_STRLEN MAX(SOCKADDR_INET_STRLEN, SOCKADDR_INET6_STRLEN)
+#define ETH_ADDRSTRLEN (ETH_ALEN * 3)
The fact that this includes two digits plus separator for all non-last octets of a MAC address, and two digits plus NULL terminator for the last octet, looks a bit subtle to me. Defining this as sizeof("00:11:22:33:44:55") wouldn't scream "off-by-one" as much, to me. Not a strong preference. -- Stefano
On Sun, Aug 18, 2024 at 05:44:55PM +0200, Stefano Brivio wrote:
On Fri, 16 Aug 2024 15:39:43 +1000 David Gibson
wrote: There are a couple of places where we somewhat messily open code formatting an Ethernet like MAC address for display. Add an eth_ntop() helper for this.
Signed-off-by: David Gibson
--- conf.c | 7 +++---- dhcp.c | 5 ++--- util.c | 19 +++++++++++++++++++ util.h | 3 +++ 4 files changed, 27 insertions(+), 7 deletions(-) diff --git a/conf.c b/conf.c index ed097bdc..830f91a6 100644 --- a/conf.c +++ b/conf.c @@ -921,7 +921,8 @@ pasta_opts: */ static void conf_print(const struct ctx *c) { - char buf4[INET_ADDRSTRLEN], buf6[INET6_ADDRSTRLEN], ifn[IFNAMSIZ]; + char buf4[INET_ADDRSTRLEN], buf6[INET6_ADDRSTRLEN]; + char bufmac[ETH_ADDRSTRLEN], ifn[IFNAMSIZ]; int i;
info("Template interface: %s%s%s%s%s", @@ -955,9 +956,7 @@ static void conf_print(const struct ctx *c) info("Namespace interface: %s", c->pasta_ifn);
info("MAC:"); - info(" host: %02x:%02x:%02x:%02x:%02x:%02x", - c->mac[0], c->mac[1], c->mac[2], - c->mac[3], c->mac[4], c->mac[5]); + info(" host: %s", eth_ntop(c->mac, bufmac, sizeof(bufmac)));
if (c->ifi4) { if (!c->no_dhcp) { diff --git a/dhcp.c b/dhcp.c index aa9f59da..acc5b03e 100644 --- a/dhcp.c +++ b/dhcp.c @@ -276,6 +276,7 @@ static void opt_set_dns_search(const struct ctx *c, size_t max_len) int dhcp(const struct ctx *c, const struct pool *p) { size_t mlen, dlen, offset = 0, opt_len, opt_off = 0; + char macstr[ETH_ADDRSTRLEN]; const struct ethhdr *eh; const struct iphdr *iph; const struct udphdr *uh; @@ -340,9 +341,7 @@ int dhcp(const struct ctx *c, const struct pool *p) return -1; }
- info(" from %02x:%02x:%02x:%02x:%02x:%02x", - m->chaddr[0], m->chaddr[1], m->chaddr[2], - m->chaddr[3], m->chaddr[4], m->chaddr[5]); + info(" from %s", eth_ntop(m->chaddr, macstr, sizeof(macstr)));
m->yiaddr = c->ip4.addr; mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len)); diff --git a/util.c b/util.c index 0b414045..892358b1 100644 --- a/util.c +++ b/util.c @@ -676,6 +676,25 @@ const char *sockaddr_ntop(const void *sa, char *dst, socklen_t size) return dst; }
+/** eth_ntop() - Convert an Ethernet MAC address to text format + * @mac: MAC address + * @dst: output buffer, minimum ETH_ADDRSTRLEN bytes + * @size: size of buffer at @dst
Nit: s/output/Output, s/size/Size
Fixed.
+ * + * Return: On success, a non-null pointer to @dst, NULL on failure + */ +const char *eth_ntop(const unsigned char *mac, char *dst, size_t size) +{ + int len; + + len = snprintf(dst, size, "%02x:%02x:%02x:%02x:%02x:%02x", + mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]); + if (len < 0 || (size_t)len >= size) + return NULL; + + return dst; +} + /** str_ee_origin() - Convert socket extended error origin to a string * @ee: Socket extended error structure * diff --git a/util.h b/util.h index cb4d181c..c1748074 100644 --- a/util.h +++ b/util.h @@ -215,9 +215,12 @@ static inline const char *af_name(sa_family_t af)
#define SOCKADDR_STRLEN MAX(SOCKADDR_INET_STRLEN, SOCKADDR_INET6_STRLEN)
+#define ETH_ADDRSTRLEN (ETH_ALEN * 3)
The fact that this includes two digits plus separator for all non-last octets of a MAC address, and two digits plus NULL terminator for the last octet, looks a bit subtle to me.
Defining this as sizeof("00:11:22:33:44:55") wouldn't scream "off-by-one" as much, to me. Not a strong preference.
Yeah, that makes sense. Done. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
c->mac isn't a great name, because it doesn't say whose mac address it is
and it's not necessarily obvious in all the contexts we use it. Since this
is specifically the address that we (passt/pasta) use on the tap interface,
rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac"
to be grammatically consistent.
Signed-off-by: David Gibson
On Fri, 16 Aug 2024 15:39:44 +1000
David Gibson
c->mac isn't a great name, because it doesn't say whose mac address it is and it's not necessarily obvious in all the contexts we use it. Since this is specifically the address that we (passt/pasta) use on the tap interface, rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac" to be grammatically consistent.
Wouldn't "our_mac" suffice? Even the day we get to support other types of link (well, "tap" for a guest is already not a tap, I know...), or especially multiple links at the same time, I guess we will still want to use a single MAC address. -- Stefano
On Sun, Aug 18, 2024 at 05:45:00PM +0200, Stefano Brivio wrote:
On Fri, 16 Aug 2024 15:39:44 +1000 David Gibson
wrote: c->mac isn't a great name, because it doesn't say whose mac address it is and it's not necessarily obvious in all the contexts we use it. Since this is specifically the address that we (passt/pasta) use on the tap interface, rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac" to be grammatically consistent.
Wouldn't "our_mac" suffice?
Maybe. This is supposed to emphasise that this is used on PIF_TAP - we also (usually) have a MAC address on the host interfaces, though we don't really need to care about it.
Even the day we get to support other types of link (well, "tap" for a guest is already not a tap, I know...), or especially multiple links at the same time, I guess we will still want to use a single MAC address.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
We rely on C11 already, so we can use clearer and more type-checkable
struct assignment instead of mempcy() for copying IP addresses around.
This exposes some "pointer could be const" warnings from cppcheck, so
address those too.
Signed-off-by: David Gibson
On Fri, 16 Aug 2024 15:39:45 +1000
David Gibson
We rely on C11 already, so we can use clearer and more type-checkable struct assignment instead of mempcy() for copying IP addresses around.
This exposes some "pointer could be const" warnings from cppcheck, so address those too.
Signed-off-by: David Gibson
--- conf.c | 12 ++++++------ dhcpv6.c | 10 ++++++---- 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/conf.c b/conf.c index 750fdc86..9b05afeb 100644 --- a/conf.c +++ b/conf.c @@ -389,14 +389,14 @@ static void add_dns6(struct ctx *c, /* Guest or container can only access local addresses via redirect */ if (IN6_IS_ADDR_LOOPBACK(addr)) { if (!c->no_map_gw) { - memcpy(*conf, &c->ip6.gw, sizeof(**conf)); + **conf = c->ip6.gw; (*conf)++;
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match)) - memcpy(&c->ip6.dns_match, addr, sizeof(*addr)); + c->ip6.dns_match = *addr; } } else { - memcpy(*conf, addr, sizeof(**conf)); + **conf = *addr; (*conf)++; }
@@ -632,7 +632,7 @@ static unsigned int conf_ip4(unsigned int ifi, ip4->prefix_len = 32; }
- memcpy(&ip4->addr_seen, &ip4->addr, sizeof(ip4->addr_seen)); + ip4->addr_seen = ip4->addr;
if (MAC_IS_ZERO(mac)) { int rc = nl_link_get_mac(nl_sock, ifi, mac); @@ -693,8 +693,8 @@ static unsigned int conf_ip6(unsigned int ifi, return 0; }
- memcpy(&ip6->addr_seen, &ip6->addr, sizeof(ip6->addr)); - memcpy(&ip6->addr_ll_seen, &ip6->addr_ll, sizeof(ip6->addr_ll)); + ip6->addr_seen = ip6->addr; + ip6->addr_ll_seen = ip6->addr_ll;
if (MAC_IS_ZERO(mac)) { rc = nl_link_get_mac(nl_sock, ifi, mac); diff --git a/dhcpv6.c b/dhcpv6.c index bbed41dc..87b3c3eb 100644 --- a/dhcpv6.c +++ b/dhcpv6.c @@ -298,7 +298,8 @@ static struct opt_hdr *dhcpv6_ia_notonlink(const struct pool *p, { char buf[INET6_ADDRSTRLEN]; struct in6_addr req_addr; - struct opt_hdr *ia, *h; + const struct opt_hdr *h; + struct opt_hdr *ia; size_t offset; int ia_type;
@@ -312,12 +313,13 @@ ia_ta: offset += sizeof(struct opt_ia_na);
while ((h = dhcpv6_opt(p, &offset, OPT_IAAADR))) { - struct opt_ia_addr *opt_addr = (struct opt_ia_addr *)h; + const struct opt_ia_addr *opt_addr + = (const struct opt_ia_addr *)h;
Nit: the assignment could go on its own line, then?
if (ntohs(h->l) != OPT_VSIZE(ia_addr)) return NULL;
- memcpy(&req_addr, &opt_addr->addr, sizeof(req_addr)); + req_addr = opt_addr->addr; if (!IN6_ARE_ADDR_EQUAL(la, &req_addr)) { info("DHCPv6: requested address %s not on link", inet_ntop(AF_INET6, &req_addr, @@ -363,7 +365,7 @@ static size_t dhcpv6_dns_fill(const struct ctx *c, char *buf, int offset) srv->hdr.l = 0; }
- memcpy(&srv->addr[i], &c->ip6.dns[i], sizeof(srv->addr[i])); + srv->addr[i] = c->ip6.dns[i]; srv->hdr.l += sizeof(srv->addr[i]); offset += sizeof(srv->addr[i]); }
I only reviewed up to this patch so far. -- Stefano
On Sun, Aug 18, 2024 at 05:45:03PM +0200, Stefano Brivio wrote:
On Fri, 16 Aug 2024 15:39:45 +1000 David Gibson
wrote: We rely on C11 already, so we can use clearer and more type-checkable struct assignment instead of mempcy() for copying IP addresses around.
This exposes some "pointer could be const" warnings from cppcheck, so address those too.
Signed-off-by: David Gibson
--- conf.c | 12 ++++++------ dhcpv6.c | 10 ++++++---- 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/conf.c b/conf.c index 750fdc86..9b05afeb 100644 --- a/conf.c +++ b/conf.c @@ -389,14 +389,14 @@ static void add_dns6(struct ctx *c, /* Guest or container can only access local addresses via redirect */ if (IN6_IS_ADDR_LOOPBACK(addr)) { if (!c->no_map_gw) { - memcpy(*conf, &c->ip6.gw, sizeof(**conf)); + **conf = c->ip6.gw; (*conf)++;
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match)) - memcpy(&c->ip6.dns_match, addr, sizeof(*addr)); + c->ip6.dns_match = *addr; } } else { - memcpy(*conf, addr, sizeof(**conf)); + **conf = *addr; (*conf)++; }
@@ -632,7 +632,7 @@ static unsigned int conf_ip4(unsigned int ifi, ip4->prefix_len = 32; }
- memcpy(&ip4->addr_seen, &ip4->addr, sizeof(ip4->addr_seen)); + ip4->addr_seen = ip4->addr;
if (MAC_IS_ZERO(mac)) { int rc = nl_link_get_mac(nl_sock, ifi, mac); @@ -693,8 +693,8 @@ static unsigned int conf_ip6(unsigned int ifi, return 0; }
- memcpy(&ip6->addr_seen, &ip6->addr, sizeof(ip6->addr)); - memcpy(&ip6->addr_ll_seen, &ip6->addr_ll, sizeof(ip6->addr_ll)); + ip6->addr_seen = ip6->addr; + ip6->addr_ll_seen = ip6->addr_ll;
if (MAC_IS_ZERO(mac)) { rc = nl_link_get_mac(nl_sock, ifi, mac); diff --git a/dhcpv6.c b/dhcpv6.c index bbed41dc..87b3c3eb 100644 --- a/dhcpv6.c +++ b/dhcpv6.c @@ -298,7 +298,8 @@ static struct opt_hdr *dhcpv6_ia_notonlink(const struct pool *p, { char buf[INET6_ADDRSTRLEN]; struct in6_addr req_addr; - struct opt_hdr *ia, *h; + const struct opt_hdr *h; + struct opt_hdr *ia; size_t offset; int ia_type;
@@ -312,12 +313,13 @@ ia_ta: offset += sizeof(struct opt_ia_na);
while ((h = dhcpv6_opt(p, &offset, OPT_IAAADR))) { - struct opt_ia_addr *opt_addr = (struct opt_ia_addr *)h; + const struct opt_ia_addr *opt_addr + = (const struct opt_ia_addr *)h;
Nit: the assignment could go on its own line, then?
Good point, done.
if (ntohs(h->l) != OPT_VSIZE(ia_addr)) return NULL;
- memcpy(&req_addr, &opt_addr->addr, sizeof(req_addr)); + req_addr = opt_addr->addr; if (!IN6_ARE_ADDR_EQUAL(la, &req_addr)) { info("DHCPv6: requested address %s not on link", inet_ntop(AF_INET6, &req_addr, @@ -363,7 +365,7 @@ static size_t dhcpv6_dns_fill(const struct ctx *c, char *buf, int offset) srv->hdr.l = 0; }
- memcpy(&srv->addr[i], &c->ip6.dns[i], sizeof(srv->addr[i])); + srv->addr[i] = c->ip6.dns[i]; srv->hdr.l += sizeof(srv->addr[i]); offset += sizeof(srv->addr[i]); }
I only reviewed up to this patch so far.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Currently add_dns[46]() take a somewhat awkward double pointer to the
entry in the c->ip[46].dns array to update. It turns out to be easier to
work with indices into that array instead.
This diff does add some lines, but it's comments, and will allow some
future code reductions.
Signed-off-by: David Gibson
get_dns() counts the number of guest DNS servers it adds, and gives an
error if it couldn't add any. However, this count ignores the fact that
add_dns[46]() may in some cases *not* add an entry. Use the array indices
we're already tracking to get an accurate count.
Signed-off-by: David Gibson
Every time we call add_dns[46] we need to first check if there's space in
the c->ip[46].dns array for the new entry. We might as well make that
check in add_dns[46]() itself.
In fact it looks like the calls in get_dns() had an off by one error, not
allowing the last entry of the array to be filled. So, that bug is also
fixed by the change.
Signed-off-by: David Gibson
get_dns() is already quite deeply nested, and future changes I have in
mind will add more complexity. Prepare for this by splitting out the
adding of a single nameserver to the configuration into its own function.
Signed-off-by: David Gibson
add_dns6() (but not add_dns4()) has a bug setting dns_match: it sets it to
the given address, rather than the gateway address. This is doubly wrong:
- We've just established the given address is a host loopback address
the guest can't access
- We've just set ip6.dns[] to tell the guest to use the gateway address,
so it won't use the dns_match address we're setting
Correct this to use the gateway address, like IPv4.
Signed-off-by: David Gibson
Although it's not 100% explicit in the man page, addresses given to the
--dns option are intended to be addresses as seen by the guest. This
differs from addresses taken from the host's /etc/resolv.conf, which must
be translated to to guest accessible versions in some cases.
Our implementation is currently inconsistent on this: when using
--dns-forward, you must usually also give --dns with the matching address,
which is meaningful only in the guest's address view. However if you give
--dns with a loopback addres, it will be translated like a host view
address.
Move the remapping logic for DNS addresses out of add_dns4() and add_dns6()
into add_dns_resolv() so that it is only applied for host nameserver
addresses, not for nameservers given explicitly with --dns.
Signed-off-by: David Gibson
Despite the names, addr_ll_seen does not relate to addr_ll the same way
addr_see relates to addr. addr_ll_seen is an observed address from the
guest, whereas addr_ll is *our* link-local address for use on the tap link
when we can't use an external endpoint address. It's used both for
passt provided services (DHCPv6, NDP) and in some cases for connections
from addresses the guest can't access.
Signed-off-by: David Gibson
When binding an IPv6 socket in sock_l4() we need to supply a scope id if
the address is link-local. We check for this by comparing the given
address to c->ip6.addr_ll. This is correct only by accident: while
c->ip6.addr_ll is typically set to the hsot interface's link local
address, the actually purpose of it is to provide a link local address
for passt's private use on the tap interface.
Instead set the scope id for any link-local address we're binding to.
We're going to need something and this is what makes sense for sockets
on the host. It doesn't make sense for PIF_SPLICE sockets, but those
should always have loopback, not link-local addresses.
Signed-off-by: David Gibson
On Fri, 16 Aug 2024 15:39:53 +1000
David Gibson
When binding an IPv6 socket in sock_l4() we need to supply a scope id if the address is link-local. We check for this by comparing the given address to c->ip6.addr_ll. This is correct only by accident: while c->ip6.addr_ll is typically set to the hsot interface's link local address, the actually purpose of it is to provide a link local address
Nits: host, actual -- Stefano
On Tue, Aug 20, 2024 at 02:14:59AM +0200, Stefano Brivio wrote:
On Fri, 16 Aug 2024 15:39:53 +1000 David Gibson
wrote: When binding an IPv6 socket in sock_l4() we need to supply a scope id if the address is link-local. We check for this by comparing the given address to c->ip6.addr_ll. This is correct only by accident: while c->ip6.addr_ll is typically set to the hsot interface's link local address, the actually purpose of it is to provide a link local address
Nits: host, actual
Fixed. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
c->ip6.addr_ll is not like c->ip6.addr. The latter is an address for the
guest, but the former is an address for our use on the tap link. Rename it
accordingly, to 'our_tap_ll'.
Signed-off-by: David Gibson
On Fri, 16 Aug 2024 15:39:54 +1000
David Gibson
c->ip6.addr_ll is not like c->ip6.addr. The latter is an address for the guest, but the former is an address for our use on the tap link. Rename it accordingly, to 'our_tap_ll'.
Same as 3/22: could this be "our_ll"? Same here, not a strong preference. I reviewed only up to 16/22 so far. -- Stefano
On Tue, Aug 20, 2024 at 02:15:03AM +0200, Stefano Brivio wrote:
On Fri, 16 Aug 2024 15:39:54 +1000 David Gibson
wrote: c->ip6.addr_ll is not like c->ip6.addr. The latter is an address for the guest, but the former is an address for our use on the tap link. Rename it accordingly, to 'our_tap_ll'.
Same as 3/22: could this be "our_ll"? Same here, not a strong preference.
Same answer here. Maybe, but I want to emphasise that it's our address as used on PIF_TAP. Obviously we may use host LL addresses if we contact external hosts on the same link as the host.
I reviewed only up to 16/22 so far.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Some are guest visible addresses and may not be valid on the host, others
are host visible addresses and may not be valid on the guest. Rearrange
and comment the ip[46]_ctx definitions to make it clearer which is which.
Signed-off-by: David Gibson
In every place we use our_tap_ll, we only use it as a fallback if the
IPv6 gateway address is not link-local. We can avoid that conditional at
use time by doing it at initialisation of our_tap_ll instead.
Signed-off-by: David Gibson
We usually avoid NAT, but in a few cases we need to apply address
translations. For inbound connections that happens for addresses which
make sense to the host but are either inaccessible, or mean a different
location from the guest's point of view.
Add some helper functions to determine such addresses, and use them in
fwd_nat_from_host(). In doing so clarify some of the reasons for the
logic. We'll also have further use for these helpers in future.
While we're there fix one unneccessary inconsistency between IPv4 and IPv6.
We always translated the guest's observed address, but for IPv4 we didn't
translate the guest's assigned address, whereas for IPv6 we did. Change
this to translate both in all cases for consistency.
Signed-off-by: David Gibson
On Fri, 16 Aug 2024 15:39:57 +1000
David Gibson
We usually avoid NAT, but in a few cases we need to apply address translations. For inbound connections that happens for addresses which make sense to the host but are either inaccessible, or mean a different location from the guest's point of view.
Add some helper functions to determine such addresses, and use them in fwd_nat_from_host(). In doing so clarify some of the reasons for the logic. We'll also have further use for these helpers in future.
While we're there fix one unneccessary inconsistency between IPv4 and IPv6. We always translated the guest's observed address, but for IPv4 we didn't translate the guest's assigned address, whereas for IPv6 we did. Change this to translate both in all cases for consistency.
Signed-off-by: David Gibson
--- fwd.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 87 insertions(+), 11 deletions(-) diff --git a/fwd.c b/fwd.c index 75dc0151..1baae338 100644 --- a/fwd.c +++ b/fwd.c @@ -170,6 +170,85 @@ static bool is_dns_flow(uint8_t proto, const struct flowside *ini) ((ini->oport == 53) || (ini->oport == 853)); }
+/** + * fwd_guest_accessible4() - Is IPv4 address guest accessible
Nit: I wonder if we should say "guest-accessible" in all these cases, it's a bit easier for me to decode, but not necessarily more correct. It's fine by me either way. -- Stefano
On Tue, Aug 20, 2024 at 09:56:18PM +0200, Stefano Brivio wrote:
On Fri, 16 Aug 2024 15:39:57 +1000 David Gibson
wrote: We usually avoid NAT, but in a few cases we need to apply address translations. For inbound connections that happens for addresses which make sense to the host but are either inaccessible, or mean a different location from the guest's point of view.
Add some helper functions to determine such addresses, and use them in fwd_nat_from_host(). In doing so clarify some of the reasons for the logic. We'll also have further use for these helpers in future.
While we're there fix one unneccessary inconsistency between IPv4 and IPv6. We always translated the guest's observed address, but for IPv4 we didn't translate the guest's assigned address, whereas for IPv6 we did. Change this to translate both in all cases for consistency.
Signed-off-by: David Gibson
--- fwd.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 87 insertions(+), 11 deletions(-) diff --git a/fwd.c b/fwd.c index 75dc0151..1baae338 100644 --- a/fwd.c +++ b/fwd.c @@ -170,6 +170,85 @@ static bool is_dns_flow(uint8_t proto, const struct flowside *ini) ((ini->oport == 53) || (ini->oport == 853)); }
+/** + * fwd_guest_accessible4() - Is IPv4 address guest accessible
Nit: I wonder if we should say "guest-accessible" in all these cases, it's a bit easier for me to decode, but not necessarily more correct. It's fine by me either way.
Just adding the hyphen? Sure, done. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
ip4.gw conflates 3 conceptually different things, which (for now) have the
same value:
1. The router/gateway address as seen by the guest
2. An address to NAT to the host with --no-map-gw isn't specified
3. An address to use as source when nothing else makes sense
Case 3 occurs in two situations:
a) for our DHCP responses - since they come from passt internally there's
no naturally meaningful address for them to come from
b) for forwarded connections coming from an address that isn't guest
accessible (localhost or the guest's own address).
(b) occurs even with --no-map-gw, and the expected behaviour of forwarding
local connections requires it.
For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the
same value as ip6.gw). For future flexibility we may want to make this
"address of last resort" different from the gateway address, so split them
logically for IPv4 as well.
Specifically, add a new ip4.our_tap_addr field for the address with this
role, and initialise it to ip4.gw for now. Unlike IPv6 where we can always
get a link-local address, we might not be able to get a (non 0.0.0.0)
address here. In that case we have to disable DHCP and forwarding of
inbound connections with guest-inaccessible source addresses.
Signed-off-by: David Gibson
On Fri, 16 Aug 2024 15:39:58 +1000
David Gibson
ip4.gw conflates 3 conceptually different things, which (for now) have the same value: 1. The router/gateway address as seen by the guest 2. An address to NAT to the host with --no-map-gw isn't specified 3. An address to use as source when nothing else makes sense
Case 3 occurs in two situations:
a) for our DHCP responses - since they come from passt internally there's no naturally meaningful address for them to come from b) for forwarded connections coming from an address that isn't guest accessible (localhost or the guest's own address).
(b) occurs even with --no-map-gw, and the expected behaviour of forwarding local connections requires it.
For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the same value as ip6.gw). For future flexibility we may want to make this "address of last resort" different from the gateway address, so split them logically for IPv4 as well.
Specifically, add a new ip4.our_tap_addr field for the address with this role, and initialise it to ip4.gw for now. Unlike IPv6 where we can always get a link-local address, we might not be able to get a (non 0.0.0.0) address here. In that case we have to disable DHCP
It's not entirely clear to me in which case we would not be able to get any address, but at least RFC 2131 doesn't have a problem with this: diff --git a/dhcp.c b/dhcp.c index aa9f59d..3de8a6e 100644 --- a/dhcp.c +++ b/dhcp.c @@ -282,6 +282,7 @@ int dhcp(const struct ctx *c, const struct pool *p) struct in_addr mask; unsigned int i; struct msg *m; + struct in_addr zeroes = { 0 }; eh = packet_get(p, 0, offset, sizeof(*eh), NULL); offset += sizeof(*eh); @@ -378,7 +379,7 @@ int dhcp(const struct ctx *c, const struct pool *p) opt_set_dns_search(c, sizeof(m->o)); dlen = offsetof(struct msg, o) + fill(m); - tap_udp4_send(c, c->ip4.gw, 67, c->ip4.addr, 68, m, dlen); + tap_udp4_send(c, zeroes, 67, c->ip4.addr, 68, m, dlen); return 1; } and: $ ./pasta -p dhcp.pcap Saving packet capture to dhcp.pcap # dhclient # tshark -r dhcp.pcap Running as user "root" and group "root". This could be dangerous. 1 0.000000 :: → ff02::16 ICMPv6 90 Multicast Listener Report Message v2 2 0.016265 0.0.0.0 → 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0x75759d11 3 0.016361 0.0.0.0 → 88.198.0.164 DHCP 342 DHCP Offer - Transaction ID 0x75759d11 4 0.016479 0.0.0.0 → 255.255.255.255 DHCP 342 DHCP Request - Transaction ID 0x75759d11 5 0.016493 0.0.0.0 → 88.198.0.164 DHCP 342 DHCP ACK - Transaction ID 0x75759d11 [...] so this could be a reasonable fallback.
and forwarding of inbound connections with guest-inaccessible source addresses.
Signed-off-by: David Gibson
--- conf.c | 7 ++++++- dhcp.c | 4 ++-- fwd.c | 10 +++++++--- passt.h | 2 ++ 4 files changed, 17 insertions(+), 6 deletions(-) diff --git a/conf.c b/conf.c index 954f20ea..9f962fc8 100644 --- a/conf.c +++ b/conf.c @@ -660,6 +660,8 @@ static unsigned int conf_ip4(unsigned int ifi,
ip4->addr_seen = ip4->addr;
+ ip4->our_tap_addr = ip4->gw; + if (MAC_IS_ZERO(mac)) { int rc = nl_link_get_mac(nl_sock, ifi, mac); if (rc < 0) { @@ -1666,7 +1668,10 @@ void conf(struct ctx *c, int argc, char **argv) die("External interface not usable");
if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.gw)) - c->no_map_gw = c->no_dhcp = 1; + c->no_map_gw = 1; + + if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) + c->no_dhcp = 1;
if (c->ifi6 && IN6_IS_ADDR_UNSPECIFIED(&c->ip6.gw)) c->no_map_gw = 1; diff --git a/dhcp.c b/dhcp.c index acc5b03e..a935dc94 100644 --- a/dhcp.c +++ b/dhcp.c @@ -347,7 +347,7 @@ int dhcp(const struct ctx *c, const struct pool *p) mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len)); memcpy(opts[1].s, &mask, sizeof(mask)); memcpy(opts[3].s, &c->ip4.gw, sizeof(c->ip4.gw)); - memcpy(opts[54].s, &c->ip4.gw, sizeof(c->ip4.gw)); + memcpy(opts[54].s, &c->ip4.our_tap_addr, sizeof(c->ip4.our_tap_addr));
Nit: this was supposed to look like a table, so it would be nice to add extra whitespace in the lines above this one. -- Stefano
On Tue, Aug 20, 2024 at 09:56:24PM +0200, Stefano Brivio wrote:
On Fri, 16 Aug 2024 15:39:58 +1000 David Gibson
wrote: ip4.gw conflates 3 conceptually different things, which (for now) have the same value: 1. The router/gateway address as seen by the guest 2. An address to NAT to the host with --no-map-gw isn't specified 3. An address to use as source when nothing else makes sense
Case 3 occurs in two situations:
a) for our DHCP responses - since they come from passt internally there's no naturally meaningful address for them to come from b) for forwarded connections coming from an address that isn't guest accessible (localhost or the guest's own address).
(b) occurs even with --no-map-gw, and the expected behaviour of forwarding local connections requires it.
For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the same value as ip6.gw). For future flexibility we may want to make this "address of last resort" different from the gateway address, so split them logically for IPv4 as well.
Specifically, add a new ip4.our_tap_addr field for the address with this role, and initialise it to ip4.gw for now. Unlike IPv6 where we can always get a link-local address, we might not be able to get a (non 0.0.0.0) address here. In that case we have to disable DHCP
It's not entirely clear to me in which case we would not be able to get any address,
Currently, when we don't have a gateway address on the host: no connectivity, or a point-to-point link with no gateway, or the like. We used to absolutely require it, but that restriction has been eased and may ease further in future.
but at least RFC 2131 doesn't have a problem with this:
diff --git a/dhcp.c b/dhcp.c index aa9f59d..3de8a6e 100644 --- a/dhcp.c +++ b/dhcp.c @@ -282,6 +282,7 @@ int dhcp(const struct ctx *c, const struct pool *p) struct in_addr mask; unsigned int i; struct msg *m; + struct in_addr zeroes = { 0 };
eh = packet_get(p, 0, offset, sizeof(*eh), NULL); offset += sizeof(*eh); @@ -378,7 +379,7 @@ int dhcp(const struct ctx *c, const struct pool *p) opt_set_dns_search(c, sizeof(m->o));
dlen = offsetof(struct msg, o) + fill(m); - tap_udp4_send(c, c->ip4.gw, 67, c->ip4.addr, 68, m, dlen); + tap_udp4_send(c, zeroes, 67, c->ip4.addr, 68, m, dlen);
return 1; }
and:
$ ./pasta -p dhcp.pcap Saving packet capture to dhcp.pcap # dhclient # tshark -r dhcp.pcap Running as user "root" and group "root". This could be dangerous. 1 0.000000 :: → ff02::16 ICMPv6 90 Multicast Listener Report Message v2 2 0.016265 0.0.0.0 → 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0x75759d11 3 0.016361 0.0.0.0 → 88.198.0.164 DHCP 342 DHCP Offer - Transaction ID 0x75759d11 4 0.016479 0.0.0.0 → 255.255.255.255 DHCP 342 DHCP Request - Transaction ID 0x75759d11 5 0.016493 0.0.0.0 → 88.198.0.164 DHCP 342 DHCP ACK - Transaction ID 0x75759d11 [...]
so this could be a reasonable fallback.
Fair point. I've removed the disabling of DHCP in this case.
and forwarding of inbound connections with guest-inaccessible source addresses.
Signed-off-by: David Gibson
--- conf.c | 7 ++++++- dhcp.c | 4 ++-- fwd.c | 10 +++++++--- passt.h | 2 ++ 4 files changed, 17 insertions(+), 6 deletions(-) diff --git a/conf.c b/conf.c index 954f20ea..9f962fc8 100644 --- a/conf.c +++ b/conf.c @@ -660,6 +660,8 @@ static unsigned int conf_ip4(unsigned int ifi,
ip4->addr_seen = ip4->addr;
+ ip4->our_tap_addr = ip4->gw; + if (MAC_IS_ZERO(mac)) { int rc = nl_link_get_mac(nl_sock, ifi, mac); if (rc < 0) { @@ -1666,7 +1668,10 @@ void conf(struct ctx *c, int argc, char **argv) die("External interface not usable");
if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.gw)) - c->no_map_gw = c->no_dhcp = 1; + c->no_map_gw = 1; + + if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) + c->no_dhcp = 1;
if (c->ifi6 && IN6_IS_ADDR_UNSPECIFIED(&c->ip6.gw)) c->no_map_gw = 1; diff --git a/dhcp.c b/dhcp.c index acc5b03e..a935dc94 100644 --- a/dhcp.c +++ b/dhcp.c @@ -347,7 +347,7 @@ int dhcp(const struct ctx *c, const struct pool *p) mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len)); memcpy(opts[1].s, &mask, sizeof(mask)); memcpy(opts[3].s, &c->ip4.gw, sizeof(c->ip4.gw)); - memcpy(opts[54].s, &c->ip4.gw, sizeof(c->ip4.gw)); + memcpy(opts[54].s, &c->ip4.our_tap_addr, sizeof(c->ip4.our_tap_addr));
Nit: this was supposed to look like a table, so it would be nice to add extra whitespace in the lines above this one.
Makes sense, done. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
When sending frames to the guest over the tap link, we need a source MAC
address. Currently we take that from the MAC address of the main interface
on the host, but that doesn't actually make much sense:
* We can't preserve the real MAC address of packets from anywhere
external so there's no transparency case here
* In fact, it's confusingly different from how we handle IP addresses:
whereas we give the guest the same IP as the host, we're making the
host's MAC the one MAC that the guest *can't* use for itself.
* We already need a fallback case if the host doesn't have an Ethernet
like MAC (e.g. if it's connected via a point to point interface, such
as a wireguard VPN).
Change to just just use an arbitrary fixed MAC address - I've picked
9a:55:9a:55:9a:55. It's simpler and has the small advantage of making
the fact that passt/pasta is in use typically obvious from guest side
packet dumps. This can still, of course, be overridden with the -M option.
Signed-off-by: David Gibson
The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default
gateway. We use this for two quite distinct things: advertising the
gateway that the guest should use (via DHCP, NDP and/or --config-net)
and for a limited form of NAT. So that the guest can access services
on the host, we map the gateway address within the guest to the
loopback address on the host.
Using the gateway address for this isn't necessarily the best choice
for this purpose, certainly not for all circumstances. So, start off
by splitting the notion of these into two different values: @guest_gw
which is the gateway address the guest should use and @nat_host_loopback,
which is the guest visible address to remap to the host's loopback.
Usually nat_host_loopback will have the same value as guest_gw. However
when --no-map-gw is specified we leave them unspecified instead. This
means when we use nat_host_loopback, we don't need to separately check
c->no_map_gw to see if it's relevant.
Signed-off-by: David Gibson
On Fri, 16 Aug 2024 15:40:00 +1000
David Gibson
The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default gateway. We use this for two quite distinct things: advertising the gateway that the guest should use (via DHCP, NDP and/or --config-net) and for a limited form of NAT. So that the guest can access services on the host, we map the gateway address within the guest to the loopback address on the host.
Using the gateway address for this isn't necessarily the best choice for this purpose, certainly not for all circumstances. So, start off by splitting the notion of these into two different values: @guest_gw which is the gateway address the guest should use and @nat_host_loopback, which is the guest visible address to remap to the host's loopback.
Usually nat_host_loopback will have the same value as guest_gw. However when --no-map-gw is specified we leave them unspecified instead. This means when we use nat_host_loopback, we don't need to separately check c->no_map_gw to see if it's relevant.
Signed-off-by: David Gibson
--- conf.c | 60 +++++++++++++++++++++++++++++---------------------------- dhcp.c | 10 ++++++---- fwd.c | 4 ++-- passt.h | 16 +++++++++------ pasta.c | 6 ++++-- 5 files changed, 53 insertions(+), 43 deletions(-) diff --git a/conf.c b/conf.c index b1c58d5b..26373584 100644 --- a/conf.c +++ b/conf.c @@ -410,12 +410,12 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver, * redirect */ if (IN4_IS_ADDR_LOOPBACK(&ns4)) { - if (c->no_map_gw) + if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback))
If you change the command-line option name to use "map", it would be good to also change these names. -- Stefano
On Tue, Aug 20, 2024 at 09:56:31PM +0200, Stefano Brivio wrote:
On Fri, 16 Aug 2024 15:40:00 +1000 David Gibson
wrote: The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default gateway. We use this for two quite distinct things: advertising the gateway that the guest should use (via DHCP, NDP and/or --config-net) and for a limited form of NAT. So that the guest can access services on the host, we map the gateway address within the guest to the loopback address on the host.
Using the gateway address for this isn't necessarily the best choice for this purpose, certainly not for all circumstances. So, start off by splitting the notion of these into two different values: @guest_gw which is the gateway address the guest should use and @nat_host_loopback, which is the guest visible address to remap to the host's loopback.
Usually nat_host_loopback will have the same value as guest_gw. However when --no-map-gw is specified we leave them unspecified instead. This means when we use nat_host_loopback, we don't need to separately check c->no_map_gw to see if it's relevant.
Signed-off-by: David Gibson
--- conf.c | 60 +++++++++++++++++++++++++++++---------------------------- dhcp.c | 10 ++++++---- fwd.c | 4 ++-- passt.h | 16 +++++++++------ pasta.c | 6 ++++-- 5 files changed, 53 insertions(+), 43 deletions(-) diff --git a/conf.c b/conf.c index b1c58d5b..26373584 100644 --- a/conf.c +++ b/conf.c @@ -410,12 +410,12 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver, * redirect */ if (IN4_IS_ADDR_LOOPBACK(&ns4)) { - if (c->no_map_gw) + if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback))
If you change the command-line option name to use "map", it would be good to also change these names.
Will do. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Because the host and guest share the same IP address with passt/pasta, it's
not possible for the guest to directly address the host. Therefore we
allow packets from the guest going to a special "NAT to host" address to be
redirected to the host, appearing there as though they have both source and
destination address of loopback.
Currently that special address is always the address of the default
gateway (or none). That can be a problem if we want that gateway to be
addressable by the guest. Therefore, allow the special "NAT to host"
address to be overridden on the command line with a new --nat-host-loopback
option.
In order to exercise and test it, update the passt_in_ns and perf
tests to use this option and give different mapping addresses for the
two layers of the environment.
Signed-off-by: David Gibson
On Fri, 16 Aug 2024 15:40:01 +1000
David Gibson
Because the host and guest share the same IP address with passt/pasta, it's not possible for the guest to directly address the host. Therefore we allow packets from the guest going to a special "NAT to host" address to be redirected to the host, appearing there as though they have both source and destination address of loopback.
Currently that special address is always the address of the default gateway (or none). That can be a problem if we want that gateway to be addressable by the guest. Therefore, allow the special "NAT to host" address to be overridden on the command line with a new --nat-host-loopback option.
In order to exercise and test it, update the passt_in_ns and perf tests to use this option and give different mapping addresses for the two layers of the environment.
Signed-off-by: David Gibson
--- conf.c | 57 +++++++++++++++++++++++++++++++-- passt.1 | 16 ++++++++++ test/lib/setup | 11 +++++-- test/passt_in_ns/dhcp | 73 +++++++++++++++++++++++++++++++++++++++++++ test/passt_in_ns/tcp | 38 +++++++++++----------- test/passt_in_ns/udp | 22 +++++++------ test/perf/passt_tcp | 33 +++++++++---------- test/perf/passt_udp | 31 +++++++++--------- test/perf/pasta_tcp | 29 ++++++++--------- test/perf/pasta_udp | 25 ++++++++------- test/run | 4 +-- 11 files changed, 244 insertions(+), 95 deletions(-) create mode 100644 test/passt_in_ns/dhcp diff --git a/conf.c b/conf.c index 26373584..c5831e82 100644 --- a/conf.c +++ b/conf.c @@ -817,6 +817,14 @@ static void usage(const char *name, FILE *f, int status) fprintf(f, " --no-dhcp-search No list in DHCP/DHCPv6/NDP\n");
fprintf(f, + " --nat-host-loopback ADDR NAT ADDR to refer to host\n" + " Packets from the guest to ADDR will be redirected to the\n" + " host. On the host such packets will appear to have both\n" + " source and destination of loopback (127.0.0.1 or ::1).\n"
I would leave these three lines to the man page. The help message is already 90 lines long. This should be a quick guide/reminder, not a full description. This reminds me that 127.0.0.1 isn't the only IPv4 loopback address. I don't know if anybody will ever have a use case where they would need a different, specific, loopback source address, but, together with --nat-guest-addr from 22/22, I start wondering: what if we had a single option taking, optionally, an arbitrary (within limits) source address? Now, given that we plan to add a configurable flow table at some point in the future, it makes no sense to make this exceedingly flexible. But I just wanted to bring this up for consideration, in case it's doable at a small cost (I'm really not sure): --map-host [source,]address where "source" would default to 127.0.0.1, but it could also be another loopback address, or another address altogether (and we'll fail if it's not local, of course). If we want (can?) go that way and keep equivalent functionality as you have now, we would have the additional problem that this option could be given up to two times (one for loopback, one for non-loopback), and not more (we don't have a data structure ready for an arbitrary number of those), so it's not as generic as it might look like, and I'm not sure if it's a good idea. But we could also expand on it in the future.
+ " ADDR can be 'none', in which case nothing is mapped\n"
This is a nice feature by the way as it should eventually allow us to get consistent options in Podman instead of "--map-gw": Podman could add by default '--map-host-loopback none', unless the user overrides that with an actual address.
+ " Can be specified zero to two (for IPv4 and IPv6)\n"
"can" (for consistency, but also because the subject is still the option, this is not a separate sentence). ...times.
+ " default: gateway address, or none if --no-map-gw is also\n" + " specified\n"
I don't think we need to mention here that --no-map-gw implies none, doing it in the man page is enough.
" --dns-forward ADDR Forward DNS queries sent to ADDR\n" " can be specified zero to two times (for IPv4 and IPv6)\n" " default: don't forward DNS queries\n" @@ -959,6 +967,11 @@ static void conf_print(const struct ctx *c) info(" host: %s", eth_ntop(c->our_tap_mac, bufmac, sizeof(bufmac)));
if (c->ifi4) { + if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback)) + info(" NAT to host 127.0.0.1: %s", + inet_ntop(AF_INET, &c->ip4.nat_host_loopback, + buf4, sizeof(buf4))); + if (!c->no_dhcp) { uint32_t mask;
@@ -989,6 +1002,11 @@ static void conf_print(const struct ctx *c) }
if (c->ifi6) { + if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback)) + info(" NAT to host ::1: %s", + inet_ntop(AF_INET6, &c->ip6.nat_host_loopback, + buf6, sizeof(buf6))); + if (!c->no_ndp && !c->no_dhcpv6) info("NDP/DHCPv6:"); else if (!c->no_ndp) @@ -1122,6 +1140,35 @@ static void conf_ugid(char *runas, uid_t *uid, gid_t *gid) } }
+/** + * conf_nat() - Parse --nat-host-loopback option + * @c: Execution context + * @arg: String argument to --nat-host-loopback + * @no_map_gw: --no-map-gw flag, updated for "none" argument + */ +static void conf_nat(struct ctx *c, const char *arg, int *no_map_gw) +{ + if (strcmp(arg, "none") == 0) { + c->ip4.nat_host_loopback = in4addr_any; + c->ip6.nat_host_loopback = in6addr_any; + *no_map_gw = 1; + } + + if (inet_pton(AF_INET6, arg, &c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_LOOPBACK(&c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_MULTICAST(&c->ip6.nat_host_loopback)) + return; + + if (inet_pton(AF_INET, arg, &c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_LOOPBACK(&c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_MULTICAST(&c->ip4.nat_host_loopback)) + return; + + die("Invalid address to remap to host: %s", optarg); +} + /** * conf_open_files() - Open files as requested by configuration * @c: Execution context @@ -1231,6 +1278,7 @@ void conf(struct ctx *c, int argc, char **argv) {"no-copy-routes", no_argument, NULL, 18 }, {"no-copy-addrs", no_argument, NULL, 19 }, {"netns-only", no_argument, NULL, 20 }, + {"nat-host-loopback", required_argument, NULL, 21 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1400,6 +1448,9 @@ void conf(struct ctx *c, int argc, char **argv) netns_only = 1; *userns = 0; break; + case 21: + conf_nat(c, optarg, &no_map_gw); + break; case 'd': c->debug = 1; c->quiet = 0; @@ -1639,10 +1690,12 @@ void conf(struct ctx *c, int argc, char **argv) (*c->ip6.ifname_out && !c->ifi6)) die("External interface not usable");
- if (c->ifi4 && !no_map_gw) + if (c->ifi4 && !no_map_gw && + IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback)) c->ip4.nat_host_loopback = c->ip4.guest_gw;
- if (c->ifi6 && !no_map_gw) + if (c->ifi6 && !no_map_gw && + IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback)) c->ip6.nat_host_loopback = c->ip6.guest_gw;
if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) diff --git a/passt.1 b/passt.1 index dca433b6..3680056a 100644 --- a/passt.1 +++ b/passt.1 @@ -327,6 +327,22 @@ namespace will be silently dropped. Disable Router Advertisements. Router Solicitations coming from guest or target namespace will be ignored.
+.TP +.BR \-\-nat-host-loopback " " \fIaddr +Translate \fIaddr\fR to refer to the host. Packets from the guest to +\fIaddr\fR will be redirected to the host. On the host such packets +will appear to have both source and destination of loopback (127.0.0.1
I would skip "of loopback" and just say "127.0.0.1 or ::1", to avoid implying that there's a single loopback address for IPv4.
+or ::1). + +If \fIaddr\fR is 'none', no address is mapped (this implies +\fB--no-map-gw\fR). Only one IPv4 and one IPv6 address can be +translated, if the option is specified multiple times, the last one +takes effect. + +Default is to translate the guest's default gateway address, unless +\fB--no-map-gw\fR is also given, in which case no address is mapped by
Why "also"? You're describing the default, so I guess this option is not actually given in that case.
+default. + .TP .BR \-\-no-map-gw Don't remap TCP connections and untracked UDP traffic, with the gateway address diff --git a/test/lib/setup b/test/lib/setup index 9b39b9fe..061bf997 100755 --- a/test/lib/setup +++ b/test/lib/setup @@ -124,7 +124,12 @@ setup_passt_in_ns() { [ ${DEBUG} -eq 1 ] && __opts="${__opts} -d" [ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
- context_run_bg pasta "./pasta ${__opts} -t 10001,10002,10011,10012 -T 10003,10013 -u 10001,10002,10011,10012 -U 10003,10013 -P ${STATESETUP}/pasta.pid --config-net ${NSTOOL} hold ${STATESETUP}/ns.hold" + __nat_host4=192.0.2.1 + __nat_host6=2001:db8:9a55::1 + __nat_ns4=192.0.2.2 + __nat_ns6=2001:db8:9a55::2 + + context_run_bg pasta "./pasta ${__opts} -t 10001,10002,10011,10012 -T 10003,10013 -u 10001,10002,10011,10012 -U 10003,10013 -P ${STATESETUP}/pasta.pid --nat-host-loopback ${__nat_host4} --nat-host-loopback ${__nat_host6} --config-net ${NSTOOL} hold ${STATESETUP}/ns.hold" wait_for [ -f "${STATESETUP}/pasta.pid" ]
context_setup_nstool qemu ${STATESETUP}/ns.hold @@ -139,11 +144,11 @@ setup_passt_in_ns() { if [ ${VALGRIND} -eq 1 ]; then context_run passt "make clean" context_run passt "make valgrind" - context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid" + context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --nat-host-loopback ${__nat_ns4} --nat-host-loopback ${__nat_ns6}" else context_run passt "make clean" context_run passt "make" - context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid" + context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --nat-host-loopback ${__nat_ns4} --nat-host-loopback ${__nat_ns6}" fi wait_for [ -f "${STATESETUP}/passt.pid" ]
diff --git a/test/passt_in_ns/dhcp b/test/passt_in_ns/dhcp new file mode 100644 index 00000000..48c7d197 --- /dev/null +++ b/test/passt_in_ns/dhcp
...how did this happen? This file already exists. -- Stefano
On Tue, Aug 20, 2024 at 09:56:34PM +0200, Stefano Brivio wrote:
On Fri, 16 Aug 2024 15:40:01 +1000 David Gibson
wrote: Because the host and guest share the same IP address with passt/pasta, it's not possible for the guest to directly address the host. Therefore we allow packets from the guest going to a special "NAT to host" address to be redirected to the host, appearing there as though they have both source and destination address of loopback.
Currently that special address is always the address of the default gateway (or none). That can be a problem if we want that gateway to be addressable by the guest. Therefore, allow the special "NAT to host" address to be overridden on the command line with a new --nat-host-loopback option.
In order to exercise and test it, update the passt_in_ns and perf tests to use this option and give different mapping addresses for the two layers of the environment.
Signed-off-by: David Gibson
--- conf.c | 57 +++++++++++++++++++++++++++++++-- passt.1 | 16 ++++++++++ test/lib/setup | 11 +++++-- test/passt_in_ns/dhcp | 73 +++++++++++++++++++++++++++++++++++++++++++ test/passt_in_ns/tcp | 38 +++++++++++----------- test/passt_in_ns/udp | 22 +++++++------ test/perf/passt_tcp | 33 +++++++++---------- test/perf/passt_udp | 31 +++++++++--------- test/perf/pasta_tcp | 29 ++++++++--------- test/perf/pasta_udp | 25 ++++++++------- test/run | 4 +-- 11 files changed, 244 insertions(+), 95 deletions(-) create mode 100644 test/passt_in_ns/dhcp diff --git a/conf.c b/conf.c index 26373584..c5831e82 100644 --- a/conf.c +++ b/conf.c @@ -817,6 +817,14 @@ static void usage(const char *name, FILE *f, int status) fprintf(f, " --no-dhcp-search No list in DHCP/DHCPv6/NDP\n");
fprintf(f, + " --nat-host-loopback ADDR NAT ADDR to refer to host\n" + " Packets from the guest to ADDR will be redirected to the\n" + " host. On the host such packets will appear to have both\n" + " source and destination of loopback (127.0.0.1 or ::1).\n"
I would leave these three lines to the man page. The help message is already 90 lines long. This should be a quick guide/reminder, not a full description.
Good idea, done.
This reminds me that 127.0.0.1 isn't the only IPv4 loopback address. I don't know if anybody will ever have a use case where they would need a different, specific, loopback source address, but, together with
This is primarily about translation of outbound connections, so loopback is more the destination address than the source here.
--nat-guest-addr from 22/22, I start wondering: what if we had a single option taking, optionally, an arbitrary (within limits) source address?
I'd like to see that, but it's a more complex exercise - we'd need a table of NATs to step through. This series is just aiming to handle the most common cases for now.
Now, given that we plan to add a configurable flow table at some point in the future, it makes no sense to make this exceedingly flexible. But I just wanted to bring this up for consideration, in case it's doable at a small cost (I'm really not sure):
--map-host [source,]address
where "source" would default to 127.0.0.1, but it could also be another loopback address, or another address altogether (and we'll fail if it's not local, of course).
There's no particular reason it has to fail if non-local. Even if we have this in future, I think --map-guest-addr would still be useful because it avoids the user having to spell out what host address they expect the guest to take.
If we want (can?) go that way and keep equivalent functionality as you have now, we would have the additional problem that this option could be given up to two times (one for loopback, one for non-loopback), and not more (we don't have a data structure ready for an arbitrary number of those), so it's not as generic as it might look like, and I'm not sure if it's a good idea. But we could also expand on it in the future.
Yeah, I see this more as a future extension.
+ " ADDR can be 'none', in which case nothing is mapped\n"
This is a nice feature by the way as it should eventually allow us to get consistent options in Podman instead of "--map-gw": Podman could add by default '--map-host-loopback none', unless the user overrides that with an actual address.
Exactly. The idea here is that we can eventually deprecate --no-map-gw in favour of --map-host-loopback=none.
+ " Can be specified zero to two (for IPv4 and IPv6)\n"
"can" (for consistency, but also because the subject is still the option, this is not a separate sentence).
Done.
...times.
And done.
+ " default: gateway address, or none if --no-map-gw is also\n" + " specified\n"
I don't think we need to mention here that --no-map-gw implies none, doing it in the man page is enough.
Done.
" --dns-forward ADDR Forward DNS queries sent to ADDR\n" " can be specified zero to two times (for IPv4 and IPv6)\n" " default: don't forward DNS queries\n" @@ -959,6 +967,11 @@ static void conf_print(const struct ctx *c) info(" host: %s", eth_ntop(c->our_tap_mac, bufmac, sizeof(bufmac)));
if (c->ifi4) { + if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback)) + info(" NAT to host 127.0.0.1: %s", + inet_ntop(AF_INET, &c->ip4.nat_host_loopback, + buf4, sizeof(buf4))); + if (!c->no_dhcp) { uint32_t mask;
@@ -989,6 +1002,11 @@ static void conf_print(const struct ctx *c) }
if (c->ifi6) { + if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback)) + info(" NAT to host ::1: %s", + inet_ntop(AF_INET6, &c->ip6.nat_host_loopback, + buf6, sizeof(buf6))); + if (!c->no_ndp && !c->no_dhcpv6) info("NDP/DHCPv6:"); else if (!c->no_ndp) @@ -1122,6 +1140,35 @@ static void conf_ugid(char *runas, uid_t *uid, gid_t *gid) } }
+/** + * conf_nat() - Parse --nat-host-loopback option + * @c: Execution context + * @arg: String argument to --nat-host-loopback + * @no_map_gw: --no-map-gw flag, updated for "none" argument + */ +static void conf_nat(struct ctx *c, const char *arg, int *no_map_gw) +{ + if (strcmp(arg, "none") == 0) { + c->ip4.nat_host_loopback = in4addr_any; + c->ip6.nat_host_loopback = in6addr_any; + *no_map_gw = 1; + } + + if (inet_pton(AF_INET6, arg, &c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_LOOPBACK(&c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_MULTICAST(&c->ip6.nat_host_loopback)) + return; + + if (inet_pton(AF_INET, arg, &c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_LOOPBACK(&c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_MULTICAST(&c->ip4.nat_host_loopback)) + return; + + die("Invalid address to remap to host: %s", optarg); +} + /** * conf_open_files() - Open files as requested by configuration * @c: Execution context @@ -1231,6 +1278,7 @@ void conf(struct ctx *c, int argc, char **argv) {"no-copy-routes", no_argument, NULL, 18 }, {"no-copy-addrs", no_argument, NULL, 19 }, {"netns-only", no_argument, NULL, 20 }, + {"nat-host-loopback", required_argument, NULL, 21 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1400,6 +1448,9 @@ void conf(struct ctx *c, int argc, char **argv) netns_only = 1; *userns = 0; break; + case 21: + conf_nat(c, optarg, &no_map_gw); + break; case 'd': c->debug = 1; c->quiet = 0; @@ -1639,10 +1690,12 @@ void conf(struct ctx *c, int argc, char **argv) (*c->ip6.ifname_out && !c->ifi6)) die("External interface not usable");
- if (c->ifi4 && !no_map_gw) + if (c->ifi4 && !no_map_gw && + IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback)) c->ip4.nat_host_loopback = c->ip4.guest_gw;
- if (c->ifi6 && !no_map_gw) + if (c->ifi6 && !no_map_gw && + IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback)) c->ip6.nat_host_loopback = c->ip6.guest_gw;
if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) diff --git a/passt.1 b/passt.1 index dca433b6..3680056a 100644 --- a/passt.1 +++ b/passt.1 @@ -327,6 +327,22 @@ namespace will be silently dropped. Disable Router Advertisements. Router Solicitations coming from guest or target namespace will be ignored.
+.TP +.BR \-\-nat-host-loopback " " \fIaddr +Translate \fIaddr\fR to refer to the host. Packets from the guest to +\fIaddr\fR will be redirected to the host. On the host such packets +will appear to have both source and destination of loopback (127.0.0.1
I would skip "of loopback" and just say "127.0.0.1 or ::1", to avoid implying that there's a single loopback address for IPv4.
Done.
+or ::1). + +If \fIaddr\fR is 'none', no address is mapped (this implies +\fB--no-map-gw\fR). Only one IPv4 and one IPv6 address can be +translated, if the option is specified multiple times, the last one +takes effect. + +Default is to translate the guest's default gateway address, unless +\fB--no-map-gw\fR is also given, in which case no address is mapped by
Why "also"? You're describing the default, so I guess this option is not actually given in that case.
Good point, fixed.
+default. + .TP .BR \-\-no-map-gw Don't remap TCP connections and untracked UDP traffic, with the gateway address diff --git a/test/lib/setup b/test/lib/setup index 9b39b9fe..061bf997 100755 --- a/test/lib/setup +++ b/test/lib/setup @@ -124,7 +124,12 @@ setup_passt_in_ns() { [ ${DEBUG} -eq 1 ] && __opts="${__opts} -d" [ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
- context_run_bg pasta "./pasta ${__opts} -t 10001,10002,10011,10012 -T 10003,10013 -u 10001,10002,10011,10012 -U 10003,10013 -P ${STATESETUP}/pasta.pid --config-net ${NSTOOL} hold ${STATESETUP}/ns.hold" + __nat_host4=192.0.2.1 + __nat_host6=2001:db8:9a55::1 + __nat_ns4=192.0.2.2 + __nat_ns6=2001:db8:9a55::2 + + context_run_bg pasta "./pasta ${__opts} -t 10001,10002,10011,10012 -T 10003,10013 -u 10001,10002,10011,10012 -U 10003,10013 -P ${STATESETUP}/pasta.pid --nat-host-loopback ${__nat_host4} --nat-host-loopback ${__nat_host6} --config-net ${NSTOOL} hold ${STATESETUP}/ns.hold" wait_for [ -f "${STATESETUP}/pasta.pid" ]
context_setup_nstool qemu ${STATESETUP}/ns.hold @@ -139,11 +144,11 @@ setup_passt_in_ns() { if [ ${VALGRIND} -eq 1 ]; then context_run passt "make clean" context_run passt "make valgrind" - context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid" + context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --nat-host-loopback ${__nat_ns4} --nat-host-loopback ${__nat_ns6}" else context_run passt "make clean" context_run passt "make" - context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid" + context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --nat-host-loopback ${__nat_ns4} --nat-host-loopback ${__nat_ns6}" fi wait_for [ -f "${STATESETUP}/passt.pid" ]
diff --git a/test/passt_in_ns/dhcp b/test/passt_in_ns/dhcp new file mode 100644 index 00000000..48c7d197 --- /dev/null +++ b/test/passt_in_ns/dhcp
...how did this happen? This file already exists.
No, it didn't. Previously we reused passt/dhcp for the passt_in_ns tests. With the change to the tests exercising the new option that doesn't work any more, because we need slightly different checks for DHCP to match what we expect when --map-host-loopback is used. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
fwd_nat_from_host() needs to adjust the source address for new flows coming
from an address which is not accessible to the guest. Currently we always
use our_tap_addr or our_tap_ll. However in cases where the address is
accessible to the guest via translation (i.e. via --nat-host-loopback) then
it makes more sense to use that translation, rather than the fallback
mapping of our_tap_*.
Signed-off-by: David Gibson
The guest is usually assigned one of the host's IP addresses. That means
it can't access the host itself via its usual address. The
--nat-host-loopback option (enabled by default with the gateway address)
allows the guest to contact the host. However, connections forwarded this
way appear on the host to have originated from the loopback interface,
which isn't always desirable.
Add a new --nat-guest-addr option, which acts similarly but forwarded
connections will go to the host's external address, instead of loopback.
If '-a' is used, so the guest's address is not the same as the host's, this
will instead forward to whatever host-visible site is shadowed by the
guest's assigned address.
Signed-off-by: David Gibson
On Fri, 16 Aug 2024 15:40:03 +1000
David Gibson
The guest is usually assigned one of the host's IP addresses. That means it can't access the host itself via its usual address. The --nat-host-loopback option (enabled by default with the gateway address) allows the guest to contact the host. However, connections forwarded this way appear on the host to have originated from the loopback interface, which isn't always desirable.
Add a new --nat-guest-addr option, which acts similarly but forwarded connections will go to the host's external address, instead of loopback.
If '-a' is used, so the guest's address is not the same as the host's, this will instead forward to whatever host-visible site is shadowed by the guest's assigned address.
Signed-off-by: David Gibson
--- conf.c | 51 ++++++++++++++++++++++++++++++++++----------------- fwd.c | 10 ++++++++++ passt.1 | 15 +++++++++++++++ passt.h | 6 ++++++ 4 files changed, 65 insertions(+), 17 deletions(-) diff --git a/conf.c b/conf.c index c5831e82..d14abc63 100644 --- a/conf.c +++ b/conf.c @@ -825,6 +825,14 @@ static void usage(const char *name, FILE *f, int status) " Can be specified zero to two (for IPv4 and IPv6)\n" " default: gateway address, or none if --no-map-gw is also\n" " specified\n" + " --nat-guest-addr ADDR NAT ADDR to guest's address\n" + " Packets from the guest to ADDR will be redirected to the\n" + " adress on the host that's the same as the guest's\n" + " assigned address. Usually that means (one of) the host's\n" + " global address.\n"
Same as 20/22, it's probably enough to have this in the man page.
+ " ADDR can be 'none', in which case nothing is mapped\n" + " Can be specified zero to two (for IPv4 and IPv6)\n"
"can", times
+ " default: none\n" " --dns-forward ADDR Forward DNS queries sent to ADDR\n" " can be specified zero to two times (for IPv4 and IPv6)\n" " default: don't forward DNS queries\n" @@ -1141,29 +1149,32 @@ static void conf_ugid(char *runas, uid_t *uid, gid_t *gid) }
/** - * conf_nat() - Parse --nat-host-loopback option - * @c: Execution context - * @arg: String argument to --nat-host-loopback - * @no_map_gw: --no-map-gw flag, updated for "none" argument + * conf_nat() - Parse --nat-host-loopback or --nat-guest-addr option + * @arg: String argument to option + * @addr4: IPv4 to update with parsed address + * @addr6: IPv6 to update with parsed address + * @no_map_gw: --no-map-gw flag, or NULL, updated for "none" argument */ -static void conf_nat(struct ctx *c, const char *arg, int *no_map_gw) +static void conf_nat(const char *arg, struct in_addr *addr4, + struct in6_addr *addr6, int *no_map_gw) { if (strcmp(arg, "none") == 0) { - c->ip4.nat_host_loopback = in4addr_any; - c->ip6.nat_host_loopback = in6addr_any; - *no_map_gw = 1; + *addr4 = in4addr_any; + *addr6 = in6addr_any; + if (no_map_gw) + *no_map_gw = 1; }
- if (inet_pton(AF_INET6, arg, &c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_LOOPBACK(&c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_MULTICAST(&c->ip6.nat_host_loopback)) + if (inet_pton(AF_INET6, arg, addr6) && + !IN6_IS_ADDR_UNSPECIFIED(addr6) && + !IN6_IS_ADDR_LOOPBACK(addr6) && + !IN6_IS_ADDR_MULTICAST(addr6)) return;
- if (inet_pton(AF_INET, arg, &c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_LOOPBACK(&c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_MULTICAST(&c->ip4.nat_host_loopback)) + if (inet_pton(AF_INET, arg, addr4) && + !IN4_IS_ADDR_UNSPECIFIED(addr4) && + !IN4_IS_ADDR_LOOPBACK(addr4) && + !IN4_IS_ADDR_MULTICAST(addr4)) return;
die("Invalid address to remap to host: %s", optarg); @@ -1279,6 +1290,7 @@ void conf(struct ctx *c, int argc, char **argv) {"no-copy-addrs", no_argument, NULL, 19 }, {"netns-only", no_argument, NULL, 20 }, {"nat-host-loopback", required_argument, NULL, 21 }, + {"nat-guest-addr", required_argument, NULL, 22 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1449,7 +1461,12 @@ void conf(struct ctx *c, int argc, char **argv) *userns = 0; break; case 21: - conf_nat(c, optarg, &no_map_gw); + conf_nat(optarg, &c->ip4.nat_host_loopback, + &c->ip6.nat_host_loopback, &no_map_gw); + break; + case 22: + conf_nat(optarg, &c->ip4.nat_guest_addr, + &c->ip6.nat_guest_addr, NULL); break; case 'd': c->debug = 1; diff --git a/fwd.c b/fwd.c index 7718f7e2..ff4789a2 100644 --- a/fwd.c +++ b/fwd.c @@ -272,6 +272,10 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, tgt->eaddr = inany_loopback4; else if (inany_equals6(&ini->oaddr, &c->ip6.nat_host_loopback)) tgt->eaddr = inany_loopback6; + else if (inany_equals4(&ini->oaddr, &c->ip4.nat_guest_addr)) + tgt->eaddr = inany_from_v4(c->ip4.addr); + else if (inany_equals6(&ini->oaddr, &c->ip6.nat_guest_addr)) + tgt->eaddr.a6 = c->ip6.addr; else tgt->eaddr = ini->oaddr;
@@ -393,6 +397,12 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && inany_equals6(&ini->eaddr, &in6addr_loopback)) { tgt->oaddr.a6 = c->ip6.nat_host_loopback; + } else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_guest_addr) && + inany_equals4(&ini->eaddr, &c->ip4.addr)) { + tgt->oaddr = inany_from_v4(c->ip4.nat_guest_addr); + } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_guest_addr) && + inany_equals6(&ini->eaddr, &c->ip6.addr)) { + tgt->oaddr.a6 = c->ip6.nat_guest_addr; } else if (!fwd_guest_accessible(c, &ini->eaddr)) { if (inany_v4(&ini->eaddr)) { if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) diff --git a/passt.1 b/passt.1 index 3680056a..7cf553cf 100644 --- a/passt.1 +++ b/passt.1 @@ -350,6 +350,21 @@ as destination, to the host. Implied if there is no gateway on the selected default route, or if there is no default route, for any of the enabled address families.
+.TP +.BR \-\-nat-guest-loopback " " \fIaddr +Translate \fIaddr\fR in the guest to be equal to the guest's assigned +address on the host. That is, packets from the guest to \fIaddr\fR +will be redirected to the address assigned to the guest with \fB-a\fR, +or by default the host's global address. This allows the guest to +access services availble on the host's global address, even though its +own address shadows that of the host. + +If \fIaddr\fR is 'none', no address is mapped. Only one IPv4 and one +IPv6 address can be translated, if the option is specified multiple
, and if
+times, the last one for each address type takes effect. + +Default is no mapping. + .TP .BR \-4 ", " \-\-ipv4-only Enable IPv4-only operation. IPv6 traffic will be ignored. diff --git a/passt.h b/passt.h index 20a5904a..586c1d05 100644 --- a/passt.h +++ b/passt.h @@ -104,6 +104,8 @@ enum passt_modes { * @guest_gw: IPv4 gateway as seen by the guest * @nat_host_loopback: Outbound connections to this address are NATted to the * host's 127.0.0.1 + * @nat_guest_addr: Outbound connections to this address are NATted to the + * guest's assigned address * @dns: DNS addresses for DHCP, zero-terminated * @dns_match: Forward DNS query if sent to this address * @our_tap_addr: IPv4 address for passt's use on tap @@ -120,6 +122,7 @@ struct ip4_ctx { int prefix_len; struct in_addr guest_gw; struct in_addr nat_host_loopback; + struct in_addr nat_guest_addr; struct in_addr dns[MAXNS + 1]; struct in_addr dns_match; struct in_addr our_tap_addr; @@ -142,6 +145,8 @@ struct ip4_ctx { * @guest_gw: IPv6 gateway as seen by the guest * @nat_host_loopback: Outbound connections to this address are NATted to the * host's [::1] + * @nat_guest_addr: Outbound connections to this address are NATted to the + * guest's assigned address * @dns: DNS addresses for DHCPv6 and NDP, zero-terminated * @dns_match: Forward DNS query if sent to this address * @our_tap_ll: Link-local IPv6 address for passt's use on tap @@ -158,6 +163,7 @@ struct ip6_ctx { struct in6_addr addr_ll_seen; struct in6_addr guest_gw; struct in6_addr nat_host_loopback; + struct in6_addr nat_guest_addr; struct in6_addr dns[MAXNS + 1]; struct in6_addr dns_match; struct in6_addr our_tap_ll;
-- Stefano
On Tue, Aug 20, 2024 at 09:56:40PM +0200, Stefano Brivio wrote:
On Fri, 16 Aug 2024 15:40:03 +1000 David Gibson
wrote: The guest is usually assigned one of the host's IP addresses. That means it can't access the host itself via its usual address. The --nat-host-loopback option (enabled by default with the gateway address) allows the guest to contact the host. However, connections forwarded this way appear on the host to have originated from the loopback interface, which isn't always desirable.
Add a new --nat-guest-addr option, which acts similarly but forwarded connections will go to the host's external address, instead of loopback.
If '-a' is used, so the guest's address is not the same as the host's, this will instead forward to whatever host-visible site is shadowed by the guest's assigned address.
Signed-off-by: David Gibson
--- conf.c | 51 ++++++++++++++++++++++++++++++++++----------------- fwd.c | 10 ++++++++++ passt.1 | 15 +++++++++++++++ passt.h | 6 ++++++ 4 files changed, 65 insertions(+), 17 deletions(-) diff --git a/conf.c b/conf.c index c5831e82..d14abc63 100644 --- a/conf.c +++ b/conf.c @@ -825,6 +825,14 @@ static void usage(const char *name, FILE *f, int status) " Can be specified zero to two (for IPv4 and IPv6)\n" " default: gateway address, or none if --no-map-gw is also\n" " specified\n" + " --nat-guest-addr ADDR NAT ADDR to guest's address\n" + " Packets from the guest to ADDR will be redirected to the\n" + " adress on the host that's the same as the guest's\n" + " assigned address. Usually that means (one of) the host's\n" + " global address.\n"
Same as 20/22, it's probably enough to have this in the man page.
+ " ADDR can be 'none', in which case nothing is mapped\n" + " Can be specified zero to two (for IPv4 and IPv6)\n"
"can", times
Done.
+ " default: none\n" " --dns-forward ADDR Forward DNS queries sent to ADDR\n" " can be specified zero to two times (for IPv4 and IPv6)\n" " default: don't forward DNS queries\n" @@ -1141,29 +1149,32 @@ static void conf_ugid(char *runas, uid_t *uid, gid_t *gid) }
/** - * conf_nat() - Parse --nat-host-loopback option - * @c: Execution context - * @arg: String argument to --nat-host-loopback - * @no_map_gw: --no-map-gw flag, updated for "none" argument + * conf_nat() - Parse --nat-host-loopback or --nat-guest-addr option + * @arg: String argument to option + * @addr4: IPv4 to update with parsed address + * @addr6: IPv6 to update with parsed address + * @no_map_gw: --no-map-gw flag, or NULL, updated for "none" argument */ -static void conf_nat(struct ctx *c, const char *arg, int *no_map_gw) +static void conf_nat(const char *arg, struct in_addr *addr4, + struct in6_addr *addr6, int *no_map_gw) { if (strcmp(arg, "none") == 0) { - c->ip4.nat_host_loopback = in4addr_any; - c->ip6.nat_host_loopback = in6addr_any; - *no_map_gw = 1; + *addr4 = in4addr_any; + *addr6 = in6addr_any; + if (no_map_gw) + *no_map_gw = 1; }
- if (inet_pton(AF_INET6, arg, &c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_LOOPBACK(&c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_MULTICAST(&c->ip6.nat_host_loopback)) + if (inet_pton(AF_INET6, arg, addr6) && + !IN6_IS_ADDR_UNSPECIFIED(addr6) && + !IN6_IS_ADDR_LOOPBACK(addr6) && + !IN6_IS_ADDR_MULTICAST(addr6)) return;
- if (inet_pton(AF_INET, arg, &c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_LOOPBACK(&c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_MULTICAST(&c->ip4.nat_host_loopback)) + if (inet_pton(AF_INET, arg, addr4) && + !IN4_IS_ADDR_UNSPECIFIED(addr4) && + !IN4_IS_ADDR_LOOPBACK(addr4) && + !IN4_IS_ADDR_MULTICAST(addr4)) return;
die("Invalid address to remap to host: %s", optarg); @@ -1279,6 +1290,7 @@ void conf(struct ctx *c, int argc, char **argv) {"no-copy-addrs", no_argument, NULL, 19 }, {"netns-only", no_argument, NULL, 20 }, {"nat-host-loopback", required_argument, NULL, 21 }, + {"nat-guest-addr", required_argument, NULL, 22 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1449,7 +1461,12 @@ void conf(struct ctx *c, int argc, char **argv) *userns = 0; break; case 21: - conf_nat(c, optarg, &no_map_gw); + conf_nat(optarg, &c->ip4.nat_host_loopback, + &c->ip6.nat_host_loopback, &no_map_gw); + break; + case 22: + conf_nat(optarg, &c->ip4.nat_guest_addr, + &c->ip6.nat_guest_addr, NULL); break; case 'd': c->debug = 1; diff --git a/fwd.c b/fwd.c index 7718f7e2..ff4789a2 100644 --- a/fwd.c +++ b/fwd.c @@ -272,6 +272,10 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, tgt->eaddr = inany_loopback4; else if (inany_equals6(&ini->oaddr, &c->ip6.nat_host_loopback)) tgt->eaddr = inany_loopback6; + else if (inany_equals4(&ini->oaddr, &c->ip4.nat_guest_addr)) + tgt->eaddr = inany_from_v4(c->ip4.addr); + else if (inany_equals6(&ini->oaddr, &c->ip6.nat_guest_addr)) + tgt->eaddr.a6 = c->ip6.addr; else tgt->eaddr = ini->oaddr;
@@ -393,6 +397,12 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && inany_equals6(&ini->eaddr, &in6addr_loopback)) { tgt->oaddr.a6 = c->ip6.nat_host_loopback; + } else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_guest_addr) && + inany_equals4(&ini->eaddr, &c->ip4.addr)) { + tgt->oaddr = inany_from_v4(c->ip4.nat_guest_addr); + } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_guest_addr) && + inany_equals6(&ini->eaddr, &c->ip6.addr)) { + tgt->oaddr.a6 = c->ip6.nat_guest_addr; } else if (!fwd_guest_accessible(c, &ini->eaddr)) { if (inany_v4(&ini->eaddr)) { if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) diff --git a/passt.1 b/passt.1 index 3680056a..7cf553cf 100644 --- a/passt.1 +++ b/passt.1 @@ -350,6 +350,21 @@ as destination, to the host. Implied if there is no gateway on the selected default route, or if there is no default route, for any of the enabled address families.
+.TP +.BR \-\-nat-guest-loopback " " \fIaddr +Translate \fIaddr\fR in the guest to be equal to the guest's assigned +address on the host. That is, packets from the guest to \fIaddr\fR +will be redirected to the address assigned to the guest with \fB-a\fR, +or by default the host's global address. This allows the guest to +access services availble on the host's global address, even though its +own address shadows that of the host. + +If \fIaddr\fR is 'none', no address is mapped. Only one IPv4 and one +IPv6 address can be translated, if the option is specified multiple
, and if
Done. Also fixed the fact I incorrectly called it --nat-guest-loopback instead of --map-guest-addr above.
+times, the last one for each address type takes effect. + +Default is no mapping. + .TP .BR \-4 ", " \-\-ipv4-only Enable IPv4-only operation. IPv6 traffic will be ignored. diff --git a/passt.h b/passt.h index 20a5904a..586c1d05 100644 --- a/passt.h +++ b/passt.h @@ -104,6 +104,8 @@ enum passt_modes { * @guest_gw: IPv4 gateway as seen by the guest * @nat_host_loopback: Outbound connections to this address are NATted to the * host's 127.0.0.1 + * @nat_guest_addr: Outbound connections to this address are NATted to the + * guest's assigned address * @dns: DNS addresses for DHCP, zero-terminated * @dns_match: Forward DNS query if sent to this address * @our_tap_addr: IPv4 address for passt's use on tap @@ -120,6 +122,7 @@ struct ip4_ctx { int prefix_len; struct in_addr guest_gw; struct in_addr nat_host_loopback; + struct in_addr nat_guest_addr; struct in_addr dns[MAXNS + 1]; struct in_addr dns_match; struct in_addr our_tap_addr; @@ -142,6 +145,8 @@ struct ip4_ctx { * @guest_gw: IPv6 gateway as seen by the guest * @nat_host_loopback: Outbound connections to this address are NATted to the * host's [::1] + * @nat_guest_addr: Outbound connections to this address are NATted to the + * guest's assigned address * @dns: DNS addresses for DHCPv6 and NDP, zero-terminated * @dns_match: Forward DNS query if sent to this address * @our_tap_ll: Link-local IPv6 address for passt's use on tap @@ -158,6 +163,7 @@ struct ip6_ctx { struct in6_addr addr_ll_seen; struct in6_addr guest_gw; struct in6_addr nat_host_loopback; + struct in6_addr nat_guest_addr; struct in6_addr dns[MAXNS + 1]; struct in6_addr dns_match; struct in6_addr our_tap_ll;
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Hi, On 16/08/2024 07:39, David Gibson wrote:
Based on Stefano's recent patch for faster tests.
Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.
Suggestions for better names for the new options in patches 20 & 22 are most welcome.
Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.
NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.
Paul, amongst other things, I think this will allow podman to (finally) nicely address #19213, picking an address to remap to the host's external address with --nat-guest-addr, much like it already uses --dns-forward.
Thanks this looks promising. I will try to test it out next week. No strong feelings about the naming but how about s/--nat/--map/ for the options?
David Gibson (22): treewide: Use "our address" instead of "forwarding address" util: Helper for formatting MAC addresses treewide: Rename MAC address fields for clarity treewide: Use struct assignment instead of memcpy() for IP addresses conf: Use array indices rather than pointers for DNS array slots conf: More accurately count entries added in get_dns() conf: Move DNS array bounds checks into add_dns[46] conf: Move adding of a nameserver from resolv.conf into subfunction conf: Correct setting of dns_match address in add_dns6() conf: Treat --dns addresses as guest visible addresses conf: Remove incorrect initialisation of addr_ll_seen util: Correct sock_l4() binding for link local addresses treewide: Change misleading 'addr_ll' name Clarify which addresses in ip[46]_ctx are meaningful where Initialise our_tap_ll to ip6.gw when suitable fwd: Helpers to clarify what host addresses aren't guest accessible fwd: Split notion of "our tap address" from gateway for IPv4 Don't take "our" MAC address from the host conf, fwd: Split notion of gateway/router from guest-visible host address conf: Allow address remapped to host to be configured fwd: Distinguish translatable from untranslatable addresses on inbound fwd, conf: Allow NAT of the guest's assigned address
arp.c | 4 +- conf.c | 328 +++++++++++++++++++++++++----------------- dhcp.c | 19 +-- dhcpv6.c | 21 +-- flow.c | 72 +++++----- flow.h | 18 +-- fwd.c | 170 +++++++++++++++++----- icmp.c | 4 +- ndp.c | 9 +- passt.1 | 45 +++++- passt.c | 2 +- passt.h | 53 +++++-- pasta.c | 14 +- tap.c | 12 +- tcp.c | 33 ++--- tcp_internal.h | 2 +- test/lib/setup | 11 +- test/passt_in_ns/dhcp | 73 ++++++++++ test/passt_in_ns/tcp | 38 +++-- test/passt_in_ns/udp | 22 +-- test/perf/passt_tcp | 33 ++--- test/perf/passt_udp | 31 ++-- test/perf/pasta_tcp | 29 ++-- test/perf/pasta_udp | 25 ++-- test/run | 4 +- udp.c | 12 +- util.c | 22 ++- util.h | 4 +- 28 files changed, 719 insertions(+), 391 deletions(-) create mode 100644 test/passt_in_ns/dhcp
-- Paul Holzinger
On Fri, 16 Aug 2024 16:45:14 +0200
Paul Holzinger
Hi,
On 16/08/2024 07:39, David Gibson wrote:
Based on Stefano's recent patch for faster tests.
Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.
Suggestions for better names for the new options in patches 20 & 22 are most welcome.
Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.
NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.
Paul, amongst other things, I think this will allow podman to (finally) nicely address #19213, picking an address to remap to the host's external address with --nat-guest-addr, much like it already uses --dns-forward.
Thanks this looks promising. I will try to test it out next week.
No strong feelings about the naming but how about s/--nat/--map/ for the options?
Exactly the same as I suggested offline a while ago. :) I think it's easier to understand what it does, that way. -- Stefano
On Fri, Aug 16, 2024 at 05:03:22PM +0200, Stefano Brivio wrote:
On Fri, 16 Aug 2024 16:45:14 +0200 Paul Holzinger
wrote: Hi,
On 16/08/2024 07:39, David Gibson wrote:
Based on Stefano's recent patch for faster tests.
Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.
Suggestions for better names for the new options in patches 20 & 22 are most welcome.
Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.
NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.
Paul, amongst other things, I think this will allow podman to (finally) nicely address #19213, picking an address to remap to the host's external address with --nat-guest-addr, much like it already uses --dns-forward.
Thanks this looks promising. I will try to test it out next week.
No strong feelings about the naming but how about s/--nat/--map/ for the options?
Exactly the same as I suggested offline a while ago. :) I think it's easier to understand what it does, that way.
Ok. I think I was going to do that originally but changed it for reasons that I've now forgotten. --map is more consistent with --no-map-gw too, so I'll change this. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:
Based on Stefano's recent patch for faster tests.
Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.
Suggestions for better names for the new options in patches 20 & 22 are most welcome.
Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.
NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.
I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime. After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it. This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses. We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug. In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests. This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them. Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, 19 Aug 2024 18:46:31 +1000
David Gibson
On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:
Based on Stefano's recent patch for faster tests.
Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.
Suggestions for better names for the new options in patches 20 & 22 are most welcome.
Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.
NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.
I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions
The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime.
After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it.
I still have to take a closer look, but I'm fairly sure I hit a similar issue while I was writing these tests originally. I pondered reconfiguring the address via DHCPv6, or using the keep_addr_on_down sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added around that time. Then:
This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses.
...I realised that this worked and forgot about the whole issue.
We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug.
In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests.
I guess setting keep_addr_on_down (even for "all" interfaces) should work as well.
This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them.
There's a notable distinction between guests temporarily diverging (in different ways) and guests we don't configure at all. It's probably more important to ensure we use the right type of address (security) rather than ensuring we somehow manage to deliver packets at any time (minor glitch otherwise), also because the one you describe is something we're unlikely to hit outside of tests.
Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past.
Yes, I still think we should support guests that don't use DHCPv6 or NDP at all, or where related exchanges fail for any reason. It improves reliability and compatibility at a small cost. In this case, I think it's a nice feature that we would resume communicating as soon as the guest shows its global unicast address. If the cost is using the wrong type of address, then not, I'm not suggesting we do that, so I think the change from this series is desirable, but in a general case, things just work and we don't break anything, as far as I know. -- Stefano
On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:
On Mon, 19 Aug 2024 18:46:31 +1000 David Gibson
wrote: On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:
Based on Stefano's recent patch for faster tests.
Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.
Suggestions for better names for the new options in patches 20 & 22 are most welcome.
Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.
NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.
I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions
The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime.
After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it.
I still have to take a closer look, but I'm fairly sure I hit a similar issue while I was writing these tests originally. I pondered reconfiguring the address via DHCPv6, or using the keep_addr_on_down sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added around that time.
Then:
This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses.
...I realised that this worked and forgot about the whole issue.
We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug.
In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests.
I guess setting keep_addr_on_down (even for "all" interfaces) should work as well.
Sounds like it. I wasn't aware of that one.
/me tests.. actually, no it doesn't work..
# sysctl -a | grep keep_addr_on_down
net.ipv6.conf.all.keep_addr_on_down = 1
net.ipv6.conf.default.keep_addr_on_down = 1
net.ipv6.conf.dummy0.keep_addr_on_down = 1
net.ipv6.conf.lo.keep_addr_on_down = 0
# ip addr add 2001:db8::1 dev dummy0
# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy0:
This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them.
There's a notable distinction between guests temporarily diverging (in different ways) and guests we don't configure at all.
I'm not really sure what you're getting at here.
It's probably more important to ensure we use the right type of address
"type" in what sense here?
(security) rather than ensuring we somehow manage to deliver packets at any time (minor glitch otherwise), also because the one you describe is something we're unlikely to hit outside of tests.
Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past.
Yes, I still think we should support guests that don't use DHCPv6 or NDP at all,
Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to manually configure the interface in the guest to match the address you've configured with -a. Just like you'd expect to have to correctly configure your address on a real network.
or where related exchanges fail for any reason. It improves reliability and compatibility at a small cost. In this case, I think it's a nice feature that we would resume communicating as soon as the guest shows its global unicast address.
Hm, maybe. I'm not entirely convinced the cost is so small long term. It's pretty badly incompatible with having multiple guests behind the same passt instance: such as the initial guest bridging or routing to nested guests. I'm actually not sure if encountering this bug makes me more or less in favour of addr_seen. On the one hand I think it highlights the flakiness of this approach; there are situations where we just won't know the right address. On the other hand if shows a relatively plausible case where the guest won't get exactly the address we want it to (it uses NDP but not DHCPv6) Hrm... actually this also shows a potential danger in the recent patches to disable DAD in the guest. With DAD enabled, when the guest grabs a new address, we'd expect it to emit DAD messages, which would have the side effect of updating our addr_seen (although I'm pretty sure I hit this patch before the nodad patches were applied, so that doesn't seem to be foolproof). We could maybe update addr_seen when we send RA messages to the guest - assuming that it will use the same host part (low 64-bits) for both link-local and global addresses. Not sure if that's a widely safe assumption or not.
If the cost is using the wrong type of address, then not, I'm not suggesting we do that, so I think the change from this series is desirable, but in a general case, things just work and we don't break anything, as far as I know.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, 19 Aug 2024 19:52:49 +1000
David Gibson
On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:
On Mon, 19 Aug 2024 18:46:31 +1000 David Gibson
wrote: On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:
Based on Stefano's recent patch for faster tests.
Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.
Suggestions for better names for the new options in patches 20 & 22 are most welcome.
Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.
NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.
I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions
The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime.
After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it.
I still have to take a closer look, but I'm fairly sure I hit a similar issue while I was writing these tests originally. I pondered reconfiguring the address via DHCPv6, or using the keep_addr_on_down sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added around that time.
Then:
This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses.
...I realised that this worked and forgot about the whole issue.
We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug.
In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests.
I guess setting keep_addr_on_down (even for "all" interfaces) should work as well.
Sounds like it. I wasn't aware of that one.
/me tests.. actually, no it doesn't work..
# sysctl -a | grep keep_addr_on_down net.ipv6.conf.all.keep_addr_on_down = 1 net.ipv6.conf.default.keep_addr_on_down = 1 net.ipv6.conf.dummy0.keep_addr_on_down = 1 net.ipv6.conf.lo.keep_addr_on_down = 0 # ip addr add 2001:db8::1 dev dummy0 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0:
mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff inet6 2001:db8::1/128 scope global valid_lft forever preferred_lft forever # ip link set dummy0 mtu 1200 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1200 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff # ip link set dummy0 mtu 1500 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff My guess is that IPv6 being deconfigured because of an unsuitable MTU is considered a different event from a mere "down".
I guess it's because they're not IFA_F_PERMANENT, because addrconf_permanent_addr() has: case NETDEV_CHANGEMTU: /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */ if (dev->mtu < IPV6_MIN_MTU) { addrconf_ifdown(dev, dev != net->loopback_dev); break; } but addrconf_ifdown() does: if (!keep_addr || !(ifa->flags & IFA_F_PERMANENT) || addr_is_local(&ifa->addr)) { hlist_del_init_rcu(&ifa->addr_lst); goto restart; } I'm not sure about the logic behind that. We could actually set those addresses as permanent once the DHCPv6 client configures them, if it's cleaner.
This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them.
There's a notable distinction between guests temporarily diverging (in different ways) and guests we don't configure at all.
I'm not really sure what you're getting at here.
In this case, it's not true that the guest doesn't configure itself in the way we requested -- it's just a temporary diversion from that configuration. Those are different cases that we can handle in different ways, I think. If it's a glitch that will only happen during testing, let's work around that. But if the guest really ignores DHCPv6 information, I think we should keep that working.
It's probably more important to ensure we use the right type of address
"type" in what sense here?
Global unicast instead of link-local.
(security) rather than ensuring we somehow manage to deliver packets at any time (minor glitch otherwise), also because the one you describe is something we're unlikely to hit outside of tests.
Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past.
Yes, I still think we should support guests that don't use DHCPv6 or NDP at all,
Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to manually configure the interface in the guest to match the address you've configured with -a. Just like you'd expect to have to correctly configure your address on a real network.
True, but if we make correctness as optional as possible, we'll be more compatible (less time spent by users fixing situations that don't necessarily need fixing, less time spent by developers to look into reports, no matter who's at fault).
or where related exchanges fail for any reason. It improves reliability and compatibility at a small cost. In this case, I think it's a nice feature that we would resume communicating as soon as the guest shows its global unicast address.
Hm, maybe. I'm not entirely convinced the cost is so small long term. It's pretty badly incompatible with having multiple guests behind the same passt instance: such as the initial guest bridging or routing to nested guests.
Why? We will need to hash the interface/guest index anyway, for outbound flows. And for inbound flows, if a guest steals the address of another guest, we'll give priority to the normal 'addr' versions instead of the '_seen' ones, to decide how to direct traffic.
I'm actually not sure if encountering this bug makes me more or less in favour of addr_seen. On the one hand I think it highlights the flakiness of this approach; there are situations where we just won't know the right address.
I don't understand this argument: indeed, there are such situations, and they are annoying. Why should we make them more common?
On the other hand if shows a relatively plausible case where the guest won't get exactly the address we want it to (it uses NDP but not DHCPv6)
Hrm... actually this also shows a potential danger in the recent patches to disable DAD in the guest. With DAD enabled, when the guest grabs a new address, we'd expect it to emit DAD messages, which would have the side effect of updating our addr_seen (although I'm pretty sure I hit this patch before the nodad patches were applied, so that doesn't seem to be foolproof).
Well, but we do that for containers with --config-net only. In that case, the addresses we configure have infinite lifetime anyway. Besides, I don't think we need to have addr_seen updated as quickly and correctly as possible just for the sake of it, we can also update it when we get any other neighbour solicitation because the guest is actually using the network. It's not meant to be perfect.
We could maybe update addr_seen when we send RA messages to the guest - assuming that it will use the same host part (low 64-bits) for both link-local and global addresses. Not sure if that's a widely safe assumption or not.
I don't understand: what case are you trying to cover with this? -- Stefano
On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote:
On Mon, 19 Aug 2024 19:52:49 +1000 David Gibson
wrote: On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:
On Mon, 19 Aug 2024 18:46:31 +1000 David Gibson
wrote: On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:
Based on Stefano's recent patch for faster tests.
Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.
Suggestions for better names for the new options in patches 20 & 22 are most welcome.
Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.
NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.
I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions
The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime.
After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it.
I still have to take a closer look, but I'm fairly sure I hit a similar issue while I was writing these tests originally. I pondered reconfiguring the address via DHCPv6, or using the keep_addr_on_down sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added around that time.
Then:
This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses.
...I realised that this worked and forgot about the whole issue.
We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug.
In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests.
I guess setting keep_addr_on_down (even for "all" interfaces) should work as well.
Sounds like it. I wasn't aware of that one.
/me tests.. actually, no it doesn't work..
# sysctl -a | grep keep_addr_on_down net.ipv6.conf.all.keep_addr_on_down = 1 net.ipv6.conf.default.keep_addr_on_down = 1 net.ipv6.conf.dummy0.keep_addr_on_down = 1 net.ipv6.conf.lo.keep_addr_on_down = 0 # ip addr add 2001:db8::1 dev dummy0 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0:
mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff inet6 2001:db8::1/128 scope global valid_lft forever preferred_lft forever # ip link set dummy0 mtu 1200 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1200 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff # ip link set dummy0 mtu 1500 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff My guess is that IPv6 being deconfigured because of an unsuitable MTU is considered a different event from a mere "down".
I guess it's because they're not IFA_F_PERMANENT, because addrconf_permanent_addr() has:
case NETDEV_CHANGEMTU: /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */ if (dev->mtu < IPV6_MIN_MTU) { addrconf_ifdown(dev, dev != net->loopback_dev); break; }
but addrconf_ifdown() does:
if (!keep_addr || !(ifa->flags & IFA_F_PERMANENT) || addr_is_local(&ifa->addr)) { hlist_del_init_rcu(&ifa->addr_lst); goto restart; }
I'm not sure about the logic behind that. We could actually set those addresses as permanent once the DHCPv6 client configures them, if it's cleaner.
Huh. Not in the passt/VM case, though, which is where I actually encountered this.
This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them.
There's a notable distinction between guests temporarily diverging (in different ways) and guests we don't configure at all.
I'm not really sure what you're getting at here.
In this case, it's not true that the guest doesn't configure itself in the way we requested -- it's just a temporary diversion from that configuration.
Oh, I see. Assuming that at some point the DHCP client will re-run.
Those are different cases that we can handle in different ways, I think. If it's a glitch that will only happen during testing, let's work around that.
But if the guest really ignores DHCPv6 information, I think we should keep that working.
It's probably more important to ensure we use the right type of address
"type" in what sense here?
Global unicast instead of link-local.
Ok.
(security) rather than ensuring we somehow manage to deliver packets at any time (minor glitch otherwise), also because the one you describe is something we're unlikely to hit outside of tests.
Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past.
Yes, I still think we should support guests that don't use DHCPv6 or NDP at all,
Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to manually configure the interface in the guest to match the address you've configured with -a. Just like you'd expect to have to correctly configure your address on a real network.
True, but if we make correctness as optional as possible, we'll be more compatible (less time spent by users fixing situations that don't necessarily need fixing, less time spent by developers to look into reports, no matter who's at fault).
Eh, maybe. Unless us trying to make sense of a nonsense situation causes some unpredictable behaviour that breaks something else.
or where related exchanges fail for any reason. It improves reliability and compatibility at a small cost. In this case, I think it's a nice feature that we would resume communicating as soon as the guest shows its global unicast address.
Hm, maybe. I'm not entirely convinced the cost is so small long term. It's pretty badly incompatible with having multiple guests behind the same passt instance: such as the initial guest bridging or routing to nested guests.
Why? We will need to hash the interface/guest index anyway, for outbound flows.
If we have separate interfaces for each guest, yes. But not if we have multiple guests behind a single tap because the initial guest sets up a bridge or routing. Then we have nothing but the address.
And for inbound flows, if a guest steals the address of another guest, we'll give priority to the normal 'addr' versions instead of the '_seen' ones, to decide how to direct traffic.
I don't see how we'd know we're in this situation, so when to prioritise which address over the other.
I'm actually not sure if encountering this bug makes me more or less in favour of addr_seen. On the one hand I think it highlights the flakiness of this approach; there are situations where we just won't know the right address.
I don't understand this argument: indeed, there are such situations, and they are annoying. Why should we make them more common?
Because predictability is good, and working _most_ of the time is a failure of predictability.
On the other hand if shows a relatively plausible case where the guest won't get exactly the address we want it to (it uses NDP but not DHCPv6)
Hrm... actually this also shows a potential danger in the recent patches to disable DAD in the guest. With DAD enabled, when the guest grabs a new address, we'd expect it to emit DAD messages, which would have the side effect of updating our addr_seen (although I'm pretty sure I hit this patch before the nodad patches were applied, so that doesn't seem to be foolproof).
Well, but we do that for containers with --config-net only. In that case, the addresses we configure have infinite lifetime anyway.
Oh, good point. Hrm... then I'm unsure why the guest wasn't re-DADing its new address.
Besides, I don't think we need to have addr_seen updated as quickly and correctly as possible just for the sake of it, we can also update it when we get any other neighbour solicitation because the guest is actually using the network. It's not meant to be perfect.
If the guest is a pure server (a common case for containers AFAICT), then I don't know that we can expect NS messages for anything other than the default gateway, which is (typically) link-local and so won't help us to learn the new global address.
We could maybe update addr_seen when we send RA messages to the guest - assuming that it will use the same host part (low 64-bits) for both link-local and global addresses. Not sure if that's a widely safe assumption or not.
I don't understand: what case are you trying to cover with this?
A case just like the one in the tests: the interface bounces, and we get NDP traffic on the link-local address, but nothing on the global address before an inbound connection. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, 20 Aug 2024 10:42:17 +1000
David Gibson
On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote:
On Mon, 19 Aug 2024 19:52:49 +1000 David Gibson
wrote: On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:
On Mon, 19 Aug 2024 18:46:31 +1000 David Gibson
wrote: On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:
Based on Stefano's recent patch for faster tests.
Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.
Suggestions for better names for the new options in patches 20 & 22 are most welcome.
Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.
NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.
I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions
The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime.
After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it.
I still have to take a closer look, but I'm fairly sure I hit a similar issue while I was writing these tests originally. I pondered reconfiguring the address via DHCPv6, or using the keep_addr_on_down sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added around that time.
Then:
This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses.
...I realised that this worked and forgot about the whole issue.
We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug.
In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests.
I guess setting keep_addr_on_down (even for "all" interfaces) should work as well.
Sounds like it. I wasn't aware of that one.
/me tests.. actually, no it doesn't work..
# sysctl -a | grep keep_addr_on_down net.ipv6.conf.all.keep_addr_on_down = 1 net.ipv6.conf.default.keep_addr_on_down = 1 net.ipv6.conf.dummy0.keep_addr_on_down = 1 net.ipv6.conf.lo.keep_addr_on_down = 0 # ip addr add 2001:db8::1 dev dummy0 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0:
mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff inet6 2001:db8::1/128 scope global valid_lft forever preferred_lft forever # ip link set dummy0 mtu 1200 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1200 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff # ip link set dummy0 mtu 1500 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff My guess is that IPv6 being deconfigured because of an unsuitable MTU is considered a different event from a mere "down".
I guess it's because they're not IFA_F_PERMANENT, because addrconf_permanent_addr() has:
case NETDEV_CHANGEMTU: /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */ if (dev->mtu < IPV6_MIN_MTU) { addrconf_ifdown(dev, dev != net->loopback_dev); break; }
but addrconf_ifdown() does:
if (!keep_addr || !(ifa->flags & IFA_F_PERMANENT) || addr_is_local(&ifa->addr)) { hlist_del_init_rcu(&ifa->addr_lst); goto restart; }
I'm not sure about the logic behind that. We could actually set those addresses as permanent once the DHCPv6 client configures them, if it's cleaner.
Huh. Not in the passt/VM case, though, which is where I actually encountered this.
I meant using ip(8) from the test script itself, but it doesn't
actually make sense:
# ip address change 2a01:4f8:222:904:c800:94ff:fe29:a8d/64 permanent dev eth0
Warning: permanent option is not mutable from userspace
because (RFC 3549):
IFA_F_PERMANENT For a permanent address set by the user.
When this is not set, it means the address
was dynamically created (e.g., by stateless
autoconfiguration).
So the address you used in your test _should_ have IFA_F_PERMANENT. The
plot thickens.
I just tried this, which confirms your hypothesis that bringing the
link down is a different event:
# ip addr add 2001:db8::1 dev dummy0
# ip link set dummy0 down
# ip addr show dev dummy0
5: dummy0:
This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them.
There's a notable distinction between guests temporarily diverging (in different ways) and guests we don't configure at all.
I'm not really sure what you're getting at here.
In this case, it's not true that the guest doesn't configure itself in the way we requested -- it's just a temporary diversion from that configuration.
Oh, I see. Assuming that at some point the DHCP client will re-run.
Those are different cases that we can handle in different ways, I think. If it's a glitch that will only happen during testing, let's work around that.
But if the guest really ignores DHCPv6 information, I think we should keep that working.
It's probably more important to ensure we use the right type of address
"type" in what sense here?
Global unicast instead of link-local.
Ok.
(security) rather than ensuring we somehow manage to deliver packets at any time (minor glitch otherwise), also because the one you describe is something we're unlikely to hit outside of tests.
Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past.
Yes, I still think we should support guests that don't use DHCPv6 or NDP at all,
Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to manually configure the interface in the guest to match the address you've configured with -a. Just like you'd expect to have to correctly configure your address on a real network.
True, but if we make correctness as optional as possible, we'll be more compatible (less time spent by users fixing situations that don't necessarily need fixing, less time spent by developers to look into reports, no matter who's at fault).
Eh, maybe. Unless us trying to make sense of a nonsense situation causes some unpredictable behaviour that breaks something else.
or where related exchanges fail for any reason. It improves reliability and compatibility at a small cost. In this case, I think it's a nice feature that we would resume communicating as soon as the guest shows its global unicast address.
Hm, maybe. I'm not entirely convinced the cost is so small long term. It's pretty badly incompatible with having multiple guests behind the same passt instance: such as the initial guest bridging or routing to nested guests.
Why? We will need to hash the interface/guest index anyway, for outbound flows.
If we have separate interfaces for each guest, yes. But not if we have multiple guests behind a single tap because the initial guest sets up a bridge or routing. Then we have nothing but the address.
...but then we should have multiple addresses anyway. By the way, I'm not sure we'll ever be able to support that kind of configuration. How does a guest set up a bridge and use passt at the same time?
And for inbound flows, if a guest steals the address of another guest, we'll give priority to the normal 'addr' versions instead of the '_seen' ones, to decide how to direct traffic.
I don't see how we'd know we're in this situation, so when to prioritise which address over the other.
In the set of all addr_seen and addr, we would have at least a non-unique value. Or, practically speaking, we should refuse to set addr_seen if it matches addr for another guest.
I'm actually not sure if encountering this bug makes me more or less in favour of addr_seen. On the one hand I think it highlights the flakiness of this approach; there are situations where we just won't know the right address.
I don't understand this argument: indeed, there are such situations, and they are annoying. Why should we make them more common?
Because predictability is good, and working _most_ of the time is a failure of predictability.
It avoids substantial effort and frustration for everybody involved though. The practical problem with lacking predictability is if it makes things harder to debug, I guess, which shouldn't be the case here.
On the other hand if shows a relatively plausible case where the guest won't get exactly the address we want it to (it uses NDP but not DHCPv6)
Hrm... actually this also shows a potential danger in the recent patches to disable DAD in the guest. With DAD enabled, when the guest grabs a new address, we'd expect it to emit DAD messages, which would have the side effect of updating our addr_seen (although I'm pretty sure I hit this patch before the nodad patches were applied, so that doesn't seem to be foolproof).
Well, but we do that for containers with --config-net only. In that case, the addresses we configure have infinite lifetime anyway.
Oh, good point. Hrm... then I'm unsure why the guest wasn't re-DADing its new address.
It probably did, but we ignored that anyway because DAD is done by sending neighbour solicitations with an unspecified address as source, for example (the "change" here drops "nodad"): $ ./pasta --config-net -p dad.pcap Saving packet capture to dad.pcap # ip addr change dev enp9s0 fe80::3882:b5ff:fe01:e9a1/64 # tshark -r dad.pcap |grep Neigh Running as user "root" and group "root". This could be dangerous. 10 2.642467 :: → ff02::1:ff01:e9a1 ICMPv6 86 Neighbor Solicitation for fe80::3882:b5ff:fe01:e9a1 and in tap6_handler() we do: } else if (!IN6_IS_ADDR_UNSPECIFIED(saddr)){ c->ip6.addr_seen = *saddr; } ...then, in ndp(): if (IN6_IS_ADDR_UNSPECIFIED(saddr)) return 1; we could set addr_seen by looking at the *target* address of the neighbour solicitation when the source address is ::, but it's not implemented yet.
Besides, I don't think we need to have addr_seen updated as quickly and correctly as possible just for the sake of it, we can also update it when we get any other neighbour solicitation because the guest is actually using the network. It's not meant to be perfect.
If the guest is a pure server (a common case for containers AFAICT), then I don't know that we can expect NS messages for anything other than the default gateway, which is (typically) link-local and so won't help us to learn the new global address.
Containers running actual applications are noisy. I've only seen this kind of problem (addr_seen not set/matching) in particularly crafted test environments.
We could maybe update addr_seen when we send RA messages to the guest - assuming that it will use the same host part (low 64-bits) for both link-local and global addresses. Not sure if that's a widely safe assumption or not.
I don't understand: what case are you trying to cover with this?
A case just like the one in the tests: the interface bounces, and we get NDP traffic on the link-local address, but nothing on the global address before an inbound connection.
Oh, I see. I think it makes sense, even though we'll set addr_seen a bit too early, but not enough to be a practical issue, I think. -- Stefano
On Tue, Aug 20, 2024 at 10:39:26PM +0200, Stefano Brivio wrote:
On Tue, 20 Aug 2024 10:42:17 +1000 David Gibson
wrote: On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote:
On Mon, 19 Aug 2024 19:52:49 +1000 David Gibson
wrote: On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:
On Mon, 19 Aug 2024 18:46:31 +1000 David Gibson
wrote: On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote: > Based on Stefano's recent patch for faster tests. > > Allow the user to specify which addresses are translated when used by > the guest, rather than always being the gateway address or nothing. > We also allow this remapping to go to the host's global address (more > precisely the address assigned to the guest) rather than just host > loopback. > > Suggestions for better names for the new options in patches 20 & 22 > are most welcome. > > Along the way to implementing that make many changes to clarify what > various addresses we track mean, fixing a number of small bugs as > well. > > NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf > tests. I haven't managed to figure out why it's causing the problem, > or even what the exact triggering conditions are (running the single > stalling iperf alone doesn't do it). Have to wrap up for today, so I > thought I'd get this out for review anyway.
I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions
The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime.
After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it.
I still have to take a closer look, but I'm fairly sure I hit a similar issue while I was writing these tests originally. I pondered reconfiguring the address via DHCPv6, or using the keep_addr_on_down sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added around that time.
Then:
This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses.
...I realised that this worked and forgot about the whole issue.
We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug.
In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests.
I guess setting keep_addr_on_down (even for "all" interfaces) should work as well.
Sounds like it. I wasn't aware of that one.
/me tests.. actually, no it doesn't work..
# sysctl -a | grep keep_addr_on_down net.ipv6.conf.all.keep_addr_on_down = 1 net.ipv6.conf.default.keep_addr_on_down = 1 net.ipv6.conf.dummy0.keep_addr_on_down = 1 net.ipv6.conf.lo.keep_addr_on_down = 0 # ip addr add 2001:db8::1 dev dummy0 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0:
mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff inet6 2001:db8::1/128 scope global valid_lft forever preferred_lft forever # ip link set dummy0 mtu 1200 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1200 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff # ip link set dummy0 mtu 1500 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff My guess is that IPv6 being deconfigured because of an unsuitable MTU is considered a different event from a mere "down".
I guess it's because they're not IFA_F_PERMANENT, because addrconf_permanent_addr() has:
case NETDEV_CHANGEMTU: /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */ if (dev->mtu < IPV6_MIN_MTU) { addrconf_ifdown(dev, dev != net->loopback_dev); break; }
but addrconf_ifdown() does:
if (!keep_addr || !(ifa->flags & IFA_F_PERMANENT) || addr_is_local(&ifa->addr)) { hlist_del_init_rcu(&ifa->addr_lst); goto restart; }
I'm not sure about the logic behind that. We could actually set those addresses as permanent once the DHCPv6 client configures them, if it's cleaner.
Huh. Not in the passt/VM case, though, which is where I actually encountered this.
I meant using ip(8) from the test script itself, but it doesn't actually make sense:
# ip address change 2a01:4f8:222:904:c800:94ff:fe29:a8d/64 permanent dev eth0 Warning: permanent option is not mutable from userspace
because (RFC 3549):
IFA_F_PERMANENT For a permanent address set by the user. When this is not set, it means the address was dynamically created (e.g., by stateless autoconfiguration).
So the address you used in your test _should_ have IFA_F_PERMANENT. The plot thickens.
I just tried this, which confirms your hypothesis that bringing the link down is a different event:
# ip addr add 2001:db8::1 dev dummy0 # ip link set dummy0 down # ip addr show dev dummy0 5: dummy0:
mtu 1280 qdisc noqueue state DOWN group default qlen 1000 link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff inet 1.2.3.1/24 scope global dummy0 valid_lft forever preferred_lft forever inet6 2001:db8::1/128 scope global valid_lft forever preferred_lft forever # ip link set dummy0 mtu 1279 # ip addr show dev dummy0 5: dummy0: mtu 1279 qdisc noqueue state DOWN group default qlen 1000 link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff inet 1.2.3.1/24 scope global dummy0 valid_lft forever preferred_lft forever ...I just can't see that from the code.
Ok.
This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them.
There's a notable distinction between guests temporarily diverging (in different ways) and guests we don't configure at all.
I'm not really sure what you're getting at here.
In this case, it's not true that the guest doesn't configure itself in the way we requested -- it's just a temporary diversion from that configuration.
Oh, I see. Assuming that at some point the DHCP client will re-run.
Those are different cases that we can handle in different ways, I think. If it's a glitch that will only happen during testing, let's work around that.
But if the guest really ignores DHCPv6 information, I think we should keep that working.
It's probably more important to ensure we use the right type of address
"type" in what sense here?
Global unicast instead of link-local.
Ok.
(security) rather than ensuring we somehow manage to deliver packets at any time (minor glitch otherwise), also because the one you describe is something we're unlikely to hit outside of tests.
Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past.
Yes, I still think we should support guests that don't use DHCPv6 or NDP at all,
Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to manually configure the interface in the guest to match the address you've configured with -a. Just like you'd expect to have to correctly configure your address on a real network.
True, but if we make correctness as optional as possible, we'll be more compatible (less time spent by users fixing situations that don't necessarily need fixing, less time spent by developers to look into reports, no matter who's at fault).
Eh, maybe. Unless us trying to make sense of a nonsense situation causes some unpredictable behaviour that breaks something else.
or where related exchanges fail for any reason. It improves reliability and compatibility at a small cost. In this case, I think it's a nice feature that we would resume communicating as soon as the guest shows its global unicast address.
Hm, maybe. I'm not entirely convinced the cost is so small long term. It's pretty badly incompatible with having multiple guests behind the same passt instance: such as the initial guest bridging or routing to nested guests.
Why? We will need to hash the interface/guest index anyway, for outbound flows.
If we have separate interfaces for each guest, yes. But not if we have multiple guests behind a single tap because the initial guest sets up a bridge or routing. Then we have nothing but the address.
...but then we should have multiple addresses anyway.
Yes.. that's kind of my point.
By the way, I'm not sure we'll ever be able to support that kind of configuration.
I don't see why not. It would require configuration so that it's clear what each inbound forward targets. But I don't see any inherent problem here, though there are a number of current implementation details which prevent it (addr_seen is one, replying to all arps is another).
How does a guest set up a bridge and use passt at the same time?
I'm not thinking of a bridge shared with the host, but a bridge (or routing) between nested guests or namespaces. This is essentially the "private switch with pasta uplink" case we've discussed occasionally before. It doesn't technically have to be nested guests - the guest could bridge between its uplink and a tunnel, but nested guests is the likely use case.
And for inbound flows, if a guest steals the address of another guest, we'll give priority to the normal 'addr' versions instead of the '_seen' ones, to decide how to direct traffic.
I don't see how we'd know we're in this situation, so when to prioritise which address over the other.
In the set of all addr_seen and addr, we would have at least a non-unique value. Or, practically speaking, we should refuse to set addr_seen if it matches addr for another guest.
Ah, ok. So again, assuming a static configuration of known guests, rather than a local bridge established by a guest at runtime.
I'm actually not sure if encountering this bug makes me more or less in favour of addr_seen. On the one hand I think it highlights the flakiness of this approach; there are situations where we just won't know the right address.
I don't understand this argument: indeed, there are such situations, and they are annoying. Why should we make them more common?
Because predictability is good, and working _most_ of the time is a failure of predictability.
It avoids substantial effort and frustration for everybody involved though. The practical problem with lacking predictability is if it makes things harder to debug, I guess, which shouldn't be the case here.
On the other hand if shows a relatively plausible case where the guest won't get exactly the address we want it to (it uses NDP but not DHCPv6)
Hrm... actually this also shows a potential danger in the recent patches to disable DAD in the guest. With DAD enabled, when the guest grabs a new address, we'd expect it to emit DAD messages, which would have the side effect of updating our addr_seen (although I'm pretty sure I hit this patch before the nodad patches were applied, so that doesn't seem to be foolproof).
Well, but we do that for containers with --config-net only. In that case, the addresses we configure have infinite lifetime anyway.
Oh, good point. Hrm... then I'm unsure why the guest wasn't re-DADing its new address.
It probably did, but we ignored that anyway because DAD is done by sending neighbour solicitations with an unspecified address as source, for example (the "change" here drops "nodad"):
$ ./pasta --config-net -p dad.pcap Saving packet capture to dad.pcap # ip addr change dev enp9s0 fe80::3882:b5ff:fe01:e9a1/64 # tshark -r dad.pcap |grep Neigh Running as user "root" and group "root". This could be dangerous. 10 2.642467 :: → ff02::1:ff01:e9a1 ICMPv6 86 Neighbor Solicitation for fe80::3882:b5ff:fe01:e9a1
and in tap6_handler() we do:
} else if (!IN6_IS_ADDR_UNSPECIFIED(saddr)){ c->ip6.addr_seen = *saddr; }
...then, in ndp():
if (IN6_IS_ADDR_UNSPECIFIED(saddr)) return 1;
we could set addr_seen by looking at the *target* address of the neighbour solicitation when the source address is ::, but it's not implemented yet.
Right. I forgot the NS went out with :: as source. Snooping the NS that way again assumes that there's only one logical machine on the guest side. But since this is for addr_seen which fundamentally assumes that anyway, I guess it doesn't make anything worse.
Besides, I don't think we need to have addr_seen updated as quickly and correctly as possible just for the sake of it, we can also update it when we get any other neighbour solicitation because the guest is actually using the network. It's not meant to be perfect.
If the guest is a pure server (a common case for containers AFAICT), then I don't know that we can expect NS messages for anything other than the default gateway, which is (typically) link-local and so won't help us to learn the new global address.
Containers running actual applications are noisy. I've only seen this kind of problem (addr_seen not set/matching) in particularly crafted test environments.
We could maybe update addr_seen when we send RA messages to the guest - assuming that it will use the same host part (low 64-bits) for both link-local and global addresses. Not sure if that's a widely safe assumption or not.
I don't understand: what case are you trying to cover with this?
A case just like the one in the tests: the interface bounces, and we get NDP traffic on the link-local address, but nothing on the global address before an inbound connection.
Oh, I see. I think it makes sense, even though we'll set addr_seen a bit too early, but not enough to be a practical issue, I think.
Yes, but I think snopping the NS from DAD is probably a better idea. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (3)
-
David Gibson
-
Paul Holzinger
-
Stefano Brivio