[PATCH 00/22] RFC: Allow configuration of special case NATs

David Gibson

16 Aug 2024 16 Aug '24

7:39 a.m.

Based on Stefano's recent patch for faster tests. Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback. Suggestions for better names for the new options in patches 20 & 22 are most welcome. Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well. NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway. Paul, amongst other things, I think this will allow podman to (finally) nicely address #19213, picking an address to remap to the host's external address with --nat-guest-addr, much like it already uses --dns-forward. David Gibson (22): treewide: Use "our address" instead of "forwarding address" util: Helper for formatting MAC addresses treewide: Rename MAC address fields for clarity treewide: Use struct assignment instead of memcpy() for IP addresses conf: Use array indices rather than pointers for DNS array slots conf: More accurately count entries added in get_dns() conf: Move DNS array bounds checks into add_dns[46] conf: Move adding of a nameserver from resolv.conf into subfunction conf: Correct setting of dns_match address in add_dns6() conf: Treat --dns addresses as guest visible addresses conf: Remove incorrect initialisation of addr_ll_seen util: Correct sock_l4() binding for link local addresses treewide: Change misleading 'addr_ll' name Clarify which addresses in ip[46]_ctx are meaningful where Initialise our_tap_ll to ip6.gw when suitable fwd: Helpers to clarify what host addresses aren't guest accessible fwd: Split notion of "our tap address" from gateway for IPv4 Don't take "our" MAC address from the host conf, fwd: Split notion of gateway/router from guest-visible host address conf: Allow address remapped to host to be configured fwd: Distinguish translatable from untranslatable addresses on inbound fwd, conf: Allow NAT of the guest's assigned address arp.c | 4 +- conf.c | 328 +++++++++++++++++++++++++----------------- dhcp.c | 19 +-- dhcpv6.c | 21 +-- flow.c | 72 +++++----- flow.h | 18 +-- fwd.c | 170 +++++++++++++++++----- icmp.c | 4 +- ndp.c | 9 +- passt.1 | 45 +++++- passt.c | 2 +- passt.h | 53 +++++-- pasta.c | 14 +- tap.c | 12 +- tcp.c | 33 ++--- tcp_internal.h | 2 +- test/lib/setup | 11 +- test/passt_in_ns/dhcp | 73 ++++++++++ test/passt_in_ns/tcp | 38 +++-- test/passt_in_ns/udp | 22 +-- test/perf/passt_tcp | 33 ++--- test/perf/passt_udp | 31 ++-- test/perf/pasta_tcp | 29 ++-- test/perf/pasta_udp | 25 ++-- test/run | 4 +- udp.c | 12 +- util.c | 22 ++- util.h | 4 +- 28 files changed, 719 insertions(+), 391 deletions(-) create mode 100644 test/passt_in_ns/dhcp -- 2.46.0

Show replies by date

David Gibson

16 Aug 16 Aug

7:39 a.m.

New subject: [PATCH 01/22] treewide: Use "our address" instead of "forwarding address"

The term "forwarding address" to indicate the local-to-passt address was well-intentioned, but ends up being kinda confusing. As discussed on a recent call, let's try "our" instead. Signed-off-by: David Gibson --- flow.c | 72 +++++++++++++++++++++++++------------------------- flow.h | 18 ++++++------- fwd.c | 70 ++++++++++++++++++++++++------------------------ icmp.c | 4 +-- tcp.c | 33 ++++++++++++----------- tcp_internal.h | 2 +- udp.c | 12 ++++----- 7 files changed, 106 insertions(+), 105 deletions(-) diff --git a/flow.c b/flow.c index 93b687dc..8915e366 100644 --- a/flow.c +++ b/flow.c @@ -127,18 +127,18 @@ static struct timespec flow_timer_run; * @af: Address family (AF_INET or AF_INET6) * @eaddr: Endpoint address (pointer to in_addr or in6_addr) * @eport: Endpoint port - * @faddr: Forwarding address (pointer to in_addr or in6_addr) - * @fport: Forwarding port + * @oaddr: Our address (pointer to in_addr or in6_addr) + * @oport: Our port */ static void flowside_from_af(struct flowside *side, sa_family_t af, const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport) + const void *oaddr, in_port_t oport) { - if (faddr) - inany_from_af(&side->faddr, af, faddr); + if (oaddr) + inany_from_af(&side->oaddr, af, oaddr); else - side->faddr = inany_any6; - side->fport = fport; + side->oaddr = inany_any6; + side->oport = oport; if (eaddr) inany_from_af(&side->eaddr, af, eaddr); @@ -193,8 +193,8 @@ static int flowside_sock_splice(void *arg) * @tgt: Target flowside * @data: epoll reference portion for protocol handlers * - * Return: socket fd of protocol @proto bound to the forwarding address and port - * from @tgt (if specified). + * Return: socket fd of protocol @proto bound to our address and port from @tgt + * (if specified). */ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data) @@ -205,11 +205,11 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, ASSERT(pif_is_socket(pif)); - pif_sockaddr(c, &sa, &sl, pif, &tgt->faddr, tgt->fport); + pif_sockaddr(c, &sa, &sl, pif, &tgt->oaddr, tgt->oport); switch (pif) { case PIF_HOST: - if (inany_is_loopback(&tgt->faddr)) + if (inany_is_loopback(&tgt->oaddr)) ifname = NULL; else if (sa.sa_family == AF_INET) ifname = c->ip4.ifname_out; @@ -309,11 +309,11 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) pif_name(f->pif[INISIDE]), inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), - ini->fport, + inany_ntop(&ini->oaddr, fstr0, sizeof(fstr0)), + ini->oport, pif_name(f->pif[TGTSIDE]), - inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), - tgt->fport, + inany_ntop(&tgt->oaddr, fstr1, sizeof(fstr1)), + tgt->oport, inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) @@ -321,8 +321,8 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) pif_name(f->pif[INISIDE]), inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), - ini->fport); + inany_ntop(&ini->oaddr, fstr0, sizeof(fstr0)), + ini->oport); } /** @@ -347,7 +347,7 @@ static void flow_initiate_(union flow *flow, uint8_t pif) * flow_initiate_af() - Move flow to INI, setting INISIDE details * @flow: Flow to change state * @pif: pif of the initiating side - * @af: Address family of @eaddr and @faddr + * @af: Address family of @eaddr and @oaddr * @saddr: Source address (pointer to in_addr or in6_addr) * @sport: Endpoint port * @daddr: Destination address (pointer to in_addr or in6_addr) @@ -384,10 +384,10 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa); if (inany_v4(&ini->eaddr)) - ini->faddr = inany_any4; + ini->oaddr = inany_any4; else - ini->faddr = inany_any6; - ini->fport = dport; + ini->oaddr = inany_any6; + ini->oport = dport; flow_initiate_(flow, pif); return ini; } @@ -432,8 +432,8 @@ const struct flowside *flow_target(const struct ctx *c, union flow *flow, pif_name(f->pif[INISIDE]), inany_ntop(&ini->eaddr, estr, sizeof(estr)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), - ini->fport); + inany_ntop(&ini->oaddr, fstr, sizeof(fstr)), + ini->oport); } if (tgtpif == PIF_NONE) @@ -561,12 +561,12 @@ static uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif, { struct siphash_state state = SIPHASH_INIT(c->hash_secret); - inany_siphash_feed(&state, &side->faddr); + inany_siphash_feed(&state, &side->oaddr); inany_siphash_feed(&state, &side->eaddr); return siphash_final(&state, 38, (uint64_t)proto << 40 | (uint64_t)pif << 32 | - (uint64_t)side->fport << 16 | + (uint64_t)side->oport << 16 | (uint64_t)side->eport); } @@ -587,7 +587,7 @@ static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx) * information, and at least a forwarding port. */ ASSERT(pif != PIF_NONE && !inany_is_unspecified(&side->eaddr) && - side->eport != 0 && side->fport != 0); + side->eport != 0 && side->oport != 0); return flow_hash(c, FLOW_PROTO(f), pif, side); } @@ -709,20 +709,20 @@ static flow_sidx_t flowside_lookup(const struct ctx *c, uint8_t proto, * @pif: Interface of the flow * @af: Address family, AF_INET or AF_INET6 * @eaddr: Guest side endpoint address (guest local address) - * @faddr: Guest side forwarding address (guest remote address) + * @oaddr: Our guest side address (guest remote address) * @eport: Guest side endpoint port (guest local port) - * @fport: Guest side forwarding port (guest remote port) + * @oport: Our guest side port (guest remote port) * * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found */ flow_sidx_t flow_lookup_af(const struct ctx *c, uint8_t proto, uint8_t pif, sa_family_t af, - const void *eaddr, const void *faddr, - in_port_t eport, in_port_t fport) + const void *eaddr, const void *oaddr, + in_port_t eport, in_port_t oport) { struct flowside side; - flowside_from_af(&side, af, eaddr, eport, faddr, fport); + flowside_from_af(&side, af, eaddr, eport, oaddr, oport); return flowside_lookup(c, proto, pif, &side); } @@ -732,22 +732,22 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, * @proto: Protocol of the flow (IP L4 protocol number) * @pif: Interface of the flow * @esa: Socket address of the endpoint - * @fport: Forwarding port number + * @oport: Our port number * * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found */ flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, - const void *esa, in_port_t fport) + const void *esa, in_port_t oport) { struct flowside side = { - .fport = fport, + .oport = oport, }; inany_from_sockaddr(&side.eaddr, &side.eport, esa); if (inany_v4(&side.eaddr)) - side.faddr = inany_any4; + side.oaddr = inany_any4; else - side.faddr = inany_any6; + side.oaddr = inany_any6; return flowside_lookup(c, proto, pif, &side); } diff --git a/flow.h b/flow.h index 078fd605..d167b654 100644 --- a/flow.h +++ b/flow.h @@ -140,14 +140,14 @@ extern const uint8_t flow_proto[]; /** * struct flowside - Address information for one side of a flow * @eaddr: Endpoint address (remote address from passt's PoV) - * @faddr: Forwarding address (local address from passt's PoV) + * @oaddr: Our address (local address from passt's PoV) * @eport: Endpoint port - * @fport: Forwarding port + * @oport: Our port */ struct flowside { - union inany_addr faddr; + union inany_addr oaddr; union inany_addr eaddr; - in_port_t fport; + in_port_t oport; in_port_t eport; }; @@ -162,8 +162,8 @@ static inline bool flowside_eq(const struct flowside *left, { return inany_equals(&left->eaddr, &right->eaddr) && left->eport == right->eport && - inany_equals(&left->faddr, &right->faddr) && - left->fport == right->fport; + inany_equals(&left->oaddr, &right->oaddr) && + left->oport == right->oport; } int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, @@ -240,10 +240,10 @@ uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx); void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx); flow_sidx_t flow_lookup_af(const struct ctx *c, uint8_t proto, uint8_t pif, sa_family_t af, - const void *eaddr, const void *faddr, - in_port_t eport, in_port_t fport); + const void *eaddr, const void *oaddr, + in_port_t eport, in_port_t oport); flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t pif, - const void *esa, in_port_t fport); + const void *esa, in_port_t oport); union flow; diff --git a/fwd.c b/fwd.c index dea36f6c..b546bc41 100644 --- a/fwd.c +++ b/fwd.c @@ -167,7 +167,7 @@ void fwd_scan_ports_init(struct ctx *c) static bool is_dns_flow(uint8_t proto, const struct flowside *ini) { return ((proto == IPPROTO_UDP) || (proto == IPPROTO_TCP)) && - ((ini->fport == 53) || (ini->fport == 853)); + ((ini->oport == 53) || (ini->oport == 853)); } /** @@ -184,33 +184,33 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { if (is_dns_flow(proto, ini) && - inany_equals4(&ini->faddr, &c->ip4.dns_match)) + inany_equals4(&ini->oaddr, &c->ip4.dns_match)) tgt->eaddr = inany_from_v4(c->ip4.dns_host); else if (is_dns_flow(proto, ini) && - inany_equals6(&ini->faddr, &c->ip6.dns_match)) + inany_equals6(&ini->oaddr, &c->ip6.dns_match)) tgt->eaddr.a6 = c->ip6.dns_host; - else if (!c->no_map_gw && inany_equals4(&ini->faddr, &c->ip4.gw)) + else if (!c->no_map_gw && inany_equals4(&ini->oaddr, &c->ip4.gw)) tgt->eaddr = inany_loopback4; - else if (!c->no_map_gw && inany_equals6(&ini->faddr, &c->ip6.gw)) + else if (!c->no_map_gw && inany_equals6(&ini->oaddr, &c->ip6.gw)) tgt->eaddr = inany_loopback6; else - tgt->eaddr = ini->faddr; + tgt->eaddr = ini->oaddr; - tgt->eport = ini->fport; + tgt->eport = ini->oport; /* The relevant addr_out controls the host side source address. This * may be unspecified, which allows the kernel to pick an address. */ if (inany_v4(&tgt->eaddr)) - tgt->faddr = inany_from_v4(c->ip4.addr_out); + tgt->oaddr = inany_from_v4(c->ip4.addr_out); else - tgt->faddr.a6 = c->ip6.addr_out; + tgt->oaddr.a6 = c->ip6.addr_out; /* Let the kernel pick a host side source port */ - tgt->fport = 0; + tgt->oport = 0; if (proto == IPPROTO_UDP) { /* But for UDP we preserve the source port */ - tgt->fport = ini->eport; + tgt->oport = ini->eport; } return PIF_HOST; @@ -230,13 +230,13 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { if (!inany_is_loopback(&ini->eaddr) || - (!inany_is_loopback(&ini->faddr) && !inany_is_unspecified(&ini->faddr))) { + (!inany_is_loopback(&ini->oaddr) && !inany_is_unspecified(&ini->oaddr))) { char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; debug("Non loopback address on %s: [%s]:%hu -> [%s]:%hu", pif_name(PIF_SPLICE), inany_ntop(&ini->eaddr, estr, sizeof(estr)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), ini->fport); + inany_ntop(&ini->oaddr, fstr, sizeof(fstr)), ini->oport); return PIF_NONE; } @@ -248,20 +248,20 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, /* Preserve the specific loopback adddress used, but let the kernel pick * a source port on the target side */ - tgt->faddr = ini->eaddr; - tgt->fport = 0; + tgt->oaddr = ini->eaddr; + tgt->oport = 0; - tgt->eport = ini->fport; + tgt->eport = ini->oport; if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_out.delta[tgt->eport]; else if (proto == IPPROTO_UDP) tgt->eport += c->udp.fwd_out.delta[tgt->eport]; /* Let the kernel pick a host side source port */ - tgt->fport = 0; + tgt->oport = 0; if (proto == IPPROTO_UDP) /* But for UDP preserve the source port */ - tgt->fport = ini->eport; + tgt->oport = ini->eport; return PIF_HOST; } @@ -280,7 +280,7 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt) { /* Common for spliced and non-spliced cases */ - tgt->eport = ini->fport; + tgt->eport = ini->oport; if (proto == IPPROTO_TCP) tgt->eport += c->tcp.fwd_in.delta[tgt->eport]; else if (proto == IPPROTO_UDP) @@ -293,11 +293,11 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, /* Preserve the specific loopback adddress used, but let the * kernel pick a source port on the target side */ - tgt->faddr = ini->eaddr; - tgt->fport = 0; + tgt->oaddr = ini->eaddr; + tgt->oport = 0; if (proto == IPPROTO_UDP) /* But for UDP preserve the source port */ - tgt->fport = ini->eport; + tgt->oport = ini->eport; if (inany_v4(&ini->eaddr)) tgt->eaddr = inany_loopback4; @@ -307,26 +307,26 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, return PIF_SPLICE; } - tgt->faddr = ini->eaddr; - tgt->fport = ini->eport; + tgt->oaddr = ini->eaddr; + tgt->oport = ini->eport; - if (inany_is_loopback4(&tgt->faddr) || - inany_is_unspecified4(&tgt->faddr) || - inany_equals4(&tgt->faddr, &c->ip4.addr_seen)) { - tgt->faddr = inany_from_v4(c->ip4.gw); - } else if (inany_is_loopback6(&tgt->faddr) || - inany_equals6(&tgt->faddr, &c->ip6.addr_seen) || - inany_equals6(&tgt->faddr, &c->ip6.addr)) { + if (inany_is_loopback4(&tgt->oaddr) || + inany_is_unspecified4(&tgt->oaddr) || + inany_equals4(&tgt->oaddr, &c->ip4.addr_seen)) { + tgt->oaddr = inany_from_v4(c->ip4.gw); + } else if (inany_is_loopback6(&tgt->oaddr) || + inany_equals6(&tgt->oaddr, &c->ip6.addr_seen) || + inany_equals6(&tgt->oaddr, &c->ip6.addr)) { if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) - tgt->faddr.a6 = c->ip6.gw; + tgt->oaddr.a6 = c->ip6.gw; else - tgt->faddr.a6 = c->ip6.addr_ll; + tgt->oaddr.a6 = c->ip6.addr_ll; } - if (inany_v4(&tgt->faddr)) { + if (inany_v4(&tgt->oaddr)) { tgt->eaddr = inany_from_v4(c->ip4.addr_seen); } else { - if (inany_is_linklocal6(&tgt->faddr)) + if (inany_is_linklocal6(&tgt->oaddr)) tgt->eaddr.a6 = c->ip6.addr_ll_seen; else tgt->eaddr.a6 = c->ip6.addr_seen; diff --git a/icmp.c b/icmp.c index cb81c768..f514dbc9 100644 --- a/icmp.c +++ b/icmp.c @@ -125,13 +125,13 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) ini->eport, seq); if (pingf->f.type == FLOW_PING4) { - const struct in_addr *saddr = inany_v4(&ini->faddr); + const struct in_addr *saddr = inany_v4(&ini->oaddr); const struct in_addr *daddr = inany_v4(&ini->eaddr); ASSERT(saddr && daddr); /* Must have IPv4 addresses */ tap_icmp4_send(c, *saddr, *daddr, buf, n); } else if (pingf->f.type == FLOW_PING6) { - const struct in6_addr *saddr = &ini->faddr.a6; + const struct in6_addr *saddr = &ini->oaddr.a6; const struct in6_addr *daddr = &ini->eaddr.a6; tap_icmp6_send(c, saddr, daddr, buf, n); diff --git a/tcp.c b/tcp.c index c0820ce7..f01fe8f9 100644 --- a/tcp.c +++ b/tcp.c @@ -361,8 +361,8 @@ static const char *tcp_flag_str[] __attribute((__unused__)) = { static int tcp_sock_init_ext [NUM_PORTS][IP_VERSIONS]; static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; -/* Table of guest side forwarding addresses with very low RTT (assumed - * to be local to the host), LRU +/* Table of our guest side addresses with very low RTT (assumed to be local to + * the host), LRU */ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; @@ -663,7 +663,7 @@ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) int i; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&tapside->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->oaddr, low_rtt_dst + i)) return 1; return 0; @@ -686,7 +686,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&tapside->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->oaddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -698,7 +698,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return; - low_rtt_dst[hole++] = tapside->faddr; + low_rtt_dst[hole++] = tapside->oaddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -881,7 +881,7 @@ static void tcp_fill_header(struct tcphdr *th, { const struct flowside *tapside = TAPFLOW(conn); - th->source = htons(tapside->fport); + th->source = htons(tapside->oport); th->dest = htons(tapside->eport); th->seq = htonl(seq); th->ack_seq = htonl(conn->seq_ack_to_tap); @@ -913,7 +913,7 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn, uint32_t seq) { const struct flowside *tapside = TAPFLOW(conn); - const struct in_addr *src4 = inany_v4(&tapside->faddr); + const struct in_addr *src4 = inany_v4(&tapside->oaddr); const struct in_addr *dst4 = inany_v4(&tapside->eaddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph); @@ -957,7 +957,7 @@ static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn, size_t l4len = dlen + sizeof(*th); ip6h->payload_len = htons(l4len); - ip6h->saddr = tapside->faddr.a6; + ip6h->saddr = tapside->oaddr.a6; ip6h->daddr = tapside->eaddr.a6; ip6h->hop_limit = 255; @@ -992,7 +992,7 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, const uint16_t *check, uint32_t seq) { const struct flowside *tapside = TAPFLOW(conn); - const struct in_addr *a4 = inany_v4(&tapside->faddr); + const struct in_addr *a4 = inany_v4(&tapside->oaddr); if (a4) { return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base, @@ -1417,15 +1417,15 @@ static void tcp_bind_outbound(const struct ctx *c, socklen_t sl; - pif_sockaddr(c, &bind_sa, &sl, PIF_HOST, &tgt->faddr, tgt->fport); - if (!inany_is_unspecified(&tgt->faddr) || tgt->fport) { + pif_sockaddr(c, &bind_sa, &sl, PIF_HOST, &tgt->oaddr, tgt->oport); + if (!inany_is_unspecified(&tgt->oaddr) || tgt->oport) { if (bind(s, &bind_sa.sa, sl)) { char sstr[INANY_ADDRSTRLEN]; flow_dbg(conn, "Can't bind TCP outbound socket to %s:%hu: %s", - inany_ntop(&tgt->faddr, sstr, sizeof(sstr)), - tgt->fport, strerror(errno)); + inany_ntop(&tgt->oaddr, sstr, sizeof(sstr)), + tgt->oport, strerror(errno)); } } @@ -1497,12 +1497,12 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0 || - !inany_is_unicast(&ini->faddr) || ini->fport == 0) { + !inany_is_unicast(&ini->oaddr) || ini->oport == 0) { char sstr[INANY_ADDRSTRLEN], dstr[INANY_ADDRSTRLEN]; debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu", inany_ntop(&ini->eaddr, sstr, sizeof(sstr)), ini->eport, - inany_ntop(&ini->faddr, dstr, sizeof(dstr)), ini->fport); + inany_ntop(&ini->oaddr, dstr, sizeof(dstr)), ini->oport); goto cancel; } @@ -2100,7 +2100,8 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, goto cancel; /* FIXME: When listening port has a specific bound address, record that - * as the forwarding address */ + * as our address + */ ini = flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port); diff --git a/tcp_internal.h b/tcp_internal.h index 8b60aabc..aa8bb64f 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -44,7 +44,7 @@ #define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)])) #define TAP_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_))) -#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->faddr)) +#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->oaddr)) #define CONN_V6(conn) (!CONN_V4(conn)) /* diff --git a/udp.c b/udp.c index 77312572..57dcc667 100644 --- a/udp.c +++ b/udp.c @@ -321,7 +321,7 @@ static void udp_splice_send(const struct ctx *c, size_t start, size_t n, static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp, const struct flowside *toside, size_t dlen) { - const struct in_addr *src = inany_v4(&toside->faddr); + const struct in_addr *src = inany_v4(&toside->oaddr); const struct in_addr *dst = inany_v4(&toside->eaddr); size_t l4len = dlen + sizeof(bp->uh); size_t l3len = l4len + sizeof(*ip4h); @@ -333,7 +333,7 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp, ip4h->saddr = src->s_addr; ip4h->check = csum_ip4_header(l3len, IPPROTO_UDP, *src, *dst); - bp->uh.source = htons(toside->fport); + bp->uh.source = htons(toside->oport); bp->uh.dest = htons(toside->eport); bp->uh.len = htons(l4len); csum_udp4(&bp->uh, *src, *dst, bp->data, dlen); @@ -357,15 +357,15 @@ static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, ip6h->payload_len = htons(l4len); ip6h->daddr = toside->eaddr.a6; - ip6h->saddr = toside->faddr.a6; + ip6h->saddr = toside->oaddr.a6; ip6h->version = 6; ip6h->nexthdr = IPPROTO_UDP; ip6h->hop_limit = 255; - bp->uh.source = htons(toside->fport); + bp->uh.source = htons(toside->oport); bp->uh.dest = htons(toside->eport); bp->uh.len = ip6h->payload_len; - csum_udp6(&bp->uh, &toside->faddr.a6, &toside->eaddr.a6, bp->data, dlen); + csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6, bp->data, dlen); return l4len; } @@ -384,7 +384,7 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx, struct udp_meta_t *bm = &udp_meta[idx]; size_t l4len; - if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->faddr)) { + if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->oaddr)) { l4len = udp_update_hdr6(&bm->ip6h, bp, toside, mmh[idx].msg_len); tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) + sizeof(udp6_eth_hdr)); -- 2.46.0

Stefano Brivio

18 Aug 18 Aug

5:44 p.m.

New subject: [PATCH 01/22] treewide: Use "our address" instead of "forwarding address"

On Fri, 16 Aug 2024 15:39:42 +1000 David Gibson wrote:

...

The term "forwarding address" to indicate the local-to-passt address was well-intentioned, but ends up being kinda confusing. As discussed on a recent call, let's try "our" instead.

Signed-off-by: David Gibson --- flow.c | 72 +++++++++++++++++++++++++------------------------- flow.h | 18 ++++++------- fwd.c | 70 ++++++++++++++++++++++++------------------------ icmp.c | 4 +-- tcp.c | 33 ++++++++++++----------- tcp_internal.h | 2 +- udp.c | 12 ++++----- 7 files changed, 106 insertions(+), 105 deletions(-)

diff --git a/flow.c b/flow.c index 93b687dc..8915e366 100644 --- a/flow.c +++ b/flow.c @@ -127,18 +127,18 @@ static struct timespec flow_timer_run; * @af: Address family (AF_INET or AF_INET6) * @eaddr: Endpoint address (pointer to in_addr or in6_addr) * @eport: Endpoint port - * @faddr: Forwarding address (pointer to in_addr or in6_addr) - * @fport: Forwarding port + * @oaddr: Our address (pointer to in_addr or in6_addr) + * @oport: Our port */ static void flowside_from_af(struct flowside *side, sa_family_t af, const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport) + const void *oaddr, in_port_t oport) { - if (faddr) - inany_from_af(&side->faddr, af, faddr); + if (oaddr) + inany_from_af(&side->oaddr, af, oaddr); else - side->faddr = inany_any6; - side->fport = fport; + side->oaddr = inany_any6; + side->oport = oport;

if (eaddr) inany_from_af(&side->eaddr, af, eaddr); @@ -193,8 +193,8 @@ static int flowside_sock_splice(void *arg) * @tgt: Target flowside * @data: epoll reference portion for protocol handlers * - * Return: socket fd of protocol @proto bound to the forwarding address and port - * from @tgt (if specified). + * Return: socket fd of protocol @proto bound to our address and port from @tgt + * (if specified). */ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data) @@ -205,11 +205,11 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,

ASSERT(pif_is_socket(pif));

- pif_sockaddr(c, &sa, &sl, pif, &tgt->faddr, tgt->fport); + pif_sockaddr(c, &sa, &sl, pif, &tgt->oaddr, tgt->oport);

switch (pif) { case PIF_HOST: - if (inany_is_loopback(&tgt->faddr)) + if (inany_is_loopback(&tgt->oaddr)) ifname = NULL; else if (sa.sa_family == AF_INET) ifname = c->ip4.ifname_out; @@ -309,11 +309,11 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) pif_name(f->pif[INISIDE]), inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), - ini->fport, + inany_ntop(&ini->oaddr, fstr0, sizeof(fstr0)), + ini->oport, pif_name(f->pif[TGTSIDE]), - inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), - tgt->fport, + inany_ntop(&tgt->oaddr, fstr1, sizeof(fstr1)), + tgt->oport, inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) @@ -321,8 +321,8 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) pif_name(f->pif[INISIDE]), inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), - ini->fport); + inany_ntop(&ini->oaddr, fstr0, sizeof(fstr0)), + ini->oport); }

/** @@ -347,7 +347,7 @@ static void flow_initiate_(union flow *flow, uint8_t pif) * flow_initiate_af() - Move flow to INI, setting INISIDE details * @flow: Flow to change state * @pif: pif of the initiating side - * @af: Address family of @eaddr and @faddr + * @af: Address family of @eaddr and @oaddr

Pre-existing, but this made me realise that flow_initiate_af() doesn't actually take @eaddr and @faddr at all (it's @saddr and @daddr instead). -- Stefano

David Gibson

19 Aug 19 Aug

3:28 a.m.

New subject: [PATCH 01/22] treewide: Use "our address" instead of "forwarding address"

On Sun, Aug 18, 2024 at 05:44:51PM +0200, Stefano Brivio wrote:

...

On Fri, 16 Aug 2024 15:39:42 +1000 David Gibson wrote:

...
The term "forwarding address" to indicate the local-to-passt address was well-intentioned, but ends up being kinda confusing. As discussed on a recent call, let's try "our" instead.

Signed-off-by: David Gibson --- flow.c | 72 +++++++++++++++++++++++++------------------------- flow.h | 18 ++++++------- fwd.c | 70 ++++++++++++++++++++++++------------------------ icmp.c | 4 +-- tcp.c | 33 ++++++++++++----------- tcp_internal.h | 2 +- udp.c | 12 ++++----- 7 files changed, 106 insertions(+), 105 deletions(-)

diff --git a/flow.c b/flow.c index 93b687dc..8915e366 100644 --- a/flow.c +++ b/flow.c @@ -127,18 +127,18 @@ static struct timespec flow_timer_run; * @af: Address family (AF_INET or AF_INET6) * @eaddr: Endpoint address (pointer to in_addr or in6_addr) * @eport: Endpoint port - * @faddr: Forwarding address (pointer to in_addr or in6_addr) - * @fport: Forwarding port + * @oaddr: Our address (pointer to in_addr or in6_addr) + * @oport: Our port */ static void flowside_from_af(struct flowside *side, sa_family_t af, const void *eaddr, in_port_t eport, - const void *faddr, in_port_t fport) + const void *oaddr, in_port_t oport) { - if (faddr) - inany_from_af(&side->faddr, af, faddr); + if (oaddr) + inany_from_af(&side->oaddr, af, oaddr); else - side->faddr = inany_any6; - side->fport = fport; + side->oaddr = inany_any6; + side->oport = oport;

if (eaddr) inany_from_af(&side->eaddr, af, eaddr); @@ -193,8 +193,8 @@ static int flowside_sock_splice(void *arg) * @tgt: Target flowside * @data: epoll reference portion for protocol handlers * - * Return: socket fd of protocol @proto bound to the forwarding address and port - * from @tgt (if specified). + * Return: socket fd of protocol @proto bound to our address and port from @tgt + * (if specified). */ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, const struct flowside *tgt, uint32_t data) @@ -205,11 +205,11 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,

ASSERT(pif_is_socket(pif));

- pif_sockaddr(c, &sa, &sl, pif, &tgt->faddr, tgt->fport); + pif_sockaddr(c, &sa, &sl, pif, &tgt->oaddr, tgt->oport);

switch (pif) { case PIF_HOST: - if (inany_is_loopback(&tgt->faddr)) + if (inany_is_loopback(&tgt->oaddr)) ifname = NULL; else if (sa.sa_family == AF_INET) ifname = c->ip4.ifname_out; @@ -309,11 +309,11 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) pif_name(f->pif[INISIDE]), inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), - ini->fport, + inany_ntop(&ini->oaddr, fstr0, sizeof(fstr0)), + ini->oport, pif_name(f->pif[TGTSIDE]), - inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), - tgt->fport, + inany_ntop(&tgt->oaddr, fstr1, sizeof(fstr1)), + tgt->oport, inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) @@ -321,8 +321,8 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) pif_name(f->pif[INISIDE]), inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), - ini->fport); + inany_ntop(&ini->oaddr, fstr0, sizeof(fstr0)), + ini->oport); }

/** @@ -347,7 +347,7 @@ static void flow_initiate_(union flow *flow, uint8_t pif) * flow_initiate_af() - Move flow to INI, setting INISIDE details * @flow: Flow to change state * @pif: pif of the initiating side - * @af: Address family of @eaddr and @faddr + * @af: Address family of @eaddr and @oaddr

Pre-existing, but this made me realise that flow_initiate_af() doesn't actually take @eaddr and @faddr at all (it's @saddr and @daddr instead).

Oops, yes. I've folded a fix for that into this patch. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

16 Aug 16 Aug

7:39 a.m.

New subject: [PATCH 02/22] util: Helper for formatting MAC addresses

There are a couple of places where we somewhat messily open code formatting an Ethernet like MAC address for display. Add an eth_ntop() helper for this. Signed-off-by: David Gibson --- conf.c | 7 +++---- dhcp.c | 5 ++--- util.c | 19 +++++++++++++++++++ util.h | 3 +++ 4 files changed, 27 insertions(+), 7 deletions(-) diff --git a/conf.c b/conf.c index ed097bdc..830f91a6 100644 --- a/conf.c +++ b/conf.c @@ -921,7 +921,8 @@ pasta_opts: */ static void conf_print(const struct ctx *c) { - char buf4[INET_ADDRSTRLEN], buf6[INET6_ADDRSTRLEN], ifn[IFNAMSIZ]; + char buf4[INET_ADDRSTRLEN], buf6[INET6_ADDRSTRLEN]; + char bufmac[ETH_ADDRSTRLEN], ifn[IFNAMSIZ]; int i; info("Template interface: %s%s%s%s%s", @@ -955,9 +956,7 @@ static void conf_print(const struct ctx *c) info("Namespace interface: %s", c->pasta_ifn); info("MAC:"); - info(" host: %02x:%02x:%02x:%02x:%02x:%02x", - c->mac[0], c->mac[1], c->mac[2], - c->mac[3], c->mac[4], c->mac[5]); + info(" host: %s", eth_ntop(c->mac, bufmac, sizeof(bufmac))); if (c->ifi4) { if (!c->no_dhcp) { diff --git a/dhcp.c b/dhcp.c index aa9f59da..acc5b03e 100644 --- a/dhcp.c +++ b/dhcp.c @@ -276,6 +276,7 @@ static void opt_set_dns_search(const struct ctx *c, size_t max_len) int dhcp(const struct ctx *c, const struct pool *p) { size_t mlen, dlen, offset = 0, opt_len, opt_off = 0; + char macstr[ETH_ADDRSTRLEN]; const struct ethhdr *eh; const struct iphdr *iph; const struct udphdr *uh; @@ -340,9 +341,7 @@ int dhcp(const struct ctx *c, const struct pool *p) return -1; } - info(" from %02x:%02x:%02x:%02x:%02x:%02x", - m->chaddr[0], m->chaddr[1], m->chaddr[2], - m->chaddr[3], m->chaddr[4], m->chaddr[5]); + info(" from %s", eth_ntop(m->chaddr, macstr, sizeof(macstr))); m->yiaddr = c->ip4.addr; mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len)); diff --git a/util.c b/util.c index 0b414045..892358b1 100644 --- a/util.c +++ b/util.c @@ -676,6 +676,25 @@ const char *sockaddr_ntop(const void *sa, char *dst, socklen_t size) return dst; } +/** eth_ntop() - Convert an Ethernet MAC address to text format + * @mac: MAC address + * @dst: output buffer, minimum ETH_ADDRSTRLEN bytes + * @size: size of buffer at @dst + * + * Return: On success, a non-null pointer to @dst, NULL on failure + */ +const char *eth_ntop(const unsigned char *mac, char *dst, size_t size) +{ + int len; + + len = snprintf(dst, size, "%02x:%02x:%02x:%02x:%02x:%02x", + mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]); + if (len < 0 || (size_t)len >= size) + return NULL; + + return dst; +} + /** str_ee_origin() - Convert socket extended error origin to a string * @ee: Socket extended error structure * diff --git a/util.h b/util.h index cb4d181c..c1748074 100644 --- a/util.h +++ b/util.h @@ -215,9 +215,12 @@ static inline const char *af_name(sa_family_t af) #define SOCKADDR_STRLEN MAX(SOCKADDR_INET_STRLEN, SOCKADDR_INET6_STRLEN) +#define ETH_ADDRSTRLEN (ETH_ALEN * 3) + struct sock_extended_err; const char *sockaddr_ntop(const void *sa, char *dst, socklen_t size); +const char *eth_ntop(const unsigned char *mac, char *dst, size_t size); const char *str_ee_origin(const struct sock_extended_err *ee); /** -- 2.46.0

Stefano Brivio

18 Aug 18 Aug

5:44 p.m.

New subject: [PATCH 02/22] util: Helper for formatting MAC addresses

On Fri, 16 Aug 2024 15:39:43 +1000 David Gibson wrote:

...

There are a couple of places where we somewhat messily open code formatting an Ethernet like MAC address for display. Add an eth_ntop() helper for this.

Signed-off-by: David Gibson --- conf.c | 7 +++---- dhcp.c | 5 ++--- util.c | 19 +++++++++++++++++++ util.h | 3 +++ 4 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/conf.c b/conf.c index ed097bdc..830f91a6 100644 --- a/conf.c +++ b/conf.c @@ -921,7 +921,8 @@ pasta_opts: */ static void conf_print(const struct ctx *c) { - char buf4[INET_ADDRSTRLEN], buf6[INET6_ADDRSTRLEN], ifn[IFNAMSIZ]; + char buf4[INET_ADDRSTRLEN], buf6[INET6_ADDRSTRLEN]; + char bufmac[ETH_ADDRSTRLEN], ifn[IFNAMSIZ]; int i;

info("Template interface: %s%s%s%s%s", @@ -955,9 +956,7 @@ static void conf_print(const struct ctx *c) info("Namespace interface: %s", c->pasta_ifn);

info("MAC:"); - info(" host: %02x:%02x:%02x:%02x:%02x:%02x", - c->mac[0], c->mac[1], c->mac[2], - c->mac[3], c->mac[4], c->mac[5]); + info(" host: %s", eth_ntop(c->mac, bufmac, sizeof(bufmac)));

if (c->ifi4) { if (!c->no_dhcp) { diff --git a/dhcp.c b/dhcp.c index aa9f59da..acc5b03e 100644 --- a/dhcp.c +++ b/dhcp.c @@ -276,6 +276,7 @@ static void opt_set_dns_search(const struct ctx *c, size_t max_len) int dhcp(const struct ctx *c, const struct pool *p) { size_t mlen, dlen, offset = 0, opt_len, opt_off = 0; + char macstr[ETH_ADDRSTRLEN]; const struct ethhdr *eh; const struct iphdr *iph; const struct udphdr *uh; @@ -340,9 +341,7 @@ int dhcp(const struct ctx *c, const struct pool *p) return -1; }

- info(" from %02x:%02x:%02x:%02x:%02x:%02x", - m->chaddr[0], m->chaddr[1], m->chaddr[2], - m->chaddr[3], m->chaddr[4], m->chaddr[5]); + info(" from %s", eth_ntop(m->chaddr, macstr, sizeof(macstr)));

m->yiaddr = c->ip4.addr; mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len)); diff --git a/util.c b/util.c index 0b414045..892358b1 100644 --- a/util.c +++ b/util.c @@ -676,6 +676,25 @@ const char *sockaddr_ntop(const void *sa, char *dst, socklen_t size) return dst; }

+/** eth_ntop() - Convert an Ethernet MAC address to text format + * @mac: MAC address + * @dst: output buffer, minimum ETH_ADDRSTRLEN bytes + * @size: size of buffer at @dst

Nit: s/output/Output, s/size/Size

...

+ * + * Return: On success, a non-null pointer to @dst, NULL on failure + */ +const char *eth_ntop(const unsigned char *mac, char *dst, size_t size) +{ + int len; + + len = snprintf(dst, size, "%02x:%02x:%02x:%02x:%02x:%02x", + mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]); + if (len < 0 || (size_t)len >= size) + return NULL; + + return dst; +} + /** str_ee_origin() - Convert socket extended error origin to a string * @ee: Socket extended error structure * diff --git a/util.h b/util.h index cb4d181c..c1748074 100644 --- a/util.h +++ b/util.h @@ -215,9 +215,12 @@ static inline const char *af_name(sa_family_t af)

#define SOCKADDR_STRLEN MAX(SOCKADDR_INET_STRLEN, SOCKADDR_INET6_STRLEN)

+#define ETH_ADDRSTRLEN (ETH_ALEN * 3)

The fact that this includes two digits plus separator for all non-last octets of a MAC address, and two digits plus NULL terminator for the last octet, looks a bit subtle to me. Defining this as sizeof("00:11:22:33:44:55") wouldn't scream "off-by-one" as much, to me. Not a strong preference. -- Stefano

David Gibson

19 Aug 19 Aug

3:29 a.m.

New subject: [PATCH 02/22] util: Helper for formatting MAC addresses

On Sun, Aug 18, 2024 at 05:44:55PM +0200, Stefano Brivio wrote:

...

On Fri, 16 Aug 2024 15:39:43 +1000 David Gibson wrote:

...
There are a couple of places where we somewhat messily open code formatting an Ethernet like MAC address for display. Add an eth_ntop() helper for this.

Signed-off-by: David Gibson --- conf.c | 7 +++---- dhcp.c | 5 ++--- util.c | 19 +++++++++++++++++++ util.h | 3 +++ 4 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/conf.c b/conf.c index ed097bdc..830f91a6 100644 --- a/conf.c +++ b/conf.c @@ -921,7 +921,8 @@ pasta_opts: */ static void conf_print(const struct ctx *c) { - char buf4[INET_ADDRSTRLEN], buf6[INET6_ADDRSTRLEN], ifn[IFNAMSIZ]; + char buf4[INET_ADDRSTRLEN], buf6[INET6_ADDRSTRLEN]; + char bufmac[ETH_ADDRSTRLEN], ifn[IFNAMSIZ]; int i;

info("Template interface: %s%s%s%s%s", @@ -955,9 +956,7 @@ static void conf_print(const struct ctx *c) info("Namespace interface: %s", c->pasta_ifn);

info("MAC:"); - info(" host: %02x:%02x:%02x:%02x:%02x:%02x", - c->mac[0], c->mac[1], c->mac[2], - c->mac[3], c->mac[4], c->mac[5]); + info(" host: %s", eth_ntop(c->mac, bufmac, sizeof(bufmac)));

if (c->ifi4) { if (!c->no_dhcp) { diff --git a/dhcp.c b/dhcp.c index aa9f59da..acc5b03e 100644 --- a/dhcp.c +++ b/dhcp.c @@ -276,6 +276,7 @@ static void opt_set_dns_search(const struct ctx *c, size_t max_len) int dhcp(const struct ctx *c, const struct pool *p) { size_t mlen, dlen, offset = 0, opt_len, opt_off = 0; + char macstr[ETH_ADDRSTRLEN]; const struct ethhdr *eh; const struct iphdr *iph; const struct udphdr *uh; @@ -340,9 +341,7 @@ int dhcp(const struct ctx *c, const struct pool *p) return -1; }

- info(" from %02x:%02x:%02x:%02x:%02x:%02x", - m->chaddr[0], m->chaddr[1], m->chaddr[2], - m->chaddr[3], m->chaddr[4], m->chaddr[5]); + info(" from %s", eth_ntop(m->chaddr, macstr, sizeof(macstr)));

m->yiaddr = c->ip4.addr; mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len)); diff --git a/util.c b/util.c index 0b414045..892358b1 100644 --- a/util.c +++ b/util.c @@ -676,6 +676,25 @@ const char *sockaddr_ntop(const void *sa, char *dst, socklen_t size) return dst; }

+/** eth_ntop() - Convert an Ethernet MAC address to text format + * @mac: MAC address + * @dst: output buffer, minimum ETH_ADDRSTRLEN bytes + * @size: size of buffer at @dst

Nit: s/output/Output, s/size/Size

Fixed.

...

...
+ * + * Return: On success, a non-null pointer to @dst, NULL on failure + */ +const char *eth_ntop(const unsigned char *mac, char *dst, size_t size) +{ + int len; + + len = snprintf(dst, size, "%02x:%02x:%02x:%02x:%02x:%02x", + mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]); + if (len < 0 || (size_t)len >= size) + return NULL; + + return dst; +} + /** str_ee_origin() - Convert socket extended error origin to a string * @ee: Socket extended error structure * diff --git a/util.h b/util.h index cb4d181c..c1748074 100644 --- a/util.h +++ b/util.h @@ -215,9 +215,12 @@ static inline const char *af_name(sa_family_t af)

#define SOCKADDR_STRLEN MAX(SOCKADDR_INET_STRLEN, SOCKADDR_INET6_STRLEN)

+#define ETH_ADDRSTRLEN (ETH_ALEN * 3)

The fact that this includes two digits plus separator for all non-last octets of a MAC address, and two digits plus NULL terminator for the last octet, looks a bit subtle to me.

Defining this as sizeof("00:11:22:33:44:55") wouldn't scream "off-by-one" as much, to me. Not a strong preference.

Yeah, that makes sense. Done. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

16 Aug 16 Aug

7:39 a.m.

New subject: [PATCH 03/22] treewide: Rename MAC address fields for clarity

c->mac isn't a great name, because it doesn't say whose mac address it is and it's not necessarily obvious in all the contexts we use it. Since this is specifically the address that we (passt/pasta) use on the tap interface, rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac" to be grammatically consistent. Signed-off-by: David Gibson --- arp.c | 4 ++-- conf.c | 10 +++++----- dhcpv6.c | 6 ++++-- ndp.c | 4 ++-- passt.c | 2 +- passt.h | 8 ++++---- pasta.c | 8 ++++---- tap.c | 12 ++++++------ 8 files changed, 28 insertions(+), 26 deletions(-) diff --git a/arp.c b/arp.c index 93b22c5d..53334dac 100644 --- a/arp.c +++ b/arp.c @@ -72,7 +72,7 @@ int arp(const struct ctx *c, const struct pool *p) ah->ar_op = htons(ARPOP_REPLY); memcpy(am->tha, am->sha, sizeof(am->tha)); - memcpy(am->sha, c->mac, sizeof(am->sha)); + memcpy(am->sha, c->our_tap_mac, sizeof(am->sha)); memcpy(swap, am->tip, sizeof(am->tip)); memcpy(am->tip, am->sip, sizeof(am->tip)); @@ -80,7 +80,7 @@ int arp(const struct ctx *c, const struct pool *p) l2len = sizeof(*eh) + sizeof(*ah) + sizeof(*am); memcpy(eh->h_dest, eh->h_source, sizeof(eh->h_dest)); - memcpy(eh->h_source, c->mac, sizeof(eh->h_source)); + memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source)); tap_send_single(c, eh, l2len); diff --git a/conf.c b/conf.c index 830f91a6..750fdc86 100644 --- a/conf.c +++ b/conf.c @@ -956,7 +956,7 @@ static void conf_print(const struct ctx *c) info("Namespace interface: %s", c->pasta_ifn); info("MAC:"); - info(" host: %s", eth_ntop(c->mac, bufmac, sizeof(bufmac))); + info(" host: %s", eth_ntop(c->our_tap_mac, bufmac, sizeof(bufmac))); if (c->ifi4) { if (!c->no_dhcp) { @@ -1289,7 +1289,7 @@ void conf(struct ctx *c, int argc, char **argv) if (c->mode != MODE_PASTA) die("--ns-mac-addr is for pasta mode only"); - parse_mac(c->mac_guest, optarg); + parse_mac(c->guest_mac, optarg); break; case 5: if (c->mode != MODE_PASTA) @@ -1500,7 +1500,7 @@ void conf(struct ctx *c, int argc, char **argv) break; case 'M': - parse_mac(c->mac, optarg); + parse_mac(c->our_tap_mac, optarg); break; case 'g': if (inet_pton(AF_INET6, optarg, &c->ip6.gw) && @@ -1629,9 +1629,9 @@ void conf(struct ctx *c, int argc, char **argv) nl_sock_init(c, false); if (!v6_only) - c->ifi4 = conf_ip4(ifi4, &c->ip4, c->mac); + c->ifi4 = conf_ip4(ifi4, &c->ip4, c->our_tap_mac); if (!v4_only) - c->ifi6 = conf_ip6(ifi6, &c->ip6, c->mac); + c->ifi6 = conf_ip6(ifi6, &c->ip6, c->our_tap_mac); if ((!c->ifi4 && !c->ifi6) || (*c->ip4.ifname_out && !c->ifi4) || (*c->ip6.ifname_out && !c->ifi6)) diff --git a/dhcpv6.c b/dhcpv6.c index 7dcca2a7..bbed41dc 100644 --- a/dhcpv6.c +++ b/dhcpv6.c @@ -574,8 +574,10 @@ void dhcpv6_init(const struct ctx *c) resp.server_id.duid_time = duid_time; resp_not_on_link.server_id.duid_time = duid_time; - memcpy(resp.server_id.duid_lladdr, c->mac, sizeof(c->mac)); - memcpy(resp_not_on_link.server_id.duid_lladdr, c->mac, sizeof(c->mac)); + memcpy(resp.server_id.duid_lladdr, + c->our_tap_mac, sizeof(c->our_tap_mac)); + memcpy(resp_not_on_link.server_id.duid_lladdr, + c->our_tap_mac, sizeof(c->our_tap_mac)); resp.ia_addr.addr = c->ip6.addr; } diff --git a/ndp.c b/ndp.c index 6dcb4872..9c0fef4a 100644 --- a/ndp.c +++ b/ndp.c @@ -247,7 +247,7 @@ int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr, memcpy(&na.target_addr, &ns->target_addr, sizeof(na.target_addr)); - memcpy(na.target_l2_addr.mac, c->mac, ETH_ALEN); + memcpy(na.target_l2_addr.mac, c->our_tap_mac, ETH_ALEN); } else if (ih->icmp6_type == RS) { size_t dns_s_len = 0; @@ -331,7 +331,7 @@ int ndp(struct ctx *c, const struct icmp6hdr *ih, const struct in6_addr *saddr, } dns_done: - memcpy(&ra.source_ll.mac, c->mac, ETH_ALEN); + memcpy(&ra.source_ll.mac, c->our_tap_mac, ETH_ALEN); } else { return 1; } diff --git a/passt.c b/passt.c index 4b3c306e..96374831 100644 --- a/passt.c +++ b/passt.c @@ -272,7 +272,7 @@ int main(int argc, char **argv) if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c))) exit(EXIT_FAILURE); - proto_update_l2_buf(c.mac_guest, c.mac); + proto_update_l2_buf(c.guest_mac, c.our_tap_mac); if (c.ifi4 && !c.no_dhcp) dhcp_init(); diff --git a/passt.h b/passt.h index ef684037..fe3e47d2 100644 --- a/passt.h +++ b/passt.h @@ -172,8 +172,8 @@ struct ip6_ctx { * @epollfd: File descriptor for epoll instance * @fd_tap_listen: File descriptor for listening AF_UNIX socket, if any * @fd_tap: AF_UNIX socket, tuntap device, or pre-opened socket - * @mac: Host MAC address - * @mac_guest: MAC address of guest or namespace, seen or configured + * @our_tap_mac: Pasta/passt's MAC on the tap link + * @guest_mac: MAC address of guest or namespace, seen or configured * @hash_secret: 128-bit secret for siphash functions * @ifi4: Index of template interface for IPv4, 0 if IPv4 disabled * @ip: IPv4 configuration @@ -226,8 +226,8 @@ struct ctx { int epollfd; int fd_tap_listen; int fd_tap; - unsigned char mac[ETH_ALEN]; - unsigned char mac_guest[ETH_ALEN]; + unsigned char our_tap_mac[ETH_ALEN]; + unsigned char guest_mac[ETH_ALEN]; uint64_t hash_secret[2]; unsigned int ifi4; diff --git a/pasta.c b/pasta.c index 615ff7b3..3b4e8ead 100644 --- a/pasta.c +++ b/pasta.c @@ -294,10 +294,10 @@ void pasta_ns_conf(struct ctx *c) strerror(-rc)); /* Get or set MAC in target namespace */ - if (MAC_IS_ZERO(c->mac_guest)) - nl_link_get_mac(nl_sock_ns, c->pasta_ifi, c->mac_guest); + if (MAC_IS_ZERO(c->guest_mac)) + nl_link_get_mac(nl_sock_ns, c->pasta_ifi, c->guest_mac); else - rc = nl_link_set_mac(nl_sock_ns, c->pasta_ifi, c->mac_guest); + rc = nl_link_set_mac(nl_sock_ns, c->pasta_ifi, c->guest_mac); if (rc < 0) die("Couldn't set MAC address in namespace: %s", strerror(-rc)); @@ -367,7 +367,7 @@ void pasta_ns_conf(struct ctx *c) } } - proto_update_l2_buf(c->mac_guest, NULL); + proto_update_l2_buf(c->guest_mac, NULL); } /** diff --git a/tap.c b/tap.c index 87be3a6b..852d8376 100644 --- a/tap.c +++ b/tap.c @@ -118,8 +118,8 @@ static void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto) struct ethhdr *eh = (struct ethhdr *)buf; /* TODO: ARP table lookup */ - memcpy(eh->h_dest, c->mac_guest, ETH_ALEN); - memcpy(eh->h_source, c->mac, ETH_ALEN); + memcpy(eh->h_dest, c->guest_mac, ETH_ALEN); + memcpy(eh->h_source, c->our_tap_mac, ETH_ALEN); eh->h_proto = ntohs(proto); return eh + 1; } @@ -946,9 +946,9 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p) eh = (struct ethhdr *)p; - if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) { - memcpy(c->mac_guest, eh->h_source, ETH_ALEN); - proto_update_l2_buf(c->mac_guest, NULL); + if (memcmp(c->guest_mac, eh->h_source, ETH_ALEN)) { + memcpy(c->guest_mac, eh->h_source, ETH_ALEN); + proto_update_l2_buf(c->guest_mac, NULL); } switch (ntohs(eh->h_proto)) { @@ -1337,6 +1337,6 @@ void tap_sock_init(struct ctx *c) * sends us packets. Use the broadcast address so that our * first packets will reach it. */ - memset(&c->mac_guest, 0xff, sizeof(c->mac_guest)); + memset(&c->guest_mac, 0xff, sizeof(c->guest_mac)); } } -- 2.46.0

Stefano Brivio

18 Aug 18 Aug

5:45 p.m.

New subject: [PATCH 03/22] treewide: Rename MAC address fields for clarity

On Fri, 16 Aug 2024 15:39:44 +1000 David Gibson wrote:

...

c->mac isn't a great name, because it doesn't say whose mac address it is and it's not necessarily obvious in all the contexts we use it. Since this is specifically the address that we (passt/pasta) use on the tap interface, rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac" to be grammatically consistent.

Wouldn't "our_mac" suffice? Even the day we get to support other types of link (well, "tap" for a guest is already not a tap, I know...), or especially multiple links at the same time, I guess we will still want to use a single MAC address. -- Stefano

David Gibson

19 Aug 19 Aug

3:36 a.m.

New subject: [PATCH 03/22] treewide: Rename MAC address fields for clarity

On Sun, Aug 18, 2024 at 05:45:00PM +0200, Stefano Brivio wrote:

...

On Fri, 16 Aug 2024 15:39:44 +1000 David Gibson wrote:

...
c->mac isn't a great name, because it doesn't say whose mac address it is and it's not necessarily obvious in all the contexts we use it. Since this is specifically the address that we (passt/pasta) use on the tap interface, rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac" to be grammatically consistent.

Wouldn't "our_mac" suffice?

Maybe. This is supposed to emphasise that this is used on PIF_TAP - we also (usually) have a MAC address on the host interfaces, though we don't really need to care about it.

...

Even the day we get to support other types of link (well, "tap" for a guest is already not a tap, I know...), or especially multiple links at the same time, I guess we will still want to use a single MAC address.

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

16 Aug 16 Aug

7:39 a.m.

New subject: [PATCH 04/22] treewide: Use struct assignment instead of memcpy() for IP addresses

We rely on C11 already, so we can use clearer and more type-checkable struct assignment instead of mempcy() for copying IP addresses around. This exposes some "pointer could be const" warnings from cppcheck, so address those too. Signed-off-by: David Gibson --- conf.c | 12 ++++++------ dhcpv6.c | 10 ++++++---- 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/conf.c b/conf.c index 750fdc86..9b05afeb 100644 --- a/conf.c +++ b/conf.c @@ -389,14 +389,14 @@ static void add_dns6(struct ctx *c, /* Guest or container can only access local addresses via redirect */ if (IN6_IS_ADDR_LOOPBACK(addr)) { if (!c->no_map_gw) { - memcpy(*conf, &c->ip6.gw, sizeof(**conf)); + **conf = c->ip6.gw; (*conf)++; if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match)) - memcpy(&c->ip6.dns_match, addr, sizeof(*addr)); + c->ip6.dns_match = *addr; } } else { - memcpy(*conf, addr, sizeof(**conf)); + **conf = *addr; (*conf)++; } @@ -632,7 +632,7 @@ static unsigned int conf_ip4(unsigned int ifi, ip4->prefix_len = 32; } - memcpy(&ip4->addr_seen, &ip4->addr, sizeof(ip4->addr_seen)); + ip4->addr_seen = ip4->addr; if (MAC_IS_ZERO(mac)) { int rc = nl_link_get_mac(nl_sock, ifi, mac); @@ -693,8 +693,8 @@ static unsigned int conf_ip6(unsigned int ifi, return 0; } - memcpy(&ip6->addr_seen, &ip6->addr, sizeof(ip6->addr)); - memcpy(&ip6->addr_ll_seen, &ip6->addr_ll, sizeof(ip6->addr_ll)); + ip6->addr_seen = ip6->addr; + ip6->addr_ll_seen = ip6->addr_ll; if (MAC_IS_ZERO(mac)) { rc = nl_link_get_mac(nl_sock, ifi, mac); diff --git a/dhcpv6.c b/dhcpv6.c index bbed41dc..87b3c3eb 100644 --- a/dhcpv6.c +++ b/dhcpv6.c @@ -298,7 +298,8 @@ static struct opt_hdr *dhcpv6_ia_notonlink(const struct pool *p, { char buf[INET6_ADDRSTRLEN]; struct in6_addr req_addr; - struct opt_hdr *ia, *h; + const struct opt_hdr *h; + struct opt_hdr *ia; size_t offset; int ia_type; @@ -312,12 +313,13 @@ ia_ta: offset += sizeof(struct opt_ia_na); while ((h = dhcpv6_opt(p, &offset, OPT_IAAADR))) { - struct opt_ia_addr *opt_addr = (struct opt_ia_addr *)h; + const struct opt_ia_addr *opt_addr + = (const struct opt_ia_addr *)h; if (ntohs(h->l) != OPT_VSIZE(ia_addr)) return NULL; - memcpy(&req_addr, &opt_addr->addr, sizeof(req_addr)); + req_addr = opt_addr->addr; if (!IN6_ARE_ADDR_EQUAL(la, &req_addr)) { info("DHCPv6: requested address %s not on link", inet_ntop(AF_INET6, &req_addr, @@ -363,7 +365,7 @@ static size_t dhcpv6_dns_fill(const struct ctx *c, char *buf, int offset) srv->hdr.l = 0; } - memcpy(&srv->addr[i], &c->ip6.dns[i], sizeof(srv->addr[i])); + srv->addr[i] = c->ip6.dns[i]; srv->hdr.l += sizeof(srv->addr[i]); offset += sizeof(srv->addr[i]); } -- 2.46.0

Stefano Brivio

18 Aug 18 Aug

5:45 p.m.

New subject: [PATCH 04/22] treewide: Use struct assignment instead of memcpy() for IP addresses

On Fri, 16 Aug 2024 15:39:45 +1000 David Gibson wrote:

...

We rely on C11 already, so we can use clearer and more type-checkable struct assignment instead of mempcy() for copying IP addresses around.

This exposes some "pointer could be const" warnings from cppcheck, so address those too.

Signed-off-by: David Gibson --- conf.c | 12 ++++++------ dhcpv6.c | 10 ++++++---- 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/conf.c b/conf.c index 750fdc86..9b05afeb 100644 --- a/conf.c +++ b/conf.c @@ -389,14 +389,14 @@ static void add_dns6(struct ctx *c, /* Guest or container can only access local addresses via redirect */ if (IN6_IS_ADDR_LOOPBACK(addr)) { if (!c->no_map_gw) { - memcpy(*conf, &c->ip6.gw, sizeof(**conf)); + **conf = c->ip6.gw; (*conf)++;

if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match)) - memcpy(&c->ip6.dns_match, addr, sizeof(*addr)); + c->ip6.dns_match = *addr; } } else { - memcpy(*conf, addr, sizeof(**conf)); + **conf = *addr; (*conf)++; }

@@ -632,7 +632,7 @@ static unsigned int conf_ip4(unsigned int ifi, ip4->prefix_len = 32; }

- memcpy(&ip4->addr_seen, &ip4->addr, sizeof(ip4->addr_seen)); + ip4->addr_seen = ip4->addr;

if (MAC_IS_ZERO(mac)) { int rc = nl_link_get_mac(nl_sock, ifi, mac); @@ -693,8 +693,8 @@ static unsigned int conf_ip6(unsigned int ifi, return 0; }

- memcpy(&ip6->addr_seen, &ip6->addr, sizeof(ip6->addr)); - memcpy(&ip6->addr_ll_seen, &ip6->addr_ll, sizeof(ip6->addr_ll)); + ip6->addr_seen = ip6->addr; + ip6->addr_ll_seen = ip6->addr_ll;

if (MAC_IS_ZERO(mac)) { rc = nl_link_get_mac(nl_sock, ifi, mac); diff --git a/dhcpv6.c b/dhcpv6.c index bbed41dc..87b3c3eb 100644 --- a/dhcpv6.c +++ b/dhcpv6.c @@ -298,7 +298,8 @@ static struct opt_hdr *dhcpv6_ia_notonlink(const struct pool *p, { char buf[INET6_ADDRSTRLEN]; struct in6_addr req_addr; - struct opt_hdr *ia, *h; + const struct opt_hdr *h; + struct opt_hdr *ia; size_t offset; int ia_type;

@@ -312,12 +313,13 @@ ia_ta: offset += sizeof(struct opt_ia_na);

while ((h = dhcpv6_opt(p, &offset, OPT_IAAADR))) { - struct opt_ia_addr *opt_addr = (struct opt_ia_addr *)h; + const struct opt_ia_addr *opt_addr + = (const struct opt_ia_addr *)h;

Nit: the assignment could go on its own line, then?

...

if (ntohs(h->l) != OPT_VSIZE(ia_addr)) return NULL;

- memcpy(&req_addr, &opt_addr->addr, sizeof(req_addr)); + req_addr = opt_addr->addr; if (!IN6_ARE_ADDR_EQUAL(la, &req_addr)) { info("DHCPv6: requested address %s not on link", inet_ntop(AF_INET6, &req_addr, @@ -363,7 +365,7 @@ static size_t dhcpv6_dns_fill(const struct ctx *c, char *buf, int offset) srv->hdr.l = 0; }

- memcpy(&srv->addr[i], &c->ip6.dns[i], sizeof(srv->addr[i])); + srv->addr[i] = c->ip6.dns[i]; srv->hdr.l += sizeof(srv->addr[i]); offset += sizeof(srv->addr[i]); }

I only reviewed up to this patch so far. -- Stefano

David Gibson

19 Aug 19 Aug

3:38 a.m.

New subject: [PATCH 04/22] treewide: Use struct assignment instead of memcpy() for IP addresses

On Sun, Aug 18, 2024 at 05:45:03PM +0200, Stefano Brivio wrote:

...

On Fri, 16 Aug 2024 15:39:45 +1000 David Gibson wrote:

...
We rely on C11 already, so we can use clearer and more type-checkable struct assignment instead of mempcy() for copying IP addresses around.

This exposes some "pointer could be const" warnings from cppcheck, so address those too.

Signed-off-by: David Gibson --- conf.c | 12 ++++++------ dhcpv6.c | 10 ++++++---- 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/conf.c b/conf.c index 750fdc86..9b05afeb 100644 --- a/conf.c +++ b/conf.c @@ -389,14 +389,14 @@ static void add_dns6(struct ctx *c, /* Guest or container can only access local addresses via redirect */ if (IN6_IS_ADDR_LOOPBACK(addr)) { if (!c->no_map_gw) { - memcpy(*conf, &c->ip6.gw, sizeof(**conf)); + **conf = c->ip6.gw; (*conf)++;

if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match)) - memcpy(&c->ip6.dns_match, addr, sizeof(*addr)); + c->ip6.dns_match = *addr; } } else { - memcpy(*conf, addr, sizeof(**conf)); + **conf = *addr; (*conf)++; }

@@ -632,7 +632,7 @@ static unsigned int conf_ip4(unsigned int ifi, ip4->prefix_len = 32; }

- memcpy(&ip4->addr_seen, &ip4->addr, sizeof(ip4->addr_seen)); + ip4->addr_seen = ip4->addr;

if (MAC_IS_ZERO(mac)) { int rc = nl_link_get_mac(nl_sock, ifi, mac); @@ -693,8 +693,8 @@ static unsigned int conf_ip6(unsigned int ifi, return 0; }

- memcpy(&ip6->addr_seen, &ip6->addr, sizeof(ip6->addr)); - memcpy(&ip6->addr_ll_seen, &ip6->addr_ll, sizeof(ip6->addr_ll)); + ip6->addr_seen = ip6->addr; + ip6->addr_ll_seen = ip6->addr_ll;

if (MAC_IS_ZERO(mac)) { rc = nl_link_get_mac(nl_sock, ifi, mac); diff --git a/dhcpv6.c b/dhcpv6.c index bbed41dc..87b3c3eb 100644 --- a/dhcpv6.c +++ b/dhcpv6.c @@ -298,7 +298,8 @@ static struct opt_hdr *dhcpv6_ia_notonlink(const struct pool *p, { char buf[INET6_ADDRSTRLEN]; struct in6_addr req_addr; - struct opt_hdr *ia, *h; + const struct opt_hdr *h; + struct opt_hdr *ia; size_t offset; int ia_type;

@@ -312,12 +313,13 @@ ia_ta: offset += sizeof(struct opt_ia_na);

while ((h = dhcpv6_opt(p, &offset, OPT_IAAADR))) { - struct opt_ia_addr *opt_addr = (struct opt_ia_addr *)h; + const struct opt_ia_addr *opt_addr + = (const struct opt_ia_addr *)h;

Nit: the assignment could go on its own line, then?

Good point, done.

...

...
if (ntohs(h->l) != OPT_VSIZE(ia_addr)) return NULL;

- memcpy(&req_addr, &opt_addr->addr, sizeof(req_addr)); + req_addr = opt_addr->addr; if (!IN6_ARE_ADDR_EQUAL(la, &req_addr)) { info("DHCPv6: requested address %s not on link", inet_ntop(AF_INET6, &req_addr, @@ -363,7 +365,7 @@ static size_t dhcpv6_dns_fill(const struct ctx *c, char *buf, int offset) srv->hdr.l = 0; }

- memcpy(&srv->addr[i], &c->ip6.dns[i], sizeof(srv->addr[i])); + srv->addr[i] = c->ip6.dns[i]; srv->hdr.l += sizeof(srv->addr[i]); offset += sizeof(srv->addr[i]); }

I only reviewed up to this patch so far.

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

16 Aug 16 Aug

7:39 a.m.

New subject: [PATCH 05/22] conf: Use array indices rather than pointers for DNS array slots

Currently add_dns[46]() take a somewhat awkward double pointer to the entry in the c->ip[46].dns array to update. It turns out to be easier to work with indices into that array instead. This diff does add some lines, but it's comments, and will allow some future code reductions. Signed-off-by: David Gibson --- conf.c | 73 +++++++++++++++++++++++++++++++++------------------------- 1 file changed, 41 insertions(+), 32 deletions(-) diff --git a/conf.c b/conf.c index 9b05afeb..2a52bc32 100644 --- a/conf.c +++ b/conf.c @@ -354,54 +354,65 @@ bind_all_fail: * add_dns4() - Possibly add the IPv4 address of a DNS resolver to configuration * @c: Execution context * @addr: Address found in /etc/resolv.conf - * @conf: Pointer to reference of current entry in array of IPv4 resolvers + * @idx: Index of free entry in array of IPv4 resolvers + * + * Return: Number of entries added (0 or 1) */ -static void add_dns4(struct ctx *c, const struct in_addr *addr, - struct in_addr **conf) +static unsigned add_dns4(struct ctx *c, const struct in_addr *addr, + unsigned idx) { + unsigned added = 0; + /* Guest or container can only access local addresses via redirect */ if (IN4_IS_ADDR_LOOPBACK(addr)) { if (!c->no_map_gw) { - **conf = c->ip4.gw; - (*conf)++; + c->ip4.dns[idx] = c->ip4.gw; + added++; if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match)) c->ip4.dns_match = c->ip4.gw; } } else { - **conf = *addr; - (*conf)++; + c->ip4.dns[idx] = *addr; + added++; } if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_host)) c->ip4.dns_host = *addr; + + return added; } /** * add_dns6() - Possibly add the IPv6 address of a DNS resolver to configuration * @c: Execution context * @addr: Address found in /etc/resolv.conf - * @conf: Pointer to reference of current entry in array of IPv6 resolvers + * @idx: Index of free entry in array of IPv6 resolvers + * + * Return: Number of entries added (0 or 1) */ -static void add_dns6(struct ctx *c, - struct in6_addr *addr, struct in6_addr **conf) +static unsigned add_dns6(struct ctx *c, struct in6_addr *addr, unsigned idx) { + unsigned added = 0; + /* Guest or container can only access local addresses via redirect */ if (IN6_IS_ADDR_LOOPBACK(addr)) { if (!c->no_map_gw) { - **conf = c->ip6.gw; - (*conf)++; + c->ip6.dns[idx] = c->ip6.gw; + added++; if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match)) c->ip6.dns_match = *addr; } } else { - **conf = *addr; - (*conf)++; + c->ip6.dns[idx] = *addr; + added++; } if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_host)) c->ip6.dns_host = *addr; + + return added; } /** @@ -410,18 +421,19 @@ static void add_dns6(struct ctx *c, */ static void get_dns(struct ctx *c) { - struct in6_addr *dns6 = &c->ip6.dns[0], dns6_tmp; - struct in_addr *dns4 = &c->ip4.dns[0], dns4_tmp; int dns4_set, dns6_set, dnss_set, dns_set, fd; + unsigned dns4_idx = 0, dns6_idx = 0; struct fqdn *s = c->dns_search; struct lineread resolvconf; + struct in6_addr dns6_tmp; + struct in_addr dns4_tmp; unsigned int added = 0; ssize_t line_len; char *line, *end; const char *p; - dns4_set = !c->ifi4 || !IN4_IS_ADDR_UNSPECIFIED(dns4); - dns6_set = !c->ifi6 || !IN6_IS_ADDR_UNSPECIFIED(dns6); + dns4_set = !c->ifi4 || !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns[0]); + dns6_set = !c->ifi6 || !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns[0]); dnss_set = !!*s->n || c->no_dns_search; dns_set = (dns4_set && dns6_set) || c->no_dns; @@ -442,17 +454,15 @@ static void get_dns(struct ctx *c) if (end) *end = 0; - if (!dns4_set && - dns4 - &c->ip4.dns[0] < ARRAY_SIZE(c->ip4.dns) - 1 + if (!dns4_set && dns4_idx < ARRAY_SIZE(c->ip4.dns) - 1 && inet_pton(AF_INET, p + 1, &dns4_tmp)) { - add_dns4(c, &dns4_tmp, &dns4); + dns4_idx += add_dns4(c, &dns4_tmp, dns4_idx); added++; } - if (!dns6_set && - dns6 - &c->ip6.dns[0] < ARRAY_SIZE(c->ip6.dns) - 1 + if (!dns6_set && dns6_idx < ARRAY_SIZE(c->ip6.dns) - 1 && inet_pton(AF_INET6, p + 1, &dns6_tmp)) { - add_dns6(c, &dns6_tmp, &dns6); + dns6_idx += add_dns6(c, &dns6_tmp, dns6_idx); added++; } } else if (!dnss_set && strstr(line, "search ") == line && @@ -1236,8 +1246,7 @@ void conf(struct ctx *c, int argc, char **argv) bool copy_addrs_opt = false, copy_routes_opt = false; enum fwd_ports_mode fwd_default = FWD_NONE; bool v4_only = false, v6_only = false; - struct in6_addr *dns6 = c->ip6.dns; - struct in_addr *dns4 = c->ip4.dns; + unsigned dns4_idx = 0, dns6_idx = 0; struct fqdn *dnss = c->dns_search; unsigned int ifi4 = 0, ifi6 = 0; const char *logfile = NULL; @@ -1662,13 +1671,13 @@ void conf(struct ctx *c, int argc, char **argv) if (!strcmp(optarg, "none")) { c->no_dns = 1; - dns4 = &c->ip4.dns[0]; + dns4_idx = 0; memset(c->ip4.dns, 0, sizeof(c->ip4.dns)); c->ip4.dns[0] = (struct in_addr){ 0 }; c->ip4.dns_match = (struct in_addr){ 0 }; c->ip4.dns_host = (struct in_addr){ 0 }; - dns6 = &c->ip6.dns[0]; + dns6_idx = 0; memset(c->ip6.dns, 0, sizeof(c->ip6.dns)); c->ip6.dns_match = (struct in6_addr){ 0 }; c->ip6.dns_host = (struct in6_addr){ 0 }; @@ -1678,15 +1687,15 @@ void conf(struct ctx *c, int argc, char **argv) c->no_dns = 0; - if (dns4 - &c->ip4.dns[0] < ARRAY_SIZE(c->ip4.dns) && + if (dns4_idx < ARRAY_SIZE(c->ip4.dns) && inet_pton(AF_INET, optarg, &dns4_tmp)) { - add_dns4(c, &dns4_tmp, &dns4); + dns4_idx += add_dns4(c, &dns4_tmp, dns4_idx); continue; } - if (dns6 - &c->ip6.dns[0] < ARRAY_SIZE(c->ip6.dns) && + if (dns6_idx < ARRAY_SIZE(c->ip6.dns) && inet_pton(AF_INET6, optarg, &dns6_tmp)) { - add_dns6(c, &dns6_tmp, &dns6); + dns6_idx += add_dns6(c, &dns6_tmp, dns6_idx); continue; } -- 2.46.0

David Gibson

7:39 a.m.

New subject: [PATCH 06/22] conf: More accurately count entries added in get_dns()

get_dns() counts the number of guest DNS servers it adds, and gives an error if it couldn't add any. However, this count ignores the fact that add_dns[46]() may in some cases *not* add an entry. Use the array indices we're already tracking to get an accurate count. Signed-off-by: David Gibson --- conf.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/conf.c b/conf.c index 2a52bc32..d19013c1 100644 --- a/conf.c +++ b/conf.c @@ -427,7 +427,6 @@ static void get_dns(struct ctx *c) struct lineread resolvconf; struct in6_addr dns6_tmp; struct in_addr dns4_tmp; - unsigned int added = 0; ssize_t line_len; char *line, *end; const char *p; @@ -455,16 +454,12 @@ static void get_dns(struct ctx *c) *end = 0; if (!dns4_set && dns4_idx < ARRAY_SIZE(c->ip4.dns) - 1 - && inet_pton(AF_INET, p + 1, &dns4_tmp)) { + && inet_pton(AF_INET, p + 1, &dns4_tmp)) dns4_idx += add_dns4(c, &dns4_tmp, dns4_idx); - added++; - } if (!dns6_set && dns6_idx < ARRAY_SIZE(c->ip6.dns) - 1 - && inet_pton(AF_INET6, p + 1, &dns6_tmp)) { + && inet_pton(AF_INET6, p + 1, &dns6_tmp)) dns6_idx += add_dns6(c, &dns6_tmp, dns6_idx); - added++; - } } else if (!dnss_set && strstr(line, "search ") == line && s == c->dns_search) { end = strpbrk(line, "\n"); @@ -491,7 +486,7 @@ static void get_dns(struct ctx *c) out: if (!dns_set) { - if (!added) + if (!(dns4_idx + dns6_idx)) warn("Couldn't get any nameserver address"); if (c->no_dhcp_dns) -- 2.46.0

David Gibson

7:39 a.m.

New subject: [PATCH 07/22] conf: Move DNS array bounds checks into add_dns[46]

Every time we call add_dns[46] we need to first check if there's space in the c->ip[46].dns array for the new entry. We might as well make that check in add_dns[46]() itself. In fact it looks like the calls in get_dns() had an off by one error, not allowing the last entry of the array to be filled. So, that bug is also fixed by the change. Signed-off-by: David Gibson --- conf.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/conf.c b/conf.c index d19013c1..dfe417b4 100644 --- a/conf.c +++ b/conf.c @@ -363,6 +363,9 @@ static unsigned add_dns4(struct ctx *c, const struct in_addr *addr, { unsigned added = 0; + if (idx >= ARRAY_SIZE(c->ip4.dns)) + return 0; + /* Guest or container can only access local addresses via redirect */ if (IN4_IS_ADDR_LOOPBACK(addr)) { if (!c->no_map_gw) { @@ -395,6 +398,9 @@ static unsigned add_dns6(struct ctx *c, struct in6_addr *addr, unsigned idx) { unsigned added = 0; + if (idx >= ARRAY_SIZE(c->ip6.dns)) + return 0; + /* Guest or container can only access local addresses via redirect */ if (IN6_IS_ADDR_LOOPBACK(addr)) { if (!c->no_map_gw) { @@ -453,12 +459,10 @@ static void get_dns(struct ctx *c) if (end) *end = 0; - if (!dns4_set && dns4_idx < ARRAY_SIZE(c->ip4.dns) - 1 - && inet_pton(AF_INET, p + 1, &dns4_tmp)) + if (!dns4_set && inet_pton(AF_INET, p + 1, &dns4_tmp)) dns4_idx += add_dns4(c, &dns4_tmp, dns4_idx); - if (!dns6_set && dns6_idx < ARRAY_SIZE(c->ip6.dns) - 1 - && inet_pton(AF_INET6, p + 1, &dns6_tmp)) + if (!dns6_set && inet_pton(AF_INET6, p + 1, &dns6_tmp)) dns6_idx += add_dns6(c, &dns6_tmp, dns6_idx); } else if (!dnss_set && strstr(line, "search ") == line && s == c->dns_search) { @@ -1682,14 +1686,12 @@ void conf(struct ctx *c, int argc, char **argv) c->no_dns = 0; - if (dns4_idx < ARRAY_SIZE(c->ip4.dns) && - inet_pton(AF_INET, optarg, &dns4_tmp)) { + if (inet_pton(AF_INET, optarg, &dns4_tmp)) { dns4_idx += add_dns4(c, &dns4_tmp, dns4_idx); continue; } - if (dns6_idx < ARRAY_SIZE(c->ip6.dns) && - inet_pton(AF_INET6, optarg, &dns6_tmp)) { + if (inet_pton(AF_INET6, optarg, &dns6_tmp)) { dns6_idx += add_dns6(c, &dns6_tmp, dns6_idx); continue; } -- 2.46.0

David Gibson

7:39 a.m.

New subject: [PATCH 08/22] conf: Move adding of a nameserver from resolv.conf into subfunction

get_dns() is already quite deeply nested, and future changes I have in mind will add more complexity. Prepare for this by splitting out the adding of a single nameserver to the configuration into its own function. Signed-off-by: David Gibson --- conf.c | 33 ++++++++++++++++++++++++++------- 1 file changed, 26 insertions(+), 7 deletions(-) diff --git a/conf.c b/conf.c index dfe417b4..4bb06b78 100644 --- a/conf.c +++ b/conf.c @@ -421,6 +421,29 @@ static unsigned add_dns6(struct ctx *c, struct in6_addr *addr, unsigned idx) return added; } +/** + * add_dns_resolv() - Possibly add ns from host resolv.conf to configuration + * @c: Execution context + * @nameserver: Nameserver address string from /etc/resolv.conf + * @idx4: Pointer to index of current entry in array of IPv4 resolvers + * @idx6: Pointer to index of current entry in array of IPv6 resolvers + * + * @idx4 or @idx6 may be NULL, in which case resolvers of the corresponding type + * are ignored. + */ +static void add_dns_resolv(struct ctx *c, const char *nameserver, + unsigned *idx4, unsigned *idx6) +{ + struct in6_addr ns6; + struct in_addr ns4; + + if (idx4 && inet_pton(AF_INET, nameserver, &ns4)) + *idx4 += add_dns4(c, &ns4, *idx4); + + if (idx6 && inet_pton(AF_INET6, nameserver, &ns6)) + *idx6 += add_dns6(c, &ns6, *idx6); +} + /** * get_dns() - Get nameserver addresses from local /etc/resolv.conf * @c: Execution context @@ -431,8 +454,6 @@ static void get_dns(struct ctx *c) unsigned dns4_idx = 0, dns6_idx = 0; struct fqdn *s = c->dns_search; struct lineread resolvconf; - struct in6_addr dns6_tmp; - struct in_addr dns4_tmp; ssize_t line_len; char *line, *end; const char *p; @@ -459,11 +480,9 @@ static void get_dns(struct ctx *c) if (end) *end = 0; - if (!dns4_set && inet_pton(AF_INET, p + 1, &dns4_tmp)) - dns4_idx += add_dns4(c, &dns4_tmp, dns4_idx); - - if (!dns6_set && inet_pton(AF_INET6, p + 1, &dns6_tmp)) - dns6_idx += add_dns6(c, &dns6_tmp, dns6_idx); + add_dns_resolv(c, p + 1, + dns4_set ? NULL : &dns4_idx, + dns6_set ? NULL : &dns6_idx); } else if (!dnss_set && strstr(line, "search ") == line && s == c->dns_search) { end = strpbrk(line, "\n"); -- 2.46.0

David Gibson

7:39 a.m.

New subject: [PATCH 09/22] conf: Correct setting of dns_match address in add_dns6()

add_dns6() (but not add_dns4()) has a bug setting dns_match: it sets it to the given address, rather than the gateway address. This is doubly wrong: - We've just established the given address is a host loopback address the guest can't access - We've just set ip6.dns[] to tell the guest to use the gateway address, so it won't use the dns_match address we're setting Correct this to use the gateway address, like IPv4. Signed-off-by: David Gibson --- conf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/conf.c b/conf.c index 4bb06b78..bf61c143 100644 --- a/conf.c +++ b/conf.c @@ -408,7 +408,7 @@ static unsigned add_dns6(struct ctx *c, struct in6_addr *addr, unsigned idx) added++; if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match)) - c->ip6.dns_match = *addr; + c->ip6.dns_match = c->ip6.gw; } } else { c->ip6.dns[idx] = *addr; -- 2.46.0

David Gibson

7:39 a.m.

New subject: [PATCH 10/22] conf: Treat --dns addresses as guest visible addresses

Although it's not 100% explicit in the man page, addresses given to the --dns option are intended to be addresses as seen by the guest. This differs from addresses taken from the host's /etc/resolv.conf, which must be translated to to guest accessible versions in some cases. Our implementation is currently inconsistent on this: when using --dns-forward, you must usually also give --dns with the matching address, which is meaningful only in the guest's address view. However if you give --dns with a loopback addres, it will be translated like a host view address. Move the remapping logic for DNS addresses out of add_dns4() and add_dns6() into add_dns_resolv() so that it is only applied for host nameserver addresses, not for nameservers given explicitly with --dns. Signed-off-by: David Gibson --- conf.c | 88 ++++++++++++++++++++++++++++----------------------------- passt.1 | 14 +++++---- 2 files changed, 52 insertions(+), 50 deletions(-) diff --git a/conf.c b/conf.c index bf61c143..3c102bcf 100644 --- a/conf.c +++ b/conf.c @@ -353,7 +353,7 @@ bind_all_fail: /** * add_dns4() - Possibly add the IPv4 address of a DNS resolver to configuration * @c: Execution context - * @addr: Address found in /etc/resolv.conf + * @addr: Guest nameserver IPv4 address * @idx: Index of free entry in array of IPv4 resolvers * * Return: Number of entries added (0 or 1) @@ -361,64 +361,29 @@ bind_all_fail: static unsigned add_dns4(struct ctx *c, const struct in_addr *addr, unsigned idx) { - unsigned added = 0; - if (idx >= ARRAY_SIZE(c->ip4.dns)) return 0; - /* Guest or container can only access local addresses via redirect */ - if (IN4_IS_ADDR_LOOPBACK(addr)) { - if (!c->no_map_gw) { - c->ip4.dns[idx] = c->ip4.gw; - added++; - - if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match)) - c->ip4.dns_match = c->ip4.gw; - } - } else { - c->ip4.dns[idx] = *addr; - added++; - } - - if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_host)) - c->ip4.dns_host = *addr; - - return added; + c->ip4.dns[idx] = *addr; + return 1; } /** * add_dns6() - Possibly add the IPv6 address of a DNS resolver to configuration * @c: Execution context - * @addr: Address found in /etc/resolv.conf + * @addr: Guest nameserver IPv6 address * @idx: Index of free entry in array of IPv6 resolvers * * Return: Number of entries added (0 or 1) */ -static unsigned add_dns6(struct ctx *c, struct in6_addr *addr, unsigned idx) +static unsigned add_dns6(struct ctx *c, const struct in6_addr *addr, + unsigned idx) { - unsigned added = 0; - if (idx >= ARRAY_SIZE(c->ip6.dns)) return 0; - /* Guest or container can only access local addresses via redirect */ - if (IN6_IS_ADDR_LOOPBACK(addr)) { - if (!c->no_map_gw) { - c->ip6.dns[idx] = c->ip6.gw; - added++; - - if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match)) - c->ip6.dns_match = c->ip6.gw; - } - } else { - c->ip6.dns[idx] = *addr; - added++; - } - - if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_host)) - c->ip6.dns_host = *addr; - - return added; + c->ip6.dns[idx] = *addr; + return 1; } /** @@ -437,11 +402,44 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver, struct in6_addr ns6; struct in_addr ns4; - if (idx4 && inet_pton(AF_INET, nameserver, &ns4)) + if (idx4 && inet_pton(AF_INET, nameserver, &ns4)) { + if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_host)) + c->ip4.dns_host = ns4; + + /* Guest or container can only access local addresses via + * redirect + */ + if (IN4_IS_ADDR_LOOPBACK(&ns4)) { + if (c->no_map_gw) + return; + + ns4 = c->ip4.gw; + if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match)) + c->ip4.dns_match = c->ip4.gw; + } + *idx4 += add_dns4(c, &ns4, *idx4); + } + + if (idx6 && inet_pton(AF_INET6, nameserver, &ns6)) { + if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_host)) + c->ip6.dns_host = ns6; + + /* Guest or container can only access local addresses via + * redirect + */ + if (IN6_IS_ADDR_LOOPBACK(&ns6)) { + if (c->no_map_gw) + return; + + ns6 = c->ip6.gw; + + if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match)) + c->ip6.dns_match = c->ip6.gw; + } - if (idx6 && inet_pton(AF_INET6, nameserver, &ns6)) *idx6 += add_dns6(c, &ns6, *idx6); + } } /** diff --git a/passt.1 b/passt.1 index 3062b719..dca433b6 100644 --- a/passt.1 +++ b/passt.1 @@ -236,11 +236,15 @@ interface will be chosen instead. .TP .BR \-D ", " \-\-dns " " \fIaddr -Use \fIaddr\fR (IPv4 or IPv6) for DHCP, DHCPv6, NDP or DNS forwarding, as -configured (see options \fB--no-dhcp-dns\fR, \fB--dhcp-dns\fR, -\fB--dns-forward\fR) instead of reading addresses from \fI/etc/resolv.conf\fR. -This option can be specified multiple times. Specifying \fB-D none\fR disables -usage of DNS addresses altogether. +Instruct the guest (via DHCP, DHVPv6 or NDP) to use \fIaddr\fR (IPv4 +or IPv6) as a nameserver, as configured (see options +\fB--no-dhcp-dns\fR, \fB--dhcp-dns\fR) instead of reading addresses +from \fI/etc/resolv.conf\fR. This option can be specified multiple +times. Specifying \fB-D none\fR disables usage of DNS addresses +altogether. Unlike addresses from \fI/etc/resolv.conf\fR, \fIaddr\fR +is given to the guest without remapping. For example \fB--dns +127.0.0.1\fR will instruct the guest to use itself as nameserver, not +the host. .TP .BR \-\-dns-forward " " \fIaddr -- 2.46.0

David Gibson

7:39 a.m.

New subject: [PATCH 11/22] conf: Remove incorrect initialisation of addr_ll_seen

Despite the names, addr_ll_seen does not relate to addr_ll the same way addr_see relates to addr. addr_ll_seen is an observed address from the guest, whereas addr_ll is *our* link-local address for use on the tap link when we can't use an external endpoint address. It's used both for passt provided services (DHCPv6, NDP) and in some cases for connections from addresses the guest can't access. Signed-off-by: David Gibson --- conf.c | 1 - 1 file changed, 1 deletion(-) diff --git a/conf.c b/conf.c index 3c102bcf..e5b5263f 100644 --- a/conf.c +++ b/conf.c @@ -720,7 +720,6 @@ static unsigned int conf_ip6(unsigned int ifi, } ip6->addr_seen = ip6->addr; - ip6->addr_ll_seen = ip6->addr_ll; if (MAC_IS_ZERO(mac)) { rc = nl_link_get_mac(nl_sock, ifi, mac); -- 2.46.0

David Gibson

7:39 a.m.

New subject: [PATCH 12/22] util: Correct sock_l4() binding for link local addresses

When binding an IPv6 socket in sock_l4() we need to supply a scope id if the address is link-local. We check for this by comparing the given address to c->ip6.addr_ll. This is correct only by accident: while c->ip6.addr_ll is typically set to the hsot interface's link local address, the actually purpose of it is to provide a link local address for passt's private use on the tap interface. Instead set the scope id for any link-local address we're binding to. We're going to need something and this is what makes sense for sockets on the host. It doesn't make sense for PIF_SPLICE sockets, but those should always have loopback, not link-local addresses. Signed-off-by: David Gibson --- util.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/util.c b/util.c index 892358b1..9682e3ce 100644 --- a/util.c +++ b/util.c @@ -199,8 +199,7 @@ int sock_l4(const struct ctx *c, sa_family_t af, enum epoll_type type, if (bind_addr) { addr6.sin6_addr = *(struct in6_addr *)bind_addr; - if (!memcmp(bind_addr, &c->ip6.addr_ll, - sizeof(c->ip6.addr_ll))) + if (IN6_IS_ADDR_LINKLOCAL(bind_addr)) addr6.sin6_scope_id = c->ifi6; } return sock_l4_sa(c, type, &addr6, sizeof(addr6), ifname, -- 2.46.0

Stefano Brivio

20 Aug 20 Aug

2:14 a.m.

New subject: [PATCH 12/22] util: Correct sock_l4() binding for link local addresses

On Fri, 16 Aug 2024 15:39:53 +1000 David Gibson wrote:

...

When binding an IPv6 socket in sock_l4() we need to supply a scope id if the address is link-local. We check for this by comparing the given address to c->ip6.addr_ll. This is correct only by accident: while c->ip6.addr_ll is typically set to the hsot interface's link local address, the actually purpose of it is to provide a link local address

Nits: host, actual -- Stefano

David Gibson

3:29 a.m.

New subject: [PATCH 12/22] util: Correct sock_l4() binding for link local addresses

On Tue, Aug 20, 2024 at 02:14:59AM +0200, Stefano Brivio wrote:

...

On Fri, 16 Aug 2024 15:39:53 +1000 David Gibson wrote:

...
When binding an IPv6 socket in sock_l4() we need to supply a scope id if the address is link-local. We check for this by comparing the given address to c->ip6.addr_ll. This is correct only by accident: while c->ip6.addr_ll is typically set to the hsot interface's link local address, the actually purpose of it is to provide a link local address

Nits: host, actual

Fixed. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

16 Aug 16 Aug

7:39 a.m.

New subject: [PATCH 13/22] treewide: Change misleading 'addr_ll' name

c->ip6.addr_ll is not like c->ip6.addr. The latter is an address for the guest, but the former is an address for our use on the tap link. Rename it accordingly, to 'our_tap_ll'. Signed-off-by: David Gibson --- conf.c | 7 ++++--- dhcpv6.c | 2 +- fwd.c | 2 +- ndp.c | 2 +- passt.h | 4 ++-- 5 files changed, 9 insertions(+), 8 deletions(-) diff --git a/conf.c b/conf.c index e5b5263f..1130ce5d 100644 --- a/conf.c +++ b/conf.c @@ -713,7 +713,7 @@ static unsigned int conf_ip6(unsigned int ifi, rc = nl_addr_get(nl_sock, ifi, AF_INET6, IN6_IS_ADDR_UNSPECIFIED(&ip6->addr) ? &ip6->addr : NULL, - &prefix_len, &ip6->addr_ll); + &prefix_len, &ip6->our_tap_ll); if (rc < 0) { err("Couldn't discover IPv6 address: %s", strerror(-rc)); return 0; @@ -735,7 +735,7 @@ static unsigned int conf_ip6(unsigned int ifi, } if (IN6_IS_ADDR_UNSPECIFIED(&ip6->addr) || - IN6_IS_ADDR_UNSPECIFIED(&ip6->addr_ll)) + IN6_IS_ADDR_UNSPECIFIED(&ip6->our_tap_ll)) return 0; return ifi; @@ -1027,7 +1027,8 @@ static void conf_print(const struct ctx *c) info(" router: %s", inet_ntop(AF_INET6, &c->ip6.gw, buf6, sizeof(buf6))); info(" our link-local: %s", - inet_ntop(AF_INET6, &c->ip6.addr_ll, buf6, sizeof(buf6))); + inet_ntop(AF_INET6, &c->ip6.our_tap_ll, + buf6, sizeof(buf6))); dns6: for (i = 0; !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns[i]); i++) { diff --git a/dhcpv6.c b/dhcpv6.c index 87b3c3eb..44e954e7 100644 --- a/dhcpv6.c +++ b/dhcpv6.c @@ -456,7 +456,7 @@ int dhcpv6(struct ctx *c, const struct pool *p, if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) src = &c->ip6.gw; else - src = &c->ip6.addr_ll; + src = &c->ip6.our_tap_ll; mh = packet_get(p, 0, sizeof(*uh), sizeof(*mh), NULL); if (!mh) diff --git a/fwd.c b/fwd.c index b546bc41..dccc947d 100644 --- a/fwd.c +++ b/fwd.c @@ -320,7 +320,7 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) tgt->oaddr.a6 = c->ip6.gw; else - tgt->oaddr.a6 = c->ip6.addr_ll; + tgt->oaddr.a6 = c->ip6.our_tap_ll; } if (inany_v4(&tgt->oaddr)) { diff --git a/ndp.c b/ndp.c index 9c0fef4a..3a76b00a 100644 --- a/ndp.c +++ b/ndp.c @@ -344,7 +344,7 @@ dns_done: if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) rsaddr = &c->ip6.gw; else - rsaddr = &c->ip6.addr_ll; + rsaddr = &c->ip6.our_tap_ll; if (ih->icmp6_type == NS) { dlen = sizeof(struct ndp_na); diff --git a/passt.h b/passt.h index fe3e47d2..5e7e6a04 100644 --- a/passt.h +++ b/passt.h @@ -122,7 +122,7 @@ struct ip4_ctx { /** * struct ip6_ctx - IPv6 execution context * @addr: IPv6 address assigned to guest - * @addr_ll: Link-local IPv6 address on external, routable interface + * @our_tap_ll: Link-local IPv6 address for passt's use on tap * @addr_seen: Latest IPv6 global/site address seen as source from tap * @addr_ll_seen: Latest IPv6 link-local address seen as source from tap * @gw: Default IPv6 gateway @@ -136,7 +136,7 @@ struct ip4_ctx { */ struct ip6_ctx { struct in6_addr addr; - struct in6_addr addr_ll; + struct in6_addr our_tap_ll; struct in6_addr addr_seen; struct in6_addr addr_ll_seen; struct in6_addr gw; -- 2.46.0

Stefano Brivio

20 Aug 20 Aug

2:15 a.m.

New subject: [PATCH 13/22] treewide: Change misleading 'addr_ll' name

On Fri, 16 Aug 2024 15:39:54 +1000 David Gibson wrote:

...

c->ip6.addr_ll is not like c->ip6.addr. The latter is an address for the guest, but the former is an address for our use on the tap link. Rename it accordingly, to 'our_tap_ll'.

Same as 3/22: could this be "our_ll"? Same here, not a strong preference. I reviewed only up to 16/22 so far. -- Stefano

David Gibson

3:30 a.m.

New subject: [PATCH 13/22] treewide: Change misleading 'addr_ll' name

On Tue, Aug 20, 2024 at 02:15:03AM +0200, Stefano Brivio wrote:

...

On Fri, 16 Aug 2024 15:39:54 +1000 David Gibson wrote:

...
c->ip6.addr_ll is not like c->ip6.addr. The latter is an address for the guest, but the former is an address for our use on the tap link. Rename it accordingly, to 'our_tap_ll'.

Same as 3/22: could this be "our_ll"? Same here, not a strong preference.

Same answer here. Maybe, but I want to emphasise that it's our address as used on PIF_TAP. Obviously we may use host LL addresses if we contact external hosts on the same link as the host.

...

I reviewed only up to 16/22 so far.

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

16 Aug 16 Aug

7:39 a.m.

New subject: [PATCH 14/22] Clarify which addresses in ip[46]_ctx are meaningful where

Some are guest visible addresses and may not be valid on the host, others are host visible addresses and may not be valid on the guest. Rearrange and comment the ip[46]_ctx definitions to make it clearer which is which. Signed-off-by: David Gibson --- passt.h | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/passt.h b/passt.h index 5e7e6a04..3b8a6283 100644 --- a/passt.h +++ b/passt.h @@ -104,15 +104,18 @@ enum passt_modes { * @no_copy_addrs: Don't copy all addresses when configuring namespace */ struct ip4_ctx { + /* PIF_TAP addresses */ struct in_addr addr; struct in_addr addr_seen; int prefix_len; struct in_addr gw; struct in_addr dns[MAXNS + 1]; struct in_addr dns_match; - struct in_addr dns_host; + /* PIF_HOST addresses */ + struct in_addr dns_host; struct in_addr addr_out; + char ifname_out[IFNAMSIZ]; bool no_copy_routes; @@ -122,12 +125,12 @@ struct ip4_ctx { /** * struct ip6_ctx - IPv6 execution context * @addr: IPv6 address assigned to guest - * @our_tap_ll: Link-local IPv6 address for passt's use on tap * @addr_seen: Latest IPv6 global/site address seen as source from tap * @addr_ll_seen: Latest IPv6 link-local address seen as source from tap * @gw: Default IPv6 gateway * @dns: DNS addresses for DHCPv6 and NDP, zero-terminated * @dns_match: Forward DNS query if sent to this address + * @our_tap_ll: Link-local IPv6 address for passt's use on tap * @dns_host: Use this DNS on the host for forwarding * @addr_out: Optional source address for outbound traffic * @ifname_out: Optional interface name to bind outbound sockets to @@ -135,16 +138,19 @@ struct ip4_ctx { * @no_copy_addrs: Don't copy all addresses when configuring namespace */ struct ip6_ctx { + /* PIF_TAP addresses */ struct in6_addr addr; - struct in6_addr our_tap_ll; struct in6_addr addr_seen; struct in6_addr addr_ll_seen; struct in6_addr gw; struct in6_addr dns[MAXNS + 1]; struct in6_addr dns_match; - struct in6_addr dns_host; + struct in6_addr our_tap_ll; + /* PIF_HOST addresses */ + struct in6_addr dns_host; struct in6_addr addr_out; + char ifname_out[IFNAMSIZ]; bool no_copy_routes; -- 2.46.0

David Gibson

7:39 a.m.

New subject: [PATCH 15/22] Initialise our_tap_ll to ip6.gw when suitable

In every place we use our_tap_ll, we only use it as a fallback if the IPv6 gateway address is not link-local. We can avoid that conditional at use time by doing it at initialisation of our_tap_ll instead. Signed-off-by: David Gibson --- conf.c | 3 +++ dhcpv6.c | 5 +---- fwd.c | 5 +---- ndp.c | 5 +---- 4 files changed, 6 insertions(+), 12 deletions(-) diff --git a/conf.c b/conf.c index 1130ce5d..954f20ea 100644 --- a/conf.c +++ b/conf.c @@ -721,6 +721,9 @@ static unsigned int conf_ip6(unsigned int ifi, ip6->addr_seen = ip6->addr; + if (IN6_IS_ADDR_LINKLOCAL(&ip6->gw)) + ip6->our_tap_ll = ip6->gw; + if (MAC_IS_ZERO(mac)) { rc = nl_link_get_mac(nl_sock, ifi, mac); if (rc < 0) { diff --git a/dhcpv6.c b/dhcpv6.c index 44e954e7..69841abc 100644 --- a/dhcpv6.c +++ b/dhcpv6.c @@ -453,10 +453,7 @@ int dhcpv6(struct ctx *c, const struct pool *p, c->ip6.addr_ll_seen = *saddr; - if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) - src = &c->ip6.gw; - else - src = &c->ip6.our_tap_ll; + src = &c->ip6.our_tap_ll; mh = packet_get(p, 0, sizeof(*uh), sizeof(*mh), NULL); if (!mh) diff --git a/fwd.c b/fwd.c index dccc947d..75dc0151 100644 --- a/fwd.c +++ b/fwd.c @@ -317,10 +317,7 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, } else if (inany_is_loopback6(&tgt->oaddr) || inany_equals6(&tgt->oaddr, &c->ip6.addr_seen) || inany_equals6(&tgt->oaddr, &c->ip6.addr)) { - if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) - tgt->oaddr.a6 = c->ip6.gw; - else - tgt->oaddr.a6 = c->ip6.our_tap_ll; + tgt->oaddr.a6 = c->ip6.our_tap_ll; } if (inany_v4(&tgt->oaddr)) { diff --git a/ndp.c b/ndp.c index 3a76b00a..a1ee8349 100644 --- a/ndp.c +++ b/ndp.c @@ -341,10 +341,7 @@ dns_done: else c->ip6.addr_seen = *saddr; - if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw)) - rsaddr = &c->ip6.gw; - else - rsaddr = &c->ip6.our_tap_ll; + rsaddr = &c->ip6.our_tap_ll; if (ih->icmp6_type == NS) { dlen = sizeof(struct ndp_na); -- 2.46.0

David Gibson

7:39 a.m.

New subject: [PATCH 16/22] fwd: Helpers to clarify what host addresses aren't guest accessible

We usually avoid NAT, but in a few cases we need to apply address translations. For inbound connections that happens for addresses which make sense to the host but are either inaccessible, or mean a different location from the guest's point of view. Add some helper functions to determine such addresses, and use them in fwd_nat_from_host(). In doing so clarify some of the reasons for the logic. We'll also have further use for these helpers in future. While we're there fix one unneccessary inconsistency between IPv4 and IPv6. We always translated the guest's observed address, but for IPv4 we didn't translate the guest's assigned address, whereas for IPv6 we did. Change this to translate both in all cases for consistency. Signed-off-by: David Gibson --- fwd.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 87 insertions(+), 11 deletions(-) diff --git a/fwd.c b/fwd.c index 75dc0151..1baae338 100644 --- a/fwd.c +++ b/fwd.c @@ -170,6 +170,85 @@ static bool is_dns_flow(uint8_t proto, const struct flowside *ini) ((ini->oport == 53) || (ini->oport == 853)); } +/** + * fwd_guest_accessible4() - Is IPv4 address guest accessible + * @c: Execution context + * @addr: Host visible IPv4 address + * + * Return: true if @addr on the host is accessible to the guest without + * translation, false otherwise + */ +static bool fwd_guest_accessible4(const struct ctx *c, + const struct in_addr *addr) +{ + if (IN4_IS_ADDR_LOOPBACK(addr)) + return false; + + /* In socket interfaces 0.0.0.0 generally means "any" or unspecified, + * however on the wire it can mean "this host on this network". Since + * that has a different meaning for host and guest, we can't let it + * through untranslated. + */ + if (IN4_IS_ADDR_UNSPECIFIED(addr)) + return false; + + /* For IPv4, addr_seen is initialised to addr, so is always a valid + * address + */ + if (IN4_ARE_ADDR_EQUAL(addr, &c->ip4.addr) || + IN4_ARE_ADDR_EQUAL(addr, &c->ip4.addr_seen)) + return false; + + return true; +} + +/** + * fwd_guest_accessible6() - Is IPv6 address guest accessible + * @c: Execution context + * @addr: Host visible IPv6 address + * + * Return: true if @addr on the host is accessible to the guest without + * translation, false otherwise + */ +static bool fwd_guest_accessible6(const struct ctx *c, + const struct in6_addr *addr) +{ + if (IN6_IS_ADDR_LOOPBACK(addr)) + return false; + + if (IN6_ARE_ADDR_EQUAL(addr, &c->ip6.addr)) + return false; + + /* For IPv6, addr_seen starts unspecified, because we don't know what LL + * address the guest will take until we see it. Only check against it + * if it has been set to a real address. + */ + if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr_seen) && + IN6_ARE_ADDR_EQUAL(addr, &c->ip6.addr_seen)) + return false; + + return true; +} + +/** + * fwd_guest_accessible() - Is IPv[46] address guest accessible + * @c: Execution context + * @addr: Host visible IPv[46] address + * + * Return: true if @addr on the host is accessible to the guest without + * translation, false otherwise + */ +static bool fwd_guest_accessible(const struct ctx *c, + const union inany_addr *addr) +{ + const struct in_addr *a4 = inany_v4(addr); + + if (a4) + return fwd_guest_accessible4(c, a4); + + return fwd_guest_accessible6(c, &addr->a6); +} + /** * fwd_nat_from_tap() - Determine to forward a flow from the tap interface * @c: Execution context @@ -307,18 +386,15 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, return PIF_SPLICE; } - tgt->oaddr = ini->eaddr; - tgt->oport = ini->eport; - - if (inany_is_loopback4(&tgt->oaddr) || - inany_is_unspecified4(&tgt->oaddr) || - inany_equals4(&tgt->oaddr, &c->ip4.addr_seen)) { - tgt->oaddr = inany_from_v4(c->ip4.gw); - } else if (inany_is_loopback6(&tgt->oaddr) || - inany_equals6(&tgt->oaddr, &c->ip6.addr_seen) || - inany_equals6(&tgt->oaddr, &c->ip6.addr)) { - tgt->oaddr.a6 = c->ip6.our_tap_ll; + if (!fwd_guest_accessible(c, &ini->eaddr)) { + if (inany_v4(&ini->eaddr)) + tgt->oaddr = inany_from_v4(c->ip4.gw); + else + tgt->oaddr.a6 = c->ip6.our_tap_ll; + } else { + tgt->oaddr = ini->eaddr; } + tgt->oport = ini->eport; if (inany_v4(&tgt->oaddr)) { tgt->eaddr = inany_from_v4(c->ip4.addr_seen); -- 2.46.0

Stefano Brivio

20 Aug 20 Aug

9:56 p.m.

New subject: [PATCH 16/22] fwd: Helpers to clarify what host addresses aren't guest accessible

On Fri, 16 Aug 2024 15:39:57 +1000 David Gibson wrote:

...

We usually avoid NAT, but in a few cases we need to apply address translations. For inbound connections that happens for addresses which make sense to the host but are either inaccessible, or mean a different location from the guest's point of view.

Add some helper functions to determine such addresses, and use them in fwd_nat_from_host(). In doing so clarify some of the reasons for the logic. We'll also have further use for these helpers in future.

While we're there fix one unneccessary inconsistency between IPv4 and IPv6. We always translated the guest's observed address, but for IPv4 we didn't translate the guest's assigned address, whereas for IPv6 we did. Change this to translate both in all cases for consistency.

Signed-off-by: David Gibson --- fwd.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 87 insertions(+), 11 deletions(-)

diff --git a/fwd.c b/fwd.c index 75dc0151..1baae338 100644 --- a/fwd.c +++ b/fwd.c @@ -170,6 +170,85 @@ static bool is_dns_flow(uint8_t proto, const struct flowside *ini) ((ini->oport == 53) || (ini->oport == 853)); }

+/** + * fwd_guest_accessible4() - Is IPv4 address guest accessible

Nit: I wonder if we should say "guest-accessible" in all these cases, it's a bit easier for me to decode, but not necessarily more correct. It's fine by me either way. -- Stefano

David Gibson

21 Aug 21 Aug

3:40 a.m.

New subject: [PATCH 16/22] fwd: Helpers to clarify what host addresses aren't guest accessible

On Tue, Aug 20, 2024 at 09:56:18PM +0200, Stefano Brivio wrote:

...

On Fri, 16 Aug 2024 15:39:57 +1000 David Gibson wrote:

...
We usually avoid NAT, but in a few cases we need to apply address translations. For inbound connections that happens for addresses which make sense to the host but are either inaccessible, or mean a different location from the guest's point of view.

Add some helper functions to determine such addresses, and use them in fwd_nat_from_host(). In doing so clarify some of the reasons for the logic. We'll also have further use for these helpers in future.

While we're there fix one unneccessary inconsistency between IPv4 and IPv6. We always translated the guest's observed address, but for IPv4 we didn't translate the guest's assigned address, whereas for IPv6 we did. Change this to translate both in all cases for consistency.

Signed-off-by: David Gibson --- fwd.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 87 insertions(+), 11 deletions(-)

diff --git a/fwd.c b/fwd.c index 75dc0151..1baae338 100644 --- a/fwd.c +++ b/fwd.c @@ -170,6 +170,85 @@ static bool is_dns_flow(uint8_t proto, const struct flowside *ini) ((ini->oport == 53) || (ini->oport == 853)); }

+/** + * fwd_guest_accessible4() - Is IPv4 address guest accessible

Nit: I wonder if we should say "guest-accessible" in all these cases, it's a bit easier for me to decode, but not necessarily more correct. It's fine by me either way.

Just adding the hyphen? Sure, done. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

16 Aug 16 Aug

7:39 a.m.

New subject: [PATCH 17/22] fwd: Split notion of "our tap address" from gateway for IPv4

ip4.gw conflates 3 conceptually different things, which (for now) have the same value: 1. The router/gateway address as seen by the guest 2. An address to NAT to the host with --no-map-gw isn't specified 3. An address to use as source when nothing else makes sense Case 3 occurs in two situations: a) for our DHCP responses - since they come from passt internally there's no naturally meaningful address for them to come from b) for forwarded connections coming from an address that isn't guest accessible (localhost or the guest's own address). (b) occurs even with --no-map-gw, and the expected behaviour of forwarding local connections requires it. For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the same value as ip6.gw). For future flexibility we may want to make this "address of last resort" different from the gateway address, so split them logically for IPv4 as well. Specifically, add a new ip4.our_tap_addr field for the address with this role, and initialise it to ip4.gw for now. Unlike IPv6 where we can always get a link-local address, we might not be able to get a (non 0.0.0.0) address here. In that case we have to disable DHCP and forwarding of inbound connections with guest-inaccessible source addresses. Signed-off-by: David Gibson --- conf.c | 7 ++++++- dhcp.c | 4 ++-- fwd.c | 10 +++++++--- passt.h | 2 ++ 4 files changed, 17 insertions(+), 6 deletions(-) diff --git a/conf.c b/conf.c index 954f20ea..9f962fc8 100644 --- a/conf.c +++ b/conf.c @@ -660,6 +660,8 @@ static unsigned int conf_ip4(unsigned int ifi, ip4->addr_seen = ip4->addr; + ip4->our_tap_addr = ip4->gw; + if (MAC_IS_ZERO(mac)) { int rc = nl_link_get_mac(nl_sock, ifi, mac); if (rc < 0) { @@ -1666,7 +1668,10 @@ void conf(struct ctx *c, int argc, char **argv) die("External interface not usable"); if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.gw)) - c->no_map_gw = c->no_dhcp = 1; + c->no_map_gw = 1; + + if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) + c->no_dhcp = 1; if (c->ifi6 && IN6_IS_ADDR_UNSPECIFIED(&c->ip6.gw)) c->no_map_gw = 1; diff --git a/dhcp.c b/dhcp.c index acc5b03e..a935dc94 100644 --- a/dhcp.c +++ b/dhcp.c @@ -347,7 +347,7 @@ int dhcp(const struct ctx *c, const struct pool *p) mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len)); memcpy(opts[1].s, &mask, sizeof(mask)); memcpy(opts[3].s, &c->ip4.gw, sizeof(c->ip4.gw)); - memcpy(opts[54].s, &c->ip4.gw, sizeof(c->ip4.gw)); + memcpy(opts[54].s, &c->ip4.our_tap_addr, sizeof(c->ip4.our_tap_addr)); /* If the gateway is not on the assigned subnet, send an option 121 * (Classless Static Routing) adding a dummy route to it. @@ -377,7 +377,7 @@ int dhcp(const struct ctx *c, const struct pool *p) opt_set_dns_search(c, sizeof(m->o)); dlen = offsetof(struct msg, o) + fill(m); - tap_udp4_send(c, c->ip4.gw, 67, c->ip4.addr, 68, m, dlen); + tap_udp4_send(c, c->ip4.our_tap_addr, 67, c->ip4.addr, 68, m, dlen); return 1; } diff --git a/fwd.c b/fwd.c index 1baae338..fe618742 100644 --- a/fwd.c +++ b/fwd.c @@ -387,10 +387,14 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, } if (!fwd_guest_accessible(c, &ini->eaddr)) { - if (inany_v4(&ini->eaddr)) - tgt->oaddr = inany_from_v4(c->ip4.gw); - else + if (inany_v4(&ini->eaddr)) { + if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) + /* No source address we can use */ + return PIF_NONE; + tgt->oaddr = inany_from_v4(c->ip4.our_tap_addr); + } else { tgt->oaddr.a6 = c->ip6.our_tap_ll; + } } else { tgt->oaddr = ini->eaddr; } diff --git a/passt.h b/passt.h index 3b8a6283..ecfed1e7 100644 --- a/passt.h +++ b/passt.h @@ -97,6 +97,7 @@ enum passt_modes { * @gw: Default IPv4 gateway * @dns: DNS addresses for DHCP, zero-terminated * @dns_match: Forward DNS query if sent to this address + * @our_tap_addr: IPv4 address for passt's use on tap * @dns_host: Use this DNS on the host for forwarding * @addr_out: Optional source address for outbound traffic * @ifname_out: Optional interface name to bind outbound sockets to @@ -111,6 +112,7 @@ struct ip4_ctx { struct in_addr gw; struct in_addr dns[MAXNS + 1]; struct in_addr dns_match; + struct in_addr our_tap_addr; /* PIF_HOST addresses */ struct in_addr dns_host; -- 2.46.0

Stefano Brivio

20 Aug 20 Aug

9:56 p.m.

New subject: [PATCH 17/22] fwd: Split notion of "our tap address" from gateway for IPv4

On Fri, 16 Aug 2024 15:39:58 +1000 David Gibson wrote:

...

ip4.gw conflates 3 conceptually different things, which (for now) have the same value: 1. The router/gateway address as seen by the guest 2. An address to NAT to the host with --no-map-gw isn't specified 3. An address to use as source when nothing else makes sense

Case 3 occurs in two situations:

a) for our DHCP responses - since they come from passt internally there's no naturally meaningful address for them to come from b) for forwarded connections coming from an address that isn't guest accessible (localhost or the guest's own address).

(b) occurs even with --no-map-gw, and the expected behaviour of forwarding local connections requires it.

For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the same value as ip6.gw). For future flexibility we may want to make this "address of last resort" different from the gateway address, so split them logically for IPv4 as well.

Specifically, add a new ip4.our_tap_addr field for the address with this role, and initialise it to ip4.gw for now. Unlike IPv6 where we can always get a link-local address, we might not be able to get a (non 0.0.0.0) address here. In that case we have to disable DHCP

It's not entirely clear to me in which case we would not be able to get any address, but at least RFC 2131 doesn't have a problem with this: diff --git a/dhcp.c b/dhcp.c index aa9f59d..3de8a6e 100644 --- a/dhcp.c +++ b/dhcp.c @@ -282,6 +282,7 @@ int dhcp(const struct ctx *c, const struct pool *p) struct in_addr mask; unsigned int i; struct msg *m; + struct in_addr zeroes = { 0 }; eh = packet_get(p, 0, offset, sizeof(*eh), NULL); offset += sizeof(*eh); @@ -378,7 +379,7 @@ int dhcp(const struct ctx *c, const struct pool *p) opt_set_dns_search(c, sizeof(m->o)); dlen = offsetof(struct msg, o) + fill(m); - tap_udp4_send(c, c->ip4.gw, 67, c->ip4.addr, 68, m, dlen); + tap_udp4_send(c, zeroes, 67, c->ip4.addr, 68, m, dlen); return 1; } and: $ ./pasta -p dhcp.pcap Saving packet capture to dhcp.pcap # dhclient # tshark -r dhcp.pcap Running as user "root" and group "root". This could be dangerous. 1 0.000000 :: → ff02::16 ICMPv6 90 Multicast Listener Report Message v2 2 0.016265 0.0.0.0 → 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0x75759d11 3 0.016361 0.0.0.0 → 88.198.0.164 DHCP 342 DHCP Offer - Transaction ID 0x75759d11 4 0.016479 0.0.0.0 → 255.255.255.255 DHCP 342 DHCP Request - Transaction ID 0x75759d11 5 0.016493 0.0.0.0 → 88.198.0.164 DHCP 342 DHCP ACK - Transaction ID 0x75759d11 [...] so this could be a reasonable fallback.

...

and forwarding of inbound connections with guest-inaccessible source addresses.

Signed-off-by: David Gibson --- conf.c | 7 ++++++- dhcp.c | 4 ++-- fwd.c | 10 +++++++--- passt.h | 2 ++ 4 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/conf.c b/conf.c index 954f20ea..9f962fc8 100644 --- a/conf.c +++ b/conf.c @@ -660,6 +660,8 @@ static unsigned int conf_ip4(unsigned int ifi,

ip4->addr_seen = ip4->addr;

+ ip4->our_tap_addr = ip4->gw; + if (MAC_IS_ZERO(mac)) { int rc = nl_link_get_mac(nl_sock, ifi, mac); if (rc < 0) { @@ -1666,7 +1668,10 @@ void conf(struct ctx *c, int argc, char **argv) die("External interface not usable");

if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.gw)) - c->no_map_gw = c->no_dhcp = 1; + c->no_map_gw = 1; + + if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) + c->no_dhcp = 1;

if (c->ifi6 && IN6_IS_ADDR_UNSPECIFIED(&c->ip6.gw)) c->no_map_gw = 1; diff --git a/dhcp.c b/dhcp.c index acc5b03e..a935dc94 100644 --- a/dhcp.c +++ b/dhcp.c @@ -347,7 +347,7 @@ int dhcp(const struct ctx *c, const struct pool *p) mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len)); memcpy(opts[1].s, &mask, sizeof(mask)); memcpy(opts[3].s, &c->ip4.gw, sizeof(c->ip4.gw)); - memcpy(opts[54].s, &c->ip4.gw, sizeof(c->ip4.gw)); + memcpy(opts[54].s, &c->ip4.our_tap_addr, sizeof(c->ip4.our_tap_addr));

Nit: this was supposed to look like a table, so it would be nice to add extra whitespace in the lines above this one. -- Stefano

David Gibson

21 Aug 21 Aug

3:56 a.m.

New subject: [PATCH 17/22] fwd: Split notion of "our tap address" from gateway for IPv4

On Tue, Aug 20, 2024 at 09:56:24PM +0200, Stefano Brivio wrote:

...

On Fri, 16 Aug 2024 15:39:58 +1000 David Gibson wrote:

...
ip4.gw conflates 3 conceptually different things, which (for now) have the same value: 1. The router/gateway address as seen by the guest 2. An address to NAT to the host with --no-map-gw isn't specified 3. An address to use as source when nothing else makes sense

Case 3 occurs in two situations:

a) for our DHCP responses - since they come from passt internally there's no naturally meaningful address for them to come from b) for forwarded connections coming from an address that isn't guest accessible (localhost or the guest's own address).

(b) occurs even with --no-map-gw, and the expected behaviour of forwarding local connections requires it.

For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the same value as ip6.gw). For future flexibility we may want to make this "address of last resort" different from the gateway address, so split them logically for IPv4 as well.

Specifically, add a new ip4.our_tap_addr field for the address with this role, and initialise it to ip4.gw for now. Unlike IPv6 where we can always get a link-local address, we might not be able to get a (non 0.0.0.0) address here. In that case we have to disable DHCP

It's not entirely clear to me in which case we would not be able to get any address,

Currently, when we don't have a gateway address on the host: no connectivity, or a point-to-point link with no gateway, or the like. We used to absolutely require it, but that restriction has been eased and may ease further in future.

...

but at least RFC 2131 doesn't have a problem with this:

diff --git a/dhcp.c b/dhcp.c index aa9f59d..3de8a6e 100644 --- a/dhcp.c +++ b/dhcp.c @@ -282,6 +282,7 @@ int dhcp(const struct ctx *c, const struct pool *p) struct in_addr mask; unsigned int i; struct msg *m; + struct in_addr zeroes = { 0 };

eh = packet_get(p, 0, offset, sizeof(*eh), NULL); offset += sizeof(*eh); @@ -378,7 +379,7 @@ int dhcp(const struct ctx *c, const struct pool *p) opt_set_dns_search(c, sizeof(m->o));

dlen = offsetof(struct msg, o) + fill(m); - tap_udp4_send(c, c->ip4.gw, 67, c->ip4.addr, 68, m, dlen); + tap_udp4_send(c, zeroes, 67, c->ip4.addr, 68, m, dlen);

return 1; }

and:

$ ./pasta -p dhcp.pcap Saving packet capture to dhcp.pcap # dhclient # tshark -r dhcp.pcap Running as user "root" and group "root". This could be dangerous. 1 0.000000 :: → ff02::16 ICMPv6 90 Multicast Listener Report Message v2 2 0.016265 0.0.0.0 → 255.255.255.255 DHCP 342 DHCP Discover - Transaction ID 0x75759d11 3 0.016361 0.0.0.0 → 88.198.0.164 DHCP 342 DHCP Offer - Transaction ID 0x75759d11 4 0.016479 0.0.0.0 → 255.255.255.255 DHCP 342 DHCP Request - Transaction ID 0x75759d11 5 0.016493 0.0.0.0 → 88.198.0.164 DHCP 342 DHCP ACK - Transaction ID 0x75759d11 [...]

so this could be a reasonable fallback.

Fair point. I've removed the disabling of DHCP in this case.

...

...
and forwarding of inbound connections with guest-inaccessible source addresses.

Signed-off-by: David Gibson --- conf.c | 7 ++++++- dhcp.c | 4 ++-- fwd.c | 10 +++++++--- passt.h | 2 ++ 4 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/conf.c b/conf.c index 954f20ea..9f962fc8 100644 --- a/conf.c +++ b/conf.c @@ -660,6 +660,8 @@ static unsigned int conf_ip4(unsigned int ifi,

ip4->addr_seen = ip4->addr;

+ ip4->our_tap_addr = ip4->gw; + if (MAC_IS_ZERO(mac)) { int rc = nl_link_get_mac(nl_sock, ifi, mac); if (rc < 0) { @@ -1666,7 +1668,10 @@ void conf(struct ctx *c, int argc, char **argv) die("External interface not usable");

if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.gw)) - c->no_map_gw = c->no_dhcp = 1; + c->no_map_gw = 1; + + if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) + c->no_dhcp = 1;

if (c->ifi6 && IN6_IS_ADDR_UNSPECIFIED(&c->ip6.gw)) c->no_map_gw = 1; diff --git a/dhcp.c b/dhcp.c index acc5b03e..a935dc94 100644 --- a/dhcp.c +++ b/dhcp.c @@ -347,7 +347,7 @@ int dhcp(const struct ctx *c, const struct pool *p) mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len)); memcpy(opts[1].s, &mask, sizeof(mask)); memcpy(opts[3].s, &c->ip4.gw, sizeof(c->ip4.gw)); - memcpy(opts[54].s, &c->ip4.gw, sizeof(c->ip4.gw)); + memcpy(opts[54].s, &c->ip4.our_tap_addr, sizeof(c->ip4.our_tap_addr));

Nit: this was supposed to look like a table, so it would be nice to add extra whitespace in the lines above this one.

Makes sense, done. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

16 Aug 16 Aug

7:39 a.m.

New subject: [PATCH 18/22] Don't take "our" MAC address from the host

When sending frames to the guest over the tap link, we need a source MAC address. Currently we take that from the MAC address of the main interface on the host, but that doesn't actually make much sense: * We can't preserve the real MAC address of packets from anywhere external so there's no transparency case here * In fact, it's confusingly different from how we handle IP addresses: whereas we give the guest the same IP as the host, we're making the host's MAC the one MAC that the guest *can't* use for itself. * We already need a fallback case if the host doesn't have an Ethernet like MAC (e.g. if it's connected via a point to point interface, such as a wireguard VPN). Change to just just use an arbitrary fixed MAC address - I've picked 9a:55:9a:55:9a:55. It's simpler and has the small advantage of making the fact that passt/pasta is in use typically obvious from guest side packet dumps. This can still, of course, be overridden with the -M option. Signed-off-by: David Gibson --- conf.c | 40 +++++----------------------------------- passt.h | 7 +++++++ util.h | 1 - 3 files changed, 12 insertions(+), 36 deletions(-) diff --git a/conf.c b/conf.c index 9f962fc8..b1c58d5b 100644 --- a/conf.c +++ b/conf.c @@ -612,12 +612,10 @@ static int conf_ip4_prefix(const char *arg) * conf_ip4() - Verify or detect IPv4 support, get relevant addresses * @ifi: Host interface to attempt (0 to determine one) * @ip4: IPv4 context (will be written) - * @mac: MAC address to use (written if unset) * * Return: Interface index for IPv4, or 0 on failure. */ -static unsigned int conf_ip4(unsigned int ifi, - struct ip4_ctx *ip4, unsigned char *mac) +static unsigned int conf_ip4(unsigned int ifi, struct ip4_ctx *ip4) { if (!ifi) ifi = nl_get_ext_if(nl_sock, AF_INET); @@ -662,20 +660,6 @@ static unsigned int conf_ip4(unsigned int ifi, ip4->our_tap_addr = ip4->gw; - if (MAC_IS_ZERO(mac)) { - int rc = nl_link_get_mac(nl_sock, ifi, mac); - if (rc < 0) { - char ifname[IFNAMSIZ]; - - err("Couldn't discover MAC address for %s: %s", - if_indextoname(ifi, ifname), strerror(-rc)); - return 0; - } - - if (MAC_IS_ZERO(mac)) - memcpy(mac, MAC_LAA, ETH_ALEN); - } - if (IN4_IS_ADDR_UNSPECIFIED(&ip4->addr)) return 0; @@ -686,12 +670,10 @@ static unsigned int conf_ip4(unsigned int ifi, * conf_ip6() - Verify or detect IPv6 support, get relevant addresses * @ifi: Host interface to attempt (0 to determine one) * @ip6: IPv6 context (will be written) - * @mac: MAC address to use (written if unset) * * Return: Interface index for IPv6, or 0 on failure. */ -static unsigned int conf_ip6(unsigned int ifi, - struct ip6_ctx *ip6, unsigned char *mac) +static unsigned int conf_ip6(unsigned int ifi, struct ip6_ctx *ip6) { int prefix_len = 0; int rc; @@ -726,19 +708,6 @@ static unsigned int conf_ip6(unsigned int ifi, if (IN6_IS_ADDR_LINKLOCAL(&ip6->gw)) ip6->our_tap_ll = ip6->gw; - if (MAC_IS_ZERO(mac)) { - rc = nl_link_get_mac(nl_sock, ifi, mac); - if (rc < 0) { - char ifname[IFNAMSIZ]; - err("Couldn't discover MAC address for %s: %s", - if_indextoname(ifi, ifname), strerror(-rc)); - return 0; - } - - if (MAC_IS_ZERO(mac)) - memcpy(mac, MAC_LAA, ETH_ALEN); - } - if (IN6_IS_ADDR_UNSPECIFIED(&ip6->addr) || IN6_IS_ADDR_UNSPECIFIED(&ip6->our_tap_ll)) return 0; @@ -1289,6 +1258,7 @@ void conf(struct ctx *c, int argc, char **argv) c->tcp.fwd_in.mode = c->tcp.fwd_out.mode = FWD_UNSET; c->udp.fwd_in.mode = c->udp.fwd_out.mode = FWD_UNSET; + memcpy(c->our_tap_mac, MAC_OUR_LAA, ETH_ALEN); optind = 1; do { @@ -1659,9 +1629,9 @@ void conf(struct ctx *c, int argc, char **argv) nl_sock_init(c, false); if (!v6_only) - c->ifi4 = conf_ip4(ifi4, &c->ip4, c->our_tap_mac); + c->ifi4 = conf_ip4(ifi4, &c->ip4); if (!v4_only) - c->ifi6 = conf_ip6(ifi6, &c->ip6, c->our_tap_mac); + c->ifi6 = conf_ip6(ifi6, &c->ip6); if ((!c->ifi4 && !c->ifi6) || (*c->ip4.ifname_out && !c->ifi4) || (*c->ip6.ifname_out && !c->ifi6)) diff --git a/passt.h b/passt.h index ecfed1e7..c6c67ffc 100644 --- a/passt.h +++ b/passt.h @@ -26,6 +26,13 @@ union epoll_ref; #include "tcp.h" #include "udp.h" +/* Default address for our end on the tap interface. Bit 0 of byte 0 must be 0 + * (unicast) and bit 1 of byte 1 must be 1 (locally administered). Otherwise + * it's arbitrary. + */ +#define MAC_OUR_LAA \ + ((uint8_t [ETH_ALEN]){0x9a, 0x55, 0x9a, 0x55, 0x9a, 0x55}) + /** * union epoll_ref - Breakdown of reference for epoll fd bookkeeping * @type: Type of fd (tells us what to do with events) diff --git a/util.h b/util.h index c1748074..899496e3 100644 --- a/util.h +++ b/util.h @@ -96,7 +96,6 @@ #define PORT_IS_EPHEMERAL(port) ((port) >= PORT_EPHEMERAL_MIN) #define MAC_ZERO ((uint8_t [ETH_ALEN]){ 0 }) -#define MAC_LAA ((uint8_t [ETH_ALEN]){ BIT(1), 0, 0, 0, 0, 0 }) #define MAC_IS_ZERO(addr) (!memcmp((addr), MAC_ZERO, ETH_ALEN)) #ifndef __bswap_constant_16 -- 2.46.0

David Gibson

7:40 a.m.

New subject: [PATCH 19/22] conf, fwd: Split notion of gateway/router from guest-visible host address

The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default gateway. We use this for two quite distinct things: advertising the gateway that the guest should use (via DHCP, NDP and/or --config-net) and for a limited form of NAT. So that the guest can access services on the host, we map the gateway address within the guest to the loopback address on the host. Using the gateway address for this isn't necessarily the best choice for this purpose, certainly not for all circumstances. So, start off by splitting the notion of these into two different values: @guest_gw which is the gateway address the guest should use and @nat_host_loopback, which is the guest visible address to remap to the host's loopback. Usually nat_host_loopback will have the same value as guest_gw. However when --no-map-gw is specified we leave them unspecified instead. This means when we use nat_host_loopback, we don't need to separately check c->no_map_gw to see if it's relevant. Signed-off-by: David Gibson --- conf.c | 60 +++++++++++++++++++++++++++++---------------------------- dhcp.c | 10 ++++++---- fwd.c | 4 ++-- passt.h | 16 +++++++++------ pasta.c | 6 ++++-- 5 files changed, 53 insertions(+), 43 deletions(-) diff --git a/conf.c b/conf.c index b1c58d5b..26373584 100644 --- a/conf.c +++ b/conf.c @@ -410,12 +410,12 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver, * redirect */ if (IN4_IS_ADDR_LOOPBACK(&ns4)) { - if (c->no_map_gw) + if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback)) return; - ns4 = c->ip4.gw; + ns4 = c->ip4.nat_host_loopback; if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match)) - c->ip4.dns_match = c->ip4.gw; + c->ip4.dns_match = c->ip4.nat_host_loopback; } *idx4 += add_dns4(c, &ns4, *idx4); @@ -429,13 +429,13 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver, * redirect */ if (IN6_IS_ADDR_LOOPBACK(&ns6)) { - if (c->no_map_gw) + if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback)) return; - ns6 = c->ip6.gw; + ns6 = c->ip6.nat_host_loopback; if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_match)) - c->ip6.dns_match = c->ip6.gw; + c->ip6.dns_match = c->ip6.nat_host_loopback; } *idx6 += add_dns6(c, &ns6, *idx6); @@ -625,8 +625,9 @@ static unsigned int conf_ip4(unsigned int ifi, struct ip4_ctx *ip4) return 0; } - if (IN4_IS_ADDR_UNSPECIFIED(&ip4->gw)) { - int rc = nl_route_get_def(nl_sock, ifi, AF_INET, &ip4->gw); + if (IN4_IS_ADDR_UNSPECIFIED(&ip4->guest_gw)) { + int rc = nl_route_get_def(nl_sock, ifi, AF_INET, + &ip4->guest_gw); if (rc < 0) { err("Couldn't discover IPv4 gateway address: %s", strerror(-rc)); @@ -658,7 +659,7 @@ static unsigned int conf_ip4(unsigned int ifi, struct ip4_ctx *ip4) ip4->addr_seen = ip4->addr; - ip4->our_tap_addr = ip4->gw; + ip4->our_tap_addr = ip4->guest_gw; if (IN4_IS_ADDR_UNSPECIFIED(&ip4->addr)) return 0; @@ -686,8 +687,8 @@ static unsigned int conf_ip6(unsigned int ifi, struct ip6_ctx *ip6) return 0; } - if (IN6_IS_ADDR_UNSPECIFIED(&ip6->gw)) { - rc = nl_route_get_def(nl_sock, ifi, AF_INET6, &ip6->gw); + if (IN6_IS_ADDR_UNSPECIFIED(&ip6->guest_gw)) { + rc = nl_route_get_def(nl_sock, ifi, AF_INET6, &ip6->guest_gw); if (rc < 0) { err("Couldn't discover IPv6 gateway address: %s", strerror(-rc)); @@ -705,8 +706,8 @@ static unsigned int conf_ip6(unsigned int ifi, struct ip6_ctx *ip6) ip6->addr_seen = ip6->addr; - if (IN6_IS_ADDR_LINKLOCAL(&ip6->gw)) - ip6->our_tap_ll = ip6->gw; + if (IN6_IS_ADDR_LINKLOCAL(&ip6->guest_gw)) + ip6->our_tap_ll = ip6->guest_gw; if (IN6_IS_ADDR_UNSPECIFIED(&ip6->addr) || IN6_IS_ADDR_UNSPECIFIED(&ip6->our_tap_ll)) @@ -969,7 +970,8 @@ static void conf_print(const struct ctx *c) info(" mask: %s", inet_ntop(AF_INET, &mask, buf4, sizeof(buf4))); info(" router: %s", - inet_ntop(AF_INET, &c->ip4.gw, buf4, sizeof(buf4))); + inet_ntop(AF_INET, &c->ip4.guest_gw, + buf4, sizeof(buf4))); } for (i = 0; !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns[i]); i++) { @@ -999,7 +1001,7 @@ static void conf_print(const struct ctx *c) info(" assign: %s", inet_ntop(AF_INET6, &c->ip6.addr, buf6, sizeof(buf6))); info(" router: %s", - inet_ntop(AF_INET6, &c->ip6.gw, buf6, sizeof(buf6))); + inet_ntop(AF_INET6, &c->ip6.guest_gw, buf6, sizeof(buf6))); info(" our link-local: %s", inet_ntop(AF_INET6, &c->ip6.our_tap_ll, buf6, sizeof(buf6))); @@ -1173,7 +1175,7 @@ fail: */ void conf(struct ctx *c, int argc, char **argv) { - int netns_only = 0; + int netns_only = 0, no_map_gw = 0; const struct option options[] = { {"debug", no_argument, NULL, 'd' }, {"quiet", no_argument, NULL, 'q' }, @@ -1202,7 +1204,7 @@ void conf(struct ctx *c, int argc, char **argv) {"no-dhcpv6", no_argument, &c->no_dhcpv6, 1 }, {"no-ndp", no_argument, &c->no_ndp, 1 }, {"no-ra", no_argument, &c->no_ra, 1 }, - {"no-map-gw", no_argument, &c->no_map_gw, 1 }, + {"no-map-gw", no_argument, &no_map_gw, 1 }, {"ipv4-only", no_argument, NULL, '4' }, {"ipv6-only", no_argument, NULL, '6' }, {"one-off", no_argument, NULL, '1' }, @@ -1503,18 +1505,18 @@ void conf(struct ctx *c, int argc, char **argv) parse_mac(c->our_tap_mac, optarg); break; case 'g': - if (inet_pton(AF_INET6, optarg, &c->ip6.gw) && - !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.gw) && - !IN6_IS_ADDR_LOOPBACK(&c->ip6.gw)) { + if (inet_pton(AF_INET6, optarg, &c->ip6.guest_gw) && + !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.guest_gw) && + !IN6_IS_ADDR_LOOPBACK(&c->ip6.guest_gw)) { if (c->mode == MODE_PASTA) c->ip6.no_copy_routes = true; break; } - if (inet_pton(AF_INET, optarg, &c->ip4.gw) && - !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.gw) && - !IN4_IS_ADDR_BROADCAST(&c->ip4.gw) && - !IN4_IS_ADDR_LOOPBACK(&c->ip4.gw)) { + if (inet_pton(AF_INET, optarg, &c->ip4.guest_gw) && + !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.guest_gw) && + !IN4_IS_ADDR_BROADCAST(&c->ip4.guest_gw) && + !IN4_IS_ADDR_LOOPBACK(&c->ip4.guest_gw)) { if (c->mode == MODE_PASTA) c->ip4.no_copy_routes = true; break; @@ -1637,15 +1639,15 @@ void conf(struct ctx *c, int argc, char **argv) (*c->ip6.ifname_out && !c->ifi6)) die("External interface not usable"); - if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.gw)) - c->no_map_gw = 1; + if (c->ifi4 && !no_map_gw) + c->ip4.nat_host_loopback = c->ip4.guest_gw; + + if (c->ifi6 && !no_map_gw) + c->ip6.nat_host_loopback = c->ip6.guest_gw; if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) c->no_dhcp = 1; - if (c->ifi6 && IN6_IS_ADDR_UNSPECIFIED(&c->ip6.gw)) - c->no_map_gw = 1; - /* Inbound port options & DNS can be parsed now (after IPv4/IPv6 * settings) */ diff --git a/dhcp.c b/dhcp.c index a935dc94..43585888 100644 --- a/dhcp.c +++ b/dhcp.c @@ -346,19 +346,21 @@ int dhcp(const struct ctx *c, const struct pool *p) m->yiaddr = c->ip4.addr; mask.s_addr = htonl(0xffffffff << (32 - c->ip4.prefix_len)); memcpy(opts[1].s, &mask, sizeof(mask)); - memcpy(opts[3].s, &c->ip4.gw, sizeof(c->ip4.gw)); + memcpy(opts[3].s, &c->ip4.guest_gw, sizeof(c->ip4.guest_gw)); memcpy(opts[54].s, &c->ip4.our_tap_addr, sizeof(c->ip4.our_tap_addr)); /* If the gateway is not on the assigned subnet, send an option 121 * (Classless Static Routing) adding a dummy route to it. */ if ((c->ip4.addr.s_addr & mask.s_addr) - != (c->ip4.gw.s_addr & mask.s_addr)) { + != (c->ip4.guest_gw.s_addr & mask.s_addr)) { /* a.b.c.d/32:0.0.0.0, 0:a.b.c.d */ opts[121].slen = 14; opts[121].s[0] = 32; - memcpy(opts[121].s + 1, &c->ip4.gw, sizeof(c->ip4.gw)); - memcpy(opts[121].s + 10, &c->ip4.gw, sizeof(c->ip4.gw)); + memcpy(opts[121].s + 1, + &c->ip4.guest_gw, sizeof(c->ip4.guest_gw)); + memcpy(opts[121].s + 10, + &c->ip4.guest_gw, sizeof(c->ip4.guest_gw)); } if (c->mtu != -1) { diff --git a/fwd.c b/fwd.c index fe618742..779278a9 100644 --- a/fwd.c +++ b/fwd.c @@ -268,9 +268,9 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, else if (is_dns_flow(proto, ini) && inany_equals6(&ini->oaddr, &c->ip6.dns_match)) tgt->eaddr.a6 = c->ip6.dns_host; - else if (!c->no_map_gw && inany_equals4(&ini->oaddr, &c->ip4.gw)) + else if (inany_equals4(&ini->oaddr, &c->ip4.nat_host_loopback)) tgt->eaddr = inany_loopback4; - else if (!c->no_map_gw && inany_equals6(&ini->oaddr, &c->ip6.gw)) + else if (inany_equals6(&ini->oaddr, &c->ip6.nat_host_loopback)) tgt->eaddr = inany_loopback6; else tgt->eaddr = ini->oaddr; diff --git a/passt.h b/passt.h index c6c67ffc..20a5904a 100644 --- a/passt.h +++ b/passt.h @@ -101,7 +101,9 @@ enum passt_modes { * @addr: IPv4 address assigned to guest * @addr_seen: Latest IPv4 address seen as source from tap * @prefixlen: IPv4 prefix length (netmask) - * @gw: Default IPv4 gateway + * @guest_gw: IPv4 gateway as seen by the guest + * @nat_host_loopback: Outbound connections to this address are NATted to the + * host's 127.0.0.1 * @dns: DNS addresses for DHCP, zero-terminated * @dns_match: Forward DNS query if sent to this address * @our_tap_addr: IPv4 address for passt's use on tap @@ -116,7 +118,8 @@ struct ip4_ctx { struct in_addr addr; struct in_addr addr_seen; int prefix_len; - struct in_addr gw; + struct in_addr guest_gw; + struct in_addr nat_host_loopback; struct in_addr dns[MAXNS + 1]; struct in_addr dns_match; struct in_addr our_tap_addr; @@ -136,7 +139,9 @@ struct ip4_ctx { * @addr: IPv6 address assigned to guest * @addr_seen: Latest IPv6 global/site address seen as source from tap * @addr_ll_seen: Latest IPv6 link-local address seen as source from tap - * @gw: Default IPv6 gateway + * @guest_gw: IPv6 gateway as seen by the guest + * @nat_host_loopback: Outbound connections to this address are NATted to the + * host's [::1] * @dns: DNS addresses for DHCPv6 and NDP, zero-terminated * @dns_match: Forward DNS query if sent to this address * @our_tap_ll: Link-local IPv6 address for passt's use on tap @@ -151,7 +156,8 @@ struct ip6_ctx { struct in6_addr addr; struct in6_addr addr_seen; struct in6_addr addr_ll_seen; - struct in6_addr gw; + struct in6_addr guest_gw; + struct in6_addr nat_host_loopback; struct in6_addr dns[MAXNS + 1]; struct in6_addr dns_match; struct in6_addr our_tap_ll; @@ -213,7 +219,6 @@ struct ip6_ctx { * @no_dhcpv6: Disable DHCPv6 server * @no_ndp: Disable NDP handler altogether * @no_ra: Disable router advertisements - * @no_map_gw: Don't map connections, untracked UDP to gateway to host * @low_wmem: Low probed net.core.wmem_max * @low_rmem: Low probed net.core.rmem_max */ @@ -273,7 +278,6 @@ struct ctx { int no_dhcpv6; int no_ndp; int no_ra; - int no_map_gw; int low_wmem; int low_rmem; diff --git a/pasta.c b/pasta.c index 3b4e8ead..2aeaf388 100644 --- a/pasta.c +++ b/pasta.c @@ -324,7 +324,8 @@ void pasta_ns_conf(struct ctx *c) if (c->ip4.no_copy_routes) { rc = nl_route_set_def(nl_sock_ns, c->pasta_ifi, - AF_INET, &c->ip4.gw); + AF_INET, + &c->ip4.guest_gw); } else { rc = nl_route_dup(nl_sock, c->ifi4, nl_sock_ns, c->pasta_ifi, AF_INET); @@ -353,7 +354,8 @@ void pasta_ns_conf(struct ctx *c) if (c->ip6.no_copy_routes) { rc = nl_route_set_def(nl_sock_ns, c->pasta_ifi, - AF_INET6, &c->ip6.gw); + AF_INET6, + &c->ip6.guest_gw); } else { rc = nl_route_dup(nl_sock, c->ifi6, nl_sock_ns, c->pasta_ifi, -- 2.46.0

Stefano Brivio

20 Aug 20 Aug

9:56 p.m.

New subject: [PATCH 19/22] conf, fwd: Split notion of gateway/router from guest-visible host address

On Fri, 16 Aug 2024 15:40:00 +1000 David Gibson wrote:

...

The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default gateway. We use this for two quite distinct things: advertising the gateway that the guest should use (via DHCP, NDP and/or --config-net) and for a limited form of NAT. So that the guest can access services on the host, we map the gateway address within the guest to the loopback address on the host.

Using the gateway address for this isn't necessarily the best choice for this purpose, certainly not for all circumstances. So, start off by splitting the notion of these into two different values: @guest_gw which is the gateway address the guest should use and @nat_host_loopback, which is the guest visible address to remap to the host's loopback.

Usually nat_host_loopback will have the same value as guest_gw. However when --no-map-gw is specified we leave them unspecified instead. This means when we use nat_host_loopback, we don't need to separately check c->no_map_gw to see if it's relevant.

Signed-off-by: David Gibson --- conf.c | 60 +++++++++++++++++++++++++++++---------------------------- dhcp.c | 10 ++++++---- fwd.c | 4 ++-- passt.h | 16 +++++++++------ pasta.c | 6 ++++-- 5 files changed, 53 insertions(+), 43 deletions(-)

diff --git a/conf.c b/conf.c index b1c58d5b..26373584 100644 --- a/conf.c +++ b/conf.c @@ -410,12 +410,12 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver, * redirect */ if (IN4_IS_ADDR_LOOPBACK(&ns4)) { - if (c->no_map_gw) + if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback))

If you change the command-line option name to use "map", it would be good to also change these names. -- Stefano

David Gibson

21 Aug 21 Aug

3:59 a.m.

New subject: [PATCH 19/22] conf, fwd: Split notion of gateway/router from guest-visible host address

On Tue, Aug 20, 2024 at 09:56:31PM +0200, Stefano Brivio wrote:

...

On Fri, 16 Aug 2024 15:40:00 +1000 David Gibson wrote:

...
The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default gateway. We use this for two quite distinct things: advertising the gateway that the guest should use (via DHCP, NDP and/or --config-net) and for a limited form of NAT. So that the guest can access services on the host, we map the gateway address within the guest to the loopback address on the host.

Using the gateway address for this isn't necessarily the best choice for this purpose, certainly not for all circumstances. So, start off by splitting the notion of these into two different values: @guest_gw which is the gateway address the guest should use and @nat_host_loopback, which is the guest visible address to remap to the host's loopback.

Usually nat_host_loopback will have the same value as guest_gw. However when --no-map-gw is specified we leave them unspecified instead. This means when we use nat_host_loopback, we don't need to separately check c->no_map_gw to see if it's relevant.

Signed-off-by: David Gibson --- conf.c | 60 +++++++++++++++++++++++++++++---------------------------- dhcp.c | 10 ++++++---- fwd.c | 4 ++-- passt.h | 16 +++++++++------ pasta.c | 6 ++++-- 5 files changed, 53 insertions(+), 43 deletions(-)

diff --git a/conf.c b/conf.c index b1c58d5b..26373584 100644 --- a/conf.c +++ b/conf.c @@ -410,12 +410,12 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver, * redirect */ if (IN4_IS_ADDR_LOOPBACK(&ns4)) { - if (c->no_map_gw) + if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback))

If you change the command-line option name to use "map", it would be good to also change these names.

Will do. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

16 Aug 16 Aug

7:40 a.m.

New subject: [PATCH 20/22] conf: Allow address remapped to host to be configured

Because the host and guest share the same IP address with passt/pasta, it's not possible for the guest to directly address the host. Therefore we allow packets from the guest going to a special "NAT to host" address to be redirected to the host, appearing there as though they have both source and destination address of loopback. Currently that special address is always the address of the default gateway (or none). That can be a problem if we want that gateway to be addressable by the guest. Therefore, allow the special "NAT to host" address to be overridden on the command line with a new --nat-host-loopback option. In order to exercise and test it, update the passt_in_ns and perf tests to use this option and give different mapping addresses for the two layers of the environment. Signed-off-by: David Gibson --- conf.c | 57 +++++++++++++++++++++++++++++++-- passt.1 | 16 ++++++++++ test/lib/setup | 11 +++++-- test/passt_in_ns/dhcp | 73 +++++++++++++++++++++++++++++++++++++++++++ test/passt_in_ns/tcp | 38 +++++++++++----------- test/passt_in_ns/udp | 22 +++++++------ test/perf/passt_tcp | 33 +++++++++---------- test/perf/passt_udp | 31 +++++++++--------- test/perf/pasta_tcp | 29 ++++++++--------- test/perf/pasta_udp | 25 ++++++++------- test/run | 4 +-- 11 files changed, 244 insertions(+), 95 deletions(-) create mode 100644 test/passt_in_ns/dhcp diff --git a/conf.c b/conf.c index 26373584..c5831e82 100644 --- a/conf.c +++ b/conf.c @@ -817,6 +817,14 @@ static void usage(const char *name, FILE *f, int status) fprintf(f, " --no-dhcp-search No list in DHCP/DHCPv6/NDP\n"); fprintf(f, + " --nat-host-loopback ADDR NAT ADDR to refer to host\n" + " Packets from the guest to ADDR will be redirected to the\n" + " host. On the host such packets will appear to have both\n" + " source and destination of loopback (127.0.0.1 or ::1).\n" + " ADDR can be 'none', in which case nothing is mapped\n" + " Can be specified zero to two (for IPv4 and IPv6)\n" + " default: gateway address, or none if --no-map-gw is also\n" + " specified\n" " --dns-forward ADDR Forward DNS queries sent to ADDR\n" " can be specified zero to two times (for IPv4 and IPv6)\n" " default: don't forward DNS queries\n" @@ -959,6 +967,11 @@ static void conf_print(const struct ctx *c) info(" host: %s", eth_ntop(c->our_tap_mac, bufmac, sizeof(bufmac))); if (c->ifi4) { + if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback)) + info(" NAT to host 127.0.0.1: %s", + inet_ntop(AF_INET, &c->ip4.nat_host_loopback, + buf4, sizeof(buf4))); + if (!c->no_dhcp) { uint32_t mask; @@ -989,6 +1002,11 @@ static void conf_print(const struct ctx *c) } if (c->ifi6) { + if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback)) + info(" NAT to host ::1: %s", + inet_ntop(AF_INET6, &c->ip6.nat_host_loopback, + buf6, sizeof(buf6))); + if (!c->no_ndp && !c->no_dhcpv6) info("NDP/DHCPv6:"); else if (!c->no_ndp) @@ -1122,6 +1140,35 @@ static void conf_ugid(char *runas, uid_t *uid, gid_t *gid) } } +/** + * conf_nat() - Parse --nat-host-loopback option + * @c: Execution context + * @arg: String argument to --nat-host-loopback + * @no_map_gw: --no-map-gw flag, updated for "none" argument + */ +static void conf_nat(struct ctx *c, const char *arg, int *no_map_gw) +{ + if (strcmp(arg, "none") == 0) { + c->ip4.nat_host_loopback = in4addr_any; + c->ip6.nat_host_loopback = in6addr_any; + *no_map_gw = 1; + } + + if (inet_pton(AF_INET6, arg, &c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_LOOPBACK(&c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_MULTICAST(&c->ip6.nat_host_loopback)) + return; + + if (inet_pton(AF_INET, arg, &c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_LOOPBACK(&c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_MULTICAST(&c->ip4.nat_host_loopback)) + return; + + die("Invalid address to remap to host: %s", optarg); +} + /** * conf_open_files() - Open files as requested by configuration * @c: Execution context @@ -1231,6 +1278,7 @@ void conf(struct ctx *c, int argc, char **argv) {"no-copy-routes", no_argument, NULL, 18 }, {"no-copy-addrs", no_argument, NULL, 19 }, {"netns-only", no_argument, NULL, 20 }, + {"nat-host-loopback", required_argument, NULL, 21 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1400,6 +1448,9 @@ void conf(struct ctx *c, int argc, char **argv) netns_only = 1; *userns = 0; break; + case 21: + conf_nat(c, optarg, &no_map_gw); + break; case 'd': c->debug = 1; c->quiet = 0; @@ -1639,10 +1690,12 @@ void conf(struct ctx *c, int argc, char **argv) (*c->ip6.ifname_out && !c->ifi6)) die("External interface not usable"); - if (c->ifi4 && !no_map_gw) + if (c->ifi4 && !no_map_gw && + IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback)) c->ip4.nat_host_loopback = c->ip4.guest_gw; - if (c->ifi6 && !no_map_gw) + if (c->ifi6 && !no_map_gw && + IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback)) c->ip6.nat_host_loopback = c->ip6.guest_gw; if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) diff --git a/passt.1 b/passt.1 index dca433b6..3680056a 100644 --- a/passt.1 +++ b/passt.1 @@ -327,6 +327,22 @@ namespace will be silently dropped. Disable Router Advertisements. Router Solicitations coming from guest or target namespace will be ignored. +.TP +.BR \-\-nat-host-loopback " " \fIaddr +Translate \fIaddr\fR to refer to the host. Packets from the guest to +\fIaddr\fR will be redirected to the host. On the host such packets +will appear to have both source and destination of loopback (127.0.0.1 +or ::1). + +If \fIaddr\fR is 'none', no address is mapped (this implies +\fB--no-map-gw\fR). Only one IPv4 and one IPv6 address can be +translated, if the option is specified multiple times, the last one +takes effect. + +Default is to translate the guest's default gateway address, unless +\fB--no-map-gw\fR is also given, in which case no address is mapped by +default. + .TP .BR \-\-no-map-gw Don't remap TCP connections and untracked UDP traffic, with the gateway address diff --git a/test/lib/setup b/test/lib/setup index 9b39b9fe..061bf997 100755 --- a/test/lib/setup +++ b/test/lib/setup @@ -124,7 +124,12 @@ setup_passt_in_ns() { [ ${DEBUG} -eq 1 ] && __opts="${__opts} -d" [ ${TRACE} -eq 1 ] && __opts="${__opts} --trace" - context_run_bg pasta "./pasta ${__opts} -t 10001,10002,10011,10012 -T 10003,10013 -u 10001,10002,10011,10012 -U 10003,10013 -P ${STATESETUP}/pasta.pid --config-net ${NSTOOL} hold ${STATESETUP}/ns.hold" + __nat_host4=192.0.2.1 + __nat_host6=2001:db8:9a55::1 + __nat_ns4=192.0.2.2 + __nat_ns6=2001:db8:9a55::2 + + context_run_bg pasta "./pasta ${__opts} -t 10001,10002,10011,10012 -T 10003,10013 -u 10001,10002,10011,10012 -U 10003,10013 -P ${STATESETUP}/pasta.pid --nat-host-loopback ${__nat_host4} --nat-host-loopback ${__nat_host6} --config-net ${NSTOOL} hold ${STATESETUP}/ns.hold" wait_for [ -f "${STATESETUP}/pasta.pid" ] context_setup_nstool qemu ${STATESETUP}/ns.hold @@ -139,11 +144,11 @@ setup_passt_in_ns() { if [ ${VALGRIND} -eq 1 ]; then context_run passt "make clean" context_run passt "make valgrind" - context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid" + context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --nat-host-loopback ${__nat_ns4} --nat-host-loopback ${__nat_ns6}" else context_run passt "make clean" context_run passt "make" - context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid" + context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --nat-host-loopback ${__nat_ns4} --nat-host-loopback ${__nat_ns6}" fi wait_for [ -f "${STATESETUP}/passt.pid" ] diff --git a/test/passt_in_ns/dhcp b/test/passt_in_ns/dhcp new file mode 100644 index 00000000..48c7d197 --- /dev/null +++ b/test/passt_in_ns/dhcp @@ -0,0 +1,73 @@ +# SPDX-License-Identifier: GPL-2.0-or-later +# +# PASST - Plug A Simple Socket Transport +# for qemu/UNIX domain socket mode +# +# PASTA - Pack A Subtle Tap Abstraction +# for network namespace/tap device mode +# +# test/passt/dhcp - Check DHCP and DHCPv6 functionality in passt mode +# +# Copyright (c) 2021 Red Hat GmbH +# Author: Stefano Brivio + +gtools ip jq dhclient sed tr +htools ip jq sed tr head + +set NAT_NS4 192.0.2.2 +set NAT_NS6 2001:db8:9a55::2 + +test Interface name +gout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' +hout HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' +hout HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]' +check [ -n "__IFNAME__" ] + +test DHCP: address +guest /sbin/dhclient -4 __IFNAME__ +gout ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[0].local' +hout HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local' +check [ "__ADDR__" = "__HOST_ADDR__" ] + +test DHCP: route +gout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' +hout HOST_GW ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' +check [ "__GW__" = "__HOST_GW__" ] + +test DHCP: MTU +gout MTU ip -j link show | jq -rM '.[] | select(.ifname == "__IFNAME__").mtu' +check [ __MTU__ = 65520 ] + +test DHCP: DNS +gout DNS sed -n 's/^nameserver $[0-9]*\.$$.*$/\1\2/p' /etc/resolv.conf | tr '\n' ',' | sed 's/,$//;s/$/\n/' +hout HOST_DNS sed -n 's/^nameserver $[0-9]*\.$$.*$/\1\2/p' /etc/resolv.conf | head -n3 | tr '\n' ',' | sed 's/,$//;s/$/\n/' +check [ "__DNS__" = "__HOST_DNS__" ] || ( [ "__DNS__" = "__NAT_NS4__" ] && expr "__HOST_DNS__" : "127[.]" ) + +# FQDNs should be terminated by dots, but the guest DHCP client might omit them: +# strip them first +test DHCP: search list +gout SEARCH sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search $.*$/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/' +hout HOST_SEARCH sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search $.*$/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/' +check [ "__SEARCH__" = "__HOST_SEARCH__" ] + +test DHCPv6: address +guest /sbin/dhclient -6 __IFNAME__ +gout ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.prefixlen == 128).local] | .[0]' +hout HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]' +check [ "__ADDR6__" = "__HOST_ADDR6__" ] + +test DHCPv6: route +gout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' +hout HOST_GW6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").gateway] | .[0]' +check [ "__GW6__" = "__HOST_GW6__" ] + +# Strip interface specifier: interface names might differ between host and guest +test DHCPv6: DNS +gout DNS6 sed -n 's/^nameserver $[^:]*:$$[^%]*$.*/\1\2/p' /etc/resolv.conf | tr '\n' ',' | sed 's/,$//;s/$/\n/' +hout HOST_DNS6 sed -n 's/^nameserver $[^:]*:$$[^%]*$.*/\1\2/p' /etc/resolv.conf | tr '\n' ',' | sed 's/,$//;s/$/\n/' +check [ "__DNS6__" = "__HOST_DNS6__" ] || [ "__DNS6__" = "__NAT_NS6__" -a "__HOST_DNS6__" = "::1" ] + +test DHCPv6: search list +gout SEARCH6 sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search $.*$/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/' +hout HOST_SEARCH6 sed 's/\. / /g' /etc/resolv.conf | sed 's/\.$//g' | sed -n 's/^search $.*$/\1/p' | tr ' \n' ',' | sed 's/,$//;s/$/\n/' +check [ "__SEARCH6__" = "__HOST_SEARCH6__" ] diff --git a/test/passt_in_ns/tcp b/test/passt_in_ns/tcp index cdb7060c..919333ca 100644 --- a/test/passt_in_ns/tcp +++ b/test/passt_in_ns/tcp @@ -15,6 +15,11 @@ gtools socat ip jq htools socat ip jq nstools socat ip jq +set NAT_HOST4 192.0.2.1 +set NAT_HOST6 2001:db8:9a55::1 +set NAT_NS4 192.0.2.2 +set NAT_NS6 2001:db8:9a55::2 + set TEMP_BIG __STATEDIR__/test_big.bin set TEMP_SMALL __STATEDIR__/test_small.bin set TEMP_NS_BIG __STATEDIR__/test_ns_big.bin @@ -36,16 +41,15 @@ check cmp __TEMP_NS_BIG__ __BASEPATH__/big.bin test TCP/IPv4: guest to host: big transfer hostb socat -u TCP4-LISTEN:10003 OPEN:__TEMP_BIG__,create,trunc -gout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' sleep 1 -guest socat -u OPEN:/root/big.bin TCP4:__GW__:10003 +guest socat -u OPEN:/root/big.bin TCP4:__NAT_HOST4__:10003 hostw check cmp __TEMP_BIG__ __BASEPATH__/big.bin test TCP/IPv4: guest to ns: big transfer nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc sleep 1 -guest socat -u OPEN:/root/big.bin TCP4:__GW__:10002 +guest socat -u OPEN:/root/big.bin TCP4:__NAT_NS4__:10002 nsw check cmp __TEMP_NS_BIG__ __BASEPATH__/big.bin @@ -59,7 +63,7 @@ check cmp __TEMP_BIG__ __BASEPATH__/big.bin test TCP/IPv4: ns to host (via tap): big transfer hostb socat -u TCP4-LISTEN:10003 OPEN:__TEMP_BIG__,create,trunc sleep 1 -ns socat -u OPEN:__BASEPATH__/big.bin TCP4:__GW__:10003 +ns socat -u OPEN:__BASEPATH__/big.bin TCP4:__NAT_HOST4__:10003 hostw check cmp __TEMP_BIG__ __BASEPATH__/big.bin @@ -95,16 +99,15 @@ check cmp __TEMP_NS_SMALL__ __BASEPATH__/small.bin test TCP/IPv4: guest to host: small transfer hostb socat -u TCP4-LISTEN:10003 OPEN:__TEMP_SMALL__,create,trunc -gout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' sleep 1 -guest socat -u OPEN:/root/small.bin TCP4:__GW__:10003 +guest socat -u OPEN:/root/small.bin TCP4:__NAT_HOST4__:10003 hostw check cmp __TEMP_SMALL__ __BASEPATH__/small.bin test TCP/IPv4: guest to ns: small transfer nsb socat -u TCP4-LISTEN:10002 OPEN:__TEMP_NS_SMALL__,create,trunc sleep 1 -guest socat -u OPEN:/root/small.bin TCP4:__GW__:10002 +guest socat -u OPEN:/root/small.bin TCP4:__NAT_NS4__:10002 nsw check cmp __TEMP_NS_SMALL__ __BASEPATH__/small.bin @@ -118,7 +121,7 @@ check cmp __TEMP_SMALL__ __BASEPATH__/small.bin test TCP/IPv4: ns to host (via tap): small transfer hostb socat -u TCP4-LISTEN:10003 OPEN:__TEMP_SMALL__,create,trunc sleep 1 -ns socat -u OPEN:__BASEPATH__/small.bin TCP4:__GW__:10003 +ns socat -u OPEN:__BASEPATH__/small.bin TCP4:__NAT_HOST4__:10003 hostw check cmp __TEMP_SMALL__ __BASEPATH__/small.bin @@ -152,17 +155,15 @@ check cmp __TEMP_NS_BIG__ __BASEPATH__/big.bin test TCP/IPv6: guest to host: big transfer hostb socat -u TCP6-LISTEN:10003 OPEN:__TEMP_BIG__,create,trunc -gout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' -gout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' sleep 1 -guest socat -u OPEN:/root/big.bin TCP6:[__GW6__%__IFNAME__]:10003 +guest socat -u OPEN:/root/big.bin TCP6:[__NAT_HOST6__]:10003 hostw check cmp __TEMP_BIG__ __BASEPATH__/big.bin test TCP/IPv6: guest to ns: big transfer nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_BIG__,create,trunc sleep 1 -guest socat -u OPEN:/root/big.bin TCP6:[__GW6__%__IFNAME__]:10002 +guest socat -u OPEN:/root/big.bin TCP6:[__NAT_NS6__]:10002 nsw check cmp __TEMP_NS_BIG__ __BASEPATH__/big.bin @@ -175,9 +176,8 @@ check cmp __TEMP_BIG__ __BASEPATH__/big.bin test TCP/IPv6: ns to host (via tap): big transfer hostb socat -u TCP6-LISTEN:10003 OPEN:__TEMP_BIG__,create,trunc -nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' sleep 1 -ns socat -u OPEN:__BASEPATH__/big.bin TCP6:[__GW6__%__IFNAME__]:10003 +ns socat -u OPEN:__BASEPATH__/big.bin TCP6:[__NAT_HOST6__]:10003 hostw check cmp __TEMP_BIG__ __BASEPATH__/big.bin @@ -190,6 +190,7 @@ guest cmp test_big.bin /root/big.bin test TCP/IPv6: ns to guest (using namespace address): big transfer guestb socat -u TCP6-LISTEN:10001 OPEN:test_big.bin,create,trunc +nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[0].local' sleep 1 ns socat -u OPEN:__BASEPATH__/big.bin TCP6:[__ADDR6__]:10001 @@ -212,17 +213,15 @@ check cmp __TEMP_NS_SMALL__ __BASEPATH__/small.bin test TCP/IPv6: guest to host: small transfer hostb socat -u TCP6-LISTEN:10003 OPEN:__TEMP_SMALL__,create,trunc -gout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' -gout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' sleep 1 -guest socat -u OPEN:/root/small.bin TCP6:[__GW6__%__IFNAME__]:10003 +guest socat -u OPEN:/root/small.bin TCP6:[__NAT_HOST6__]:10003 hostw check cmp __TEMP_SMALL__ __BASEPATH__/small.bin test TCP/IPv6: guest to ns: small transfer nsb socat -u TCP6-LISTEN:10002 OPEN:__TEMP_NS_SMALL__ sleep 1 -guest socat -u OPEN:/root/small.bin TCP6:[__GW6__%__IFNAME__]:10002 +guest socat -u OPEN:/root/small.bin TCP6:[__NAT_NS6__]:10002 nsw check cmp __TEMP_NS_SMALL__ __BASEPATH__/small.bin @@ -235,9 +234,8 @@ check cmp __TEMP_SMALL__ __BASEPATH__/small.bin test TCP/IPv6: ns to host (via tap): small transfer hostb socat -u TCP6-LISTEN:10003 OPEN:__TEMP_SMALL__,create,trunc -nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' sleep 1 -ns socat -u OPEN:__BASEPATH__/small.bin TCP6:[__GW6__%__IFNAME__]:10003 +ns socat -u OPEN:__BASEPATH__/small.bin TCP6:[__NAT_HOST6__]:10003 hostw check cmp __TEMP_SMALL__ __BASEPATH__/small.bin diff --git a/test/passt_in_ns/udp b/test/passt_in_ns/udp index 8a025131..0e3574f5 100644 --- a/test/passt_in_ns/udp +++ b/test/passt_in_ns/udp @@ -15,6 +15,11 @@ gtools socat ip jq nstools socat ip jq htools socat ip jq +set NAT_HOST4 192.0.2.1 +set NAT_HOST6 2001:db8:9a55::1 +set NAT_NS4 192.0.2.2 +set NAT_NS6 2001:db8:9a55::2 + set TEMP __STATEDIR__/test.bin set TEMP_NS __STATEDIR__/test_ns.bin @@ -34,16 +39,15 @@ check cmp __TEMP_NS__ __BASEPATH__/medium.bin test UDP/IPv4: guest to host hostb socat -u UDP4-LISTEN:10003,null-eof OPEN:__TEMP__,create,trunc -gout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' sleep 1 -guest socat -u OPEN:/root/medium.bin UDP4:__GW__:10003,shut-null +guest socat -u OPEN:/root/medium.bin UDP4:__NAT_HOST4__:10003,shut-null hostw check cmp __TEMP__ __BASEPATH__/medium.bin test UDP/IPv4: guest to ns nsb socat -u UDP4-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc sleep 1 -guest socat -u OPEN:/root/medium.bin UDP4:__GW__:10002,shut-null +guest socat -u OPEN:/root/medium.bin UDP4:__NAT_NS4__:10002,shut-null nsw check cmp __TEMP_NS__ __BASEPATH__/medium.bin @@ -57,7 +61,7 @@ check cmp __TEMP__ __BASEPATH__/medium.bin test UDP/IPv4: ns to host (via tap) hostb socat -u UDP4-LISTEN:10003,null-eof OPEN:__TEMP__,create,trunc sleep 1 -ns socat -u OPEN:__BASEPATH__/medium.bin UDP4:__GW__:10003,shut-null +ns socat -u OPEN:__BASEPATH__/medium.bin UDP4:__NAT_HOST4__:10003,shut-null hostw check cmp __TEMP__ __BASEPATH__/medium.bin @@ -93,17 +97,15 @@ check cmp __TEMP_NS__ __BASEPATH__/medium.bin test UDP/IPv6: guest to host hostb socat -u UDP6-LISTEN:10003,null-eof OPEN:__TEMP__,create,trunc -gout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' -gout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' sleep 1 -guest socat -u OPEN:/root/medium.bin UDP6:[__GW6__%__IFNAME__]:10003,shut-null +guest socat -u OPEN:/root/medium.bin UDP6:[__NAT_HOST6__]:10003,shut-null hostw check cmp __TEMP__ __BASEPATH__/medium.bin test UDP/IPv6: guest to ns nsb socat -u UDP6-LISTEN:10002,null-eof OPEN:__TEMP_NS__,create,trunc sleep 1 -guest socat -u OPEN:/root/medium.bin UDP6:[__GW6__%__IFNAME__]:10002,shut-null +guest socat -u OPEN:/root/medium.bin UDP6:[__NAT_NS6__]:10002,shut-null nsw check cmp __TEMP_NS__ __BASEPATH__/medium.bin @@ -116,9 +118,8 @@ check cmp __TEMP__ __BASEPATH__/medium.bin test UDP/IPv6: ns to host (via tap) hostb socat -u UDP6-LISTEN:10003,null-eof OPEN:__TEMP__,create,trunc -nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' sleep 1 -ns socat -u OPEN:__BASEPATH__/medium.bin UDP6:[__GW6__%__IFNAME__]:10003,shut-null +ns socat -u OPEN:__BASEPATH__/medium.bin UDP6:[__NAT_HOST6__]:10003,shut-null hostw check cmp __TEMP__ __BASEPATH__/medium.bin @@ -131,6 +132,7 @@ guest cmp test.bin /root/medium.bin test UDP/IPv6: ns to guest (using namespace address) guestb socat -u UDP6-LISTEN:10001,null-eof OPEN:test.bin,create,trunc +nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[0].local' sleep 1 ns socat -u OPEN:__BASEPATH__/medium.bin UDP6:[__ADDR6__]:10001,shut-null diff --git a/test/perf/passt_tcp b/test/perf/passt_tcp index 695479f3..ae03c7df 100644 --- a/test/perf/passt_tcp +++ b/test/perf/passt_tcp @@ -15,6 +15,9 @@ gtools /sbin/sysctl ip jq nproc seq sleep iperf3 tcp_rr tcp_crr # From neper nstools /sbin/sysctl ip jq nproc seq sleep iperf3 tcp_rr tcp_crr htools bc head sed seq +set NAT_NS4 192.0.2.2 +set NAT_NS6 2001:db8:9a55::2 + test passt: throughput and latency guest /sbin/sysctl -w net.core.rmem_max=536870912 @@ -29,8 +32,6 @@ ns /sbin/sysctl -w net.ipv4.tcp_rmem="4096 524288 134217728" ns /sbin/sysctl -w net.ipv4.tcp_wmem="4096 524288 134217728" ns /sbin/sysctl -w net.ipv4.tcp_timestamps=0 -gout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' -gout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' gout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' hout FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: $[0-9]*$\..*$/(\1+10^2\/2)\/10^3/p' /proc/cpuinfo) | bc -l | head -n1 @@ -54,16 +55,16 @@ iperf3s ns 10002 bw - bw - guest ip link set dev __IFNAME__ mtu 1280 -iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -w 4M +iperf3 BW guest __NAT_NS6__ 10002 __TIME__ __OPTS__ -w 4M bw __BW__ 1.2 1.5 guest ip link set dev __IFNAME__ mtu 1500 -iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -w 4M +iperf3 BW guest __NAT_NS6__ 10002 __TIME__ __OPTS__ -w 4M bw __BW__ 1.6 1.8 guest ip link set dev __IFNAME__ mtu 9000 -iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -w 8M +iperf3 BW guest __NAT_NS6__ 10002 __TIME__ __OPTS__ -w 8M bw __BW__ 4.0 5.0 guest ip link set dev __IFNAME__ mtu 65520 -iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -w 16M +iperf3 BW guest __NAT_NS6__ 10002 __TIME__ __OPTS__ -w 16M bw __BW__ 7.0 8.0 iperf3k ns @@ -75,7 +76,7 @@ lat - lat - lat - nsb tcp_rr --nolog -6 -gout LAT tcp_rr --nolog -l1 -6 -c -H __GW6__%__IFNAME__ | sed -n 's/^throughput=$.*$/\1/p' +gout LAT tcp_rr --nolog -l1 -6 -c -H __NAT_NS6__ | sed -n 's/^throughput=$.*$/\1/p' lat __LAT__ 200 150 tl TCP CRR latency over IPv6: guest to host @@ -85,29 +86,29 @@ lat - lat - lat - nsb tcp_crr --nolog -6 -gout LAT tcp_crr --nolog -l1 -6 -c -H __GW6__%__IFNAME__ | sed -n 's/^throughput=$.*$/\1/p' +gout LAT tcp_crr --nolog -l1 -6 -c -H __NAT_NS6__ | sed -n 's/^throughput=$.*$/\1/p' lat __LAT__ 500 400 tr TCP throughput over IPv4: guest to host iperf3s ns 10002 guest ip link set dev __IFNAME__ mtu 256 -iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -w 1M +iperf3 BW guest __NAT_NS4__ 10002 __TIME__ __OPTS__ -w 1M bw __BW__ 0.2 0.3 guest ip link set dev __IFNAME__ mtu 576 -iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -w 1M +iperf3 BW guest __NAT_NS4__ 10002 __TIME__ __OPTS__ -w 1M bw __BW__ 0.5 0.8 guest ip link set dev __IFNAME__ mtu 1280 -iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -w 4M +iperf3 BW guest __NAT_NS4__ 10002 __TIME__ __OPTS__ -w 4M bw __BW__ 1.2 1.5 guest ip link set dev __IFNAME__ mtu 1500 -iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -w 4M +iperf3 BW guest __NAT_NS4__ 10002 __TIME__ __OPTS__ -w 4M bw __BW__ 1.6 1.8 guest ip link set dev __IFNAME__ mtu 9000 -iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -w 8M +iperf3 BW guest __NAT_NS4__ 10002 __TIME__ __OPTS__ -w 8M bw __BW__ 4.0 5.0 guest ip link set dev __IFNAME__ mtu 65520 -iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -w 16M +iperf3 BW guest __NAT_NS4__ 10002 __TIME__ __OPTS__ -w 16M bw __BW__ 7.0 8.0 iperf3k ns @@ -119,7 +120,7 @@ lat - lat - lat - nsb tcp_rr --nolog -4 -gout LAT tcp_rr --nolog -l1 -4 -c -H __GW__ | sed -n 's/^throughput=$.*$/\1/p' +gout LAT tcp_rr --nolog -l1 -4 -c -H __NAT_NS4__ | sed -n 's/^throughput=$.*$/\1/p' lat __LAT__ 200 150 tl TCP CRR latency over IPv4: guest to host @@ -129,7 +130,7 @@ lat - lat - lat - nsb tcp_crr --nolog -4 -gout LAT tcp_crr --nolog -l1 -4 -c -H __GW__ | sed -n 's/^throughput=$.*$/\1/p' +gout LAT tcp_crr --nolog -l1 -4 -c -H __NAT_NS4__ | sed -n 's/^throughput=$.*$/\1/p' lat __LAT__ 500 400 tr TCP throughput over IPv6: host to guest diff --git a/test/perf/passt_udp b/test/perf/passt_udp index f25c9033..2160797c 100644 --- a/test/perf/passt_udp +++ b/test/perf/passt_udp @@ -15,6 +15,9 @@ gtools /sbin/sysctl ip jq nproc sleep iperf3 udp_rr # From neper nstools ip jq sleep iperf3 udp_rr htools bc head sed +set NAT_NS4 192.0.2.2 +set NAT_NS6 2001:db8:9a55::2 + test passt: throughput and latency guest /sbin/sysctl -w net.core.rmem_max=16777216 @@ -22,10 +25,6 @@ guest /sbin/sysctl -w net.core.wmem_max=16777216 guest /sbin/sysctl -w net.core.rmem_default=16777216 guest /sbin/sysctl -w net.core.wmem_default=16777216 -gout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' -gout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' -gout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' - hout FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: $[0-9]*$\..*$/(\1+10^2\/2)\/10^3/p' /proc/cpuinfo) | bc -l | head -n1 hout FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq) ) | bc -l hout FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__ @@ -46,13 +45,13 @@ iperf3s ns 10002 bw - bw - -iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -b 3G -l 1232 +iperf3 BW guest __NAT_NS6__ 10002 __TIME__ __OPTS__ -b 3G -l 1232 bw __BW__ 0.8 1.2 -iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -b 4G -l 1452 +iperf3 BW guest __NAT_NS6__ 10002 __TIME__ __OPTS__ -b 4G -l 1452 bw __BW__ 1.0 1.5 -iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -b 8G -l 8952 +iperf3 BW guest __NAT_NS6__ 10002 __TIME__ __OPTS__ -b 8G -l 8952 bw __BW__ 4.0 5.0 -iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -b 15G -l 64372 +iperf3 BW guest __NAT_NS6__ 10002 __TIME__ __OPTS__ -b 15G -l 64372 bw __BW__ 4.0 5.0 iperf3k ns @@ -64,7 +63,7 @@ lat - lat - lat - nsb udp_rr --nolog -6 -gout LAT udp_rr --nolog -6 -c -H __GW6__%__IFNAME__ | sed -n 's/^throughput=$.*$/\1/p' +gout LAT udp_rr --nolog -6 -c -H __NAT_NS6__ | sed -n 's/^throughput=$.*$/\1/p' lat __LAT__ 200 150 @@ -72,17 +71,17 @@ tr UDP throughput over IPv4: guest to host iperf3s ns 10002 # (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header -iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -b 1G -l 228 +iperf3 BW guest __NAT_NS4__ 10002 __TIME__ __OPTS__ -b 1G -l 228 bw __BW__ 0.0 0.0 -iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -b 2G -l 548 +iperf3 BW guest __NAT_NS4__ 10002 __TIME__ __OPTS__ -b 2G -l 548 bw __BW__ 0.4 0.6 -iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -b 3G -l 1252 +iperf3 BW guest __NAT_NS4__ 10002 __TIME__ __OPTS__ -b 3G -l 1252 bw __BW__ 0.8 1.2 -iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -b 4G -l 1472 +iperf3 BW guest __NAT_NS4__ 10002 __TIME__ __OPTS__ -b 4G -l 1472 bw __BW__ 1.0 1.5 -iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -b 8G -l 8972 +iperf3 BW guest __NAT_NS4__ 10002 __TIME__ __OPTS__ -b 8G -l 8972 bw __BW__ 4.0 5.0 -iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -b 15G -l 65492 +iperf3 BW guest __NAT_NS4__ 10002 __TIME__ __OPTS__ -b 15G -l 65492 bw __BW__ 4.0 5.0 iperf3k ns @@ -94,7 +93,7 @@ lat - lat - lat - nsb udp_rr --nolog -4 -gout LAT udp_rr --nolog -4 -c -H __GW__ | sed -n 's/^throughput=$.*$/\1/p' +gout LAT udp_rr --nolog -4 -c -H __NAT_NS4__ | sed -n 's/^throughput=$.*$/\1/p' lat __LAT__ 200 150 diff --git a/test/perf/pasta_tcp b/test/perf/pasta_tcp index a443f5a9..a6ea062c 100644 --- a/test/perf/pasta_tcp +++ b/test/perf/pasta_tcp @@ -14,6 +14,9 @@ htools head ip seq bc sleep iperf3 tcp_rr tcp_crr jq sed nstools /sbin/sysctl nproc ip seq sleep iperf3 tcp_rr tcp_crr jq sed +set NAT_HOST4 192.0.2.1 +set NAT_HOST6 2001:db8:9a55::1 + test pasta: throughput and latency (local connections) ns /sbin/sysctl -w net.ipv4.tcp_rmem="131072 524288 134217728" @@ -122,8 +125,6 @@ te test pasta: throughput and latency (connections via tap) -nsout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' -nsout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' set THREADS 2 set OPTS -Z -P __THREADS__ -i1 -O__OMIT__ @@ -137,16 +138,16 @@ tr TCP throughput over IPv6: ns to host iperf3s host 10003 ns ip link set dev __IFNAME__ mtu 1500 -iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -w 512k +iperf3 BW ns __NAT_HOST6__ 10003 __TIME__ __OPTS__ -w 512k bw __BW__ 0.2 0.4 ns ip link set dev __IFNAME__ mtu 4000 -iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -w 1M +iperf3 BW ns __NAT_HOST6__ 10003 __TIME__ __OPTS__ -w 1M bw __BW__ 0.3 0.5 ns ip link set dev __IFNAME__ mtu 16384 -iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -w 8M +iperf3 BW ns __NAT_HOST6__ 10003 __TIME__ __OPTS__ -w 8M bw __BW__ 1.5 2.0 ns ip link set dev __IFNAME__ mtu 65520 -iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -w 8M +iperf3 BW ns __NAT_HOST6__ 10003 __TIME__ __OPTS__ -w 8M bw __BW__ 2.0 2.5 iperf3k host @@ -156,7 +157,7 @@ lat - lat - lat - hostb tcp_rr --nolog -P 10003 -C 10013 -6 -nsout LAT tcp_rr --nolog -l1 -P 10003 -C 10013 -6 -c -H __GW6__%__IFNAME__ | sed -n 's/^throughput=$.*$/\1/p' +nsout LAT tcp_rr --nolog -l1 -P 10003 -C 10013 -6 -c -H __NAT_HOST6__ | sed -n 's/^throughput=$.*$/\1/p' hostw lat __LAT__ 150 100 @@ -165,7 +166,7 @@ lat - lat - lat - hostb tcp_crr --nolog -P 10003 -C 10013 -6 -nsout LAT tcp_crr --nolog -l1 -P 10003 -C 10013 -6 -c -H __GW6__%__IFNAME__ | sed -n 's/^throughput=$.*$/\1/p' +nsout LAT tcp_crr --nolog -l1 -P 10003 -C 10013 -6 -c -H __NAT_HOST6__ | sed -n 's/^throughput=$.*$/\1/p' hostw lat __LAT__ 1500 500 @@ -174,16 +175,16 @@ tr TCP throughput over IPv4: ns to host iperf3s host 10003 ns ip link set dev __IFNAME__ mtu 1500 -iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -w 512k +iperf3 BW ns __NAT_HOST4__ 10003 __TIME__ __OPTS__ -w 512k bw __BW__ 0.2 0.4 ns ip link set dev __IFNAME__ mtu 4000 -iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -w 1M +iperf3 BW ns __NAT_HOST4__ 10003 __TIME__ __OPTS__ -w 1M bw __BW__ 0.3 0.5 ns ip link set dev __IFNAME__ mtu 16384 -iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -w 8M +iperf3 BW ns __NAT_HOST4__ 10003 __TIME__ __OPTS__ -w 8M bw __BW__ 1.5 2.0 ns ip link set dev __IFNAME__ mtu 65520 -iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -w 8M +iperf3 BW ns __NAT_HOST4__ 10003 __TIME__ __OPTS__ -w 8M bw __BW__ 2.0 2.5 iperf3k host @@ -193,7 +194,7 @@ lat - lat - lat - hostb tcp_rr --nolog -P 10003 -C 10013 -4 -nsout LAT tcp_rr --nolog -l1 -P 10003 -C 10013 -4 -c -H __GW__ | sed -n 's/^throughput=$.*$/\1/p' +nsout LAT tcp_rr --nolog -l1 -P 10003 -C 10013 -4 -c -H __NAT_HOST4__ | sed -n 's/^throughput=$.*$/\1/p' hostw lat __LAT__ 150 100 @@ -202,7 +203,7 @@ lat - lat - lat - hostb tcp_crr --nolog -P 10003 -C 10013 -4 -nsout LAT tcp_crr --nolog -l1 -P 10003 -C 10013 -4 -c -H __GW__ | sed -n 's/^throughput=$.*$/\1/p' +nsout LAT tcp_crr --nolog -l1 -P 10003 -C 10013 -4 -c -H __NAT_HOST4__ | sed -n 's/^throughput=$.*$/\1/p' hostw lat __LAT__ 1500 500 diff --git a/test/perf/pasta_udp b/test/perf/pasta_udp index 9fed62e4..146e41b8 100644 --- a/test/perf/pasta_udp +++ b/test/perf/pasta_udp @@ -14,6 +14,9 @@ htools bc head ip sleep iperf3 udp_rr jq sed nstools ip sleep iperf3 udp_rr jq sed +set NAT_HOST4 192.0.2.1 +set NAT_HOST6 2001:db8:9a55::1 + test pasta: throughput and latency (local traffic) hout FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: $[0-9]*$\..*$/(\1+10^2\/2)\/10^3/p' /proc/cpuinfo) | bc -l | head -n1 @@ -133,8 +136,6 @@ te test pasta: throughput and latency (traffic via tap) -nsout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' -nsout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' info Throughput in Gbps, latency in µs, one thread at __FREQ__ GHz @@ -146,13 +147,13 @@ tr UDP throughput over IPv6: ns to host iperf3s host 10003 # (datagram size) = (packet size) - 48: 40 bytes of IPv6 header, 8 of UDP header -iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -b 8G -l 1472 +iperf3 BW ns __NAT_HOST6__ 10003 __TIME__ __OPTS__ -b 8G -l 1472 bw __BW__ 0.3 0.5 -iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -b 12G -l 3972 +iperf3 BW ns __NAT_HOST6__ 10003 __TIME__ __OPTS__ -b 12G -l 3972 bw __BW__ 0.5 0.8 -iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -b 20G -l 16356 +iperf3 BW ns __NAT_HOST6__ 10003 __TIME__ __OPTS__ -b 20G -l 16356 bw __BW__ 3.0 4.0 -iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -b 30G -l 65472 +iperf3 BW ns __NAT_HOST6__ 10003 __TIME__ __OPTS__ -b 30G -l 65472 bw __BW__ 6.0 7.0 iperf3k host @@ -162,7 +163,7 @@ lat - lat - lat - hostb udp_rr --nolog -P 10003 -C 10013 -6 -nsout LAT udp_rr --nolog -P 10003 -C 10013 -6 -c -H __GW6__%__IFNAME__ | sed -n 's/^throughput=$.*$/\1/p' +nsout LAT udp_rr --nolog -P 10003 -C 10013 -6 -c -H __NAT_HOST6__ | sed -n 's/^throughput=$.*$/\1/p' hostw lat __LAT__ 200 150 @@ -171,13 +172,13 @@ tr UDP throughput over IPv4: ns to host iperf3s host 10003 # (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header -iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -b 8G -l 1472 +iperf3 BW ns __NAT_HOST4__ 10003 __TIME__ __OPTS__ -b 8G -l 1472 bw __BW__ 0.3 0.5 -iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -b 12G -l 3972 +iperf3 BW ns __NAT_HOST4__ 10003 __TIME__ __OPTS__ -b 12G -l 3972 bw __BW__ 0.5 0.8 -iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -b 20G -l 16356 +iperf3 BW ns __NAT_HOST4__ 10003 __TIME__ __OPTS__ -b 20G -l 16356 bw __BW__ 3.0 4.0 -iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -b 30G -l 65492 +iperf3 BW ns __NAT_HOST4__ 10003 __TIME__ __OPTS__ -b 30G -l 65492 bw __BW__ 6.0 7.0 iperf3k host @@ -187,7 +188,7 @@ lat - lat - lat - hostb udp_rr --nolog -P 10003 -C 10013 -4 -nsout LAT udp_rr --nolog -P 10003 -C 10013 -4 -c -H __GW__ | sed -n 's/^throughput=$.*$/\1/p' +nsout LAT udp_rr --nolog -P 10003 -C 10013 -4 -c -H __NAT_HOST4__ | sed -n 's/^throughput=$.*$/\1/p' hostw lat __LAT__ 200 150 diff --git a/test/run b/test/run index 3b376639..cd6d7076 100755 --- a/test/run +++ b/test/run @@ -101,7 +101,7 @@ run() { VALGRIND=1 setup passt_in_ns test passt/ndp - test passt/dhcp + test passt_in_ns/dhcp test passt_in_ns/icmp test passt_in_ns/tcp test passt_in_ns/udp @@ -115,7 +115,7 @@ run() { VALGRIND=0 setup passt_in_ns test passt/ndp - test passt/dhcp + test passt_in_ns/dhcp test perf/passt_tcp test perf/passt_udp test perf/pasta_tcp -- 2.46.0

Stefano Brivio

20 Aug 20 Aug

9:56 p.m.

New subject: [PATCH 20/22] conf: Allow address remapped to host to be configured

On Fri, 16 Aug 2024 15:40:01 +1000 David Gibson wrote:

...

Because the host and guest share the same IP address with passt/pasta, it's not possible for the guest to directly address the host. Therefore we allow packets from the guest going to a special "NAT to host" address to be redirected to the host, appearing there as though they have both source and destination address of loopback.

Currently that special address is always the address of the default gateway (or none). That can be a problem if we want that gateway to be addressable by the guest. Therefore, allow the special "NAT to host" address to be overridden on the command line with a new --nat-host-loopback option.

In order to exercise and test it, update the passt_in_ns and perf tests to use this option and give different mapping addresses for the two layers of the environment.

Signed-off-by: David Gibson --- conf.c | 57 +++++++++++++++++++++++++++++++-- passt.1 | 16 ++++++++++ test/lib/setup | 11 +++++-- test/passt_in_ns/dhcp | 73 +++++++++++++++++++++++++++++++++++++++++++ test/passt_in_ns/tcp | 38 +++++++++++----------- test/passt_in_ns/udp | 22 +++++++------ test/perf/passt_tcp | 33 +++++++++---------- test/perf/passt_udp | 31 +++++++++--------- test/perf/pasta_tcp | 29 ++++++++--------- test/perf/pasta_udp | 25 ++++++++------- test/run | 4 +-- 11 files changed, 244 insertions(+), 95 deletions(-) create mode 100644 test/passt_in_ns/dhcp

diff --git a/conf.c b/conf.c index 26373584..c5831e82 100644 --- a/conf.c +++ b/conf.c @@ -817,6 +817,14 @@ static void usage(const char *name, FILE *f, int status) fprintf(f, " --no-dhcp-search No list in DHCP/DHCPv6/NDP\n");

fprintf(f, + " --nat-host-loopback ADDR NAT ADDR to refer to host\n" + " Packets from the guest to ADDR will be redirected to the\n" + " host. On the host such packets will appear to have both\n" + " source and destination of loopback (127.0.0.1 or ::1).\n"

I would leave these three lines to the man page. The help message is already 90 lines long. This should be a quick guide/reminder, not a full description. This reminds me that 127.0.0.1 isn't the only IPv4 loopback address. I don't know if anybody will ever have a use case where they would need a different, specific, loopback source address, but, together with --nat-guest-addr from 22/22, I start wondering: what if we had a single option taking, optionally, an arbitrary (within limits) source address? Now, given that we plan to add a configurable flow table at some point in the future, it makes no sense to make this exceedingly flexible. But I just wanted to bring this up for consideration, in case it's doable at a small cost (I'm really not sure): --map-host [source,]address where "source" would default to 127.0.0.1, but it could also be another loopback address, or another address altogether (and we'll fail if it's not local, of course). If we want (can?) go that way and keep equivalent functionality as you have now, we would have the additional problem that this option could be given up to two times (one for loopback, one for non-loopback), and not more (we don't have a data structure ready for an arbitrary number of those), so it's not as generic as it might look like, and I'm not sure if it's a good idea. But we could also expand on it in the future.

...

+ " ADDR can be 'none', in which case nothing is mapped\n"

This is a nice feature by the way as it should eventually allow us to get consistent options in Podman instead of "--map-gw": Podman could add by default '--map-host-loopback none', unless the user overrides that with an actual address.

...

+ " Can be specified zero to two (for IPv4 and IPv6)\n"

"can" (for consistency, but also because the subject is still the option, this is not a separate sentence). ...times.

...

+ " default: gateway address, or none if --no-map-gw is also\n" + " specified\n"

I don't think we need to mention here that --no-map-gw implies none, doing it in the man page is enough.

...

" --dns-forward ADDR Forward DNS queries sent to ADDR\n" " can be specified zero to two times (for IPv4 and IPv6)\n" " default: don't forward DNS queries\n" @@ -959,6 +967,11 @@ static void conf_print(const struct ctx *c) info(" host: %s", eth_ntop(c->our_tap_mac, bufmac, sizeof(bufmac)));

if (c->ifi4) { + if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback)) + info(" NAT to host 127.0.0.1: %s", + inet_ntop(AF_INET, &c->ip4.nat_host_loopback, + buf4, sizeof(buf4))); + if (!c->no_dhcp) { uint32_t mask;

@@ -989,6 +1002,11 @@ static void conf_print(const struct ctx *c) }

if (c->ifi6) { + if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback)) + info(" NAT to host ::1: %s", + inet_ntop(AF_INET6, &c->ip6.nat_host_loopback, + buf6, sizeof(buf6))); + if (!c->no_ndp && !c->no_dhcpv6) info("NDP/DHCPv6:"); else if (!c->no_ndp) @@ -1122,6 +1140,35 @@ static void conf_ugid(char *runas, uid_t *uid, gid_t *gid) } }

+/** + * conf_nat() - Parse --nat-host-loopback option + * @c: Execution context + * @arg: String argument to --nat-host-loopback + * @no_map_gw: --no-map-gw flag, updated for "none" argument + */ +static void conf_nat(struct ctx *c, const char *arg, int *no_map_gw) +{ + if (strcmp(arg, "none") == 0) { + c->ip4.nat_host_loopback = in4addr_any; + c->ip6.nat_host_loopback = in6addr_any; + *no_map_gw = 1; + } + + if (inet_pton(AF_INET6, arg, &c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_LOOPBACK(&c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_MULTICAST(&c->ip6.nat_host_loopback)) + return; + + if (inet_pton(AF_INET, arg, &c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_LOOPBACK(&c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_MULTICAST(&c->ip4.nat_host_loopback)) + return; + + die("Invalid address to remap to host: %s", optarg); +} + /** * conf_open_files() - Open files as requested by configuration * @c: Execution context @@ -1231,6 +1278,7 @@ void conf(struct ctx *c, int argc, char **argv) {"no-copy-routes", no_argument, NULL, 18 }, {"no-copy-addrs", no_argument, NULL, 19 }, {"netns-only", no_argument, NULL, 20 }, + {"nat-host-loopback", required_argument, NULL, 21 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1400,6 +1448,9 @@ void conf(struct ctx *c, int argc, char **argv) netns_only = 1; *userns = 0; break; + case 21: + conf_nat(c, optarg, &no_map_gw); + break; case 'd': c->debug = 1; c->quiet = 0; @@ -1639,10 +1690,12 @@ void conf(struct ctx *c, int argc, char **argv) (*c->ip6.ifname_out && !c->ifi6)) die("External interface not usable");

- if (c->ifi4 && !no_map_gw) + if (c->ifi4 && !no_map_gw && + IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback)) c->ip4.nat_host_loopback = c->ip4.guest_gw;

- if (c->ifi6 && !no_map_gw) + if (c->ifi6 && !no_map_gw && + IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback)) c->ip6.nat_host_loopback = c->ip6.guest_gw;

if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) diff --git a/passt.1 b/passt.1 index dca433b6..3680056a 100644 --- a/passt.1 +++ b/passt.1 @@ -327,6 +327,22 @@ namespace will be silently dropped. Disable Router Advertisements. Router Solicitations coming from guest or target namespace will be ignored.

+.TP +.BR \-\-nat-host-loopback " " \fIaddr +Translate \fIaddr\fR to refer to the host. Packets from the guest to +\fIaddr\fR will be redirected to the host. On the host such packets +will appear to have both source and destination of loopback (127.0.0.1

I would skip "of loopback" and just say "127.0.0.1 or ::1", to avoid implying that there's a single loopback address for IPv4.

...

+or ::1). + +If \fIaddr\fR is 'none', no address is mapped (this implies +\fB--no-map-gw\fR). Only one IPv4 and one IPv6 address can be +translated, if the option is specified multiple times, the last one +takes effect. + +Default is to translate the guest's default gateway address, unless +\fB--no-map-gw\fR is also given, in which case no address is mapped by

Why "also"? You're describing the default, so I guess this option is not actually given in that case.

...

+default. + .TP .BR \-\-no-map-gw Don't remap TCP connections and untracked UDP traffic, with the gateway address diff --git a/test/lib/setup b/test/lib/setup index 9b39b9fe..061bf997 100755 --- a/test/lib/setup +++ b/test/lib/setup @@ -124,7 +124,12 @@ setup_passt_in_ns() { [ ${DEBUG} -eq 1 ] && __opts="${__opts} -d" [ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"

- context_run_bg pasta "./pasta ${__opts} -t 10001,10002,10011,10012 -T 10003,10013 -u 10001,10002,10011,10012 -U 10003,10013 -P ${STATESETUP}/pasta.pid --config-net ${NSTOOL} hold ${STATESETUP}/ns.hold" + __nat_host4=192.0.2.1 + __nat_host6=2001:db8:9a55::1 + __nat_ns4=192.0.2.2 + __nat_ns6=2001:db8:9a55::2 + + context_run_bg pasta "./pasta ${__opts} -t 10001,10002,10011,10012 -T 10003,10013 -u 10001,10002,10011,10012 -U 10003,10013 -P ${STATESETUP}/pasta.pid --nat-host-loopback ${__nat_host4} --nat-host-loopback ${__nat_host6} --config-net ${NSTOOL} hold ${STATESETUP}/ns.hold" wait_for [ -f "${STATESETUP}/pasta.pid" ]

context_setup_nstool qemu ${STATESETUP}/ns.hold @@ -139,11 +144,11 @@ setup_passt_in_ns() { if [ ${VALGRIND} -eq 1 ]; then context_run passt "make clean" context_run passt "make valgrind" - context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid" + context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --nat-host-loopback ${__nat_ns4} --nat-host-loopback ${__nat_ns6}" else context_run passt "make clean" context_run passt "make" - context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid" + context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --nat-host-loopback ${__nat_ns4} --nat-host-loopback ${__nat_ns6}" fi wait_for [ -f "${STATESETUP}/passt.pid" ]

diff --git a/test/passt_in_ns/dhcp b/test/passt_in_ns/dhcp new file mode 100644 index 00000000..48c7d197 --- /dev/null +++ b/test/passt_in_ns/dhcp

...how did this happen? This file already exists. -- Stefano

David Gibson

21 Aug 21 Aug

4:23 a.m.

New subject: [PATCH 20/22] conf: Allow address remapped to host to be configured

On Tue, Aug 20, 2024 at 09:56:34PM +0200, Stefano Brivio wrote:

...

On Fri, 16 Aug 2024 15:40:01 +1000 David Gibson wrote:

...
Because the host and guest share the same IP address with passt/pasta, it's not possible for the guest to directly address the host. Therefore we allow packets from the guest going to a special "NAT to host" address to be redirected to the host, appearing there as though they have both source and destination address of loopback.

Currently that special address is always the address of the default gateway (or none). That can be a problem if we want that gateway to be addressable by the guest. Therefore, allow the special "NAT to host" address to be overridden on the command line with a new --nat-host-loopback option.

In order to exercise and test it, update the passt_in_ns and perf tests to use this option and give different mapping addresses for the two layers of the environment.

Signed-off-by: David Gibson --- conf.c | 57 +++++++++++++++++++++++++++++++-- passt.1 | 16 ++++++++++ test/lib/setup | 11 +++++-- test/passt_in_ns/dhcp | 73 +++++++++++++++++++++++++++++++++++++++++++ test/passt_in_ns/tcp | 38 +++++++++++----------- test/passt_in_ns/udp | 22 +++++++------ test/perf/passt_tcp | 33 +++++++++---------- test/perf/passt_udp | 31 +++++++++--------- test/perf/pasta_tcp | 29 ++++++++--------- test/perf/pasta_udp | 25 ++++++++------- test/run | 4 +-- 11 files changed, 244 insertions(+), 95 deletions(-) create mode 100644 test/passt_in_ns/dhcp

diff --git a/conf.c b/conf.c index 26373584..c5831e82 100644 --- a/conf.c +++ b/conf.c @@ -817,6 +817,14 @@ static void usage(const char *name, FILE *f, int status) fprintf(f, " --no-dhcp-search No list in DHCP/DHCPv6/NDP\n");

fprintf(f, + " --nat-host-loopback ADDR NAT ADDR to refer to host\n" + " Packets from the guest to ADDR will be redirected to the\n" + " host. On the host such packets will appear to have both\n" + " source and destination of loopback (127.0.0.1 or ::1).\n"

I would leave these three lines to the man page. The help message is already 90 lines long. This should be a quick guide/reminder, not a full description.

Good idea, done.

...

This reminds me that 127.0.0.1 isn't the only IPv4 loopback address. I don't know if anybody will ever have a use case where they would need a different, specific, loopback source address, but, together with

This is primarily about translation of outbound connections, so loopback is more the destination address than the source here.

...

--nat-guest-addr from 22/22, I start wondering: what if we had a single option taking, optionally, an arbitrary (within limits) source address?

I'd like to see that, but it's a more complex exercise - we'd need a table of NATs to step through. This series is just aiming to handle the most common cases for now.

...

Now, given that we plan to add a configurable flow table at some point in the future, it makes no sense to make this exceedingly flexible. But I just wanted to bring this up for consideration, in case it's doable at a small cost (I'm really not sure):

--map-host [source,]address

where "source" would default to 127.0.0.1, but it could also be another loopback address, or another address altogether (and we'll fail if it's not local, of course).

There's no particular reason it has to fail if non-local. Even if we have this in future, I think --map-guest-addr would still be useful because it avoids the user having to spell out what host address they expect the guest to take.

...

If we want (can?) go that way and keep equivalent functionality as you have now, we would have the additional problem that this option could be given up to two times (one for loopback, one for non-loopback), and not more (we don't have a data structure ready for an arbitrary number of those), so it's not as generic as it might look like, and I'm not sure if it's a good idea. But we could also expand on it in the future.

Yeah, I see this more as a future extension.

...

...
+ " ADDR can be 'none', in which case nothing is mapped\n"

This is a nice feature by the way as it should eventually allow us to get consistent options in Podman instead of "--map-gw": Podman could add by default '--map-host-loopback none', unless the user overrides that with an actual address.

Exactly. The idea here is that we can eventually deprecate --no-map-gw in favour of --map-host-loopback=none.

...

...
+ " Can be specified zero to two (for IPv4 and IPv6)\n"

"can" (for consistency, but also because the subject is still the option, this is not a separate sentence).

Done.

...

...times.

And done.

...

...
+ " default: gateway address, or none if --no-map-gw is also\n" + " specified\n"

I don't think we need to mention here that --no-map-gw implies none, doing it in the man page is enough.

Done.

...

...
" --dns-forward ADDR Forward DNS queries sent to ADDR\n" " can be specified zero to two times (for IPv4 and IPv6)\n" " default: don't forward DNS queries\n" @@ -959,6 +967,11 @@ static void conf_print(const struct ctx *c) info(" host: %s", eth_ntop(c->our_tap_mac, bufmac, sizeof(bufmac)));

if (c->ifi4) { + if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback)) + info(" NAT to host 127.0.0.1: %s", + inet_ntop(AF_INET, &c->ip4.nat_host_loopback, + buf4, sizeof(buf4))); + if (!c->no_dhcp) { uint32_t mask;

@@ -989,6 +1002,11 @@ static void conf_print(const struct ctx *c) }

if (c->ifi6) { + if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback)) + info(" NAT to host ::1: %s", + inet_ntop(AF_INET6, &c->ip6.nat_host_loopback, + buf6, sizeof(buf6))); + if (!c->no_ndp && !c->no_dhcpv6) info("NDP/DHCPv6:"); else if (!c->no_ndp) @@ -1122,6 +1140,35 @@ static void conf_ugid(char *runas, uid_t *uid, gid_t *gid) } }

+/** + * conf_nat() - Parse --nat-host-loopback option + * @c: Execution context + * @arg: String argument to --nat-host-loopback + * @no_map_gw: --no-map-gw flag, updated for "none" argument + */ +static void conf_nat(struct ctx *c, const char *arg, int *no_map_gw) +{ + if (strcmp(arg, "none") == 0) { + c->ip4.nat_host_loopback = in4addr_any; + c->ip6.nat_host_loopback = in6addr_any; + *no_map_gw = 1; + } + + if (inet_pton(AF_INET6, arg, &c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_LOOPBACK(&c->ip6.nat_host_loopback) && + !IN6_IS_ADDR_MULTICAST(&c->ip6.nat_host_loopback)) + return; + + if (inet_pton(AF_INET, arg, &c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_LOOPBACK(&c->ip4.nat_host_loopback) && + !IN4_IS_ADDR_MULTICAST(&c->ip4.nat_host_loopback)) + return; + + die("Invalid address to remap to host: %s", optarg); +} + /** * conf_open_files() - Open files as requested by configuration * @c: Execution context @@ -1231,6 +1278,7 @@ void conf(struct ctx *c, int argc, char **argv) {"no-copy-routes", no_argument, NULL, 18 }, {"no-copy-addrs", no_argument, NULL, 19 }, {"netns-only", no_argument, NULL, 20 }, + {"nat-host-loopback", required_argument, NULL, 21 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1400,6 +1448,9 @@ void conf(struct ctx *c, int argc, char **argv) netns_only = 1; *userns = 0; break; + case 21: + conf_nat(c, optarg, &no_map_gw); + break; case 'd': c->debug = 1; c->quiet = 0; @@ -1639,10 +1690,12 @@ void conf(struct ctx *c, int argc, char **argv) (*c->ip6.ifname_out && !c->ifi6)) die("External interface not usable");

- if (c->ifi4 && !no_map_gw) + if (c->ifi4 && !no_map_gw && + IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback)) c->ip4.nat_host_loopback = c->ip4.guest_gw;

- if (c->ifi6 && !no_map_gw) + if (c->ifi6 && !no_map_gw && + IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback)) c->ip6.nat_host_loopback = c->ip6.guest_gw;

if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) diff --git a/passt.1 b/passt.1 index dca433b6..3680056a 100644 --- a/passt.1 +++ b/passt.1 @@ -327,6 +327,22 @@ namespace will be silently dropped. Disable Router Advertisements. Router Solicitations coming from guest or target namespace will be ignored.

+.TP +.BR \-\-nat-host-loopback " " \fIaddr +Translate \fIaddr\fR to refer to the host. Packets from the guest to +\fIaddr\fR will be redirected to the host. On the host such packets +will appear to have both source and destination of loopback (127.0.0.1

I would skip "of loopback" and just say "127.0.0.1 or ::1", to avoid implying that there's a single loopback address for IPv4.

Done.

...

...
+or ::1). + +If \fIaddr\fR is 'none', no address is mapped (this implies +\fB--no-map-gw\fR). Only one IPv4 and one IPv6 address can be +translated, if the option is specified multiple times, the last one +takes effect. + +Default is to translate the guest's default gateway address, unless +\fB--no-map-gw\fR is also given, in which case no address is mapped by

Why "also"? You're describing the default, so I guess this option is not actually given in that case.

Good point, fixed.

...

...
+default. + .TP .BR \-\-no-map-gw Don't remap TCP connections and untracked UDP traffic, with the gateway address diff --git a/test/lib/setup b/test/lib/setup index 9b39b9fe..061bf997 100755 --- a/test/lib/setup +++ b/test/lib/setup @@ -124,7 +124,12 @@ setup_passt_in_ns() { [ ${DEBUG} -eq 1 ] && __opts="${__opts} -d" [ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"

- context_run_bg pasta "./pasta ${__opts} -t 10001,10002,10011,10012 -T 10003,10013 -u 10001,10002,10011,10012 -U 10003,10013 -P ${STATESETUP}/pasta.pid --config-net ${NSTOOL} hold ${STATESETUP}/ns.hold" + __nat_host4=192.0.2.1 + __nat_host6=2001:db8:9a55::1 + __nat_ns4=192.0.2.2 + __nat_ns6=2001:db8:9a55::2 + + context_run_bg pasta "./pasta ${__opts} -t 10001,10002,10011,10012 -T 10003,10013 -u 10001,10002,10011,10012 -U 10003,10013 -P ${STATESETUP}/pasta.pid --nat-host-loopback ${__nat_host4} --nat-host-loopback ${__nat_host6} --config-net ${NSTOOL} hold ${STATESETUP}/ns.hold" wait_for [ -f "${STATESETUP}/pasta.pid" ]

context_setup_nstool qemu ${STATESETUP}/ns.hold @@ -139,11 +144,11 @@ setup_passt_in_ns() { if [ ${VALGRIND} -eq 1 ]; then context_run passt "make clean" context_run passt "make valgrind" - context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid" + context_run_bg passt "valgrind --max-stackframe=$((4 * 1024 * 1024)) --trace-children=yes --vgdb=no --error-exitcode=1 --suppressions=test/valgrind.supp ./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --nat-host-loopback ${__nat_ns4} --nat-host-loopback ${__nat_ns6}" else context_run passt "make clean" context_run passt "make" - context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid" + context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --nat-host-loopback ${__nat_ns4} --nat-host-loopback ${__nat_ns6}" fi wait_for [ -f "${STATESETUP}/passt.pid" ]

diff --git a/test/passt_in_ns/dhcp b/test/passt_in_ns/dhcp new file mode 100644 index 00000000..48c7d197 --- /dev/null +++ b/test/passt_in_ns/dhcp

...how did this happen? This file already exists.

No, it didn't. Previously we reused passt/dhcp for the passt_in_ns tests. With the change to the tests exercising the new option that doesn't work any more, because we need slightly different checks for DHCP to match what we expect when --map-host-loopback is used. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

16 Aug 16 Aug

7:40 a.m.

New subject: [PATCH 21/22] fwd: Distinguish translatable from untranslatable addresses on inbound

fwd_nat_from_host() needs to adjust the source address for new flows coming from an address which is not accessible to the guest. Currently we always use our_tap_addr or our_tap_ll. However in cases where the address is accessible to the guest via translation (i.e. via --nat-host-loopback) then it makes more sense to use that translation, rather than the fallback mapping of our_tap_*. Signed-off-by: David Gibson --- fwd.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/fwd.c b/fwd.c index 779278a9..7718f7e2 100644 --- a/fwd.c +++ b/fwd.c @@ -386,7 +386,14 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, return PIF_SPLICE; } - if (!fwd_guest_accessible(c, &ini->eaddr)) { + if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback) && + inany_equals4(&ini->eaddr, &in4addr_loopback)) { + /* Specifically 127.0.0.1, not 127.0.0.0/8 */ + tgt->oaddr = inany_from_v4(c->ip4.nat_host_loopback); + } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && + inany_equals6(&ini->eaddr, &in6addr_loopback)) { + tgt->oaddr.a6 = c->ip6.nat_host_loopback; + } else if (!fwd_guest_accessible(c, &ini->eaddr)) { if (inany_v4(&ini->eaddr)) { if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) /* No source address we can use */ -- 2.46.0

David Gibson

7:40 a.m.

New subject: [PATCH 22/22] fwd, conf: Allow NAT of the guest's assigned address

The guest is usually assigned one of the host's IP addresses. That means it can't access the host itself via its usual address. The --nat-host-loopback option (enabled by default with the gateway address) allows the guest to contact the host. However, connections forwarded this way appear on the host to have originated from the loopback interface, which isn't always desirable. Add a new --nat-guest-addr option, which acts similarly but forwarded connections will go to the host's external address, instead of loopback. If '-a' is used, so the guest's address is not the same as the host's, this will instead forward to whatever host-visible site is shadowed by the guest's assigned address. Signed-off-by: David Gibson --- conf.c | 51 ++++++++++++++++++++++++++++++++++----------------- fwd.c | 10 ++++++++++ passt.1 | 15 +++++++++++++++ passt.h | 6 ++++++ 4 files changed, 65 insertions(+), 17 deletions(-) diff --git a/conf.c b/conf.c index c5831e82..d14abc63 100644 --- a/conf.c +++ b/conf.c @@ -825,6 +825,14 @@ static void usage(const char *name, FILE *f, int status) " Can be specified zero to two (for IPv4 and IPv6)\n" " default: gateway address, or none if --no-map-gw is also\n" " specified\n" + " --nat-guest-addr ADDR NAT ADDR to guest's address\n" + " Packets from the guest to ADDR will be redirected to the\n" + " adress on the host that's the same as the guest's\n" + " assigned address. Usually that means (one of) the host's\n" + " global address.\n" + " ADDR can be 'none', in which case nothing is mapped\n" + " Can be specified zero to two (for IPv4 and IPv6)\n" + " default: none\n" " --dns-forward ADDR Forward DNS queries sent to ADDR\n" " can be specified zero to two times (for IPv4 and IPv6)\n" " default: don't forward DNS queries\n" @@ -1141,29 +1149,32 @@ static void conf_ugid(char *runas, uid_t *uid, gid_t *gid) } /** - * conf_nat() - Parse --nat-host-loopback option - * @c: Execution context - * @arg: String argument to --nat-host-loopback - * @no_map_gw: --no-map-gw flag, updated for "none" argument + * conf_nat() - Parse --nat-host-loopback or --nat-guest-addr option + * @arg: String argument to option + * @addr4: IPv4 to update with parsed address + * @addr6: IPv6 to update with parsed address + * @no_map_gw: --no-map-gw flag, or NULL, updated for "none" argument */ -static void conf_nat(struct ctx *c, const char *arg, int *no_map_gw) +static void conf_nat(const char *arg, struct in_addr *addr4, + struct in6_addr *addr6, int *no_map_gw) { if (strcmp(arg, "none") == 0) { - c->ip4.nat_host_loopback = in4addr_any; - c->ip6.nat_host_loopback = in6addr_any; - *no_map_gw = 1; + *addr4 = in4addr_any; + *addr6 = in6addr_any; + if (no_map_gw) + *no_map_gw = 1; } - if (inet_pton(AF_INET6, arg, &c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_LOOPBACK(&c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_MULTICAST(&c->ip6.nat_host_loopback)) + if (inet_pton(AF_INET6, arg, addr6) && + !IN6_IS_ADDR_UNSPECIFIED(addr6) && + !IN6_IS_ADDR_LOOPBACK(addr6) && + !IN6_IS_ADDR_MULTICAST(addr6)) return; - if (inet_pton(AF_INET, arg, &c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_LOOPBACK(&c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_MULTICAST(&c->ip4.nat_host_loopback)) + if (inet_pton(AF_INET, arg, addr4) && + !IN4_IS_ADDR_UNSPECIFIED(addr4) && + !IN4_IS_ADDR_LOOPBACK(addr4) && + !IN4_IS_ADDR_MULTICAST(addr4)) return; die("Invalid address to remap to host: %s", optarg); @@ -1279,6 +1290,7 @@ void conf(struct ctx *c, int argc, char **argv) {"no-copy-addrs", no_argument, NULL, 19 }, {"netns-only", no_argument, NULL, 20 }, {"nat-host-loopback", required_argument, NULL, 21 }, + {"nat-guest-addr", required_argument, NULL, 22 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1449,7 +1461,12 @@ void conf(struct ctx *c, int argc, char **argv) *userns = 0; break; case 21: - conf_nat(c, optarg, &no_map_gw); + conf_nat(optarg, &c->ip4.nat_host_loopback, + &c->ip6.nat_host_loopback, &no_map_gw); + break; + case 22: + conf_nat(optarg, &c->ip4.nat_guest_addr, + &c->ip6.nat_guest_addr, NULL); break; case 'd': c->debug = 1; diff --git a/fwd.c b/fwd.c index 7718f7e2..ff4789a2 100644 --- a/fwd.c +++ b/fwd.c @@ -272,6 +272,10 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, tgt->eaddr = inany_loopback4; else if (inany_equals6(&ini->oaddr, &c->ip6.nat_host_loopback)) tgt->eaddr = inany_loopback6; + else if (inany_equals4(&ini->oaddr, &c->ip4.nat_guest_addr)) + tgt->eaddr = inany_from_v4(c->ip4.addr); + else if (inany_equals6(&ini->oaddr, &c->ip6.nat_guest_addr)) + tgt->eaddr.a6 = c->ip6.addr; else tgt->eaddr = ini->oaddr; @@ -393,6 +397,12 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && inany_equals6(&ini->eaddr, &in6addr_loopback)) { tgt->oaddr.a6 = c->ip6.nat_host_loopback; + } else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_guest_addr) && + inany_equals4(&ini->eaddr, &c->ip4.addr)) { + tgt->oaddr = inany_from_v4(c->ip4.nat_guest_addr); + } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_guest_addr) && + inany_equals6(&ini->eaddr, &c->ip6.addr)) { + tgt->oaddr.a6 = c->ip6.nat_guest_addr; } else if (!fwd_guest_accessible(c, &ini->eaddr)) { if (inany_v4(&ini->eaddr)) { if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) diff --git a/passt.1 b/passt.1 index 3680056a..7cf553cf 100644 --- a/passt.1 +++ b/passt.1 @@ -350,6 +350,21 @@ as destination, to the host. Implied if there is no gateway on the selected default route, or if there is no default route, for any of the enabled address families. +.TP +.BR \-\-nat-guest-loopback " " \fIaddr +Translate \fIaddr\fR in the guest to be equal to the guest's assigned +address on the host. That is, packets from the guest to \fIaddr\fR +will be redirected to the address assigned to the guest with \fB-a\fR, +or by default the host's global address. This allows the guest to +access services availble on the host's global address, even though its +own address shadows that of the host. + +If \fIaddr\fR is 'none', no address is mapped. Only one IPv4 and one +IPv6 address can be translated, if the option is specified multiple +times, the last one for each address type takes effect. + +Default is no mapping. + .TP .BR \-4 ", " \-\-ipv4-only Enable IPv4-only operation. IPv6 traffic will be ignored. diff --git a/passt.h b/passt.h index 20a5904a..586c1d05 100644 --- a/passt.h +++ b/passt.h @@ -104,6 +104,8 @@ enum passt_modes { * @guest_gw: IPv4 gateway as seen by the guest * @nat_host_loopback: Outbound connections to this address are NATted to the * host's 127.0.0.1 + * @nat_guest_addr: Outbound connections to this address are NATted to the + * guest's assigned address * @dns: DNS addresses for DHCP, zero-terminated * @dns_match: Forward DNS query if sent to this address * @our_tap_addr: IPv4 address for passt's use on tap @@ -120,6 +122,7 @@ struct ip4_ctx { int prefix_len; struct in_addr guest_gw; struct in_addr nat_host_loopback; + struct in_addr nat_guest_addr; struct in_addr dns[MAXNS + 1]; struct in_addr dns_match; struct in_addr our_tap_addr; @@ -142,6 +145,8 @@ struct ip4_ctx { * @guest_gw: IPv6 gateway as seen by the guest * @nat_host_loopback: Outbound connections to this address are NATted to the * host's [::1] + * @nat_guest_addr: Outbound connections to this address are NATted to the + * guest's assigned address * @dns: DNS addresses for DHCPv6 and NDP, zero-terminated * @dns_match: Forward DNS query if sent to this address * @our_tap_ll: Link-local IPv6 address for passt's use on tap @@ -158,6 +163,7 @@ struct ip6_ctx { struct in6_addr addr_ll_seen; struct in6_addr guest_gw; struct in6_addr nat_host_loopback; + struct in6_addr nat_guest_addr; struct in6_addr dns[MAXNS + 1]; struct in6_addr dns_match; struct in6_addr our_tap_ll; -- 2.46.0

Stefano Brivio

20 Aug 20 Aug

9:56 p.m.

New subject: [PATCH 22/22] fwd, conf: Allow NAT of the guest's assigned address

On Fri, 16 Aug 2024 15:40:03 +1000 David Gibson wrote:

...

The guest is usually assigned one of the host's IP addresses. That means it can't access the host itself via its usual address. The --nat-host-loopback option (enabled by default with the gateway address) allows the guest to contact the host. However, connections forwarded this way appear on the host to have originated from the loopback interface, which isn't always desirable.

Add a new --nat-guest-addr option, which acts similarly but forwarded connections will go to the host's external address, instead of loopback.

If '-a' is used, so the guest's address is not the same as the host's, this will instead forward to whatever host-visible site is shadowed by the guest's assigned address.

Signed-off-by: David Gibson --- conf.c | 51 ++++++++++++++++++++++++++++++++++----------------- fwd.c | 10 ++++++++++ passt.1 | 15 +++++++++++++++ passt.h | 6 ++++++ 4 files changed, 65 insertions(+), 17 deletions(-)

diff --git a/conf.c b/conf.c index c5831e82..d14abc63 100644 --- a/conf.c +++ b/conf.c @@ -825,6 +825,14 @@ static void usage(const char *name, FILE *f, int status) " Can be specified zero to two (for IPv4 and IPv6)\n" " default: gateway address, or none if --no-map-gw is also\n" " specified\n" + " --nat-guest-addr ADDR NAT ADDR to guest's address\n" + " Packets from the guest to ADDR will be redirected to the\n" + " adress on the host that's the same as the guest's\n" + " assigned address. Usually that means (one of) the host's\n" + " global address.\n"

Same as 20/22, it's probably enough to have this in the man page.

...

+ " ADDR can be 'none', in which case nothing is mapped\n" + " Can be specified zero to two (for IPv4 and IPv6)\n"

"can", times

...

+ " default: none\n" " --dns-forward ADDR Forward DNS queries sent to ADDR\n" " can be specified zero to two times (for IPv4 and IPv6)\n" " default: don't forward DNS queries\n" @@ -1141,29 +1149,32 @@ static void conf_ugid(char *runas, uid_t *uid, gid_t *gid) }

/** - * conf_nat() - Parse --nat-host-loopback option - * @c: Execution context - * @arg: String argument to --nat-host-loopback - * @no_map_gw: --no-map-gw flag, updated for "none" argument + * conf_nat() - Parse --nat-host-loopback or --nat-guest-addr option + * @arg: String argument to option + * @addr4: IPv4 to update with parsed address + * @addr6: IPv6 to update with parsed address + * @no_map_gw: --no-map-gw flag, or NULL, updated for "none" argument */ -static void conf_nat(struct ctx *c, const char *arg, int *no_map_gw) +static void conf_nat(const char *arg, struct in_addr *addr4, + struct in6_addr *addr6, int *no_map_gw) { if (strcmp(arg, "none") == 0) { - c->ip4.nat_host_loopback = in4addr_any; - c->ip6.nat_host_loopback = in6addr_any; - *no_map_gw = 1; + *addr4 = in4addr_any; + *addr6 = in6addr_any; + if (no_map_gw) + *no_map_gw = 1; }

- if (inet_pton(AF_INET6, arg, &c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_LOOPBACK(&c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_MULTICAST(&c->ip6.nat_host_loopback)) + if (inet_pton(AF_INET6, arg, addr6) && + !IN6_IS_ADDR_UNSPECIFIED(addr6) && + !IN6_IS_ADDR_LOOPBACK(addr6) && + !IN6_IS_ADDR_MULTICAST(addr6)) return;

- if (inet_pton(AF_INET, arg, &c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_LOOPBACK(&c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_MULTICAST(&c->ip4.nat_host_loopback)) + if (inet_pton(AF_INET, arg, addr4) && + !IN4_IS_ADDR_UNSPECIFIED(addr4) && + !IN4_IS_ADDR_LOOPBACK(addr4) && + !IN4_IS_ADDR_MULTICAST(addr4)) return;

die("Invalid address to remap to host: %s", optarg); @@ -1279,6 +1290,7 @@ void conf(struct ctx *c, int argc, char **argv) {"no-copy-addrs", no_argument, NULL, 19 }, {"netns-only", no_argument, NULL, 20 }, {"nat-host-loopback", required_argument, NULL, 21 }, + {"nat-guest-addr", required_argument, NULL, 22 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1449,7 +1461,12 @@ void conf(struct ctx *c, int argc, char **argv) *userns = 0; break; case 21: - conf_nat(c, optarg, &no_map_gw); + conf_nat(optarg, &c->ip4.nat_host_loopback, + &c->ip6.nat_host_loopback, &no_map_gw); + break; + case 22: + conf_nat(optarg, &c->ip4.nat_guest_addr, + &c->ip6.nat_guest_addr, NULL); break; case 'd': c->debug = 1; diff --git a/fwd.c b/fwd.c index 7718f7e2..ff4789a2 100644 --- a/fwd.c +++ b/fwd.c @@ -272,6 +272,10 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, tgt->eaddr = inany_loopback4; else if (inany_equals6(&ini->oaddr, &c->ip6.nat_host_loopback)) tgt->eaddr = inany_loopback6; + else if (inany_equals4(&ini->oaddr, &c->ip4.nat_guest_addr)) + tgt->eaddr = inany_from_v4(c->ip4.addr); + else if (inany_equals6(&ini->oaddr, &c->ip6.nat_guest_addr)) + tgt->eaddr.a6 = c->ip6.addr; else tgt->eaddr = ini->oaddr;

@@ -393,6 +397,12 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && inany_equals6(&ini->eaddr, &in6addr_loopback)) { tgt->oaddr.a6 = c->ip6.nat_host_loopback; + } else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_guest_addr) && + inany_equals4(&ini->eaddr, &c->ip4.addr)) { + tgt->oaddr = inany_from_v4(c->ip4.nat_guest_addr); + } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_guest_addr) && + inany_equals6(&ini->eaddr, &c->ip6.addr)) { + tgt->oaddr.a6 = c->ip6.nat_guest_addr; } else if (!fwd_guest_accessible(c, &ini->eaddr)) { if (inany_v4(&ini->eaddr)) { if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) diff --git a/passt.1 b/passt.1 index 3680056a..7cf553cf 100644 --- a/passt.1 +++ b/passt.1 @@ -350,6 +350,21 @@ as destination, to the host. Implied if there is no gateway on the selected default route, or if there is no default route, for any of the enabled address families.

+.TP +.BR \-\-nat-guest-loopback " " \fIaddr +Translate \fIaddr\fR in the guest to be equal to the guest's assigned +address on the host. That is, packets from the guest to \fIaddr\fR +will be redirected to the address assigned to the guest with \fB-a\fR, +or by default the host's global address. This allows the guest to +access services availble on the host's global address, even though its +own address shadows that of the host. + +If \fIaddr\fR is 'none', no address is mapped. Only one IPv4 and one +IPv6 address can be translated, if the option is specified multiple

, and if

...

+times, the last one for each address type takes effect. + +Default is no mapping. + .TP .BR \-4 ", " \-\-ipv4-only Enable IPv4-only operation. IPv6 traffic will be ignored. diff --git a/passt.h b/passt.h index 20a5904a..586c1d05 100644 --- a/passt.h +++ b/passt.h @@ -104,6 +104,8 @@ enum passt_modes { * @guest_gw: IPv4 gateway as seen by the guest * @nat_host_loopback: Outbound connections to this address are NATted to the * host's 127.0.0.1 + * @nat_guest_addr: Outbound connections to this address are NATted to the + * guest's assigned address * @dns: DNS addresses for DHCP, zero-terminated * @dns_match: Forward DNS query if sent to this address * @our_tap_addr: IPv4 address for passt's use on tap @@ -120,6 +122,7 @@ struct ip4_ctx { int prefix_len; struct in_addr guest_gw; struct in_addr nat_host_loopback; + struct in_addr nat_guest_addr; struct in_addr dns[MAXNS + 1]; struct in_addr dns_match; struct in_addr our_tap_addr; @@ -142,6 +145,8 @@ struct ip4_ctx { * @guest_gw: IPv6 gateway as seen by the guest * @nat_host_loopback: Outbound connections to this address are NATted to the * host's [::1] + * @nat_guest_addr: Outbound connections to this address are NATted to the + * guest's assigned address * @dns: DNS addresses for DHCPv6 and NDP, zero-terminated * @dns_match: Forward DNS query if sent to this address * @our_tap_ll: Link-local IPv6 address for passt's use on tap @@ -158,6 +163,7 @@ struct ip6_ctx { struct in6_addr addr_ll_seen; struct in6_addr guest_gw; struct in6_addr nat_host_loopback; + struct in6_addr nat_guest_addr; struct in6_addr dns[MAXNS + 1]; struct in6_addr dns_match; struct in6_addr our_tap_ll;

-- Stefano

David Gibson

21 Aug 21 Aug

4:28 a.m.

New subject: [PATCH 22/22] fwd, conf: Allow NAT of the guest's assigned address

On Tue, Aug 20, 2024 at 09:56:40PM +0200, Stefano Brivio wrote:

...

On Fri, 16 Aug 2024 15:40:03 +1000 David Gibson wrote:

...
The guest is usually assigned one of the host's IP addresses. That means it can't access the host itself via its usual address. The --nat-host-loopback option (enabled by default with the gateway address) allows the guest to contact the host. However, connections forwarded this way appear on the host to have originated from the loopback interface, which isn't always desirable.

Add a new --nat-guest-addr option, which acts similarly but forwarded connections will go to the host's external address, instead of loopback.

If '-a' is used, so the guest's address is not the same as the host's, this will instead forward to whatever host-visible site is shadowed by the guest's assigned address.

Signed-off-by: David Gibson --- conf.c | 51 ++++++++++++++++++++++++++++++++++----------------- fwd.c | 10 ++++++++++ passt.1 | 15 +++++++++++++++ passt.h | 6 ++++++ 4 files changed, 65 insertions(+), 17 deletions(-)

diff --git a/conf.c b/conf.c index c5831e82..d14abc63 100644 --- a/conf.c +++ b/conf.c @@ -825,6 +825,14 @@ static void usage(const char *name, FILE *f, int status) " Can be specified zero to two (for IPv4 and IPv6)\n" " default: gateway address, or none if --no-map-gw is also\n" " specified\n" + " --nat-guest-addr ADDR NAT ADDR to guest's address\n" + " Packets from the guest to ADDR will be redirected to the\n" + " adress on the host that's the same as the guest's\n" + " assigned address. Usually that means (one of) the host's\n" + " global address.\n"

Same as 20/22, it's probably enough to have this in the man page.

...
+ " ADDR can be 'none', in which case nothing is mapped\n" + " Can be specified zero to two (for IPv4 and IPv6)\n"

"can", times

Done.

...

...
+ " default: none\n" " --dns-forward ADDR Forward DNS queries sent to ADDR\n" " can be specified zero to two times (for IPv4 and IPv6)\n" " default: don't forward DNS queries\n" @@ -1141,29 +1149,32 @@ static void conf_ugid(char *runas, uid_t *uid, gid_t *gid) }

/** - * conf_nat() - Parse --nat-host-loopback option - * @c: Execution context - * @arg: String argument to --nat-host-loopback - * @no_map_gw: --no-map-gw flag, updated for "none" argument + * conf_nat() - Parse --nat-host-loopback or --nat-guest-addr option + * @arg: String argument to option + * @addr4: IPv4 to update with parsed address + * @addr6: IPv6 to update with parsed address + * @no_map_gw: --no-map-gw flag, or NULL, updated for "none" argument */ -static void conf_nat(struct ctx *c, const char *arg, int *no_map_gw) +static void conf_nat(const char *arg, struct in_addr *addr4, + struct in6_addr *addr6, int *no_map_gw) { if (strcmp(arg, "none") == 0) { - c->ip4.nat_host_loopback = in4addr_any; - c->ip6.nat_host_loopback = in6addr_any; - *no_map_gw = 1; + *addr4 = in4addr_any; + *addr6 = in6addr_any; + if (no_map_gw) + *no_map_gw = 1; }

- if (inet_pton(AF_INET6, arg, &c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_LOOPBACK(&c->ip6.nat_host_loopback) && - !IN6_IS_ADDR_MULTICAST(&c->ip6.nat_host_loopback)) + if (inet_pton(AF_INET6, arg, addr6) && + !IN6_IS_ADDR_UNSPECIFIED(addr6) && + !IN6_IS_ADDR_LOOPBACK(addr6) && + !IN6_IS_ADDR_MULTICAST(addr6)) return;

- if (inet_pton(AF_INET, arg, &c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_LOOPBACK(&c->ip4.nat_host_loopback) && - !IN4_IS_ADDR_MULTICAST(&c->ip4.nat_host_loopback)) + if (inet_pton(AF_INET, arg, addr4) && + !IN4_IS_ADDR_UNSPECIFIED(addr4) && + !IN4_IS_ADDR_LOOPBACK(addr4) && + !IN4_IS_ADDR_MULTICAST(addr4)) return;

die("Invalid address to remap to host: %s", optarg); @@ -1279,6 +1290,7 @@ void conf(struct ctx *c, int argc, char **argv) {"no-copy-addrs", no_argument, NULL, 19 }, {"netns-only", no_argument, NULL, 20 }, {"nat-host-loopback", required_argument, NULL, 21 }, + {"nat-guest-addr", required_argument, NULL, 22 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1449,7 +1461,12 @@ void conf(struct ctx *c, int argc, char **argv) *userns = 0; break; case 21: - conf_nat(c, optarg, &no_map_gw); + conf_nat(optarg, &c->ip4.nat_host_loopback, + &c->ip6.nat_host_loopback, &no_map_gw); + break; + case 22: + conf_nat(optarg, &c->ip4.nat_guest_addr, + &c->ip6.nat_guest_addr, NULL); break; case 'd': c->debug = 1; diff --git a/fwd.c b/fwd.c index 7718f7e2..ff4789a2 100644 --- a/fwd.c +++ b/fwd.c @@ -272,6 +272,10 @@ uint8_t fwd_nat_from_tap(const struct ctx *c, uint8_t proto, tgt->eaddr = inany_loopback4; else if (inany_equals6(&ini->oaddr, &c->ip6.nat_host_loopback)) tgt->eaddr = inany_loopback6; + else if (inany_equals4(&ini->oaddr, &c->ip4.nat_guest_addr)) + tgt->eaddr = inany_from_v4(c->ip4.addr); + else if (inany_equals6(&ini->oaddr, &c->ip6.nat_guest_addr)) + tgt->eaddr.a6 = c->ip6.addr; else tgt->eaddr = ini->oaddr;

@@ -393,6 +397,12 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_host_loopback) && inany_equals6(&ini->eaddr, &in6addr_loopback)) { tgt->oaddr.a6 = c->ip6.nat_host_loopback; + } else if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.nat_guest_addr) && + inany_equals4(&ini->eaddr, &c->ip4.addr)) { + tgt->oaddr = inany_from_v4(c->ip4.nat_guest_addr); + } else if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.nat_guest_addr) && + inany_equals6(&ini->eaddr, &c->ip6.addr)) { + tgt->oaddr.a6 = c->ip6.nat_guest_addr; } else if (!fwd_guest_accessible(c, &ini->eaddr)) { if (inany_v4(&ini->eaddr)) { if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.our_tap_addr)) diff --git a/passt.1 b/passt.1 index 3680056a..7cf553cf 100644 --- a/passt.1 +++ b/passt.1 @@ -350,6 +350,21 @@ as destination, to the host. Implied if there is no gateway on the selected default route, or if there is no default route, for any of the enabled address families.

+.TP +.BR \-\-nat-guest-loopback " " \fIaddr +Translate \fIaddr\fR in the guest to be equal to the guest's assigned +address on the host. That is, packets from the guest to \fIaddr\fR +will be redirected to the address assigned to the guest with \fB-a\fR, +or by default the host's global address. This allows the guest to +access services availble on the host's global address, even though its +own address shadows that of the host. + +If \fIaddr\fR is 'none', no address is mapped. Only one IPv4 and one +IPv6 address can be translated, if the option is specified multiple

, and if

Done. Also fixed the fact I incorrectly called it --nat-guest-loopback instead of --map-guest-addr above.

...

...
+times, the last one for each address type takes effect. + +Default is no mapping. + .TP .BR \-4 ", " \-\-ipv4-only Enable IPv4-only operation. IPv6 traffic will be ignored. diff --git a/passt.h b/passt.h index 20a5904a..586c1d05 100644 --- a/passt.h +++ b/passt.h @@ -104,6 +104,8 @@ enum passt_modes { * @guest_gw: IPv4 gateway as seen by the guest * @nat_host_loopback: Outbound connections to this address are NATted to the * host's 127.0.0.1 + * @nat_guest_addr: Outbound connections to this address are NATted to the + * guest's assigned address * @dns: DNS addresses for DHCP, zero-terminated * @dns_match: Forward DNS query if sent to this address * @our_tap_addr: IPv4 address for passt's use on tap @@ -120,6 +122,7 @@ struct ip4_ctx { int prefix_len; struct in_addr guest_gw; struct in_addr nat_host_loopback; + struct in_addr nat_guest_addr; struct in_addr dns[MAXNS + 1]; struct in_addr dns_match; struct in_addr our_tap_addr; @@ -142,6 +145,8 @@ struct ip4_ctx { * @guest_gw: IPv6 gateway as seen by the guest * @nat_host_loopback: Outbound connections to this address are NATted to the * host's [::1] + * @nat_guest_addr: Outbound connections to this address are NATted to the + * guest's assigned address * @dns: DNS addresses for DHCPv6 and NDP, zero-terminated * @dns_match: Forward DNS query if sent to this address * @our_tap_ll: Link-local IPv6 address for passt's use on tap @@ -158,6 +163,7 @@ struct ip6_ctx { struct in6_addr addr_ll_seen; struct in6_addr guest_gw; struct in6_addr nat_host_loopback; + struct in6_addr nat_guest_addr; struct in6_addr dns[MAXNS + 1]; struct in6_addr dns_match; struct in6_addr our_tap_ll;

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Paul Holzinger

16 Aug 16 Aug

4:45 p.m.

Hi, On 16/08/2024 07:39, David Gibson wrote:

...

Based on Stefano's recent patch for faster tests.

Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.

Suggestions for better names for the new options in patches 20 & 22 are most welcome.

Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.

NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.

Paul, amongst other things, I think this will allow podman to (finally) nicely address #19213, picking an address to remap to the host's external address with --nat-guest-addr, much like it already uses --dns-forward.

Thanks this looks promising. I will try to test it out next week. No strong feelings about the naming but how about s/--nat/--map/ for the options?

...

David Gibson (22): treewide: Use "our address" instead of "forwarding address" util: Helper for formatting MAC addresses treewide: Rename MAC address fields for clarity treewide: Use struct assignment instead of memcpy() for IP addresses conf: Use array indices rather than pointers for DNS array slots conf: More accurately count entries added in get_dns() conf: Move DNS array bounds checks into add_dns[46] conf: Move adding of a nameserver from resolv.conf into subfunction conf: Correct setting of dns_match address in add_dns6() conf: Treat --dns addresses as guest visible addresses conf: Remove incorrect initialisation of addr_ll_seen util: Correct sock_l4() binding for link local addresses treewide: Change misleading 'addr_ll' name Clarify which addresses in ip[46]_ctx are meaningful where Initialise our_tap_ll to ip6.gw when suitable fwd: Helpers to clarify what host addresses aren't guest accessible fwd: Split notion of "our tap address" from gateway for IPv4 Don't take "our" MAC address from the host conf, fwd: Split notion of gateway/router from guest-visible host address conf: Allow address remapped to host to be configured fwd: Distinguish translatable from untranslatable addresses on inbound fwd, conf: Allow NAT of the guest's assigned address

arp.c | 4 +- conf.c | 328 +++++++++++++++++++++++++----------------- dhcp.c | 19 +-- dhcpv6.c | 21 +-- flow.c | 72 +++++----- flow.h | 18 +-- fwd.c | 170 +++++++++++++++++----- icmp.c | 4 +- ndp.c | 9 +- passt.1 | 45 +++++- passt.c | 2 +- passt.h | 53 +++++-- pasta.c | 14 +- tap.c | 12 +- tcp.c | 33 ++--- tcp_internal.h | 2 +- test/lib/setup | 11 +- test/passt_in_ns/dhcp | 73 ++++++++++ test/passt_in_ns/tcp | 38 +++-- test/passt_in_ns/udp | 22 +-- test/perf/passt_tcp | 33 ++--- test/perf/passt_udp | 31 ++-- test/perf/pasta_tcp | 29 ++-- test/perf/pasta_udp | 25 ++-- test/run | 4 +- udp.c | 12 +- util.c | 22 ++- util.h | 4 +- 28 files changed, 719 insertions(+), 391 deletions(-) create mode 100644 test/passt_in_ns/dhcp

-- Paul Holzinger

Stefano Brivio

5:03 p.m.

On Fri, 16 Aug 2024 16:45:14 +0200 Paul Holzinger wrote:

...

Hi,

On 16/08/2024 07:39, David Gibson wrote:

...
Based on Stefano's recent patch for faster tests.

Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.

Suggestions for better names for the new options in patches 20 & 22 are most welcome.

Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.

NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.

Paul, amongst other things, I think this will allow podman to (finally) nicely address #19213, picking an address to remap to the host's external address with --nat-guest-addr, much like it already uses --dns-forward.

Thanks this looks promising. I will try to test it out next week.

No strong feelings about the naming but how about s/--nat/--map/ for the options?

Exactly the same as I suggested offline a while ago. :) I think it's easier to understand what it does, that way. -- Stefano

David Gibson

17 Aug 17 Aug

10:01 a.m.

On Fri, Aug 16, 2024 at 05:03:22PM +0200, Stefano Brivio wrote:

...

On Fri, 16 Aug 2024 16:45:14 +0200 Paul Holzinger wrote:

...
Hi,

On 16/08/2024 07:39, David Gibson wrote:

...
Based on Stefano's recent patch for faster tests.

Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.

Suggestions for better names for the new options in patches 20 & 22 are most welcome.

Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.

NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.

Paul, amongst other things, I think this will allow podman to (finally) nicely address #19213, picking an address to remap to the host's external address with --nat-guest-addr, much like it already uses --dns-forward.

Thanks this looks promising. I will try to test it out next week.

No strong feelings about the naming but how about s/--nat/--map/ for the options?

Exactly the same as I suggested offline a while ago. :) I think it's easier to understand what it does, that way.

Ok. I think I was going to do that originally but changed it for reasons that I've now forgotten. --map is more consistent with --no-map-gw too, so I'll change this. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

19 Aug 19 Aug

10:46 a.m.

On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:

...

Based on Stefano's recent patch for faster tests.

Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.

Suggestions for better names for the new options in patches 20 & 22 are most welcome.

Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.

NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.

I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime. After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it. This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses. We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug. In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests. This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them. Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

11:27 a.m.

On Mon, 19 Aug 2024 18:46:31 +1000 David Gibson wrote:

...

On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:

...
Based on Stefano's recent patch for faster tests.

Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.

Suggestions for better names for the new options in patches 20 & 22 are most welcome.

Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.

NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.

I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions

The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime.

After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it.

I still have to take a closer look, but I'm fairly sure I hit a similar issue while I was writing these tests originally. I pondered reconfiguring the address via DHCPv6, or using the keep_addr_on_down sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added around that time. Then:

...

This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses.

...I realised that this worked and forgot about the whole issue.

...

We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug.

In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests.

I guess setting keep_addr_on_down (even for "all" interfaces) should work as well.

...

This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them.

There's a notable distinction between guests temporarily diverging (in different ways) and guests we don't configure at all. It's probably more important to ensure we use the right type of address (security) rather than ensuring we somehow manage to deliver packets at any time (minor glitch otherwise), also because the one you describe is something we're unlikely to hit outside of tests.

...

Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past.

Yes, I still think we should support guests that don't use DHCPv6 or NDP at all, or where related exchanges fail for any reason. It improves reliability and compatibility at a small cost. In this case, I think it's a nice feature that we would resume communicating as soon as the guest shows its global unicast address. If the cost is using the wrong type of address, then not, I'm not suggesting we do that, so I think the change from this series is desirable, but in a general case, things just work and we don't break anything, as far as I know. -- Stefano

David Gibson

11:52 a.m.

On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:

...

On Mon, 19 Aug 2024 18:46:31 +1000 David Gibson wrote:

...
On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:

...
Based on Stefano's recent patch for faster tests.

Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.

Suggestions for better names for the new options in patches 20 & 22 are most welcome.

Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.

NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.

I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions

The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime.

After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it.

I still have to take a closer look, but I'm fairly sure I hit a similar issue while I was writing these tests originally. I pondered reconfiguring the address via DHCPv6, or using the keep_addr_on_down sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added around that time.

Then:

...
This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses.

...I realised that this worked and forgot about the whole issue.

...
We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug.

In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests.

I guess setting keep_addr_on_down (even for "all" interfaces) should work as well.

Sounds like it. I wasn't aware of that one. /me tests.. actually, no it doesn't work.. # sysctl -a | grep keep_addr_on_down net.ipv6.conf.all.keep_addr_on_down = 1 net.ipv6.conf.default.keep_addr_on_down = 1 net.ipv6.conf.dummy0.keep_addr_on_down = 1 net.ipv6.conf.lo.keep_addr_on_down = 0 # ip addr add 2001:db8::1 dev dummy0 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff inet6 2001:db8::1/128 scope global valid_lft forever preferred_lft forever # ip link set dummy0 mtu 1200 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1200 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff # ip link set dummy0 mtu 1500 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff My guess is that IPv6 being deconfigured because of an unsuitable MTU is considered a different event from a mere "down".

...

...
This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them.

There's a notable distinction between guests temporarily diverging (in different ways) and guests we don't configure at all.

I'm not really sure what you're getting at here.

...

It's probably more important to ensure we use the right type of address

"type" in what sense here?

...

(security) rather than ensuring we somehow manage to deliver packets at any time (minor glitch otherwise), also because the one you describe is something we're unlikely to hit outside of tests.

...
Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past.

Yes, I still think we should support guests that don't use DHCPv6 or NDP at all,

Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to manually configure the interface in the guest to match the address you've configured with -a. Just like you'd expect to have to correctly configure your address on a real network.

...

or where related exchanges fail for any reason. It improves reliability and compatibility at a small cost. In this case, I think it's a nice feature that we would resume communicating as soon as the guest shows its global unicast address.

Hm, maybe. I'm not entirely convinced the cost is so small long term. It's pretty badly incompatible with having multiple guests behind the same passt instance: such as the initial guest bridging or routing to nested guests. I'm actually not sure if encountering this bug makes me more or less in favour of addr_seen. On the one hand I think it highlights the flakiness of this approach; there are situations where we just won't know the right address. On the other hand if shows a relatively plausible case where the guest won't get exactly the address we want it to (it uses NDP but not DHCPv6) Hrm... actually this also shows a potential danger in the recent patches to disable DAD in the guest. With DAD enabled, when the guest grabs a new address, we'd expect it to emit DAD messages, which would have the side effect of updating our addr_seen (although I'm pretty sure I hit this patch before the nodad patches were applied, so that doesn't seem to be foolproof). We could maybe update addr_seen when we send RA messages to the guest - assuming that it will use the same host part (low 64-bits) for both link-local and global addresses. Not sure if that's a widely safe assumption or not.

...

If the cost is using the wrong type of address, then not, I'm not suggesting we do that, so I think the change from this series is desirable, but in a general case, things just work and we don't break anything, as far as I know.

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

3:01 p.m.

On Mon, 19 Aug 2024 19:52:49 +1000 David Gibson wrote:

...

On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:

...
On Mon, 19 Aug 2024 18:46:31 +1000 David Gibson wrote:

...
On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:

...
Based on Stefano's recent patch for faster tests.

Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.

Suggestions for better names for the new options in patches 20 & 22 are most welcome.

Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.

NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.

I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions

The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime.

After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it.

I still have to take a closer look, but I'm fairly sure I hit a similar issue while I was writing these tests originally. I pondered reconfiguring the address via DHCPv6, or using the keep_addr_on_down sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added around that time.

Then:

...
This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses.

...I realised that this worked and forgot about the whole issue.

...
We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug.

In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests.

I guess setting keep_addr_on_down (even for "all" interfaces) should work as well.

Sounds like it. I wasn't aware of that one.

/me tests.. actually, no it doesn't work..

# sysctl -a | grep keep_addr_on_down net.ipv6.conf.all.keep_addr_on_down = 1 net.ipv6.conf.default.keep_addr_on_down = 1 net.ipv6.conf.dummy0.keep_addr_on_down = 1 net.ipv6.conf.lo.keep_addr_on_down = 0 # ip addr add 2001:db8::1 dev dummy0 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff inet6 2001:db8::1/128 scope global valid_lft forever preferred_lft forever # ip link set dummy0 mtu 1200 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1200 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff # ip link set dummy0 mtu 1500 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff

My guess is that IPv6 being deconfigured because of an unsuitable MTU is considered a different event from a mere "down".

I guess it's because they're not IFA_F_PERMANENT, because addrconf_permanent_addr() has: case NETDEV_CHANGEMTU: /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */ if (dev->mtu < IPV6_MIN_MTU) { addrconf_ifdown(dev, dev != net->loopback_dev); break; } but addrconf_ifdown() does: if (!keep_addr || !(ifa->flags & IFA_F_PERMANENT) || addr_is_local(&ifa->addr)) { hlist_del_init_rcu(&ifa->addr_lst); goto restart; } I'm not sure about the logic behind that. We could actually set those addresses as permanent once the DHCPv6 client configures them, if it's cleaner.

...

...
...
This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them.

There's a notable distinction between guests temporarily diverging (in different ways) and guests we don't configure at all.

I'm not really sure what you're getting at here.

In this case, it's not true that the guest doesn't configure itself in the way we requested -- it's just a temporary diversion from that configuration. Those are different cases that we can handle in different ways, I think. If it's a glitch that will only happen during testing, let's work around that. But if the guest really ignores DHCPv6 information, I think we should keep that working.

...

...
It's probably more important to ensure we use the right type of address

"type" in what sense here?

Global unicast instead of link-local.

...

...
(security) rather than ensuring we somehow manage to deliver packets at any time (minor glitch otherwise), also because the one you describe is something we're unlikely to hit outside of tests.

...
Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past.

Yes, I still think we should support guests that don't use DHCPv6 or NDP at all,

Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to manually configure the interface in the guest to match the address you've configured with -a. Just like you'd expect to have to correctly configure your address on a real network.

True, but if we make correctness as optional as possible, we'll be more compatible (less time spent by users fixing situations that don't necessarily need fixing, less time spent by developers to look into reports, no matter who's at fault).

...

...
or where related exchanges fail for any reason. It improves reliability and compatibility at a small cost. In this case, I think it's a nice feature that we would resume communicating as soon as the guest shows its global unicast address.

Hm, maybe. I'm not entirely convinced the cost is so small long term. It's pretty badly incompatible with having multiple guests behind the same passt instance: such as the initial guest bridging or routing to nested guests.

Why? We will need to hash the interface/guest index anyway, for outbound flows. And for inbound flows, if a guest steals the address of another guest, we'll give priority to the normal 'addr' versions instead of the '_seen' ones, to decide how to direct traffic.

...

I'm actually not sure if encountering this bug makes me more or less in favour of addr_seen. On the one hand I think it highlights the flakiness of this approach; there are situations where we just won't know the right address.

I don't understand this argument: indeed, there are such situations, and they are annoying. Why should we make them more common?

...

On the other hand if shows a relatively plausible case where the guest won't get exactly the address we want it to (it uses NDP but not DHCPv6)

Hrm... actually this also shows a potential danger in the recent patches to disable DAD in the guest. With DAD enabled, when the guest grabs a new address, we'd expect it to emit DAD messages, which would have the side effect of updating our addr_seen (although I'm pretty sure I hit this patch before the nodad patches were applied, so that doesn't seem to be foolproof).

Well, but we do that for containers with --config-net only. In that case, the addresses we configure have infinite lifetime anyway. Besides, I don't think we need to have addr_seen updated as quickly and correctly as possible just for the sake of it, we can also update it when we get any other neighbour solicitation because the guest is actually using the network. It's not meant to be perfect.

...

We could maybe update addr_seen when we send RA messages to the guest - assuming that it will use the same host part (low 64-bits) for both link-local and global addresses. Not sure if that's a widely safe assumption or not.

I don't understand: what case are you trying to cover with this? -- Stefano

David Gibson

20 Aug 20 Aug

2:42 a.m.

On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote:

...

On Mon, 19 Aug 2024 19:52:49 +1000 David Gibson wrote:

...
On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:

...
On Mon, 19 Aug 2024 18:46:31 +1000 David Gibson wrote:

...
On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:

...
Based on Stefano's recent patch for faster tests.

Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.

Suggestions for better names for the new options in patches 20 & 22 are most welcome.

Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.

NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.

I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions

The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime.

After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it.

I still have to take a closer look, but I'm fairly sure I hit a similar issue while I was writing these tests originally. I pondered reconfiguring the address via DHCPv6, or using the keep_addr_on_down sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added around that time.

Then:

...
This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses.

...I realised that this worked and forgot about the whole issue.

...
We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug.

In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests.

I guess setting keep_addr_on_down (even for "all" interfaces) should work as well.

Sounds like it. I wasn't aware of that one.

/me tests.. actually, no it doesn't work..

# sysctl -a | grep keep_addr_on_down net.ipv6.conf.all.keep_addr_on_down = 1 net.ipv6.conf.default.keep_addr_on_down = 1 net.ipv6.conf.dummy0.keep_addr_on_down = 1 net.ipv6.conf.lo.keep_addr_on_down = 0 # ip addr add 2001:db8::1 dev dummy0 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff inet6 2001:db8::1/128 scope global valid_lft forever preferred_lft forever # ip link set dummy0 mtu 1200 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1200 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff # ip link set dummy0 mtu 1500 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff

My guess is that IPv6 being deconfigured because of an unsuitable MTU is considered a different event from a mere "down".

I guess it's because they're not IFA_F_PERMANENT, because addrconf_permanent_addr() has:

case NETDEV_CHANGEMTU: /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */ if (dev->mtu < IPV6_MIN_MTU) { addrconf_ifdown(dev, dev != net->loopback_dev); break; }

but addrconf_ifdown() does:

if (!keep_addr || !(ifa->flags & IFA_F_PERMANENT) || addr_is_local(&ifa->addr)) { hlist_del_init_rcu(&ifa->addr_lst); goto restart; }

I'm not sure about the logic behind that. We could actually set those addresses as permanent once the DHCPv6 client configures them, if it's cleaner.

Huh. Not in the passt/VM case, though, which is where I actually encountered this.

...

...
...
...
This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them.

There's a notable distinction between guests temporarily diverging (in different ways) and guests we don't configure at all.

I'm not really sure what you're getting at here.

In this case, it's not true that the guest doesn't configure itself in the way we requested -- it's just a temporary diversion from that configuration.

Oh, I see. Assuming that at some point the DHCP client will re-run.

...

Those are different cases that we can handle in different ways, I think. If it's a glitch that will only happen during testing, let's work around that.

But if the guest really ignores DHCPv6 information, I think we should keep that working.

...
...
It's probably more important to ensure we use the right type of address

"type" in what sense here?

Global unicast instead of link-local.

Ok.

...

...
...
(security) rather than ensuring we somehow manage to deliver packets at any time (minor glitch otherwise), also because the one you describe is something we're unlikely to hit outside of tests.

...
Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past.

Yes, I still think we should support guests that don't use DHCPv6 or NDP at all,

Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to manually configure the interface in the guest to match the address you've configured with -a. Just like you'd expect to have to correctly configure your address on a real network.

True, but if we make correctness as optional as possible, we'll be more compatible (less time spent by users fixing situations that don't necessarily need fixing, less time spent by developers to look into reports, no matter who's at fault).

Eh, maybe. Unless us trying to make sense of a nonsense situation causes some unpredictable behaviour that breaks something else.

...

...
...
or where related exchanges fail for any reason. It improves reliability and compatibility at a small cost. In this case, I think it's a nice feature that we would resume communicating as soon as the guest shows its global unicast address.

Hm, maybe. I'm not entirely convinced the cost is so small long term. It's pretty badly incompatible with having multiple guests behind the same passt instance: such as the initial guest bridging or routing to nested guests.

Why? We will need to hash the interface/guest index anyway, for outbound flows.

If we have separate interfaces for each guest, yes. But not if we have multiple guests behind a single tap because the initial guest sets up a bridge or routing. Then we have nothing but the address.

...

And for inbound flows, if a guest steals the address of another guest, we'll give priority to the normal 'addr' versions instead of the '_seen' ones, to decide how to direct traffic.

I don't see how we'd know we're in this situation, so when to prioritise which address over the other.

...

...
I'm actually not sure if encountering this bug makes me more or less in favour of addr_seen. On the one hand I think it highlights the flakiness of this approach; there are situations where we just won't know the right address.

I don't understand this argument: indeed, there are such situations, and they are annoying. Why should we make them more common?

Because predictability is good, and working _most_ of the time is a failure of predictability.

...

...
On the other hand if shows a relatively plausible case where the guest won't get exactly the address we want it to (it uses NDP but not DHCPv6)

Hrm... actually this also shows a potential danger in the recent patches to disable DAD in the guest. With DAD enabled, when the guest grabs a new address, we'd expect it to emit DAD messages, which would have the side effect of updating our addr_seen (although I'm pretty sure I hit this patch before the nodad patches were applied, so that doesn't seem to be foolproof).

Well, but we do that for containers with --config-net only. In that case, the addresses we configure have infinite lifetime anyway.

Oh, good point. Hrm... then I'm unsure why the guest wasn't re-DADing its new address.

...

Besides, I don't think we need to have addr_seen updated as quickly and correctly as possible just for the sake of it, we can also update it when we get any other neighbour solicitation because the guest is actually using the network. It's not meant to be perfect.

If the guest is a pure server (a common case for containers AFAICT), then I don't know that we can expect NS messages for anything other than the default gateway, which is (typically) link-local and so won't help us to learn the new global address.

...

...
We could maybe update addr_seen when we send RA messages to the guest - assuming that it will use the same host part (low 64-bits) for both link-local and global addresses. Not sure if that's a widely safe assumption or not.

I don't understand: what case are you trying to cover with this?

A case just like the one in the tests: the interface bounces, and we get NDP traffic on the link-local address, but nothing on the global address before an inbound connection. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

10:39 p.m.

On Tue, 20 Aug 2024 10:42:17 +1000 David Gibson wrote:

...

On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote:

...
On Mon, 19 Aug 2024 19:52:49 +1000 David Gibson wrote:

...
On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:

...
On Mon, 19 Aug 2024 18:46:31 +1000 David Gibson wrote:

...
On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:

...
Based on Stefano's recent patch for faster tests.

Allow the user to specify which addresses are translated when used by the guest, rather than always being the gateway address or nothing. We also allow this remapping to go to the host's global address (more precisely the address assigned to the guest) rather than just host loopback.

Suggestions for better names for the new options in patches 20 & 22 are most welcome.

Along the way to implementing that make many changes to clarify what various addresses we track mean, fixing a number of small bugs as well.

NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf tests. I haven't managed to figure out why it's causing the problem, or even what the exact triggering conditions are (running the single stalling iperf alone doesn't do it). Have to wrap up for today, so I thought I'd get this out for review anyway.

I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions

The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime.

After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it.

I still have to take a closer look, but I'm fairly sure I hit a similar issue while I was writing these tests originally. I pondered reconfiguring the address via DHCPv6, or using the keep_addr_on_down sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added around that time.

Then:

...
This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses.

...I realised that this worked and forgot about the whole issue.

...
We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug.

In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests.

I guess setting keep_addr_on_down (even for "all" interfaces) should work as well.

Sounds like it. I wasn't aware of that one.

/me tests.. actually, no it doesn't work..

# sysctl -a | grep keep_addr_on_down net.ipv6.conf.all.keep_addr_on_down = 1 net.ipv6.conf.default.keep_addr_on_down = 1 net.ipv6.conf.dummy0.keep_addr_on_down = 1 net.ipv6.conf.lo.keep_addr_on_down = 0 # ip addr add 2001:db8::1 dev dummy0 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff inet6 2001:db8::1/128 scope global valid_lft forever preferred_lft forever # ip link set dummy0 mtu 1200 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1200 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff # ip link set dummy0 mtu 1500 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff

My guess is that IPv6 being deconfigured because of an unsuitable MTU is considered a different event from a mere "down".

I guess it's because they're not IFA_F_PERMANENT, because addrconf_permanent_addr() has:

case NETDEV_CHANGEMTU: /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */ if (dev->mtu < IPV6_MIN_MTU) { addrconf_ifdown(dev, dev != net->loopback_dev); break; }

but addrconf_ifdown() does:

if (!keep_addr || !(ifa->flags & IFA_F_PERMANENT) || addr_is_local(&ifa->addr)) { hlist_del_init_rcu(&ifa->addr_lst); goto restart; }

I'm not sure about the logic behind that. We could actually set those addresses as permanent once the DHCPv6 client configures them, if it's cleaner.

Huh. Not in the passt/VM case, though, which is where I actually encountered this.

I meant using ip(8) from the test script itself, but it doesn't actually make sense: # ip address change 2a01:4f8:222:904:c800:94ff:fe29:a8d/64 permanent dev eth0 Warning: permanent option is not mutable from userspace because (RFC 3549): IFA_F_PERMANENT For a permanent address set by the user. When this is not set, it means the address was dynamically created (e.g., by stateless autoconfiguration). So the address you used in your test _should_ have IFA_F_PERMANENT. The plot thickens. I just tried this, which confirms your hypothesis that bringing the link down is a different event: # ip addr add 2001:db8::1 dev dummy0 # ip link set dummy0 down # ip addr show dev dummy0 5: dummy0: mtu 1280 qdisc noqueue state DOWN group default qlen 1000 link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff inet 1.2.3.1/24 scope global dummy0 valid_lft forever preferred_lft forever inet6 2001:db8::1/128 scope global valid_lft forever preferred_lft forever # ip link set dummy0 mtu 1279 # ip addr show dev dummy0 5: dummy0: mtu 1279 qdisc noqueue state DOWN group default qlen 1000 link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff inet 1.2.3.1/24 scope global dummy0 valid_lft forever preferred_lft forever ...I just can't see that from the code.

...

...
...
...
...
This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them.

There's a notable distinction between guests temporarily diverging (in different ways) and guests we don't configure at all.

I'm not really sure what you're getting at here.

In this case, it's not true that the guest doesn't configure itself in the way we requested -- it's just a temporary diversion from that configuration.

Oh, I see. Assuming that at some point the DHCP client will re-run.

...
Those are different cases that we can handle in different ways, I think. If it's a glitch that will only happen during testing, let's work around that.

But if the guest really ignores DHCPv6 information, I think we should keep that working.

...
...
It's probably more important to ensure we use the right type of address

"type" in what sense here?

Global unicast instead of link-local.

Ok.

...
...
...
(security) rather than ensuring we somehow manage to deliver packets at any time (minor glitch otherwise), also because the one you describe is something we're unlikely to hit outside of tests.

...
Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past.

Yes, I still think we should support guests that don't use DHCPv6 or NDP at all,

Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to manually configure the interface in the guest to match the address you've configured with -a. Just like you'd expect to have to correctly configure your address on a real network.

True, but if we make correctness as optional as possible, we'll be more compatible (less time spent by users fixing situations that don't necessarily need fixing, less time spent by developers to look into reports, no matter who's at fault).

Eh, maybe. Unless us trying to make sense of a nonsense situation causes some unpredictable behaviour that breaks something else.

...
...
...
or where related exchanges fail for any reason. It improves reliability and compatibility at a small cost. In this case, I think it's a nice feature that we would resume communicating as soon as the guest shows its global unicast address.

Hm, maybe. I'm not entirely convinced the cost is so small long term. It's pretty badly incompatible with having multiple guests behind the same passt instance: such as the initial guest bridging or routing to nested guests.

Why? We will need to hash the interface/guest index anyway, for outbound flows.

If we have separate interfaces for each guest, yes. But not if we have multiple guests behind a single tap because the initial guest sets up a bridge or routing. Then we have nothing but the address.

...but then we should have multiple addresses anyway. By the way, I'm not sure we'll ever be able to support that kind of configuration. How does a guest set up a bridge and use passt at the same time?

...

...
And for inbound flows, if a guest steals the address of another guest, we'll give priority to the normal 'addr' versions instead of the '_seen' ones, to decide how to direct traffic.

I don't see how we'd know we're in this situation, so when to prioritise which address over the other.

In the set of all addr_seen and addr, we would have at least a non-unique value. Or, practically speaking, we should refuse to set addr_seen if it matches addr for another guest.

...

...
...
I'm actually not sure if encountering this bug makes me more or less in favour of addr_seen. On the one hand I think it highlights the flakiness of this approach; there are situations where we just won't know the right address.

I don't understand this argument: indeed, there are such situations, and they are annoying. Why should we make them more common?

Because predictability is good, and working _most_ of the time is a failure of predictability.

It avoids substantial effort and frustration for everybody involved though. The practical problem with lacking predictability is if it makes things harder to debug, I guess, which shouldn't be the case here.

...

...
...
On the other hand if shows a relatively plausible case where the guest won't get exactly the address we want it to (it uses NDP but not DHCPv6)

Hrm... actually this also shows a potential danger in the recent patches to disable DAD in the guest. With DAD enabled, when the guest grabs a new address, we'd expect it to emit DAD messages, which would have the side effect of updating our addr_seen (although I'm pretty sure I hit this patch before the nodad patches were applied, so that doesn't seem to be foolproof).

Well, but we do that for containers with --config-net only. In that case, the addresses we configure have infinite lifetime anyway.

Oh, good point. Hrm... then I'm unsure why the guest wasn't re-DADing its new address.

It probably did, but we ignored that anyway because DAD is done by sending neighbour solicitations with an unspecified address as source, for example (the "change" here drops "nodad"): $ ./pasta --config-net -p dad.pcap Saving packet capture to dad.pcap # ip addr change dev enp9s0 fe80::3882:b5ff:fe01:e9a1/64 # tshark -r dad.pcap |grep Neigh Running as user "root" and group "root". This could be dangerous. 10 2.642467 :: → ff02::1:ff01:e9a1 ICMPv6 86 Neighbor Solicitation for fe80::3882:b5ff:fe01:e9a1 and in tap6_handler() we do: } else if (!IN6_IS_ADDR_UNSPECIFIED(saddr)){ c->ip6.addr_seen = *saddr; } ...then, in ndp(): if (IN6_IS_ADDR_UNSPECIFIED(saddr)) return 1; we could set addr_seen by looking at the *target* address of the neighbour solicitation when the source address is ::, but it's not implemented yet.

...

...
Besides, I don't think we need to have addr_seen updated as quickly and correctly as possible just for the sake of it, we can also update it when we get any other neighbour solicitation because the guest is actually using the network. It's not meant to be perfect.

If the guest is a pure server (a common case for containers AFAICT), then I don't know that we can expect NS messages for anything other than the default gateway, which is (typically) link-local and so won't help us to learn the new global address.

Containers running actual applications are noisy. I've only seen this kind of problem (addr_seen not set/matching) in particularly crafted test environments.

...

...
...
We could maybe update addr_seen when we send RA messages to the guest - assuming that it will use the same host part (low 64-bits) for both link-local and global addresses. Not sure if that's a widely safe assumption or not.

I don't understand: what case are you trying to cover with this?

A case just like the one in the tests: the interface bounces, and we get NDP traffic on the link-local address, but nothing on the global address before an inbound connection.

Oh, I see. I think it makes sense, even though we'll set addr_seen a bit too early, but not enough to be a practical issue, I think. -- Stefano

David Gibson

21 Aug 21 Aug

4:51 a.m.

On Tue, Aug 20, 2024 at 10:39:26PM +0200, Stefano Brivio wrote:

...

On Tue, 20 Aug 2024 10:42:17 +1000 David Gibson wrote:

...
On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote:

...
On Mon, 19 Aug 2024 19:52:49 +1000 David Gibson wrote:

...
On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:

...
On Mon, 19 Aug 2024 18:46:31 +1000 David Gibson wrote:

...
On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote: > Based on Stefano's recent patch for faster tests. > > Allow the user to specify which addresses are translated when used by > the guest, rather than always being the gateway address or nothing. > We also allow this remapping to go to the host's global address (more > precisely the address assigned to the guest) rather than just host > loopback. > > Suggestions for better names for the new options in patches 20 & 22 > are most welcome. > > Along the way to implementing that make many changes to clarify what > various addresses we track mean, fixing a number of small bugs as > well. > > NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf > tests. I haven't managed to figure out why it's causing the problem, > or even what the exact triggering conditions are (running the single > stalling iperf alone doesn't do it). Have to wrap up for today, so I > thought I'd get this out for review anyway.

I've identified the bug here. IMO, it's a pre-existing problem that only works by accident at the moment. The immediate fix is pretty obvious, but it raises some broader questions

The problem arises because of the MTU changes we make in order to test throughput with different packet sizes. Specifically we change the MTU to values < 1280, which implicitly disables IPv6 since it requires an MTU >= 1280. When we change the MTU back to a larger value IPv6 is re-enabled, but some configuration has been lost in the meantime.

After the MTU is restored the guest reconfigures with NDP, but does not re-DHCPv6. That means the guest gets a SLAAC address in the right prefix but not the exact /128 address we've tried to assign to it. However, at least with the sequence of things we have in the tests, the guest never sends any packets with the new address, so passt doesn't update addr_seen. When the inbound connection comes we send it to the assigned address instead of the guest's actual address and the guest rejects it.

I still have to take a closer look, but I'm fairly sure I hit a similar issue while I was writing these tests originally. I pondered reconfiguring the address via DHCPv6, or using the keep_addr_on_down sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added around that time.

Then:

...
This "worked" previously, because before this patch, passt would translate the inbound connection to have source/dest as link-local addresses.

...I realised that this worked and forgot about the whole issue.

...
We *do* have a current addr_ll_seen because (a) it won't change if the guest doesn't change MAC and (b) when IPv6 is re-enabled the NDP traffic the guest generates will have link-local addresses that update addr_ll_seen. With this patch, and a global address for --map-host-loopback, we now need to send to addr_seen instead of addr_ll_seen, hence exposing the bug.

In the short term, the obvious fix would be to re-run dhclient -6 in the guest after we twiddle MTU but before running IPv6 tests.

I guess setting keep_addr_on_down (even for "all" interfaces) should work as well.

Sounds like it. I wasn't aware of that one.

/me tests.. actually, no it doesn't work..

# sysctl -a | grep keep_addr_on_down net.ipv6.conf.all.keep_addr_on_down = 1 net.ipv6.conf.default.keep_addr_on_down = 1 net.ipv6.conf.dummy0.keep_addr_on_down = 1 net.ipv6.conf.lo.keep_addr_on_down = 0 # ip addr add 2001:db8::1 dev dummy0 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff inet6 2001:db8::1/128 scope global valid_lft forever preferred_lft forever # ip link set dummy0 mtu 1200 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1200 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff # ip link set dummy0 mtu 1500 # ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: dummy0: mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff

My guess is that IPv6 being deconfigured because of an unsuitable MTU is considered a different event from a mere "down".

I guess it's because they're not IFA_F_PERMANENT, because addrconf_permanent_addr() has:

case NETDEV_CHANGEMTU: /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */ if (dev->mtu < IPV6_MIN_MTU) { addrconf_ifdown(dev, dev != net->loopback_dev); break; }

but addrconf_ifdown() does:

if (!keep_addr || !(ifa->flags & IFA_F_PERMANENT) || addr_is_local(&ifa->addr)) { hlist_del_init_rcu(&ifa->addr_lst); goto restart; }

I'm not sure about the logic behind that. We could actually set those addresses as permanent once the DHCPv6 client configures them, if it's cleaner.

Huh. Not in the passt/VM case, though, which is where I actually encountered this.

I meant using ip(8) from the test script itself, but it doesn't actually make sense:

# ip address change 2a01:4f8:222:904:c800:94ff:fe29:a8d/64 permanent dev eth0 Warning: permanent option is not mutable from userspace

because (RFC 3549):

IFA_F_PERMANENT For a permanent address set by the user. When this is not set, it means the address was dynamically created (e.g., by stateless autoconfiguration).

So the address you used in your test _should_ have IFA_F_PERMANENT. The plot thickens.

I just tried this, which confirms your hypothesis that bringing the link down is a different event:

# ip addr add 2001:db8::1 dev dummy0 # ip link set dummy0 down # ip addr show dev dummy0 5: dummy0: mtu 1280 qdisc noqueue state DOWN group default qlen 1000 link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff inet 1.2.3.1/24 scope global dummy0 valid_lft forever preferred_lft forever inet6 2001:db8::1/128 scope global valid_lft forever preferred_lft forever # ip link set dummy0 mtu 1279 # ip addr show dev dummy0 5: dummy0: mtu 1279 qdisc noqueue state DOWN group default qlen 1000 link/ether 02:59:00:28:1b:5f brd ff:ff:ff:ff:ff:ff inet 1.2.3.1/24 scope global dummy0 valid_lft forever preferred_lft forever

...I just can't see that from the code.

Ok.

...

...
...
...
...
...
This kind of opens a question about how hard we should try to accomodate guests which don't configure themselves how we told them.

There's a notable distinction between guests temporarily diverging (in different ways) and guests we don't configure at all.

I'm not really sure what you're getting at here.

In this case, it's not true that the guest doesn't configure itself in the way we requested -- it's just a temporary diversion from that configuration.

Oh, I see. Assuming that at some point the DHCP client will re-run.

...
Those are different cases that we can handle in different ways, I think. If it's a glitch that will only happen during testing, let's work around that.

But if the guest really ignores DHCPv6 information, I think we should keep that working.

...
...
It's probably more important to ensure we use the right type of address

"type" in what sense here?

Global unicast instead of link-local.

Ok.

...
...
...
(security) rather than ensuring we somehow manage to deliver packets at any time (minor glitch otherwise), also because the one you describe is something we're unlikely to hit outside of tests.

...
Personally I'd be ok with saying that nothing works if the guest doesn't configure itself properly, thereby removing addr_seen and addr_ll_seen entirely. But I think, Stefano, you've been against that idea in the past.

Yes, I still think we should support guests that don't use DHCPv6 or NDP at all,

Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to manually configure the interface in the guest to match the address you've configured with -a. Just like you'd expect to have to correctly configure your address on a real network.

True, but if we make correctness as optional as possible, we'll be more compatible (less time spent by users fixing situations that don't necessarily need fixing, less time spent by developers to look into reports, no matter who's at fault).

Eh, maybe. Unless us trying to make sense of a nonsense situation causes some unpredictable behaviour that breaks something else.

...
...
...
or where related exchanges fail for any reason. It improves reliability and compatibility at a small cost. In this case, I think it's a nice feature that we would resume communicating as soon as the guest shows its global unicast address.

Hm, maybe. I'm not entirely convinced the cost is so small long term. It's pretty badly incompatible with having multiple guests behind the same passt instance: such as the initial guest bridging or routing to nested guests.

Why? We will need to hash the interface/guest index anyway, for outbound flows.

If we have separate interfaces for each guest, yes. But not if we have multiple guests behind a single tap because the initial guest sets up a bridge or routing. Then we have nothing but the address.

...but then we should have multiple addresses anyway.

Yes.. that's kind of my point.

...

By the way, I'm not sure we'll ever be able to support that kind of configuration.

I don't see why not. It would require configuration so that it's clear what each inbound forward targets. But I don't see any inherent problem here, though there are a number of current implementation details which prevent it (addr_seen is one, replying to all arps is another).

...

How does a guest set up a bridge and use passt at the same time?

I'm not thinking of a bridge shared with the host, but a bridge (or routing) between nested guests or namespaces. This is essentially the "private switch with pasta uplink" case we've discussed occasionally before. It doesn't technically have to be nested guests - the guest could bridge between its uplink and a tunnel, but nested guests is the likely use case.

...

...
...
And for inbound flows, if a guest steals the address of another guest, we'll give priority to the normal 'addr' versions instead of the '_seen' ones, to decide how to direct traffic.

I don't see how we'd know we're in this situation, so when to prioritise which address over the other.

In the set of all addr_seen and addr, we would have at least a non-unique value. Or, practically speaking, we should refuse to set addr_seen if it matches addr for another guest.

Ah, ok. So again, assuming a static configuration of known guests, rather than a local bridge established by a guest at runtime.

...

...
...
...
I'm actually not sure if encountering this bug makes me more or less in favour of addr_seen. On the one hand I think it highlights the flakiness of this approach; there are situations where we just won't know the right address.

I don't understand this argument: indeed, there are such situations, and they are annoying. Why should we make them more common?

Because predictability is good, and working _most_ of the time is a failure of predictability.

It avoids substantial effort and frustration for everybody involved though. The practical problem with lacking predictability is if it makes things harder to debug, I guess, which shouldn't be the case here.

...
...
...
On the other hand if shows a relatively plausible case where the guest won't get exactly the address we want it to (it uses NDP but not DHCPv6)

Hrm... actually this also shows a potential danger in the recent patches to disable DAD in the guest. With DAD enabled, when the guest grabs a new address, we'd expect it to emit DAD messages, which would have the side effect of updating our addr_seen (although I'm pretty sure I hit this patch before the nodad patches were applied, so that doesn't seem to be foolproof).

Well, but we do that for containers with --config-net only. In that case, the addresses we configure have infinite lifetime anyway.

Oh, good point. Hrm... then I'm unsure why the guest wasn't re-DADing its new address.

It probably did, but we ignored that anyway because DAD is done by sending neighbour solicitations with an unspecified address as source, for example (the "change" here drops "nodad"):

$ ./pasta --config-net -p dad.pcap Saving packet capture to dad.pcap # ip addr change dev enp9s0 fe80::3882:b5ff:fe01:e9a1/64 # tshark -r dad.pcap |grep Neigh Running as user "root" and group "root". This could be dangerous. 10 2.642467 :: → ff02::1:ff01:e9a1 ICMPv6 86 Neighbor Solicitation for fe80::3882:b5ff:fe01:e9a1

and in tap6_handler() we do:

} else if (!IN6_IS_ADDR_UNSPECIFIED(saddr)){ c->ip6.addr_seen = *saddr; }

...then, in ndp():

if (IN6_IS_ADDR_UNSPECIFIED(saddr)) return 1;

we could set addr_seen by looking at the *target* address of the neighbour solicitation when the source address is ::, but it's not implemented yet.

Right. I forgot the NS went out with :: as source. Snooping the NS that way again assumes that there's only one logical machine on the guest side. But since this is for addr_seen which fundamentally assumes that anyway, I guess it doesn't make anything worse.

...

...
...
Besides, I don't think we need to have addr_seen updated as quickly and correctly as possible just for the sake of it, we can also update it when we get any other neighbour solicitation because the guest is actually using the network. It's not meant to be perfect.

If the guest is a pure server (a common case for containers AFAICT), then I don't know that we can expect NS messages for anything other than the default gateway, which is (typically) link-local and so won't help us to learn the new global address.

Containers running actual applications are noisy. I've only seen this kind of problem (addr_seen not set/matching) in particularly crafted test environments.

...
...
...
We could maybe update addr_seen when we send RA messages to the guest - assuming that it will use the same host part (low 64-bits) for both link-local and global addresses. Not sure if that's a widely safe assumption or not.

I don't understand: what case are you trying to cover with this?

A case just like the one in the tests: the interface bounces, and we get NDP traffic on the link-local address, but nothing on the global address before an inbound connection.

Oh, I see. I think it makes sense, even though we'll set addr_seen a bit too early, but not enough to be a practical issue, I think.

Yes, but I think snopping the NS from DAD is probably a better idea. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

544

Age (days ago)

549

Last active (days ago)

List overview

Download

54 comments

3 participants

participants (3)

David Gibson
Paul Holzinger
Stefano Brivio

[PATCH 00/22] RFC: Allow configuration of special case NATs

tags

participants (3)