[PATCH v4 0/9] Use true MAC address of LAN local remote hosts
Bug #120 asks us to use the true MAC addresses of LAN local remote hosts, since some programs need this information. These commits introduces this for ARP, NDP, UDP, TCP and ICMP. --- v3: Updated according to feedback from Stefano and David: - Made the ARP/NDP lookup call filter out the requested address by itself, qualified by the index if the template interface - Moved the flow specific MAC address from struct flowside to struct flow_common. v4: - Updated according to feedback from David and Stefan - Added a cache table for ARP/NDP table contents Jon Maloy (9): netlink: add function to extract MAC addresses from NDP/ARP table arp/ndp: respond with true MAC address of LAN local remote hosts flow: add MAC address of LAN local remote hosts to flow udp: forward external source MAC address through tap interface tcp: forward external source MAC address through tap interface tap: change signature of function tap_push_l2h() tcp: make tcp_rst_no_conn() respond with correct MAC address icmp: let icmp use mac address from flowside structure fwd: Added cache table for ARP/NDP contents arp.c | 8 ++ conf.c | 2 + flow.c | 20 +++- flow.h | 2 + fwd.c | 244 +++++++++++++++++++++++++++++++++++++++++++++++-- fwd.h | 5 + icmp.c | 4 +- inany.c | 1 + ndp.c | 10 +- netlink.c | 79 ++++++++++++++++ netlink.h | 2 + passt.c | 9 +- passt.h | 3 +- pasta.c | 2 +- tap.c | 23 +++-- tap.h | 7 +- tcp.c | 21 ++++- tcp.h | 2 +- tcp_buf.c | 29 +++--- tcp_internal.h | 2 +- tcp_vu.c | 5 +- udp.c | 52 ++++++----- udp.h | 2 +- 23 files changed, 452 insertions(+), 82 deletions(-) -- 2.50.1
The solution to bug https://bugs.passt.top/show_bug.cgi?id=120
requires the ability to translate from an IP address to its
corresponding MAC address in cases where those are present in
the ARP/NDP table.
We add this feature here.
Signed-off-by: Jon Maloy
On Tue, Aug 19, 2025 at 11:09:57PM -0400, Jon Maloy wrote:
The solution to bug https://bugs.passt.top/show_bug.cgi?id=120 requires the ability to translate from an IP address to its corresponding MAC address in cases where those are present in the ARP/NDP table.
We add this feature here.
Signed-off-by: Jon Maloy
--- netlink.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ netlink.h | 2 ++ 2 files changed, 81 insertions(+) diff --git a/netlink.c b/netlink.c index 8f82e73..cf7debc 100644 --- a/netlink.c +++ b/netlink.c @@ -800,6 +800,85 @@ int nl_addr_get(int s, unsigned int ifi, sa_family_t af, return status; }
+/** + * nl_neigh_mac_get() - Get neighbor MAC address from the kernel neigh table + * @s: Netlink socket fd + * @addr: IPv4 or IPv6 address + * @ifi: Interface index + * @mac: Buffer for Ethernet MAC, left unchanged if not found/usable + * + * Return: <0 on netlink error; 0 otherwise (drains all replies for @seq). + */ +int nl_neigh_mac_get(int s, const union inany_addr *addr, + int ifi, unsigned char *mac) +{ + const void *ip = inany_v4(addr); + struct req_t { + struct nlmsghdr nlh; + struct ndmsg ndm; + struct rtattr rta; + char ip[RTA_ALIGN(sizeof(struct in6_addr))]; + } req; + struct nlmsghdr *nh; + char buf[NLBUFSIZ]; + ssize_t status; + uint32_t seq; + int msglen; + int iplen; + + memset(&req, 0, sizeof(req)); + req.ndm.ndm_ifindex = ifi; + req.rta.rta_type = NDA_DST; + + if (ip) { + req.ndm.ndm_family = AF_INET; + iplen = sizeof(struct in_addr); + } else { + req.ndm.ndm_family = AF_INET6; + ip = &addr; + iplen = sizeof(struct in6_addr); + } + + req.rta.rta_len = RTA_LENGTH(iplen); + memcpy(RTA_DATA(&req.rta), ip, iplen); + msglen = NLMSG_ALIGN(sizeof(req.nlh) + sizeof(req.ndm) + RTA_LENGTH(iplen)); + seq = nl_send(s, &req, RTM_GETNEIGH, 0, msglen); + + /* Drain all RTM_NEWNEIGH replies for this seq */ + nl_foreach_oftype(nh, status, s, buf, seq, RTM_NEWNEIGH) { + struct ndmsg *ndm = NLMSG_DATA(nh); + struct rtattr *rta = (struct rtattr *)(ndm + 1); + const uint8_t *lladdr = NULL; + size_t na = RTM_PAYLOAD(nh); + const void *dst = NULL; + size_t lladdr_len = 0; + size_t dstlen = 0; + + for (; RTA_OK(rta, na); rta = RTA_NEXT(rta, na)) { + switch (rta->rta_type) { + case NDA_DST: + dst = RTA_DATA(rta); + dstlen = RTA_PAYLOAD(rta); + break; + case NDA_LLADDR: + lladdr = RTA_DATA(rta); + lladdr_len = RTA_PAYLOAD(rta); + break; + default: + break; + } + } + + if (dst && dstlen == (size_t)iplen && memcmp(dst, ip, iplen) == 0) { + /* Only copy Ethernet-style addresses; leave unchanged otherwise */ + if (lladdr && lladdr_len == ETH_ALEN)
You need to return an error code in the case that the MAC is missing or doesn't have the required form; otherwise you'll silently return garbage in @mac.
+ memcpy(mac, lladdr, ETH_ALEN); + } + } + + return status; +} + /** * nl_addr_get_ll() - Get first IPv6 link-local address for a given interface * @s: Netlink socket diff --git a/netlink.h b/netlink.h index b51e99c..026c64f 100644 --- a/netlink.h +++ b/netlink.h @@ -17,6 +17,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src, int s_dst, unsigned int ifi_dst, sa_family_t af); int nl_addr_get(int s, unsigned int ifi, sa_family_t af, void *addr, int *prefix_len, void *addr_l); +int nl_neigh_mac_get(int s, const union inany_addr *addr, int ifi, + unsigned char *mac); int nl_addr_set(int s, unsigned int ifi, sa_family_t af, const void *addr, int prefix_len); int nl_addr_get_ll(int s, unsigned int ifi, struct in6_addr *addr);
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
When we receive an ARP request or NDP neigbor solicitation over
the tap interface for a host on the local network segment attached
to the template interface, we respond with that host's real MAC
address.
The local host, which is acting as a proxy for the default gateway,
is still exempted from this rule.
Signed-off-by: Jon Maloy
On Tue, Aug 19, 2025 at 11:09:58PM -0400, Jon Maloy wrote:
When we receive an ARP request or NDP neigbor solicitation over the tap interface for a host on the local network segment attached to the template interface, we respond with that host's real MAC address.
The local host, which is acting as a proxy for the default gateway, is still exempted from this rule.
Signed-off-by: Jon Maloy
--- arp.c | 9 +++++++++ fwd.c | 38 ++++++++++++++++++++++++++++++-------- fwd.h | 1 + inany.c | 1 + ndp.c | 9 +++++++++ 5 files changed, 50 insertions(+), 8 deletions(-) diff --git a/arp.c b/arp.c index fc482bb..c37867a 100644 --- a/arp.c +++ b/arp.c @@ -29,6 +29,7 @@ #include "dhcp.h" #include "passt.h" #include "tap.h" +#include "netlink.h"
/** * arp() - Check if this is a supported ARP message, reply as needed @@ -39,6 +40,7 @@ */ int arp(const struct ctx *c, const struct pool *p) { + union inany_addr tgt; unsigned char swap[4]; struct ethhdr *eh; struct arphdr *ah; @@ -72,6 +74,13 @@ int arp(const struct ctx *c, const struct pool *p) memcpy(am->tha, am->sha, sizeof(am->tha)); memcpy(am->sha, c->our_tap_mac, sizeof(am->sha));
+ /* Respond with true MAC address if remote host is on + * the template interface's network segment + */ + inany_from_af(&tgt, AF_INET, am->tip); + if (!fwd_inany_nat(c, &tgt)) + nl_neigh_mac_get(nl_sock, &tgt, c->ifi4, am->sha);
You're not checking for errors from nl_neigh_mac_get().
memcpy(swap, am->tip, sizeof(am->tip)); memcpy(am->tip, am->sip, sizeof(am->tip)); memcpy(am->sip, swap, sizeof(am->sip)); diff --git a/fwd.c b/fwd.c index 250cf56..55bf5f2 100644 --- a/fwd.c +++ b/fwd.c @@ -331,20 +331,29 @@ static bool fwd_guest_accessible(const struct ctx *c, * * Only handles translations that depend *only* on the address. Anything * related to specific ports or flows is handled elsewhere. + * + * Return: true if there was a translation, otherwise false */ -static void nat_outbound(const struct ctx *c, const union inany_addr *addr, - union inany_addr *translated) +static bool nat_outbound(const struct ctx *c, const union inany_addr *addr, + union inany_addr *translated) {
I'm having trouble convincing myself that explicitly excluding MAC preservation in NAT cases is strictly correct, as opposed to doing a MAC lookup on the translated address. I'm reasonably confident it will be good enough for most purposes.
- if (inany_equals4(addr, &c->ip4.map_host_loopback)) + if (inany_equals4(addr, &c->ip4.map_host_loopback)) {
For this case, we'll certainly want our_tap_mac. We should get that indirectly anyway, though, since 127.0.0.1 will never be in the host interface's neighbour table.
*translated = inany_loopback4; - else if (inany_equals6(addr, &c->ip6.map_host_loopback)) + return true; + } else if (inany_equals6(addr, &c->ip6.map_host_loopback)) { *translated = inany_loopback6;
Ditto, but for ::.
- else if (inany_equals4(addr, &c->ip4.map_guest_addr)) + return true; + } else if (inany_equals4(addr, &c->ip4.map_guest_addr)) { *translated = inany_from_v4(c->ip4.addr); - else if (inany_equals6(addr, &c->ip6.map_guest_addr)) + return true; + } else if (inany_equals6(addr, &c->ip6.map_guest_addr)) { translated->a6 = c->ip6.addr;
Arguably for these cases, using the MAC of the translated address is what we want. There is a real peer on the host side (usually the host itself, but maybe something else), which might have a presence on the template interface. We have the NAT in place because it conflicts with the guest's IP, but the MAC should still be unique.
- else - *translated = *addr; + return true; + } + + *translated = *addr; + return false; + }
/** @@ -554,3 +563,16 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
return PIF_TAP; } + +/** fwd_inany_nat - Find if a remote IPv[46] address is subject to NAT + * @c: Execution context + * @addr: IPv[46] address + * + * Return: true if translated, false otherwise + */ +bool fwd_inany_nat(const struct ctx *c, const union inany_addr *addr) +{ + union inany_addr addr_nat; + + return nat_outbound(c, addr, &addr_nat); +} diff --git a/fwd.h b/fwd.h index 65c7c96..c8d485d 100644 --- a/fwd.h +++ b/fwd.h @@ -56,5 +56,6 @@ uint8_t fwd_nat_from_splice(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt); uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt); +bool fwd_inany_nat(const struct ctx *c, const union inany_addr *addr);
#endif /* FWD_H */ diff --git a/inany.c b/inany.c index 65a39f9..7680439 100644 --- a/inany.c +++ b/inany.c @@ -16,6 +16,7 @@ #include "ip.h" #include "siphash.h" #include "inany.h" +#include "fwd.h"
const union inany_addr inany_loopback4 = INANY_INIT4(IN4ADDR_LOOPBACK_INIT); const union inany_addr inany_any4 = INANY_INIT4(IN4ADDR_ANY_INIT); diff --git a/ndp.c b/ndp.c index 3e15494..9912f80 100644 --- a/ndp.c +++ b/ndp.c @@ -32,6 +32,7 @@ #include "passt.h" #include "tap.h" #include "log.h" +#include "netlink.h"
#define RT_LIFETIME 65535
@@ -196,6 +197,7 @@ static void ndp_send(const struct ctx *c, const struct in6_addr *dst, static void ndp_na(const struct ctx *c, const struct in6_addr *dst, const struct in6_addr *addr) { + union inany_addr tgt; struct ndp_na na = { .ih = { .icmp6_type = NA, @@ -215,6 +217,13 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst,
memcpy(na.target_l2_addr.mac, c->our_tap_mac, ETH_ALEN);
+ /* Respond with true link-layer address if remote host is on + * the template interface's network segment + */ + inany_from_af(&tgt, AF_INET6, addr); + if (!fwd_inany_nat(c, &tgt)) + nl_neigh_mac_get(nl_sock, &tgt, c->ifi6, na.target_l2_addr.mac); + ndp_send(c, dst, &na, sizeof(na)); }
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
When communicating with remote hosts on the local network, some guest
applications want to see the real MAC address of that host instead
of PASST/PASTA's own tap address. The flow_common structure is a
convenient location for storing that address, so we do that in this
commit.
Note that we don´t add actual usage of this address here, that will
be done in later commits.
Signed-off-by: Jon Maloy
On Tue, Aug 19, 2025 at 11:09:59PM -0400, Jon Maloy wrote:
When communicating with remote hosts on the local network, some guest applications want to see the real MAC address of that host instead of PASST/PASTA's own tap address. The flow_common structure is a convenient location for storing that address, so we do that in this commit.
Note that we don´t add actual usage of this address here, that will be done in later commits.
Signed-off-by: Jon Maloy
--- flow.c | 19 ++++++++++++++++++- flow.h | 2 ++ 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/flow.c b/flow.c index feefda3..d7b3fd1 100644 --- a/flow.c +++ b/flow.c @@ -20,6 +20,7 @@ #include "flow.h" #include "flow_table.h" #include "repair.h" +#include "netlink.h"
const char *flow_state_str[] = { [FLOW_STATE_FREE] = "FREE", @@ -438,18 +439,27 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, { char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; struct flow_common *f = &flow->f; - const struct flowside *ini = &f->side[INISIDE]; + struct flowside *ini = &f->side[INISIDE];
I don't see anywhere you're modifying *ini. Leftover from an older draft?
struct flowside *tgt = &f->side[TGTSIDE]; uint8_t tgtpif = PIF_NONE; + int ifi;
ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); ASSERT(f->type == FLOW_TYPE_NONE); ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE); ASSERT(flow->f.state == FLOW_STATE_INI); + memcpy(f->tap_omac, c->our_tap_mac, ETH_ALEN);
switch (f->pif[INISIDE]) { case PIF_TAP: tgtpif = fwd_nat_from_tap(c, proto, ini, tgt);
Attempting to preserve MAC probably only makes sense if tgtpif == PIF_HOST (which I think is always true for now, but maybe not forever).
+ + /* If there is no NAT, the remote host might be on the template
As before, I'm not sure if conditioning this no no NAT makes sense.
+ * interface's local network segment. If so, insert its MAC address + */ + ifi = inany_v4(&ini->oaddr) ? c->ifi4 : c->ifi6; + if (!fwd_inany_nat(c, &ini->oaddr)) + nl_neigh_mac_get(nl_sock, &ini->oaddr, ifi, f->tap_omac);
It should be tgt->eaddr, rather than ini->oaddr. They'll usually be equal, but logically we want to look up based on the host side address. As noted in general comments on earlier versions, this lookup will fail (give our_tap_mac) if the host hasn't contacted this peer before (so it's not in the neigh table). Can we defer this lookup until we get a reply, to avoid that problem?
break;
case PIF_SPLICE: @@ -458,6 +468,13 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow,
case PIF_HOST: tgtpif = fwd_nat_from_host(c, proto, ini, tgt); + + /* If there is no NAT, the remote host might be on the template + * interface's local network segment. If so, insert its MAC address + */ + ifi = inany_v4(&ini->eaddr) ? c->ifi4 : c->ifi6; + if (!fwd_inany_nat(c, &ini->eaddr)) + nl_neigh_mac_get(nl_sock, &ini->eaddr, ifi, f->tap_omac); break;
default: diff --git a/flow.h b/flow.h index cac618a..29c8bc6 100644 --- a/flow.h +++ b/flow.h @@ -177,6 +177,7 @@ int flowside_connect(const struct ctx *c, int s, * @type: Type of packet flow * @pif[]: Interface for each side of the flow * @side[]: Information for each side of the flow + * @tap_omac: MAC address of remote endpoint as seen from the guest */ struct flow_common { #ifdef __GNUC__ @@ -192,6 +193,7 @@ struct flow_common { #endif uint8_t pif[SIDES]; struct flowside side[SIDES]; + unsigned char tap_omac[6]; };
#define FLOW_INDEX_BITS 17 /* 128k - 1 */
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
We forward the incoming MAC address through the tap interface when
receiving incoming packets from network local hosts. Packets from
the own host are excepted from this rule, and are still forwarded
with the default PASST/PASTA MAC address as source.
This is a part of the solution to bug
https://bugs.passt.top/show_bug.cgi?id=120
Signed-off-by: Jon Maloy
On Tue, Aug 19, 2025 at 11:10:00PM -0400, Jon Maloy wrote:
We forward the incoming MAC address through the tap interface when receiving incoming packets from network local hosts. Packets from the own host are excepted from this rule, and are still forwarded with the default PASST/PASTA MAC address as source.
This is a part of the solution to bug https://bugs.passt.top/show_bug.cgi?id=120
Signed-off-by: Jon Maloy
Reviewed-by: David Gibson
--- passt.c | 2 +- udp.c | 39 ++++++++++++++++++++------------------- udp.h | 2 +- 3 files changed, 22 insertions(+), 21 deletions(-)
diff --git a/passt.c b/passt.c index 388d10f..477a01f 100644 --- a/passt.c +++ b/passt.c @@ -154,7 +154,7 @@ static void timer_init(struct ctx *c, const struct timespec *now) void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s) { tcp_update_l2_buf(eth_d, eth_s); - udp_update_l2_buf(eth_d, eth_s); + udp_update_l2_buf(eth_d); }
/** diff --git a/udp.c b/udp.c index 75edc20..35e3603 100644 --- a/udp.c +++ b/udp.c @@ -133,11 +133,8 @@ static int udp_splice_init[IP_VERSIONS][NUM_PORTS]; /* UDP header and data for inbound messages */ static struct udp_payload_t udp_payload[UDP_MAX_FRAMES];
-/* Ethernet header for IPv4 frames */ -static struct ethhdr udp4_eth_hdr; - -/* Ethernet header for IPv6 frames */ -static struct ethhdr udp6_eth_hdr; +/* Ethernet headers for IPv4 and IPv6 frames */ +static struct ethhdr udp_eth_hdr[UDP_MAX_FRAMES];
/** * struct udp_meta_t - Pre-cooked headers for UDP packets @@ -212,10 +209,12 @@ void udp_portmap_clear(void) * @eth_d: Ethernet destination address, NULL if unchanged * @eth_s: Ethernet source address, NULL if unchanged */ -void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s) +void udp_update_l2_buf(const unsigned char *eth_d) { - eth_update_mac(&udp4_eth_hdr, eth_d, eth_s); - eth_update_mac(&udp6_eth_hdr, eth_d, eth_s); + int i; + + for (i = 0; i < UDP_MAX_FRAMES; i++) + eth_update_mac(&udp_eth_hdr[i], eth_d, NULL); }
/** @@ -238,6 +237,7 @@ static void udp_iov_init_one(const struct ctx *c, size_t i)
*siov = IOV_OF_LVALUE(payload->data);
+ tiov[UDP_IOV_ETH] = IOV_OF_LVALUE(udp_eth_hdr[i]); tiov[UDP_IOV_TAP] = tap_hdr_iov(c, &meta->taph); tiov[UDP_IOV_PAYLOAD].iov_base = payload;
@@ -253,9 +253,6 @@ static void udp_iov_init(const struct ctx *c) { size_t i;
- udp4_eth_hdr.h_proto = htons_constant(ETH_P_IP); - udp6_eth_hdr.h_proto = htons_constant(ETH_P_IPV6); - for (i = 0; i < UDP_MAX_FRAMES; i++) udp_iov_init_one(c, i); } @@ -352,31 +349,34 @@ size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, * udp_tap_prepare() - Convert one datagram into a tap frame * @mmh: Receiving mmsghdr array * @idx: Index of the datagram to prepare + * @tap_omac: MAC address of remote endpoint as seen from the guest * @toside: Flowside for destination side * @no_udp_csum: Do not set UDP checksum */ static void udp_tap_prepare(const struct mmsghdr *mmh, - unsigned idx, const struct flowside *toside, + unsigned int idx, + const unsigned char *tap_omac, + const struct flowside *toside, bool no_udp_csum) { struct iovec (*tap_iov)[UDP_NUM_IOVS] = &udp_l2_iov[idx]; struct udp_payload_t *bp = &udp_payload[idx]; struct udp_meta_t *bm = &udp_meta[idx]; + struct ethhdr *eh = (*tap_iov)[UDP_IOV_ETH].iov_base; size_t l4len;
+ eth_update_mac(eh, 0, tap_omac);
s/0/NULL/ please.
if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->oaddr)) { l4len = udp_update_hdr6(&bm->ip6h, bp, toside, mmh[idx].msg_len, no_udp_csum); - tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) + - sizeof(udp6_eth_hdr)); - (*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp6_eth_hdr); + tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) + ETH_HLEN);
s/ETH_HLEN/sizeof(*eh)/
+ eh->h_proto = htons_constant(ETH_P_IPV6); (*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip6h); } else { l4len = udp_update_hdr4(&bm->ip4h, bp, toside, mmh[idx].msg_len, no_udp_csum); - tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) + - sizeof(udp4_eth_hdr)); - (*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp4_eth_hdr); + tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) + ETH_HLEN);
Ditto.
+ eh->h_proto = htons_constant(ETH_P_IP); (*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip4h); } (*tap_iov)[UDP_IOV_PAYLOAD].iov_len = l4len; @@ -801,13 +801,14 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx) { const struct flowside *toside = flowside_at_sidx(tosidx); + const struct udp_flow *uflow = udp_at_sidx(tosidx); int i;
if ((n = udp_sock_recv(c, s, udp_mh_recv, n)) <= 0) return;
for (i = 0; i < n; i++) - udp_tap_prepare(udp_mh_recv, i, toside, false); + udp_tap_prepare(udp_mh_recv, i, uflow->f.tap_omac, toside, false);
tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n); } diff --git a/udp.h b/udp.h index 8f8531a..dd6e5ad 100644 --- a/udp.h +++ b/udp.h @@ -21,7 +21,7 @@ int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr, const char *ifname, in_port_t port); int udp_init(struct ctx *c); void udp_timer(struct ctx *c, const struct timespec *now); -void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s); +void udp_update_l2_buf(const unsigned char *eth_d);
/** * union udp_listen_epoll_ref - epoll reference for "listening" UDP sockets
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
We forward the incoming mac address through the tap interface when
receiving incoming packets from network local hosts. Packets from
the own host are excepted from this rule, and are still forwarded
with the default PASST/PASTA MAC address as source.
This is a part of the solution to bug
https://bugs.passt.top/show_bug.cgi?id=120
Signed-off-by: Jon Maloy
On Tue, Aug 19, 2025 at 11:10:01PM -0400, Jon Maloy wrote:
We forward the incoming mac address through the tap interface when receiving incoming packets from network local hosts. Packets from the own host are excepted from this rule, and are still forwarded with the default PASST/PASTA MAC address as source.
This is a part of the solution to bug https://bugs.passt.top/show_bug.cgi?id=120
Signed-off-by: Jon Maloy
Reviewed-by: David Gibson
--- passt.c | 7 +++---- passt.h | 3 +-- pasta.c | 2 +- tap.c | 2 +- tcp.c | 5 ++++- tcp.h | 2 +- tcp_buf.c | 29 ++++++++++++----------------- tcp_internal.h | 2 +- tcp_vu.c | 5 ++--- udp.c | 1 - 10 files changed, 26 insertions(+), 32 deletions(-)
diff --git a/passt.c b/passt.c index 477a01f..746c9ff 100644 --- a/passt.c +++ b/passt.c @@ -149,11 +149,10 @@ static void timer_init(struct ctx *c, const struct timespec *now) /** * proto_update_l2_buf() - Update scatter-gather L2 buffers in protocol handlers * @eth_d: Ethernet destination address, NULL if unchanged - * @eth_s: Ethernet source address, NULL if unchanged */ -void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s) +void proto_update_l2_buf(const unsigned char *eth_d) { - tcp_update_l2_buf(eth_d, eth_s); + tcp_update_l2_buf(eth_d); udp_update_l2_buf(eth_d); }
@@ -255,7 +254,7 @@ int main(int argc, char **argv) if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c))) _exit(EXIT_FAILURE);
- proto_update_l2_buf(c.guest_mac, c.our_tap_mac); + proto_update_l2_buf(c.guest_mac);
if (c.ifi4 && !c.no_dhcp) dhcp_init(); diff --git a/passt.h b/passt.h index 4cfd6eb..2c5b3e1 100644 --- a/passt.h +++ b/passt.h @@ -324,7 +324,6 @@ struct ctx { bool migrate_exit; };
-void proto_update_l2_buf(const unsigned char *eth_d, - const unsigned char *eth_s); +void proto_update_l2_buf(const unsigned char *eth_d);
#endif /* PASST_H */ diff --git a/pasta.c b/pasta.c index c207692..f251331 100644 --- a/pasta.c +++ b/pasta.c @@ -409,7 +409,7 @@ void pasta_ns_conf(struct ctx *c) } }
- proto_update_l2_buf(c->guest_mac, NULL); + proto_update_l2_buf(c->guest_mac); }
/** diff --git a/tap.c b/tap.c index 6db5d88..a5de8b7 100644 --- a/tap.c +++ b/tap.c @@ -1085,7 +1085,7 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p,
if (memcmp(c->guest_mac, eh->h_source, ETH_ALEN)) { memcpy(c->guest_mac, eh->h_source, ETH_ALEN); - proto_update_l2_buf(c->guest_mac, NULL); + proto_update_l2_buf(c->guest_mac); }
switch (ntohs(eh->h_proto)) { diff --git a/tcp.c b/tcp.c index 957b498..1dd2e24 100644 --- a/tcp.c +++ b/tcp.c @@ -920,7 +920,7 @@ static void tcp_fill_header(struct tcphdr *th, * @no_tcp_csum: Do not set TCP checksum */ void tcp_fill_headers(const struct tcp_tap_conn *conn, - struct tap_hdr *taph, + struct tap_hdr *taph, struct ethhdr *eh, struct iphdr *ip4h, struct ipv6hdr *ip6h, struct tcphdr *th, struct iov_tail *payload, const uint16_t *ip4_check, uint32_t seq, bool no_tcp_csum) @@ -952,6 +952,7 @@ void tcp_fill_headers(const struct tcp_tap_conn *conn, psum = proto_ipv4_header_psum(l4len, IPPROTO_TCP, *src4, *dst4); } + eh->h_proto = htons_constant(ETH_P_IP); }
if (ip6h) { @@ -972,7 +973,9 @@ void tcp_fill_headers(const struct tcp_tap_conn *conn, &ip6h->saddr, &ip6h->daddr); } + eh->h_proto = htons_constant(ETH_P_IPV6); } + eth_update_mac(eh, NULL, conn->f.tap_omac);
tcp_fill_header(th, conn, seq);
diff --git a/tcp.h b/tcp.h index 234a803..c1b8385 100644 --- a/tcp.h +++ b/tcp.h @@ -24,7 +24,7 @@ int tcp_init(struct ctx *c); void tcp_timer(struct ctx *c, const struct timespec *now); void tcp_defer_handler(struct ctx *c);
-void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s); +void tcp_update_l2_buf(const unsigned char *eth_d);
extern bool peek_offset_cap;
diff --git a/tcp_buf.c b/tcp_buf.c index d1fca67..dac5788 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -40,8 +40,7 @@ /* Static buffers */
/* Ethernet header for IPv4 and IPv6 frames */ -static struct ethhdr tcp4_eth_src; -static struct ethhdr tcp6_eth_src; +static struct ethhdr tcp_eth_hdr[TCP_FRAMES_MEM];
static struct tap_hdr tcp_payload_tap_hdr[TCP_FRAMES_MEM];
@@ -67,12 +66,13 @@ static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS]; /** * tcp_update_l2_buf() - Update Ethernet header buffers with addresses * @eth_d: Ethernet destination address, NULL if unchanged - * @eth_s: Ethernet source address, NULL if unchanged */ -void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s) +void tcp_update_l2_buf(const unsigned char *eth_d) { - eth_update_mac(&tcp4_eth_src, eth_d, eth_s); - eth_update_mac(&tcp6_eth_src, eth_d, eth_s); + int i; + + for (i = 0; i < TCP_FRAMES_MEM; i++) + eth_update_mac(&tcp_eth_hdr[i], eth_d, NULL); }
/** @@ -85,9 +85,6 @@ void tcp_sock_iov_init(const struct ctx *c) struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP); int i;
- tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6); - tcp4_eth_src.h_proto = htons_constant(ETH_P_IP); - for (i = 0; i < ARRAY_SIZE(tcp_payload); i++) { tcp6_payload_ip[i] = ip6; tcp4_payload_ip[i] = iph; @@ -164,6 +161,7 @@ static void tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, struct tap_hdr *taph = iov[TCP_IOV_TAP].iov_base; const struct flowside *tapside = TAPFLOW(conn); const struct in_addr *a4 = inany_v4(&tapside->oaddr); + struct ethhdr *eh = iov[TCP_IOV_ETH].iov_base; struct ipv6hdr *ip6h = NULL; struct iphdr *ip4h = NULL;
@@ -172,7 +170,7 @@ static void tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn, else ip6h = iov[TCP_IOV_IP].iov_base;
- tcp_fill_headers(conn, taph, ip4h, ip6h, th, &tail, + tcp_fill_headers(conn, taph, eh, ip4h, ip6h, th, &tail, check, seq, no_tcp_csum); }
@@ -194,14 +192,12 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) int ret;
iov = tcp_l2_iov[tcp_payload_used]; - if (CONN_V4(conn)) { + if (CONN_V4(conn)) iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]); - iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; - } else { + else iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]); - iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src; - }
+ iov[TCP_IOV_ETH] = IOV_OF_LVALUE(tcp_eth_hdr[tcp_payload_used]); payload = iov[TCP_IOV_PAYLOAD].iov_base; seq = conn->seq_to_tap; ret = tcp_prepare_flags(c, conn, flags, &payload->th, @@ -259,11 +255,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, check = &iph->check; } iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]); - iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; } else if (CONN_V6(conn)) { iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]); - iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src; } + iov[TCP_IOV_ETH].iov_base = &tcp_eth_hdr[tcp_payload_used]; payload = iov[TCP_IOV_PAYLOAD].iov_base; payload->th.th_off = sizeof(struct tcphdr) / 4; payload->th.th_x2 = 0; diff --git a/tcp_internal.h b/tcp_internal.h index 36c6533..6c2d1ef 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -167,7 +167,7 @@ void tcp_rst_do(const struct ctx *c, struct tcp_tap_conn *conn); struct tcp_info_linux;
void tcp_fill_headers(const struct tcp_tap_conn *conn, - struct tap_hdr *taph, + struct tap_hdr *taph, struct ethhdr *eh, struct iphdr *ip4h, struct ipv6hdr *ip6h, struct tcphdr *th, struct iov_tail *payload, const uint16_t *ip4_check, uint32_t seq, bool no_tcp_csum); diff --git a/tcp_vu.c b/tcp_vu.c index cb39bc2..a895d6c 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -135,7 +135,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) flags_elem[0].in_sg[0].iov_len = hdrlen + optlen; payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen);
- tcp_fill_headers(conn, NULL, ip4h, ip6h, th, &payload, + tcp_fill_headers(conn, NULL, eh, ip4h, ip6h, th, &payload, NULL, seq, !*c->pcap);
if (*c->pcap) { @@ -315,7 +315,6 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn, eh = vu_eth(base);
memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest)); - memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
/* initialize header */
@@ -339,7 +338,7 @@ static void tcp_vu_prepare(const struct ctx *c, struct tcp_tap_conn *conn, th->ack = 1; th->psh = push;
- tcp_fill_headers(conn, NULL, ip4h, ip6h, th, &payload, + tcp_fill_headers(conn, NULL, eh, ip4h, ip6h, th, &payload, *check, conn->seq_to_tap, no_tcp_csum); if (ip4h) *check = &ip4h->check; diff --git a/udp.c b/udp.c index 35e3603..30937dd 100644 --- a/udp.c +++ b/udp.c @@ -207,7 +207,6 @@ void udp_portmap_clear(void) /** * udp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses * @eth_d: Ethernet destination address, NULL if unchanged - * @eth_s: Ethernet source address, NULL if unchanged */ void udp_update_l2_buf(const unsigned char *eth_d) {
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
In the following commits it must be possible for the callers of
function tap_push_l2h() to specify which source MAC address should
be added to the ethernet header sent over the tap interface. As
a preparation, we now add a new argument to that function, still
without actually using it.
Signed-off-by: Jon Maloy
On Tue, Aug 19, 2025 at 11:10:02PM -0400, Jon Maloy wrote:
In the following commits it must be possible for the callers of function tap_push_l2h() to specify which source MAC address should be added to the ethernet header sent over the tap interface. As a preparation, we now add a new argument to that function, still without actually using it.
Signed-off-by: Jon Maloy
--- tap.c | 15 +++++++++------ tap.h | 3 ++- tcp.c | 4 ++-- 3 files changed, 13 insertions(+), 9 deletions(-) diff --git a/tap.c b/tap.c index a5de8b7..0349835 100644 --- a/tap.c +++ b/tap.c @@ -171,17 +171,20 @@ const struct in6_addr *tap_ip6_daddr(const struct ctx *c, * tap_push_l2h() - Build an L2 header for an inbound packet * @c: Execution context * @buf: Buffer address at which to generate header + * @src_mac: MAC address to be used as source for message. * @proto: Ethernet protocol number for L3 * * Return: pointer at which to write the packet's payload */ -void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto) +void *tap_push_l2h(const struct ctx *c, void *buf, + const void *src_mac, uint16_t proto) { struct ethhdr *eh = (struct ethhdr *)buf;
/* TODO: ARP table lookup */ + memcpy(eh->h_dest, c->guest_mac, ETH_ALEN); - memcpy(eh->h_source, c->our_tap_mac, ETH_ALEN); + memcpy(eh->h_source, src_mac, ETH_ALEN);
This assumes that src_mac is non-NULL...
eh->h_proto = ntohs(proto); return eh + 1; } @@ -261,7 +264,7 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport, { size_t l4len = dlen + sizeof(struct udphdr); char buf[USHRT_MAX]; - struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP); + struct iphdr *ip4h = tap_push_l2h(c, buf, NULL, ETH_P_IP);
... but then you pass NULL.
struct udphdr *uh = tap_push_ip4h(ip4h, src, dst, l4len, IPPROTO_UDP); char *data = tap_push_uh4(uh, src, sport, dst, dport, in, dlen);
@@ -281,7 +284,7 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst, const void *in, size_t l4len) { char buf[USHRT_MAX]; - struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP); + struct iphdr *ip4h = tap_push_l2h(c, buf, NULL, ETH_P_IP); struct icmphdr *icmp4h = tap_push_ip4h(ip4h, src, dst, l4len, IPPROTO_ICMP);
@@ -367,7 +370,7 @@ void tap_udp6_send(const struct ctx *c, { size_t l4len = dlen + sizeof(struct udphdr); char buf[USHRT_MAX]; - struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6); + struct ipv6hdr *ip6h = tap_push_l2h(c, buf, NULL, ETH_P_IPV6); struct udphdr *uh = tap_push_ip6h(ip6h, src, dst, l4len, IPPROTO_UDP, flow); char *data = tap_push_uh6(uh, src, sport, dst, dport, in, dlen); @@ -389,7 +392,7 @@ void tap_icmp6_send(const struct ctx *c, const void *in, size_t l4len) { char buf[USHRT_MAX]; - struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6); + struct ipv6hdr *ip6h = tap_push_l2h(c, buf, NULL, ETH_P_IPV6); struct icmp6hdr *icmp6h = tap_push_ip6h(ip6h, src, dst, l4len, IPPROTO_ICMPV6, 0);
diff --git a/tap.h b/tap.h index 936ae93..6ec171d 100644 --- a/tap.h +++ b/tap.h @@ -70,7 +70,8 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len) }
unsigned long tap_l2_max_len(const struct ctx *c); -void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto); +void *tap_push_l2h(const struct ctx *c, void *buf, + const void *src_mac, uint16_t proto); void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src, struct in_addr dst, size_t l4len, uint8_t proto); void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport, diff --git a/tcp.c b/tcp.c index 1dd2e24..bdcd477 100644 --- a/tcp.c +++ b/tcp.c @@ -1898,7 +1898,7 @@ static void tcp_rst_no_conn(const struct ctx *c, int af, return;
if (af == AF_INET) { - struct iphdr *ip4h = tap_push_l2h(c, buf, ETH_P_IP); + struct iphdr *ip4h = tap_push_l2h(c, buf, NULL, ETH_P_IP); const struct in_addr *rst_src = daddr; const struct in_addr *rst_dst = saddr;
@@ -1908,7 +1908,7 @@ static void tcp_rst_no_conn(const struct ctx *c, int af, *rst_src, *rst_dst);
} else { - struct ipv6hdr *ip6h = tap_push_l2h(c, buf, ETH_P_IPV6); + struct ipv6hdr *ip6h = tap_push_l2h(c, buf, NULL, ETH_P_IPV6); const struct in6_addr *rst_src = daddr; const struct in6_addr *rst_dst = saddr;
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
tcp_rst_no_conn() needs to identify and specify which source MAC
address to use when sending an RST to the guest. This is because
it doesn't have access to any flow structure where this address
could be fetched.
Signed-off-by: Jon Maloy
On Tue, Aug 19, 2025 at 11:10:03PM -0400, Jon Maloy wrote:
tcp_rst_no_conn() needs to identify and specify which source MAC address to use when sending an RST to the guest. This is because it doesn't have access to any flow structure where this address could be fetched.
I'm not sure we actually need to preserve MAC in this case. This happens when the guest sends a TCP packet that doesn't appear to belong to any existing flow. That means that the guest is somehow already out of sync with what's going on in the outside world. So, it's hard to imagine any scenario where the guest needs the correct MAC of this peer that it already has a wrong idea of the state of.
Signed-off-by: Jon Maloy
--- tcp.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/tcp.c b/tcp.c index bdcd477..28f9ef5 100644 --- a/tcp.c +++ b/tcp.c @@ -309,6 +309,7 @@ #include "tcp_internal.h" #include "tcp_buf.h" #include "tcp_vu.h" +#include "netlink.h"
#ifndef __USE_MISC /* From Linux UAPI, missing in netinet/tcp.h provided by musl */ @@ -1888,17 +1889,29 @@ static void tcp_rst_no_conn(const struct ctx *c, int af, const struct tcphdr *th, size_t l4len) { struct iov_tail payload = IOV_TAIL(NULL, 0, 0); + unsigned char src_mac[ETH_ALEN]; + union inany_addr tgt; struct tcphdr *rsth; char buf[USHRT_MAX]; uint32_t psum = 0; size_t rst_l2len; + int ifi;
/* Don't respond to RSTs without a connection */ if (th->rst) return;
+ /* Respond with true MAC address if remote host is on + * the template interface's network segment + */ + ifi = af == AF_INET ? c->ifi4 : c->ifi6; + memcpy(src_mac, c->our_tap_mac, ETH_ALEN); + inany_from_af(&tgt, af, daddr); + if (!fwd_inany_nat(c, &tgt)) + nl_neigh_mac_get(nl_sock, &tgt, ifi, src_mac); + if (af == AF_INET) { - struct iphdr *ip4h = tap_push_l2h(c, buf, NULL, ETH_P_IP); + struct iphdr *ip4h = tap_push_l2h(c, buf, src_mac, ETH_P_IP); const struct in_addr *rst_src = daddr; const struct in_addr *rst_dst = saddr;
@@ -1908,7 +1921,7 @@ static void tcp_rst_no_conn(const struct ctx *c, int af, *rst_src, *rst_dst);
} else { - struct ipv6hdr *ip6h = tap_push_l2h(c, buf, NULL, ETH_P_IPV6); + struct ipv6hdr *ip6h = tap_push_l2h(c, buf, src_mac, ETH_P_IPV6); const struct in6_addr *rst_src = daddr; const struct in6_addr *rst_dst = saddr;
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Even ICMP needs to be updated to use the external MAC address instead
of just the own tap address when applicable. We do that here.
Signed-off-by: Jon Maloy
On Tue, Aug 19, 2025 at 11:10:04PM -0400, Jon Maloy wrote:
Even ICMP needs to be updated to use the external MAC address instead of just the own tap address when applicable. We do that here.
Signed-off-by: Jon Maloy
--- icmp.c | 4 ++-- ndp.c | 2 +- tap.c | 10 ++++++---- tap.h | 4 ++-- udp.c | 12 ++++++++---- 5 files changed, 19 insertions(+), 13 deletions(-) diff --git a/icmp.c b/icmp.c index 95f38c1..aa42cc3 100644 --- a/icmp.c +++ b/icmp.c @@ -129,12 +129,12 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) const struct in_addr *daddr = inany_v4(&ini->eaddr);
ASSERT(saddr && daddr); /* Must have IPv4 addresses */ - tap_icmp4_send(c, *saddr, *daddr, buf, n); + tap_icmp4_send(c, *saddr, *daddr, buf, pingf->f.tap_omac, n); } else if (pingf->f.type == FLOW_PING6) { const struct in6_addr *saddr = &ini->oaddr.a6; const struct in6_addr *daddr = &ini->eaddr.a6;
- tap_icmp6_send(c, saddr, daddr, buf, n); + tap_icmp6_send(c, saddr, daddr, buf, pingf->f.tap_omac, n); } return;
diff --git a/ndp.c b/ndp.c index 9912f80..19b9b28 100644 --- a/ndp.c +++ b/ndp.c @@ -185,7 +185,7 @@ static void ndp_send(const struct ctx *c, const struct in6_addr *dst, { const struct in6_addr *src = &c->ip6.our_tap_ll;
- tap_icmp6_send(c, src, dst, buf, l4len); + tap_icmp6_send(c, src, dst, buf, c->our_tap_mac, l4len); }
/** diff --git a/tap.c b/tap.c index 0349835..443928a 100644 --- a/tap.c +++ b/tap.c @@ -278,13 +278,14 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport, * @src: IPv4 source address * @dst: IPv4 destination address * @in: ICMP packet, including ICMP header + * @src_mac: MAC address to be used as source for message * @l4len: ICMP packet length, including ICMP header */ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst, - const void *in, size_t l4len) + const void *in, const void *src_mac, size_t l4len) { char buf[USHRT_MAX]; - struct iphdr *ip4h = tap_push_l2h(c, buf, NULL, ETH_P_IP); + struct iphdr *ip4h = tap_push_l2h(c, buf, src_mac, ETH_P_IP); struct icmphdr *icmp4h = tap_push_ip4h(ip4h, src, dst, l4len, IPPROTO_ICMP);
@@ -385,14 +386,15 @@ void tap_udp6_send(const struct ctx *c, * @src: IPv6 source address * @dst: IPv6 destination address * @in: ICMP packet, including ICMP header + * @src_mac: MAC address to be used as source for message * @l4len: ICMP packet length, including ICMP header */ void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, - const void *in, size_t l4len) + const void *in, const void *src_mac, size_t l4len) { char buf[USHRT_MAX]; - struct ipv6hdr *ip6h = tap_push_l2h(c, buf, NULL, ETH_P_IPV6); + struct ipv6hdr *ip6h = tap_push_l2h(c, buf, src_mac, ETH_P_IPV6); struct icmp6hdr *icmp6h = tap_push_ip6h(ip6h, src, dst, l4len, IPPROTO_ICMPV6, 0);
diff --git a/tap.h b/tap.h index 6ec171d..ce0b9a6 100644 --- a/tap.h +++ b/tap.h @@ -91,7 +91,7 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport, struct in_addr dst, in_port_t dport, const void *in, size_t dlen); void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst, - const void *in, size_t l4len); + const void *in, const void *src_mac, size_t l4len); const struct in6_addr *tap_ip6_daddr(const struct ctx *c, const struct in6_addr *src); void *tap_push_ip6h(struct ipv6hdr *ip6h, @@ -103,7 +103,7 @@ void tap_udp6_send(const struct ctx *c, uint32_t flow, void *in, size_t dlen); void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, - const void *in, size_t l4len); + const void *in, const void *src_mac, size_t l4len); void tap_send_single(const struct ctx *c, const void *data, size_t l2len); size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t bufs_per_frame, size_t nframes); diff --git a/udp.c b/udp.c index 30937dd..8d43646 100644 --- a/udp.c +++ b/udp.c @@ -385,6 +385,7 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, * udp_send_tap_icmp4() - Construct and send ICMPv4 to local peer * @c: Execution context * @ee: Extended error descriptor + * @uflow: UDP flow * @toside: Destination side of flow * @saddr: Address of ICMP generating node * @in: First bytes (max 8) of original UDP message body @@ -392,6 +393,7 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, */ static void udp_send_tap_icmp4(const struct ctx *c, const struct sock_extended_err *ee, + const struct udp_flow *uflow, const struct flowside *toside, struct in_addr saddr, const void *in, size_t dlen) @@ -421,7 +423,7 @@ static void udp_send_tap_icmp4(const struct ctx *c, tap_push_uh4(&msg.uh, eaddr, eport, oaddr, oport, in, dlen); memcpy(&msg.data, in, dlen);
- tap_icmp4_send(c, saddr, eaddr, &msg, msglen); + tap_icmp4_send(c, saddr, eaddr, &msg, uflow->f.tap_omac, msglen);
This is only correct when saddr == toside->oaddr; i.e. when the ICMP is coming from the flow peer. An ICMP can come from an intermediate node, which we'd need a separate MAC lookup for.
}
@@ -429,6 +431,7 @@ static void udp_send_tap_icmp4(const struct ctx *c, * udp_send_tap_icmp6() - Construct and send ICMPv6 to local peer * @c: Execution context * @ee: Extended error descriptor + * @uflow: UDP flow * @toside: Destination side of flow * @saddr: Address of ICMP generating node * @in: First bytes (max 1232) of original UDP message body @@ -437,6 +440,7 @@ static void udp_send_tap_icmp4(const struct ctx *c, */ static void udp_send_tap_icmp6(const struct ctx *c, const struct sock_extended_err *ee, + const struct udp_flow *uflow, const struct flowside *toside, const struct in6_addr *saddr, void *in, size_t dlen, uint32_t flow) @@ -466,7 +470,7 @@ static void udp_send_tap_icmp6(const struct ctx *c, tap_push_uh6(&msg.uh, eaddr, eport, oaddr, oport, in, dlen); memcpy(&msg.data, in, dlen);
- tap_icmp6_send(c, saddr, eaddr, &msg, msglen); + tap_icmp6_send(c, saddr, eaddr, &msg, uflow->f.tap_omac, msglen);
Ditto.
}
/** @@ -626,12 +630,12 @@ static int udp_sock_recverr(const struct ctx *c, int s, flow_sidx_t sidx, if (hdr->cmsg_level == IPPROTO_IP && (o4 = inany_v4(&otap)) && inany_v4(&toside->eaddr)) { dlen = MIN(dlen, ICMP4_MAX_DLEN); - udp_send_tap_icmp4(c, ee, toside, *o4, data, dlen); + udp_send_tap_icmp4(c, ee, uflow, toside, *o4, data, dlen); return 1; }
if (hdr->cmsg_level == IPPROTO_IPV6 && !inany_v4(&toside->eaddr)) { - udp_send_tap_icmp6(c, ee, toside, &otap.a6, data, dlen, + udp_send_tap_icmp6(c, ee, uflow, toside, &otap.a6, data, dlen, FLOW_IDX(uflow)); return 1; }
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
We add a cache table to keep partial contents of the kernel ARP/NDP
tables. This way, we drastically reduce the number of netlink calls
to read those tables.
We create dummy cache entries representing non-(not-yet)-existing
ARP/NDP entries when needed. We add a short expiration time to each
such entry, so that we can know when to make repeated calls to the
kernel tables in the beginning. We also add an access counter to the
entries, to ensure that the timer becomes longer and the call frequency
abates over time if no ARP/NDP entry is created.
For regular entries we use a much longer timer, with the purpose to
update the entry in the rare case that a remote host changes its
MAC address.
Signed-off-by: Jon Maloy
On Tue, Aug 19, 2025 at 11:10:05PM -0400, Jon Maloy wrote:
We add a cache table to keep partial contents of the kernel ARP/NDP tables. This way, we drastically reduce the number of netlink calls to read those tables.
Do you have data to suggest this is necessary? It's a lot of code to optimise something only needed for some pretty uncommon cases.
We create dummy cache entries representing non-(not-yet)-existing ARP/NDP entries when needed. We add a short expiration time to each such entry, so that we can know when to make repeated calls to the kernel tables in the beginning. We also add an access counter to the entries, to ensure that the timer becomes longer and the call frequency abates over time if no ARP/NDP entry is created.
For regular entries we use a much longer timer, with the purpose to update the entry in the rare case that a remote host changes its MAC address.
Signed-off-by: Jon Maloy
--- arp.c | 3 +- conf.c | 2 + flow.c | 5 +- fwd.c | 206 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ fwd.h | 4 ++ ndp.c | 3 +- tcp.c | 3 +- 7 files changed, 218 insertions(+), 8 deletions(-) diff --git a/arp.c b/arp.c index c37867a..040d4fe 100644 --- a/arp.c +++ b/arp.c @@ -29,7 +29,6 @@ #include "dhcp.h" #include "passt.h" #include "tap.h" -#include "netlink.h"
/** * arp() - Check if this is a supported ARP message, reply as needed @@ -79,7 +78,7 @@ int arp(const struct ctx *c, const struct pool *p) */ inany_from_af(&tgt, AF_INET, am->tip); if (!fwd_inany_nat(c, &tgt)) - nl_neigh_mac_get(nl_sock, &tgt, c->ifi4, am->sha); + fwd_neigh_mac_get(c, &tgt, c->ifi4, am->sha);
memcpy(swap, am->tip, sizeof(am->tip)); memcpy(am->tip, am->sip, sizeof(am->tip)); diff --git a/conf.c b/conf.c index f47f48e..0abdbf7 100644 --- a/conf.c +++ b/conf.c @@ -2122,6 +2122,8 @@ void conf(struct ctx *c, int argc, char **argv) c->udp.fwd_out.mode = fwd_default;
fwd_scan_ports_init(c); + if (fwd_mac_cache_init()) + die("Failed to initiate neighnor MAC cache");
"neighnor"
if (!c->quiet) conf_print(c); diff --git a/flow.c b/flow.c index d7b3fd1..3ca30f1 100644 --- a/flow.c +++ b/flow.c @@ -459,7 +459,7 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, */ ifi = inany_v4(&ini->oaddr) ? c->ifi4 : c->ifi6; if (!fwd_inany_nat(c, &ini->oaddr)) - nl_neigh_mac_get(nl_sock, &ini->oaddr, ifi, f->tap_omac); + fwd_neigh_mac_get(c, &ini->oaddr, ifi, f->tap_omac); break;
case PIF_SPLICE: @@ -474,7 +474,7 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow, */ ifi = inany_v4(&ini->eaddr) ? c->ifi4 : c->ifi6; if (!fwd_inany_nat(c, &ini->eaddr)) - nl_neigh_mac_get(nl_sock, &ini->eaddr, ifi, f->tap_omac); + fwd_neigh_mac_get(c, &ini->eaddr, ifi, f->tap_omac); break;
default: @@ -491,6 +491,7 @@ struct flowside *flow_target(const struct ctx *c, union flow *flow,
f->pif[TGTSIDE] = tgtpif; flow_set_state(f, FLOW_STATE_TGT); + return tgt; }
diff --git a/fwd.c b/fwd.c index 55bf5f2..6647764 100644 --- a/fwd.c +++ b/fwd.c @@ -19,6 +19,8 @@ #include
#include #include +#include +#include #include "util.h" #include "ip.h" @@ -26,6 +28,8 @@ #include "passt.h" #include "lineread.h" #include "flow_table.h" +#include "inany.h" +#include "netlink.h"
/* Empheral port range: values from RFC 6335 */ static in_port_t fwd_ephemeral_min = (1 << 15) + (1 << 14); @@ -33,6 +37,208 @@ static in_port_t fwd_ephemeral_max = NUM_PORTS - 1;
#define PORT_RANGE_SYSCTL "/proc/sys/net/ipv4/ip_local_port_range"
+#define MAC_CACHE_BUCKETS 1024 +#define MAC_CACHE_RENEWAL 3600 /* Refresh entry from ARP/NDP every hour */ + +/* Partial cache of ARP/NDP table contents */ +struct mac_cache_entry { + union inany_addr key; + unsigned char mac[ETH_ALEN]; + struct timespec expiry; + uint32_t count; + struct mac_cache_entry *next; +}; + +struct mac_cache_table { + struct mac_cache_entry **buckets; + size_t nbuckets; +}; + +static struct mac_cache_table mac_cache; + +/** + * timespec_before() - Check the relation between two pints in time + * @a: Point in time to be tested + * @b: Point in time test a against + * Return: True if a comes before b, otherwise b + */ +static inline bool timespec_before(const struct timespec *a, + const struct timespec *b) +{ + return (a->tv_sec < b->tv_sec) || + (a->tv_sec == b->tv_sec && a->tv_nsec < b->tv_nsec); +} + +/** + * mac_entry_is_dummy() - Check if a cache entry is a placeholder + * @c: Execution context + * @e: Cache entry + * + * Return: True if the entry is a placeholder, false otherwise + */ +bool mac_entry_is_dummy(const struct ctx *c, const struct mac_cache_entry *e) +{ + return !memcmp(c->our_tap_mac, e->mac, ETH_ALEN); +} + +/** + * mac_entry_expired() - Check if a cache entry has expired + * @e: Cache entry + * + * Return: True if the entry has expired, false otherwise + */ +static bool mac_entry_expired(const struct mac_cache_entry *e) +{ + struct timespec now; + + clock_gettime(CLOCK_MONOTONIC, &now); + return timespec_before(&e->expiry, &now); +} + +/** + * mac_entry_set_expiry() - Set the time for a cache entry to expire + * @e: Cache entry + * @expiry: Expiration time, in seconds from current moment. + * + * Return: The result of the hash + */ +static void mac_entry_set_expiry(struct mac_cache_entry *e, int expiry) +{ + clock_gettime(CLOCK_MONOTONIC, &e->expiry); + e->expiry.tv_sec += expiry; +} + +/** + * inany_hash32() - Hash the contenst of key into an integer + * @key: IPv4 or IPv6 address, used as key + * + * Return: The result of the hash + */ +static inline uint32_t inany_hash32(const struct ctx *c, + const union inany_addr *key) +{ + struct siphash_state st = SIPHASH_INIT(c->hash_secret); + + inany_siphash_feed(&st, key); + + return (uint32_t)siphash_final(&st, sizeof(*key), 0); +} + +/** + * fwd_mac_cache_bucket_idx() - Find the table index of an entry + * @c: Execution context + * @key: IPv4 or IPv6 address, used as key for the hash lookup + * @nbuckets: Number of buckets in the table + * + * Return: The index found + */ +static inline size_t fwd_mac_cache_bucket_idx(const struct ctx *c, + const union inany_addr *key, + size_t nbuckets) +{ + uint32_t h = inany_hash32(c, key); + + return (nbuckets & (nbuckets - 1)) ? (h % nbuckets) : (h & (nbuckets - 1));
I don't think this is worth the complexity: I wouldn't be confident the conditional branch is less expensive than an integer divmod. And if it is, it wouldn't suprise me if the compiler did that optimisation itself.
+} + +/** + * fwd_mac_cache_find() - Find an entry in the ARP/NDP cache table + * @c: Execution context + * @key: IPv4 or IPv6 address, used as key for the hash lookup + * + * Return: Pointer to the entry on success, NULL on failure. + */ +static struct mac_cache_entry *fwd_mac_cache_find(const struct ctx *c, + const union inany_addr *key) +{ + const struct mac_cache_table *t = &mac_cache; + struct mac_cache_entry *e; + size_t idx; + + idx = fwd_mac_cache_bucket_idx(c, key, t->nbuckets); + for (e = t->buckets[idx]; e; e = e->next) + if (inany_equals(&e->key, key)) + return e; + return NULL; +} + +/** + * fwd_mac_cache_add() - Add a new entry to the ARP/NDP cache table + * @c: Execution context + * @key: IPv4 or IPv6 address, used as key for the hash lookup + * @mac: Buffer for Ethernet MAC, left unchanged if not found/usable + * + * Return: Pointer to the new entry on success, NULL on failure. + */ +static struct mac_cache_entry *fwd_mac_cache_add(const struct ctx *c, + const union inany_addr *key, + const unsigned char *mac) +{ + struct mac_cache_table *t = &mac_cache; + size_t idx = fwd_mac_cache_bucket_idx(c, key, t->nbuckets); + struct mac_cache_entry *e; + + e = calloc(1, sizeof(*e)); + if (!e) + return NULL; + e->key = *key; + memcpy(e->mac, mac, ETH_ALEN); + e->count = 0; + e->next = t->buckets[idx]; + t->buckets[idx] = e; + + return e; +} + +/** + * fwd_neigh_mac_get() - Find a MAC address the ARP/NDP cache table + * @c: Execution context + * @addr: IPv4 or IPv6 address + * @ifi: Interface index + * @mac: Buffer for Ethernet MAC, left unchanged if not found/usable + * + * Return: 0 on success, -1 on failure. + */ +int fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr, + int ifi, unsigned char *mac) +{ + struct mac_cache_entry *e = fwd_mac_cache_find(c, addr); + bool refresh = false; + + if (e) + refresh = mac_entry_expired(e); + else if ((e = fwd_mac_cache_add(c, addr, mac))) + refresh = true; + else + return -1; + + if (refresh) { + nl_neigh_mac_get(nl_sock, addr, ifi, e->mac); + if (mac_entry_is_dummy(c, e)) + mac_entry_set_expiry(e, e->count++); + else + mac_entry_set_expiry(e, MAC_CACHE_RENEWAL); + } + memcpy(mac, e->mac, ETH_ALEN); + + return 0; +} + +/** + * fwd_mac_cache_init() - Initiate ARP/NDP cache table + * + * Return: 0 on success, -1 on failure. + */ +int fwd_mac_cache_init(void) +{ + struct mac_cache_table *t = &mac_cache; + + t->nbuckets = MAC_CACHE_BUCKETS; + t->buckets = calloc(t->nbuckets, sizeof(*t->buckets)); + + return t->buckets ? 0 : -1; +} + /** fwd_probe_ephemeral() - Determine what ports this host considers ephemeral * * Work out what ports the host thinks are emphemeral and record it for later diff --git a/fwd.h b/fwd.h index c8d485d..ebfc114 100644 --- a/fwd.h +++ b/fwd.h @@ -58,4 +58,8 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto, const struct flowside *ini, struct flowside *tgt); bool fwd_inany_nat(const struct ctx *c, const union inany_addr *addr);
+int fwd_neigh_mac_get(const struct ctx *c, const union inany_addr *addr, + int ifi, unsigned char *mac); +int fwd_mac_cache_init(void); + #endif /* FWD_H */ diff --git a/ndp.c b/ndp.c index 19b9b28..af957c9 100644 --- a/ndp.c +++ b/ndp.c @@ -32,7 +32,6 @@ #include "passt.h" #include "tap.h" #include "log.h" -#include "netlink.h"
#define RT_LIFETIME 65535
@@ -222,7 +221,7 @@ static void ndp_na(const struct ctx *c, const struct in6_addr *dst, */ inany_from_af(&tgt, AF_INET6, addr); if (!fwd_inany_nat(c, &tgt)) - nl_neigh_mac_get(nl_sock, &tgt, c->ifi6, na.target_l2_addr.mac); + fwd_neigh_mac_get(c, &tgt, c->ifi6, na.target_l2_addr.mac);
ndp_send(c, dst, &na, sizeof(na)); } diff --git a/tcp.c b/tcp.c index 28f9ef5..6b5b4ff 100644 --- a/tcp.c +++ b/tcp.c @@ -309,7 +309,6 @@ #include "tcp_internal.h" #include "tcp_buf.h" #include "tcp_vu.h" -#include "netlink.h"
#ifndef __USE_MISC /* From Linux UAPI, missing in netinet/tcp.h provided by musl */ @@ -1908,7 +1907,7 @@ static void tcp_rst_no_conn(const struct ctx *c, int af, memcpy(src_mac, c->our_tap_mac, ETH_ALEN); inany_from_af(&tgt, af, daddr); if (!fwd_inany_nat(c, &tgt)) - nl_neigh_mac_get(nl_sock, &tgt, ifi, src_mac); + fwd_neigh_mac_get(c, &tgt, ifi, src_mac);
if (af == AF_INET) { struct iphdr *ip4h = tap_push_l2h(c, buf, src_mac, ETH_P_IP);
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 21 Aug 2025 12:03:47 +1000
David Gibson
On Tue, Aug 19, 2025 at 11:10:05PM -0400, Jon Maloy wrote:
We add a cache table to keep partial contents of the kernel ARP/NDP tables. This way, we drastically reduce the number of netlink calls to read those tables.
Do you have data to suggest this is necessary?
I'll chime in as I originally suggested that we need this cache. Without it, we'll have one netlink query for each local, non-loopback flow being established, which sounds rather absurd (...am I missing something?). I haven't tested these changes yet but I suppose the usual tcp_crr test should show the issue. It's not just about TCP CRR latency though. We're adding this mostly for use cases where some kind of LAN service is implemented by a container (say, Pi-hole), and we can probably expect a ton of short-lived TCP flows in those cases (say, DNS requests over TCP).
It's a lot of code to optimise something only needed for some pretty uncommon cases.
Actually, it's a bit less code than I expected, but I don't understand why you're assuming those cases are uncommon. By the way, note that we should be able to get rid of most of this once we implement a netlink monitor (which we need for other purposes), because at that point we can also subscribe to ARP / neighbour table changes.
We create dummy cache entries representing non-(not-yet)-existing ARP/NDP entries when needed. We add a short expiration time to each such entry, so that we can know when to make repeated calls to the kernel tables in the beginning. We also add an access counter to the entries, to ensure that the timer becomes longer and the call frequency abates over time if no ARP/NDP entry is created.
For regular entries we use a much longer timer, with the purpose to update the entry in the rare case that a remote host changes its MAC address.
Signed-off-by: Jon Maloy
--- arp.c | 3 +- conf.c | 2 + flow.c | 5 +- fwd.c | 206 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ fwd.h | 4 ++ ndp.c | 3 +- tcp.c | 3 +- 7 files changed, 218 insertions(+), 8 deletions(-) diff --git a/arp.c b/arp.c index c37867a..040d4fe 100644 --- a/arp.c +++ b/arp.c @@ -29,7 +29,6 @@ #include "dhcp.h" #include "passt.h" #include "tap.h" -#include "netlink.h"
/** * arp() - Check if this is a supported ARP message, reply as needed @@ -79,7 +78,7 @@ int arp(const struct ctx *c, const struct pool *p) */ inany_from_af(&tgt, AF_INET, am->tip); if (!fwd_inany_nat(c, &tgt)) - nl_neigh_mac_get(nl_sock, &tgt, c->ifi4, am->sha); + fwd_neigh_mac_get(c, &tgt, c->ifi4, am->sha);
memcpy(swap, am->tip, sizeof(am->tip)); memcpy(am->tip, am->sip, sizeof(am->tip)); diff --git a/conf.c b/conf.c index f47f48e..0abdbf7 100644 --- a/conf.c +++ b/conf.c @@ -2122,6 +2122,8 @@ void conf(struct ctx *c, int argc, char **argv) c->udp.fwd_out.mode = fwd_default;
fwd_scan_ports_init(c); + if (fwd_mac_cache_init()) + die("Failed to initiate neighnor MAC cache");
"neighnor"
Jon, by the way, we've been using BrE quite consistently throughout the codebase, so the two occurrences of "neighbour" (code comments only) have, well, a 'u' in them. I'd try to keep that consistency if it doesn't bother anybody. -- Stefano
participants (3)
-
David Gibson
-
Jon Maloy
-
Stefano Brivio