On Thu, 21 Aug 2025 12:03:47 +1000
David Gibson
On Tue, Aug 19, 2025 at 11:10:05PM -0400, Jon Maloy wrote:
We add a cache table to keep partial contents of the kernel ARP/NDP tables. This way, we drastically reduce the number of netlink calls to read those tables.
Do you have data to suggest this is necessary?
I'll chime in as I originally suggested that we need this cache. Without it, we'll have one netlink query for each local, non-loopback flow being established, which sounds rather absurd (...am I missing something?). I haven't tested these changes yet but I suppose the usual tcp_crr test should show the issue. It's not just about TCP CRR latency though. We're adding this mostly for use cases where some kind of LAN service is implemented by a container (say, Pi-hole), and we can probably expect a ton of short-lived TCP flows in those cases (say, DNS requests over TCP).
It's a lot of code to optimise something only needed for some pretty uncommon cases.
Actually, it's a bit less code than I expected, but I don't understand why you're assuming those cases are uncommon. By the way, note that we should be able to get rid of most of this once we implement a netlink monitor (which we need for other purposes), because at that point we can also subscribe to ARP / neighbour table changes.
We create dummy cache entries representing non-(not-yet)-existing ARP/NDP entries when needed. We add a short expiration time to each such entry, so that we can know when to make repeated calls to the kernel tables in the beginning. We also add an access counter to the entries, to ensure that the timer becomes longer and the call frequency abates over time if no ARP/NDP entry is created.
For regular entries we use a much longer timer, with the purpose to update the entry in the rare case that a remote host changes its MAC address.
Signed-off-by: Jon Maloy
--- arp.c | 3 +- conf.c | 2 + flow.c | 5 +- fwd.c | 206 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ fwd.h | 4 ++ ndp.c | 3 +- tcp.c | 3 +- 7 files changed, 218 insertions(+), 8 deletions(-) diff --git a/arp.c b/arp.c index c37867a..040d4fe 100644 --- a/arp.c +++ b/arp.c @@ -29,7 +29,6 @@ #include "dhcp.h" #include "passt.h" #include "tap.h" -#include "netlink.h"
/** * arp() - Check if this is a supported ARP message, reply as needed @@ -79,7 +78,7 @@ int arp(const struct ctx *c, const struct pool *p) */ inany_from_af(&tgt, AF_INET, am->tip); if (!fwd_inany_nat(c, &tgt)) - nl_neigh_mac_get(nl_sock, &tgt, c->ifi4, am->sha); + fwd_neigh_mac_get(c, &tgt, c->ifi4, am->sha);
memcpy(swap, am->tip, sizeof(am->tip)); memcpy(am->tip, am->sip, sizeof(am->tip)); diff --git a/conf.c b/conf.c index f47f48e..0abdbf7 100644 --- a/conf.c +++ b/conf.c @@ -2122,6 +2122,8 @@ void conf(struct ctx *c, int argc, char **argv) c->udp.fwd_out.mode = fwd_default;
fwd_scan_ports_init(c); + if (fwd_mac_cache_init()) + die("Failed to initiate neighnor MAC cache");
"neighnor"
Jon, by the way, we've been using BrE quite consistently throughout the codebase, so the two occurrences of "neighbour" (code comments only) have, well, a 'u' in them. I'd try to keep that consistency if it doesn't bother anybody. -- Stefano