Partly in looking at flow table support, partly for other reasons, I've spotted a number of small fixes that can be made to the UDP code (in addition to the large fixes I'm still working on). These are ready to go now, so here they are. They'll go before the flow related addressing cleanups. David Gibson (7): udp: Don't attempt to translate a 0.0.0.0 source address udp: Set pif in epoll reference for ephemeral host sockets udp: Unconditionally clear act flag in udp_timer_one() udp: Separate bound sockets from flags udp: Small streamline to udp_update_hdr4() udp: Fix incorrect usage of IPv6 state in IPv4 path udp: Remove unnecessary test for unspecified addr_out udp.c | 170 +++++++++++++++++++++++++++++++--------------------------- 1 file changed, 90 insertions(+), 80 deletions(-) -- 2.43.2
If an incoming packet has a source address of 0.0.0.0 we translate that to the gateway address. This doesn't really make sense, because we have no way to do a reverse translation for reply packets. Certain UDP protocols do use an unspecified source address in some circumstances (e.g. DHCP). These generally either require no reply, a multicast reply, or provide a suitable reply address by other means. In none of those cases does translating it in passt/pasta make sense. The best we can really do here is just leave it as is. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 1 - 1 file changed, 1 deletion(-) diff --git a/udp.c b/udp.c index a3961bfd..d2f8027c 100644 --- a/udp.c +++ b/udp.c @@ -599,7 +599,6 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport, src_port == 53) { b->iph.saddr = c->ip4.dns_match.s_addr; } else if (IN4_IS_ADDR_LOOPBACK(&b->s_in.sin_addr) || - IN4_IS_ADDR_UNSPECIFIED(&b->s_in.sin_addr)|| IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.addr_seen)) { b->iph.saddr = c->ip4.gw.s_addr; udp_tap_map[V4][src_port].ts = now->tv_sec; -- 2.43.2
On Thu, 22 Feb 2024 10:21:09 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:If an incoming packet has a source address of 0.0.0.0 we translate that to the gateway address. This doesn't really make sense, because we have no way to do a reverse translation for reply packets.Well, we would translate that back to a loopback address, which is fine if we take 0.0.0.0 as "This host on this network". Actually, after my previous note based on RFC 6890, I went and had a look at RFC 1122, section 3.2.1.3, which also states that 0.0.0.0: MUST NOT be sent, except as a source address as part of an initialization procedure by which the host learns its own IP address. ...so I guess dropping it here is fine. By the way, I added this originally as part of commit 6488c3e8489d ("tcp, udp: Replace loopback source address by gateway address") on the basis that 0.0.0.0 could be used in lieu of a loopback address, but sure, we shouldn't even get it from the kernel to start with.Certain UDP protocols do use an unspecified source address in some circumstances (e.g. DHCP). These generally either require no reply, a multicast reply, or provide a suitable reply address by other means. In none of those cases does translating it in passt/pasta make sense. The best we can really do here is just leave it as is. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 1 - 1 file changed, 1 deletion(-) diff --git a/udp.c b/udp.c index a3961bfd..d2f8027c 100644 --- a/udp.c +++ b/udp.c @@ -599,7 +599,6 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport, src_port == 53) { b->iph.saddr = c->ip4.dns_match.s_addr; } else if (IN4_IS_ADDR_LOOPBACK(&b->s_in.sin_addr) || - IN4_IS_ADDR_UNSPECIFIED(&b->s_in.sin_addr)|| IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.addr_seen)) { b->iph.saddr = c->ip4.gw.s_addr; udp_tap_map[V4][src_port].ts = now->tv_sec;-- Stefano
On Thu, Feb 22, 2024 at 06:46:02PM +0100, Stefano Brivio wrote:On Thu, 22 Feb 2024 10:21:09 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Not really, because "this host" has a different meaning to the sender than it does to us. For example, returning replies to DHCP broadcasts to localhost would absolutely not be correct. Of course, attempting to run a DHCP server within passt/pasta sounds problematic, but my point is that localhost is not a reasonable translation.If an incoming packet has a source address of 0.0.0.0 we translate that to the gateway address. This doesn't really make sense, because we have no way to do a reverse translation for reply packets.Well, we would translate that back to a loopback address, which is fine if we take 0.0.0.0 as "This host on this network".Actually, after my previous note based on RFC 6890, I went and had a look at RFC 1122, section 3.2.1.3, which also states that 0.0.0.0: MUST NOT be sent, except as a source address as part of an initialization procedure by which the host learns its own IP address. ...so I guess dropping it here is fine.I'm not dropping it: I'm leaving it untranslated.By the way, I added this originally as part of commit 6488c3e8489d ("tcp, udp: Replace loopback source address by gateway address") on the basis that 0.0.0.0 could be used in lieu of a loopback address, but sure, we shouldn't even get it from the kernel to start with.Again, that's true for certain API calls, but AFAICT it's not true on the wire. At least it seems to be that way in practice, although I haven't located an RFC to say that explicitly.-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonCertain UDP protocols do use an unspecified source address in some circumstances (e.g. DHCP). These generally either require no reply, a multicast reply, or provide a suitable reply address by other means. In none of those cases does translating it in passt/pasta make sense. The best we can really do here is just leave it as is. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 1 - 1 file changed, 1 deletion(-) diff --git a/udp.c b/udp.c index a3961bfd..d2f8027c 100644 --- a/udp.c +++ b/udp.c @@ -599,7 +599,6 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport, src_port == 53) { b->iph.saddr = c->ip4.dns_match.s_addr; } else if (IN4_IS_ADDR_LOOPBACK(&b->s_in.sin_addr) || - IN4_IS_ADDR_UNSPECIFIED(&b->s_in.sin_addr)|| IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.addr_seen)) { b->iph.saddr = c->ip4.gw.s_addr; udp_tap_map[V4][src_port].ts = now->tv_sec;
The udp_epoll_ref contains a field for the pif to which the socket belongs. We fill this in for permanent sockets created with udp_sock_init() and for spliced sockets, however, we omit it for ephemeral sockets created for tap originated flows. This is a bug, although we currently get away with it, because we don't consult that field for such flows. Correctly fill it in. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/udp.c b/udp.c index d2f8027c..03a5936e 100644 --- a/udp.c +++ b/udp.c @@ -868,7 +868,10 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, src, dst, udp_tap_map[V4][src].sock); if ((s = udp_tap_map[V4][src].sock) < 0) { struct in_addr bind_addr = IN4ADDR_ANY_INIT; - union udp_epoll_ref uref = { .port = src }; + union udp_epoll_ref uref = { + .port = src, + .pif = PIF_HOST, + }; const char *bind_if = NULL; if (!IN6_IS_ADDR_LOOPBACK(&s_in.sin_addr)) @@ -916,7 +919,11 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, } if ((s = udp_tap_map[V6][src].sock) < 0) { - union udp_epoll_ref uref = { .v6 = 1, .port = src }; + union udp_epoll_ref uref = { + .v6 = 1, + .port = src, + .pif = PIF_HOST, + }; const char *bind_if = NULL; if (!IN6_IS_ADDR_LOOPBACK(&s_in6.sin6_addr)) -- 2.43.2
udp_timer_one() first checks for expiries of various sorts of sockets and if necessary sets the 'sockp' variable to trigger a cleanup at the end. If sockp is *not* set then, correctly, we don't attempt to close a non-existent socket. However, we also don't clear the flag in the udp_act[] map, in which case we'll come back here and there will, again, be nothing to be done. So, clear the udp_act[] flag, even if we don't need to clean up a socket. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/udp.c b/udp.c index 03a5936e..4200f849 100644 --- a/udp.c +++ b/udp.c @@ -1123,8 +1123,9 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, *sockp = -1; epoll_ctl(c->epollfd, EPOLL_CTL_DEL, s, NULL); close(s); - bitmap_clear(udp_act[v6 ? V6 : V4][type], port); } + + bitmap_clear(udp_act[v6 ? V6 : V4][type], port); } /** -- 2.43.2
[Sorry, I wrote this comment a while ago and just now realised I didn't send this out...] On Thu, 22 Feb 2024 10:21:11 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:udp_timer_one() first checks for expiries of various sorts of sockets and if necessary sets the 'sockp' variable to trigger a cleanup at the end. If sockp is *not* set then, correctly, we don't attempt to close a non-existent socket. However, we also don't clear the flag in the udp_act[] map, in which case we'll come back here and there will, again, be nothing to be done.Well, but that's intended: if we didn't reach UDP_CONN_TIMEOUT for this port, the socket should still be in udp_act[], meaning we'll check it against activity timeouts and close on timeout. Otherwise, we'll never check that port for the activity timeout, right?So, clear the udp_act[] flag, even if we don't need to clean up a socket. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/udp.c b/udp.c index 03a5936e..4200f849 100644 --- a/udp.c +++ b/udp.c @@ -1123,8 +1123,9 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, *sockp = -1; epoll_ctl(c->epollfd, EPOLL_CTL_DEL, s, NULL); close(s); - bitmap_clear(udp_act[v6 ? V6 : V4][type], port); } + + bitmap_clear(udp_act[v6 ? V6 : V4][type], port); } /**-- Stefano
On Tue, Feb 27, 2024 at 12:56:32PM +0100, Stefano Brivio wrote:[Sorry, I wrote this comment a while ago and just now realised I didn't send this out...] On Thu, 22 Feb 2024 10:21:11 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ah, bother. Yes, my patch is entirely wrong, I'll drop.udp_timer_one() first checks for expiries of various sorts of sockets and if necessary sets the 'sockp' variable to trigger a cleanup at the end. If sockp is *not* set then, correctly, we don't attempt to close a non-existent socket. However, we also don't clear the flag in the udp_act[] map, in which case we'll come back here and there will, again, be nothing to be done.Well, but that's intended: if we didn't reach UDP_CONN_TIMEOUT for this port, the socket should still be in udp_act[], meaning we'll check it against activity timeouts and close on timeout. Otherwise, we'll never check that port for the activity timeout, right?-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonSo, clear the udp_act[] flag, even if we don't need to clean up a socket. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/udp.c b/udp.c index 03a5936e..4200f849 100644 --- a/udp.c +++ b/udp.c @@ -1123,8 +1123,9 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, *sockp = -1; epoll_ctl(c->epollfd, EPOLL_CTL_DEL, s, NULL); close(s); - bitmap_clear(udp_act[v6 ? V6 : V4][type], port); } + + bitmap_clear(udp_act[v6 ? V6 : V4][type], port); } /**
struct udp_tap_port contains both a socket fd and a set of flags. But, confusingly, these fields are not actually related to each other at all. udp_tap_map[?][port].sock contains a host UDP socket bound to 'port' (or -1 if we haven't opened such a socket). That is, it is indexed by the bound port == the host side forwarding port == destination for host->guest packets == source for guest->host packets. udp_tap_map[?][port].flags, however, contains flags which control how we direct datagrams bound for 'port' from the tap interface. That is, it is indexed by the destination for guest->host packets == source for host->guest packets == host side endpoint port. Worse both aspects of this share the ts field used to control expiry. When we update this because we've used the socket in slot 'port' we could be clobbering the flags timestamp for an unrelated flow of packets that happens to have the same endpoint port as our forwarding port. Clarify this by moving the flags out into a separate array, with its own timestamp and expiry path. As a bonus this means that we can now use the same data structure for tracking the socket fds of "tap" and "splice" path sockets. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 132 ++++++++++++++++++++++++++++++---------------------------- 1 file changed, 68 insertions(+), 64 deletions(-) diff --git a/udp.c b/udp.c index 4200f849..97518d92 100644 --- a/udp.c +++ b/udp.c @@ -121,40 +121,39 @@ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ /** - * struct udp_tap_port - Port tracking based on tap-facing source port - * @sock: Socket bound to source port used as index + * struct udp_port_flags - Translation options by remote Port tracking based on tap-facing source port * @flags: Flags for local bind, loopback address/unicast address as source - * @ts: Activity timestamp from tap, used for socket aging + * @ts: Activity timestamp */ -struct udp_tap_port { - int sock; - uint8_t flags; +struct udp_port_flags { #define PORT_LOCAL BIT(0) #define PORT_LOOPBACK BIT(1) #define PORT_GUA BIT(2) - + uint8_t flags; time_t ts; }; +static struct udp_port_flags udp_port_flags[IP_VERSIONS][NUM_PORTS]; /** - * struct udp_splice_port - Bound socket for spliced communication - * @sock: Socket bound to index port + * struct udp_bound_port - Track bound UDP sockets + * @sock: Socket bound to index port (or -1) * @ts: Activity timestamp */ -struct udp_splice_port { +struct udp_bound_port { int sock; time_t ts; }; -/* Port tracking, arrays indexed by packet source port (host order) */ -static struct udp_tap_port udp_tap_map [IP_VERSIONS][NUM_PORTS]; +/* Bound sockets indexed by bound port (host order) */ +static struct udp_bound_port udp_tap_sock [IP_VERSIONS][NUM_PORTS]; -/* "Spliced" sockets indexed by bound port (host order) */ -static struct udp_splice_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; -static struct udp_splice_port udp_splice_init[IP_VERSIONS][NUM_PORTS]; +/* Bound sockets for splicing indexed by bound port (host order) */ +static struct udp_bound_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; +static struct udp_bound_port udp_splice_init [IP_VERSIONS][NUM_PORTS]; enum udp_act_type { - UDP_ACT_TAP, + UDP_ACT_TAP_SOCK, + UDP_ACT_FLAGS, UDP_ACT_SPLICE_NS, UDP_ACT_SPLICE_INIT, UDP_ACT_TYPE_MAX, @@ -246,7 +245,7 @@ void udp_portmap_clear(void) unsigned i; for (i = 0; i < NUM_PORTS; i++) { - udp_tap_map[V4][i].sock = udp_tap_map[V6][i].sock = -1; + udp_tap_sock[V4][i].sock = udp_tap_sock[V6][i].sock = -1; udp_splice_ns[V4][i].sock = udp_splice_ns[V6][i].sock = -1; udp_splice_init[V4][i].sock = udp_splice_init[V6][i].sock = -1; } @@ -393,16 +392,16 @@ int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) union epoll_ref ref = { .type = EPOLL_TYPE_UDP, .udp = { .splice = true, .v6 = v6, .port = src } }; - struct udp_splice_port *sp; + struct udp_bound_port *bp; int act, s; if (ns) { ref.udp.pif = PIF_SPLICE; - sp = &udp_splice_ns[v6 ? V6 : V4][src]; + bp = &udp_splice_ns[v6 ? V6 : V4][src]; act = UDP_ACT_SPLICE_NS; } else { ref.udp.pif = PIF_HOST; - sp = &udp_splice_init[v6 ? V6 : V4][src]; + bp = &udp_splice_init[v6 ? V6 : V4][src]; act = UDP_ACT_SPLICE_INIT; } @@ -437,7 +436,7 @@ int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) goto fail; } - sp->sock = s; + bp->sock = s; bitmap_set(udp_act[v6 ? V6 : V4][act], src); ev.data.u64 = ref.u64; @@ -601,15 +600,15 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport, } else if (IN4_IS_ADDR_LOOPBACK(&b->s_in.sin_addr) || IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.addr_seen)) { b->iph.saddr = c->ip4.gw.s_addr; - udp_tap_map[V4][src_port].ts = now->tv_sec; - udp_tap_map[V4][src_port].flags |= PORT_LOCAL; + udp_port_flags[V4][src_port].ts = now->tv_sec; + udp_port_flags[V4][src_port].flags |= PORT_LOCAL; if (IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr.s_addr, &c->ip4.addr_seen)) - udp_tap_map[V4][src_port].flags &= ~PORT_LOOPBACK; + udp_port_flags[V4][src_port].flags &= ~PORT_LOOPBACK; else - udp_tap_map[V4][src_port].flags |= PORT_LOOPBACK; + udp_port_flags[V4][src_port].flags |= PORT_LOOPBACK; - bitmap_set(udp_act[V4][UDP_ACT_TAP], src_port); + bitmap_set(udp_act[V4][UDP_ACT_FLAGS], src_port); } else { b->iph.saddr = b->s_in.sin_addr.s_addr; } @@ -664,20 +663,20 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport, else b->ip6h.saddr = c->ip6.addr_ll; - udp_tap_map[V6][src_port].ts = now->tv_sec; - udp_tap_map[V6][src_port].flags |= PORT_LOCAL; + udp_port_flags[V6][src_port].ts = now->tv_sec; + udp_port_flags[V6][src_port].flags |= PORT_LOCAL; if (IN6_IS_ADDR_LOOPBACK(src)) - udp_tap_map[V6][src_port].flags |= PORT_LOOPBACK; + udp_port_flags[V6][src_port].flags |= PORT_LOOPBACK; else - udp_tap_map[V6][src_port].flags &= ~PORT_LOOPBACK; + udp_port_flags[V6][src_port].flags &= ~PORT_LOOPBACK; if (IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) - udp_tap_map[V6][src_port].flags |= PORT_GUA; + udp_port_flags[V6][src_port].flags |= PORT_GUA; else - udp_tap_map[V6][src_port].flags &= ~PORT_GUA; + udp_port_flags[V6][src_port].flags &= ~PORT_GUA; - bitmap_set(udp_act[V6][UDP_ACT_TAP], src_port); + bitmap_set(udp_act[V6][UDP_ACT_FLAGS], src_port); } else { b->ip6h.daddr = c->ip6.addr_seen; b->ip6h.saddr = b->s_in6.sin6_addr; @@ -857,16 +856,16 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, s_in.sin_addr = c->ip4.dns_host; } else if (IN4_ARE_ADDR_EQUAL(&s_in.sin_addr, &c->ip4.gw) && !c->no_map_gw) { - if (!(udp_tap_map[V4][dst].flags & PORT_LOCAL) || - (udp_tap_map[V4][dst].flags & PORT_LOOPBACK)) + if (!(udp_port_flags[V4][dst].flags & PORT_LOCAL) || + (udp_port_flags[V4][dst].flags & PORT_LOOPBACK)) s_in.sin_addr.s_addr = htonl(INADDR_LOOPBACK); else s_in.sin_addr = c->ip4.addr_seen; } debug("UDP from tap src=%hu dst=%hu, s=%d", - src, dst, udp_tap_map[V4][src].sock); - if ((s = udp_tap_map[V4][src].sock) < 0) { + src, dst, udp_tap_sock[V4][src].sock); + if ((s = udp_tap_sock[V4][src].sock) < 0) { struct in_addr bind_addr = IN4ADDR_ANY_INIT; union udp_epoll_ref uref = { .port = src, @@ -886,11 +885,11 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, if (s < 0) return p->count - idx; - udp_tap_map[V4][src].sock = s; - bitmap_set(udp_act[V4][UDP_ACT_TAP], src); + udp_tap_sock[V4][src].sock = s; + bitmap_set(udp_act[V4][UDP_ACT_TAP_SOCK], src); } - udp_tap_map[V4][src].ts = now->tv_sec; + udp_tap_sock[V4][src].ts = now->tv_sec; } else { s_in6 = (struct sockaddr_in6) { .sin6_family = AF_INET6, @@ -907,10 +906,10 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, s_in6.sin6_addr = c->ip6.dns_host; } else if (IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw) && !c->no_map_gw) { - if (!(udp_tap_map[V6][dst].flags & PORT_LOCAL) || - (udp_tap_map[V6][dst].flags & PORT_LOOPBACK)) + if (!(udp_port_flags[V6][dst].flags & PORT_LOCAL) || + (udp_port_flags[V6][dst].flags & PORT_LOOPBACK)) s_in6.sin6_addr = in6addr_loopback; - else if (udp_tap_map[V6][dst].flags & PORT_GUA) + else if (udp_port_flags[V6][dst].flags & PORT_GUA) s_in6.sin6_addr = c->ip6.addr; else s_in6.sin6_addr = c->ip6.addr_seen; @@ -918,7 +917,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, bind_addr = &c->ip6.addr_ll; } - if ((s = udp_tap_map[V6][src].sock) < 0) { + if ((s = udp_tap_sock[V6][src].sock) < 0) { union udp_epoll_ref uref = { .v6 = 1, .port = src, @@ -939,11 +938,11 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, if (s < 0) return p->count - idx; - udp_tap_map[V6][src].sock = s; - bitmap_set(udp_act[V6][UDP_ACT_TAP], src); + udp_tap_sock[V6][src].sock = s; + bitmap_set(udp_act[V6][UDP_ACT_TAP_SOCK], src); } - udp_tap_map[V6][src].ts = now->tv_sec; + udp_tap_sock[V6][src].ts = now->tv_sec; } for (i = 0; i < (int)p->count - idx; i++) { @@ -1015,7 +1014,7 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, r4 = s = sock_l4(c, AF_INET, IPPROTO_UDP, addr, ifname, port, uref.u32); - udp_tap_map[V4][uref.port].sock = s < 0 ? -1 : s; + udp_tap_sock[V4][uref.port].sock = s < 0 ? -1 : s; udp_splice_init[V4][port].sock = s < 0 ? -1 : s; } else { struct in_addr loopback = IN4ADDR_LOOPBACK_INIT; @@ -1033,7 +1032,7 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, r6 = s = sock_l4(c, AF_INET6, IPPROTO_UDP, addr, ifname, port, uref.u32); - udp_tap_map[V6][uref.port].sock = s < 0 ? -1 : s; + udp_tap_sock[V6][uref.port].sock = s < 0 ? -1 : s; udp_splice_init[V6][port].sock = s < 0 ? -1 : s; } else { r6 = s = sock_l4(c, AF_INET6, IPPROTO_UDP, @@ -1086,32 +1085,37 @@ static void udp_splice_iov_init(void) static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, in_port_t port, const struct timespec *now) { - struct udp_splice_port *sp; - struct udp_tap_port *tp; + struct udp_bound_port *bp; + struct udp_port_flags *pf; int *sockp = NULL; switch (type) { - case UDP_ACT_TAP: - tp = &udp_tap_map[v6 ? V6 : V4][port]; + case UDP_ACT_TAP_SOCK: + bp = &udp_tap_sock[v6 ? V6 : V4][port]; - if (now->tv_sec - tp->ts > UDP_CONN_TIMEOUT) { - sockp = &tp->sock; - tp->flags = 0; - } + if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT) + sockp = &bp->sock; + + break; + case UDP_ACT_FLAGS: + pf = &udp_port_flags[v6 ? V6 : V4][port]; + + if (now->tv_sec - pf->ts > UDP_CONN_TIMEOUT) + pf->flags = 0; break; case UDP_ACT_SPLICE_INIT: - sp = &udp_splice_init[v6 ? V6 : V4][port]; + bp = &udp_splice_init[v6 ? V6 : V4][port]; - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; + if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT) + sockp = &bp->sock; break; case UDP_ACT_SPLICE_NS: - sp = &udp_splice_ns[v6 ? V6 : V4][port]; + bp = &udp_splice_ns[v6 ? V6 : V4][port]; - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; + if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT) + sockp = &bp->sock; break; default: @@ -1141,7 +1145,7 @@ static void udp_port_rebind(struct ctx *c, bool outbound) = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; const uint8_t *rmap = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; - struct udp_splice_port (*socks)[NUM_PORTS] + struct udp_bound_port (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init; unsigned port; -- 2.43.2
On Thu, 22 Feb 2024 10:21:12 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:struct udp_tap_port contains both a socket fd and a set of flags. But, confusingly, these fields are not actually related to each other at all. udp_tap_map[?][port].sock contains a host UDP socket bound to 'port' (or -1 if we haven't opened such a socket). That is, it is indexed by the bound port == the host side forwarding port == destination for host->guest packets == source for guest->host packets. udp_tap_map[?][port].flags, however, contains flags which control how we direct datagrams bound for 'port' from the tap interface. That is, it is indexed by the destination for guest->host packets == source for host->guest packets == host side endpoint port. Worse both aspects of this share the ts field used to control expiry. When we update this because we've used the socket in slot 'port' we could be clobbering the flags timestamp for an unrelated flow of packets that happens to have the same endpoint port as our forwarding port. Clarify this by moving the flags out into a separate array, with its own timestamp and expiry path. As a bonus this means that we can now use the same data structure for tracking the socket fds of "tap" and "splice" path sockets. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 132 ++++++++++++++++++++++++++++++---------------------------- 1 file changed, 68 insertions(+), 64 deletions(-) diff --git a/udp.c b/udp.c index 4200f849..97518d92 100644 --- a/udp.c +++ b/udp.c @@ -121,40 +121,39 @@ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ /** - * struct udp_tap_port - Port tracking based on tap-facing source port - * @sock: Socket bound to source port used as index + * struct udp_port_flags - Translation options by remote Port tracking based on tap-facing source portI'm not sure what you mean by this -- perhaps "Address translation options for bound source ports"?* @flags: Flags for local bind, loopback address/unicast address as source - * @ts: Activity timestamp from tap, used for socket aging + * @ts: Activity timestamp */ -struct udp_tap_port { - int sock; - uint8_t flags; +struct udp_port_flags { #define PORT_LOCAL BIT(0) #define PORT_LOOPBACK BIT(1) #define PORT_GUA BIT(2) - + uint8_t flags; time_t ts; }; +static struct udp_port_flags udp_port_flags[IP_VERSIONS][NUM_PORTS]; /** - * struct udp_splice_port - Bound socket for spliced communication - * @sock: Socket bound to index port + * struct udp_bound_port - Track bound UDP sockets + * @sock: Socket bound to index port (or -1) * @ts: Activity timestamp */ -struct udp_splice_port { +struct udp_bound_port { int sock; time_t ts; }; -/* Port tracking, arrays indexed by packet source port (host order) */ -static struct udp_tap_port udp_tap_map [IP_VERSIONS][NUM_PORTS]; +/* Bound sockets indexed by bound port (host order) */ +static struct udp_bound_port udp_tap_sock [IP_VERSIONS][NUM_PORTS]; -/* "Spliced" sockets indexed by bound port (host order) */ -static struct udp_splice_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; -static struct udp_splice_port udp_splice_init[IP_VERSIONS][NUM_PORTS]; +/* Bound sockets for splicing indexed by bound port (host order) */ +static struct udp_bound_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; +static struct udp_bound_port udp_splice_init [IP_VERSIONS][NUM_PORTS]; enum udp_act_type { - UDP_ACT_TAP, + UDP_ACT_TAP_SOCK, + UDP_ACT_FLAGS,I'm having a hard time reviewing the rest because of this, I think. Earlier, UDP_ACT_TAP meant that the activity timestamp was observed on a non-spliced port. Now, we have UDP_ACT_FLAGS which seems to mean that activity was seen in the direction of (to) the "tap" side, and UDP_ACT_TAP_SOCK meaning that activity was seen in the direction of (to) the socket. But that's somehow the opposite of UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT -- even though they don't actually indicate activity, rather the fact that activity timestamps refer to sockets that originated in (*from*) the target namespace, or in the initial namespace. Anyway, I'm not quite sure: why do we need two separate flags for "tap" (from/to) activity? I mean, as long as we observe activity on a non-spliced flow, we'll update the related timestamp, and we make sure the activity flag is set. Can't it simply be UDP_ACT_TAP? The rest of the series looks good to me.UDP_ACT_SPLICE_NS, UDP_ACT_SPLICE_INIT, UDP_ACT_TYPE_MAX, @@ -246,7 +245,7 @@ void udp_portmap_clear(void) unsigned i; for (i = 0; i < NUM_PORTS; i++) { - udp_tap_map[V4][i].sock = udp_tap_map[V6][i].sock = -1; + udp_tap_sock[V4][i].sock = udp_tap_sock[V6][i].sock = -1; udp_splice_ns[V4][i].sock = udp_splice_ns[V6][i].sock = -1; udp_splice_init[V4][i].sock = udp_splice_init[V6][i].sock = -1; } @@ -393,16 +392,16 @@ int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) union epoll_ref ref = { .type = EPOLL_TYPE_UDP, .udp = { .splice = true, .v6 = v6, .port = src } }; - struct udp_splice_port *sp; + struct udp_bound_port *bp; int act, s; if (ns) { ref.udp.pif = PIF_SPLICE; - sp = &udp_splice_ns[v6 ? V6 : V4][src]; + bp = &udp_splice_ns[v6 ? V6 : V4][src]; act = UDP_ACT_SPLICE_NS; } else { ref.udp.pif = PIF_HOST; - sp = &udp_splice_init[v6 ? V6 : V4][src]; + bp = &udp_splice_init[v6 ? V6 : V4][src]; act = UDP_ACT_SPLICE_INIT; } @@ -437,7 +436,7 @@ int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) goto fail; } - sp->sock = s; + bp->sock = s; bitmap_set(udp_act[v6 ? V6 : V4][act], src); ev.data.u64 = ref.u64; @@ -601,15 +600,15 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport, } else if (IN4_IS_ADDR_LOOPBACK(&b->s_in.sin_addr) || IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.addr_seen)) { b->iph.saddr = c->ip4.gw.s_addr; - udp_tap_map[V4][src_port].ts = now->tv_sec; - udp_tap_map[V4][src_port].flags |= PORT_LOCAL; + udp_port_flags[V4][src_port].ts = now->tv_sec; + udp_port_flags[V4][src_port].flags |= PORT_LOCAL; if (IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr.s_addr, &c->ip4.addr_seen)) - udp_tap_map[V4][src_port].flags &= ~PORT_LOOPBACK; + udp_port_flags[V4][src_port].flags &= ~PORT_LOOPBACK; else - udp_tap_map[V4][src_port].flags |= PORT_LOOPBACK; + udp_port_flags[V4][src_port].flags |= PORT_LOOPBACK; - bitmap_set(udp_act[V4][UDP_ACT_TAP], src_port); + bitmap_set(udp_act[V4][UDP_ACT_FLAGS], src_port); } else { b->iph.saddr = b->s_in.sin_addr.s_addr; } @@ -664,20 +663,20 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport, else b->ip6h.saddr = c->ip6.addr_ll; - udp_tap_map[V6][src_port].ts = now->tv_sec; - udp_tap_map[V6][src_port].flags |= PORT_LOCAL; + udp_port_flags[V6][src_port].ts = now->tv_sec; + udp_port_flags[V6][src_port].flags |= PORT_LOCAL; if (IN6_IS_ADDR_LOOPBACK(src)) - udp_tap_map[V6][src_port].flags |= PORT_LOOPBACK; + udp_port_flags[V6][src_port].flags |= PORT_LOOPBACK; else - udp_tap_map[V6][src_port].flags &= ~PORT_LOOPBACK; + udp_port_flags[V6][src_port].flags &= ~PORT_LOOPBACK; if (IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) - udp_tap_map[V6][src_port].flags |= PORT_GUA; + udp_port_flags[V6][src_port].flags |= PORT_GUA; else - udp_tap_map[V6][src_port].flags &= ~PORT_GUA; + udp_port_flags[V6][src_port].flags &= ~PORT_GUA; - bitmap_set(udp_act[V6][UDP_ACT_TAP], src_port); + bitmap_set(udp_act[V6][UDP_ACT_FLAGS], src_port); } else { b->ip6h.daddr = c->ip6.addr_seen; b->ip6h.saddr = b->s_in6.sin6_addr; @@ -857,16 +856,16 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, s_in.sin_addr = c->ip4.dns_host; } else if (IN4_ARE_ADDR_EQUAL(&s_in.sin_addr, &c->ip4.gw) && !c->no_map_gw) { - if (!(udp_tap_map[V4][dst].flags & PORT_LOCAL) || - (udp_tap_map[V4][dst].flags & PORT_LOOPBACK)) + if (!(udp_port_flags[V4][dst].flags & PORT_LOCAL) || + (udp_port_flags[V4][dst].flags & PORT_LOOPBACK)) s_in.sin_addr.s_addr = htonl(INADDR_LOOPBACK); else s_in.sin_addr = c->ip4.addr_seen; } debug("UDP from tap src=%hu dst=%hu, s=%d", - src, dst, udp_tap_map[V4][src].sock); - if ((s = udp_tap_map[V4][src].sock) < 0) { + src, dst, udp_tap_sock[V4][src].sock); + if ((s = udp_tap_sock[V4][src].sock) < 0) { struct in_addr bind_addr = IN4ADDR_ANY_INIT; union udp_epoll_ref uref = { .port = src, @@ -886,11 +885,11 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, if (s < 0) return p->count - idx; - udp_tap_map[V4][src].sock = s; - bitmap_set(udp_act[V4][UDP_ACT_TAP], src); + udp_tap_sock[V4][src].sock = s; + bitmap_set(udp_act[V4][UDP_ACT_TAP_SOCK], src); } - udp_tap_map[V4][src].ts = now->tv_sec; + udp_tap_sock[V4][src].ts = now->tv_sec; } else { s_in6 = (struct sockaddr_in6) { .sin6_family = AF_INET6, @@ -907,10 +906,10 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, s_in6.sin6_addr = c->ip6.dns_host; } else if (IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw) && !c->no_map_gw) { - if (!(udp_tap_map[V6][dst].flags & PORT_LOCAL) || - (udp_tap_map[V6][dst].flags & PORT_LOOPBACK)) + if (!(udp_port_flags[V6][dst].flags & PORT_LOCAL) || + (udp_port_flags[V6][dst].flags & PORT_LOOPBACK)) s_in6.sin6_addr = in6addr_loopback; - else if (udp_tap_map[V6][dst].flags & PORT_GUA) + else if (udp_port_flags[V6][dst].flags & PORT_GUA) s_in6.sin6_addr = c->ip6.addr; else s_in6.sin6_addr = c->ip6.addr_seen; @@ -918,7 +917,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, bind_addr = &c->ip6.addr_ll; } - if ((s = udp_tap_map[V6][src].sock) < 0) { + if ((s = udp_tap_sock[V6][src].sock) < 0) { union udp_epoll_ref uref = { .v6 = 1, .port = src, @@ -939,11 +938,11 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, if (s < 0) return p->count - idx; - udp_tap_map[V6][src].sock = s; - bitmap_set(udp_act[V6][UDP_ACT_TAP], src); + udp_tap_sock[V6][src].sock = s; + bitmap_set(udp_act[V6][UDP_ACT_TAP_SOCK], src); } - udp_tap_map[V6][src].ts = now->tv_sec; + udp_tap_sock[V6][src].ts = now->tv_sec; } for (i = 0; i < (int)p->count - idx; i++) { @@ -1015,7 +1014,7 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, r4 = s = sock_l4(c, AF_INET, IPPROTO_UDP, addr, ifname, port, uref.u32); - udp_tap_map[V4][uref.port].sock = s < 0 ? -1 : s; + udp_tap_sock[V4][uref.port].sock = s < 0 ? -1 : s; udp_splice_init[V4][port].sock = s < 0 ? -1 : s; } else { struct in_addr loopback = IN4ADDR_LOOPBACK_INIT; @@ -1033,7 +1032,7 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, r6 = s = sock_l4(c, AF_INET6, IPPROTO_UDP, addr, ifname, port, uref.u32); - udp_tap_map[V6][uref.port].sock = s < 0 ? -1 : s; + udp_tap_sock[V6][uref.port].sock = s < 0 ? -1 : s; udp_splice_init[V6][port].sock = s < 0 ? -1 : s; } else { r6 = s = sock_l4(c, AF_INET6, IPPROTO_UDP, @@ -1086,32 +1085,37 @@ static void udp_splice_iov_init(void) static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, in_port_t port, const struct timespec *now) { - struct udp_splice_port *sp; - struct udp_tap_port *tp; + struct udp_bound_port *bp; + struct udp_port_flags *pf; int *sockp = NULL; switch (type) { - case UDP_ACT_TAP: - tp = &udp_tap_map[v6 ? V6 : V4][port]; + case UDP_ACT_TAP_SOCK: + bp = &udp_tap_sock[v6 ? V6 : V4][port]; - if (now->tv_sec - tp->ts > UDP_CONN_TIMEOUT) { - sockp = &tp->sock; - tp->flags = 0; - } + if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT) + sockp = &bp->sock; + + break; + case UDP_ACT_FLAGS: + pf = &udp_port_flags[v6 ? V6 : V4][port]; + + if (now->tv_sec - pf->ts > UDP_CONN_TIMEOUT) + pf->flags = 0; break; case UDP_ACT_SPLICE_INIT: - sp = &udp_splice_init[v6 ? V6 : V4][port]; + bp = &udp_splice_init[v6 ? V6 : V4][port]; - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; + if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT) + sockp = &bp->sock; break; case UDP_ACT_SPLICE_NS: - sp = &udp_splice_ns[v6 ? V6 : V4][port]; + bp = &udp_splice_ns[v6 ? V6 : V4][port]; - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; + if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT) + sockp = &bp->sock; break; default: @@ -1141,7 +1145,7 @@ static void udp_port_rebind(struct ctx *c, bool outbound) = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; const uint8_t *rmap = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; - struct udp_splice_port (*socks)[NUM_PORTS] + struct udp_bound_port (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init; unsigned port;-- Stefano
On Tue, Feb 27, 2024 at 12:56:59PM +0100, Stefano Brivio wrote:On Thu, 22 Feb 2024 10:21:12 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Oops, yes. I think I partially rewrote an existing comment, leaving an incoherent fusion of two different sentences.struct udp_tap_port contains both a socket fd and a set of flags. But, confusingly, these fields are not actually related to each other at all. udp_tap_map[?][port].sock contains a host UDP socket bound to 'port' (or -1 if we haven't opened such a socket). That is, it is indexed by the bound port == the host side forwarding port == destination for host->guest packets == source for guest->host packets. udp_tap_map[?][port].flags, however, contains flags which control how we direct datagrams bound for 'port' from the tap interface. That is, it is indexed by the destination for guest->host packets == source for host->guest packets == host side endpoint port. Worse both aspects of this share the ts field used to control expiry. When we update this because we've used the socket in slot 'port' we could be clobbering the flags timestamp for an unrelated flow of packets that happens to have the same endpoint port as our forwarding port. Clarify this by moving the flags out into a separate array, with its own timestamp and expiry path. As a bonus this means that we can now use the same data structure for tracking the socket fds of "tap" and "splice" path sockets. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 132 ++++++++++++++++++++++++++++++---------------------------- 1 file changed, 68 insertions(+), 64 deletions(-) diff --git a/udp.c b/udp.c index 4200f849..97518d92 100644 --- a/udp.c +++ b/udp.c @@ -121,40 +121,39 @@ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ /** - * struct udp_tap_port - Port tracking based on tap-facing source port - * @sock: Socket bound to source port used as index + * struct udp_port_flags - Translation options by remote Port tracking based on tap-facing source portI'm not sure what you mean by this -- perhaps "Address translation options for bound source ports"?Ah! I'd read those as "ACTions we needed to take" in the expiry timers, rather than tying them to specific activity. It amounts to the same thing pre-patch, but it makes more sense to me with that new lens.* @flags: Flags for local bind, loopback address/unicast address as source - * @ts: Activity timestamp from tap, used for socket aging + * @ts: Activity timestamp */ -struct udp_tap_port { - int sock; - uint8_t flags; +struct udp_port_flags { #define PORT_LOCAL BIT(0) #define PORT_LOOPBACK BIT(1) #define PORT_GUA BIT(2) - + uint8_t flags; time_t ts; }; +static struct udp_port_flags udp_port_flags[IP_VERSIONS][NUM_PORTS]; /** - * struct udp_splice_port - Bound socket for spliced communication - * @sock: Socket bound to index port + * struct udp_bound_port - Track bound UDP sockets + * @sock: Socket bound to index port (or -1) * @ts: Activity timestamp */ -struct udp_splice_port { +struct udp_bound_port { int sock; time_t ts; }; -/* Port tracking, arrays indexed by packet source port (host order) */ -static struct udp_tap_port udp_tap_map [IP_VERSIONS][NUM_PORTS]; +/* Bound sockets indexed by bound port (host order) */ +static struct udp_bound_port udp_tap_sock [IP_VERSIONS][NUM_PORTS]; -/* "Spliced" sockets indexed by bound port (host order) */ -static struct udp_splice_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; -static struct udp_splice_port udp_splice_init[IP_VERSIONS][NUM_PORTS]; +/* Bound sockets for splicing indexed by bound port (host order) */ +static struct udp_bound_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; +static struct udp_bound_port udp_splice_init [IP_VERSIONS][NUM_PORTS]; enum udp_act_type { - UDP_ACT_TAP, + UDP_ACT_TAP_SOCK, + UDP_ACT_FLAGS,I'm having a hard time reviewing the rest because of this, I think. Earlier, UDP_ACT_TAP meant that the activity timestamp was observed on a non-spliced port.Now, we have UDP_ACT_FLAGS which seems to mean that activity was seen in the direction of (to) the "tap" side, and UDP_ACT_TAP_SOCK meaning that activity was seen in the direction of (to) the socket. But that's somehow the opposite of UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT -- even though they don't actually indicate activity, rather the fact that activity timestamps refer to sockets that originated in (*from*) the target namespace, or in the initial namespace. Anyway, I'm not quite sure: why do we need two separate flags for "tap" (from/to) activity?We don't, this is an ugly side effect. The problem is that in scanning for expiries we only have a single port number. But the socket is associated with a particular forwarding port, while the flags are associated with a particular endpoint port. Without flow tracking, we don't have a way to match one to the other. The compromise here is - or is supposed to be - that we have independent timestamps for the socket and the flags. The two timestamps associated with a single flow are supposed to be updated at basically the same time - looks like I missed some things to do that as well.I mean, as long as we observe activity on a non-spliced flow, we'll update the related timestamp, and we make sure the activity flag is set. Can't it simply be UDP_ACT_TAP? The rest of the series looks good to me.-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonUDP_ACT_SPLICE_NS, UDP_ACT_SPLICE_INIT, UDP_ACT_TYPE_MAX, @@ -246,7 +245,7 @@ void udp_portmap_clear(void) unsigned i; for (i = 0; i < NUM_PORTS; i++) { - udp_tap_map[V4][i].sock = udp_tap_map[V6][i].sock = -1; + udp_tap_sock[V4][i].sock = udp_tap_sock[V6][i].sock = -1; udp_splice_ns[V4][i].sock = udp_splice_ns[V6][i].sock = -1; udp_splice_init[V4][i].sock = udp_splice_init[V6][i].sock = -1; } @@ -393,16 +392,16 @@ int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) union epoll_ref ref = { .type = EPOLL_TYPE_UDP, .udp = { .splice = true, .v6 = v6, .port = src } }; - struct udp_splice_port *sp; + struct udp_bound_port *bp; int act, s; if (ns) { ref.udp.pif = PIF_SPLICE; - sp = &udp_splice_ns[v6 ? V6 : V4][src]; + bp = &udp_splice_ns[v6 ? V6 : V4][src]; act = UDP_ACT_SPLICE_NS; } else { ref.udp.pif = PIF_HOST; - sp = &udp_splice_init[v6 ? V6 : V4][src]; + bp = &udp_splice_init[v6 ? V6 : V4][src]; act = UDP_ACT_SPLICE_INIT; } @@ -437,7 +436,7 @@ int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) goto fail; } - sp->sock = s; + bp->sock = s; bitmap_set(udp_act[v6 ? V6 : V4][act], src); ev.data.u64 = ref.u64; @@ -601,15 +600,15 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport, } else if (IN4_IS_ADDR_LOOPBACK(&b->s_in.sin_addr) || IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.addr_seen)) { b->iph.saddr = c->ip4.gw.s_addr; - udp_tap_map[V4][src_port].ts = now->tv_sec; - udp_tap_map[V4][src_port].flags |= PORT_LOCAL; + udp_port_flags[V4][src_port].ts = now->tv_sec; + udp_port_flags[V4][src_port].flags |= PORT_LOCAL; if (IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr.s_addr, &c->ip4.addr_seen)) - udp_tap_map[V4][src_port].flags &= ~PORT_LOOPBACK; + udp_port_flags[V4][src_port].flags &= ~PORT_LOOPBACK; else - udp_tap_map[V4][src_port].flags |= PORT_LOOPBACK; + udp_port_flags[V4][src_port].flags |= PORT_LOOPBACK; - bitmap_set(udp_act[V4][UDP_ACT_TAP], src_port); + bitmap_set(udp_act[V4][UDP_ACT_FLAGS], src_port); } else { b->iph.saddr = b->s_in.sin_addr.s_addr; } @@ -664,20 +663,20 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport, else b->ip6h.saddr = c->ip6.addr_ll; - udp_tap_map[V6][src_port].ts = now->tv_sec; - udp_tap_map[V6][src_port].flags |= PORT_LOCAL; + udp_port_flags[V6][src_port].ts = now->tv_sec; + udp_port_flags[V6][src_port].flags |= PORT_LOCAL; if (IN6_IS_ADDR_LOOPBACK(src)) - udp_tap_map[V6][src_port].flags |= PORT_LOOPBACK; + udp_port_flags[V6][src_port].flags |= PORT_LOOPBACK; else - udp_tap_map[V6][src_port].flags &= ~PORT_LOOPBACK; + udp_port_flags[V6][src_port].flags &= ~PORT_LOOPBACK; if (IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) - udp_tap_map[V6][src_port].flags |= PORT_GUA; + udp_port_flags[V6][src_port].flags |= PORT_GUA; else - udp_tap_map[V6][src_port].flags &= ~PORT_GUA; + udp_port_flags[V6][src_port].flags &= ~PORT_GUA; - bitmap_set(udp_act[V6][UDP_ACT_TAP], src_port); + bitmap_set(udp_act[V6][UDP_ACT_FLAGS], src_port); } else { b->ip6h.daddr = c->ip6.addr_seen; b->ip6h.saddr = b->s_in6.sin6_addr; @@ -857,16 +856,16 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, s_in.sin_addr = c->ip4.dns_host; } else if (IN4_ARE_ADDR_EQUAL(&s_in.sin_addr, &c->ip4.gw) && !c->no_map_gw) { - if (!(udp_tap_map[V4][dst].flags & PORT_LOCAL) || - (udp_tap_map[V4][dst].flags & PORT_LOOPBACK)) + if (!(udp_port_flags[V4][dst].flags & PORT_LOCAL) || + (udp_port_flags[V4][dst].flags & PORT_LOOPBACK)) s_in.sin_addr.s_addr = htonl(INADDR_LOOPBACK); else s_in.sin_addr = c->ip4.addr_seen; } debug("UDP from tap src=%hu dst=%hu, s=%d", - src, dst, udp_tap_map[V4][src].sock); - if ((s = udp_tap_map[V4][src].sock) < 0) { + src, dst, udp_tap_sock[V4][src].sock); + if ((s = udp_tap_sock[V4][src].sock) < 0) { struct in_addr bind_addr = IN4ADDR_ANY_INIT; union udp_epoll_ref uref = { .port = src, @@ -886,11 +885,11 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, if (s < 0) return p->count - idx; - udp_tap_map[V4][src].sock = s; - bitmap_set(udp_act[V4][UDP_ACT_TAP], src); + udp_tap_sock[V4][src].sock = s; + bitmap_set(udp_act[V4][UDP_ACT_TAP_SOCK], src); } - udp_tap_map[V4][src].ts = now->tv_sec; + udp_tap_sock[V4][src].ts = now->tv_sec; } else { s_in6 = (struct sockaddr_in6) { .sin6_family = AF_INET6, @@ -907,10 +906,10 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, s_in6.sin6_addr = c->ip6.dns_host; } else if (IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw) && !c->no_map_gw) { - if (!(udp_tap_map[V6][dst].flags & PORT_LOCAL) || - (udp_tap_map[V6][dst].flags & PORT_LOOPBACK)) + if (!(udp_port_flags[V6][dst].flags & PORT_LOCAL) || + (udp_port_flags[V6][dst].flags & PORT_LOOPBACK)) s_in6.sin6_addr = in6addr_loopback; - else if (udp_tap_map[V6][dst].flags & PORT_GUA) + else if (udp_port_flags[V6][dst].flags & PORT_GUA) s_in6.sin6_addr = c->ip6.addr; else s_in6.sin6_addr = c->ip6.addr_seen; @@ -918,7 +917,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, bind_addr = &c->ip6.addr_ll; } - if ((s = udp_tap_map[V6][src].sock) < 0) { + if ((s = udp_tap_sock[V6][src].sock) < 0) { union udp_epoll_ref uref = { .v6 = 1, .port = src, @@ -939,11 +938,11 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, if (s < 0) return p->count - idx; - udp_tap_map[V6][src].sock = s; - bitmap_set(udp_act[V6][UDP_ACT_TAP], src); + udp_tap_sock[V6][src].sock = s; + bitmap_set(udp_act[V6][UDP_ACT_TAP_SOCK], src); } - udp_tap_map[V6][src].ts = now->tv_sec; + udp_tap_sock[V6][src].ts = now->tv_sec; } for (i = 0; i < (int)p->count - idx; i++) { @@ -1015,7 +1014,7 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, r4 = s = sock_l4(c, AF_INET, IPPROTO_UDP, addr, ifname, port, uref.u32); - udp_tap_map[V4][uref.port].sock = s < 0 ? -1 : s; + udp_tap_sock[V4][uref.port].sock = s < 0 ? -1 : s; udp_splice_init[V4][port].sock = s < 0 ? -1 : s; } else { struct in_addr loopback = IN4ADDR_LOOPBACK_INIT; @@ -1033,7 +1032,7 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, r6 = s = sock_l4(c, AF_INET6, IPPROTO_UDP, addr, ifname, port, uref.u32); - udp_tap_map[V6][uref.port].sock = s < 0 ? -1 : s; + udp_tap_sock[V6][uref.port].sock = s < 0 ? -1 : s; udp_splice_init[V6][port].sock = s < 0 ? -1 : s; } else { r6 = s = sock_l4(c, AF_INET6, IPPROTO_UDP, @@ -1086,32 +1085,37 @@ static void udp_splice_iov_init(void) static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, in_port_t port, const struct timespec *now) { - struct udp_splice_port *sp; - struct udp_tap_port *tp; + struct udp_bound_port *bp; + struct udp_port_flags *pf; int *sockp = NULL; switch (type) { - case UDP_ACT_TAP: - tp = &udp_tap_map[v6 ? V6 : V4][port]; + case UDP_ACT_TAP_SOCK: + bp = &udp_tap_sock[v6 ? V6 : V4][port]; - if (now->tv_sec - tp->ts > UDP_CONN_TIMEOUT) { - sockp = &tp->sock; - tp->flags = 0; - } + if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT) + sockp = &bp->sock; + + break; + case UDP_ACT_FLAGS: + pf = &udp_port_flags[v6 ? V6 : V4][port]; + + if (now->tv_sec - pf->ts > UDP_CONN_TIMEOUT) + pf->flags = 0; break; case UDP_ACT_SPLICE_INIT: - sp = &udp_splice_init[v6 ? V6 : V4][port]; + bp = &udp_splice_init[v6 ? V6 : V4][port]; - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; + if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT) + sockp = &bp->sock; break; case UDP_ACT_SPLICE_NS: - sp = &udp_splice_ns[v6 ? V6 : V4][port]; + bp = &udp_splice_ns[v6 ? V6 : V4][port]; - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; + if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT) + sockp = &bp->sock; break; default: @@ -1141,7 +1145,7 @@ static void udp_port_rebind(struct ctx *c, bool outbound) = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; const uint8_t *rmap = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; - struct udp_splice_port (*socks)[NUM_PORTS] + struct udp_bound_port (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init; unsigned port;
On Wed, Feb 28, 2024 at 01:30:05PM +1100, David Gibson wrote:On Tue, Feb 27, 2024 at 12:56:59PM +0100, Stefano Brivio wrote: > On Thu, 22 Feb 2024 10:21:12 +1100 > David Gibson <david(a)gibson.dropbear.id.au> wrote:[snip]Urgh. So it seems like every time I re-examine this, I find more pre-existing bugs. So, part of what this patch was attempting to fix originally is that previously, although we updated the timestamp host->guest packets, we updated the wrong one for expiring the socket (based on endpoint rather than forwarding port). In trying to fix it up, I additionally realised, we currently don't update the timestamp for host to guest packets at all, except in the when remapping the src to gateway address, which as noted previously itself doesn't honour the no_map_gw flag. *Then* I realised that if we do update the correct timestamp and ACT flag, we could now set an ACT flag for a socket created by -u, and therefore cause a port that should remain open to be expired. I don't think this any longer counts as a "small" fix, so I will remove it from this series and tackle it elsewhere. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonI'm having a hard time reviewing the rest because of this, I think. Earlier, UDP_ACT_TAP meant that the activity timestamp was observed on a non-spliced port.Ah! I'd read those as "ACTions we needed to take" in the expiry timers, rather than tying them to specific activity. It amounts to the same thing pre-patch, but it makes more sense to me with that new lens.Now, we have UDP_ACT_FLAGS which seems to mean that activity was seen in the direction of (to) the "tap" side, and UDP_ACT_TAP_SOCK meaning that activity was seen in the direction of (to) the socket. But that's somehow the opposite of UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT -- even though they don't actually indicate activity, rather the fact that activity timestamps refer to sockets that originated in (*from*) the target namespace, or in the initial namespace. Anyway, I'm not quite sure: why do we need two separate flags for "tap" (from/to) activity?We don't, this is an ugly side effect. The problem is that in scanning for expiries we only have a single port number. But the socket is associated with a particular forwarding port, while the flags are associated with a particular endpoint port. Without flow tracking, we don't have a way to match one to the other. The compromise here is - or is supposed to be - that we have independent timestamps for the socket and the flags. The two timestamps associated with a single flow are supposed to be updated at basically the same time - looks like I missed some things to do that as well. > I mean, as long as we observe activity on a non-spliced flow, we'll > update the related timestamp, and we make sure the activity flag is > set. Can't it simply be UDP_ACT_TAP?
Streamline the logic here slightly, by introducing a 'src' temporary for brevity. We also transform the logic for setting/clearing PORT_LOOPBACK. This makes udp_update_hdr4() more closely match the corresponding logic from udp_update_udp6(). Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/udp.c b/udp.c index 97518d92..87f440f5 100644 --- a/udp.c +++ b/udp.c @@ -583,6 +583,7 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport, const struct timespec *now) { struct udp4_l2_buf_t *b = &udp4_l2_buf[n]; + struct in_addr *src; in_port_t src_port; size_t ip_len; @@ -591,26 +592,26 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport, b->iph.tot_len = htons(ip_len); b->iph.daddr = c->ip4.addr_seen.s_addr; + src = &b->s_in.sin_addr; src_port = ntohs(b->s_in.sin_port); if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_match) && - IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.dns_host) && - src_port == 53) { + IN4_ARE_ADDR_EQUAL(src, &c->ip4.dns_host) && src_port == 53) { b->iph.saddr = c->ip4.dns_match.s_addr; - } else if (IN4_IS_ADDR_LOOPBACK(&b->s_in.sin_addr) || - IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.addr_seen)) { + } else if (IN4_IS_ADDR_LOOPBACK(src) || + IN4_ARE_ADDR_EQUAL(src, &c->ip4.addr_seen)) { b->iph.saddr = c->ip4.gw.s_addr; udp_port_flags[V4][src_port].ts = now->tv_sec; udp_port_flags[V4][src_port].flags |= PORT_LOCAL; - if (IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr.s_addr, &c->ip4.addr_seen)) - udp_port_flags[V4][src_port].flags &= ~PORT_LOOPBACK; - else + if (IN4_IS_ADDR_LOOPBACK(src)) udp_port_flags[V4][src_port].flags |= PORT_LOOPBACK; + else + udp_port_flags[V4][src_port].flags &= ~PORT_LOOPBACK; bitmap_set(udp_act[V4][UDP_ACT_FLAGS], src_port); } else { - b->iph.saddr = b->s_in.sin_addr.s_addr; + b->iph.saddr = src->s_addr; } udp_update_check4(b); -- 2.43.2
When forwarding IPv4 packets in udp_tap_handler(), we incorrectly use an IPv6 address test on our IPv4 address (which could cause an out of bounds access), and possibly set our bind interface to the IPv6 interface based on it. Adjust to correctly look at the IPv4 address and IPv4 interface. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/udp.c b/udp.c index 87f440f5..0204f1d2 100644 --- a/udp.c +++ b/udp.c @@ -874,8 +874,8 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, }; const char *bind_if = NULL; - if (!IN6_IS_ADDR_LOOPBACK(&s_in.sin_addr)) - bind_if = c->ip6.ifname_out; + if (!IN4_IS_ADDR_LOOPBACK(&s_in.sin_addr)) + bind_if = c->ip4.ifname_out; if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.addr_out) && !IN4_IS_ADDR_LOOPBACK(&s_in.sin_addr)) -- 2.43.2
If the configured output address is unspecified, we don't set the bind address to it when creating a new socket in udp_tap_handler(). That sounds sensible, but what we're leaving the bind address as is, exactly, the unspecified address, so this test makes no difference. Remove it. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/udp.c b/udp.c index 0204f1d2..304ba206 100644 --- a/udp.c +++ b/udp.c @@ -877,8 +877,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, if (!IN4_IS_ADDR_LOOPBACK(&s_in.sin_addr)) bind_if = c->ip4.ifname_out; - if (!IN4_IS_ADDR_UNSPECIFIED(&c->ip4.addr_out) && - !IN4_IS_ADDR_LOOPBACK(&s_in.sin_addr)) + if (!IN4_IS_ADDR_LOOPBACK(&s_in.sin_addr)) bind_addr = c->ip4.addr_out; s = sock_l4(c, AF_INET, IPPROTO_UDP, &bind_addr, @@ -929,8 +928,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, if (!IN6_IS_ADDR_LOOPBACK(&s_in6.sin6_addr)) bind_if = c->ip6.ifname_out; - if (!IN6_IS_ADDR_UNSPECIFIED(&c->ip6.addr_out) && - !IN6_IS_ADDR_LOOPBACK(&s_in6.sin6_addr) && + if (!IN6_IS_ADDR_LOOPBACK(&s_in6.sin6_addr) && !IN6_IS_ADDR_LINKLOCAL(&s_in6.sin6_addr)) bind_addr = &c->ip6.addr_out; -- 2.43.2