On Tue, Feb 27, 2024 at 12:56:59PM +0100, Stefano Brivio wrote:On Thu, 22 Feb 2024 10:21:12 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Oops, yes. I think I partially rewrote an existing comment, leaving an incoherent fusion of two different sentences.struct udp_tap_port contains both a socket fd and a set of flags. But, confusingly, these fields are not actually related to each other at all. udp_tap_map[?][port].sock contains a host UDP socket bound to 'port' (or -1 if we haven't opened such a socket). That is, it is indexed by the bound port == the host side forwarding port == destination for host->guest packets == source for guest->host packets. udp_tap_map[?][port].flags, however, contains flags which control how we direct datagrams bound for 'port' from the tap interface. That is, it is indexed by the destination for guest->host packets == source for host->guest packets == host side endpoint port. Worse both aspects of this share the ts field used to control expiry. When we update this because we've used the socket in slot 'port' we could be clobbering the flags timestamp for an unrelated flow of packets that happens to have the same endpoint port as our forwarding port. Clarify this by moving the flags out into a separate array, with its own timestamp and expiry path. As a bonus this means that we can now use the same data structure for tracking the socket fds of "tap" and "splice" path sockets. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 132 ++++++++++++++++++++++++++++++---------------------------- 1 file changed, 68 insertions(+), 64 deletions(-) diff --git a/udp.c b/udp.c index 4200f849..97518d92 100644 --- a/udp.c +++ b/udp.c @@ -121,40 +121,39 @@ #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ /** - * struct udp_tap_port - Port tracking based on tap-facing source port - * @sock: Socket bound to source port used as index + * struct udp_port_flags - Translation options by remote Port tracking based on tap-facing source portI'm not sure what you mean by this -- perhaps "Address translation options for bound source ports"?Ah! I'd read those as "ACTions we needed to take" in the expiry timers, rather than tying them to specific activity. It amounts to the same thing pre-patch, but it makes more sense to me with that new lens.* @flags: Flags for local bind, loopback address/unicast address as source - * @ts: Activity timestamp from tap, used for socket aging + * @ts: Activity timestamp */ -struct udp_tap_port { - int sock; - uint8_t flags; +struct udp_port_flags { #define PORT_LOCAL BIT(0) #define PORT_LOOPBACK BIT(1) #define PORT_GUA BIT(2) - + uint8_t flags; time_t ts; }; +static struct udp_port_flags udp_port_flags[IP_VERSIONS][NUM_PORTS]; /** - * struct udp_splice_port - Bound socket for spliced communication - * @sock: Socket bound to index port + * struct udp_bound_port - Track bound UDP sockets + * @sock: Socket bound to index port (or -1) * @ts: Activity timestamp */ -struct udp_splice_port { +struct udp_bound_port { int sock; time_t ts; }; -/* Port tracking, arrays indexed by packet source port (host order) */ -static struct udp_tap_port udp_tap_map [IP_VERSIONS][NUM_PORTS]; +/* Bound sockets indexed by bound port (host order) */ +static struct udp_bound_port udp_tap_sock [IP_VERSIONS][NUM_PORTS]; -/* "Spliced" sockets indexed by bound port (host order) */ -static struct udp_splice_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; -static struct udp_splice_port udp_splice_init[IP_VERSIONS][NUM_PORTS]; +/* Bound sockets for splicing indexed by bound port (host order) */ +static struct udp_bound_port udp_splice_ns [IP_VERSIONS][NUM_PORTS]; +static struct udp_bound_port udp_splice_init [IP_VERSIONS][NUM_PORTS]; enum udp_act_type { - UDP_ACT_TAP, + UDP_ACT_TAP_SOCK, + UDP_ACT_FLAGS,I'm having a hard time reviewing the rest because of this, I think. Earlier, UDP_ACT_TAP meant that the activity timestamp was observed on a non-spliced port.Now, we have UDP_ACT_FLAGS which seems to mean that activity was seen in the direction of (to) the "tap" side, and UDP_ACT_TAP_SOCK meaning that activity was seen in the direction of (to) the socket. But that's somehow the opposite of UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT -- even though they don't actually indicate activity, rather the fact that activity timestamps refer to sockets that originated in (*from*) the target namespace, or in the initial namespace. Anyway, I'm not quite sure: why do we need two separate flags for "tap" (from/to) activity?We don't, this is an ugly side effect. The problem is that in scanning for expiries we only have a single port number. But the socket is associated with a particular forwarding port, while the flags are associated with a particular endpoint port. Without flow tracking, we don't have a way to match one to the other. The compromise here is - or is supposed to be - that we have independent timestamps for the socket and the flags. The two timestamps associated with a single flow are supposed to be updated at basically the same time - looks like I missed some things to do that as well.I mean, as long as we observe activity on a non-spliced flow, we'll update the related timestamp, and we make sure the activity flag is set. Can't it simply be UDP_ACT_TAP? The rest of the series looks good to me.-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonUDP_ACT_SPLICE_NS, UDP_ACT_SPLICE_INIT, UDP_ACT_TYPE_MAX, @@ -246,7 +245,7 @@ void udp_portmap_clear(void) unsigned i; for (i = 0; i < NUM_PORTS; i++) { - udp_tap_map[V4][i].sock = udp_tap_map[V6][i].sock = -1; + udp_tap_sock[V4][i].sock = udp_tap_sock[V6][i].sock = -1; udp_splice_ns[V4][i].sock = udp_splice_ns[V6][i].sock = -1; udp_splice_init[V4][i].sock = udp_splice_init[V6][i].sock = -1; } @@ -393,16 +392,16 @@ int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) union epoll_ref ref = { .type = EPOLL_TYPE_UDP, .udp = { .splice = true, .v6 = v6, .port = src } }; - struct udp_splice_port *sp; + struct udp_bound_port *bp; int act, s; if (ns) { ref.udp.pif = PIF_SPLICE; - sp = &udp_splice_ns[v6 ? V6 : V4][src]; + bp = &udp_splice_ns[v6 ? V6 : V4][src]; act = UDP_ACT_SPLICE_NS; } else { ref.udp.pif = PIF_HOST; - sp = &udp_splice_init[v6 ? V6 : V4][src]; + bp = &udp_splice_init[v6 ? V6 : V4][src]; act = UDP_ACT_SPLICE_INIT; } @@ -437,7 +436,7 @@ int udp_splice_new(const struct ctx *c, int v6, in_port_t src, bool ns) goto fail; } - sp->sock = s; + bp->sock = s; bitmap_set(udp_act[v6 ? V6 : V4][act], src); ev.data.u64 = ref.u64; @@ -601,15 +600,15 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport, } else if (IN4_IS_ADDR_LOOPBACK(&b->s_in.sin_addr) || IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr, &c->ip4.addr_seen)) { b->iph.saddr = c->ip4.gw.s_addr; - udp_tap_map[V4][src_port].ts = now->tv_sec; - udp_tap_map[V4][src_port].flags |= PORT_LOCAL; + udp_port_flags[V4][src_port].ts = now->tv_sec; + udp_port_flags[V4][src_port].flags |= PORT_LOCAL; if (IN4_ARE_ADDR_EQUAL(&b->s_in.sin_addr.s_addr, &c->ip4.addr_seen)) - udp_tap_map[V4][src_port].flags &= ~PORT_LOOPBACK; + udp_port_flags[V4][src_port].flags &= ~PORT_LOOPBACK; else - udp_tap_map[V4][src_port].flags |= PORT_LOOPBACK; + udp_port_flags[V4][src_port].flags |= PORT_LOOPBACK; - bitmap_set(udp_act[V4][UDP_ACT_TAP], src_port); + bitmap_set(udp_act[V4][UDP_ACT_FLAGS], src_port); } else { b->iph.saddr = b->s_in.sin_addr.s_addr; } @@ -664,20 +663,20 @@ static size_t udp_update_hdr6(const struct ctx *c, int n, in_port_t dstport, else b->ip6h.saddr = c->ip6.addr_ll; - udp_tap_map[V6][src_port].ts = now->tv_sec; - udp_tap_map[V6][src_port].flags |= PORT_LOCAL; + udp_port_flags[V6][src_port].ts = now->tv_sec; + udp_port_flags[V6][src_port].flags |= PORT_LOCAL; if (IN6_IS_ADDR_LOOPBACK(src)) - udp_tap_map[V6][src_port].flags |= PORT_LOOPBACK; + udp_port_flags[V6][src_port].flags |= PORT_LOOPBACK; else - udp_tap_map[V6][src_port].flags &= ~PORT_LOOPBACK; + udp_port_flags[V6][src_port].flags &= ~PORT_LOOPBACK; if (IN6_ARE_ADDR_EQUAL(src, &c->ip6.addr)) - udp_tap_map[V6][src_port].flags |= PORT_GUA; + udp_port_flags[V6][src_port].flags |= PORT_GUA; else - udp_tap_map[V6][src_port].flags &= ~PORT_GUA; + udp_port_flags[V6][src_port].flags &= ~PORT_GUA; - bitmap_set(udp_act[V6][UDP_ACT_TAP], src_port); + bitmap_set(udp_act[V6][UDP_ACT_FLAGS], src_port); } else { b->ip6h.daddr = c->ip6.addr_seen; b->ip6h.saddr = b->s_in6.sin6_addr; @@ -857,16 +856,16 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, s_in.sin_addr = c->ip4.dns_host; } else if (IN4_ARE_ADDR_EQUAL(&s_in.sin_addr, &c->ip4.gw) && !c->no_map_gw) { - if (!(udp_tap_map[V4][dst].flags & PORT_LOCAL) || - (udp_tap_map[V4][dst].flags & PORT_LOOPBACK)) + if (!(udp_port_flags[V4][dst].flags & PORT_LOCAL) || + (udp_port_flags[V4][dst].flags & PORT_LOOPBACK)) s_in.sin_addr.s_addr = htonl(INADDR_LOOPBACK); else s_in.sin_addr = c->ip4.addr_seen; } debug("UDP from tap src=%hu dst=%hu, s=%d", - src, dst, udp_tap_map[V4][src].sock); - if ((s = udp_tap_map[V4][src].sock) < 0) { + src, dst, udp_tap_sock[V4][src].sock); + if ((s = udp_tap_sock[V4][src].sock) < 0) { struct in_addr bind_addr = IN4ADDR_ANY_INIT; union udp_epoll_ref uref = { .port = src, @@ -886,11 +885,11 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, if (s < 0) return p->count - idx; - udp_tap_map[V4][src].sock = s; - bitmap_set(udp_act[V4][UDP_ACT_TAP], src); + udp_tap_sock[V4][src].sock = s; + bitmap_set(udp_act[V4][UDP_ACT_TAP_SOCK], src); } - udp_tap_map[V4][src].ts = now->tv_sec; + udp_tap_sock[V4][src].ts = now->tv_sec; } else { s_in6 = (struct sockaddr_in6) { .sin6_family = AF_INET6, @@ -907,10 +906,10 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, s_in6.sin6_addr = c->ip6.dns_host; } else if (IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw) && !c->no_map_gw) { - if (!(udp_tap_map[V6][dst].flags & PORT_LOCAL) || - (udp_tap_map[V6][dst].flags & PORT_LOOPBACK)) + if (!(udp_port_flags[V6][dst].flags & PORT_LOCAL) || + (udp_port_flags[V6][dst].flags & PORT_LOOPBACK)) s_in6.sin6_addr = in6addr_loopback; - else if (udp_tap_map[V6][dst].flags & PORT_GUA) + else if (udp_port_flags[V6][dst].flags & PORT_GUA) s_in6.sin6_addr = c->ip6.addr; else s_in6.sin6_addr = c->ip6.addr_seen; @@ -918,7 +917,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, bind_addr = &c->ip6.addr_ll; } - if ((s = udp_tap_map[V6][src].sock) < 0) { + if ((s = udp_tap_sock[V6][src].sock) < 0) { union udp_epoll_ref uref = { .v6 = 1, .port = src, @@ -939,11 +938,11 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, if (s < 0) return p->count - idx; - udp_tap_map[V6][src].sock = s; - bitmap_set(udp_act[V6][UDP_ACT_TAP], src); + udp_tap_sock[V6][src].sock = s; + bitmap_set(udp_act[V6][UDP_ACT_TAP_SOCK], src); } - udp_tap_map[V6][src].ts = now->tv_sec; + udp_tap_sock[V6][src].ts = now->tv_sec; } for (i = 0; i < (int)p->count - idx; i++) { @@ -1015,7 +1014,7 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, r4 = s = sock_l4(c, AF_INET, IPPROTO_UDP, addr, ifname, port, uref.u32); - udp_tap_map[V4][uref.port].sock = s < 0 ? -1 : s; + udp_tap_sock[V4][uref.port].sock = s < 0 ? -1 : s; udp_splice_init[V4][port].sock = s < 0 ? -1 : s; } else { struct in_addr loopback = IN4ADDR_LOOPBACK_INIT; @@ -1033,7 +1032,7 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, r6 = s = sock_l4(c, AF_INET6, IPPROTO_UDP, addr, ifname, port, uref.u32); - udp_tap_map[V6][uref.port].sock = s < 0 ? -1 : s; + udp_tap_sock[V6][uref.port].sock = s < 0 ? -1 : s; udp_splice_init[V6][port].sock = s < 0 ? -1 : s; } else { r6 = s = sock_l4(c, AF_INET6, IPPROTO_UDP, @@ -1086,32 +1085,37 @@ static void udp_splice_iov_init(void) static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type, in_port_t port, const struct timespec *now) { - struct udp_splice_port *sp; - struct udp_tap_port *tp; + struct udp_bound_port *bp; + struct udp_port_flags *pf; int *sockp = NULL; switch (type) { - case UDP_ACT_TAP: - tp = &udp_tap_map[v6 ? V6 : V4][port]; + case UDP_ACT_TAP_SOCK: + bp = &udp_tap_sock[v6 ? V6 : V4][port]; - if (now->tv_sec - tp->ts > UDP_CONN_TIMEOUT) { - sockp = &tp->sock; - tp->flags = 0; - } + if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT) + sockp = &bp->sock; + + break; + case UDP_ACT_FLAGS: + pf = &udp_port_flags[v6 ? V6 : V4][port]; + + if (now->tv_sec - pf->ts > UDP_CONN_TIMEOUT) + pf->flags = 0; break; case UDP_ACT_SPLICE_INIT: - sp = &udp_splice_init[v6 ? V6 : V4][port]; + bp = &udp_splice_init[v6 ? V6 : V4][port]; - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; + if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT) + sockp = &bp->sock; break; case UDP_ACT_SPLICE_NS: - sp = &udp_splice_ns[v6 ? V6 : V4][port]; + bp = &udp_splice_ns[v6 ? V6 : V4][port]; - if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT) - sockp = &sp->sock; + if (now->tv_sec - bp->ts > UDP_CONN_TIMEOUT) + sockp = &bp->sock; break; default: @@ -1141,7 +1145,7 @@ static void udp_port_rebind(struct ctx *c, bool outbound) = outbound ? c->udp.fwd_out.f.map : c->udp.fwd_in.f.map; const uint8_t *rmap = outbound ? c->udp.fwd_in.f.map : c->udp.fwd_out.f.map; - struct udp_splice_port (*socks)[NUM_PORTS] + struct udp_bound_port (*socks)[NUM_PORTS] = outbound ? udp_splice_ns : udp_splice_init; unsigned port;