Although we associate a flow with traffic coming from a socket, for spliced traffic we don't use that flow entry to address it. Fix that, using the flow table as the source of truth for addressing "spliced" datagrams. As part of this we tweak how we batch datagrams. Previously we'd batch together contiguous datagrams as long as they have the same source port on the destination side. Now we batch together datagrams only if they belong to the same flow. Previously the structure required that datagrams with the same source port would also have the same destination port, and we relied on that fact. Although that will be true for flows at the moment it may not be true in future, so we need to ensure that everything in the batch has the same destination port. Similarly, although all datagrams will have loopback as both the source and destination address, these could be different addresses in the 127.0.0.1/8 subnet. To be sent as a single batch we need those addresses to be the same as well. Again, future changes to how we construct flow entries may make this a real concern. If both ports and addresses are the same, it must be the same flow, since that's how flows are looked up. So, we can simplify the logic simply by checking for the same flow. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- udp.c | 149 +++++++++++++++++----------------------------------------- 1 file changed, 43 insertions(+), 106 deletions(-) diff --git a/udp.c b/udp.c index 29f3ba85..89dc0307 100644 --- a/udp.c +++ b/udp.c @@ -73,81 +73,26 @@ * need to forward a datagram from that port, bound to the configured outbound * address (which may be "any"). * - * Port Tracking - * ============= + * - L4 to L4 ("spliced") traffic * - * For datagrams not handled by the flow table, a reduced version of port-based - * connection tracking is implemented with two purposes: - * - binding ephemeral ports when they're used as source port by the guest, so - * that replies on those ports can be forwarded back to the guest, with a - * fixed timeout for this binding - * - packets received from the local host get their source changed to a local - * address (gateway address) so that they can be forwarded to the guest, and - * packets sent as replies by the guest need their destination address to - * be changed back to the address of the local host. This is dynamic to allow - * connections from the gateway as well, and uses the same fixed 180s timeout - * - * Sockets for bound ports are created at initialisation time, one set for IPv4 - * and one for IPv6. + * In PASTA mode, the L2-L4 translation is skipped for datagrams between host + * and namespace with loopback addresses on both sides. Instead, messages are + * directly transferred between L4 sockets. These are called spliced flows by + * analogy with the TCP implementation, but the splice() syscall isn't + * actually used (that's only for streams); a pair of recvmmsg() and + * sendmmsg() deals with this case. * - * Packets are forwarded back and forth, by prepending and stripping UDP headers - * in the obvious way, with no port translation. + * To find a suitable sending socket for spliced datagrams, that is bound to + * localhost and the appropriate forwarding port, we use udp_splice_init[] for + * sockets in the initial host namespace and udp_splice_ns[] for sockets in + * the guest namespace. * - * In PASTA mode, the L2-L4 translation is skipped for connections to ports - * bound between namespaces using the loopback interface, messages are directly - * transferred between L4 sockets instead. These are called spliced connections - * for consistency with the TCP implementation, but the splice() syscall isn't - * actually used as it wouldn't make sense for datagram-based connections: a - * pair of recvmmsg() and sendmmsg() deals with this case. + * FIXME: sockets in udp_splice_init[] can conflict with sockets bound to the + * "any" address in udp_tap_map[]. * - * The connection tracking for PASTA mode is slightly complicated by the absence - * of actual connections, see struct udp_splice_port, and these examples: - * - * - from init to namespace: - * - * - forward direction: 127.0.0.1:5000 -> 127.0.0.1:80 in init from socket s, - * with epoll reference: index = 80, splice = 1, orig = 1, ns = 0 - * - if udp_splice_ns[V4][5000].sock: - * - send packet to udp_splice_ns[V4][5000].sock, with destination port - * 80 - * - otherwise: - * - create new socket udp_splice_ns[V4][5000].sock - * - bind in namespace to 127.0.0.1:5000 - * - add to epoll with reference: index = 5000, splice = 1, orig = 0, - * ns = 1 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - * - reverse direction: 127.0.0.1:80 -> 127.0.0.1:5000 in namespace socket s, - * having epoll reference: index = 5000, splice = 1, orig = 0, ns = 1 - * - if udp_splice_init[V4][80].sock: - * - send to udp_splice_init[V4][80].sock, with destination port 5000 - * - update udp_splice_init[V4][80].ts and udp_splice_ns[V4][5000].ts with - * current time - * - otherwise, discard - * - * - from namespace to init: - * - * - forward direction: 127.0.0.1:2000 -> 127.0.0.1:22 in namespace from - * socket s, with epoll reference: index = 22, splice = 1, orig = 1, ns = 1 - * - if udp4_splice_init[V4][2000].sock: - * - send packet to udp_splice_init[V4][2000].sock, with destination - * port 22 - * - otherwise: - * - create new socket udp_splice_init[V4][2000].sock - * - bind in init to 127.0.0.1:2000 - * - add to epoll with reference: index = 2000, splice = 1, orig = 0, - * ns = 0 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - * - reverse direction: 127.0.0.1:22 -> 127.0.0.1:2000 in init from socket s, - * having epoll reference: index = 2000, splice = 1, orig = 0, ns = 0 - * - if udp_splice_ns[V4][22].sock: - * - send to udp_splice_ns[V4][22].sock, with destination port 2000 - * - update udp_splice_ns[V4][22].ts and udp_splice_init[V4][2000].ts with - * current time - * - otherwise, discard + * FIXME: We don't handle the case of needing multiple "splice" sockets for a + * single port, due to using different IPv4 loopback addresses (e.g. 127.0.0.1 + * and 127.0.0.2). */ #include <sched.h> @@ -305,14 +250,7 @@ static struct mmsghdr udp6_l2_mh_sock [UDP_MAX_FRAMES]; /* recvmmsg()/sendmmsg() data for "spliced" connections */ static struct iovec udp_iov_splice [UDP_MAX_FRAMES]; -static struct sockaddr_in udp4_localname = { - .sin_family = AF_INET, - .sin_addr = IN4ADDR_LOOPBACK_INIT, -}; -static struct sockaddr_in6 udp6_localname = { - .sin6_family = AF_INET6, - .sin6_addr = IN6ADDR_LOOPBACK_INIT, -}; +static union sockaddr_inany udp_splicename; static struct mmsghdr udp4_mh_splice [UDP_MAX_FRAMES]; static struct mmsghdr udp6_mh_splice [UDP_MAX_FRAMES]; @@ -633,37 +571,38 @@ cancel: * @c: Execution context * @start: Index of first datagram in udp[46]_l2_buf * @n: Total number of datagrams in udp[46]_l2_buf pool - * @dst: Datagrams will be sent to this port (on destination side) * @uref: UDP epoll reference for origin socket * @now: Timestamp * - * This consumes as many datagrams as are sendable via a single socket. It - * requires that udp_meta[(a)start].splicesrc is initialised, and will initialise - * udp_meta[].splicesrc for each datagram it consumes *and one more* (if - * present). + * This consumes as many contiguous datagrams as possible with the same flow. + * It requires that udp_meta[(a)start].tosidx is initialised, and will initialise + * udp_meta[].tosidx for each frame it consumes *and one more* (if present). * * Return: Number of datagrams forwarded */ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, - in_port_t dst, union udp_epoll_ref uref, + union udp_epoll_ref uref, const struct timespec *now) { - in_port_t src = udp_meta[start].splicesrc; + flow_sidx_t startsidx = udp_meta[start].tosidx; + const struct flowside *toside = flowside_at_sidx(startsidx); + in_port_t src = toside->fport, dst = toside->eport; + uint8_t topif = pif_at_sidx(startsidx); struct mmsghdr *mmh_recv, *mmh_send; unsigned int i = start; + socklen_t sl; int s; - ASSERT(udp_meta[start].splicesrc >= 0); - if (uref.v6) { mmh_recv = udp6_l2_mh_sock; mmh_send = udp6_mh_splice; - udp6_localname.sin6_port = htons(dst); } else { mmh_recv = udp4_l2_mh_sock; mmh_send = udp4_mh_splice; - udp4_localname.sin_port = htons(dst); } + /* We don't need a scope id, because it's always a loopback address */ + sockaddr_from_inany(&udp_splicename, &sl, + &toside->eaddr, toside->eport, 0); do { mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len; @@ -673,10 +612,9 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); udp_meta[i].tosidx = udp_flow_from_sock(c, uref, &udp_meta[i]); - } while (udp_meta[i].splicesrc == src); + } while (flow_sidx_eq(udp_meta[i].tosidx, startsidx)); - if (uref.pif == PIF_SPLICE) { - src += c->udp.fwd_in.rdelta[src]; + if (topif == PIF_HOST) { s = udp_splice_init[uref.v6][src].sock; if (s < 0 && uref.orig) s = udp_splice_new(c, uref.v6, src, false); @@ -687,8 +625,7 @@ static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, udp_splice_ns[uref.v6][dst].ts = now->tv_sec; udp_splice_init[uref.v6][src].ts = now->tv_sec; } else { - ASSERT(uref.pif == PIF_HOST); - src += c->udp.fwd_out.rdelta[src]; + ASSERT(topif == PIF_SPLICE); s = udp_splice_ns[uref.v6][src].sock; if (s < 0 && uref.orig) { struct udp_splice_new_ns_arg arg = { @@ -881,11 +818,11 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve return; /* We divide things into batches based on how we need to send them, - * determined by udp_meta[i].splicesrc or tosidx. To avoid either two - * passes through the array, or recalculating splicesrc and tosidx for a - * single entry, we have to populate them one entry *ahead* of the loop - * counter (if present). So we fill in entry 0 before the loop, then - * udp_*_send() populate one entry past where they consume. + * determined by udp_meta[i].tosidx. To avoid either two passes through + * the array, or recalculating splicesrc and tosidx for a single entry, + * we have to populate them one entry *ahead* of the loop counter (if + * present). So we fill in entry 0 before the loop, then udp_*_send() + * populate one entry past where they consume. */ udp_meta[0].splicesrc = udp_mmh_splice_port(ref.udp, mmh_recv); udp_meta[0].tosidx = udp_flow_from_sock(c, ref.udp, &udp_meta[0]); @@ -895,8 +832,8 @@ void udp_buf_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t eve if (topif == PIF_TAP) { m = udp_tap_send(c, i, n, ref.udp, now); - } else if (udp_meta[i].splicesrc >= 0) { - m = udp_splice_send(c, i, n, dstport, ref.udp, now); + } else if (topif != PIF_NONE) { + m = udp_splice_send(c, i, n, ref.udp, now); } else { char sstr[SOCKADDR_STRLEN]; @@ -1183,11 +1120,11 @@ static void udp_splice_iov_init(void) struct msghdr *mh4 = &udp4_mh_splice[i].msg_hdr; struct msghdr *mh6 = &udp6_mh_splice[i].msg_hdr; - mh4->msg_name = &udp4_localname; - mh4->msg_namelen = sizeof(udp4_localname); + mh4->msg_name = &udp_splicename; + mh4->msg_namelen = sizeof(udp_splicename.sa4); - mh6->msg_name = &udp6_localname; - mh6->msg_namelen = sizeof(udp6_localname); + mh6->msg_name = &udp_splicename; + mh6->msg_namelen = sizeof(udp_splicename.sa6); udp_iov_splice[i].iov_base = udp_payload[i].data; -- 2.45.2