[RFC v2 00/11] Add vhost-net kernel support

newer
[PATCH] selinux: pasta accesses...

older
[PATCH v3] treewide: By default,...

Eugenio Pérez

9 Jul 2025 9 Jul '25

7:47 p.m.

vhost-net is a kernel device that allows to read packets from a tap device using virtio queues instead of regular read(2) and write(2). This enables a more eficient packet processing, as the memory can be written directly by the kernel to the userspace and back, instead of wasting bandwith on copies, and it enables to batch many packets in a single notification (through eventfds) both tx and rx. Namespace tx performance improves from ~26.3Gbit/s to ~36.9Gbit/s. Namespace rx performance improves from ~16BGbit/s to ~17.26Gbit/s. RFC: At this moment only these are supported: * Receive l2 packets from the vhost kernel to pasta * Send l4 tcp socket received data through vhost-kernel to namespace. TODO: Add vhost zerocopy in the tests, and compare with veth. TODO: Implement at least UDP tx. Or maybe we want UDP to be write(2) because of latency? TODO: Check style for variable declarations in for loops and use of curly brackets as long as they wrap more than a line. TODO: kerneldoc style function header comments -- v2: Add TCP tx, and integrated some comments from the previous series. Please check each patch message for details. Eugenio Pérez (11): tap: implement vhost_call_cb tap: add die() on vhost error Replace tx tap hdr with virtio_nethdr_mrg_rxbuf tcp: export memory regions to vhost virtio: Fill .next in tx queue tap: move static iov_sock to tcp_buf_data_from_sock tap: support tx through vhost tap: add tap_free_old_xmit tcp: start conversion to circular buffer Add poll(2) to used_idx tcp_buf: adding TCP tx circular buffer arp.c | 2 +- epoll_type.h | 4 + passt.c | 12 +- passt.h | 11 +- tap.c | 489 +++++++++++++++++++++++++++++++++++++++++++++++++-- tap.h | 13 +- tcp.c | 2 +- tcp_buf.c | 179 +++++++++++++++---- tcp_buf.h | 19 ++ udp.c | 2 +- 10 files changed, 675 insertions(+), 58 deletions(-) -- 2.50.0

Show replies by date

Eugenio Pérez

9 Jul 9 Jul

7:47 p.m.

New subject: [RFC v2 01/11] tap: implement vhost_call_cb

Implement the rx side of the vhost-kernel, where the namespace send packets through its tap device and the kernel passes them to pasta using a virtqueue. The virtqueue is build using a virtual ring (vring), which includes: * A descriptor table (buffer info) * An available ring (buffers ready for the kernel) * A used ring (buffers that the kernel has filled) The descriptor table holds an array of buffers defined by address and length. The kernel writes the packets transmitted by the namespace into these buffers. The number of descriptors (vq size) is set by VHOST_NDESCS. Pasta fills this table using pkt_buf, splitting it evenly across all descriptors. This table is read-only for the kernel. The available ring is where pasta marks which buffers the kernel can use. It's read only for the kernel. It includes a ring[] array with the descriptor indexes and an avail->idx index. Pasta increments avail->idx when a new buffer is added, modulo the size of the virtqueue. As pasta writes the rx buffers sequentially, ring[] is always [0, 1, 2...] and only avail->idx is incremented when new buffers are available for the kernel. avail->idx can be incremented by more than one at a time. Pasta also notifies the kernel of new available buffers by writing to the kick eventfd. Once the kernel has written a frame in a descriptor it writes its index into used_ring->ring[] and increments the used_ring->idx accordly. Like the avail idx the kernel can increase it by more than one. Pasta gets a notification in the call eventfd, so we add it into the epoll ctx. Pasta assumes buffers are used in order. QEMU also assumes it in the virtio-net migration code so it is safe. Now, vhost-kernel is designed to read the virtqueues and the buffers as *guest* physical addresses (GPA), not process virtual addresses (HVA). The way QEMU tells the translations is through the memory regions. Since we don't have GPAs, let's just send the memory regions as a 1:1 translations of the HVA. TODO: Evaluate if we can reuse the tap fd code instead of making a new epoll event type. TODO: Split a new file for vhost (Stefano) Signed-off-by: Eugenio Pérez --- RFC v2: * Need to integrate "now" parameter in tap_add_packet and replace TAP_MSGS to TAP_MSGS{4,6}. * Actually refill rx queue at the end of operation. * Explain virtqueues and memory regions theory of operation. * Extrack vhost_kick so it can be reused by tx. * Add macro for VHOST regions. * Only register call_cb in rx queue, as tx calls are handled just if needed. * Renamed vhost_call_cb to tap_vhost_input (David) * Move the inuse and last_used_idx to a writable struct from vhost tx function. Context is readonly for vhost tx code path. * Changed from inuse to num_free tracking, more aligned with kernel drv * Use always the same tap_pool instead of having the two splitted for v4 and v6. * Space between (struct vhost_memory_region) {.XXX = ...}. (Stefano) * Expand comments about memory region size (David) * Add some TODOs. Signed-off-by: Eugenio Pérez --- epoll_type.h | 2 + passt.c | 7 +- passt.h | 10 +- tap.c | 311 ++++++++++++++++++++++++++++++++++++++++++++++++++- tap.h | 8 ++ 5 files changed, 335 insertions(+), 3 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index 12ac59b..0371c14 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -44,6 +44,8 @@ enum epoll_type { EPOLL_TYPE_REPAIR_LISTEN, /* TCP_REPAIR helper socket */ EPOLL_TYPE_REPAIR, + /* vhost-kernel call socket */ + EPOLL_TYPE_VHOST_CALL, EPOLL_NUM_TYPES, }; diff --git a/passt.c b/passt.c index 388d10f..0f2659c 100644 --- a/passt.c +++ b/passt.c @@ -79,6 +79,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket", [EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket", [EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket", + [EPOLL_TYPE_VHOST_CALL] = "vhost-kernel call socket", }; static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES, "epoll_type_str[] doesn't match enum epoll_type"); @@ -310,7 +311,8 @@ loop: switch (ref.type) { case EPOLL_TYPE_TAP_PASTA: - tap_handler_pasta(&c, eventmask, &now); + // TODO: Find a better way, maybe reuse the fd. + // tap_handler_pasta(&c, eventmask, &now); break; case EPOLL_TYPE_TAP_PASST: tap_handler_passt(&c, eventmask, &now); @@ -357,6 +359,9 @@ loop: case EPOLL_TYPE_REPAIR: repair_handler(&c, eventmask); break; + case EPOLL_TYPE_VHOST_CALL: + tap_vhost_input(&c, ref, &now); + break; default: /* Can't happen */ ASSERT(0); diff --git a/passt.h b/passt.h index 8693794..7bb86c4 100644 --- a/passt.h +++ b/passt.h @@ -45,7 +45,7 @@ union epoll_ref; * @icmp: ICMP-specific reference part * @data: Data handled by protocol handlers * @nsdir_fd: netns dirfd for fallback timer checking if namespace is gone - * @queue: vhost-user queue index for this fd + * @queue: vhost queue index for this fd * @u64: Opaque reference for epoll_ctl() and epoll_wait() */ union epoll_ref { @@ -269,11 +269,14 @@ struct ctx { int fd_tap; int fd_repair_listen; int fd_repair; + /* TODO document all added fields */ + int fd_vhost; unsigned char our_tap_mac[ETH_ALEN]; unsigned char guest_mac[ETH_ALEN]; uint16_t mtu; uint64_t hash_secret[2]; + uint64_t virtio_features; int ifi4; struct ip4_ctx ip4; @@ -297,6 +300,11 @@ struct ctx { int no_icmp; struct icmp_ctx icmp; + struct { + int kick_fd; + int call_fd; + } vq[2]; + int no_dns; int no_dns_search; int no_dhcp_dns; diff --git a/tap.c b/tap.c index 6db5d88..e4a3822 100644 --- a/tap.c +++ b/tap.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include @@ -101,6 +102,51 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf); #define TAP_SEQS 128 /* Different L4 tuples in one batch */ #define FRAGMENT_MSG_RATE 10 /* # seconds between fragment warnings */ +#define VHOST_VIRTIO 0xAF +#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64) +#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64) +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01) +#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory) +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state) +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr) +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file) +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file) +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file) +#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64) +#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file) + +#define VHOST_NDESCS (PKT_BUF_BYTES / 65520) +static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)), + "Number of vhost descs must be a power of two by standard"); +static struct { + /* Number of free descriptors */ + uint16_t num_free; + + /* Last used idx processed */ + uint16_t last_used_idx; +} vqs[2]; + +static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE))); +static union { + struct vring_avail avail; + char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])]; +} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 __attribute__((aligned(PAGE_SIZE))); +static union { + struct vring_used used; + char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])]; +} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE))); + +/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */ +#define N_VHOST_REGIONS 6 +union { + struct vhost_memory mem; + char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])]; +} vhost_memory = { + .mem = { + .nregions = N_VHOST_REGIONS, + }, +}; + /** * tap_l2_max_len() - Maximum frame size (including L2 header) for current mode * @c: Execution context @@ -399,6 +445,18 @@ void tap_icmp6_send(const struct ctx *c, tap_send_single(c, buf, l4len + ((char *)icmp6h - buf)); } +static void vhost_kick(struct vring_used *used, int kick_fd) { + /* We need to expose available array entries before checking avail + * event. + * + * TODO: Does eventfd_write already do this? + */ + smp_mb(); + + if (!(used->flags & VRING_USED_F_NO_NOTIFY)) + eventfd_write(kick_fd, 1); +} + /** * tap_send_frames_pasta() - Send multiple frames to the pasta tap * @c: Execution context @@ -1386,6 +1444,89 @@ void tap_listen_handler(struct ctx *c, uint32_t events) tap_start_connection(c); } +static void *virtqueue_get_rx_buf(unsigned qid, unsigned *len) +{ + struct vring_used *used = !qid ? &vring_used_0.used : &vring_used_1.used; + uint32_t i; + uint16_t used_idx, last_used; + + /* TODO think if this has races with previous eventfd_read */ + /* TODO we could improve performance with a shadow_used_idx */ + used_idx = le16toh(used->idx); + + smp_rmb(); + + if (used_idx == vqs[0].last_used_idx) { + *len = 0; + return NULL; + } + + last_used = vqs[0].last_used_idx % VHOST_NDESCS; + i = le32toh(used->ring[last_used].id); + *len = le32toh(used->ring[last_used].len); + + /* Make sure the kernel is consuming the descriptors in order */ + if (i != last_used) { + die("vhost: id %u at used position %u != %u", i, last_used, i); + return NULL; + } + + if (*len > PKT_BUF_BYTES/VHOST_NDESCS) { + warn("vhost: id %d len %u > %zu", i, *len, PKT_BUF_BYTES/VHOST_NDESCS); + return NULL; + } + + /* TODO check if the id is valid and it has not been double used */ + vqs[0].last_used_idx++; + vqs[0].num_free++; + return pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS); +} + +/* TODO this assumes the kernel consumes descriptors in order */ +static void rx_pkt_refill(struct ctx *c) +{ + /* TODO: tune this threshold */ + if (!vqs[0].num_free) + return; + + vring_avail_0.avail.idx += vqs[0].num_free; + vqs[0].num_free = 0; + vhost_kick(&vring_used_0.used, c->vq[0].kick_fd); +} + +void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now) +{ + eventfd_read(ref.fd, (eventfd_t[]){ 0 }); + + tap_flush_pools(); + + while (true) { + struct virtio_net_hdr_mrg_rxbuf *hdr; + unsigned len; + + hdr = virtqueue_get_rx_buf(ref.queue, &len); + if (!hdr) + break; + + if (len < sizeof(*hdr)) { + warn("vhost: invalid len %u", len); + continue; + } + + /* TODO this will break from this moment */ + if (hdr->num_buffers != 1) { + warn("vhost: Too many buffers %u, %zu bytes should be enough for everybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS); + continue; + } + + /* TODO fix the v6 pool to support ipv6 */ + tap_add_packet(c, len - sizeof(*hdr), (void *)(hdr+1), now); + } + + tap_handler(c, now); + rx_pkt_refill(c); +} + /** * tap_ns_tun() - Get tuntap fd in namespace * @c: Execution context @@ -1396,10 +1537,14 @@ void tap_listen_handler(struct ctx *c, uint32_t events) */ static int tap_ns_tun(void *arg) { + /* TODO we need to check if vhost support VIRTIO_NET_F_MRG_RXBUF and VHOST_NET_F_VIRTIO_NET_HDR actually */ + static const uint64_t features = + (1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_NET_F_MRG_RXBUF) | (1ULL << VHOST_NET_F_VIRTIO_NET_HDR); struct ifreq ifr = { .ifr_flags = IFF_TAP | IFF_NO_PI }; int flags = O_RDWR | O_NONBLOCK | O_CLOEXEC; struct ctx *c = (struct ctx *)arg; - int fd, rc; + unsigned i; + int fd, vhost_fd, rc; c->fd_tap = -1; memcpy(ifr.ifr_name, c->pasta_ifn, IFNAMSIZ); @@ -1409,6 +1554,143 @@ static int tap_ns_tun(void *arg) if (fd < 0) die_perror("Failed to open() /dev/net/tun"); + vhost_fd = open("/dev/vhost-net", flags); + if (vhost_fd < 0) + die_perror("Failed to open() /dev/vhost-net"); + + rc = ioctl(vhost_fd, VHOST_SET_OWNER, NULL); + if (rc < 0) + die_perror("VHOST_SET_OWNER ioctl on /dev/vhost-net failed"); + + rc = ioctl(vhost_fd, VHOST_GET_FEATURES, &c->virtio_features); + if (rc < 0) + die_perror("VHOST_GET_FEATURES ioctl on /dev/vhost-net failed"); + + /* TODO inform more explicitely */ + fprintf(stderr, "vhost features: %lx\n", c->virtio_features); + fprintf(stderr, "req features: %lx\n", features); + c->virtio_features &= features; + if (c->virtio_features != features) + die("vhost does not support required features"); + + for (i = 0; i < ARRAY_SIZE(c->vq); i++) { + struct vhost_vring_file file = { + .index = i, + }; + union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_CALL, + .queue = i }; + struct epoll_event ev; + + file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); + ref.fd = file.fd; + if (file.fd < 0) + die_perror("Failed to create call eventfd"); + + rc = ioctl(vhost_fd, VHOST_SET_VRING_CALL, &file); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_CALL ioctl on /dev/vhost-net failed"); + + ev = (struct epoll_event){ .data.u64 = ref.u64, .events = EPOLLIN }; + if (i == 0) { + rc = epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev); + if (rc < 0) + die_perror("Failed to add call eventfd to epoll"); + } + c->vq[i].call_fd = file.fd; + } + + /* + * Unlike most of C-style APIs, userspace_addr+memory_size is + * also accesible by the kernel. Include a -1 to adjust. + */ +#define VHOST_MEMORY_REGION_PTR(addr, size) \ + (struct vhost_memory_region) { \ + .guest_phys_addr = (uintptr_t)addr, \ + .memory_size = size - 1, \ + .userspace_addr = (uintptr_t)addr, \ + } +#define VHOST_MEMORY_REGION(elem) VHOST_MEMORY_REGION_PTR(&elem, sizeof(elem)) + + /* 1:1 translation */ + vhost_memory.mem.regions[0] = VHOST_MEMORY_REGION(vring_desc); + vhost_memory.mem.regions[1] = VHOST_MEMORY_REGION(vring_avail_0); + vhost_memory.mem.regions[2] = VHOST_MEMORY_REGION(vring_avail_1); + vhost_memory.mem.regions[3] = VHOST_MEMORY_REGION(vring_used_0); + vhost_memory.mem.regions[4] = VHOST_MEMORY_REGION(vring_used_1); + vhost_memory.mem.regions[5] = VHOST_MEMORY_REGION(pkt_buf); + static_assert(5 < N_VHOST_REGIONS); +#undef VHOST_MEMORY_REGION +#undef VHOST_MEMORY_REGION_PTR + + rc = ioctl(vhost_fd, VHOST_SET_MEM_TABLE, &vhost_memory.mem); + if (rc < 0) + die_perror( + "VHOST_SET_MEM_TABLE ioctl on /dev/vhost-net failed"); + + /* TODO: probably it increases RX perf */ +#if 0 + struct ifreq ifr; + memset(&ifr, 0, sizeof(ifr)); + + if (ioctl(fd, TUNGETIFF, &ifr) != 0) + die_perror("Unable to query TUNGETIFF on FD %d", fd); + } + + if (ifr.ifr_flags & IFF_VNET_HDR) + net->dev.features &= ~(1ULL << VIRTIO_NET_F_MRG_RXBUF); +#endif + rc = ioctl(vhost_fd, VHOST_SET_FEATURES, &c->virtio_features); + if (rc < 0) + die_perror("VHOST_SET_FEATURES ioctl on /dev/vhost-net failed"); + + /* Duplicating foreach queue to follow the exact order from QEMU */ + for (i = 0; i < ARRAY_SIZE(c->vq); i++) { + struct vhost_vring_addr addr = { + .index = i, + .desc_user_addr = (unsigned long)vring_desc[i], + .avail_user_addr = i == 0 ? (unsigned long)&vring_avail_0 : + (unsigned long)&vring_avail_1, + .used_user_addr = i == 0 ? (unsigned long)&vring_used_0 : + (unsigned long)&vring_used_1, + /* GPA addr */ + .log_guest_addr = i == 0 ? (unsigned long)&vring_used_0 : + (unsigned long)&vring_used_1, + }; + struct vhost_vring_state state = { + .index = i, + .num = VHOST_NDESCS, + }; + struct vhost_vring_file file = { + .index = i, + }; + + rc = ioctl(vhost_fd, VHOST_SET_VRING_NUM, &state); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_NUM ioctl on /dev/vhost-net failed"); + + fprintf(stderr, "qid: %d\n", i); + fprintf(stderr, "vhost desc addr: 0x%llx\n", addr.desc_user_addr); + fprintf(stderr, "vhost avail addr: 0x%llx\n", addr.avail_user_addr); + fprintf(stderr, "vhost used addr: 0x%llx\n", addr.used_user_addr); + rc = ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &addr); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_ADDR ioctl on /dev/vhost-net failed"); + + file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); + if (file.fd < 0) + die_perror("Failed to create kick eventfd"); + rc = ioctl(vhost_fd, VHOST_SET_VRING_KICK, &file); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_KICK ioctl on /dev/vhost-net failed"); + c->vq[i].kick_fd = file.fd; + + vqs[i].num_free = VHOST_NDESCS; + } + rc = ioctl(fd, (int)TUNSETIFF, &ifr); if (rc < 0) die_perror("TUNSETIFF ioctl on /dev/net/tun failed"); @@ -1416,7 +1698,34 @@ static int tap_ns_tun(void *arg) if (!(c->pasta_ifi = if_nametoindex(c->pasta_ifn))) die("Tap device opened but no network interface found"); + for (i = 0; i < VHOST_NDESCS; ++i) { + vring_desc[0][i].addr = (uintptr_t)pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS); + vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS; + vring_desc[0][i].flags = VRING_DESC_F_WRITE; + } + for (i = 0; i < VHOST_NDESCS; ++i) { + vring_avail_0.avail.ring[i] = i; + } + + rx_pkt_refill(c); + + /* Duplicating foreach queue to follow the exact order from QEMU */ + for (int i = 0; i < ARRAY_SIZE(c->vq); i++) { + struct vhost_vring_file file = { + .index = i, + .fd = fd, + }; + + fprintf(stderr, "qid: %d\n", file.index); + fprintf(stderr, "tap fd: %d\n", file.fd); + rc = ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &file); + if (rc < 0) + die_perror( + "VHOST_NET_SET_BACKEND ioctl on /dev/vhost-net failed"); + } + c->fd_tap = fd; + c->fd_vhost = vhost_fd; return 0; } diff --git a/tap.h b/tap.h index 6fe3d15..ff8cee5 100644 --- a/tap.h +++ b/tap.h @@ -69,6 +69,14 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len) thdr->vnet_len = htonl(l2len); } +/** + * tap_vhost_input() - Handler for new data on the socket to hypervisor vq + * @c: Execution context + * @ref: epoll reference + * @now: Current timestamp + */ +void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now); + unsigned long tap_l2_max_len(const struct ctx *c); void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto); void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src, -- 2.50.0

David Gibson

23 Jul 23 Jul

8:56 a.m.

New subject: [RFC v2 01/11] tap: implement vhost_call_cb

On Wed, Jul 09, 2025 at 07:47:38PM +0200, Eugenio Pérez wrote:

...

Implement the rx side of the vhost-kernel, where the namespace send packets through its tap device and the kernel passes them to pasta using a virtqueue. The virtqueue is build using a virtual ring (vring), which includes: * A descriptor table (buffer info) * An available ring (buffers ready for the kernel) * A used ring (buffers that the kernel has filled)

The descriptor table holds an array of buffers defined by address and length. The kernel writes the packets transmitted by the namespace into these buffers. The number of descriptors (vq size) is set by VHOST_NDESCS. Pasta fills this table using pkt_buf, splitting it evenly across all descriptors. This table is read-only for the kernel.

The available ring is where pasta marks which buffers the kernel can use. It's read only for the kernel. It includes a ring[] array with the descriptor indexes and an avail->idx index. Pasta increments avail->idx when a new buffer is added, modulo the size of the virtqueue. As pasta writes the rx buffers sequentially, ring[] is always [0, 1, 2...] and only avail->idx is incremented when new buffers are available for the kernel. avail->idx can be incremented by more than one at a time.

Pasta also notifies the kernel of new available buffers by writing to the kick eventfd.

Once the kernel has written a frame in a descriptor it writes its index into used_ring->ring[] and increments the used_ring->idx accordly. Like the avail idx the kernel can increase it by more than one. Pasta gets a notification in the call eventfd, so we add it into the epoll ctx.

Pasta assumes buffers are used in order. QEMU also assumes it in the virtio-net migration code so it is safe.

Now, vhost-kernel is designed to read the virtqueues and the buffers as *guest* physical addresses (GPA), not process virtual addresses (HVA). The way QEMU tells the translations is through the memory regions. Since we don't have GPAs, let's just send the memory regions as a 1:1 translations of the HVA.

TODO: Evaluate if we can reuse the tap fd code instead of making a new epoll event type.

I'm not sure this is a particularly desirable code, unless it obvious removes a bunch of duplicated code. As a rule I'd prefer to use different epoll event types rather than a single epoll type that then needs to consult additional information to determine how to handle it.

...

TODO: Split a new file for vhost (Stefano)

Signed-off-by: Eugenio Pérez --- RFC v2: * Need to integrate "now" parameter in tap_add_packet and replace TAP_MSGS to TAP_MSGS{4,6}. * Actually refill rx queue at the end of operation. * Explain virtqueues and memory regions theory of operation. * Extrack vhost_kick so it can be reused by tx. * Add macro for VHOST regions. * Only register call_cb in rx queue, as tx calls are handled just if needed. * Renamed vhost_call_cb to tap_vhost_input (David) * Move the inuse and last_used_idx to a writable struct from vhost tx function. Context is readonly for vhost tx code path. * Changed from inuse to num_free tracking, more aligned with kernel drv * Use always the same tap_pool instead of having the two splitted for v4 and v6. * Space between (struct vhost_memory_region) {.XXX = ...}. (Stefano) * Expand comments about memory region size (David) * Add some TODOs.

Signed-off-by: Eugenio Pérez --- epoll_type.h | 2 + passt.c | 7 +- passt.h | 10 +- tap.c | 311 ++++++++++++++++++++++++++++++++++++++++++++++++++- tap.h | 8 ++ 5 files changed, 335 insertions(+), 3 deletions(-)

diff --git a/epoll_type.h b/epoll_type.h index 12ac59b..0371c14 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -44,6 +44,8 @@ enum epoll_type { EPOLL_TYPE_REPAIR_LISTEN, /* TCP_REPAIR helper socket */ EPOLL_TYPE_REPAIR, + /* vhost-kernel call socket */ + EPOLL_TYPE_VHOST_CALL,

EPOLL_NUM_TYPES, }; diff --git a/passt.c b/passt.c index 388d10f..0f2659c 100644 --- a/passt.c +++ b/passt.c @@ -79,6 +79,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket", [EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket", [EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket", + [EPOLL_TYPE_VHOST_CALL] = "vhost-kernel call socket", }; static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES, "epoll_type_str[] doesn't match enum epoll_type"); @@ -310,7 +311,8 @@ loop:

switch (ref.type) { case EPOLL_TYPE_TAP_PASTA: - tap_handler_pasta(&c, eventmask, &now); + // TODO: Find a better way, maybe reuse the fd. + // tap_handler_pasta(&c, eventmask, &now);

This change seems bogus. We still need this if we want to be able to fall back to tap without vhost, which I think we want to be able to do. I'm also not clear why you need it - if you don't want these events, can't you just not register them (at least not with this event type) in epoll_ctl?

...

break; case EPOLL_TYPE_TAP_PASST: tap_handler_passt(&c, eventmask, &now); @@ -357,6 +359,9 @@ loop: case EPOLL_TYPE_REPAIR: repair_handler(&c, eventmask); break; + case EPOLL_TYPE_VHOST_CALL: + tap_vhost_input(&c, ref, &now);

I'd suggest calling this tap_handler_vhost() for consistency with tap_handler_pasta() and others. I've also found it usually ends up a bit cleaner to pass the relevant individual members of ref, rather than the whole thing.

...

+ break; default: /* Can't happen */ ASSERT(0); diff --git a/passt.h b/passt.h index 8693794..7bb86c4 100644 --- a/passt.h +++ b/passt.h @@ -45,7 +45,7 @@ union epoll_ref; * @icmp: ICMP-specific reference part * @data: Data handled by protocol handlers * @nsdir_fd: netns dirfd for fallback timer checking if namespace is gone - * @queue: vhost-user queue index for this fd + * @queue: vhost queue index for this fd * @u64: Opaque reference for epoll_ctl() and epoll_wait() */ union epoll_ref { @@ -269,11 +269,14 @@ struct ctx { int fd_tap; int fd_repair_listen; int fd_repair; + /* TODO document all added fields */ + int fd_vhost; unsigned char our_tap_mac[ETH_ALEN]; unsigned char guest_mac[ETH_ALEN]; uint16_t mtu;

uint64_t hash_secret[2]; + uint64_t virtio_features;

int ifi4; struct ip4_ctx ip4; @@ -297,6 +300,11 @@ struct ctx { int no_icmp; struct icmp_ctx icmp;

+ struct { + int kick_fd; + int call_fd; + } vq[2];

Maybe not in scope for this series, but this is reminding me that we should really split out the tap-backend specific fields into their own union of substructures to make it clearer what fields are relevant in which configurations.

...

int no_dns; int no_dns_search; int no_dhcp_dns; diff --git a/tap.c b/tap.c index 6db5d88..e4a3822 100644 --- a/tap.c +++ b/tap.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include @@ -101,6 +102,51 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf); #define TAP_SEQS 128 /* Different L4 tuples in one batch */ #define FRAGMENT_MSG_RATE 10 /* # seconds between fragment warnings */

+#define VHOST_VIRTIO 0xAF +#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64) +#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64) +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01) +#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory) +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state) +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr) +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file) +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file) +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file) +#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64) +#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)

This stuff probably belongs in linux_dep.h (if you can't get it from a system or kernel header).

...

+#define VHOST_NDESCS (PKT_BUF_BYTES / 65520)

I suspect this is wrong. For starters we should definitely be using a define, not open-coded 65520. But more importantly, 65520 is the (default) MTU - the maximum size *of the IP packet*, excluding L2 headers. I'm assuming these buffers will also need to contain the L2 header, so will need to be larger. Looking at the later code, it looks like these buffers also include the virtio_net_hdr_mrg_rxbuf pseudo-physical header, so will need to be bigger again. I think you'll want a new define for this, which would be L2_MAX_LEN_PASTA plus whatever you need on top of the Ethernet header (virtion_net_hdr_mrg_rxbuf, AFAICT). For now, I'm going to call this L1_MAX_LEN_VHOST, which is not a great name, but I want to refer to it later in this mail. You may also need/want to round it up to 4/8/whatever bytes to avoid poorly aligned buffers.

...

+static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)), + "Number of vhost descs must be a power of two by standard");

Relying on this calculation come out to a power of 2 seems kind of fragile. Maybe just round down to a power of 2 instead? Although I guess that could mean a bunch of wasted space. To avoid wasting space if this doesn't come neatly to a power of 2, would it be safe to round _up_ to a power of 2, as long as we never, ever put an index beyond what we actually have space for into the available queue?

...

+static struct { + /* Number of free descriptors */ + uint16_t num_free; + + /* Last used idx processed */ + uint16_t last_used_idx; +} vqs[2]; + +static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE))); +static union { + struct vring_avail avail; + char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])];

The purpose of this idiom of a union with a char buffer based on offsetof isn't obvious to me yet.

...

+} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 __attribute__((aligned(PAGE_SIZE))); +static union { + struct vring_used used; + char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])]; +} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE)));

Maybe 2 entry arrays, rather than *_0 and *_1?

...

+ +/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */ +#define N_VHOST_REGIONS 6 +union { + struct vhost_memory mem; + char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])]; +} vhost_memory = { + .mem = { + .nregions = N_VHOST_REGIONS, + }, +}; + /** * tap_l2_max_len() - Maximum frame size (including L2 header) for current mode * @c: Execution context @@ -399,6 +445,18 @@ void tap_icmp6_send(const struct ctx *c, tap_send_single(c, buf, l4len + ((char *)icmp6h - buf)); }

+static void vhost_kick(struct vring_used *used, int kick_fd) {

passt style puts the opening { of a function on the nect line.

...

+ /* We need to expose available array entries before checking avail + * event. + * + * TODO: Does eventfd_write already do this? + */ + smp_mb(); + + if (!(used->flags & VRING_USED_F_NO_NOTIFY)) + eventfd_write(kick_fd, 1); +} + /** * tap_send_frames_pasta() - Send multiple frames to the pasta tap * @c: Execution context @@ -1386,6 +1444,89 @@ void tap_listen_handler(struct ctx *c, uint32_t events) tap_start_connection(c); }

+static void *virtqueue_get_rx_buf(unsigned qid, unsigned *len)

None of these functions have descriptive comment blocks yet. I realise this is an RFC, but it can certainly make things easier to review if you know what to expect before reading the function.

...

+{ + struct vring_used *used = !qid ? &vring_used_0.used : &vring_used_1.used;

Using an array as suggested above would avoid this awkward construct.

...

+ uint32_t i; + uint16_t used_idx, last_used;

passt style orders declarations in reverse order of line length (when possible), so this line should be before the previous one.

...

+ + /* TODO think if this has races with previous eventfd_read */ + /* TODO we could improve performance with a shadow_used_idx */ + used_idx = le16toh(used->idx); + + smp_rmb(); + + if (used_idx == vqs[0].last_used_idx) { + *len = 0; + return NULL; + } + + last_used = vqs[0].last_used_idx % VHOST_NDESCS; + i = le32toh(used->ring[last_used].id); + *len = le32toh(used->ring[last_used].len); + + /* Make sure the kernel is consuming the descriptors in order */ + if (i != last_used) { + die("vhost: id %u at used position %u != %u", i, last_used, i); + return NULL; + } + + if (*len > PKT_BUF_BYTES/VHOST_NDESCS) {

You want your L1_MAX_LEN_VHOST (or whatever it's called) again here.

...

+ warn("vhost: id %d len %u > %zu", i, *len, PKT_BUF_BYTES/VHOST_NDESCS);

And here.

...

+ return NULL; + } + + /* TODO check if the id is valid and it has not been double used */ + vqs[0].last_used_idx++; + vqs[0].num_free++; + return pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS);

And here.

...

+} + +/* TODO this assumes the kernel consumes descriptors in order */ +static void rx_pkt_refill(struct ctx *c) +{ + /* TODO: tune this threshold */ + if (!vqs[0].num_free) + return; + + vring_avail_0.avail.idx += vqs[0].num_free; + vqs[0].num_free = 0; + vhost_kick(&vring_used_0.used, c->vq[0].kick_fd); +} + +void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now) +{ + eventfd_read(ref.fd, (eventfd_t[]){ 0 }); + + tap_flush_pools(); + + while (true) { + struct virtio_net_hdr_mrg_rxbuf *hdr; + unsigned len; + + hdr = virtqueue_get_rx_buf(ref.queue, &len); + if (!hdr) + break;

You could use while ((hdr = virtqueue_get_rx_buf(...))) instead of while (true), couldn't you?

...

+ + if (len < sizeof(*hdr)) {

You should check that there are enough bytes for the L2 header as well as the mrg_rxbuf header - tap_add_packet() doesn't appear to check that itself (maybe it should).

...

+ warn("vhost: invalid len %u", len); + continue; + } + + /* TODO this will break from this moment */ + if (hdr->num_buffers != 1) { + warn("vhost: Too many buffers %u, %zu bytes should be enough for everybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS);

L1_MAX_LEN_VHOST again. Also passt style keeps lines below 80 columns.

...

+ continue; + } + + /* TODO fix the v6 pool to support ipv6 */

Not clear on why anything particular is needed for ipv6 here.

...

+ tap_add_packet(c, len - sizeof(*hdr), (void *)(hdr+1), now); + } + + tap_handler(c, now); + rx_pkt_refill(c); +} + /** * tap_ns_tun() - Get tuntap fd in namespace * @c: Execution context @@ -1396,10 +1537,14 @@ void tap_listen_handler(struct ctx *c, uint32_t events) */ static int tap_ns_tun(void *arg) { + /* TODO we need to check if vhost support VIRTIO_NET_F_MRG_RXBUF and VHOST_NET_F_VIRTIO_NET_HDR actually */ + static const uint64_t features = + (1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_NET_F_MRG_RXBUF) | (1ULL << VHOST_NET_F_VIRTIO_NET_HDR);

If I understand the code above properly, for the Rx path at least we don't need or want F_MRG_RXBUF. So why require it here? If we want it for the Tx path, we can omit it here and add it later in the series.

...

struct ifreq ifr = { .ifr_flags = IFF_TAP | IFF_NO_PI }; int flags = O_RDWR | O_NONBLOCK | O_CLOEXEC; struct ctx *c = (struct ctx *)arg; - int fd, rc; + unsigned i; + int fd, vhost_fd, rc;

c->fd_tap = -1; memcpy(ifr.ifr_name, c->pasta_ifn, IFNAMSIZ); @@ -1409,6 +1554,143 @@ static int tap_ns_tun(void *arg) if (fd < 0) die_perror("Failed to open() /dev/net/tun");

+ vhost_fd = open("/dev/vhost-net", flags); + if (vhost_fd < 0) + die_perror("Failed to open() /dev/vhost-net"); + + rc = ioctl(vhost_fd, VHOST_SET_OWNER, NULL); + if (rc < 0) + die_perror("VHOST_SET_OWNER ioctl on /dev/vhost-net failed"); + + rc = ioctl(vhost_fd, VHOST_GET_FEATURES, &c->virtio_features); + if (rc < 0) + die_perror("VHOST_GET_FEATURES ioctl on /dev/vhost-net failed"); + + /* TODO inform more explicitely */ + fprintf(stderr, "vhost features: %lx\n", c->virtio_features); + fprintf(stderr, "req features: %lx\n", features);

You should use info() or debug() here.

...

+ c->virtio_features &= features; + if (c->virtio_features != features) + die("vhost does not support required features"); + + for (i = 0; i < ARRAY_SIZE(c->vq); i++) { + struct vhost_vring_file file = { + .index = i, + }; + union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_CALL, + .queue = i }; + struct epoll_event ev; + + file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); + ref.fd = file.fd; + if (file.fd < 0) + die_perror("Failed to create call eventfd");

I know it's technically harmless, but it reads better for me if you check for the error immediately after the eventfd() call, before doing something else with the value.

...

+ + rc = ioctl(vhost_fd, VHOST_SET_VRING_CALL, &file); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_CALL ioctl on /dev/vhost-net failed"); + + ev = (struct epoll_event){ .data.u64 = ref.u64, .events = EPOLLIN }; + if (i == 0) { + rc = epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev); + if (rc < 0) + die_perror("Failed to add call eventfd to epoll"); + } + c->vq[i].call_fd = file.fd; + } + + /* + * Unlike most of C-style APIs, userspace_addr+memory_size is + * also accesible by the kernel. Include a -1 to adjust. + */ +#define VHOST_MEMORY_REGION_PTR(addr, size) \ + (struct vhost_memory_region) { \ + .guest_phys_addr = (uintptr_t)addr, \ + .memory_size = size - 1, \ + .userspace_addr = (uintptr_t)addr, \ + } +#define VHOST_MEMORY_REGION(elem) VHOST_MEMORY_REGION_PTR(&elem, sizeof(elem)) + + /* 1:1 translation */ + vhost_memory.mem.regions[0] = VHOST_MEMORY_REGION(vring_desc); + vhost_memory.mem.regions[1] = VHOST_MEMORY_REGION(vring_avail_0); + vhost_memory.mem.regions[2] = VHOST_MEMORY_REGION(vring_avail_1); + vhost_memory.mem.regions[3] = VHOST_MEMORY_REGION(vring_used_0); + vhost_memory.mem.regions[4] = VHOST_MEMORY_REGION(vring_used_1); + vhost_memory.mem.regions[5] = VHOST_MEMORY_REGION(pkt_buf); + static_assert(5 < N_VHOST_REGIONS); +#undef VHOST_MEMORY_REGION +#undef VHOST_MEMORY_REGION_PTR + + rc = ioctl(vhost_fd, VHOST_SET_MEM_TABLE, &vhost_memory.mem); + if (rc < 0) + die_perror( + "VHOST_SET_MEM_TABLE ioctl on /dev/vhost-net failed"); + + /* TODO: probably it increases RX perf */ +#if 0 + struct ifreq ifr; + memset(&ifr, 0, sizeof(ifr)); + + if (ioctl(fd, TUNGETIFF, &ifr) != 0) + die_perror("Unable to query TUNGETIFF on FD %d", fd); + } + + if (ifr.ifr_flags & IFF_VNET_HDR) + net->dev.features &= ~(1ULL << VIRTIO_NET_F_MRG_RXBUF); +#endif + rc = ioctl(vhost_fd, VHOST_SET_FEATURES, &c->virtio_features); + if (rc < 0) + die_perror("VHOST_SET_FEATURES ioctl on /dev/vhost-net failed"); + + /* Duplicating foreach queue to follow the exact order from QEMU */ + for (i = 0; i < ARRAY_SIZE(c->vq); i++) { + struct vhost_vring_addr addr = { + .index = i, + .desc_user_addr = (unsigned long)vring_desc[i], + .avail_user_addr = i == 0 ? (unsigned long)&vring_avail_0 : + (unsigned long)&vring_avail_1, + .used_user_addr = i == 0 ? (unsigned long)&vring_used_0 : + (unsigned long)&vring_used_1, + /* GPA addr */ + .log_guest_addr = i == 0 ? (unsigned long)&vring_used_0 : + (unsigned long)&vring_used_1, + }; + struct vhost_vring_state state = { + .index = i, + .num = VHOST_NDESCS, + }; + struct vhost_vring_file file = { + .index = i, + }; + + rc = ioctl(vhost_fd, VHOST_SET_VRING_NUM, &state); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_NUM ioctl on /dev/vhost-net failed"); + + fprintf(stderr, "qid: %d\n", i); + fprintf(stderr, "vhost desc addr: 0x%llx\n", addr.desc_user_addr); + fprintf(stderr, "vhost avail addr: 0x%llx\n", addr.avail_user_addr); + fprintf(stderr, "vhost used addr: 0x%llx\n", addr.used_user_addr); + rc = ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &addr); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_ADDR ioctl on /dev/vhost-net failed"); + + file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); + if (file.fd < 0) + die_perror("Failed to create kick eventfd"); + rc = ioctl(vhost_fd, VHOST_SET_VRING_KICK, &file); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_KICK ioctl on /dev/vhost-net failed"); + c->vq[i].kick_fd = file.fd; + + vqs[i].num_free = VHOST_NDESCS; + } + rc = ioctl(fd, (int)TUNSETIFF, &ifr); if (rc < 0) die_perror("TUNSETIFF ioctl on /dev/net/tun failed"); @@ -1416,7 +1698,34 @@ static int tap_ns_tun(void *arg) if (!(c->pasta_ifi = if_nametoindex(c->pasta_ifn))) die("Tap device opened but no network interface found");

+ for (i = 0; i < VHOST_NDESCS; ++i) { + vring_desc[0][i].addr = (uintptr_t)pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS); + vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS; + vring_desc[0][i].flags = VRING_DESC_F_WRITE; + } + for (i = 0; i < VHOST_NDESCS; ++i) {

No { } for single line blocks in passt style.

...

+ vring_avail_0.avail.ring[i] = i; + } + + rx_pkt_refill(c); + + /* Duplicating foreach queue to follow the exact order from QEMU */ + for (int i = 0; i < ARRAY_SIZE(c->vq); i++) { + struct vhost_vring_file file = { + .index = i, + .fd = fd, + }; + + fprintf(stderr, "qid: %d\n", file.index); + fprintf(stderr, "tap fd: %d\n", file.fd); + rc = ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &file); + if (rc < 0) + die_perror( + "VHOST_NET_SET_BACKEND ioctl on /dev/vhost-net failed"); + } + c->fd_tap = fd; + c->fd_vhost = vhost_fd;

return 0; } diff --git a/tap.h b/tap.h index 6fe3d15..ff8cee5 100644 --- a/tap.h +++ b/tap.h @@ -69,6 +69,14 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len) thdr->vnet_len = htonl(l2len); }

+/** + * tap_vhost_input() - Handler for new data on the socket to hypervisor vq + * @c: Execution context + * @ref: epoll reference + * @now: Current timestamp + */

passt convention puts these comments in the .c file not the .h, even for exported functions.

...

+void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now); + unsigned long tap_l2_max_len(const struct ctx *c); void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto); void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Perez Martin

28 Jul 28 Jul

6:33 p.m.

New subject: [RFC v2 01/11] tap: implement vhost_call_cb

On Wed, Jul 23, 2025 at 8:57 AM David Gibson wrote:

...

On Wed, Jul 09, 2025 at 07:47:38PM +0200, Eugenio Pérez wrote:

...
Implement the rx side of the vhost-kernel, where the namespace send packets through its tap device and the kernel passes them to pasta using a virtqueue. The virtqueue is build using a virtual ring (vring), which includes: * A descriptor table (buffer info) * An available ring (buffers ready for the kernel) * A used ring (buffers that the kernel has filled)

The descriptor table holds an array of buffers defined by address and length. The kernel writes the packets transmitted by the namespace into these buffers. The number of descriptors (vq size) is set by VHOST_NDESCS. Pasta fills this table using pkt_buf, splitting it evenly across all descriptors. This table is read-only for the kernel.

The available ring is where pasta marks which buffers the kernel can use. It's read only for the kernel. It includes a ring[] array with the descriptor indexes and an avail->idx index. Pasta increments avail->idx when a new buffer is added, modulo the size of the virtqueue. As pasta writes the rx buffers sequentially, ring[] is always [0, 1, 2...] and only avail->idx is incremented when new buffers are available for the kernel. avail->idx can be incremented by more than one at a time.

Pasta also notifies the kernel of new available buffers by writing to the kick eventfd.

Once the kernel has written a frame in a descriptor it writes its index into used_ring->ring[] and increments the used_ring->idx accordly. Like the avail idx the kernel can increase it by more than one. Pasta gets a notification in the call eventfd, so we add it into the epoll ctx.

Pasta assumes buffers are used in order. QEMU also assumes it in the virtio-net migration code so it is safe.

Now, vhost-kernel is designed to read the virtqueues and the buffers as *guest* physical addresses (GPA), not process virtual addresses (HVA). The way QEMU tells the translations is through the memory regions. Since we don't have GPAs, let's just send the memory regions as a 1:1 translations of the HVA.

TODO: Evaluate if we can reuse the tap fd code instead of making a new epoll event type.

I'm not sure this is a particularly desirable code, unless it obvious removes a bunch of duplicated code. As a rule I'd prefer to use different epoll event types rather than a single epoll type that then needs to consult additional information to determine how to handle it.

Sure, it can be done that way.

...

...
TODO: Split a new file for vhost (Stefano)

Signed-off-by: Eugenio Pérez --- RFC v2: * Need to integrate "now" parameter in tap_add_packet and replace TAP_MSGS to TAP_MSGS{4,6}. * Actually refill rx queue at the end of operation. * Explain virtqueues and memory regions theory of operation. * Extrack vhost_kick so it can be reused by tx. * Add macro for VHOST regions. * Only register call_cb in rx queue, as tx calls are handled just if needed. * Renamed vhost_call_cb to tap_vhost_input (David) * Move the inuse and last_used_idx to a writable struct from vhost tx function. Context is readonly for vhost tx code path. * Changed from inuse to num_free tracking, more aligned with kernel drv * Use always the same tap_pool instead of having the two splitted for v4 and v6. * Space between (struct vhost_memory_region) {.XXX = ...}. (Stefano) * Expand comments about memory region size (David) * Add some TODOs.

Signed-off-by: Eugenio Pérez --- epoll_type.h | 2 + passt.c | 7 +- passt.h | 10 +- tap.c | 311 ++++++++++++++++++++++++++++++++++++++++++++++++++- tap.h | 8 ++ 5 files changed, 335 insertions(+), 3 deletions(-)

diff --git a/epoll_type.h b/epoll_type.h index 12ac59b..0371c14 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -44,6 +44,8 @@ enum epoll_type { EPOLL_TYPE_REPAIR_LISTEN, /* TCP_REPAIR helper socket */ EPOLL_TYPE_REPAIR, + /* vhost-kernel call socket */ + EPOLL_TYPE_VHOST_CALL,

EPOLL_NUM_TYPES, }; diff --git a/passt.c b/passt.c index 388d10f..0f2659c 100644 --- a/passt.c +++ b/passt.c @@ -79,6 +79,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket", [EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket", [EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket", + [EPOLL_TYPE_VHOST_CALL] = "vhost-kernel call socket", }; static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES, "epoll_type_str[] doesn't match enum epoll_type"); @@ -310,7 +311,8 @@ loop:

switch (ref.type) { case EPOLL_TYPE_TAP_PASTA: - tap_handler_pasta(&c, eventmask, &now); + // TODO: Find a better way, maybe reuse the fd. + // tap_handler_pasta(&c, eventmask, &now);

This change seems bogus. We still need this if we want to be able to fall back to tap without vhost, which I think we want to be able to do. I'm also not clear why you need it - if you don't want these events, can't you just not register them (at least not with this event type) in epoll_ctl?

Absolutely, it was commented out for testing purposes but I forgot to note it.

...

...
break; case EPOLL_TYPE_TAP_PASST: tap_handler_passt(&c, eventmask, &now); @@ -357,6 +359,9 @@ loop: case EPOLL_TYPE_REPAIR: repair_handler(&c, eventmask); break; + case EPOLL_TYPE_VHOST_CALL: + tap_vhost_input(&c, ref, &now);

I'd suggest calling this tap_handler_vhost() for consistency with tap_handler_pasta() and others. I've also found it usually ends up a bit cleaner to pass the relevant individual members of ref, rather than the whole thing.

I'm ok with tap_handler_vhost, but tap_vhost_input was your suggestion :) [1].

...

...
+ break; default: /* Can't happen */ ASSERT(0); diff --git a/passt.h b/passt.h index 8693794..7bb86c4 100644 --- a/passt.h +++ b/passt.h @@ -45,7 +45,7 @@ union epoll_ref; * @icmp: ICMP-specific reference part * @data: Data handled by protocol handlers * @nsdir_fd: netns dirfd for fallback timer checking if namespace is gone - * @queue: vhost-user queue index for this fd + * @queue: vhost queue index for this fd * @u64: Opaque reference for epoll_ctl() and epoll_wait() */ union epoll_ref { @@ -269,11 +269,14 @@ struct ctx { int fd_tap; int fd_repair_listen; int fd_repair; + /* TODO document all added fields */ + int fd_vhost; unsigned char our_tap_mac[ETH_ALEN]; unsigned char guest_mac[ETH_ALEN]; uint16_t mtu;

uint64_t hash_secret[2]; + uint64_t virtio_features;

int ifi4; struct ip4_ctx ip4; @@ -297,6 +300,11 @@ struct ctx { int no_icmp; struct icmp_ctx icmp;

+ struct { + int kick_fd; + int call_fd; + } vq[2];

Maybe not in scope for this series, but this is reminding me that we should really split out the tap-backend specific fields into their own union of substructures to make it clearer what fields are relevant in which configurations.

...
int no_dns; int no_dns_search; int no_dhcp_dns; diff --git a/tap.c b/tap.c index 6db5d88..e4a3822 100644 --- a/tap.c +++ b/tap.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include @@ -101,6 +102,51 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf); #define TAP_SEQS 128 /* Different L4 tuples in one batch */ #define FRAGMENT_MSG_RATE 10 /* # seconds between fragment warnings */

+#define VHOST_VIRTIO 0xAF +#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64) +#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64) +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01) +#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory) +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state) +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr) +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file) +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file) +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file) +#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64) +#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)

This stuff probably belongs in linux_dep.h (if you can't get it from a system or kernel header).

...
+#define VHOST_NDESCS (PKT_BUF_BYTES / 65520)

I suspect this is wrong. For starters we should definitely be using a define, not open-coded 65520. But more importantly, 65520 is the (default) MTU - the maximum size *of the IP packet*, excluding L2 headers. I'm assuming these buffers will also need to contain the L2 header, so will need to be larger. Looking at the later code, it looks like these buffers also include the virtio_net_hdr_mrg_rxbuf pseudo-physical header, so will need to be bigger again.

I think you'll want a new define for this, which would be L2_MAX_LEN_PASTA plus whatever you need on top of the Ethernet header (virtion_net_hdr_mrg_rxbuf, AFAICT). For now, I'm going to call this L1_MAX_LEN_VHOST, which is not a great name, but I want to refer to it later in this mail. You may also need/want to round it up to 4/8/whatever bytes to avoid poorly aligned buffers.

Very good points, applying it for the next series.

...

...
+static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)), + "Number of vhost descs must be a power of two by standard");

Relying on this calculation come out to a power of 2 seems kind of fragile. Maybe just round down to a power of 2 instead? Although I guess that could mean a bunch of wasted space.

To me it's the classical kernel include/linux/log2.h:is_power_of_2 function, but we can replace (or entirely omit) the assert for sure.

...

To avoid wasting space if this doesn't come neatly to a power of 2, would it be safe to round _up_ to a power of 2, as long as we never, ever put an index beyond what we actually have space for into the available queue?

All descriptors are exposed to the kernel always in the rx queue, so rounding is not possible. This number must be a power of 2 in the split case, but I'll be happy if it is just a comment in the final version.

...

...
+static struct { + /* Number of free descriptors */ + uint16_t num_free; + + /* Last used idx processed */ + uint16_t last_used_idx; +} vqs[2]; + +static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE))); +static union { + struct vring_avail avail; + char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])];

The purpose of this idiom of a union with a char buffer based on offsetof isn't obvious to me yet.

It's because vring_avail contains a flexible array member. In VirtIO the avail ring is defined as: struct virtq_avail { le16 flags; le16 idx; le16 ring[ /* Queue Size */ ]; /* ommiting last member here, as it is not used by pasta */ }; But the kernel uapi definition of vring_avail is: struct vring_avail { __virtio16 flags; __virtio16 idx; __virtio16 ring[]; }; So I need to allocate enough bytes for ring[]. To me the easiest way is by using char buf[offsetof(last_member_of_ring[])]. I would be happy to redefine it for pasta, as we know the size in advance, or to hide it in a macro.

...

...
+} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 __attribute__((aligned(PAGE_SIZE))); +static union { + struct vring_used used; + char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])]; +} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE)));

Maybe 2 entry arrays, rather than *_0 and *_1?

Flexible array members again :(.

...

...
+ +/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */ +#define N_VHOST_REGIONS 6 +union { + struct vhost_memory mem; + char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])]; +} vhost_memory = { + .mem = { + .nregions = N_VHOST_REGIONS, + }, +}; + /** * tap_l2_max_len() - Maximum frame size (including L2 header) for current mode * @c: Execution context @@ -399,6 +445,18 @@ void tap_icmp6_send(const struct ctx *c, tap_send_single(c, buf, l4len + ((char *)icmp6h - buf)); }

+static void vhost_kick(struct vring_used *used, int kick_fd) {

passt style puts the opening { of a function on the nect line.

Ouch, I'll fix it in the next version!

...

...
+ /* We need to expose available array entries before checking avail + * event. + * + * TODO: Does eventfd_write already do this? + */ + smp_mb(); + + if (!(used->flags & VRING_USED_F_NO_NOTIFY)) + eventfd_write(kick_fd, 1); +} + /** * tap_send_frames_pasta() - Send multiple frames to the pasta tap * @c: Execution context @@ -1386,6 +1444,89 @@ void tap_listen_handler(struct ctx *c, uint32_t events) tap_start_connection(c); }

+static void *virtqueue_get_rx_buf(unsigned qid, unsigned *len)

None of these functions have descriptive comment blocks yet. I realise this is an RFC, but it can certainly make things easier to review if you know what to expect before reading the function.

...
+{ + struct vring_used *used = !qid ? &vring_used_0.used : &vring_used_1.used;

Using an array as suggested above would avoid this awkward construct.

...
+ uint32_t i; + uint16_t used_idx, last_used;

passt style orders declarations in reverse order of line length (when possible), so this line should be before the previous one.

Got it, I'll use it from now on.

...

...
+ + /* TODO think if this has races with previous eventfd_read */ + /* TODO we could improve performance with a shadow_used_idx */ + used_idx = le16toh(used->idx); + + smp_rmb(); + + if (used_idx == vqs[0].last_used_idx) { + *len = 0; + return NULL; + } + + last_used = vqs[0].last_used_idx % VHOST_NDESCS; + i = le32toh(used->ring[last_used].id); + *len = le32toh(used->ring[last_used].len); + + /* Make sure the kernel is consuming the descriptors in order */ + if (i != last_used) { + die("vhost: id %u at used position %u != %u", i, last_used, i); + return NULL; + } + + if (*len > PKT_BUF_BYTES/VHOST_NDESCS) {

You want your L1_MAX_LEN_VHOST (or whatever it's called) again here.

...
+ warn("vhost: id %d len %u > %zu", i, *len, PKT_BUF_BYTES/VHOST_NDESCS);

And here.

...
+ return NULL; + } + + /* TODO check if the id is valid and it has not been double used */ + vqs[0].last_used_idx++; + vqs[0].num_free++; + return pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS);

And here.

...
+} + +/* TODO this assumes the kernel consumes descriptors in order */ +static void rx_pkt_refill(struct ctx *c) +{ + /* TODO: tune this threshold */ + if (!vqs[0].num_free) + return; + + vring_avail_0.avail.idx += vqs[0].num_free; + vqs[0].num_free = 0; + vhost_kick(&vring_used_0.used, c->vq[0].kick_fd); +} + +void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now) +{ + eventfd_read(ref.fd, (eventfd_t[]){ 0 }); + + tap_flush_pools(); + + while (true) { + struct virtio_net_hdr_mrg_rxbuf *hdr; + unsigned len; + + hdr = virtqueue_get_rx_buf(ref.queue, &len); + if (!hdr) + break;

You could use while ((hdr = virtqueue_get_rx_buf(...))) instead of while (true), couldn't you?

Sure.

...

...
+ + if (len < sizeof(*hdr)) {

You should check that there are enough bytes for the L2 header as well as the mrg_rxbuf header - tap_add_packet() doesn't appear to check that itself (maybe it should).

That's right, I thought tap_add_packet did it but now I realize it assumes there is. Why not check it in tap_add_packet, as all namespace input needs to go there? We need to adapt the function signature though.

...

...
+ warn("vhost: invalid len %u", len); + continue; + } + + /* TODO this will break from this moment */ + if (hdr->num_buffers != 1) { + warn("vhost: Too many buffers %u, %zu bytes should be enough for everybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS);

L1_MAX_LEN_VHOST again.

I'm ok with moving to L1_MAX_LEN_VHOST, but then another bunch of places would need L1_MAX_LEN_VHOST*VHOST_NDESCS. Or should we keep the two macros in parallel?

...

Also passt style keeps lines below 80 columns.

Fixing for the next RFC.

...

...
+ continue; + } + + /* TODO fix the v6 pool to support ipv6 */

Not clear on why anything particular is needed for ipv6 here.

This comment is outdated actually, I only used pool_tap4 here. Fixing in the next revision!

...

...
+ tap_add_packet(c, len - sizeof(*hdr), (void *)(hdr+1), now); + } + + tap_handler(c, now); + rx_pkt_refill(c); +} + /** * tap_ns_tun() - Get tuntap fd in namespace * @c: Execution context @@ -1396,10 +1537,14 @@ void tap_listen_handler(struct ctx *c, uint32_t events) */ static int tap_ns_tun(void *arg) { + /* TODO we need to check if vhost support VIRTIO_NET_F_MRG_RXBUF and VHOST_NET_F_VIRTIO_NET_HDR actually */ + static const uint64_t features = + (1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_NET_F_MRG_RXBUF) | (1ULL << VHOST_NET_F_VIRTIO_NET_HDR);

If I understand the code above properly, for the Rx path at least we don't need or want F_MRG_RXBUF. So why require it here? If we want it for the Tx path, we can omit it here and add it later in the series.

I can try without it, sure.

...

...
struct ifreq ifr = { .ifr_flags = IFF_TAP | IFF_NO_PI }; int flags = O_RDWR | O_NONBLOCK | O_CLOEXEC; struct ctx *c = (struct ctx *)arg; - int fd, rc; + unsigned i; + int fd, vhost_fd, rc;

c->fd_tap = -1; memcpy(ifr.ifr_name, c->pasta_ifn, IFNAMSIZ); @@ -1409,6 +1554,143 @@ static int tap_ns_tun(void *arg) if (fd < 0) die_perror("Failed to open() /dev/net/tun");

+ vhost_fd = open("/dev/vhost-net", flags); + if (vhost_fd < 0) + die_perror("Failed to open() /dev/vhost-net"); + + rc = ioctl(vhost_fd, VHOST_SET_OWNER, NULL); + if (rc < 0) + die_perror("VHOST_SET_OWNER ioctl on /dev/vhost-net failed"); + + rc = ioctl(vhost_fd, VHOST_GET_FEATURES, &c->virtio_features); + if (rc < 0) + die_perror("VHOST_GET_FEATURES ioctl on /dev/vhost-net failed"); + + /* TODO inform more explicitely */ + fprintf(stderr, "vhost features: %lx\n", c->virtio_features); + fprintf(stderr, "req features: %lx\n", features);

You should use info() or debug() here.

Right, I'll change.

...

...
+ c->virtio_features &= features; + if (c->virtio_features != features) + die("vhost does not support required features"); + + for (i = 0; i < ARRAY_SIZE(c->vq); i++) { + struct vhost_vring_file file = { + .index = i, + }; + union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_CALL, + .queue = i }; + struct epoll_event ev; + + file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); + ref.fd = file.fd; + if (file.fd < 0) + die_perror("Failed to create call eventfd");

I know it's technically harmless, but it reads better for me if you check for the error immediately after the eventfd() call, before doing something else with the value.

Changing in the next version.

...

...
+ + rc = ioctl(vhost_fd, VHOST_SET_VRING_CALL, &file); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_CALL ioctl on /dev/vhost-net failed"); + + ev = (struct epoll_event){ .data.u64 = ref.u64, .events = EPOLLIN }; + if (i == 0) { + rc = epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev); + if (rc < 0) + die_perror("Failed to add call eventfd to epoll"); + } + c->vq[i].call_fd = file.fd; + } + + /* + * Unlike most of C-style APIs, userspace_addr+memory_size is + * also accesible by the kernel. Include a -1 to adjust. + */ +#define VHOST_MEMORY_REGION_PTR(addr, size) \ + (struct vhost_memory_region) { \ + .guest_phys_addr = (uintptr_t)addr, \ + .memory_size = size - 1, \ + .userspace_addr = (uintptr_t)addr, \ + } +#define VHOST_MEMORY_REGION(elem) VHOST_MEMORY_REGION_PTR(&elem, sizeof(elem)) + + /* 1:1 translation */ + vhost_memory.mem.regions[0] = VHOST_MEMORY_REGION(vring_desc); + vhost_memory.mem.regions[1] = VHOST_MEMORY_REGION(vring_avail_0); + vhost_memory.mem.regions[2] = VHOST_MEMORY_REGION(vring_avail_1); + vhost_memory.mem.regions[3] = VHOST_MEMORY_REGION(vring_used_0); + vhost_memory.mem.regions[4] = VHOST_MEMORY_REGION(vring_used_1); + vhost_memory.mem.regions[5] = VHOST_MEMORY_REGION(pkt_buf); + static_assert(5 < N_VHOST_REGIONS); +#undef VHOST_MEMORY_REGION +#undef VHOST_MEMORY_REGION_PTR + + rc = ioctl(vhost_fd, VHOST_SET_MEM_TABLE, &vhost_memory.mem); + if (rc < 0) + die_perror( + "VHOST_SET_MEM_TABLE ioctl on /dev/vhost-net failed"); + + /* TODO: probably it increases RX perf */ +#if 0 + struct ifreq ifr; + memset(&ifr, 0, sizeof(ifr)); + + if (ioctl(fd, TUNGETIFF, &ifr) != 0) + die_perror("Unable to query TUNGETIFF on FD %d", fd); + } + + if (ifr.ifr_flags & IFF_VNET_HDR) + net->dev.features &= ~(1ULL << VIRTIO_NET_F_MRG_RXBUF); +#endif + rc = ioctl(vhost_fd, VHOST_SET_FEATURES, &c->virtio_features); + if (rc < 0) + die_perror("VHOST_SET_FEATURES ioctl on /dev/vhost-net failed"); + + /* Duplicating foreach queue to follow the exact order from QEMU */ + for (i = 0; i < ARRAY_SIZE(c->vq); i++) { + struct vhost_vring_addr addr = { + .index = i, + .desc_user_addr = (unsigned long)vring_desc[i], + .avail_user_addr = i == 0 ? (unsigned long)&vring_avail_0 : + (unsigned long)&vring_avail_1, + .used_user_addr = i == 0 ? (unsigned long)&vring_used_0 : + (unsigned long)&vring_used_1, + /* GPA addr */ + .log_guest_addr = i == 0 ? (unsigned long)&vring_used_0 : + (unsigned long)&vring_used_1, + }; + struct vhost_vring_state state = { + .index = i, + .num = VHOST_NDESCS, + }; + struct vhost_vring_file file = { + .index = i, + }; + + rc = ioctl(vhost_fd, VHOST_SET_VRING_NUM, &state); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_NUM ioctl on /dev/vhost-net failed"); + + fprintf(stderr, "qid: %d\n", i); + fprintf(stderr, "vhost desc addr: 0x%llx\n", addr.desc_user_addr); + fprintf(stderr, "vhost avail addr: 0x%llx\n", addr.avail_user_addr); + fprintf(stderr, "vhost used addr: 0x%llx\n", addr.used_user_addr); + rc = ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &addr); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_ADDR ioctl on /dev/vhost-net failed"); + + file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); + if (file.fd < 0) + die_perror("Failed to create kick eventfd"); + rc = ioctl(vhost_fd, VHOST_SET_VRING_KICK, &file); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_KICK ioctl on /dev/vhost-net failed"); + c->vq[i].kick_fd = file.fd; + + vqs[i].num_free = VHOST_NDESCS; + } + rc = ioctl(fd, (int)TUNSETIFF, &ifr); if (rc < 0) die_perror("TUNSETIFF ioctl on /dev/net/tun failed"); @@ -1416,7 +1698,34 @@ static int tap_ns_tun(void *arg) if (!(c->pasta_ifi = if_nametoindex(c->pasta_ifn))) die("Tap device opened but no network interface found");

+ for (i = 0; i < VHOST_NDESCS; ++i) { + vring_desc[0][i].addr = (uintptr_t)pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS); + vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS; + vring_desc[0][i].flags = VRING_DESC_F_WRITE; + } + for (i = 0; i < VHOST_NDESCS; ++i) {

No { } for single line blocks in passt style.

I'll change for the next version.

...

...
+ vring_avail_0.avail.ring[i] = i; + } + + rx_pkt_refill(c); + + /* Duplicating foreach queue to follow the exact order from QEMU */ + for (int i = 0; i < ARRAY_SIZE(c->vq); i++) { + struct vhost_vring_file file = { + .index = i, + .fd = fd, + }; + + fprintf(stderr, "qid: %d\n", file.index); + fprintf(stderr, "tap fd: %d\n", file.fd); + rc = ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &file); + if (rc < 0) + die_perror( + "VHOST_NET_SET_BACKEND ioctl on /dev/vhost-net failed"); + } + c->fd_tap = fd; + c->fd_vhost = vhost_fd;

return 0; } diff --git a/tap.h b/tap.h index 6fe3d15..ff8cee5 100644 --- a/tap.h +++ b/tap.h @@ -69,6 +69,14 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len) thdr->vnet_len = htonl(l2len); }

+/** + * tap_vhost_input() - Handler for new data on the socket to hypervisor vq + * @c: Execution context + * @ref: epoll reference + * @now: Current timestamp + */

passt convention puts these comments in the .c file not the .h, even for exported functions.

Ouch, I missed that.

...

...
+void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct timespec *now); + unsigned long tap_l2_max_len(const struct ctx *c); void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto); void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,

[1] https://archives.passt.top/passt-dev/Z-3oQcL28QuIh6LT@zatzit/

David Gibson

29 Jul 29 Jul

2:11 a.m.

New subject: [RFC v2 01/11] tap: implement vhost_call_cb

On Mon, Jul 28, 2025 at 06:33:43PM +0200, Eugenio Perez Martin wrote:

...

On Wed, Jul 23, 2025 at 8:57 AM David Gibson wrote:

...
On Wed, Jul 09, 2025 at 07:47:38PM +0200, Eugenio Pérez wrote:

[snip]

...
...
@@ -310,7 +311,8 @@ loop:

switch (ref.type) { case EPOLL_TYPE_TAP_PASTA: - tap_handler_pasta(&c, eventmask, &now); + // TODO: Find a better way, maybe reuse the fd. + // tap_handler_pasta(&c, eventmask, &now);

This change seems bogus. We still need this if we want to be able to fall back to tap without vhost, which I think we want to be able to do. I'm also not clear why you need it - if you don't want these events, can't you just not register them (at least not with this event type) in epoll_ctl?

Absolutely, it was commented out for testing purposes but I forgot to note it.

Ok.

...

...
...
break; case EPOLL_TYPE_TAP_PASST: tap_handler_passt(&c, eventmask, &now); @@ -357,6 +359,9 @@ loop: case EPOLL_TYPE_REPAIR: repair_handler(&c, eventmask); break; + case EPOLL_TYPE_VHOST_CALL: + tap_vhost_input(&c, ref, &now);

I'd suggest calling this tap_handler_vhost() for consistency with tap_handler_pasta() and others. I've also found it usually ends up a bit cleaner to pass the relevant individual members of ref, rather than the whole thing.

I'm ok with tap_handler_vhost, but tap_vhost_input was your suggestion :) [1].

Looks.... ah, I see. I was right the first time, sorry. The difference here is that a vhost call event is always Rx path, whereas tap_handler_*() handle multiple epoll events that could be Rx or Tx path. tap_vhost_input() is good. [snip]

...

...
...
+#define VHOST_VIRTIO 0xAF +#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64) +#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64) +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01) +#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory) +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state) +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr) +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file) +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file) +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file) +#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64) +#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)

This stuff probably belongs in linux_dep.h (if you can't get it from a system or kernel header).

...
+#define VHOST_NDESCS (PKT_BUF_BYTES / 65520)

I suspect this is wrong. For starters we should definitely be using a define, not open-coded 65520. But more importantly, 65520 is the (default) MTU - the maximum size *of the IP packet*, excluding L2 headers. I'm assuming these buffers will also need to contain the L2 header, so will need to be larger. Looking at the later code, it looks like these buffers also include the virtio_net_hdr_mrg_rxbuf pseudo-physical header, so will need to be bigger again.

I think you'll want a new define for this, which would be L2_MAX_LEN_PASTA plus whatever you need on top of the Ethernet header (virtion_net_hdr_mrg_rxbuf, AFAICT). For now, I'm going to call this L1_MAX_LEN_VHOST, which is not a great name, but I want to refer to it later in this mail. You may also need/want to round it up to 4/8/whatever bytes to avoid poorly aligned buffers.

Very good points, applying it for the next series.

...
...
+static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)), + "Number of vhost descs must be a power of two by standard");

Relying on this calculation come out to a power of 2 seems kind of fragile. Maybe just round down to a power of 2 instead? Although I guess that could mean a bunch of wasted space.

To me it's the classical kernel include/linux/log2.h:is_power_of_2 function, but we can replace (or entirely omit) the assert for sure.

The test for a power of 2 is fine (although you introduce an explicit macro for it later, which should be used here as well). The part I was concerned about was requiring this calculated value to be a power of 2, rather than finding a power of 2 that fits.

...

...
To avoid wasting space if this doesn't come neatly to a power of 2, would it be safe to round _up_ to a power of 2, as long as we never, ever put an index beyond what we actually have space for into the available queue?

All descriptors are exposed to the kernel always in the rx queue, so rounding is not possible. This number must be a power of 2 in the split case, but I'll be happy if it is just a comment in the final version.

Ah, ok. *thinks*.... I withdraw my objection. Rounding down could lost a *lot* of buffers and could therefore lead to hard to track performance issues. Forcing it to be a power of 2 in the first place, and adjusting the buffer size to make that true is a better approach.

...

...
...
+static struct { + /* Number of free descriptors */ + uint16_t num_free; + + /* Last used idx processed */ + uint16_t last_used_idx; +} vqs[2]; + +static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE))); +static union { + struct vring_avail avail; + char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])];

The purpose of this idiom of a union with a char buffer based on offsetof isn't obvious to me yet.

It's because vring_avail contains a flexible array member.

In VirtIO the avail ring is defined as: struct virtq_avail { le16 flags; le16 idx; le16 ring[ /* Queue Size */ ]; /* ommiting last member here, as it is not used by pasta */ };

But the kernel uapi definition of vring_avail is: struct vring_avail { __virtio16 flags; __virtio16 idx; __virtio16 ring[]; };

So I need to allocate enough bytes for ring[]. To me the easiest way is by using char buf[offsetof(last_member_of_ring[])].

Ah, ok, now I understand what it's doing at least.

...

I would be happy to redefine it for pasta, as we know the size in advance, or to hide it in a macro.

Redefining it for our purpose sounds promising to me, but I might have to see what it looks like.

...

...
...
+} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 __attribute__((aligned(PAGE_SIZE))); +static union { + struct vring_used used; + char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])]; +} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE)));

Maybe 2 entry arrays, rather than *_0 and *_1?

Flexible array members again :(.

Right. Yeah, defining our own version or wrapper of vring_used would be cleaner, I think. [snip]

...

...
You should check that there are enough bytes for the L2 header as well as the mrg_rxbuf header - tap_add_packet() doesn't appear to check that itself (maybe it should).

That's right, I thought tap_add_packet did it but now I realize it assumes there is.

...

Why not check it in tap_add_packet, as all namespace input needs to go there? We need to adapt the function signature though.

Yeah, I think it probably should.

...

...
...
+ warn("vhost: invalid len %u", len); + continue; + } + + /* TODO this will break from this moment */ + if (hdr->num_buffers != 1) { + warn("vhost: Too many buffers %u, %zu bytes should be enough for everybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS);

L1_MAX_LEN_VHOST again.

I'm ok with moving to L1_MAX_LEN_VHOST, but then another bunch of places would need L1_MAX_LEN_VHOST*VHOST_NDESCS. Or should we keep the two macros in parallel?

We can have macros for both the individual buffer size and the total buffer size, as long as we make sure they're consistent (either by construction or with static_assert()s). -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Pérez

9 Jul 9 Jul

7:47 p.m.

New subject: [RFC v2 02/11] tap: add die() on vhost error

In case the kernel needs to signal an error. Signed-off-by: Eugenio Pérez --- RFC v2: Add TODO (David) Signed-off-by: Eugenio Pérez --- epoll_type.h | 2 ++ passt.c | 5 +++++ passt.h | 1 + tap.c | 16 ++++++++++++++++ 4 files changed, 24 insertions(+) diff --git a/epoll_type.h b/epoll_type.h index 0371c14..93c4701 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -46,6 +46,8 @@ enum epoll_type { EPOLL_TYPE_REPAIR, /* vhost-kernel call socket */ EPOLL_TYPE_VHOST_CALL, + /* vhost-kernel error socket */ + EPOLL_TYPE_VHOST_ERROR, EPOLL_NUM_TYPES, }; diff --git a/passt.c b/passt.c index 0f2659c..d839f5a 100644 --- a/passt.c +++ b/passt.c @@ -80,6 +80,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket", [EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket", [EPOLL_TYPE_VHOST_CALL] = "vhost-kernel call socket", + [EPOLL_TYPE_VHOST_ERROR] = "vhost-kernel error socket", }; static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES, "epoll_type_str[] doesn't match enum epoll_type"); @@ -362,6 +363,10 @@ loop: case EPOLL_TYPE_VHOST_CALL: tap_vhost_input(&c, ref, &now); break; + case EPOLL_TYPE_VHOST_ERROR: + /* TODO re-initialize vhost */ + die("Error on vhost-kernel socket"); + break; default: /* Can't happen */ ASSERT(0); diff --git a/passt.h b/passt.h index 7bb86c4..0066145 100644 --- a/passt.h +++ b/passt.h @@ -303,6 +303,7 @@ struct ctx { struct { int kick_fd; int call_fd; + int err_fd; } vq[2]; int no_dns; diff --git a/tap.c b/tap.c index e4a3822..0c49e6d 100644 --- a/tap.c +++ b/tap.c @@ -1598,6 +1598,22 @@ static int tap_ns_tun(void *arg) die_perror("Failed to add call eventfd to epoll"); } c->vq[i].call_fd = file.fd; + + file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); + if (file.fd < 0) + die_perror("Failed to create error eventfd"); + + rc = ioctl(vhost_fd, VHOST_SET_VRING_ERR, &file); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_ERR ioctl on /dev/vhost-net failed"); + + ref.type = EPOLL_TYPE_VHOST_ERROR, ref.fd = file.fd; + ev.data.u64 = ref.u64; + rc = epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev); + if (rc < 0) + die_perror("Failed to add error eventfd to epoll"); + c->vq[i].err_fd = file.fd; } /* -- 2.50.0

David Gibson

23 Jul 23 Jul

8:58 a.m.

New subject: [RFC v2 02/11] tap: add die() on vhost error

On Wed, Jul 09, 2025 at 07:47:39PM +0200, Eugenio Pérez wrote:

...

In case the kernel needs to signal an error.

Signed-off-by: Eugenio Pérez

Reviewed-by: David Gibson

...

--- RFC v2: Add TODO (David)

Signed-off-by: Eugenio Pérez --- epoll_type.h | 2 ++ passt.c | 5 +++++ passt.h | 1 + tap.c | 16 ++++++++++++++++ 4 files changed, 24 insertions(+)

diff --git a/epoll_type.h b/epoll_type.h index 0371c14..93c4701 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -46,6 +46,8 @@ enum epoll_type { EPOLL_TYPE_REPAIR, /* vhost-kernel call socket */ EPOLL_TYPE_VHOST_CALL, + /* vhost-kernel error socket */ + EPOLL_TYPE_VHOST_ERROR,

EPOLL_NUM_TYPES, }; diff --git a/passt.c b/passt.c index 0f2659c..d839f5a 100644 --- a/passt.c +++ b/passt.c @@ -80,6 +80,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket", [EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket", [EPOLL_TYPE_VHOST_CALL] = "vhost-kernel call socket", + [EPOLL_TYPE_VHOST_ERROR] = "vhost-kernel error socket", }; static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES, "epoll_type_str[] doesn't match enum epoll_type"); @@ -362,6 +363,10 @@ loop: case EPOLL_TYPE_VHOST_CALL: tap_vhost_input(&c, ref, &now); break; + case EPOLL_TYPE_VHOST_ERROR: + /* TODO re-initialize vhost */ + die("Error on vhost-kernel socket"); + break; default: /* Can't happen */ ASSERT(0); diff --git a/passt.h b/passt.h index 7bb86c4..0066145 100644 --- a/passt.h +++ b/passt.h @@ -303,6 +303,7 @@ struct ctx { struct { int kick_fd; int call_fd; + int err_fd; } vq[2];

int no_dns; diff --git a/tap.c b/tap.c index e4a3822..0c49e6d 100644 --- a/tap.c +++ b/tap.c @@ -1598,6 +1598,22 @@ static int tap_ns_tun(void *arg) die_perror("Failed to add call eventfd to epoll"); } c->vq[i].call_fd = file.fd; + + file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); + if (file.fd < 0) + die_perror("Failed to create error eventfd"); + + rc = ioctl(vhost_fd, VHOST_SET_VRING_ERR, &file); + if (rc < 0) + die_perror( + "VHOST_SET_VRING_ERR ioctl on /dev/vhost-net failed"); + + ref.type = EPOLL_TYPE_VHOST_ERROR, ref.fd = file.fd; + ev.data.u64 = ref.u64; + rc = epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev); + if (rc < 0) + die_perror("Failed to add error eventfd to epoll"); + c->vq[i].err_fd = file.fd; }

/*

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Pérez

9 Jul 9 Jul

7:47 p.m.

New subject: [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf

vhost kernel expects this as the first data of the frame. Signed-off-by: Eugenio Pérez --- tap.c | 2 +- tcp_buf.c | 15 +++++++++++++-- 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/tap.c b/tap.c index 0c49e6d..0656294 100644 --- a/tap.c +++ b/tap.c @@ -593,7 +593,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, nframes - m, nframes); pcap_multiple(iov, bufs_per_frame, m, - c->mode == MODE_PASST ? sizeof(uint32_t) : 0); + c->mode == MODE_PASST ? sizeof(uint32_t) : sizeof(struct virtio_net_hdr_mrg_rxbuf)); return m; } diff --git a/tcp_buf.c b/tcp_buf.c index d1fca67..2fbd056 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -22,6 +22,8 @@ #include +#include + #include "util.h" #include "ip.h" #include "iov.h" @@ -43,7 +45,7 @@ static struct ethhdr tcp4_eth_src; static struct ethhdr tcp6_eth_src; -static struct tap_hdr tcp_payload_tap_hdr[TCP_FRAMES_MEM]; +static struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM]; /* IP headers for IPv4 and IPv6 */ struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM]; @@ -75,6 +77,14 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s) eth_update_mac(&tcp6_eth_src, eth_d, eth_s); } +static inline struct iovec virtio_net_hdr_iov(struct virtio_net_hdr_mrg_rxbuf *hdr) +{ + return (struct iovec){ + .iov_base = hdr, + .iov_len = sizeof(*hdr), + }; +} + /** * tcp_sock_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets * @c: Execution context @@ -85,6 +95,7 @@ void tcp_sock_iov_init(const struct ctx *c) struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP); int i; + (void)c; tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6); tcp4_eth_src.h_proto = htons_constant(ETH_P_IP); @@ -96,7 +107,7 @@ void tcp_sock_iov_init(const struct ctx *c) for (i = 0; i < TCP_FRAMES_MEM; i++) { struct iovec *iov = tcp_l2_iov[i]; - iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp_payload_tap_hdr[i]); + iov[TCP_IOV_TAP] = virtio_net_hdr_iov(&tcp_payload_tap_hdr[i]); iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr); iov[TCP_IOV_PAYLOAD].iov_base = &tcp_payload[i]; } -- 2.50.0

David Gibson

24 Jul 24 Jul

2:17 a.m.

New subject: [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf

On Wed, Jul 09, 2025 at 07:47:40PM +0200, Eugenio Pérez wrote:

...

vhost kernel expects this as the first data of the frame.

Signed-off-by: Eugenio Pérez --- tap.c | 2 +- tcp_buf.c | 15 +++++++++++++-- 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/tap.c b/tap.c index 0c49e6d..0656294 100644 --- a/tap.c +++ b/tap.c @@ -593,7 +593,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, nframes - m, nframes);

pcap_multiple(iov, bufs_per_frame, m, - c->mode == MODE_PASST ? sizeof(uint32_t) : 0); + c->mode == MODE_PASST ? sizeof(uint32_t) : sizeof(struct virtio_net_hdr_mrg_rxbuf));

If I understand correctly, you're always considering virtio_net_hdr_mrg_rxbuf part of the extended frame in PASTA mode, it's just unused when you're not using the vhost path. Is that correct? We probably want a helper that returns the correct tap header length, a companion to tap_hdr_iov().

...

return m; } diff --git a/tcp_buf.c b/tcp_buf.c index d1fca67..2fbd056 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -22,6 +22,8 @@

#include

+#include + #include "util.h" #include "ip.h" #include "iov.h" @@ -43,7 +45,7 @@ static struct ethhdr tcp4_eth_src; static struct ethhdr tcp6_eth_src;

-static struct tap_hdr tcp_payload_tap_hdr[TCP_FRAMES_MEM]; +static struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];

The intention is that struct tap_hdr can store whatever sort of "below L2" header we need - it's just that so far we only had the trivial qemu socket header or nothing. Now we're adding another option, so it should be a union branch of tap_hdr, rather than a separate structure.

...

/* IP headers for IPv4 and IPv6 */ struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM]; @@ -75,6 +77,14 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s) eth_update_mac(&tcp6_eth_src, eth_d, eth_s); }

+static inline struct iovec virtio_net_hdr_iov(struct virtio_net_hdr_mrg_rxbuf *hdr) +{ + return (struct iovec){ + .iov_base = hdr, + .iov_len = sizeof(*hdr), + }; +}

With that unification of tap_hdr_iov(), this should be a branch in tap_hdr_iov(), rather than a separate function. tap_hdr_update() will also need to be updated to only update vnet_len in PASST (qemu socket) mode. It is now occurring to me that adding vhost is highlighting an existing ambiguity in our terminology: "tap" can mean either "the guest facing interface, of whatever mechanics", or specifically the kernel tuntap device. Maybe we need to look at a great renaming.

...

+ /** * tcp_sock_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets * @c: Execution context @@ -85,6 +95,7 @@ void tcp_sock_iov_init(const struct ctx *c) struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP); int i;

+ (void)c; tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6); tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);

@@ -96,7 +107,7 @@ void tcp_sock_iov_init(const struct ctx *c) for (i = 0; i < TCP_FRAMES_MEM; i++) { struct iovec *iov = tcp_l2_iov[i];

- iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp_payload_tap_hdr[i]); + iov[TCP_IOV_TAP] = virtio_net_hdr_iov(&tcp_payload_tap_hdr[i]); iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr); iov[TCP_IOV_PAYLOAD].iov_base = &tcp_payload[i]; }

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Perez Martin

28 Jul 28 Jul

6:37 p.m.

New subject: [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf

On Thu, Jul 24, 2025 at 2:25 AM David Gibson wrote:

...

On Wed, Jul 09, 2025 at 07:47:40PM +0200, Eugenio Pérez wrote:

...
vhost kernel expects this as the first data of the frame.

Signed-off-by: Eugenio Pérez --- tap.c | 2 +- tcp_buf.c | 15 +++++++++++++-- 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/tap.c b/tap.c index 0c49e6d..0656294 100644 --- a/tap.c +++ b/tap.c @@ -593,7 +593,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, nframes - m, nframes);

pcap_multiple(iov, bufs_per_frame, m, - c->mode == MODE_PASST ? sizeof(uint32_t) : 0); + c->mode == MODE_PASST ? sizeof(uint32_t) : sizeof(struct virtio_net_hdr_mrg_rxbuf));

If I understand correctly, you're always considering virtio_net_hdr_mrg_rxbuf part of the extended frame in PASTA mode, it's just unused when you're not using the vhost path. Is that correct?

We probably want a helper that returns the correct tap header length, a companion to tap_hdr_iov().

Yes, the plan was to add all of these conditionals in a later version but I failed to document it properly :).

...

...
return m; } diff --git a/tcp_buf.c b/tcp_buf.c index d1fca67..2fbd056 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -22,6 +22,8 @@

#include

+#include + #include "util.h" #include "ip.h" #include "iov.h" @@ -43,7 +45,7 @@ static struct ethhdr tcp4_eth_src; static struct ethhdr tcp6_eth_src;

-static struct tap_hdr tcp_payload_tap_hdr[TCP_FRAMES_MEM]; +static struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];

The intention is that struct tap_hdr can store whatever sort of "below L2" header we need - it's just that so far we only had the trivial qemu socket header or nothing. Now we're adding another option, so it should be a union branch of tap_hdr, rather than a separate structure.

Right.

...

...
/* IP headers for IPv4 and IPv6 */ struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM]; @@ -75,6 +77,14 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s) eth_update_mac(&tcp6_eth_src, eth_d, eth_s); }

+static inline struct iovec virtio_net_hdr_iov(struct virtio_net_hdr_mrg_rxbuf *hdr) +{ + return (struct iovec){ + .iov_base = hdr, + .iov_len = sizeof(*hdr), + }; +}

With that unification of tap_hdr_iov(), this should be a branch in tap_hdr_iov(), rather than a separate function. tap_hdr_update() will also need to be updated to only update vnet_len in PASST (qemu socket) mode.

It is now occurring to me that adding vhost is highlighting an existing ambiguity in our terminology: "tap" can mean either "the guest facing interface, of whatever mechanics", or specifically the kernel tuntap device. Maybe we need to look at a great renaming.

...
+ /** * tcp_sock_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets * @c: Execution context @@ -85,6 +95,7 @@ void tcp_sock_iov_init(const struct ctx *c) struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP); int i;

+ (void)c; tcp6_eth_src.h_proto = htons_constant(ETH_P_IPV6); tcp4_eth_src.h_proto = htons_constant(ETH_P_IP);

@@ -96,7 +107,7 @@ void tcp_sock_iov_init(const struct ctx *c) for (i = 0; i < TCP_FRAMES_MEM; i++) { struct iovec *iov = tcp_l2_iov[i];

- iov[TCP_IOV_TAP] = tap_hdr_iov(c, &tcp_payload_tap_hdr[i]); + iov[TCP_IOV_TAP] = virtio_net_hdr_iov(&tcp_payload_tap_hdr[i]); iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr); iov[TCP_IOV_PAYLOAD].iov_base = &tcp_payload[i]; }

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Pérez

9 Jul 9 Jul

7:47 p.m.

New subject: [RFC v2 04/11] tcp: export memory regions to vhost

So vhost kernel is able to access the TCP buffers. Signed-off-by: Eugenio Pérez --- tap.c | 14 +++++++++++--- tcp_buf.c | 14 ++++---------- tcp_buf.h | 19 +++++++++++++++++++ 3 files changed, 34 insertions(+), 13 deletions(-) diff --git a/tap.c b/tap.c index 0656294..8b3ec45 100644 --- a/tap.c +++ b/tap.c @@ -63,6 +63,8 @@ #include "vhost_user.h" #include "vu_common.h" +#include "tcp_buf.h" + /* Maximum allowed frame lengths (including L2 header) */ /* Verify that an L2 frame length limit is large enough to contain the header, @@ -136,8 +138,8 @@ static union { char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])]; } vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE))); -/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */ -#define N_VHOST_REGIONS 6 +/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf + TCP virtio hdr + TCP eth(src,dst) + TCP ip hdr */ +#define N_VHOST_REGIONS 12 union { struct vhost_memory mem; char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])]; @@ -1635,7 +1637,13 @@ static int tap_ns_tun(void *arg) vhost_memory.mem.regions[3] = VHOST_MEMORY_REGION(vring_used_0); vhost_memory.mem.regions[4] = VHOST_MEMORY_REGION(vring_used_1); vhost_memory.mem.regions[5] = VHOST_MEMORY_REGION(pkt_buf); - static_assert(5 < N_VHOST_REGIONS); + vhost_memory.mem.regions[6] = VHOST_MEMORY_REGION(tcp_payload_tap_hdr); + vhost_memory.mem.regions[7] = VHOST_MEMORY_REGION(tcp4_eth_src); + vhost_memory.mem.regions[8] = VHOST_MEMORY_REGION(tcp6_eth_src); + vhost_memory.mem.regions[9] = VHOST_MEMORY_REGION(tcp4_payload_ip); + vhost_memory.mem.regions[10] = VHOST_MEMORY_REGION(tcp6_payload_ip); + vhost_memory.mem.regions[11] = VHOST_MEMORY_REGION(tcp_payload); + static_assert(11 < N_VHOST_REGIONS); #undef VHOST_MEMORY_REGION #undef VHOST_MEMORY_REGION_PTR diff --git a/tcp_buf.c b/tcp_buf.c index 2fbd056..c999d2e 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -22,8 +22,6 @@ #include -#include - #include "util.h" #include "ip.h" #include "iov.h" @@ -35,24 +33,20 @@ #include "tcp_internal.h" #include "tcp_buf.h" -#define TCP_FRAMES_MEM 128 -#define TCP_FRAMES \ - (c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM) - /* Static buffers */ /* Ethernet header for IPv4 and IPv6 frames */ -static struct ethhdr tcp4_eth_src; -static struct ethhdr tcp6_eth_src; +struct ethhdr tcp4_eth_src; +struct ethhdr tcp6_eth_src; -static struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM]; +struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM]; /* IP headers for IPv4 and IPv6 */ struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM]; struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM]; /* TCP segments with payload for IPv4 and IPv6 frames */ -static struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM]; +struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM]; static_assert(MSS4 <= sizeof(tcp_payload[0].data), "MSS4 is greater than 65516"); static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516"); diff --git a/tcp_buf.h b/tcp_buf.h index 54f5e53..7ae2536 100644 --- a/tcp_buf.h +++ b/tcp_buf.h @@ -6,9 +6,28 @@ #ifndef TCP_BUF_H #define TCP_BUF_H +#include + +#include "tcp_conn.h" +#include "tcp_internal.h" + void tcp_sock_iov_init(const struct ctx *c); void tcp_payload_flush(const struct ctx *c); +struct tcp_tap_conn; int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn); int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags); +#define TCP_FRAMES_MEM 128 +#define TCP_FRAMES \ +(c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM) + +extern struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM]; +extern struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM]; + +extern struct ethhdr tcp4_eth_src; +extern struct ethhdr tcp6_eth_src; + +extern struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM]; +extern struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM]; + #endif /*TCP_BUF_H */ -- 2.50.0

David Gibson

23 Jul 23 Jul

9:06 a.m.

New subject: [RFC v2 04/11] tcp: export memory regions to vhost

On Wed, Jul 09, 2025 at 07:47:41PM +0200, Eugenio Pérez wrote:

...

So vhost kernel is able to access the TCP buffers.

Signed-off-by: Eugenio Pérez --- tap.c | 14 +++++++++++--- tcp_buf.c | 14 ++++---------- tcp_buf.h | 19 +++++++++++++++++++ 3 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/tap.c b/tap.c index 0656294..8b3ec45 100644 --- a/tap.c +++ b/tap.c @@ -63,6 +63,8 @@ #include "vhost_user.h" #include "vu_common.h"

+#include "tcp_buf.h" +

I don't love including the pretty specific content of tcp_buf.h into the mostly protocol unaware tap.c. Though I do realise that avoiding it will probably have other tradeoffs.

...

/* Maximum allowed frame lengths (including L2 header) */

/* Verify that an L2 frame length limit is large enough to contain the header, @@ -136,8 +138,8 @@ static union { char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])]; } vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE)));

-/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */ -#define N_VHOST_REGIONS 6 +/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf + TCP virtio hdr + TCP eth(src,dst) + TCP ip hdr */ +#define N_VHOST_REGIONS 12

Hmm. Keeping this and the region initialisation in sync is pretty clunky. I recall that before I went on leave we were discussing just exposing pasta's whole data segment to vhost; was there a reason for changing that plan, or just that you haven't implemented it so far?

...

union { struct vhost_memory mem; char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])]; @@ -1635,7 +1637,13 @@ static int tap_ns_tun(void *arg) vhost_memory.mem.regions[3] = VHOST_MEMORY_REGION(vring_used_0); vhost_memory.mem.regions[4] = VHOST_MEMORY_REGION(vring_used_1); vhost_memory.mem.regions[5] = VHOST_MEMORY_REGION(pkt_buf); - static_assert(5 < N_VHOST_REGIONS); + vhost_memory.mem.regions[6] = VHOST_MEMORY_REGION(tcp_payload_tap_hdr); + vhost_memory.mem.regions[7] = VHOST_MEMORY_REGION(tcp4_eth_src); + vhost_memory.mem.regions[8] = VHOST_MEMORY_REGION(tcp6_eth_src); + vhost_memory.mem.regions[9] = VHOST_MEMORY_REGION(tcp4_payload_ip); + vhost_memory.mem.regions[10] = VHOST_MEMORY_REGION(tcp6_payload_ip); + vhost_memory.mem.regions[11] = VHOST_MEMORY_REGION(tcp_payload); + static_assert(11 < N_VHOST_REGIONS);

If all the regions are global variables, you could put them into a static const array, then define N_VHOST_REGIONS via ARRAY_SIZE().

...

#undef VHOST_MEMORY_REGION #undef VHOST_MEMORY_REGION_PTR

diff --git a/tcp_buf.c b/tcp_buf.c index 2fbd056..c999d2e 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -22,8 +22,6 @@

#include

-#include - #include "util.h" #include "ip.h" #include "iov.h" @@ -35,24 +33,20 @@ #include "tcp_internal.h" #include "tcp_buf.h"

-#define TCP_FRAMES_MEM 128 -#define TCP_FRAMES \ - (c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM) - /* Static buffers */

/* Ethernet header for IPv4 and IPv6 frames */ -static struct ethhdr tcp4_eth_src; -static struct ethhdr tcp6_eth_src; +struct ethhdr tcp4_eth_src; +struct ethhdr tcp6_eth_src;

-static struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM]; +struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];

/* IP headers for IPv4 and IPv6 */ struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM]; struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM];

/* TCP segments with payload for IPv4 and IPv6 frames */ -static struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM]; +struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM];

static_assert(MSS4 <= sizeof(tcp_payload[0].data), "MSS4 is greater than 65516"); static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516"); diff --git a/tcp_buf.h b/tcp_buf.h index 54f5e53..7ae2536 100644 --- a/tcp_buf.h +++ b/tcp_buf.h @@ -6,9 +6,28 @@ #ifndef TCP_BUF_H #define TCP_BUF_H

+#include + +#include "tcp_conn.h" +#include "tcp_internal.h" + void tcp_sock_iov_init(const struct ctx *c); void tcp_payload_flush(const struct ctx *c); +struct tcp_tap_conn; int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn); int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);

+#define TCP_FRAMES_MEM 128 +#define TCP_FRAMES \ +(c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM) + +extern struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM]; +extern struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM]; + +extern struct ethhdr tcp4_eth_src; +extern struct ethhdr tcp6_eth_src; + +extern struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM]; +extern struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM]; + #endif /*TCP_BUF_H */

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Perez Martin

28 Jul 28 Jul

6:41 p.m.

New subject: [RFC v2 04/11] tcp: export memory regions to vhost

On Wed, Jul 23, 2025 at 9:06 AM David Gibson wrote:

...

On Wed, Jul 09, 2025 at 07:47:41PM +0200, Eugenio Pérez wrote:

...
So vhost kernel is able to access the TCP buffers.

Signed-off-by: Eugenio Pérez --- tap.c | 14 +++++++++++--- tcp_buf.c | 14 ++++---------- tcp_buf.h | 19 +++++++++++++++++++ 3 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/tap.c b/tap.c index 0656294..8b3ec45 100644 --- a/tap.c +++ b/tap.c @@ -63,6 +63,8 @@ #include "vhost_user.h" #include "vu_common.h"

+#include "tcp_buf.h" +

I don't love including the pretty specific content of tcp_buf.h into the mostly protocol unaware tap.c. Though I do realise that avoiding it will probably have other tradeoffs.

I was more focused on testing the performance of the solution than making it pretty, but I'm a big fan of hiding the buffers somehow :). I just didn't explore the options.

...

...
/* Maximum allowed frame lengths (including L2 header) */

/* Verify that an L2 frame length limit is large enough to contain the header, @@ -136,8 +138,8 @@ static union { char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])]; } vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE)));

-/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */ -#define N_VHOST_REGIONS 6 +/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf + TCP virtio hdr + TCP eth(src,dst) + TCP ip hdr */ +#define N_VHOST_REGIONS 12

Hmm. Keeping this and the region initialisation in sync is pretty clunky. I recall that before I went on leave we were discussing just exposing pasta's whole data segment to vhost; was there a reason for changing that plan, or just that you haven't implemented it so far?

...
union { struct vhost_memory mem; char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])]; @@ -1635,7 +1637,13 @@ static int tap_ns_tun(void *arg) vhost_memory.mem.regions[3] = VHOST_MEMORY_REGION(vring_used_0); vhost_memory.mem.regions[4] = VHOST_MEMORY_REGION(vring_used_1); vhost_memory.mem.regions[5] = VHOST_MEMORY_REGION(pkt_buf); - static_assert(5 < N_VHOST_REGIONS); + vhost_memory.mem.regions[6] = VHOST_MEMORY_REGION(tcp_payload_tap_hdr); + vhost_memory.mem.regions[7] = VHOST_MEMORY_REGION(tcp4_eth_src); + vhost_memory.mem.regions[8] = VHOST_MEMORY_REGION(tcp6_eth_src); + vhost_memory.mem.regions[9] = VHOST_MEMORY_REGION(tcp4_payload_ip); + vhost_memory.mem.regions[10] = VHOST_MEMORY_REGION(tcp6_payload_ip); + vhost_memory.mem.regions[11] = VHOST_MEMORY_REGION(tcp_payload); + static_assert(11 < N_VHOST_REGIONS);

If all the regions are global variables, you could put them into a static const array, then define N_VHOST_REGIONS via ARRAY_SIZE().

I'm ok with that, just trying to make all the variables as narrow scoped as possible. The array will not be used after the syscall.

...

...
#undef VHOST_MEMORY_REGION #undef VHOST_MEMORY_REGION_PTR

diff --git a/tcp_buf.c b/tcp_buf.c index 2fbd056..c999d2e 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -22,8 +22,6 @@

#include

-#include - #include "util.h" #include "ip.h" #include "iov.h" @@ -35,24 +33,20 @@ #include "tcp_internal.h" #include "tcp_buf.h"

-#define TCP_FRAMES_MEM 128 -#define TCP_FRAMES \ - (c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM) - /* Static buffers */

/* Ethernet header for IPv4 and IPv6 frames */ -static struct ethhdr tcp4_eth_src; -static struct ethhdr tcp6_eth_src; +struct ethhdr tcp4_eth_src; +struct ethhdr tcp6_eth_src;

-static struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM]; +struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM];

/* IP headers for IPv4 and IPv6 */ struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM]; struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM];

/* TCP segments with payload for IPv4 and IPv6 frames */ -static struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM]; +struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM];

static_assert(MSS4 <= sizeof(tcp_payload[0].data), "MSS4 is greater than 65516"); static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516"); diff --git a/tcp_buf.h b/tcp_buf.h index 54f5e53..7ae2536 100644 --- a/tcp_buf.h +++ b/tcp_buf.h @@ -6,9 +6,28 @@ #ifndef TCP_BUF_H #define TCP_BUF_H

+#include + +#include "tcp_conn.h" +#include "tcp_internal.h" + void tcp_sock_iov_init(const struct ctx *c); void tcp_payload_flush(const struct ctx *c); +struct tcp_tap_conn; int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn); int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags);

+#define TCP_FRAMES_MEM 128 +#define TCP_FRAMES \ +(c->mode == MODE_PASTA ? 1 : TCP_FRAMES_MEM) + +extern struct virtio_net_hdr_mrg_rxbuf tcp_payload_tap_hdr[TCP_FRAMES_MEM]; +extern struct tcp_payload_t tcp_payload[TCP_FRAMES_MEM]; + +extern struct ethhdr tcp4_eth_src; +extern struct ethhdr tcp6_eth_src; + +extern struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM]; +extern struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM]; + #endif /*TCP_BUF_H */

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

29 Jul 29 Jul

2:25 a.m.

New subject: [RFC v2 04/11] tcp: export memory regions to vhost

On Mon, Jul 28, 2025 at 06:41:20PM +0200, Eugenio Perez Martin wrote:

...

On Wed, Jul 23, 2025 at 9:06 AM David Gibson wrote:

...
On Wed, Jul 09, 2025 at 07:47:41PM +0200, Eugenio Pérez wrote:

...
So vhost kernel is able to access the TCP buffers.

Signed-off-by: Eugenio Pérez --- tap.c | 14 +++++++++++--- tcp_buf.c | 14 ++++---------- tcp_buf.h | 19 +++++++++++++++++++ 3 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/tap.c b/tap.c index 0656294..8b3ec45 100644 --- a/tap.c +++ b/tap.c @@ -63,6 +63,8 @@ #include "vhost_user.h" #include "vu_common.h"

+#include "tcp_buf.h" +

I don't love including the pretty specific content of tcp_buf.h into the mostly protocol unaware tap.c. Though I do realise that avoiding it will probably have other tradeoffs.

I was more focused on testing the performance of the solution than making it pretty, but I'm a big fan of hiding the buffers somehow :). I just didn't explore the options.

Ok, that's fine for now then. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Pérez

9 Jul 9 Jul

7:47 p.m.

New subject: [RFC v2 05/11] virtio: Fill .next in tx queue

This way we can send one frame splitted in many buffers. TCP stack already works this way in pasta. Signed-off-by: Eugenio Pérez --- tap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tap.c b/tap.c index 8b3ec45..5667fbe 100644 --- a/tap.c +++ b/tap.c @@ -1727,6 +1727,9 @@ static int tap_ns_tun(void *arg) vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS; vring_desc[0][i].flags = VRING_DESC_F_WRITE; } + for (i = 0; i < (VHOST_NDESCS - 1); ++i) { + vring_desc[1][i].next = i+1; + } for (i = 0; i < VHOST_NDESCS; ++i) { vring_avail_0.avail.ring[i] = i; } -- 2.50.0

David Gibson

23 Jul 23 Jul

9:07 a.m.

New subject: [RFC v2 05/11] virtio: Fill .next in tx queue

On Wed, Jul 09, 2025 at 07:47:42PM +0200, Eugenio Pérez wrote:

...

This way we can send one frame splitted in many buffers. TCP stack

English usage node: s/splitted/split/.

...

already works this way in pasta.

Signed-off-by: Eugenio Pérez --- tap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/tap.c b/tap.c index 8b3ec45..5667fbe 100644 --- a/tap.c +++ b/tap.c @@ -1727,6 +1727,9 @@ static int tap_ns_tun(void *arg) vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS; vring_desc[0][i].flags = VRING_DESC_F_WRITE; } + for (i = 0; i < (VHOST_NDESCS - 1); ++i) { + vring_desc[1][i].next = i+1; + }

passt style omits { } here.

...

for (i = 0; i < VHOST_NDESCS; ++i) { vring_avail_0.avail.ring[i] = i; }

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Perez Martin

28 Jul 28 Jul

6:44 p.m.

New subject: [RFC v2 05/11] virtio: Fill .next in tx queue

On Wed, Jul 23, 2025 at 9:22 AM David Gibson wrote:

...

On Wed, Jul 09, 2025 at 07:47:42PM +0200, Eugenio Pérez wrote:

...
This way we can send one frame splitted in many buffers. TCP stack

English usage node: s/splitted/split/.

...
already works this way in pasta.

Signed-off-by: Eugenio Pérez --- tap.c | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/tap.c b/tap.c index 8b3ec45..5667fbe 100644 --- a/tap.c +++ b/tap.c @@ -1727,6 +1727,9 @@ static int tap_ns_tun(void *arg) vring_desc[0][i].len = PKT_BUF_BYTES/VHOST_NDESCS; vring_desc[0][i].flags = VRING_DESC_F_WRITE; } + for (i = 0; i < (VHOST_NDESCS - 1); ++i) { + vring_desc[1][i].next = i+1; + }

passt style omits { } here.

Fixing both for the next revision, thanks!

...

...
for (i = 0; i < VHOST_NDESCS; ++i) { vring_avail_0.avail.ring[i] = i; }

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Pérez

9 Jul 9 Jul

7:47 p.m.

New subject: [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock

As it is the only function using it. I'm always confusing it with tcp_l2_iov, moving it here avoids it. Signed-off-by: Eugenio Pérez --- tcp_buf.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/tcp_buf.c b/tcp_buf.c index c999d2e..6d79d67 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -55,9 +55,6 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516") static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp_payload_used; -/* recvmsg()/sendmsg() data for tap */ -static struct iovec iov_sock [TCP_FRAMES_MEM + 1]; - static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS]; /** @@ -292,6 +289,8 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, */ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) { + static struct iovec iov_sock[TCP_FRAMES_MEM + 1]; + uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; int len, dlen, i, s = conn->sock; -- 2.50.0

David Gibson

23 Jul 23 Jul

9:09 a.m.

New subject: [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock

On Wed, Jul 09, 2025 at 07:47:43PM +0200, Eugenio Pérez wrote:

...

As it is the only function using it. I'm always confusing it with tcp_l2_iov, moving it here avoids it.

Signed-off-by: Eugenio Pérez

I like making it local, I'd question whether it even needs to remain 'static'.

...

--- tcp_buf.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/tcp_buf.c b/tcp_buf.c index c999d2e..6d79d67 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -55,9 +55,6 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516") static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp_payload_used;

-/* recvmsg()/sendmsg() data for tap */ -static struct iovec iov_sock [TCP_FRAMES_MEM + 1]; - static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];

/** @@ -292,6 +289,8 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, */ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) { + static struct iovec iov_sock[TCP_FRAMES_MEM + 1]; + uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; int len, dlen, i, s = conn->sock;

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Perez Martin

28 Jul 28 Jul

6:43 p.m.

New subject: [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock

On Wed, Jul 23, 2025 at 9:22 AM David Gibson wrote:

...

On Wed, Jul 09, 2025 at 07:47:43PM +0200, Eugenio Pérez wrote:

...
As it is the only function using it. I'm always confusing it with tcp_l2_iov, moving it here avoids it.

Signed-off-by: Eugenio Pérez

I like making it local, I'd question whether it even needs to remain 'static'.

Probably not, I was just worried about a stack overflow. But I didn't measure its size to be honest.

...

...
--- tcp_buf.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/tcp_buf.c b/tcp_buf.c index c999d2e..6d79d67 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -55,9 +55,6 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516") static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp_payload_used;

-/* recvmsg()/sendmsg() data for tap */ -static struct iovec iov_sock [TCP_FRAMES_MEM + 1]; - static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];

/** @@ -292,6 +289,8 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, */ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) { + static struct iovec iov_sock[TCP_FRAMES_MEM + 1]; + uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; int len, dlen, i, s = conn->sock;

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

29 Jul 29 Jul

2:28 a.m.

New subject: [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock

On Mon, Jul 28, 2025 at 06:43:50PM +0200, Eugenio Perez Martin wrote:

...

On Wed, Jul 23, 2025 at 9:22 AM David Gibson wrote:

...
On Wed, Jul 09, 2025 at 07:47:43PM +0200, Eugenio Pérez wrote:

...
As it is the only function using it. I'm always confusing it with tcp_l2_iov, moving it here avoids it.

Signed-off-by: Eugenio Pérez

I like making it local, I'd question whether it even needs to remain 'static'.

Probably not, I was just worried about a stack overflow. But I didn't measure its size to be honest.

I think we have comparable sized things on the stack already, such as mmsghdr[UIO_MAXIOV] in udp_tap_handler. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Pérez

9 Jul 9 Jul

7:47 p.m.

New subject: [RFC v2 07/11] tap: support tx through vhost

No users enable vhost right now, just defining the functions. The use of virtqueue is similar than in rx case. fills the descriptor table with packet data it wants to send to the namespace. Each descriptor points to a buffer in memory, with an address and a length. The number of descriptors is again defined by VHOST_NDESCS. Afterwards it writes the descriptor index into the avail->ring[] array, then increments avail->idx to make it visible to the kernel, then kicks the virtqueue 1 event fd. When the kernel does not need the buffer anymore it writes its id into the used_ring->ring[], and increments used_ring->idx. Normally, the kernel also notifies pasta through call eventfd of the virtqueue 1. But we don't monitor the eventfd. Instead, we check if we can reuse the buffers or not just when we produce, making the code simpler and more performant. Like on the rx path, we assume descriptors are used in the same order they were made available. This is also consistent with behavior seen in QEMU's virtio-net implementation. Signed-off-by: Eugenio Pérez --- arp.c | 2 +- tap.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++-------- tap.h | 4 +-- tcp.c | 2 +- tcp_buf.c | 2 +- udp.c | 2 +- 6 files changed, 79 insertions(+), 17 deletions(-) diff --git a/arp.c b/arp.c index fc482bb..ea786a0 100644 --- a/arp.c +++ b/arp.c @@ -80,7 +80,7 @@ int arp(const struct ctx *c, const struct pool *p) memcpy(eh->h_dest, eh->h_source, sizeof(eh->h_dest)); memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source)); - tap_send_single(c, eh, l2len); + tap_send_single(c, eh, l2len, false); return 1; } diff --git a/tap.c b/tap.c index 5667fbe..7ccac86 100644 --- a/tap.c +++ b/tap.c @@ -121,11 +121,19 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf); static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)), "Number of vhost descs must be a power of two by standard"); static struct { + /* Descriptor index we're using. This is not the same as avail idx in + * split: this takes into account the chained descs */ + uint16_t vring_idx; + /* Number of free descriptors */ uint16_t num_free; /* Last used idx processed */ uint16_t last_used_idx; + + /* Descriptors in use */ + /* desc info: number of descriptors in the chain */ + uint16_t ndescs[VHOST_NDESCS]; } vqs[2]; static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE))); @@ -176,7 +184,7 @@ unsigned long tap_l2_max_len(const struct ctx *c) * @data: Packet buffer * @l2len: Total L2 packet length */ -void tap_send_single(const struct ctx *c, const void *data, size_t l2len) +void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost) { uint32_t vnet_len = htonl(l2len); struct iovec iov[2]; @@ -192,7 +200,7 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len) iov[iovcnt].iov_len = l2len; iovcnt++; - tap_send_frames(c, iov, iovcnt, 1); + tap_send_frames(c, iov, iovcnt, 1, vhost); break; case MODE_VU: vu_send_single(c, data, l2len); @@ -314,7 +322,7 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport, char *data = tap_push_uh4(uh, src, sport, dst, dport, in, dlen); memcpy(data, in, dlen); - tap_send_single(c, buf, dlen + (data - buf)); + tap_send_single(c, buf, dlen + (data - buf), false); } /** @@ -336,7 +344,7 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst, memcpy(icmp4h, in, l4len); csum_icmp4(icmp4h, icmp4h + 1, l4len - sizeof(*icmp4h)); - tap_send_single(c, buf, l4len + ((char *)icmp4h - buf)); + tap_send_single(c, buf, l4len + ((char *)icmp4h - buf), false); } /** @@ -421,7 +429,7 @@ void tap_udp6_send(const struct ctx *c, char *data = tap_push_uh6(uh, src, sport, dst, dport, in, dlen); memcpy(data, in, dlen); - tap_send_single(c, buf, dlen + (data - buf)); + tap_send_single(c, buf, dlen + (data - buf), false); } /** @@ -444,7 +452,7 @@ void tap_icmp6_send(const struct ctx *c, memcpy(icmp6h, in, l4len); csum_icmp6(icmp6h, src, dst, icmp6h + 1, l4len - sizeof(*icmp6h)); - tap_send_single(c, buf, l4len + ((char *)icmp6h - buf)); + tap_send_single(c, buf, l4len + ((char *)icmp6h - buf), false); } static void vhost_kick(struct vring_used *used, int kick_fd) { @@ -459,8 +467,9 @@ static void vhost_kick(struct vring_used *used, int kick_fd) { eventfd_write(kick_fd, 1); } + /** - * tap_send_frames_pasta() - Send multiple frames to the pasta tap + * tap_send_frames_vhost() - Send multiple frames to the pasta tap * @c: Execution context * @iov: Array of buffers * @bufs_per_frame: Number of buffers (iovec entries) per frame @@ -470,16 +479,68 @@ static void vhost_kick(struct vring_used *used, int kick_fd) { * @bufs_per_frame contiguous buffers representing a single frame. * * Return: number of frames successfully sent + */ +static size_t tap_send_frames_vhost(const struct ctx *c, + const struct iovec *iov, + size_t bufs_per_frame, size_t nframes) +{ + size_t i; + + for (i = 0; i < nframes; i++) { + size_t j; + + if (vqs[1].num_free < bufs_per_frame) + return i; + + vring_avail_1.avail.ring[(vring_avail_1.avail.idx + i) % VHOST_NDESCS] = htole16(vqs[1].vring_idx) % VHOST_NDESCS; + vqs[1].ndescs[(vring_avail_1.avail.idx + i) % VHOST_NDESCS] = bufs_per_frame; + vqs[1].num_free -= bufs_per_frame; + + for (j = 0; j < bufs_per_frame; ++j) { + struct vring_desc *desc = &vring_desc[1][vqs[1].vring_idx % VHOST_NDESCS]; + const struct iovec *iov_i = &iov[i * bufs_per_frame + j]; + + desc->addr = (uint64_t)iov_i->iov_base; + desc->len = iov_i->iov_len; + desc->flags = (j == bufs_per_frame - 1) ? 0 : htole16(VRING_DESC_F_NEXT); + vqs[1].vring_idx++; + } + } + + smp_wmb(); + vring_avail_1.avail.idx = htole16(le16toh(vring_avail_1.avail.idx) + nframes); + + vhost_kick(&vring_used_1.used, c->vq[1].kick_fd); + + return nframes; +} + + +/** + * tap_send_frames_pasta() - Send multiple frames to the pasta tap + * @c: Execution context + * @iov: Array of buffers + * @bufs_per_frame: Number of buffers (iovec entries) per frame + * @nframes: Number of frames to send + * @vhost: Use vhost-kernel or not + * + * @iov must have total length @bufs_per_frame * @nframes, with each set of + * @bufs_per_frame contiguous buffers representing a single frame. + * + * Return: number of frames successfully sent (or queued) * * #syscalls:pasta write */ static size_t tap_send_frames_pasta(const struct ctx *c, const struct iovec *iov, - size_t bufs_per_frame, size_t nframes) + size_t bufs_per_frame, size_t nframes, bool vhost) { size_t nbufs = bufs_per_frame * nframes; size_t i; + if (vhost) + return tap_send_frames_vhost(c, iov, bufs_per_frame, nframes); + for (i = 0; i < nbufs; i += bufs_per_frame) { ssize_t rc = writev(c->fd_tap, iov + i, bufs_per_frame); size_t framelen = iov_size(iov + i, bufs_per_frame); @@ -563,14 +624,15 @@ static size_t tap_send_frames_passt(const struct ctx *c, * @iov: Array of buffers, each containing one frame (with L2 headers) * @bufs_per_frame: Number of buffers (iovec entries) per frame * @nframes: Number of frames to send + * @vhost: Use vhost-kernel or not * * @iov must have total length @bufs_per_frame * @nframes, with each set of * @bufs_per_frame contiguous buffers representing a single frame. * - * Return: number of frames actually sent + * Return: number of frames actually sent (or queued) */ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, - size_t bufs_per_frame, size_t nframes) + size_t bufs_per_frame, size_t nframes, bool vhost) { size_t m; @@ -579,7 +641,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, switch (c->mode) { case MODE_PASTA: - m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes); + m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes, vhost); break; case MODE_PASST: m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes); diff --git a/tap.h b/tap.h index ff8cee5..e924dfb 100644 --- a/tap.h +++ b/tap.h @@ -111,9 +111,9 @@ void tap_udp6_send(const struct ctx *c, void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, const void *in, size_t l4len); -void tap_send_single(const struct ctx *c, const void *data, size_t l2len); +void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost); size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, - size_t bufs_per_frame, size_t nframes); + size_t bufs_per_frame, size_t nframes, bool vhost); void eth_update_mac(struct ethhdr *eh, const unsigned char *eth_d, const unsigned char *eth_s); void tap_listen_handler(struct ctx *c, uint32_t events); diff --git a/tcp.c b/tcp.c index f43c1e2..05f5b4c 100644 --- a/tcp.c +++ b/tcp.c @@ -1935,7 +1935,7 @@ static void tcp_rst_no_conn(const struct ctx *c, int af, tcp_update_csum(psum, rsth, &payload); rst_l2len = ((char *)rsth - buf) + sizeof(*rsth); - tap_send_single(c, buf, rst_l2len); + tap_send_single(c, buf, rst_l2len, false); } /** diff --git a/tcp_buf.c b/tcp_buf.c index 6d79d67..242086d 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -141,7 +141,7 @@ void tcp_payload_flush(const struct ctx *c) size_t m; m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS, - tcp_payload_used); + tcp_payload_used, false); if (m != tcp_payload_used) { tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m], tcp_payload_used - m); diff --git a/udp.c b/udp.c index 65a52e0..d017d99 100644 --- a/udp.c +++ b/udp.c @@ -809,7 +809,7 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n, for (i = 0; i < n; i++) udp_tap_prepare(udp_mh_recv, i, toside, false); - tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n); + tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n, false); } /** -- 2.50.0

David Gibson

24 Jul 24 Jul

2:24 a.m.

New subject: [RFC v2 07/11] tap: support tx through vhost

On Wed, Jul 09, 2025 at 07:47:44PM +0200, Eugenio Pérez wrote:

...

No users enable vhost right now, just defining the functions.

The use of virtqueue is similar than in rx case. fills the descriptor table with packet data it wants to send to the namespace. Each descriptor points to a buffer in memory, with an address and a length. The number of descriptors is again defined by VHOST_NDESCS.

Does the number of descriptors have to be equal for the Tx and Rx queues? For Rx, the number of descriptors is basically determined by how many frames we can fit in pkt_buf. That doesn't really apply to the Tx path though, it would be more natural to define the number of Tx descriptors independently.

...

Afterwards it writes the descriptor index into the avail->ring[] array, then increments avail->idx to make it visible to the kernel, then kicks the virtqueue 1 event fd.

When the kernel does not need the buffer anymore it writes its id into the used_ring->ring[], and increments used_ring->idx. Normally, the kernel also notifies pasta through call eventfd of the virtqueue 1. But we don't monitor the eventfd. Instead, we check if we can reuse the buffers or not just when we produce, making the code simpler and more performant.

Oh, that's pretty neat. Nit: s/performant/performent/

...

Like on the rx path, we assume descriptors are used in the same order they were made available. This is also consistent with behavior seen in QEMU's virtio-net implementation.

Signed-off-by: Eugenio Pérez --- arp.c | 2 +- tap.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++-------- tap.h | 4 +-- tcp.c | 2 +- tcp_buf.c | 2 +- udp.c | 2 +- 6 files changed, 79 insertions(+), 17 deletions(-)

diff --git a/arp.c b/arp.c index fc482bb..ea786a0 100644 --- a/arp.c +++ b/arp.c @@ -80,7 +80,7 @@ int arp(const struct ctx *c, const struct pool *p) memcpy(eh->h_dest, eh->h_source, sizeof(eh->h_dest)); memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));

- tap_send_single(c, eh, l2len); + tap_send_single(c, eh, l2len, false);

return 1; } diff --git a/tap.c b/tap.c index 5667fbe..7ccac86 100644 --- a/tap.c +++ b/tap.c @@ -121,11 +121,19 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf); static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)), "Number of vhost descs must be a power of two by standard"); static struct { + /* Descriptor index we're using. This is not the same as avail idx in + * split: this takes into account the chained descs */ + uint16_t vring_idx; + /* Number of free descriptors */ uint16_t num_free;

/* Last used idx processed */ uint16_t last_used_idx; + + /* Descriptors in use */ + /* desc info: number of descriptors in the chain */ + uint16_t ndescs[VHOST_NDESCS]; } vqs[2];

Do we need all these fields for both vqs? Or should we have different structs or a union to store different info for the Rx versus Tx queue.

...

static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE))); @@ -176,7 +184,7 @@ unsigned long tap_l2_max_len(const struct ctx *c) * @data: Packet buffer * @l2len: Total L2 packet length */ -void tap_send_single(const struct ctx *c, const void *data, size_t l2len) +void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost)

This function is explicitly a slow path, so I don't think we need to offer the option of using vhost. Likewise with the various wrappers below.

...

{ uint32_t vnet_len = htonl(l2len); struct iovec iov[2]; @@ -192,7 +200,7 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len) iov[iovcnt].iov_len = l2len; iovcnt++;

- tap_send_frames(c, iov, iovcnt, 1); + tap_send_frames(c, iov, iovcnt, 1, vhost); break; case MODE_VU: vu_send_single(c, data, l2len); @@ -314,7 +322,7 @@ void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport, char *data = tap_push_uh4(uh, src, sport, dst, dport, in, dlen);

memcpy(data, in, dlen); - tap_send_single(c, buf, dlen + (data - buf)); + tap_send_single(c, buf, dlen + (data - buf), false); }

/** @@ -336,7 +344,7 @@ void tap_icmp4_send(const struct ctx *c, struct in_addr src, struct in_addr dst, memcpy(icmp4h, in, l4len); csum_icmp4(icmp4h, icmp4h + 1, l4len - sizeof(*icmp4h));

- tap_send_single(c, buf, l4len + ((char *)icmp4h - buf)); + tap_send_single(c, buf, l4len + ((char *)icmp4h - buf), false); }

/** @@ -421,7 +429,7 @@ void tap_udp6_send(const struct ctx *c, char *data = tap_push_uh6(uh, src, sport, dst, dport, in, dlen);

memcpy(data, in, dlen); - tap_send_single(c, buf, dlen + (data - buf)); + tap_send_single(c, buf, dlen + (data - buf), false); }

/** @@ -444,7 +452,7 @@ void tap_icmp6_send(const struct ctx *c, memcpy(icmp6h, in, l4len); csum_icmp6(icmp6h, src, dst, icmp6h + 1, l4len - sizeof(*icmp6h));

- tap_send_single(c, buf, l4len + ((char *)icmp6h - buf)); + tap_send_single(c, buf, l4len + ((char *)icmp6h - buf), false); }

static void vhost_kick(struct vring_used *used, int kick_fd) { @@ -459,8 +467,9 @@ static void vhost_kick(struct vring_used *used, int kick_fd) { eventfd_write(kick_fd, 1); }

+ /** - * tap_send_frames_pasta() - Send multiple frames to the pasta tap + * tap_send_frames_vhost() - Send multiple frames to the pasta tap * @c: Execution context * @iov: Array of buffers * @bufs_per_frame: Number of buffers (iovec entries) per frame @@ -470,16 +479,68 @@ static void vhost_kick(struct vring_used *used, int kick_fd) { * @bufs_per_frame contiguous buffers representing a single frame. * * Return: number of frames successfully sent + */ +static size_t tap_send_frames_vhost(const struct ctx *c, + const struct iovec *iov, + size_t bufs_per_frame, size_t nframes) +{ + size_t i; + + for (i = 0; i < nframes; i++) { + size_t j; + + if (vqs[1].num_free < bufs_per_frame) + return i; + + vring_avail_1.avail.ring[(vring_avail_1.avail.idx + i) % VHOST_NDESCS] = htole16(vqs[1].vring_idx) % VHOST_NDESCS;

Seems like it would be worth putting ((vring_avail_1.avail.idx + i) % VHOST_NDESCS) into a local.

...

+ vqs[1].ndescs[(vring_avail_1.avail.idx + i) % VHOST_NDESCS] = bufs_per_frame; + vqs[1].num_free -= bufs_per_frame; + + for (j = 0; j < bufs_per_frame; ++j) { + struct vring_desc *desc = &vring_desc[1][vqs[1].vring_idx % VHOST_NDESCS]; + const struct iovec *iov_i = &iov[i * bufs_per_frame + j]; + + desc->addr = (uint64_t)iov_i->iov_base; + desc->len = iov_i->iov_len; + desc->flags = (j == bufs_per_frame - 1) ? 0 : htole16(VRING_DESC_F_NEXT); + vqs[1].vring_idx++; + } + } + + smp_wmb(); + vring_avail_1.avail.idx = htole16(le16toh(vring_avail_1.avail.idx) + nframes); + + vhost_kick(&vring_used_1.used, c->vq[1].kick_fd); + + return nframes; +} + + +/** + * tap_send_frames_pasta() - Send multiple frames to the pasta tap + * @c: Execution context + * @iov: Array of buffers + * @bufs_per_frame: Number of buffers (iovec entries) per frame + * @nframes: Number of frames to send + * @vhost: Use vhost-kernel or not + * + * @iov must have total length @bufs_per_frame * @nframes, with each set of + * @bufs_per_frame contiguous buffers representing a single frame. + * + * Return: number of frames successfully sent (or queued) * * #syscalls:pasta write */ static size_t tap_send_frames_pasta(const struct ctx *c, const struct iovec *iov, - size_t bufs_per_frame, size_t nframes) + size_t bufs_per_frame, size_t nframes, bool vhost) { size_t nbufs = bufs_per_frame * nframes; size_t i;

+ if (vhost) + return tap_send_frames_vhost(c, iov, bufs_per_frame, nframes); + for (i = 0; i < nbufs; i += bufs_per_frame) { ssize_t rc = writev(c->fd_tap, iov + i, bufs_per_frame); size_t framelen = iov_size(iov + i, bufs_per_frame); @@ -563,14 +624,15 @@ static size_t tap_send_frames_passt(const struct ctx *c, * @iov: Array of buffers, each containing one frame (with L2 headers) * @bufs_per_frame: Number of buffers (iovec entries) per frame * @nframes: Number of frames to send + * @vhost: Use vhost-kernel or not

Hrm. I don't love this parameter. tap_send_frames() is supposed to more or less abstract away the differences between the different backends, but here we're putting a tuntap (pasta mode) specific option here. I don't quickly see a way to avoid it though; making this a different mode / a sub-mode in struct ctx would take away the option of using vanilla tap for slow paths.

...

* * @iov must have total length @bufs_per_frame * @nframes, with each set of * @bufs_per_frame contiguous buffers representing a single frame. * - * Return: number of frames actually sent + * Return: number of frames actually sent (or queued) */ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, - size_t bufs_per_frame, size_t nframes) + size_t bufs_per_frame, size_t nframes, bool vhost) { size_t m;

@@ -579,7 +641,7 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,

switch (c->mode) { case MODE_PASTA: - m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes); + m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes, vhost); break; case MODE_PASST: m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes); diff --git a/tap.h b/tap.h index ff8cee5..e924dfb 100644 --- a/tap.h +++ b/tap.h @@ -111,9 +111,9 @@ void tap_udp6_send(const struct ctx *c, void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, const void *in, size_t l4len); -void tap_send_single(const struct ctx *c, const void *data, size_t l2len); +void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost); size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, - size_t bufs_per_frame, size_t nframes); + size_t bufs_per_frame, size_t nframes, bool vhost); void eth_update_mac(struct ethhdr *eh, const unsigned char *eth_d, const unsigned char *eth_s); void tap_listen_handler(struct ctx *c, uint32_t events); diff --git a/tcp.c b/tcp.c index f43c1e2..05f5b4c 100644 --- a/tcp.c +++ b/tcp.c @@ -1935,7 +1935,7 @@ static void tcp_rst_no_conn(const struct ctx *c, int af,

tcp_update_csum(psum, rsth, &payload); rst_l2len = ((char *)rsth - buf) + sizeof(*rsth); - tap_send_single(c, buf, rst_l2len); + tap_send_single(c, buf, rst_l2len, false); }

/** diff --git a/tcp_buf.c b/tcp_buf.c index 6d79d67..242086d 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -141,7 +141,7 @@ void tcp_payload_flush(const struct ctx *c) size_t m;

m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS, - tcp_payload_used); + tcp_payload_used, false); if (m != tcp_payload_used) { tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m], tcp_payload_used - m); diff --git a/udp.c b/udp.c index 65a52e0..d017d99 100644 --- a/udp.c +++ b/udp.c @@ -809,7 +809,7 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n, for (i = 0; i < n; i++) udp_tap_prepare(udp_mh_recv, i, toside, false);

- tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n); + tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n, false); }

/**

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

4:30 p.m.

New subject: [RFC v2 07/11] tap: support tx through vhost

Silly comments only: On Thu, 24 Jul 2025 10:24:39 +1000 David Gibson wrote:

...

On Wed, Jul 09, 2025 at 07:47:44PM +0200, Eugenio Pérez wrote:

...
No users enable vhost right now, just defining the functions.

The use of virtqueue is similar than in rx case. fills the descriptor table with packet data it wants to send to the namespace. Each descriptor points to a buffer in memory, with an address and a length. The number of descriptors is again defined by VHOST_NDESCS.

Does the number of descriptors have to be equal for the Tx and Rx queues? For Rx, the number of descriptors is basically determined by how many frames we can fit in pkt_buf. That doesn't really apply to the Tx path though, it would be more natural to define the number of Tx descriptors independently.

...
Afterwards it writes the descriptor index into the avail->ring[] array, then increments avail->idx to make it visible to the kernel, then kicks the virtqueue 1 event fd.

When the kernel does not need the buffer anymore it writes its id into the used_ring->ring[], and increments used_ring->idx. Normally, the kernel also notifies pasta through call eventfd of the virtqueue 1. But we don't monitor the eventfd. Instead, we check if we can reuse the buffers or not just when we produce, making the code simpler and more performant.

Oh, that's pretty neat.

Nit: s/performant/performent/

Actually, "performent" (which I had never heard before) seems to be an archaic term for "performance": https://www.oed.com/dictionary/performent_n whereas "performant": https://en.wiktionary.org/wiki/performant seems to be fitting here (as much as I'd favour "efficient" instead, because it actually sounds like a real word ;)). -- Stefano

David Gibson

25 Jul 25 Jul

2:23 a.m.

New subject: [RFC v2 07/11] tap: support tx through vhost

On Thu, Jul 24, 2025 at 04:30:50PM +0200, Stefano Brivio wrote:

...

Silly comments only:

On Thu, 24 Jul 2025 10:24:39 +1000 David Gibson wrote:

...
On Wed, Jul 09, 2025 at 07:47:44PM +0200, Eugenio Pérez wrote:

...
No users enable vhost right now, just defining the functions.

The use of virtqueue is similar than in rx case. fills the descriptor table with packet data it wants to send to the namespace. Each descriptor points to a buffer in memory, with an address and a length. The number of descriptors is again defined by VHOST_NDESCS.

Does the number of descriptors have to be equal for the Tx and Rx queues? For Rx, the number of descriptors is basically determined by how many frames we can fit in pkt_buf. That doesn't really apply to the Tx path though, it would be more natural to define the number of Tx descriptors independently.

...
Afterwards it writes the descriptor index into the avail->ring[] array, then increments avail->idx to make it visible to the kernel, then kicks the virtqueue 1 event fd.

When the kernel does not need the buffer anymore it writes its id into the used_ring->ring[], and increments used_ring->idx. Normally, the kernel also notifies pasta through call eventfd of the virtqueue 1. But we don't monitor the eventfd. Instead, we check if we can reuse the buffers or not just when we produce, making the code simpler and more performant.

Oh, that's pretty neat.

Nit: s/performant/performent/

Actually, "performent" (which I had never heard before) seems to be an archaic term for "performance":

https://www.oed.com/dictionary/performent_n

whereas "performant":

https://en.wiktionary.org/wiki/performant

Oops, you're right. Embarrassing. "performant" looked wrong to me... now they both do.

...

seems to be fitting here (as much as I'd favour "efficient" instead, because it actually sounds like a real word ;)).

Yeah, agreed. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Pérez

9 Jul 9 Jul

7:47 p.m.

New subject: [RFC v2 08/11] tap: add tap_free_old_xmit

As pasta cannot modify the TCP sent buffers until vhost-kernel does not use them anymore, we need a way to report the caller the buffers that can be overriden. Let's start by following the same pattern as in tap write(2): wait until pasta can override the buffers. We can add async cleaning on top. Signed-off-by: Eugenio Pérez --- tap.c | 26 ++++++++++++++++++++++++++ tap.h | 1 + 2 files changed, 27 insertions(+) diff --git a/tap.c b/tap.c index 7ccac86..55357e3 100644 --- a/tap.c +++ b/tap.c @@ -128,6 +128,11 @@ static struct { /* Number of free descriptors */ uint16_t num_free; + /* Last used_idx in the used ring. + * Duplicate here allows to check for proper vhost usage, and avoid + * false sharing between pasta and kernel. */ + uint16_t shadow_used_idx; + /* Last used idx processed */ uint16_t last_used_idx; @@ -467,6 +472,27 @@ static void vhost_kick(struct vring_used *used, int kick_fd) { eventfd_write(kick_fd, 1); } +/* n = target */ +void tap_free_old_xmit(size_t n) +{ + size_t r = 0; + + while (r < n) { + uint16_t used_idx = vqs[1].last_used_idx; + if (vqs[1].shadow_used_idx == used_idx) { + vqs[1].shadow_used_idx = le16toh(*(volatile uint16_t*)&vring_used_1.used.idx); + + if (vqs[1].shadow_used_idx == used_idx) + continue; + } + + /* assert in-order */ + assert(vring_used_1.used.ring[used_idx % VHOST_NDESCS].id == vring_avail_1.avail.ring[used_idx % VHOST_NDESCS]); + vqs[1].num_free += vqs[1].ndescs[used_idx % VHOST_NDESCS]; + vqs[1].last_used_idx++; + r++; + } +} /** * tap_send_frames_vhost() - Send multiple frames to the pasta tap diff --git a/tap.h b/tap.h index e924dfb..7ca0fb0 100644 --- a/tap.h +++ b/tap.h @@ -112,6 +112,7 @@ void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, const void *in, size_t l4len); void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost); +void tap_free_old_xmit(size_t n); size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t bufs_per_frame, size_t nframes, bool vhost); void eth_update_mac(struct ethhdr *eh, -- 2.50.0

David Gibson

24 Jul 24 Jul

2:32 a.m.

New subject: [RFC v2 08/11] tap: add tap_free_old_xmit

On Wed, Jul 09, 2025 at 07:47:45PM +0200, Eugenio Pérez wrote:

...

As pasta cannot modify the TCP sent buffers until vhost-kernel does not use them anymore, we need a way to report the caller the buffers that can be overriden.

Let's start by following the same pattern as in tap write(2): wait until pasta can override the buffers. We can add async cleaning on top.

Signed-off-by: Eugenio Pérez --- tap.c | 26 ++++++++++++++++++++++++++ tap.h | 1 + 2 files changed, 27 insertions(+)

diff --git a/tap.c b/tap.c index 7ccac86..55357e3 100644 --- a/tap.c +++ b/tap.c @@ -128,6 +128,11 @@ static struct { /* Number of free descriptors */ uint16_t num_free;

+ /* Last used_idx in the used ring. + * Duplicate here allows to check for proper vhost usage, and avoid + * false sharing between pasta and kernel. */ + uint16_t shadow_used_idx; + /* Last used idx processed */ uint16_t last_used_idx;

@@ -467,6 +472,27 @@ static void vhost_kick(struct vring_used *used, int kick_fd) { eventfd_write(kick_fd, 1); }

+/* n = target */ +void tap_free_old_xmit(size_t n)

This function is introduced without either a caller or a descriptive comment. That makes it pretty hard to review - I'm not sure what it is supposed to be doing.

...

+{ + size_t r = 0; + + while (r < n) { + uint16_t used_idx = vqs[1].last_used_idx; + if (vqs[1].shadow_used_idx == used_idx) { + vqs[1].shadow_used_idx = le16toh(*(volatile uint16_t*)&vring_used_1.used.idx); + + if (vqs[1].shadow_used_idx == used_idx) + continue; + } + + /* assert in-order */ + assert(vring_used_1.used.ring[used_idx % VHOST_NDESCS].id == vring_avail_1.avail.ring[used_idx % VHOST_NDESCS]); + vqs[1].num_free += vqs[1].ndescs[used_idx % VHOST_NDESCS]; + vqs[1].last_used_idx++; + r++; + } +}

/** * tap_send_frames_vhost() - Send multiple frames to the pasta tap diff --git a/tap.h b/tap.h index e924dfb..7ca0fb0 100644 --- a/tap.h +++ b/tap.h @@ -112,6 +112,7 @@ void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, const void *in, size_t l4len); void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost); +void tap_free_old_xmit(size_t n); size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t bufs_per_frame, size_t nframes, bool vhost); void eth_update_mac(struct ethhdr *eh,

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Perez Martin

28 Jul 28 Jul

6:45 p.m.

New subject: [RFC v2 08/11] tap: add tap_free_old_xmit

On Thu, Jul 24, 2025 at 2:51 AM David Gibson wrote:

...

On Wed, Jul 09, 2025 at 07:47:45PM +0200, Eugenio Pérez wrote:

...
As pasta cannot modify the TCP sent buffers until vhost-kernel does not use them anymore, we need a way to report the caller the buffers that can be overriden.

Let's start by following the same pattern as in tap write(2): wait until pasta can override the buffers. We can add async cleaning on top.

Signed-off-by: Eugenio Pérez --- tap.c | 26 ++++++++++++++++++++++++++ tap.h | 1 + 2 files changed, 27 insertions(+)

diff --git a/tap.c b/tap.c index 7ccac86..55357e3 100644 --- a/tap.c +++ b/tap.c @@ -128,6 +128,11 @@ static struct { /* Number of free descriptors */ uint16_t num_free;

+ /* Last used_idx in the used ring. + * Duplicate here allows to check for proper vhost usage, and avoid + * false sharing between pasta and kernel. */ + uint16_t shadow_used_idx; + /* Last used idx processed */ uint16_t last_used_idx;

@@ -467,6 +472,27 @@ static void vhost_kick(struct vring_used *used, int kick_fd) { eventfd_write(kick_fd, 1); }

+/* n = target */ +void tap_free_old_xmit(size_t n)

This function is introduced without either a caller or a descriptive comment. That makes it pretty hard to review - I'm not sure what it is supposed to be doing.

Documenting for the next version.

...

...
+{ + size_t r = 0; + + while (r < n) { + uint16_t used_idx = vqs[1].last_used_idx; + if (vqs[1].shadow_used_idx == used_idx) { + vqs[1].shadow_used_idx = le16toh(*(volatile uint16_t*)&vring_used_1.used.idx); + + if (vqs[1].shadow_used_idx == used_idx) + continue; + } + + /* assert in-order */ + assert(vring_used_1.used.ring[used_idx % VHOST_NDESCS].id == vring_avail_1.avail.ring[used_idx % VHOST_NDESCS]); + vqs[1].num_free += vqs[1].ndescs[used_idx % VHOST_NDESCS]; + vqs[1].last_used_idx++; + r++; + } +}

/** * tap_send_frames_vhost() - Send multiple frames to the pasta tap diff --git a/tap.h b/tap.h index e924dfb..7ca0fb0 100644 --- a/tap.h +++ b/tap.h @@ -112,6 +112,7 @@ void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, const void *in, size_t l4len); void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost); +void tap_free_old_xmit(size_t n); size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t bufs_per_frame, size_t nframes, bool vhost); void eth_update_mac(struct ethhdr *eh,

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Pérez

9 Jul 9 Jul

7:47 p.m.

New subject: [RFC v2 09/11] tcp: start conversion to circular buffer

The vhost-kernel module is async by nature: the driver (pasta) places a few buffers in the virtqueue and the device (vhost-kernel) trust the driver will not modify them until it uses them. To implement it is not possible with TCP at the moment, as tcp_buf trust it can reuse the buffers as soon as tcp_payload_flush() finish. To achieve async let's make tcp_buf work with a circular ring, so vhost can transmit at the same time pasta is queing more data. When a buffer is received from a TCP socket, the element is placed in the ring and sock_head is moved: [][][][] ^ ^ | | | sock_head | tail tap_head When the data is sent to vhost through the tx queue, tap_head is moved forward: [][][][] ^ ^ | | | sock_head | tap_head | tail Finally, the tail move forward when vhost has used the tx buffers, so tcp_payload (and all lower protocol buffers) can be reused. [][][][] ^ | sock_head tap_head tail In the case of error queueing to the vhost virtqueue, sock_head moves backwards. The only possible error is that the queue is full, as virtio-net does not report success on packet sending. Starting as simple as possible, and only implementing the count variables in this patch so it keeps working as previously. The circular behavior will be added on top. From ~16BGbit/s to ~13Gbit/s compared with write(2) to the tap. Signed-off-by: Eugenio Pérez --- tcp_buf.c | 63 +++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 40 insertions(+), 23 deletions(-) diff --git a/tcp_buf.c b/tcp_buf.c index 242086d..0437120 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -53,7 +53,12 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516") /* References tracking the owner connection of frames in the tap outqueue */ static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM]; -static unsigned int tcp_payload_used; +static unsigned int tcp_payload_sock_used, tcp_payload_tap_used; + +static void tcp_payload_sock_produce(size_t n) +{ + tcp_payload_sock_used += n; +} static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS]; @@ -132,6 +137,16 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns, } } +static void tcp_buf_free_old_tap_xmit(void) +{ + while (tcp_payload_tap_used) { + tap_free_old_xmit(tcp_payload_tap_used); + + tcp_payload_tap_used = 0; + tcp_payload_sock_used = 0; + } +} + /** * tcp_payload_flush() - Send out buffers for segments with data or flags * @c: Execution context @@ -141,12 +156,13 @@ void tcp_payload_flush(const struct ctx *c) size_t m; m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS, - tcp_payload_used, false); - if (m != tcp_payload_used) { + tcp_payload_sock_used, true); + if (m != tcp_payload_sock_used) { tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m], - tcp_payload_used - m); + tcp_payload_sock_used - m); } - tcp_payload_used = 0; + tcp_payload_tap_used += m; + tcp_buf_free_old_tap_xmit(); } /** @@ -195,12 +211,12 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) uint32_t seq; int ret; - iov = tcp_l2_iov[tcp_payload_used]; + iov = tcp_l2_iov[tcp_payload_sock_used]; if (CONN_V4(conn)) { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]); iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; } else { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]); iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src; } @@ -211,13 +227,14 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) if (ret <= 0) return ret; - tcp_payload_used++; + tcp_payload_sock_produce(1); l4len = optlen + sizeof(struct tcphdr); iov[TCP_IOV_PAYLOAD].iov_len = l4len; tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false); if (flags & DUP_ACK) { - struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used++]; + struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used]; + tcp_payload_sock_produce(1); memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_len); @@ -228,8 +245,9 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) dup_iov[TCP_IOV_PAYLOAD].iov_len = l4len; } - if (tcp_payload_used > TCP_FRAMES_MEM - 2) + if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { tcp_payload_flush(c); + } return 0; } @@ -251,19 +269,19 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, struct iovec *iov; conn->seq_to_tap = seq + dlen; - tcp_frame_conns[tcp_payload_used] = conn; - iov = tcp_l2_iov[tcp_payload_used]; + tcp_frame_conns[tcp_payload_sock_used] = conn; + iov = tcp_l2_iov[tcp_payload_sock_used]; if (CONN_V4(conn)) { if (no_csum) { - struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1]; + struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1]; struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base; check = &iph->check; } - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]); iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; } else if (CONN_V6(conn)) { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]); iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src; } payload = iov[TCP_IOV_PAYLOAD].iov_base; @@ -274,8 +292,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, payload->th.psh = push; iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr); tcp_l2_buf_fill_headers(conn, iov, check, seq, false); - if (++tcp_payload_used > TCP_FRAMES_MEM - 1) + tcp_payload_sock_produce(1); + if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) { tcp_payload_flush(c); + } } /** @@ -341,15 +361,12 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) mh_sock.msg_iovlen = fill_bufs; } - if (tcp_payload_used + fill_bufs > TCP_FRAMES_MEM) { + if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) { tcp_payload_flush(c); - - /* Silence Coverity CWE-125 false positive */ - tcp_payload_used = 0; } for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) { - iov->iov_base = &tcp_payload[tcp_payload_used + i].data; + iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data; iov->iov_len = mss; } if (iov_rem) @@ -407,7 +424,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) dlen = mss; seq = conn->seq_to_tap; for (i = 0; i < send_bufs; i++) { - int no_csum = i && i != send_bufs - 1 && tcp_payload_used; + int no_csum = i && i != send_bufs - 1 && tcp_payload_sock_used; bool push = false; if (i == send_bufs - 1) { -- 2.50.0

David Gibson

24 Jul 24 Jul

3:03 a.m.

New subject: [RFC v2 09/11] tcp: start conversion to circular buffer

On Wed, Jul 09, 2025 at 07:47:46PM +0200, Eugenio Pérez wrote:

...

The vhost-kernel module is async by nature: the driver (pasta) places a few buffers in the virtqueue and the device (vhost-kernel) trust the

s/trust/trusts/

...

driver will not modify them until it uses them. To implement it is not possible with TCP at the moment, as tcp_buf trust it can reuse the buffers as soon as tcp_payload_flush() finish.

...

To achieve async let's make tcp_buf work with a circular ring, so vhost can transmit at the same time pasta is queing more data. When a buffer is received from a TCP socket, the element is placed in the ring and sock_head is moved: [][][][] ^ ^ | | | sock_head | tail tap_head

When the data is sent to vhost through the tx queue, tap_head is moved forward: [][][][] ^ ^ | | | sock_head | tap_head | tail

Finally, the tail move forward when vhost has used the tx buffers, so tcp_payload (and all lower protocol buffers) can be reused. [][][][] ^ | sock_head tap_head tail

This all sounds good. I wonder if it might be clearer to do this circular queue conversion as a separate patch series. I think it makes sense even without the context of vhost (it's closer to how most network things work).

...

In the case of error queueing to the vhost virtqueue, sock_head moves backwards. The only possible error is that the queue is full, as

sock_head moves backwards? Or tap_head moves backwards?

...

virtio-net does not report success on packet sending.

Starting as simple as possible, and only implementing the count variables in this patch so it keeps working as previously. The circular behavior will be added on top.

From ~16BGbit/s to ~13Gbit/s compared with write(2) to the tap.

I don't really understand what you're comparing here.

...

Signed-off-by: Eugenio Pérez --- tcp_buf.c | 63 +++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/tcp_buf.c b/tcp_buf.c index 242086d..0437120 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -53,7 +53,12 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")

/* References tracking the owner connection of frames in the tap outqueue */ static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM]; -static unsigned int tcp_payload_used; +static unsigned int tcp_payload_sock_used, tcp_payload_tap_used;

I think the "payload" here is a hangover from when we had separate queues for flags-only and data-containing packets. We can probably drop it and make a bunch of names shorter.

...

+static void tcp_payload_sock_produce(size_t n) +{ + tcp_payload_sock_used += n; +}

static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];

@@ -132,6 +137,16 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns, } }

+static void tcp_buf_free_old_tap_xmit(void) +{ + while (tcp_payload_tap_used) { + tap_free_old_xmit(tcp_payload_tap_used); + + tcp_payload_tap_used = 0; + tcp_payload_sock_used = 0; + } +} + /** * tcp_payload_flush() - Send out buffers for segments with data or flags * @c: Execution context @@ -141,12 +156,13 @@ void tcp_payload_flush(const struct ctx *c) size_t m;

m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS, - tcp_payload_used, false); - if (m != tcp_payload_used) { + tcp_payload_sock_used, true); + if (m != tcp_payload_sock_used) { tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m], - tcp_payload_used - m); + tcp_payload_sock_used - m); } - tcp_payload_used = 0; + tcp_payload_tap_used += m; + tcp_buf_free_old_tap_xmit(); }

/** @@ -195,12 +211,12 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) uint32_t seq; int ret;

- iov = tcp_l2_iov[tcp_payload_used]; + iov = tcp_l2_iov[tcp_payload_sock_used]; if (CONN_V4(conn)) { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]); iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; } else { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]); iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src; }

@@ -211,13 +227,14 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) if (ret <= 0) return ret;

- tcp_payload_used++; + tcp_payload_sock_produce(1); l4len = optlen + sizeof(struct tcphdr); iov[TCP_IOV_PAYLOAD].iov_len = l4len; tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);

if (flags & DUP_ACK) { - struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used++]; + struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used]; + tcp_payload_sock_produce(1);

memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_len); @@ -228,8 +245,9 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) dup_iov[TCP_IOV_PAYLOAD].iov_len = l4len; }

- if (tcp_payload_used > TCP_FRAMES_MEM - 2) + if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { tcp_payload_flush(c); + }

No { } here in passt style.

...

return 0; } @@ -251,19 +269,19 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, struct iovec *iov;

conn->seq_to_tap = seq + dlen; - tcp_frame_conns[tcp_payload_used] = conn; - iov = tcp_l2_iov[tcp_payload_used]; + tcp_frame_conns[tcp_payload_sock_used] = conn; + iov = tcp_l2_iov[tcp_payload_sock_used]; if (CONN_V4(conn)) { if (no_csum) { - struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1]; + struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1]; struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;

check = &iph->check; } - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]); iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; } else if (CONN_V6(conn)) { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]); iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src; } payload = iov[TCP_IOV_PAYLOAD].iov_base; @@ -274,8 +292,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, payload->th.psh = push; iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr); tcp_l2_buf_fill_headers(conn, iov, check, seq, false); - if (++tcp_payload_used > TCP_FRAMES_MEM - 1) + tcp_payload_sock_produce(1); + if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) { tcp_payload_flush(c); + } }

/** @@ -341,15 +361,12 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) mh_sock.msg_iovlen = fill_bufs; }

- if (tcp_payload_used + fill_bufs > TCP_FRAMES_MEM) { + if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) { tcp_payload_flush(c); - - /* Silence Coverity CWE-125 false positive */ - tcp_payload_used = 0; }

for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) { - iov->iov_base = &tcp_payload[tcp_payload_used + i].data; + iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data; iov->iov_len = mss; } if (iov_rem) @@ -407,7 +424,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) dlen = mss; seq = conn->seq_to_tap; for (i = 0; i < send_bufs; i++) { - int no_csum = i && i != send_bufs - 1 && tcp_payload_used; + int no_csum = i && i != send_bufs - 1 && tcp_payload_sock_used; bool push = false;

if (i == send_bufs - 1) {

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Perez Martin

28 Jul 28 Jul

6:55 p.m.

New subject: [RFC v2 09/11] tcp: start conversion to circular buffer

On Thu, Jul 24, 2025 at 3:03 AM David Gibson wrote:

...

On Wed, Jul 09, 2025 at 07:47:46PM +0200, Eugenio Pérez wrote:

...
The vhost-kernel module is async by nature: the driver (pasta) places a few buffers in the virtqueue and the device (vhost-kernel) trust the

s/trust/trusts/

Fixing in the next version.

...

...
driver will not modify them until it uses them. To implement it is not possible with TCP at the moment, as tcp_buf trust it can reuse the buffers as soon as tcp_payload_flush() finish.

...
To achieve async let's make tcp_buf work with a circular ring, so vhost can transmit at the same time pasta is queing more data. When a buffer is received from a TCP socket, the element is placed in the ring and sock_head is moved: [][][][] ^ ^ | | | sock_head | tail tap_head

When the data is sent to vhost through the tx queue, tap_head is moved forward: [][][][] ^ ^ | | | sock_head | tap_head | tail

Finally, the tail move forward when vhost has used the tx buffers, so tcp_payload (and all lower protocol buffers) can be reused. [][][][] ^ | sock_head tap_head tail

This all sounds good. I wonder if it might be clearer to do this circular queue conversion as a separate patch series. I think it makes sense even without the context of vhost (it's closer to how most network things work).

Sure it can be done.

...

...
In the case of error queueing to the vhost virtqueue, sock_head moves backwards. The only possible error is that the queue is full, as

sock_head moves backwards? Or tap_head moves backwards?

Sock head moves backwards. Tap_head cannot move backwards as vhost does not have a way to report "the last X packets has not been sent".

...

...
virtio-net does not report success on packet sending.

Starting as simple as possible, and only implementing the count variables in this patch so it keeps working as previously. The circular behavior will be added on top.

From ~16BGbit/s to ~13Gbit/s compared with write(2) to the tap.

I don't really understand what you're comparing here.

Sending through vhost-net vs write(2) to tap device.

...

...
Signed-off-by: Eugenio Pérez --- tcp_buf.c | 63 +++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/tcp_buf.c b/tcp_buf.c index 242086d..0437120 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -53,7 +53,12 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")

/* References tracking the owner connection of frames in the tap outqueue */ static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM]; -static unsigned int tcp_payload_used; +static unsigned int tcp_payload_sock_used, tcp_payload_tap_used;

I think the "payload" here is a hangover from when we had separate queues for flags-only and data-containing packets. We can probably drop it and make a bunch of names shorter.

Maybe we can short even more if we isolate this in its own circular_buffer.h or equivalent. UDP will also need it.

...

...
+static void tcp_payload_sock_produce(size_t n) +{ + tcp_payload_sock_used += n; +}

static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];

@@ -132,6 +137,16 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns, } }

+static void tcp_buf_free_old_tap_xmit(void) +{ + while (tcp_payload_tap_used) { + tap_free_old_xmit(tcp_payload_tap_used); + + tcp_payload_tap_used = 0; + tcp_payload_sock_used = 0; + } +} + /** * tcp_payload_flush() - Send out buffers for segments with data or flags * @c: Execution context @@ -141,12 +156,13 @@ void tcp_payload_flush(const struct ctx *c) size_t m;

m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS, - tcp_payload_used, false); - if (m != tcp_payload_used) { + tcp_payload_sock_used, true); + if (m != tcp_payload_sock_used) { tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m], - tcp_payload_used - m); + tcp_payload_sock_used - m); } - tcp_payload_used = 0; + tcp_payload_tap_used += m; + tcp_buf_free_old_tap_xmit(); }

/** @@ -195,12 +211,12 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) uint32_t seq; int ret;

- iov = tcp_l2_iov[tcp_payload_used]; + iov = tcp_l2_iov[tcp_payload_sock_used]; if (CONN_V4(conn)) { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]); iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; } else { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]); iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src; }

@@ -211,13 +227,14 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) if (ret <= 0) return ret;

- tcp_payload_used++; + tcp_payload_sock_produce(1); l4len = optlen + sizeof(struct tcphdr); iov[TCP_IOV_PAYLOAD].iov_len = l4len; tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);

if (flags & DUP_ACK) { - struct iovec *dup_iov = tcp_l2_iov[tcp_payload_used++]; + struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used]; + tcp_payload_sock_produce(1);

memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_len); @@ -228,8 +245,9 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) dup_iov[TCP_IOV_PAYLOAD].iov_len = l4len; }

- if (tcp_payload_used > TCP_FRAMES_MEM - 2) + if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { tcp_payload_flush(c); + }

No { } here in passt style.

Fixing in the next revision.

...

...
return 0; } @@ -251,19 +269,19 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, struct iovec *iov;

conn->seq_to_tap = seq + dlen; - tcp_frame_conns[tcp_payload_used] = conn; - iov = tcp_l2_iov[tcp_payload_used]; + tcp_frame_conns[tcp_payload_sock_used] = conn; + iov = tcp_l2_iov[tcp_payload_sock_used]; if (CONN_V4(conn)) { if (no_csum) { - struct iovec *iov_prev = tcp_l2_iov[tcp_payload_used - 1]; + struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1]; struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;

check = &iph->check; } - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]); iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; } else if (CONN_V6(conn)) { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]); iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src; } payload = iov[TCP_IOV_PAYLOAD].iov_base; @@ -274,8 +292,10 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, payload->th.psh = push; iov[TCP_IOV_PAYLOAD].iov_len = dlen + sizeof(struct tcphdr); tcp_l2_buf_fill_headers(conn, iov, check, seq, false); - if (++tcp_payload_used > TCP_FRAMES_MEM - 1) + tcp_payload_sock_produce(1); + if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) { tcp_payload_flush(c); + } }

/** @@ -341,15 +361,12 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) mh_sock.msg_iovlen = fill_bufs; }

- if (tcp_payload_used + fill_bufs > TCP_FRAMES_MEM) { + if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) { tcp_payload_flush(c); - - /* Silence Coverity CWE-125 false positive */ - tcp_payload_used = 0; }

for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) { - iov->iov_base = &tcp_payload[tcp_payload_used + i].data; + iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data; iov->iov_len = mss; } if (iov_rem) @@ -407,7 +424,7 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) dlen = mss; seq = conn->seq_to_tap; for (i = 0; i < send_bufs; i++) { - int no_csum = i && i != send_bufs - 1 && tcp_payload_used; + int no_csum = i && i != send_bufs - 1 && tcp_payload_sock_used; bool push = false;

if (i == send_bufs - 1) {

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

29 Jul 29 Jul

2:30 a.m.

New subject: [RFC v2 09/11] tcp: start conversion to circular buffer

On Mon, Jul 28, 2025 at 06:55:50PM +0200, Eugenio Perez Martin wrote:

...

On Thu, Jul 24, 2025 at 3:03 AM David Gibson wrote:

...
On Wed, Jul 09, 2025 at 07:47:46PM +0200, Eugenio Pérez wrote:

...
The vhost-kernel module is async by nature: the driver (pasta) places a few buffers in the virtqueue and the device (vhost-kernel) trust the

s/trust/trusts/

Fixing in the next version.

...
...
driver will not modify them until it uses them. To implement it is not possible with TCP at the moment, as tcp_buf trust it can reuse the buffers as soon as tcp_payload_flush() finish.

...
To achieve async let's make tcp_buf work with a circular ring, so vhost can transmit at the same time pasta is queing more data. When a buffer is received from a TCP socket, the element is placed in the ring and sock_head is moved: [][][][] ^ ^ | | | sock_head | tail tap_head

When the data is sent to vhost through the tx queue, tap_head is moved forward: [][][][] ^ ^ | | | sock_head | tap_head | tail

Finally, the tail move forward when vhost has used the tx buffers, so tcp_payload (and all lower protocol buffers) can be reused. [][][][] ^ | sock_head tap_head tail

This all sounds good. I wonder if it might be clearer to do this circular queue conversion as a separate patch series. I think it makes sense even without the context of vhost (it's closer to how most network things work).

Sure it can be done.

...
...
In the case of error queueing to the vhost virtqueue, sock_head moves backwards. The only possible error is that the queue is full, as

sock_head moves backwards? Or tap_head moves backwards?

Sock head moves backwards. Tap_head cannot move backwards as vhost does not have a way to report "the last X packets has not been sent".

Right, I realised that as I read further.

...

...
...
virtio-net does not report success on packet sending.

Starting as simple as possible, and only implementing the count variables in this patch so it keeps working as previously. The circular behavior will be added on top.

From ~16BGbit/s to ~13Gbit/s compared with write(2) to the tap.

I don't really understand what you're comparing here.

Sending through vhost-net vs write(2) to tap device.

Ok. That's a bit dissapointing.

...

...
...
Signed-off-by: Eugenio Pérez --- tcp_buf.c | 63 +++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/tcp_buf.c b/tcp_buf.c index 242086d..0437120 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -53,7 +53,12 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")

/* References tracking the owner connection of frames in the tap outqueue */ static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM]; -static unsigned int tcp_payload_used; +static unsigned int tcp_payload_sock_used, tcp_payload_tap_used;

I think the "payload" here is a hangover from when we had separate queues for flags-only and data-containing packets. We can probably drop it and make a bunch of names shorter.

Maybe we can short even more if we isolate this in its own circular_buffer.h or equivalent. UDP will also need it.

Maybe, yes. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Pérez

9 Jul 9 Jul

7:47 p.m.

New subject: [RFC v2 10/11] tap: add poll(2) to used_idx

From ~13Gbit/s to ~11.5Gbit/s. TODO: Maybe we can reuse epoll for this, not needing to introduce a new syscall. Signed-off-by: Eugenio Pérez --- tap.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------- tap.h | 2 +- tcp_buf.c | 6 +++--- 3 files changed, 53 insertions(+), 14 deletions(-) diff --git a/tap.c b/tap.c index 55357e3..93a8c12 100644 --- a/tap.c +++ b/tap.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -120,7 +121,7 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf); #define VHOST_NDESCS (PKT_BUF_BYTES / 65520) static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)), "Number of vhost descs must be a power of two by standard"); -static struct { +static struct vhost_virtqueue { /* Descriptor index we're using. This is not the same as avail idx in * split: this takes into account the chained descs */ uint16_t vring_idx; @@ -472,26 +473,63 @@ static void vhost_kick(struct vring_used *used, int kick_fd) { eventfd_write(kick_fd, 1); } +/* + * #syscalls:pasta read poll + */ +static uint16_t used_idx(struct vhost_virtqueue *vq, + struct vring_avail *avail, + const struct vring_used *used, int pollfd) +{ + struct pollfd fds = { .fd = pollfd, .events = POLLIN }; + int r; + + if (vq->shadow_used_idx == vq->last_used_idx) + vq->shadow_used_idx = le16toh(used->idx); + + if (vq->shadow_used_idx != vq->last_used_idx || pollfd < 0) + return vq->shadow_used_idx; + + avail->flags &= ~htole16(1ULL << VRING_AVAIL_F_NO_INTERRUPT); + /* trusting syscall for smp_wb() */ + r = read(pollfd, (uint64_t[]){0}, sizeof(uint64_t)); + assert((r < 0 && errno == EAGAIN) || r == 8); + + /* Another oportunity before syscalls */ + vq->shadow_used_idx = le16toh(used->idx); + if (vq->shadow_used_idx != vq->last_used_idx) { + return vqs->shadow_used_idx; + } + + r = poll(&fds, 1, -1); + assert (0 < r); + avail->flags |= htole16(1ULL << VRING_AVAIL_F_NO_INTERRUPT); + vq->shadow_used_idx = le16toh(used->idx); + return vq->shadow_used_idx; +} + /* n = target */ -void tap_free_old_xmit(size_t n) +size_t tap_free_old_xmit(const struct ctx *c, size_t n) { size_t r = 0; + int pollfd = (n == (size_t)-1) ? -1 : c->vq[1].call_fd; while (r < n) { - uint16_t used_idx = vqs[1].last_used_idx; - if (vqs[1].shadow_used_idx == used_idx) { - vqs[1].shadow_used_idx = le16toh(*(volatile uint16_t*)&vring_used_1.used.idx); - - if (vqs[1].shadow_used_idx == used_idx) - continue; + uint16_t last_used = vqs[1].last_used_idx; + if (used_idx(&vqs[1], &vring_avail_1.avail, &vring_used_1.used, pollfd) == last_used) { + assert(pollfd == -1); + return r; } + /* Only get used array entries after they have been exposed by vhost. */ + smp_rmb(); /* assert in-order */ - assert(vring_used_1.used.ring[used_idx % VHOST_NDESCS].id == vring_avail_1.avail.ring[used_idx % VHOST_NDESCS]); - vqs[1].num_free += vqs[1].ndescs[used_idx % VHOST_NDESCS]; + assert(vring_used_1.used.ring[last_used % VHOST_NDESCS].id == vring_avail_1.avail.ring[last_used % VHOST_NDESCS]); + vqs[1].num_free += vqs[1].ndescs[last_used % VHOST_NDESCS]; vqs[1].last_used_idx++; r++; } + + return r; } /** @@ -1687,6 +1725,7 @@ static int tap_ns_tun(void *arg) if (rc < 0) die_perror("Failed to add call eventfd to epoll"); } + fprintf(stderr, "[eperezma %s:%d][i=%d][call_fd=%d]\n", __func__, __LINE__, i, file.fd); c->vq[i].call_fd = file.fd; file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); diff --git a/tap.h b/tap.h index 7ca0fb0..7004116 100644 --- a/tap.h +++ b/tap.h @@ -112,7 +112,7 @@ void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, const void *in, size_t l4len); void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost); -void tap_free_old_xmit(size_t n); +size_t tap_free_old_xmit(const struct ctx *c, size_t n); size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t bufs_per_frame, size_t nframes, bool vhost); void eth_update_mac(struct ethhdr *eh, diff --git a/tcp_buf.c b/tcp_buf.c index 0437120..f74d22d 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -137,10 +137,10 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns, } } -static void tcp_buf_free_old_tap_xmit(void) +static void tcp_buf_free_old_tap_xmit(const struct ctx *c) { while (tcp_payload_tap_used) { - tap_free_old_xmit(tcp_payload_tap_used); + tap_free_old_xmit(c, tcp_payload_tap_used); tcp_payload_tap_used = 0; tcp_payload_sock_used = 0; @@ -162,7 +162,7 @@ void tcp_payload_flush(const struct ctx *c) tcp_payload_sock_used - m); } tcp_payload_tap_used += m; - tcp_buf_free_old_tap_xmit(); + tcp_buf_free_old_tap_xmit(c); } /** -- 2.50.0

David Gibson

24 Jul 24 Jul

3:20 a.m.

New subject: [RFC v2 10/11] tap: add poll(2) to used_idx

On Wed, Jul 09, 2025 at 07:47:47PM +0200, Eugenio Pérez wrote:

...

From ~13Gbit/s to ~11.5Gbit/s.

Again, I really don't know what you're comparing to what here.

...

TODO: Maybe we can reuse epoll for this, not needing to introduce a new syscall.

I really hope so.

...

Signed-off-by: Eugenio Pérez --- tap.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------- tap.h | 2 +- tcp_buf.c | 6 +++--- 3 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/tap.c b/tap.c index 55357e3..93a8c12 100644 --- a/tap.c +++ b/tap.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -120,7 +121,7 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf); #define VHOST_NDESCS (PKT_BUF_BYTES / 65520) static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)), "Number of vhost descs must be a power of two by standard"); -static struct { +static struct vhost_virtqueue { /* Descriptor index we're using. This is not the same as avail idx in * split: this takes into account the chained descs */ uint16_t vring_idx; @@ -472,26 +473,63 @@ static void vhost_kick(struct vring_used *used, int kick_fd) { eventfd_write(kick_fd, 1); }

+/* + * #syscalls:pasta read poll + */ +static uint16_t used_idx(struct vhost_virtqueue *vq, + struct vring_avail *avail, + const struct vring_used *used, int pollfd) +{ + struct pollfd fds = { .fd = pollfd, .events = POLLIN }; + int r; + + if (vq->shadow_used_idx == vq->last_used_idx) + vq->shadow_used_idx = le16toh(used->idx); + + if (vq->shadow_used_idx != vq->last_used_idx || pollfd < 0) + return vq->shadow_used_idx; + + avail->flags &= ~htole16(1ULL << VRING_AVAIL_F_NO_INTERRUPT); + /* trusting syscall for smp_wb() */ + r = read(pollfd, (uint64_t[]){0}, sizeof(uint64_t)); + assert((r < 0 && errno == EAGAIN) || r == 8); + + /* Another oportunity before syscalls */ + vq->shadow_used_idx = le16toh(used->idx); + if (vq->shadow_used_idx != vq->last_used_idx) { + return vqs->shadow_used_idx; + } + + r = poll(&fds, 1, -1); + assert (0 < r); + avail->flags |= htole16(1ULL << VRING_AVAIL_F_NO_INTERRUPT); + vq->shadow_used_idx = le16toh(used->idx); + return vq->shadow_used_idx;

I don't understand what this is accomplishing. It seems like it's blocking waiting for a buffer to be freed, which seems like exactly what we don't want to do.

...

+} + /* n = target */ -void tap_free_old_xmit(size_t n) +size_t tap_free_old_xmit(const struct ctx *c, size_t n) { size_t r = 0; + int pollfd = (n == (size_t)-1) ? -1 : c->vq[1].call_fd;

while (r < n) { - uint16_t used_idx = vqs[1].last_used_idx; - if (vqs[1].shadow_used_idx == used_idx) { - vqs[1].shadow_used_idx = le16toh(*(volatile uint16_t*)&vring_used_1.used.idx); - - if (vqs[1].shadow_used_idx == used_idx) - continue; + uint16_t last_used = vqs[1].last_used_idx; + if (used_idx(&vqs[1], &vring_avail_1.avail, &vring_used_1.used, pollfd) == last_used) { + assert(pollfd == -1); + return r; }

+ /* Only get used array entries after they have been exposed by vhost. */ + smp_rmb(); /* assert in-order */ - assert(vring_used_1.used.ring[used_idx % VHOST_NDESCS].id == vring_avail_1.avail.ring[used_idx % VHOST_NDESCS]); - vqs[1].num_free += vqs[1].ndescs[used_idx % VHOST_NDESCS]; + assert(vring_used_1.used.ring[last_used % VHOST_NDESCS].id == vring_avail_1.avail.ring[last_used % VHOST_NDESCS]); + vqs[1].num_free += vqs[1].ndescs[last_used % VHOST_NDESCS]; vqs[1].last_used_idx++; r++; } + + return r; }

/** @@ -1687,6 +1725,7 @@ static int tap_ns_tun(void *arg) if (rc < 0) die_perror("Failed to add call eventfd to epoll"); } + fprintf(stderr, "[eperezma %s:%d][i=%d][call_fd=%d]\n", __func__, __LINE__, i, file.fd); c->vq[i].call_fd = file.fd;

file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); diff --git a/tap.h b/tap.h index 7ca0fb0..7004116 100644 --- a/tap.h +++ b/tap.h @@ -112,7 +112,7 @@ void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, const void *in, size_t l4len); void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost); -void tap_free_old_xmit(size_t n); +size_t tap_free_old_xmit(const struct ctx *c, size_t n); size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t bufs_per_frame, size_t nframes, bool vhost); void eth_update_mac(struct ethhdr *eh, diff --git a/tcp_buf.c b/tcp_buf.c index 0437120..f74d22d 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -137,10 +137,10 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns, } }

-static void tcp_buf_free_old_tap_xmit(void) +static void tcp_buf_free_old_tap_xmit(const struct ctx *c) { while (tcp_payload_tap_used) { - tap_free_old_xmit(tcp_payload_tap_used); + tap_free_old_xmit(c, tcp_payload_tap_used);

tcp_payload_tap_used = 0; tcp_payload_sock_used = 0; @@ -162,7 +162,7 @@ void tcp_payload_flush(const struct ctx *c) tcp_payload_sock_used - m); } tcp_payload_tap_used += m; - tcp_buf_free_old_tap_xmit(); + tcp_buf_free_old_tap_xmit(c); }

/**

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Perez Martin

28 Jul 28 Jul

7:03 p.m.

New subject: [RFC v2 10/11] tap: add poll(2) to used_idx

On Thu, Jul 24, 2025 at 3:21 AM David Gibson wrote:

...

On Wed, Jul 09, 2025 at 07:47:47PM +0200, Eugenio Pérez wrote:

...
From ~13Gbit/s to ~11.5Gbit/s.

Again, I really don't know what you're comparing to what here.

When the buffer is full I'm using poll() to wait until vhost free some buffers, instead of actively checking the used index. This is the cost of the syscall. We could mitigate some of this by making the tx queue larger though.

...

...
TODO: Maybe we can reuse epoll for this, not needing to introduce a new syscall.

I really hope so.

...
Signed-off-by: Eugenio Pérez --- tap.c | 59 +++++++++++++++++++++++++++++++++++++++++++++---------- tap.h | 2 +- tcp_buf.c | 6 +++--- 3 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/tap.c b/tap.c index 55357e3..93a8c12 100644 --- a/tap.c +++ b/tap.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -120,7 +121,7 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf); #define VHOST_NDESCS (PKT_BUF_BYTES / 65520) static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)), "Number of vhost descs must be a power of two by standard"); -static struct { +static struct vhost_virtqueue { /* Descriptor index we're using. This is not the same as avail idx in * split: this takes into account the chained descs */ uint16_t vring_idx; @@ -472,26 +473,63 @@ static void vhost_kick(struct vring_used *used, int kick_fd) { eventfd_write(kick_fd, 1); }

+/* + * #syscalls:pasta read poll + */ +static uint16_t used_idx(struct vhost_virtqueue *vq, + struct vring_avail *avail, + const struct vring_used *used, int pollfd) +{ + struct pollfd fds = { .fd = pollfd, .events = POLLIN }; + int r; + + if (vq->shadow_used_idx == vq->last_used_idx) + vq->shadow_used_idx = le16toh(used->idx); + + if (vq->shadow_used_idx != vq->last_used_idx || pollfd < 0) + return vq->shadow_used_idx; + + avail->flags &= ~htole16(1ULL << VRING_AVAIL_F_NO_INTERRUPT); + /* trusting syscall for smp_wb() */ + r = read(pollfd, (uint64_t[]){0}, sizeof(uint64_t)); + assert((r < 0 && errno == EAGAIN) || r == 8); + + /* Another oportunity before syscalls */ + vq->shadow_used_idx = le16toh(used->idx); + if (vq->shadow_used_idx != vq->last_used_idx) { + return vqs->shadow_used_idx; + } + + r = poll(&fds, 1, -1); + assert (0 < r); + avail->flags |= htole16(1ULL << VRING_AVAIL_F_NO_INTERRUPT); + vq->shadow_used_idx = le16toh(used->idx); + return vq->shadow_used_idx;

I don't understand what this is accomplishing. It seems like it's blocking waiting for a buffer to be freed, which seems like exactly what we don't want to do.

...
+} + /* n = target */ -void tap_free_old_xmit(size_t n) +size_t tap_free_old_xmit(const struct ctx *c, size_t n) { size_t r = 0; + int pollfd = (n == (size_t)-1) ? -1 : c->vq[1].call_fd;

while (r < n) { - uint16_t used_idx = vqs[1].last_used_idx; - if (vqs[1].shadow_used_idx == used_idx) { - vqs[1].shadow_used_idx = le16toh(*(volatile uint16_t*)&vring_used_1.used.idx); - - if (vqs[1].shadow_used_idx == used_idx) - continue; + uint16_t last_used = vqs[1].last_used_idx; + if (used_idx(&vqs[1], &vring_avail_1.avail, &vring_used_1.used, pollfd) == last_used) { + assert(pollfd == -1); + return r; }

+ /* Only get used array entries after they have been exposed by vhost. */ + smp_rmb(); /* assert in-order */ - assert(vring_used_1.used.ring[used_idx % VHOST_NDESCS].id == vring_avail_1.avail.ring[used_idx % VHOST_NDESCS]); - vqs[1].num_free += vqs[1].ndescs[used_idx % VHOST_NDESCS]; + assert(vring_used_1.used.ring[last_used % VHOST_NDESCS].id == vring_avail_1.avail.ring[last_used % VHOST_NDESCS]); + vqs[1].num_free += vqs[1].ndescs[last_used % VHOST_NDESCS]; vqs[1].last_used_idx++; r++; } + + return r; }

/** @@ -1687,6 +1725,7 @@ static int tap_ns_tun(void *arg) if (rc < 0) die_perror("Failed to add call eventfd to epoll"); } + fprintf(stderr, "[eperezma %s:%d][i=%d][call_fd=%d]\n", __func__, __LINE__, i, file.fd); c->vq[i].call_fd = file.fd;

file.fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); diff --git a/tap.h b/tap.h index 7ca0fb0..7004116 100644 --- a/tap.h +++ b/tap.h @@ -112,7 +112,7 @@ void tap_icmp6_send(const struct ctx *c, const struct in6_addr *src, const struct in6_addr *dst, const void *in, size_t l4len); void tap_send_single(const struct ctx *c, const void *data, size_t l2len, bool vhost); -void tap_free_old_xmit(size_t n); +size_t tap_free_old_xmit(const struct ctx *c, size_t n); size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, size_t bufs_per_frame, size_t nframes, bool vhost); void eth_update_mac(struct ethhdr *eh, diff --git a/tcp_buf.c b/tcp_buf.c index 0437120..f74d22d 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -137,10 +137,10 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns, } }

-static void tcp_buf_free_old_tap_xmit(void) +static void tcp_buf_free_old_tap_xmit(const struct ctx *c) { while (tcp_payload_tap_used) { - tap_free_old_xmit(tcp_payload_tap_used); + tap_free_old_xmit(c, tcp_payload_tap_used);

tcp_payload_tap_used = 0; tcp_payload_sock_used = 0; @@ -162,7 +162,7 @@ void tcp_payload_flush(const struct ctx *c) tcp_payload_sock_used - m); } tcp_payload_tap_used += m; - tcp_buf_free_old_tap_xmit(); + tcp_buf_free_old_tap_xmit(c); }

/**

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

David Gibson

29 Jul 29 Jul

2:32 a.m.

New subject: [RFC v2 10/11] tap: add poll(2) to used_idx

On Mon, Jul 28, 2025 at 07:03:12PM +0200, Eugenio Perez Martin wrote:

...

On Thu, Jul 24, 2025 at 3:21 AM David Gibson wrote:

...
On Wed, Jul 09, 2025 at 07:47:47PM +0200, Eugenio Pérez wrote:

...
From ~13Gbit/s to ~11.5Gbit/s.

Again, I really don't know what you're comparing to what here.

When the buffer is full I'm using poll() to wait until vhost free some buffers, instead of actively checking the used index. This is the cost of the syscall.

Ah, right. So.. I'm not sure if it's so much the cost of the syscall itself, as the fact that you're actively waiting for free buffers, rather than returning to the main epoll loop so you can maybe make progress on something else before returning to the Tx path. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Perez Martin

9:04 a.m.

New subject: [RFC v2 10/11] tap: add poll(2) to used_idx

On Tue, Jul 29, 2025 at 2:33 AM David Gibson wrote:

...

On Mon, Jul 28, 2025 at 07:03:12PM +0200, Eugenio Perez Martin wrote:

...
On Thu, Jul 24, 2025 at 3:21 AM David Gibson wrote:

...
On Wed, Jul 09, 2025 at 07:47:47PM +0200, Eugenio Pérez wrote:

...
From ~13Gbit/s to ~11.5Gbit/s.

Again, I really don't know what you're comparing to what here.

When the buffer is full I'm using poll() to wait until vhost free some buffers, instead of actively checking the used index. This is the cost of the syscall.

Ah, right. So.. I'm not sure if it's so much the cost of the syscall itself, as the fact that you're actively waiting for free buffers, rather than returning to the main epoll loop so you can maybe make progress on something else before returning to the Tx path.

Previous patch also wait for free buffers, but it does it burning a CPU for that. The next patch is the one that allows to continue progress as long as there are enough free buffers, instead of always wait until all the buffer has been sent. But there are situations where this conversion needs other code changes. In particular, all the calls to tcp_payload_flush after checking that we have enough buffers like: if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { tcp_buf_free_old_tap_xmit(c, 2); tcp_payload_flush(c); ... } Seems like coroutines would be a good fix here, but maybe there are simpler ways to go back to the main loop while keeping the tcp socket "ready to read" by epoll POV. Out of curiosity, what do you think about setjmp()? :).

David Gibson

30 Jul 30 Jul

2:32 a.m.

New subject: [RFC v2 10/11] tap: add poll(2) to used_idx

On Tue, Jul 29, 2025 at 09:04:19AM +0200, Eugenio Perez Martin wrote:

...

On Tue, Jul 29, 2025 at 2:33 AM David Gibson wrote:

...
On Mon, Jul 28, 2025 at 07:03:12PM +0200, Eugenio Perez Martin wrote:

...
On Thu, Jul 24, 2025 at 3:21 AM David Gibson wrote:

...
On Wed, Jul 09, 2025 at 07:47:47PM +0200, Eugenio Pérez wrote:

...
From ~13Gbit/s to ~11.5Gbit/s.

Again, I really don't know what you're comparing to what here.

When the buffer is full I'm using poll() to wait until vhost free some buffers, instead of actively checking the used index. This is the cost of the syscall.

Ah, right. So.. I'm not sure if it's so much the cost of the syscall itself, as the fact that you're actively waiting for free buffers, rather than returning to the main epoll loop so you can maybe make progress on something else before returning to the Tx path.

Previous patch also wait for free buffers, but it does it burning a CPU for that.

Ah, ok. Hrm. I still find it hard to believe that it's the cost of the syscall per se that's causing the slowdown. My guess is that the cost is because having the poll() leads to a higher latency between the buffer being released and us detecting it and re-using.

...

The next patch is the one that allows to continue progress as long as there are enough free buffers, instead of always wait until all the buffer has been sent. But there are situations where this conversion needs other code changes. In particular, all the calls to tcp_payload_flush after checking that we have enough buffers like:

if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { tcp_buf_free_old_tap_xmit(c, 2); tcp_payload_flush(c); ... }

Seems like coroutines would be a good fix here, but maybe there are simpler ways to go back to the main loop while keeping the tcp socket "ready to read" by epoll POV. Out of curiosity, what do you think about setjmp()? :).

I think it has its uses, but deciding to go with it is a big architectural decision not to be entered into likely. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Perez Martin

8:11 a.m.

New subject: [RFC v2 10/11] tap: add poll(2) to used_idx

On Wed, Jul 30, 2025 at 2:34 AM David Gibson wrote:

...

On Tue, Jul 29, 2025 at 09:04:19AM +0200, Eugenio Perez Martin wrote:

...
On Tue, Jul 29, 2025 at 2:33 AM David Gibson wrote:

...
On Mon, Jul 28, 2025 at 07:03:12PM +0200, Eugenio Perez Martin wrote:

...
On Thu, Jul 24, 2025 at 3:21 AM David Gibson wrote:

...
On Wed, Jul 09, 2025 at 07:47:47PM +0200, Eugenio Pérez wrote:

...
From ~13Gbit/s to ~11.5Gbit/s.

Again, I really don't know what you're comparing to what here.

When the buffer is full I'm using poll() to wait until vhost free some buffers, instead of actively checking the used index. This is the cost of the syscall.

Ah, right. So.. I'm not sure if it's so much the cost of the syscall itself, as the fact that you're actively waiting for free buffers, rather than returning to the main epoll loop so you can maybe make progress on something else before returning to the Tx path.

Previous patch also wait for free buffers, but it does it burning a CPU for that.

Ah, ok. Hrm. I still find it hard to believe that it's the cost of the syscall per se that's causing the slowdown. My guess is that the cost is because having the poll() leads to a higher latency between the buffer being released and us detecting it and re-using.

...
The next patch is the one that allows to continue progress as long as there are enough free buffers, instead of always wait until all the buffer has been sent. But there are situations where this conversion needs other code changes. In particular, all the calls to tcp_payload_flush after checking that we have enough buffers like:

if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { tcp_buf_free_old_tap_xmit(c, 2); tcp_payload_flush(c); ... }

Seems like coroutines would be a good fix here, but maybe there are simpler ways to go back to the main loop while keeping the tcp socket "ready to read" by epoll POV. Out of curiosity, what do you think about setjmp()? :).

I think it has its uses, but deciding to go with it is a big architectural decision not to be entered into likely.

Got it, Another idea is to add the flows that are being processed but they had no space available into the virtqueue to a "pending" list. When the kernel tells pasta that new buffers are available, pasta checks that pending list. Maybe it can consist of only one element.

David Gibson

31 Jul 31 Jul

7:59 a.m.

New subject: [RFC v2 10/11] tap: add poll(2) to used_idx

On Wed, Jul 30, 2025 at 08:11:20AM +0200, Eugenio Perez Martin wrote:

...

On Wed, Jul 30, 2025 at 2:34 AM David Gibson wrote:

...
On Tue, Jul 29, 2025 at 09:04:19AM +0200, Eugenio Perez Martin wrote:

...
On Tue, Jul 29, 2025 at 2:33 AM David Gibson wrote:

...
On Mon, Jul 28, 2025 at 07:03:12PM +0200, Eugenio Perez Martin wrote:

...
On Thu, Jul 24, 2025 at 3:21 AM David Gibson wrote:

...
On Wed, Jul 09, 2025 at 07:47:47PM +0200, Eugenio Pérez wrote: > From ~13Gbit/s to ~11.5Gbit/s.

Again, I really don't know what you're comparing to what here.

When the buffer is full I'm using poll() to wait until vhost free some buffers, instead of actively checking the used index. This is the cost of the syscall.

Ah, right. So.. I'm not sure if it's so much the cost of the syscall itself, as the fact that you're actively waiting for free buffers, rather than returning to the main epoll loop so you can maybe make progress on something else before returning to the Tx path.

Previous patch also wait for free buffers, but it does it burning a CPU for that.

Ah, ok. Hrm. I still find it hard to believe that it's the cost of the syscall per se that's causing the slowdown. My guess is that the cost is because having the poll() leads to a higher latency between the buffer being released and us detecting it and re-using.

...
The next patch is the one that allows to continue progress as long as there are enough free buffers, instead of always wait until all the buffer has been sent. But there are situations where this conversion needs other code changes. In particular, all the calls to tcp_payload_flush after checking that we have enough buffers like:

if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { tcp_buf_free_old_tap_xmit(c, 2); tcp_payload_flush(c); ... }

Seems like coroutines would be a good fix here, but maybe there are simpler ways to go back to the main loop while keeping the tcp socket "ready to read" by epoll POV. Out of curiosity, what do you think about setjmp()? :).

I think it has its uses, but deciding to go with it is a big architectural decision not to be entered into likely.

Got it,

Another idea is to add the flows that are being processed but they had no space available into the virtqueue to a "pending" list. When the kernel tells pasta that new buffers are available, pasta checks that pending list. Maybe it can consist of only one element.

I think this makes sense. We already kind of want the same thing for the (rare) cases where the tap buffer fills up (or the pipe buffer for passt/qemu). This is part of what we'd need to make the event handling simpler (if we have proper wakeups on tap side writability we can always use EPOLLET on the socket side, instead of turning it on and off). I'm not actually sure if we need an explicit list. It might be adequate to just have a pending flag (or even derive it from existing state) and poll the entire flow list. Might be more expensive, but could well be good enough (we already scan the entire flow list on every epoll cycle). -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Perez Martin

10:11 a.m.

New subject: [RFC v2 10/11] tap: add poll(2) to used_idx

On Thu, Jul 31, 2025 at 7:59 AM David Gibson wrote:

...

On Wed, Jul 30, 2025 at 08:11:20AM +0200, Eugenio Perez Martin wrote:

...
On Wed, Jul 30, 2025 at 2:34 AM David Gibson wrote:

...
On Tue, Jul 29, 2025 at 09:04:19AM +0200, Eugenio Perez Martin wrote:

...
On Tue, Jul 29, 2025 at 2:33 AM David Gibson wrote:

...
On Mon, Jul 28, 2025 at 07:03:12PM +0200, Eugenio Perez Martin wrote:

...
On Thu, Jul 24, 2025 at 3:21 AM David Gibson wrote: > > On Wed, Jul 09, 2025 at 07:47:47PM +0200, Eugenio Pérez wrote: > > From ~13Gbit/s to ~11.5Gbit/s. > > Again, I really don't know what you're comparing to what here. >

When the buffer is full I'm using poll() to wait until vhost free some buffers, instead of actively checking the used index. This is the cost of the syscall.

Ah, right. So.. I'm not sure if it's so much the cost of the syscall itself, as the fact that you're actively waiting for free buffers, rather than returning to the main epoll loop so you can maybe make progress on something else before returning to the Tx path.

Previous patch also wait for free buffers, but it does it burning a CPU for that.

Ah, ok. Hrm. I still find it hard to believe that it's the cost of the syscall per se that's causing the slowdown. My guess is that the cost is because having the poll() leads to a higher latency between the buffer being released and us detecting it and re-using.

...
The next patch is the one that allows to continue progress as long as there are enough free buffers, instead of always wait until all the buffer has been sent. But there are situations where this conversion needs other code changes. In particular, all the calls to tcp_payload_flush after checking that we have enough buffers like:

if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { tcp_buf_free_old_tap_xmit(c, 2); tcp_payload_flush(c); ... }

Seems like coroutines would be a good fix here, but maybe there are simpler ways to go back to the main loop while keeping the tcp socket "ready to read" by epoll POV. Out of curiosity, what do you think about setjmp()? :).

I think it has its uses, but deciding to go with it is a big architectural decision not to be entered into likely.

Got it,

Another idea is to add the flows that are being processed but they had no space available into the virtqueue to a "pending" list. When the kernel tells pasta that new buffers are available, pasta checks that pending list. Maybe it can consist of only one element.

I think this makes sense. We already kind of want the same thing for the (rare) cases where the tap buffer fills up (or the pipe buffer for passt/qemu). This is part of what we'd need to make the event handling simpler (if we have proper wakeups on tap side writability we can always use EPOLLET on the socket side, instead of turning it on and off).

That can be one way, yes.

...

I'm not actually sure if we need an explicit list. It might be adequate to just have a pending flag (or even derive it from existing state) and poll the entire flow list. Might be more expensive, but could well be good enough (we already scan the entire flow list on every epoll cycle).

I'm ok with a new flag but the memory increase will be bigger than a single "pending" entry. If scan the whole list for a new state is also a possibility, sure, I'm ok with that too.

David Gibson

4 Aug 4 Aug

6:09 a.m.

New subject: [RFC v2 10/11] tap: add poll(2) to used_idx

On Thu, Jul 31, 2025 at 10:11:14AM +0200, Eugenio Perez Martin wrote:

...

On Thu, Jul 31, 2025 at 7:59 AM David Gibson wrote:

...
On Wed, Jul 30, 2025 at 08:11:20AM +0200, Eugenio Perez Martin wrote:

...
On Wed, Jul 30, 2025 at 2:34 AM David Gibson wrote:

...
On Tue, Jul 29, 2025 at 09:04:19AM +0200, Eugenio Perez Martin wrote:

...
On Tue, Jul 29, 2025 at 2:33 AM David Gibson wrote:

...
On Mon, Jul 28, 2025 at 07:03:12PM +0200, Eugenio Perez Martin wrote: > On Thu, Jul 24, 2025 at 3:21 AM David Gibson > wrote: > > > > On Wed, Jul 09, 2025 at 07:47:47PM +0200, Eugenio Pérez wrote: > > > From ~13Gbit/s to ~11.5Gbit/s. > > > > Again, I really don't know what you're comparing to what here. > > > > When the buffer is full I'm using poll() to wait until vhost free some > buffers, instead of actively checking the used index. This is the cost > of the syscall.

Ah, right. So.. I'm not sure if it's so much the cost of the syscall itself, as the fact that you're actively waiting for free buffers, rather than returning to the main epoll loop so you can maybe make progress on something else before returning to the Tx path.

Previous patch also wait for free buffers, but it does it burning a CPU for that.

Ah, ok. Hrm. I still find it hard to believe that it's the cost of the syscall per se that's causing the slowdown. My guess is that the cost is because having the poll() leads to a higher latency between the buffer being released and us detecting it and re-using.

...
The next patch is the one that allows to continue progress as long as there are enough free buffers, instead of always wait until all the buffer has been sent. But there are situations where this conversion needs other code changes. In particular, all the calls to tcp_payload_flush after checking that we have enough buffers like:

if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { tcp_buf_free_old_tap_xmit(c, 2); tcp_payload_flush(c); ... }

Seems like coroutines would be a good fix here, but maybe there are simpler ways to go back to the main loop while keeping the tcp socket "ready to read" by epoll POV. Out of curiosity, what do you think about setjmp()? :).

I think it has its uses, but deciding to go with it is a big architectural decision not to be entered into likely.

Got it,

Another idea is to add the flows that are being processed but they had no space available into the virtqueue to a "pending" list. When the kernel tells pasta that new buffers are available, pasta checks that pending list. Maybe it can consist of only one element.

I think this makes sense. We already kind of want the same thing for the (rare) cases where the tap buffer fills up (or the pipe buffer for passt/qemu). This is part of what we'd need to make the event handling simpler (if we have proper wakeups on tap side writability we can always use EPOLLET on the socket side, instead of turning it on and off).

That can be one way, yes.

...
I'm not actually sure if we need an explicit list. It might be adequate to just have a pending flag (or even derive it from existing state) and poll the entire flow list. Might be more expensive, but could well be good enough (we already scan the entire flow list on every epoll cycle).

I'm ok with a new flag but the memory increase will be bigger than a single "pending" entry. If scan the whole list for a new state is also a possibility, sure, I'm ok with that too.

To be clear, I'm not insisting on a particular way of doing it. I'm just suggesting a few options; I'm not sure which will work out to be best / simplest. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Pérez

9 Jul 9 Jul

7:47 p.m.

New subject: [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer

Now both tcp_sock and tap uses the circular buffer as intended. Very lightly tested. Especially, paths like ring full or almost full that are checked before producing like tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM. Processing the tx buffers in a circular buffer makes namespace rx go from to ~11.5Gbit/s. to ~17.26Gbit/s. TODO: Increase the tx queue length, as we spend a lot of descriptors in each request. Ideally, tx size should be at least bufs_per_frame*TCP_FRAMES_MEM, but maybe we got more performance with bigger queues. TODO: Sometimes we call tcp_buf_free_old_tap_xmit twice: one to free at least N used tx buffers and the next one in tcp_payload_flush. Maybe we can optimize it. Signed-off-by: Eugenio Pérez --- tcp_buf.c | 130 ++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 106 insertions(+), 24 deletions(-) diff --git a/tcp_buf.c b/tcp_buf.c index f74d22d..326af79 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -53,13 +53,66 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516") /* References tracking the owner connection of frames in the tap outqueue */ static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM]; -static unsigned int tcp_payload_sock_used, tcp_payload_tap_used; + +/* + * sock_head: Head of buffers available for writing. tcp_data_to_tap moves it + * forward, but errors queueing to vhost can move it backwards to tap_head + * again. + * + * tap_head: Head of buffers that have been sent to vhost. flush moves this + * forward. + * + * tail: Chasing index. Increments when vhost uses buffers. + * + * _used: Independent variables to tell between full and empty. + */ +static unsigned int tcp_payload_sock_head, tcp_payload_tap_head, tcp_payload_tail, tcp_payload_sock_used, tcp_payload_tap_used; +#define IS_POW2(y) (((y) > 0) && !((y) & ((y) - 1))) +static_assert(ARRAY_SIZE(tcp_payload) == TCP_FRAMES_MEM, "TCP_FRAMES_MEM is not the size of tcp_payload anymore"); +static_assert(IS_POW2(TCP_FRAMES_MEM), "TCP_FRAMES_MEM must be a power of two"); + +static size_t tcp_payload_cnt_to_end(size_t head, size_t tail) +{ + assert(head != tail); + size_t end = ARRAY_SIZE(tcp_payload) - tail; + size_t n = (head + end) % ARRAY_SIZE(tcp_payload); + + return MIN(n, end); +} + +/* Count the number of items that has been written from sock to the + * curcular buffer and can be sent to tap. + */ +static size_t tcp_payload_tap_cnt(void) +{ + return tcp_payload_sock_used - tcp_payload_tap_used; +} static void tcp_payload_sock_produce(size_t n) { + tcp_payload_sock_head = (tcp_payload_sock_head + n) % ARRAY_SIZE(tcp_payload); tcp_payload_sock_used += n; } +/* Count the number of consecutive items that has been written from sock to the + * curcular buffer and can be sent to tap without having to wrap back to the + * beginning of the buffer. + */ +static size_t tcp_payload_tap_cnt_to_end(void) +{ + if (tcp_payload_sock_head == tcp_payload_tap_head) { + /* empty? */ + if (tcp_payload_sock_used - tcp_payload_tap_used == 0) + return 0; + + /* full */ + return ARRAY_SIZE(tcp_payload) - tcp_payload_tap_head; + } + + return tcp_payload_cnt_to_end(tcp_payload_sock_head, + tcp_payload_tap_head); +} + static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS]; /** @@ -137,14 +190,13 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns, } } -static void tcp_buf_free_old_tap_xmit(const struct ctx *c) +static void tcp_buf_free_old_tap_xmit(const struct ctx *c, size_t target) { - while (tcp_payload_tap_used) { - tap_free_old_xmit(c, tcp_payload_tap_used); + size_t n = tap_free_old_xmit(c, target); - tcp_payload_tap_used = 0; - tcp_payload_sock_used = 0; - } + tcp_payload_tail = (tcp_payload_tail + n) & (ARRAY_SIZE(tcp_payload) - 1); + tcp_payload_tap_used -= n; + tcp_payload_sock_used -= n; } /** @@ -153,16 +205,33 @@ static void tcp_buf_free_old_tap_xmit(const struct ctx *c) */ void tcp_payload_flush(const struct ctx *c) { - size_t m; + size_t m, n = tcp_payload_tap_cnt_to_end(); + struct iovec *head = &tcp_l2_iov[tcp_payload_tap_head][0]; - m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS, - tcp_payload_sock_used, true); - if (m != tcp_payload_sock_used) { - tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m], - tcp_payload_sock_used - m); - } + tcp_buf_free_old_tap_xmit(c, (size_t)-1); + m = tap_send_frames(c, head, TCP_NUM_IOVS, n, true); tcp_payload_tap_used += m; - tcp_buf_free_old_tap_xmit(c); + tcp_payload_tap_head = (tcp_payload_tap_head + m) % + ARRAY_SIZE(tcp_payload); + + if (m != n) { + n = tcp_payload_tap_cnt_to_end(); + + tcp_revert_seq(c, &tcp_frame_conns[tcp_payload_tap_head], + &tcp_l2_iov[tcp_payload_tap_head], n); + /* + * circular buffer wrap case. + * TODO: Maybe it's better to adapt tcp_revert_seq. + */ + tcp_revert_seq(c, &tcp_frame_conns[0], &tcp_l2_iov[0], + tcp_payload_tap_cnt() - n); + + tcp_payload_sock_head = tcp_payload_tap_head; + tcp_payload_sock_used = tcp_payload_tap_used; + } else if (tcp_payload_tap_cnt_to_end()) { + /* circular buffer wrap case */ + tcp_payload_flush(c); + } } /** @@ -209,14 +278,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) size_t optlen; size_t l4len; uint32_t seq; + unsigned int i = tcp_payload_sock_head; int ret; - iov = tcp_l2_iov[tcp_payload_sock_used]; + iov = tcp_l2_iov[i]; if (CONN_V4(conn)) { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[i]); iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; } else { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[i]); iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src; } @@ -228,13 +298,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) return ret; tcp_payload_sock_produce(1); + i = tcp_payload_sock_head; l4len = optlen + sizeof(struct tcphdr); iov[TCP_IOV_PAYLOAD].iov_len = l4len; tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false); if (flags & DUP_ACK) { - struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used]; + struct iovec *dup_iov = tcp_l2_iov[i]; tcp_payload_sock_produce(1); + i = tcp_payload_sock_head; memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_len); @@ -246,7 +318,10 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) } if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { + tcp_buf_free_old_tap_xmit(c, 2); tcp_payload_flush(c); + /* TODO how to fix this? original code didn't chech for success either */ + assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 2); } return 0; @@ -269,16 +344,17 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, struct iovec *iov; conn->seq_to_tap = seq + dlen; - tcp_frame_conns[tcp_payload_sock_used] = conn; - iov = tcp_l2_iov[tcp_payload_sock_used]; + tcp_frame_conns[tcp_payload_sock_head] = conn; + iov = tcp_l2_iov[tcp_payload_sock_head]; if (CONN_V4(conn)) { if (no_csum) { - struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1]; + unsigned prev = (tcp_payload_sock_head - 1) % TCP_FRAMES_MEM; + struct iovec *iov_prev = tcp_l2_iov[prev]; struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base; check = &iph->check; } - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_head]); iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; } else if (CONN_V6(conn)) { iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]); @@ -294,8 +370,11 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp_l2_buf_fill_headers(conn, iov, check, seq, false); tcp_payload_sock_produce(1); if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) { + tcp_buf_free_old_tap_xmit(c, 1); tcp_payload_flush(c); + assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 1); } + } /** @@ -362,11 +441,14 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) } if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) { + tcp_buf_free_old_tap_xmit(c, fill_bufs); tcp_payload_flush(c); + /* TODO how to report this to upper layers? */ + assert(tcp_payload_sock_used + fill_bufs <= TCP_FRAMES_MEM); } for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) { - iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data; + iov->iov_base = &tcp_payload[(tcp_payload_sock_head + i) % TCP_FRAMES_MEM].data; iov->iov_len = mss; } if (iov_rem) -- 2.50.0

David Gibson

24 Jul 24 Jul

3:33 a.m.

New subject: [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer

On Wed, Jul 09, 2025 at 07:47:48PM +0200, Eugenio Pérez wrote:

...

Now both tcp_sock and tap uses the circular buffer as intended.

Very lightly tested. Especially, paths like ring full or almost full that are checked before producing like tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM.

Processing the tx buffers in a circular buffer makes namespace rx go from to ~11.5Gbit/s. to ~17.26Gbit/s.

TODO: Increase the tx queue length, as we spend a lot of descriptors in each request. Ideally, tx size should be at least bufs_per_frame*TCP_FRAMES_MEM, but maybe we got more performance with bigger queues.

TODO: Sometimes we call tcp_buf_free_old_tap_xmit twice: one to free at least N used tx buffers and the next one in tcp_payload_flush. Maybe we can optimize it.

Signed-off-by: Eugenio Pérez --- tcp_buf.c | 130 ++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 106 insertions(+), 24 deletions(-)

diff --git a/tcp_buf.c b/tcp_buf.c index f74d22d..326af79 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -53,13 +53,66 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")

/* References tracking the owner connection of frames in the tap outqueue */ static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM]; -static unsigned int tcp_payload_sock_used, tcp_payload_tap_used; + +/* + * sock_head: Head of buffers available for writing. tcp_data_to_tap moves it + * forward, but errors queueing to vhost can move it backwards to tap_head + * again. + * + * tap_head: Head of buffers that have been sent to vhost. flush moves this + * forward. + * + * tail: Chasing index. Increments when vhost uses buffers. + * + * _used: Independent variables to tell between full and empty.

Hm. I kind of hope there's a less bulky way of doing this.

...

+ */ +static unsigned int tcp_payload_sock_head, tcp_payload_tap_head, tcp_payload_tail, tcp_payload_sock_used, tcp_payload_tap_used; +#define IS_POW2(y) (((y) > 0) && !((y) & ((y) - 1)))

Worth putting this in util.h as a separate patch.

...

+static_assert(ARRAY_SIZE(tcp_payload) == TCP_FRAMES_MEM, "TCP_FRAMES_MEM is not the size of tcp_payload anymore"); +static_assert(IS_POW2(TCP_FRAMES_MEM), "TCP_FRAMES_MEM must be a power of two"); + +static size_t tcp_payload_cnt_to_end(size_t head, size_t tail) +{ + assert(head != tail); + size_t end = ARRAY_SIZE(tcp_payload) - tail; + size_t n = (head + end) % ARRAY_SIZE(tcp_payload); + + return MIN(n, end); +} + +/* Count the number of items that has been written from sock to the + * curcular buffer and can be sent to tap.

s/curcular/circular/g

...

+ */ +static size_t tcp_payload_tap_cnt(void) +{ + return tcp_payload_sock_used - tcp_payload_tap_used; +}

static void tcp_payload_sock_produce(size_t n) { + tcp_payload_sock_head = (tcp_payload_sock_head + n) % ARRAY_SIZE(tcp_payload); tcp_payload_sock_used += n; }

+/* Count the number of consecutive items that has been written from sock to the + * curcular buffer and can be sent to tap without having to wrap back to the + * beginning of the buffer. + */ +static size_t tcp_payload_tap_cnt_to_end(void) +{ + if (tcp_payload_sock_head == tcp_payload_tap_head) { + /* empty? */ + if (tcp_payload_sock_used - tcp_payload_tap_used == 0) + return 0; + + /* full */ + return ARRAY_SIZE(tcp_payload) - tcp_payload_tap_head; + } + + return tcp_payload_cnt_to_end(tcp_payload_sock_head, + tcp_payload_tap_head); +} + static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];

/** @@ -137,14 +190,13 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns, } }

-static void tcp_buf_free_old_tap_xmit(const struct ctx *c) +static void tcp_buf_free_old_tap_xmit(const struct ctx *c, size_t target) { - while (tcp_payload_tap_used) { - tap_free_old_xmit(c, tcp_payload_tap_used); + size_t n = tap_free_old_xmit(c, target);

- tcp_payload_tap_used = 0; - tcp_payload_sock_used = 0; - } + tcp_payload_tail = (tcp_payload_tail + n) & (ARRAY_SIZE(tcp_payload) - 1);

use % instead of & here - it's consistent with other places, and the compiler should be able to optimize it to the same thing.

...

+ tcp_payload_tap_used -= n; + tcp_payload_sock_used -= n; }

/** @@ -153,16 +205,33 @@ static void tcp_buf_free_old_tap_xmit(const struct ctx *c) */ void tcp_payload_flush(const struct ctx *c) { - size_t m; + size_t m, n = tcp_payload_tap_cnt_to_end(); + struct iovec *head = &tcp_l2_iov[tcp_payload_tap_head][0];

- m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS, - tcp_payload_sock_used, true); - if (m != tcp_payload_sock_used) { - tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m], - tcp_payload_sock_used - m); - } + tcp_buf_free_old_tap_xmit(c, (size_t)-1); + m = tap_send_frames(c, head, TCP_NUM_IOVS, n, true); tcp_payload_tap_used += m; - tcp_buf_free_old_tap_xmit(c); + tcp_payload_tap_head = (tcp_payload_tap_head + m) % + ARRAY_SIZE(tcp_payload); + + if (m != n) { + n = tcp_payload_tap_cnt_to_end(); + + tcp_revert_seq(c, &tcp_frame_conns[tcp_payload_tap_head], + &tcp_l2_iov[tcp_payload_tap_head], n); + /* + * circular buffer wrap case. + * TODO: Maybe it's better to adapt tcp_revert_seq. + */ + tcp_revert_seq(c, &tcp_frame_conns[0], &tcp_l2_iov[0], + tcp_payload_tap_cnt() - n); + + tcp_payload_sock_head = tcp_payload_tap_head; + tcp_payload_sock_used = tcp_payload_tap_used; + } else if (tcp_payload_tap_cnt_to_end()) { + /* circular buffer wrap case */ + tcp_payload_flush(c); + } }

/** @@ -209,14 +278,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) size_t optlen; size_t l4len; uint32_t seq; + unsigned int i = tcp_payload_sock_head; int ret;

- iov = tcp_l2_iov[tcp_payload_sock_used]; + iov = tcp_l2_iov[i]; if (CONN_V4(conn)) { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[i]); iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; } else { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[i]); iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src; }

@@ -228,13 +298,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) return ret;

tcp_payload_sock_produce(1); + i = tcp_payload_sock_head; l4len = optlen + sizeof(struct tcphdr); iov[TCP_IOV_PAYLOAD].iov_len = l4len; tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);

if (flags & DUP_ACK) { - struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used]; + struct iovec *dup_iov = tcp_l2_iov[i]; tcp_payload_sock_produce(1); + i = tcp_payload_sock_head;

memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_len); @@ -246,7 +318,10 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) }

if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { + tcp_buf_free_old_tap_xmit(c, 2); tcp_payload_flush(c); + /* TODO how to fix this? original code didn't chech for success either */ + assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 2); }

return 0; @@ -269,16 +344,17 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, struct iovec *iov;

conn->seq_to_tap = seq + dlen; - tcp_frame_conns[tcp_payload_sock_used] = conn; - iov = tcp_l2_iov[tcp_payload_sock_used]; + tcp_frame_conns[tcp_payload_sock_head] = conn; + iov = tcp_l2_iov[tcp_payload_sock_head]; if (CONN_V4(conn)) { if (no_csum) { - struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1]; + unsigned prev = (tcp_payload_sock_head - 1) % TCP_FRAMES_MEM; + struct iovec *iov_prev = tcp_l2_iov[prev]; struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;

check = &iph->check; } - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_head]); iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; } else if (CONN_V6(conn)) { iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]); @@ -294,8 +370,11 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp_l2_buf_fill_headers(conn, iov, check, seq, false); tcp_payload_sock_produce(1); if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) { + tcp_buf_free_old_tap_xmit(c, 1); tcp_payload_flush(c); + assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 1); } + }

/** @@ -362,11 +441,14 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) }

if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) { + tcp_buf_free_old_tap_xmit(c, fill_bufs); tcp_payload_flush(c); + /* TODO how to report this to upper layers? */ + assert(tcp_payload_sock_used + fill_bufs <= TCP_FRAMES_MEM); }

for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) { - iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data; + iov->iov_base = &tcp_payload[(tcp_payload_sock_head + i) % TCP_FRAMES_MEM].data; iov->iov_len = mss; } if (iov_rem)

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Perez Martin

28 Jul 28 Jul

7:04 p.m.

New subject: [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer

On Thu, Jul 24, 2025 at 3:34 AM David Gibson wrote:

...

On Wed, Jul 09, 2025 at 07:47:48PM +0200, Eugenio Pérez wrote:

...
Now both tcp_sock and tap uses the circular buffer as intended.

Very lightly tested. Especially, paths like ring full or almost full that are checked before producing like tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM.

Processing the tx buffers in a circular buffer makes namespace rx go from to ~11.5Gbit/s. to ~17.26Gbit/s.

TODO: Increase the tx queue length, as we spend a lot of descriptors in each request. Ideally, tx size should be at least bufs_per_frame*TCP_FRAMES_MEM, but maybe we got more performance with bigger queues.

TODO: Sometimes we call tcp_buf_free_old_tap_xmit twice: one to free at least N used tx buffers and the next one in tcp_payload_flush. Maybe we can optimize it.

Signed-off-by: Eugenio Pérez --- tcp_buf.c | 130 ++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 106 insertions(+), 24 deletions(-)

diff --git a/tcp_buf.c b/tcp_buf.c index f74d22d..326af79 100644 --- a/tcp_buf.c +++ b/tcp_buf.c @@ -53,13 +53,66 @@ static_assert(MSS6 <= sizeof(tcp_payload[0].data), "MSS6 is greater than 65516")

/* References tracking the owner connection of frames in the tap outqueue */ static struct tcp_tap_conn *tcp_frame_conns[TCP_FRAMES_MEM]; -static unsigned int tcp_payload_sock_used, tcp_payload_tap_used; + +/* + * sock_head: Head of buffers available for writing. tcp_data_to_tap moves it + * forward, but errors queueing to vhost can move it backwards to tap_head + * again. + * + * tap_head: Head of buffers that have been sent to vhost. flush moves this + * forward. + * + * tail: Chasing index. Increments when vhost uses buffers. + * + * _used: Independent variables to tell between full and empty.

Hm. I kind of hope there's a less bulky way of doing this.

The other option I know is to always keep one entry unused.

...

...
+ */ +static unsigned int tcp_payload_sock_head, tcp_payload_tap_head, tcp_payload_tail, tcp_payload_sock_used, tcp_payload_tap_used; +#define IS_POW2(y) (((y) > 0) && !((y) & ((y) - 1)))

Worth putting this in util.h as a separate patch.

Agree.

...

...
+static_assert(ARRAY_SIZE(tcp_payload) == TCP_FRAMES_MEM, "TCP_FRAMES_MEM is not the size of tcp_payload anymore"); +static_assert(IS_POW2(TCP_FRAMES_MEM), "TCP_FRAMES_MEM must be a power of two"); + +static size_t tcp_payload_cnt_to_end(size_t head, size_t tail) +{ + assert(head != tail); + size_t end = ARRAY_SIZE(tcp_payload) - tail; + size_t n = (head + end) % ARRAY_SIZE(tcp_payload); + + return MIN(n, end); +} + +/* Count the number of items that has been written from sock to the + * curcular buffer and can be sent to tap.

s/curcular/circular/g

Thanks for the catch, fixing in the next release!

...

...
+ */ +static size_t tcp_payload_tap_cnt(void) +{ + return tcp_payload_sock_used - tcp_payload_tap_used; +}

static void tcp_payload_sock_produce(size_t n) { + tcp_payload_sock_head = (tcp_payload_sock_head + n) % ARRAY_SIZE(tcp_payload); tcp_payload_sock_used += n; }

+/* Count the number of consecutive items that has been written from sock to the + * curcular buffer and can be sent to tap without having to wrap back to the + * beginning of the buffer. + */ +static size_t tcp_payload_tap_cnt_to_end(void) +{ + if (tcp_payload_sock_head == tcp_payload_tap_head) { + /* empty? */ + if (tcp_payload_sock_used - tcp_payload_tap_used == 0) + return 0; + + /* full */ + return ARRAY_SIZE(tcp_payload) - tcp_payload_tap_head; + } + + return tcp_payload_cnt_to_end(tcp_payload_sock_head, + tcp_payload_tap_head); +} + static struct iovec tcp_l2_iov[TCP_FRAMES_MEM][TCP_NUM_IOVS];

/** @@ -137,14 +190,13 @@ static void tcp_revert_seq(const struct ctx *c, struct tcp_tap_conn **conns, } }

-static void tcp_buf_free_old_tap_xmit(const struct ctx *c) +static void tcp_buf_free_old_tap_xmit(const struct ctx *c, size_t target) { - while (tcp_payload_tap_used) { - tap_free_old_xmit(c, tcp_payload_tap_used); + size_t n = tap_free_old_xmit(c, target);

- tcp_payload_tap_used = 0; - tcp_payload_sock_used = 0; - } + tcp_payload_tail = (tcp_payload_tail + n) & (ARRAY_SIZE(tcp_payload) - 1);

use % instead of & here - it's consistent with other places, and the compiler should be able to optimize it to the same thing.

...
+ tcp_payload_tap_used -= n; + tcp_payload_sock_used -= n; }

/** @@ -153,16 +205,33 @@ static void tcp_buf_free_old_tap_xmit(const struct ctx *c) */ void tcp_payload_flush(const struct ctx *c) { - size_t m; + size_t m, n = tcp_payload_tap_cnt_to_end(); + struct iovec *head = &tcp_l2_iov[tcp_payload_tap_head][0];

- m = tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS, - tcp_payload_sock_used, true); - if (m != tcp_payload_sock_used) { - tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m], - tcp_payload_sock_used - m); - } + tcp_buf_free_old_tap_xmit(c, (size_t)-1); + m = tap_send_frames(c, head, TCP_NUM_IOVS, n, true); tcp_payload_tap_used += m; - tcp_buf_free_old_tap_xmit(c); + tcp_payload_tap_head = (tcp_payload_tap_head + m) % + ARRAY_SIZE(tcp_payload); + + if (m != n) { + n = tcp_payload_tap_cnt_to_end(); + + tcp_revert_seq(c, &tcp_frame_conns[tcp_payload_tap_head], + &tcp_l2_iov[tcp_payload_tap_head], n); + /* + * circular buffer wrap case. + * TODO: Maybe it's better to adapt tcp_revert_seq. + */ + tcp_revert_seq(c, &tcp_frame_conns[0], &tcp_l2_iov[0], + tcp_payload_tap_cnt() - n); + + tcp_payload_sock_head = tcp_payload_tap_head; + tcp_payload_sock_used = tcp_payload_tap_used; + } else if (tcp_payload_tap_cnt_to_end()) { + /* circular buffer wrap case */ + tcp_payload_flush(c); + } }

/** @@ -209,14 +278,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) size_t optlen; size_t l4len; uint32_t seq; + unsigned int i = tcp_payload_sock_head; int ret;

- iov = tcp_l2_iov[tcp_payload_sock_used]; + iov = tcp_l2_iov[i]; if (CONN_V4(conn)) { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[i]); iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; } else { - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[i]); iov[TCP_IOV_ETH].iov_base = &tcp6_eth_src; }

@@ -228,13 +298,15 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) return ret;

tcp_payload_sock_produce(1); + i = tcp_payload_sock_head; l4len = optlen + sizeof(struct tcphdr); iov[TCP_IOV_PAYLOAD].iov_len = l4len; tcp_l2_buf_fill_headers(conn, iov, NULL, seq, false);

if (flags & DUP_ACK) { - struct iovec *dup_iov = tcp_l2_iov[tcp_payload_sock_used]; + struct iovec *dup_iov = tcp_l2_iov[i]; tcp_payload_sock_produce(1); + i = tcp_payload_sock_head;

memcpy(dup_iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_base, iov[TCP_IOV_TAP].iov_len); @@ -246,7 +318,10 @@ int tcp_buf_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) }

if (tcp_payload_sock_used > TCP_FRAMES_MEM - 2) { + tcp_buf_free_old_tap_xmit(c, 2); tcp_payload_flush(c); + /* TODO how to fix this? original code didn't chech for success either */ + assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 2); }

return 0; @@ -269,16 +344,17 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, struct iovec *iov;

conn->seq_to_tap = seq + dlen; - tcp_frame_conns[tcp_payload_sock_used] = conn; - iov = tcp_l2_iov[tcp_payload_sock_used]; + tcp_frame_conns[tcp_payload_sock_head] = conn; + iov = tcp_l2_iov[tcp_payload_sock_head]; if (CONN_V4(conn)) { if (no_csum) { - struct iovec *iov_prev = tcp_l2_iov[tcp_payload_sock_used - 1]; + unsigned prev = (tcp_payload_sock_head - 1) % TCP_FRAMES_MEM; + struct iovec *iov_prev = tcp_l2_iov[prev]; struct iphdr *iph = iov_prev[TCP_IOV_IP].iov_base;

check = &iph->check; } - iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_used]); + iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp4_payload_ip[tcp_payload_sock_head]); iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; } else if (CONN_V6(conn)) { iov[TCP_IOV_IP] = IOV_OF_LVALUE(tcp6_payload_ip[tcp_payload_sock_used]); @@ -294,8 +370,11 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp_l2_buf_fill_headers(conn, iov, check, seq, false); tcp_payload_sock_produce(1); if (tcp_payload_sock_used > TCP_FRAMES_MEM - 1) { + tcp_buf_free_old_tap_xmit(c, 1); tcp_payload_flush(c); + assert(tcp_payload_sock_used <= TCP_FRAMES_MEM - 1); } + }

/** @@ -362,11 +441,14 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) }

if (tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM) { + tcp_buf_free_old_tap_xmit(c, fill_bufs); tcp_payload_flush(c); + /* TODO how to report this to upper layers? */ + assert(tcp_payload_sock_used + fill_bufs <= TCP_FRAMES_MEM); }

for (i = 0, iov = iov_sock + 1; i < fill_bufs; i++, iov++) { - iov->iov_base = &tcp_payload[tcp_payload_sock_used + i].data; + iov->iov_base = &tcp_payload[(tcp_payload_sock_head + i) % TCP_FRAMES_MEM].data; iov->iov_len = mss; } if (iov_rem)

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Eugenio Perez Martin

10 Jul 10 Jul

11:46 a.m.

On Wed, Jul 9, 2025 at 7:48 PM Eugenio Pérez wrote:

...

vhost-net is a kernel device that allows to read packets from a tap device using virtio queues instead of regular read(2) and write(2). This enables a more eficient packet processing, as the memory can be written directly by the kernel to the userspace and back, instead of wasting bandwith on copies, and it enables to batch many packets in a single notification (through eventfds) both tx and rx.

Namespace tx performance improves from ~26.3Gbit/s to ~36.9Gbit/s.

Namespace rx performance improves from ~16BGbit/s to ~17.26Gbit/s.

For reference, iperf without namespaces and veth both reaches ~47.8Gbit/s in both directions.

...

RFC: At this moment only these are supported: * Receive l2 packets from the vhost kernel to pasta * Send l4 tcp socket received data through vhost-kernel to namespace.

TODO: Add vhost zerocopy in the tests, and compare with veth. TODO: Implement at least UDP tx. Or maybe we want UDP to be write(2) because of latency? TODO: Check style for variable declarations in for loops and use of curly brackets as long as they wrap more than a line. TODO: kerneldoc style function header comments

-- v2: Add TCP tx, and integrated some comments from the previous series. Please check each patch message for details.

Eugenio Pérez (11): tap: implement vhost_call_cb tap: add die() on vhost error Replace tx tap hdr with virtio_nethdr_mrg_rxbuf tcp: export memory regions to vhost virtio: Fill .next in tx queue tap: move static iov_sock to tcp_buf_data_from_sock tap: support tx through vhost tap: add tap_free_old_xmit tcp: start conversion to circular buffer Add poll(2) to used_idx tcp_buf: adding TCP tx circular buffer

arp.c | 2 +- epoll_type.h | 4 + passt.c | 12 +- passt.h | 11 +- tap.c | 489 +++++++++++++++++++++++++++++++++++++++++++++++++-- tap.h | 13 +- tcp.c | 2 +- tcp_buf.c | 179 +++++++++++++++---- tcp_buf.h | 19 ++ udp.c | 2 +- 10 files changed, 675 insertions(+), 58 deletions(-)

-- 2.50.0

Age (days ago)

Last active (days ago)

List overview

Download

45 comments

4 participants

participants (4)

David Gibson
Eugenio Perez Martin
Eugenio Pérez
Stefano Brivio

[RFC v2 00/11] Add vhost-net kernel support

tags

participants (4)