vhost-net is a kernel device that allows to read packets from a tap device using virtio queues instead of regular read(2) and write(2). This enables a more eficient packet processing, as the memory can be written directly by the kernel to the userspace and back, instead of wasting bandwith on copies, and it enables to batch many packets in a single notification (through eventfds) both tx and rx. Namespace tx performance improves from ~26.3Gbit/s to ~36.9Gbit/s. Namespace rx performance improves from ~16BGbit/s to ~17.26Gbit/s. RFC: At this moment only these are supported: * Receive l2 packets from the vhost kernel to pasta * Send l4 tcp socket received data through vhost-kernel to namespace. TODO: Add vhost zerocopy in the tests, and compare with veth. TODO: Implement at least UDP tx. Or maybe we want UDP to be write(2) because of latency? TODO: Check style for variable declarations in for loops and use of curly brackets as long as they wrap more than a line. TODO: kerneldoc style function header comments -- v2: Add TCP tx, and integrated some comments from the previous series. Please check each patch message for details. Eugenio Pérez (11): tap: implement vhost_call_cb tap: add die() on vhost error Replace tx tap hdr with virtio_nethdr_mrg_rxbuf tcp: export memory regions to vhost virtio: Fill .next in tx queue tap: move static iov_sock to tcp_buf_data_from_sock tap: support tx through vhost tap: add tap_free_old_xmit tcp: start conversion to circular buffer Add poll(2) to used_idx tcp_buf: adding TCP tx circular buffer arp.c | 2 +- epoll_type.h | 4 + passt.c | 12 +- passt.h | 11 +- tap.c | 489 +++++++++++++++++++++++++++++++++++++++++++++++++-- tap.h | 13 +- tcp.c | 2 +- tcp_buf.c | 179 +++++++++++++++---- tcp_buf.h | 19 ++ udp.c | 2 +- 10 files changed, 675 insertions(+), 58 deletions(-) -- 2.50.0
Implement the rx side of the vhost-kernel, where the namespace send
packets through its tap device and the kernel passes them to
pasta using a virtqueue. The virtqueue is build using a
virtual ring (vring), which includes:
* A descriptor table (buffer info)
* An available ring (buffers ready for the kernel)
* A used ring (buffers that the kernel has filled)
The descriptor table holds an array of buffers defined by address and
length. The kernel writes the packets transmitted by the namespace into
these buffers. The number of descriptors (vq size) is set by
VHOST_NDESCS. Pasta fills this table using pkt_buf, splitting it
evenly across all descriptors. This table is read-only for the kernel.
The available ring is where pasta marks which buffers the kernel can
use. It's read only for the kernel. It includes a ring[] array with
the descriptor indexes and an avail->idx index. Pasta increments
avail->idx when a new buffer is added, modulo the size of the
virtqueue. As pasta writes the rx buffers sequentially, ring[] is
always [0, 1, 2...] and only avail->idx is incremented when new buffers
are available for the kernel. avail->idx can be incremented by more
than one at a time.
Pasta also notifies the kernel of new available buffers by writing to
the kick eventfd.
Once the kernel has written a frame in a descriptor it writes its index
into used_ring->ring[] and increments the used_ring->idx accordly.
Like the avail idx the kernel can increase it by more than one. Pasta
gets a notification in the call eventfd, so we add it into the epoll ctx.
Pasta assumes buffers are used in order. QEMU also assumes it in the
virtio-net migration code so it is safe.
Now, vhost-kernel is designed to read the virtqueues and the buffers as
*guest* physical addresses (GPA), not process virtual addresses (HVA).
The way QEMU tells the translations is through the memory regions.
Since we don't have GPAs, let's just send the memory regions as a 1:1
translations of the HVA.
TODO: Evaluate if we can reuse the tap fd code instead of making a new
epoll event type.
TODO: Split a new file for vhost (Stefano)
Signed-off-by: Eugenio Pérez
In case the kernel needs to signal an error.
Signed-off-by: Eugenio Pérez
vhost kernel expects this as the first data of the frame.
Signed-off-by: Eugenio Pérez
So vhost kernel is able to access the TCP buffers.
Signed-off-by: Eugenio Pérez
This way we can send one frame splitted in many buffers. TCP stack
already works this way in pasta.
Signed-off-by: Eugenio Pérez
As it is the only function using it. I'm always confusing it with
tcp_l2_iov, moving it here avoids it.
Signed-off-by: Eugenio Pérez
No users enable vhost right now, just defining the functions.
The use of virtqueue is similar than in rx case. fills the descriptor
table with packet data it wants to send to the namespace. Each
descriptor points to a buffer in memory, with an address and a length.
The number of descriptors is again defined by VHOST_NDESCS.
Afterwards it writes the descriptor index into the avail->ring[] array,
then increments avail->idx to make it visible to the kernel, then kicks
the virtqueue 1 event fd.
When the kernel does not need the buffer anymore it writes its id into
the used_ring->ring[], and increments used_ring->idx. Normally, the
kernel also notifies pasta through call eventfd of the virtqueue 1.
But we don't monitor the eventfd. Instead, we check if we can reuse the
buffers or not just when we produce, making the code simpler and more
performant.
Like on the rx path, we assume descriptors are used in the same order
they were made available. This is also consistent with behavior seen in
QEMU's virtio-net implementation.
Signed-off-by: Eugenio Pérez
As pasta cannot modify the TCP sent buffers until vhost-kernel does not
use them anymore, we need a way to report the caller the buffers that
can be overriden.
Let's start by following the same pattern as in tap write(2): wait until
pasta can override the buffers. We can add async cleaning on top.
Signed-off-by: Eugenio Pérez
The vhost-kernel module is async by nature: the driver (pasta) places a
few buffers in the virtqueue and the device (vhost-kernel) trust the
driver will not modify them until it uses them. To implement it is not
possible with TCP at the moment, as tcp_buf trust it can reuse the
buffers as soon as tcp_payload_flush() finish.
To achieve async let's make tcp_buf work with a circular ring, so vhost
can transmit at the same time pasta is queing more data. When a buffer
is received from a TCP socket, the element is placed in the ring and
sock_head is moved:
[][][][]
^ ^
| |
| sock_head
|
tail
tap_head
When the data is sent to vhost through the tx queue, tap_head is moved
forward:
[][][][]
^ ^
| |
| sock_head
| tap_head
|
tail
Finally, the tail move forward when vhost has used the tx buffers, so
tcp_payload (and all lower protocol buffers) can be reused.
[][][][]
^
|
sock_head
tap_head
tail
In the case of error queueing to the vhost virtqueue, sock_head moves
backwards. The only possible error is that the queue is full, as
virtio-net does not report success on packet sending.
Starting as simple as possible, and only implementing the count
variables in this patch so it keeps working as previously. The circular
behavior will be added on top.
From ~16BGbit/s to ~13Gbit/s compared with write(2) to the tap.
Signed-off-by: Eugenio Pérez
From ~13Gbit/s to ~11.5Gbit/s.
TODO: Maybe we can reuse epoll for this, not needing to introduce a new
syscall.
Signed-off-by: Eugenio Pérez
Now both tcp_sock and tap uses the circular buffer as intended.
Very lightly tested. Especially, paths like ring full or almost full
that are checked before producing like
tcp_payload_sock_used + fill_bufs > TCP_FRAMES_MEM.
Processing the tx buffers in a circular buffer makes namespace rx go
from to ~11.5Gbit/s. to ~17.26Gbit/s.
TODO: Increase the tx queue length, as we spend a lot of descriptors in
each request. Ideally, tx size should be at least
bufs_per_frame*TCP_FRAMES_MEM, but maybe we got more performance with
bigger queues.
TODO: Sometimes we call tcp_buf_free_old_tap_xmit twice: one to free at
least N used tx buffers and the next one in tcp_payload_flush. Maybe we
can optimize it.
Signed-off-by: Eugenio Pérez
On Wed, Jul 9, 2025 at 7:48 PM Eugenio Pérez
vhost-net is a kernel device that allows to read packets from a tap device using virtio queues instead of regular read(2) and write(2). This enables a more eficient packet processing, as the memory can be written directly by the kernel to the userspace and back, instead of wasting bandwith on copies, and it enables to batch many packets in a single notification (through eventfds) both tx and rx.
Namespace tx performance improves from ~26.3Gbit/s to ~36.9Gbit/s.
Namespace rx performance improves from ~16BGbit/s to ~17.26Gbit/s.
For reference, iperf without namespaces and veth both reaches ~47.8Gbit/s in both directions.
RFC: At this moment only these are supported: * Receive l2 packets from the vhost kernel to pasta * Send l4 tcp socket received data through vhost-kernel to namespace.
TODO: Add vhost zerocopy in the tests, and compare with veth. TODO: Implement at least UDP tx. Or maybe we want UDP to be write(2) because of latency? TODO: Check style for variable declarations in for loops and use of curly brackets as long as they wrap more than a line. TODO: kerneldoc style function header comments
-- v2: Add TCP tx, and integrated some comments from the previous series. Please check each patch message for details.
Eugenio Pérez (11): tap: implement vhost_call_cb tap: add die() on vhost error Replace tx tap hdr with virtio_nethdr_mrg_rxbuf tcp: export memory regions to vhost virtio: Fill .next in tx queue tap: move static iov_sock to tcp_buf_data_from_sock tap: support tx through vhost tap: add tap_free_old_xmit tcp: start conversion to circular buffer Add poll(2) to used_idx tcp_buf: adding TCP tx circular buffer
arp.c | 2 +- epoll_type.h | 4 + passt.c | 12 +- passt.h | 11 +- tap.c | 489 +++++++++++++++++++++++++++++++++++++++++++++++++-- tap.h | 13 +- tcp.c | 2 +- tcp_buf.c | 179 +++++++++++++++---- tcp_buf.h | 19 ++ udp.c | 2 +- 10 files changed, 675 insertions(+), 58 deletions(-)
-- 2.50.0
participants (2)
-
Eugenio Perez Martin
-
Eugenio Pérez