[PATCH 00/10] vhost-user: Preparatory series for multiple iovec entries per virtqueue element
Currently, the vhost-user path assumes each virtqueue element contains exactly one iovec entry covering the entire frame. This assumption breaks as some virtio-net drivers (notably iPXE) provide descriptors where the vnet header and the frame payload are in separate buffers, resulting in two iovec entries per virtqueue element. This series refactors the vhost-user data path so that frame lengths, header sizes, and padding are tracked and passed explicitly rather than being derived from iovec sizes. This decoupling is a prerequisite for correctly handling padding of multi-buffer frames. The changes in this series can be split in 3 groups: - New iov helpers (patches 1-2): iov_memset() and iov_memcopy() operate across iovec boundaries. These are needed by the final patch to pad and copy frame data when a frame spans multiple iovec entries. - Structural refactoring (patches 3-5): Move vnethdr setup into vu_flush(), separate virtqueue management from socket I/O in the UDP path, and pass iov arrays explicitly instead of using file-scoped state. These changes make it possible to pass explicit frame lengths through the stack, which is required to pad frames independently of iovec layout. - Explicit length passing throughout the stack (patches 6-10): Thread explicit L4, L2, frame, and data lengths through checksum, pcap, vu_flush(), and tcp_fill_headers(), replacing lengths that were previously derived from iovec sizes. With lengths tracked explicitly, the final patch can centralise Ethernet frame padding into vu_collect() and a new vu_pad() helper that correctly pads frames spanning multiple iovec entries. Laurent Vivier (10): iov: Introduce iov_memset() iov: Add iov_memcopy() to copy data between iovec arrays vu_common: Move vnethdr setup into vu_flush() udp_vu: Move virtqueue management from udp_vu_sock_recv() to its caller udp_vu: Pass iov explicitly to helpers instead of using file-scoped array checksum: Pass explicit L4 length to checksum functions pcap: Pass explicit L2 length to pcap_iov() vu_common: Pass explicit frame length to vu_flush() tcp: Pass explicit data length to tcp_fill_headers() vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad() checksum.c | 35 ++++++----- checksum.h | 6 +- iov.c | 78 ++++++++++++++++++++++++ iov.h | 5 ++ pcap.c | 37 ++++++++--- pcap.h | 2 +- tap.c | 6 +- tcp.c | 14 +++-- tcp_buf.c | 3 +- tcp_internal.h | 2 +- tcp_vu.c | 62 +++++++++---------- udp.c | 5 +- udp_vu.c | 162 ++++++++++++++++++++++++++----------------------- vu_common.c | 58 ++++++++++-------- vu_common.h | 5 +- 15 files changed, 304 insertions(+), 176 deletions(-) -- 2.53.0
Add a helper to set a range of bytes across an IO vector to a given
value, similar to memset() but operating over scatter-gather buffers.
It skips to the given offset and fills across iovec entries up to the
requested length.
Signed-off-by: Laurent Vivier
Add a helper to copy data from a source iovec array to a destination
iovec array, each starting at an arbitrary byte offset, iterating
through both arrays simultaneously and copying in chunks matching the
smaller of the two current segments.
Signed-off-by: Laurent Vivier
Every caller of vu_flush() was calling vu_set_vnethdr() beforehand with
the same pattern. Move it into vu_flush().
Remove vu_queue_notify() from vu_flush() and let callers invoke it
explicitly. This allows paths that perform multiple flushes, such as
tcp_vu_send_flag() and tcp_vu_data_from_sock(), to issue a single guest
notification at the end.
Signed-off-by: Laurent Vivier
udp_vu_sock_recv() currently mixes two concerns: receiving data from the
socket and managing virtqueue buffers (collecting, rewinding, releasing).
This makes the function harder to reason about and couples socket I/O
with virtqueue state.
Move all virtqueue operations, vu_collect(), vu_init_elem(),
vu_queue_rewind(), vu_set_vnethdr(), and the queue-readiness check, into
udp_vu_sock_to_tap(), which is the only caller. This turns
udp_vu_sock_recv() into a pure socket receive function that simply reads
into the provided iov array and adjusts its length.
Signed-off-by: Laurent Vivier
udp_vu_sock_recv(), udp_vu_prepare(), and udp_vu_csum() all operated on
the file-scoped iov_vu[] array directly. Pass iov and count as explicit
parameters instead, and move iov_vu[] and elem[] to function-local
statics in udp_vu_sock_to_tap(), the only function that needs them.
Signed-off-by: Laurent Vivier
The iov_tail passed to csum_iov_tail() may contain padding or trailing
data beyond the actual L4 payload. Rather than relying on
iov_tail_size() to determine how many bytes to checksum, pass the
length explicitly so that only the relevant payload bytes are included
in the checksum computation.
Signed-off-by: Laurent Vivier
With vhost-user multibuffer frames, the iov can be larger than the
actual L2 frame. The previous approach of computing L2 length as
iov_size() - offset would overcount and write extra bytes into the
pcap file.
Pass the L2 frame length explicitly to pcap_frame() and pcap_iov(),
and write exactly that many bytes instead of the full iov remainder.
Signed-off-by: Laurent Vivier
Currently vu_flush() derives the frame size from the iov, but in
preparation for iov arrays that may be larger than the actual frame,
pass the total length (including vnet header) explicitly so that only
the relevant portion is reported to the virtqueue.
Ensure a minimum frame size of ETH_ZLEN + VNET_HLEN to handle short
frames. All elements are still flushed to avoid descriptor leaks,
but trailing elements beyond frame_len will report a zero length.
Signed-off-by: Laurent Vivier
tcp_fill_headers() computed the TCP payload length from iov_tail_size(),
but with vhost-user multibuffer frames, the iov_tail will be larger than
the actual data. Pass the data length explicitly so that IP total
length, pseudo-header, and checksum computations use the correct value.
Signed-off-by: Laurent Vivier
The previous per-protocol padding done by vu_pad() in tcp_vu.c and
udp_vu.c was only correct for single-buffer frames: it assumed the
padding area always fell within the first iov, writing past its end
with a plain memset().
It also required each caller to compute MAX(..., ETH_ZLEN + VNET_HLEN)
for vu_collect() and to call vu_pad() at the right point, duplicating
the minimum-size logic across protocols.
Move the Ethernet minimum size enforcement into vu_collect() itself, so
that enough buffer space is always reserved for padding regardless of
the requested frame size.
Rewrite vu_pad() to take a full iovec array and use iov_memset(),
making it safe for multi-buffer (mergeable rx buffer) frames.
In tcp_vu_sock_recv(), replace iov_truncate() with iov_skip_bytes():
now that all consumers receive explicit data lengths, truncating the
iovecs is no longer needed. In tcp_vu_data_from_sock(), cap each
frame's data length against the remaining bytes actually received from
the socket, so that the last partial frame gets correct headers and
sequence number advancement.
Signed-off-by: Laurent Vivier
Nits, all about names:
On Wed, 1 Apr 2026 21:18:18 +0200
Laurent Vivier
Add a helper to copy data from a source iovec array to a destination iovec array, each starting at an arbitrary byte offset, iterating through both arrays simultaneously and copying in chunks matching the smaller of the two current segments.
Signed-off-by: Laurent Vivier
--- iov.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++ iov.h | 3 +++ 2 files changed, 55 insertions(+) diff --git a/iov.c b/iov.c index 0188acdf5eba..83b683f3976a 100644 --- a/iov.c +++ b/iov.c @@ -197,6 +197,58 @@ void iov_memset(const struct iovec *iov, size_t iov_cnt, size_t offset, int c, } }
+/** + * iov_memcopy() - Copy data between two iovec arrays
Wouldn't it be less surprising to call this iov_memcpy(), like memcpy()?
+ * @dst_iov: Destination iovec array + * @dst_iov_cnt: Number of elements in destination iovec array + * @dst_offs: Destination offset + * @iov: Source iovec array + * @iov_cnt: Number of elements in source iovec array
I think @src_iov and @src_iov_cnt would make the whole function easier to follow and look more symmetric.
+ * @offs: Source offset
What about @dst_offset and @src_offset? "offs" as a short-form for "offset" isn't really obvious (to me at least).
+ * @length: Number of bytes to copy + * + * Return: total number of bytes copied + */ +/* cppcheck-suppress unusedFunction */ +size_t iov_memcopy(struct iovec *dst_iov, size_t dst_iov_cnt, size_t dst_offs, + const struct iovec *iov, size_t iov_cnt, size_t offs, + size_t length) +{ + unsigned int i, j; + size_t total = 0; + + i = iov_skip_bytes(iov, iov_cnt, offs, &offs); + j = iov_skip_bytes(dst_iov, dst_iov_cnt, dst_offs, &dst_offs); + + /* copying data */ + while (length && i < iov_cnt && j < dst_iov_cnt) { + size_t n = MIN(dst_iov[j].iov_len - dst_offs, + iov[i].iov_len - offs); + + if (n > length) + n = length; + + memcpy((char *)dst_iov[j].iov_base + dst_offs, + (const char *)iov[i].iov_base + offs, n); + + dst_offs += n; + offs += n; + total += n; + length -= n; + + if (dst_offs == dst_iov[j].iov_len) { + dst_offs = 0; + j++; + } + if (offs == iov[i].iov_len) { + offs = 0; + i++; + } + } + + return total; +} + /** * iov_tail_prune() - Remove any unneeded buffers from an IOV tail * @tail: IO vector tail (modified) diff --git a/iov.h b/iov.h index d295d05b3bab..074266e127ef 100644 --- a/iov.h +++ b/iov.h @@ -32,6 +32,9 @@ size_t iov_size(const struct iovec *iov, size_t iov_cnt); size_t iov_truncate(struct iovec *iov, size_t iov_cnt, size_t size); void iov_memset(const struct iovec *iov, size_t iov_cnt, size_t offset, int c, size_t length); +size_t iov_memcopy(struct iovec *dst_iov, size_t dst_iov_cnt, size_t dst_offs, + const struct iovec *iov, size_t iov_cnt, size_t offs, + size_t length);
/* * DOC: Theory of Operation, struct iov_tail
-- Stefano
On Wed, 1 Apr 2026 21:18:19 +0200
Laurent Vivier
Every caller of vu_flush() was calling vu_set_vnethdr() beforehand with the same pattern. Move it into vu_flush().
Remove vu_queue_notify() from vu_flush() and let callers invoke it explicitly. This allows paths that perform multiple flushes, such as tcp_vu_send_flag() and tcp_vu_data_from_sock(), to issue a single guest notification at the end.
Signed-off-by: Laurent Vivier
--- tcp_vu.c | 19 ++++++++----------- udp_vu.c | 3 +-- vu_common.c | 9 +++++---- vu_common.h | 1 - 4 files changed, 14 insertions(+), 18 deletions(-) diff --git a/tcp_vu.c b/tcp_vu.c index dc0e17c0f03f..0cd01190d612 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -82,7 +82,6 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) struct ethhdr *eh; uint32_t seq; int elem_cnt; - int nb_ack; int ret;
hdrlen = tcp_vu_hdrlen(CONN_V6(conn)); @@ -97,8 +96,6 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) assert(flags_elem[0].in_sg[0].iov_len >= MAX(hdrlen + sizeof(*opts), ETH_ZLEN + VNET_HLEN));
- vu_set_vnethdr(flags_elem[0].in_sg[0].iov_base, 1); - eh = vu_eth(flags_elem[0].in_sg[0].iov_base);
memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest)); @@ -143,9 +140,10 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) l2len = optlen + hdrlen - VNET_HLEN; vu_pad(&flags_elem[0].in_sg[0], l2len);
+ vu_flush(vdev, vq, flags_elem, 1); + if (*c->pcap) pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN); - nb_ack = 1;
if (flags & DUP_ACK) { elem_cnt = vu_collect(vdev, vq, &flags_elem[1], 1, @@ -157,14 +155,14 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) memcpy(flags_elem[1].in_sg[0].iov_base, flags_elem[0].in_sg[0].iov_base, flags_elem[0].in_sg[0].iov_len); - nb_ack++; + + vu_flush(vdev, vq, &flags_elem[1], 1);
if (*c->pcap) pcap_iov(&flags_elem[1].in_sg[0], 1, VNET_HLEN); } } - - vu_flush(vdev, vq, flags_elem, nb_ack); + vu_queue_notify(vdev, vq);
return 0; } @@ -451,7 +449,6 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) assert(frame_size >= hdrlen);
dlen = frame_size - hdrlen; - vu_set_vnethdr(iov->iov_base, buf_cnt);
/* The IPv4 header checksum varies only with dlen */ if (previous_dlen != dlen) @@ -464,14 +461,14 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) l2len = dlen + hdrlen - VNET_HLEN; vu_pad(iov, l2len);
+ vu_flush(vdev, vq, &elem[head[i]], buf_cnt); + if (*c->pcap) pcap_iov(iov, buf_cnt, VNET_HLEN);
conn->seq_to_tap += dlen; } - - /* send packets */ - vu_flush(vdev, vq, elem, iov_cnt); + vu_queue_notify(vdev, vq);
If I understand correctly, this makes iov_cnt set by tcp_vu_sock_recv() redundant because you're now calling vu_flush() from inside the loop that iterates up to head_cnt (which is the only count we need to know). Maybe it would be clearer to have a patch in this series dropping iov_cnt as argument altogether (especially the commit message of that kind of change might make things clearer).
conn_flag(c, conn, ACK_FROM_TAP_DUE);
diff --git a/udp_vu.c b/udp_vu.c index cc69654398f0..f8629af58ab5 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -124,8 +124,6 @@ static int udp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, int s, l2len = *dlen + hdrlen - VNET_HLEN; vu_pad(&iov_vu[0], l2len);
- vu_set_vnethdr(iov_vu[0].iov_base, elem_used); - /* release unused buffers */ vu_queue_rewind(vq, elem_cnt - elem_used);
@@ -230,6 +228,7 @@ void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx) pcap_iov(iov_vu, iov_used, VNET_HLEN); } vu_flush(vdev, vq, elem, iov_used); + vu_queue_notify(vdev, vq); } } } diff --git a/vu_common.c b/vu_common.c index 13b1e51001d4..57949ca32309 100644 --- a/vu_common.c +++ b/vu_common.c @@ -118,7 +118,8 @@ int vu_collect(const struct vu_dev *vdev, struct vu_virtq *vq, * @vnethdr: Address of the header to set * @num_buffers: Number of guest buffers of the frame */ -void vu_set_vnethdr(struct virtio_net_hdr_mrg_rxbuf *vnethdr, int num_buffers) +static void vu_set_vnethdr(struct virtio_net_hdr_mrg_rxbuf *vnethdr, + int num_buffers) { vnethdr->hdr = VU_HEADER; /* Note: if VIRTIO_NET_F_MRG_RXBUF is not negotiated, @@ -139,6 +140,8 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, { int i;
+ vu_set_vnethdr(elem[0].in_sg[0].iov_base, elem_cnt); + for (i = 0; i < elem_cnt; i++) { size_t elem_size = iov_size(elem[i].in_sg, elem[i].in_num);
@@ -146,7 +149,6 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, }
vu_queue_flush(vdev, vq, elem_cnt); - vu_queue_notify(vdev, vq); }
/** @@ -260,8 +262,6 @@ int vu_send_single(const struct ctx *c, const void *buf, size_t size) goto err; }
- vu_set_vnethdr(in_sg[0].iov_base, elem_cnt); - total -= VNET_HLEN;
/* copy data from the buffer to the iovec */ @@ -271,6 +271,7 @@ int vu_send_single(const struct ctx *c, const void *buf, size_t size) pcap_iov(in_sg, in_total, VNET_HLEN);
vu_flush(vdev, vq, elem, elem_cnt); + vu_queue_notify(vdev, vq);
trace("vhost-user sent %zu", total);
diff --git a/vu_common.h b/vu_common.h index 7b060eb6184f..4037ab765b7d 100644 --- a/vu_common.h +++ b/vu_common.h @@ -39,7 +39,6 @@ int vu_collect(const struct vu_dev *vdev, struct vu_virtq *vq, struct vu_virtq_element *elem, int max_elem, struct iovec *in_sg, size_t max_in_sg, size_t *in_total, size_t size, size_t *collected); -void vu_set_vnethdr(struct virtio_net_hdr_mrg_rxbuf *vnethdr, int num_buffers); void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, struct vu_virtq_element *elem, int elem_cnt); void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
-- Stefano
On Wed, 1 Apr 2026 21:18:23 +0200
Laurent Vivier
With vhost-user multibuffer frames, the iov can be larger than the actual L2 frame. The previous approach of computing L2 length as iov_size() - offset would overcount and write extra bytes into the pcap file.
Pass the L2 frame length explicitly to pcap_frame() and pcap_iov(), and write exactly that many bytes instead of the full iov remainder.
Signed-off-by: Laurent Vivier
--- pcap.c | 37 ++++++++++++++++++++++++++++--------- pcap.h | 2 +- tap.c | 2 +- tcp_vu.c | 9 ++++++--- udp_vu.c | 4 +++- vu_common.c | 2 +- 6 files changed, 40 insertions(+), 16 deletions(-) diff --git a/pcap.c b/pcap.c index a026f17e7974..dfe1c61add9a 100644 --- a/pcap.c +++ b/pcap.c @@ -52,22 +52,38 @@ struct pcap_pkthdr { * @iov: IO vector containing frame (with L2 headers and tap headers) * @iovcnt: Number of buffers (@iov entries) in frame * @offset: Byte offset of the L2 headers within @iov + * @l2len: Length of L2 frame data to capture * @now: Timestamp */ static void pcap_frame(const struct iovec *iov, size_t iovcnt, - size_t offset, const struct timespec *now) + size_t offset, size_t l2len, const struct timespec *now) { - size_t l2len = iov_size(iov, iovcnt) - offset; struct pcap_pkthdr h = { .tv_sec = now->tv_sec, .tv_usec = DIV_ROUND_CLOSEST(now->tv_nsec, 1000), .caplen = l2len, .len = l2len }; + size_t i;
- if (write_all_buf(pcap_fd, &h, sizeof(h)) < 0 || - write_remainder(pcap_fd, iov, iovcnt, offset) < 0) - debug_perror("Cannot log packet, length %zu", l2len); + if (write_all_buf(pcap_fd, &h, sizeof(h)) < 0) { + debug_perror("Cannot log packet, packet header error"); + return; + } + + for (i = iov_skip_bytes(iov, iovcnt, offset, &offset); + i < iovcnt && l2len; i++) { + size_t n = MIN(l2len, iov[i].iov_len - offset); + + if (write_all_buf(pcap_fd, (char *)iov[i].iov_base + offset, + n) < 0) {
This could be written as: size_t l = MIN(l2len, iov[i].iov_len - offset), n; n = write_all_buf(pcap_fd, (char *)iov[i].iov_base + offset, l); if (n < 0) { same number of lines, but slightly clearer I think.
+ debug_perror("Cannot log packet");
Should we add back the ", length %zu" part, perhaps, using this 'l' variable? I think it makes a difference if we see we're failing to capture a reasonably sized 64k packet (disk full or something) compared to a 100G-length buffer coming from some calculation gone wrong.
+ return; + } + + offset = 0; + l2len -= n; + } }
/** @@ -87,7 +103,7 @@ void pcap(const char *pkt, size_t l2len) if (clock_gettime(CLOCK_REALTIME, &now)) err_perror("Failed to get CLOCK_REALTIME time");
- pcap_frame(&iov, 1, 0, &now); + pcap_frame(&iov, 1, 0, l2len, &now); }
/** @@ -110,7 +126,9 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n, err_perror("Failed to get CLOCK_REALTIME time");
for (i = 0; i < n; i++) - pcap_frame(iov + i * frame_parts, frame_parts, offset, &now); + pcap_frame(iov + i * frame_parts, frame_parts, offset,
Nit: curly brackets for coding style.
+ iov_size(iov + i * frame_parts, frame_parts) - offset, + &now); }
/** @@ -120,8 +138,9 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n, * containing packet data to write, including L2 header * @iovcnt: Number of buffers (@iov entries) * @offset: Offset of the L2 frame within the full data length + * @l2len: Length of L2 frame data to capture */ -void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset) +void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset, size_t l2len) { struct timespec now = { 0 };
@@ -131,7 +150,7 @@ void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset) if (clock_gettime(CLOCK_REALTIME, &now)) err_perror("Failed to get CLOCK_REALTIME time");
- pcap_frame(iov, iovcnt, offset, &now); + pcap_frame(iov, iovcnt, offset, l2len, &now); }
/** diff --git a/pcap.h b/pcap.h index dface5df4ee6..c171257cbd73 100644 --- a/pcap.h +++ b/pcap.h @@ -13,7 +13,7 @@ extern int pcap_fd; void pcap(const char *pkt, size_t l2len); void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n, size_t offset); -void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset); +void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset, size_t l2len); void pcap_init(struct ctx *c);
#endif /* PCAP_H */ diff --git a/tap.c b/tap.c index b61199dd699d..007c91864b4e 100644 --- a/tap.c +++ b/tap.c @@ -1105,7 +1105,7 @@ void tap_add_packet(struct ctx *c, struct iov_tail *data, struct ethhdr eh_storage; const struct ethhdr *eh;
- pcap_iov(data->iov, data->cnt, data->off); + pcap_iov(data->iov, data->cnt, data->off, iov_tail_size(data));
This leads to a warning from Coverity Scan: /home/sbrivio/passt/tap.c:1108:2: Type: Evaluation order violation (EVALUATION_ORDER) /home/sbrivio/passt/tap.c:1108:2: read_write_order: In argument #4 of "pcap_iov(data->iov, data->cnt, data->off, iov_tail_size(data))", a call is made to "iov_tail_size(data)". In argument #1 of this function, the object "data->off" is modified. This object is also used in "data->off", the argument #3 of the outer function call. The order in which these arguments are evaluated is not specified, and will vary between platforms. I'm not sure if we can ever reach this point with a non-pruned iov_tail so that iov_tail_prune() will actually modify 'off', but just to be sure I guess we should call iov_tail_size() first, assign its result to a temporary variable, and use that.
eh = IOV_PEEK_HEADER(data, eh_storage); if (!eh) diff --git a/tcp_vu.c b/tcp_vu.c index 0cd01190d612..329fa969fca1 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -143,7 +143,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) vu_flush(vdev, vq, flags_elem, 1);
if (*c->pcap) - pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN); + pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN, + hdrlen + optlen - VNET_HLEN);
Same here, curly brackets.
if (flags & DUP_ACK) { elem_cnt = vu_collect(vdev, vq, &flags_elem[1], 1, @@ -159,7 +160,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) vu_flush(vdev, vq, &flags_elem[1], 1);
if (*c->pcap) - pcap_iov(&flags_elem[1].in_sg[0], 1, VNET_HLEN); + pcap_iov(&flags_elem[1].in_sg[0], 1, VNET_HLEN, + hdrlen + optlen - VNET_HLEN);
And here.
} } vu_queue_notify(vdev, vq); @@ -464,7 +466,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) vu_flush(vdev, vq, &elem[head[i]], buf_cnt);
if (*c->pcap) - pcap_iov(iov, buf_cnt, VNET_HLEN); + pcap_iov(iov, buf_cnt, VNET_HLEN, + dlen + hdrlen - VNET_HLEN);
And here too.
conn->seq_to_tap += dlen; } diff --git a/udp_vu.c b/udp_vu.c index 5421a7d71a19..81491afa7e6a 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -185,6 +185,7 @@ void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx) static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE]; struct vu_dev *vdev = c->vdev; struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE]; + size_t hdrlen = udp_vu_hdrlen(v6); int i;
assert(!c->no_udp); @@ -230,7 +231,8 @@ void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx) size_t l4len = udp_vu_prepare(c, iov_vu, toside, dlen); if (*c->pcap) { udp_vu_csum(toside, iov_vu, iov_cnt, l4len); - pcap_iov(iov_vu, iov_cnt, VNET_HLEN); + pcap_iov(iov_vu, iov_cnt, VNET_HLEN, + hdrlen + dlen - VNET_HLEN);
And here as well. Or maybe, in all those cases, we could have a 'size' temporary variable which should make things a bit more obvious.
} vu_flush(vdev, vq, elem, elem_used); vu_queue_notify(vdev, vq); diff --git a/vu_common.c b/vu_common.c index 57949ca32309..f254cb67ec78 100644 --- a/vu_common.c +++ b/vu_common.c @@ -268,7 +268,7 @@ int vu_send_single(const struct ctx *c, const void *buf, size_t size) iov_from_buf(in_sg, in_total, VNET_HLEN, buf, total);
if (*c->pcap) - pcap_iov(in_sg, in_total, VNET_HLEN); + pcap_iov(in_sg, in_total, VNET_HLEN, size);
vu_flush(vdev, vq, elem, elem_cnt); vu_queue_notify(vdev, vq);
-- Stefano
On Wed, 1 Apr 2026 21:18:24 +0200
Laurent Vivier
Currently vu_flush() derives the frame size from the iov, but in preparation for iov arrays that may be larger than the actual frame, pass the total length (including vnet header) explicitly so that only the relevant portion is reported to the virtqueue.
Ensure a minimum frame size of ETH_ZLEN + VNET_HLEN to handle short frames. All elements are still flushed to avoid descriptor leaks, but trailing elements beyond frame_len will report a zero length.
Signed-off-by: Laurent Vivier
--- tcp_vu.c | 6 +++--- udp_vu.c | 2 +- vu_common.c | 15 ++++++++++++--- vu_common.h | 2 +- 4 files changed, 17 insertions(+), 8 deletions(-) diff --git a/tcp_vu.c b/tcp_vu.c index 329fa969fca1..105bca41c6de 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -140,7 +140,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) l2len = optlen + hdrlen - VNET_HLEN; vu_pad(&flags_elem[0].in_sg[0], l2len);
- vu_flush(vdev, vq, flags_elem, 1); + vu_flush(vdev, vq, flags_elem, 1, hdrlen + optlen);
if (*c->pcap) pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN, @@ -157,7 +157,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) flags_elem[0].in_sg[0].iov_base, flags_elem[0].in_sg[0].iov_len);
- vu_flush(vdev, vq, &flags_elem[1], 1); + vu_flush(vdev, vq, &flags_elem[1], 1, hdrlen + optlen);
if (*c->pcap) pcap_iov(&flags_elem[1].in_sg[0], 1, VNET_HLEN, @@ -463,7 +463,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) l2len = dlen + hdrlen - VNET_HLEN; vu_pad(iov, l2len);
- vu_flush(vdev, vq, &elem[head[i]], buf_cnt); + vu_flush(vdev, vq, &elem[head[i]], buf_cnt, dlen + hdrlen);
if (*c->pcap) pcap_iov(iov, buf_cnt, VNET_HLEN, diff --git a/udp_vu.c b/udp_vu.c index 81491afa7e6a..4641f42eb5c4 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -234,7 +234,7 @@ void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx) pcap_iov(iov_vu, iov_cnt, VNET_HLEN, hdrlen + dlen - VNET_HLEN); } - vu_flush(vdev, vq, elem, elem_used); + vu_flush(vdev, vq, elem, elem_used, hdrlen + dlen); vu_queue_notify(vdev, vq); } } diff --git a/vu_common.c b/vu_common.c index f254cb67ec78..d371a59a1813 100644 --- a/vu_common.c +++ b/vu_common.c @@ -134,18 +134,27 @@ static void vu_set_vnethdr(struct virtio_net_hdr_mrg_rxbuf *vnethdr, * @vq: vhost-user virtqueue * @elem: virtqueue elements array to send back to the virtqueue * @elem_cnt: Length of the array + * @frame_len: Total frame length including vnet header */ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, - struct vu_virtq_element *elem, int elem_cnt) + struct vu_virtq_element *elem, int elem_cnt, size_t frame_len) { + size_t len; int i;
vu_set_vnethdr(elem[0].in_sg[0].iov_base, elem_cnt);
+ len = MAX(ETH_ZLEN + VNET_HLEN, frame_len); for (i = 0; i < elem_cnt; i++) { - size_t elem_size = iov_size(elem[i].in_sg, elem[i].in_num); + size_t elem_size; + + elem_size = iov_size(elem[i].in_sg, elem[i].in_num); + if (elem_size > len) + elem_size = len;
Convenient, but this doesn't really represent 'elem_size' anymore and it's a bit confusing because we later subtract it from 'len' (at a first glance, it looks like one might underflow it). What about: size_t elem_size, fill_size; elem_size = iov_size(elem[i].in_sg, elem[i].in_num); fill_size = MIN(len, elem_size); vu_queue_fill(vdev, vq, &elem[i], fill_size, i); len -= fill_size; ?
vu_queue_fill(vdev, vq, &elem[i], elem_size, i); + + len -= elem_size; }
Should we now add a debug message or warning here if we happen to have any residual 'len'? Or it can never happen by design? I'm not quite sure.
vu_queue_flush(vdev, vq, elem_cnt); @@ -270,7 +279,7 @@ int vu_send_single(const struct ctx *c, const void *buf, size_t size) if (*c->pcap) pcap_iov(in_sg, in_total, VNET_HLEN, size);
- vu_flush(vdev, vq, elem, elem_cnt); + vu_flush(vdev, vq, elem, elem_cnt, VNET_HLEN + size); vu_queue_notify(vdev, vq);
trace("vhost-user sent %zu", total); diff --git a/vu_common.h b/vu_common.h index 4037ab765b7d..77d1849e6115 100644 --- a/vu_common.h +++ b/vu_common.h @@ -40,7 +40,7 @@ int vu_collect(const struct vu_dev *vdev, struct vu_virtq *vq, struct iovec *in_sg, size_t max_in_sg, size_t *in_total, size_t size, size_t *collected); void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, - struct vu_virtq_element *elem, int elem_cnt); + struct vu_virtq_element *elem, int elem_cnt, size_t frame_len); void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size);
-- Stefano
On Wed, 1 Apr 2026 21:18:26 +0200
Laurent Vivier
The previous per-protocol padding done by vu_pad() in tcp_vu.c and udp_vu.c was only correct for single-buffer frames: it assumed the padding area always fell within the first iov, writing past its end with a plain memset().
It also required each caller to compute MAX(..., ETH_ZLEN + VNET_HLEN) for vu_collect() and to call vu_pad() at the right point, duplicating the minimum-size logic across protocols.
Move the Ethernet minimum size enforcement into vu_collect() itself, so that enough buffer space is always reserved for padding regardless of the requested frame size.
Rewrite vu_pad() to take a full iovec array and use iov_memset(), making it safe for multi-buffer (mergeable rx buffer) frames.
In tcp_vu_sock_recv(), replace iov_truncate() with iov_skip_bytes(): now that all consumers receive explicit data lengths, truncating the iovecs is no longer needed. In tcp_vu_data_from_sock(), cap each frame's data length against the remaining bytes actually received from the socket, so that the last partial frame gets correct headers and sequence number advancement.
Signed-off-by: Laurent Vivier
--- iov.c | 1 - tcp_vu.c | 29 ++++++++++++++--------------- udp_vu.c | 14 ++++++++------ vu_common.c | 32 +++++++++++++++----------------- vu_common.h | 2 +- 5 files changed, 38 insertions(+), 40 deletions(-) diff --git a/iov.c b/iov.c index 83b683f3976a..2289b425529e 100644 --- a/iov.c +++ b/iov.c @@ -180,7 +180,6 @@ size_t iov_truncate(struct iovec *iov, size_t iov_cnt, size_t size) * Will write less than @length bytes if it runs out of space in * the iov */ -/* cppcheck-suppress unusedFunction */ void iov_memset(const struct iovec *iov, size_t iov_cnt, size_t offset, int c, size_t length) { diff --git a/tcp_vu.c b/tcp_vu.c index ae79a6d856b0..cae6926334b9 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -72,12 +72,12 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) struct vu_dev *vdev = c->vdev; struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE]; struct vu_virtq_element flags_elem[2]; - size_t optlen, hdrlen, l2len; struct ipv6hdr *ip6h = NULL; struct iphdr *ip4h = NULL; struct iovec flags_iov[2]; struct tcp_syn_opts *opts; struct iov_tail payload; + size_t optlen, hdrlen; struct tcphdr *th; struct ethhdr *eh; uint32_t seq; @@ -88,7 +88,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
elem_cnt = vu_collect(vdev, vq, &flags_elem[0], 1, &flags_iov[0], 1, NULL, - MAX(hdrlen + sizeof(*opts), ETH_ZLEN + VNET_HLEN), NULL); + hdrlen + sizeof(*opts), NULL); if (elem_cnt != 1) return -1;
@@ -128,7 +128,6 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) return ret; }
- iov_truncate(&flags_iov[0], 1, hdrlen + optlen); payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen);
if (flags & KEEPALIVE) @@ -137,9 +136,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload, optlen, NULL, seq, !*c->pcap);
- l2len = optlen + hdrlen - VNET_HLEN; - vu_pad(&flags_elem[0].in_sg[0], l2len); - + vu_pad(flags_elem[0].in_sg, 1, hdrlen + optlen); vu_flush(vdev, vq, flags_elem, 1, hdrlen + optlen);
if (*c->pcap) @@ -149,7 +146,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) if (flags & DUP_ACK) { elem_cnt = vu_collect(vdev, vq, &flags_elem[1], 1, &flags_iov[1], 1, NULL, - flags_elem[0].in_sg[0].iov_len, NULL); + hdrlen + optlen, NULL); if (elem_cnt == 1 && flags_elem[1].in_sg[0].iov_len >= flags_elem[0].in_sg[0].iov_len) { @@ -213,7 +210,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, ARRAY_SIZE(elem) - elem_cnt, &iov_vu[DISCARD_IOV_NUM + iov_used], VIRTQUEUE_MAX_SIZE - iov_used, &in_total, - MAX(MIN(mss, fillsize) + hdrlen, ETH_ZLEN + VNET_HLEN), + MIN(mss, fillsize) + hdrlen, &frame_size); if (cnt == 0) break; @@ -249,8 +246,11 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, if (!peek_offset_cap) ret -= already_sent;
- /* adjust iov number and length of the last iov */ - i = iov_truncate(&iov_vu[DISCARD_IOV_NUM], iov_used, ret); + i = iov_skip_bytes(&iov_vu[DISCARD_IOV_NUM], iov_used, + MAX(hdrlen + ret, VNET_HLEN + ETH_ZLEN), + NULL);
Nit: this should be aligned like this: i = iov_skip_bytes(&iov_vu[DISCARD_IOV_NUM], iov_used, MAX(hdrlen + ret, VNET_HLEN + ETH_ZLEN), NULL);
+ if ((size_t)i < iov_used) + i++;
I'm a bit lost here. I see that this increment restores the iov_truncate() convention of returning the number of iov items (which we need later), but... what happens if we have i >= iov_used (even though my assumption is that it should never happen)? We're throwing away data?
/* adjust head count */ while (*head_cnt > 0 && head[*head_cnt - 1] >= i) @@ -447,11 +447,13 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) size_t frame_size = iov_size(iov, buf_cnt); bool push = i == head_cnt - 1; ssize_t dlen; - size_t l2len;
assert(frame_size >= hdrlen);
dlen = frame_size - hdrlen; + if (dlen > len) + dlen = len; + len -= dlen;
/* The IPv4 header checksum varies only with dlen */ if (previous_dlen != dlen) @@ -460,10 +462,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
tcp_vu_prepare(c, conn, iov, buf_cnt, dlen, &check, !*c->pcap, push);
- /* Pad first/single buffer only, it's at least ETH_ZLEN long */ - l2len = dlen + hdrlen - VNET_HLEN; - vu_pad(iov, l2len); - + vu_pad(elem[head[i]].in_sg, buf_cnt, dlen + hdrlen); vu_flush(vdev, vq, &elem[head[i]], buf_cnt, dlen + hdrlen);
if (*c->pcap) diff --git a/udp_vu.c b/udp_vu.c index 4641f42eb5c4..30af64034516 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -65,7 +65,7 @@ static size_t udp_vu_hdrlen(bool v6) static ssize_t udp_vu_sock_recv(struct iovec *iov, size_t *cnt, int s, bool v6) { struct msghdr msg = { 0 }; - size_t hdrlen, l2len; + size_t hdrlen, iov_used; ssize_t dlen;
/* compute L2 header length */ @@ -88,11 +88,12 @@ static ssize_t udp_vu_sock_recv(struct iovec *iov, size_t *cnt, int s, bool v6) iov[0].iov_base = (char *)iov[0].iov_base - hdrlen; iov[0].iov_len += hdrlen;
- *cnt = iov_truncate(iov, *cnt, dlen + hdrlen); - - /* pad frame to 60 bytes: first buffer is at least ETH_ZLEN long */ - l2len = dlen + hdrlen - VNET_HLEN; - vu_pad(&iov[0], l2len); + iov_used = iov_skip_bytes(iov, *cnt, + MAX(dlen + hdrlen, VNET_HLEN + ETH_ZLEN), + NULL); + if (iov_used < *cnt) + iov_used++;
(I would have the same question here)
+ *cnt = iov_used; /* one iovec per element */
return dlen; } @@ -234,6 +235,7 @@ void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx) pcap_iov(iov_vu, iov_cnt, VNET_HLEN, hdrlen + dlen - VNET_HLEN); } + vu_pad(iov_vu, iov_cnt, hdrlen + dlen); vu_flush(vdev, vq, elem, elem_used, hdrlen + dlen); vu_queue_notify(vdev, vq); } diff --git a/vu_common.c b/vu_common.c index d371a59a1813..ca0aab369d3c 100644 --- a/vu_common.c +++ b/vu_common.c @@ -74,6 +74,7 @@ int vu_collect(const struct vu_dev *vdev, struct vu_virtq *vq, size_t current_iov = 0; int elem_cnt = 0;
+ size = MAX(size, ETH_ZLEN + VNET_HLEN); /* Ethernet minimum size */
I think this (if needed): size = MAX(size, ETH_ZLEN /* Ethernet minimum size */ + VNET_HLEN); would be more accurate.
while (current_size < size && elem_cnt < max_elem && current_iov < max_in_sg) { int ret; @@ -262,29 +263,27 @@ int vu_send_single(const struct ctx *c, const void *buf, size_t size) return -1; }
- size += VNET_HLEN; elem_cnt = vu_collect(vdev, vq, elem, ARRAY_SIZE(elem), in_sg, - ARRAY_SIZE(in_sg), &in_total, size, &total); - if (elem_cnt == 0 || total < size) { + ARRAY_SIZE(in_sg), &in_total, VNET_HLEN + size, &total); + if (elem_cnt == 0 || total < VNET_HLEN + size) { debug("vu_send_single: no space to send the data " "elem_cnt %d size %zu", elem_cnt, total); goto err; }
- total -= VNET_HLEN; - /* copy data from the buffer to the iovec */ - iov_from_buf(in_sg, in_total, VNET_HLEN, buf, total); + iov_from_buf(in_sg, in_total, VNET_HLEN, buf, size);
if (*c->pcap) pcap_iov(in_sg, in_total, VNET_HLEN, size);
+ vu_pad(in_sg, in_total, VNET_HLEN + size); vu_flush(vdev, vq, elem, elem_cnt, VNET_HLEN + size); vu_queue_notify(vdev, vq);
- trace("vhost-user sent %zu", total); + trace("vhost-user sent %zu", size);
- return total; + return size; err: for (i = 0; i < elem_cnt; i++) vu_queue_detach_element(vq); @@ -293,15 +292,14 @@ err: }
/** - * vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed - * @iov: Buffer in iovec array where end of 802.3 frame is stored - * @l2len: Layer-2 length already filled in frame + * vu_pad() - Pad short frames to minimum Ethernet length and truncate iovec + * @iov: Pointer to iovec array + * @cnt: Number of entries in @iov + * @frame_len: Data length in @iov (including virtio-net header) */ -void vu_pad(struct iovec *iov, size_t l2len) +void vu_pad(const struct iovec *iov, size_t cnt, size_t frame_len) { - if (l2len >= ETH_ZLEN) - return; - - memset((char *)iov->iov_base + iov->iov_len, 0, ETH_ZLEN - l2len); - iov->iov_len += ETH_ZLEN - l2len; + if (frame_len < ETH_ZLEN + VNET_HLEN)
Nit: curly brackets. Perhaps better: use a temporary variable for ETH_ZLEN + VNET_HLEN - frame_len ("padding"?).
+ iov_memset(iov, cnt, frame_len, 0, + ETH_ZLEN + VNET_HLEN - frame_len); } diff --git a/vu_common.h b/vu_common.h index 77d1849e6115..51f70084a7cb 100644 --- a/vu_common.h +++ b/vu_common.h @@ -44,6 +44,6 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size); -void vu_pad(struct iovec *iov, size_t l2len); +void vu_pad(const struct iovec *iov, size_t cnt, size_t frame_len);
#endif /* VU_COMMON_H */
-- Stefano
On 4/3/26 08:20, Stefano Brivio wrote:
On Wed, 1 Apr 2026 21:18:19 +0200 Laurent Vivier
wrote: Every caller of vu_flush() was calling vu_set_vnethdr() beforehand with the same pattern. Move it into vu_flush().
Remove vu_queue_notify() from vu_flush() and let callers invoke it explicitly. This allows paths that perform multiple flushes, such as tcp_vu_send_flag() and tcp_vu_data_from_sock(), to issue a single guest notification at the end.
Signed-off-by: Laurent Vivier
--- tcp_vu.c | 19 ++++++++----------- udp_vu.c | 3 +-- vu_common.c | 9 +++++---- vu_common.h | 1 - 4 files changed, 14 insertions(+), 18 deletions(-) diff --git a/tcp_vu.c b/tcp_vu.c index dc0e17c0f03f..0cd01190d612 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -82,7 +82,6 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) struct ethhdr *eh; uint32_t seq; int elem_cnt; - int nb_ack; int ret;
hdrlen = tcp_vu_hdrlen(CONN_V6(conn)); @@ -97,8 +96,6 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) assert(flags_elem[0].in_sg[0].iov_len >= MAX(hdrlen + sizeof(*opts), ETH_ZLEN + VNET_HLEN));
- vu_set_vnethdr(flags_elem[0].in_sg[0].iov_base, 1); - eh = vu_eth(flags_elem[0].in_sg[0].iov_base);
memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest)); @@ -143,9 +140,10 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) l2len = optlen + hdrlen - VNET_HLEN; vu_pad(&flags_elem[0].in_sg[0], l2len);
+ vu_flush(vdev, vq, flags_elem, 1); + if (*c->pcap) pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN); - nb_ack = 1;
if (flags & DUP_ACK) { elem_cnt = vu_collect(vdev, vq, &flags_elem[1], 1, @@ -157,14 +155,14 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) memcpy(flags_elem[1].in_sg[0].iov_base, flags_elem[0].in_sg[0].iov_base, flags_elem[0].in_sg[0].iov_len); - nb_ack++; + + vu_flush(vdev, vq, &flags_elem[1], 1);
if (*c->pcap) pcap_iov(&flags_elem[1].in_sg[0], 1, VNET_HLEN); } } - - vu_flush(vdev, vq, flags_elem, nb_ack); + vu_queue_notify(vdev, vq);
return 0; } @@ -451,7 +449,6 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) assert(frame_size >= hdrlen);
dlen = frame_size - hdrlen; - vu_set_vnethdr(iov->iov_base, buf_cnt);
/* The IPv4 header checksum varies only with dlen */ if (previous_dlen != dlen) @@ -464,14 +461,14 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) l2len = dlen + hdrlen - VNET_HLEN; vu_pad(iov, l2len);
+ vu_flush(vdev, vq, &elem[head[i]], buf_cnt); + if (*c->pcap) pcap_iov(iov, buf_cnt, VNET_HLEN);
conn->seq_to_tap += dlen; } - - /* send packets */ - vu_flush(vdev, vq, elem, iov_cnt); + vu_queue_notify(vdev, vq);
If I understand correctly, this makes iov_cnt set by tcp_vu_sock_recv() redundant because you're now calling vu_flush() from inside the loop that iterates up to head_cnt (which is the only count we need to know).
Maybe it would be clearer to have a patch in this series dropping iov_cnt as argument altogether (especially the commit message of that kind of change might make things clearer).
I rework this part in the TCP series. This is one of the reasons the multibuffer series are needed. Here we have a simplification as we have one iovec per element, so the number of iovec is equal to the number of element, and head point to an element that is also one iovec. iov_cnt is the total number of iovec for a frame (spread across several elements) but as we have one iovec per element it's also the number of element. buf_cnt is the number of elements for a frame but also the number of iovec for a frame (as we have one iovec / element). So: - vu_flush() and vu_set_vnethdr() need the number of element / frame (buf_cnt as element count) - pcap_iov() needs the number of iovec in a frame (buf_cnt as iovec count) In this patch I only want to move vu_set_vnethdr(), no more. Thanks, Laurent
On 4/3/26 08:20, Stefano Brivio wrote:
On Wed, 1 Apr 2026 21:18:23 +0200 Laurent Vivier
wrote: With vhost-user multibuffer frames, the iov can be larger than the actual L2 frame. The previous approach of computing L2 length as iov_size() - offset would overcount and write extra bytes into the pcap file.
Pass the L2 frame length explicitly to pcap_frame() and pcap_iov(), and write exactly that many bytes instead of the full iov remainder.
Signed-off-by: Laurent Vivier
--- pcap.c | 37 ++++++++++++++++++++++++++++--------- pcap.h | 2 +- tap.c | 2 +- tcp_vu.c | 9 ++++++--- udp_vu.c | 4 +++- vu_common.c | 2 +- 6 files changed, 40 insertions(+), 16 deletions(-) diff --git a/pcap.c b/pcap.c index a026f17e7974..dfe1c61add9a 100644 --- a/pcap.c +++ b/pcap.c @@ -52,22 +52,38 @@ struct pcap_pkthdr { * @iov: IO vector containing frame (with L2 headers and tap headers) * @iovcnt: Number of buffers (@iov entries) in frame * @offset: Byte offset of the L2 headers within @iov + * @l2len: Length of L2 frame data to capture * @now: Timestamp */ static void pcap_frame(const struct iovec *iov, size_t iovcnt, - size_t offset, const struct timespec *now) + size_t offset, size_t l2len, const struct timespec *now) { - size_t l2len = iov_size(iov, iovcnt) - offset; struct pcap_pkthdr h = { .tv_sec = now->tv_sec, .tv_usec = DIV_ROUND_CLOSEST(now->tv_nsec, 1000), .caplen = l2len, .len = l2len }; + size_t i;
- if (write_all_buf(pcap_fd, &h, sizeof(h)) < 0 || - write_remainder(pcap_fd, iov, iovcnt, offset) < 0) - debug_perror("Cannot log packet, length %zu", l2len); + if (write_all_buf(pcap_fd, &h, sizeof(h)) < 0) { + debug_perror("Cannot log packet, packet header error"); + return; + } + + for (i = iov_skip_bytes(iov, iovcnt, offset, &offset); + i < iovcnt && l2len; i++) { + size_t n = MIN(l2len, iov[i].iov_len - offset); + + if (write_all_buf(pcap_fd, (char *)iov[i].iov_base + offset, + n) < 0) {
This could be written as:
size_t l = MIN(l2len, iov[i].iov_len - offset), n;
n = write_all_buf(pcap_fd, (char *)iov[i].iov_base + offset, l); if (n < 0) {
same number of lines, but slightly clearer I think.
+ debug_perror("Cannot log packet");
Should we add back the ", length %zu" part, perhaps, using this 'l' variable? I think it makes a difference if we see we're failing to capture a reasonably sized 64k packet (disk full or something) compared to a 100G-length buffer coming from some calculation gone wrong.
+ return; + } + + offset = 0; + l2len -= n; + } }
/** @@ -87,7 +103,7 @@ void pcap(const char *pkt, size_t l2len) if (clock_gettime(CLOCK_REALTIME, &now)) err_perror("Failed to get CLOCK_REALTIME time");
- pcap_frame(&iov, 1, 0, &now); + pcap_frame(&iov, 1, 0, l2len, &now); }
/** @@ -110,7 +126,9 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n, err_perror("Failed to get CLOCK_REALTIME time");
for (i = 0; i < n; i++) - pcap_frame(iov + i * frame_parts, frame_parts, offset, &now); + pcap_frame(iov + i * frame_parts, frame_parts, offset,
Nit: curly brackets for coding style.
+ iov_size(iov + i * frame_parts, frame_parts) - offset, + &now); }
/** @@ -120,8 +138,9 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n, * containing packet data to write, including L2 header * @iovcnt: Number of buffers (@iov entries) * @offset: Offset of the L2 frame within the full data length + * @l2len: Length of L2 frame data to capture */ -void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset) +void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset, size_t l2len) { struct timespec now = { 0 };
@@ -131,7 +150,7 @@ void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset) if (clock_gettime(CLOCK_REALTIME, &now)) err_perror("Failed to get CLOCK_REALTIME time");
- pcap_frame(iov, iovcnt, offset, &now); + pcap_frame(iov, iovcnt, offset, l2len, &now); }
/** diff --git a/pcap.h b/pcap.h index dface5df4ee6..c171257cbd73 100644 --- a/pcap.h +++ b/pcap.h @@ -13,7 +13,7 @@ extern int pcap_fd; void pcap(const char *pkt, size_t l2len); void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n, size_t offset); -void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset); +void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset, size_t l2len); void pcap_init(struct ctx *c);
#endif /* PCAP_H */ diff --git a/tap.c b/tap.c index b61199dd699d..007c91864b4e 100644 --- a/tap.c +++ b/tap.c @@ -1105,7 +1105,7 @@ void tap_add_packet(struct ctx *c, struct iov_tail *data, struct ethhdr eh_storage; const struct ethhdr *eh;
- pcap_iov(data->iov, data->cnt, data->off); + pcap_iov(data->iov, data->cnt, data->off, iov_tail_size(data));
This leads to a warning from Coverity Scan:
/home/sbrivio/passt/tap.c:1108:2: Type: Evaluation order violation (EVALUATION_ORDER)
/home/sbrivio/passt/tap.c:1108:2: read_write_order: In argument #4 of "pcap_iov(data->iov, data->cnt, data->off, iov_tail_size(data))", a call is made to "iov_tail_size(data)". In argument #1 of this function, the object "data->off" is modified. This object is also used in "data->off", the argument #3 of the outer function call. The order in which these arguments are evaluated is not specified, and will vary between platforms.
I'm not sure if we can ever reach this point with a non-pruned iov_tail so that iov_tail_prune() will actually modify 'off', but just to be sure I guess we should call iov_tail_size() first, assign its result to a temporary variable, and use that.
In fact, iov_size() will not change after pruning. The couple (cnt, off) will always point to the same offset (but cnt can be higher and off lower).
eh = IOV_PEEK_HEADER(data, eh_storage); if (!eh) diff --git a/tcp_vu.c b/tcp_vu.c index 0cd01190d612..329fa969fca1 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -143,7 +143,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) vu_flush(vdev, vq, flags_elem, 1);
if (*c->pcap) - pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN); + pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN, + hdrlen + optlen - VNET_HLEN);
Same here, curly brackets.
if (flags & DUP_ACK) { elem_cnt = vu_collect(vdev, vq, &flags_elem[1], 1, @@ -159,7 +160,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) vu_flush(vdev, vq, &flags_elem[1], 1);
if (*c->pcap) - pcap_iov(&flags_elem[1].in_sg[0], 1, VNET_HLEN); + pcap_iov(&flags_elem[1].in_sg[0], 1, VNET_HLEN, + hdrlen + optlen - VNET_HLEN);
And here.
} } vu_queue_notify(vdev, vq); @@ -464,7 +466,8 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) vu_flush(vdev, vq, &elem[head[i]], buf_cnt);
if (*c->pcap) - pcap_iov(iov, buf_cnt, VNET_HLEN); + pcap_iov(iov, buf_cnt, VNET_HLEN, + dlen + hdrlen - VNET_HLEN);
And here too.
conn->seq_to_tap += dlen; } diff --git a/udp_vu.c b/udp_vu.c index 5421a7d71a19..81491afa7e6a 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -185,6 +185,7 @@ void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx) static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE]; struct vu_dev *vdev = c->vdev; struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE]; + size_t hdrlen = udp_vu_hdrlen(v6); int i;
assert(!c->no_udp); @@ -230,7 +231,8 @@ void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx) size_t l4len = udp_vu_prepare(c, iov_vu, toside, dlen); if (*c->pcap) { udp_vu_csum(toside, iov_vu, iov_cnt, l4len); - pcap_iov(iov_vu, iov_cnt, VNET_HLEN); + pcap_iov(iov_vu, iov_cnt, VNET_HLEN, + hdrlen + dlen - VNET_HLEN);
And here as well. Or maybe, in all those cases, we could have a 'size' temporary variable which should make things a bit more obvious.
I agree Thanks, Laurent
On 4/3/26 08:20, Stefano Brivio wrote:
On Wed, 1 Apr 2026 21:18:26 +0200 Laurent Vivier
wrote: The previous per-protocol padding done by vu_pad() in tcp_vu.c and udp_vu.c was only correct for single-buffer frames: it assumed the padding area always fell within the first iov, writing past its end with a plain memset().
It also required each caller to compute MAX(..., ETH_ZLEN + VNET_HLEN) for vu_collect() and to call vu_pad() at the right point, duplicating the minimum-size logic across protocols.
Move the Ethernet minimum size enforcement into vu_collect() itself, so that enough buffer space is always reserved for padding regardless of the requested frame size.
Rewrite vu_pad() to take a full iovec array and use iov_memset(), making it safe for multi-buffer (mergeable rx buffer) frames.
In tcp_vu_sock_recv(), replace iov_truncate() with iov_skip_bytes(): now that all consumers receive explicit data lengths, truncating the iovecs is no longer needed. In tcp_vu_data_from_sock(), cap each frame's data length against the remaining bytes actually received from the socket, so that the last partial frame gets correct headers and sequence number advancement.
Signed-off-by: Laurent Vivier
--- iov.c | 1 - tcp_vu.c | 29 ++++++++++++++--------------- udp_vu.c | 14 ++++++++------ vu_common.c | 32 +++++++++++++++----------------- vu_common.h | 2 +- 5 files changed, 38 insertions(+), 40 deletions(-) diff --git a/iov.c b/iov.c index 83b683f3976a..2289b425529e 100644 --- a/iov.c +++ b/iov.c @@ -180,7 +180,6 @@ size_t iov_truncate(struct iovec *iov, size_t iov_cnt, size_t size) * Will write less than @length bytes if it runs out of space in * the iov */ -/* cppcheck-suppress unusedFunction */ void iov_memset(const struct iovec *iov, size_t iov_cnt, size_t offset, int c, size_t length) { diff --git a/tcp_vu.c b/tcp_vu.c index ae79a6d856b0..cae6926334b9 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -72,12 +72,12 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) struct vu_dev *vdev = c->vdev; struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE]; struct vu_virtq_element flags_elem[2]; - size_t optlen, hdrlen, l2len; struct ipv6hdr *ip6h = NULL; struct iphdr *ip4h = NULL; struct iovec flags_iov[2]; struct tcp_syn_opts *opts; struct iov_tail payload; + size_t optlen, hdrlen; struct tcphdr *th; struct ethhdr *eh; uint32_t seq; @@ -88,7 +88,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
elem_cnt = vu_collect(vdev, vq, &flags_elem[0], 1, &flags_iov[0], 1, NULL, - MAX(hdrlen + sizeof(*opts), ETH_ZLEN + VNET_HLEN), NULL); + hdrlen + sizeof(*opts), NULL); if (elem_cnt != 1) return -1;
@@ -128,7 +128,6 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) return ret; }
- iov_truncate(&flags_iov[0], 1, hdrlen + optlen); payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen);
if (flags & KEEPALIVE) @@ -137,9 +136,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload, optlen, NULL, seq, !*c->pcap);
- l2len = optlen + hdrlen - VNET_HLEN; - vu_pad(&flags_elem[0].in_sg[0], l2len); - + vu_pad(flags_elem[0].in_sg, 1, hdrlen + optlen); vu_flush(vdev, vq, flags_elem, 1, hdrlen + optlen);
if (*c->pcap) @@ -149,7 +146,7 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) if (flags & DUP_ACK) { elem_cnt = vu_collect(vdev, vq, &flags_elem[1], 1, &flags_iov[1], 1, NULL, - flags_elem[0].in_sg[0].iov_len, NULL); + hdrlen + optlen, NULL); if (elem_cnt == 1 && flags_elem[1].in_sg[0].iov_len >= flags_elem[0].in_sg[0].iov_len) { @@ -213,7 +210,7 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, ARRAY_SIZE(elem) - elem_cnt, &iov_vu[DISCARD_IOV_NUM + iov_used], VIRTQUEUE_MAX_SIZE - iov_used, &in_total, - MAX(MIN(mss, fillsize) + hdrlen, ETH_ZLEN + VNET_HLEN), + MIN(mss, fillsize) + hdrlen, &frame_size); if (cnt == 0) break; @@ -249,8 +246,11 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, if (!peek_offset_cap) ret -= already_sent;
- /* adjust iov number and length of the last iov */ - i = iov_truncate(&iov_vu[DISCARD_IOV_NUM], iov_used, ret); + i = iov_skip_bytes(&iov_vu[DISCARD_IOV_NUM], iov_used, + MAX(hdrlen + ret, VNET_HLEN + ETH_ZLEN), + NULL);
Nit: this should be aligned like this:
i = iov_skip_bytes(&iov_vu[DISCARD_IOV_NUM], iov_used, MAX(hdrlen + ret, VNET_HLEN + ETH_ZLEN), NULL);
+ if ((size_t)i < iov_used) + i++;
I'm a bit lost here. I see that this increment restores the iov_truncate() convention of returning the number of iov items (which
iov_truncate() was truncating the iovec array (reducing the cnt and iov_len of the last iovec) to fit the actual size of the data. Here we are counting the number of elements: we have collected more elements than needed to store the data, so we need to know how many we use to give back the unused to the virtio-queue. Again, the confusing point is that we have the same number of elements as the number of iovec. It's fixed in the following series.
we need later), but... what happens if we have i >= iov_used (even though my assumption is that it should never happen)? We're throwing away data? i cannot be greater than iov_used. if i == iov_used, it means we need all the elements.
Thanks, Laurent
On Wed, Apr 01, 2026 at 09:18:17PM +0200, Laurent Vivier wrote:
Add a helper to set a range of bytes across an IO vector to a given value, similar to memset() but operating over scatter-gather buffers. It skips to the given offset and fills across iovec entries up to the requested length.
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson
--- iov.c | 27 +++++++++++++++++++++++++++ iov.h | 2 ++ 2 files changed, 29 insertions(+)
diff --git a/iov.c b/iov.c index ae0743931d18..0188acdf5eba 100644 --- a/iov.c +++ b/iov.c @@ -170,6 +170,33 @@ size_t iov_truncate(struct iovec *iov, size_t iov_cnt, size_t size) return i; }
+/** + * iov_memset() - Set bytes of an IO vector to a given value + * @iov: IO vector + * @iov_cnt: Number of elements in @iov + * @offset: Byte offset in the iovec at which to start + * @c: Byte value to fill with + * @length: Number of bytes to set + * Will write less than @length bytes if it runs out of space in + * the iov + */ +/* cppcheck-suppress unusedFunction */ +void iov_memset(const struct iovec *iov, size_t iov_cnt, size_t offset, int c, + size_t length) +{ + size_t i; + + i = iov_skip_bytes(iov, iov_cnt, offset, &offset); + + for ( ; i < iov_cnt && length; i++) { + size_t n = MIN(iov[i].iov_len - offset, length); + + memset((char *)iov[i].iov_base + offset, c, n); + offset = 0; + length -= n; + } +} + /** * iov_tail_prune() - Remove any unneeded buffers from an IOV tail * @tail: IO vector tail (modified) diff --git a/iov.h b/iov.h index b4e50b0fca5a..d295d05b3bab 100644 --- a/iov.h +++ b/iov.h @@ -30,6 +30,8 @@ size_t iov_to_buf(const struct iovec *iov, size_t iov_cnt, size_t offset, void *buf, size_t bytes); size_t iov_size(const struct iovec *iov, size_t iov_cnt); size_t iov_truncate(struct iovec *iov, size_t iov_cnt, size_t size); +void iov_memset(const struct iovec *iov, size_t iov_cnt, size_t offset, int c, + size_t length);
/* * DOC: Theory of Operation, struct iov_tail -- 2.53.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (3)
-
David Gibson
-
Laurent Vivier
-
Stefano Brivio