[PATCH v3 00/10] vhost-user: Preparatory series for multiple iovec entries per virtqueue element
Currently, the vhost-user path assumes each virtqueue element contains exactly one iovec entry covering the entire frame. This assumption breaks as some virtio-net drivers (notably iPXE) provide descriptors where the vnet header and the frame payload are in separate buffers, resulting in two iovec entries per virtqueue element. This series refactors the vhost-user data path so that frame lengths, header sizes, and padding are tracked and passed explicitly rather than being derived from iovec sizes. This decoupling is a prerequisite for correctly handling padding of multi-buffer frames. The changes in this series can be split in 3 groups: - New iov helpers (patches 1-2): iov_memset() and iov_memcpy() operate across iovec boundaries. These are needed by the final patch to pad and copy frame data when a frame spans multiple iovec entries. - Structural refactoring (patches 3-5): Move vnethdr setup into vu_flush(), separate virtqueue management from socket I/O in the UDP path, and pass iov arrays explicitly instead of using file-scoped state. These changes make it possible to pass explicit frame lengths through the stack, which is required to pad frames independently of iovec layout. - Explicit length passing throughout the stack (patches 6-10): Thread explicit L4, L2, frame, and data lengths through checksum, pcap, vu_flush(), and tcp_fill_headers(), replacing lengths that were previously derived from iovec sizes. With lengths tracked explicitly, the final patch can centralise Ethernet frame padding into vu_collect() and a new vu_pad() helper that correctly pads frames spanning multiple iovec entries. v3: - csum_udp4()/csum_udp6()/udp_vu_csum receive payload length (dlen) rather than l4len - Add a length parameter to write_remainder() and use it in pcap_frame() v2: - Rename iov_memcopy() to iov_memcpy() and use clearer parameter names - Use clearer code in pcap_frame() - Add braces around bodies in pcap.c and tcp_vu.c for style consistency - Extract l2len variable in tap_add_packet() and tcp_vu_send_flag() to avoid repeating the same expression - Fix indentation alignment of iov_skip_bytes() arguments in tcp_vu_c - Introduce fill_size variable in vu_flush() - Reposition comment for ETH_ZLEN in vu_collect() Laurent Vivier (10): iov: Introduce iov_memset() iov: Add iov_memcpy() to copy data between iovec arrays vu_common: Move vnethdr setup into vu_flush() udp_vu: Move virtqueue management from udp_vu_sock_recv() to its caller udp_vu: Pass iov explicitly to helpers instead of using file-scoped array checksum: Pass explicit L4 length to checksum functions pcap: Pass explicit L2 length to pcap_iov() vu_common: Pass explicit frame length to vu_flush() tcp: Pass explicit data length to tcp_fill_headers() vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad() checksum.c | 43 +++++++----- checksum.h | 6 +- iov.c | 78 ++++++++++++++++++++++ iov.h | 5 ++ pcap.c | 28 +++++--- pcap.h | 2 +- tap.c | 10 +-- tcp.c | 14 ++-- tcp_buf.c | 3 +- tcp_internal.h | 2 +- tcp_vu.c | 66 ++++++++++--------- udp.c | 5 +- udp_vu.c | 173 +++++++++++++++++++++++++------------------------ util.c | 31 +++++++-- util.h | 3 +- vu_common.c | 58 ++++++++++------- vu_common.h | 5 +- 17 files changed, 338 insertions(+), 194 deletions(-) -- 2.53.0
Add a helper to set a range of bytes across an IO vector to a given
value, similar to memset() but operating over scatter-gather buffers.
It skips to the given offset and fills across iovec entries up to the
requested length.
Signed-off-by: Laurent Vivier
Add a helper to copy data from a source iovec array to a destination
iovec array, each starting at an arbitrary byte offset, iterating
through both arrays simultaneously and copying in chunks matching the
smaller of the two current segments.
Signed-off-by: Laurent Vivier
Every caller of vu_flush() was calling vu_set_vnethdr() beforehand with
the same pattern. Move it into vu_flush().
Remove vu_queue_notify() from vu_flush() and let callers invoke it
explicitly. This allows paths that perform multiple flushes, such as
tcp_vu_send_flag() and tcp_vu_data_from_sock(), to issue a single guest
notification at the end.
Signed-off-by: Laurent Vivier
udp_vu_sock_recv() currently mixes two concerns: receiving data from the
socket and managing virtqueue buffers (collecting, rewinding, releasing).
This makes the function harder to reason about and couples socket I/O
with virtqueue state.
Move all virtqueue operations, vu_collect(), vu_init_elem(),
vu_queue_rewind(), vu_set_vnethdr(), and the queue-readiness check, into
udp_vu_sock_to_tap(), which is the only caller. This turns
udp_vu_sock_recv() into a pure socket receive function that simply reads
into the provided iov array and adjusts its length.
Signed-off-by: Laurent Vivier
The iov_tail passed to csum_iov_tail() may contain padding or trailing
data beyond the actual L4 payload. Rather than relying on
iov_tail_size() to determine how many bytes to checksum, pass the
length explicitly so that only the relevant payload bytes are included
in the checksum computation.
Signed-off-by: Laurent Vivier
udp_vu_sock_recv(), udp_vu_prepare(), and udp_vu_csum() all operated on
the file-scoped iov_vu[] array directly. Pass iov and count as explicit
parameters instead, and move iov_vu[] and elem[] to function-local
statics in udp_vu_sock_to_tap(), the only function that needs them.
Signed-off-by: Laurent Vivier
With vhost-user multibuffer frames, the iov can be larger than the
actual L2 frame. The previous approach of computing L2 length as
iov_size() - offset would overcount and write extra bytes into the
pcap file.
Pass the L2 frame length explicitly to pcap_frame() and pcap_iov(),
and write exactly that many bytes instead of the full iov remainder.
Signed-off-by: Laurent Vivier
Currently vu_flush() derives the frame size from the iov, but in
preparation for iov arrays that may be larger than the actual frame,
pass the total length (including vnet header) explicitly so that only
the relevant portion is reported to the virtqueue.
Ensure a minimum frame size of ETH_ZLEN + VNET_HLEN to handle short
frames. All elements are still flushed to avoid descriptor leaks,
but trailing elements beyond frame_len will report a zero length.
Signed-off-by: Laurent Vivier
The previous per-protocol padding done by vu_pad() in tcp_vu.c and
udp_vu.c was only correct for single-buffer frames: it assumed the
padding area always fell within the first iov, writing past its end
with a plain memset().
It also required each caller to compute MAX(..., ETH_ZLEN + VNET_HLEN)
for vu_collect() and to call vu_pad() at the right point, duplicating
the minimum-size logic across protocols.
Move the Ethernet minimum size enforcement into vu_collect() itself, so
that enough buffer space is always reserved for padding regardless of
the requested frame size.
Rewrite vu_pad() to take a full iovec array and use iov_memset(),
making it safe for multi-buffer (mergeable rx buffer) frames.
In tcp_vu_sock_recv(), replace iov_truncate() with iov_skip_bytes():
now that all consumers receive explicit data lengths, truncating the
iovecs is no longer needed. In tcp_vu_data_from_sock(), cap each
frame's data length against the remaining bytes actually received from
the socket, so that the last partial frame gets correct headers and
sequence number advancement.
Signed-off-by: Laurent Vivier
tcp_fill_headers() computed the TCP payload length from iov_tail_size(),
but with vhost-user multibuffer frames, the iov_tail will be larger than
the actual data. Pass the data length explicitly so that IP total
length, pseudo-header, and checksum computations use the correct value.
Signed-off-by: Laurent Vivier
On Thu, Apr 16, 2026 at 05:57:18PM +0200, Laurent Vivier wrote:
With vhost-user multibuffer frames, the iov can be larger than the actual L2 frame. The previous approach of computing L2 length as iov_size() - offset would overcount and write extra bytes into the pcap file.
Pass the L2 frame length explicitly to pcap_frame() and pcap_iov(), and write exactly that many bytes instead of the full iov remainder.
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson
@@ -500,7 +500,8 @@ static size_t tap_send_frames_passt(const struct ctx *c, /* Number of unsent or partially sent buffers for the frame */ size_t rembufs = bufs_per_frame - (i % bufs_per_frame);
- if (write_remainder(c->fd_tap, &iov[i], rembufs, buf_offset) < 0) { + if (write_remainder(c->fd_tap, &iov[i], rembufs, buf_offset, + SIZE_MAX) < 0) { err_perror("tap: partial frame send"); return i; [snip] @@ -722,31 +722,54 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags, * @iov: IO vector * @iovcnt: Number of entries in @iov * @skip: Number of bytes of the vector to skip writing + * @length: Number of bytes of the vector to write
Nit: This is the maximum number of bytes to write, not the exact number, which we rely on in tap_send_frames_passt() above. That makes the comment slightly misleading. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, Apr 16, 2026 at 05:57:17PM +0200, Laurent Vivier wrote:
The iov_tail passed to csum_iov_tail() may contain padding or trailing data beyond the actual L4 payload. Rather than relying on iov_tail_size() to determine how many bytes to checksum, pass the length explicitly so that only the relevant payload bytes are included in the checksum computation.
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson
--- checksum.c | 43 +++++++++++++++++++++++++------------------ checksum.h | 6 +++--- tap.c | 4 ++-- tcp.c | 12 +++++++----- udp.c | 5 +++-- udp_vu.c | 21 +++++++++------------ 6 files changed, 49 insertions(+), 42 deletions(-)
diff --git a/checksum.c b/checksum.c index 828f9ecc9c02..7c62e42d6d4c 100644 --- a/checksum.c +++ b/checksum.c @@ -182,21 +182,22 @@ static uint16_t csum(const void *buf, size_t len, uint32_t init) * @saddr: IPv4 source address * @daddr: IPv4 destination address * @data: UDP payload (as IO vector tail) + * @dlen: UDP payload length */ void csum_udp4(struct udphdr *udp4hr, struct in_addr saddr, struct in_addr daddr, - struct iov_tail *data) + struct iov_tail *data, size_t dlen) { /* UDP checksums are optional, so don't bother */ udp4hr->check = 0;
if (UDP4_REAL_CHECKSUMS) { - uint16_t l4len = iov_tail_size(data) + sizeof(struct udphdr); - uint32_t psum = proto_ipv4_header_psum(l4len, IPPROTO_UDP, - saddr, daddr); + uint32_t psum = proto_ipv4_header_psum(sizeof(*udp4hr) + dlen, + IPPROTO_UDP, saddr, + daddr);
- psum = csum_unfolded(udp4hr, sizeof(struct udphdr), psum); - udp4hr->check = csum_iov_tail(data, psum); + psum = csum_unfolded(udp4hr, sizeof(*udp4hr), psum); + udp4hr->check = csum_iov_tail(data, psum, dlen); } }
@@ -245,19 +246,19 @@ uint32_t proto_ipv6_header_psum(uint16_t payload_len, uint8_t protocol, * @saddr: Source address * @daddr: Destination address * @data: UDP payload (as IO vector tail) + * @dlen: UDP payload length */ void csum_udp6(struct udphdr *udp6hr, const struct in6_addr *saddr, const struct in6_addr *daddr, - struct iov_tail *data) + struct iov_tail *data, size_t dlen) { - uint16_t l4len = iov_tail_size(data) + sizeof(struct udphdr); - uint32_t psum = proto_ipv6_header_psum(l4len, IPPROTO_UDP, - saddr, daddr); + uint32_t psum = proto_ipv6_header_psum(dlen + sizeof(*udp6hr), + IPPROTO_UDP, saddr, daddr);
udp6hr->check = 0;
- psum = csum_unfolded(udp6hr, sizeof(struct udphdr), psum); - udp6hr->check = csum_iov_tail(data, psum); + psum = csum_unfolded(udp6hr, sizeof(*udp6hr), psum); + udp6hr->check = csum_iov_tail(data, psum, dlen); }
/** @@ -604,20 +605,26 @@ uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init) /** * csum_iov_tail() - Calculate unfolded checksum for the tail of an IO vector * @tail: IO vector tail to checksum - * @init Initial 32-bit checksum, 0 for no pre-computed checksum + * @init: Initial 32-bit checksum, 0 for no pre-computed checksum + * @len: Number of bytes to checksum from @tail * * Return: 16-bit folded, complemented checksum */ -uint16_t csum_iov_tail(struct iov_tail *tail, uint32_t init) +uint16_t csum_iov_tail(struct iov_tail *tail, uint32_t init, size_t len) { if (iov_tail_prune(tail)) { - size_t i; + size_t i, n;
+ n = MIN(len, tail->iov[0].iov_len - tail->off); init = csum_unfolded((char *)tail->iov[0].iov_base + tail->off, - tail->iov[0].iov_len - tail->off, init); - for (i = 1; i < tail->cnt; i++) { + n, init); + len -= n; + + for (i = 1; len && i < tail->cnt; i++) { const struct iovec *iov = &tail->iov[i]; - init = csum_unfolded(iov->iov_base, iov->iov_len, init); + n = MIN(len, iov->iov_len); + init = csum_unfolded(iov->iov_base, n, init); + len -= n; } } return (uint16_t)~csum_fold(init); diff --git a/checksum.h b/checksum.h index 4e3b098db072..6270f1457a73 100644 --- a/checksum.h +++ b/checksum.h @@ -21,18 +21,18 @@ uint32_t proto_ipv4_header_psum(uint16_t l4len, uint8_t protocol, struct in_addr saddr, struct in_addr daddr); void csum_udp4(struct udphdr *udp4hr, struct in_addr saddr, struct in_addr daddr, - struct iov_tail *data); + struct iov_tail *data, size_t dlen); void csum_icmp4(struct icmphdr *icmp4hr, const void *payload, size_t dlen); uint32_t proto_ipv6_header_psum(uint16_t payload_len, uint8_t protocol, const struct in6_addr *saddr, const struct in6_addr *daddr); void csum_udp6(struct udphdr *udp6hr, const struct in6_addr *saddr, const struct in6_addr *daddr, - struct iov_tail *data); + struct iov_tail *data, size_t dlen); void csum_icmp6(struct icmp6hdr *icmp6hr, const struct in6_addr *saddr, const struct in6_addr *daddr, const void *payload, size_t dlen); uint32_t csum_unfolded(const void *buf, size_t len, uint32_t init); -uint16_t csum_iov_tail(struct iov_tail *tail, uint32_t init); +uint16_t csum_iov_tail(struct iov_tail *tail, uint32_t init, size_t len);
#endif /* CHECKSUM_H */ diff --git a/tap.c b/tap.c index 1049e023bcd2..41a61a36c279 100644 --- a/tap.c +++ b/tap.c @@ -252,7 +252,7 @@ void *tap_push_uh4(struct udphdr *uh, struct in_addr src, in_port_t sport, uh->source = htons(sport); uh->dest = htons(dport); uh->len = htons(l4len); - csum_udp4(uh, src, dst, &payload); + csum_udp4(uh, src, dst, &payload, dlen); return (char *)uh + sizeof(*uh); }
@@ -357,7 +357,7 @@ void *tap_push_uh6(struct udphdr *uh, uh->source = htons(sport); uh->dest = htons(dport); uh->len = htons(l4len); - csum_udp6(uh, src, dst, &payload); + csum_udp6(uh, src, dst, &payload, dlen); return (char *)uh + sizeof(*uh); }
diff --git a/tcp.c b/tcp.c index 8ea9be84a9f3..5b2a732cf26d 100644 --- a/tcp.c +++ b/tcp.c @@ -815,13 +815,14 @@ static void tcp_sock_set_nodelay(int s) * @psum: Unfolded partial checksum of the IPv4 or IPv6 pseudo-header * @th: TCP header (updated) * @payload: TCP payload + * @dlen: TCP payload length */ static void tcp_update_csum(uint32_t psum, struct tcphdr *th, - struct iov_tail *payload) + struct iov_tail *payload, size_t dlen) { th->check = 0; psum = csum_unfolded(th, sizeof(*th), psum); - th->check = csum_iov_tail(payload, psum); + th->check = csum_iov_tail(payload, psum, dlen); }
/** @@ -958,7 +959,8 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, bool no_tcp_csum) { const struct flowside *tapside = TAPFLOW(conn); - size_t l4len = iov_tail_size(payload) + sizeof(*th); + size_t dlen = iov_tail_size(payload); + size_t l4len = dlen + sizeof(*th); uint8_t *omac = conn->f.tap_omac; size_t l3len = l4len; uint32_t psum = 0; @@ -1019,7 +1021,7 @@ size_t tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, if (no_tcp_csum) th->check = 0; else - tcp_update_csum(psum, th, payload); + tcp_update_csum(psum, th, payload, dlen);
return MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN); } @@ -2196,7 +2198,7 @@ static void tcp_rst_no_conn(const struct ctx *c, int af, rsth->ack = 1; }
- tcp_update_csum(psum, rsth, &payload); + tcp_update_csum(psum, rsth, &payload, 0); rst_l2len = ((char *)rsth - buf) + sizeof(*rsth); tap_send_single(c, buf, rst_l2len); } diff --git a/udp.c b/udp.c index 1fc5a42c5ca7..4eef10854d8a 100644 --- a/udp.c +++ b/udp.c @@ -289,7 +289,7 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp, .iov_len = dlen }; struct iov_tail data = IOV_TAIL(&iov, 1, 0); - csum_udp4(&bp->uh, *src, *dst, &data); + csum_udp4(&bp->uh, *src, *dst, &data, dlen); }
return l4len; @@ -334,7 +334,8 @@ size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, .iov_len = dlen }; struct iov_tail data = IOV_TAIL(&iov, 1, 0); - csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6, &data); + csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6, &data, + dlen); }
return l4len; diff --git a/udp_vu.c b/udp_vu.c index 277bd5688907..1a73d997f683 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -105,14 +105,11 @@ static ssize_t udp_vu_sock_recv(struct iovec *iov, size_t *cnt, int s, bool v6) * @iov: IO vector for the frame (including vnet header) * @toside: Address information for one side of the flow * @dlen: Packet data length - * - * Return: Layer-4 length */ -static size_t udp_vu_prepare(const struct ctx *c, const struct iovec *iov, +static void udp_vu_prepare(const struct ctx *c, const struct iovec *iov, const struct flowside *toside, ssize_t dlen) { struct ethhdr *eh; - size_t l4len;
/* ethernet header */ eh = vu_eth(iov[0].iov_base); @@ -129,7 +126,7 @@ static size_t udp_vu_prepare(const struct ctx *c, const struct iovec *iov,
*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_UDP);
- l4len = udp_update_hdr4(iph, bp, toside, dlen, true); + udp_update_hdr4(iph, bp, toside, dlen, true); } else { struct ipv6hdr *ip6h = vu_ip(iov[0].iov_base); struct udp_payload_t *bp = vu_payloadv6(iov[0].iov_base); @@ -138,10 +135,8 @@ static size_t udp_vu_prepare(const struct ctx *c, const struct iovec *iov,
*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_UDP);
- l4len = udp_update_hdr6(ip6h, bp, toside, dlen, true); + udp_update_hdr6(ip6h, bp, toside, dlen, true); } - - return l4len; }
/** @@ -149,9 +144,10 @@ static size_t udp_vu_prepare(const struct ctx *c, const struct iovec *iov, * @toside: Address information for one side of the flow * @iov: IO vector for the frame * @cnt: Number of IO vector entries + * @dlen: Data length */ static void udp_vu_csum(const struct flowside *toside, const struct iovec *iov, - size_t cnt) + size_t cnt, size_t dlen) { const struct in_addr *src4 = inany_v4(&toside->oaddr); const struct in_addr *dst4 = inany_v4(&toside->eaddr); @@ -162,11 +158,12 @@ static void udp_vu_csum(const struct flowside *toside, const struct iovec *iov, if (src4 && dst4) { bp = vu_payloadv4(base); data = IOV_TAIL(iov, cnt, (char *)&bp->data - base); - csum_udp4(&bp->uh, *src4, *dst4, &data); + csum_udp4(&bp->uh, *src4, *dst4, &data, dlen); } else { bp = vu_payloadv6(base); data = IOV_TAIL(iov, cnt, (char *)&bp->data - base); - csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6, &data); + csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6, &data, + dlen); } }
@@ -229,7 +226,7 @@ void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx) if (iov_cnt > 0) { udp_vu_prepare(c, iov_vu, toside, dlen); if (*c->pcap) { - udp_vu_csum(toside, iov_vu, iov_cnt); + udp_vu_csum(toside, iov_vu, iov_cnt, dlen); pcap_iov(iov_vu, iov_cnt, VNET_HLEN); } vu_flush(vdev, vq, elem, elem_used); -- 2.53.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, Apr 16, 2026 at 05:57:21PM +0200, Laurent Vivier wrote:
The previous per-protocol padding done by vu_pad() in tcp_vu.c and udp_vu.c was only correct for single-buffer frames: it assumed the padding area always fell within the first iov, writing past its end with a plain memset().
It also required each caller to compute MAX(..., ETH_ZLEN + VNET_HLEN) for vu_collect() and to call vu_pad() at the right point, duplicating the minimum-size logic across protocols.
Move the Ethernet minimum size enforcement into vu_collect() itself, so that enough buffer space is always reserved for padding regardless of the requested frame size.
Rewrite vu_pad() to take a full iovec array and use iov_memset(), making it safe for multi-buffer (mergeable rx buffer) frames.
In tcp_vu_sock_recv(), replace iov_truncate() with iov_skip_bytes(): now that all consumers receive explicit data lengths, truncating the iovecs is no longer needed. In tcp_vu_data_from_sock(), cap each frame's data length against the remaining bytes actually received from the socket, so that the last partial frame gets correct headers and sequence number advancement.
Signed-off-by: Laurent Vivier
LGTM, except for what looks like one minor bug. [snip]
index 704e908aa02c..d07f584f228a 100644 --- a/vu_common.c +++ b/vu_common.c @@ -74,6 +74,7 @@ int vu_collect(const struct vu_dev *vdev, struct vu_virtq *vq, size_t current_iov = 0; int elem_cnt = 0;
+ size = MAX(size, ETH_ZLEN /* Ethernet minimum size */ + VNET_HLEN);
This seems to imply size should include the vnet header...
while (current_size < size && elem_cnt < max_elem && current_iov < max_in_sg) { int ret; @@ -261,29 +262,27 @@ int vu_send_single(const struct ctx *c, const void *buf, size_t size) return -1; }
- size += VNET_HLEN; elem_cnt = vu_collect(vdev, vq, elem, ARRAY_SIZE(elem), in_sg, - ARRAY_SIZE(in_sg), &in_total, size, &total); - if (elem_cnt == 0 || total < size) { + ARRAY_SIZE(in_sg), &in_total, VNET_HLEN + size, &total);
...but this seems to imply it doesn't.
+ if (elem_cnt == 0 || total < VNET_HLEN + size) { debug("vu_send_single: no space to send the data " "elem_cnt %d size %zu", elem_cnt, total); goto err; }
- total -= VNET_HLEN; - /* copy data from the buffer to the iovec */ - iov_from_buf(in_sg, in_total, VNET_HLEN, buf, total); + iov_from_buf(in_sg, in_total, VNET_HLEN, buf, size);
if (*c->pcap) pcap_iov(in_sg, in_total, VNET_HLEN, size);
+ vu_pad(in_sg, in_total, VNET_HLEN + size);
As does this.
vu_flush(vdev, vq, elem, elem_cnt, VNET_HLEN + size);
And this.
vu_queue_notify(vdev, vq);
- trace("vhost-user sent %zu", total); + trace("vhost-user sent %zu", size);
- return total; + return size; err: for (i = 0; i < elem_cnt; i++) vu_queue_detach_element(vq); @@ -292,15 +291,15 @@ err: }
/** - * vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed - * @iov: Buffer in iovec array where end of 802.3 frame is stored - * @l2len: Layer-2 length already filled in frame + * vu_pad() - Pad short frames to minimum Ethernet length and truncate iovec + * @iov: Pointer to iovec array + * @cnt: Number of entries in @iov + * @frame_len: Data length in @iov (including virtio-net header) */ -void vu_pad(struct iovec *iov, size_t l2len) +void vu_pad(const struct iovec *iov, size_t cnt, size_t frame_len) { - if (l2len >= ETH_ZLEN) - return; + size_t min_frame_len = ETH_ZLEN + VNET_HLEN;
- memset((char *)iov->iov_base + iov->iov_len, 0, ETH_ZLEN - l2len); - iov->iov_len += ETH_ZLEN - l2len; + if (frame_len < min_frame_len) + iov_memset(iov, cnt, frame_len, 0, min_frame_len - frame_len); } diff --git a/vu_common.h b/vu_common.h index 77d1849e6115..51f70084a7cb 100644 --- a/vu_common.h +++ b/vu_common.h @@ -44,6 +44,6 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size); -void vu_pad(struct iovec *iov, size_t l2len); +void vu_pad(const struct iovec *iov, size_t cnt, size_t frame_len);
#endif /* VU_COMMON_H */ -- 2.53.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On 2026-04-16 11:57, Laurent Vivier wrote:
Add a helper to copy data from a source iovec array to a destination iovec array, each starting at an arbitrary byte offset, iterating through both arrays simultaneously and copying in chunks matching the smaller of the two current segments.
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson
Reviewed-by: Jon Maloy
--- iov.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++ iov.h | 3 +++ 2 files changed, 55 insertions(+)
diff --git a/iov.c b/iov.c index 0188acdf5eba..dabc4f1ceea3 100644 --- a/iov.c +++ b/iov.c @@ -197,6 +197,58 @@ void iov_memset(const struct iovec *iov, size_t iov_cnt, size_t offset, int c, } }
+/** + * iov_memcpy() - Copy data between two iovec arrays + * @dst_iov: Destination iovec array + * @dst_iov_cnt: Number of elements in destination iovec array + * @dst_offset: Destination offset + * @src_iov: Source iovec array + * @src_iov_cnt: Number of elements in source iovec array + * @offs: Source offset + * @length: Number of bytes to copy + * + * Return: total number of bytes copied + */ +/* cppcheck-suppress unusedFunction */ +size_t iov_memcpy(struct iovec *dst_iov, size_t dst_iov_cnt, size_t dst_offset, + const struct iovec *src_iov, size_t src_iov_cnt, + size_t src_offset, size_t length) +{ + unsigned int i, j;
These could be size_t for consistency with the in parameters. /jon
+ size_t total = 0; + + i = iov_skip_bytes(src_iov, src_iov_cnt, src_offset, &src_offset); + j = iov_skip_bytes(dst_iov, dst_iov_cnt, dst_offset, &dst_offset); + + /* copying data */ + while (length && i < src_iov_cnt && j < dst_iov_cnt) { + size_t n = MIN(dst_iov[j].iov_len - dst_offset, + src_iov[i].iov_len - src_offset); + + if (n > length) + n = length; + + memcpy((char *)dst_iov[j].iov_base + dst_offset, + (const char *)src_iov[i].iov_base + src_offset, n); + + dst_offset += n; + src_offset += n; + total += n; + length -= n; + + if (dst_offset == dst_iov[j].iov_len) { + dst_offset = 0; + j++; + } + if (src_offset == src_iov[i].iov_len) { + src_offset = 0; + i++; + } + } + + return total; +} + /** * iov_tail_prune() - Remove any unneeded buffers from an IOV tail * @tail: IO vector tail (modified) diff --git a/iov.h b/iov.h index d295d05b3bab..3c63308e554f 100644 --- a/iov.h +++ b/iov.h @@ -32,6 +32,9 @@ size_t iov_size(const struct iovec *iov, size_t iov_cnt); size_t iov_truncate(struct iovec *iov, size_t iov_cnt, size_t size); void iov_memset(const struct iovec *iov, size_t iov_cnt, size_t offset, int c, size_t length); +size_t iov_memcpy(struct iovec *dst_iov, size_t dst_iov_cnt, size_t dst_offset, + const struct iovec *src_iov, size_t src_iov_cnt, + size_t src_offset, size_t length);
/* * DOC: Theory of Operation, struct iov_tail
On 2026-04-16 11:57, Laurent Vivier wrote:
udp_vu_sock_recv() currently mixes two concerns: receiving data from the socket and managing virtqueue buffers (collecting, rewinding, releasing). This makes the function harder to reason about and couples socket I/O with virtqueue state.
Move all virtqueue operations, vu_collect(), vu_init_elem(), vu_queue_rewind(), vu_set_vnethdr(), and the queue-readiness check, into udp_vu_sock_to_tap(), which is the only caller. This turns udp_vu_sock_recv() into a pure socket receive function that simply reads into the provided iov array and adjusts its length.
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson
Reviewed-by: Jon Maloy
--- udp_vu.c | 98 +++++++++++++++++++++++++++++--------------------------- 1 file changed, 50 insertions(+), 48 deletions(-)
diff --git a/udp_vu.c b/udp_vu.c index f8629af58ab5..bd9fd5abb971 100644 --- a/udp_vu.c [...] + + if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) { + struct msghdr msg = { 0 }; + + debug("Got UDP packet, but RX virtqueue not usable yet"); + + for (i = 0; i < n; i++) { + if (recvmsg(s, &msg, MSG_DONTWAIT) < 0) + debug_perror("Failed to discard datagram"); + } + + return; + } + for (i = 0; i < n; i++) { + unsigned elem_cnt, elem_used;
unsigned int is standard in our code, I think. /jon
+ size_t iov_cnt; ssize_t dlen; - int iov_used;
- iov_used = udp_vu_sock_recv(c, vq, s, v6, &dlen); - if (iov_used < 0) + elem_cnt = vu_collect(vdev, vq, elem, ARRAY_SIZE(elem), + iov_vu, ARRAY_SIZE(iov_vu), &iov_cnt, + IP_MAX_MTU + ETH_HLEN + VNET_HLEN, NULL); + if (elem_cnt == 0) + break; + + assert((size_t)elem_cnt == iov_cnt); /* one iovec per element */ + + dlen = udp_vu_sock_recv(s, v6, &iov_cnt); + if (dlen < 0) { + vu_queue_rewind(vq, iov_cnt); break; + } + + elem_used = iov_cnt; /* one iovec per element */ + + /* release unused buffers */ + vu_queue_rewind(vq, elem_cnt - elem_used);
- if (iov_used > 0) { + if (iov_cnt > 0) { udp_vu_prepare(c, toside, dlen); if (*c->pcap) { - udp_vu_csum(toside, iov_used); - pcap_iov(iov_vu, iov_used, VNET_HLEN); + udp_vu_csum(toside, iov_cnt); + pcap_iov(iov_vu, iov_cnt, VNET_HLEN); } - vu_flush(vdev, vq, elem, iov_used); + vu_flush(vdev, vq, elem, iov_cnt); vu_queue_notify(vdev, vq); } }
On 2026-04-16 11:57, Laurent Vivier wrote:
udp_vu_sock_recv(), udp_vu_prepare(), and udp_vu_csum() all operated on the file-scoped iov_vu[] array directly. Pass iov and count as explicit parameters instead, and move iov_vu[] and elem[] to function-local statics in udp_vu_sock_to_tap(), the only function that needs them.
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson
Reviewed-by: Jon Maloy
--- udp_vu.c | 67 +++++++++++++++++++++++++++++--------------------------- 1 file changed, 35 insertions(+), 32 deletions(-)
[...] oside, int iov_used)
+static void udp_vu_csum(const struct flowside *toside, const struct iovec *iov, + size_t cnt) { const struct in_addr *src4 = inany_v4(&toside->oaddr); const struct in_addr *dst4 = inany_v4(&toside->eaddr); - char *base = iov_vu[0].iov_base; + char *base = iov[0].iov_base; struct udp_payload_t *bp; struct iov_tail data;
if (src4 && dst4) { bp = vu_payloadv4(base); - data = IOV_TAIL(iov_vu, iov_used, (char *)&bp->data - base); + data = IOV_TAIL(iov, cnt, (char *)&bp->data - base); csum_udp4(&bp->uh, *src4, *dst4, &data); } else { bp = vu_payloadv6(base); - data = IOV_TAIL(iov_vu, iov_used, (char *)&bp->data - base); + data = IOV_TAIL(iov, cnt, (char *)&bp->data - base); csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6, &data); } } @@ -179,7 +180,9 @@ static void udp_vu_csum(const struct flowside *toside, int iov_used) void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx) { const struct flowside *toside = flowside_at_sidx(tosidx); + static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
I think this declaration should move one line down. /jon
bool v6 = !(inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)); + static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE]; struct vu_dev *vdev = c->vdev; struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE]; int i; @@ -212,9 +215,9 @@ void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx)
assert((size_t)elem_cnt == iov_cnt); /* one iovec per element */
- dlen = udp_vu_sock_recv(s, v6, &iov_cnt); + dlen = udp_vu_sock_recv(iov_vu, &iov_cnt, s, v6); if (dlen < 0) { - vu_queue_rewind(vq, iov_cnt); + vu_queue_rewind(vq, elem_cnt); break; }
@@ -224,12 +227,12 @@ void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx) vu_queue_rewind(vq, elem_cnt - elem_used);
if (iov_cnt > 0) { - udp_vu_prepare(c, toside, dlen); + udp_vu_prepare(c, iov_vu, toside, dlen); if (*c->pcap) { - udp_vu_csum(toside, iov_cnt); + udp_vu_csum(toside, iov_vu, iov_cnt); pcap_iov(iov_vu, iov_cnt, VNET_HLEN); } - vu_flush(vdev, vq, elem, iov_cnt); + vu_flush(vdev, vq, elem, elem_used); vu_queue_notify(vdev, vq); } }
On 2026-04-16 11:57, Laurent Vivier wrote:
With vhost-user multibuffer frames, the iov can be larger than the actual L2 frame. The previous approach of computing L2 length as iov_size() - offset would overcount and write extra bytes into the pcap file.
Pass the L2 frame length explicitly to pcap_frame() and pcap_iov(), and write exactly that many bytes instead of the full iov remainder.
Signed-off-by: Laurent Vivier
Reviewed-by: Jon Maloy
--- pcap.c | 28 +++++++++++++++++++--------- pcap.h | 2 +- tap.c | 6 ++++-- tcp_vu.c | 14 ++++++++------
[...]
@@ -109,8 +115,11 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n, if (clock_gettime(CLOCK_REALTIME, &now)) err_perror("Failed to get CLOCK_REALTIME time");
- for (i = 0; i < n; i++) - pcap_frame(iov + i * frame_parts, frame_parts, offset, &now); + for (i = 0; i < n; i++) { + pcap_frame(iov + i * frame_parts, frame_parts, offset, + iov_size(iov + i * frame_parts, frame_parts) - offset, + &now); + } }
/** @@ -120,8 +129,9 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n, * containing packet data to write, including L2 header * @iovcnt: Number of buffers (@iov entries) * @offset: Offset of the L2 frame within the full data length + * @l2len: Length of L2 frame data to capture */ -void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset) +void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset, size_t l2len)
This line should be wrapped. /jon
{ struct timespec now = { 0 };
@@ -131,7 +141,7 @@ void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset) if (clock_gettime(CLOCK_REALTIME, &now)) err_perror("Failed to get CLOCK_REALTIME time");
- pcap_frame(iov, iovcnt, offset, &now); + pcap_frame(iov, iovcnt, offset, l2len, &now); }
/** diff --git a/pcap.h b/pcap.h index dface5df4ee6..c171257cbd73 100644 --- a/pcap.h +++ b/pcap.h @@ -13,7 +13,7 @@ extern int pcap_fd; void pcap(const char *pkt, size_t l2len); void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n, size_t offset); -void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset); +void pcap_iov(const struct iovec *iov, size_t iovcnt, size_t offset, size_t l2len); void pcap_init(struct ctx *c);
#endif /* PCAP_H */ diff --git a/tap.c b/tap.c index 41a61a36c279..41ba2b8666a5 100644 --- a/tap.c +++ b/tap.c @@ -500,7 +500,8 @@ static size_t tap_send_frames_passt(const struct ctx *c, /* Number of unsent or partially sent buffers for the frame */ size_t rembufs = bufs_per_frame - (i % bufs_per_frame);
- if (write_remainder(c->fd_tap, &iov[i], rembufs, buf_offset) < 0) { + if (write_remainder(c->fd_tap, &iov[i], rembufs, buf_offset, + SIZE_MAX) < 0) { err_perror("tap: partial frame send"); return i; } @@ -1102,10 +1103,11 @@ void tap_handler(struct ctx *c, const struct timespec *now) void tap_add_packet(struct ctx *c, struct iov_tail *data, const struct timespec *now) { + size_t l2len = iov_tail_size(data); struct ethhdr eh_storage; const struct ethhdr *eh;
- pcap_iov(data->iov, data->cnt, data->off); + pcap_iov(data->iov, data->cnt, data->off, l2len);
eh = IOV_PEEK_HEADER(data, eh_storage); if (!eh) diff --git a/tcp_vu.c b/tcp_vu.c index 0cd01190d612..7b7ea9c789b1 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -128,7 +128,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) return ret; }
- iov_truncate(&flags_iov[0], 1, hdrlen + optlen); + l2len = hdrlen + optlen - VNET_HLEN; + iov_truncate(&flags_iov[0], 1, l2len + VNET_HLEN); payload = IOV_TAIL(flags_elem[0].in_sg, 1, hdrlen);
if (flags & KEEPALIVE) @@ -137,13 +138,12 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) tcp_fill_headers(c, conn, eh, ip4h, ip6h, th, &payload, NULL, seq, !*c->pcap);
- l2len = optlen + hdrlen - VNET_HLEN; vu_pad(&flags_elem[0].in_sg[0], l2len);
vu_flush(vdev, vq, flags_elem, 1);
if (*c->pcap) - pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN); + pcap_iov(&flags_elem[0].in_sg[0], 1, VNET_HLEN, l2len);
if (flags & DUP_ACK) { elem_cnt = vu_collect(vdev, vq, &flags_elem[1], 1, @@ -158,8 +158,10 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags)
vu_flush(vdev, vq, &flags_elem[1], 1);
- if (*c->pcap) - pcap_iov(&flags_elem[1].in_sg[0], 1, VNET_HLEN); + if (*c->pcap) { + pcap_iov(&flags_elem[1].in_sg[0], 1, VNET_HLEN, + l2len); + } } } vu_queue_notify(vdev, vq); @@ -464,7 +466,7 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) vu_flush(vdev, vq, &elem[head[i]], buf_cnt);
if (*c->pcap) - pcap_iov(iov, buf_cnt, VNET_HLEN); + pcap_iov(iov, buf_cnt, VNET_HLEN, l2len);
conn->seq_to_tap += dlen; } diff --git a/udp_vu.c b/udp_vu.c index 1a73d997f683..76242d423778 100644 --- a/udp_vu.c +++ b/udp_vu.c @@ -182,6 +182,7 @@ void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx) static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE]; struct vu_dev *vdev = c->vdev; struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE]; + size_t hdrlen = udp_vu_hdrlen(v6); int i;
assert(!c->no_udp); @@ -227,7 +228,8 @@ void udp_vu_sock_to_tap(const struct ctx *c, int s, int n, flow_sidx_t tosidx) udp_vu_prepare(c, iov_vu, toside, dlen); if (*c->pcap) { udp_vu_csum(toside, iov_vu, iov_cnt, dlen); - pcap_iov(iov_vu, iov_cnt, VNET_HLEN); + pcap_iov(iov_vu, iov_cnt, VNET_HLEN, + hdrlen + dlen - VNET_HLEN); } vu_flush(vdev, vq, elem, elem_used); vu_queue_notify(vdev, vq); diff --git a/util.c b/util.c index 73c9d51d7b4a..141aad275869 100644 --- a/util.c +++ b/util.c @@ -722,31 +722,54 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags, * @iov: IO vector * @iovcnt: Number of entries in @iov * @skip: Number of bytes of the vector to skip writing + * @length: Number of bytes of the vector to write * * Return: 0 on success, -1 on error (with errno set) * * #syscalls writev */ -int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip) +int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, + size_t skip, size_t length) { size_t i = 0, offset;
- while ((i += iov_skip_bytes(iov + i, iovcnt - i, skip, &offset)) < iovcnt) { + while (length && + (i += iov_skip_bytes(iov + i, iovcnt - i, skip, &offset)) < iovcnt) { ssize_t rc; + size_t end;
if (offset) { + size_t len = MIN(length, iov[i].iov_len - offset); + /* Write the remainder of the partially written buffer */ if (write_all_buf(fd, (char *)iov[i].iov_base + offset, - iov[i].iov_len - offset) < 0) + len) < 0) return -1; + + length -= len; i++; + + if (!length || i >= iovcnt) + break; + } + + end = iov_skip_bytes(iov + i, iovcnt - i, length, NULL); + + /* Write a trailing partial buffer */ + if (!end) { + size_t len = MIN(length, iov[i].iov_len); + + if (write_all_buf(fd, iov[i].iov_base, len) < 0) + return -1; + break; }
/* Write as much of the remaining whole buffers as we can */ - rc = writev(fd, &iov[i], iovcnt - i); + rc = writev(fd, &iov[i], end); if (rc < 0) return -1;
+ length -= rc; skip = rc; } return 0; diff --git a/util.h b/util.h index 92aeabc86b52..888093277896 100644 --- a/util.h +++ b/util.h @@ -235,7 +235,8 @@ int fls(unsigned long x); int ilog2(unsigned long x); int write_file(const char *path, const char *buf); intmax_t read_file_integer(const char *path, intmax_t fallback); -int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip); +int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, + size_t skip, size_t length); int read_remainder(int fd, const struct iovec *iov, size_t cnt, size_t skip); void close_open_files(int argc, char **argv); bool snprintf_check(char *str, size_t size, const char *format, ...); diff --git a/vu_common.c b/vu_common.c index 57949ca32309..f254cb67ec78 100644 --- a/vu_common.c +++ b/vu_common.c @@ -268,7 +268,7 @@ int vu_send_single(const struct ctx *c, const void *buf, size_t size) iov_from_buf(in_sg, in_total, VNET_HLEN, buf, total);
if (*c->pcap) - pcap_iov(in_sg, in_total, VNET_HLEN); + pcap_iov(in_sg, in_total, VNET_HLEN, size);
vu_flush(vdev, vq, elem, elem_cnt); vu_queue_notify(vdev, vq);
Nice series. I responded directly on the patches where I saw some very
minor issues.
For the rest:
Reviewed-by: Jon Maloy
Currently, the vhost-user path assumes each virtqueue element contains exactly one iovec entry covering the entire frame. This assumption breaks as some virtio-net drivers (notably iPXE) provide descriptors where the vnet header and the frame payload are in separate buffers, resulting in two iovec entries per virtqueue element.
This series refactors the vhost-user data path so that frame lengths, header sizes, and padding are tracked and passed explicitly rather than being derived from iovec sizes. This decoupling is a prerequisite for correctly handling padding of multi-buffer frames.
The changes in this series can be split in 3 groups:
- New iov helpers (patches 1-2):
iov_memset() and iov_memcpy() operate across iovec boundaries. These are needed by the final patch to pad and copy frame data when a frame spans multiple iovec entries.
- Structural refactoring (patches 3-5):
Move vnethdr setup into vu_flush(), separate virtqueue management from socket I/O in the UDP path, and pass iov arrays explicitly instead of using file-scoped state. These changes make it possible to pass explicit frame lengths through the stack, which is required to pad frames independently of iovec layout.
- Explicit length passing throughout the stack (patches 6-10):
Thread explicit L4, L2, frame, and data lengths through checksum, pcap, vu_flush(), and tcp_fill_headers(), replacing lengths that were previously derived from iovec sizes. With lengths tracked explicitly, the final patch can centralise Ethernet frame padding into vu_collect() and a new vu_pad() helper that correctly pads frames spanning multiple iovec entries.
v3: - csum_udp4()/csum_udp6()/udp_vu_csum receive payload length (dlen) rather than l4len - Add a length parameter to write_remainder() and use it in pcap_frame()
v2: - Rename iov_memcopy() to iov_memcpy() and use clearer parameter names - Use clearer code in pcap_frame() - Add braces around bodies in pcap.c and tcp_vu.c for style consistency - Extract l2len variable in tap_add_packet() and tcp_vu_send_flag() to avoid repeating the same expression - Fix indentation alignment of iov_skip_bytes() arguments in tcp_vu_c - Introduce fill_size variable in vu_flush() - Reposition comment for ETH_ZLEN in vu_collect()
Laurent Vivier (10): iov: Introduce iov_memset() iov: Add iov_memcpy() to copy data between iovec arrays vu_common: Move vnethdr setup into vu_flush() udp_vu: Move virtqueue management from udp_vu_sock_recv() to its caller udp_vu: Pass iov explicitly to helpers instead of using file-scoped array checksum: Pass explicit L4 length to checksum functions pcap: Pass explicit L2 length to pcap_iov() vu_common: Pass explicit frame length to vu_flush() tcp: Pass explicit data length to tcp_fill_headers() vhost-user: Centralise Ethernet frame padding in vu_collect() and vu_pad()
checksum.c | 43 +++++++----- checksum.h | 6 +- iov.c | 78 ++++++++++++++++++++++ iov.h | 5 ++ pcap.c | 28 +++++--- pcap.h | 2 +- tap.c | 10 +-- tcp.c | 14 ++-- tcp_buf.c | 3 +- tcp_internal.h | 2 +- tcp_vu.c | 66 ++++++++++--------- udp.c | 5 +- udp_vu.c | 173 +++++++++++++++++++++++++------------------------ util.c | 31 +++++++-- util.h | 3 +- vu_common.c | 58 ++++++++++------- vu_common.h | 5 +- 17 files changed, 338 insertions(+), 194 deletions(-)
On Mon, May 11, 2026 at 05:30:33AM -0400, Jon Maloy wrote:
On 2026-04-16 11:57, Laurent Vivier wrote:
udp_vu_sock_recv() currently mixes two concerns: receiving data from the socket and managing virtqueue buffers (collecting, rewinding, releasing). This makes the function harder to reason about and couples socket I/O with virtqueue state.
Move all virtqueue operations, vu_collect(), vu_init_elem(), vu_queue_rewind(), vu_set_vnethdr(), and the queue-readiness check, into udp_vu_sock_to_tap(), which is the only caller. This turns udp_vu_sock_recv() into a pure socket receive function that simply reads into the provided iov array and adjusts its length.
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson Reviewed-by: Jon Maloy
see below.
--- udp_vu.c | 98 +++++++++++++++++++++++++++++--------------------------- 1 file changed, 50 insertions(+), 48 deletions(-)
diff --git a/udp_vu.c b/udp_vu.c index f8629af58ab5..bd9fd5abb971 100644 --- a/udp_vu.c [...] + + if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) { + struct msghdr msg = { 0 }; + + debug("Got UDP packet, but RX virtqueue not usable yet"); + + for (i = 0; i < n; i++) { + if (recvmsg(s, &msg, MSG_DONTWAIT) < 0) + debug_perror("Failed to discard datagram"); + } + + return; + } + for (i = 0; i < n; i++) { + unsigned elem_cnt, elem_used;
unsigned int is standard in our code, I think.
I think Laurent has usually used 'unsigned int', but I've usually used 'unsigned'. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On 5/11/26 04:01, David Gibson wrote:
On Thu, Apr 16, 2026 at 05:57:21PM +0200, Laurent Vivier wrote:
The previous per-protocol padding done by vu_pad() in tcp_vu.c and udp_vu.c was only correct for single-buffer frames: it assumed the padding area always fell within the first iov, writing past its end with a plain memset().
It also required each caller to compute MAX(..., ETH_ZLEN + VNET_HLEN) for vu_collect() and to call vu_pad() at the right point, duplicating the minimum-size logic across protocols.
Move the Ethernet minimum size enforcement into vu_collect() itself, so that enough buffer space is always reserved for padding regardless of the requested frame size.
Rewrite vu_pad() to take a full iovec array and use iov_memset(), making it safe for multi-buffer (mergeable rx buffer) frames.
In tcp_vu_sock_recv(), replace iov_truncate() with iov_skip_bytes(): now that all consumers receive explicit data lengths, truncating the iovecs is no longer needed. In tcp_vu_data_from_sock(), cap each frame's data length against the remaining bytes actually received from the socket, so that the last partial frame gets correct headers and sequence number advancement.
Signed-off-by: Laurent Vivier
LGTM, except for what looks like one minor bug.
[snip]
index 704e908aa02c..d07f584f228a 100644 --- a/vu_common.c +++ b/vu_common.c @@ -74,6 +74,7 @@ int vu_collect(const struct vu_dev *vdev, struct vu_virtq *vq, size_t current_iov = 0; int elem_cnt = 0;
+ size = MAX(size, ETH_ZLEN /* Ethernet minimum size */ + VNET_HLEN);
This seems to imply size should include the vnet header...
size is the max of "size" provided below by vu_single() to vu_collect() (and you noted includes vnet header) and the the minimum frame size (Ethernet minimum + vnet header)
while (current_size < size && elem_cnt < max_elem && current_iov < max_in_sg) { int ret; @@ -261,29 +262,27 @@ int vu_send_single(const struct ctx *c, const void *buf, size_t size) return -1; }
- size += VNET_HLEN; elem_cnt = vu_collect(vdev, vq, elem, ARRAY_SIZE(elem), in_sg, - ARRAY_SIZE(in_sg), &in_total, size, &total); - if (elem_cnt == 0 || total < size) { + ARRAY_SIZE(in_sg), &in_total, VNET_HLEN + size, &total);
...but this seems to imply it doesn't.
This is not the same "size". Here "size" is without vnet header, but we need to provide a size with vnet header to vu_collect().
+ if (elem_cnt == 0 || total < VNET_HLEN + size) { debug("vu_send_single: no space to send the data " "elem_cnt %d size %zu", elem_cnt, total); goto err; }
- total -= VNET_HLEN; - /* copy data from the buffer to the iovec */ - iov_from_buf(in_sg, in_total, VNET_HLEN, buf, total); + iov_from_buf(in_sg, in_total, VNET_HLEN, buf, size);
if (*c->pcap) pcap_iov(in_sg, in_total, VNET_HLEN, size);
+ vu_pad(in_sg, in_total, VNET_HLEN + size);
As does this.
Same here (see vu_pad() comment header)
vu_flush(vdev, vq, elem, elem_cnt, VNET_HLEN + size);
And this.
See vu_flush() comment header
vu_queue_notify(vdev, vq);
- trace("vhost-user sent %zu", total); + trace("vhost-user sent %zu", size);
- return total; + return size; err: for (i = 0; i < elem_cnt; i++) vu_queue_detach_element(vq); @@ -292,15 +291,15 @@ err: }
/** - * vu_pad() - Pad 802.3 frame to minimum length (60 bytes) if needed - * @iov: Buffer in iovec array where end of 802.3 frame is stored - * @l2len: Layer-2 length already filled in frame + * vu_pad() - Pad short frames to minimum Ethernet length and truncate iovec + * @iov: Pointer to iovec array + * @cnt: Number of entries in @iov + * @frame_len: Data length in @iov (including virtio-net header) */ -void vu_pad(struct iovec *iov, size_t l2len) +void vu_pad(const struct iovec *iov, size_t cnt, size_t frame_len) { - if (l2len >= ETH_ZLEN) - return; + size_t min_frame_len = ETH_ZLEN + VNET_HLEN;
- memset((char *)iov->iov_base + iov->iov_len, 0, ETH_ZLEN - l2len); - iov->iov_len += ETH_ZLEN - l2len; + if (frame_len < min_frame_len) + iov_memset(iov, cnt, frame_len, 0, min_frame_len - frame_len); } diff --git a/vu_common.h b/vu_common.h index 77d1849e6115..51f70084a7cb 100644 --- a/vu_common.h +++ b/vu_common.h @@ -44,6 +44,6 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size); -void vu_pad(struct iovec *iov, size_t l2len); +void vu_pad(const struct iovec *iov, size_t cnt, size_t frame_len);
#endif /* VU_COMMON_H */ -- 2.53.0
Thanks, Laurent
participants (3)
-
David Gibson
-
Jon Maloy
-
Laurent Vivier