[PATCH v6 0/4] vhost-user,tcp: Handle multiple iovec entries per virtqueue element
This is the TCP counterpart to the UDP multi-iov series. It converts the TCP vhost-user receive path from direct pointer arithmetic (via vu_eth(), vu_ip(), etc.) to the iov_tail abstraction, removing the assumption that all headers reside in a single contiguous buffer. With this series applied, the TCP path correctly handles virtio-net drivers that provide multiple buffers per virtqueue element (e.g. iPXE provides the vnet header in the first buffer and the frame payload in a second one), matching the support already present in the UDP path. Based-on: 20260416160926.3822963-1-lvivier@redhat.com v6: - Rebase on v8 of UDP series (tcp_update_csum() takes dlen rather than l4len) v5: - Use l2len variable for pcap_iov() length in tcp_vu_send_flag() - Add braces - Move pcap_iov() before vu_flush() - Remove vu_flush() from tcp_vu_send_dup(), let the caller handle it v4: - fix error during rebase, s/vu_pad_len/vu_pad/ v3: - Rebased on top of [PATCH 00/10] vhost-user: Preparatory series for multiple iovec entries per virtqueue element v2: - add "tcp: Encode checksum computation flags in a single parameter" - remove IOV_PUT_HEADER()/with_header() and use IOV_PUSH_HEADER() - don't use the iov_tail to provide the headers to the functions Laurent Vivier (4): tcp: Encode checksum computation flags in a single parameter tcp_vu: Build headers on the stack and write them into the iovec tcp_vu: Support multibuffer frames in tcp_vu_sock_recv() tcp_vu: Support multibuffer frames in tcp_vu_send_flag() iov.c | 1 - tcp.c | 25 +-- tcp_buf.c | 23 +-- tcp_internal.h | 7 +- tcp_vu.c | 403 +++++++++++++++++++++++++++++-------------------- vu_common.h | 20 --- 6 files changed, 270 insertions(+), 209 deletions(-) -- 2.53.0
tcp_vu_prepare() currently assumes the first iovec element provided by
the guest is large enough to hold all L2-L4 headers, and builds them
in place via pointer casts into iov[0].iov_base. This assumption is
enforced by an assert().
Since the headers in the buffer are uninitialized anyway, we can just
as well build the Ethernet, IP, and TCP headers on the stack instead,
and write them into the iovec with IOV_PUSH_HEADER(). This mirrors the
approach already used in udp_vu_prepare(), and prepares for support of
elements with multiple iovecs.
Signed-off-by: Laurent Vivier
tcp_fill_headers() takes a pointer to a previously computed IPv4 header
checksum to avoid recalculating it when the payload length doesn't
change, and a separate bool to skip TCP checksum computation.
Replace both parameters with a single uint32_t csum_flags that encodes:
- IP4_CSUM (bit 31): compute IPv4 header checksum from scratch
- TCP_CSUM (bit 30): compute TCP checksum
- IP4_CMASK (low 16 bits): cached IPv4 header checksum value
When IP4_CSUM is not set, the cached checksum is extracted from the low
16 bits. This is cleaner than the pointer-based approach, and also
avoids a potential dangling pointer issue: a subsequent patch makes
tcp_fill_headers() access ip4h via with_header(), which scopes it to a
temporary variable, so a pointer to ip4h->check would become invalid
after the with_header() block.
Suggested-by: David Gibson
Previously, tcp_vu_sock_recv() assumed a 1:1 mapping between virtqueue
elements and iovecs (one iovec per element), enforced by an ASSERT.
This prevented the use of virtqueue elements with multiple buffers
(e.g. when mergeable rx buffers are not negotiated and headers are
provided in a separate buffer).
Introduce a struct vu_frame to track per-frame metadata: the range of
elements and iovecs that make up each frame, and the frame's total size.
This replaces the head[] array which only tracked element indices.
A separate iov_msg[] array is built for recvmsg() by cloning the data
portions (after stripping headers) using iov_tail helpers.
Then a frame truncation after recvmsg() properly walks the frame and
element arrays to adjust iovec counts and element counts.
Signed-off-by: Laurent Vivier
Build the Ethernet, IP, and TCP headers on the stack instead of
directly in the buffer via pointer casts, then write them into the
iovec with IOV_PUSH_HEADER(). This mirrors the approach already used
in tcp_vu_prepare() and udp_vu_prepare().
Remove the vu_eth(), vu_ip(), vu_payloadv4() and vu_payloadv6() helpers
from vu_common.h, as they are no longer used anywhere.
Introduce tcp_vu_send_dup() to handle DUP_ACK duplication using
vu_collect() and iov_memcopy() instead of a plain memcpy(), so that
the duplicated frame is also properly scattered across multiple iovecs.
Signed-off-by: Laurent Vivier
On 4/16/26 18:16, Laurent Vivier wrote:
Previously, tcp_vu_sock_recv() assumed a 1:1 mapping between virtqueue elements and iovecs (one iovec per element), enforced by an ASSERT. This prevented the use of virtqueue elements with multiple buffers (e.g. when mergeable rx buffers are not negotiated and headers are provided in a separate buffer).
Introduce a struct vu_frame to track per-frame metadata: the range of elements and iovecs that make up each frame, and the frame's total size. This replaces the head[] array which only tracked element indices.
A separate iov_msg[] array is built for recvmsg() by cloning the data portions (after stripping headers) using iov_tail helpers.
Then a frame truncation after recvmsg() properly walks the frame and element arrays to adjust iovec counts and element counts.
Signed-off-by: Laurent Vivier
--- tcp_vu.c | 174 ++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 113 insertions(+), 61 deletions(-) diff --git a/tcp_vu.c b/tcp_vu.c index 2017aec90342..96b16007701d 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -35,9 +35,24 @@ #include "vu_common.h" #include
-static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE + DISCARD_IOV_NUM]; +static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE]; static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE]; -static int head[VIRTQUEUE_MAX_SIZE + 1]; + +/** + * struct vu_frame - Descriptor for a TCP frame mapped to virtqueue elements + * @idx_element: Index of first element in elem[] for this frame + * @num_element: Number of virtqueue elements used by this frame + * @idx_iovec: Index of first iovec in iov_vu[] for this frame + * @num_iovec: Number of iovecs covering this frame's buffers + * @size: Total frame size including all headers + */ +static struct vu_frame { + int idx_element; + int num_element; + int idx_iovec; + int num_iovec; + size_t size; +} frame[VIRTQUEUE_MAX_SIZE];
/** * tcp_vu_hdrlen() - Sum size of all headers, from TCP to virtio-net @@ -174,8 +189,8 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) * @v6: Set for IPv6 connections * @already_sent: Number of bytes already sent * @fillsize: Maximum bytes to fill in guest-side receiving window - * @iov_cnt: number of iov (output) - * @head_cnt: Pointer to store the count of head iov entries (output) + * @elem_used: number of element (output) + * @frame_cnt: Pointer to store the number of frames (output) * * Return: number of bytes received from the socket, or a negative error code * on failure. @@ -183,57 +198,77 @@ int tcp_vu_send_flag(const struct ctx *c, struct tcp_tap_conn *conn, int flags) static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, const struct tcp_tap_conn *conn, bool v6, uint32_t already_sent, size_t fillsize, - int *iov_cnt, int *head_cnt) + int *elem_used, int *frame_cnt) { + static struct iovec iov_msg[VIRTQUEUE_MAX_SIZE + DISCARD_IOV_NUM]; const struct vu_dev *vdev = c->vdev; struct msghdr mh_sock = { 0 }; uint16_t mss = MSS_GET(conn); size_t hdrlen, iov_used; int s = conn->sock; + ssize_t ret, dlen; int elem_cnt; - ssize_t ret; - int i; - - *iov_cnt = 0; + int i, j;
hdrlen = tcp_vu_hdrlen(v6);
+ *elem_used = 0; + iov_used = 0; elem_cnt = 0; - *head_cnt = 0; + *frame_cnt = 0; while (fillsize > 0 && elem_cnt < ARRAY_SIZE(elem) && - iov_used < VIRTQUEUE_MAX_SIZE) { - size_t frame_size, dlen, in_total; - struct iovec *iov; + iov_used < ARRAY_SIZE(iov_vu) && + *frame_cnt < ARRAY_SIZE(frame)) { + size_t frame_size, in_total; int cnt;
cnt = vu_collect(vdev, vq, &elem[elem_cnt], ARRAY_SIZE(elem) - elem_cnt, - &iov_vu[DISCARD_IOV_NUM + iov_used], - VIRTQUEUE_MAX_SIZE - iov_used, &in_total, + &iov_vu[iov_used], + ARRAY_SIZE(iov_vu) - iov_used, &in_total, MIN(mss, fillsize) + hdrlen, &frame_size); if (cnt == 0) break; - assert((size_t)cnt == in_total); /* one iovec per element */ + + frame[*frame_cnt].idx_element = elem_cnt; + frame[*frame_cnt].num_element = cnt; + frame[*frame_cnt].idx_iovec = iov_used; + frame[*frame_cnt].num_iovec = in_total; + frame[*frame_cnt].size = frame_size; + (*frame_cnt)++;
iov_used += in_total; - dlen = frame_size - hdrlen; + elem_cnt += cnt;
- /* reserve space for headers in iov */ - iov = &elem[elem_cnt].in_sg[0]; - assert(iov->iov_len >= hdrlen); - iov->iov_base = (char *)iov->iov_base + hdrlen; - iov->iov_len -= hdrlen; - head[(*head_cnt)++] = elem_cnt; + fillsize -= frame_size - hdrlen; + }
- fillsize -= dlen; - elem_cnt += cnt; + /* build an iov array without headers */ + for (i = 0, j = DISCARD_IOV_NUM; i < *frame_cnt && + j < ARRAY_SIZE(iov_msg); i++) { + struct iov_tail data; + ssize_t cnt; + + data = IOV_TAIL(&iov_vu[frame[i].idx_iovec], + frame[i].num_iovec, 0); + iov_drop_header(&data, hdrlen); + + cnt = iov_tail_clone(&iov_msg[j], ARRAY_SIZE(iov_msg) - j, + &data); + if (cnt == -1) + die("Missing entries in iov_msg");
We need this to avoid a false positive Coverity overflow error: diff --git a/tcp_vu.c b/tcp_vu.c index 7f7e43860b10..c483350bff8f 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -284,7 +284,8 @@ static ssize_t tcp_vu_sock_recv(const struct ctx *c, struct vu_virtq *vq, cnt = iov_tail_clone(&iov_msg[j], ARRAY_SIZE(iov_msg) - j, &data); - if (cnt == -1) + assert(cnt < ARRAY_SIZE(iov_msg) - j); /* for Coverity */ + if (cnt < 0) die("Missing entries in iov_msg"); j += cnt; "cnt < ARRAY_SIZE(iov_msg) - j" cannot be true as in iov_tail_clone() the return value (cnt) is always < dst_iov_cnt (ARRAY_SIZE(iov_msg) - j) and the only negative value returned by iov_tail_clone() is -1. Thanks,Laurent
participants (1)
-
Laurent Vivier