[PATCH] vhost_user: fix multibuffer from linux
Under some conditions, linux can provide several buffers
in the same element (multiple entries in the iovec array).
I didn't identify what changed between the kernel guest that
provides one buffer and the one that provides several
(doesn't seem to be a kernel change or a configuration change).
Fix the following assert:
ASSERTION FAILED in virtqueue_map_desc (virtio.c:402): num_sg < max_num_sg
What I can see is the buffer can be splitted in two iovecs:
- vnet header
- packet data
This change manages this special case but the real fix will be to allow
tap_add_packet() to manage iovec array.
Signed-off-by: Laurent Vivier
On Wed, 15 Jan 2025 17:22:30 +0100
Laurent Vivier
Under some conditions, linux can provide several buffers in the same element (multiple entries in the iovec array).
I didn't identify what changed between the kernel guest that provides one buffer and the one that provides several (doesn't seem to be a kernel change or a configuration change).
Perhaps memory pressure, or different page accounting between kernels?
Fix the following assert:
ASSERTION FAILED in virtqueue_map_desc (virtio.c:402): num_sg < max_num_sg
What I can see is the buffer can be splitted in two iovecs: - vnet header - packet data
This change manages this special case but the real fix will be to allow tap_add_packet() to manage iovec array.
Signed-off-by: Laurent Vivier
Applied. I just wonder, if it makes sense as a follow-up:
--- vu_common.c | 28 ++++++++++++++++++++++------ 1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/vu_common.c b/vu_common.c index 6d365bea5fe2..431fba6be0c0 100644 --- a/vu_common.c +++ b/vu_common.c @@ -18,6 +18,8 @@ #include "pcap.h" #include "vu_common.h"
+#define VU_MAX_TX_BUFFER_NB 2 + /** * vu_packet_check_range() - Check if a given memory zone is contained in * a mapped guest memory region @@ -168,10 +170,15 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
count = 0; out_sg_count = 0; - while (count < VIRTQUEUE_MAX_SIZE) { + while (count < VIRTQUEUE_MAX_SIZE && + out_sg_count + VU_MAX_TX_BUFFER_NB <= VIRTQUEUE_MAX_SIZE) { int ret;
- vu_set_element(&elem[count], &out_sg[out_sg_count], NULL); + elem[count].out_num = VU_MAX_TX_BUFFER_NB; + elem[count].out_sg = &out_sg[out_sg_count]; + elem[count].in_num = 0; + elem[count].in_sg = NULL; + ret = vu_queue_pop(vdev, vq, &elem[count]); if (ret < 0) break; @@ -181,11 +188,20 @@ static void vu_handle_tx(struct vu_dev *vdev, int index, warn("virtio-net transmit queue contains no out buffers"); break; } - ASSERT(elem[count].out_num == 1); + if (elem[count].out_num == 1) { + tap_add_packet(vdev->context, + elem[count].out_sg[0].iov_len - hdrlen, + (char *)elem[count].out_sg[0].iov_base + + hdrlen); + } else { + /* vnet header can be in a separate iovec */ + ASSERT(elem[count].out_num == 2);
I suppose we don't have strong guarantees about this. What about discarding the packet with a debug() message, at least until we have a more elegant solution, if this happens? For UDP and ICMP, that's the best thing we can do. For TCP, we could just discard a part of it, and the peer would tell our guest, but it's surely not practical to look into the packet here, so dropping it altogether would look reasonable.
+ ASSERT(elem[count].out_sg[0].iov_len == (size_t)hdrlen);
And similarly here (with an err() message), even though there's probably an issue in the hypervisor if this happens, but it doesn't mean we're doomed.
+ tap_add_packet(vdev->context, + elem[count].out_sg[1].iov_len, + (char *)elem[count].out_sg[1].iov_base); + }
- tap_add_packet(vdev->context, - elem[count].out_sg[0].iov_len - hdrlen, - (char *)elem[count].out_sg[0].iov_base + hdrlen); count++; } tap_handler(vdev->context, now);
-- Stefano
On Wed, Jan 15, 2025 at 11:33:02PM +0100, Stefano Brivio wrote:
On Wed, 15 Jan 2025 17:22:30 +0100 Laurent Vivier
wrote: Under some conditions, linux can provide several buffers in the same element (multiple entries in the iovec array).
I didn't identify what changed between the kernel guest that provides one buffer and the one that provides several (doesn't seem to be a kernel change or a configuration change).
Perhaps memory pressure, or different page accounting between kernels?
Fix the following assert:
ASSERTION FAILED in virtqueue_map_desc (virtio.c:402): num_sg < max_num_sg
What I can see is the buffer can be splitted in two iovecs: - vnet header - packet data
This change manages this special case but the real fix will be to allow tap_add_packet() to manage iovec array.
Signed-off-by: Laurent Vivier
Applied.
So, we obviously want this as a short term fix. However, this seems like it's still not very robust. IIUC, kernel in theory could arbitrarily split the packet, not just breaking it neatly at the header boundary. I think we should try to handle this more generally.
I just wonder, if it makes sense as a follow-up:
--- vu_common.c | 28 ++++++++++++++++++++++------ 1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/vu_common.c b/vu_common.c index 6d365bea5fe2..431fba6be0c0 100644 --- a/vu_common.c +++ b/vu_common.c @@ -18,6 +18,8 @@ #include "pcap.h" #include "vu_common.h"
+#define VU_MAX_TX_BUFFER_NB 2 + /** * vu_packet_check_range() - Check if a given memory zone is contained in * a mapped guest memory region @@ -168,10 +170,15 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
count = 0; out_sg_count = 0; - while (count < VIRTQUEUE_MAX_SIZE) { + while (count < VIRTQUEUE_MAX_SIZE && + out_sg_count + VU_MAX_TX_BUFFER_NB <= VIRTQUEUE_MAX_SIZE) { int ret;
- vu_set_element(&elem[count], &out_sg[out_sg_count], NULL); + elem[count].out_num = VU_MAX_TX_BUFFER_NB; + elem[count].out_sg = &out_sg[out_sg_count]; + elem[count].in_num = 0; + elem[count].in_sg = NULL; + ret = vu_queue_pop(vdev, vq, &elem[count]); if (ret < 0) break; @@ -181,11 +188,20 @@ static void vu_handle_tx(struct vu_dev *vdev, int index, warn("virtio-net transmit queue contains no out buffers"); break; } - ASSERT(elem[count].out_num == 1); + if (elem[count].out_num == 1) { + tap_add_packet(vdev->context, + elem[count].out_sg[0].iov_len - hdrlen, + (char *)elem[count].out_sg[0].iov_base + + hdrlen); + } else { + /* vnet header can be in a separate iovec */ + ASSERT(elem[count].out_num == 2);
I suppose we don't have strong guarantees about this. What about discarding the packet with a debug() message, at least until we have a more elegant solution, if this happens?
For UDP and ICMP, that's the best thing we can do.
For TCP, we could just discard a part of it, and the peer would tell our guest, but it's surely not practical to look into the packet here, so dropping it altogether would look reasonable.
+ ASSERT(elem[count].out_sg[0].iov_len == (size_t)hdrlen);
And similarly here (with an err() message), even though there's probably an issue in the hypervisor if this happens, but it doesn't mean we're doomed.
Right. If the kernel is splitting every packet this way, then we probably are doomed, but we don't know it for certain. In general ASSERT()s should indicate a definite bug in *our* code, not unexpected behaviour from something we interact with, so err() and dropping the packet would be more appropriate IMO.
+ tap_add_packet(vdev->context, + elem[count].out_sg[1].iov_len, + (char *)elem[count].out_sg[1].iov_base); + }
- tap_add_packet(vdev->context, - elem[count].out_sg[0].iov_len - hdrlen, - (char *)elem[count].out_sg[0].iov_base + hdrlen); count++; } tap_handler(vdev->context, now);
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (3)
-
David Gibson
-
Laurent Vivier
-
Stefano Brivio