This... is not any of the things I said I would be working on. I can only say that a herd of very hairy yaks led me astray. Looking at bug 66 I spotted some problems with our handling of MTUs / maximum frame sizes. Looking at that I found some weirdness and some real, if minor, bugs in the sizing and handling of the packet pools. David Gibson (3): packet: Use flexible array member in struct pool packet: Don't have struct pool specify its buffer tap: Don't size pool_tap[46] for the maximum number of packets packet.c | 63 ++++++---------------------------------------------- packet.h | 41 ++++++++++++---------------------- passt.h | 2 -- tap.c | 63 +++++++++++++++++++++++++--------------------------- tap.h | 4 ++-- vhost_user.c | 2 -- vu_common.c | 31 ++------------------------ 7 files changed, 55 insertions(+), 151 deletions(-) -- 2.47.1
Currently we have a dummy pkt[1] array, which we alias with an array of a different size via various macros. However, we already require C11 which includes flexible array members, so we can do better. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- packet.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/packet.h b/packet.h index 3f70e949..85ee5508 100644 --- a/packet.h +++ b/packet.h @@ -21,7 +21,7 @@ struct pool { size_t buf_size; size_t size; size_t count; - struct iovec pkt[1]; + struct iovec pkt[]; }; int vu_packet_check_range(void *buf, size_t offset, size_t len, -- 2.47.1
struct pool, which represents a batch of packets includes values giving the buffer in which all the packets lie - or for vhost_user a link to the vu_dev_region array in which the packets sit. Originally that made sense because we stored each packet as an offset and length within that buffer. However dd143e389 ("packet: replace struct desc by struct iovec") replaced the offset and length with a struct iovec which can directly reference a packet anywhere in memory. This means we no longer need the buffer reference to interpret packets from the pool. So there's really no need to check where the packet sits. We can remove the buf reference and all checks associated with it. As a bonus this removes the special case for vhost-user. Similarly the old representation used a 16-bit length, so there were some checks that packets didn't exceed that. That's also no longer necessary with the struct iovec which uses a size_t length. I think under an unlikely set of circumstances it might have been possible to hit that 16-bit limit for a legitimate packet: other parts of the code place a limit of 65535 bytes on the L2 frame, however that doesn't include the length tag used by the qemu socket protocol. That tag *is* included in the packet as stored in the pool, however, meaning we could get a 65539 byte packet at this level. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- packet.c | 60 ---------------------------------------------------- packet.h | 36 +++++++++---------------------- tap.c | 45 +++++++++++++-------------------------- tap.h | 1 - vhost_user.c | 2 -- vu_common.c | 28 ------------------------ 6 files changed, 25 insertions(+), 147 deletions(-) diff --git a/packet.c b/packet.c index 03a11e6a..5bfa7304 100644 --- a/packet.c +++ b/packet.c @@ -22,46 +22,6 @@ #include "util.h" #include "log.h" -/** - * packet_check_range() - Check if a packet memory range is valid - * @p: Packet pool - * @offset: Offset of data range in packet descriptor - * @len: Length of desired data range - * @start: Start of the packet descriptor - * @func: For tracing: name of calling function - * @line: For tracing: caller line of function call - * - * Return: 0 if the range is valid, -1 otherwise - */ -static int packet_check_range(const struct pool *p, size_t offset, size_t len, - const char *start, const char *func, int line) -{ - if (p->buf_size == 0) { - int ret; - - ret = vu_packet_check_range((void *)p->buf, offset, len, start); - - if (ret == -1) - trace("cannot find region, %s:%i", func, line); - - return ret; - } - - if (start < p->buf) { - trace("packet start %p before buffer start %p, " - "%s:%i", (void *)start, (void *)p->buf, func, line); - return -1; - } - - if (start + len + offset > p->buf + p->buf_size) { - trace("packet offset plus length %zu from size %zu, " - "%s:%i", start - p->buf + len + offset, - p->buf_size, func, line); - return -1; - } - - return 0; -} /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -81,14 +41,6 @@ void packet_add_do(struct pool *p, size_t len, const char *start, return; } - if (packet_check_range(p, 0, len, start, func, line)) - return; - - if (len > UINT16_MAX) { - trace("add packet length %zu, %s:%i", len, func, line); - return; - } - p->pkt[idx].iov_base = (void *)start; p->pkt[idx].iov_len = len; @@ -118,14 +70,6 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset, return NULL; } - if (len > UINT16_MAX) { - if (func) { - trace("packet data length %zu, %s:%i", - len, func, line); - } - return NULL; - } - if (len + offset > p->pkt[idx].iov_len) { if (func) { trace("data length %zu, offset %zu from length %zu, " @@ -135,10 +79,6 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset, return NULL; } - if (packet_check_range(p, offset, len, p->pkt[idx].iov_base, - func, line)) - return NULL; - if (left) *left = p->pkt[idx].iov_len - offset - len; diff --git a/packet.h b/packet.h index 85ee5508..98eb8812 100644 --- a/packet.h +++ b/packet.h @@ -7,25 +7,17 @@ #define PACKET_H /** - * struct pool - Generic pool of packets stored in a buffer - * @buf: Buffer storing packet descriptors, - * a struct vu_dev_region array for passt vhost-user mode - * @buf_size: Total size of buffer, - * 0 for passt vhost-user mode + * struct pool - Generic pool of packets stored in nmemory * @size: Number of usable descriptors for the pool * @count: Number of used descriptors for the pool * @pkt: Descriptors: see macros below */ struct pool { - char *buf; - size_t buf_size; size_t size; size_t count; struct iovec pkt[]; }; -int vu_packet_check_range(void *buf, size_t offset, size_t len, - const char *start); void packet_add_do(struct pool *p, size_t len, const char *start, const char *func, int line); void *packet_get_do(const struct pool *p, const size_t idx, @@ -42,35 +34,27 @@ void pool_flush(struct pool *p); #define packet_get_try(p, idx, offset, len, left) \ packet_get_do(p, idx, offset, len, left, NULL, 0) -#define PACKET_POOL_DECL(_name, _size, _buf) \ +#define PACKET_POOL_DECL(_name, _size) \ struct _name ## _t { \ - char *buf; \ - size_t buf_size; \ size_t size; \ size_t count; \ struct iovec pkt[_size]; \ } -#define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size) \ +#define PACKET_POOL_INIT_NOCAST(_size) \ { \ - .buf_size = _buf_size, \ - .buf = _buf, \ .size = _size, \ } -#define PACKET_POOL(name, size, buf, buf_size) \ - PACKET_POOL_DECL(name, size, buf) name = \ - PACKET_POOL_INIT_NOCAST(size, buf, buf_size) +#define PACKET_POOL(name, size) \ + PACKET_POOL_DECL(name, size) name = \ + PACKET_POOL_INIT_NOCAST(size) -#define PACKET_INIT(name, size, buf, buf_size) \ - (struct name ## _t) PACKET_POOL_INIT_NOCAST(size, buf, buf_size) +#define PACKET_INIT(name, size) \ + (struct name ## _t) PACKET_POOL_INIT_NOCAST(size) -#define PACKET_POOL_NOINIT(name, size, buf) \ - PACKET_POOL_DECL(name, size, buf) name ## _storage; \ - static struct pool *name = (struct pool *)&name ## _storage - -#define PACKET_POOL_P(name, size, buf, buf_size) \ - PACKET_POOL(name ## _storage, size, buf, buf_size); \ +#define PACKET_POOL_P(name, size) \ + PACKET_POOL(name ## _storage, size); \ struct pool *name = (struct pool *)&name ## _storage #endif /* PACKET_H */ diff --git a/tap.c b/tap.c index cd32a901..68231f09 100644 --- a/tap.c +++ b/tap.c @@ -62,8 +62,8 @@ #include "vu_common.h" /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */ -static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf); -static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf); +static PACKET_POOL_P(pool_tap4, TAP_MSGS); +static PACKET_POOL_P(pool_tap6, TAP_MSGS); #define TAP_SEQS 128 /* Different L4 tuples in one batch */ #define FRAGMENT_MSG_RATE 10 /* # seconds between fragment warnings */ @@ -462,7 +462,7 @@ void eth_update_mac(struct ethhdr *eh, memcpy(eh->h_source, eth_s, sizeof(eh->h_source)); } -PACKET_POOL_DECL(pool_l4, UIO_MAXIOV, pkt_buf); +PACKET_POOL_DECL(pool_l4, UIO_MAXIOV); /** * struct l4_seq4_t - Message sequence for one protocol handler call, IPv4 @@ -618,7 +618,7 @@ resume: if (!eh) continue; if (ntohs(eh->h_proto) == ETH_P_ARP) { - PACKET_POOL_P(pkt, 1, in->buf, in->buf_size); + PACKET_POOL_P(pkt, 1); packet_add(pkt, l2len, (char *)eh); arp(c, pkt); @@ -658,7 +658,7 @@ resume: continue; if (iph->protocol == IPPROTO_ICMP) { - PACKET_POOL_P(pkt, 1, in->buf, in->buf_size); + PACKET_POOL_P(pkt, 1); if (c->no_icmp) continue; @@ -677,7 +677,7 @@ resume: continue; if (iph->protocol == IPPROTO_UDP) { - PACKET_POOL_P(pkt, 1, in->buf, in->buf_size); + PACKET_POOL_P(pkt, 1); packet_add(pkt, l2len, (char *)eh); if (dhcp(c, pkt)) @@ -829,7 +829,7 @@ resume: } if (proto == IPPROTO_ICMPV6) { - PACKET_POOL_P(pkt, 1, in->buf, in->buf_size); + PACKET_POOL_P(pkt, 1); if (c->no_icmp) continue; @@ -854,7 +854,7 @@ resume: uh = (struct udphdr *)l4h; if (proto == IPPROTO_UDP) { - PACKET_POOL_P(pkt, 1, in->buf, in->buf_size); + PACKET_POOL_P(pkt, 1); packet_add(pkt, l4len, l4h); @@ -1381,36 +1381,21 @@ static void tap_sock_tun_init(struct ctx *c) } /** - * tap_sock_update_pool() - Set the buffer base and size for the pool of packets - * @base: Buffer base - * @size Buffer size + * tap_backend_init() - Create and set up AF_UNIX socket or + * tuntap file descriptor + * @c: Execution context */ -void tap_sock_update_pool(void *base, size_t size) +void tap_backend_init(struct ctx *c) { int i; - pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, base, size); - pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, base, size); - for (i = 0; i < TAP_SEQS; i++) { - tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, base, size); - tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, base, size); + tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV); + tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV); } -} -/** - * tap_backend_init() - Create and set up AF_UNIX socket or - * tuntap file descriptor - * @c: Execution context - */ -void tap_backend_init(struct ctx *c) -{ - if (c->mode == MODE_VU) { - tap_sock_update_pool(NULL, 0); + if (c->mode == MODE_VU) vu_init(c); - } else { - tap_sock_update_pool(pkt_buf, sizeof(pkt_buf)); - } if (c->fd_tap != -1) { /* Passed as --fd */ ASSERT(c->one_off); diff --git a/tap.h b/tap.h index dfbd8b9e..9731f70c 100644 --- a/tap.h +++ b/tap.h @@ -70,7 +70,6 @@ void tap_handler_passt(struct ctx *c, uint32_t events, const struct timespec *now); int tap_sock_unix_open(char *sock_path); void tap_sock_reset(struct ctx *c); -void tap_sock_update_pool(void *base, size_t size); void tap_backend_init(struct ctx *c); void tap_flush_pools(void); void tap_handler(struct ctx *c, const struct timespec *now); diff --git a/vhost_user.c b/vhost_user.c index 4b8558fa..77d36766 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -494,8 +494,6 @@ static bool vu_set_mem_table_exec(struct vu_dev *vdev, ASSERT(vdev->nregions < VHOST_USER_MAX_RAM_SLOTS - 1); vdev->regions[vdev->nregions].mmap_addr = 0; - tap_sock_update_pool(vdev->regions, 0); - return false; } diff --git a/vu_common.c b/vu_common.c index 299b5a32..bb794193 100644 --- a/vu_common.c +++ b/vu_common.c @@ -18,34 +18,6 @@ #include "pcap.h" #include "vu_common.h" -/** - * vu_packet_check_range() - Check if a given memory zone is contained in - * a mapped guest memory region - * @buf: Array of the available memory regions - * @offset: Offset of data range in packet descriptor - * @size: Length of desired data range - * @start: Start of the packet descriptor - * - * Return: 0 if the zone is in a mapped memory region, -1 otherwise - */ -int vu_packet_check_range(void *buf, size_t offset, size_t len, - const char *start) -{ - struct vu_dev_region *dev_region; - - for (dev_region = buf; dev_region->mmap_addr; dev_region++) { - /* NOLINTNEXTLINE(performance-no-int-to-ptr) */ - char *m = (char *)(uintptr_t)dev_region->mmap_addr; - - if (m <= start && - start + offset + len <= m + dev_region->mmap_offset + - dev_region->size) - return 0; - } - - return -1; -} - /** * vu_init_elem() - initialize an array of virtqueue elements with 1 iov in each * @elem: Array of virtqueue elements to initialize -- 2.47.1
On Fri, 13 Dec 2024 23:01:55 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:struct pool, which represents a batch of packets includes values giving the buffer in which all the packets lie - or for vhost_user a link to the vu_dev_region array in which the packets sit. Originally that made sense because we stored each packet as an offset and length within that buffer. However dd143e389 ("packet: replace struct desc by struct iovec") replaced the offset and length with a struct iovec which can directly reference a packet anywhere in memory. This means we no longer need the buffer reference to interpret packets from the pool. So there's really no need to check where the packet sits. We can remove the buf reference and all checks associated with it. As a bonus this removes the special case for vhost-user. Similarly the old representation used a 16-bit length, so there were some checks that packets didn't exceed that. That's also no longer necessary with the struct iovec which uses a size_t length. I think under an unlikely set of circumstances it might have been possible to hit that 16-bit limit for a legitimate packet: other parts of the code place a limit of 65535 bytes on the L2 frame, however that doesn't include the length tag used by the qemu socket protocol. That tag *is* included in the packet as stored in the pool, however, meaning we could get a 65539 byte packet at this level.As I mentioned in the call on Monday: sure, we need to fix this, but at the same time I'm not quite convinced that it's a good idea to drop all these sanity checks. Even if they're not based on offsets anymore, I think it's still valuable to ensure that the packets are not exactly _anywhere_ in memory, but only where we expect them to be. If it's doable, I would rather keep these checks, and change the ones on the length to allow a maximum value of 65539 bytes. I mean, there's a big difference between 65539 and, say, 4294967296. By the way, I haven't checked what happens with MTUs slightly bigger than 65520 bytes: virtio-net (at least with QEMU) doesn't budge if I set more than 65520, but I didn't actually send big packets. I'll try to have a look (also with muvm) unless you already checked. -- Stefano
On Thu, Dec 19, 2024 at 10:00:11AM +0100, Stefano Brivio wrote:On Fri, 13 Dec 2024 23:01:55 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right, I have draft patches that do basically this.struct pool, which represents a batch of packets includes values giving the buffer in which all the packets lie - or for vhost_user a link to the vu_dev_region array in which the packets sit. Originally that made sense because we stored each packet as an offset and length within that buffer. However dd143e389 ("packet: replace struct desc by struct iovec") replaced the offset and length with a struct iovec which can directly reference a packet anywhere in memory. This means we no longer need the buffer reference to interpret packets from the pool. So there's really no need to check where the packet sits. We can remove the buf reference and all checks associated with it. As a bonus this removes the special case for vhost-user. Similarly the old representation used a 16-bit length, so there were some checks that packets didn't exceed that. That's also no longer necessary with the struct iovec which uses a size_t length. I think under an unlikely set of circumstances it might have been possible to hit that 16-bit limit for a legitimate packet: other parts of the code place a limit of 65535 bytes on the L2 frame, however that doesn't include the length tag used by the qemu socket protocol. That tag *is* included in the packet as stored in the pool, however, meaning we could get a 65539 byte packet at this level.As I mentioned in the call on Monday: sure, we need to fix this, but at the same time I'm not quite convinced that it's a good idea to drop all these sanity checks. Even if they're not based on offsets anymore, I think it's still valuable to ensure that the packets are not exactly _anywhere_ in memory, but only where we expect them to be. If it's doable, I would rather keep these checks, and change the ones on the length to allow a maximum value of 65539 bytes. I mean, there's a big difference between 65539 and, say, 4294967296.By the way, I haven't checked what happens with MTUs slightly bigger than 65520 bytes: virtio-net (at least with QEMU) doesn't budge if I set more than 65520, but I didn't actually send big packets. I'll try to have a look (also with muvm) unless you already checked.I'm not sure what you mean by "doesn't budge". No, I haven't checked with either qemu or muvm. There could of course be limits applied by either VMM, or by the guest virtio-net driver. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 20 Dec 2024 11:59:09 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Thu, Dec 19, 2024 at 10:00:11AM +0100, Stefano Brivio wrote:Oh, sorry, I was deep in the perspective of trying to make things crash... and it didn't do anything, just accepted the setting and kept sending packets out. Let me try that then, with and without your new series... -- StefanoOn Fri, 13 Dec 2024 23:01:55 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right, I have draft patches that do basically this.struct pool, which represents a batch of packets includes values giving the buffer in which all the packets lie - or for vhost_user a link to the vu_dev_region array in which the packets sit. Originally that made sense because we stored each packet as an offset and length within that buffer. However dd143e389 ("packet: replace struct desc by struct iovec") replaced the offset and length with a struct iovec which can directly reference a packet anywhere in memory. This means we no longer need the buffer reference to interpret packets from the pool. So there's really no need to check where the packet sits. We can remove the buf reference and all checks associated with it. As a bonus this removes the special case for vhost-user. Similarly the old representation used a 16-bit length, so there were some checks that packets didn't exceed that. That's also no longer necessary with the struct iovec which uses a size_t length. I think under an unlikely set of circumstances it might have been possible to hit that 16-bit limit for a legitimate packet: other parts of the code place a limit of 65535 bytes on the L2 frame, however that doesn't include the length tag used by the qemu socket protocol. That tag *is* included in the packet as stored in the pool, however, meaning we could get a 65539 byte packet at this level.As I mentioned in the call on Monday: sure, we need to fix this, but at the same time I'm not quite convinced that it's a good idea to drop all these sanity checks. Even if they're not based on offsets anymore, I think it's still valuable to ensure that the packets are not exactly _anywhere_ in memory, but only where we expect them to be. If it's doable, I would rather keep these checks, and change the ones on the length to allow a maximum value of 65539 bytes. I mean, there's a big difference between 65539 and, say, 4294967296.By the way, I haven't checked what happens with MTUs slightly bigger than 65520 bytes: virtio-net (at least with QEMU) doesn't budge if I set more than 65520, but I didn't actually send big packets. I'll try to have a look (also with muvm) unless you already checked.I'm not sure what you mean by "doesn't budge". No, I haven't checked with either qemu or muvm. There could of course be limits applied by either VMM, or by the guest virtio-net driver.
On Fri, Dec 20, 2024 at 10:51:33AM +0100, Stefano Brivio wrote:On Fri, 20 Dec 2024 11:59:09 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right. Even without my packet pool changes, I'm not aware of a way to make it crash, just ways to cause packets to be dropped when they shouldn't.On Thu, Dec 19, 2024 at 10:00:11AM +0100, Stefano Brivio wrote:Oh, sorry, I was deep in the perspective of trying to make things crash... and it didn't do anything, just accepted the setting and kept sending packets out.On Fri, 13 Dec 2024 23:01:55 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right, I have draft patches that do basically this.struct pool, which represents a batch of packets includes values giving the buffer in which all the packets lie - or for vhost_user a link to the vu_dev_region array in which the packets sit. Originally that made sense because we stored each packet as an offset and length within that buffer. However dd143e389 ("packet: replace struct desc by struct iovec") replaced the offset and length with a struct iovec which can directly reference a packet anywhere in memory. This means we no longer need the buffer reference to interpret packets from the pool. So there's really no need to check where the packet sits. We can remove the buf reference and all checks associated with it. As a bonus this removes the special case for vhost-user. Similarly the old representation used a 16-bit length, so there were some checks that packets didn't exceed that. That's also no longer necessary with the struct iovec which uses a size_t length. I think under an unlikely set of circumstances it might have been possible to hit that 16-bit limit for a legitimate packet: other parts of the code place a limit of 65535 bytes on the L2 frame, however that doesn't include the length tag used by the qemu socket protocol. That tag *is* included in the packet as stored in the pool, however, meaning we could get a 65539 byte packet at this level.As I mentioned in the call on Monday: sure, we need to fix this, but at the same time I'm not quite convinced that it's a good idea to drop all these sanity checks. Even if they're not based on offsets anymore, I think it's still valuable to ensure that the packets are not exactly _anywhere_ in memory, but only where we expect them to be. If it's doable, I would rather keep these checks, and change the ones on the length to allow a maximum value of 65539 bytes. I mean, there's a big difference between 65539 and, say, 4294967296.By the way, I haven't checked what happens with MTUs slightly bigger than 65520 bytes: virtio-net (at least with QEMU) doesn't budge if I set more than 65520, but I didn't actually send big packets. I'll try to have a look (also with muvm) unless you already checked.I'm not sure what you mean by "doesn't budge". No, I haven't checked with either qemu or muvm. There could of course be limits applied by either VMM, or by the guest virtio-net driver.Let me try that then, with and without your new series...-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long). It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are. We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- packet.c | 13 ++++++++++++- packet.h | 3 +++ passt.h | 2 -- tap.c | 18 +++++++++++++++--- tap.h | 3 ++- vu_common.c | 3 ++- 6 files changed, 34 insertions(+), 8 deletions(-) diff --git a/packet.c b/packet.c index 5bfa7304..b68580cc 100644 --- a/packet.c +++ b/packet.c @@ -22,6 +22,17 @@ #include "util.h" #include "log.h" +/** + * pool_full() - Is a packet pool full? + * @p: Pointer to packet pool + * + * Return: true if the pool is full, false if more packets can be added + */ +bool pool_full(const struct pool *p) +{ + return p->count >= p->size; +} + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -35,7 +46,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, { size_t idx = p->count; - if (idx >= p->size) { + if (pool_full(p)) { trace("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; diff --git a/packet.h b/packet.h index 98eb8812..3618f213 100644 --- a/packet.h +++ b/packet.h @@ -6,6 +6,8 @@ #ifndef PACKET_H #define PACKET_H +#include <stdbool.h> + /** * struct pool - Generic pool of packets stored in nmemory * @size: Number of usable descriptors for the pool @@ -23,6 +25,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, void *packet_get_do(const struct pool *p, const size_t idx, size_t offset, size_t len, size_t *left, const char *func, int line); +bool pool_full(const struct pool *p); void pool_flush(struct pool *p); #define packet_add(p, len, start) \ diff --git a/passt.h b/passt.h index 0dd4efa0..81b2787f 100644 --- a/passt.h +++ b/passt.h @@ -70,8 +70,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data), #define TAP_BUF_BYTES \ ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE) -#define TAP_MSGS \ - DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t)) #define PKT_BUF_BYTES MAX(TAP_BUF_BYTES, 0) extern char pkt_buf [PKT_BUF_BYTES]; diff --git a/tap.c b/tap.c index 68231f09..42370a26 100644 --- a/tap.c +++ b/tap.c @@ -61,6 +61,8 @@ #include "vhost_user.h" #include "vu_common.h" +#define TAP_MSGS 256 + /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */ static PACKET_POOL_P(pool_tap4, TAP_MSGS); static PACKET_POOL_P(pool_tap6, TAP_MSGS); @@ -966,8 +968,10 @@ void tap_handler(struct ctx *c, const struct timespec *now) * @c: Execution context * @l2len: Total L2 packet length * @p: Packet buffer + * @now: Current timestamp */ -void tap_add_packet(struct ctx *c, ssize_t l2len, char *p) +void tap_add_packet(struct ctx *c, ssize_t l2len, char *p, + const struct timespec *now) { const struct ethhdr *eh; @@ -983,9 +987,17 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p) switch (ntohs(eh->h_proto)) { case ETH_P_ARP: case ETH_P_IP: + if (pool_full(pool_tap4)) { + tap4_handler(c, pool_tap4, now); + pool_flush(pool_tap4); + } packet_add(pool_tap4, l2len, p); break; case ETH_P_IPV6: + if (pool_full(pool_tap6)) { + tap6_handler(c, pool_tap6, now); + pool_flush(pool_tap6); + } packet_add(pool_tap6, l2len, p); break; default: @@ -1066,7 +1078,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) p += sizeof(uint32_t); n -= sizeof(uint32_t); - tap_add_packet(c, l2len, p); + tap_add_packet(c, l2len, p, now); p += l2len; n -= l2len; @@ -1129,7 +1141,7 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now) len > (ssize_t)ETH_MAX_MTU) continue; - tap_add_packet(c, len, pkt_buf + n); + tap_add_packet(c, len, pkt_buf + n, now); } tap_handler(c, now); diff --git a/tap.h b/tap.h index 9731f70c..c9acf860 100644 --- a/tap.h +++ b/tap.h @@ -73,6 +73,7 @@ void tap_sock_reset(struct ctx *c); void tap_backend_init(struct ctx *c); void tap_flush_pools(void); void tap_handler(struct ctx *c, const struct timespec *now); -void tap_add_packet(struct ctx *c, ssize_t l2len, char *p); +void tap_add_packet(struct ctx *c, ssize_t l2len, char *p, + const struct timespec *now); #endif /* TAP_H */ diff --git a/vu_common.c b/vu_common.c index bb794193..38511230 100644 --- a/vu_common.c +++ b/vu_common.c @@ -157,7 +157,8 @@ static void vu_handle_tx(struct vu_dev *vdev, int index, tap_add_packet(vdev->context, elem[count].out_sg[0].iov_len - hdrlen, - (char *)elem[count].out_sg[0].iov_base + hdrlen); + (char *)elem[count].out_sg[0].iov_base + hdrlen, + now); count++; } tap_handler(vdev->context, now); -- 2.47.1
On Fri, 13 Dec 2024 23:01:56 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long). It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are. We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- packet.c | 13 ++++++++++++- packet.h | 3 +++ passt.h | 2 -- tap.c | 18 +++++++++++++++--- tap.h | 3 ++- vu_common.c | 3 ++- 6 files changed, 34 insertions(+), 8 deletions(-) diff --git a/packet.c b/packet.c index 5bfa7304..b68580cc 100644 --- a/packet.c +++ b/packet.c @@ -22,6 +22,17 @@ #include "util.h" #include "log.h" +/** + * pool_full() - Is a packet pool full? + * @p: Pointer to packet pool + * + * Return: true if the pool is full, false if more packets can be added + */ +bool pool_full(const struct pool *p) +{ + return p->count >= p->size; +} + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -35,7 +46,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, { size_t idx = p->count; - if (idx >= p->size) { + if (pool_full(p)) { trace("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; diff --git a/packet.h b/packet.h index 98eb8812..3618f213 100644 --- a/packet.h +++ b/packet.h @@ -6,6 +6,8 @@ #ifndef PACKET_H #define PACKET_H +#include <stdbool.h> + /** * struct pool - Generic pool of packets stored in nmemory * @size: Number of usable descriptors for the pool @@ -23,6 +25,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, void *packet_get_do(const struct pool *p, const size_t idx, size_t offset, size_t len, size_t *left, const char *func, int line); +bool pool_full(const struct pool *p); void pool_flush(struct pool *p); #define packet_add(p, len, start) \ diff --git a/passt.h b/passt.h index 0dd4efa0..81b2787f 100644 --- a/passt.h +++ b/passt.h @@ -70,8 +70,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data), #define TAP_BUF_BYTES \ ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE) -#define TAP_MSGS \ - DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t)) #define PKT_BUF_BYTES MAX(TAP_BUF_BYTES, 0) extern char pkt_buf [PKT_BUF_BYTES]; diff --git a/tap.c b/tap.c index 68231f09..42370a26 100644 --- a/tap.c +++ b/tap.c @@ -61,6 +61,8 @@ #include "vhost_user.h" #include "vu_common.h" +#define TAP_MSGS 256Sorry, I stopped at 2/3, had just a quick look at this one, and I missed this. Assuming 4 KiB pages, this changes from 161319 to 256. You mention that in practice we never have more than a handful of messages, which is probably almost always the case, but I wonder if that's also the case with UDP "real-time" streams, where we could have bursts of a few hundred (thousand?) messages at a time. I wonder: how bad would it be to correct the calculation, instead? We wouldn't actually use more memory, right? -- Stefano
On Thu, Dec 19, 2024 at 10:00:15AM +0100, Stefano Brivio wrote:On Fri, 13 Dec 2024 23:01:56 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Yes. I'm certainly open to arguments on what the number should be.Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long). It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are. We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- packet.c | 13 ++++++++++++- packet.h | 3 +++ passt.h | 2 -- tap.c | 18 +++++++++++++++--- tap.h | 3 ++- vu_common.c | 3 ++- 6 files changed, 34 insertions(+), 8 deletions(-) diff --git a/packet.c b/packet.c index 5bfa7304..b68580cc 100644 --- a/packet.c +++ b/packet.c @@ -22,6 +22,17 @@ #include "util.h" #include "log.h" +/** + * pool_full() - Is a packet pool full? + * @p: Pointer to packet pool + * + * Return: true if the pool is full, false if more packets can be added + */ +bool pool_full(const struct pool *p) +{ + return p->count >= p->size; +} + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -35,7 +46,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, { size_t idx = p->count; - if (idx >= p->size) { + if (pool_full(p)) { trace("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; diff --git a/packet.h b/packet.h index 98eb8812..3618f213 100644 --- a/packet.h +++ b/packet.h @@ -6,6 +6,8 @@ #ifndef PACKET_H #define PACKET_H +#include <stdbool.h> + /** * struct pool - Generic pool of packets stored in nmemory * @size: Number of usable descriptors for the pool @@ -23,6 +25,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, void *packet_get_do(const struct pool *p, const size_t idx, size_t offset, size_t len, size_t *left, const char *func, int line); +bool pool_full(const struct pool *p); void pool_flush(struct pool *p); #define packet_add(p, len, start) \ diff --git a/passt.h b/passt.h index 0dd4efa0..81b2787f 100644 --- a/passt.h +++ b/passt.h @@ -70,8 +70,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data), #define TAP_BUF_BYTES \ ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE) -#define TAP_MSGS \ - DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t)) #define PKT_BUF_BYTES MAX(TAP_BUF_BYTES, 0) extern char pkt_buf [PKT_BUF_BYTES]; diff --git a/tap.c b/tap.c index 68231f09..42370a26 100644 --- a/tap.c +++ b/tap.c @@ -61,6 +61,8 @@ #include "vhost_user.h" #include "vu_common.h" +#define TAP_MSGS 256Sorry, I stopped at 2/3, had just a quick look at this one, and I missed this. Assuming 4 KiB pages, this changes from 161319 to 256. You mention thatin practice we never have more than a handful of messages, which is probably almost always the case, but I wonder if that's also the case with UDP "real-time" streams, where we could have bursts of a few hundred (thousand?) messages at a time.Maybe. If we are getting them in large bursts, then we're no longer really suceeding at the streams being "real-time", but sure, we should try to catch up as best we can.I wonder: how bad would it be to correct the calculation, instead? We wouldn't actually use more memory, right?I was pretty painful when I tried, and it would use more memory. The safe option would be to use ETH_HLEN as the minimum size (which is pretty much all we enforce in the tap layer), which would expand the iovec array here by 2-3x. It's not enormous, but it's not nothing. Or do you mean the unused pages of the array would never be instantiated? In which case, yeah, I guess not. Remember that with the changes in this patch if we exceed TAP_MSGS, nothing particularly bad happens: we don't crash, and we don't drop packets; we just process things in batches of TAP_MSGS frames at a time. So this doesn't need to be large enough to handle any burst we could ever get, just large enough to adequately mitigate the per-batch costs, which I don't think are _that_ large. 256 was a first guess at that. Maybe it's not enough, but I'd be pretty surprised if it needed to be greater than ~1000 to make the per-batch costs negligible compared to the per-frame costs. UDP_MAX_FRAMES, which is on the reverse path but serves a similar function, is only 32. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 20 Dec 2024 12:13:23 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Thu, Dec 19, 2024 at 10:00:15AM +0100, Stefano Brivio wrote:No idea, until now I just thought we'd have a limit that we can't hit in practice. Let me have a look at what happens with 256 (your new series) and iperf3 or udp_stream from neper.On Fri, 13 Dec 2024 23:01:56 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Yes. I'm certainly open to arguments on what the number should be.Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long). It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are. We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- packet.c | 13 ++++++++++++- packet.h | 3 +++ passt.h | 2 -- tap.c | 18 +++++++++++++++--- tap.h | 3 ++- vu_common.c | 3 ++- 6 files changed, 34 insertions(+), 8 deletions(-) diff --git a/packet.c b/packet.c index 5bfa7304..b68580cc 100644 --- a/packet.c +++ b/packet.c @@ -22,6 +22,17 @@ #include "util.h" #include "log.h" +/** + * pool_full() - Is a packet pool full? + * @p: Pointer to packet pool + * + * Return: true if the pool is full, false if more packets can be added + */ +bool pool_full(const struct pool *p) +{ + return p->count >= p->size; +} + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -35,7 +46,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, { size_t idx = p->count; - if (idx >= p->size) { + if (pool_full(p)) { trace("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; diff --git a/packet.h b/packet.h index 98eb8812..3618f213 100644 --- a/packet.h +++ b/packet.h @@ -6,6 +6,8 @@ #ifndef PACKET_H #define PACKET_H +#include <stdbool.h> + /** * struct pool - Generic pool of packets stored in nmemory * @size: Number of usable descriptors for the pool @@ -23,6 +25,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, void *packet_get_do(const struct pool *p, const size_t idx, size_t offset, size_t len, size_t *left, const char *func, int line); +bool pool_full(const struct pool *p); void pool_flush(struct pool *p); #define packet_add(p, len, start) \ diff --git a/passt.h b/passt.h index 0dd4efa0..81b2787f 100644 --- a/passt.h +++ b/passt.h @@ -70,8 +70,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data), #define TAP_BUF_BYTES \ ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE) -#define TAP_MSGS \ - DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t)) #define PKT_BUF_BYTES MAX(TAP_BUF_BYTES, 0) extern char pkt_buf [PKT_BUF_BYTES]; diff --git a/tap.c b/tap.c index 68231f09..42370a26 100644 --- a/tap.c +++ b/tap.c @@ -61,6 +61,8 @@ #include "vhost_user.h" #include "vu_common.h" +#define TAP_MSGS 256Sorry, I stopped at 2/3, had just a quick look at this one, and I missed this. Assuming 4 KiB pages, this changes from 161319 to 256. You mention thatIt could even be that flushing more frequently actually improves things. I'm not sure. I was just pointing out that, quite likely, we can actually hit the new limit.in practice we never have more than a handful of messages, which is probably almost always the case, but I wonder if that's also the case with UDP "real-time" streams, where we could have bursts of a few hundred (thousand?) messages at a time.Maybe. If we are getting them in large bursts, then we're no longer really suceeding at the streams being "real-time", but sure, we should try to catch up as best we can.Yes, that's what I meant.I wonder: how bad would it be to correct the calculation, instead? We wouldn't actually use more memory, right?I was pretty painful when I tried, and it would use more memory. The safe option would be to use ETH_HLEN as the minimum size (which is pretty much all we enforce in the tap layer), which would expand the iovec array here by 2-3x. It's not enormous, but it's not nothing. Or do you mean the unused pages of the array would never be instantiated? In which case, yeah, I guess not.Remember that with the changes in this patch if we exceed TAP_MSGS, nothing particularly bad happens: we don't crash, and we don't drop packets; we just process things in batches of TAP_MSGS frames at a time. So this doesn't need to be large enough to handle any burst we could ever get, just large enough to adequately mitigate the per-batch costs, which I don't think are _that_ large. 256 was a first guess at that. Maybe it's not enough, but I'd be pretty surprised if it needed to be greater than ~1000 to make the per-batch costs negligible compared to the per-frame costs. UDP_MAX_FRAMES, which is on the reverse path but serves a similar function, is only 32.Okay, let me give that a try. I guess you didn't see a change in UDP throughput tests... but there we use fairly large messages. -- Stefano
On Fri, Dec 20, 2024 at 10:51:36AM +0100, Stefano Brivio wrote:On Fri, 20 Dec 2024 12:13:23 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:No, but TBH I didn't look that closely. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonOn Thu, Dec 19, 2024 at 10:00:15AM +0100, Stefano Brivio wrote:No idea, until now I just thought we'd have a limit that we can't hit in practice. Let me have a look at what happens with 256 (your new series) and iperf3 or udp_stream from neper.On Fri, 13 Dec 2024 23:01:56 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Yes. I'm certainly open to arguments on what the number should be.Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long). It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are. We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- packet.c | 13 ++++++++++++- packet.h | 3 +++ passt.h | 2 -- tap.c | 18 +++++++++++++++--- tap.h | 3 ++- vu_common.c | 3 ++- 6 files changed, 34 insertions(+), 8 deletions(-) diff --git a/packet.c b/packet.c index 5bfa7304..b68580cc 100644 --- a/packet.c +++ b/packet.c @@ -22,6 +22,17 @@ #include "util.h" #include "log.h" +/** + * pool_full() - Is a packet pool full? + * @p: Pointer to packet pool + * + * Return: true if the pool is full, false if more packets can be added + */ +bool pool_full(const struct pool *p) +{ + return p->count >= p->size; +} + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -35,7 +46,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, { size_t idx = p->count; - if (idx >= p->size) { + if (pool_full(p)) { trace("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; diff --git a/packet.h b/packet.h index 98eb8812..3618f213 100644 --- a/packet.h +++ b/packet.h @@ -6,6 +6,8 @@ #ifndef PACKET_H #define PACKET_H +#include <stdbool.h> + /** * struct pool - Generic pool of packets stored in nmemory * @size: Number of usable descriptors for the pool @@ -23,6 +25,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, void *packet_get_do(const struct pool *p, const size_t idx, size_t offset, size_t len, size_t *left, const char *func, int line); +bool pool_full(const struct pool *p); void pool_flush(struct pool *p); #define packet_add(p, len, start) \ diff --git a/passt.h b/passt.h index 0dd4efa0..81b2787f 100644 --- a/passt.h +++ b/passt.h @@ -70,8 +70,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data), #define TAP_BUF_BYTES \ ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE) -#define TAP_MSGS \ - DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t)) #define PKT_BUF_BYTES MAX(TAP_BUF_BYTES, 0) extern char pkt_buf [PKT_BUF_BYTES]; diff --git a/tap.c b/tap.c index 68231f09..42370a26 100644 --- a/tap.c +++ b/tap.c @@ -61,6 +61,8 @@ #include "vhost_user.h" #include "vu_common.h" +#define TAP_MSGS 256Sorry, I stopped at 2/3, had just a quick look at this one, and I missed this. Assuming 4 KiB pages, this changes from 161319 to 256. You mention thatIt could even be that flushing more frequently actually improves things. I'm not sure. I was just pointing out that, quite likely, we can actually hit the new limit.in practice we never have more than a handful of messages, which is probably almost always the case, but I wonder if that's also the case with UDP "real-time" streams, where we could have bursts of a few hundred (thousand?) messages at a time.Maybe. If we are getting them in large bursts, then we're no longer really suceeding at the streams being "real-time", but sure, we should try to catch up as best we can.Yes, that's what I meant.I wonder: how bad would it be to correct the calculation, instead? We wouldn't actually use more memory, right?I was pretty painful when I tried, and it would use more memory. The safe option would be to use ETH_HLEN as the minimum size (which is pretty much all we enforce in the tap layer), which would expand the iovec array here by 2-3x. It's not enormous, but it's not nothing. Or do you mean the unused pages of the array would never be instantiated? In which case, yeah, I guess not.Remember that with the changes in this patch if we exceed TAP_MSGS, nothing particularly bad happens: we don't crash, and we don't drop packets; we just process things in batches of TAP_MSGS frames at a time. So this doesn't need to be large enough to handle any burst we could ever get, just large enough to adequately mitigate the per-batch costs, which I don't think are _that_ large. 256 was a first guess at that. Maybe it's not enough, but I'd be pretty surprised if it needed to be greater than ~1000 to make the per-batch costs negligible compared to the per-frame costs. UDP_MAX_FRAMES, which is on the reverse path but serves a similar function, is only 32.Okay, let me give that a try. I guess you didn't see a change in UDP throughput tests... but there we use fairly large messages.