[PATCH 0/4] Improve robustness of calculations related to frame size limits

older
[PATCH v2] flow, repair: Wait for...

David Gibson

13 Mar 2025 13 Mar '25

6:40 a.m.

There are a number of places where we make calculations and checks around how large frames can be and where they sit in memory. Several of these are roughly correct, but can be wrong in certain edge cases. Improve robustness by clarifying what we're doing and being more careful about the edge cases. David Gibson (4): vu_common: Tighten vu_packet_check_range() packet: More cautious checks to avoid pointer arithmetic UB tap: Make size of pool_tap[46] purely a tuning parameter tap: Clarify calculation of TAP_MSGS packet.c | 25 +++++++++++++++++++++---- packet.h | 3 +++ passt.h | 2 -- tap.c | 43 ++++++++++++++++++++++++++++++++++++------- tap.h | 3 ++- vu_common.c | 15 ++++++++++----- 6 files changed, 72 insertions(+), 19 deletions(-) -- 2.48.1

Show replies by date

David Gibson

13 Mar 13 Mar

6:40 a.m.

New subject: [PATCH 1/4] vu_common: Tighten vu_packet_check_range()

This function verifies that the given packet is within the mmap()ed memory region of the vhost-user device. We can do better, however. The packet should be not only within the mmap()ed range, but specifically in the subsection of that range set aside for shared buffers, which starts at dev_region->mmap_offset within there. Signed-off-by: David Gibson --- vu_common.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/vu_common.c b/vu_common.c index 686a09b2..9eea4f2f 100644 --- a/vu_common.c +++ b/vu_common.c @@ -37,10 +37,10 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len) for (dev_region = buf; dev_region->mmap_addr; dev_region++) { /* NOLINTNEXTLINE(performance-no-int-to-ptr) */ - char *m = (char *)(uintptr_t)dev_region->mmap_addr; + char *m = (char *)(uintptr_t)dev_region->mmap_addr + + dev_region->mmap_offset; - if (m <= ptr && - ptr + len <= m + dev_region->mmap_offset + dev_region->size) + if (m <= ptr && ptr + len <= m + dev_region->size) return 0; } -- 2.48.1

David Gibson

6:40 a.m.

New subject: [PATCH 2/4] packet: More cautious checks to avoid pointer arithmetic UB

packet_check_range and vu_packet_check_range() verify that the packet or section of packet we're interested in lies in the packet buffer pool we expect it to. However, in doing so it doesn't avoid the possibility of an integer overflow while performing pointer arithmetic, with is UB. In fact, AFAICT it's UB even to use arbitrary pointer arithmetic to construct a pointer outside of a known valid buffer. To do this safely, we can't calculate the end of a memory region with pointer addition when then the length as untrusted. Instead we must work out the offset of one memory region within another using pointer subtraction, then do integer checks against the length of the outer region. We then need to be careful about the order of checks so that those integer checks can't themselves overflow. Signed-off-by: David Gibson --- packet.c | 12 +++++++++--- vu_common.c | 10 +++++++--- 2 files changed, 16 insertions(+), 6 deletions(-) diff --git a/packet.c b/packet.c index bcac0375..d1a51a5b 100644 --- a/packet.c +++ b/packet.c @@ -52,9 +52,15 @@ static int packet_check_range(const struct pool *p, const char *ptr, size_t len, return -1; } - if (ptr + len > p->buf + p->buf_size) { - trace("packet range end %p after buffer end %p, %s:%i", - (void *)(ptr + len), (void *)(p->buf + p->buf_size), + if (len > p->buf_size) { + trace("packet range length %zu larger than buffer %zu, %s:%i", + len, p->buf_size, func, line); + return -1; + } + + if ((size_t)(ptr - p->buf) > p->buf_size - len) { + trace("packet range %p, len %zu after buffer end %p, %s:%i", + (void *)ptr, len, (void *)(p->buf + p->buf_size), func, line); return -1; } diff --git a/vu_common.c b/vu_common.c index 9eea4f2f..cefe5e20 100644 --- a/vu_common.c +++ b/vu_common.c @@ -36,11 +36,15 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len) struct vu_dev_region *dev_region; for (dev_region = buf; dev_region->mmap_addr; dev_region++) { - /* NOLINTNEXTLINE(performance-no-int-to-ptr) */ - char *m = (char *)(uintptr_t)dev_region->mmap_addr + + uintptr_t base_addr = dev_region->mmap_addr + dev_region->mmap_offset; + /* NOLINTNEXTLINE(performance-no-int-to-ptr) */ + const char *base = (const char *)base_addr; + + ASSERT(base_addr >= dev_region->mmap_addr); - if (m <= ptr && ptr + len <= m + dev_region->size) + if (len <= dev_region->size && base <= ptr && + (size_t)(ptr - base) <= dev_region->size - len) return 0; } -- 2.48.1

David Gibson

6:40 a.m.

New subject: [PATCH 3/4] tap: Make size of pool_tap[46] purely a tuning parameter

Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf (TAP_MSGS). However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But ETH_ZLEN is based on physical constraints of Ethernet, which don't apply to our virtual devices. It is possible to generate a legitimate frame smaller than this, for example an empty payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long. Further more, the same limit applies for vhost-user, which is not limited by the size of pkt_buf like the other backends. In that case we don't even have full control of the maximum buffer size, so we can't really calculate how many packets could fit in there. If we exceed do TAP_MSGS we'll drop packets, not just use more batches, which is moderately bad. The fact that this needs to be sized just so for correctness not merely for tuning is a fairly non-obvious coupling between different parts of the code. To make this more robust, alter the tap code so it doesn't rely on everything fitting in a single batch of TAP_MSGS packets, instead breaking into multiple batches as necessary. This leaves TAP_MSGS as purely a tuning parameter, which we can freely adjust based on performance measures. Signed-off-by: David Gibson --- packet.c | 13 ++++++++++++- packet.h | 3 +++ passt.h | 2 -- tap.c | 19 ++++++++++++++++--- tap.h | 3 ++- vu_common.c | 5 +++-- 6 files changed, 36 insertions(+), 9 deletions(-) diff --git a/packet.c b/packet.c index d1a51a5b..08076d57 100644 --- a/packet.c +++ b/packet.c @@ -67,6 +67,17 @@ static int packet_check_range(const struct pool *p, const char *ptr, size_t len, return 0; } +/** + * pool_full() - Is a packet pool full? + * @p: Pointer to packet pool + * + * Return: true if the pool is full, false if more packets can be added + */ +bool pool_full(const struct pool *p) +{ + return p->count >= p->size; +} + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -80,7 +91,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, { size_t idx = p->count; - if (idx >= p->size) { + if (pool_full(p)) { trace("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; diff --git a/packet.h b/packet.h index d099f026..dd18461b 100644 --- a/packet.h +++ b/packet.h @@ -6,6 +6,8 @@ #ifndef PACKET_H #define PACKET_H +#include + /* Maximum size of a single packet stored in pool, including headers */ #define PACKET_MAX_LEN UINT16_MAX @@ -33,6 +35,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, void *packet_get_do(const struct pool *p, const size_t idx, size_t offset, size_t len, size_t *left, const char *func, int line); +bool pool_full(const struct pool *p); void pool_flush(struct pool *p); #define packet_add(p, len, start) \ diff --git a/passt.h b/passt.h index 8f450912..8693794b 100644 --- a/passt.h +++ b/passt.h @@ -71,8 +71,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data), /* Large enough for ~128 maximum size frames */ #define PKT_BUF_BYTES (8UL << 20) -#define TAP_MSGS \ - DIV_ROUND_UP(PKT_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t)) extern char pkt_buf [PKT_BUF_BYTES]; diff --git a/tap.c b/tap.c index 182a1151..34e6774f 100644 --- a/tap.c +++ b/tap.c @@ -75,6 +75,9 @@ CHECK_FRAME_LEN(L2_MAX_LEN_PASTA); CHECK_FRAME_LEN(L2_MAX_LEN_PASST); CHECK_FRAME_LEN(L2_MAX_LEN_VU); +#define TAP_MSGS \ + DIV_ROUND_UP(sizeof(pkt_buf), ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t)) + /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */ static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf); static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf); @@ -1042,8 +1045,10 @@ void tap_handler(struct ctx *c, const struct timespec *now) * @c: Execution context * @l2len: Total L2 packet length * @p: Packet buffer + * @now: Current timestamp */ -void tap_add_packet(struct ctx *c, ssize_t l2len, char *p) +void tap_add_packet(struct ctx *c, ssize_t l2len, char *p, + const struct timespec *now) { const struct ethhdr *eh; @@ -1059,9 +1064,17 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p) switch (ntohs(eh->h_proto)) { case ETH_P_ARP: case ETH_P_IP: + if (pool_full(pool_tap4)) { + tap4_handler(c, pool_tap4, now); + pool_flush(pool_tap4); + } packet_add(pool_tap4, l2len, p); break; case ETH_P_IPV6: + if (pool_full(pool_tap6)) { + tap6_handler(c, pool_tap6, now); + pool_flush(pool_tap6); + } packet_add(pool_tap6, l2len, p); break; default: @@ -1142,7 +1155,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) p += sizeof(uint32_t); n -= sizeof(uint32_t); - tap_add_packet(c, l2len, p); + tap_add_packet(c, l2len, p, now); p += l2len; n -= l2len; @@ -1207,7 +1220,7 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now) len > (ssize_t)L2_MAX_LEN_PASTA) continue; - tap_add_packet(c, len, pkt_buf + n); + tap_add_packet(c, len, pkt_buf + n, now); } tap_handler(c, now); diff --git a/tap.h b/tap.h index dd39fd89..6fe3d15d 100644 --- a/tap.h +++ b/tap.h @@ -119,6 +119,7 @@ void tap_sock_update_pool(void *base, size_t size); void tap_backend_init(struct ctx *c); void tap_flush_pools(void); void tap_handler(struct ctx *c, const struct timespec *now); -void tap_add_packet(struct ctx *c, ssize_t l2len, char *p); +void tap_add_packet(struct ctx *c, ssize_t l2len, char *p, + const struct timespec *now); #endif /* TAP_H */ diff --git a/vu_common.c b/vu_common.c index cefe5e20..5e6fd4a8 100644 --- a/vu_common.c +++ b/vu_common.c @@ -195,7 +195,7 @@ static void vu_handle_tx(struct vu_dev *vdev, int index, tap_add_packet(vdev->context, elem[count].out_sg[0].iov_len - hdrlen, (char *)elem[count].out_sg[0].iov_base + - hdrlen); + hdrlen, now); } else { /* vnet header can be in a separate iovec */ if (elem[count].out_num != 2) { @@ -207,7 +207,8 @@ static void vu_handle_tx(struct vu_dev *vdev, int index, } else { tap_add_packet(vdev->context, elem[count].out_sg[1].iov_len, - (char *)elem[count].out_sg[1].iov_base); + (char *)elem[count].out_sg[1].iov_base, + now); } } -- 2.48.1

David Gibson

6:40 a.m.

New subject: [PATCH 4/4] tap: Clarify calculation of TAP_MSGS

The rationale behind the calculation of TAP_MSGS isn't necessarily obvious. It's supposed to be the maximum number of packets that can fit in pkt_buf. However, the calculation is wrong in several ways: * It's based on ETH_ZLEN which isn't meaningful for virtual devices * It always includes the qemu socket header which isn't used for pasta * The size of pkt_buf isn't relevant for vhost-user We've already made sure this is just a tuning parameter, not a hard limit. Clarify what we're calculating here and why. Signed-off-by: David Gibson --- tap.c | 28 ++++++++++++++++++++++------ 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/tap.c b/tap.c index 34e6774f..3a6fcbe8 100644 --- a/tap.c +++ b/tap.c @@ -75,12 +75,28 @@ CHECK_FRAME_LEN(L2_MAX_LEN_PASTA); CHECK_FRAME_LEN(L2_MAX_LEN_PASST); CHECK_FRAME_LEN(L2_MAX_LEN_VU); -#define TAP_MSGS \ - DIV_ROUND_UP(sizeof(pkt_buf), ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t)) +/* We try size the packet pools so that we can use a single batch for the entire + * packet buffer. This might be exceeded for vhost-user, though, which uses its + * own buffers rather than pkt_buf. + * + * This is just a tuning parameter, the code will work with slightly more + * overhead if it's incorrect. So, we estimate based on the minimum practical + * frame size - an empty UDP datagram - rather than the minimum theoretical + * frame size. + * + * FIXME: Profile to work out how big this actually needs to be to amortise + * per-batch syscall overheads + */ +#define TAP_MSGS_IP4 \ + DIV_ROUND_UP(sizeof(pkt_buf), \ + ETH_HLEN + sizeof(struct iphdr) + sizeof(struct udphdr)) +#define TAP_MSGS_IP6 \ + DIV_ROUND_UP(sizeof(pkt_buf), \ + ETH_HLEN + sizeof(struct ipv6hdr) + sizeof(struct udphdr)) /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */ -static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf); -static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf); +static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS_IP4, pkt_buf); +static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, pkt_buf); #define TAP_SEQS 128 /* Different L4 tuples in one batch */ #define FRAGMENT_MSG_RATE 10 /* # seconds between fragment warnings */ @@ -1418,8 +1434,8 @@ void tap_sock_update_pool(void *base, size_t size) { int i; - pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, base, size); - pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, base, size); + pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS_IP4, base, size); + pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS_IP6, base, size); for (i = 0; i < TAP_SEQS; i++) { tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, base, size); -- 2.48.1

214

Age (days ago)

214

Last active (days ago)

List overview

Download

4 comments

1 participants

participants (1)

David Gibson