A long time ago Matej Hrica pointed out a possible buffer overrun when receiving data from the qemu socket. Stefano recently proposed a fix for this, but I don't think it's quite right. This series is a different approach to fixing that problem and a few adjacent ones. David Gibson (5): tap: Better report errors receiving from QEMU socket tap: Don't attempt to carry on if we get a bad frame length from qemu tap: Don't use EPOLLET on Qemu sockets tap: Correctly handle frames of odd length tap: Improve handling of partially received frames on qemu socket passt.h | 1 - tap.c | 72 ++++++++++++++++++++++++++++++--------------------------- util.h | 16 +++++++++++++ 3 files changed, 54 insertions(+), 35 deletions(-) -- 2.45.2
If we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability. Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off). Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index 44bd4445..a2ae7c2a 100644 --- a/tap.c +++ b/tap.c @@ -969,10 +969,10 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p) */ static void tap_sock_reset(struct ctx *c) { - if (c->one_off) { - info("Client closed connection, exiting"); + info("Client connection closed%s", c->one_off ? ", exiting" : ""); + + if (c->one_off) exit(EXIT_SUCCESS); - } /* Close the connected socket, wait for a new connection */ epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL); @@ -1005,8 +1005,10 @@ redo: n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT); if (n < 0) { - if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) + if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) { + err_perror("Error receiving from QEMU socket"); tap_sock_reset(c); + } return; } -- 2.45.2
On Fri, 26 Jul 2024 17:20:27 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:If we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability. Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off). Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index 44bd4445..a2ae7c2a 100644 --- a/tap.c +++ b/tap.c @@ -969,10 +969,10 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p) */ static void tap_sock_reset(struct ctx *c) { - if (c->one_off) { - info("Client closed connection, exiting"); + info("Client connection closed%s", c->one_off ? ", exiting" : ""); + + if (c->one_off) exit(EXIT_SUCCESS); - } /* Close the connected socket, wait for a new connection */ epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL); @@ -1005,8 +1005,10 @@ redo: n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT); if (n < 0) { - if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) + if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) { + err_perror("Error receiving from QEMU socket");Nit I'm fixing up on merge: it's not necessarily QEMU, because libkrun and perhaps others also use this path (in fact, the whole problem was reported as part of the libkrun integration). Maybe it's obvious to users anyway, but this might seriously confuse somebody who's using e.g. krun on Asahi Linux (is QEMU running, one might wonder): https://github.com/slp/krun#motivation https://github.com/slp/krun/blob/main/crates/krun/src/net.rs On top of that, now that we have an error message, I guess it would be nice to state we're resetting the connection, because it's not really obvious: the subsequent message from tap_sock_reset() makes it look like the client decided to close the connection on its own. So I'm changing this to: err("Error receiving from guest, resetting connection"); ...if you see an issue with it, I'll send another patch. -- Stefano
On Fri, 26 Jul 2024 13:25:08 +0200 Stefano Brivio <sbrivio(a)redhat.com> wrote:On Fri, 26 Jul 2024 17:20:27 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:...or rather: err_perror("Receive error on guest connection, reset"); -- StefanoIf we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability. Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off). Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index 44bd4445..a2ae7c2a 100644 --- a/tap.c +++ b/tap.c @@ -969,10 +969,10 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p) */ static void tap_sock_reset(struct ctx *c) { - if (c->one_off) { - info("Client closed connection, exiting"); + info("Client connection closed%s", c->one_off ? ", exiting" : ""); + + if (c->one_off) exit(EXIT_SUCCESS); - } /* Close the connected socket, wait for a new connection */ epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL); @@ -1005,8 +1005,10 @@ redo: n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT); if (n < 0) { - if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) + if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) { + err_perror("Error receiving from QEMU socket");Nit I'm fixing up on merge: it's not necessarily QEMU, because libkrun and perhaps others also use this path (in fact, the whole problem was reported as part of the libkrun integration). Maybe it's obvious to users anyway, but this might seriously confuse somebody who's using e.g. krun on Asahi Linux (is QEMU running, one might wonder): https://github.com/slp/krun#motivation https://github.com/slp/krun/blob/main/crates/krun/src/net.rs On top of that, now that we have an error message, I guess it would be nice to state we're resetting the connection, because it's not really obvious: the subsequent message from tap_sock_reset() makes it look like the client decided to close the connection on its own. So I'm changing this to: err("Error receiving from guest, resetting connection");
On Fri, Jul 26, 2024 at 01:25:08PM +0200, Stefano Brivio wrote:On Fri, 26 Jul 2024 17:20:27 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Fair point. I'm not sure what term to use to describe specifically a socket using this qemu-originated protocol, though.If we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability. Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off). Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index 44bd4445..a2ae7c2a 100644 --- a/tap.c +++ b/tap.c @@ -969,10 +969,10 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p) */ static void tap_sock_reset(struct ctx *c) { - if (c->one_off) { - info("Client closed connection, exiting"); + info("Client connection closed%s", c->one_off ? ", exiting" : ""); + + if (c->one_off) exit(EXIT_SUCCESS); - } /* Close the connected socket, wait for a new connection */ epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL); @@ -1005,8 +1005,10 @@ redo: n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT); if (n < 0) { - if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) + if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) { + err_perror("Error receiving from QEMU socket");Nit I'm fixing up on merge: it's not necessarily QEMU, because libkrun and perhaps others also use this path (in fact, the whole problem was reported as part of the libkrun integration).Maybe it's obvious to users anyway, but this might seriously confuse somebody who's using e.g. krun on Asahi Linux (is QEMU running, one might wonder): https://github.com/slp/krun#motivation https://github.com/slp/krun/blob/main/crates/krun/src/net.rs On top of that, now that we have an error message, I guess it would be nice to state we're resetting the connection, because it's not really obvious: the subsequent message from tap_sock_reset() makes it look like the client decided to close the connection on its own.That's why I changed the message in tap_sock_reset() from "client closed connection" to "client connection closed".So I'm changing this to: err("Error receiving from guest, resetting connection"); ...if you see an issue with it, I'll send another patch.Ok. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
If we receive a too-short or too-long frame from the QEMU socket, currently we try to skip it and carry on. That sounds sensible on first blush, but probably isn't wise in practice. If this happens, either (a) qemu has done something seriously unexpected, or (b) we've received corrupt data over a Unix socket. Or more likely (c), we have a bug elswhere which has put us out of sync with the stream, so we're trying to read something that's not a frame length as a frame length. Neither (b) nor (c) is really salvageable with the same stream. Case (a) might be ok, but we can no longer be confident qemu won't do something else we can't cope with. So, instead of just skipping the frame and trying to carry on, log an error and close the socket. As a bonus, establishing firm bounds on l2len early will allow simplifications to how we deal with the case where a partial frame is recv()ed. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tap.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/tap.c b/tap.c index a2ae7c2a..08700f65 100644 --- a/tap.c +++ b/tap.c @@ -1013,7 +1013,13 @@ redo: } while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p); + + if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) { + err("Invalid frame size from QEMU (corrupt stream?)"); + tap_sock_reset(c); + return; + } p += sizeof(uint32_t); n -= sizeof(uint32_t); @@ -1027,16 +1033,8 @@ redo: return; } - /* Complete the partial read above before discarding a malformed - * frame, otherwise the stream will be inconsistent. - */ - if (l2len < (ssize_t)sizeof(struct ethhdr) || - l2len > (ssize_t)ETH_MAX_MTU) - goto next; - tap_add_packet(c, l2len, p); -next: p += l2len; n -= l2len; } -- 2.45.2
On Fri, 26 Jul 2024 17:20:28 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:If we receive a too-short or too-long frame from the QEMU socket, currently we try to skip it and carry on. That sounds sensible on first blush, but probably isn't wise in practice. If this happens, either (a) qemu has done something seriously unexpected, or (b) we've received corrupt data over a Unix socket. Or more likely (c), we have a bug elswhere which has put us out of sync with the stream, so we're trying to read something that's not a frame length as a frame length. Neither (b) nor (c) is really salvageable with the same stream. Case (a) might be ok, but we can no longer be confident qemu won't do something else we can't cope with. So, instead of just skipping the frame and trying to carry on, log an error and close the socket. As a bonus, establishing firm bounds on l2len early will allow simplifications to how we deal with the case where a partial frame is recv()ed. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tap.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/tap.c b/tap.c index a2ae7c2a..08700f65 100644 --- a/tap.c +++ b/tap.c @@ -1013,7 +1013,13 @@ redo: } while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p); + + if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) { + err("Invalid frame size from QEMU (corrupt stream?)");Same as my comment to 1/5, let me change this to: err("Bad frame size from guest, resetting connection"); -- Stefano
Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however. Furthermore, using EPOLLET causes additional complications: 1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however we *do* set it when using pasta mode with --fd. This inconsistency doesn't seem to have broken anything, but it's odd. 2) EPOLLET requires that tap_handler_passt() loop until all data available is read (otherwise we may have data in the buffer but never get an event causing us to read it). We do that with a rather ugly goto. Worse, our condition for that goto appears to be incorrect. We'll only loop if rem is non-zero, which will only happen if we perform a blocking recv() for a partially received frame. We'll only perform that second recv() if the original recv() resulted in a partially read frame. As far as I can tell the original recv() could end on a frame boundary (never triggering the second recv()) even if there is additional data in the socket buffer. In that circumstance we wouldn't goto redo and could leave unprocessed frames in the qemu socket buffer indefinitely. This doesn't seem to have caused any problems in practice, but since there's no obvious reason to use EPOLLET here anyway, we might as well get rid of it. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tap.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/tap.c b/tap.c index 08700f65..df1287ad 100644 --- a/tap.c +++ b/tap.c @@ -989,7 +989,7 @@ static void tap_sock_reset(struct ctx *c) void tap_handler_passt(struct ctx *c, uint32_t events, const struct timespec *now) { - ssize_t n, rem; + ssize_t n; char *p; if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) { @@ -997,9 +997,7 @@ void tap_handler_passt(struct ctx *c, uint32_t events, return; } -redo: p = pkt_buf; - rem = 0; tap_flush_pools(); @@ -1028,7 +1026,7 @@ redo: * needs to be blocking. */ if (l2len > n) { - rem = recv(c->fd_tap, p + n, l2len - n, 0); + ssize_t rem = recv(c->fd_tap, p + n, l2len - n, 0); if ((n += rem) != l2len) return; } @@ -1040,10 +1038,6 @@ redo: } tap_handler(c, now); - - /* We can't use EPOLLET otherwise. */ - if (rem) - goto redo; } /** @@ -1228,7 +1222,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events) trace("tap: failed to set SO_SNDBUF to %i", v); ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1317,7 +1311,7 @@ void tap_sock_init(struct ctx *c) else ref.type = EPOLL_TYPE_TAP_PASTA; - ev.events = EPOLLIN | EPOLLET | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); return; -- 2.45.2
On Fri, 26 Jul 2024 17:20:29 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however.It significantly decreases epoll_wait() overhead on sustained data transfers, because we can read multiple TAP_BUF_SIZE buffers at a time instead of just one. I can check that now again with current QEMU and kernel versions, plus several fundamental changes in buffer handling, but I don't see a real reason why this shouldn't have changed meanwhile. -- Stefano
On Fri, 26 Jul 2024 10:00:56 +0200 Stefano Brivio <sbrivio(a)redhat.com> wrote:On Fri, 26 Jul 2024 17:20:29 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Surprisingly, this doesn't affect throughput at all on my setup, no matter the packet size. I didn't check with perf(1), though, and we probably should give all the recent substantial changes a pass with it (it's been a while since I last checked where overhead typically is...). -- StefanoCurrently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however.It significantly decreases epoll_wait() overhead on sustained data transfers, because we can read multiple TAP_BUF_SIZE buffers at a time instead of just one. I can check that now again with current QEMU and kernel versions, plus several fundamental changes in buffer handling, but I don't see a real reason why this shouldn't have changed meanwhile.
On Fri, Jul 26, 2024 at 10:00:56AM +0200, Stefano Brivio wrote:On Fri, 26 Jul 2024 17:20:29 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:That's a reason to keep the loop, but not EPOLLET itself, AFAICT. I'd be happy enough to put the loop back in as an optimization (although, I'd prefer to avoid the goto).Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however.It significantly decreases epoll_wait() overhead on sustained data transfers, because we can read multiple TAP_BUF_SIZE buffers at a time instead of just one.I can check that now again with current QEMU and kernel versions, plus several fundamental changes in buffer handling, but I don't see a real reason why this shouldn't have changed meanwhile.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 26 Jul 2024 22:12:27 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Fri, Jul 26, 2024 at 10:00:56AM +0200, Stefano Brivio wrote:True, we could actually have that loop back without EPOLLET. But the reason why I added EPOLLET, despite the resulting complexity, was surely increased overhead without it. I can't remember (and unfortunately I didn't write this in that commit message from 2021) exactly how that looked like, if we had spurious or more frequent wake-ups or what else. Maybe that was a side effect of something that's fixed or otherwise changed now, but still we should give this a pass with perf(1) before we try to optimise this again (if it even needs to be optimised, that is). -- StefanoOn Fri, 26 Jul 2024 17:20:29 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:That's a reason to keep the loop, but not EPOLLET itself, AFAICT. I'd be happy enough to put the loop back in as an optimization (although, I'd prefer to avoid the goto).Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however.It significantly decreases epoll_wait() overhead on sustained data transfers, because we can read multiple TAP_BUF_SIZE buffers at a time instead of just one.
On Fri, Jul 26, 2024 at 03:25:38PM +0200, Stefano Brivio wrote:On Fri, 26 Jul 2024 22:12:27 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:I think we need to understand if and why that's still the case before putting this back in. I can see an obvious reason why the loop might reduce overhead, but not why the EPOLLET flag itself would. If anything I'd expect level triggered events to more accurately give us wakeups only exactly when we need them. Note also that even looping withouth EPOLLET does have its own cost here: it potentially allows a heavy burst of traffic from qemu to starve processing of events on other sockets. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonOn Fri, Jul 26, 2024 at 10:00:56AM +0200, Stefano Brivio wrote:True, we could actually have that loop back without EPOLLET. But the reason why I added EPOLLET, despite the resulting complexity, was surely increased overhead without it. I can't remember (and unfortunately I didn't write this in that commit message from 2021) exactly how that looked like, if we had spurious or more frequent wake-ups or what else. Maybe that was a side effect of something that's fixed or otherwise changed now, but still we should give this a pass with perf(1) before we try to optimise this again (if it even needs to be optimised, that is).On Fri, 26 Jul 2024 17:20:29 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:That's a reason to keep the loop, but not EPOLLET itself, AFAICT. I'd be happy enough to put the loop back in as an optimization (although, I'd prefer to avoid the goto).Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however.It significantly decreases epoll_wait() overhead on sustained data transfers, because we can read multiple TAP_BUF_SIZE buffers at a time instead of just one.
The Qemu socket protocol consists of a 32-bit frame length in network (BE) order, followed by the Ethernet frame itself. As far as I can tell, frames can be any length, with no particular alignment requirement. This means that although pkt_buf itself is aligned, if we have a frame of odd length, frames after it will have their frame length at an unaligned address. Currently we load the frame length by just casting a char pointer to (uint32_t *) and loading. Some platforms will generate a fatal trap on such an unaligned load. Even if they don't casting an incorrectly aligned pointer to (uint32_t *) is undefined behaviour, strictly speaking. Introduce a new helper to safely load a possibly unaligned value here. We assume that the compiler is smart enough to optimize this into nothing on platforms that provide performant unaligned loads. If that turns out not to be the case, we can look at improvements then. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- tap.c | 2 +- util.h | 16 ++++++++++++++++ 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/tap.c b/tap.c index df1287ad..aca27bb9 100644 --- a/tap.c +++ b/tap.c @@ -1011,7 +1011,7 @@ void tap_handler_passt(struct ctx *c, uint32_t events, } while (n > (ssize_t)sizeof(uint32_t)) { - uint32_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl_unaligned(p); if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) { err("Invalid frame size from QEMU (corrupt stream?)"); diff --git a/util.h b/util.h index 826614cf..5a038e29 100644 --- a/util.h +++ b/util.h @@ -10,8 +10,10 @@ #include <stdarg.h> #include <stdbool.h> #include <stddef.h> +#include <stdint.h> #include <string.h> #include <signal.h> +#include <arpa/inet.h> #include "log.h" @@ -116,6 +118,20 @@ #define htonl_constant(x) (__bswap_constant_32(x)) #endif +/** + * ntohl_unaligned() - Read 32-bit BE value from a possibly unaligned address + * @p: Pointer to the BE value in memory + * + * Returns: Host-order value of 32-bit BE quantity at @p + */ +static inline uint32_t ntohl_unaligned(const void *p) +{ + uint32_t val; + + memcpy(&val, p, sizeof(val)); + return ntohl(val); +} + #define NS_FN_STACK_SIZE (RLIMIT_STACK_VAL * 1024 / 8) int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags, void *arg); -- 2.45.2
Because the Unix socket to qemu is a stream socket, we have no guarantee of where the boundaries between recv() calls will lie. Typically they will lie on frame boundaries, because that's how qemu will send then, but we can't rely on it. Currently we handle this case by detecting when we have received a partial frame and performing a blocking recv() to get the remainder, and only then processing the frames. Change it so instead we save the partial frame persistently and include it as the first thing processed next time we receive data from the socket. This handles a number of (unlikely) cases which previously would not be dealt with correctly: * If qemu sent a partial frame then waited some time before sending the remainder, previously we could block here for an unacceptably long time * If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without doing the partial frame handling, which would put us out of sync with the stream from qemu * If a the blocking recv() only received some of the remainder of the frame, not all of it, we'd return leaving us out of sync with the stream again Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This is probably acceptable because it's an unlikely case in practice. If necessary we could mitigate this by using a true ring buffer. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- passt.h | 1 - tap.c | 36 +++++++++++++++++++++++------------- 2 files changed, 23 insertions(+), 14 deletions(-) diff --git a/passt.h b/passt.h index 4cc2b6f0..d0f31a23 100644 --- a/passt.h +++ b/passt.h @@ -60,7 +60,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data), #define TAP_BUF_BYTES \ ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE) -#define TAP_BUF_FILL (TAP_BUF_BYTES - ETH_MAX_MTU - sizeof(uint32_t)) #define TAP_MSGS \ DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t)) diff --git a/tap.c b/tap.c index aca27bb9..18eec279 100644 --- a/tap.c +++ b/tap.c @@ -989,6 +989,8 @@ static void tap_sock_reset(struct ctx *c) void tap_handler_passt(struct ctx *c, uint32_t events, const struct timespec *now) { + static const char *partial_frame; + static ssize_t partial_len = 0; ssize_t n; char *p; @@ -997,11 +999,18 @@ void tap_handler_passt(struct ctx *c, uint32_t events, return; } - p = pkt_buf; - tap_flush_pools(); - n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT); + if (partial_len) { + /* We have a partial frame from an earlier pass. Move it to the + * start of the buffer, top up with new data, then process all + * of it. + */ + memmove(pkt_buf, partial_frame, partial_len); + } + + n = recv(c->fd_tap, pkt_buf + partial_len, TAP_BUF_BYTES - partial_len, + MSG_DONTWAIT); if (n < 0) { if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) { err_perror("Error receiving from QEMU socket"); @@ -1010,7 +1019,10 @@ void tap_handler_passt(struct ctx *c, uint32_t events, return; } - while (n > (ssize_t)sizeof(uint32_t)) { + p = pkt_buf; + n += partial_len; + + while (n >= (ssize_t)sizeof(uint32_t)) { uint32_t l2len = ntohl_unaligned(p); if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) { @@ -1019,24 +1031,22 @@ void tap_handler_passt(struct ctx *c, uint32_t events, return; } + if (l2len + sizeof(uint32_t) > (size_t)n) + /* Leave this incomplete frame for later */ + break; + p += sizeof(uint32_t); n -= sizeof(uint32_t); - /* At most one packet might not fit in a single read, and this - * needs to be blocking. - */ - if (l2len > n) { - ssize_t rem = recv(c->fd_tap, p + n, l2len - n, 0); - if ((n += rem) != l2len) - return; - } - tap_add_packet(c, l2len, p); p += l2len; n -= l2len; } + partial_len = n; + partial_frame = p; + tap_handler(c, now); } -- 2.45.2
On Fri, 26 Jul 2024 17:20:31 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Because the Unix socket to qemu is a stream socket, we have no guarantee of where the boundaries between recv() calls will lie. Typically they will lie on frame boundaries, because that's how qemu will send then, but we can't rely on it. Currently we handle this case by detecting when we have received a partial frame and performing a blocking recv() to get the remainder, and only then processing the frames. Change it so instead we save the partial frame persistently and include it as the first thing processed next time we receive data from the socket. This handles a number of (unlikely) cases which previously would not be dealt with correctly: * If qemu sent a partial frame then waited some time before sending the remainder, previously we could block here for an unacceptably long time * If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without doing the partial frame handling, which would put us out of sync with the stream from qemu * If a the blocking recv() only received some of the remainder of the frame, not all of it, we'd return leaving us out of sync with the stream again Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This is probably acceptable because it's an unlikely case in practice. If necessary we could mitigate this by using a true ring buffer.I don't think that that memmove() is a problem if we have a single recv(), even if it happens to be one memmove() for every recv() (guest filling up the buffer, common in throughput tests and bulk transfers), because it's very small in relative terms anyway. I think the ringbuffer would be worth it with multiple recv() calls per epoll wakeup, with EPOLLET. -- Stefano
On Fri, Jul 26, 2024 at 01:39:13PM +0200, Stefano Brivio wrote:On Fri, 26 Jul 2024 17:20:31 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:So first, as noted on the earlier patch, I don't think multiple recv()s per wakeup requires EPOLLET, though the reverse is true. Regardless, AFAICT the proportion of memmove()s to data received would not vary regardless of whether we do multiple recv()s per wakeup or the same number of recv()s split across multiple wakeups. Of course, if the recv()s line up with frame boundaries, as we expect, then it doesn't matter anyway, since we won't get partial frames and we won't need memmove()s. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonBecause the Unix socket to qemu is a stream socket, we have no guarantee of where the boundaries between recv() calls will lie. Typically they will lie on frame boundaries, because that's how qemu will send then, but we can't rely on it. Currently we handle this case by detecting when we have received a partial frame and performing a blocking recv() to get the remainder, and only then processing the frames. Change it so instead we save the partial frame persistently and include it as the first thing processed next time we receive data from the socket. This handles a number of (unlikely) cases which previously would not be dealt with correctly: * If qemu sent a partial frame then waited some time before sending the remainder, previously we could block here for an unacceptably long time * If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without doing the partial frame handling, which would put us out of sync with the stream from qemu * If a the blocking recv() only received some of the remainder of the frame, not all of it, we'd return leaving us out of sync with the stream again Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This is probably acceptable because it's an unlikely case in practice. If necessary we could mitigate this by using a true ring buffer.I don't think that that memmove() is a problem if we have a single recv(), even if it happens to be one memmove() for every recv() (guest filling up the buffer, common in throughput tests and bulk transfers), because it's very small in relative terms anyway. I think the ringbuffer would be worth it with multiple recv() calls per epoll wakeup, with EPOLLET.
On Fri, 26 Jul 2024 17:20:26 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:A long time ago Matej Hrica pointed out a possible buffer overrun when receiving data from the qemu socket. Stefano recently proposed a fix for this, but I don't think it's quite right. This series is a different approach to fixing that problem and a few adjacent ones. David Gibson (5): tap: Better report errors receiving from QEMU socket tap: Don't attempt to carry on if we get a bad frame length from qemu tap: Don't use EPOLLET on Qemu sockets tap: Correctly handle frames of odd length tap: Improve handling of partially received frames on qemu socketApplied. I'm a bit nervous about 3/5 but anyway the series as a whole is clearly better than the alternative. -- Stefano