[PATCH v3 00/18] RFC: Unify and simplify tap send path
Although we have an abstraction for the "slow path" (DHCP, NDP) guest bound packets, the TCP and UDP forwarding paths write directly to the tap fd. However, it turns out how they send frames to the tap device is more similar than it originally appears. This series unifies the low-level tap send functions for TCP and UDP, and makes some clean ups along the way. This is based on my earlier outstanding series. Changes since v2: * Rebase on the latest version of the UDP tap/splice socket unification * Rework pcap_frame() to return an error rather than printing itself, restoring less verbose behaviour for one error in a group of frames. * Assorted typo fixes in comments and commit messages Changes since v1: * Abstract tap header construction as well as frame send (a number of new patches) * Remove unneeded flags buf_bytes globals as well * Fix bug where we weren't correctly setting iov_len after the move to giving variable sized iovecs to send_frames(). David Gibson (18): pcap: Introduce pcap_frame() helper pcap: Replace pcapm() with pcap_multiple() tcp: Combine two parts of passt tap send path together tcp: Don't compute total bytes in a message until we need it tcp: Improve interface to tcp_l2_buf_flush() tcp: Combine two parts of pasta tap send path together tap, tcp: Move tap send path to tap.c util: Introduce hton*_constant() in place of #ifdefs tcp, udp: Use named field initializers in iov_init functions util: Parameterize ethernet header initializer macro tcp: Remove redundant and incorrect initialization from *_iov_init() tcp: Consolidate calculation of total frame size tap: Add "tap headers" abstraction tcp: Use abstracted tap header tap: Use different io vector bases depending on tap type udp: Use abstracted tap header tap: Improve handling of partial frame sends udp: Use tap_send_frames() dhcpv6.c | 50 +++-------- pcap.c | 86 ++++++------------- pcap.h | 3 +- tap.c | 121 ++++++++++++++++++++++++++ tap.h | 54 ++++++++++++ tcp.c | 254 +++++++++++++------------------------------------------ udp.c | 213 ++++++---------------------------------------- udp.h | 2 +- util.h | 47 ++-------- 9 files changed, 310 insertions(+), 520 deletions(-) -- 2.39.0
pcap(), pcapm() and pcapmm() duplicate some code, for the actual writing
to the capture file. The purpose of pcapm() and pcapmm() not calling
pcap() seems to be to avoid repeatedly calling gettimeofday() and to avoid
printing errors for every packet in a batch if there's a problem. We can
accomplish that while still sharing code by adding a new helper which
takes the packet timestamp as a parameter.
Signed-off-by: David Gibson
pcapm() captures multiple frames from a msghdr, however the only thing it
cares about in the msghdr is the list of buffers, where it assumes there is
one frame to capture per buffer. That's what we want for its single caller
but it's not the only obvious choice here (one frame per msghdr would
arguably make more sense in isolation). In addition pcapm() has logic
that only makes sense in the context of the passt specific path its called
from: it skips the first 4 bytes of each buffer, because those have the
qemu vnet_len rather than the frame proper.
Make this clearer by replacing pcapm() with pcap_multiple() which more
explicitly takes one struct iovec per frame, and parameterizes how much of
each buffer to skip (i.e. the offset of the frame within the buffer).
Signed-off-by: David Gibson
tcp_l2_buf_flush() open codes the "primary" send of message to the passt
tap interface, but calls tcp_l2_buf_flush_part() to handle the case of a
short send. Combine these two passt-specific operations into
tcp_l2_buf_flush_passt() which is a little cleaner and will enable furrther
cleanups.
Signed-off-by: David Gibson
tcp[46]_l2_buf_bytes keep track of the total number of bytes we have
queued to send to the tap interface. tcp_l2_buf_flush_passt() uses this
to determine if sendmsg() has sent all the data we requested, or whether
we need to resend a trailing portion.
However, the logic for finding where we're up to in the case of a short
sendmsg() can equally well tell whether we've had one at all, without
knowing the total number in advance. This does require an extra loop after
each sendmsg(), but it's doing simple arithmetic on values we've already
been accessing, and it leads to overall simpler code.
tcp[46]_l2_flags_buf_bytes were being calculated, but never used for
anything, so simply remove them.
Signed-off-by: David Gibson
Currently this takes a msghdr, but the only thing we actually care
about in there is the io vector. Make it take an io vector directly.
We also have a weird side effect of zeroing @buf_used. Just pass this
by value and zero it in the caller instead.
Signed-off-by: David Gibson
tcp_l2_buf_flush() open codes the loop across each frame in a group, but
but calls tcp_l2_buf_write_one() to send each frame to the pasta tuntap
device. Combine these two pasta-specific operations into
tcp_l2_buf_flush_pasta() which is a little cleaner and will enable further
cleanups.
Signed-off-by: David Gibson
On Fri, 6 Jan 2023 11:43:10 +1100
David Gibson
tcp_l2_buf_flush() open codes the loop across each frame in a group, but but calls tcp_l2_buf_write_one() to send each frame to the pasta tuntap device. Combine these two pasta-specific operations into tcp_l2_buf_flush_pasta() which is a little cleaner and will enable further cleanups.
Signed-off-by: David Gibson
--- tcp.c | 40 ++++++++++++++++++---------------------- 1 file changed, 18 insertions(+), 22 deletions(-) diff --git a/tcp.c b/tcp.c index d96122d..9960a35 100644 --- a/tcp.c +++ b/tcp.c @@ -1391,23 +1391,25 @@ static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); } while (0)
/** - * tcp_l2_buf_write_one() - Write a single buffer to tap file descriptor + * tcp_l2_buf_flush_pasta() - Send frames on the pasta tap interface * @c: Execution context - * @iov: struct iovec item pointing to buffer - * @ts: Current timestamp - * - * Return: 0 on success, negative error code on failure (tap reset possible) + * @iov: Pointer to array of buffers, one per frame + * @n: Number of buffers/frames to flush */ -static int tcp_l2_buf_write_one(struct ctx *c, const struct iovec *iov) +static void tcp_l2_buf_flush_pasta(struct ctx *c, + const struct iovec *iov, size_t n) { - if (write(c->fd_tap, (char *)iov->iov_base + 4, iov->iov_len - 4) < 0) { - debug("tap write: %s", strerror(errno)); - if (errno != EAGAIN && errno != EWOULDBLOCK) - tap_handler(c, c->fd_tap, EPOLLERR, NULL); - return -errno; - } + size_t i;
- return 0; + for (i = 0; i < n; i++) { + if (write(c->fd_tap, (char *)iov->iov_base + 4, + iov->iov_len - 4) < 0) {
It took me a moment to miss this during review, but a very long time to figure out later. :( This always sends the first frame in 'iov'. Surprisingly, pasta_tcp performance tests work just fine most of the times: for data connections we usually end up moving a single frame at a time, and retransmissions hide the issue for control messages. I just posted a patch on top of this, you don't have to respin, and it's actually more convenient for me to apply this with a fix at this point. -- Stefano
The functions which do the final steps of sending TCP packets on through
the tap interface - tcp_l2_buf_flush*() - no longer have anything that's
actually specific to TCP in them, other than comments and names. Move them
all to tap.c.
Signed-off-by: David Gibson
We have several places where we have fairly ugly #ifdefs on __BYTE_ORDER
where we need network order values in a constant expression (so we can't
use htons() or htonl()). We can do this more cleanly by using a single
__BYTE_ORDER ifdef to define htons_constant() and htonl_constant()
macros, then using those in all the other places.
Signed-off-by: David Gibson
Both the TCP and UDP iov_init functions have some large structure literals
defined in "field order" style. These are pretty hard to read since it's
not obvious what value corresponds to what field. Use named field style
initializers instead to make this clearer.
Signed-off-by: David Gibson
We have separate IPv4 and IPv6 versions of a macro to construct an
initializer for ethernet headers. However, now that we have htons_constant
it's easy to simply paramterize this with the ethernet protocol number.
Signed-off-by: David Gibson
tcp_sock[46]_iov_init() initialize the length of each iovec buffer to
MSS_DEFAULT. That will always be overwritten before use in
tcp_data_to_tap, so it's redundant. It also wasn't correct, because it
didn't correctly account for the header lengths in all cases.
Signed-off-by: David Gibson
tcp_l2_buf_fill_headers() returns the size of the generated frame including
the ethernet header. The caller then adds on the size of the vnet_len
field to get the total frame size to be passed to the tap device.
Outside the tap code, though, we never care about the ethernet header size
only the final total size we need to put into an iovec. So, consolidate
the total frame size calculation within tcp_l2_buf_fill_headers().
Signed-off-by: David Gibson
Currently both the TCP and UDP code need to deal in various places with the
details of the L2 headers, and also the tap-specific "vnet_len" header.
This makes abstracting the tap interface to new backends (e.g. vhost-user
or tun) more difficult.
To improve this abstraction, create a new 'tap_hdr' structure which
represents both L2 (always Ethernet at the moment, but might be vary in
future) and any additional tap specific headers (such as the qemu socket's
vnet_len field). Provide helper functions and macros to initialize, update
and use it.
Signed-off-by: David Gibson
Update the TCP code to use the tap layer abstractions for initializing and
updating the L2 and lower headers. This will make adding other tap
backends in future easier.
Signed-off-by: David Gibson
Currently tap_send_frames() expects the frames it is given to include the
vnet_len field, even in pasta mode which doesn't use it (although it need
not be initialized in that case). To match, tap_iov_base() and
tap_iov_len() construct the frame in that way.
This will inconvenience future changes, so alter things to set the buffers
to include just the frame needed by the tap backend type.
Signed-off-by: David Gibson
Update the UDP code to use the tap layer abstractions for initializing and
updating the L2 and lower headers. This will make adding other tap
backends in future easier.
Signed-off-by: David Gibson
In passt mode, when writing frames to the qemu socket, we might get a short
send. If we ignored this and carried on, the qemu socket would get out of
sync, because the bytes we actually sent wouldn't correspond to the length
header we already sent. tap_send_frames_passt() handles that by doing a
a blocking send to complete the message, but it has a few flaws:
* We only attempt to resend once: although it's unlikely in practice,
nothing prevents the blocking send() from also being short
* We print a debug error if send() returns non-zero.. but send() returns
the number of bytes sent, so we actually want it to return the length
of the remaining data.
Correct those flaws and also be a bit more thorough about reporting
problems here.
Signed-off-by: David Gibson
To send frames on the tap interface, the UDP uses a fairly complicated two
level batching. First multiple frames are gathered into a single "message"
for the qemu stream socket, then multiple messages are send with
sendmmsg(). We now have tap_send_frames() which already deals with sending
a number of frames, including batching and handling partial sends. Use
that to considerably simplify things.
This does make a couple of behavioural changes:
* We used to split messages to keep them under 32kiB (except when a
single frame was longer than that). The comments claim this is
needed to stop qemu from closing the connection, but we don't have any
equivalent logic for TCP. I wasn't able to reproduce the problem with
this series, although it was apparently easy to reproduce earlier.
My suspicion is that there was never an inherent need to keep messages
small, however with larger messages (and default kernel buffer sizes)
the chances of needing more than one resend for partial send()s is
greatly increased. We used not to correctly handle that case of
multiple resends, but now we do.
* Previously when we got a partial send on UDP, we would resend the
remainder of the entire "message", including multiple frames. The
common code now only resends the remainder of a single frame, simply
dropping any frames which weren't even partially sent. This is what
TCP always did and is probably a better idea for UDP too.
Signed-off-by: David Gibson
On Fri, 6 Jan 2023 11:43:04 +1100
David Gibson
Although we have an abstraction for the "slow path" (DHCP, NDP) guest bound packets, the TCP and UDP forwarding paths write directly to the tap fd. However, it turns out how they send frames to the tap device is more similar than it originally appears.
This series unifies the low-level tap send functions for TCP and UDP, and makes some clean ups along the way.
This is based on my earlier outstanding series.
For some reason, performance tests consistently get stuck (both TCP and UDP, sometimes throughput, sometimes latency tests) with this series, and not without it, but I don't see any possible relationship with that. I checked debug output and I couldn't find anything obviously wrong there. I just started checking packet captures now... -- Stefano
On Tue, Jan 24, 2023 at 10:20:43PM +0100, Stefano Brivio wrote:
On Fri, 6 Jan 2023 11:43:04 +1100 David Gibson
wrote: Although we have an abstraction for the "slow path" (DHCP, NDP) guest bound packets, the TCP and UDP forwarding paths write directly to the tap fd. However, it turns out how they send frames to the tap device is more similar than it originally appears.
This series unifies the low-level tap send functions for TCP and UDP, and makes some clean ups along the way.
This is based on my earlier outstanding series.
For some reason, performance tests consistently get stuck (both TCP and UDP, sometimes throughput, sometimes latency tests) with this series, and not without it, but I don't see any possible relationship with that.
Drat, I didn't encounter that. Any chance you could bisect to figure out which patch specifically seems to trigger it? I wonder if this could be related to the stalls I'm debugging, although those didn't appear on the perf tests and also occur on main. I have now discovered they seem to be masked by large socket buffer sizes - more info at https://bugs.passt.top/show_bug.cgi?id=41
I checked debug output and I couldn't find anything obviously wrong there. I just started checking packet captures now...
Hrm, probably not then. The stalls I'm seeing are associated with lots of partial sends. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Wed, 25 Jan 2023 14:13:44 +1100
David Gibson
On Tue, Jan 24, 2023 at 10:20:43PM +0100, Stefano Brivio wrote:
On Fri, 6 Jan 2023 11:43:04 +1100 David Gibson
wrote: Although we have an abstraction for the "slow path" (DHCP, NDP) guest bound packets, the TCP and UDP forwarding paths write directly to the tap fd. However, it turns out how they send frames to the tap device is more similar than it originally appears.
This series unifies the low-level tap send functions for TCP and UDP, and makes some clean ups along the way.
This is based on my earlier outstanding series.
For some reason, performance tests consistently get stuck (both TCP and UDP, sometimes throughput, sometimes latency tests) with this series, and not without it, but I don't see any possible relationship with that.
Drat, I didn't encounter that. Any chance you could bisect to figure out which patch specifically seems to trigger it?
I couldn't do it conclusively, yet. :/ Before "tcp: Combine two parts of passt tap send path together", no stalls at all. After that, I'm routinely getting a stall on the perf/passt_udp test, IPv4 host-to-guest with 256B MTU. I know, that test is probably meaningless as a performance figure, but it helps find issues like this, at least. :) Yes, UDP -- the iperf3 client doesn't connect to the server, passt doesn't crash, but it's gone (zombie) by the time I get to it. I think it's the test scripts terminating it (even though I don't see anything on the terminal), and script.log ends with: 2023/01/25 21:27:14 socat[3432381] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 ssh-keygen: generating new host keys: RSA 2023/01/25 21:27:14 socat[3432390] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 2023/01/25 21:27:14 socat[3432393] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 2023/01/25 21:27:14 socat[3432396] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 2023/01/25 21:27:14 socat[3432399] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 DSA ECDSA ED25519 # Warning: Permanently added 'guest' (ED25519) to the list of known hosts. which looks like fairly normal retries. If I run the tests with DEBUG=1, they get stuck during UDP functional testing, so I'm letting that aside for a moment. If I apply the whole series, other tests get stuck (including TCP ones). There might be something going wrong with iperf3's (TCP) control message exchange. I'm going to run this single test next, and add some debugging prints here and there.
I wonder if this could be related to the stalls I'm debugging, although those didn't appear on the perf tests and also occur on main. I have now discovered they seem to be masked by large socket buffer sizes - more info at https://bugs.passt.top/show_bug.cgi?id=41
Maybe the subsequent failures (or even this one) could actually be related, and triggered somehow by some change in timing. I'm still clueless at the moment. -- Stefano
On Thu, 26 Jan 2023 00:21:33 +0100
Stefano Brivio
On Wed, 25 Jan 2023 14:13:44 +1100 David Gibson
wrote: On Tue, Jan 24, 2023 at 10:20:43PM +0100, Stefano Brivio wrote:
On Fri, 6 Jan 2023 11:43:04 +1100 David Gibson
wrote: Although we have an abstraction for the "slow path" (DHCP, NDP) guest bound packets, the TCP and UDP forwarding paths write directly to the tap fd. However, it turns out how they send frames to the tap device is more similar than it originally appears.
This series unifies the low-level tap send functions for TCP and UDP, and makes some clean ups along the way.
This is based on my earlier outstanding series.
For some reason, performance tests consistently get stuck (both TCP and UDP, sometimes throughput, sometimes latency tests) with this series, and not without it, but I don't see any possible relationship with that.
Drat, I didn't encounter that. Any chance you could bisect to figure out which patch specifically seems to trigger it?
[...]
I wonder if this could be related to the stalls I'm debugging, although those didn't appear on the perf tests and also occur on main. I have now discovered they seem to be masked by large socket buffer sizes - more info at https://bugs.passt.top/show_bug.cgi?id=41
Maybe the subsequent failures (or even this one) could actually be related, and triggered somehow by some change in timing. I'm still clueless at the moment.
This turned out to be a combination of three different issues: - left-over patches in my local qemu tree (and build) trying to address the virtio-net TX hang ultimately fixed by kernel commit d71ebe8114b4 ("virtio-net: correctly enable callback during start_xmit"). I'm using the latest upstream now, clean - the issue you reported at https://bugs.passt.top/show_bug.cgi?id=41, I just posted a patch for it - the issue introduced by "tcp: Combine two parts of pasta tap send path together", patch also posted With these three sorted, finally I could apply this series! Apologies for the delay. -- Stefano
participants (2)
-
David Gibson
-
Stefano Brivio