[PATCH 0/4] Some improvements to the tap send path
This series has a handful of small improvements to the tap send path. See individual commit messages for the details. I expect this will conflict with Laurent's upcoming work. I hope the conflicts won't be too bad, and indeed will set us up for less duplication there in the end. This is based on Laurent's patch fixing pcap_multiple() not to capture frames we failed to send. David Gibson (4): tap: Extend tap_send_frames() to allow multi-buffer frames tap: Simplify some casts in the tap "slow path" functions tap: Implement tap_send() "slow path" in terms of fast path tap: Rename tap_iov_{base,len} arp.c | 4 +- tap.c | 158 +++++++++++++++++++++++++++++++--------------------------- tap.h | 19 +++---- tcp.c | 20 ++++---- udp.c | 10 ++-- 5 files changed, 111 insertions(+), 100 deletions(-) -- 2.44.0
tap_send_frames() takes a vector of buffers and requires exactly one frame
per buffer. We have future plans where we want to have multiple buffers
per frame in some circumstances, so extend tap_send_frames() to take the
number of buffers per frame as a parameter.
Signed-off-by: David Gibson
On Fri, 8 Mar 2024 17:53:22 +1100
David Gibson
tap_send_frames() takes a vector of buffers and requires exactly one frame per buffer. We have future plans where we want to have multiple buffers per frame in some circumstances, so extend tap_send_frames() to take the number of buffers per frame as a parameter.
Signed-off-by: David Gibson
--- tap.c | 83 +++++++++++++++++++++++++++++++++++++---------------------- tap.h | 3 ++- tcp.c | 8 +++--- udp.c | 2 +- 4 files changed, 59 insertions(+), 37 deletions(-) diff --git a/tap.c b/tap.c index f4051cec..f9e2a8d9 100644 --- a/tap.c +++ b/tap.c @@ -309,21 +309,28 @@ void tap_icmp6_send(const struct ctx *c,
/** * tap_send_frames_pasta() - Send multiple frames to the pasta tap - * @c: Execution context - * @iov: Array of buffers, each containing one frame - * @n: Number of buffers/frames in @iov + * @c: Execution context + * @iov: Array of buffers + * @bufs_per_frame: Number of buffers (iovec entries) per frame + * @nframes: Number of frames to send * + * @iov must have total length @bufs_per_frame * @nframes, with each set of + * @bufs_per_frame contiguous buffers representing a single frame.
Oh, this does pretty much what I was suggesting as a comment to Laurent's "tcp: Replace TCP buffer structure by an iovec array" -- I should have reviewed this first.
+ * * Return: number of frames successfully sent * * #syscalls:pasta write */ static size_t tap_send_frames_pasta(const struct ctx *c, - const struct iovec *iov, size_t n) + const struct iovec *iov, + size_t bufs_per_frame, size_t nframes) { + size_t nbufs = bufs_per_frame * nframes; size_t i;
- for (i = 0; i < n; i++) { - ssize_t rc = write(c->fd_tap, iov[i].iov_base, iov[i].iov_len); + for (i = 0; i < nbufs; i += bufs_per_frame) { + ssize_t rc = writev(c->fd_tap, iov + i, bufs_per_frame); + size_t framelen = iov_size(iov + i, bufs_per_frame);
if (rc < 0) { debug("tap write: %s", strerror(errno)); @@ -340,32 +347,37 @@ static size_t tap_send_frames_pasta(const struct ctx *c, default: die("Write error on tap device, exiting"); } - } else if ((size_t)rc < iov[i].iov_len) { - debug("short write on tuntap: %zd/%zu", - rc, iov[i].iov_len); + } else if ((size_t)rc < framelen) { + debug("short write on tuntap: %zd/%zu", rc, framelen); break; } }
- return i; + return i / bufs_per_frame; }
/** * tap_send_frames_passt() - Send multiple frames to the passt tap - * @c: Execution context - * @iov: Array of buffers, each containing one frame - * @n: Number of buffers/frames in @iov + * @c: Execution context + * @iov: Array of buffers, each containing one frame + * @bufs_per_frame: Number of buffers (iovec entries) per frame + * @nframes: Number of frames to send * + * @iov must have total length @bufs_per_frame * @nframes, with each set of + * @bufs_per_frame contiguous buffers representing a single frame. + * * Return: number of frames successfully sent * * #syscalls:passt sendmsg */ static size_t tap_send_frames_passt(const struct ctx *c, - const struct iovec *iov, size_t n) + const struct iovec *iov, + size_t bufs_per_frame, size_t nframes) { + size_t nbufs = bufs_per_frame * nframes; struct msghdr mh = { .msg_iov = (void *)iov, - .msg_iovlen = n, + .msg_iovlen = nbufs, }; size_t buf_offset; unsigned int i; @@ -376,44 +388,53 @@ static size_t tap_send_frames_passt(const struct ctx *c, return 0;
/* Check for any partial frames due to short send */ - i = iov_skip_bytes(iov, n, sent, &buf_offset); + i = iov_skip_bytes(iov, nbufs, sent, &buf_offset); + + if (i < nbufs && (buf_offset || (i % bufs_per_frame))) { + /* Number of not-fully-sent buffers in the frame */
Strictly speaking, this comment is correct, but "not-fully-sent" seems to imply that rembufs only counts partially sent buffers. It also counts the ones that weren't sent at all. What about: /* Number of partially sent or not sent buffers for the frame */ ? The rest of the series looks good to me, I can change this text on merge if you like the proposal, or even apply it as it is (well, it's correct, after all). -- Stefano
On Thu, Mar 14, 2024 at 08:02:17AM +0100, Stefano Brivio wrote:
On Fri, 8 Mar 2024 17:53:22 +1100 David Gibson
wrote: tap_send_frames() takes a vector of buffers and requires exactly one frame per buffer. We have future plans where we want to have multiple buffers per frame in some circumstances, so extend tap_send_frames() to take the number of buffers per frame as a parameter.
Signed-off-by: David Gibson
--- tap.c | 83 +++++++++++++++++++++++++++++++++++++---------------------- tap.h | 3 ++- tcp.c | 8 +++--- udp.c | 2 +- 4 files changed, 59 insertions(+), 37 deletions(-) diff --git a/tap.c b/tap.c index f4051cec..f9e2a8d9 100644 --- a/tap.c +++ b/tap.c @@ -309,21 +309,28 @@ void tap_icmp6_send(const struct ctx *c,
/** * tap_send_frames_pasta() - Send multiple frames to the pasta tap - * @c: Execution context - * @iov: Array of buffers, each containing one frame - * @n: Number of buffers/frames in @iov + * @c: Execution context + * @iov: Array of buffers + * @bufs_per_frame: Number of buffers (iovec entries) per frame + * @nframes: Number of frames to send * + * @iov must have total length @bufs_per_frame * @nframes, with each set of + * @bufs_per_frame contiguous buffers representing a single frame.
Oh, this does pretty much what I was suggesting as a comment to Laurent's "tcp: Replace TCP buffer structure by an iovec array" -- I should have reviewed this first.
Right.
+ * * Return: number of frames successfully sent * * #syscalls:pasta write */ static size_t tap_send_frames_pasta(const struct ctx *c, - const struct iovec *iov, size_t n) + const struct iovec *iov, + size_t bufs_per_frame, size_t nframes) { + size_t nbufs = bufs_per_frame * nframes; size_t i;
- for (i = 0; i < n; i++) { - ssize_t rc = write(c->fd_tap, iov[i].iov_base, iov[i].iov_len); + for (i = 0; i < nbufs; i += bufs_per_frame) { + ssize_t rc = writev(c->fd_tap, iov + i, bufs_per_frame); + size_t framelen = iov_size(iov + i, bufs_per_frame);
if (rc < 0) { debug("tap write: %s", strerror(errno)); @@ -340,32 +347,37 @@ static size_t tap_send_frames_pasta(const struct ctx *c, default: die("Write error on tap device, exiting"); } - } else if ((size_t)rc < iov[i].iov_len) { - debug("short write on tuntap: %zd/%zu", - rc, iov[i].iov_len); + } else if ((size_t)rc < framelen) { + debug("short write on tuntap: %zd/%zu", rc, framelen); break; } }
- return i; + return i / bufs_per_frame; }
/** * tap_send_frames_passt() - Send multiple frames to the passt tap - * @c: Execution context - * @iov: Array of buffers, each containing one frame - * @n: Number of buffers/frames in @iov + * @c: Execution context + * @iov: Array of buffers, each containing one frame + * @bufs_per_frame: Number of buffers (iovec entries) per frame + * @nframes: Number of frames to send * + * @iov must have total length @bufs_per_frame * @nframes, with each set of + * @bufs_per_frame contiguous buffers representing a single frame. + * * Return: number of frames successfully sent * * #syscalls:passt sendmsg */ static size_t tap_send_frames_passt(const struct ctx *c, - const struct iovec *iov, size_t n) + const struct iovec *iov, + size_t bufs_per_frame, size_t nframes) { + size_t nbufs = bufs_per_frame * nframes; struct msghdr mh = { .msg_iov = (void *)iov, - .msg_iovlen = n, + .msg_iovlen = nbufs, }; size_t buf_offset; unsigned int i; @@ -376,44 +388,53 @@ static size_t tap_send_frames_passt(const struct ctx *c, return 0;
/* Check for any partial frames due to short send */ - i = iov_skip_bytes(iov, n, sent, &buf_offset); + i = iov_skip_bytes(iov, nbufs, sent, &buf_offset); + + if (i < nbufs && (buf_offset || (i % bufs_per_frame))) { + /* Number of not-fully-sent buffers in the frame */
Strictly speaking, this comment is correct, but "not-fully-sent" seems to imply that rembufs only counts partially sent buffers. It also counts the ones that weren't sent at all. What about:
/* Number of partially sent or not sent buffers for the frame */
?
Yeah, the original is technically correct, but easy to misread. Maybe Number of unsent or partially sent buffers for the frame for slightly more brevity than your suggestion?
The rest of the series looks good to me, I can change this text on merge if you like the proposal, or even apply it as it is (well, it's correct, after all).
Yes, please adjust and go ahead. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
We can both remove some variables which differ from others only in type,
and slightly improve type safety.
Signed-off-by: David Gibson
Most times we send frames to the guest it goes via tap_send_frames().
However "slow path" protocols - ARP, ICMP, ICMPv6, DHCP and DHCPv6 - go
via tap_send().
As well as being a semantic duplication, tap_send() contains at least one
serious problem: it doesn't properly handle short sends, which can be fatal
on the qemu socket connection, since frame boundaries will get out of sync.
Rewrite tap_send() to call tap_send_frames(). While we're there, rename it
tap_send_single() for clarity.
Signed-off-by: David Gibson
These two functions are typically used to calculate values to go into the
iov_base and iov_len fields of a struct iovec. They don't have to be used
for that, though. Rename them in terms of what they actually do: calculate
the base address and total length of the complete frame, including both L2
and tap specific headers.
Signed-off-by: David Gibson
On 3/8/24 07:53, David Gibson wrote:
This series has a handful of small improvements to the tap send path. See individual commit messages for the details.
I expect this will conflict with Laurent's upcoming work. I hope the conflicts won't be too bad, and indeed will set us up for less duplication there in the end.
I'm working on patch that devides TCP buffers in several buffers pointed out by an IOV arrays and then provided to tap_send_frames(). I'm going to base my patch on this series. The idea is: A frame is made with 4 iovecs: #define TCP_IOV_VNET 0 #define TCP_IOV_ETH 1 #define TCP_IOV_IP 2 #define TCP_IOV_PAYLOAD 3 #define TCP_IOV_NUM 4 typedef struct iovec tap_iovec_t[TCP_IOV_NUM]; The payload can be TCP + data or TCP + flags: struct tcp_l2_flags_t { struct tcphdr th; char opts[OPT_MSS_LEN + OPT_WS_LEN + 1]; }; struct tcp_l2_payload_t { struct tcphdr th; /* 20 bytes */ uint8_t data[MSS]; /* 65516 bytes */ #ifdef __AVX2__ } __attribute__ ((packed, aligned(32))); #else } __attribute__ ((packed, aligned(__alignof__(unsigned int)))); #endif static struct ethhdr tcp4_eth_src; static struct iphdr tcp4_l2_ip[TCP_FRAMES_MEM]; static struct tcp_l2_payload_t tcp4_l2_payload[TCP_FRAMES_MEM]; static struct tcp_buf_seq_update tcp4_l2_buf_seq_update[TCP_FRAMES_MEM]; static unsigned int tcp4_l2_buf_used; static tap_iovec_t tcp4_l2_iov [TCP_FRAMES_MEM]; Initialization looks like: tcp4_eth_src.h_proto = htons_constant(ETH_P_IP); for (i = 0; i < TCP_FRAMES_MEM; i++) { ... /* headers */ tcp4_l2_ip[i] = iph; tcp4_l2_payload[i].th = (struct tcphdr){ .doff = sizeof(struct tcphdr) / 4, .ack = 1 }; ... /* iovecs */ iov = tcp4_l2_iov[i]; iov[TCP_IOV_ETH].iov_base = &tcp4_eth_src; iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr); iov[TCP_IOV_IP].iov_base = &tcp4_l2_ip[i]; iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr); iov[TCP_IOV_PAYLOAD].iov_base = &tcp4_l2_payload[i]; ... } Then to fill the payload header (data are received by tcp_data_from_sock()): iov = tcp4_l2_iov[tcp4_l2_buf_used++]; iov[TCP_IOV_PAYLOAD].iov_len = tcp_l2_buf_fill_headers(c, conn, iov, plen, check, seq); And the frame is sent using: m = tap_send_iov(c, tcp4_l2_iov, tcp4_l2_buf_used); For the moment (I'll rebase on your series), tap_send_iov_passt() looks like: static size_t tap_send_iov_passt(const struct ctx *c, tap_iovec_t *tap_iov, size_t n) { unsigned int i; for (i = 0; i < n; i++) { uint32_t vnet_len; int j; vnet_len = 0; for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++) vnet_len += tap_iov[i][j].iov_len; tap_iov[i][TCP_IOV_VNET].iov_base = &vnet_len; tap_iov[i][TCP_IOV_VNET].iov_len = sizeof(vnet_len); if (!tap_send_frames_passt(c, tap_iov[i], TCP_IOV_NUM)) break; } return i; } Thanks, Laurent
This is based on Laurent's patch fixing pcap_multiple() not to capture frames we failed to send.
David Gibson (4): tap: Extend tap_send_frames() to allow multi-buffer frames tap: Simplify some casts in the tap "slow path" functions tap: Implement tap_send() "slow path" in terms of fast path tap: Rename tap_iov_{base,len}
arp.c | 4 +- tap.c | 158 +++++++++++++++++++++++++++++++--------------------------- tap.h | 19 +++---- tcp.c | 20 ++++---- udp.c | 10 ++-- 5 files changed, 111 insertions(+), 100 deletions(-)
On Fri, 8 Mar 2024 09:18:48 +0100
Laurent Vivier
On 3/8/24 07:53, David Gibson wrote:
This series has a handful of small improvements to the tap send path. See individual commit messages for the details.
I expect this will conflict with Laurent's upcoming work. I hope the conflicts won't be too bad, and indeed will set us up for less duplication there in the end.
I'm working on patch that devides TCP buffers in several buffers pointed out by an IOV arrays and then provided to tap_send_frames(). I'm going to base my patch on this series.
The idea is:
A frame is made with 4 iovecs:
#define TCP_IOV_VNET 0 #define TCP_IOV_ETH 1 #define TCP_IOV_IP 2 #define TCP_IOV_PAYLOAD 3 #define TCP_IOV_NUM 4 typedef struct iovec tap_iovec_t[TCP_IOV_NUM];
Except for the typedef :) (I'm actively trying to discourage them) ...this looks very neat. I would suggest that as soon as you have something barely spitting out pseudo-correct bytes, you give it a try with perf(1). I'm quite concerned that we might end up killing throughput, especially without vhost-user, even though in theory it all sounds nice and byte-saving. -- Stefano
On 3/8/24 09:34, Stefano Brivio wrote:
On Fri, 8 Mar 2024 09:18:48 +0100 Laurent Vivier
wrote: On 3/8/24 07:53, David Gibson wrote:
This series has a handful of small improvements to the tap send path. See individual commit messages for the details.
I expect this will conflict with Laurent's upcoming work. I hope the conflicts won't be too bad, and indeed will set us up for less duplication there in the end.
I'm working on patch that devides TCP buffers in several buffers pointed out by an IOV arrays and then provided to tap_send_frames(). I'm going to base my patch on this series.
The idea is:
A frame is made with 4 iovecs:
#define TCP_IOV_VNET 0 #define TCP_IOV_ETH 1 #define TCP_IOV_IP 2 #define TCP_IOV_PAYLOAD 3 #define TCP_IOV_NUM 4 typedef struct iovec tap_iovec_t[TCP_IOV_NUM];
Except for the typedef :) (I'm actively trying to discourage them)
ok...
...this looks very neat.
I would suggest that as soon as you have something barely spitting out pseudo-correct bytes, you give it a try with perf(1). I'm quite concerned that we might end up killing throughput, especially without vhost-user, even though in theory it all sounds nice and byte-saving.
I hope to have something to test today... Thanks, Laurent
On 3/8/24 09:34, Stefano Brivio wrote:
On Fri, 8 Mar 2024 09:18:48 +0100 Laurent Vivier
wrote: On 3/8/24 07:53, David Gibson wrote:
This series has a handful of small improvements to the tap send path. See individual commit messages for the details.
I expect this will conflict with Laurent's upcoming work. I hope the conflicts won't be too bad, and indeed will set us up for less duplication there in the end.
I'm working on patch that devides TCP buffers in several buffers pointed out by an IOV arrays and then provided to tap_send_frames(). I'm going to base my patch on this series.
The idea is:
A frame is made with 4 iovecs:
#define TCP_IOV_VNET 0 #define TCP_IOV_ETH 1 #define TCP_IOV_IP 2 #define TCP_IOV_PAYLOAD 3 #define TCP_IOV_NUM 4 typedef struct iovec tap_iovec_t[TCP_IOV_NUM];
Except for the typedef :) (I'm actively trying to discourage them) ...this looks very neat.
I would suggest that as soon as you have something barely spitting out pseudo-correct bytes, you give it a try with perf(1). I'm quite concerned that we might end up killing throughput, especially without vhost-user, even though in theory it all sounds nice and byte-saving.
Using iovec improves performance: iperf3 -c localhost -p 10001 -t 60 -4 iovec [ 5] 0.00-60.00 sec 32.9 GBytes 4.72 Gbits/sec 0 sender [ 5] 0.00-60.06 sec 32.9 GBytes 4.71 Gbits/sec receiver buf_t [ 5] 0.00-60.00 sec 15.9 GBytes 2.27 Gbits/sec 0 sender [ 5] 0.00-60.06 sec 15.9 GBytes 2.27 Gbits/sec receiver iperf3 -c localhost -p 10001 -t 60 -6 iovec [ 5] 0.00-60.00 sec 26.0 GBytes 3.73 Gbits/sec 0 sender [ 5] 0.00-60.06 sec 26.0 GBytes 3.72 Gbits/sec receiver buf_t [ 5] 0.00-60.00 sec 15.7 GBytes 2.25 Gbits/sec 0 sender [ 5] 0.00-60.07 sec 15.7 GBytes 2.25 Gbits/sec receiver Thanks, Laurent
On Fri, 8 Mar 2024 16:49:20 +0100
Laurent Vivier
On 3/8/24 09:34, Stefano Brivio wrote:
On Fri, 8 Mar 2024 09:18:48 +0100 Laurent Vivier
wrote: On 3/8/24 07:53, David Gibson wrote:
This series has a handful of small improvements to the tap send path. See individual commit messages for the details.
I expect this will conflict with Laurent's upcoming work. I hope the conflicts won't be too bad, and indeed will set us up for less duplication there in the end.
I'm working on patch that devides TCP buffers in several buffers pointed out by an IOV arrays and then provided to tap_send_frames(). I'm going to base my patch on this series.
The idea is:
A frame is made with 4 iovecs:
#define TCP_IOV_VNET 0 #define TCP_IOV_ETH 1 #define TCP_IOV_IP 2 #define TCP_IOV_PAYLOAD 3 #define TCP_IOV_NUM 4 typedef struct iovec tap_iovec_t[TCP_IOV_NUM];
Except for the typedef :) (I'm actively trying to discourage them) ...this looks very neat.
I would suggest that as soon as you have something barely spitting out pseudo-correct bytes, you give it a try with perf(1). I'm quite concerned that we might end up killing throughput, especially without vhost-user, even though in theory it all sounds nice and byte-saving.
Using iovec improves performance:
iperf3 -c localhost -p 10001 -t 60 -4
iovec
[ 5] 0.00-60.00 sec 32.9 GBytes 4.72 Gbits/sec 0 sender [ 5] 0.00-60.06 sec 32.9 GBytes 4.71 Gbits/sec receiver
buf_t
[ 5] 0.00-60.00 sec 15.9 GBytes 2.27 Gbits/sec 0 sender [ 5] 0.00-60.06 sec 15.9 GBytes 2.27 Gbits/sec receiver
! ...it must be some alignment matter. Nice. I guess it doesn't happen with large windows (-w ...) or messages (-l 1M), but it looks promising enough with small ones. -- Stefano
On Fri, Mar 08, 2024 at 09:18:48AM +0100, Laurent Vivier wrote:
On 3/8/24 07:53, David Gibson wrote:
This series has a handful of small improvements to the tap send path. See individual commit messages for the details.
I expect this will conflict with Laurent's upcoming work. I hope the conflicts won't be too bad, and indeed will set us up for less duplication there in the end.
I'm working on patch that devides TCP buffers in several buffers pointed out by an IOV arrays and then provided to tap_send_frames(). I'm going to base my patch on this series.
The idea is:
A frame is made with 4 iovecs:
#define TCP_IOV_VNET 0 #define TCP_IOV_ETH 1 #define TCP_IOV_IP 2 #define TCP_IOV_PAYLOAD 3 #define TCP_IOV_NUM 4 typedef struct iovec tap_iovec_t[TCP_IOV_NUM];
General concept seems good. Unless you have a specific reason to do so, I'd suggest keeping VNET and ETH - i.e. L2 and everything below it - together. As well as just making one less buffer for each frame, I think that will make life easier if we want to add an L2 interface with non-Ethernet framing (e.g. "tun" instead of "tap").
The payload can be TCP + data or TCP + flags:
struct tcp_l2_flags_t { struct tcphdr th; char opts[OPT_MSS_LEN + OPT_WS_LEN + 1]; };
struct tcp_l2_payload_t { struct tcphdr th; /* 20 bytes */ uint8_t data[MSS]; /* 65516 bytes */ #ifdef __AVX2__ } __attribute__ ((packed, aligned(32))); #else } __attribute__ ((packed, aligned(__alignof__(unsigned int)))); #endif
Not sure if we'd be better off with this approach, or having both IP and L4 headers together, and the L4 payload in another. The latter would mean duplicating the TCP or UDP headers between the IPv4 and IPv6 cases, but it allows the data buffers - which will be used directly on the socket side. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On 3/8/24 13:42, David Gibson wrote: ...
The payload can be TCP + data or TCP + flags:
struct tcp_l2_flags_t { struct tcphdr th; char opts[OPT_MSS_LEN + OPT_WS_LEN + 1]; };
struct tcp_l2_payload_t { struct tcphdr th; /* 20 bytes */ uint8_t data[MSS]; /* 65516 bytes */ #ifdef __AVX2__ } __attribute__ ((packed, aligned(32))); #else } __attribute__ ((packed, aligned(__alignof__(unsigned int)))); #endif
Not sure if we'd be better off with this approach, or having both IP and L4 headers together, and the L4 payload in another. The latter would mean duplicating the TCP or UDP headers between the IPv4 and IPv6 cases, but it allows the data buffers - which will be used directly on the socket side.
The main idea behind using iovec is to separate tcphdr and iphdr structures, allowing for direct access to the pointers of tcphdr and iphdr without concerns about pointer alignment. Moreover, having tcphdr and TCP payload in the same vector can make sense when computing the TCP checksum, as the checksum includes tcphdr and the payload. Thanks, Laurent
On Fri, Mar 08, 2024 at 05:49:15PM +0100, Laurent Vivier wrote:
On 3/8/24 13:42, David Gibson wrote: ...
The payload can be TCP + data or TCP + flags:
struct tcp_l2_flags_t { struct tcphdr th; char opts[OPT_MSS_LEN + OPT_WS_LEN + 1]; };
struct tcp_l2_payload_t { struct tcphdr th; /* 20 bytes */ uint8_t data[MSS]; /* 65516 bytes */ #ifdef __AVX2__ } __attribute__ ((packed, aligned(32))); #else } __attribute__ ((packed, aligned(__alignof__(unsigned int)))); #endif
Not sure if we'd be better off with this approach, or having both IP and L4 headers together, and the L4 payload in another. The latter would mean duplicating the TCP or UDP headers between the IPv4 and IPv6 cases, but it allows the data buffers - which will be used directly on the socket side.
The main idea behind using iovec is to separate tcphdr and iphdr structures, allowing for direct access to the pointers of tcphdr and iphdr without concerns about pointer alignment. Moreover, having tcphdr and TCP payload in the same vector can make sense when computing the TCP checksum, as the checksum includes tcphdr and the payload.
Ah, yes. That makes sense. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On 3/8/24 13:42, David Gibson wrote:
On Fri, Mar 08, 2024 at 09:18:48AM +0100, Laurent Vivier wrote:
On 3/8/24 07:53, David Gibson wrote:
This series has a handful of small improvements to the tap send path. See individual commit messages for the details.
I expect this will conflict with Laurent's upcoming work. I hope the conflicts won't be too bad, and indeed will set us up for less duplication there in the end.
I'm working on patch that devides TCP buffers in several buffers pointed out by an IOV arrays and then provided to tap_send_frames(). I'm going to base my patch on this series.
The idea is:
A frame is made with 4 iovecs:
#define TCP_IOV_VNET 0 #define TCP_IOV_ETH 1 #define TCP_IOV_IP 2 #define TCP_IOV_PAYLOAD 3 #define TCP_IOV_NUM 4 typedef struct iovec tap_iovec_t[TCP_IOV_NUM];
General concept seems good. Unless you have a specific reason to do so, I'd suggest keeping VNET and ETH - i.e. L2 and everything below it - together. As well as just making one less buffer for each frame, I think that will make life easier if we want to add an L2 interface with non-Ethernet framing (e.g. "tun" instead of "tap").
In fact keeping vnet header separated from eth header makes easier to remove it from the list to pass the iovec array to pcap functions and to pasta send function (can use iovec[1] and iovcount - 1). Thanks, Laurent
On Mon, Mar 11, 2024 at 12:02:08PM +0100, Laurent Vivier wrote:
On 3/8/24 13:42, David Gibson wrote:
On Fri, Mar 08, 2024 at 09:18:48AM +0100, Laurent Vivier wrote:
On 3/8/24 07:53, David Gibson wrote:
This series has a handful of small improvements to the tap send path. See individual commit messages for the details.
I expect this will conflict with Laurent's upcoming work. I hope the conflicts won't be too bad, and indeed will set us up for less duplication there in the end.
I'm working on patch that devides TCP buffers in several buffers pointed out by an IOV arrays and then provided to tap_send_frames(). I'm going to base my patch on this series.
The idea is:
A frame is made with 4 iovecs:
#define TCP_IOV_VNET 0 #define TCP_IOV_ETH 1 #define TCP_IOV_IP 2 #define TCP_IOV_PAYLOAD 3 #define TCP_IOV_NUM 4 typedef struct iovec tap_iovec_t[TCP_IOV_NUM];
General concept seems good. Unless you have a specific reason to do so, I'd suggest keeping VNET and ETH - i.e. L2 and everything below it - together. As well as just making one less buffer for each frame, I think that will make life easier if we want to add an L2 interface with non-Ethernet framing (e.g. "tun" instead of "tap").
In fact keeping vnet header separated from eth header makes easier to remove it from the list to pass the iovec array to pcap functions and to pasta send function (can use iovec[1] and iovcount - 1).
Ok, fair point. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Fri, 8 Mar 2024 17:53:21 +1100
David Gibson
This series has a handful of small improvements to the tap send path. See individual commit messages for the details.
I expect this will conflict with Laurent's upcoming work. I hope the conflicts won't be too bad, and indeed will set us up for less duplication there in the end.
This is based on Laurent's patch fixing pcap_multiple() not to capture frames we failed to send.
David Gibson (4): tap: Extend tap_send_frames() to allow multi-buffer frames tap: Simplify some casts in the tap "slow path" functions tap: Implement tap_send() "slow path" in terms of fast path tap: Rename tap_iov_{base,len}
Applied, with your new version of the comment to 'rembufs' in 1/4. -- Stefano
participants (3)
-
David Gibson
-
Laurent Vivier
-
Stefano Brivio