On 3/14/24 16:47, Stefano Brivio wrote:
On Thu, 14 Mar 2024 15:07:48 +0100 Laurent Vivier
wrote: On 3/13/24 12:37, Stefano Brivio wrote: ...
@@ -390,6 +414,42 @@ static size_t tap_send_frames_passt(const struct ctx *c, return i; }
+/** + * tap_send_iov_passt() - Send out multiple prepared frames
...I would argue that this function prepares frames as well. Maybe:
* tap_send_iov_passt() - Prepare TCP_IOV_VNET parts and send multiple frames
+ * @c: Execution context + * @iov: Array of frames, each frames is divided in an array of iovecs. + * The first entry of the iovec is updated to point to an + * uint32_t storing the frame length.
* @iov: Array of frames, each one a vector of parts, TCP_IOV_VNET blank
+ * @n: Number of frames in @iov + * + * Return: number of frames actually sent + */ +static size_t tap_send_iov_passt(const struct ctx *c, + struct iovec iov[][TCP_IOV_NUM], + size_t n) +{ + unsigned int i; + + for (i = 0; i < n; i++) { + uint32_t vnet_len; + int j; + + vnet_len = 0;
This could be initialised in the declaration (yes, it's "reset" at every loop iteration).
+ for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++) + vnet_len += iov[i][j].iov_len; + + vnet_len = htonl(vnet_len); + iov[i][TCP_IOV_VNET].iov_base = &vnet_len; + iov[i][TCP_IOV_VNET].iov_len = sizeof(vnet_len); + + if (!tap_send_frames_passt(c, iov[i], TCP_IOV_NUM))
...which would now send a single frame at a time, but actually it can already send everything in one shot because it's using sendmsg(), if you move it outside of the loop and do something like (untested):
return tap_send_frames_passt(c, iov, TCP_IOV_NUM * n);
+ break; + } + + return i; + +} +
I tried to do something like that but I have a performance drop:
static size_t tap_send_iov_passt(const struct ctx *c, struct iovec iov[][TCP_IOV_NUM], size_t n) { unsigned int i; uint32_t vnet_len[n];
for (i = 0; i < n; i++) { int j;
vnet_len[i] = 0; for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++) vnet_len[i] += iov[i][j].iov_len;
vnet_len[i] = htonl(vnet_len[i]); iov[i][TCP_IOV_VNET].iov_base = &vnet_len[i]; iov[i][TCP_IOV_VNET].iov_len = sizeof(uint32_t); }
return tap_send_frames_passt(c, &iov[0][0], TCP_IOV_NUM * n) / TCP_IOV_NUM; }
iperf3 -c localhost -p 10001 -t 60 -4
berfore [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 33.0 GBytes 4.72 Gbits/sec 1 sender [ 5] 0.00-60.06 sec 33.0 GBytes 4.72 Gbits/sec receiver
after: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 18.2 GBytes 2.60 Gbits/sec 0 sender [ 5] 0.00-60.07 sec 18.2 GBytes 2.60 Gbits/sec receiver
Weird, it looks like doing one sendmsg() per frame results in a higher throughput than one sendmsg() per multiple frames, which sounds rather absurd. Perhaps we should start looking into what perf(1) reports, in terms of both syscall overhead and cache misses.
I'll have a look later today or tomorrow -- unless you have other ideas as to why this might happen...
Perhaps in first case we only update one vnet_len and in the second case we have to update an array of vnet_len, so there is an use of more cache lines? Thanks, Laurent