On Thu, 14 Mar 2024 16:54:02 +0100 Laurent Vivier <lvivier(a)redhat.com> wrote:On 3/14/24 16:47, Stefano Brivio wrote:Yes, I'm wondering if for example this: iov[i][TCP_IOV_VNET].iov_base = &vnet_len[i]; causes a prefetch of everything pointed by iov[i][...], so we would prefetch (and throw away) each buffer, one by one. Another interesting experiment to verify if this is the case could be to "flush" a few frames at a time (say, 4), with something like this on top of your original change (completely untested): [...] if (!((i + 1) % 4) && !tap_send_frames_passt(c, iov[i / 4], TCP_IOV_NUM * 4)) break; } if ((i + 1) % 4) { tap_send_frames_passt(c, iov[i / 4], TCP_IOV_NUM * ((i + 1) % 4)); } Or maybe we could set vnet_len right after we receive data in the buffers. -- StefanoOn Thu, 14 Mar 2024 15:07:48 +0100 Laurent Vivier <lvivier(a)redhat.com> wrote:Perhaps in first case we only update one vnet_len and in the second case we have to update an array of vnet_len, so there is an use of more cache lines?On 3/13/24 12:37, Stefano Brivio wrote: ...Weird, it looks like doing one sendmsg() per frame results in a higher throughput than one sendmsg() per multiple frames, which sounds rather absurd. Perhaps we should start looking into what perf(1) reports, in terms of both syscall overhead and cache misses. I'll have a look later today or tomorrow -- unless you have other ideas as to why this might happen...> @@ -390,6 +414,42 @@ static size_t tap_send_frames_passt(const struct ctx *c, > return i; > } > > +/** > + * tap_send_iov_passt() - Send out multiple prepared frames ...I would argue that this function prepares frames as well. Maybe: * tap_send_iov_passt() - Prepare TCP_IOV_VNET parts and send multiple frames > + * @c: Execution context > + * @iov: Array of frames, each frames is divided in an array of iovecs. > + * The first entry of the iovec is updated to point to an > + * uint32_t storing the frame length. * @iov: Array of frames, each one a vector of parts, TCP_IOV_VNET blank > + * @n: Number of frames in @iov > + * > + * Return: number of frames actually sent > + */ > +static size_t tap_send_iov_passt(const struct ctx *c, > + struct iovec iov[][TCP_IOV_NUM], > + size_t n) > +{ > + unsigned int i; > + > + for (i = 0; i < n; i++) { > + uint32_t vnet_len; > + int j; > + > + vnet_len = 0; This could be initialised in the declaration (yes, it's "reset" at every loop iteration). > + for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++) > + vnet_len += iov[i][j].iov_len; > + > + vnet_len = htonl(vnet_len); > + iov[i][TCP_IOV_VNET].iov_base = &vnet_len; > + iov[i][TCP_IOV_VNET].iov_len = sizeof(vnet_len); > + > + if (!tap_send_frames_passt(c, iov[i], TCP_IOV_NUM)) ...which would now send a single frame at a time, but actually it can already send everything in one shot because it's using sendmsg(), if you move it outside of the loop and do something like (untested): return tap_send_frames_passt(c, iov, TCP_IOV_NUM * n); > + break; > + } > + > + return i; > + > +} > +I tried to do something like that but I have a performance drop: static size_t tap_send_iov_passt(const struct ctx *c, struct iovec iov[][TCP_IOV_NUM], size_t n) { unsigned int i; uint32_t vnet_len[n]; for (i = 0; i < n; i++) { int j; vnet_len[i] = 0; for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++) vnet_len[i] += iov[i][j].iov_len; vnet_len[i] = htonl(vnet_len[i]); iov[i][TCP_IOV_VNET].iov_base = &vnet_len[i]; iov[i][TCP_IOV_VNET].iov_len = sizeof(uint32_t); } return tap_send_frames_passt(c, &iov[0][0], TCP_IOV_NUM * n) / TCP_IOV_NUM; } iperf3 -c localhost -p 10001 -t 60 -4 berfore [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 33.0 GBytes 4.72 Gbits/sec 1 sender [ 5] 0.00-60.06 sec 33.0 GBytes 4.72 Gbits/sec receiver after: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 18.2 GBytes 2.60 Gbits/sec 0 sender [ 5] 0.00-60.07 sec 18.2 GBytes 2.60 Gbits/sec receiver