On Thu, Mar 14, 2024 at 05:26:17PM +0100, Stefano Brivio wrote:On Thu, 14 Mar 2024 16:54:02 +0100 Laurent Vivier <lvivier(a)redhat.com> wrote: > On 3/14/24 16:47, Stefano Brivio wrote: > > On Thu, 14 Mar 2024 15:07:48 +0100 > > Laurent Vivier <lvivier(a)redhat.com> wrote: > > > >> On 3/13/24 12:37, Stefano Brivio wrote: > >> ... > >>>> @@ -390,6 +414,42 @@ static size_t tap_send_frames_passt(const struct ctx *c, > >>>> return i; > >>>> } > >>>> > >>>> +/** > >>>> + * tap_send_iov_passt() - Send out multiple prepared frames > >>> > >>> ...I would argue that this function prepares frames as well. Maybe: > >>> > >>> * tap_send_iov_passt() - Prepare TCP_IOV_VNET parts and send multiple frames > >>> > >>>> + * @c: Execution context > >>>> + * @iov: Array of frames, each frames is divided in an array of iovecs. > >>>> + * The first entry of the iovec is updated to point to an > >>>> + * uint32_t storing the frame length. > >>> > >>> * @iov: Array of frames, each one a vector of parts, TCP_IOV_VNET blank > >>> > >>>> + * @n: Number of frames in @iov > >>>> + * > >>>> + * Return: number of frames actually sent > >>>> + */ > >>>> +static size_t tap_send_iov_passt(const struct ctx *c, > >>>> + struct iovec iov[][TCP_IOV_NUM], > >>>> + size_t n) > >>>> +{ > >>>> + unsigned int i; > >>>> + > >>>> + for (i = 0; i < n; i++) { > >>>> + uint32_t vnet_len; > >>>> + int j; > >>>> + > >>>> + vnet_len = 0; > >>> > >>> This could be initialised in the declaration (yes, it's "reset" at > >>> every loop iteration). > >>> > >>>> + for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++) > >>>> + vnet_len += iov[i][j].iov_len; > >>>> + > >>>> + vnet_len = htonl(vnet_len); > >>>> + iov[i][TCP_IOV_VNET].iov_base = &vnet_len; > >>>> + iov[i][TCP_IOV_VNET].iov_len = sizeof(vnet_len); > >>>> + > >>>> + if (!tap_send_frames_passt(c, iov[i], TCP_IOV_NUM)) > >>> > >>> ...which would now send a single frame at a time, but actually it can > >>> already send everything in one shot because it's using sendmsg(), if you > >>> move it outside of the loop and do something like (untested): > >>> > >>> return tap_send_frames_passt(c, iov, TCP_IOV_NUM * n); > >>> > >>>> + break; > >>>> + } > >>>> + > >>>> + return i; > >>>> + > >>>> +} > >>>> + > >> > >> I tried to do something like that but I have a performance drop: > >> > >> static size_t tap_send_iov_passt(const struct ctx *c, > >> struct iovec iov[][TCP_IOV_NUM], > >> size_t n) > >> { > >> unsigned int i; > >> uint32_t vnet_len[n]; > >> > >> for (i = 0; i < n; i++) { > >> int j; > >> > >> vnet_len[i] = 0; > >> for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++) > >> vnet_len[i] += iov[i][j].iov_len; > >> > >> vnet_len[i] = htonl(vnet_len[i]); > >> iov[i][TCP_IOV_VNET].iov_base = &vnet_len[i]; > >> iov[i][TCP_IOV_VNET].iov_len = sizeof(uint32_t); > >> } > >> > >> return tap_send_frames_passt(c, &iov[0][0], TCP_IOV_NUM * n) / TCP_IOV_NUM; > >> } > >> > >> iperf3 -c localhost -p 10001 -t 60 -4 > >> > >> berfore > >> [ ID] Interval Transfer Bitrate Retr > >> [ 5] 0.00-60.00 sec 33.0 GBytes 4.72 Gbits/sec 1 sender > >> [ 5] 0.00-60.06 sec 33.0 GBytes 4.72 Gbits/sec receiver > >> > >> after: > >> [ ID] Interval Transfer Bitrate Retr > >> [ 5] 0.00-60.00 sec 18.2 GBytes 2.60 Gbits/sec 0 sender > >> [ 5] 0.00-60.07 sec 18.2 GBytes 2.60 Gbits/sec receiver > > > > Weird, it looks like doing one sendmsg() per frame results in a higher > > throughput than one sendmsg() per multiple frames, which sounds rather > > absurd. Perhaps we should start looking into what perf(1) reports, in > > terms of both syscall overhead and cache misses. > > > > I'll have a look later today or tomorrow -- unless you have other > > ideas as to why this might happen... > > Perhaps in first case we only update one vnet_len and in the second case we have to update > an array of vnet_len, so there is an use of more cache lines?We should be able to test this relatively easily, yes? By updating all the vnet_len then using a single sendmsg().Yes, I'm wondering if for example this: iov[i][TCP_IOV_VNET].iov_base = &vnet_len[i]; causes a prefetch of everything pointed by iov[i][...], so we would prefetch (and throw away) each buffer, one by one. Another interesting experiment to verify if this is the case could be to "flush" a few frames at a time (say, 4), with something like this on top of your original change (completely untested): [...] if (!((i + 1) % 4) && !tap_send_frames_passt(c, iov[i / 4], TCP_IOV_NUM * 4)) break; } if ((i + 1) % 4) { tap_send_frames_passt(c, iov[i / 4], TCP_IOV_NUM * ((i + 1) % 4)); } Or maybe we could set vnet_len right after we receive data in the buffers.I really hope we can avoid this. If we want to allow IPv4<->IPv6 translation, then we can't know the vnet_len until after we've done our routing / translation. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson