Re: [RFC] tcp: Replace TCP buffer structure by an iovec array

15 Mar 2024


      On Thu, Mar 14, 2024 at 05:26:17PM +0100, Stefano Brivio wrote:
...
On Thu, 14 Mar 2024 16:54:02 +0100
Laurent Vivier  wrote:
...
On 3/14/24 16:47, Stefano Brivio wrote:
...
On Thu, 14 Mar 2024 15:07:48 +0100
Laurent Vivier  wrote:
...
On 3/13/24 12:37, Stefano Brivio wrote:
...
...
...
@@ -390,6 +414,42 @@ static size_t tap_send_frames_passt(const struct ctx *c,
       return i;
   }
+/**
+ * tap_send_iov_passt() - Send out multiple prepared frames
...I would argue that this function prepares frames as well. Maybe:
* tap_send_iov_passt() - Prepare TCP_IOV_VNET parts and send multiple frames
...
+ * @c:		Execution context
+ * @iov:	Array of frames, each frames is divided in an array of iovecs.
+ *              The first entry of the iovec is updated to point to an
+ *              uint32_t storing the frame length.
* @iov:	Array of frames, each one a vector of parts, TCP_IOV_VNET blank
...
+ * @n:		Number of frames in @iov
+ *
+ * Return: number of frames actually sent
+ */
+static size_t tap_send_iov_passt(const struct ctx *c,
+				 struct iovec iov[][TCP_IOV_NUM],
+				 size_t n)
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++) {
+		uint32_t vnet_len;
+		int j;
+
+		vnet_len = 0;
This could be initialised in the declaration (yes, it's "reset" at
every loop iteration).
...
+		for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++)
+			vnet_len += iov[i][j].iov_len;
+
+		vnet_len = htonl(vnet_len);
+		iov[i][TCP_IOV_VNET].iov_base = &vnet_len;
+		iov[i][TCP_IOV_VNET].iov_len = sizeof(vnet_len);
+
+		if (!tap_send_frames_passt(c, iov[i], TCP_IOV_NUM))
...which would now send a single frame at a time, but actually it can
already send everything in one shot because it's using sendmsg(), if you
move it outside of the loop and do something like (untested):
return tap_send_frames_passt(c, iov, TCP_IOV_NUM * n);
...
+			break;
+	}
+
+	return i;
+
+}
+
I tried to do something like that but I have a performance drop:
static size_t tap_send_iov_passt(const struct ctx *c,
                                   struct iovec iov[][TCP_IOV_NUM],
                                   size_t n)
{
          unsigned int i;
          uint32_t vnet_len[n];
for (i = 0; i < n; i++) {
                  int j;
vnet_len[i] = 0;
                  for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++)
                          vnet_len[i] += iov[i][j].iov_len;
vnet_len[i] = htonl(vnet_len[i]);
                  iov[i][TCP_IOV_VNET].iov_base = &vnet_len[i];
                  iov[i][TCP_IOV_VNET].iov_len = sizeof(uint32_t);
          }
return tap_send_frames_passt(c, &iov[0][0], TCP_IOV_NUM * n) / TCP_IOV_NUM;
}
iperf3 -c localhost -p 10001  -t 60  -4
berfore
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  33.0 GBytes  4.72 Gbits/sec    1             sender
[  5]   0.00-60.06  sec  33.0 GBytes  4.72 Gbits/sec                  receiver
after:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  18.2 GBytes  2.60 Gbits/sec    0             sender
[  5]   0.00-60.07  sec  18.2 GBytes  2.60 Gbits/sec                  receiver
Weird, it looks like doing one sendmsg() per frame results in a higher
throughput than one sendmsg() per multiple frames, which sounds rather
absurd. Perhaps we should start looking into what perf(1) reports, in
terms of both syscall overhead and cache misses.
I'll have a look later today or tomorrow -- unless you have other
ideas as to why this might happen...
Perhaps in first case we only update one vnet_len and in the second case we have to update 
an array of vnet_len, so there is an use of more cache lines?
We should be able to test this relatively easily, yes?  By updating
all the vnet_len then using a single sendmsg().
...
Yes, I'm wondering if for example this:
iov[i][TCP_IOV_VNET].iov_base = &vnet_len[i];
causes a prefetch of everything pointed by iov[i][...], so we would
prefetch (and throw away) each buffer, one by one.
Another interesting experiment to verify if this is the case could be
to "flush" a few frames at a time (say, 4), with something like this on
top of your original change (completely untested):
[...]
if (!((i + 1) % 4) &&
          !tap_send_frames_passt(c, iov[i / 4], TCP_IOV_NUM * 4))
      	break;
  }
if ((i + 1) % 4) {
      tap_send_frames_passt(c, iov[i / 4],
      		      TCP_IOV_NUM * ((i + 1) % 4));
  }
Or maybe we could set vnet_len right after we receive data in the
buffers.
I really hope we can avoid this.  If we want to allow IPv4<->IPv6
translation, then we can't know the vnet_len until after we've done
our routing / translation.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson