Re: [RFC] tcp: Replace TCP buffer structure by an iovec array

14 Mar 2024


      On 3/14/24 16:47, Stefano Brivio wrote:
...
On Thu, 14 Mar 2024 15:07:48 +0100
Laurent Vivier  wrote:
...
On 3/13/24 12:37, Stefano Brivio wrote:
...
...
...
@@ -390,6 +414,42 @@ static size_t tap_send_frames_passt(const struct ctx *c,
       return i;
   }
+/**
+ * tap_send_iov_passt() - Send out multiple prepared frames
...I would argue that this function prepares frames as well. Maybe:
* tap_send_iov_passt() - Prepare TCP_IOV_VNET parts and send multiple frames
...
+ * @c:		Execution context
+ * @iov:	Array of frames, each frames is divided in an array of iovecs.
+ *              The first entry of the iovec is updated to point to an
+ *              uint32_t storing the frame length.
* @iov:	Array of frames, each one a vector of parts, TCP_IOV_VNET blank
...
+ * @n:		Number of frames in @iov
+ *
+ * Return: number of frames actually sent
+ */
+static size_t tap_send_iov_passt(const struct ctx *c,
+				 struct iovec iov[][TCP_IOV_NUM],
+				 size_t n)
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++) {
+		uint32_t vnet_len;
+		int j;
+
+		vnet_len = 0;
This could be initialised in the declaration (yes, it's "reset" at
every loop iteration).
...
+		for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++)
+			vnet_len += iov[i][j].iov_len;
+
+		vnet_len = htonl(vnet_len);
+		iov[i][TCP_IOV_VNET].iov_base = &vnet_len;
+		iov[i][TCP_IOV_VNET].iov_len = sizeof(vnet_len);
+
+		if (!tap_send_frames_passt(c, iov[i], TCP_IOV_NUM))
...which would now send a single frame at a time, but actually it can
already send everything in one shot because it's using sendmsg(), if you
move it outside of the loop and do something like (untested):
return tap_send_frames_passt(c, iov, TCP_IOV_NUM * n);
...
+			break;
+	}
+
+	return i;
+
+}
+
I tried to do something like that but I have a performance drop:
static size_t tap_send_iov_passt(const struct ctx *c,
                                   struct iovec iov[][TCP_IOV_NUM],
                                   size_t n)
{
          unsigned int i;
          uint32_t vnet_len[n];
for (i = 0; i < n; i++) {
                  int j;
vnet_len[i] = 0;
                  for (j = TCP_IOV_ETH; j < TCP_IOV_NUM; j++)
                          vnet_len[i] += iov[i][j].iov_len;
vnet_len[i] = htonl(vnet_len[i]);
                  iov[i][TCP_IOV_VNET].iov_base = &vnet_len[i];
                  iov[i][TCP_IOV_VNET].iov_len = sizeof(uint32_t);
          }
return tap_send_frames_passt(c, &iov[0][0], TCP_IOV_NUM * n) / TCP_IOV_NUM;
}
iperf3 -c localhost -p 10001  -t 60  -4
berfore
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  33.0 GBytes  4.72 Gbits/sec    1             sender
[  5]   0.00-60.06  sec  33.0 GBytes  4.72 Gbits/sec                  receiver
after:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  18.2 GBytes  2.60 Gbits/sec    0             sender
[  5]   0.00-60.07  sec  18.2 GBytes  2.60 Gbits/sec                  receiver
Weird, it looks like doing one sendmsg() per frame results in a higher
throughput than one sendmsg() per multiple frames, which sounds rather
absurd. Perhaps we should start looking into what perf(1) reports, in
terms of both syscall overhead and cache misses.
I'll have a look later today or tomorrow -- unless you have other
ideas as to why this might happen...
Perhaps in first case we only update one vnet_len and in the second case we have to update 
an array of vnet_len, so there is an use of more cache lines?

Thanks,
Laurent