Re: [PATCH 3/4] udp: Rework how we divide queued datagrams between sending methods

14 Jun 2024

On Thu, Jun 13, 2024 at 08:21:12PM +0200, Stefano Brivio wrote:
...
On Wed,  5 Jun 2024 11:39:02 +1000
David Gibson  wrote:
...
udp_sock_handler() takes a number of datagrams from sockets that depending
on their addresses could be forwarded either to the L2 interface ("tap")
or to another socket ("spliced").  In the latter case we can also only
send packets together if they have the same source port, and therefore
are sent via the same socket.
To reduce the total number of system calls we gather contiguous batches of
datagrams with the same destination interface and socket where applicable.
The determination of what the target is is made by udp_mmh_splice_port().
It returns the source port for splice packets and -1 for "tap" packets.
We find batches by looking ahead in our queue until we find a datagram
whose "splicefrom" port doesn't match the first in our current batch.
udp_mmh_splice_port() is moderately expensive, since it must examine IPv6
addresses.  But unfortunately we can call it twice on the same datagram:
once as the (last + 1) entry in one batch (showing that it's not in that
match, then again as the first entry in the next batch.
This paragraph took me an embarrassingly long time to grasp, if you
re-spin it would be nice to fix it:
- "And unfortunately [...]", I guess: otherwise it looks like we're
  lucky that udp_mmh_splice_port() is expensive or something like that
  (because of the "But" implying contrast).
I initially assumed "unfortunately" was a typo and tried to
  understand why it was a good thing we'd call udp_mmh_splice_port()
  twice on the same datagram (faster than calling it on two
  datagrams!), then started reading the change and got even more
  confused...
- "(to check that it's not that batch)" ?
Yeah, didn't help that there were some other typos in there.  Updated
referring to these suggestions.
...
...
Avoid this by keeping track of the "splice port" in the metadata structure,
and filling it in one entry ahead of the one we're currently considering.
This is a bit subtle, but not that hard.  It will also generalise better
when we have more complex possibilities based on the flow table.
I guess this is the actual, main reason for this change. :) I should
have read this paragraph first.
Yes.  In the flow table we'll replace this splicesrc array with an
array of flow sidxs, updated in the same way.
...
...
Signed-off-by: David Gibson 
---
 udp.c | 147 ++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 86 insertions(+), 61 deletions(-)

diff --git a/udp.c b/udp.c
index 7487d2b2..757c10ab 100644
--- a/udp.c
+++ b/udp.c
@@ -198,6 +198,7 @@ static struct ethhdr udp6_eth_hdr;
  * @ip4h:	Pre-filled IPv4 header (except for tot_len and saddr)
  * @taph:	Tap backend specific header
  * @s_in:	Source socket address, filled in by recvmmsg()
+ * @splicesrc:	Source port for splicing, or -1 if not spliceable
  */
 static struct udp_meta_t {
  struct ipv6hdr ip6h;
@@ -205,6 +206,7 @@ static struct udp_meta_t {
  struct tap_hdr taph;
union sockaddr_inany s_in;
+	int splicesrc;
 }
 #ifdef __AVX2__
 __attribute__ ((aligned(32)))
@@ -492,28 +494,32 @@ static int udp_mmh_splice_port(union udp_epoll_ref uref,
 }
/**
- * udp_splice_sendfrom() - Send datagrams from given port to given port
+ * udp_splice_send() - Send datagrams from socket to socket
  * @c:		Execution context
  * @start:	Index of first datagram in udp[46]_l2_buf
- * @n:		Number of datagrams to send
- * @src:	Datagrams will be sent from this port (on origin side)
- * @dst:	Datagrams will be send to this port (on destination side)
- * @from_pif:	pif from which the packet originated
- * @v6:		Send as IPv6?
- * @allow_new:	If true create sending socket if needed, if false discard
- *              if no sending socket is available
+ * @n:		Total number of datagrams in udp[46]_l2_buf pool
+ * @dst:	Datagrams will be sent to this port (on destination side)
+ * @uref:	UDP epoll reference for origin socket
  * @now:	Timestamp
+ *
+ * This consumes as many frames as are sendable via a single socket.  It
s/frames/datagrams/ ...or messages.
Good point, adjusted.
...
...
+ * requires that udp_meta[@start].splicesrc is initialised, and will initialise
+ * udp_meta[].splicesrc for each frame it consumes *and one more* (if present).
+ *
+ * Return: Number of frames sent
I'd say it's rather the number of datagrams (not frames) we tried to
send.
In some sense, it's also the number of frames sent _by us_ (well, after
calling sendmmsg(), messages were sent), but we call sendmmsg()
ignoring the result, so this comment might look a bit misleading.
Right... I've gone with "Number of datagrams forwarded" how's that?
...
...
*/
-static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
-				in_port_t src, in_port_t dst, uint8_t from_pif,
-				bool v6, bool allow_new,
+static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n,
+				in_port_t dst, union udp_epoll_ref uref,
      		const struct timespec *now)
 {
+	in_port_t src = udp_meta[start].splicesrc;
  struct mmsghdr *mmh_recv, *mmh_send;
-	unsigned int i;
+	unsigned int i = start;
  int s;
-	if (v6) {
+	ASSERT(udp_meta[start].splicesrc >= 0);
+
+	if (uref.v6) {
      mmh_recv = udp6_l2_mh_sock;
      mmh_send = udp6_mh_splice;
  } else {
@@ -521,40 +527,48 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n,
      mmh_send = udp4_mh_splice;
  }
-	if (from_pif == PIF_SPLICE) {
+	do {
+		mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len;
+
+		if (++i >= n)
+			break;
+
+		udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
+	} while (udp_meta[i].splicesrc == src);
I don't have a strong preference, but a for loop like this:
for (; i < n && udp_meta[i].splicesrc == src; i++) {
      mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len;
      udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
This needs to update udp_meta[i+1], not udp_meta[i], and therefore
also be conditional on i+1 < n.
...
}
if (i++ < n)	/* Set splicesrc for first mismatching entry, too */
This needs to be ++i, not i++.
...
udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]);
looks a bit more readable to me. Same for udp_tap_send().
At which point I'm not sure it's more readable after all.  It also
redundantly checks that udp_meta[start].splicesrc == src.

Like I said, subtle.  Putting the increment in the middle of the loop
body, rather than beginning or end, while unusual, was the sanest way
I could see to do this.  Well, other than havine one pass to set
splicesrc[], then another using it, which works but I'm concerned
about the effect on cache locality.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson