[PATCH v2 00/11] Preliminaries for UDP flow support
The redesign of UDP flows required (or at least, suggested) a new batch of prelininary changes that don't rely on the core of the flow table rework. Changes since v1: * Assorted minor fixes based on Stefano's feedback * Moved test programs from contrib/ to doc/ David Gibson (11): util: sock_l4() determine protocol from epoll type rather than the reverse flow: Add flow_sidx_valid() helper udp: Pass full epoll reference through more of sock handler path udp: Rename IOV and mmsghdr arrays udp: Unify udp[46]_mh_splice udp: Unify udp[46]_l2_iov udp: Don't repeatedly initialise udp[46]_eth_hdr udp: Move some more of sock_handler tasks into sub-functions udp: Consolidate datagram batching doc: Add program to document and test assumptions about SO_REUSEADDR doc: Test behaviour of zero length datagram recv()s doc/platform-requirements/.gitignore | 2 + doc/platform-requirements/Makefile | 45 +++ doc/platform-requirements/README | 18 + doc/platform-requirements/common.c | 66 ++++ doc/platform-requirements/common.h | 47 +++ doc/platform-requirements/recv-zero.c | 74 ++++ .../reuseaddr-priority.c | 240 ++++++++++++ epoll_type.h | 41 ++ flow.h | 11 + flow_table.h | 2 +- icmp.c | 2 +- passt.h | 32 -- tcp.c | 17 +- udp.c | 365 +++++++++--------- util.c | 48 +-- util.h | 3 +- 16 files changed, 756 insertions(+), 257 deletions(-) create mode 100644 doc/platform-requirements/.gitignore create mode 100644 doc/platform-requirements/Makefile create mode 100644 doc/platform-requirements/README create mode 100644 doc/platform-requirements/common.c create mode 100644 doc/platform-requirements/common.h create mode 100644 doc/platform-requirements/recv-zero.c create mode 100644 doc/platform-requirements/reuseaddr-priority.c create mode 100644 epoll_type.h -- 2.45.2
sock_l4() creates a socket of the given IP protocol number, and adds it to
the epoll state. Currently it determines the correct tag for the epoll
data based on the protocol. However, we have some future cases where we
might want different semantics, and therefore epoll types, for sockets of
the same protocol. So, change sock_l4() to take the epoll type as an
explicit parameter, and determine the protocol from that.
Signed-off-by: David Gibson
To implement the TCP hash table, we need an invalid (NULL-like) value for
flow_sidx_t. We use FLOW_SIDX_NONE for that, but for defensiveness, we
treat (usually) anything with an out of bounds flow index the same way.
That's not always done consistently though. In flow_at_sidx() we open code
a check on the flow index. In tcp_hash_probe() we instead compare against
FLOW_SIDX_NONE, and in some other places we use the fact that
flow_at_sidx() will return NULL in this case, even if we don't otherwise
need the flow it returns.
Clean this up a bit, by adding an explicit flow_sidx_valid() test function.
Signed-off-by: David Gibson
udp_buf_sock_handler() takes the epoll reference from the receiving socket,
and passes the UDP relevant part on to several other functions. Future
changes are going to need several different epoll types for UDP, and to
pass that information through to some of those functions. To avoid extra
noise in the patches making the real changes, change those functions now
to take the full epoll reference, rather than just the UDP part.
Signed-off-by: David Gibson
Make the salient points about these various arrays clearer with renames:
* udp_l2_iov_sock and udp[46]_l2_mh_sock don't really have anything to do
with L2. They are, however, specific to receiving not sending. Rename
to udp_iov_recv and udp[46]_mh_recv.
* udp[46]_l2_iov_tap is redundant - "tap" implies L2 and vice versa.
Rename to udp[46]_l2_iov
* udp[46]_localname are (for now) pre-populated with the local address but
the more salient point is that these are the destination address for the
splice arrays. Rename to udp[46]_splice_to
Signed-off-by: David Gibson
We have separate mmsghdr arrays for splicing IPv4 and IPv6 packets, where
the only difference is that they point to different sockaddr buffers for
the destination address.
Unify these by having the common array point at a sockaddr_inany as the
address. This does mean slightly more work when we're about to splice,
because we need to write the whole socket address, rather than just the
port. However it removes 32 mmsghdr structures and we're going to need
more flexibility constructing that target address for the flow table.
Because future changes might mean that the address isn't always loopback,
change the name of the common address from *_localname to udp_splicename.
Signed-off-by: David Gibson
The only differences between these arrays are that udp4_l2_iov is
pre-initialised to point to the IPv4 ethernet header, and IPv4 per-frame
header and udp6_l2_iov points to the IPv6 versions.
We already have to set up a bunch of headers per-frame, including updating
udp[46]_l2_iov[i][UDP_IOV_PAYLOAD].iov_len. It makes more sense to adjust
the IOV entries to point at the correct headers for the frame than to have
two complete sets of iovecs.
Signed-off-by: David Gibson
Since we split our packet frame buffers into different pieces, we have
a single buffer per IP version for the ethernet header, rather than one
per frame. This makes sense since our ethernet header is alwaus the same.
However we initialise those buffers udp[46]_eth_hdr inside a per frame
loop. Pull that outside the loop so we just initialise them once.
Signed-off-by: David Gibson
udp_buf_sock_handler(), udp_splice_send() and udp_tap_send loosely, do four
things between them:
1. Receive some datagrams from a socket
2. Split those datagrams into batches depending on how they need to be
sent (via tap or via a specific splice socket)
3. Prepare buffers for each datagram to send it onwards
4. Actually send it onwards
Split (1) and (3) into specific helper functions. This isn't
immediately useful (udp_splice_prepare(), in particular, is trivial),
but it will make further reworks clearer.
Signed-off-by: David Gibson
When we receive datagrams on a socket, we need to split them into batches
depending on how they need to be forwarded (either via a specific splice
socket, or via tap). The logic to do this, is somewhat awkwardly split
between udp_buf_sock_handler() itself, udp_splice_send() and
udp_tap_send().
Move all the batching logic into udp_buf_sock_handler(), leaving
udp_splice_send() to just send the prepared batch. udp_tap_send() reduces
to just a call to tap_send_frames() so open-code that call in
udp_buf_sock_handler().
This will allow separating the batching logic from the rest of the datagram
forwarding logic, which we'll need for upcoming flow table support.
Signed-off-by: David Gibson
For the approach we intend to use for handling UDP flows, we have some
pretty specific requirements about how SO_REUSEADDR works with UDP sockets.
Specifically SO_REUSEADDR allows multiple sockets with overlapping bind()s,
and therefore there can be multiple sockets which are eligible to receive
the same datagram. Which one will actually receive it is important to us.
Add a test program which verifies things work the way we expect, which
documents what those expectations are in the process.
Signed-off-by: David Gibson
On Fri, 05 Jul 2024, David Gibson wrote: I may be missing something subtle, but is j intended to be used twice here, rather than k?
+ +static void check_all_orders(void) +{ + int norders = sizeof(orders) / sizeof(orders[0]); + int i, j, k, l; + + for (i = 0; i < norders; i++) + for (j = 0; j < norders; j++) + for (k = 0; k < norders; k++) + for (l = 0; l < norders; l++) + check_one_order(orders[i], orders[j], + orders[j], orders[l]); --------------------------------------------------------^^^^^^^^^ +}
-- David Taylor
On Fri, Jul 12, 2024 at 12:42:57PM +0100, David Taylor wrote:
On Fri, 05 Jul 2024, David Gibson wrote:
I may be missing something subtle, but is j intended to be used twice here, rather than k?
Indeed not, good catch, thanks.
+ +static void check_all_orders(void) +{ + int norders = sizeof(orders) / sizeof(orders[0]); + int i, j, k, l; + + for (i = 0; i < norders; i++) + for (j = 0; j < norders; j++) + for (k = 0; k < norders; k++) + for (l = 0; l < norders; l++) + check_one_order(orders[i], orders[j], + orders[j], orders[l]); --------------------------------------------------------^^^^^^^^^ +}
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Add a test program verifying that we're able to discard datagrams from a
socket without needing a big discard buffer, by using a zero length recv().
Signed-off-by: David Gibson
On Fri, 5 Jul 2024 20:43:58 +1000
David Gibson
The redesign of UDP flows required (or at least, suggested) a new batch of prelininary changes that don't rely on the core of the flow table rework.
Changes since v1: * Assorted minor fixes based on Stefano's feedback * Moved test programs from contrib/ to doc/
David Gibson (11): util: sock_l4() determine protocol from epoll type rather than the reverse flow: Add flow_sidx_valid() helper udp: Pass full epoll reference through more of sock handler path udp: Rename IOV and mmsghdr arrays udp: Unify udp[46]_mh_splice udp: Unify udp[46]_l2_iov udp: Don't repeatedly initialise udp[46]_eth_hdr udp: Move some more of sock_handler tasks into sub-functions udp: Consolidate datagram batching doc: Add program to document and test assumptions about SO_REUSEADDR doc: Test behaviour of zero length datagram recv()s
Applied. -- Stefano
participants (3)
-
David Gibson
-
David Taylor
-
Stefano Brivio