[PATCH v3 0/8] Don't use additional sockets for receiving "spliced" UDP communications
At present, the UDP "splice" and "tap" paths are quite separate. We have separate sockets to receive packets bound for the tap and splice paths. This leads to some code duplication, and extra open sockets. This series partially unifies the two paths, allowing us to use a single (host side) socket, bound to 0.0.0.0 or :: to receive packets for both cases. Changes since v2: * Don't receive multiple packets at once for pasta mode - seems to hurt throughput on balance. * Add some comments clarifying reasoning here. Changes since v1: * Renamed udp_localname[46] to udp[46]_localname * Allow handling of UDP port 0 * Fix a bug which could misidentify certain v6 packets as v4-spliceable * Some minor cosmetic fixes to code and commit messages David Gibson (8): udp: Move sending pasta tap frames to the end of udp_sock_handler() udp: Split sending to passt tap interface into separate function udp: Split receive from preparation and send in udp_sock_handler() udp: Don't handle tap receive batch size calculation within a #define udp: Pre-populate msg_names with local address udp: Unify udp_sock_handler_splice() with udp_sock_handler() udp: Decide whether to "splice" per datagram rather than per socket udp: Don't use separate sockets to listen for spliced packets udp.c | 389 ++++++++++++++++++++++++++++++--------------------------- udp.h | 2 +- util.h | 7 ++ 3 files changed, 214 insertions(+), 184 deletions(-) -- 2.39.0
udp_sock_handler() has a surprising difference in flow between pasta and
passt mode: For pasta we send each frame to the tap interface as we prepare
it. For passt, though, we prepare all the frames, then send them with a
single sendmmsg().
Alter the pasta path to also prepare all the frames, then send them at the
end. We already have a suitable data structure for the passt case. This
will make it easier to abstract out the tap backend difference in future.
Signed-off-by: David Gibson
The last part of udp_sock_handler() does the actual sending of frames
to the tap interface. For pasta that's just a call to
udp_tap_send_pasta() but for passt, it's moderately complex and open
coded.
For symmetry, move the passt send path into its own function,
udp_tap_send_passt(). This will make it easier to abstract the tap
interface in future (e.g. when we want to add vhost-user).
Signed-off-by: David Gibson
The receive part of udp_sock_handler() and udp_sock_handler_splice() is now
almost identical. In preparation for merging that, split the receive part
of udp_sock_handler() from the part preparing and sending the frames for
sending on the tap interface. The latter goes into a new udp_tap_send()
function.
Signed-off-by: David Gibson
UDP_MAX_FRAMES gives the maximum number of datagrams we'll ever handle as a
batch for sizing our buffers and control structures. The subtly different
UDP_TAP_FRAMES gives the maximum number of datagrams we'll actually try to
receive at once for tap packets in the current configuration.
This depends on the mode, meaning that the macro has a non-obvious
dependency on the usual 'c' context variable being available. We only use
it in one place, so it makes more sense to open code this. Add an
explanatory comment while we're there.
Signed-off-by: David Gibson
udp_splice_namebuf is now used only for spliced sending, and so it is
only ever populated with the localhost address, either IPv4 or IPv6.
So, replace the awkward initialization in udp_sock_handler_splice()
with statically initialized versions for IPv4 and IPv6. We then just
need to update the port number in udp_sock_handler_splice().
Signed-off-by: David Gibson
These two functions now have a very similar structure, and their first
part (calling recvmmsg()) is functionally identical. So, merge the two
functions into one.
This does have the side effect of meaning we no longer receive multiple
packets at once for splice (we already didn't for tap). This does hurt
throughput for small spliced packets, but improves it for large spliced
packets and tap packets.
Signed-off-by: David Gibson
On Wed, 4 Jan 2023 16:44:24 +1100
David Gibson
These two functions now have a very similar structure, and their first part (calling recvmmsg()) is functionally identical. So, merge the two functions into one.
This does have the side effect of meaning we no longer receive multiple packets at once for splice (we already didn't for tap). This does hurt throughput for small spliced packets, but improves it for large spliced packets and tap packets.
Signed-off-by: David Gibson
--- udp.c | 95 ++++++++++++++++++++++------------------------------------- 1 file changed, 35 insertions(+), 60 deletions(-) diff --git a/udp.c b/udp.c index f6c07a5..32ca83e 100644 --- a/udp.c +++ b/udp.c @@ -590,52 +590,6 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n, sendmmsg(s, mmh_send + start, n, MSG_NOSIGNAL); }
-/** - * udp_sock_handler_splice() - Handler for socket mapped to "spliced" connection - * @c: Execution context - * @ref: epoll reference - * @events: epoll events bitmap - * @now: Current timestamp - */ -static void udp_sock_handler_splice(const struct ctx *c, union epoll_ref ref, - uint32_t events, const struct timespec *now) -{ - in_port_t dst = ref.r.p.udp.udp.port; - int v6 = ref.r.p.udp.udp.v6, n, i, m; - struct mmsghdr *mmh_recv; - - if (!(events & EPOLLIN)) - return; - - if (v6) - mmh_recv = udp6_l2_mh_sock; - else - mmh_recv = udp4_l2_mh_sock; - - n = recvmmsg(ref.r.s, mmh_recv, UDP_MAX_FRAMES, 0, NULL); - - if (n <= 0) - return; - - if (v6) - udp6_localname.sin6_port = htons(dst); - else - udp4_localname.sin_port = htons(dst); - - for (i = 0; i < n; i += m) { - in_port_t src = sa_port(v6, mmh_recv[i].msg_hdr.msg_name); - - for (m = 1; i + m < n; m++) { - void *mname = mmh_recv[i + m].msg_hdr.msg_name; - if (sa_port(v6, mname) != src) - break; - } - - udp_splice_sendfrom(c, i, m, src, dst, v6, ref.r.p.udp.udp.ns, - ref.r.p.udp.udp.orig, now); - } -} - /** * udp_update_hdr4() - Update headers for one IPv4 datagram * @c: Execution context @@ -944,32 +898,53 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now) { /* For not entirely clear reasons (data locality?) pasta gets - * better throughput if we receive the datagrams one at a - * time. + * better throughput if we receive tap datagrams one at a + * atime. For small splice datagrams throughput is slightly + * better if we do batch, but it's slightly worse for large + * splice datagrams. Since we don't know before we receive + * whether we'll use tap or splice, always go one at a time + * for pasta mode. */ ssize_t n = (c->mode == MODE_PASST ? UDP_MAX_FRAMES : 1); in_port_t dstport = ref.r.p.udp.udp.port; bool v6 = ref.r.p.udp.udp.v6; - struct mmsghdr *sock_mmh; + struct mmsghdr *mmh_recv; + unsigned int i, m; + ssize_t n;
This doesn't build, you're redefining 'n' after the new version of 4/8. I could drop this on merge (the rest of the series would be ready to merge) but as you usually prefer to respin anyway, I'll wait for that. -- Stefano
On Wed, Jan 04, 2023 at 06:44:38PM +0100, Stefano Brivio wrote:
On Wed, 4 Jan 2023 16:44:24 +1100 David Gibson
wrote: [snip] /** * udp_update_hdr4() - Update headers for one IPv4 datagram * @c: Execution context @@ -944,32 +898,53 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now) { /* For not entirely clear reasons (data locality?) pasta gets - * better throughput if we receive the datagrams one at a - * time. + * better throughput if we receive tap datagrams one at a + * atime. For small splice datagrams throughput is slightly + * better if we do batch, but it's slightly worse for large + * splice datagrams. Since we don't know before we receive + * whether we'll use tap or splice, always go one at a time + * for pasta mode. */ ssize_t n = (c->mode == MODE_PASST ? UDP_MAX_FRAMES : 1); in_port_t dstport = ref.r.p.udp.udp.port; bool v6 = ref.r.p.udp.udp.v6; - struct mmsghdr *sock_mmh; + struct mmsghdr *mmh_recv; + unsigned int i, m; + ssize_t n;
This doesn't build, you're redefining 'n' after the new version of 4/8.
Wow, that was super sloppy of me, sorry. I'm going to blame not being quite back into it after the holiday. For whatever reason my brain just skipped over even the most rudimentary testing here.
I could drop this on merge (the rest of the series would be ready to merge) but as you usually prefer to respin anyway, I'll wait for that.
Yeah, I'll send a respin short. Tested, this time :). -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
Currently we have special sockets for receiving datagrams from locahost
which can use the optimized "splice" path rather than going across the tap
interface.
We want to loosen this so that sockets can receive sockets that will be
forwarded by both the spliced and non-spliced paths. To do this, we alter
the meaning of the @splice bit in the reference to mean that packets
receieved on this socket *can* be spliced, not that they *will* be spliced.
They'll only actually be spliced if they come from 127.0.0.1 or ::1.
We can't (for now) remove the splice bit entirely, unlike with TCP. Our
gateway mapping means that if the ns initiates communication to the gw
address, we'll translate that to target 127.0.0.1 on the host side. Reply
packets will therefore have source address 127.0.0.1 when received on the
host, but these need to go via the tap path where that will be translated
back to the gateway address. We need the @splice bit to distinguish that
case from packets going from localhost to a port mapped explicitly with
-u which should be spliced.
Signed-off-by: David Gibson
Currently, when ports are forwarded inbound in pasta mode, we open two
sockets for incoming traffic: one listens on the public IP address and will
forward packets to the tuntap interface. The other listens on localhost
and forwards via "splicing" (resending directly via sockets in the ns).
Now that we've improved the logic about whether we "splice" any individual
packet, we don't need this. Instead we can have a single socket bound to
0.0.0.0 or ::, marked as able to splice and udp_sock_handler() will deal
with each packet as appropriate.
Signed-off-by: David Gibson
participants (2)
-
David Gibson
-
Stefano Brivio