On Thu, Nov 17, 2022 at 08:21:00AM +0100, Stefano Brivio wrote:On Thu, 17 Nov 2022 16:07:32 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Uh.. other than?In preparation for trying to implement dual stack sockets for UDP, I've been getting my head around how the UDP splicing works. Alas, I'm pretty sure that it's broken if there's not a one-to-one correspondence between init side source ports and ns side destination ports. That will typically be the case, but given its UDP there's no guarantee.I understand the concern below, but I don't understand this part, that is: in which other way is it broken?I'm not concerned so much about replies coming from a different port as a server which expects initial requests from a particular port. Still not that likely, but more likely than with TCP.In addition, UDP servers in the ns will not see the correct port numbers with getpeername(). That's also true of spliced TCP connections (see https://bugs.passt.top/show_bug.cgi?id=39), but it's more likely to matter for UDP (I don't know of any TCP protocols that care about source port number on the server side, but there are some common UDP protocols that have at least port number conventions on both sides).I can think of DHCP and DNS, for which we offer special handling somehow. Still, if the flow is started by the guest or container, replies should really come with a source port matching the destination port used initially.For TCP, I don't see this is as an issue at all.I largely agree.I'll fill in the details below.I can expand on the details later, but pasta will do the wrong thing in at least some circumstances for both a single init side socket sendto()ing packets to multiple different ports in the ns/guest and multiple init side sockets send()ing to the same port in the ns/guest. I think I know how to fix it, but it's not a trivial job. So, the question is do I embark on this now, or do I just remove UDP "splicing" entirely for the time being (other than a minimum required to make -U work)? That would unblock dual stack UDP sockets and we can attempt to reoptimize this later.So, I'm not really sure what's broken here, but in any case, UDP"splicing" doesn't offer as much value as the TCP one does, the difference in packet rate is not that big. I don't see a problem if we want to remove it temporarily.Ok, good to know.The only real concern I have is how easy it would be to add it back after a rework not taking that functionality into account.I actually think this will fit better with the tap path once I've made the dual stack socket changes. Ok, for the details of the problem. I'm only considering the case where the host side initiates the communication. I think there are similar cases the other way, but I haven't thought them through. Scenario 1: one source port, multiple destination ports Here pasta is running with -u 200 -u 300 1. Client on the host opens UDP socket A and binds it to localhost:100 2. Client sends datagram 1 on socket A to localhost:200 with sendto() 3. Datagram 1 is received by pasta on splice socket B bound to localhost:200 4. Because of the -U 200, pasta handles this in udp_sock_handler_splice(), ref has splice==UDP_TO_NS 5, recvmmsg() gets a single datagram, from source localhost:100, so src==100 6. udp_splice_map[v6][100].ns_conn_sock is empty, so we call udp_splice_connect_ns() 6.1. udp_splice_connect() creates socket B*, and connects it to localhost:200 in the namespace 6.2. udp_splice_map[v6][100].ns_conn_sock is populated with socket B* 7. sendmmsg() forwards the datagram to socket B* 8. Datagram 1 correctly reaches port 200 within the ns 9. Client sends datagram 2 on socket A to localhost:300 with sendto() 11. Datagram 2 is received by pasta on socket C bound to localhost:300 10. Again, pasta handles this in udp_sock_handler_splice() with UDP_TO_NS. Again, src==100 11. udp_splice_map[v6][100] is populated with socket B* from above 12. sendmmsg() forwads datagram 2 to socket B* * 13, Datagram 2 is incorrectly delivered to port 200 within the ns, instead of port 300 Scenario 2: multiple source ports, one destination port Here pasta is running with -u 1000 1. Client on the host opens socket A bound to localhost:2000 2. Client on the host opens socket B bound to localhost:3000 2. Client sends datagram 1 from socket A to localhost:1000 with sendto() 3. Client sends datagram 2 from socket B to localhost:1000 with sendto() 4. Datagram 1 and 2 are both received by pasta on socket C bound to localhost:1000, with UDP_TO_NS 5. Datagram 1 and 2 happen to both be received by the same recvmmsg(), in that order 6. udp_sock_handler_splice() only examines udp_mmh_recv[0] and so sets src==2000 7. udp_splice_map[v6][2000].ns_conn_sock is unpopulated, so udp_splice_connect_ns() is called 7.1 udp_splice_connect creates socket C* and connects it to localhost:1000 within the guest, let's say it gets ephemeral bound port 50000. It's tagged with UDP_BACK_TO_INIT 7.2 udp_splice_map[v6][2000].ns_conn_sock is populated with socket C* 7.3 udp_splice_map[v6][50000].init_bound_sock is populated with socket C 7.4 udp_splice_map[v6][50000].init_dst_port is populated with 2000 8. sendmmsg() forwads datagrams 1 & 2 to socket C* 9. Datagrams 1 & 2 correctly delivered to port 1000 in the namespace 10. Server within the namespace receives datagram 1 with recvfrom(). From address is localhost:50000 (socket C*) 11. Server sends reply datagram 1* to localhost:50000 within the ns 12. Server receives datagram 2 with recvfrom(). From address is again localhost:50000 (socket C*) 13. Server sends reply datagram 2* to localhost:50000 within the ns 14. pasta receives datagrams 1* and 2* on socket C*. UDP_BACK_TO_INIT and dst==50000 from the epoll ref 15. udp_sock_handler_splice() sets s to socket C from udp_splice_map[v6][50000].init_bound_sock, and send_dst to 2000 from udp_splice_map[v6][50000] 16. sendmmsg() forwards datagram 1* on socket C to localhost:2000 17. Datagram 1* correctly received by socket A on localhost:2000 18. sendmmsg() forwards datagram 2* on socket C to localhost:2000 * 19. Datagram 2* incorrectly received by socket A on localhost:2000 instead of socket B on localhost:3000 -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson