In preparation for trying to implement dual stack sockets for UDP, I've been getting my head around how the UDP splicing works. Alas, I'm pretty sure that it's broken if there's not a one-to-one correspondence between init side source ports and ns side destination ports. That will typically be the case, but given its UDP there's no guarantee. In addition, UDP servers in the ns will not see the correct port numbers with getpeername(). That's also true of spliced TCP connections (see https://bugs.passt.top/show_bug.cgi?id=39), but it's more likely to matter for UDP (I don't know of any TCP protocols that care about source port number on the server side, but there are some common UDP protocols that have at least port number conventions on both sides). I can expand on the details later, but pasta will do the wrong thing in at least some circumstances for both a single init side socket sendto()ing packets to multiple different ports in the ns/guest and multiple init side sockets send()ing to the same port in the ns/guest. I think I know how to fix it, but it's not a trivial job. So, the question is do I embark on this now, or do I just remove UDP "splicing" entirely for the time being (other than a minimum required to make -U work)? That would unblock dual stack UDP sockets and we can attempt to reoptimize this later. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Thu, 17 Nov 2022 16:07:32 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:In preparation for trying to implement dual stack sockets for UDP, I've been getting my head around how the UDP splicing works. Alas, I'm pretty sure that it's broken if there's not a one-to-one correspondence between init side source ports and ns side destination ports. That will typically be the case, but given its UDP there's no guarantee.I understand the concern below, but I don't understand this part, that is: in which other way is it broken?In addition, UDP servers in the ns will not see the correct port numbers with getpeername(). That's also true of spliced TCP connections (see https://bugs.passt.top/show_bug.cgi?id=39), but it's more likely to matter for UDP (I don't know of any TCP protocols that care about source port number on the server side, but there are some common UDP protocols that have at least port number conventions on both sides).I can think of DHCP and DNS, for which we offer special handling somehow. Still, if the flow is started by the guest or container, replies should really come with a source port matching the destination port used initially. For TCP, I don't see this is as an issue at all.I can expand on the details later, but pasta will do the wrong thing in at least some circumstances for both a single init side socket sendto()ing packets to multiple different ports in the ns/guest and multiple init side sockets send()ing to the same port in the ns/guest. I think I know how to fix it, but it's not a trivial job. So, the question is do I embark on this now, or do I just remove UDP "splicing" entirely for the time being (other than a minimum required to make -U work)? That would unblock dual stack UDP sockets and we can attempt to reoptimize this later.So, I'm not really sure what's broken here, but in any case, UDP "splicing" doesn't offer as much value as the TCP one does, the difference in packet rate is not that big. I don't see a problem if we want to remove it temporarily. The only real concern I have is how easy it would be to add it back after a rework not taking that functionality into account. -- Stefano
On Thu, Nov 17, 2022 at 08:21:00AM +0100, Stefano Brivio wrote:On Thu, 17 Nov 2022 16:07:32 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Uh.. other than?In preparation for trying to implement dual stack sockets for UDP, I've been getting my head around how the UDP splicing works. Alas, I'm pretty sure that it's broken if there's not a one-to-one correspondence between init side source ports and ns side destination ports. That will typically be the case, but given its UDP there's no guarantee.I understand the concern below, but I don't understand this part, that is: in which other way is it broken?I'm not concerned so much about replies coming from a different port as a server which expects initial requests from a particular port. Still not that likely, but more likely than with TCP.In addition, UDP servers in the ns will not see the correct port numbers with getpeername(). That's also true of spliced TCP connections (see https://bugs.passt.top/show_bug.cgi?id=39), but it's more likely to matter for UDP (I don't know of any TCP protocols that care about source port number on the server side, but there are some common UDP protocols that have at least port number conventions on both sides).I can think of DHCP and DNS, for which we offer special handling somehow. Still, if the flow is started by the guest or container, replies should really come with a source port matching the destination port used initially.For TCP, I don't see this is as an issue at all.I largely agree.I'll fill in the details below.I can expand on the details later, but pasta will do the wrong thing in at least some circumstances for both a single init side socket sendto()ing packets to multiple different ports in the ns/guest and multiple init side sockets send()ing to the same port in the ns/guest. I think I know how to fix it, but it's not a trivial job. So, the question is do I embark on this now, or do I just remove UDP "splicing" entirely for the time being (other than a minimum required to make -U work)? That would unblock dual stack UDP sockets and we can attempt to reoptimize this later.So, I'm not really sure what's broken here, but in any case, UDP"splicing" doesn't offer as much value as the TCP one does, the difference in packet rate is not that big. I don't see a problem if we want to remove it temporarily.Ok, good to know.The only real concern I have is how easy it would be to add it back after a rework not taking that functionality into account.I actually think this will fit better with the tap path once I've made the dual stack socket changes. Ok, for the details of the problem. I'm only considering the case where the host side initiates the communication. I think there are similar cases the other way, but I haven't thought them through. Scenario 1: one source port, multiple destination ports Here pasta is running with -u 200 -u 300 1. Client on the host opens UDP socket A and binds it to localhost:100 2. Client sends datagram 1 on socket A to localhost:200 with sendto() 3. Datagram 1 is received by pasta on splice socket B bound to localhost:200 4. Because of the -U 200, pasta handles this in udp_sock_handler_splice(), ref has splice==UDP_TO_NS 5, recvmmsg() gets a single datagram, from source localhost:100, so src==100 6. udp_splice_map[v6][100].ns_conn_sock is empty, so we call udp_splice_connect_ns() 6.1. udp_splice_connect() creates socket B*, and connects it to localhost:200 in the namespace 6.2. udp_splice_map[v6][100].ns_conn_sock is populated with socket B* 7. sendmmsg() forwards the datagram to socket B* 8. Datagram 1 correctly reaches port 200 within the ns 9. Client sends datagram 2 on socket A to localhost:300 with sendto() 11. Datagram 2 is received by pasta on socket C bound to localhost:300 10. Again, pasta handles this in udp_sock_handler_splice() with UDP_TO_NS. Again, src==100 11. udp_splice_map[v6][100] is populated with socket B* from above 12. sendmmsg() forwads datagram 2 to socket B* * 13, Datagram 2 is incorrectly delivered to port 200 within the ns, instead of port 300 Scenario 2: multiple source ports, one destination port Here pasta is running with -u 1000 1. Client on the host opens socket A bound to localhost:2000 2. Client on the host opens socket B bound to localhost:3000 2. Client sends datagram 1 from socket A to localhost:1000 with sendto() 3. Client sends datagram 2 from socket B to localhost:1000 with sendto() 4. Datagram 1 and 2 are both received by pasta on socket C bound to localhost:1000, with UDP_TO_NS 5. Datagram 1 and 2 happen to both be received by the same recvmmsg(), in that order 6. udp_sock_handler_splice() only examines udp_mmh_recv[0] and so sets src==2000 7. udp_splice_map[v6][2000].ns_conn_sock is unpopulated, so udp_splice_connect_ns() is called 7.1 udp_splice_connect creates socket C* and connects it to localhost:1000 within the guest, let's say it gets ephemeral bound port 50000. It's tagged with UDP_BACK_TO_INIT 7.2 udp_splice_map[v6][2000].ns_conn_sock is populated with socket C* 7.3 udp_splice_map[v6][50000].init_bound_sock is populated with socket C 7.4 udp_splice_map[v6][50000].init_dst_port is populated with 2000 8. sendmmsg() forwads datagrams 1 & 2 to socket C* 9. Datagrams 1 & 2 correctly delivered to port 1000 in the namespace 10. Server within the namespace receives datagram 1 with recvfrom(). From address is localhost:50000 (socket C*) 11. Server sends reply datagram 1* to localhost:50000 within the ns 12. Server receives datagram 2 with recvfrom(). From address is again localhost:50000 (socket C*) 13. Server sends reply datagram 2* to localhost:50000 within the ns 14. pasta receives datagrams 1* and 2* on socket C*. UDP_BACK_TO_INIT and dst==50000 from the epoll ref 15. udp_sock_handler_splice() sets s to socket C from udp_splice_map[v6][50000].init_bound_sock, and send_dst to 2000 from udp_splice_map[v6][50000] 16. sendmmsg() forwards datagram 1* on socket C to localhost:2000 17. Datagram 1* correctly received by socket A on localhost:2000 18. sendmmsg() forwards datagram 2* on socket C to localhost:2000 * 19. Datagram 2* incorrectly received by socket A on localhost:2000 instead of socket B on localhost:3000 -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Thu, 17 Nov 2022 21:30:03 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Thu, Nov 17, 2022 at 08:21:00AM +0100, Stefano Brivio wrote:Sorry, I didn't understand it's just broken for the cases below, but not in case one source port corresponds one destination port (you wrote it, I misread). So, well, I disagree on it being fairly broken: it works in the most common case. This doesn't really matter though: - it's an obstacle to unify IPv4 and IPv6 sockets - it's definitely not pretty -- it was more of a sketch I wanted to rewrite for a long time - it doesn't make a huge difference in packet rates (far from having the same impact as the TCP spliced connections) - it's broken in those (albeit probably uncommon) cases below ...let's drop it for now, and add back a saner version later, now that we know in much better detail how it should work and where the problems might be.On Thu, 17 Nov 2022 16:07:32 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Uh.. other than?In preparation for trying to implement dual stack sockets for UDP, I've been getting my head around how the UDP splicing works. Alas, I'm pretty sure that it's broken if there's not a one-to-one correspondence between init side source ports and ns side destination ports. That will typically be the case, but given its UDP there's no guarantee.I understand the concern below, but I don't understand this part, that is: in which other way is it broken?-u 200 hereI'm not concerned so much about replies coming from a different port as a server which expects initial requests from a particular port. Still not that likely, but more likely than with TCP.In addition, UDP servers in the ns will not see the correct port numbers with getpeername(). That's also true of spliced TCP connections (see https://bugs.passt.top/show_bug.cgi?id=39), but it's more likely to matter for UDP (I don't know of any TCP protocols that care about source port number on the server side, but there are some common UDP protocols that have at least port number conventions on both sides).I can think of DHCP and DNS, for which we offer special handling somehow. Still, if the flow is started by the guest or container, replies should really come with a source port matching the destination port used initially.For TCP, I don't see this is as an issue at all.I largely agree.I'll fill in the details below.I can expand on the details later, but pasta will do the wrong thing in at least some circumstances for both a single init side socket sendto()ing packets to multiple different ports in the ns/guest and multiple init side sockets send()ing to the same port in the ns/guest. I think I know how to fix it, but it's not a trivial job. So, the question is do I embark on this now, or do I just remove UDP "splicing" entirely for the time being (other than a minimum required to make -U work)? That would unblock dual stack UDP sockets and we can attempt to reoptimize this later.So, I'm not really sure what's broken here, but in any case, UDP"splicing" doesn't offer as much value as the TCP one does, the difference in packet rate is not that big. I don't see a problem if we want to remove it temporarily.Ok, good to know.The only real concern I have is how easy it would be to add it back after a rework not taking that functionality into account.I actually think this will fit better with the tap path once I've made the dual stack socket changes. Ok, for the details of the problem. I'm only considering the case where the host side initiates the communication. I think there are similar cases the other way, but I haven't thought them through. Scenario 1: one source port, multiple destination ports Here pasta is running with -u 200 -u 300 1. Client on the host opens UDP socket A and binds it to localhost:100 2. Client sends datagram 1 on socket A to localhost:200 with sendto() 3. Datagram 1 is received by pasta on splice socket B bound to localhost:200 4. Because of the -U 200, pasta handles this inudp_sock_handler_splice(), ref has splice==UDP_TO_NS 5, recvmmsg() gets a single datagram, from source localhost:100, so src==100 6. udp_splice_map[v6][100].ns_conn_sock is empty, so we call udp_splice_connect_ns() 6.1. udp_splice_connect() creates socket B*, and connects it to localhost:200 in the namespace 6.2. udp_splice_map[v6][100].ns_conn_sock is populated with socket B* 7. sendmmsg() forwards the datagram to socket B* 8. Datagram 1 correctly reaches port 200 within the ns 9. Client sends datagram 2 on socket A to localhost:300 with sendto() 11. Datagram 2 is received by pasta on socket C bound to localhost:300 10. Again, pasta handles this in udp_sock_handler_splice() with UDP_TO_NS. Again, src==100 11. udp_splice_map[v6][100] is populated with socket B* from above 12. sendmmsg() forwads datagram 2 to socket B* * 13, Datagram 2 is incorrectly delivered to port 200 within the ns, instead of port 300 Scenario 2: multiple source ports, one destination port Here pasta is running with -u 1000 1. Client on the host opens socket A bound to localhost:2000 2. Client on the host opens socket B bound to localhost:3000 2. Client sends datagram 1 from socket A to localhost:1000 with sendto() 3. Client sends datagram 2 from socket B to localhost:1000 with sendto() 4. Datagram 1 and 2 are both received by pasta on socket C bound to localhost:1000, with UDP_TO_NS 5. Datagram 1 and 2 happen to both be received by the same recvmmsg(), in that order 6. udp_sock_handler_splice() only examines udp_mmh_recv[0] and so sets src==2000 7. udp_splice_map[v6][2000].ns_conn_sock is unpopulated, so udp_splice_connect_ns() is called 7.1 udp_splice_connect creates socket C* and connects it to localhost:1000 within the guest, let's say it gets ephemeral bound port 50000. It's tagged with UDP_BACK_TO_INIT 7.2 udp_splice_map[v6][2000].ns_conn_sock is populated with socket C* 7.3 udp_splice_map[v6][50000].init_bound_sock is populated with socket C 7.4 udp_splice_map[v6][50000].init_dst_port is populated with 2000 8. sendmmsg() forwads datagrams 1 & 2 to socket C* 9. Datagrams 1 & 2 correctly delivered to port 1000 in the namespace 10. Server within the namespace receives datagram 1 with recvfrom(). From address is localhost:50000 (socket C*) 11. Server sends reply datagram 1* to localhost:50000 within the ns 12. Server receives datagram 2 with recvfrom(). From address is again localhost:50000 (socket C*) 13. Server sends reply datagram 2* to localhost:50000 within the ns 14. pasta receives datagrams 1* and 2* on socket C*. UDP_BACK_TO_INIT and dst==50000 from the epoll ref 15. udp_sock_handler_splice() sets s to socket C from udp_splice_map[v6][50000].init_bound_sock, and send_dst to 2000 from udp_splice_map[v6][50000] 16. sendmmsg() forwards datagram 1* on socket C to localhost:2000 17. Datagram 1* correctly received by socket A on localhost:2000 18. sendmmsg() forwards datagram 2* on socket C to localhost:2000 * 19. Datagram 2* incorrectly received by socket A on localhost:2000 instead of socket B on localhost:3000Thanks a lot for the details, issues are clear to me now. -- Stefano
On Thu, Nov 17, 2022 at 09:02:35PM +0100, Stefano Brivio wrote:On Thu, 17 Nov 2022 21:30:03 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ah, right.On Thu, Nov 17, 2022 at 08:21:00AM +0100, Stefano Brivio wrote:Sorry, I didn't understand it's just broken for the cases below, but not in case one source port corresponds one destination port (you wrote it, I misread).On Thu, 17 Nov 2022 16:07:32 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Uh.. other than?In preparation for trying to implement dual stack sockets for UDP, I've been getting my head around how the UDP splicing works. Alas, I'm pretty sure that it's broken if there's not a one-to-one correspondence between init side source ports and ns side destination ports. That will typically be the case, but given its UDP there's no guarantee.I understand the concern below, but I don't understand this part, that is: in which other way is it broken?So, well, I disagree on it being fairly broken: it works in the most common case. This doesn't really matter though:So, I'm less sanguine about this, as you probably realize. Yes, it's not a common case, but the failure mode is nasty. It's not just that it falls back to a slower path, nor even that it just drops packets. At best it's going to lead to very hard to debug failures. At worst, it's a security problem: the second scenario potentially lets an unprivileged host process intercept replies to another user's requests to a server in the ns, or interpose incorrect reponses to that other user's requests.- it's an obstacle to unify IPv4 and IPv6 sockets - it's definitely not pretty -- it was more of a sketch I wanted to rewrite for a long time - it doesn't make a huge difference in packet rates (far from having the same impact as the TCP spliced connections) - it's broken in those (albeit probably uncommon) cases below ...let's drop it for now, and add back a saner version later, now that we know in much better detail how it should work and where the problems might be.I had a look at this today, and unfortunately it's less straightforward than I first thought. The difficulty is that the splicing isn't just a performance hack, it's also the only way that the host can access servers within the ns bound explicitly to 127.0.0.1 or ::1. I'm not particularly convinced allowing that's a good idea (it's a behavioural difference between passt and pasta for one). But, removing it now would be a big semantic change, and if we do so we should make it consistent for tcp as well. Still looking and thinking through options.-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson-u 200 hereI'm not concerned so much about replies coming from a different port as a server which expects initial requests from a particular port. Still not that likely, but more likely than with TCP.In addition, UDP servers in the ns will not see the correct port numbers with getpeername(). That's also true of spliced TCP connections (see https://bugs.passt.top/show_bug.cgi?id=39), but it's more likely to matter for UDP (I don't know of any TCP protocols that care about source port number on the server side, but there are some common UDP protocols that have at least port number conventions on both sides).I can think of DHCP and DNS, for which we offer special handling somehow. Still, if the flow is started by the guest or container, replies should really come with a source port matching the destination port used initially.For TCP, I don't see this is as an issue at all.I largely agree.I'll fill in the details below.I can expand on the details later, but pasta will do the wrong thing in at least some circumstances for both a single init side socket sendto()ing packets to multiple different ports in the ns/guest and multiple init side sockets send()ing to the same port in the ns/guest. I think I know how to fix it, but it's not a trivial job. So, the question is do I embark on this now, or do I just remove UDP "splicing" entirely for the time being (other than a minimum required to make -U work)? That would unblock dual stack UDP sockets and we can attempt to reoptimize this later.So, I'm not really sure what's broken here, but in any case, UDP"splicing" doesn't offer as much value as the TCP one does, the difference in packet rate is not that big. I don't see a problem if we want to remove it temporarily.Ok, good to know.The only real concern I have is how easy it would be to add it back after a rework not taking that functionality into account.I actually think this will fit better with the tap path once I've made the dual stack socket changes. Ok, for the details of the problem. I'm only considering the case where the host side initiates the communication. I think there are similar cases the other way, but I haven't thought them through. Scenario 1: one source port, multiple destination ports Here pasta is running with -u 200 -u 300 1. Client on the host opens UDP socket A and binds it to localhost:100 2. Client sends datagram 1 on socket A to localhost:200 with sendto() 3. Datagram 1 is received by pasta on splice socket B bound to localhost:200 4. Because of the -U 200, pasta handles this inudp_sock_handler_splice(), ref has splice==UDP_TO_NS 5, recvmmsg() gets a single datagram, from source localhost:100, so src==100 6. udp_splice_map[v6][100].ns_conn_sock is empty, so we call udp_splice_connect_ns() 6.1. udp_splice_connect() creates socket B*, and connects it to localhost:200 in the namespace 6.2. udp_splice_map[v6][100].ns_conn_sock is populated with socket B* 7. sendmmsg() forwards the datagram to socket B* 8. Datagram 1 correctly reaches port 200 within the ns 9. Client sends datagram 2 on socket A to localhost:300 with sendto() 11. Datagram 2 is received by pasta on socket C bound to localhost:300 10. Again, pasta handles this in udp_sock_handler_splice() with UDP_TO_NS. Again, src==100 11. udp_splice_map[v6][100] is populated with socket B* from above 12. sendmmsg() forwads datagram 2 to socket B* * 13, Datagram 2 is incorrectly delivered to port 200 within the ns, instead of port 300 Scenario 2: multiple source ports, one destination port Here pasta is running with -u 1000 1. Client on the host opens socket A bound to localhost:2000 2. Client on the host opens socket B bound to localhost:3000 2. Client sends datagram 1 from socket A to localhost:1000 with sendto() 3. Client sends datagram 2 from socket B to localhost:1000 with sendto() 4. Datagram 1 and 2 are both received by pasta on socket C bound to localhost:1000, with UDP_TO_NS 5. Datagram 1 and 2 happen to both be received by the same recvmmsg(), in that order 6. udp_sock_handler_splice() only examines udp_mmh_recv[0] and so sets src==2000 7. udp_splice_map[v6][2000].ns_conn_sock is unpopulated, so udp_splice_connect_ns() is called 7.1 udp_splice_connect creates socket C* and connects it to localhost:1000 within the guest, let's say it gets ephemeral bound port 50000. It's tagged with UDP_BACK_TO_INIT 7.2 udp_splice_map[v6][2000].ns_conn_sock is populated with socket C* 7.3 udp_splice_map[v6][50000].init_bound_sock is populated with socket C 7.4 udp_splice_map[v6][50000].init_dst_port is populated with 2000 8. sendmmsg() forwads datagrams 1 & 2 to socket C* 9. Datagrams 1 & 2 correctly delivered to port 1000 in the namespace 10. Server within the namespace receives datagram 1 with recvfrom(). From address is localhost:50000 (socket C*) 11. Server sends reply datagram 1* to localhost:50000 within the ns 12. Server receives datagram 2 with recvfrom(). From address is again localhost:50000 (socket C*) 13. Server sends reply datagram 2* to localhost:50000 within the ns 14. pasta receives datagrams 1* and 2* on socket C*. UDP_BACK_TO_INIT and dst==50000 from the epoll ref 15. udp_sock_handler_splice() sets s to socket C from udp_splice_map[v6][50000].init_bound_sock, and send_dst to 2000 from udp_splice_map[v6][50000] 16. sendmmsg() forwards datagram 1* on socket C to localhost:2000 17. Datagram 1* correctly received by socket A on localhost:2000 18. sendmmsg() forwards datagram 2* on socket C to localhost:2000 * 19. Datagram 2* incorrectly received by socket A on localhost:2000 instead of socket B on localhost:3000Thanks a lot for the details, issues are clear to me now.