On Mon, 25 Sep 2023 14:21:47 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Mon, Sep 25, 2023 at 02:09:41PM +1000, David Gibson wrote:This is probably due to the receive buffer getting bigger than TCP_FRAMES_MEM * MSS4 (or MSS6), so the amount of data we can read in one shot from the sockets isn't optimally sized anymore. We should have a look at the difference between not clamping at all (and if that yields the same throughput, great), and clamping to, I guess, TCP_FRAMES_MEM * MIN(MSS4, MSS6). -- StefanoOn Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote:I noted another oddity. With this patch, _no_ RCVBUF clamp and 16MiB wmem_max fixed, things seem to behave much better with a small rmem_max than large. With rmem_max=256KiB I get pretty consistent 37Gbps throughput and iperf3 -c reports 0 retransmits. With rmem_max=16MiB, the throughput fluctuates from second to second between ~3Gbps and ~30Gbps. The client reports retransmits in some intervals, which is pretty weird over lo. Urgh... so many variables.It looks like we need it as workaround for this situation, readily reproducible at least with a 6.5 Linux kernel, with default rmem_max and wmem_max values: - an iperf3 client on the host sends about 160 KiB, typically segmented into five frames by passt. We read this data using MSG_PEEK - the iperf3 server on the guest starts receiving - meanwhile, the host kernel advertised a zero-sized window to the receiver, as expected - eventually, the guest acknowledges all the data sent so far, and we drop it from the buffer, courtesy of tcp_sock_consume(), using recv() with MSG_TRUNC - the client, however, doesn't get an updated window value, and even keepalive packets are answered with zero-window segments, until the connection is closed It looks like dropping data from a socket using MSG_TRUNC doesn't cause a recalculation of the window, which would be expected as a result of any receiving operation that invalidates data on a buffer (that is, not with MSG_PEEK). Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to the previous value we clamped to, forces a recalculation of the window which is advertised to the guest. I couldn't quite confirm this issue by following all the possible code paths in the kernel, yet. If confirmed, this should be fixed in the kernel, but meanwhile this workaround looks robust to me (and it will be needed for backward compatibility anyway).So, I tested this, and things got a bit complicated. First, I reproduced the "read side" problem by setting net.core.rmem_max to 256kiB while setting net.core.wmem_max to 16MiB. The "160kiB" stall happened almost every time. Applying this patch appears to fix it completely, getting GiB/s throughput consistently. So, yah. Then I tried reproducing it differently: by setting both net.core.rmem_max and net.core.wmem_max to 16MiB, but setting SO_RCVBUF to 128kiB explicitly in tcp_sock_set_bufsize() (which actually results in a 256kiB buffer, because of the kernel's weird interpretation). With the SO_RCVBUF clamp and without this patch, I don't get the 160kiB stall consistently any more. What I *do* get is nearly every time - but not *every* time - is slow transfers, ~40Mbps vs. ~12Gbps. Sometimes it stalls after several seconds. The stall is slightly different from the 160kiB stall though: the 160kiB stall seems 0 bytes transferred on both sides. With the RCVBUF stall I get a trickle of bytes (620 bytes/s) on the receiver/guest side, with mostly 0 bytes per interval on the sender but occasionally an interval with several hundred KB. That is it seems like there's a buffer somewhere that's very slowly draining into the receiver, then getting topped up in an instant once it gets low enough. When I have both this patch and the RCVBUF clamp, I don't seem to be able to reproduce the trickle-stall anymore, but I still get the slow transfer speeds most, but not every time. Sometimes, but only rarely, I do seem to still get a complete stall (0 bytes on both sides).