On Sat, Sep 23, 2023 at 12:06:05AM +0200, Stefano Brivio wrote:The fundamental patch here is 3/5, which is a workaround for a rather surprising kernel behaviour we seem to be hitting. This all comes from the investigation around https://bugs.passt.top/show_bug.cgi?id=74. I can't hit stalls anymore and throughput looks finally good to me (~3.5gbps with 208 KiB rmem_max and wmem_max), but... please test.Write site issue, testing results I replied with a bunch of test information already, but that was all related to the specifically read-side issue: I used 16MiB wmem_max throughout, but limited the read side buffer either with rmem_max or SO_RCVBUF. I've now done some tests looking specifically for write side issues. I basically reversed the setup, with rmem_max set to 4MiB throughout, but wmem_max limited to 256kiB. With no patches applied, I easily get a stall, although the exact details are a little different from the read-side stall: rather than being consistently 0 there are a few small bursts of traffic on both sides. With 2/5 applied, there doesn't appear to be much difference in behaviour. With 3/5 applied, I can no longer reproduce stalls, but throughput isn't very good. With 4/5 applied, throughput seems to improve notably (from ~300Mbps to ~2.5Gbps, though it's not surprisingly variable from second to second). Tentative conclusions: * The primary cause of the stalls appears to be the kernel bug identified, where the window isn't properly recalculated after MSG_TRUNC. 3/5 appears to successfully work around that bug. I think getting that merged is our top priority. * 2/5 makes logical sense to me, but I don't see a lot of evidence of it changing the behaviour here much. I think we hold it back for now, polish it a bit, maybe reconsider it as part of a broader rethink of the STALLED flag. * 4/5 doesn't appear to be linked to the stalls per se, but does appear to generally improve behaviour with limited wmem_max. I think we can improve the implementation a bit, then look at merging as the second priority. Open questions: * Even with the fixes, why does very large rmem_max seem to cause wildly variable and not great throughput? * Why does explicitly limiting RCVBUF usually, but not always, cause very poor throughput but without stalling? * Given the above oddities, is there any value to us setting RCVBUF for TCP sockets, rather than just letting the kernel adapt it. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson