On Mon, 18 Aug 2025 19:40:39 +0200
Paul Holzinger
Hi,
Starting from Linux kernel commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks"), window limits are enforced more aggressively with a bigger amount of zero-window updates compared to what happened with e2142825c120 ("net: tcp: send zero-window ACK when no memory") alone, and occasional duplicate ACKs can now be seen also for local transfers with default (208 KiB) socket buffer sizes.
Paul reports that, with 6.17-rc1-ish kernels, Podman tests for the pasta integration occasionally fail on the "TCP/IPv4 large transfer, tap" case.
While playing with a reproducer that seems to be matching those failures:
while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done
and a kernel including that commit, I hit a few different failures, that should be fixed by this series.
Stefano Brivio (6): tcp: FIN flags have to be retransmitted as well tcp: Factor sequence rewind for retransmissions into a new function tcp: Rewind sequence when guest shrinks window to zero tcp: Fix closing logic for half-closed connections tcp: Don't try to transmit right after the peer shrank the window to zero tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent
tcp.c | 170 ++++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 118 insertions(+), 52 deletions(-) I applied the series in my test VM and run the reproducer from the
On 15/08/2025 18:10, Stefano Brivio wrote: podman tests where I noticed the flake.
It failed after about 10mins and then when I enabled pasta trace logs and tcpdump captures then it took 208 mins for me to reproduce so it doesn't seem to fix the issue I am seeing.
I did send the pasta log/pcap files in private to Stefano for further debugging.
Here's a quick summary of the issue by the way, if somebody is curious. Excerpt of packet capture and pasta.log attached as well. Commented capture: $ tshark -r pasta-container.pcap -td -Y 'frame.number in { 345 .. 369 }' 345 0.000018 192.168.122.100 → 169.254.1.2 60 TCP 5978 → 51626 [ACK] Seq=1 Ack=10086371 Win=140288 Len=0 346 0.000014 169.254.1.2 → 192.168.122.100 65540 TCP 51626 → 5978 [PSH, ACK] Seq=10086371 Ack=1 Win=65536 Len=65480 347 0.000024 169.254.1.2 → 192.168.122.100 740 TCP 51626 → 5978 [PSH, ACK] Seq=10151851 Ack=1 Win=65536 Len=680 348 0.000031 169.254.1.2 → 192.168.122.100 65540 TCP 51626 → 5978 [PSH, ACK] Seq=10152531 Ack=1 Win=65536 Len=65480 349 0.000032 169.254.1.2 → 192.168.122.100 8708 TCP [TCP Window Full] 51626 → 5978 [PSH, ACK] Seq=10218011 Ack=1 Win=65536 Len=8648 350 0.000005 192.168.122.100 → 169.254.1.2 60 TCP 5978 → 51626 [ACK] Seq=1 Ack=10218011 Win=8704 Len=0 351 0.000128 192.168.122.100 → 169.254.1.2 60 TCP 5978 → 51626 [ACK] Seq=1 Ack=10226659 Win=65536 Len=0 352 0.000051 169.254.1.2 → 192.168.122.100 116 TCP 51626 → 5978 [PSH, ACK] Seq=10226659 Ack=1 Win=65536 Len=56 353 0.000038 169.254.1.2 → 192.168.122.100 65540 TCP [TCP Window Full] 51626 → 5978 [PSH, ACK] Seq=10226715 Ack=1 Win=65536 Len=65480 354 0.000021 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5978 → 51626 [ACK] Seq=1 Ack=10292195 Win=0 Len=0 355 0.000122 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5978 → 51626 [ACK] Seq=1 Ack=10292195 Win=110592 Len=0 356 0.000067 169.254.1.2 → 192.168.122.100 65540 TCP 51626 → 5978 [PSH, ACK] Seq=10292195 Ack=1 Win=65536 Len=65480 357 0.000031 169.254.1.2 → 192.168.122.100 45172 TCP [TCP Window Full] 51626 → 5978 [PSH, ACK] Seq=10357675 Ack=1 Win=65536 Len=45112 358 0.000038 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5978 → 51626 [ACK] Seq=1 Ack=10357675 Win=0 Len=0 359 0.000023 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5978 → 51626 [ACK] Seq=1 Ack=10357675 Win=69632 Len=0 360 0.000037 169.254.1.2 → 192.168.122.100 65540 TCP [TCP Out-Of-Order] 51626 → 5978 [PSH, ACK] Seq=10357675 Ack=1 Win=65536 Len=65480 ...so far so good. What tshark marks as out-of-order is a result of the kernel shrinking the window and us retransmitting from there (that's patch 3/6). Then this happens: -- 361 0.000018 169.254.1.2 → 192.168.122.100 4212 TCP [TCP Window Full] 51626 → 5978 [PSH, ACK] Seq=10423155 Ack=1 Win=65536 Len=4152 - we send some data -- 362 0.000036 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5978 → 51626 [ACK] Seq=1 Ack=10423155 Win=0 Len=0 - the kernel shrinks the window: according to frame #359, we could send up to sequence byte 10357675 + 69632 = 10427307, and so we did that in frame #361 (10423155 + 4152 = 10427307), but now that's "too much" (this looks pathological and maybe we should have a second / third / fourth attempt at trying to fix the kernel...) -- 363 0.000066 169.254.1.2 → 192.168.122.100 62666 TCP [TCP Out-Of-Order] 51626 → 5978 [PSH, ACK] Seq=10423155 Ack=1 Win=65536 Len=62606 - we have more data from the socket, corresponding pasta.log excerpt: 0.8967: pasta: epoll event on connected TCP socket 81 (events: 0x00002001) 0.8967: Flow 0 (TCP connection): event at tcp_sock_handler:2416 0.8968: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:393 0.8968: Flow 0 (TCP connection): ACK_FROM_TAP_BLOCKS dropped 0.8968: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:394 0.8968: Flow 0 (TCP connection): flag at tcp_buf_data_from_sock:418 0.8968: Flow 0 (TCP connection): timer expires in 2.000s and we decide to send some, out of window. This is intended. However: -- 364 0.000006 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5978 → 51626 [ACK] Seq=1 Ack=10423155 Win=0 Len=0 ...the kernel didn't reopen the window yet -- 365 0.000058 169.254.1.2 → 192.168.122.100 60 TCP 51626 → 5978 [FIN, PSH, ACK] Seq=10485761 Ack=1 Win=65536 Len=0 but we send a FIN with an updated sequence (this looks wrong): 0.8968: pasta: epoll event on connected TCP socket 81 (events: 0x00002001) 0.8968: Flow 0 (TCP connection): event at tcp_sock_handler:2416 0.8968: Flow 0 (TCP connection): flag at tcp_prepare_flags:1195 0.8968: Flow 0 (TCP connection): flag at tcp_prepare_flags:1201 0.8968: Flow 0 (TCP connection): timer expires in 2.000s 0.8968: Flow 0 (TCP connection): event at tcp_buf_data_from_sock:379 0.8968: Flow 0 (TCP connection): TAP_FIN_SENT: CLOSE_WAIT -> LAST_ACK -- 366 0.000003 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5978 → 51626 [ACK] Seq=1 Ack=10423155 Win=0 Len=0 same as before... -- 367 0.000021 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5978 → 51626 [ACK] Seq=1 Ack=10423155 Win=73216 Len=0 368 0.000119 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5978 → 51626 [ACK] Seq=1 Ack=10423155 Win=155136 Len=0 and finally we have window updates, but we're in LAST_ACK and for whatever reason we never send the missing data. Things time out after two minutes. -- I guess I missed to rewind the sequence in some cases where the window shrinks, or something else to that effect. -- Stefano