On Wed, 10 Sep 2025 12:29:48 +1000
David Gibson
On Tue, Sep 09, 2025 at 08:16:55PM +0200, Stefano Brivio wrote:
For some reason, tcp_vu_data_from_sock() already takes care of this, but the non-vhost-user version ignores this possibility and just sends out a FIN segment whenever we infer we received one host-side, regardless of the fact that we might have unacknowledged data still to send.
Somewhat surprisingly, this didn't cause any issue to be reported yet, until 6.17-rc1 and 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") came around, leading to the following report from Paul, who hit this running Podman tests:
439 0.033032 169.254.1.2 → 192.168.122.100 65540 TCP 56602 → 5789 [PSH, ACK] Seq=10336361 Ack=1 Win=65536 Len=65480 440 0.033055 169.254.1.2 → 192.168.122.100 30324 TCP [TCP Window Full] 56602 → 5789 [PSH, ACK] Seq=10401841 Ack=1 Win=65536 Len=30264
we're sending data to the container, up to the edge of the window
441 0.033059 192.168.122.100 → 169.254.1.2 60 TCP 5789 → 56602 [ACK] Seq=1 Ack=10401841 Win=83968 Len=0
and the container acknowledges it
442 0.033091 169.254.1.2 → 192.168.122.100 53716 TCP 56602 → 5789 [PSH, ACK] Seq=10432105 Ack=1 Win=65536 Len=53656
we send more data, all we possibly can, in window
443 0.033126 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0
and the container shrinks the window due to the issue introduced by kernel commit e2142825c120 ("net: tcp: send zero-window ACK when no memory"). With a previous patch from this series, we rewind the sequence, meaning that we assign conn->seq_to_tap from conn->seq_ack_from_tap, so that we'll retransmit this segment, by reading again from the socket, and increasing conn->seq_to_tap once more.
However:
444 0.033144 169.254.1.2 → 192.168.122.100 60 TCP 56602 → 5789 [FIN, PSH, ACK] Seq=10485761 Ack=1 Win=65536 Len=0
we eventually get a zero-length read from the socket and we miss the fact that conn->seq_to_tap != conn->seq_ack_from_tap, so we send a FIN flag with the most recent sequence. The kernel insists:
445 0.033147 192.168.122.100 → 169.254.1.2 60 TCP [TCP ZeroWindow] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=0 Len=0
with its buggy zero-window update, but:
446 0.033152 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=69120 Len=0 447 0.033202 192.168.122.100 → 169.254.1.2 60 TCP [TCP Window Update] 5789 → 56602 [ACK] Seq=1 Ack=10432105 Win=142848 Len=0
we don't reset the TAP_FIN_SENT flag anymore, and don't resend the FIN segment (nor data), as we already rewound the sequence earlier.
To solve this, hold off the FIN segment until we get a zero-length read from the socket *and* we know that there's no unacknowledged pending data, also without vhost-user, in tcp_buf_data_from_sock().
Reported-by: Paul Holzinger
Signed-off-by: Stefano Brivio Reviewed-by: David Gibson
At least, I think this makes sense. As always, I find the semantics of the STALLED flag difficult to wrap my head around.
After this fix I start thinking that we should probably split STALLED into several flags explicitly indicating what we are waiting for, because it's rather complicated to make sure that some code path will eventually take care of doing something on a later trigger/condition (in this case, sending the FIN flag when we can). I don't have a concrete list of possible flags in mind but something like WAITING_ON_(x), or consistent sets of BLOCKED_BY_(x) (we can't proceed with anything otherwise) and EXPECTING_(x) (x is expected but we can proceed with something else). Maybe we'll need to go the full (y)_BLOCKED_BY_(x) route, though (FIN_BLOCKED_BY_TAP_ACK?). I hope not. -- Stefano