If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and
send a FIN segment to the guest / container acknowledging a sequence
number that's behind what we received so far, we won't have any
further trigger to send an updated ACK segment, as we are now
switching the epoll socket monitoring to edge-triggered mode.
To avoid this situation, in tcp_update_seqack_wnd(), we set the next
acknowledgement sequence to the current observed sequence, regardless
of what was acknowledged socket-side.
However, we don't necessarily call tcp_update_seqack_wnd() before
sending the FIN segment, which might lead to this exchange, where
192.168.10.102 is a HTTP server in a Podman container, and
192.168.10.44 is a client fetching approximately 160 KB of data
from it:
82 2.026811 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [FIN, ACK] Seq=121441 Ack=143 Win=65536 Len=0
the server is done sending
83 2.026898 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [ACK] Seq=143 Ack=114394 Win=216192 Len=0
pasta (client) acknowledges a previous sequence
84 2.027324 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [FIN, ACK] Seq=143 Ack=114394 Win=216192 Len=0
pasta (client) sends FIN, ACK as the client has no more data to
send (a single GET request), while still acknowledging a previous
sequence (because that's what the client acknowledged so far)
85 2.027349 192.168.10.102 → 192.168.10.44 54 TCP 55414 → 44992 [ACK] Seq=121442 Ack=144 Win=65536 Len=0
the server acknowledges the FIN, ACK
86 2.224125 192.168.10.102 → 192.168.10.44 4150 TCP [TCP Retransmission] 55414 → 44992 [ACK] Seq=114394 Ack=144 Win=65536 Len=4096 [TCP segment of a reassembled PDU]
...and nothing happens for a while, until, approximately 200ms
later, the server concludes that it needs to retransmit some data
as it wasn't acknowledged.
This is totally unnecessary, so avoid that by setting the ACK
sequence to whatever we received from the container / guest, before
sending a FIN segment and switching to EPOLLET.
Further on:
87 2.224202 192.168.10.44 → 192.168.10.102 54 TCP 44992 → 55414 [RST] Seq=144 Win=0 Len=0
as we see the retransmission, we conclude that this was the final
ACK and close the connection. But the retransmission should instead
tell us that the server is waiting for an ACK segment we never sent.
To fix this, revert the TAP_FIN_RCVD event if we get a retransmission,
because the peer is rewinding the sequence, which implies that the FIN
will be retransmitted as well, eventually. And, until then, we
shouldn't handle that side of the connection as closed. We need to
process (silently discard) this data, instead.
Link: https://github.com/containers/podman/issues/27179
Signed-off-by: Stefano Brivio