[PATCH v2 0/6] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels
Starting from Linux kernel commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks"), window limits are enforced more aggressively with a bigger amount of zero-window updates compared to what happened with e2142825c120 ("net: tcp: send zero-window ACK when no memory") alone, and occasional duplicate ACKs can now be seen also for local transfers with default (208 KiB) socket buffer sizes. Paul reports that, with 6.17-rc1-ish kernels, Podman tests for the pasta integration occasionally fail on the "TCP/IPv4 large transfer, tap" case. While playing with a reproducer that seems to be matching those failures: while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done and a kernel including that commit, I hit a few different failures, that should be fixed by this series. Paul tested v1 of this series and found an additional failure (transfer timeout), which I could reproduce with a slightly different command: while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 EXEC:./write.sh & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done where write.sh is simply: #!/bin/sh cat > /tmp/large.rcv so that the connection is not half-closed starting from the beginning, because socat can't make assumptions about the unidirectional nature of the traffic. This should now be fixed as well by the new version of patch 3/6. v2: in 3/6, rewind sequence also if the zero-window update comes in the middle of a batch with non-zero window updates Stefano Brivio (6): tcp: FIN flags have to be retransmitted as well tcp: Factor sequence rewind for retransmissions into a new function tcp: Rewind sequence when guest shrinks window to zero tcp: Fix closing logic for half-closed connections tcp: Don't try to transmit right after the peer shrank the window to zero tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent tcp.c | 179 +++++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 127 insertions(+), 52 deletions(-) -- 2.43.0
If we're retransmitting any data, and we sent a FIN segment to our
peer, regardless of whether it was received, we obviously have to
retransmit it as well, given that it can only come with the last data
segment, or after it.
Unconditionally clear the internal TAP_FIN_SENT flag whenever we
re-transmit, so that we know we have to send it again, in case.
Reported-by: Paul Holzinger
...as I'm going to need a third occurrence of this in the next change.
This introduces a small functional change in tcp_data_from_tap(): the
sequence was previously rewound to the highest ACK number we found in
the current packet batch, and not to the current value of
seq_ack_from_tap.
The two might differ in case tcp_sock_consume() failed, because in
that case we're ignoring that ACK altogether. But if we're ignoring
it, it looks more correct to me to start retransmitting from an
earlier sequence anyway.
Signed-off-by: Stefano Brivio
A window shrunk to zero means by definition that anything else that
might be in flight is now out of window. Restart from the currently
acknowledged sequence.
We need to do that both in tcp_tap_window_update(), where we already
check for zero-window updates, as well as in tcp_data_from_tap(),
because we might get one of those updates in a batch of packets that
also contains a non-zero window update.
Suggested-by: Jon Maloy
First off, don't close connections half-closed by the guest before
our own FIN is acknowledged by the guest itself.
That is, after we receive a FIN from the guest (TAP_FIN_RCVD), if we
don't have any data left to send from the socket (SOCK_FIN_RCVD, or
EPOLLHUP), we send a FIN segment to the guest (TAP_FIN_SENT), but we
need to actually have it acknowledged (and have no pending
retransmissions) before we can close the connection: check for
TAP_FIN_ACKED, first.
Then, if we set TAP_FIN_SENT, and we receive an ACK segment from the
guest, set TAP_FIN_ACKED. This was entirely missing for the
TAP_FIN_RCVD case, and as we fix the problem described above, this
becomes relevant as well.
Signed-off-by: Stefano Brivio
If the peer shrinks the window to zero, we'll skip storing the new
window, as a convenient way to cause window probes (which exceed any
zero-sized window, strictly speaking) if we don't get window updates
in a while.
As we do so, though, we need to ensure we don't try to queue more data
from the socket right after we process this window update, as the
entire point of a zero-window advertisement is to keep us from sending
more data.
Signed-off-by: Stefano Brivio
We currently have a number of discrepancies in the tcp_tap_handler()
path between the half-closed connection path and the regular one, and
they are mostly a result of code duplication, which comes in turn from
the fact that tcp_data_from_tap() deals with data transfers as well as
general connection bookkeeping, so we can't use it for half-closed
connections.
This suggests that we should probably rework it into two or more
functions, in the long term, but for the moment being I'm just fixing
one obvious issue, which is the lack of fast retransmissions in the
TAP_FIN_RCVD path, and a potential one, which is the fact we don't
handle socket flush failures.
Add fast re-transmit for half-closed connections, and extract the
logic to determine the TCP payload length from tcp_data_from_tap()
into the new tcp_packet_data_len() helper to decrease a bit the amount
of resulting code duplication.
Handle the case of socket flush (tcp_sock_consume()) flush failure in
the same way as tcp_data_from_tap() handles it.
Signed-off-by: Stefano Brivio
On Wed, 20 Aug 2025 18:51:31 +0200
Stefano Brivio
Starting from Linux kernel commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks"), window limits are enforced more aggressively with a bigger amount of zero-window updates compared to what happened with e2142825c120 ("net: tcp: send zero-window ACK when no memory") alone, and occasional duplicate ACKs can now be seen also for local transfers with default (208 KiB) socket buffer sizes.
Paul reports that, with 6.17-rc1-ish kernels, Podman tests for the pasta integration occasionally fail on the "TCP/IPv4 large transfer, tap" case.
While playing with a reproducer that seems to be matching those failures:
while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done
and a kernel including that commit, I hit a few different failures, that should be fixed by this series.
Paul tested v1 of this series and found an additional failure (transfer timeout), which I could reproduce with a slightly different command:
while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 EXEC:./write.sh & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done
where write.sh is simply:
#!/bin/sh
cat > /tmp/large.rcv
so that the connection is not half-closed starting from the beginning, because socat can't make assumptions about the unidirectional nature of the traffic. This should now be fixed as well by the new version of patch 3/6.
v2: in 3/6, rewind sequence also if the zero-window update comes in the middle of a batch with non-zero window updates
Hmpf, never mind, this broke (on the second run): TCP/IPv4: ns to host (via tap): big transfer from the main tests (not Podman tests). I still need to look into it. -- Stefano
participants (1)
-
Stefano Brivio