Re: [PATCH 1/4] tcp: Fix ACK sequence on FIN to tap

7 Oct 2025

On Fri, 3 Oct 2025 13:19:17 +1000
David Gibson  wrote:
...
On Thu, Oct 02, 2025 at 01:58:41PM +0200, Stefano Brivio wrote:
...
On Thu, 2 Oct 2025 12:41:08 +1000
David Gibson  wrote:
...
On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote:
...
If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and
send a FIN segment to the guest / container acknowledging a sequence
number that's behind what we received so far, we won't have any
further trigger to send an updated ACK segment, as we are now
switching the epoll socket monitoring to edge-triggered mode.
To avoid this situation, in tcp_update_seqack_wnd(), we set the next
acknowledgement sequence to the current observed sequence, regardless
of what was acknowledged socket-side.
To double check my understanding: things should work if we always
acknowledged everything we've received.  Acknowledging only what the
peer has acked is a refinement to give the guest a view that's closer
to what it would be end-to-end with the peer (which might improve the
operation of flow control).
Right.
...
We can't use that refined mechanism when the socket is closing
(amongst other cases), because while we can get the peer acked bytes
from TCP_INFO, we can't get events when that changes, so we have no
mechanism to provide updates to the guest at the right time.  So we
fall back to the simpler method.
Is that correct?
Also correct, yes. If you have a better idea to summarise this in the
comment in tcp_buf_data_from_sock() let me know.
Hm, I might.  Or actually a way to reorganise the code that I think
will be a bit clearer and probably allow a clearer comment too.
I would keep reworks for a later moment. Right now, it's already taking
me long enough to find a moment to investigate these issues, write these
fixes, and test them.
...
...
Maybe I could mention
EPOLLET explicitly there?
I don't think EPOLLET is actually relevant.  Even if we had level
triggered events, a change in bytes_acked doesn't count as an event
(AFAIK).
It doesn't count, but with level-triggered events, we would be busy
polling bytes_acked, as you noted. I was mentioning EPOLLET because it
could be taken, intuitively, as a "stop listening for events" (almost)
step. I'll leave that out then.
...
So either some other event is on, in which case we'd
effectively be busy polling bytes_acked, or it's not in which case we
don't get updates, just like now.
I principle we could implement some sort of timer based polling, but
that sounds like way more trouble than it's worth.
We already have something similar, based on ACK_INTERVAL and
ACK_TO_TAP_DUE, and it shouldn't be overly complicated to extend that
to a new FIN_TO_TAP_DUE flag. But indeed beyond the scope of this
series.
...
...
...
...
However, we don't necessarily call tcp_update_seqack_wnd() before
sending the FIN segment, which might potentially lead to a situation,
not observed in practice, where we unnecessarily cause a
retransmission at some point after our FIN segment.
Avoid that by setting the ACK sequence to whatever we received from
the container / guest, before sending a FIN segment and switching to
EPOLLET.
Signed-off-by: Stefano Brivio 
Based on my understanding above, this looks correct to me, so,
Reviewed-by: David Gibson 
My only concern is whether we could instead insert an extra call to
tcp_update_seqack_wnd() to reduce duplicated logic.
Hmm, maybe, but on the other hand we're closing the connection. Should
we really spend time querying TCP_INFO to recalculate the window at
this point? I wouldn't.
Good point.  I mean tcp_update_seqack_wnd() could skip the TCP_INFO in
that case, but that does look a bit fiddly.
On the other hand, in favour of not duplicating logic...
[snip]
...
...
...
@@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
      	conn_flag(c, conn, STALLED);
      } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
      	   SOCK_FIN_RCVD) {
-			int ret = tcp_buf_send_flag(c, conn, FIN | ACK);
+			int ret;
+
+			/* On TAP_FIN_SENT, we won't get further data events
+			 * from the socket, and this might be the last ACK
+			 * segment we send to the tap, so update its sequence to
+			 * include everything we received until now.
+			 *
+			 * See also the special handling on CONN_IS_CLOSING() in
+			 * tcp_update_seqack_wnd().
+			 */
+			conn->seq_ack_to_tap = conn->seq_from_tap;
... the equivalent bits in tcp_update_seqack_wnd() have after them:
  if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap))
      conn->seq_ack_to_tap = prev_ack_to_tap;
Don't we need that here as well, in case the guest is retransmitting
when we get the sock side FIN?
Not really, because we don't rewind conn->seq_from_tap, so we don't
risk jumping back here.

In tcp_update_seqack_wnd(), we might jump back (that should be double
checked eventually, I'm not sure it's still the case) if we happened to
acknowledge more than acknowledged socket-side while handling some
particular condition, and then we switch back to acknowledging only
bytes_acked.

It should happen if the destination is/was a low RTT one, but we run
out of slots in low_rtt_dst in favour of other entries. I don't
remember any other case.
...
...
...
...
+
+			ret = tcp_buf_send_flag(c, conn, FIN | ACK);
      	if (ret) {
      		tcp_rst(c, conn);
      		return ret;

diff --git a/tcp_vu.c b/tcp_vu.c
index ebd3a1e..3ec3538 100644
--- a/tcp_vu.c
+++ b/tcp_vu.c
@@ -410,7 +410,12 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)
      	conn_flag(c, conn, STALLED);
      } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) ==
      	   SOCK_FIN_RCVD) {
-			int ret = tcp_vu_send_flag(c, conn, FIN | ACK);
+			int ret;
+
+			/* See tcp_buf_data_from_sock() */
+			conn->seq_ack_to_tap = conn->seq_from_tap;
+
+			ret = tcp_vu_send_flag(c, conn, FIN | ACK);
      	if (ret) {
      		tcp_rst(c, conn);
      		return ret;
-- 
2.43.0
-- 
Stefano