On Fri, 3 Oct 2025 13:19:17 +1000
David Gibson
On Thu, Oct 02, 2025 at 01:58:41PM +0200, Stefano Brivio wrote:
On Thu, 2 Oct 2025 12:41:08 +1000 David Gibson
wrote: On Thu, Oct 02, 2025 at 02:06:43AM +0200, Stefano Brivio wrote:
If we reach end-of-file on a socket (or get EPOLLRDHUP / EPOLLHUP) and send a FIN segment to the guest / container acknowledging a sequence number that's behind what we received so far, we won't have any further trigger to send an updated ACK segment, as we are now switching the epoll socket monitoring to edge-triggered mode.
To avoid this situation, in tcp_update_seqack_wnd(), we set the next acknowledgement sequence to the current observed sequence, regardless of what was acknowledged socket-side.
To double check my understanding: things should work if we always acknowledged everything we've received. Acknowledging only what the peer has acked is a refinement to give the guest a view that's closer to what it would be end-to-end with the peer (which might improve the operation of flow control).
Right.
We can't use that refined mechanism when the socket is closing (amongst other cases), because while we can get the peer acked bytes from TCP_INFO, we can't get events when that changes, so we have no mechanism to provide updates to the guest at the right time. So we fall back to the simpler method.
Is that correct?
Also correct, yes. If you have a better idea to summarise this in the comment in tcp_buf_data_from_sock() let me know.
Hm, I might. Or actually a way to reorganise the code that I think will be a bit clearer and probably allow a clearer comment too.
I would keep reworks for a later moment. Right now, it's already taking me long enough to find a moment to investigate these issues, write these fixes, and test them.
Maybe I could mention EPOLLET explicitly there?
I don't think EPOLLET is actually relevant. Even if we had level triggered events, a change in bytes_acked doesn't count as an event (AFAIK).
It doesn't count, but with level-triggered events, we would be busy polling bytes_acked, as you noted. I was mentioning EPOLLET because it could be taken, intuitively, as a "stop listening for events" (almost) step. I'll leave that out then.
So either some other event is on, in which case we'd effectively be busy polling bytes_acked, or it's not in which case we don't get updates, just like now.
I principle we could implement some sort of timer based polling, but that sounds like way more trouble than it's worth.
We already have something similar, based on ACK_INTERVAL and ACK_TO_TAP_DUE, and it shouldn't be overly complicated to extend that to a new FIN_TO_TAP_DUE flag. But indeed beyond the scope of this series.
However, we don't necessarily call tcp_update_seqack_wnd() before sending the FIN segment, which might potentially lead to a situation, not observed in practice, where we unnecessarily cause a retransmission at some point after our FIN segment.
Avoid that by setting the ACK sequence to whatever we received from the container / guest, before sending a FIN segment and switching to EPOLLET.
Signed-off-by: Stefano Brivio
Based on my understanding above, this looks correct to me, so,
Reviewed-by: David Gibson
My only concern is whether we could instead insert an extra call to tcp_update_seqack_wnd() to reduce duplicated logic.
Hmm, maybe, but on the other hand we're closing the connection. Should we really spend time querying TCP_INFO to recalculate the window at this point? I wouldn't.
Good point. I mean tcp_update_seqack_wnd() could skip the TCP_INFO in that case, but that does look a bit fiddly.
On the other hand, in favour of not duplicating logic...
[snip]
@@ -368,7 +368,19 @@ int tcp_buf_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) conn_flag(c, conn, STALLED); } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) { - int ret = tcp_buf_send_flag(c, conn, FIN | ACK); + int ret; + + /* On TAP_FIN_SENT, we won't get further data events + * from the socket, and this might be the last ACK + * segment we send to the tap, so update its sequence to + * include everything we received until now. + * + * See also the special handling on CONN_IS_CLOSING() in + * tcp_update_seqack_wnd(). + */ + conn->seq_ack_to_tap = conn->seq_from_tap;
... the equivalent bits in tcp_update_seqack_wnd() have after them: if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap)) conn->seq_ack_to_tap = prev_ack_to_tap;
Don't we need that here as well, in case the guest is retransmitting when we get the sock side FIN?
Not really, because we don't rewind conn->seq_from_tap, so we don't risk jumping back here. In tcp_update_seqack_wnd(), we might jump back (that should be double checked eventually, I'm not sure it's still the case) if we happened to acknowledge more than acknowledged socket-side while handling some particular condition, and then we switch back to acknowledging only bytes_acked. It should happen if the destination is/was a low RTT one, but we run out of slots in low_rtt_dst in favour of other entries. I don't remember any other case.
+ + ret = tcp_buf_send_flag(c, conn, FIN | ACK); if (ret) { tcp_rst(c, conn); return ret; diff --git a/tcp_vu.c b/tcp_vu.c index ebd3a1e..3ec3538 100644 --- a/tcp_vu.c +++ b/tcp_vu.c @@ -410,7 +410,12 @@ int tcp_vu_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) conn_flag(c, conn, STALLED); } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) { - int ret = tcp_vu_send_flag(c, conn, FIN | ACK); + int ret; + + /* See tcp_buf_data_from_sock() */ + conn->seq_ack_to_tap = conn->seq_from_tap; + + ret = tcp_vu_send_flag(c, conn, FIN | ACK); if (ret) { tcp_rst(c, conn); return ret; -- 2.43.0
-- Stefano