On Tue, 9 Dec 2025 16:10:36 +1100
David Gibson
On Mon, Dec 08, 2025 at 08:20:17AM +0100, Stefano Brivio wrote:
A fixed 10 ms ACK_INTERVAL timer value served us relatively well until the previous change, because we would generally cause retransmissions for non-local outbound transfers with relatively high (> 100 Mbps) bandwidth and non-local but low (< 5 ms) RTT.
Now that retransmissions are less frequent, we don't have a proper trigger to check for acknowledged bytes on the socket, and will generally block the sender for a significant amount of time while we could acknowledge more data, instead.
Store the RTT reported by the kernel using an approximation (exponent), to keep flow storage size within two (typical) cachelines. Check for socket updates when half of this time elapses: it should be a good indication of the one-way delay we're interested in (peer to us).
Representable values are between 100 us and 3.2768 s, and any value outside this range is clamped to these bounds. This choice appears to be a good trade-off between additional overhead and throughput.
This mechanism partially overlaps with the "low RTT" destinations, which we use to infer that a socket is connected to an endpoint to the same machine (while possibly in a different namespace) if the RTT is reported as 10 us or less.
This change doesn't, however, conflict with it: we are reading TCP_INFO parameters for local connections anyway, so we can always store the RTT approximation opportunistically.
Then, if the RTT is "low", we don't really need a timer to acknowledge data as we'll always acknowledge everything to the sender right away. However, we have limited space in the array where we store addresses of local destination, so the low RTT property of a connection might toggle frequently. Because of this, it's actually helpful to always have the RTT approximation stored.
This could probably benefit from a future rework, though, introducing a more integrated approach between these two mechanisms.
Signed-off-by: Stefano Brivio
--- tcp.c | 30 +++++++++++++++++++++++------- tcp_conn.h | 9 +++++++++ util.c | 14 ++++++++++++++ util.h | 1 + 4 files changed, 47 insertions(+), 7 deletions(-) diff --git a/tcp.c b/tcp.c index 28d3304..0167121 100644 --- a/tcp.c +++ b/tcp.c @@ -202,9 +202,13 @@ * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on * either side, the connection is reset * - * - ACK_INTERVAL elapsed after data segment received from tap without having + * - RTT / 2 elapsed after data segment received from tap without having * sent an ACK segment, or zero-sized window advertised to tap/guest (flag - * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent + * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent. + * + * RTT, here, is an approximation of the RTT value reported by the kernel via + * TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to + * RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly. * * * Summary of data flows (with ESTABLISHED event) @@ -341,7 +345,6 @@ enum { #define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */
-#define ACK_INTERVAL 10 /* ms */ #define RTO_INIT 1 /* s, RFC 6298 */ #define RTO_INIT_AFTER_SYN_RETRIES 3 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 @@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) }
if (conn->flags & ACK_TO_TAP_DUE) { - it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; + it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000); + it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { int exp = conn->retries, timeout = RTO_INIT; if (!(conn->events & ESTABLISHED)) @@ -608,9 +612,17 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) it.it_value.tv_sec = ACT_TIMEOUT; }
- flow_dbg(conn, "timer expires in %llu.%03llus", - (unsigned long long)it.it_value.tv_sec, - (unsigned long long)it.it_value.tv_nsec / 1000 / 1000); + if (conn->flags & ACK_TO_TAP_DUE) { + flow_trace(conn, "timer expires in %llu.%03llums", + (unsigned long)it.it_value.tv_sec * 1000 + + (unsigned long long)it.it_value.tv_nsec % + (1000 * 1000), + (unsigned long long)it.it_value.tv_nsec / 1000);
This is the wrong way around. The ms part needs to be:
Ouch, I... half swapped them. Almost. This risks haunting us for a bit.
Luckily we have discrete values there so I could make this small table
for --trace logs from the broken implementation:
#include
tv_sec * 1000 + tv_nsec / 1000000 and the fractional (us) part: (tv_nsec / 1000) % 1000
(or if you did just want a single digit after the ., then: tv_nsec / 100000 % 10 )
I guess I'll display two digits instead, I never had a ~100 ns RTT in my tests so I didn't notice until now that the resulting timeout would be displayed as 0.0ms with one fractional digit. Thanks for the snippet, I'll post a patch soon (of course, feel free to do so meanwhile if you already have a local change...). -- Stefano