Re: [PATCH v3 04/10] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks

9 Dec 2025

On Tue, 9 Dec 2025 16:10:36 +1100
David Gibson  wrote:
...
On Mon, Dec 08, 2025 at 08:20:17AM +0100, Stefano Brivio wrote:
...
A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
the previous change, because we would generally cause retransmissions
for non-local outbound transfers with relatively high (> 100 Mbps)
bandwidth and non-local but low (< 5 ms) RTT.
Now that retransmissions are less frequent, we don't have a proper
trigger to check for acknowledged bytes on the socket, and will
generally block the sender for a significant amount of time while
we could acknowledge more data, instead.
Store the RTT reported by the kernel using an approximation (exponent),
to keep flow storage size within two (typical) cachelines. Check for
socket updates when half of this time elapses: it should be a good
indication of the one-way delay we're interested in (peer to us).
Representable values are between 100 us and 3.2768 s, and any value
outside this range is clamped to these bounds. This choice appears
to be a good trade-off between additional overhead and throughput.
This mechanism partially overlaps with the "low RTT" destinations,
which we use to infer that a socket is connected to an endpoint to
the same machine (while possibly in a different namespace) if the
RTT is reported as 10 us or less.
This change doesn't, however, conflict with it: we are reading
TCP_INFO parameters for local connections anyway, so we can always
store the RTT approximation opportunistically.
Then, if the RTT is "low", we don't really need a timer to
acknowledge data as we'll always acknowledge everything to the
sender right away. However, we have limited space in the array where
we store addresses of local destination, so the low RTT property of a
connection might toggle frequently. Because of this, it's actually
helpful to always have the RTT approximation stored.
This could probably benefit from a future rework, though, introducing
a more integrated approach between these two mechanisms.
Signed-off-by: Stefano Brivio 
---
 tcp.c      | 30 +++++++++++++++++++++++-------
 tcp_conn.h |  9 +++++++++
 util.c     | 14 ++++++++++++++
 util.h     |  1 +
 4 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/tcp.c b/tcp.c
index 28d3304..0167121 100644
--- a/tcp.c
+++ b/tcp.c
@@ -202,9 +202,13 @@
  * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
  *   either side, the connection is reset
  *
- * - ACK_INTERVAL elapsed after data segment received from tap without having
+ * - RTT / 2 elapsed after data segment received from tap without having
  *   sent an ACK segment, or zero-sized window advertised to tap/guest (flag
- *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
+ *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
+ *
+ *   RTT, here, is an approximation of the RTT value reported by the kernel via
+ *   TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
+ *   RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly.
  *
  *
  * Summary of data flows (with ESTABLISHED event)
@@ -341,7 +345,6 @@ enum {
 #define MSS_DEFAULT			536
 #define WINDOW_DEFAULT			14600		/* RFC 6928 */
-#define ACK_INTERVAL			10		/* ms */
 #define RTO_INIT			1		/* s, RFC 6298 */
 #define RTO_INIT_AFTER_SYN_RETRIES	3		/* s, RFC 6298 */
 #define FIN_TIMEOUT			60
@@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
  }
if (conn->flags & ACK_TO_TAP_DUE) {
-		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
+		it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000);
+		it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000;
  } else if (conn->flags & ACK_FROM_TAP_DUE) {
      int exp = conn->retries, timeout = RTO_INIT;
      if (!(conn->events & ESTABLISHED))
@@ -608,9 +612,17 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
      it.it_value.tv_sec = ACT_TIMEOUT;
  }
-	flow_dbg(conn, "timer expires in %llu.%03llus",
-		 (unsigned long long)it.it_value.tv_sec,
-		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
+	if (conn->flags & ACK_TO_TAP_DUE) {
+		flow_trace(conn, "timer expires in %llu.%03llums",
+			   (unsigned long)it.it_value.tv_sec * 1000 +
+			   (unsigned long long)it.it_value.tv_nsec %
+					       (1000 * 1000),
+			   (unsigned long long)it.it_value.tv_nsec / 1000);
This is the wrong way around.  The ms part needs to be:
Ouch, I... half swapped them. Almost. This risks haunting us for a bit.

Luckily we have discrete values there so I could make this small table
for --trace logs from the broken implementation:

#include 

int main()
{
	unsigned long rtt, n, s;

	printf("before fix\t\tafter fix\tRTT\n");
	for (rtt = 100; rtt < 100 << 16; rtt <<= 1) {
		s = rtt / 2 / (1000 * 1000);
		n = rtt / 2 % (1000 * 1000) * 1000;

		printf("%llu.%03llu\t\t%lu.%02lu\t\t%lu\n",
		       s * 1000 + n % (1000 * 1000), n / 1000,
		       s * 1000 + n / (1000 * 1000), (n / 10000) % 100,
		       rtt);
	}
}

before fix		after fix	RTT
50000.050		0.05		100
100000.100		0.10		200
200000.200		0.20		400
400000.400		0.40		800
800000.800		0.80		1600
600000.1600		1.60		3200
200000.3200		3.20		6400
400000.6400		6.40		12800
800000.12800		12.80		25600
600000.25600		25.60		51200
200000.51200		51.20		102400
400000.102400		102.40		204800
800000.204800		204.80		409600
600000.409600		409.60		819200
200000.819200		819.20		1638400
401000.638400		1638.40		3276800
...
tv_sec * 1000 + tv_nsec / 1000000
and the fractional (us) part:
  (tv_nsec / 1000) % 1000
(or if you did just want a single digit after the ., then:
  tv_nsec / 100000 % 10
)
I guess I'll display two digits instead, I never had a ~100 ns RTT in
my tests so I didn't notice until now that the resulting timeout would
be displayed as 0.0ms with one fractional digit.

Thanks for the snippet, I'll post a patch soon (of course, feel free to
do so meanwhile if you already have a local change...).

-- 
Stefano