New subject: [PATCH v2 1/9] tcp, util: Add function for scaling to linearly interpolated factor, use it

8 Dec 2025

      Patch 2/9 is the most relevant fix here, as we currently advertise a
window that might be too big for what we can write to the socket,
causing retransmissions right away and occasional high latency on
short transfers to non-local peers.

Mostly as a consequence of fixing that, we now need several
improvements and small fixes, including, most notably, an adaptive
approach to pick the interval between checks for socket-side ACKs
(patch 3/9), and several tricks to reliably trigger TCP buffer size
auto-tuning as implemented by the Linux kernel (patches 5/9 and 7/9).

These changes make some existing issues more relevant, fixed by the
other patches.

With this series, I'm getting the expected (wirespeed) throughput for
transfers between peers with varying non-local RTTs: I checked
different guests bridged on the same machine (~500 us) and hosts with
increasing distance using iperf3, as well as HTTP transfers only for
some hosts I have control over (500 us and 5 ms case).

With increasing RTTs, I can finally see the throughput converging to
the available bandwidth reasonably fast:

* 500 us RTT, ~10 Gbps available:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec   785 MBytes  6.57 Gbits/sec   13   1.84 MBytes
  [  5]   1.00-2.00   sec   790 MBytes  6.64 Gbits/sec    0   1.92 MBytes

* 5 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  4.88 MBytes  40.9 Mbits/sec   22    240 KBytes
  [  5]   1.00-2.00   sec  46.2 MBytes   388 Mbits/sec   34    900 KBytes
  [  5]   2.00-3.00   sec   110 MBytes   923 Mbits/sec    0   1.11 MBytes

* 10 ms, 1 Gbps:
  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  67.9 MBytes   569 Mbits/sec    2    960 KBytes
  [  5]   1.00-2.00   sec   110 MBytes   926 Mbits/sec    0   1.05 MBytes
  [  5]   2.00-3.00   sec   111 MBytes   934 Mbits/sec    0   1.17 MBytes

* 24 ms, 1 Gbps:
  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  2.50 MBytes  20.9 Mbits/sec   16    240 KBytes
  [  5]   1.00-2.00   sec  1.50 MBytes  12.6 Mbits/sec    9    120 KBytes
  [  5]   2.00-3.00   sec  99.2 MBytes   832 Mbits/sec    4   2.40 MBytes
  [  5]   3.00-4.00   sec   122 MBytes  1.03 Gbits/sec    1   3.16 MBytes
  [  5]   4.00-5.00   sec   119 MBytes  1.00 Gbits/sec    0   4.16 MBytes

* 40 ms, ~600 Mbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  2.12 MBytes  17.8 Mbits/sec    0    600 KBytes
  [  5]   1.00-2.00   sec  3.25 MBytes  27.3 Mbits/sec   14    420 KBytes
  [  5]   2.00-3.00   sec  31.5 MBytes   264 Mbits/sec   11   1.29 MBytes
  [  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0   1.46 MBytes

* 100 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  4.88 MBytes  40.9 Mbits/sec    1    840 KBytes
  [  5]   1.00-2.00   sec  1.62 MBytes  13.6 Mbits/sec    9    240 KBytes
  [  5]   2.00-3.00   sec  5.25 MBytes  44.0 Mbits/sec    5    780 KBytes
  [  5]   3.00-4.00   sec  9.75 MBytes  81.8 Mbits/sec    0   1.29 MBytes
  [  5]   4.00-5.00   sec  15.8 MBytes   132 Mbits/sec    0   1.99 MBytes
  [  5]   5.00-6.00   sec  22.9 MBytes   192 Mbits/sec    0   3.05 MBytes
  [  5]   6.00-7.00   sec   132 MBytes  1.11 Gbits/sec    0   7.62 MBytes

* 114 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.62 MBytes  13.6 Mbits/sec    8    420 KBytes
  [  5]   1.00-2.00   sec  2.12 MBytes  17.8 Mbits/sec   15    120 KBytes
  [  5]   2.00-3.00   sec  26.0 MBytes   218 Mbits/sec   50   1.82 MBytes
  [  5]   3.00-4.00   sec   103 MBytes   865 Mbits/sec   31   5.10 MBytes
  [  5]   4.00-5.00   sec   111 MBytes   930 Mbits/sec    0   5.92 MBytes

* 153 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    2    180 KBytes
  [  5]   1.00-2.00   sec  2.12 MBytes  17.8 Mbits/sec   11    180 KBytes
  [  5]   2.00-3.00   sec  12.6 MBytes   106 Mbits/sec   40   1.29 MBytes
  [  5]   3.00-4.00   sec  44.5 MBytes   373 Mbits/sec   22   2.75 MBytes
  [  5]   4.00-5.00   sec  86.0 MBytes   721 Mbits/sec    0   6.68 MBytes
  [  5]   5.00-6.00   sec   120 MBytes  1.01 Gbits/sec  119   6.97 MBytes
  [  5]   6.00-7.00   sec   110 MBytes   927 Mbits/sec    0   6.97 MBytes

* 186 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    3    180 KBytes
  [  5]   1.00-2.00   sec   512 KBytes  4.19 Mbits/sec    4    120 KBytes
  [  5]   2.00-3.00   sec  2.12 MBytes  17.8 Mbits/sec    6    360 KBytes
  [  5]   3.00-4.00   sec  27.1 MBytes   228 Mbits/sec    6   1.11 MBytes
  [  5]   4.00-5.00   sec  38.2 MBytes   321 Mbits/sec    0   1.99 MBytes
  [  5]   5.00-6.00   sec  38.2 MBytes   321 Mbits/sec    0   2.46 MBytes
  [  5]   6.00-7.00   sec  69.2 MBytes   581 Mbits/sec   71   3.63 MBytes
  [  5]   7.00-8.00   sec   110 MBytes   919 Mbits/sec    0   5.92 MBytes

* 271 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    0    320 KBytes
  [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    0    600 KBytes
  [  5]   2.00-3.00   sec   512 KBytes  4.19 Mbits/sec    7    420 KBytes
  [  5]   3.00-4.00   sec   896 KBytes  7.34 Mbits/sec    8   60.0 KBytes
  [  5]   4.00-5.00   sec  2.62 MBytes  22.0 Mbits/sec   13    420 KBytes
  [  5]   5.00-6.00   sec  12.1 MBytes   102 Mbits/sec    7   1020 KBytes
  [  5]   6.00-7.00   sec  19.9 MBytes   167 Mbits/sec    0   1.82 MBytes
  [  5]   7.00-8.00   sec  17.9 MBytes   150 Mbits/sec   44   1.76 MBytes
  [  5]   8.00-9.00   sec  57.4 MBytes   481 Mbits/sec   30   2.70 MBytes
  [  5]   9.00-10.00  sec  88.0 MBytes   738 Mbits/sec    0   6.45 MBytes

* 292 ms, ~ 600 Mbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    0    450 KBytes
  [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    3    180 KBytes
  [  5]   2.00-3.00   sec   640 KBytes  5.24 Mbits/sec    4    120 KBytes
  [  5]   3.00-4.00   sec   384 KBytes  3.15 Mbits/sec    2    120 KBytes
  [  5]   4.00-5.00   sec  4.50 MBytes  37.7 Mbits/sec    1    660 KBytes
  [  5]   5.00-6.00   sec  13.0 MBytes   109 Mbits/sec    3    960 KBytes
  [  5]   6.00-7.00   sec  64.5 MBytes   541 Mbits/sec    0   2.17 MBytes

* 327 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    0    600 KBytes
  [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    4    240 KBytes
  [  5]   2.00-3.00   sec   768 KBytes  6.29 Mbits/sec    4    120 KBytes
  [  5]   3.00-4.00   sec  1.62 MBytes  13.6 Mbits/sec    5    120 KBytes
  [  5]   4.00-5.00   sec  1.88 MBytes  15.7 Mbits/sec    0    480 KBytes
  [  5]   5.00-6.00   sec  17.6 MBytes   148 Mbits/sec   14   1.05 MBytes
  [  5]   6.00-7.00   sec  35.1 MBytes   295 Mbits/sec    0   2.58 MBytes
  [  5]   7.00-8.00   sec  45.2 MBytes   380 Mbits/sec    0   4.63 MBytes
  [  5]   8.00-9.00   sec  27.0 MBytes   226 Mbits/sec   96   3.93 MBytes
  [  5]   9.00-10.00  sec  85.9 MBytes   720 Mbits/sec   67   4.22 MBytes
  [  5]  10.00-11.00  sec   118 MBytes   986 Mbits/sec    0   9.67 MBytes
  [  5]  11.00-12.00  sec   124 MBytes  1.04 Gbits/sec    0   15.9 MBytes

For short transfers, we strictly stick to the available sending buffer
size to (almost) make sure we avoid local retransmissions, and
significantly decrease transfer time as a result: from 1.2 s to 60 ms
for a 5 MB HTTP transfer from a container hosted in a virtual machine
to another guest.

v2:

- Add 1/9, factoring out a generic version of the scaling function we
  already use for tcp_get_sndbuf(), as I'm now using it in 7/9 as well

- in 3/9, use 4 bits instead of 3 to represent the RTT, which is
  important as we now use RTT values for more than just the ACK checks

- in 5/9, instead of just comparing the sending buffer to SNDBUF_BIG
  to decide when to acknowledge data, use an adaptive approach based on
  the bandwidth-delay product

- in 6/9, clarify the relationship between SWS avoidance and Nagle's
  algorithm, and introduce a reference to RFC 813, Section 4

- in 7/9, use an adaptive approach based on the product of bytes sent
  (and acknowledged) so far and RTT, instead of the previous approach
  based on bytes sent only, as it allows us to converge to the expected
  throughput much quicker with high RTT destinations

Stefano Brivio (9):
  tcp, util: Add function for scaling to linearly interpolated factor,
    use it
  tcp: Limit advertised window to available, not total sending buffer
    size
  tcp: Adaptive interval based on RTT for socket-side acknowledgement
    checks
  tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized
    window
  tcp: Acknowledge everything if it looks like bulk traffic, not
    interactive
  tcp: Don't limit window to less-than-MSS values, use zero instead
  tcp: Allow exceeding the available sending buffer size in window
    advertisements
  tcp: Send a duplicate ACK also on complete sendmsg() failure
  tcp: Skip redundant ACK on partial sendmsg() failure

 README.md  |   2 +-
 tcp.c      | 168 ++++++++++++++++++++++++++++++++++++++++++-----------
 tcp_conn.h |   9 +++
 util.c     |  52 +++++++++++++++++
 util.h     |   2 +
 5 files changed, 197 insertions(+), 36 deletions(-)

-- 
2.43.0

[PATCH v2 0/9] tcp: Fix throughput issues with non-local peers

Stefano Brivio

Stefano Brivio

Stefano Brivio

Stefano Brivio

Stefano Brivio

Stefano Brivio

Stefano Brivio

Stefano Brivio

Stefano Brivio

Stefano Brivio

David Gibson

David Gibson

David Gibson

David Gibson

David Gibson

David Gibson

Stefano Brivio

Stefano Brivio

Stefano Brivio

Stefano Brivio

Stefano Brivio

David Gibson

David Gibson

Stefano Brivio

David Gibson

tags

participants (2)