[PATCH v3 00/10] tcp: Fix throughput issues with non-local peers
Patch 3/10 is the most relevant fix here, as we currently advertise a window that might be too big for what we can write to the socket, causing retransmissions right away and occasional high latency on short transfers to non-local peers. Mostly as a consequence of fixing that, we now need several improvements and small fixes, including, most notably, an adaptive approach to pick the interval between checks for socket-side ACKs (patch 4/10), and several tricks to reliably trigger TCP buffer size auto-tuning as implemented by the Linux kernel (patches 6/10 and 8/10). These changes make some existing issues more relevant, fixed by the other patches. With this series, I'm getting the expected (wirespeed) throughput for transfers between peers with varying non-local RTTs: I checked different guests bridged on the same machine (~500 us) and hosts with increasing distance using iperf3, as well as HTTP transfers only for some hosts I have control over (500 us and 5 ms case). With increasing RTTs, I can finally see the throughput converging to the available bandwidth reasonably fast: * 500 us RTT, ~10 Gbps available: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 785 MBytes 6.57 Gbits/sec 13 1.84 MBytes [ 5] 1.00-2.00 sec 790 MBytes 6.64 Gbits/sec 0 1.92 MBytes * 5 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 4.88 MBytes 40.9 Mbits/sec 22 240 KBytes [ 5] 1.00-2.00 sec 46.2 MBytes 388 Mbits/sec 34 900 KBytes [ 5] 2.00-3.00 sec 110 MBytes 923 Mbits/sec 0 1.11 MBytes * 10 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 67.9 MBytes 569 Mbits/sec 2 960 KBytes [ 5] 1.00-2.00 sec 110 MBytes 926 Mbits/sec 0 1.05 MBytes [ 5] 2.00-3.00 sec 111 MBytes 934 Mbits/sec 0 1.17 MBytes * 24 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 2.50 MBytes 20.9 Mbits/sec 16 240 KBytes [ 5] 1.00-2.00 sec 1.50 MBytes 12.6 Mbits/sec 9 120 KBytes [ 5] 2.00-3.00 sec 99.2 MBytes 832 Mbits/sec 4 2.40 MBytes [ 5] 3.00-4.00 sec 122 MBytes 1.03 Gbits/sec 1 3.16 MBytes [ 5] 4.00-5.00 sec 119 MBytes 1.00 Gbits/sec 0 4.16 MBytes * 40 ms, ~600 Mbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 2.12 MBytes 17.8 Mbits/sec 0 600 KBytes [ 5] 1.00-2.00 sec 3.25 MBytes 27.3 Mbits/sec 14 420 KBytes [ 5] 2.00-3.00 sec 31.5 MBytes 264 Mbits/sec 11 1.29 MBytes [ 5] 3.00-4.00 sec 72.5 MBytes 608 Mbits/sec 0 1.46 MBytes * 100 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 4.88 MBytes 40.9 Mbits/sec 1 840 KBytes [ 5] 1.00-2.00 sec 1.62 MBytes 13.6 Mbits/sec 9 240 KBytes [ 5] 2.00-3.00 sec 5.25 MBytes 44.0 Mbits/sec 5 780 KBytes [ 5] 3.00-4.00 sec 9.75 MBytes 81.8 Mbits/sec 0 1.29 MBytes [ 5] 4.00-5.00 sec 15.8 MBytes 132 Mbits/sec 0 1.99 MBytes [ 5] 5.00-6.00 sec 22.9 MBytes 192 Mbits/sec 0 3.05 MBytes [ 5] 6.00-7.00 sec 132 MBytes 1.11 Gbits/sec 0 7.62 MBytes * 114 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.62 MBytes 13.6 Mbits/sec 8 420 KBytes [ 5] 1.00-2.00 sec 2.12 MBytes 17.8 Mbits/sec 15 120 KBytes [ 5] 2.00-3.00 sec 26.0 MBytes 218 Mbits/sec 50 1.82 MBytes [ 5] 3.00-4.00 sec 103 MBytes 865 Mbits/sec 31 5.10 MBytes [ 5] 4.00-5.00 sec 111 MBytes 930 Mbits/sec 0 5.92 MBytes * 153 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 2 180 KBytes [ 5] 1.00-2.00 sec 2.12 MBytes 17.8 Mbits/sec 11 180 KBytes [ 5] 2.00-3.00 sec 12.6 MBytes 106 Mbits/sec 40 1.29 MBytes [ 5] 3.00-4.00 sec 44.5 MBytes 373 Mbits/sec 22 2.75 MBytes [ 5] 4.00-5.00 sec 86.0 MBytes 721 Mbits/sec 0 6.68 MBytes [ 5] 5.00-6.00 sec 120 MBytes 1.01 Gbits/sec 119 6.97 MBytes [ 5] 6.00-7.00 sec 110 MBytes 927 Mbits/sec 0 6.97 MBytes * 186 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 3 180 KBytes [ 5] 1.00-2.00 sec 512 KBytes 4.19 Mbits/sec 4 120 KBytes [ 5] 2.00-3.00 sec 2.12 MBytes 17.8 Mbits/sec 6 360 KBytes [ 5] 3.00-4.00 sec 27.1 MBytes 228 Mbits/sec 6 1.11 MBytes [ 5] 4.00-5.00 sec 38.2 MBytes 321 Mbits/sec 0 1.99 MBytes [ 5] 5.00-6.00 sec 38.2 MBytes 321 Mbits/sec 0 2.46 MBytes [ 5] 6.00-7.00 sec 69.2 MBytes 581 Mbits/sec 71 3.63 MBytes [ 5] 7.00-8.00 sec 110 MBytes 919 Mbits/sec 0 5.92 MBytes * 271 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 0 320 KBytes [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 0 600 KBytes [ 5] 2.00-3.00 sec 512 KBytes 4.19 Mbits/sec 7 420 KBytes [ 5] 3.00-4.00 sec 896 KBytes 7.34 Mbits/sec 8 60.0 KBytes [ 5] 4.00-5.00 sec 2.62 MBytes 22.0 Mbits/sec 13 420 KBytes [ 5] 5.00-6.00 sec 12.1 MBytes 102 Mbits/sec 7 1020 KBytes [ 5] 6.00-7.00 sec 19.9 MBytes 167 Mbits/sec 0 1.82 MBytes [ 5] 7.00-8.00 sec 17.9 MBytes 150 Mbits/sec 44 1.76 MBytes [ 5] 8.00-9.00 sec 57.4 MBytes 481 Mbits/sec 30 2.70 MBytes [ 5] 9.00-10.00 sec 88.0 MBytes 738 Mbits/sec 0 6.45 MBytes * 292 ms, ~ 600 Mbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 0 450 KBytes [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 3 180 KBytes [ 5] 2.00-3.00 sec 640 KBytes 5.24 Mbits/sec 4 120 KBytes [ 5] 3.00-4.00 sec 384 KBytes 3.15 Mbits/sec 2 120 KBytes [ 5] 4.00-5.00 sec 4.50 MBytes 37.7 Mbits/sec 1 660 KBytes [ 5] 5.00-6.00 sec 13.0 MBytes 109 Mbits/sec 3 960 KBytes [ 5] 6.00-7.00 sec 64.5 MBytes 541 Mbits/sec 0 2.17 MBytes * 327 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 0 600 KBytes [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 4 240 KBytes [ 5] 2.00-3.00 sec 768 KBytes 6.29 Mbits/sec 4 120 KBytes [ 5] 3.00-4.00 sec 1.62 MBytes 13.6 Mbits/sec 5 120 KBytes [ 5] 4.00-5.00 sec 1.88 MBytes 15.7 Mbits/sec 0 480 KBytes [ 5] 5.00-6.00 sec 17.6 MBytes 148 Mbits/sec 14 1.05 MBytes [ 5] 6.00-7.00 sec 35.1 MBytes 295 Mbits/sec 0 2.58 MBytes [ 5] 7.00-8.00 sec 45.2 MBytes 380 Mbits/sec 0 4.63 MBytes [ 5] 8.00-9.00 sec 27.0 MBytes 226 Mbits/sec 96 3.93 MBytes [ 5] 9.00-10.00 sec 85.9 MBytes 720 Mbits/sec 67 4.22 MBytes [ 5] 10.00-11.00 sec 118 MBytes 986 Mbits/sec 0 9.67 MBytes [ 5] 11.00-12.00 sec 124 MBytes 1.04 Gbits/sec 0 15.9 MBytes For short transfers, we strictly stick to the available sending buffer size to (almost) make sure we avoid local retransmissions, and significantly decrease transfer time as a result: from 1.2 s to 60 ms for a 5 MB HTTP transfer from a container hosted in a virtual machine to another guest. v3: - split 1/9 into 1/10 and 2/10 (one adding the function, one changing the usage factor) - in 1/10, rename function and clarify that the factor can be less than 100% - fix timer expiration formatting in 4/10 - in 6/10, use tcpi_rtt directly instead of its stored approximation v2: - Add 1/9, factoring out a generic version of the scaling function we already use for tcp_get_sndbuf(), as I'm now using it in 7/9 as well - in 3/9, use 4 bits instead of 3 to represent the RTT, which is important as we now use RTT values for more than just the ACK checks - in 5/9, instead of just comparing the sending buffer to SNDBUF_BIG to decide when to acknowledge data, use an adaptive approach based on the bandwidth-delay product - in 6/9, clarify the relationship between SWS avoidance and Nagle's algorithm, and introduce a reference to RFC 813, Section 4 - in 7/9, use an adaptive approach based on the product of bytes sent (and acknowledged) so far and RTT, instead of the previous approach based on bytes sent only, as it allows us to converge to the expected throughput much quicker with high RTT destinations *** BLURB HERE *** Stefano Brivio (10): tcp, util: Add function for scaling to linearly interpolated factor, use it tcp: Change usage factor of sending buffer in tcp_get_sndbuf() to 75% tcp: Limit advertised window to available, not total sending buffer size tcp: Adaptive interval based on RTT for socket-side acknowledgement checks tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window tcp: Acknowledge everything if it looks like bulk traffic, not interactive tcp: Don't limit window to less-than-MSS values, use zero instead tcp: Allow exceeding the available sending buffer size in window advertisements tcp: Send a duplicate ACK also on complete sendmsg() failure tcp: Skip redundant ACK on partial sendmsg() failure README.md | 2 +- tcp.c | 170 ++++++++++++++++++++++++++++++++++++++++++----------- tcp_conn.h | 9 +++ util.c | 52 ++++++++++++++++ util.h | 2 + 5 files changed, 199 insertions(+), 36 deletions(-) -- 2.43.0
Right now, the only need for this kind of function comes from
tcp_get_sndbuf(), which calculates the amount of sending buffer we
want to use depending on its own size: we want to use more of it
if it's smaller, as bookkeeping overhead is usually lower and we rely
on auto-tuning there, and use less of it when it's bigger.
For this purpose, the new function is overly generic: @x is the same
as @y, that is, we want to use more or less of the buffer depending
on the size of the buffer itself.
However, an upcoming change will need that generality, as we'll want
to scale the amount of sending buffer we use depending on another
(scaled) factor.
Signed-off-by: Stefano Brivio
Now that we have a new clamped_scale() function, which makes it simple
to specify a precise usage factor, change the amount of sending buffer
we want to use at and above 4 MiB: 75% looks perfectly safe.
Signed-off-by: Stefano Brivio
For non-local connections, we advertise the same window size as what
the peer in turn advertises to us, and limit it to the buffer size
reported via SO_SNDBUF.
That's not quite correct: in order to later avoid failures while
queueing data to the socket, we need to limit the window to the
available buffer size, not the total one.
Use the SIOCOUTQ ioctl and subtract the number of outbound queued
bytes from the total buffer size, then clamp to this value.
Signed-off-by: Stefano Brivio
A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
the previous change, because we would generally cause retransmissions
for non-local outbound transfers with relatively high (> 100 Mbps)
bandwidth and non-local but low (< 5 ms) RTT.
Now that retransmissions are less frequent, we don't have a proper
trigger to check for acknowledged bytes on the socket, and will
generally block the sender for a significant amount of time while
we could acknowledge more data, instead.
Store the RTT reported by the kernel using an approximation (exponent),
to keep flow storage size within two (typical) cachelines. Check for
socket updates when half of this time elapses: it should be a good
indication of the one-way delay we're interested in (peer to us).
Representable values are between 100 us and 3.2768 s, and any value
outside this range is clamped to these bounds. This choice appears
to be a good trade-off between additional overhead and throughput.
This mechanism partially overlaps with the "low RTT" destinations,
which we use to infer that a socket is connected to an endpoint to
the same machine (while possibly in a different namespace) if the
RTT is reported as 10 us or less.
This change doesn't, however, conflict with it: we are reading
TCP_INFO parameters for local connections anyway, so we can always
store the RTT approximation opportunistically.
Then, if the RTT is "low", we don't really need a timer to
acknowledge data as we'll always acknowledge everything to the
sender right away. However, we have limited space in the array where
we store addresses of local destination, so the low RTT property of a
connection might toggle frequently. Because of this, it's actually
helpful to always have the RTT approximation stored.
This could probably benefit from a future rework, though, introducing
a more integrated approach between these two mechanisms.
Signed-off-by: Stefano Brivio
We correctly avoid doing that at the beginning of tcp_prepare_flags(),
but we might clear the flag later on if we actually end up sending a
"flag" segment.
Make sure we don't, otherwise we might delay window updates after a
zero-window condition significantly, and significantly affect
throughput.
In some cases, we're forcing peers to send zero-window probes or
keep-alive segments.
Signed-off-by: Stefano Brivio
...instead of checking if the current sending buffer is less than
SNDBUF_SMALL, because this isn't simply an optimisation to coalesce
ACK segments: we rely on having enough data at once from the sender
to make the buffer grow by means of TCP buffer size tuning
implemented in the Linux kernel.
This is important if we're trying to maximise throughput, but not
desirable for interactive traffic, where we want to be transparent as
possible and avoid introducing unnecessary latency.
Use the tcpi_delivery_rate field reported by the Linux kernel, if
available, to calculate the current bandwidth-delay product: if it's
significantly smaller than the available sending buffer, conclude that
we're not bandwidth-bound and this is likely to be interactive
traffic, so acknowledge data only as it's acknowledged by the peer.
Conversely, if the bandwidth-delay product is comparable to the size
of the sending buffer (more than 5%), we're probably bandwidth-bound
or... bound to be: acknowledge everything in that case.
Signed-off-by: Stefano Brivio
If the sender uses data clumping (including Nagle's algorithm) for
Silly Window Syndrome (SWS) avoidance, advertising less than a MSS
means the sender might stop sending altogether, and window updates
after a low window condition are just as important as they are in
a zero-window condition.
For simplicity, approximate that limit to zero, as we have an
implementation forcing window updates after zero-sized windows.
This matches the suggestion from RFC 813, section 4.
Signed-off-by: Stefano Brivio
If the remote peer is advertising a bigger value than our current
sending buffer, it means that a bigger sending buffer is likely to
benefit throughput.
We can get a bigger sending buffer by means of the buffer size
auto-tuning performed by the Linux kernel, which is triggered by
aggressively filling the sending buffer.
Use an adaptive boost factor, up to 150%, depending on:
- how much data we sent so far: we don't want to risk retransmissions
for short-lived connections, as the latency cost would be
unacceptable, and
- the current RTT value, as we need a bigger buffer for higher
transmission delays
The factor we use is not quite a bandwidth-delay product, as we're
missing the time component of the bandwidth, which is not interesting
here: we are trying to make the buffer grow at the beginning of a
connection, progressively, as more data is sent.
The tuning of the amount of boost factor we want to apply was done
somewhat empirically but it appears to yield the available throughput
in rather different scenarios (from ~ 10 Gbps bandwidth with 500ns to
~ 1 Gbps with 300 ms RTT) and it allows getting there rather quickly,
within a few seconds for the 300 ms case.
Note that we want to apply this factor only if the window advertised
by the peer is bigger than the current sending buffer, as we only need
this for auto-tuning, and we absolutely don't want to incur
unnecessary retransmissions otherwise.
The related condition in tcp_update_seqack_wnd() is not redundant as
there's a subtractive factor, sendq, in the calculation of the window
limit. If the sending buffer is smaller than the peer's advertised
window, the additional limit we might apply might be lower than we
would do otherwise.
Assuming that the sending buffer is reported as 100k, sendq is
20k, we could have these example cases:
1. tinfo->tcpi_snd_wnd is 120k, which is bigger than the sending
buffer, so we boost its size to 150k, and we limit the window
to 120k
2. tinfo->tcpi_snd_wnd is 90k, which is smaller than the sending
buffer, so we aren't trying to trigger buffer auto-tuning and
we'll stick to the existing, more conservative calculation,
by limiting the window to 100 - 20 = 80k
If we omitted the new condition, we would always use the boosted
value, that is, 120k, even if potentially causing unnecessary
retransmissions.
Signed-off-by: Stefano Brivio
...in order to trigger a fast retransmit as soon as possible. There's
no benefit in forcing the sender to wait for a longer time than that.
We already do this on partial failures (short socket writes), but for
historical reason not on complete failures. Make these two cases
consistent between each other.
Signed-off-by: Stefano Brivio
...we'll send a duplicate ACK right away in this case, and this
redundant, earlier check is not just useless, but it might actually
be harmful as we'll now send a triple ACK which might cause two
retransmissions.
Signed-off-by: Stefano Brivio
Hi Stefano, On Mon, 2025-12-08 at 08:20 +0100, Stefano Brivio wrote:
+ return scale_x_to_y_slope(SNDBUF_GET(conn), bytes_rtt_product,
I had to change this to "clamped_scale", otherwise the build failed. Thanks, -- Max
Hi Stefano, On Mon, 2025-12-08 at 08:20 +0100, Stefano Brivio wrote:
@@ -1035,6 +1042,35 @@ void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn, tap_hdr_update(taph, MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN)); }
This chunk fails to apply because master currently has the following line instead: tap_hdr_update(taph, l3len + sizeof(struct ethhdr)); The patch applied cleanly after I fixed the context though. Thanks, -- Max
Hi Stefano, On Mon, 2025-12-08 at 08:20 +0100, Stefano Brivio wrote:
With this series, I'm getting the expected (wirespeed) throughput for transfers between peers with varying non-local RTTs: I checked different guests bridged on the same machine (~500 us) and hosts with increasing distance using iperf3, as well as HTTP transfers only for some hosts I have control over (500 us and 5 ms case).
With increasing RTTs, I can finally see the throughput converging to the available bandwidth reasonably fast:
Thanks for the patch, but this unfortunately seems to make things worse in my testing (with curl/https). Using my benchmarking script from the earlier thread, with a 10MB file size and a 30s timeout: $ pasta --version # Using the stock pasta in F43 pasta 0^20250919.g623dbf6-1.fc43.x86_64 $ ./pasta-upload-test.sh network ping_time wmem_max rmem_max tcp_notsent_lowat tcp_congestion_control default_qdisc download_time upload_time host 50ms custom custom custom custom custom 1.508751 1.656876 pasta 50ms custom custom custom custom custom 1.548367 2.184099 host 170ms custom custom custom custom custom 9.313611 3.055348 pasta 170ms custom custom custom custom custom 13.300405 25.046154 $ sudo dnf install <freshly built pasta rpms> $ pasta --version pasta 0^20251208.g5943ea4-1.fc43.x86_64 $ ./pasta-upload-test.sh network ping_time wmem_max rmem_max tcp_notsent_lowat tcp_congestion_control default_qdisc download_time upload_time host 50ms custom custom custom custom custom 1.490700 1.666525 pasta 50ms custom custom custom custom custom 1.474725 30.000000 host 170ms custom custom custom custom custom 9.618929 3.221314 pasta 170ms custom custom custom custom custom 10.475894 30.000000 $ sudo dnf downgrade pasta $ pasta --version # Back to the stock pasta in F43 pasta 0^20250919.g623dbf6-1.fc43.x86_64 $ ./pasta-upload-test.sh network ping_time wmem_max rmem_max tcp_notsent_lowat tcp_congestion_control default_qdisc download_time upload_time host 50ms custom custom custom custom custom 1.407653 1.686541 pasta 50ms custom custom custom custom custom 1.558330 2.481097 host 170ms custom custom custom custom custom 8.951508 3.191743 pasta 170ms custom custom custom custom custom 9.891349 30.000000 $ sudo dnf install <freshly built pasta rpms> $ pasta --version # Try the patched version again in case the last test was an anomaly pasta 0^20251208.g5943ea4-1.fc43.x86_64 $ ./pasta-upload-test.sh network ping_time wmem_max rmem_max tcp_notsent_lowat tcp_congestion_control default_qdisc download_time upload_time host 50ms custom custom custom custom custom 1.450695 1.689421 pasta 50ms custom custom custom custom custom 1.605941 30.000000 host 170ms custom custom custom custom custom 5.610433 3.034058 pasta 170ms custom custom custom custom custom 5.544638 30.000000 Thanks, -- Max
Hi Max,
On Mon, 08 Dec 2025 01:11:56 -0700
Max Chernoff
Hi Stefano,
On Mon, 2025-12-08 at 08:20 +0100, Stefano Brivio wrote:
With this series, I'm getting the expected (wirespeed) throughput for transfers between peers with varying non-local RTTs: I checked different guests bridged on the same machine (~500 us) and hosts with increasing distance using iperf3, as well as HTTP transfers only for some hosts I have control over (500 us and 5 ms case).
With increasing RTTs, I can finally see the throughput converging to the available bandwidth reasonably fast:
Thanks for the patch, but this unfortunately seems to make things worse in my testing (with curl/https). Using my benchmarking script from the earlier thread, with a 10MB file size and a 30s timeout:
Thanks for re-testing. I actually wanted to get back to you about your sysctl values, but, in general, I don't think things can work reliably with the values you shared for tcp_notsent_lowat. Does this (upload now taking longer/timing out with 50 ms RTT) also happen without "custom" values for tcp_notsent_lowat? I tested things quite extensively in that RTT region (without custom sysctl values) and the improvement looks rather consistent to me. -- Stefano
On Mon, 08 Dec 2025 01:15:49 -0700
Max Chernoff
Hi Stefano,
On Mon, 2025-12-08 at 08:20 +0100, Stefano Brivio wrote:
+ return scale_x_to_y_slope(SNDBUF_GET(conn), bytes_rtt_product,
I had to change this to "clamped_scale", otherwise the build failed.
Right, thanks for pointing that out, I went back and edited 1/10 without re-rebasing again. If it's just this (it was, in the tests I'm running now) I won't re-spin. -- Stefano
Hi Stefano, On Mon, 2025-12-08 at 09:25 +0100, Stefano Brivio wrote:
but, in general, I don't think things can work reliably with the values you shared for tcp_notsent_lowat.
Ok, that works for me. I know very little about TCP, so I just blindly copied that value for tcp_notsent_lowat from https://blog.cloudflare.com/http-2-prioritization-with-nginx/ but if that's incompatible with pasta, then I have no problem resetting tcp_notsent_lowat back to the kernel default. A random web search makes it look like changing tcp_notsent_lowat is somewhat common https://www.google.com/search?q=tcp_notsent_lowat%3D131072 https://github.com/search?q=tcp_notsent_lowat%3D131072+NOT+is%3Afork&type=code so maybe it would be a good idea for pasta to either use setsockopt to override it, or to print a warning on startup if the sysctl is set too low?
Does this (upload now taking longer/timing out with 50 ms RTT) also happen without "custom" values for tcp_notsent_lowat?
I tested things quite extensively in that RTT region (without custom sysctl values) and the improvement looks rather consistent to me.
Ok, with tcp_notsent_lowat reset to the Fedora defaults, the upload speeds with large RTTs do indeed look *much* better $ sudo dnf install <freshly built pasta rpms> $ pasta --version pasta 0^20251208.g5943ea4-1.fc43.x86_64 $ ./pasta-upload-test.sh network ping_time wmem_max rmem_max tcp_notsent_lowat tcp_congestion_control default_qdisc download_time upload_time host 50ms custom custom default custom custom 1.561761 2.045501 pasta 50ms custom custom default custom custom 1.575290 1.707500 host 170ms custom custom default custom custom 9.147689 3.220591 pasta 170ms custom custom default custom custom 13.351799 3.411078 $ sudo dnf downgrade pasta $ pasta --version # Back to the stock pasta in F43 pasta 0^20250919.g623dbf6-1.fc43.x86_64 $ ./pasta-upload-test.sh network ping_time wmem_max rmem_max tcp_notsent_lowat tcp_congestion_control default_qdisc download_time upload_time host 50ms custom custom default custom custom 1.429540 1.674165 pasta 50ms custom custom default custom custom 1.503907 2.025471 host 170ms custom custom default custom custom 8.891267 3.039416 pasta 170ms custom custom default custom custom 11.056843 18.704653 Thanks again for all your help, -- Max
On Mon, 08 Dec 2025 01:51:48 -0700
Max Chernoff
Hi Stefano,
On Mon, 2025-12-08 at 09:25 +0100, Stefano Brivio wrote:
but, in general, I don't think things can work reliably with the values you shared for tcp_notsent_lowat.
Ok, that works for me. I know very little about TCP, so I just blindly copied that value for tcp_notsent_lowat from
https://blog.cloudflare.com/http-2-prioritization-with-nginx/
Right, I guess you mentioned it already. I still need to find some time to actually look into it in better detail and understand what's still true now (note that that article is from 2018 and there have been very relevant kernel changes against "bufferbloat" in between). At that point:
but if that's incompatible with pasta, then I have no problem resetting tcp_notsent_lowat back to the kernel default.
A random web search makes it look like changing tcp_notsent_lowat is somewhat common
https://www.google.com/search?q=tcp_notsent_lowat%3D131072
https://github.com/search?q=tcp_notsent_lowat%3D131072+NOT+is%3Afork&type=code
so maybe it would be a good idea for pasta to either use setsockopt to override it, or to print a warning on startup if the sysctl is set too low?
...I would also get an opinion on this. I should be able to get back to you in a few days.
Does this (upload now taking longer/timing out with 50 ms RTT) also happen without "custom" values for tcp_notsent_lowat?
I tested things quite extensively in that RTT region (without custom sysctl values) and the improvement looks rather consistent to me.
Ok, with tcp_notsent_lowat reset to the Fedora defaults, the upload speeds with large RTTs do indeed look *much* better
Ok, great, thanks for confirming. That's all I'm trying to fix for the moment. :) -- Stefano
On Mon, Dec 08, 2025 at 08:20:15AM +0100, Stefano Brivio wrote:
Now that we have a new clamped_scale() function, which makes it simple to specify a precise usage factor, change the amount of sending buffer we want to use at and above 4 MiB: 75% looks perfectly safe.
Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- tcp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tcp.c b/tcp.c index 026546a..37aceed 100644 --- a/tcp.c +++ b/tcp.c @@ -773,7 +773,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, }
/** - * tcp_get_sndbuf() - Get, scale SO_SNDBUF between thresholds (1 to 0.5 usage) + * tcp_get_sndbuf() - Get, scale SO_SNDBUF between thresholds (1 to 0.75 usage) * @conn: Connection pointer */ static void tcp_get_sndbuf(struct tcp_tap_conn *conn) @@ -788,7 +788,7 @@ static void tcp_get_sndbuf(struct tcp_tap_conn *conn) return; }
- v = clamped_scale(sndbuf, sndbuf, SNDBUF_SMALL, SNDBUF_BIG, 50); + v = clamped_scale(sndbuf, sndbuf, SNDBUF_SMALL, SNDBUF_BIG, 75);
SNDBUF_SET(conn, MIN(INT_MAX, v)); } -- 2.43.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Dec 08, 2025 at 08:20:17AM +0100, Stefano Brivio wrote:
A fixed 10 ms ACK_INTERVAL timer value served us relatively well until the previous change, because we would generally cause retransmissions for non-local outbound transfers with relatively high (> 100 Mbps) bandwidth and non-local but low (< 5 ms) RTT.
Now that retransmissions are less frequent, we don't have a proper trigger to check for acknowledged bytes on the socket, and will generally block the sender for a significant amount of time while we could acknowledge more data, instead.
Store the RTT reported by the kernel using an approximation (exponent), to keep flow storage size within two (typical) cachelines. Check for socket updates when half of this time elapses: it should be a good indication of the one-way delay we're interested in (peer to us).
Representable values are between 100 us and 3.2768 s, and any value outside this range is clamped to these bounds. This choice appears to be a good trade-off between additional overhead and throughput.
This mechanism partially overlaps with the "low RTT" destinations, which we use to infer that a socket is connected to an endpoint to the same machine (while possibly in a different namespace) if the RTT is reported as 10 us or less.
This change doesn't, however, conflict with it: we are reading TCP_INFO parameters for local connections anyway, so we can always store the RTT approximation opportunistically.
Then, if the RTT is "low", we don't really need a timer to acknowledge data as we'll always acknowledge everything to the sender right away. However, we have limited space in the array where we store addresses of local destination, so the low RTT property of a connection might toggle frequently. Because of this, it's actually helpful to always have the RTT approximation stored.
This could probably benefit from a future rework, though, introducing a more integrated approach between these two mechanisms.
Signed-off-by: Stefano Brivio
--- tcp.c | 30 +++++++++++++++++++++++------- tcp_conn.h | 9 +++++++++ util.c | 14 ++++++++++++++ util.h | 1 + 4 files changed, 47 insertions(+), 7 deletions(-) diff --git a/tcp.c b/tcp.c index 28d3304..0167121 100644 --- a/tcp.c +++ b/tcp.c @@ -202,9 +202,13 @@ * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on * either side, the connection is reset * - * - ACK_INTERVAL elapsed after data segment received from tap without having + * - RTT / 2 elapsed after data segment received from tap without having * sent an ACK segment, or zero-sized window advertised to tap/guest (flag - * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent + * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent. + * + * RTT, here, is an approximation of the RTT value reported by the kernel via + * TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to + * RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly. * * * Summary of data flows (with ESTABLISHED event) @@ -341,7 +345,6 @@ enum { #define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */
-#define ACK_INTERVAL 10 /* ms */ #define RTO_INIT 1 /* s, RFC 6298 */ #define RTO_INIT_AFTER_SYN_RETRIES 3 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 @@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) }
if (conn->flags & ACK_TO_TAP_DUE) { - it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; + it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000); + it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { int exp = conn->retries, timeout = RTO_INIT; if (!(conn->events & ESTABLISHED)) @@ -608,9 +612,17 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) it.it_value.tv_sec = ACT_TIMEOUT; }
- flow_dbg(conn, "timer expires in %llu.%03llus", - (unsigned long long)it.it_value.tv_sec, - (unsigned long long)it.it_value.tv_nsec / 1000 / 1000); + if (conn->flags & ACK_TO_TAP_DUE) { + flow_trace(conn, "timer expires in %llu.%03llums", + (unsigned long)it.it_value.tv_sec * 1000 + + (unsigned long long)it.it_value.tv_nsec % + (1000 * 1000), + (unsigned long long)it.it_value.tv_nsec / 1000);
This is the wrong way around. The ms part needs to be: tv_sec * 1000 + tv_nsec / 1000000 and the fractional (us) part: (tv_nsec / 1000) % 1000 (or if you did just want a single digit after the ., then: tv_nsec / 100000 % 10 )
+ } else { + flow_dbg(conn, "timer expires in %llu.%03llus", + (unsigned long long)it.it_value.tv_sec, + (unsigned long long)it.it_value.tv_nsec / 1000 / 1000); + }
if (timerfd_settime(conn->timer, 0, &it, NULL)) flow_perror(conn, "failed to set timer"); @@ -1144,6 +1156,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, conn_flag(c, conn, ACK_TO_TAP_DUE);
out: + /* Opportunistically store RTT approximation on valid TCP_INFO data */ + if (tinfo) + RTT_SET(conn, tinfo->tcpi_rtt); + return new_wnd_to_tap != prev_wnd_to_tap || conn->seq_ack_to_tap != prev_ack_to_tap; } diff --git a/tcp_conn.h b/tcp_conn.h index e36910c..9c6ff9e 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -49,6 +49,15 @@ struct tcp_tap_conn { #define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) #define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS))
+#define RTT_EXP_BITS 4 + unsigned int rtt_exp :RTT_EXP_BITS; +#define RTT_EXP_MAX MAX_FROM_BITS(RTT_EXP_BITS) +#define RTT_STORE_MIN 100 /* us, minimum representable */ +#define RTT_STORE_MAX ((long)(RTT_STORE_MIN << RTT_EXP_MAX)) +#define RTT_SET(conn, rtt) \ + (conn->rtt_exp = MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MIN)))) +#define RTT_GET(conn) (RTT_STORE_MIN << conn->rtt_exp) + int sock :FD_REF_BITS;
uint8_t events; diff --git a/util.c b/util.c index 2232a24..bfeb619 100644 --- a/util.c +++ b/util.c @@ -614,6 +614,9 @@ int __daemon(int pidfile_fd, int devnull_fd) * fls() - Find last (most significant) bit set in word * @x: Word * + * Note: unlike ffs() and other implementations of fls(), notably the one from + * the Linux kernel, the starting position is 0 and not 1, that is, fls(1) = 0. + * * Return: position of most significant bit set, starting from 0, -1 if none */ int fls(unsigned long x) @@ -629,6 +632,17 @@ int fls(unsigned long x) return y; }
+/** + * ilog2() - Integral part (floor) of binary logarithm (logarithm to the base 2) + * @x: Argument + * + * Return: integral part of binary logarithm of @x, -1 if undefined (if @x is 0) + */ +int ilog2(unsigned long x) +{ + return fls(x); +} + /** * write_file() - Replace contents of file with a string * @path: File to write diff --git a/util.h b/util.h index 744880b..f7a941f 100644 --- a/util.h +++ b/util.h @@ -233,6 +233,7 @@ int output_file_open(const char *path, int flags); void pidfile_write(int fd, pid_t pid); int __daemon(int pidfile_fd, int devnull_fd); int fls(unsigned long x); +int ilog2(unsigned long x); int write_file(const char *path, const char *buf); intmax_t read_file_integer(const char *path, intmax_t fallback); int write_all_buf(int fd, const void *buf, size_t len); -- 2.43.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Dec 08, 2025 at 08:20:19AM +0100, Stefano Brivio wrote:
...instead of checking if the current sending buffer is less than SNDBUF_SMALL, because this isn't simply an optimisation to coalesce ACK segments: we rely on having enough data at once from the sender to make the buffer grow by means of TCP buffer size tuning implemented in the Linux kernel.
This is important if we're trying to maximise throughput, but not desirable for interactive traffic, where we want to be transparent as possible and avoid introducing unnecessary latency.
Use the tcpi_delivery_rate field reported by the Linux kernel, if available, to calculate the current bandwidth-delay product: if it's significantly smaller than the available sending buffer, conclude that we're not bandwidth-bound and this is likely to be interactive traffic, so acknowledge data only as it's acknowledged by the peer.
Conversely, if the bandwidth-delay product is comparable to the size of the sending buffer (more than 5%), we're probably bandwidth-bound or... bound to be: acknowledge everything in that case.
Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- tcp.c | 45 +++++++++++++++++++++++++++++++++------------ 1 file changed, 33 insertions(+), 12 deletions(-)
diff --git a/tcp.c b/tcp.c index b2e4174..923c1f2 100644 --- a/tcp.c +++ b/tcp.c @@ -353,6 +353,9 @@ enum { #define LOW_RTT_TABLE_SIZE 8 #define LOW_RTT_THRESHOLD 10 /* us */
+/* Ratio of buffer to bandwidth * delay product implying interactive traffic */ +#define SNDBUF_TO_BW_DELAY_INTERACTIVE /* > */ 20 /* (i.e. < 5% of buffer) */ + #define ACK_IF_NEEDED 0 /* See tcp_send_flag() */
#define CONN_IS_CLOSING(conn) \ @@ -426,11 +429,13 @@ socklen_t tcp_info_size; sizeof(((struct tcp_info_linux *)NULL)->tcpi_##f_)) <= tcp_info_size)
/* Kernel reports sending window in TCP_INFO (kernel commit 8f7baad7f035) */ -#define snd_wnd_cap tcp_info_cap(snd_wnd) +#define snd_wnd_cap tcp_info_cap(snd_wnd) /* Kernel reports bytes acked in TCP_INFO (kernel commit 0df48c26d84) */ -#define bytes_acked_cap tcp_info_cap(bytes_acked) +#define bytes_acked_cap tcp_info_cap(bytes_acked) /* Kernel reports minimum RTT in TCP_INFO (kernel commit cd9b266095f4) */ -#define min_rtt_cap tcp_info_cap(min_rtt) +#define min_rtt_cap tcp_info_cap(min_rtt) +/* Kernel reports delivery rate in TCP_INFO (kernel commit eb8329e0a04d) */ +#define delivery_rate_cap tcp_info_cap(delivery_rate)
/* sendmsg() to socket */ static struct iovec tcp_iov [UIO_MAXIOV]; @@ -1050,6 +1055,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, socklen_t sl = sizeof(*tinfo); struct tcp_info_linux tinfo_new; uint32_t new_wnd_to_tap = prev_wnd_to_tap; + bool ack_everything = true; int s = conn->sock;
/* At this point we could ack all the data we've accepted for forwarding @@ -1059,7 +1065,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, * control behaviour. * * For it to be possible and worth it we need: - * - The TCP_INFO Linux extension which gives us the peer acked bytes + * - The TCP_INFO Linux extensions which give us the peer acked bytes + * and the delivery rate (outbound bandwidth at receiver) * - Not to be told not to (force_seq) * - Not half-closed in the peer->guest direction * With no data coming from the peer, we might not get events which @@ -1069,19 +1076,36 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, * Data goes from socket to socket, with nothing meaningfully "in * flight". * - Not a pseudo-local connection (e.g. to a VM on the same host) - * - Large enough send buffer - * In these cases, there's not enough in flight to bother. + * If it is, there's not enough in flight to bother. + * - Sending buffer significantly larger than bandwidth * delay product + * Meaning we're not bandwidth-bound and this is likely to be + * interactive traffic where we want to preserve transparent + * connection behaviour and latency. + * + * Otherwise, we probably want to maximise throughput, which needs + * sending buffer auto-tuning, triggered in turn by filling up the + * outbound socket queue. */ - if (bytes_acked_cap && !force_seq && + if (bytes_acked_cap && delivery_rate_cap && !force_seq && !CONN_IS_CLOSING(conn) && - !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) && - (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) { + !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn)) { if (!tinfo) { tinfo = &tinfo_new; if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) return 0; }
+ if ((unsigned)SNDBUF_GET(conn) > (long long)tinfo->tcpi_rtt * + tinfo->tcpi_delivery_rate / + 1000 / 1000 * + SNDBUF_TO_BW_DELAY_INTERACTIVE) + ack_everything = false; + } + + if (ack_everything) { + /* Fall back to acknowledging everything we got */ + conn->seq_ack_to_tap = conn->seq_from_tap; + } else { /* This trips a cppcheck bug in some versions, including * cppcheck 2.18.3. * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/ @@ -1089,9 +1113,6 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, /* cppcheck-suppress [uninitvar,unmatchedSuppression] */ conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked + conn->seq_init_from_tap; - } else { - /* Fall back to acknowledging everything we got */ - conn->seq_ack_to_tap = conn->seq_from_tap; }
/* It's occasionally possible for us to go from using the fallback above -- 2.43.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Dec 08, 2025 at 08:20:14AM +0100, Stefano Brivio wrote:
Right now, the only need for this kind of function comes from tcp_get_sndbuf(), which calculates the amount of sending buffer we want to use depending on its own size: we want to use more of it if it's smaller, as bookkeeping overhead is usually lower and we rely on auto-tuning there, and use less of it when it's bigger.
For this purpose, the new function is overly generic: @x is the same as @y, that is, we want to use more or less of the buffer depending on the size of the buffer itself.
However, an upcoming change will need that generality, as we'll want to scale the amount of sending buffer we use depending on another (scaled) factor.
Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- tcp.c | 6 +----- util.c | 38 ++++++++++++++++++++++++++++++++++++++ util.h | 1 + 3 files changed, 40 insertions(+), 5 deletions(-)
diff --git a/tcp.c b/tcp.c index bb661ee..026546a 100644 --- a/tcp.c +++ b/tcp.c @@ -788,11 +788,7 @@ static void tcp_get_sndbuf(struct tcp_tap_conn *conn) return; }
- v = sndbuf; - if (v >= SNDBUF_BIG) - v /= 2; - else if (v > SNDBUF_SMALL) - v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2; + v = clamped_scale(sndbuf, sndbuf, SNDBUF_SMALL, SNDBUF_BIG, 50);
SNDBUF_SET(conn, MIN(INT_MAX, v)); } diff --git a/util.c b/util.c index f32c9cb..2232a24 100644 --- a/util.c +++ b/util.c @@ -1223,3 +1223,41 @@ void fsync_pcap_and_log(void) if (log_file != -1) (void)fsync(log_file); } + +/** + * clamped_scale() - Scale @x from 100% to f% depending on @y's value + * @x: Value to scale + * @y: Value determining scaling + * @lo: Lower bound for @y (start of y-axis slope) + * @hi: Upper bound for @y (end of y-axis slope) + * @f: Scaling factor, percent (might be less or more than 100) + * + * Return: @x scaled by @f * linear interpolation of @y between @lo and @hi + * + * In pictures: + * + * f % -> ,---- * If @y < lo (for example, @y is y0), return @x + * /| | + * / | | * If @lo < @y < @hi (for example, @y is y1), + * / | | return @x scaled by a factor linearly + * (100 + f) / 2 % ->/ | | interpolated between 100% and f% depending on + * /| | | @y's position between @lo (100%) and @hi (f%) + * / | | | + * / | | | * If @y > @hi (for example, @y is y2), return + * 100 % -> -----' | | | @x * @f / 100 + * | | | | | + * y0 lo y1 hi y2 Example: @f = 150, @lo = 10, @hi = 20, @y = 15, + * @x = 1000 + * -> interpolated factor is 125% + * -> return 1250 + */ +long clamped_scale(long x, long y, long lo, long hi, long f) +{ + if (y < lo) + return x; + + if (y > hi) + return x * f / 100; + + return x - (x * (y - lo) / (hi - lo)) * (100 - f) / 100; +} diff --git a/util.h b/util.h index 17f5ae0..744880b 100644 --- a/util.h +++ b/util.h @@ -242,6 +242,7 @@ int read_remainder(int fd, const struct iovec *iov, size_t cnt, size_t skip); void close_open_files(int argc, char **argv); bool snprintf_check(char *str, size_t size, const char *format, ...); void fsync_pcap_and_log(void); +long clamped_scale(long x, long y, long lo, long hi, long f);
/** * af_name() - Return name of an address family -- 2.43.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, 9 Dec 2025 16:10:36 +1100
David Gibson
On Mon, Dec 08, 2025 at 08:20:17AM +0100, Stefano Brivio wrote:
A fixed 10 ms ACK_INTERVAL timer value served us relatively well until the previous change, because we would generally cause retransmissions for non-local outbound transfers with relatively high (> 100 Mbps) bandwidth and non-local but low (< 5 ms) RTT.
Now that retransmissions are less frequent, we don't have a proper trigger to check for acknowledged bytes on the socket, and will generally block the sender for a significant amount of time while we could acknowledge more data, instead.
Store the RTT reported by the kernel using an approximation (exponent), to keep flow storage size within two (typical) cachelines. Check for socket updates when half of this time elapses: it should be a good indication of the one-way delay we're interested in (peer to us).
Representable values are between 100 us and 3.2768 s, and any value outside this range is clamped to these bounds. This choice appears to be a good trade-off between additional overhead and throughput.
This mechanism partially overlaps with the "low RTT" destinations, which we use to infer that a socket is connected to an endpoint to the same machine (while possibly in a different namespace) if the RTT is reported as 10 us or less.
This change doesn't, however, conflict with it: we are reading TCP_INFO parameters for local connections anyway, so we can always store the RTT approximation opportunistically.
Then, if the RTT is "low", we don't really need a timer to acknowledge data as we'll always acknowledge everything to the sender right away. However, we have limited space in the array where we store addresses of local destination, so the low RTT property of a connection might toggle frequently. Because of this, it's actually helpful to always have the RTT approximation stored.
This could probably benefit from a future rework, though, introducing a more integrated approach between these two mechanisms.
Signed-off-by: Stefano Brivio
--- tcp.c | 30 +++++++++++++++++++++++------- tcp_conn.h | 9 +++++++++ util.c | 14 ++++++++++++++ util.h | 1 + 4 files changed, 47 insertions(+), 7 deletions(-) diff --git a/tcp.c b/tcp.c index 28d3304..0167121 100644 --- a/tcp.c +++ b/tcp.c @@ -202,9 +202,13 @@ * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on * either side, the connection is reset * - * - ACK_INTERVAL elapsed after data segment received from tap without having + * - RTT / 2 elapsed after data segment received from tap without having * sent an ACK segment, or zero-sized window advertised to tap/guest (flag - * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent + * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent. + * + * RTT, here, is an approximation of the RTT value reported by the kernel via + * TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to + * RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly. * * * Summary of data flows (with ESTABLISHED event) @@ -341,7 +345,6 @@ enum { #define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */
-#define ACK_INTERVAL 10 /* ms */ #define RTO_INIT 1 /* s, RFC 6298 */ #define RTO_INIT_AFTER_SYN_RETRIES 3 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 @@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) }
if (conn->flags & ACK_TO_TAP_DUE) { - it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; + it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000); + it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { int exp = conn->retries, timeout = RTO_INIT; if (!(conn->events & ESTABLISHED)) @@ -608,9 +612,17 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) it.it_value.tv_sec = ACT_TIMEOUT; }
- flow_dbg(conn, "timer expires in %llu.%03llus", - (unsigned long long)it.it_value.tv_sec, - (unsigned long long)it.it_value.tv_nsec / 1000 / 1000); + if (conn->flags & ACK_TO_TAP_DUE) { + flow_trace(conn, "timer expires in %llu.%03llums", + (unsigned long)it.it_value.tv_sec * 1000 + + (unsigned long long)it.it_value.tv_nsec % + (1000 * 1000), + (unsigned long long)it.it_value.tv_nsec / 1000);
This is the wrong way around. The ms part needs to be:
Ouch, I... half swapped them. Almost. This risks haunting us for a bit.
Luckily we have discrete values there so I could make this small table
for --trace logs from the broken implementation:
#include
tv_sec * 1000 + tv_nsec / 1000000 and the fractional (us) part: (tv_nsec / 1000) % 1000
(or if you did just want a single digit after the ., then: tv_nsec / 100000 % 10 )
I guess I'll display two digits instead, I never had a ~100 ns RTT in my tests so I didn't notice until now that the resulting timeout would be displayed as 0.0ms with one fractional digit. Thanks for the snippet, I'll post a patch soon (of course, feel free to do so meanwhile if you already have a local change...). -- Stefano
participants (3)
-
David Gibson
-
Max Chernoff
-
Stefano Brivio