[PATCH] tcp: Use SO_MEMINFO for accurate send buffer overhead accounting
The TCP window advertised to the guest/container must balance two
competing needs: large enough to trigger kernel socket buffer
auto-tuning, but not so large that sendmsg() partially fails, causing
retransmissions.
The current approach uses the difference (SNDBUF_GET() - SIOCOUTQ), but
these values are in reality representing different units: SO_SNDBUF
includes the buffer overhead (sk_buff head, alignment, skb_shared_info),
while SIOCOUTQ only returns the actual payload bytes. The clamped_scale
value of 75% is a rough approximation of this overhead, but it is
inaccurate: too generous for large buffers, causing retransmissions at
higher RTTs, and too conservative for small ones, hence inhibiting
auto-tuning.
We now introduce the use of SO_MEMINFO to obtain SK_MEMINFO_SNDBUF and
SK_MEMINFO_WMEM_QUEUED from the kernel. Those are both presented in
the kernel's own accounting units, i.e. including the per-skb overhead,
and match exactly what the kernel's own sk_stream_memory_free()
function is using.
When we combine the above with the payload bytes indicated by SIOCOUTQ,
the observed overhead ratio self-calibrates to whatever gso_segs, cache
line size, and sk_buff layout the kernel may use, and is even
architecture agnostic.
When data is queued and the overhead ratio is observable
(wmem_queued > sendq), the available payload window is calculated as:
(sk_sndbuf - wmem_queued) * sendq / wmem_queued
When the ratio cannot be observed, e.g. because the queue is empty or
we are in a transient state, we fall back to 75% of remaining buffer
capacity, like before.
If SO_MEMINFO is unavailable, we fall back to the pre-existing
SNDBUF_GET() - SIOCOUTQ calculation.
Link: https://bugs.passt.top/show_bug.cgi?id=138
Signed-off-by: Jon Maloy
On Tue, 21 Apr 2026 22:23:42 -0400
Jon Maloy
The TCP window advertised to the guest/container must balance two competing needs: large enough to trigger kernel socket buffer auto-tuning, but not so large that sendmsg() partially fails, causing retransmissions.
The current approach uses the difference (SNDBUF_GET() - SIOCOUTQ), but these values are in reality representing different units: SO_SNDBUF includes the buffer overhead (sk_buff head, alignment, skb_shared_info), while SIOCOUTQ only returns the actual payload bytes.
They're not really different units, because SNDBUF_GET() already returns a scaled value trying to take (very roughly) the overhead into account.
The clamped_scale value of 75% is a rough approximation of this overhead, but it is inaccurate: too generous for large buffers, causing retransmissions at higher RTTs, and too conservative for small ones, hence inhibiting auto-tuning.
It actually works the other way around (we use 100% for small buffers and gradually going towards 75% for large buffer) and auto-tuning works pretty well with it. Example before your patch with an iperf3 test with 15 ms RTT: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 125 MBytes 1.05 Gbits/sec 0 9.55 MBytes [ 5] 1.00-2.00 sec 110 MBytes 922 Mbits/sec 0 9.55 MBytes [ 5] 2.00-3.00 sec 111 MBytes 934 Mbits/sec 0 9.55 MBytes [ 5] 3.00-4.00 sec 104 MBytes 877 Mbits/sec 0 9.55 MBytes [ 5] 4.00-5.00 sec 110 MBytes 927 Mbits/sec 0 9.55 MBytes [ 5] 5.00-6.00 sec 111 MBytes 928 Mbits/sec 0 9.55 MBytes [ 5] 6.00-7.00 sec 112 MBytes 944 Mbits/sec 0 9.55 MBytes [ 5] 7.00-8.00 sec 110 MBytes 919 Mbits/sec 0 9.55 MBytes [ 5] 8.00-9.00 sec 112 MBytes 942 Mbits/sec 0 9.55 MBytes [ 5] 9.00-10.00 sec 110 MBytes 918 Mbits/sec 0 9.55 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec 0 sender [ 5] 0.00-10.02 sec 1.07 GBytes 918 Mbits/sec receiver and after your patch: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 2.50 MBytes 21.0 Mbits/sec 0 320 KBytes [ 5] 1.00-2.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes [ 5] 2.00-3.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 3.00-4.00 sec 1.38 MBytes 11.5 Mbits/sec 0 320 KBytes [ 5] 4.00-5.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes [ 5] 5.00-6.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 6.00-7.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 7.00-8.00 sec 1.38 MBytes 11.5 Mbits/sec 0 320 KBytes [ 5] 8.00-9.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 9.00-10.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 16.0 MBytes 13.4 Mbits/sec 0 sender [ 5] 0.00-10.02 sec 15.1 MBytes 12.7 Mbits/sec receiver It's similar in a test with 285 ms RTT. Before your patch: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 0 320 KBytes [ 5] 1.00-2.00 sec 2.00 MBytes 16.8 Mbits/sec 0 1.17 MBytes [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 11 660 KBytes [ 5] 3.00-4.00 sec 3.12 MBytes 26.2 Mbits/sec 11 540 KBytes [ 5] 4.00-5.00 sec 31.5 MBytes 264 Mbits/sec 0 1.93 MBytes [ 5] 5.00-6.00 sec 83.9 MBytes 704 Mbits/sec 0 4.10 MBytes [ 5] 6.00-7.00 sec 112 MBytes 941 Mbits/sec 0 7.38 MBytes [ 5] 7.00-8.00 sec 126 MBytes 1.06 Gbits/sec 0 11.9 MBytes [ 5] 8.00-9.00 sec 114 MBytes 952 Mbits/sec 0 11.9 MBytes [ 5] 9.00-10.00 sec 110 MBytes 925 Mbits/sec 0 11.9 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 584 MBytes 490 Mbits/sec 22 sender [ 5] 0.00-10.31 sec 548 MBytes 445 Mbits/sec receiver and after your patch: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 2.50 MBytes 21.0 Mbits/sec 0 320 KBytes [ 5] 1.00-2.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes [ 5] 2.00-3.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 3.00-4.00 sec 1.38 MBytes 11.5 Mbits/sec 0 320 KBytes [ 5] 4.00-5.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes [ 5] 5.00-6.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 6.00-7.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 7.00-8.00 sec 1.38 MBytes 11.5 Mbits/sec 0 320 KBytes [ 5] 8.00-9.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 9.00-10.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 16.0 MBytes 13.4 Mbits/sec 0 sender [ 5] 0.00-10.02 sec 15.1 MBytes 12.7 Mbits/sec receiver
We now introduce the use of SO_MEMINFO to obtain SK_MEMINFO_SNDBUF and SK_MEMINFO_WMEM_QUEUED from the kernel. Those are both presented in the kernel's own accounting units, i.e. including the per-skb overhead, and match exactly what the kernel's own sk_stream_memory_free() function is using.
Using SK_MEMINFO_WMEM_QUEUED might be helpful, because that actually tells us the memory used for pending outgoing segments *as currently stored*, but SK_MEMINFO_SNDBUF returns the same value as we get via SO_SNDBUF socket option (in the kernel, that's sk->sk_sndbuf), so we're still missing the information of how much overhead we'll have for data *we haven't written yet*. That is, the approach we were considering so far was something like: - divide the value returned via SIOCOUTQ into MSS-sized segments, calculate and add overhead for each of those. This shouldn't be needed if we we use SK_MEMINFO_WMEM_QUEUED (while I wanted to avoid SO_MEMINFO because it's a rather large copy_to_user(), maybe it's actually fine) - divide the remaining space into MSS-sized segments, calculate overhead for each of them... and this is the tricky part that you're approximating here.
When we combine the above with the payload bytes indicated by SIOCOUTQ, the observed overhead ratio self-calibrates to whatever gso_segs, cache line size, and sk_buff layout the kernel may use, and is even architecture agnostic.
I hope we can get something like this to work, but this patch, as it is, applies a linear factor to a non-linear overhead, which *I think* is what results in the underestimation of the available buffer size that's visible from tests.
When data is queued and the overhead ratio is observable (wmem_queued > sendq), the available payload window is calculated as: (sk_sndbuf - wmem_queued) * sendq / wmem_queued
I still think we should try a bit harder to accurately reverse the calculation done by the kernel including gso_segs. If we can't, a variation I would try on this patch is to consider segments as discrete quantities, because that should be slightly more accurate. Example: this patch right now would calculate, say: ( 200000 - 87500 (50 segments) ) * 73000 (those 10 segments) / 87500 and give us a 83.428% payload factor which we apply flat over those 112500 bytes remaining, giving us 93857 bytes of window. Instead, I think we could do this: * 87500 bytes per 50 segments -> 290 bytes overhead per segment * 180000 bytes left: 64 segments * advertise 93440 bytes
When the ratio cannot be observed, e.g. because the queue is empty or we are in a transient state, we fall back to 75% of remaining buffer capacity, like before.
That's not generally the case. We use between 75% and 100%.
If SO_MEMINFO is unavailable, we fall back to the pre-existing SNDBUF_GET() - SIOCOUTQ calculation.
Link: https://bugs.passt.top/show_bug.cgi?id=138 Signed-off-by: Jon Maloy
--- tcp.c | 33 ++++++++++++++++++++++++++------- util.c | 1 + 2 files changed, 27 insertions(+), 7 deletions(-) diff --git a/tcp.c b/tcp.c index 43b8fdb..3b47a3b 100644 --- a/tcp.c +++ b/tcp.c @@ -295,6 +295,7 @@ #include
#include
+#include #include "checksum.h" #include "util.h" @@ -1128,19 +1129,37 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, new_wnd_to_tap = tinfo->tcpi_snd_wnd; } else { unsigned rtt_ms_ceiling = DIV_ROUND_UP(tinfo->tcpi_rtt, 1000); + uint32_t mem[SK_MEMINFO_VARS]; + socklen_t mem_sl; uint32_t sendq; - int limit; + uint32_t sndbuf; + uint32_t limit;
if (ioctl(s, SIOCOUTQ, &sendq)) { debug_perror("SIOCOUTQ on socket %i, assuming 0", s); sendq = 0; } tcp_get_sndbuf(conn); + sndbuf = SNDBUF_GET(conn);
- if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */ - limit = 0; - else - limit = SNDBUF_GET(conn) - (int)sendq; + mem_sl = sizeof(mem); + if (getsockopt(s, SOL_SOCKET, SO_MEMINFO, &mem, &mem_sl)) {
If we are already fetching this, we don't need to fetch SO_SNDBUF (same as SK_MEMINFO_SNDBUF).
+ if (sendq > sndbuf) + limit = 0; + else + limit = sndbuf - sendq; + } else { + uint32_t sb = mem[SK_MEMINFO_SNDBUF]; + uint32_t wq = mem[SK_MEMINFO_WMEM_QUEUED]; + + if (wq > sb) + limit = 0; + else if (!sendq || wq <= sendq) + limit = (sb - wq) * 3 / 4;
Note that SNDBUF_GET() is already scaled. Maybe, actually, that's the problem? Let me try to fix that and see what happens...
+ else + limit = (uint64_t)(sb - wq) * + sendq / wq; + }
/* If the sender uses mechanisms to prevent Silly Window * Syndrome (SWS, described in RFC 813 Section 3) it's critical @@ -1168,11 +1187,11 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, * but we won't send enough to fill one because we're stuck * with pending data in the outbound queue */ - if (limit < MSS_GET(conn) && sendq && + if (limit < (unsigned int)MSS_GET(conn) && sendq && tinfo->tcpi_last_data_sent < rtt_ms_ceiling * 10) limit = 0;
- new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit); + new_wnd_to_tap = MIN(tinfo->tcpi_snd_wnd, limit); }
new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW); diff --git a/util.c b/util.c index 73c9d51..036fac1 100644 --- a/util.c +++ b/util.c @@ -1137,3 +1137,4 @@ long clamped_scale(long x, long y, long lo, long hi, long f)
return x - (x * (y - lo) / (hi - lo)) * (100 - f) / 100; } +
Nit: three unrelated changes. -- Stefano
participants (2)
-
Jon Maloy
-
Stefano Brivio