Hi Max,
On Tue, 11 Nov 2025 23:11:17 -0700
Max Chernoff
Hi,
For the past few months, I've noticed that my HTTP uploads from containers are really slow. Reproduction:
[...]
- The host that I ran this from is a VM in a datacenter, supposedly with a symmetrical 1Gb/s connection. Pinging github.com shows an RTT of 2ms, and pinging www.ctan.org shows an RTT of 100ms.
Thanks for reporting this. Coincidentally, I'm currently debugging an issue that looks similar to this, and I should have some patches for you to test in a couple of days, I'll keep you posted. -- Stefano
On Wed, 12 Nov 2025 07:55:48 +0100
Stefano Brivio
Hi Max,
On Tue, 11 Nov 2025 23:11:17 -0700 Max Chernoff
wrote: Hi,
For the past few months, I've noticed that my HTTP uploads from containers are really slow. Reproduction:
[...]
- The host that I ran this from is a VM in a datacenter, supposedly with a symmetrical 1Gb/s connection. Pinging github.com shows an RTT of 2ms, and pinging www.ctan.org shows an RTT of 100ms.
Thanks for reporting this. Coincidentally, I'm currently debugging an issue that looks similar to this, and I should have some patches for you to test in a couple of days, I'll keep you posted.
Hmm, actually, I have a hack that's not quite correct (we should make ACK_INTERVAL adaptive instead, which is one of the other bits I'm working on), but if it fixes the issue for you, it should at least mean that we're talking about the same issue. Patch attached. Can you give that a try? -- Stefano
Hi Stefano, On Wed, 2025-11-12 at 11:32 +0100, Stefano Brivio wrote:
Hmm, actually, I have a hack that's not quite correct (we should make ACK_INTERVAL adaptive instead, which is one of the other bits I'm working on), but if it fixes the issue for you, it should at least mean that we're talking about the same issue.
Patch attached. Can you give that a try?
That seemed to help quite a bit---it's now 200x faster than before, but still 10x slower than --network=host: $ curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 100M 0 345 100 100M 63 18.4M 0:00:05 0:00:05 --:--:-- 20.3M (With the original pasta, stopped early) $ podman run --rm --pull=newer --volume="$(realpath .):/srv/:Z" --workdir=/srv/ --network=pasta quay.io/fedora/fedora-minimal curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 4 100M 0 0 4 5056k 0 78152 0:22:21 0:01:06 0:21:15 39298⏎ (With the patch applied) $ podman run --rm --pull=newer --volume="$(realpath .):/srv/:Z" --workdir=/srv/ --network=pasta quay.io/fedora/fedora-minimal curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 100M 0 345 100 100M 8 2393k 0:00:42 0:00:42 --:--:-- 4729k $ podman run --rm --pull=newer --volume="$(realpath .):/srv/:Z" --workdir=/srv/ --network=host quay.io/fedora/fedora-minimal curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 100M 0 345 100 100M 69 20.0M 0:00:04 0:00:04 --:--:-- 20.4M Also, I should mention that I'm using the following networking-related sysctls: net.core.wmem_max=7500000 net.core.rmem_max=7500000 net.ipv4.tcp_notsent_lowat=131072 net.core.default_qdisc=cake net.ipv4.tcp_congestion_control=bbr I read some articles that suggested that those were a good idea, and I've been using them for about a year now, but I can disable those for testing if you want. I'm also using systemd's IPAddressAllow/IPAddressDeny/RestrictAddressFamilies and some SELinux port restrictions; I can easily disable those too. Thanks, -- Max
On Wed, 12 Nov 2025 04:22:32 -0700
Max Chernoff
Hi Stefano,
On Wed, 2025-11-12 at 11:32 +0100, Stefano Brivio wrote:
Hmm, actually, I have a hack that's not quite correct (we should make ACK_INTERVAL adaptive instead, which is one of the other bits I'm working on), but if it fixes the issue for you, it should at least mean that we're talking about the same issue.
Patch attached. Can you give that a try?
That seemed to help quite a bit---it's now 200x faster than before, but still 10x slower than --network=host:
Thanks for testing. I'm fairly sure it's that problem, then. Setting 1 ms as interval between checks for socket-side ACKs (reported by the kernel) as my hack does is not the appropriate solution, I'm implementing something based on the reported round-trip-time (RTT) instead. As a further hack, you could probably do something like this on top: --- diff --git a/tcp.c b/tcp.c index 697f80d..8c50ee0 100644 --- a/tcp.c +++ b/tcp.c @@ -339,7 +339,7 @@ enum { #define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */ -#define ACK_INTERVAL 1 /* ms */ +#define ACK_INTERVAL 200 /* us */ #define SYN_TIMEOUT 10 /* s */ #define ACK_TIMEOUT 2 #define FIN_TIMEOUT 60 @@ -582,7 +582,7 @@ static void tcp_timer_ctl(struct tcp_tap_conn *conn) } if (conn->flags & ACK_TO_TAP_DUE) { - it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; + it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) it.it_value.tv_sec = SYN_TIMEOUT; --- ...but again, I'm going to fix that properly in a bit.
Also, I should mention that I'm using the following networking-related sysctls:
net.core.wmem_max=7500000 net.core.rmem_max=7500000
Those were settings we recommended for KubeVirt until https://github.com/kubevirt/user-guide/pull/933, but they don't seem to necessarily make sense as we seem to have made peace with the TCP auto-tuning mechanism in Linux meanwhile. See also https://bugs.passt.top/show_bug.cgi?id=138 and commit 71249ef3f9bc ("tcp, tcp_splice: Don't set SO_SNDBUF and SO_RCVBUF to maximum values"). As the issue here is about socket (kernel) buffers being "too small" for a while, I guess that those settings plus reverting that commit would "fix" the issue entirely for you. But it's impractical to rely on users to set those, that's why I'm looking for something adaptive which still plays nicely with TCP auto-tuning instead.
net.ipv4.tcp_notsent_lowat=131072 net.core.default_qdisc=cake net.ipv4.tcp_congestion_control=bbr
I'm not sure if those really matter for pasta, but I haven't really thought about them.
I read some articles that suggested that those were a good idea, and I've been using them for about a year now, but I can disable those for testing if you want. I'm also using systemd's IPAddressAllow/IPAddressDeny/RestrictAddressFamilies and some SELinux port restrictions; I can easily disable those too.
No need, I don't think those make a difference. -- Stefano
Hi Stefano, On Wed, 2025-11-12 at 13:53 +0100, Stefano Brivio wrote:
As a further hack, you could probably do something like this on top:
--- diff --git a/tcp.c b/tcp.c index 697f80d..8c50ee0 100644 --- a/tcp.c +++ b/tcp.c @@ -339,7 +339,7 @@ enum { #define MSS_DEFAULT 536 #define WINDOW_DEFAULT 14600 /* RFC 6928 */
-#define ACK_INTERVAL 1 /* ms */ +#define ACK_INTERVAL 200 /* us */ #define SYN_TIMEOUT 10 /* s */ #define ACK_TIMEOUT 2 #define FIN_TIMEOUT 60 @@ -582,7 +582,7 @@ static void tcp_timer_ctl(struct tcp_tap_conn *conn) }
if (conn->flags & ACK_TO_TAP_DUE) { - it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; + it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) it.it_value.tv_sec = SYN_TIMEOUT; ---
That actually makes it worse again, about as bad as before the patch. But I've just tried rebuilding with original patch again, and also with the exact same binary that I used yesterday, and that's slow now too. I've verified with pgrep that Podman is using the correct pasta version, so I have no idea what's happening. However, I do remember that for the past few months, some uploads would randomly go really quickly, so maybe the problem happens sporadically, and when I was testing the patched version I just happened to get (un)lucky?
net.core.wmem_max=7500000 net.core.rmem_max=7500000
Those were settings we recommended for KubeVirt until https://github.com/kubevirt/user-guide/pull/933, but they don't seem to necessarily make sense as we seem to have made peace with the TCP auto-tuning mechanism in Linux meanwhile.
See also https://bugs.passt.top/show_bug.cgi?id=138 and commit 71249ef3f9bc ("tcp, tcp_splice: Don't set SO_SNDBUF and SO_RCVBUF to maximum values").
As the issue here is about socket (kernel) buffers being "too small" for a while, I guess that those settings plus reverting that commit would "fix" the issue entirely for you. But it's impractical to rely on users to set those, that's why I'm looking for something adaptive which still plays nicely with TCP auto-tuning instead.
Ah, I didn't know that those (used to be) recommended for pasta; I set those for Caddy since it complains on startup if those aren't set https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes
net.ipv4.tcp_notsent_lowat=131072 net.core.default_qdisc=cake net.ipv4.tcp_congestion_control=bbr
I set those as a general performance tuning thing (not for pasta specifically) based off of https://blog.cloudflare.com/optimizing-tcp-for-high-throughput-and-low-laten... https://grapheneos.org/articles/server-traffic-shaping
I'm not sure if those really matter for pasta, but I haven't really thought about them.
Aha, those do seem to be the issue. Using the original (unpatched) pasta: (Set everything to my original settings) $ sudo sysctl -w net.core.wmem_max=7500000 net.core.rmem_max=7500000 net.ipv4.tcp_notsent_lowat=131072 net.core.default_qdisc=cake net.ipv4.tcp_congestion_control=bbr (Test with --network=host) $ podman run --rm --pull=newer --volume="$(realpath .):/srv/:Z" --workdir=/srv/ --network=host quay.io/fedora/fedora-minimal curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 100M 0 345 100 100M 69 20.0M 0:00:04 0:00:04 --:--:-- 20.4M (Test with --network=pasta) $ podman run --rm --pull=newer --volume="$(realpath .):/srv/:Z" --workdir=/srv/ --network=pasta quay.io/fedora/fedora-minimal curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 4 100M 0 0 4 5120k 0 42440 0:41:10 0:02:03 0:39:07 26188 (Stopped early since I got sick of waiting) (Set everything to the kernel defaults) $ sudo sysctl -w net.core.wmem_max=212992 net.core.rmem_max=212992 net.ipv4.tcp_notsent_lowat=4294967295 net.core.default_qdisc=fq_codel net.ipv4.tcp_congestion_control=cubic $ podman run --rm --pull=newer --volume="$(realpath .):/srv/:Z" --workdir=/srv/ --network=pasta quay.io/fedora/fedora-minimal curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 100M 0 345 100 100M 4 1431k 0:01:11 0:01:11 --:--:-- 1399k (tcp_congestion_control=default, tcp_notsent_lowat=custom, [rw]mem_max=default) $ sudo sysctl -w net.ipv4.tcp_congestion_control=cubic net.ipv4.tcp_notsent_lowat=131072 net.core.rmem_max=212992 net.core.wmem_max=212992 $ podman run --rm --pull=newer --volume="$(realpath .):/srv/:Z" --workdir=/srv/ --network=pasta quay.io/fedora/fedora-minimal curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 100M 0 345 100 100M 5 1484k 0:01:08 0:01:08 --:--:-- 1400k (tcp_congestion_control=custom, tcp_notsent_lowat=default, [rw]mem_max=default) $ sudo sysctl -w net.ipv4.tcp_congestion_control=bbr net.ipv4.tcp_notsent_lowat=4294967295 net.core.rmem_max=212992 net.core.wmem_max=212992 $ podman run --rm --pull=newer --volume="$(realpath .):/srv/:Z" --workdir=/srv/ --network=pasta quay.io/fedora/fedora-minimal curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 100M 0 345 100 100M 12 3576k 0:00:28 0:00:28 --:--:-- 14.9M (tcp_congestion_control=custom, tcp_notsent_lowat=custom, [rw]mem_max=default) $ sudo sysctl -w net.ipv4.tcp_congestion_control=bbr net.ipv4.tcp_notsent_lowat=131072 net.core.rmem_max=212992 net.core.wmem_max=212992 $ podman run --rm --pull=newer --volume="$(realpath .):/srv/:Z" --workdir=/srv/ --network=pasta quay.io/fedora/fedora-minimal curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 100M 0 345 100 100M 8 2595k 0:00:39 0:00:39 --:--:-- 17.1M (tcp_congestion_control=default, tcp_notsent_lowat=default, [rw]mem_max=custom) $ sudo sysctl -w net.ipv4.tcp_congestion_control=cubic net.ipv4.tcp_notsent_lowat=4294967295 net.core.rmem_max=7500000 net.core.wmem_max=7500000 $ podman run --rm --pull=newer --volume="$(realpath .):/srv/:Z" --workdir=/srv/ --network=pasta quay.io/fedora/fedora-minimal curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 100M 0 345 100 100M 7 2107k 0:00:48 0:00:48 --:--:-- 8310k (tcp_congestion_control=custom, tcp_notsent_lowat=default, [rw]mem_max=custom) $ sudo sysctl -w net.ipv4.tcp_congestion_control=bbr net.ipv4.tcp_notsent_lowat=4294967295 net.core.rmem_max=7500000 net.core.wmem_max=7500000 $ podman run --rm --pull=newer --volume="$(realpath .):/srv/:Z" --workdir=/srv/ --network=pasta quay.io/fedora/fedora-minimal curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 100M 0 345 100 100M 15 4620k 0:00:22 0:00:22 --:--:-- 12.1M (tcp_congestion_control=custom, tcp_notsent_lowat=custom, [rw]mem_max=custom) $ sudo sysctl -w net.ipv4.tcp_congestion_control=bbr net.ipv4.tcp_notsent_lowat=131072 net.core.rmem_max=7500000 net.core.wmem_max=7500000 $ podman run --rm --pull=newer --volume="$(realpath .):/srv/:Z" --workdir=/srv/ --network=pasta quay.io/fedora/fedora-minimal curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 5 100M 0 0 5 5376k 0 29899 0:58:27 0:03:04 0:55:23 0 (Stopped early since I got sick of waiting) (Test with --network=host again) $ podman run --rm --pull=newer --volume="$(realpath .):/srv/:Z" --workdir=/srv/ --network=host quay.io/fedora/fedora-minimal curl --output /dev/null --progress-meter --form file=@./test.tar.gz "https://www.ctan.org/submit/validate" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 100M 0 345 100 100M 69 20.0M 0:00:04 0:00:04 --:--:-- 24.3M I know fairly little about networking and the kernel, so if the answer is just "don't set those sysctls together", I'd be okay with that. But I haven't changed these sysctls since February, my upload speeds via pasta were fine up until a few months ago, and the upload speeds are still okay with --network=host, so I suspect that this is a bug somewhere. I also find it quite interesting that setting any of the sysctls individually or in pairs improves the upload speeds, but setting all 3 at once slows it down drastically. I bisected a kernel a few weeks ago, so I can try that here if you think that this is a kernel bug and not a pasta bug. Thanks, -- Max
participants (2)
-
Max Chernoff
-
Stefano Brivio