On Tue, Jun 10, 2025 at 5:29 PM Stefano Brivio
[Adding Paul as Podman developer]
On Mon, 9 Jun 2025 11:59:21 +0200 Eugenio Perez Martin
wrote: On Fri, Jun 6, 2025 at 6:37 PM Stefano Brivio
wrote: On Fri, 6 Jun 2025 16:32:38 +0200 Eugenio Perez Martin
wrote: On Wed, May 21, 2025 at 12:35 PM Eugenio Perez Martin
wrote: On Wed, May 21, 2025 at 12:09 PM Stefano Brivio
wrote: On Tue, 20 May 2025 17:09:44 +0200 Eugenio Perez Martin
wrote: > [...] > > Now if I isolate the vhost kernel thread [1] I get way more > performance as expected: > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 43.1 GBytes 37.1 Gbits/sec 0 sender > [ 5] 0.00-10.04 sec 43.1 GBytes 36.9 Gbits/sec receiver > > After analyzing perf output, rep_movs_alternative is the most called > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and > vhost (~15%Self)
Interesting... s/most called function/function using the most cycles/, I suppose.
Right!
So it looks somewhat similar to
https://archives.passt.top/passt-dev/20241017021027.2ac9ea53@elisabeth/
now?
Kind of. Below tcp_sendmsg_locked I don't see sk_page_frag_refill but skb_do_copy_data_nocache. Not sure if that means something, as it should not be affected by vhost.
> But I don't see any of them consuming 100% of CPU in > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes > 60%, and vhost consumes ~53%. > > So... I have mixed feelings about this :). By "default" it seems to > have less performance, but my test is maybe too synthetic.
Well, surely we can't ask Podman users to pin specific stuff to given CPU threads. :)
Yes but maybe the result changes under the right schedule? I'm isolating the CPUs entirely, which is not the usual case for pasta for sure :).
> There is room for improvement with the mentioned optimizations so I'd > continue applying them, continuing with UDP and TCP zerocopy, and > developing zerocopy vhost rx.
That definitely makes sense to me.
Good!
> With these numbers I think the series should not be > merged at the moment. I could send it as RFC if you want but I've not > applied the comments the first one received, POC style :).
I don't think it's really needed for you to spend time on semi-polishing something just to have an RFC if you're still working on it. I guess the implementation will change substantially anyway once you factor in further optimisations.
Agree! I'll keep iterating on this then.
Actually, if I remove all the taskset etc, and trust the kernel scheduler, vanilla pasta gives me: [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M Connecting to host 10.6.68.254, port 5201 [ 5] local 10.6.68.20 port 40408 connected to 10.6.68.254 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes [ 5] 1.00-2.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes [ 5] 2.00-3.00 sec 3.12 GBytes 26.8 Gbits/sec 0 25.4 MBytes [ 5] 3.00-4.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes [ 5] 4.00-5.00 sec 3.10 GBytes 26.6 Gbits/sec 0 25.4 MBytes [ 5] 5.00-6.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes [ 5] 6.00-7.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes [ 5] 7.00-8.00 sec 3.09 GBytes 26.6 Gbits/sec 0 25.4 MBytes [ 5] 8.00-9.00 sec 3.08 GBytes 26.5 Gbits/sec 0 25.4 MBytes [ 5] 9.00-10.00 sec 3.10 GBytes 26.6 Gbits/sec 0 25.4 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 31.0 GBytes 26.7 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 31.0 GBytes 26.5 Gbits/sec receiver
And with vhost-net : [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M ... Connecting to host 10.6.68.254, port 5201 [ 5] local 10.6.68.20 port 46720 connected to 10.6.68.254 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 4.17 GBytes 35.8 Gbits/sec 0 11.9 MBytes [ 5] 1.00-2.00 sec 4.17 GBytes 35.9 Gbits/sec 0 11.9 MBytes [ 5] 2.00-3.00 sec 4.16 GBytes 35.7 Gbits/sec 0 11.9 MBytes [ 5] 3.00-4.00 sec 4.14 GBytes 35.6 Gbits/sec 0 11.9 MBytes [ 5] 4.00-5.00 sec 4.16 GBytes 35.7 Gbits/sec 0 11.9 MBytes [ 5] 5.00-6.00 sec 4.16 GBytes 35.8 Gbits/sec 0 11.9 MBytes [ 5] 6.00-7.00 sec 4.18 GBytes 35.9 Gbits/sec 0 11.9 MBytes [ 5] 7.00-8.00 sec 4.19 GBytes 35.9 Gbits/sec 0 11.9 MBytes [ 5] 8.00-9.00 sec 4.18 GBytes 35.9 Gbits/sec 0 11.9 MBytes [ 5] 9.00-10.00 sec 4.18 GBytes 35.9 Gbits/sec 0 11.9 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 41.7 GBytes 35.8 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 41.7 GBytes 35.7 Gbits/sec receiver
If I go the extra mile and disable notifications (it might be just noise, but...) [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M ... Connecting to host 10.6.68.254, port 5201 [ 5] local 10.6.68.20 port 56590 connected to 10.6.68.254 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 4.19 GBytes 36.0 Gbits/sec 0 12.4 MBytes [ 5] 1.00-2.00 sec 4.18 GBytes 35.9 Gbits/sec 0 12.4 MBytes [ 5] 2.00-3.00 sec 4.18 GBytes 35.9 Gbits/sec 0 12.4 MBytes [ 5] 3.00-4.00 sec 4.20 GBytes 36.1 Gbits/sec 0 12.4 MBytes [ 5] 4.00-5.00 sec 4.21 GBytes 36.2 Gbits/sec 0 12.4 MBytes [ 5] 5.00-6.00 sec 4.21 GBytes 36.1 Gbits/sec 0 12.4 MBytes [ 5] 6.00-7.00 sec 4.20 GBytes 36.1 Gbits/sec 0 12.4 MBytes [ 5] 7.00-8.00 sec 4.23 GBytes 36.4 Gbits/sec 0 12.4 MBytes [ 5] 8.00-9.00 sec 4.24 GBytes 36.4 Gbits/sec 0 12.4 MBytes [ 5] 9.00-10.00 sec 4.21 GBytes 36.2 Gbits/sec 0 12.4 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 42.1 GBytes 36.1 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 42.1 GBytes 36.0 Gbits/sec receiver
So I guess the best is to actually run performance tests closer to real-world workload against the new version and see if it works better?
Well, that's certainly a possibility.
I'd say the biggest value for vhost-net usage in pasta is reaching throughput figures that are comparable with veth, with or without multithreading (keeping an eye on bytes per cycle, of course), with or without kernel changes, so that users won't need to choose between rootless and performance anymore.
It would also simplify things in Podman quite a lot (and to some extent in rootlesskit / Docker as well). We're pretty much there with virtual machines, just not quite with containers (which is somewhat ironic, but of course there's a good reason for that).
If we're clearly wasting cycles in vhost-net (because of the bounce buffer, plus something else perhaps?) *and* there's a somewhat possible solution for that in sight *and* the interface would change anyway, running throughput tests and polishing up the current version with a half-baked solution at the moment sounds a bit wasteful to me.
My point is that I'm testing a very synthetic scenario. If everybody agree this is close enough to real world ones, I'm ok to continue improving the edges we see. If not, maybe we're picking the wrong fruit even if it is low hand?
Getting a table like [1] would give us light about this, especially if it is just a matter of running "make performance" or similar. Maybe we need to include longer queues? Focus on a given scenario? UDP goes better but TCP?
Well, it's a matter of running ./run under tests (or 'make' there). Have you tried that with your patch? It's kind of representative in the sense that it uses several message sizes and different values for the sending window.
Yes but it freezes in my env. Copying the different windows, top-left: # tail -f --retry /home/passt/test/test_logs/context_unshare.log /home/passt/test/test_logs/context_ns.log tail: warning: --retry only effective for the initial open ==> /home/passt/test/test_logs/context_unshare.log <== unshare$ tail: cannot open '/home/passt/test/test_logs/context_ns.log' for reading: No such file or directory --- top-right: # while cat /tmp/passt-tests-7HpbEm/log_pipe; do :; done Test layout: single pasta instance with namespace. --- bottom-left: # tail -f --retry /home/passt/test/test_logs/context_host.log tail: warning: --retry only effective for the initial open host$ --- bottom-right: # tail -f --ontext_passt.logt/test/test_logs/c tail: warning: --retry only effective for the initial open passt$ --- And test/test_logs/test.log: === build/all
Build passt ? ! [ -e passt ] ? [ -f passt ] ...passed.
Build pasta ? ! [ -e pasta ] ? [ -h pasta ] ...passed.
Build qrap ? ! [ -e qrap ] ? [ -f qrap ] ...passed.
Build all ? ! [ -e passt ] ? ! [ -e pasta ] ? ! [ -e qrap ] ? [ -f passt ] ? [ -h pasta ] ? [ -f qrap ] ...passed.
Install ? [ -f /tmp/passt-tests-7HpbEm/build/all/prefix/bin/passt ] ? [ -h /tmp/passt-tests-7HpbEm/build/all/prefix/bin/pasta ] ? [ -f /tmp/passt-tests-7HpbEm/build/all/prefix/bin/qrap ] ? man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W passt ? man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W pasta ? man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W qrap ...passed.
Uninstall ? ! [ -f /tmp/passt-tests-7HpbEm/build/all/prefix/bin/passt ] ? ! [ -h /tmp/passt-tests-7HpbEm/build/all/prefix/bin/pasta ] ? ! [ -f /tmp/passt-tests-7HpbEm/build/all/prefix/bin/qrap ] ? ! man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W passt 2>/dev/null ? ! man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W pasta 2>/dev/null ? ! man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W qrap 2>/dev/null ...passed.
=== build/cppcheck ...skipped. === build/clang_tidy ...skipped. ---
Now more points about this scenario: 1) I don't see 100% CPU usage in any element: CPU% 84.2 passt.avx2 57.9 iperf3 57.2 iperf3 50.7 vhost-1805109
Still, I bet we're using an awful amount of cycles compared to veth.
2) The most used (Self%) function in vhost is rep_movs_alternative, called from skb_copy_datagram_iter, so yes, ZeroCopy should help a lot here.
Now, is "iperf3 -w 8M" representative? I'm sure ZC helps in this scenario, does it make it worse if we have small packets? Do we care?
We don't care _a lot_ about small packets because we can typically use large packets, inbound and outbound, at least for TCP (bulk) transfers. But users are doing all sort of things with containers, including bulk transfers and VPN traffic over UDP, so we do, a bit.
Again, the main value of using vhost-net, I think, is making "rootful" networking essentially unnecessary, or necessary just for niche use cases (say, non-TCP, non-UDP traffic, or macvlan-like cases). If there are relatively common use cases where pasta performs pretty bad compared to veth, we'll still need rootful networking.
So, yes, it is representative, but not necessarily universal.
I'm totally ok with continuing trying with ZC, I just want to make sure we're not missing anything :).
In any case, it looks like vhost-net zero-copy is a bigger task than we thought, so, even if we don't reach a universal solution that makes rootful networking essentially unnecessary, but we have a big improvement ready, there's of course a lot of value in it. Your call...
Got it, thanks!