On Tue, May 20, 2025 at 11:10 PM Eugenio Perez Martin
Hi!
Some updates on the integration. The main culprit was to allow pasta to keep reading packets using the regular read() on the tap device. I thought that part was completely disabled, but I guess the kernel is able to omit the notification on tap as long as the userspace does not read it.
My scenario: in different CPUs, all in the same NUMA. I run iperf server on CPU 11 with "iperf3 -A 11 -s". All odds CPUs are isolated with isolcpus=1,3,... nohz=on nohz_full=1,3,...
With vanilla pasta isolated to CPUs 1,3 with taskset, and just --config-net option, and running iperf with "iperf3 -A 5 -c 10.6.68.254 -w 8M": - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 30.7 GBytes 26.4 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 30.7 GBytes 26.3 Gbits/sec receiver
Now trying with the vhost patches we get a slightly worse performance: - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 25.5 GBytes 21.9 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 25.5 GBytes 21.8 Gbits/sec receiver
Now vhost patch still lacks optimizations like disabling notifications or batch more rx available buffer notifications. At the moment it refills the rx buffers in each iteration, and does not set the no_notify bit which makes the kernel skip the used buffer notifications if pasta is actively checking the queue, which is not optimal.
Now if I isolate the vhost kernel thread [1] I get way more performance as expected: - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 43.1 GBytes 37.1 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 43.1 GBytes 36.9 Gbits/sec receiver
After analyzing perf output, rep_movs_alternative is the most called function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and vhost (~15%Self), But I don't see any of them consuming 100% of CPU in top: pasta consumes ~85% %CPU, both iperf3 client and server consumes 60%, and vhost consumes ~53%.
So... I have mixed feelings about this :). By "default" it seems to have less performance, but my test is maybe too synthetic. There is room for improvement with the mentioned optimizations so I'd continue applying them, continuing with UDP and TCP zerocopy, and developing zerocopy vhost rx. With these numbers I think the series should not be merged at the moment. I could send it as RFC if you want but I've not applied the comments the first one received, POC style :).
Have you pinned pasta in a specific CPU? Note that vhost will inherit the affinity so there could be some contention if you do that. Thanks
Thanks!
[1] Notes to reproduce it, I'm able to see it with top -H and then set with taskset. Either the latest changes on the module or the way pasta behaves does not allow me to see in classical ps output.