On Wed, 9 Oct 2024 19:37:26 +0200 Stefano Brivio <sbrivio(a)redhat.com> wrote:On Wed, 9 Oct 2024 11:07:07 +0200 Laurent Vivier <lvivier(a)redhat.com> wrote:...I had a closer look with perf(1). For outbound traffic (I checked with IPv6), a reasonably expanded output (20 seconds of iperf3 with those parameters): -- Samples: 80K of event 'cycles', Event count (approx.): 81191130978 Children Self Command Shared Object Symbol - 89.20% 0.07% passt.avx2 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ▒ 89.13% entry_SYSCALL_64_after_hwframe ◆ - do_syscall_64 ▒ - 69.37% __sys_recvmsg ▒ - 69.26% ___sys_recvmsg ▒ - 68.11% ____sys_recvmsg ▒ - 67.80% inet6_recvmsg ▒ - tcp_recvmsg ▒ - 67.56% tcp_recvmsg_locked ▒ - 64.14% skb_copy_datagram_iter ▒ - __skb_datagram_iter ▒ - 51.42% _copy_to_iter ▒ 48.74% copy_user_generic_string ▒ - 6.96% simple_copy_to_iter ▒ + __check_object_size ▒ + 1.09% __tcp_transmit_skb ▒ + 0.75% tcp_rcv_space_adjust ▒ + 0.61% __tcp_cleanup_rbuf ▒ + 1.01% copy_msghdr_from_user ▒ - 16.60% __x64_sys_recvfrom ▒ - __sys_recvfrom ▒ - 16.43% inet6_recvmsg ▒ - tcp_recvmsg ▒ - 14.27% tcp_recvmsg_locked ▒ - 12.67% __tcp_transmit_skb ▒ - 12.64% __ip_queue_xmit ▒ - 12.56% ip_finish_output2 ▒ - 12.51% __local_bh_enable_ip ▒ - do_softirq.part.0 ▒ - 12.50% __softirqentry_text_start ▒ - 12.49% net_rx_action ▒ - 12.47% __napi_poll ▒ - process_backlog ▒ - 12.41% __netif_receive_skb_one_core ▒ - 9.75% ip_local_deliver_finish ▒ - 9.52% ip_protocol_deliver_rcu ▒ + 9.28% tcp_v4_rcv ▒ - 2.02% ip_local_deliver ▒ - 1.92% nf_hook_slow ▒ + 1.84% nft_do_chain_ipv4 ▒ 0.70% __sk_mem_reduce_allocated ▒ + 2.10% release_sock ▒ + 1.16% __x64_sys_timerfd_settime ▒ + 0.56% ksys_write ▒ + 0.54% __x64_sys_epoll_wait ▒ + 89.13% 0.06% passt.avx2 [kernel.kallsyms] [k] do_syscall_64 ▒ + 84.23% 0.02% passt.avx2 [kernel.kallsyms] [k] inet6_recvmsg ▒ + 84.21% 0.04% passt.avx2 [kernel.kallsyms] [k] tcp_recvmsg ▒ + 81.84% 0.97% passt.avx2 [kernel.kallsyms] [k] tcp_recvmsg_locked ▒ + 74.96% 0.00% passt.avx2 [unknown] [k] 0000000000000000 ▒ + 69.78% 0.07% passt.avx2 libc.so.6 [.] __libc_recvmsg ▒ + 69.37% 0.02% passt.avx2 [kernel.kallsyms] [k] __sys_recvmsg ▒ + 69.26% 0.03% passt.avx2 [kernel.kallsyms] [k] ___sys_recvmsg ▒ + 68.11% 0.12% passt.avx2 [kernel.kallsyms] [k] ____sys_recvmsg ▒ + 64.14% 0.06% passt.avx2 [kernel.kallsyms] [k] skb_copy_datagram_iter ▒ + 64.08% 5.60% passt.avx2 [kernel.kallsyms] [k] __skb_datagram_iter ▒ + 51.44% 2.68% passt.avx2 [kernel.kallsyms] [k] _copy_to_iter ▒ + 49.16% 49.08% passt.avx2 [kernel.kallsyms] [k] copy_user_generic_string ◆ + 16.84% 0.00% passt.avx2 [unknown] [k] 0xffff000000000000 ▒ + 16.77% 0.02% passt.avx2 libc.so.6 [.] __libc_recv ▒ + 16.60% 0.00% passt.avx2 [kernel.kallsyms] [k] __x64_sys_recvfrom ▒ + 16.60% 0.07% passt.avx2 [kernel.kallsyms] [k] __sys_recvfrom ▒ + 13.81% 1.28% passt.avx2 [kernel.kallsyms] [k] __tcp_transmit_skb ▒ + 13.76% 0.06% passt.avx2 [kernel.kallsyms] [k] __ip_queue_xmit ▒ + 13.64% 0.10% passt.avx2 [kernel.kallsyms] [k] ip_finish_output2 ▒ + 13.62% 0.14% passt.avx2 [kernel.kallsyms] [k] __local_bh_enable_ip ▒ + 13.57% 0.01% passt.avx2 [kernel.kallsyms] [k] __softirqentry_text_start ▒ + 13.56% 0.01% passt.avx2 [kernel.kallsyms] [k] do_softirq.part.0 ▒ + 13.53% 0.01% passt.avx2 [kernel.kallsyms] [k] net_rx_action ▒ + 13.51% 0.00% passt.avx2 [kernel.kallsyms] [k] __napi_poll ▒ + 13.51% 0.04% passt.avx2 [kernel.kallsyms] [k] process_backlog ▒ + 13.45% 0.02% passt.avx2 [kernel.kallsyms] [k] __netif_receive_skb_one_core ▒ + 11.27% 0.04% passt.avx2 [kernel.kallsyms] [k] tcp_v4_do_rcv ▒ + 10.96% 0.06% passt.avx2 [kernel.kallsyms] [k] tcp_rcv_established ▒ + 10.74% 0.02% passt.avx2 [kernel.kallsyms] [k] ip_local_deliver_finish ▒ + 10.51% 0.04% passt.avx2 [kernel.kallsyms] [k] ip_protocol_deliver_rcu ▒ + 10.26% 0.17% passt.avx2 [kernel.kallsyms] [k] tcp_v4_rcv ▒ + 8.14% 0.01% passt.avx2 [kernel.kallsyms] [k] __tcp_push_pending_frames ▒ + 8.13% 0.73% passt.avx2 [kernel.kallsyms] [k] tcp_write_xmit ▒ + 6.96% 0.26% passt.avx2 [kernel.kallsyms] [k] simple_copy_to_iter ▒ + 6.79% 4.72% passt.avx2 [kernel.kallsyms] [k] __check_object_size ▒ + 3.33% 0.16% passt.avx2 [kernel.kallsyms] [k] nf_hook_slow ▒ + 3.09% 0.09% passt.avx2 [nf_tables] [k] nft_do_chain_ipv4 ▒ + 3.00% 2.28% passt.avx2 [nf_tables] [k] nft_do_chain ▒ + 2.85% 2.84% passt.avx2 passt.avx2 [.] vu_init_elem ▒ + 2.22% 0.02% passt.avx2 [kernel.kallsyms] [k] release_sock ▒ + 2.15% 0.02% passt.avx2 [kernel.kallsyms] [k] __release_sock ▒ + 2.04% 0.08% passt.avx2 [kernel.kallsyms] [k] ip_local_deliver ▒ + 1.80% 1.79% passt.avx2 [kernel.kallsyms] [k] __virt_addr_valid ▒ + 1.57% 0.03% passt.avx2 libc.so.6 [.] timerfd_settime ▒ 1.53% 1.53% passt.avx2 passt.avx2 [.] vu_queue_map_desc.isra.0 ▒ -- not much we can improve (and the throughput is anyway very close to iperf3 to iperf3 on host's loopback, ~50 Gbps vs. ~70): the bulk of it is copy_user_generic_string() reading from sockets into the queue and related bookkeeping. The only users of more than 1% of cycles are vu_init_elem() and vu_queue_map_desc(), perhaps we could try to speed those up... one day. Full perf output (you can load it with perf -i ...), if you're curious, at: https://passt.top/static/vu_tcp_ipv6_inbound.perf For outbound traffic (I tried with IPv4), which is much slower for some reason (~25 Gbps): -- Samples: 79K of event 'cycles', Event count (approx.): 73661070737 Children Self Command Shared Object Symbol - 91.00% 0.23% passt.avx2 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ◆ 90.78% entry_SYSCALL_64_after_hwframe ▒ - do_syscall_64 ▒ - 78.75% __sys_sendmsg ▒ - 78.58% ___sys_sendmsg ▒ - 78.06% ____sys_sendmsg ▒ - sock_sendmsg ▒ - 77.58% tcp_sendmsg ▒ - 68.63% tcp_sendmsg_locked ▒ - 26.24% sk_page_frag_refill ▒ - skb_page_frag_refill ▒ - 25.87% __alloc_pages ▒ - 25.61% get_page_from_freelist ▒ 24.51% clear_page_rep ▒ - 23.08% _copy_from_iter ▒ 22.88% copy_user_generic_string ▒ - 8.77% tcp_write_xmit ▒ - 8.19% __tcp_transmit_skb ▒ - 7.86% __ip_queue_xmit ▒ - 7.13% ip_finish_output2 ▒ - 6.65% __local_bh_enable_ip ▒ - 6.60% do_softirq.part.0 ▒ - 6.51% __softirqentry_text_start ▒ - 6.40% net_rx_action ▒ - 5.43% __napi_poll ▒ + process_backlog ▒ 0.50% napi_consume_skb ▒ + 5.39% __tcp_push_pending_frames ▒ + 2.03% tcp_stream_alloc_skb ▒ + 1.48% tcp_wmem_schedule ▒ + 8.58% release_sock ▒ - 4.57% ksys_write ▒ - 4.41% vfs_write ▒ - 3.96% eventfd_write ▒ - 3.46% __wake_up_common ▒ - irqfd_wakeup ▒ - 3.15% kvm_arch_set_irq_inatomic ▒ - 3.11% kvm_irq_delivery_to_apic_fast ▒ - 2.01% __apic_accept_irq ▒ 0.93% svm_complete_interrupt_delivery ▒ + 3.91% __x64_sys_epoll_wait ▒ + 1.20% __x64_sys_getsockopt ▒ + 0.78% syscall_trace_enter.constprop.0 ▒ 0.71% syscall_exit_to_user_mode ▒ + 0.61% ksys_read ▒ -- ...there are no users of more than 1% cycles in passt itself. The bulk of it is sendmsg() as expected, one notable thing is that the kernel spends an awful amount of cycles zeroing pages so that we can fill them. I looked into that "issue" a long time ago, https://github.com/netoptimizer/prototype-kernel/pull/39/commits/2c8223c30d… ...maybe I can try out a kernel with a version of that as clear_page_rep() and see what happens. Anyway, same here, I don't see anything we can really improve in passt. Full output at: https://passt.top/static/vu_tcp_ipv4_outbound.perf -- StefanoThis series of patches adds vhost-user support to passt and then allows passt to connect to QEMU network backend using virtqueue rather than a socket. With QEMU, rather than using to connect: -netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket we will use: -chardev socket,id=chr0,path=/tmp/passt_1.socket -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0 The memory backend is needed to share data between passt and QEMU. Performance comparison between "-netdev stream" and "-netdev vhost-user":On my setup, with a few tweaks (don't ask me why... we should figure out eventually):