On Thu, 10 Oct 2024 09:08:01 +0200 Stefano Brivio <sbrivio(a)redhat.com> wrote:For outbound traffic (I tried with IPv4), which is much slower for some reason (~25 Gbps): -- Samples: 79K of event 'cycles', Event count (approx.): 73661070737 Children Self Command Shared Object Symbol - 91.00% 0.23% passt.avx2 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ◆ 90.78% entry_SYSCALL_64_after_hwframe ▒ - do_syscall_64 ▒ - 78.75% __sys_sendmsg ▒ - 78.58% ___sys_sendmsg ▒ - 78.06% ____sys_sendmsg ▒ - sock_sendmsg ▒ - 77.58% tcp_sendmsg ▒ - 68.63% tcp_sendmsg_locked ▒ - 26.24% sk_page_frag_refill ▒ - skb_page_frag_refill ▒ - 25.87% __alloc_pages ▒ - 25.61% get_page_from_freelist ▒ 24.51% clear_page_rep ▒ - 23.08% _copy_from_iter ▒ 22.88% copy_user_generic_string ▒ - 8.77% tcp_write_xmit ▒ - 8.19% __tcp_transmit_skb ▒ - 7.86% __ip_queue_xmit ▒ - 7.13% ip_finish_output2 ▒ - 6.65% __local_bh_enable_ip ▒ - 6.60% do_softirq.part.0 ▒ - 6.51% __softirqentry_text_start ▒ - 6.40% net_rx_action ▒ - 5.43% __napi_poll ▒ + process_backlog ▒ 0.50% napi_consume_skb ▒ + 5.39% __tcp_push_pending_frames ▒ + 2.03% tcp_stream_alloc_skb ▒ + 1.48% tcp_wmem_schedule ▒ + 8.58% release_sock ▒ - 4.57% ksys_write ▒ - 4.41% vfs_write ▒ - 3.96% eventfd_write ▒ - 3.46% __wake_up_common ▒ - irqfd_wakeup ▒ - 3.15% kvm_arch_set_irq_inatomic ▒ - 3.11% kvm_irq_delivery_to_apic_fast ▒ - 2.01% __apic_accept_irq ▒ 0.93% svm_complete_interrupt_delivery ▒ + 3.91% __x64_sys_epoll_wait ▒ + 1.20% __x64_sys_getsockopt ▒ + 0.78% syscall_trace_enter.constprop.0 ▒ 0.71% syscall_exit_to_user_mode ▒ + 0.61% ksys_read ▒ -- ...there are no users of more than 1% cycles in passt itself. The bulk of it is sendmsg() as expected, one notable thing is that the kernel spends an awful amount of cycles zeroing pages so that we can fill them. I looked into that "issue" a long time ago, https://github.com/netoptimizer/prototype-kernel/pull/39/commits/2c8223c30d… ...maybe I can try out a kernel with a version of that as clear_page_rep() and see what happens....so I tried, it looks like this, but it doesn't boot for some reason: -- diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h index f3d257c45225..4079012ce765 100644 --- a/arch/x86/include/asm/page_64.h +++ b/arch/x86/include/asm/page_64.h @@ -44,6 +44,17 @@ void clear_page_orig(void *page); void clear_page_rep(void *page); void clear_page_erms(void *page); +#define MEMSET_AVX2_ZERO(reg) \ + asm volatile("vpxor %ymm" #reg ", %ymm" #reg ", %ymm" #reg) +#define MEMSET_AVX2_STORE(loc, reg) \ + asm volatile("vmovdqa %%ymm" #reg ", %0" : "=m" (loc)) + +#define YMM_BYTES (256 / 8) +#define BYTES_TO_YMM(x) ((x) / YMM_BYTES) +extern void kernel_fpu_begin_mask(unsigned int kfpu_mask); +extern void kernel_fpu_end(void); +extern bool irq_fpu_usable(void); + static inline void clear_page(void *page) { /* @@ -51,6 +62,18 @@ static inline void clear_page(void *page) * below clobbers @page, so we perform unpoisoning before it. */ kmsan_unpoison_memory(page, PAGE_SIZE); + + if (irq_fpu_usable()) { + int i; + + kernel_fpu_begin(); + MEMSET_AVX2_ZERO(0); + for (i = 0; i < BYTES_TO_YMM(PAGE_SIZE); i++) + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * i], 0); + kernel_fpu_end(); + return; + } + alternative_call_2(clear_page_orig, clear_page_rep, X86_FEATURE_REP_GOOD, clear_page_erms, X86_FEATURE_ERMS, -- ...I'm not sure if that's something we can do at early boot, so perhaps I should add something specific in skb_page_frag_refill() instead. But that's for another day/week/month... -- Stefano