[PATCH v7 0/8] Add vhost-user support to passt. (part 3)
This series of patches adds vhost-user support to passt and then allows passt to connect to QEMU network backend using virtqueue rather than a socket. With QEMU, rather than using to connect: -netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket we will use: -chardev socket,id=chr0,path=/tmp/passt_1.socket -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0 The memory backend is needed to share data between passt and QEMU. Performance comparison between "-netdev stream" and "-netdev vhost-user": $ iperf3 -c localhost -p 10001 -t 60 -6 -u -b 50G socket: [ 5] 0.00-60.05 sec 95.6 GBytes 13.7 Gbits/sec 0.017 ms 6998988/10132413 (69%) receiver vhost-user: [ 5] 0.00-60.04 sec 237 GBytes 33.9 Gbits/sec 0.006 ms 53673/7813770 (0.69%) receiver $ iperf3 -c localhost -p 10001 -t 60 -4 -u -b 50G socket: [ 5] 0.00-60.05 sec 98.9 GBytes 14.1 Gbits/sec 0.018 ms 6260735/9501832 (66%) receiver vhost-user: [ 5] 0.00-60.05 sec 235 GBytes 33.7 Gbits/sec 0.008 ms 37581/7752699 (0.48%) receiver $ iperf3 -c localhost -p 10001 -t 60 -6 socket: [ 5] 0.00-60.00 sec 17.3 GBytes 2.48 Gbits/sec 0 sender [ 5] 0.00-60.06 sec 17.3 GBytes 2.48 Gbits/sec receiver vhost-user: [ 5] 0.00-60.00 sec 191 GBytes 27.4 Gbits/sec 0 sender [ 5] 0.00-60.05 sec 191 GBytes 27.3 Gbits/sec receiver $ iperf3 -c localhost -p 10001 -t 60 -4 socket: [ 5] 0.00-60.00 sec 15.6 GBytes 2.24 Gbits/sec 0 sender [ 5] 0.00-60.06 sec 15.6 GBytes 2.24 Gbits/sec receiver vhost-user: [ 5] 0.00-60.00 sec 189 GBytes 27.1 Gbits/sec 0 sender [ 5] 0.00-60.04 sec 189 GBytes 27.0 Gbits/sec receiver v7: - rebase - use vu_collect_one_frame() to do vu_collect() (collect multiple frame) - add vhost-user tests from Stefano v6: - rebase - extract 3 patches from "vhost-user: add vhost-user": passt: rename tap_sock_init() to tap_backend_init() tcp: Export headers functions udp: Prepare udp.c to be shared with vhost-user - introduce new functions vu_collect_one_frame(), vu_collect(), vu_set_vnethdr(), vu_flush(), vu_send_single() to be called from tcp_vu.c, udp_vu.c and ICMP/DHCP where vhost-user code was duplicated. v5: - rebase on top of 2024_09_06.6b38f07 - rework udp_vu.c as ref.udp.v6 has been removed and we need to know if we receive IPv4 or IPv6 frame when we prepare the guest buffers for recvmsg() - remove vnet->hdrlen as the size is always the same with virtio-net v1 - address comments from David and Stefano v4: - rebase on top of 2024_08_21.1d6142f (rebasing on top of 620e19a1b48a ("udp: Merge udp[46]_mh_recv arrays") introduces a regression in the measure of the latency with UDP because I think I don't replace correctly ref.udp.v6 that is removed by this commit) - Addressed most of the comments from David and Stefano (I didn't want to postpone this version to next week, so I'll address the remaining comments in the next version). v3: - rebase on top of flow table - update tcp_vu.c to look like udp_vu.c (recv()/prepare()/send_frame()) - address comments from Stefano and David on version 2 v2: - remove PATCH 4 - rewrite PATCH 2 and 3 to follow passt coding style - move some code from PATCH 3 to PATCH 4 (previously PATCH 5) - partially addressed David's comment on PATCH 5 Laurent Vivier (7): packet: replace struct desc by struct iovec vhost-user: introduce virtio API vhost-user: introduce vhost-user API udp: Prepare udp.c to be shared with vhost-user tcp: Export headers functions passt: rename tap_sock_init() to tap_backend_init() vhost-user: add vhost-user Stefano Brivio (1): test: Add tests for passt in vhost-user mode Makefile | 9 +- conf.c | 21 +- epoll_type.h | 4 + iov.c | 1 - isolation.c | 15 +- packet.c | 91 ++-- packet.h | 22 +- passt.1 | 10 +- passt.c | 11 +- passt.h | 6 + pcap.c | 1 - tap.c | 128 +++-- tap.h | 7 +- tcp.c | 37 +- tcp_internal.h | 15 + tcp_vu.c | 476 +++++++++++++++++++ tcp_vu.h | 12 + test/lib/perf_report | 15 + test/lib/setup | 77 ++- test/lib/setup_ugly | 2 +- test/passt_vu | 1 + test/passt_vu_in_ns | 1 + test/perf/passt_vu_tcp | 211 +++++++++ test/perf/passt_vu_udp | 159 +++++++ test/run | 25 + test/two_guests_vu | 1 + udp.c | 84 ++-- udp_internal.h | 34 ++ udp_vu.c | 332 +++++++++++++ udp_vu.h | 13 + util.h | 8 + vhost_user.c | 1005 ++++++++++++++++++++++++++++++++++++++++ vhost_user.h | 206 ++++++++ virtio.c | 660 ++++++++++++++++++++++++++ virtio.h | 184 ++++++++ vu_common.c | 358 ++++++++++++++ vu_common.h | 47 ++ 37 files changed, 4135 insertions(+), 154 deletions(-) create mode 100644 tcp_vu.c create mode 100644 tcp_vu.h create mode 120000 test/passt_vu create mode 120000 test/passt_vu_in_ns create mode 100644 test/perf/passt_vu_tcp create mode 100644 test/perf/passt_vu_udp create mode 120000 test/two_guests_vu create mode 100644 udp_internal.h create mode 100644 udp_vu.c create mode 100644 udp_vu.h create mode 100644 vhost_user.c create mode 100644 vhost_user.h create mode 100644 virtio.c create mode 100644 virtio.h create mode 100644 vu_common.c create mode 100644 vu_common.h -- 2.46.2
To be able to manage buffers inside a shared memory provided
by a VM via a vhost-user interface, we cannot rely on the fact
that buffers are located in a pre-defined memory area and use
a base address and a 32bit offset to address them.
We need a 64bit address, so replace struct desc by struct iovec
and update range checking.
Signed-off-by: Laurent Vivier
Add virtio.c and virtio.h that define the functions needed
to manage virtqueues.
Signed-off-by: Laurent Vivier
Add vhost_user.c and vhost_user.h that define the functions needed
to implement vhost-user backend.
Signed-off-by: Laurent Vivier
Export udp_payload_t, udp_update_hdr4(), udp_update_hdr6() and
udp_sock_errs().
Rename udp_listen_sock_handler() to udp_buf_listen_sock_handler() and
udp_reply_sock_handler to udp_buf_reply_sock_handler().
Signed-off-by: Laurent Vivier
Export tcp_fill_headers[4|6]() and tcp_update_check_tcp[4|6]().
They'll be needed by vhost-user.
Signed-off-by: Laurent Vivier
Extract pool storage initialization loop to tap_sock_update_pool(),
extract QEMU hints to tap_backend_show_hints().
Signed-off-by: Laurent Vivier
add virtio and vhost-user functions to connect with QEMU.
$ ./passt --vhost-user
and
# qemu-system-x86_64 ... -m 4G \
-object memory-backend-memfd,id=memfd0,share=on,size=4G \
-numa node,memdev=memfd0 \
-chardev socket,id=chr0,path=/tmp/passt_1.socket \
-netdev vhost-user,id=netdev0,chardev=chr0 \
-device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
...
Signed-off-by: Laurent Vivier
From: Stefano Brivio
On Wed, 9 Oct 2024 11:07:07 +0200
Laurent Vivier
This series of patches adds vhost-user support to passt and then allows passt to connect to QEMU network backend using virtqueue rather than a socket.
With QEMU, rather than using to connect:
-netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket
we will use:
-chardev socket,id=chr0,path=/tmp/passt_1.socket -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0
The memory backend is needed to share data between passt and QEMU.
I just got the tests from 8/8 hanging like this (display with >= 212 columns): guest$ socat -u TCP6-LISTEN:10001 OPEN:test_big.bin,create,trunc │Starting test: TCP/IPv6: host to ns: small transfer guest$ cmp test_big.bin /root/big.bin │? cmp /tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_ns_small.bin /home/sbrivio/passt/test/small.bin guest$ socat -u TCP6-LISTEN:10001 OPEN:test_big.bin,create,trunc │...passed. guest$ cmp test_big.bin /root/big.bin │ guest$ socat -u TCP6-LISTEN:10001 OPEN:test_small.bin,create,trunc │Starting test: TCP/IPv6: guest to host: small transfer guest$ cmp test_small.bin /root/small.bin │? cmp /tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_small.bin /home/sbrivio/passt/test/small.bin guest$ socat -u OPEN:/root/small.bin TCP6:[2001:db8:9a55::1]:10003 │...passed. guest$ socat -u OPEN:/root/small.bin TCP6:[2001:db8:9a55::2]:10002 │ guest$ socat -u TCP6-LISTEN:10001 OPEN:test_small.bin,create,trunc │Starting test: TCP/IPv6: guest to ns: small transfer guest$ cmp test_small.bin /root/small.bin │? cmp /tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_ns_small.bin /home/sbrivio/passt/test/small.bin guest$ socat -u TCP6-LISTEN:10001 OPEN:test_small.bin,create,trunc │...passed. guest$ cmp test_small.bin /root/small.bin │ guest$ which socat ip jq >/dev/null │Starting test: TCP/IPv6: ns to host (spliced): small transfer guest$ socat -u UDP4-LISTEN:10001,null-eof OPEN:test.bin,create,trunc │? cmp /tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_small.bin /home/sbrivio/passt/test/small.bin │...passed. ──guest──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[2001:db8:9a55::1]:10003 │Starting test: TCP/IPv6: ns to host (via tap): small transfer ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[::1]:10001 │? cmp /tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_small.bin /home/sbrivio/passt/test/small.bin ns$ ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' │...passed. enp9s0 │ ns$ ip -j -6 addr show|jq -rM '.[] | select(.ifname == "enp9s0").addr_info[0].local' │Starting test: TCP/IPv6: ns to guest (using loopback address): small transfer 2a01:4f8:222:904::2 │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[2a01:4f8:222:904::2]:10001 │ ns$ socat -u TCP6-LISTEN:10002 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_ns_small.bin,create,trunc │Starting test: TCP/IPv6: ns to guest (using namespace address): small transfer ns$ socat -u TCP6-LISTEN:10002 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_ns_small.bin │...passed. ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP6:[::1]:10003 │ ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP6:[2001:db8:9a55::1]:10003 │================================================================================================================== ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP6:[::1]:10001 │Starting tests in file: passt_vu_in_ns/udp ns$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP6:[2a01:4f8:222:904::2]:10001 │ ns$ which socat ip jq >/dev/null │Starting test: UDP/IPv4: host to guest ns$ │ ──namespace─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────┴──passt_vu_in_ns/udp [1/16] - UDP/IPv4: host to guest───────────────────────────────────────────────────────────── host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10001 │NDP/DHCPv6: host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP4:127.0.0.1:10002 │ assign: 2a01:4f8:222:904::2 host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_big.bin,create,trunc │ router: fe80::1 host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_big.bin,create,trunc │ our link-local: fe80::1 host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_big.bin,create,trunc │DNS: host$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10001 │ 2a01:4ff:ff00::add:2 host$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP4:127.0.0.1:10002 │ 2a01:4ff:ff00::add:1 host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_small.bin,create,trunc │You can start qemu with: host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_small.bin,create,trunc │ kvm ... -chardev socket,id=chr0,path=/tmp/passt-tests-KLvyGO/passt_in_ns/passt.socket -netdev vhost-user,id=netdev0,chardev=chr0 -d host$ socat -u TCP4-LISTEN:10003 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_small.bin,create,trunc │evice virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0 host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[::1]:10001 │ host$ socat -u OPEN:/home/sbrivio/passt/test/big.bin TCP6:[::1]:10002 │accepted connection from PID 4848 host$ socat -u TCP6-LISTEN:10003 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_big.bin,create,trunc │==4846== Warning: set address range perms: large range [0x59c8f000, 0x119c8f000) (defined) host$ socat -u TCP6-LISTEN:10003 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_big.bin,create,trunc │==4846== Warning: set address range perms: large range [0x119c8f000, 0x519c8f000) (defined) host$ socat -u TCP6-LISTEN:10003 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_big.bin,create,trunc │NDP: received RS, sending RA host$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP6:[::1]:10001 │DHCP: offer to discover host$ socat -u OPEN:/home/sbrivio/passt/test/small.bin TCP6:[::1]:10002 │ from 52:54:00:12:34:56 host$ socat -u TCP6-LISTEN:10003 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_small.bin,create,trunc │DHCP: ack to request host$ socat -u TCP6-LISTEN:10003 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_small.bin,create,trunc │ from 52:54:00:12:34:56 host$ socat -u TCP6-LISTEN:10003 OPEN:/tmp/passt-tests-KLvyGO/passt_vu_in_ns/tcp/test_small.bin,create,trunc │DHCPv6: received SOLICIT, sending ADVERTISE host$ which socat ip jq >/dev/null │DHCPv6: received REQUEST/RENEW/CONFIRM, sending REPLY host$ socat -u OPEN:/home/sbrivio/passt/test/medium.bin UDP4:127.0.0.1:10001,shut-null │NDP: received NS, sending NA host$ │ ──host──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──passt in pasta (namespace)─────────────────────────────────────────────────────────────────────────────────────────────────────────── Testing commit: 529d9fd test: Add tests for passt in vhost-user mode PASS: 192 | FAIL: 0 | 2024-10-09T12:39:37+00:00 ...that is, in passt_vu_in_ns/udp, on the basic "UDP/IPv4: host to guest" test, the client is already done, but the server gets nothing. It doesn't look like a race condition in the test itself, because if I re-run the client manually the server is still stuck, but I didn't really investigate, yet. The server is just waiting for data: $ ssh -F /tmp/passt-tests-KLvyGO/passt_in_ns/context_guest.ssh guest sh cd /proc cat 562/cmdline socat-uUDP6-LISTEN:10001,null-eofOPEN:test.bin,create,trunc strace -p 562 strace: Process 562 attached pselect6(6, [5], [], [], NULL, NULL -- Stefano
On 09/10/2024 15:07, Stefano Brivio wrote:
On Wed, 9 Oct 2024 11:07:07 +0200 Laurent Vivier
wrote: This series of patches adds vhost-user support to passt and then allows passt to connect to QEMU network backend using virtqueue rather than a socket.
With QEMU, rather than using to connect:
-netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket
we will use:
-chardev socket,id=chr0,path=/tmp/passt_1.socket -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0
The memory backend is needed to share data between passt and QEMU. ...
...that is, in passt_vu_in_ns/udp, on the basic "UDP/IPv4: host to guest" test, the client is already done, but the server gets nothing. It doesn't look like a race condition in the test itself, because if I re-run the client manually the server is still stuck, but I didn't really investigate, yet.
Yes, I've seen that. I didn't know if the problem was in the test or in my code. I found it's a regression when I have introduced my new function vu_collect_one_frame(). I'm on it. Thanks, Laurent
On Wed, 9 Oct 2024 11:07:07 +0200
Laurent Vivier
This series of patches adds vhost-user support to passt and then allows passt to connect to QEMU network backend using virtqueue rather than a socket.
With QEMU, rather than using to connect:
-netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket
we will use:
-chardev socket,id=chr0,path=/tmp/passt_1.socket -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0
The memory backend is needed to share data between passt and QEMU.
Performance comparison between "-netdev stream" and "-netdev vhost-user":
On my setup, with a few tweaks (don't ask me why... we should figure out eventually): -- diff --git a/test/perf/passt_vu_tcp b/test/perf/passt_vu_tcp index b434008..76bdd48 100644 --- a/test/perf/passt_vu_tcp +++ b/test/perf/passt_vu_tcp @@ -38,10 +38,10 @@ hout FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: \([0-9]*\)\..*$/(\1+10^2\ hout FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq) ) | bc -l hout FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__ -set THREADS 4 +set THREADS 4-6 set TIME 5 set OMIT 0.1 -set OPTS -Z -P __THREADS__ -l 1M -O__OMIT__ -N +set OPTS -Z -O__OMIT__ -N info Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz report passt_vu tcp __THREADS__ __FREQ__ @@ -55,16 +55,16 @@ iperf3s ns 10002 bw - bw - guest ip link set dev __IFNAME__ mtu 1280 -iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 16M +iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 16M -l 1M -P 4 bw __BW__ 1.2 1.5 guest ip link set dev __IFNAME__ mtu 1500 -iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 32M +iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 32M -l 1M -P 4 bw __BW__ 1.6 1.8 guest ip link set dev __IFNAME__ mtu 9000 -iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 64M +iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 64M -l 1M -P 4 bw __BW__ 4.0 5.0 guest ip link set dev __IFNAME__ mtu 65520 -iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 64M +iperf3 BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 256M -l 1M -P 4 bw __BW__ 7.0 8.0 iperf3k ns @@ -93,22 +93,22 @@ tr TCP throughput over IPv4: guest to host iperf3s ns 10002 guest ip link set dev __IFNAME__ mtu 256 -iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 2M +iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 2M -l 1M -P 4 bw __BW__ 0.2 0.3 guest ip link set dev __IFNAME__ mtu 576 -iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 4M +iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 4M -l 1M -P 4 bw __BW__ 0.5 0.8 guest ip link set dev __IFNAME__ mtu 1280 -iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 8M +iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 8M -l 1M -P 4 bw __BW__ 1.2 1.5 guest ip link set dev __IFNAME__ mtu 1500 -iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 16M +iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 16M -l 1M -P 4 bw __BW__ 1.6 1.8 guest ip link set dev __IFNAME__ mtu 9000 -iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 64M +iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 64M -l 1M -P 4 bw __BW__ 4.0 5.0 guest ip link set dev __IFNAME__ mtu 65520 -iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 64M +iperf3 BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 256M -l 1M -P 4 bw __BW__ 7.0 8.0 iperf3k ns @@ -145,7 +145,7 @@ bw - bw - bw - bw - -iperf3 BW ns ::1 10001 __TIME__ __OPTS__ -w 32M +iperf3 BW ns ::1 10001 __TIME__ __OPTS__ -w 256M -l 16k -P 6 bw __BW__ 6.0 6.8 iperf3k guest @@ -181,7 +181,7 @@ bw - bw - bw - bw - -iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -w 32M +iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -w 256M -l 16k -P 6 bw __BW__ 6.0 6.8 iperf3k guest -- I'm getting an even bigger improvement in throughput (and also significantly lower latency). Non-vhost-user first: -- === perf/passt_tcp
passt: throughput and latency Throughput in Gbps, latency in µs, 4 threads at 3.6 GHz MTU: | 256B | 576B | 1280B | 1500B | 9000B | 65520B | |--------|--------|--------|--------|--------|--------| TCP throughput over IPv6: guest to host | - | - | 6.3 | 6.8 | 18.4 | 21.4 | TCP RR latency over IPv6: guest to host | - | - | - | - | - | 52 | TCP CRR latency over IPv6: guest to host | - | - | - | - | - | 141 | |--------|--------|--------|--------|--------|--------| TCP throughput over IPv4: guest to host | 0.8 | 3.0 | 5.6 | 7.4 | 19.6 | 21.3 | TCP RR latency over IPv4: guest to host | - | - | - | - | - | 58 | TCP CRR latency over IPv4: guest to host | - | - | - | - | - | 132 | |--------|--------|--------|--------|--------|--------| TCP throughput over IPv6: host to guest | - | - | - | - | - | 18.0 | TCP RR latency over IPv6: host to guest | - | - | - | - | - | 50 | TCP CRR latency over IPv6: host to guest | - | - | - | - | - | 115 | |--------|--------|--------|--------|--------|--------| TCP throughput over IPv4: host to guest | - | - | - | - | - | 17.8 | TCP RR latency over IPv4: host to guest | - | - | - | - | - | 60 | TCP CRR latency over IPv6: host to guest | - | - | - | - | - | 94 | '--------'--------'--------'--------'--------'--------' ...passed.
=== perf/passt_udp
passt: throughput and latency Throughput in Gbps, latency in µs, 2 threads at 3.6 GHz pktlen: | 256B | 576B | 1280B | 1500B | 9000B | 65520B | |--------|--------|--------|--------|--------|--------| UDP throughput over IPv6: guest to host | - | - | 3.4 | 4.1 | 12.3 | 18.2 | UDP RR latency over IPv6: guest to host | - | - | - | - | - | 49 | |--------|--------|--------|--------|--------|--------| UDP throughput over IPv4: guest to host | 0.8 | 1.9 | 3.7 | 4.0 | 11.1 | 17.2 | UDP RR latency over IPv4: guest to host | - | - | - | - | - | 52 | |--------|--------|--------|--------|--------|--------| UDP throughput over IPv6: host to guest | - | - | 2.6 | 3.1 | 5.4 | 17.9 | UDP RR latency over IPv6: host to guest | - | - | - | - | - | 48 | |--------|--------|--------|--------|--------|--------| UDP throughput over IPv4: host to guest | 0.9 | 2.3 | 5.6 | 7.4 | 12.8 | 16.6 | UDP RR latency over IPv4: host to guest | - | - | - | - | - | 48 | '--------'--------'--------'--------'--------'--------' ...passed.
[...] === perf/passt_vu_tcp
passt: throughput and latency Throughput in Gbps, latency in µs, 4-6 threads at 3.6 GHz MTU: | 256B | 576B | 1280B | 1500B | 9000B | 65520B | |--------|--------|--------|--------|--------|--------| TCP throughput over IPv6: guest to host | - | - | 8.2 | 10.1 | 14.7 | 22.3 | TCP RR latency over IPv6: guest to host | - | - | - | - | - | 30 | TCP CRR latency over IPv6: guest to host | - | - | - | - | - | 88 | |--------|--------|--------|--------|--------|--------| TCP throughput over IPv4: guest to host | 1.2 | 5.3 | 9.2 | 10.1 | 18.5 | 23.7 | TCP RR latency over IPv4: guest to host | - | - | - | - | - | 31 | TCP CRR latency over IPv4: guest to host | - | - | - | - | - | 93 | |--------|--------|--------|--------|--------|--------| TCP throughput over IPv6: host to guest | - | - | - | - | - | 42.1 | TCP RR latency over IPv6: host to guest | - | - | - | - | - | 30 | TCP CRR latency over IPv6: host to guest | - | - | - | - | - | 88 | |--------|--------|--------|--------|--------|--------| TCP throughput over IPv4: host to guest | - | - | - | - | - | 48.8 | TCP RR latency over IPv4: host to guest | - | - | - | - | - | 35 | TCP CRR latency over IPv6: host to guest | - | - | - | - | - | 79 | '--------'--------'--------'--------'--------'--------' ...passed.
=== perf/passt_vu_udp
passt: throughput and latency Throughput in Gbps, latency in µs, 2 threads at 3.6 GHz pktlen: | 256B | 576B | 1280B | 1500B | 9000B | 65520B | |--------|--------|--------|--------|--------|--------| UDP throughput over IPv6: guest to host | - | - | 2.2 | 2.6 | 14.1 | 33.4 | UDP RR latency over IPv6: guest to host | - | - | - | - | - | 32 | |--------|--------|--------|--------|--------|--------| UDP throughput over IPv4: guest to host | 0.4 | 1.1 | 2.6 | 2.3 | 13.6 | 28.9 | UDP RR latency over IPv4: guest to host | - | - | - | - | - | 31 | |--------|--------|--------|--------|--------|--------| UDP throughput over IPv6: host to guest | - | - | 3.1 | 3.7 | 18.7 | 29.4 | UDP RR latency over IPv6: host to guest | - | - | - | - | - | 35 | |--------|--------|--------|--------|--------|--------| UDP throughput over IPv4: host to guest | 0.5 | 1.3 | 3.3 | 3.8 | 18.6 | 37.5 | UDP RR latency over IPv4: host to guest | - | - | - | - | - | 35 | '--------'--------'--------'--------'--------'--------' ...passed. --
passt is CPU-bound only on host-to-guest tests. But there, iperf3 seems to actually use more CPU time than passt itself. -- Stefano
On Wed, 9 Oct 2024 19:37:26 +0200
Stefano Brivio
On Wed, 9 Oct 2024 11:07:07 +0200 Laurent Vivier
wrote: This series of patches adds vhost-user support to passt and then allows passt to connect to QEMU network backend using virtqueue rather than a socket.
With QEMU, rather than using to connect:
-netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket
we will use:
-chardev socket,id=chr0,path=/tmp/passt_1.socket -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0
The memory backend is needed to share data between passt and QEMU.
Performance comparison between "-netdev stream" and "-netdev vhost-user":
On my setup, with a few tweaks (don't ask me why... we should figure out eventually):
...I had a closer look with perf(1). For outbound traffic (I checked with IPv6), a reasonably expanded output (20 seconds of iperf3 with those parameters): -- Samples: 80K of event 'cycles', Event count (approx.): 81191130978 Children Self Command Shared Object Symbol - 89.20% 0.07% passt.avx2 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ▒ 89.13% entry_SYSCALL_64_after_hwframe ◆ - do_syscall_64 ▒ - 69.37% __sys_recvmsg ▒ - 69.26% ___sys_recvmsg ▒ - 68.11% ____sys_recvmsg ▒ - 67.80% inet6_recvmsg ▒ - tcp_recvmsg ▒ - 67.56% tcp_recvmsg_locked ▒ - 64.14% skb_copy_datagram_iter ▒ - __skb_datagram_iter ▒ - 51.42% _copy_to_iter ▒ 48.74% copy_user_generic_string ▒ - 6.96% simple_copy_to_iter ▒ + __check_object_size ▒ + 1.09% __tcp_transmit_skb ▒ + 0.75% tcp_rcv_space_adjust ▒ + 0.61% __tcp_cleanup_rbuf ▒ + 1.01% copy_msghdr_from_user ▒ - 16.60% __x64_sys_recvfrom ▒ - __sys_recvfrom ▒ - 16.43% inet6_recvmsg ▒ - tcp_recvmsg ▒ - 14.27% tcp_recvmsg_locked ▒ - 12.67% __tcp_transmit_skb ▒ - 12.64% __ip_queue_xmit ▒ - 12.56% ip_finish_output2 ▒ - 12.51% __local_bh_enable_ip ▒ - do_softirq.part.0 ▒ - 12.50% __softirqentry_text_start ▒ - 12.49% net_rx_action ▒ - 12.47% __napi_poll ▒ - process_backlog ▒ - 12.41% __netif_receive_skb_one_core ▒ - 9.75% ip_local_deliver_finish ▒ - 9.52% ip_protocol_deliver_rcu ▒ + 9.28% tcp_v4_rcv ▒ - 2.02% ip_local_deliver ▒ - 1.92% nf_hook_slow ▒ + 1.84% nft_do_chain_ipv4 ▒ 0.70% __sk_mem_reduce_allocated ▒ + 2.10% release_sock ▒ + 1.16% __x64_sys_timerfd_settime ▒ + 0.56% ksys_write ▒ + 0.54% __x64_sys_epoll_wait ▒ + 89.13% 0.06% passt.avx2 [kernel.kallsyms] [k] do_syscall_64 ▒ + 84.23% 0.02% passt.avx2 [kernel.kallsyms] [k] inet6_recvmsg ▒ + 84.21% 0.04% passt.avx2 [kernel.kallsyms] [k] tcp_recvmsg ▒ + 81.84% 0.97% passt.avx2 [kernel.kallsyms] [k] tcp_recvmsg_locked ▒ + 74.96% 0.00% passt.avx2 [unknown] [k] 0000000000000000 ▒ + 69.78% 0.07% passt.avx2 libc.so.6 [.] __libc_recvmsg ▒ + 69.37% 0.02% passt.avx2 [kernel.kallsyms] [k] __sys_recvmsg ▒ + 69.26% 0.03% passt.avx2 [kernel.kallsyms] [k] ___sys_recvmsg ▒ + 68.11% 0.12% passt.avx2 [kernel.kallsyms] [k] ____sys_recvmsg ▒ + 64.14% 0.06% passt.avx2 [kernel.kallsyms] [k] skb_copy_datagram_iter ▒ + 64.08% 5.60% passt.avx2 [kernel.kallsyms] [k] __skb_datagram_iter ▒ + 51.44% 2.68% passt.avx2 [kernel.kallsyms] [k] _copy_to_iter ▒ + 49.16% 49.08% passt.avx2 [kernel.kallsyms] [k] copy_user_generic_string ◆ + 16.84% 0.00% passt.avx2 [unknown] [k] 0xffff000000000000 ▒ + 16.77% 0.02% passt.avx2 libc.so.6 [.] __libc_recv ▒ + 16.60% 0.00% passt.avx2 [kernel.kallsyms] [k] __x64_sys_recvfrom ▒ + 16.60% 0.07% passt.avx2 [kernel.kallsyms] [k] __sys_recvfrom ▒ + 13.81% 1.28% passt.avx2 [kernel.kallsyms] [k] __tcp_transmit_skb ▒ + 13.76% 0.06% passt.avx2 [kernel.kallsyms] [k] __ip_queue_xmit ▒ + 13.64% 0.10% passt.avx2 [kernel.kallsyms] [k] ip_finish_output2 ▒ + 13.62% 0.14% passt.avx2 [kernel.kallsyms] [k] __local_bh_enable_ip ▒ + 13.57% 0.01% passt.avx2 [kernel.kallsyms] [k] __softirqentry_text_start ▒ + 13.56% 0.01% passt.avx2 [kernel.kallsyms] [k] do_softirq.part.0 ▒ + 13.53% 0.01% passt.avx2 [kernel.kallsyms] [k] net_rx_action ▒ + 13.51% 0.00% passt.avx2 [kernel.kallsyms] [k] __napi_poll ▒ + 13.51% 0.04% passt.avx2 [kernel.kallsyms] [k] process_backlog ▒ + 13.45% 0.02% passt.avx2 [kernel.kallsyms] [k] __netif_receive_skb_one_core ▒ + 11.27% 0.04% passt.avx2 [kernel.kallsyms] [k] tcp_v4_do_rcv ▒ + 10.96% 0.06% passt.avx2 [kernel.kallsyms] [k] tcp_rcv_established ▒ + 10.74% 0.02% passt.avx2 [kernel.kallsyms] [k] ip_local_deliver_finish ▒ + 10.51% 0.04% passt.avx2 [kernel.kallsyms] [k] ip_protocol_deliver_rcu ▒ + 10.26% 0.17% passt.avx2 [kernel.kallsyms] [k] tcp_v4_rcv ▒ + 8.14% 0.01% passt.avx2 [kernel.kallsyms] [k] __tcp_push_pending_frames ▒ + 8.13% 0.73% passt.avx2 [kernel.kallsyms] [k] tcp_write_xmit ▒ + 6.96% 0.26% passt.avx2 [kernel.kallsyms] [k] simple_copy_to_iter ▒ + 6.79% 4.72% passt.avx2 [kernel.kallsyms] [k] __check_object_size ▒ + 3.33% 0.16% passt.avx2 [kernel.kallsyms] [k] nf_hook_slow ▒ + 3.09% 0.09% passt.avx2 [nf_tables] [k] nft_do_chain_ipv4 ▒ + 3.00% 2.28% passt.avx2 [nf_tables] [k] nft_do_chain ▒ + 2.85% 2.84% passt.avx2 passt.avx2 [.] vu_init_elem ▒ + 2.22% 0.02% passt.avx2 [kernel.kallsyms] [k] release_sock ▒ + 2.15% 0.02% passt.avx2 [kernel.kallsyms] [k] __release_sock ▒ + 2.04% 0.08% passt.avx2 [kernel.kallsyms] [k] ip_local_deliver ▒ + 1.80% 1.79% passt.avx2 [kernel.kallsyms] [k] __virt_addr_valid ▒ + 1.57% 0.03% passt.avx2 libc.so.6 [.] timerfd_settime ▒ 1.53% 1.53% passt.avx2 passt.avx2 [.] vu_queue_map_desc.isra.0 ▒ -- not much we can improve (and the throughput is anyway very close to iperf3 to iperf3 on host's loopback, ~50 Gbps vs. ~70): the bulk of it is copy_user_generic_string() reading from sockets into the queue and related bookkeeping. The only users of more than 1% of cycles are vu_init_elem() and vu_queue_map_desc(), perhaps we could try to speed those up... one day. Full perf output (you can load it with perf -i ...), if you're curious, at: https://passt.top/static/vu_tcp_ipv6_inbound.perf For outbound traffic (I tried with IPv4), which is much slower for some reason (~25 Gbps): -- Samples: 79K of event 'cycles', Event count (approx.): 73661070737 Children Self Command Shared Object Symbol - 91.00% 0.23% passt.avx2 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ◆ 90.78% entry_SYSCALL_64_after_hwframe ▒ - do_syscall_64 ▒ - 78.75% __sys_sendmsg ▒ - 78.58% ___sys_sendmsg ▒ - 78.06% ____sys_sendmsg ▒ - sock_sendmsg ▒ - 77.58% tcp_sendmsg ▒ - 68.63% tcp_sendmsg_locked ▒ - 26.24% sk_page_frag_refill ▒ - skb_page_frag_refill ▒ - 25.87% __alloc_pages ▒ - 25.61% get_page_from_freelist ▒ 24.51% clear_page_rep ▒ - 23.08% _copy_from_iter ▒ 22.88% copy_user_generic_string ▒ - 8.77% tcp_write_xmit ▒ - 8.19% __tcp_transmit_skb ▒ - 7.86% __ip_queue_xmit ▒ - 7.13% ip_finish_output2 ▒ - 6.65% __local_bh_enable_ip ▒ - 6.60% do_softirq.part.0 ▒ - 6.51% __softirqentry_text_start ▒ - 6.40% net_rx_action ▒ - 5.43% __napi_poll ▒ + process_backlog ▒ 0.50% napi_consume_skb ▒ + 5.39% __tcp_push_pending_frames ▒ + 2.03% tcp_stream_alloc_skb ▒ + 1.48% tcp_wmem_schedule ▒ + 8.58% release_sock ▒ - 4.57% ksys_write ▒ - 4.41% vfs_write ▒ - 3.96% eventfd_write ▒ - 3.46% __wake_up_common ▒ - irqfd_wakeup ▒ - 3.15% kvm_arch_set_irq_inatomic ▒ - 3.11% kvm_irq_delivery_to_apic_fast ▒ - 2.01% __apic_accept_irq ▒ 0.93% svm_complete_interrupt_delivery ▒ + 3.91% __x64_sys_epoll_wait ▒ + 1.20% __x64_sys_getsockopt ▒ + 0.78% syscall_trace_enter.constprop.0 ▒ 0.71% syscall_exit_to_user_mode ▒ + 0.61% ksys_read ▒ -- ...there are no users of more than 1% cycles in passt itself. The bulk of it is sendmsg() as expected, one notable thing is that the kernel spends an awful amount of cycles zeroing pages so that we can fill them. I looked into that "issue" a long time ago, https://github.com/netoptimizer/prototype-kernel/pull/39/commits/2c8223c30d7... ...maybe I can try out a kernel with a version of that as clear_page_rep() and see what happens. Anyway, same here, I don't see anything we can really improve in passt. Full output at: https://passt.top/static/vu_tcp_ipv4_outbound.perf -- Stefano
On 10/10/2024 09:08, Stefano Brivio wrote:
The only users of more than 1% of cycles are vu_init_elem() and vu_queue_map_desc(), perhaps we could try to speed those up... one day.
I think vu_init_elem() could be executed only once at startup. I need to check how to do that. It's an easy improvement, it's on my list. Thanks, Laurent
On Thu, 10 Oct 2024 09:45:36 +0200
Laurent Vivier
On 10/10/2024 09:08, Stefano Brivio wrote:
For outbound traffic (I tried with IPv4), which is much slower for some reason (~25 Gbps):
Perhaps because we can't bypass IPv4 header checksum whereas there is no IPv6 header checksum ?
Ah, no, sorry, I meant slower than inbound traffic. For outbound traffic, I get very similar numbers between IPv4 and IPv6. And judging from perf output, it's all about copying and zeroing and stuff. As far as I know the kernel doesn't actually compute checksum, delivery is local. -- Stefano
On Thu, 10 Oct 2024 09:08:01 +0200
Stefano Brivio
For outbound traffic (I tried with IPv4), which is much slower for some reason (~25 Gbps):
-- Samples: 79K of event 'cycles', Event count (approx.): 73661070737 Children Self Command Shared Object Symbol - 91.00% 0.23% passt.avx2 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ◆ 90.78% entry_SYSCALL_64_after_hwframe ▒ - do_syscall_64 ▒ - 78.75% __sys_sendmsg ▒ - 78.58% ___sys_sendmsg ▒ - 78.06% ____sys_sendmsg ▒ - sock_sendmsg ▒ - 77.58% tcp_sendmsg ▒ - 68.63% tcp_sendmsg_locked ▒ - 26.24% sk_page_frag_refill ▒ - skb_page_frag_refill ▒ - 25.87% __alloc_pages ▒ - 25.61% get_page_from_freelist ▒ 24.51% clear_page_rep ▒ - 23.08% _copy_from_iter ▒ 22.88% copy_user_generic_string ▒ - 8.77% tcp_write_xmit ▒ - 8.19% __tcp_transmit_skb ▒ - 7.86% __ip_queue_xmit ▒ - 7.13% ip_finish_output2 ▒ - 6.65% __local_bh_enable_ip ▒ - 6.60% do_softirq.part.0 ▒ - 6.51% __softirqentry_text_start ▒ - 6.40% net_rx_action ▒ - 5.43% __napi_poll ▒ + process_backlog ▒ 0.50% napi_consume_skb ▒ + 5.39% __tcp_push_pending_frames ▒ + 2.03% tcp_stream_alloc_skb ▒ + 1.48% tcp_wmem_schedule ▒ + 8.58% release_sock ▒ - 4.57% ksys_write ▒ - 4.41% vfs_write ▒ - 3.96% eventfd_write ▒ - 3.46% __wake_up_common ▒ - irqfd_wakeup ▒ - 3.15% kvm_arch_set_irq_inatomic ▒ - 3.11% kvm_irq_delivery_to_apic_fast ▒ - 2.01% __apic_accept_irq ▒ 0.93% svm_complete_interrupt_delivery ▒ + 3.91% __x64_sys_epoll_wait ▒ + 1.20% __x64_sys_getsockopt ▒ + 0.78% syscall_trace_enter.constprop.0 ▒ 0.71% syscall_exit_to_user_mode ▒ + 0.61% ksys_read ▒ --
...there are no users of more than 1% cycles in passt itself. The bulk of it is sendmsg() as expected, one notable thing is that the kernel spends an awful amount of cycles zeroing pages so that we can fill them. I looked into that "issue" a long time ago,
https://github.com/netoptimizer/prototype-kernel/pull/39/commits/2c8223c30d7...
...maybe I can try out a kernel with a version of that as clear_page_rep() and see what happens.
...so I tried, it looks like this, but it doesn't boot for some reason: -- diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h index f3d257c45225..4079012ce765 100644 --- a/arch/x86/include/asm/page_64.h +++ b/arch/x86/include/asm/page_64.h @@ -44,6 +44,17 @@ void clear_page_orig(void *page); void clear_page_rep(void *page); void clear_page_erms(void *page); +#define MEMSET_AVX2_ZERO(reg) \ + asm volatile("vpxor %ymm" #reg ", %ymm" #reg ", %ymm" #reg) +#define MEMSET_AVX2_STORE(loc, reg) \ + asm volatile("vmovdqa %%ymm" #reg ", %0" : "=m" (loc)) + +#define YMM_BYTES (256 / 8) +#define BYTES_TO_YMM(x) ((x) / YMM_BYTES) +extern void kernel_fpu_begin_mask(unsigned int kfpu_mask); +extern void kernel_fpu_end(void); +extern bool irq_fpu_usable(void); + static inline void clear_page(void *page) { /* @@ -51,6 +62,18 @@ static inline void clear_page(void *page) * below clobbers @page, so we perform unpoisoning before it. */ kmsan_unpoison_memory(page, PAGE_SIZE); + + if (irq_fpu_usable()) { + int i; + + kernel_fpu_begin(); + MEMSET_AVX2_ZERO(0); + for (i = 0; i < BYTES_TO_YMM(PAGE_SIZE); i++) + MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * i], 0); + kernel_fpu_end(); + return; + } + alternative_call_2(clear_page_orig, clear_page_rep, X86_FEATURE_REP_GOOD, clear_page_erms, X86_FEATURE_ERMS, -- ...I'm not sure if that's something we can do at early boot, so perhaps I should add something specific in skb_page_frag_refill() instead. But that's for another day/week/month... -- Stefano
On Fri, 11 Oct 2024 20:07:30 +0200
Stefano Brivio
[...]
...maybe I can try out a kernel with a version of that as clear_page_rep() and see what happens.
...so I tried, it looks like this, but it doesn't boot for some reason:
I played with this a bit more. If I select the AVX2-based page clearing
with:
if (system_state >= SYSTEM_RUNNING && irq_fpu_usable()) {
instead of just irq_fpu_usable(), the kernel boots, and everything
works (also after init).
I tested this in a VM where I can't really get a baseline throughput
that's comparable to the host: iperf3 to iperf3 via loopback gives me
about 50 Gbps (instead of 70 as I get on the host), and the same iperf3
vhost-user test with outbound traffic from the nested, L2 guest yields
about 20 Gbps (instead of 25).
1. The VMOVDQA version I was originally trying looks like this:
Samples: 39K of event 'cycles:P', Event count (approx.): 34909065261
Children Self Command Shared Object Symbol
- 94.32% 0.87% passt.avx2 [kernel.kallsyms] [k] entry_SYSCALL_64 ◆
- 93.45% entry_SYSCALL_64 ▒
- 93.29% do_syscall_64 ▒
- 79.66% __sys_sendmsg ▒
- 79.52% ___sys_sendmsg ▒
- 78.88% ____sys_sendmsg ▒
- 78.46% tcp_sendmsg ▒
- 66.75% tcp_sendmsg_locked ▒
- 25.81% sk_page_frag_refill ▒
- 25.73% skb_page_frag_refill ▒
- 25.34% alloc_pages_mpol_noprof ▒
- 25.17% __alloc_pages_noprof ▒
- 24.91% get_page_from_freelist ▒
- 23.38% kernel_init_pages ▒
0.88% kernel_fpu_begin_mask ▒
- 15.37% tcp_write_xmit ▒
- 14.14% __tcp_transmit_skb ▒
- 13.31% __ip_queue_xmit ▒
- 11.06% ip_finish_output2 ▒
- 10.89% __dev_queue_xmit ▒
- 10.00% __local_bh_enable_ip ▒
- do_softirq.part.0 ▒
- handle_softirqs ▒
- 9.86% net_rx_action ▒
- 7.95% __napi_poll ▒
+ process_backlog ▒
+ 1.17% napi_consume_skb ▒
+ 0.61% dev_hard_start_xmit ▒
- 1.56% ip_local_out ▒
- __ip_local_out ▒
- 1.29% nf_hook_slow ▒
1.00% nf_conntrack_in ▒
+ 14.60% _copy_from_iter ▒
+ 3.97% __tcp_push_pending_frames ▒
+ 2.42% tcp_stream_alloc_skb ▒
+ 2.08% tcp_wmem_schedule ▒
0.64% __check_object_size ▒
+ 11.08% release_sock ▒
+ 4.48% ksys_write ▒
+ 3.57% __x64_sys_epoll_wait ▒
+ 2.26% __x64_sys_getsockopt ▒
1.09% syscall_exit_to_user_mode ▒
+ 0.90% ksys_read ▒
0.64% syscall_trace_enter ▒
...that's 24.91% clock cycles spent on get_page_from_freelist() instead of
25.61% I was getting with the original clear_page() implementation. Checking
the annotated output, it doesn't look very... superscalar:
Samples: 39K of event 'cycles:P', 4000 Hz, Event count (approx.): 34909065261
Percent│ { ▒
│ return page_to_virt(page); ▒
│32: mov %r12,%rbx ▒
│ sub vmemmap_base,%rbx ▒
0.32 │ sar $0x6,%rbx ▒
0.02 │ shl $0xc,%rbx ▒
0.02 │ add page_offset_base,%rbx ▒
│ clear_page(): ▒
│ if (system_state >= SYSTEM_RUNNING && irq_fpu_usable()) { ▒
0.05 │ cmpl $0x2,system_state ▒
0.47 │ jbe 21 ▒
0.01 │ call irq_fpu_usable ▒
0.20 │ test %al,%al ▒
0.56 │ je 21 ▒
│ kernel_fpu_begin_mask(0); ▒
0.07 │ xor %edi,%edi ▒
│ call kernel_fpu_begin_mask ▒
│ MEMSET_AVX2_ZERO(0); ▒
0.06 │ vpxor %ymm0,%ymm0,%ymm0 ▒
│ for (i = 0; i < BYTES_TO_YMM(PAGE_SIZE); i++) ▒
0.58 │ lea 0x1000(%rbx),%rax ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * i], 0); ▒
4.96 │6f: vmovdqa %ymm0,(%rbx) ▒
71.38 │ vmovdqa %ymm0,0x20(%rbx) ▒
│ for (i = 0; i < BYTES_TO_YMM(PAGE_SIZE); i++) ▒
2.81 │ add $0x40,%rbx ▒
0.06 │ cmp %rbx,%rax ▒
17.22 │ jne 6f ▒
│ kernel_fpu_end(); ▒
│ call kernel_fpu_end ▒
│ kernel_init_pages(): ▒
0.44 │ add $0x40,%r12 ▒
0.55 │ cmp %r12,%rbp ▒
0.07 │ jne 32 ▒
│ clear_highpage_kasan_tagged(page + i); ▒
│ kasan_enable_current(); ▒
│ } ◆
│8f: pop %rbx ▒
0.11 │ pop %rbp ▒
│ pop %r12 ▒
0.01 │ jmp __x86_return_thunk ▒
│98: jmp __x86_return_thunk ▒
2. Let's try to unroll it:
Samples: 39K of event 'cycles:P', Event count (approx.): 33598124504
Children Self Command Shared Object Symbol
+ 92.49% 0.33% passt.avx2 [kernel.kallsyms] [k] entry_SYSCALL_64_a◆
- 92.01% 0.47% passt.avx2 [kernel.kallsyms] [k] do_syscall_64 ▒
- 91.54% do_syscall_64 ▒
- 75.04% __sys_sendmsg ▒
- 74.85% ___sys_sendmsg ▒
- 74.26% ____sys_sendmsg ▒
- 73.68% tcp_sendmsg ▒
- 62.69% tcp_sendmsg_locked ▒
- 22.26% sk_page_frag_refill ▒
- 22.14% skb_page_frag_refill ▒
- 21.74% alloc_pages_mpol_noprof ▒
- 21.52% __alloc_pages_noprof ▒
- 21.25% get_page_from_freelist ▒
- 20.04% prep_new_page ▒
- 19.57% clear_page ▒
0.55% kernel_fpu_begin_mask ▒
+ 15.04% tcp_write_xmit ▒
+ 13.77% _copy_from_iter ▒
+ 5.12% __tcp_push_pending_frames ▒
+ 2.05% tcp_wmem_schedule ▒
+ 1.86% tcp_stream_alloc_skb ▒
0.73% __check_object_size ▒
+ 10.15% release_sock ▒
+ 0.62% lock_sock_nested ▒
+ 5.63% ksys_write ▒
+ 4.65% __x64_sys_epoll_wait ▒
+ 2.61% __x64_sys_getsockopt ▒
1.21% syscall_exit_to_user_mode ▒
+ 1.16% ksys_read ▒
+ 0.84% syscall_trace_enter ▒
annotated:
Samples: 39K of event 'cycles:P', 4000 Hz, Event count (approx.): 33598124504
clear_page /proc/kcore [Percent: local period]
Percent│
│ ffffffffb5198480 <load0>:
0.06 │ push %rbx
0.27 │ cmpl $0x2,0x1ab243c(%rip)
0.07 │ mov %rdi,%rbx
│ ja 1b
│ d: mov %rbx,%rdi
│ call clear_page_rep
│ pop %rbx
│ jmp srso_return_thunk
0.03 │ 1b: call irq_fpu_usable
0.14 │ test %al,%al
0.64 │ je d
0.04 │ xor %edi,%edi
│ call kernel_fpu_begin_mask
0.05 │ vpxor %ymm0,%ymm0,%ymm0
0.80 │ vmovdqa %ymm0,(%rbx)
1.12 │ vmovdqa %ymm0,0x20(%rbx)
0.06 │ vmovdqa %ymm0,0x40(%rbx)
1.39 │ vmovdqa %ymm0,0x60(%rbx)
0.24 │ vmovdqa %ymm0,0x80(%rbx)
0.58 │ vmovdqa %ymm0,0xa0(%rbx)
0.21 │ vmovdqa %ymm0,0xc0(%rbx)
0.77 │ vmovdqa %ymm0,0xe0(%rbx)
0.38 │ vmovdqa %ymm0,0x100(%rbx)
7.60 │ vmovdqa %ymm0,0x120(%rbx)
0.26 │ vmovdqa %ymm0,0x140(%rbx)
1.38 │ vmovdqa %ymm0,0x160(%rbx)
0.42 │ vmovdqa %ymm0,0x180(%rbx)
1.25 │ vmovdqa %ymm0,0x1a0(%rbx)
0.26 │ vmovdqa %ymm0,0x1c0(%rbx)
0.73 │ vmovdqa %ymm0,0x1e0(%rbx)
0.33 │ vmovdqa %ymm0,0x200(%rbx)
1.72 │ vmovdqa %ymm0,0x220(%rbx)
0.16 │ vmovdqa %ymm0,0x240(%rbx)
0.61 │ vmovdqa %ymm0,0x260(%rbx)
0.19 │ vmovdqa %ymm0,0x280(%rbx)
0.68 │ vmovdqa %ymm0,0x2a0(%rbx)
0.22 │ vmovdqa %ymm0,0x2c0(%rbx)
0.66 │ vmovdqa %ymm0,0x2e0(%rbx)
0.50 │ vmovdqa %ymm0,0x300(%rbx)
0.67 │ vmovdqa %ymm0,0x320(%rbx)
0.29 │ vmovdqa %ymm0,0x340(%rbx)
0.31 │ vmovdqa %ymm0,0x360(%rbx)
0.14 │ vmovdqa %ymm0,0x380(%rbx)
0.55 │ vmovdqa %ymm0,0x3a0(%rbx)
0.35 │ vmovdqa %ymm0,0x3c0(%rbx)
0.82 │ vmovdqa %ymm0,0x3e0(%rbx)
0.25 │ vmovdqa %ymm0,0x400(%rbx)
0.49 │ vmovdqa %ymm0,0x420(%rbx) ▒
0.18 │ vmovdqa %ymm0,0x440(%rbx) ▒
1.05 │ vmovdqa %ymm0,0x460(%rbx) ▒
0.08 │ vmovdqa %ymm0,0x480(%rbx) ▒
2.22 │ vmovdqa %ymm0,0x4a0(%rbx) ▒
0.20 │ vmovdqa %ymm0,0x4c0(%rbx) ▒
2.33 │ vmovdqa %ymm0,0x4e0(%rbx) ▒
0.03 │ vmovdqa %ymm0,0x500(%rbx) ▒
2.87 │ vmovdqa %ymm0,0x520(%rbx) ▒
0.08 │ vmovdqa %ymm0,0x540(%rbx) ▒
1.60 │ vmovdqa %ymm0,0x560(%rbx) ▒
0.01 │ vmovdqa %ymm0,0x580(%rbx) ▒
7.03 │ vmovdqa %ymm0,0x5a0(%rbx) ▒
0.42 │ vmovdqa %ymm0,0x5c0(%rbx) ▒
2.74 │ vmovdqa %ymm0,0x5e0(%rbx) ▒
0.69 │ vmovdqa %ymm0,0x600(%rbx) ▒
2.34 │ vmovdqa %ymm0,0x620(%rbx) ▒
0.37 │ vmovdqa %ymm0,0x640(%rbx) ▒
1.21 │ vmovdqa %ymm0,0x660(%rbx) ▒
0.22 │ vmovdqa %ymm0,0x680(%rbx) ▒
1.16 │ vmovdqa %ymm0,0x6a0(%rbx) ▒
0.29 │ vmovdqa %ymm0,0x6c0(%rbx) ▒
0.98 │ vmovdqa %ymm0,0x6e0(%rbx) ▒
0.19 │ vmovdqa %ymm0,0x700(%rbx) ▒
0.81 │ vmovdqa %ymm0,0x720(%rbx) ▒
0.47 │ vmovdqa %ymm0,0x740(%rbx) ▒
0.69 │ vmovdqa %ymm0,0x760(%rbx) ▒
0.23 │ vmovdqa %ymm0,0x780(%rbx) ▒
0.68 │ vmovdqa %ymm0,0x7a0(%rbx) ▒
0.30 │ vmovdqa %ymm0,0x7c0(%rbx) ▒
0.68 │ vmovdqa %ymm0,0x7e0(%rbx) ▒
0.25 │ vmovdqa %ymm0,0x800(%rbx) ◆
0.58 │ vmovdqa %ymm0,0x820(%rbx) ▒
0.19 │ vmovdqa %ymm0,0x840(%rbx) ▒
0.83 │ vmovdqa %ymm0,0x860(%rbx) ▒
0.27 │ vmovdqa %ymm0,0x880(%rbx) ▒
1.01 │ vmovdqa %ymm0,0x8a0(%rbx) ▒
0.16 │ vmovdqa %ymm0,0x8c0(%rbx) ▒
0.89 │ vmovdqa %ymm0,0x8e0(%rbx) ▒
0.24 │ vmovdqa %ymm0,0x900(%rbx) ▒
0.98 │ vmovdqa %ymm0,0x920(%rbx) ▒
0.28 │ vmovdqa %ymm0,0x940(%rbx) ▒
0.86 │ vmovdqa %ymm0,0x960(%rbx) ▒
0.23 │ vmovdqa %ymm0,0x980(%rbx) ▒
1.19 │ vmovdqa %ymm0,0x9a0(%rbx) ▒
0.28 │ vmovdqa %ymm0,0x9c0(%rbx) ▒
1.04 │ vmovdqa %ymm0,0x9e0(%rbx) ▒
0.33 │ vmovdqa %ymm0,0xa00(%rbx) ▒
0.90 │ vmovdqa %ymm0,0xa20(%rbx) ▒
0.35 │ vmovdqa %ymm0,0xa40(%rbx) ▒
0.87 │ vmovdqa %ymm0,0xa60(%rbx) ▒
0.25 │ vmovdqa %ymm0,0xa80(%rbx) ▒
0.89 │ vmovdqa %ymm0,0xaa0(%rbx) ▒
0.28 │ vmovdqa %ymm0,0xac0(%rbx) ▒
0.92 │ vmovdqa %ymm0,0xae0(%rbx) ▒
0.23 │ vmovdqa %ymm0,0xb00(%rbx) ▒
1.39 │ vmovdqa %ymm0,0xb20(%rbx) ▒
0.29 │ vmovdqa %ymm0,0xb40(%rbx) ▒
1.15 │ vmovdqa %ymm0,0xb60(%rbx) ▒
0.26 │ vmovdqa %ymm0,0xb80(%rbx) ▒
1.33 │ vmovdqa %ymm0,0xba0(%rbx) ▒
0.29 │ vmovdqa %ymm0,0xbc0(%rbx) ▒
1.05 │ vmovdqa %ymm0,0xbe0(%rbx) ▒
0.25 │ vmovdqa %ymm0,0xc00(%rbx) ▒
0.89 │ vmovdqa %ymm0,0xc20(%rbx) ▒
0.34 │ vmovdqa %ymm0,0xc40(%rbx) ▒
0.78 │ vmovdqa %ymm0,0xc60(%rbx) ▒
0.40 │ vmovdqa %ymm0,0xc80(%rbx) ▒
0.99 │ vmovdqa %ymm0,0xca0(%rbx) ▒
0.44 │ vmovdqa %ymm0,0xcc0(%rbx) ▒
1.06 │ vmovdqa %ymm0,0xce0(%rbx) ▒
0.35 │ vmovdqa %ymm0,0xd00(%rbx) ▒
0.85 │ vmovdqa %ymm0,0xd20(%rbx) ▒
0.46 │ vmovdqa %ymm0,0xd40(%rbx) ▒
0.88 │ vmovdqa %ymm0,0xd60(%rbx) ▒
0.38 │ vmovdqa %ymm0,0xd80(%rbx) ▒
0.82 │ vmovdqa %ymm0,0xda0(%rbx) ▒
0.40 │ vmovdqa %ymm0,0xdc0(%rbx) ▒
0.98 │ vmovdqa %ymm0,0xde0(%rbx) ▒
0.27 │ vmovdqa %ymm0,0xe00(%rbx) ▒
1.10 │ vmovdqa %ymm0,0xe20(%rbx) ▒
0.25 │ vmovdqa %ymm0,0xe40(%rbx) ▒
0.89 │ vmovdqa %ymm0,0xe60(%rbx) ▒
0.32 │ vmovdqa %ymm0,0xe80(%rbx) ▒
0.87 │ vmovdqa %ymm0,0xea0(%rbx) ▒
0.22 │ vmovdqa %ymm0,0xec0(%rbx) ▒
0.94 │ vmovdqa %ymm0,0xee0(%rbx) ▒
0.27 │ vmovdqa %ymm0,0xf00(%rbx) ▒
0.90 │ vmovdqa %ymm0,0xf20(%rbx) ▒
0.28 │ vmovdqa %ymm0,0xf40(%rbx) ▒
0.79 │ vmovdqa %ymm0,0xf60(%rbx) ▒
0.31 │ vmovdqa %ymm0,0xf80(%rbx) ▒
1.11 │ vmovdqa %ymm0,0xfa0(%rbx) ▒
0.25 │ vmovdqa %ymm0,0xfc0(%rbx) ▒
0.99 │ vmovdqa %ymm0,0xfe0(%rbx) ▒
0.10 │ pop %rbx ▒
│ jmp 0xffffffffb4e4b050 ▒
...that looks like progress: we now spend 21.25% of the clock cyles on
get_page_from_freelist() (non-AVX: 25.61%). But still, there seem to be
(somewhat unexpected) stalls. For example, after 8 VMOVDQA instructions:
7.60 │ vmovdqa %ymm0,0x120(%rbx)
we have one where we spend/wait a long time, and there are more later.
3. ...what if we use a non-temporal hint, that is, if we clear the page
without making it cache hot ("stream" instead of "store")?
That's vmovntdq m256, ymm ("nt" meaning non-temporal). It's not vmovntdqa
(where "a" stands for "aligned"), as one could expect from the vmovdqa above,
because there's no unaligned version.
The only vmovntdq_a_ instruction is vmovntdqa ymm, m256 (memory to register,
"stream load"), because there's an unaligned equivalent in that case.
Anyway, perf output:
Samples: 39K of event 'cycles:P', Event count (approx.): 33890710610
Children Self Command Shared Object Symbol
- 92.62% 0.88% passt.avx2 [kernel.vmlinux] [k] entry_SYSCALL◆
- 91.74% entry_SYSCALL_64 ▒
- 91.60% do_syscall_64 ▒
- 75.05% __sys_sendmsg ▒
- 74.88% ___sys_sendmsg ▒
- 74.22% ____sys_sendmsg ▒
- 73.65% tcp_sendmsg ▒
- 61.71% tcp_sendmsg_locked ▒
- 24.82% _copy_from_iter ▒
24.40% rep_movs_alternative ▒
- 14.69% sk_page_frag_refill ▒
- 14.57% skb_page_frag_refill ▒
- 14.19% alloc_pages_mpol_noprof ▒
- 14.03% __alloc_pages_noprof ▒
- 13.77% get_page_from_freelist ▒
- 12.56% prep_new_page ▒
- 12.19% clear_page ▒
0.68% kernel_fpu_begin_mask ▒
- 11.12% tcp_write_xmit ▒
- 10.17% __tcp_transmit_skb ▒
- 9.62% __ip_queue_xmit ▒
- 8.08% ip_finish_output2 ▒
- 7.96% __dev_queue_xmit ▒
- 7.26% __local_bh_enable_ip ▒
- 7.24% do_softirq.part.0 ▒
- handle_softirqs ▒
- net_rx_action ▒
+ 5.80% __napi_poll ▒
+ 0.87% napi_consume_skb ▒
- 1.06% ip_local_out ▒
- 1.05% __ip_local_out ▒
- 0.90% nf_hook_slow ▒
0.66% nf_conntrack_in ▒
+ 4.22% __tcp_push_pending_frames ▒
+ 2.51% tcp_wmem_schedule ▒
+ 1.99% tcp_stream_alloc_skb ▒
0.59% __check_object_size ▒
+ 11.21% release_sock ▒
0.52% lock_sock_nested ▒
+ 5.32% ksys_write ▒
+ 4.75% __x64_sys_epoll_wait ▒
+ 2.45% __x64_sys_getsockopt ▒
1.29% syscall_exit_to_user_mode ▒
+ 1.25% ksys_read ▒
+ 0.70% syscall_trace_enter ▒
...finally we cut down significantly on cycles spent to clear pages, with
get_page_from_freelist() taking 13.77% of the cycles instead of 25.61%.
That's about half the overhead.
This makes _copy_from_iter() the biggest consumer of cycles under
tcp_sendmsg_locked(), which is what I expected. Does this mean that we
increase the overhead there because we're increasing the amount of cache
misses there, or are we simply more efficient? I'm not sure yet.
For completeness, annotated version of clear_page():
│ if (system_state >= SYSTEM_RUNNING && irq_fpu_usable()) {
0.09 │ 1b: call irq_fpu_usable
0.11 │ test %al,%al
0.51 │ je d
│ kernel_fpu_begin_mask(0);
0.16 │ xor %edi,%edi
│ call kernel_fpu_begin_mask
│ MEMSET_AVX2_ZERO(0);
0.05 │ vpxor %ymm0,%ymm0,%ymm0
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x00], 0);
0.79 │ vmovntdq %ymm0,(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x01], 0);
2.46 │ vmovntdq %ymm0,0x20(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x02], 0);
0.07 │ vmovntdq %ymm0,0x40(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x03], 0);
1.35 │ vmovntdq %ymm0,0x60(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x04], 0);
0.18 │ vmovntdq %ymm0,0x80(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x05], 0);
1.40 │ vmovntdq %ymm0,0xa0(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x06], 0);
0.11 │ vmovntdq %ymm0,0xc0(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x07], 0);
0.81 │ vmovntdq %ymm0,0xe0(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x08], 0);
0.07 │ vmovntdq %ymm0,0x100(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x09], 0);
1.25 │ vmovntdq %ymm0,0x120(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x0a], 0);
0.08 │ vmovntdq %ymm0,0x140(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x0b], 0);
1.36 │ vmovntdq %ymm0,0x160(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x0c], 0);
0.11 │ vmovntdq %ymm0,0x180(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x0d], 0);
1.73 │ vmovntdq %ymm0,0x1a0(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x0e], 0);
0.09 │ vmovntdq %ymm0,0x1c0(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x0f], 0);
0.97 │ vmovntdq %ymm0,0x1e0(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x10], 0);
0.07 │ vmovntdq %ymm0,0x200(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x11], 0);
1.25 │ vmovntdq %ymm0,0x220(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x12], 0);
0.14 │ vmovntdq %ymm0,0x240(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x13], 0);
0.79 │ vmovntdq %ymm0,0x260(%rbx)
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x14], 0); ▒
0.09 │ vmovntdq %ymm0,0x280(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x15], 0); ▒
1.19 │ vmovntdq %ymm0,0x2a0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x16], 0); ▒
0.07 │ vmovntdq %ymm0,0x2c0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x17], 0); ▒
1.45 │ vmovntdq %ymm0,0x2e0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x18], 0); ▒
│ vmovntdq %ymm0,0x300(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x19], 0); ▒
1.45 │ vmovntdq %ymm0,0x320(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x1a], 0); ▒
0.05 │ vmovntdq %ymm0,0x340(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x1b], 0); ▒
1.49 │ vmovntdq %ymm0,0x360(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x1c], 0); ▒
0.14 │ vmovntdq %ymm0,0x380(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x1d], 0); ▒
1.34 │ vmovntdq %ymm0,0x3a0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x1e], 0); ▒
0.09 │ vmovntdq %ymm0,0x3c0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x1f], 0); ▒
1.69 │ vmovntdq %ymm0,0x3e0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x20], 0); ▒
0.16 │ vmovntdq %ymm0,0x400(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x21], 0); ▒
1.15 │ vmovntdq %ymm0,0x420(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x22], 0); ▒
0.13 │ vmovntdq %ymm0,0x440(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x23], 0); ▒
1.36 │ vmovntdq %ymm0,0x460(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x24], 0); ▒
0.07 │ vmovntdq %ymm0,0x480(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x25], 0); ▒
1.01 │ vmovntdq %ymm0,0x4a0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x26], 0); ▒
0.09 │ vmovntdq %ymm0,0x4c0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x27], 0); ◆
1.53 │ vmovntdq %ymm0,0x4e0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x28], 0); ▒
0.12 │ vmovntdq %ymm0,0x500(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x29], 0); ▒
1.45 │ vmovntdq %ymm0,0x520(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x2a], 0); ▒
0.13 │ vmovntdq %ymm0,0x540(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x2b], 0); ▒
0.97 │ vmovntdq %ymm0,0x560(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x2c], 0); ▒
0.12 │ vmovntdq %ymm0,0x580(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x2d], 0); ▒
1.21 │ vmovntdq %ymm0,0x5a0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x2e], 0); ▒
0.15 │ vmovntdq %ymm0,0x5c0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x2f], 0); ▒
1.42 │ vmovntdq %ymm0,0x5e0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x30], 0); ▒
0.19 │ vmovntdq %ymm0,0x600(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x31], 0); ▒
1.12 │ vmovntdq %ymm0,0x620(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x32], 0); ▒
0.04 │ vmovntdq %ymm0,0x640(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x33], 0); ▒
1.59 │ vmovntdq %ymm0,0x660(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x34], 0); ▒
0.07 │ vmovntdq %ymm0,0x680(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x35], 0); ▒
1.65 │ vmovntdq %ymm0,0x6a0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x36], 0); ▒
0.14 │ vmovntdq %ymm0,0x6c0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x37], 0); ▒
1.00 │ vmovntdq %ymm0,0x6e0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x38], 0); ▒
0.14 │ vmovntdq %ymm0,0x700(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x39], 0); ▒
1.31 │ vmovntdq %ymm0,0x720(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x3a], 0); ▒
0.10 │ vmovntdq %ymm0,0x740(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x3b], 0); ▒
1.21 │ vmovntdq %ymm0,0x760(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x3c], 0); ▒
0.07 │ vmovntdq %ymm0,0x780(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x3d], 0); ▒
1.27 │ vmovntdq %ymm0,0x7a0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x3e], 0); ▒
0.09 │ vmovntdq %ymm0,0x7c0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x3f], 0); ▒
1.28 │ vmovntdq %ymm0,0x7e0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x40], 0); ▒
0.11 │ vmovntdq %ymm0,0x800(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x41], 0); ▒
1.32 │ vmovntdq %ymm0,0x820(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x42], 0); ▒
0.09 │ vmovntdq %ymm0,0x840(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x43], 0); ▒
1.43 │ vmovntdq %ymm0,0x860(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x44], 0); ▒
0.11 │ vmovntdq %ymm0,0x880(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x45], 0); ▒
1.21 │ vmovntdq %ymm0,0x8a0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x46], 0); ▒
0.11 │ vmovntdq %ymm0,0x8c0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x47], 0); ▒
1.09 │ vmovntdq %ymm0,0x8e0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x48], 0); ▒
0.07 │ vmovntdq %ymm0,0x900(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x49], 0); ▒
1.26 │ vmovntdq %ymm0,0x920(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x4a], 0); ▒
0.16 │ vmovntdq %ymm0,0x940(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x4b], 0); ▒
1.58 │ vmovntdq %ymm0,0x960(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x4c], 0); ▒
0.05 │ vmovntdq %ymm0,0x980(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x4d], 0); ▒
1.54 │ vmovntdq %ymm0,0x9a0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x4e], 0); ▒
0.07 │ vmovntdq %ymm0,0x9c0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x4f], 0); ▒
1.66 │ vmovntdq %ymm0,0x9e0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x50], 0); ▒
0.16 │ vmovntdq %ymm0,0xa00(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x51], 0); ▒
1.31 │ vmovntdq %ymm0,0xa20(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x52], 0); ▒
0.20 │ vmovntdq %ymm0,0xa40(%rbx) ◆
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x53], 0); ▒
1.44 │ vmovntdq %ymm0,0xa60(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x54], 0); ▒
0.05 │ vmovntdq %ymm0,0xa80(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x55], 0); ▒
1.52 │ vmovntdq %ymm0,0xaa0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x56], 0); ▒
0.21 │ vmovntdq %ymm0,0xac0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x57], 0); ▒
1.09 │ vmovntdq %ymm0,0xae0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x58], 0); ▒
0.22 │ vmovntdq %ymm0,0xb00(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x59], 0); ▒
1.58 │ vmovntdq %ymm0,0xb20(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x5a], 0); ▒
0.12 │ vmovntdq %ymm0,0xb40(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x5b], 0); ▒
1.46 │ vmovntdq %ymm0,0xb60(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x5c], 0); ▒
0.04 │ vmovntdq %ymm0,0xb80(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x5d], 0); ▒
1.62 │ vmovntdq %ymm0,0xba0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x5e], 0); ▒
0.07 │ vmovntdq %ymm0,0xbc0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x5f], 0); ▒
1.71 │ vmovntdq %ymm0,0xbe0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x60], 0); ▒
0.19 │ vmovntdq %ymm0,0xc00(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x61], 0); ▒
1.89 │ vmovntdq %ymm0,0xc20(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x62], 0); ▒
0.11 │ vmovntdq %ymm0,0xc40(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x63], 0); ▒
1.98 │ vmovntdq %ymm0,0xc60(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x64], 0); ▒
0.16 │ vmovntdq %ymm0,0xc80(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x65], 0); ▒
1.58 │ vmovntdq %ymm0,0xca0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x66], 0); ▒
0.13 │ vmovntdq %ymm0,0xcc0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x67], 0); ▒
1.16 │ vmovntdq %ymm0,0xce0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x68], 0); ▒
0.09 │ vmovntdq %ymm0,0xd00(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x69], 0); ▒
1.67 │ vmovntdq %ymm0,0xd20(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x6a], 0); ▒
0.11 │ vmovntdq %ymm0,0xd40(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x6b], 0); ▒
1.82 │ vmovntdq %ymm0,0xd60(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x6c], 0); ▒
0.07 │ vmovntdq %ymm0,0xd80(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x6d], 0); ▒
1.57 │ vmovntdq %ymm0,0xda0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x6e], 0); ▒
0.02 │ vmovntdq %ymm0,0xdc0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x6f], 0); ▒
1.27 │ vmovntdq %ymm0,0xde0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x70], 0); ▒
│ vmovntdq %ymm0,0xe00(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x71], 0); ▒
1.48 │ vmovntdq %ymm0,0xe20(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x72], 0); ▒
0.11 │ vmovntdq %ymm0,0xe40(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x73], 0); ▒
1.87 │ vmovntdq %ymm0,0xe60(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x74], 0); ▒
0.16 │ vmovntdq %ymm0,0xe80(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x75], 0); ▒
1.45 │ vmovntdq %ymm0,0xea0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x76], 0); ▒
0.07 │ vmovntdq %ymm0,0xec0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x77], 0); ▒
1.65 │ vmovntdq %ymm0,0xee0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x78], 0); ▒
0.10 │ vmovntdq %ymm0,0xf00(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x79], 0); ▒
1.53 │ vmovntdq %ymm0,0xf20(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x7a], 0); ▒
0.07 │ vmovntdq %ymm0,0xf40(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x7b], 0); ▒
1.51 │ vmovntdq %ymm0,0xf60(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x7c], 0); ▒
0.12 │ vmovntdq %ymm0,0xf80(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x7d], 0); ▒
1.62 │ vmovntdq %ymm0,0xfa0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x7e], 0); ▒
0.08 │ vmovntdq %ymm0,0xfc0(%rbx) ▒
│ MEMSET_AVX2_STORE(((unsigned char *)page)[YMM_BYTES * 0x7f], 0); ▒
1.62 │ vmovntdq %ymm0,0xfe0(%rbx) ▒
│ } ▒
0.13 │ pop %rbx ▒
│ kernel_fpu_end(); ▒
│ jmp ffffffff8104b050
participants (2)
-
Laurent Vivier
-
Stefano Brivio