On Wed, Jan 04, 2023 at 01:08:52AM +0100, Stefano Brivio wrote:
On Wed, 21 Dec 2022 17:00:24 +1100
David Gibson <david(a)gibson.dropbear.id.au> wrote:
On Tue, Dec 20, 2022 at 11:42:46AM +0100, Stefano
Brivio wrote:
Sorry for the further delay,
On Wed, 14 Dec 2022 11:35:46 +0100
Stefano Brivio <sbrivio(a)redhat.com> wrote:
On Wed, 14 Dec 2022 12:42:14 +1100
David Gibson <david(a)gibson.dropbear.id.au> wrote:
> On Tue, Dec 13, 2022 at 11:48:47PM +0100, Stefano Brivio wrote:
> > Sorry for the long delay here,
> >
> > On Mon, 5 Dec 2022 19:14:21 +1100
> > David Gibson <david(a)gibson.dropbear.id.au> wrote:
> >
> > > Usually udp_sock_handler() will receive and forward multiple (up to 32)
> > > datagrams in udp_sock_handler(), then forward them all to the tap
> > > interface. For unclear reasons, though, when in pasta mode we will only
> > > receive and forward a single datagram at a time. Change it to receive
> > > multiple datagrams at once, like the other paths.
> >
> > This is explained in the commit message of 6c931118643c ("tcp, udp:
> > Receive batching doesn't pay off when writing single frames to
tap").
> >
> > I think it's worth re-checking the throughput now as this path is a bit
> > different, but unfortunately I didn't include this in the "perf"
tests :(
> > because at the time I introduced those I wasn't sure it even made sense
to
> > have traffic from the same host being directed to the tap device.
> >
> > The iperf3 runs were I observed this are actually the ones from the Podman
> > demo. Ideally that case should be also checked in the perf/pasta_udp tests.
>
> Hm, ok.
>
> > How fundamental is this for the rest of the series? I couldn't find any
> > actual dependency on this but I might be missing something.
>
> So the issue is that prior to this change in pasta we receive multiple
> frames at once on the splice path, but one frame at a time on the tap
> path. By the end of this series we can't do that any more, because we
> don't know before the recvmmsg() which one we'll be doing.
Oh, right, I see. Then let me add this path to the perf/pasta_udp test
and check how relevant this is now, I'll get back to you in a bit.
I was checking the wrong path. With this:
diff --git a/test/perf/pasta_udp b/test/perf/pasta_udp
index 27ea724..973c2f4 100644
--- a/test/perf/pasta_udp
+++ b/test/perf/pasta_udp
@@ -31,6 +31,14 @@ report pasta lo_udp 1 __FREQ__
th MTU 1500B 4000B 16384B 65535B
+tr UDP throughput over IPv6: host to ns
+nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type ==
"ether").ifname'
+nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname ==
"__IFNAME__").addr_info[] | select(.scope == "global" and .prefixlen
== 64).local'
+bw -
+bw -
+bw -
+iperf3 BW host ns __ADDR6__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 15G
+bw __BW__ 7.0 9.0
tr UDP throughput over IPv6: ns to host
ns ip link set dev lo mtu 1500
diff --git a/test/run b/test/run
index e07513f..b53182b 100755
--- a/test/run
+++ b/test/run
@@ -67,6 +67,14 @@ run() {
test build/clang_tidy
teardown build
+ VALGRIND=0
+ setup passt_in_ns
+ test passt/ndp
+ test passt/dhcp
+ test perf/pasta_udp
+ test passt_in_ns/shutdown
+ teardown passt_in_ns
+
setup pasta
test pasta/ndp
test pasta/dhcp
Ah, ok. Can we add that to the standard set of tests ASAP, please.
I get 21.6 gbps after this series, and 29.7 gbps
before -- it's quite
significant.
Drat.
And there's nothing strange in perf's
output, really, the distribution
of overhead per functions is pretty much the same, but writing multiple
messages to the tap device just takes more cycles per message compared
to a single message.
That's so weird. It should be basically an identical set of write()s,
except that they happen in a batch, rather than a bit spread out. I
guess it has to be some kind of cache locality thing. I wonder if the
difference would go away or reverse if we had a way to submit multiple
frames with a single syscall.
I'm a bit ashamed to propose this, but do
you think about something
like:
if (c->mode == MODE_PASTA) { if
(recvmmsg(ref.r.s, mmh_recv,
1, 0, NULL) <= 0) return;
if (udp_mmh_splice_port(v6, mmh_recv)) { n =
recvmmsg(ref.r.s, mmh_recv + 1, UDP_MAX_FRAMES
- 1, 0, NULL); }
if (n > 0) n++; else n = 1; } else { n =
recvmmsg(ref.r.s, mmh_recv, UDP_MAX_FRAMES, 0,
NULL); if (n <= 0) return; }
? Other than the inherent ugliness, it looks like
a good
approximation to me.
Hmm. Well, the first question is how much impact does going 1 message
at a time have on the spliced throughput. If it's not too bad, then
we could just always go one at a time for pasta, regardless of
splicing. And we could even abstract that difference into the tap
backend with a callback like tap_batch_size(c).
So, finally I had the chance to try this out.
First off, baseline with the patch adding the new tests I just sent,
and the series you posted:
=== perf/pasta_udp
pasta: throughput and latency (local traffic)
Throughput in Gbps, latency in µs, one thread at 3.6 GHz, 4 streams
MTU: | 1500B | 4000B | 16384B
| 65535B |
|--------|--------|--------|--------|
UDP throughput over IPv6: ns to host | 4.4 | 8.5 | 19.5
| 23.0 |
UDP RR latency over IPv6: ns to host | - | - | -
| 27 |
|--------|--------|--------|--------|
UDP throughput over IPv4: ns to host | 4.3 | 8.8 | 18.5
| 24.4 |
UDP RR latency over IPv4: ns to host | - | - | -
| 26 |
|--------|--------|--------|--------|
UDP throughput over IPv6: host to ns | - | - | -
| 22.5 |
UDP RR latency over IPv6: host to ns | - | - | -
| 30 |
|--------|--------|--------|--------|
UDP throughput over IPv4: host to ns | - | - | -
| 24.5 |
UDP RR latency over IPv4: host to ns | - | - | -
| 25 |
'--------'--------'--------'--------'
...passed.
pasta: throughput and latency (traffic via tap)
Throughput in Gbps, latency in µs, one thread at 3.6 GHz, 4 streams
MTU: | 1500B | 4000B | 16384B
| 65520B |
|--------|--------|--------|--------|
UDP throughput over IPv6: ns to host | 4.4 | 10.4 | 16.0
| 23.4 |
UDP RR latency over IPv6: ns to host | - | - | -
| 27 |
|--------|--------|--------|--------|
UDP throughput over IPv4: ns to host | 5.2 | 10.8 | 16.0
| 24.0 |
UDP RR latency over IPv4: ns to host | - | - | -
| 28 |
|--------|--------|--------|--------|
UDP throughput over IPv6: host to ns | - | - | -
| 21.5 |
UDP RR latency over IPv6: host to ns | - | - | -
| 29 |
|--------|--------|--------|--------|
UDP throughput over IPv4: host to ns | - | - | -
| 26.3 |
UDP RR latency over IPv4: host to ns | - | - | -
| 26 |
'--------'--------'--------'--------'
which seems to indicate the whole "splicing" thing is pretty much
useless, for UDP (except for that 16 KiB MTU case, but I wonder how
relevant that is).
If I set UDP_MAX_FRAMES to 1, with a quick workaround for the resulting
warning in udp_tap_send() (single frame to send, hence single message),
it gets somewhat weird:
=== perf/pasta_udp
pasta: throughput and latency (local traffic)
Throughput in Gbps, latency in µs, one thread at 3.6 GHz, 4 streams
MTU: | 1500B | 4000B | 16384B
| 65535B |
|--------|--------|--------|--------|
UDP throughput over IPv6: ns to host | 3.4 | 7.0 | 21.6
| 31.6 |
UDP RR latency over IPv6: ns to host | - | - | -
| 30 |
|--------|--------|--------|--------|
UDP throughput over IPv4: ns to host | 3.8 | 7.0 | 22.0
| 32.4 |
UDP RR latency over IPv4: ns to host | - | - | -
| 26 |
|--------|--------|--------|--------|
UDP throughput over IPv6: host to ns | - | - | -
| 29.3 |
UDP RR latency over IPv6: host to ns | - | - | -
| 31 |
|--------|--------|--------|--------|
UDP throughput over IPv4: host to ns | - | - | -
| 33.8 |
UDP RR latency over IPv4: host to ns | - | - | -
| 25 |
'--------'--------'--------'--------'
...passed.
pasta: throughput and latency (traffic via tap)
Throughput in Gbps, latency in µs, one thread at 3.6 GHz, 4 streams
MTU: | 1500B | 4000B | 16384B
| 65520B |
|--------|--------|--------|--------|
UDP throughput over IPv6: ns to host | 4.7 | 10.3 | 16.0
| 24.0 |
UDP RR latency over IPv6: ns to host | - | - | -
| 27 |
|--------|--------|--------|--------|
UDP throughput over IPv4: ns to host | 5.6 | 11.4 | 16.0
| 24.0 |
UDP RR latency over IPv4: ns to host | - | - | -
| 26 |
|--------|--------|--------|--------|
UDP throughput over IPv6: host to ns | - | - | -
| 21.5 |
UDP RR latency over IPv6: host to ns | - | - | -
| 29 |
|--------|--------|--------|--------|
UDP throughput over IPv4: host to ns | - | - | -
| 28.7 |
UDP RR latency over IPv4: host to ns | - | - | -
| 29 |
'--------'--------'--------'--------'
...except for the cases with low MTUs, throughput is significantly
higher if we read and send one message at a time on the "spliced" path.
Next, I would like to:
- bisect between 32 and 1 for UDP_MAX_FRAMES: maybe 32 affects data
locality too much, but some lower value would still be beneficial by
lowering syscall overhead
Ok.
- try with sendmsg() instead of sendmmsg(), at this
point. Looking at
the kernel, that doesn't seem to make a real difference.
Which sendmmsg() specifically are you looking at changing?
About this series: should we just go ahead and apply
it with
UDP_MAX_FRAMES set to 1 for the moment being? It's anyway better than
the existing situation.
I think that's a good idea - or rather, not setting UDP_MAX_FRAMES to
1, but clamping the batch size to 1 for pasta - I'm pretty sure we
still want the batching for passt. We lose a little bit on
small-packet spliced, but we gain on both tap and large-packet
spliced. This will unblock the dual stack udp stuff and we can
further tune it later.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson