Re: [PATCH v2 06/12] packet: Don't hard code maximum packet size to UINT16_MAX

3 Jan 2025

      On Thu, Jan 02, 2025 at 10:59:48PM +0100, Stefano Brivio wrote:
...
On Thu, 2 Jan 2025 12:00:30 +1100
David Gibson  wrote:
...
On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote:
...
On Fri, 20 Dec 2024 19:35:29 +1100
David Gibson  wrote:
...
We verify that every packet we store in a pool - and every partial packet
we retreive from it has a length no longer than UINT16_MAX.  This
originated in the older packet pool implementation which stored packet
lengths in a uint16_t.  Now, that packets are represented by a struct
iovec with its size_t length, this check serves only as a sanity / security
check that we don't have some wildly out of range length due to a bug
elsewhere.
However, UINT16_MAX (65535) isn't quite enough, because the "packet" as
stored in the pool is in fact an entire frame including both L2 and any
backend specific headers.  We can exceed this in passt mode, even with the
default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header +
4 bytes of qemu stream length header = 65538 bytes.
Introduce our own define for the maximum length of a packet in the pool and
set it slightly larger, allowing 128 bytes for L2 and/or other backend
specific headers.  We'll use different amounts of that depending on the
tap backend, but since this is just a sanity check, the bound doesn't need
to be 100% tight.
I couldn't find the time to check what's the maximum amount of bytes we
can get here depending on hypervisor and interface, but if this patch
So, it's a separate calculation for each backend type, and some of
them are pretty tricky.
For anything based on the kernel tap device it is 65535, because it
has an internal frame size limit of 65535, already including any L2
headers (it explicitly limits the MTU to 65535 - hard_header_len).
There is no "hardware" header.
For the qemu stream protocol it gets pretty complicated, because there
are multiple layers which could clamp the maximum size.  It doesn't
look like the socket protocol code itself imposes a limit beyond the
structural one of (2^32-1 + 4) (well, and putting it into an ssize_t,
which could be less for 32-bit systems).  AFAICT, it's not
theoretically impossible to have gigabyte frames with a weird virtual
NIC model... though obviously that wouldn't be IP, and probably not
even Ethernet.
Theoretically speaking, it could actually be IPv6 with Jumbograms. They
never really gained traction (because of Ethernet, I think) and we don't
support them, but the only attempt to deprecate them I'm aware of
didn't succeed (yet):
https://datatracker.ietf.org/doc/draft-jones-6man-historic-rfc2675/
...and I actually wonder if somebody will ever try to revive them for
virtual networks, where they might make somewhat more sense (you could
transfer filesystems with one packet per file or similar silly tricks).
Hm, yes.  Well, one problem at a time.  Well, ok, 2 or 3 problems at a
time.
...
...
Each virtual NIC could have its own limit.  I suspect that's going to
be in the vicinity of 64k.  But, I'm really struggling to figure out
what it is just for virtio-net, so I really don't want to try to
figure it out for all of them.  With a virtio-net NIC, I seem to be
able to set MTU all the way up to 65535 successfully, which implies a
maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol
header) = 65553 at least.
The Layer-2 header is included (because that also happens to be
ETH_MAX_MTU, on Linux), it wouldn't be on top.
No.  Or if so, that's a guest side bug.  The MTU set with ip-link is
the maximum L3 size - at the default 1500, the L2 frame can be 1514
bytes.  If the driver can't send an L2 frame greater than 65535 bytes,
then it should clamp the MTU to (65535 - hard_header_len) like tuntap
already does.

I do think ETH_MAX_MTU is a confusing name: is it the maximum (IP) MTU
which can be had in an ethernet frame (that's longer), or is it the
maximum ethernet frame size (leading to an IP mtu of 65521).  I plan
to eliminate use of ETH_MAX_MTU in favour of clearer things in my MTU
series.  I should merge that with this one, it might make the context
clearer.

AFAICT there is *no* structural limit on the size of an ethernet
frame; the length isn't in any header, it's just assumed to be
reported out of band by the hardware.  No theoretical reason that
reporting mechanism couldn't allow lengths > 65535, whether slightly
(65535 bytes of payload + header & FCS) or vastly.
...
...
Similar situation for vhost-user, where I'm finding it even more
inscrutable to figure out what limits are imposed at the sub-IP
levels.  At the moment the "hardware" header
(virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the
packet.c layer, but we might have reasons to change that.
So, any sub-IP limits for qemu, I'm basically not managing to find.
However, we (more or less) only care about IP, which imposes a more
practical limit of: 65535 + L2 header size + "hardware" header size.
At present that maxes out at 65553, as above, but if we ever support
other L2 encapsulations, or other backend protocols with larger
"hardware" headers, that could change.
Okay. I was thinking of a more practical approach, based on the fact
that we only support Ethernet anyway, with essentially four types of
adapters (three virtio-net implementations, and tap), plus rather rare
reports with e1000e (https://bugs.passt.top/show_bug.cgi?id=107) and we
could actually test things.
Well, I did that anyway, because I'm not quite comfortable dropping the
UINT16_MAX check in packet_add_do() without testing...
We're not dropping it, just raising the limit, fairly slightly.
...
and yes, with
this patch we can trigger a buffer overflow with vhost-user. In detail:
- muvm, virtio-net with 65535 bytes MTU:
- ping -s 65492 works (65534 bytes "on wire" from pcap)
  - ping -s 65493 doesn't because the guest fragments: 65530 plus 39 bytes
    on wire (that's with a newer kernel, probably that's the reason why
    it's not the same limit as QEMU, below)
That's a guest driver bug.  If the MTU is 65535, it should be able to
send a 65535 byte IP datagram without fragmentation.  Looks like it
needs to clamp the MTU based on L2 limitations.
...
- QEMU, virtio-net without vhost-user, 65535 bytes MTU:
- ping -s 65493 works (65535 bytes "on wire" from pcap)
  - with -s 65494: "Bad frame size from guest, resetting connection"
That's our check, which I plan to fix in the MTU series.
...
- QEMU, virtio-net with vhost-user, 65535 bytes MTU:
- ping -s 65493 works (65535 bytes "on wire" from pcap)
  - ping -s 65494:
*** buffer overflow detected ***: terminated
without this patch, we catch that in packet_add_do() (and without
    9/12 we don't crash)
Ouch.  That's a bug in our vhost-user code.  The easy fix would be to
clamp MTU to 65521, arguably more correct would be to decouple its
notion of maximum frame size from ETH_MAX_MTU.
...
- tap, 65521 bytes MTU (maximum allowed, I think 65520 would be the
  correct maximum though):
No, 65521 is correct: 65521+14 = 65535 which is the maximum allowed
tap frame size.  Seems like there are other drivers that should also
be clamping their max MTU similarly, but aren't.
...
- ping -s 65493 works (65535 bytes on wire)
  - ping -s 65494 doesn't (65530 bytes + 40 bytes fragments)
This time the fragmentation is correct, because the MTU is only 65521.
...
So, I guess we should find out the issue with vhost-user first.
Yeah.  I definitely need to intermingle the MTU series with this one
to get the order of operations right.
...
Other than that, it looks like we can reach at least 65539 bytes if we
add the "virtio-net" length descriptors, but still the frame size
itself (which is actually what matters for the functions in packet.c)
Well.. sort of.  The packet.c functions really care nothing about any
of the layers, it's just a blob of data to them.  I did miss that
tap_passt_input() excludes the qemu header before inserting the frame
into the pool, so indeed, the pool layer won't currently see length
greater than 65535.  Of course, since that frame header *is* stored in
pkt_buf with all the rest, that's arguably not using the pool layer's
buffer bound checks to the full extent we could.
...
can't exceed 65535 bytes, at least from my tests.
Then, yes, I agree that it's not actually correct, even though it fits
all the use cases we have, because we *could* have an implementation
exceeding that value (at the moment, it looks like we don't).
So, it seems like the Linux drivers might not actually generate
ethernet frames > 65535 bytes - although they don't all correctly
reduce their maximum MTU to reflect that.  I don't think we should
rely on that; AFAICT it would be reasonable for a driver + VMM
implementation to allow ethernet frames that are at least 65535 bytes
+ L2 header.  That might also allow for 16 byte 802.1q vlan L2
headers.
...
...
...
fixes an actual issue as you seem to imply, actually checking that with
QEMU and muvm would be nice.
By the way, as you mention a specific calculation, does it really make
sense to use a "good enough" value here? Can we ever exceed 65538
bytes, or can we use that as limit? It would be good to find out, while
at it.
So, yes, I think we can exceed 65538.  But more significantly, trying
to make the limit tight here feels like a borderline layering
violation.  The packet layer doesn't really care about the frame size
as long as it's "sane".
It might still be convenient for some back-ends to define "sane" as 64
KiB. I'm really not sure if it is, I didn't look into the matching part
of vhost-user in detail.
That's fair, but I don't think the pool layer should impose that limit
on the backends, because I think it's equally reasonable or another
backend to allow slightly larger frames with a 65535 byte L3 payload.
Hence setting the limit to 64k + "a little bit".
...
If it's buggy because we have a user that can exceed that, sure, let's
fix it. If not... also fine by me as it's some kind of theoretical
flaw, but we shouldn't crash.
...
Fwiw, in the draft changes I have improving
MTU handling, it's my intention that individual backends calculate
and/or enforce tighter limits of their own where practical, and
BUILD_ASSERT() that those fit within the packet layer's frame size
limit.
Ah, nice, that definitely sounds like an improvement.
-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson