[PATCH 0/9] Improve handling of MTU limits
After considerable lead up, this corrects the handling of the --mtu option so that it will respect limits imposed by the backend tap layers, correctly accounting for L2 headers. This incorporates the earlier mode setting patches, on which the rest depend. There's been no change in those patches, there just included here for self-containedness. David Gibson (9): conf: Use the same optstring for passt and pasta modes conf: Move mode detection into helper function conf: Detect vhost-user mode earlier packet: Give explicit name to maximum packet size packet: Remove redundant TAP_BUF_BYTES define tap: Use explicit defines for maximum length of L2 frame Simplify sizing of pkt_buf pcap: Correctly set snaplen based on tap backend type conf: Limit maximum MTU based on backend frame size conf.c | 98 +++++++++++++++++++++++++++++++++++++++++--------------- conf.h | 1 + packet.c | 4 +-- packet.h | 3 ++ passt.c | 16 ++------- passt.h | 7 ++-- pcap.c | 46 +++++++++++++------------- tap.c | 39 +++++++++++++++++++--- tap.h | 26 +++++++++++++++ util.h | 3 -- 10 files changed, 168 insertions(+), 75 deletions(-) -- 2.48.1
Currently we rely on detecting our mode first and use different sets of
(single character) options for each. This means that if you use an option
valid in only one mode in another you'll get the generic usage() message.
We can give more helpful errors with little extra effort by combining all
the options into a single value of the option string and giving bespoke
messages if an option for the wrong mode is used; in fact we already did
this for some single mode options like '-1'.
Signed-off-by: David Gibson
One of the first things we need to do is determine if we're in passt mode
or pasta mode. Currently this is open-coded in main(), by examining
argv[0]. We want to complexify this a bit in future to cover vhost-user
mode as well. Prepare for this, by moving the mode detection into a new
conf_mode() function.
Signed-off-by: David Gibson
We detect our operating mode in conf_mode(), unless we're using vhost-user
mode, in which case we change it later when we parse the --vhost-user
option. That means we need to delay parsing the --repair-path option (for
vhost-user only) until still later.
However, there are many other places in the main option parsing loop which
also rely on mode. We get away with those, because they happen to be able
to treat passt and vhost-user modes identically. This is potentially
confusing, though. So, move setting of MODE_VU into conf_mode() so
c->mode always has its final value from that point onwards.
To match, we move the parsing of --repair-path back into the main option
parsing loop.
Signed-off-by: David Gibson
On Tue, 11 Mar 2025 17:03:12 +1100
David Gibson
We detect our operating mode in conf_mode(), unless we're using vhost-user mode, in which case we change it later when we parse the --vhost-user option. That means we need to delay parsing the --repair-path option (for vhost-user only) until still later.
However, there are many other places in the main option parsing loop which also rely on mode. We get away with those, because they happen to be able to treat passt and vhost-user modes identically. This is potentially confusing, though. So, move setting of MODE_VU into conf_mode() so c->mode always has its final value from that point onwards.
To match, we move the parsing of --repair-path back into the main option parsing loop.
Signed-off-by: David Gibson
--- conf.c | 43 ++++++++++++++++++++++++++----------------- 1 file changed, 26 insertions(+), 17 deletions(-) diff --git a/conf.c b/conf.c index 2022ea1d..b58e2a6e 100644 --- a/conf.c +++ b/conf.c @@ -998,10 +998,23 @@ pasta_opts: * * Return: mode to operate in, PASTA or PASST */ -/* cppcheck-suppress constParameter */ enum passt_modes conf_mode(int argc, char *argv[]) { + int vhost_user = 0; + const struct option optvu[] = { + {"vhost-user", no_argument, &vhost_user, 1 }, + { 0 }, + }; char argv0[PATH_MAX], *basearg0; + int name; + + optind = 0; + do { + name = getopt_long(argc, argv, "-:", optvu, NULL); + } while (name != -1); + + if (vhost_user) + return MODE_VU;
if (argc < 1) die("Cannot determine argv[0]"); @@ -1604,9 +1617,8 @@ void conf(struct ctx *c, int argc, char **argv)
die("Invalid host nameserver address: %s", optarg); case 25: - if (c->mode == MODE_PASTA) - die("--vhost-user is for passt mode only");
This check should now be moved to conf_mode() instead of being dropped, otherwise you can do: $ ./pasta -f --vhost-user and at this point, the mode is MODE_VU, so it's all fine, but I don't think it's intended (...or is it?).
- c->mode = MODE_VU; + /* Already handled in conf_mode() */ + ASSERT(c->mode == MODE_VU); break; case 26: vu_print_capabilities();
Pre-existing, but now we can fix this: case 26 (--print-capabilities) should only be accepted if (c->mode == MODE_VU). It can also be done in another patch I would say, if you don't want to re-spin this. -- Stefano
On Tue, Mar 11, 2025 at 11:45:03PM +0100, Stefano Brivio wrote:
On Tue, 11 Mar 2025 17:03:12 +1100 David Gibson
wrote: We detect our operating mode in conf_mode(), unless we're using vhost-user mode, in which case we change it later when we parse the --vhost-user option. That means we need to delay parsing the --repair-path option (for vhost-user only) until still later.
However, there are many other places in the main option parsing loop which also rely on mode. We get away with those, because they happen to be able to treat passt and vhost-user modes identically. This is potentially confusing, though. So, move setting of MODE_VU into conf_mode() so c->mode always has its final value from that point onwards.
To match, we move the parsing of --repair-path back into the main option parsing loop.
Signed-off-by: David Gibson
--- conf.c | 43 ++++++++++++++++++++++++++----------------- 1 file changed, 26 insertions(+), 17 deletions(-) diff --git a/conf.c b/conf.c index 2022ea1d..b58e2a6e 100644 --- a/conf.c +++ b/conf.c @@ -998,10 +998,23 @@ pasta_opts: * * Return: mode to operate in, PASTA or PASST */ -/* cppcheck-suppress constParameter */ enum passt_modes conf_mode(int argc, char *argv[]) { + int vhost_user = 0; + const struct option optvu[] = { + {"vhost-user", no_argument, &vhost_user, 1 }, + { 0 }, + }; char argv0[PATH_MAX], *basearg0; + int name; + + optind = 0; + do { + name = getopt_long(argc, argv, "-:", optvu, NULL); + } while (name != -1); + + if (vhost_user) + return MODE_VU;
if (argc < 1) die("Cannot determine argv[0]"); @@ -1604,9 +1617,8 @@ void conf(struct ctx *c, int argc, char **argv)
die("Invalid host nameserver address: %s", optarg); case 25: - if (c->mode == MODE_PASTA) - die("--vhost-user is for passt mode only");
This check should now be moved to conf_mode() instead of being dropped, otherwise you can do:
$ ./pasta -f --vhost-user
and at this point, the mode is MODE_VU, so it's all fine, but I don't think it's intended (...or is it?).
It's more or less intended. To me it seemed simpler to treat "vhost-user mode" as co-equal with "/dev/net/tun mode" (pasta) or "qemu -net stream mode" (passt), rather than having vu be sort of a sub-mode of passt. It's true that vu mode has slightly more in common with passt mode than pasta at the moment, but I don't see that as really inherent. I also saw this as a precursor to a "--mode whatever" option which would override the mode regardless of argv[0], in case there are circumstances where manipulating argv[0] is inconvenient. But if you'd really prefer I can reinstate the check.
- c->mode = MODE_VU; + /* Already handled in conf_mode() */ + ASSERT(c->mode == MODE_VU); break; case 26: vu_print_capabilities();
Pre-existing, but now we can fix this: case 26 (--print-capabilities) should only be accepted if (c->mode == MODE_VU).
I was unsure about this, because I wasn't certain if --vhost-user was passed when we were invoked just to probe capabilities.
It can also be done in another patch I would say, if you don't want to re-spin this.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
We verify that every packet we store in a pool (and every partial packet
we retreive from it) has a length no longer than UINT16_MAX. This
originated in the older packet pool implementation which stored packet
lengths in a uint16_t. Now, that packets are represented by a struct
iovec with its size_t length, this check serves only as a sanity / security
check that we don't have some wildly out of range length due to a bug
elsewhere.
We have may reasons to (slightly) increase this limit in future, so in
preparation, give this quantity an explicit name - PACKET_MAX_LEN.
Signed-off-by: David Gibson
Currently we define both TAP_BUF_BYTES and PKT_BUF_BYTES as essentially
the same thing. They'll be different only if TAP_BUF_BYTES is negative,
which makes no sense. So, remove TAP_BUF_BYTES and just use PKT_BUF_BYTES.
In addition, most places we use this to just mean the size of the main
packet buffer (pkt_buf) for which we can just directly use sizeof.
Signed-off-by: David Gibson
Currently in tap.c we (mostly) use ETH_MAX_MTU as the maximum length of
an L2 frame. This define comes from the kernel, but it's badly named and
used confusingly.
First, it doesn't really have anything to do with Ethernet, which has no
structural limit on frame lengths. It comes more from either a) IP which
imposes a 64k datagram limit or b) from internal buffers used in various
places in the kernel (and in passt).
Worse, MTU generally means the maximum size of the IP (L3) datagram which
may be transferred, _not_ counting the L2 headers. In the kernel
ETH_MAX_MTU is sometimes used that way, but sometimes seems to be used as
a maximum frame length, _including_ L2 headers. In tap.c we're mostly
using it in the second way.
Finally, each of our tap backends could have different limits on the frame
size imposed by the mechanisms they're using.
Start clearing up this confusion by replacing it in tap.c with new
L2_MAX_LEN_* defines which specifically refer to the maximum L2 frame
length for each backend.
Signed-off-by: David Gibson
Nits only (I can fix it all up on merge):
On Tue, 11 Mar 2025 17:03:15 +1100
David Gibson
Currently in tap.c we (mostly) use ETH_MAX_MTU as the maximum length of an L2 frame. This define comes from the kernel, but it's badly named and used confusingly.
First, it doesn't really have anything to do with Ethernet, which has no structural limit on frame lengths. It comes more from either a) IP which imposes a 64k datagram limit or b) from internal buffers used in various places in the kernel (and in passt).
Worse, MTU generally means the maximum size of the IP (L3) datagram which may be transferred, _not_ counting the L2 headers. In the kernel ETH_MAX_MTU is sometimes used that way, but sometimes seems to be used as a maximum frame length, _including_ L2 headers. In tap.c we're mostly using it in the second way.
Finally, each of our tap backends could have different limits on the frame size imposed by the mechanisms they're using.
Start clearing up this confusion by replacing it in tap.c with new L2_MAX_LEN_* defines which specifically refer to the maximum L2 frame length for each backend.
Signed-off-by: David Gibson
--- tap.c | 18 ++++++++++++++---- tap.h | 25 +++++++++++++++++++++++++ 2 files changed, 39 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index fb306e75..4840dcfa 100644 --- a/tap.c +++ b/tap.c @@ -62,6 +62,15 @@ #include "vhost_user.h" #include "vu_common.h"
+/* Maximum allowed frame lengths (including L2 header) */ + +static_assert(L2_MAX_LEN_PASTA <= PACKET_MAX_LEN, + "packet pool can't store maximum size pasta frame"); +static_assert(L2_MAX_LEN_PASST <= PACKET_MAX_LEN, + "packet pool can't store maximum size qemu socket frame"); +static_assert(L2_MAX_LEN_VU <= PACKET_MAX_LEN, + "packet pool can't store maximum size vhost-user frame"); + /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */ static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf); static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf); @@ -1097,7 +1106,8 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) while (n >= (ssize_t)sizeof(uint32_t)) { uint32_t l2len = ntohl_unaligned(p);
- if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) { + if (l2len < sizeof(struct ethhdr) || + l2len > L2_MAX_LEN_PASST) {
No need to wrap this.
err("Bad frame size from guest, resetting connection"); tap_sock_reset(c); return; @@ -1151,8 +1161,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
tap_flush_pools();
- for (n = 0; n <= (ssize_t)(sizeof(pkt_buf) - ETH_MAX_MTU); n += len) { - len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU); + for (n = 0; n <= (ssize_t)(sizeof(pkt_buf) - L2_MAX_LEN_PASTA); n += len) {
n += len should go on its own line now.
+ len = read(c->fd_tap, pkt_buf + n, L2_MAX_LEN_PASTA);
if (len == 0) { die("EOF on tap device, exiting"); @@ -1170,7 +1180,7 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
/* Ignore frames of bad length */ if (len < (ssize_t)sizeof(struct ethhdr) || - len > (ssize_t)ETH_MAX_MTU) + len > (ssize_t)L2_MAX_LEN_PASTA) continue;
tap_add_packet(c, len, pkt_buf + n); diff --git a/tap.h b/tap.h index a2c3b87d..140e3305 100644 --- a/tap.h +++ b/tap.h @@ -6,6 +6,31 @@ #ifndef TAP_H #define TAP_H
+/** L2_MAX_LEN_PASTA - Maximum frame length for pasta mode (with L2 header) + * + * The kernel tuntap device imposes a maximum frame size of 65535 including + * 'hard_header_len' (14 bytes for L2 Ethernet in the case of "tap" mode).
Extra whitespaces in indentation.
+ */ +#define L2_MAX_LEN_PASTA USHRT_MAX + +/** L2_MAX_LEN_PASST - Maximum frame length for passt mode (with L2 header) + * + * The only structural limit the Qemu socket protocol imposes on frames is
QEMU
+ * (2^32-1) bytes, but that would be ludicrously long in practice. For now, + * limit it somewhat arbitrarily to 65535 bytes. FIXME: Work out an appropriate + * limit with more precision. + */ +#define L2_MAX_LEN_PASST USHRT_MAX + +/** L2_MAX_LEN_VU - Maximum frame length for vhost-user mode (with L2 header) + * + * VU allows multiple buffers per frame, each of which can be quite large, so
vhost-user
+ * the inherent frame size limit is rather large. Much larger than is actually + * useful for IP. For now limit arbitrarily to 65535 bytes. FIXME: Work out an + * appropriate limit with more precision. + */ +#define L2_MAX_LEN_VU USHRT_MAX + struct udphdr;
/**
-- Stefano
On Tue, Mar 11, 2025 at 11:45:09PM +0100, Stefano Brivio wrote:
Nits only (I can fix it all up on merge):
Actually, I spotted one other small change I'd like to make here, so I might as well respin.
On Tue, 11 Mar 2025 17:03:15 +1100 David Gibson
wrote: [snip] + */ +#define L2_MAX_LEN_PASTA USHRT_MAX + +/** L2_MAX_LEN_PASST - Maximum frame length for passt mode (with L2 header) + * + * The only structural limit the Qemu socket protocol imposes on frames is
QEMU
For some reason, the standard style for capitalizing QEMU never sticks in my head. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
We define the size of pkt_buf as large enough to hold 128 maximum size
packets. Well, approximately, since we round down to the page size. We
don't have any specific reliance on how many packets can fit in the buffer,
we just want it to be big enough to allow reasonable batching. The
current definition relies on the confusingly named ETH_MAX_MTU and adds
in sizeof(uint32_t) rather non-obviously for the pseudo-physical header
used by the qemu socket (passt mode) protocol.
Instead, just define it to be 8MiB, which is what that complex calculation
works out to.
Signed-off-by: David Gibson
The pcap header includes a value indicating how much of each frame is
captured. We always capture the entire frame, so we want to set this to
the maximum possible frame size. Currently we do that by setting it to
ETH_MAX_MTU, but that's a confusingly named constant which might not always
be correct depending on the details of our tap backend.
Instead add a tap_l2_max_len() function that explicitly returns the maximum
frame size for the current mode and use that to set snaplen. While we're
there, there's no particular need for the pcap header to be defined in a
global; make it local to pcap_init() instead.
Signed-off-by: David Gibson
The -m option controls the MTU, that is the maximum transmissible L3
datagram, not including L2 headers. We currently limit it to ETH_MAX_MTU
which sounds like it makes sense. But ETH_MAX_MTU is confusing: it's not
consistently used as to whether it means the maximum L3 datagram size or
the maximum L2 frame size. Even within conf() we explicitly account for
the L2 header size when computing the default --mtu value, but not when
we compute the maximum --mtu value.
Clean this up by reworking the maximum MTU computation to be the minimum of
IP_MAX_MTU (65535) and the maximum sized IP datagram which can fit into
our L2 frames when we account for the L2 header. The latter can vary
depending on our tap backend, although it doesn't right now.
Link: https://bugs.passt.top/show_bug.cgi?id=66
Signed-off-by: David Gibson
participants (2)
-
David Gibson
-
Stefano Brivio