On Wed, 17 May 2023 11:15:06 +1000
David Gibson <david(a)gibson.dropbear.id.au> wrote:
On Tue, May 16, 2023 at 11:42:09PM +0200, Stefano
Brivio wrote:
On Tue, 16 May 2023 15:06:29 +1000
David Gibson <david(a)gibson.dropbear.id.au> wrote:
On Sun, May 14, 2023 at 08:14:05PM +0200, Stefano
Brivio wrote:
This series, along with pseudo-related fixes,
enables:
- optional copy of all routes from selected interface in outer
namespace, to (hopefully!) fix the issue reported by Callum at:
https://github.com/containers/podman/issues/18539
- optional copy of all addresses, mostly for consistency. It doesn't,
however, enable assignment of multiple addresses in the sense
requested at:
https://bugs.passt.top/show_bug.cgi?id=47
because the addresses still need to be present on the host, and
the "outer" address isn't selected depending on the address used
inside the container
- operation without a gateway address, to (again, hopefully) support
usage of Wireguard endpoints established outside the container,
https://bugs.passt.top/show_bug.cgi?id=49
I tested the single functionalities introduced here, but I didn't
try to reproduce the setups where the issues were reported, so some
help with testing is definitely fundamental here. Thanks.
I've sent reviews for some of the simpler patches in this series which
make sense even without the context of the overall aim. I think those
can be applied immediately.
Those are actually the least important patches for users
Well, granted.
-- and I can't
apply 6/10 without breaking Podman's CI plus probably a number of
deployments (that's why it comes after 5/10)... so, no, I would rather
not apply the rest for the moment.
Uh.. true, 6/10 is problematic, but I think the other easy ones could
be applied safely enough.
With two hands and (worryingly close to) just 24 hours in a day, I
honestly can't picture even a quick rebase and retest for those being a
priority, while happily keeping around the issue that 5/10 fixes.
For the rest of the series, I want to address the
generalities before
doing detailed review of the implementation.
I think the basic idea here is sound: we want to expose anything
routable to the host as routable to the guest, even when the host has
a more complex routing setup that just a netmask on the "main"
interface and a default gateway within that prefix.
The intentions behind this series are actually slightly different:
- we have a complete breakage in a seemingly common use case (I would
even say cloud-init setups in general), and I'd like to fix that
sooner rather than later
Well, sure, but we should at least think about where we're going with
this longer term, so we don't box ourselves in.
I don't think this is going to "box us in" -- I'm just proposing to
change this after about two years, and we can definitely change it
again, as long as things keep working. If keeping things working is
boxing us in, well, I can't see that as a bad thing.
That is, I don't see people doing screen-scraping of 'ip route show'
and writing applications around that, and surely not adapting
applications to what pasta does. Not at the moment, and surely not for
a long while.
- this
concerns only the direct configuration pasta does, with
--config-net. What we advertise is definitely related, but not the
same topic... to the point that the issues fixed by this series don't
even occur with a DHCP client:
https://github.com/containers/podman/issues/18539#issuecomment-1545023424
Ah, interesting. It looks like dhclient (or rather dhclient-script, I
expect) is adding an explicit /32 route to the default gateway. It
seems to me the best quick fix for --config-net is to do the same
thing. Basically rather than expanding the netmask as we did in 6/10,
if the gateway address is not in the interface's netmask add a /32 or
/128 route to the gateway.
That's the first option I considered (of course!):
https://github.com/containers/podman/issues/18539#issuecomment-1545260780
https://github.com/containers/podman/issues/18539#issuecomment-1546967377
and only as I started implementing it, I realised that we can have
anyway chained dependencies which aren't that easy to handle, especially
if we admit an arbitrary number of routes and we need to sort them.
Plus it's going to be 1. more code 2. actually "complicated" code. This
is stupidly simple instead.
I have some experience of fixing IPv6 FIB code in kernel with
consequences on sorting/selection, and there are just so many hidden
details involved in interpretation of Linux-style routes and ways of
shadowing them.
And, in
general, we can't advertise everything we can configure (say,
a route without router over DHCP).
Ah, true. The DHCP options for static routes are even more limited
than I realized. Ok, that nixes option B.3.
I'd be much more careful about what we
advertise. We have direct
control of what we configure via netlink, but for DHCP, NDP, DHCPv6,
we need to think of possible interpretations and common half-bugs as
well.
But I think we want to think a bit more deeply
about exactly what we
need/want to expose here.
Even with the current code, the default gateway address we advertise
to the guest is kind of meaningless: the guest cannot directly access
that gateway, everything really goes through passt on the host.
In the simplest, probably most common network setups, that's actually
the gateway that connects our guest to other nodes.
I don't understand what you mean by this. Yes, we have the same IP
for the gateway that the host sees, but the NAT to host means that we
can't even talk to the gateway at L4.
It's disabled by default in Podman. It's the default behaviour in
passt because this started from KubeVirt and that's what they expect,
but that's about it. Once the address is configurable, this is not a
valid point, in general.
A gateway doesn't need to be a host, and it's very often, functionally,
not a host. This is by design: RFC 791, 2.2:
In a gateway the higher level protocols need not be implemented and
the GGP functions are added to the IP module.
Literally the only thing the guest kernel will do with
that gateway
address is put it into ARP and neighbour discovery packets, which passt
will resolve to its own MAC, like nearly every other IP.
No, the guest kernel might also have netfilter rules, specifying that
gateway address, that were originally designed for the host, as if guest
and passt didn't exist. Those might happily use the gateway address to
represent the notion of gateway.
For other
cases, I think we should eventually implement
https://bugs.passt.top/show_bug.cgi?id=47 anyway, and it goes without
saying that, then, we can't just use the same host route no matter what
the container chooses. We'll need to match them.
Oh.. I'm wondering if I've been confusing by using "host route" in
two
different ways: one being "a route taken from the passt host system"
and the other meaning "a route to a single network host, that is /32
or /128".
I agree that we should move to allowing multiple IPs on the guest
side, but I don't see how that conflicts with the routing issue here.
It's related because, once we allow them, different host routes should
actually be used, so having them in the guest/container too,
aiming at a 1:1 mapping, should simplify things rather than become
misleading.
I mean,
I'm not saying that the behaviour from this series is complete
and self-consistent, just that it works around obvious, urgent issues
and at the same time it looks like we'll probably need something
similar to support further use cases.
Adding a /32 or /128 route to the gateway seems a simpler way to do
that to me. Plus it matches the behaviour that DHCP seems to be doing
anyway.
For the reason why it's not really simple, see above. About what DHCP
clients do: that's in general not the case for udhcpc, dhcpcd, pump, or
NetworkManager. See also e1c94637ad50 ("dhcp: Send option 121 if the
default gateway is not on the assigned subnet").
As far as I know, that's just what the script used in conjunction with
ISC's dhclient does on _some_ distributions.
This works because the gateway address (like
everything) will ARP/NDP
to passt's host side MAC address and once the packets hit passt it
doesn't matter what the guest thought the routing was going to be.
I think we have a few choices in two more-or-less orthogonal
categories.
A) What routable prefixes do we advertise to the guest?
A.1) Always a default route (0.0.0.0/0 and ::/0)
We tell the guest that every address is routable via the passt
interface, regardless of routing setup on the host. This essentially
tells the guest to delegate all routing responsibility to passt.
Advantages:
* Simple
* No need to update anything if routing configuration on the host
changes
Disadvantages:
* If addresses are unroutable from the host, the guest will only
know via ICMP/ICMPv6, rather than statically, which may be a worse
UX on the guest side. Plus we might need to actually implement
those host unreachable ICMPs.
* Might be messy if the guest has multiple interfacees - e.g. if we
allow passt to be configured to attach to a specific host
interface only, then we have multiple passts attached to a single
guest: they'd all be advertising a default route.
A.2) Copy routable prefixes from the host to the guest
I'm having a hard time figuring out the definition of this point. How
would you define that? Strictly speaking, in the case at hand, nothing
is routable: we have a /32 address.
Right.. which means that if the host is working, it must have an
additional static route - also probably /32 - telling it how to get to
the gateway. Indeed I can see it in the bug, initial comment:
172.31.1.1 dev ens3 proto static scope link metric 100
With A.2 we'd copy that route to the guest - or at least one with the
same prefix (which is a single address in this case).
Blindly copying routes is one thing. Figuring out what subnets are
routable and which ones aren't is a different matter, and _that_ is
what I don't consider generally feasible.
We just advertise those prefixes routable to the host
to the guest
(which might include an empty prefix == default route).
Advantages:
* Guest statically knows what addresses are routable via the passt
interface
Disadvantages:
* What do we do with overlapping prefixes? On the host we might
have more specific routes pointing to a specific interface. For
the guest they all point to the passt interface, so what's the
point?
* Can we advertise an arbitrary set of static routes via all our
mechanisms (--config-net, DHCP, NDP+DHCPv6)? Even if we can it
adds more complexity to that code
* How do we update things if the host routing configuration changes?
* What do we do if the host has source-based routing or other
advanced stuff set up?
B) What gateway, if any, do we advertise for each route?
B.1) Copy it from the host
Advantages:
* Guest L3 configuration resembles that of the host
...which is a fundamental design goal of passt: transparency, and
pretending it doesn't exist. Otherwise we can have a route, a bridge,
an interface, etc.
Well... we want to be transparent for anything visible at L4. For
things only visible at L3 - like routes, it's not possible for things
to look 100% identical, so I think we have some wiggle room in exactly
what we do.
With this series, in most cases, things will actually be 100% identical.
But no, the transparency design goal applies especially to L3:
https://passt.top/passt/about/#motivation
let alone ALGs, which are probably less common nowadays -- but with
service meshes (increasingly common), L3 transparency is very helpful
to have. Otherwise libslirp would be absolutely enough from that
perspective.
Now, while
there are use cases that rely on different aspects of this
transparency (KubeVirt and service mesh integration) I understand this
might sound a bit dogmatic, because you might say there are more
important use cases (which I'm not aware of) or supposed benefits.
What's far less dogmatic, though, is how many issues we happily and
automatically avoid by relying on the sanity of the host networking
configuration.
By trying to copy it as close as possible, we avoid one very important
source of issues, which is our interpretation or possible lack of
knowledge about how applications we don't know about chose to interact
with kernel and network setups. The main case fixed by this series
shows exactly that: I think it's broken, but it works, and users
expect it to work.
And by trusting the host configuration we don't lose much: if that's
broken, almost everything else is broken anyway.
It's not a question of "trust" in the host configuration, it's the
fact that parts of the host configuration don't make sense in the
guest's context. Most obviously the interface names from the host
routes can't be used in the guest.
By default, with containers, even interface names are copied. Indices,
of course, aren't, but that's not something users or applications
typically try to fiddle with.
We can and do use the same
addresses for the routers, but what does it really mean? The guest
can't actually contact them as neighbours - when it tries they just
ARP to passt's fake MAC and the packets get routed by the host kernel
regardless of what router the guest was trying to send them to - in
fact neither passt nor the host kernel will even know what router the
guest thought it was using.
This is an L2 matter, which was never a problem for any project using
this.
Disadvantages:
* If the host route doesn't have a gateway we have to fall back on
B.2 or B.3 anyway
Well, they are a particular case of B.1 then: what's the disadvantage?
Two cases is more complex than one.
...but, with this series, we don't implement two different cases...?
This is
consistent (especially with this series, and especially if we
start adapting the *default* behaviours in this sense).
* Misleading: in fact everything is routed by
passt and the host
before it reaches any gateway we're listing here
But passt isn't supposed to be a router...? Let's say we have multiple
routes on the host, we configure or advertise multiple routes to the
guest. Does that make passt a router? I don't think so: we're just
associating them as closely as possible, without fancy interpretations.
A router has its own routing table, passt's would simply be a copy.
Right now it has essentially none.
Sorry, by "passt" here I really meant the host kernel, which
absolutely will route the packets. There's no guarantee they'll even
go next to the router the guest thought it was using, although it's
likely.
Right. I'd just try to make it as likely as possible. That doesn't come
with this series, and, for instance, me(a)yawnt.com already checked and
told me this series isn't enough for the "regular" Wireguard case (with
the endpoint in the outer namespace), but this clearly appears to be
getting closer to what we'll need to "naturally" support that.
B.2) Pick an address to represent passt as gateway
Advantages:
* Accurately represents that everything is routed by passt
This is configurable, actually, but no, I insist that passt isn't
*functionally* routing anything, or at least that we should get as
close as possible to that.
Again, the host kernel definitely will, and there's no avoiding that.
* We can
make this the same as the NAT-to-host address, so we only
have one "magic" address (per AF)
Not really, if it's configurable.
I mean one per passt instance, not one globally. As opposed to the
gateway address and the NAT-to-host address being potentially
different magic addresses in a single instance.
Disadvantages:
* Have to allocate an address that's safe, which is tricky (but we
usually want this for NAT-to-host anyway)
There's a difference between picking an address by default and letting
the user configure one. Besides, at least for IPv4, I don't think such
an address exists.
There certainly isn't one we can use everywhere. I think we have some
options for probing one that will be safe in a particular case.
Well... that doesn't sound great.
* Do we want just one address, or one for each
distinct gateway from
the host?
* If we can't pick something in the interfaces "natural" prefix, we
will also need to advertise a static route to reach it.
B.3) Don't advertise a gateway for any route
passt essentially proxy ARPs for the entire internet.
Advantages:
* No need to allocate an address - in fact passt need not have any
guest facing IP at all
* Extends naturally if we ever have a guest<->passt transport that's
point-to-point rather than pseudo-ethernet
Disadvantages:
* Guest ARP / neighbour tables could get real big
...it would also break a number of applications that peek at netlink
(or do ioctl()s) to check they are in fact online.
Uh.. what exactly are they looking at? We'd still have at least one
route, they just wouldn't have gateways attached to them. But as you
pointed out above I don't think we can do this with DHCP, which pretty
much kills it anyway.
The
status quo is, roughly, A.1+B.1, except that we also enforce that
the host must have a default route, which sidesteps one of the
complications of B.1. IIUC, this series is implementing A.2+B.1.
Thinking about it, I'm moderately convinced that B.1 is a bad idea.
I'm leaning towards B.2 - combining it with the NAT-to-host cleanups
to have a more concrete guest-visible address for passt itself - but
I'm also open to B.3.
...that, especially B.3, sounds like another tool, or at least like
another mode, because it conflicts quite a bit with design goals.
They're different from design _choices_ in
the sense that that's what
I've been "selling" to users and what I and others have been
implementing in integrations so far.
So the ways L4 transparency are valuable (including guest address) are
pretty clear to me. Are there also cases where the (partial) L3
transparency matter? They're certainly not obvious to me
Absolutely, and I thought I explained this a number of times, but...
service meshes using netfilter. Applications being moved from "host" to
containers, or from containers to VMs. That's where we want to pretend
nothing changes, and that usually causes more L3 headaches rather than
L4.
I'm not sure about A.1 vs. A.2. I was leaning
towards A.2, but on
further consideration, I feel like the fact that A.1 automatically
works for routing changes on the host might outweigh the fact that he
guest only gets limited information (ICMP) about what's routable.
I don't think A.2 is doable,
?? AFAICT this series is doing A.2
Not really, we're not figuring out routable prefixes, just blindly
copying routing entries. It's closer to A.2 than A.1, but it's not
that, either.
--
Stefano