Re: [PATCH 00/22] RFC: Allow configuration of special case NATs

20 Aug 2024


      On Mon, Aug 19, 2024 at 03:01:00PM +0200, Stefano Brivio wrote:
...
On Mon, 19 Aug 2024 19:52:49 +1000
David Gibson  wrote:
...
On Mon, Aug 19, 2024 at 11:27:49AM +0200, Stefano Brivio wrote:
...
On Mon, 19 Aug 2024 18:46:31 +1000
David Gibson  wrote:
...
On Fri, Aug 16, 2024 at 03:39:41PM +1000, David Gibson wrote:
...
Based on Stefano's recent patch for faster tests.
Allow the user to specify which addresses are translated when used by
the guest, rather than always being the gateway address or nothing.
We also allow this remapping to go to the host's global address (more
precisely the address assigned to the guest) rather than just host
loopback.
Suggestions for better names for the new options in patches 20 & 22
are most welcome.
Along the way to implementing that make many changes to clarify what
various addresses we track mean, fixing a number of small bugs as
well.
NOTE: there is a bug in 21/22 which breaks some of the passt_tcp perf
tests.  I haven't managed to figure out why it's causing the problem,
or even what the exact triggering conditions are (running the single
stalling iperf alone doesn't do it).  Have to wrap up for today, so I
thought I'd get this out for review anyway.
I've identified the bug here.  IMO, it's a pre-existing problem that
only works by accident at the moment.  The immediate fix is pretty
obvious, but it raises some broader questions
The problem arises because of the MTU changes we make in order to test
throughput with different packet sizes.  Specifically we change the
MTU to values < 1280, which implicitly disables IPv6 since it requires
an MTU >= 1280.  When we change the MTU back to a larger value IPv6 is
re-enabled, but some configuration has been lost in the meantime.
After the MTU is restored the guest reconfigures with NDP, but does
not re-DHCPv6.  That means the guest gets a SLAAC address in the right
prefix but not the exact /128 address we've tried to assign to it.
However, at least with the sequence of things we have in the tests,
the guest never sends any packets with the new address, so passt
doesn't update addr_seen.  When the inbound connection comes we send
it to the assigned address instead of the guest's actual address and
the guest rejects it.
I still have to take a closer look, but I'm fairly sure I hit a similar
issue while I was writing these tests originally. I pondered
reconfiguring the address via DHCPv6, or using the keep_addr_on_down
sysctl (net.ipv6.conf.<interface>.keep_addr_on_down), which was added
around that time.
Then:
...
This "worked" previously, because before this patch, passt would
translate the inbound connection to have source/dest as link-local
addresses.
...I realised that this worked and forgot about the whole issue.
...
We *do* have a current addr_ll_seen because (a) it won't
change if the guest doesn't change MAC and (b) when IPv6 is re-enabled
the NDP traffic the guest generates will have link-local addresses
that update addr_ll_seen.  With this patch, and a global address for
--map-host-loopback, we now need to send to addr_seen instead of
addr_ll_seen, hence exposing the bug.
In the short term, the obvious fix would be to re-run dhclient -6 in
the guest after we twiddle MTU but before running IPv6 tests.
I guess setting keep_addr_on_down (even for "all" interfaces) should
work as well.
Sounds like it.  I wasn't aware of that one.
/me tests..  actually, no it doesn't work..
# sysctl -a | grep keep_addr_on_down
net.ipv6.conf.all.keep_addr_on_down = 1
net.ipv6.conf.default.keep_addr_on_down = 1
net.ipv6.conf.dummy0.keep_addr_on_down = 1
net.ipv6.conf.lo.keep_addr_on_down = 0
# ip addr add 2001:db8::1 dev dummy0
# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy0:  mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
    inet6 2001:db8::1/128 scope global 
       valid_lft forever preferred_lft forever
# ip link set dummy0 mtu 1200
# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy0:  mtu 1200 qdisc noop state DOWN group default qlen 1000
    link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
# ip link set dummy0 mtu 1500
# ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: dummy0:  mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether c2:02:f2:79:f9:94 brd ff:ff:ff:ff:ff:ff
My guess is that IPv6 being deconfigured because of an unsuitable MTU
is considered a different event from a mere "down".
I guess it's because they're not IFA_F_PERMANENT, because
addrconf_permanent_addr() has:
case NETDEV_CHANGEMTU:
                /* if MTU under IPV6_MIN_MTU stop IPv6 on this interface. */
                if (dev->mtu < IPV6_MIN_MTU) {
                        addrconf_ifdown(dev, dev != net->loopback_dev);
                        break;
                }
but addrconf_ifdown() does:
if (!keep_addr ||
                                    !(ifa->flags & IFA_F_PERMANENT) ||
                                    addr_is_local(&ifa->addr)) {
                                        hlist_del_init_rcu(&ifa->addr_lst);
                                        goto restart;
                                }
I'm not sure about the logic behind that. We could actually set those
addresses as permanent once the DHCPv6 client configures them, if it's
cleaner.
Huh.  Not in the passt/VM case, though, which is where I actually
encountered this.
...
...
...
...
This kind of opens a question about how hard we should try to
accomodate guests which don't configure themselves how we told them.
There's a notable distinction between guests temporarily diverging (in
different ways) and guests we don't configure at all.
I'm not really sure what you're getting at here.
In this case, it's not true that the guest doesn't configure itself in
the way we requested -- it's just a temporary diversion from that
configuration.
Oh, I see.  Assuming that at some point the DHCP client will re-run.
...
Those are different cases that we can handle in different ways, I
think. If it's a glitch that will only happen during testing, let's
work around that.
But if the guest really ignores DHCPv6 information, I think we should
keep that working.
...
...
It's probably more important to ensure we use the right type of address
"type" in what sense here?
Global unicast instead of link-local.
Ok.
...
...
...
(security) rather than ensuring we somehow manage to deliver packets at
any time (minor glitch otherwise), also because the one you describe is
something we're unlikely to hit outside of tests.
...
Personally I'd be ok with saying that nothing works if the guest
doesn't configure itself properly, thereby removing addr_seen and
addr_ll_seen entirely.  But I think, Stefano, you've been against that
idea in the past.
Yes, I still think we should support guests that don't use DHCPv6 or
NDP at all,
Well, you still wouldn't *need* DHCPv6 or NDP, but you'd have to
manually configure the interface in the guest to match the address
you've configured with -a.  Just like you'd expect to have to
correctly configure your address on a real network.
True, but if we make correctness as optional as possible, we'll be more
compatible (less time spent by users fixing situations that don't
necessarily need fixing, less time spent by developers to look into
reports, no matter who's at fault).
Eh, maybe.  Unless us trying to make sense of a nonsense situation
causes some unpredictable behaviour that breaks something else.
...
...
...
or where related exchanges fail for any reason. It improves
reliability and compatibility at a small cost. In this case, I think
it's a nice feature that we would resume communicating as soon as the
guest shows its global unicast address.
Hm, maybe.  I'm not entirely convinced the cost is so small long term.
It's pretty badly incompatible with having multiple guests behind the
same passt instance: such as the initial guest bridging or routing to
nested guests.
Why? We will need to hash the interface/guest index anyway, for
outbound flows.
If we have separate interfaces for each guest, yes.  But not if we
have multiple guests behind a single tap because the initial guest
sets up a bridge or routing.  Then we have nothing but the address.
...
And for inbound flows, if a guest steals the address of another guest,
we'll give priority to the normal 'addr' versions instead of the
'_seen' ones, to decide how to direct traffic.
I don't see how we'd know we're in this situation, so when to
prioritise which address over the other.
...
...
I'm actually not sure if encountering this bug makes me more or less
in favour of addr_seen.  On the one hand I think it highlights the
flakiness of this approach; there are situations where we just won't
know the right address.
I don't understand this argument: indeed, there are such situations,
and they are annoying. Why should we make them more common?
Because predictability is good, and working _most_ of the time is a
failure of predictability.
...
...
On the other hand if shows a relatively
plausible case where the guest won't get exactly the address we want
it to (it uses NDP but not DHCPv6)
Hrm... actually this also shows a potential danger in the recent
patches to disable DAD in the guest.  With DAD enabled, when the guest
grabs a new address, we'd expect it to emit DAD messages, which would
have the side effect of updating our addr_seen (although I'm pretty
sure I hit this patch before the nodad patches were applied, so that
doesn't seem to be foolproof).
Well, but we do that for containers with --config-net only. In that
case, the addresses we configure have infinite lifetime anyway.
Oh, good point.  Hrm... then I'm unsure why the guest wasn't re-DADing
its new address.
...
Besides, I don't think we need to have addr_seen updated as quickly and
correctly as possible just for the sake of it, we can also update it
when we get any other neighbour solicitation because the guest is
actually using the network. It's not meant to be perfect.
If the guest is a pure server (a common case for containers AFAICT),
then I don't know that we can expect NS messages for anything other
than the default gateway, which is (typically) link-local and so won't
help us to learn the new global address.
...
...
We could maybe update addr_seen when we send RA messages to the guest
- assuming that it will use the same host part (low 64-bits) for both
link-local and global addresses.  Not sure if that's a widely safe
assumption or not.
I don't understand: what case are you trying to cover with this?
A case just like the one in the tests: the interface bounces, and we
get NDP traffic on the link-local address, but nothing on the global
address before an inbound connection.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson