Re: [PATCH v3 4/4] fwd: Direct inbound spliced forwards to the guest's external address

17 Oct 2024

      On Thu, 17 Oct 2024 12:19:58 +1100
David Gibson  wrote:
...
On Wed, Oct 16, 2024 at 05:26:48PM +0200, Stefano Brivio wrote:
...
On Wed, 16 Oct 2024 19:39:40 +1100
David Gibson  wrote:
...
On Wed, Oct 16, 2024 at 04:46:52PM +1100, David Gibson wrote:
...
On Wed, Oct 16, 2024 at 02:15:19PM +1100, David Gibson wrote:
...
On Thu, Oct 10, 2024 at 04:57:32PM +1100, David Gibson wrote:
...
On Wed, Oct 09, 2024 at 10:44:33PM +0200, Stefano Brivio wrote:    
> On Wed, 9 Oct 2024 15:07:21 +0200
> Stefano Brivio  wrote:    
[snip]    
> > > @@ -447,20 +447,35 @@ uint8_t fwd_nat_from_host(const struct ctx *c, uint8_t proto,
> > >  	    (proto == IPPROTO_TCP || proto == IPPROTO_UDP)) {
> > >  		/* spliceable */
> > >  
> > > -		/* Preserve the specific loopback adddress used, but let the
> > > -		 * kernel pick a source port on the target side
> > > +		/* The traffic will go over the guest's 'lo' interface, but by
> > > +		 * default use its external address, so we don't inadvertently
> > > +		 * expose services that listen only on the guest's loopback
> > > +		 * address.  That can be overridden by --host-lo-to-ns-lo which
> > > +		 * will instead forward to the loopback address in the guest.
> > > +		 *
> > > +		 * In either case, let the kernel pick the source address to
> > > +		 * match.
> > >  		 */
> > > -		tgt->oaddr = ini->eaddr;
> > > +		if (inany_v4(&ini->eaddr)) {
> > > +			if (c->host_lo_to_ns_lo)
> > > +				tgt->eaddr = inany_loopback4;
> > > +			else
> > > +				tgt->eaddr = inany_from_v4(c->ip4.addr_seen);
> > > +			tgt->oaddr = inany_any4;
> > > +		} else {
> > > +			if (c->host_lo_to_ns_lo)
> > > +				tgt->eaddr = inany_loopback6;
> > > +			else
> > > +				tgt->eaddr.a6 = c->ip6.addr_seen;      
> > 
> > Either this...
> >     
> > > +			tgt->oaddr = inany_any6;      
> > 
> > or this (and not something before this patch, up to 3/4) make the
> > "TCP/IPv6: host to ns (spliced): big transfer" test in pasta/tcp hang,
> > sometimes (about one in three/four runs), that's what I mistakenly
> > reported as coming from Laurent's series at:
Huh, interesting.  Just got back from my leave and ran that group of
tests in a loop this afternoon, but didn't manage to reproduce.  I
have administrivia that will probably fill the rest of this week, but
I'll look into this as soon as I can.
I reproduced the problem on passt.top, and I have a partial idea
what's going on.  As you say it's seeming like the address (addr_seen
== addr in this case) isn't properly ready.  This is over splice, but
on the tap interface, I see the container sending NS messages for its
own address - seems like it's doing DAD.  But more importantly, we're
answering those NS messages with NA messages, because we answer all
NS.  i.e. we're making the DAD fail.  What I'm not sure of is how this
ever worked at all.  --config-net makes sense, since we disable DAD,
but our test suite has always been using NDP+DHCP instead of
--config-net.
So, AFACT, we'll always fail guest DAD attempts, both IPv6, which
happens most of the time and for IPv4 via ARP, which is used much more
rarely.  I think we need to be more selective in what NS or ARP
lookups we resopnd to.  The question is what approach to take:
Hmm... no.. there's more to this.
Usually DAD requests have :: as the source address, and we *do*
exclude those from getting replies.  In this case though, we're
getting NS requests for the assigned address from what looks like the
SLAAC address.  So, I do think it would be wise to explicitly exclude
these: we shouldn't be giving NA responses for an address that ought
to belong to the guest, even if it doesn't look like a DAD.
But, I'm not sure what's triggering this.  Is for some reason the DHCP
address not "taking", so the container is trying to locate it on the
network instead?  Or _is_ this DAD, but under some circumstances
rather than using :: as the source address it uses another configured
address.
Ok.. I've understood a bit more.  While timing is a factor here, it
looks like the main reason I wasn't seeing it on my machine is what
I'd consider a bug in the Debian version of the dhclient-script:
when adding an IPv6 address, it returns without waiting for DAD to
complete (i.e. for the address to be non-tentative).
Oops. On one hand, I would feel inclined to propose a fix for the
Debian and Ubuntu packages. On the other hand, I wonder if it's
universally considered a bug: the DHCPv6 client did its job at that
point, and it's debatable whether dhclient should wait for the address
to be usable before forking to background.
That is, arguably, the job of dhclient's is to request and configure an
address. It's not a network configuration daemon. There might be many
other reasons why that address is unusable, and yet dhclient is not
responsible for them.
Hrm... I guess.  Counterpoints..
 - Most other failures to get a usable address will result in a
   visible error
 - dhclient has a --dad-wait-time option which seems to imply that the
   script should wait for DAD
 - The upstream script version waits for DAD
In any case I filed a report for it
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231
...
By the way, I guess it's just an issue for test scripts like this one.
Why do you guess that?
Because it's kind of rare that your address changes if you use DHCPv6,
I guess, so this would be relevant almost exclusively at boot.

And, at boot, if a remote peer/client happens to try to connect to the
machine where the client is running right after an address was
assigned, it must have a retry mechanism almost for sure.

-- 
Stefano