Re: Pesto protocol proposals

5 Mar 2026

      On Thu, Mar 05, 2026 at 02:19:53AM +0100, Stefano Brivio wrote:
...
On Wed, 4 Mar 2026 15:28:30 +1100
David Gibson  wrote:
...
Most of today and yesterday I've spent thinking about the dynamic
update model and protocol.  I certainly don't have all the details
pinned down, let alone any implementation, but I have come to some
conclusions.
# Shadow forward table
On further consideration, I think this is a bad idea.  To avoid peer
visible disruption, we don't want to destroy and recreate listening
sockets
(Side note: if it's just *listening* sockets, is this actually that
bad?)
Well, it's obviously much less bad that interrupting existing
connections.  It does mean a peer attempting to connect at the wrong
moment might get an ECONNREFUSED, as far as it knows, a permanent
error.
...
...
that are associated with a forward rule that's not being altered.
After reading the rest of your proposal, as long as:
...
Doing that with a shadow table would mean we'd need to essentially
diff the two tables as we switch.  That seems moderately complex,
...this is the only downside (I can't think of others though), and I
don't think it's *that* complex as I mentioned, it would be a O(n^2)
step that can be probably optimised (via sorting) to O(n * log(m)) with
n new rules and m old rules, cycling on new rules and creating listening
sockets (we need this part anyway) unless we find (marking it
somewhere temporarily) a matching one...
I wasn't particularly concerned about the computational cost.  It was
more that I couldn't quickly see a clear approach with unambiguous
semantics.  But, I think I came up with one now, see later.
...
...
and
kind of silly when then client almost certainly have created the
shadow table using specific adds/removes from the original table.
...even though this is true conceptually, at least at a first glance
(why would I send 11 rules to add a single rule to a table of 10?), I
think the other details of the implementation, and conceptual matters
(such as rollback and two-step activation) make this apparent silliness
much less relevant, and I'm more and more convinced that a shadow table
is actually the simplest, most robust, least bug-prone approach.
Especially:
...
# Rule states / active bit
I think we *do* still want two stage activation of new rules:
...this part, which led to a huge number of bugs over the years in nft
/ nftables updates, which also use separate insert / activate / commit
/ deactivate / delete operations.
Huh, interesting.  I wasn't aware of that, and it's pretty persuasive.
...
It's extremely complicated to grasp and implement properly, and you end
up with a lot of quasi-diffing anyway (to check for duplicates in
ranges, for example).
It makes much more sense in nftables because you can have hundreds of
megabytes of data stored in tables, but any usage that was ever
mentioned for passt in the past ~5 years would seem to imply at most
hundreds of kilobytes per table.
Shifting complexity to the client is also a relevant topic for me, as we
decided to have a binary client to avoid anything complicated (parsing)
in the server. A shadow table allows us to shift even more complexity
to the client, which is important for security.
I definitely agree in principle - what I wasn't convinced about was
that the overall balance actually favoured the client, because of my
concern over the complexity of that "diff"ing.  But
...
I haven't finished drafting a proposal based on this idea, but I plan to
do it within one day or so.
Actually, you convinced me already, so I can do that.
...
It won't be as detailed, because I don't think it's realistic to come
up with all the details before writing any of the code (what's the
point if you then have to throw away 70% of it?) but I hope it will be
complete enough to provide a comparison.
By the way, at least at a first approximation, closing and reopening
listening sockets will mostly do the trick for anything our users
(mostly via Podman) will ever reasonably want, so I have half a mind of
keeping it like that in a first proposal, but indeed we should make
sure there's a way around it, which is what is is taking me a bit more
time to demonstrate.
With some more thought I saw a way of doing the "diff" that looks
pretty straightforward and reasonable.  Moreover it's less churn of
the existing code, and works nicely with close-and-reopen as an
interim step.  It even provides socket continuity for arbitrarily
overlapping ranges in the old and new tables.

For close and re-open, we can implement COMMIT as:
	1. fwd_listen_close() on old table
	2. fwd_listen_sync() on new table

I think we can get socket continuity if by swapping the order of those
steps and extending fwd_sync_one() to do:
	for each port:
	    if <already opened>:
	        nothing to do
<new>	    else if <matching open socket in old table>:
<new>	        steal socket for new table
            else:
	        open/bind/listen new socket

The "steal" would mark the fd as -1 in the old table so
fwd_listen_close() won't get rid of it.

I think the check for a matching socket in the old table will be
moderately expensive O(n), but not so much as to be a problem in
practice.
...
...
[...]
# Suggested client workflow
I suggest the client should:
1. Parse all rule modifications
   2. INSERT all new rules
      -> On error, DELETE them again  
   3. DEACTIVATE all removed rules
      -> Should only fail if the client has done something wrong  
   4. ACTIVATE all new rules
      -> On error (rule conflict):  
         DEACTIVATE rules we already ACTIVATEd
   ACTIVATE rules we already DEACTIVATEd
   DELETE rules we INSERTed
   5. Check for bind errors (see details later)
      If there are failures we can't tolerate:
         DEACTIVATE rules we already ACTIVATEd
   ACTIVATE rules we already DEACTIVATEd
   DELETE rules we INSERTed
   6. DELETE rules we DEACTIVATEd
      -> Should only fail if the client has done something wrong
DEACTIVATE comes before ACTIVATE to avoid spurious conflicts between
new rules and rules we're deleting.
I think that gets us closeish to "as atomic as we can be", at least
from the perspective of peers.  The main case it doesn't catch is that
we don't detect rule conflicts until after we might have removed some
rules.  Is that good enough?
I think it is absolutely fine as an outcome, but the complexity of error
handling in this case is a bit worrying. This is exactly the kind of
thing (and we discussed it already a couple of times) that made and
makes me think that a shadow table is a better approach instead.
I'll work on a more concrete proposal based on the shadow table
approach.  There are still some wrinkles with how to report bind()
errors with this scheme to figure out.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson