Re: [PATCH 2/2] migrate, flow: Don't attempt to migrate TCP flows without passt-repair

21 Feb 2025

On Thu, Feb 20, 2025 at 11:38:00AM +0100, Stefano Brivio wrote:
...
On Thu, 20 Feb 2025 21:18:06 +1100
David Gibson  wrote:
...
On Thu, Feb 20, 2025 at 09:07:26AM +0100, Stefano Brivio wrote:
...
On Thu, 20 Feb 2025 17:03:18 +1100
David Gibson  wrote:
...
Migrating TCP flows requires passt-repair in order to use TCP_REPAIR.  If
passt-repair is not started, our failure mode is pretty ugly though: we'll
attempt the migration, hitting various problems when we can't enter repair
mode.  In some cases we may not roll back these changes properly, meaning
we break network connections on the source.
Our general approach is not to completely block migration if there are
problems, but simply to break any flows we can't migrate.  So, if we have
no connection from passt-repair carry on with the migration, but don't
attempt to migrate any TCP connections.
Signed-off-by: David Gibson 
---
 flow.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/flow.c b/flow.c
index 6cf96c26..749c4984 100644
--- a/flow.c
+++ b/flow.c
@@ -923,6 +923,10 @@ static int flow_migrate_repair_all(struct ctx *c, bool enable)
  union flow *flow;
  int rc;
+	/* If we don't have a repair helper, there's nothing we can do */
+	if (c->fd_repair < 0)
+		return 0;
+
This doesn't fix the behaviour in a relatively likely failure mode:
passt-repair is there, but we can't communicate to it (LSM policy
issues or similar).
Ah... true.  Although it shouldn't make it any worse for that case,
right, so that could be a separate fix.
Sure.
...
...
In that case, unconditionally terminating on failure in the rollback
function:
if (tcp_flow_repair_off(c, &flow->tcp))
      	die("Failed to roll back TCP_REPAIR mode");
if (repair_flush(c))
      die("Failed to roll back TCP_REPAIR mode");
isn't a very productive thing to do: we go from an uneventful failure
where flows were not affected at all to a guest left without
connectivity.
So, the issue is that leaving sockets in repair mode after we leave
the migration path would be very bad.
Why? I really can't see anything catastrophic happening as a result of
that (hence my v12 version of this). Surely not as bad as the guest
losing connectivity without any possible recovery.
I was meaning specifically trying to carry on using a repair mode
socket as if it wasn't in repair mode, not closing it out.
...
...
We can't easily close
sockets/flows for which that's the case, because the batching means if
there's a failure we don't actually know which sockets are in which
mode, hence the die().
They can be closed (via tcp_rst()) anyway. If they're in repair mode,
no RST will reach the peer, and if they aren't, it will.
Ah, yes, I guess we can just close anything that might be affected.
Brutal, but effective.  It will disrupt connectivity, of course, but
not as badly as dying completely.
...
...
...
That starts looking less robust than the alternative (e.g. what I
implemented in v12: silently fail and continue) at least without
https://patchew.org/QEMU/20250217092550.1172055-1-lvivier@redhat.com/
in a general case as well: if we continue, we'll have hanging flows
that will expire on timeout, but if we don't, again, we'll have a
guest without connectivity.
I understand that leaving flows around for that long might present a
relevant inconsistency, though.
So I'm wondering about some alternatives: actually, the rollback
function shouldn't be called at all in this case. Or it could just
(indirectly) call tcp_rst() on all the flows that were possibly
affected.
Making it be a safe no-op if we never managed to turn repair on for
anything would make sense to me.  Unfortunately, in this situation we
won't see an error until we do a repair_flush() which means we now
don't know the state of any sockets we already passed to
repair_set().
We could, I suppose, close all flows that we passed to repair_set() by
the time we see the error.  If we have < one batch's worth of
connections that will kill connectivity almost as much as die()ing,
though.  I guess it will come back without needing qemu to restart us,
though, so that's something.
Right, yes, that's what I'm suggesting.
Ok.  I'll work on something to do that.
...
...
This sort of thing is, incidentally, why I did way back suggest the
possibility of passt-repair reporting failures per-fd, rather than
just per-batch.
Sorry, I somehow missed that proposal, and I can't find any trace of
it.
It may have just been on IRC somewhere.
...
But anyway, the problem is that if we fail to read a batch for any
reason (invalid ancillary data... maybe always implying a kernel issue,
but I'm not sure), you can't _reliably_ report per-fd failures.
*Usually*, you can. Worth it?
Ah, I see.  We could handle that by being able to report both per-fd
and "whole batch" failure (equivalent to failure on every fd), but
that would complexify the protocol, of course.
...
In any case, if it's simple, we can still do it, because passt and
passt-repair are distributed together. You can't pass back the file
descriptors via SCM_RIGHTS though, because we want to close() them
before we reply.
Another alternative could be that passt-repair reverts back the state
of the file descriptors that were already switched, on failure.
That might help a bit, we'd still need to rework the passt-side
interface to know what needs reverting at the right stage.

I'll take those ideas and see what I can come up with today.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson