On Thu, 20 Feb 2025 17:03:18 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Migrating TCP flows requires passt-repair in order to use TCP_REPAIR. If passt-repair is not started, our failure mode is pretty ugly though: we'll attempt the migration, hitting various problems when we can't enter repair mode. In some cases we may not roll back these changes properly, meaning we break network connections on the source. Our general approach is not to completely block migration if there are problems, but simply to break any flows we can't migrate. So, if we have no connection from passt-repair carry on with the migration, but don't attempt to migrate any TCP connections. Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au> --- flow.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/flow.c b/flow.c index 6cf96c26..749c4984 100644 --- a/flow.c +++ b/flow.c @@ -923,6 +923,10 @@ static int flow_migrate_repair_all(struct ctx *c, bool enable) union flow *flow; int rc; + /* If we don't have a repair helper, there's nothing we can do */ + if (c->fd_repair < 0) + return 0; +This doesn't fix the behaviour in a relatively likely failure mode: passt-repair is there, but we can't communicate to it (LSM policy issues or similar). In that case, unconditionally terminating on failure in the rollback function: if (tcp_flow_repair_off(c, &flow->tcp)) die("Failed to roll back TCP_REPAIR mode"); if (repair_flush(c)) die("Failed to roll back TCP_REPAIR mode"); isn't a very productive thing to do: we go from an uneventful failure where flows were not affected at all to a guest left without connectivity. That starts looking less robust than the alternative (e.g. what I implemented in v12: silently fail and continue) at least without https://patchew.org/QEMU/20250217092550.1172055-1-lvivier@redhat.com/ in a general case as well: if we continue, we'll have hanging flows that will expire on timeout, but if we don't, again, we'll have a guest without connectivity. I understand that leaving flows around for that long might present a relevant inconsistency, though. So I'm wondering about some alternatives: actually, the rollback function shouldn't be called at all in this case. Or it could just (indirectly) call tcp_rst() on all the flows that were possibly affected. -- Stefano