Re: [PATCH v3 17/20] vhost_user: Make source quit after reporting migration state

3 Feb 2025


      On Mon, 3 Feb 2025 19:52:37 +1100
David Gibson  wrote:
...
On Mon, Feb 03, 2025 at 07:09:32AM +0100, Stefano Brivio wrote:
...
On Mon, 3 Feb 2025 12:55:47 +1100
David Gibson  wrote:
...
On Fri, Jan 31, 2025 at 08:39:50PM +0100, Stefano Brivio wrote:
...
On migration, the source process asks passt-helper to set TCP sockets
in repair mode, dumps the information we need to migrate connections,
and closes them.
At this point, we can't pass them back to passt-helper using
SCM_RIGHTS, because they are closed, from that perspective, and
sendmsg() will give us EBADF. But if we don't clear repair mode, the
port they are bound to will not be available for binding in the
target.
Terminate once we're done with the migration and we reported the
state. This is equivalent to clearing repair mode on the sockets we
just closed.
As noted on the passt-repair patch, I think this is based on a
misinterpreation of the situation.  I think the problem is that the
sockets aren't closed in passt-repair, so the additional handle copy
is keeping the underlying socket open.  This appears to work, because
it is causing passt-repair to also terminate.
Right, exactly that.
...
That said, we probably want to terminate on the source side after a
succesful migrate anyway.  At the very least we need to close() all
our sockets, and delete the corresponding flows, because we don't own
them any more.  Quitting is probably the simplest way to do that.
I'm not sure if there's an established behaviour for helpers supporting
state migration.
By "helper" do you mean passt as a device helper to qemu, or
passt-repair as a helper to passt.  For the latter I wouldn't expect
so - it's only a weirdness of our situation that we need passt-repair
at all.  If the former, I'm not really sure what you're after.
I meant passt and similar. Is there any convention we should adopt?
...
...
We could probably close sockets, delete flows, and keep things up and
running for the rest (restart from a clean situation), but at that
point we already the guest networking is already broken in a number of
ways. So, yeah, maybe let's keep this instead.
So, I realised it's a bit more complicated than that.  We need to
identify exactly where the "point of no return" is.  I'll discuss in
our call tonight.
I think it's simply where we close sockets, by the way.

-- 
Stefano