On Wed, Jul 23, 2025 at 11:17:08AM +0200, Stefano Brivio wrote:
On Wed, 23 Jul 2025 10:27:30 +1000 David Gibson
wrote: On Tue, Jul 22, 2025 at 11:12:49PM +0200, Stefano Brivio wrote: [snip]
I think a preferable approach would be to EPOLL_CTL_DEL each socket at the point we would close() them with --migrate-no-linger. I'd be fine with that as a follow up improvement, though.
I was pondering about adding this on top of the ignore_data_events trick, but, actually, the whole thing as of this patch is somewhat bogus because
- we're ignoring events on TCP sockets (intentional),
- we're ignoring events on the tap device (who cares, migration is only supported with vhost-user)
- but *not* ignoring events on the vhost-user kick descriptor (oversight).
On a second thought, it doesn't look safe to ignore events on the kick descriptor, and in any case, with this change, we don't want to prevent the guest to send out further packets. It's not expected anyway.
So I just replaced the whole thing with EPOLL_CTL_DEL (epoll_del()) as we go through the sockets. It's simpler and arguably safer.
Yes, that's what I had in mind as well (I thought I put that in the mail, but it looks like I didn't). Just one additional concern that I don't think need hold up merge: do we also need to epoll_del() our listening sockets?
Right, I had that thought as well, and this was somewhat covered by the first version because we'd ignore events on those. But that would still come with the risk of epoll_wait() loops.
Exactly.
In the perspective of a simple implementation / fix mostly intended for KubeVirt such as this one, we know that the guest is suspended at that point, so we'll send a SYN to it, nothing comes back, we'll eventually time out in 10 seconds and try to reset the connection (in the unlikely case we're still running *and* getting traffic at that point).
Hrm. Maybe. Wouldn't surprise me if there are still some weird edge cases in her, but, yes, I suspect they're not a huge deal.
And I guess that's fine because if we're still running and getting traffic after that timeout something must have gone wrong, so sending a RST segment doesn't look that bad to me. If the connection was meanwhile established with the target node, I think our RST segment won't actually reset the connection because sequence numbers won't match.
But surely it's not elegant and I think we should eventually have an explicit implementation of the whole thing, perhaps with a new socket state ("MIGRATED"?) and going through the whole list of listening sockets and... closing them, I suppose?
Right. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson