[PATCH v3] treewide: By default, don't quit source after migration, keep sockets open
We are hitting an issue in the KubeVirt integration where some data is
still sent to the source instance even after migration is complete. As
we exit, the kernel closes our sockets and resets connections. The
resulting RST segments are sent to peers, effectively terminating
connections that were meanwhile migrated.
At the moment, this is not done intentionally, but in the future
KubeVirt might enable OVN-Kubernetes features where source and
destination nodes are explicitly getting mirrored traffic for a while,
in order to decrease migration downtime.
By default, don't quit after migration is completed on the source: the
previous behaviour can be enabled with the new, but deprecated,
--migrate-exit option. After migration (as source), the -1 / --one-off
option has no effect.
Also, by default, keep migrated TCP sockets open (in repair mode) as
long as we're running, and ignore events on any epoll descriptor
representing data channels. The previous behaviour can be enabled with
the new, equally deprecated, --migrate-no-linger option.
By keeping sockets open, and not exiting, we prevent the kernel
running on the source node to send out RST segments if further data
reaches us.
Reported-by: Nir Dothan
On Thu, Jul 24, 2025 at 07:28:58PM +0200, Stefano Brivio wrote:
We are hitting an issue in the KubeVirt integration where some data is still sent to the source instance even after migration is complete. As we exit, the kernel closes our sockets and resets connections. The resulting RST segments are sent to peers, effectively terminating connections that were meanwhile migrated.
At the moment, this is not done intentionally, but in the future KubeVirt might enable OVN-Kubernetes features where source and destination nodes are explicitly getting mirrored traffic for a while, in order to decrease migration downtime.
By default, don't quit after migration is completed on the source: the previous behaviour can be enabled with the new, but deprecated, --migrate-exit option. After migration (as source), the -1 / --one-off option has no effect.
Also, by default, keep migrated TCP sockets open (in repair mode) as long as we're running, and ignore events on any epoll descriptor representing data channels. The previous behaviour can be enabled with the new, equally deprecated, --migrate-no-linger option.
By keeping sockets open, and not exiting, we prevent the kernel running on the source node to send out RST segments if further data reaches us.
Reported-by: Nir Dothan
Signed-off-by: Stefano Brivio --- v2: - assorted changes in commit message - context variable ignore_linger becomes ignore_no_linger - new options are deprecated - don't ignore events on some descriptors, drop them from epoll v3: - Nir reported occasional failures (connections being reset) with both v1 and v2, because, in KubeVirt's usage, we quit as QEMU exits. Disable --one-off after migration as source, and document this exception
This seems like an awful, awful hack. We're abandoning consistent semantics on a wild guess as to what the layers above us need. Specifically, --once-off used to mean that the layer above us didn't need to manage passt's lifetime; it was tied to qemu's. Now it still needs to manually manage passt's lifetime, so what's the point. So, if it needs passt to outlive qemu it should actually manage that and not use --once-off. Requring passt to outlive qemu already seems pretty dubious to me: having the source still connected when passt was quitting is one thing - indeed it's arguably hard to avoid. Having it still connected when *qemu* quits is much less defensible. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 25 Jul 2025 14:04:17 +1000
David Gibson
On Thu, Jul 24, 2025 at 07:28:58PM +0200, Stefano Brivio wrote:
We are hitting an issue in the KubeVirt integration where some data is still sent to the source instance even after migration is complete. As we exit, the kernel closes our sockets and resets connections. The resulting RST segments are sent to peers, effectively terminating connections that were meanwhile migrated.
At the moment, this is not done intentionally, but in the future KubeVirt might enable OVN-Kubernetes features where source and destination nodes are explicitly getting mirrored traffic for a while, in order to decrease migration downtime.
By default, don't quit after migration is completed on the source: the previous behaviour can be enabled with the new, but deprecated, --migrate-exit option. After migration (as source), the -1 / --one-off option has no effect.
Also, by default, keep migrated TCP sockets open (in repair mode) as long as we're running, and ignore events on any epoll descriptor representing data channels. The previous behaviour can be enabled with the new, equally deprecated, --migrate-no-linger option.
By keeping sockets open, and not exiting, we prevent the kernel running on the source node to send out RST segments if further data reaches us.
Reported-by: Nir Dothan
Signed-off-by: Stefano Brivio --- v2: - assorted changes in commit message - context variable ignore_linger becomes ignore_no_linger - new options are deprecated - don't ignore events on some descriptors, drop them from epoll v3: - Nir reported occasional failures (connections being reset) with both v1 and v2, because, in KubeVirt's usage, we quit as QEMU exits. Disable --one-off after migration as source, and document this exception
This seems like an awful, awful hack.
Well, of course, it is, and long term it should be fixed in either KubeVirt or libvirt (even though I'm not sure how, see below) instead.
We're abandoning consistent semantics on a wild guess as to what the layers above us need.
No, not really, we tested this and tested the alternative.
Specifically, --once-off used to mean that the layer above us didn't
--one-off
need to manage passt's lifetime; it was tied to qemu's. Now it still needs to manually manage passt's lifetime, so what's the point. So, if it needs passt to outlive qemu it should actually manage that and not use --once-off.
The main point is that it does *not* manually manage passt's lifetime if there's no migration (which is the general case for libvirt and all other users). We don't have any other user with an implementation of the migration workflow anyway (libvirt itself doesn't do that, yet). It's otherwise unusable for KubeVirt. So I'd say let's fix it for the only user we have.
Requring passt to outlive qemu already seems pretty dubious to me: having the source still connected when passt was quitting is one thing - indeed it's arguably hard to avoid. Having it still connected when *qemu* quits is much less defensible.
The fundamental problem here is that there's an issue in KubeVirt (and working around it is the whole point of this patch) which implies that packets are sent to the source pod *for a while* after migration. We found out that the guest is generally suspended during that while, but sometimes it might even have already exited. The pod remains, though, as long as it's needed. That's the only certainty we have. So, do we want to drop --one-off from the libvirt integration, and have libvirt manage passt's lifecycle entirely (note that all users outside KubeVirt don't use migration, so we would make the general case vastly more complicated for the sake of correctness for a single usage...)? Well, we can try to do that. Except that libvirt doesn't know either for how long this traffic will reach the source pod (that's a KubeVirt concept). So it should implement the same hack: let it outlive QEMU on migration... as long as we have that issue in KubeVirt. But I asked KubeVirt people, and it turns out that it's extremely complicated to fix this in KubeVirt. So, actually, I don't see another way to fix this in the short term. And without KubeVirt using this we could also drop the whole feature... -- Stefano
On Fri, Jul 25, 2025 at 07:10:58AM +0200, Stefano Brivio wrote:
On Fri, 25 Jul 2025 14:04:17 +1000 David Gibson
wrote: On Thu, Jul 24, 2025 at 07:28:58PM +0200, Stefano Brivio wrote:
We are hitting an issue in the KubeVirt integration where some data is still sent to the source instance even after migration is complete. As we exit, the kernel closes our sockets and resets connections. The resulting RST segments are sent to peers, effectively terminating connections that were meanwhile migrated.
At the moment, this is not done intentionally, but in the future KubeVirt might enable OVN-Kubernetes features where source and destination nodes are explicitly getting mirrored traffic for a while, in order to decrease migration downtime.
By default, don't quit after migration is completed on the source: the previous behaviour can be enabled with the new, but deprecated, --migrate-exit option. After migration (as source), the -1 / --one-off option has no effect.
Also, by default, keep migrated TCP sockets open (in repair mode) as long as we're running, and ignore events on any epoll descriptor representing data channels. The previous behaviour can be enabled with the new, equally deprecated, --migrate-no-linger option.
By keeping sockets open, and not exiting, we prevent the kernel running on the source node to send out RST segments if further data reaches us.
Reported-by: Nir Dothan
Signed-off-by: Stefano Brivio --- v2: - assorted changes in commit message - context variable ignore_linger becomes ignore_no_linger - new options are deprecated - don't ignore events on some descriptors, drop them from epoll v3: - Nir reported occasional failures (connections being reset) with both v1 and v2, because, in KubeVirt's usage, we quit as QEMU exits. Disable --one-off after migration as source, and document this exception
This seems like an awful, awful hack.
Well, of course, it is, and long term it should be fixed in either KubeVirt or libvirt (even though I'm not sure how, see below) instead.
But this hack means that even when it's fixed we'll still have this wildly counterintuitive behaviour that every future user will have to work around. There's no sensible internal reason for out-migration to affect lifetime, it's a workaround for problems that are quite specific to this stack of layers above.
We're abandoning consistent semantics on a wild guess as to what the layers above us need.
No, not really, we tested this and tested the alternative.
With just one use case. Creating semantics to work with exactly how something is used now, without thought to whether they make sense in general is the definition of fragile software.
Specifically, --once-off used to mean that the layer above us didn't
--one-off
need to manage passt's lifetime; it was tied to qemu's. Now it still needs to manually manage passt's lifetime, so what's the point. So, if it needs passt to outlive qemu it should actually manage that and not use --once-off.
The main point is that it does *not* manually manage passt's lifetime if there's no migration (which is the general case for libvirt and all other users).
That's exactly my point. With this hack it's neither one model nor the other so you have to be aware of both.
We don't have any other user with an implementation of the migration workflow anyway (libvirt itself doesn't do that, yet). It's otherwise unusable for KubeVirt. So I'd say let's fix it for the only user we have.
Please not at the expense of forcing every future user to deal with this suckage.
Requring passt to outlive qemu already seems pretty dubious to me: having the source still connected when passt was quitting is one thing - indeed it's arguably hard to avoid. Having it still connected when *qemu* quits is much less defensible.
The fundamental problem here is that there's an issue in KubeVirt (and working around it is the whole point of this patch) which implies that packets are sent to the source pod *for a while* after migration.
We found out that the guest is generally suspended during that while, but sometimes it might even have already exited. The pod remains, though, as long as it's needed. That's the only certainty we have.
Keeping the pod around is fine. What needs to change is that the guest's IP(s) needs to be removed from the source host before qemu (and therefore passt) is terminated. The pod must have at least one other IP, or it would be impossible to perform the migration in the first place. This essentially matches the situation for bridged networking: with the source guest suspended the source host will no longer respond to the guest IP
So, do we want to drop --one-off from the libvirt integration, and have libvirt manage passt's lifecycle entirely (note that all users outside KubeVirt don't use migration, so we would make the general case vastly more complicated for the sake of correctness for a single usage...)?
Hmm.. if I understand correctly the network swizzling is handled by KubeVirt, not libvirt. I'm hoping that means there's a suitable point at which it can remove the IP without having to alter libvirt.
Well, we can try to do that. Except that libvirt doesn't know either for how long this traffic will reach the source pod (that's a KubeVirt concept). So it should implement the same hack: let it outlive QEMU on migration... as long as we have that issue in KubeVirt.
But I asked KubeVirt people, and it turns out that it's extremely complicated to fix this in KubeVirt. So, actually, I don't see another way to fix this in the short term. And without KubeVirt using this we could also drop the whole feature...
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 25 Jul 2025 16:50:23 +1000
David Gibson
On Fri, Jul 25, 2025 at 07:10:58AM +0200, Stefano Brivio wrote:
On Fri, 25 Jul 2025 14:04:17 +1000 David Gibson
wrote: On Thu, Jul 24, 2025 at 07:28:58PM +0200, Stefano Brivio wrote:
We are hitting an issue in the KubeVirt integration where some data is still sent to the source instance even after migration is complete. As we exit, the kernel closes our sockets and resets connections. The resulting RST segments are sent to peers, effectively terminating connections that were meanwhile migrated.
At the moment, this is not done intentionally, but in the future KubeVirt might enable OVN-Kubernetes features where source and destination nodes are explicitly getting mirrored traffic for a while, in order to decrease migration downtime.
By default, don't quit after migration is completed on the source: the previous behaviour can be enabled with the new, but deprecated, --migrate-exit option. After migration (as source), the -1 / --one-off option has no effect.
Also, by default, keep migrated TCP sockets open (in repair mode) as long as we're running, and ignore events on any epoll descriptor representing data channels. The previous behaviour can be enabled with the new, equally deprecated, --migrate-no-linger option.
By keeping sockets open, and not exiting, we prevent the kernel running on the source node to send out RST segments if further data reaches us.
Reported-by: Nir Dothan
Signed-off-by: Stefano Brivio --- v2: - assorted changes in commit message - context variable ignore_linger becomes ignore_no_linger - new options are deprecated - don't ignore events on some descriptors, drop them from epoll v3: - Nir reported occasional failures (connections being reset) with both v1 and v2, because, in KubeVirt's usage, we quit as QEMU exits. Disable --one-off after migration as source, and document this exception
This seems like an awful, awful hack.
Well, of course, it is, and long term it should be fixed in either KubeVirt or libvirt (even though I'm not sure how, see below) instead.
But this hack means that even when it's fixed we'll still have this wildly counterintuitive behaviour that every future user will have to work around.
No, why? We can change that as well. We changed semantics of options in the past and I don't see an issue doing that as long as we coordinate things to a reasonable extent (like we do with Podman and rootlesskit and with distributions and LSMs...). This is just to get things working properly in KubeVirt 1.6 as far as I'm concerned. Otherwise they can as well drop the whole feature (at least, that would be my recommendation).
There's no sensible internal reason for out-migration to affect lifetime, it's a workaround for problems that are quite specific to this stack of layers above.
We're abandoning consistent semantics on a wild guess as to what the layers above us need.
No, not really, we tested this and tested the alternative.
With just one use case.
...better than zero?
Creating semantics to work with exactly how something is used now, without thought to whether they make sense in general is the definition of fragile software.
...better than useless?
Specifically, --once-off used to mean that the layer above us didn't
--one-off
need to manage passt's lifetime; it was tied to qemu's. Now it still needs to manually manage passt's lifetime, so what's the point. So, if it needs passt to outlive qemu it should actually manage that and not use --once-off.
The main point is that it does *not* manually manage passt's lifetime if there's no migration (which is the general case for libvirt and all other users).
That's exactly my point. With this hack it's neither one model nor the other so you have to be aware of both.
Current users except for KubeVirt use --one-off with that model, and we surely want and need to keep that. Now it turns out that there's an issue with KubeVirt and that (obvious) model, so here's a workaround for the only documented user of the migration feature, because it *currently* *needs* the other (obviously wrong) model.
We don't have any other user with an implementation of the migration workflow anyway (libvirt itself doesn't do that, yet). It's otherwise unusable for KubeVirt. So I'd say let's fix it for the only user we have.
Please not at the expense of forcing every future user to deal with this suckage.
That's not the case. We can (and really should) fix this in passt later. We need anyway to rework a fair amount of code here because, for example, as you mentioned, listening sockets are still there.
Requring passt to outlive qemu already seems pretty dubious to me: having the source still connected when passt was quitting is one thing - indeed it's arguably hard to avoid. Having it still connected when *qemu* quits is much less defensible.
The fundamental problem here is that there's an issue in KubeVirt (and working around it is the whole point of this patch) which implies that packets are sent to the source pod *for a while* after migration.
We found out that the guest is generally suspended during that while, but sometimes it might even have already exited. The pod remains, though, as long as it's needed. That's the only certainty we have.
Keeping the pod around is fine. What needs to change is that the guest's IP(s) needs to be removed from the source host before qemu (and therefore passt) is terminated. The pod must have at least one other IP, or it would be impossible to perform the migration in the first place.
Maybe, yes. I'm not sure if it's doable.
This essentially matches the situation for bridged networking: with the source guest suspended the source host will no longer respond to the guest IP
So, do we want to drop --one-off from the libvirt integration, and have libvirt manage passt's lifecycle entirely (note that all users outside KubeVirt don't use migration, so we would make the general case vastly more complicated for the sake of correctness for a single usage...)?
Hmm.. if I understand correctly the network swizzling is handled by KubeVirt, not libvirt.
That's OVN-Kubernetes in KubeVirt's case.
I'm hoping that means there's a suitable point at which it can remove the IP without having to alter libvirt.
I hope so too, eventually. Or we could make sure that QEMU is alive as long as needed, this is probably easier to ensure from virt-launcher. I haven't looked at the details yet, but in passt it's one line and we can drop it later as needed, in KubeVirt it's probably much more complicated than that.
Well, we can try to do that. Except that libvirt doesn't know either for how long this traffic will reach the source pod (that's a KubeVirt concept). So it should implement the same hack: let it outlive QEMU on migration... as long as we have that issue in KubeVirt.
But I asked KubeVirt people, and it turns out that it's extremely complicated to fix this in KubeVirt. So, actually, I don't see another way to fix this in the short term. And without KubeVirt using this we could also drop the whole feature...
-- Stefano
On Fri, Jul 25, 2025 at 10:21:12AM +0200, Stefano Brivio wrote:
On Fri, 25 Jul 2025 16:50:23 +1000 David Gibson
wrote: On Fri, Jul 25, 2025 at 07:10:58AM +0200, Stefano Brivio wrote:
On Fri, 25 Jul 2025 14:04:17 +1000 David Gibson
wrote: [snip] v3: - Nir reported occasional failures (connections being reset) with both v1 and v2, because, in KubeVirt's usage, we quit as QEMU exits. Disable --one-off after migration as source, and document this exception
This seems like an awful, awful hack.
Well, of course, it is, and long term it should be fixed in either KubeVirt or libvirt (even though I'm not sure how, see below) instead.
But this hack means that even when it's fixed we'll still have this wildly counterintuitive behaviour that every future user will have to work around.
No, why? We can change that as well. We changed semantics of options in the past and I don't see an issue doing that as long as we coordinate things to a reasonable extent (like we do with Podman and rootlesskit and with distributions and LSMs...).
Ah, ok. That kind of changes everything. It thought this indicated commiting to these semantics indefinitely. I think the co-ordination you're suggesting may be messier than you think, but if we're explicitly willing to (technically) break compatibility to remove this again, then, sure, ok. It's still gross, but fine. [snip]
We found out that the guest is generally suspended during that while, but sometimes it might even have already exited. The pod remains, though, as long as it's needed. That's the only certainty we have.
Keeping the pod around is fine. What needs to change is that the guest's IP(s) needs to be removed from the source host before qemu (and therefore passt) is terminated. The pod must have at least one other IP, or it would be impossible to perform the migration in the first place.
Maybe, yes. I'm not sure if it's doable.
KubeVirt and/or libvirt really need to figure out how to make it doable, because having two simultaneously active things with the same IP on the same network is bound to cause trouble. At some point it's likely to cause trouble that we can't hack our way around in passt.
This essentially matches the situation for bridged networking: with the source guest suspended the source host will no longer respond to the guest IP
So, do we want to drop --one-off from the libvirt integration, and have libvirt manage passt's lifecycle entirely (note that all users outside KubeVirt don't use migration, so we would make the general case vastly more complicated for the sake of correctness for a single usage...)?
Hmm.. if I understand correctly the network swizzling is handled by KubeVirt, not libvirt.
That's OVN-Kubernetes in KubeVirt's case.
To clarify, I'm not talking so much about what actually makes the network arrangements, but what component initiates the changeover.
I'm hoping that means there's a suitable point at which it can remove the IP without having to alter libvirt.
I hope so too, eventually. Or we could make sure that QEMU is alive as long as needed, this is probably easier to ensure from virt-launcher.
Right. There's kind of a basic mismatch here - libvirt aims to manage the whole migration, end-to-end, so you just tell it to migrate and it does. But in the k8s context, something else needs to do network jiggery-pokery co-ordinated with that, so it kind of breaks the model. passt adds a further complication in that the host owns the IP, to share it with the guest. That means suspending the guest is no longer sufficient to prevent that source responding on that IP. So, maybe that means libvirt logically ought to be handling the IP assignment/de-assignment when migrating with passt. But... no doubt that would conflict with OVN-Kubernetes wanting to own the network configuration. Yet again, libvirt's wish to manage everything (in the manner of a Xen machine circa 2010) does not play well with the kind of circumstances people try to use it.
I haven't looked at the details yet, but in passt it's one line and we can drop it later as needed, in KubeVirt it's probably much more complicated than that.
Well, we can try to do that. Except that libvirt doesn't know either for how long this traffic will reach the source pod (that's a KubeVirt concept). So it should implement the same hack: let it outlive QEMU on migration... as long as we have that issue in KubeVirt.
But I asked KubeVirt people, and it turns out that it's extremely complicated to fix this in KubeVirt. So, actually, I don't see another way to fix this in the short term. And without KubeVirt using this we could also drop the whole feature...
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (2)
-
David Gibson
-
Stefano Brivio