Re: [PATCH v2] flow, repair: Wait for a short while for passt-repair to connect

11 Mar 2025

On Tue, 11 Mar 2025 12:13:46 +1100
David Gibson  wrote:
...
On Fri, Mar 07, 2025 at 11:41:29PM +0100, Stefano Brivio wrote:
...
...and time out after that. This will be needed because of an upcoming
change to passt-repair enabling it to start before passt is started,
on both source and target, by means of an inotify watch.
Once the inotify watch triggers, passt-repair will connect right away,
but we have no guarantees that the connection completes before we
start the migration process, so wait for it (for a reasonable amount
of time).
Signed-off-by: Stefano Brivio 
I still think it's ugly, of course, but I don't see a better way, so:
Reviewed-by: David Gibson 
...
---
v2:
- Use 10 ms as timeout instead of 100 ms. Given that I'm unable to
  migrate a simple guest with 256 MiB of memory and no storage other
  than an initramfs in less than 4 milliseconds, at least on my test
  system (rather fast CPU threads and memory interface), I think that
  10 ms shouldn't make a big difference in case passt-repair is not
  available for whatever reason
So, IIUC, that 4ms is the *total* migration time.
Ah, no, that's passt-to-passt in the migrate/basic test, to have a fair
comparison. That is:

$ git diff

diff --git a/migrate.c b/migrate.c
index 0fca77b..3d36843 100644
--- a/migrate.c
+++ b/migrate.c
@@ -286,6 +286,13 @@ void migrate_handler(struct ctx *c)
 	if (c->device_state_fd < 0)
 		return;
 
+#include 
+	{
+		struct timespec now;
+		clock_gettime(CLOCK_REALTIME, &now);
+		err("tv: %li.%li", now.tv_sec, now.tv_nsec);
+	}
+
 	debug("Handling migration request from fd: %d, target: %d",
 	      c->device_state_fd, c->migrate_target);
 
$ grep tv\: test/test_logs/context_passt_*.log
test/test_logs/context_passt_1.log:tv: 1741729630.368652064
test/test_logs/context_passt_2.log:tv: 1741729630.378664420

In this case it's 10 ms, but I can sometimes get 7 ms. This is with 512
MiB, but with 256 MiB I typically get 5 to 6 ms, and sometimes slightly
more than 4 ms. One flow or zero flows seem to make little difference.
...
The concern here is
not that we add to the total migration time, but that we add to the
migration downtime, that is, the time the guest is not running
anywhere.  The downtime can be much smaller than the total migration
time.  Furthermore qemu has no way to account for this delay in its
estimate of what the downtime will be - the time for transferring
device state is pretty much assumed to be neglible in comparison to
transferring guest memory contents.  So, if qemu stops the guest at
the point that the remaining memory transfer will just fit in the
downtime limit, any delays we add will likely cause the downtime limit
to be missed by that much.
Now, as it happens, the default downtime limit is 300ms, so an
additional 10ms is probably fine (though 100ms really wasn't).
Nonetheless the reasoning above isn't valid.
~50 ms is actually quite easy to get with a few (8) gigabytes of
memory, that's why 100 ms also looked fine to me, but sure, 10 ms
sounds more reasonable.

-- 
Stefano