Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure

31 Jan 2025

      On Thu, 30 Jan 2025 19:54:17 +1100
David Gibson  wrote:
...
On Thu, Jan 30, 2025 at 09:32:36AM +0100, Stefano Brivio wrote:
...
On Thu, 30 Jan 2025 18:38:22 +1100
David Gibson  wrote:
...
Right, but in the present draft you pay that cost whether or not
you're actually using the flows.  Unfortunately a busy server with
heaps of active connections is exactly the case that's likely to be
most sensitve to additional downtime, but there's not really any
getting around that.  A machine with a lot of state will need either
high downtime or high migration bandwidth.
It's... sixteen megabytes. A KubeVirt node is only allowed to perform up
to _four_ migrations in parallel, and that's our main use case at the
moment. "High downtime" is kind of relative.
Certainly.  But I believe it's typical to aim for downtimes in the
~100ms range.
...
...
But, I'm really hoping we can move relatively quickly to a model where
a guest with only a handful of connections _doesn't_ have to pay that
128k flow cost - and can consequently migrate ok even with quite
constrained migration bandwidth.  In that scenario the size of the
header could become significant.
I think the biggest cost of the full flow table transfer is rather code
that's a bit quicker to write (I just managed to properly set sequences
on the target, connections don't quite "flow" yet) but relatively high
maintenance (as you mentioned, we need to be careful about every single
field) and easy to break.
Right.  And with this draft we can't even change the size of the flow
table without breaking migration.  That seems like a thing we might
well want to change.
Why? The size of the flow table hasn't changed since it was added. I
don't see a reason to improve this if we don't want to transfer the
flow table anyway.
...
...
I would like to quickly complete the whole flow first, because I think
we can inform design and implementation decisions much better at that
point, and we can be sure it's feasible,
That's fair.
...
but I'm not particularly keen
to merge this patch like it is, if we can switch it relatively swiftly
to an implementation where we model a smaller fixed-endian structure
with just the stuff we need.
So, there are kind of two parts to this:
1) Only transferring active flow entries, and not transferring the
   hash table
I think this is pretty easy.  It could be done with or without
preserving flow indicies.  Preserving makes for debug log continuity
between the ends, but not preserving lets us change the size of the
flow table without breaking migration.
I would just add prints on migration showing how old flow indices map
to new ones.
...
2) Only transferring the necessary pieces of each entry, and using a
   fixed representation of each piece
This is harder.  Not *super* hard, I think, but definitely trickier
than (1)
...
And again, to be a bit more sure of which stuff we need in it, the full
flow is useful to have implemented.
Actually the biggest complications I see in switching to that approach,
from the current point, are that we need to, I guess:
1. model arrays (not really complicated by itself)
So here, I actually think this is simpler if we don't attempt to have
a declarative approach to defining the protocol, but just write
functions to implement it.
...
2. have a temporary structure where we store flows instead of using the
   flow table directly (meaning that the "data model" needs to logically
   decouple source and destination of the copy)
Right.. I'd really prefer to "stream" in the entries one by one,
rather than having a big staging area.  That's even harder to do
declaratively, but I think the other advantages are worth it.
...
3. batch stuff to some extent. We'll call socket() and connect() once
   for each socket anyway, obviously, but sending one message to the
   TCP_REPAIR helper for each socket looks like a rather substantial
   and avoidable overhead
I don't think this actually has a lot of bearing on the protocol.  I'd
envisage migrate_target() decodes all the information into the
target's flow table, then migrate_target_post() steps through all the
flows re-establishing the connections.  Since we've already parsed the
protocol at that point, we can make multiple passes: one to gather
batches and set TCP_REPAIR, another through each entry to set the
values, and a final one to clear TCP_REPAIR in batches.
Ah, right, I didn't think of using the target flow table directly. That
has the advantage that the current code I'm writing to reactivate flows
from the flow table can be recycled as it is.
...
...
...
...
...
It's both easier to do
and a bigger win in most cases.  That would dramatically reduce the
size sent here.
Yep, feel free.
It's on my queue for the next few days.
To me this part actually looks like the biggest priority after/while
getting the whole thing to work, because we can start right with a 'v1'
which looks more sustainable.
And I would just get stuff working on x86_64 in that case, without even
implementing conversions and endianness switches etc.
Right.  Given the number of options here, I think it would be safest
to go in expecting to go through a few throwaway protocol versions
before reaching one we're happy enough to support long term.
To ease that process, I'm wondering if we should, add a default-off
command line option to enable migration.  For now, enabling it would
print some sort of "migration is experimental!" warning.  Once we have
a stream format we're ok with, we can flip it to on-by-default, but we
don't maintain receive compatibility for the experimental versions
leading up to that.
It looks like unnecessary code churn to me. It doesn't need to be
merged if it's work in progress. You can also push stuff to a temporary
branch if needed.

It can also be merged and not documented for a while, as long as it
doesn't break existing functionality.

-- 
Stefano