On Thu, 30 Jan 2025 18:38:22 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right, but in the present draft you pay that cost whether or not you're actually using the flows. Unfortunately a busy server with heaps of active connections is exactly the case that's likely to be most sensitve to additional downtime, but there's not really any getting around that. A machine with a lot of state will need either high downtime or high migration bandwidth.It's... sixteen megabytes. A KubeVirt node is only allowed to perform up to _four_ migrations in parallel, and that's our main use case at the moment. "High downtime" is kind of relative.But, I'm really hoping we can move relatively quickly to a model where a guest with only a handful of connections _doesn't_ have to pay that 128k flow cost - and can consequently migrate ok even with quite constrained migration bandwidth. In that scenario the size of the header could become significant.I think the biggest cost of the full flow table transfer is rather code that's a bit quicker to write (I just managed to properly set sequences on the target, connections don't quite "flow" yet) but relatively high maintenance (as you mentioned, we need to be careful about every single field) and easy to break. I would like to quickly complete the whole flow first, because I think we can inform design and implementation decisions much better at that point, and we can be sure it's feasible, but I'm not particularly keen to merge this patch like it is, if we can switch it relatively swiftly to an implementation where we model a smaller fixed-endian structure with just the stuff we need. And again, to be a bit more sure of which stuff we need in it, the full flow is useful to have implemented. Actually the biggest complications I see in switching to that approach, from the current point, are that we need to, I guess: 1. model arrays (not really complicated by itself) 2. have a temporary structure where we store flows instead of using the flow table directly (meaning that the "data model" needs to logically decouple source and destination of the copy) 3. batch stuff to some extent. We'll call socket() and connect() once for each socket anyway, obviously, but sending one message to the TCP_REPAIR helper for each socket looks like a rather substantial and avoidable overheadTo me this part actually looks like the biggest priority after/while getting the whole thing to work, because we can start right with a 'v1' which looks more sustainable. And I would just get stuff working on x86_64 in that case, without even implementing conversions and endianness switches etc. -- StefanoIt's on my queue for the next few days.It's both easier to do and a bigger win in most cases. That would dramatically reduce the size sent here.Yep, feel free.