On Thu, Jan 30, 2025 at 05:55:22AM +0100, Stefano Brivio wrote:On Thu, 30 Jan 2025 11:48:19 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Very well.On Wed, Jan 29, 2025 at 08:33:50AM +0100, Stefano Brivio wrote:...which doesn't exist on musl. On current Alpine Edge: util.h:131:34: error: implicit declaration of function '__bswap_constant_64' [-Wimplicit-function-declaration] 131 | #define htonll_constant(x) (__bswap_constant_64(x)) | ^~~~~~~~~~~~~~~~~~~ ...so rather than adding an implementation for this single usage, which makes it actually less clear to me, I would keep it like it is.On Wed, 29 Jan 2025 12:16:58 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Oh, yes, we'd need to add a __bswap_constant_64() for this.On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote: > On Tue, 28 Jan 2025 12:40:12 +1100 > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote: > > > Add two sets (source or target) of three functions each for passt in > > > vhost-user mode, triggered by activity on the file descriptor passed > > > via VHOST_USER_PROTOCOL_F_DEVICE_STATE: > > > > > > - migrate_source_pre() and migrate_target_pre() are called to prepare > > > for migration, before data is transferred > > > > > > - migrate_source() sends, and migrate_target() receives migration data > > > > > > - migrate_source_post() and migrate_target_post() are responsible for > > > any post-migration task > > > > > > Callbacks are added to these functions with arrays of function > > > pointers in migrate.c. Migration handlers are versioned. > > > > > > Versioned descriptions of data sections will be added to the > > > data_versions array, which points to versioned iovec arrays. Version > > > 1 is currently empty and will be filled in in subsequent patches. > > > > > > The source announces the data version to be used and informs the peer > > > about endianness, and the size of void *, time_t, flow entries and > > > flow hash table entries. > > > > > > The target checks if the version of the source is still supported. If > > > it's not, it aborts the migration. > > > > > > Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> > > > --- > > > Makefile | 12 +-- > > > migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > migrate.h | 90 ++++++++++++++++++ > > > passt.c | 2 +- > > > vu_common.c | 122 ++++++++++++++++--------- > > > vu_common.h | 2 +- > > > 6 files changed, 438 insertions(+), 49 deletions(-) > > > create mode 100644 migrate.c > > > create mode 100644 migrate.h > > > > > > diff --git a/Makefile b/Makefile > > > index 464eef1..1383875 100644 > > > --- a/Makefile > > > +++ b/Makefile > > > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) > > > > > > PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ > > > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ > > > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ > > > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > > > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ > > > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > > > vhost_user.c virtio.c vu_common.c > > > QRAP_SRCS = qrap.c > > > SRCS = $(PASST_SRCS) $(QRAP_SRCS) > > > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 > > > > > > PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ > > > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ > > > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ > > > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ > > > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ > > > - virtio.h vu_common.h > > > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ > > > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ > > > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ > > > + vhost_user.h virtio.h vu_common.h > > > HEADERS = $(PASST_HEADERS) seccomp.h > > > > > > C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);} > > > diff --git a/migrate.c b/migrate.c > > > new file mode 100644 > > > index 0000000..bee9653 > > > --- /dev/null > > > +++ b/migrate.c > > > @@ -0,0 +1,259 @@ > > > +// SPDX-License-Identifier: GPL-2.0-or-later > > > + > > > +/* PASST - Plug A Simple Socket Transport > > > + * for qemu/UNIX domain socket mode > > > + * > > > + * PASTA - Pack A Subtle Tap Abstraction > > > + * for network namespace/tap device mode > > > + * > > > + * migrate.c - Migration sections, layout, and routines > > > + * > > > + * Copyright (c) 2025 Red Hat GmbH > > > + * Author: Stefano Brivio <sbrivio(a)redhat.com> > > > + */ > > > + > > > +#include <errno.h> > > > +#include <sys/uio.h> > > > + > > > +#include "util.h" > > > +#include "ip.h" > > > +#include "passt.h" > > > +#include "inany.h" > > > +#include "flow.h" > > > +#include "flow_table.h" > > > + > > > +#include "migrate.h" > > > + > > > +/* Current version of migration data */ > > > +#define MIGRATE_VERSION 1 > > > + > > > +/* Magic as we see it and as seen with reverse endianness */ > > > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 > > > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1 > > > > As noted, I'm hoping we can get rid of "either endian" migration. But > > if this stays, we should define it using __bswap_constant_32() to > > avoid embarrassing mistakes. > > Those always give me issues on musl, What sort of issues? We're already using them, and have fallback versions defined in util.hThe very issues that brought me to introduce those fallback versions, so I'm instinctively reluctant to use them. Actually, I think it's even clearer to have this spelt out (I always need to stop for a moment and think: what happens when I cross the 32-bit boundary?).Right, but in the present draft you pay that cost whether or not you're actually using the flows. Unfortunately a busy server with heaps of active connections is exactly the case that's likely to be most sensitve to additional downtime, but there's not really any getting around that. A machine with a lot of state will need either high downtime or high migration bandwidth. But, I'm really hoping we can move relatively quickly to a model where a guest with only a handful of connections _doesn't_ have to pay that 128k flow cost - and can consequently migrate ok even with quite constrained migration bandwidth. In that scenario the size of the header could become significant.[snip]Well, of course, I meant that we would only transfer used flows at that point, because it's not about transferring the flow table as a whole, with none of the advantages and disadvantages of it. But still one can have 128k flows at the moment.Just transferring the in-use flows would be higher priority than being selective about what we send within each flow.> > > +/** > > > + * union migrate_header - Migration header from source > > > + * @magic: 0xB1BB1D1B0BB1D1B0, host order > > > + * @version: Source sends highest known, target aborts if unsupported > > > + * @voidp_size: sizeof(void *), network order > > > + * @time_t_size: sizeof(time_t), network order > > > + * @flow_size: sizeof(union flow), network order > > > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order > > > + * @unused: Go figure > > > + */ > > > +union migrate_header { > > > + struct { > > > + uint64_t magic; > > > + uint32_t version; > > > + uint32_t voidp_size; > > > + uint32_t time_t_size; > > > + uint32_t flow_size; > > > + uint32_t flow_sidx_size; > > > + }; > > > + uint8_t unused[65536]; > > > > So, having looked at this, I no longer think padding the header to 64kiB > > is a good idea. The structure means we're basically stuck always > > having that chunky header. Instead, I think the header should be > > absolutely minimal: basically magic and version only. v1 (and maybe > > others) can add a "metadata" or whatever section for additional > > information like this they need. > > The header is processed by the target in a separate, preliminary step, > though. > > That's why I added metadata right in the header: if the target needs to > abort the migration because, say, the size of a flow entry is too big > to handle for a particular version, then we should know that before > migrate_target_pre(). Ah, yes, I missed that, we'd need a more complex design to do additional transfers and checks before making the target_pre callbacks. > As long as we check the version first, we can always shrink the header > later on. *thinks*.. I guess so, though it's kind of awkward; a future version would have to read the "header of the header", check the version, then if it's the old one, read the remainder of the 64kiB block. I still think we should clearly separate the part that we're committing to being in every future version (which I think should just be magic and version), from the stuff that's just v1.Sure, I can add a comment.> But having 64 KiB reserved looks more robust because it's a > safe place to add this kind of metadata. > > Note that 64 KiB is typically transferred in a single read/write > from/to the vhost-user back-end. Ok, but it also has to go over the qemu migration channel, which will often be a physical link, not a super-fast local/virtual one, and may be bandwidth capped as well. I'm not actually certain if 64kiB is likely to be a problem there, but it *is* large compared to the state blobs of most qemu devices (usually only a few hundred bytes).Even if we transfer just what we need of a flow, it's still something well in excess of 50 bytes each. 100k flows would be 5 megs.It's on my queue for the next few days.It's both easier to do and a bigger win in most cases. That would dramatically reduce the size sent here.Yep, feel free.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonSame here.Well, anyway, let's cut this down to 4k, which should be enough, so that it's not a topic anymore.I still think it's ugly, but whatever.