On Wed, Jan 29, 2025 at 08:33:50AM +0100, Stefano Brivio wrote:On Wed, 29 Jan 2025 12:16:58 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Oh, yes, we'd need to add a __bswap_constant_64() for this. [snip]On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote:The very issues that brought me to introduce those fallback versions, so I'm instinctively reluctant to use them. Actually, I think it's even clearer to have this spelt out (I always need to stop for a moment and think: what happens when I cross the 32-bit boundary?).On Tue, 28 Jan 2025 12:40:12 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:What sort of issues? We're already using them, and have fallback versions defined in util.hOn Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote: > Add two sets (source or target) of three functions each for passt in > vhost-user mode, triggered by activity on the file descriptor passed > via VHOST_USER_PROTOCOL_F_DEVICE_STATE: > > - migrate_source_pre() and migrate_target_pre() are called to prepare > for migration, before data is transferred > > - migrate_source() sends, and migrate_target() receives migration data > > - migrate_source_post() and migrate_target_post() are responsible for > any post-migration task > > Callbacks are added to these functions with arrays of function > pointers in migrate.c. Migration handlers are versioned. > > Versioned descriptions of data sections will be added to the > data_versions array, which points to versioned iovec arrays. Version > 1 is currently empty and will be filled in in subsequent patches. > > The source announces the data version to be used and informs the peer > about endianness, and the size of void *, time_t, flow entries and > flow hash table entries. > > The target checks if the version of the source is still supported. If > it's not, it aborts the migration. > > Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> > --- > Makefile | 12 +-- > migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > migrate.h | 90 ++++++++++++++++++ > passt.c | 2 +- > vu_common.c | 122 ++++++++++++++++--------- > vu_common.h | 2 +- > 6 files changed, 438 insertions(+), 49 deletions(-) > create mode 100644 migrate.c > create mode 100644 migrate.h > > diff --git a/Makefile b/Makefile > index 464eef1..1383875 100644 > --- a/Makefile > +++ b/Makefile > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) > > PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > vhost_user.c virtio.c vu_common.c > QRAP_SRCS = qrap.c > SRCS = $(PASST_SRCS) $(QRAP_SRCS) > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 > > PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ > - virtio.h vu_common.h > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ > + vhost_user.h virtio.h vu_common.h > HEADERS = $(PASST_HEADERS) seccomp.h > > C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);} > diff --git a/migrate.c b/migrate.c > new file mode 100644 > index 0000000..bee9653 > --- /dev/null > +++ b/migrate.c > @@ -0,0 +1,259 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > + > +/* PASST - Plug A Simple Socket Transport > + * for qemu/UNIX domain socket mode > + * > + * PASTA - Pack A Subtle Tap Abstraction > + * for network namespace/tap device mode > + * > + * migrate.c - Migration sections, layout, and routines > + * > + * Copyright (c) 2025 Red Hat GmbH > + * Author: Stefano Brivio <sbrivio(a)redhat.com> > + */ > + > +#include <errno.h> > +#include <sys/uio.h> > + > +#include "util.h" > +#include "ip.h" > +#include "passt.h" > +#include "inany.h" > +#include "flow.h" > +#include "flow_table.h" > + > +#include "migrate.h" > + > +/* Current version of migration data */ > +#define MIGRATE_VERSION 1 > + > +/* Magic as we see it and as seen with reverse endianness */ > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1 As noted, I'm hoping we can get rid of "either endian" migration. But if this stays, we should define it using __bswap_constant_32() to avoid embarrassing mistakes.Those always give me issues on musl,Just transferring the in-use flows would be higher priority than being selective about what we send within each flow. It's both easier to do and a bigger win in most cases. That would dramatically reduce the size sent here.Sure, I can add a comment.Ah, yes, I missed that, we'd need a more complex design to do additional transfers and checks before making the target_pre callbacks.> +/** > + * union migrate_header - Migration header from source > + * @magic: 0xB1BB1D1B0BB1D1B0, host order > + * @version: Source sends highest known, target aborts if unsupported > + * @voidp_size: sizeof(void *), network order > + * @time_t_size: sizeof(time_t), network order > + * @flow_size: sizeof(union flow), network order > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order > + * @unused: Go figure > + */ > +union migrate_header { > + struct { > + uint64_t magic; > + uint32_t version; > + uint32_t voidp_size; > + uint32_t time_t_size; > + uint32_t flow_size; > + uint32_t flow_sidx_size; > + }; > + uint8_t unused[65536]; So, having looked at this, I no longer think padding the header to 64kiB is a good idea. The structure means we're basically stuck always having that chunky header. Instead, I think the header should be absolutely minimal: basically magic and version only. v1 (and maybe others) can add a "metadata" or whatever section for additional information like this they need.The header is processed by the target in a separate, preliminary step, though. That's why I added metadata right in the header: if the target needs to abort the migration because, say, the size of a flow entry is too big to handle for a particular version, then we should know that before migrate_target_pre().As long as we check the version first, we can always shrink the header later on.*thinks*.. I guess so, though it's kind of awkward; a future version would have to read the "header of the header", check the version, then if it's the old one, read the remainder of the 64kiB block. I still think we should clearly separate the part that we're committing to being in every future version (which I think should just be magic and version), from the stuff that's just v1.Even if we transfer just what we need of a flow, it's still something well in excess of 50 bytes each. 100k flows would be 5 megs.But having 64 KiB reserved looks more robust because it's a safe place to add this kind of metadata. Note that 64 KiB is typically transferred in a single read/write from/to the vhost-user back-end.Ok, but it also has to go over the qemu migration channel, which will often be a physical link, not a super-fast local/virtual one, and may be bandwidth capped as well. I'm not actually certain if 64kiB is likely to be a problem there, but it *is* large compared to the state blobs of most qemu devices (usually only a few hundred bytes).Well, anyway, let's cut this down to 4k, which should be enough, so that it's not a topic anymore.I still think it's ugly, but whatever. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson