[PATCH v3 00/20] Draft, incomplete series introducing state migration
...and finally connections survive migration from source to target, at least the ones originating from the (source) guest. I didn't try the other way around, small tweaks might be needed. Tested as follows, roughly as instructed by Laurent: Source: $ ./passt --vhost-user $ qemu-system-x86_64 -machine accel=kvm -cpu host -kernel ... \ -initrd mbuto.img -nographic -serial mon:stdio -nodefaults \ -append "console=ttyS0" \ -chardev socket,id=chr0,path=/tmp/passt_1.socket \ -netdev vhost-user,id=netdev0,chardev=chr0 \ -device virtio-net,netdev=netdev0 \ -object memory-backend-memfd,id=memfd0,share=on,size=$((2 * 1024 * 1024 * 1024)) \ -numa node,memdev=memfd0 -m 2G # ./passt-repair /tmp/passt_1.socket.repair Target (same host): $ ./passt --vhost-user $ qemu-system-x86_64 -machine accel=kvm -cpu host -kernel ... \ -initrd mbuto.img -nographic -serial mon:stdio -nodefaults \ -append "console=ttyS0" \ -chardev socket,id=chr0,path=/tmp/passt_2.socket \ -netdev vhost-user,id=netdev0,chardev=chr0 \ -device virtio-net,netdev=netdev0 \ -object memory-backend-memfd,id=memfd0,share=on,size=$((2 * 1024 * 1024 * 1024)) \ -numa node,memdev=memfd0 -m 2G \ -incoming tcp:0:4444 # ./passt-repair /tmp/passt_2.socket.repair Test server: $ nc -l 9091 Once the guest boots: # ip link set dev eth0 up # dhclient eth0 # socat STDIN TCP:$DEFAULT_GW:9091 abcd ^a-c migrate tcp:0:4444 Then continue typing in the target guest: efgh The purpose of this is mostly to show the complete flow, but it needs a number of reworks. What's missing (letting aside pending packet queues for a moment, those are not strictly needed): 1. tests based on the two_guests layout/setup. Even with reverse-search in the shell, this is getting quite hard on wrists. I guess we can start QEMU with -monitor unix:mon.sock,server,nowait and send the 'migrate' command via socat STDIN UNIX-CONNECT:mon.sock 2. dump and transfer of *socket-side* MSS and window scale (I used hardcoded values): this needs more storage, so it needs to be transferred outside the flow table 3. dump, transfer and restore of TCP_REPAIR_WINDOW parameters (not strictly needed, but easy to add once we have appropriate storage) 4. perhaps some small bits of implementation for socket-originated connections (I tested only guest-originated ones so far) 5. UDP and ICMP flows (ping already happens to "survive" nicely, by the way) 6. man page for passt-repair, and man page changes for everything 7. packaging and Linux Security Module changes for passt-repair 8. error handling here and there, and repair rollback/migration abort 9. setting original receive/send buffer sizes and socket options (TCP_NODELAY) What clearly needs changes: a. we can't dump more stuff to the flow table, because we would exceed 128 bytes. We need to copy everything from tcp_tap_conn except for: - state in flow_common - in_epoll - sock - timer and on top of this we need: - values for TCPOPT_WINDOW and TCPOPT_MAXSEG - struct tcp_repair_window somewhat unexpectedly, this is actually bigger than a flow table entry. In any case, we need to implement a stream/per-entry migration right away. b. at this point, I guess we can throw the header away, and just keep a magic (0xB1BB1D1B0BB1D1B0 has a missing 0 at the end but, well, https://en.wikipedia.org/wiki/Bibbidi-Bobbidi-Boo is the Magic Song: can we keep it?) and a version number. The rest, let's go with big/network endianness I'd say, and 64-bit time_t c. the declarative data thing is very convenient but we need to fetch stuff from struct ctx, as shown by the hash_secret example. What's very convenient of this approach is the iovec / writev() / readv() idea. I'm not sure if we can maintain that convenience, though Patches that could be applied regardless of this series to make it more manageable: 1/20 tcp: Always pass NULL event with EPOLL_CTL_DEL 2/20 util: Rename and make global vu_remove_watch() 6/20 util: Add read_remainder() and read_all_buf() 8/20 Introduce passt-repair 16/20 vhost_user: Turn vhost-user message reports to trace() 17/20 vhost_user: Make source quit after reporting migration state 18/20 tcp: Get our socket port using getsockname() when connecting from guest 19/20 tcp: Add HOSTSIDE(x), HOSTFLOW(x) macros Patches that we can throw away with the changes outlined above: 3/20 icmp, udp: Pad time_t timestamp to 64-bit to ease state migration 4/20 flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits 15/20 flow, flow_table: Export declaration of hash table David Gibson (6): tcp: Always pass NULL event with EPOLL_CTL_DEL util: Rename and make global vu_remove_watch() migrate: vu_migrate_{source,target}() aren't actually vu speciic migrate: Move repair_sock_init() to vu_init() migrate: Make more handling common rather than vhost-user specific migrate: Don't handle the migration channel through epoll Stefano Brivio (14): icmp, udp: Pad time_t timestamp to 64-bit to ease state migration flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits flow_table: Use size in extern declaration for flowtab util: Add read_remainder() and read_all_buf() Introduce facilities for guest migration on top of vhost-user infrastructure Introduce passt-repair Add interfaces and configuration bits for passt-repair flow, tcp: Basic pre-migration source handler to dump sequence numbers flow, flow_table: Export declaration of hash table vhost_user: Turn vhost-user message reports to trace() vhost_user: Make source quit after reporting migration state tcp: Get our socket port using getsockname() when connecting from guest tcp: Add HOSTSIDE(x), HOSTFLOW(x) macros Implement target side of migration .gitignore | 1 + Makefile | 24 +-- conf.c | 44 +++++- epoll_type.h | 6 +- flow.c | 97 +++++++++++- flow.h | 20 ++- flow_table.h | 22 ++- icmp.c | 2 +- icmp_flow.h | 6 +- migrate.c | 408 +++++++++++++++++++++++++++++++++++++++++++++++++ migrate.h | 84 ++++++++++ passt-repair.c | 117 ++++++++++++++ passt.1 | 11 ++ passt.c | 17 ++- passt.h | 17 +++ repair.c | 193 +++++++++++++++++++++++ repair.h | 16 ++ tap.c | 64 +------- tcp.c | 198 +++++++++++++++++++++++- tcp_conn.h | 7 + tcp_internal.h | 10 +- tcp_splice.c | 4 +- udp_flow.c | 2 +- udp_flow.h | 6 +- util.c | 155 +++++++++++++++++++ util.h | 4 + vhost_user.c | 94 +++--------- virtio.h | 4 - vu_common.c | 62 +++----- vu_common.h | 2 +- 30 files changed, 1469 insertions(+), 228 deletions(-) create mode 100644 migrate.c create mode 100644 migrate.h create mode 100644 passt-repair.c create mode 100644 repair.c create mode 100644 repair.h -- 2.43.0
From: David Gibson
From: David Gibson
That's the only field in flows with different storage sizes depending
on the architecture: it's usually 4-byte wide on 32-bit architectures,
except for arc and x32 where it's 8 bytes, and 8-byte wide on 64-bit
machines.
By keeping flow entries the same size across architectures, we avoid
having to expand or shrink table entries upon migration.
Signed-off-by: Stefano Brivio
...to keep migration sane. Right now, the biggest struct in union flow
is struct tcp_splice_conn with 120 bytes on x86_64, which should also
have the biggest storage and alignment requirements of any
architecture we might run on.
Signed-off-by: Stefano Brivio
...so that we can use sizeof() on it.
Signed-off-by: Stefano Brivio
These are symmetric to write_remainder() and write_all_buf() and
almost a copy and paste of them, with the most notable differences
being reversed reads/writes and a couple of better-safe-than-sorry
asserts to keep Coverity happy.
I'll use them in the next patch. At least for the moment, they're
going to be used for vhost-user mode only, so I'm not unconditionally
enabling readv() in the seccomp profile: the caller has to ensure it's
there.
Signed-off-by: Stefano Brivio
Add two sets (source or target) of three functions each for passt in
vhost-user mode, triggered by activity on the file descriptor passed
via VHOST_USER_PROTOCOL_F_DEVICE_STATE:
- migrate_source_pre() and migrate_target_pre() are called to prepare
for migration, before data is transferred
- migrate_source() sends, and migrate_target() receives migration data
- migrate_source_post() and migrate_target_post() are responsible for
any post-migration task
Callbacks are added to these functions with arrays of function
pointers in migrate.c. Migration handlers are versioned.
Versioned descriptions of data sections will be added to the
data_versions array, which points to versioned iovec arrays. Version
1 is currently empty and will be filled in in subsequent patches.
The source announces the data version to be used and informs the peer
about endianness, and the size of void *, time_t, flow entries and
flow hash table entries.
The target checks if the version of the source is still supported. If
it's not, it aborts the migration.
Signed-off-by: Stefano Brivio
A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
passt. Not used yet.
From David's patch: add it to .gitignore, like our other executable
targets.
Co-authored-by: David Gibson
On Fri, Jan 31, 2025 at 08:39:41PM +0100, Stefano Brivio wrote:
A privileged helper to set/clear TCP_REPAIR on sockets on behalf of passt. Not used yet.
From David's patch: add it to .gitignore, like our other executable targets.
Co-authored-by: David Gibson
I don't think a trivial change like the .gitignore really needs to be commented and credited.
Signed-off-by: Stefano Brivio
--- .gitignore | 1 + Makefile | 10 +++-- passt-repair.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 125 insertions(+), 3 deletions(-) create mode 100644 passt-repair.c diff --git a/.gitignore b/.gitignore index d1c8be9..5824a71 100644 --- a/.gitignore +++ b/.gitignore @@ -3,6 +3,7 @@ /passt.avx2 /pasta /pasta.avx2 +/passt-repair /qrap /pasta.1 /seccomp.h diff --git a/Makefile b/Makefile index 1383875..1b71cb0 100644 --- a/Makefile +++ b/Makefile @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c -SRCS = $(PASST_SRCS) $(QRAP_SRCS) +PASST_REPAIR_SRCS = passt-repair.c +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
MANPAGES = passt.1 pasta.1 qrap.1
@@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man man1dir ?= $(mandir)/man1
ifeq ($(TARGET_ARCH),x86_64) -BIN := passt passt.avx2 pasta pasta.avx2 qrap +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair else -BIN := passt pasta qrap +BIN := passt pasta qrap passt-repair endif
all: $(BIN) $(MANPAGES) docs @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt% qrap: $(QRAP_SRCS) passt.h $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
+passt-repair: $(PASST_REPAIR_SRCS) + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS) + valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \ rt_sigreturn getpid gettid kill clock_gettime mmap \ mmap2 munmap open unlink gettimeofday futex statx \ diff --git a/passt-repair.c b/passt-repair.c new file mode 100644 index 0000000..988a52c --- /dev/null +++ b/passt-repair.c @@ -0,0 +1,117 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * PASTA - Pack A Subtle Tap Abstraction + * for network namespace/tap device mode + * + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio
+ * + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along + * with byte commands mapping to TCP_REPAIR values, and switch repair mode on or + * off. Reply by echoing the command. Exit on EOF. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */ + +int main(int argc, char **argv) +{ + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)] + __attribute__ ((aligned(__alignof__(struct cmsghdr)))); + struct sockaddr_un a = { AF_UNIX, "" }; + int fds[SCM_MAX_FD], s, ret, i, n; + int8_t cmd = INT8_MAX; + struct cmsghdr *cmsg; + struct msghdr msg; + struct iovec iov; + + iov = (struct iovec){ &cmd, sizeof(cmd) }; + msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 }; + cmsg = CMSG_FIRSTHDR(&msg); + + if (argc != 2) { + fprintf(stderr, "Usage: %s PATH\n", argv[0]); + return -1; + } + + ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]); + if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) { + fprintf(stderr, "Invalid socket path: %s\n", argv[1]); + return -1; + } + + if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) { + perror("Failed to create AF_UNIX socket"); + return -1; + } + + if (connect(s, (struct sockaddr *)&a, sizeof(a))) { + fprintf(stderr, "Failed to connect to %s: %s\n", argv[1], + strerror(errno)); + return -1; + } + +loop: + ret = recvmsg(s, &msg, 0); + if (ret < 0) { + perror("Failed to receive message"); + return -1; + } + + if (!ret) /* Done */ + return 0; + + if (!cmsg || + cmsg->cmsg_len < CMSG_LEN(sizeof(int)) || + cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) || + cmsg->cmsg_type != SCM_RIGHTS) + return -1; + + n = cmsg->cmsg_len / CMSG_LEN(sizeof(int)); + memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n); + + if (cmd != TCP_REPAIR_ON && cmd != TCP_REPAIR_OFF && + cmd != TCP_REPAIR_OFF_NO_WP) { + fprintf(stderr, "Unsupported command 0x%04x\n", cmd); + return -1; + } + + for (i = 0; i < n; i++) { + int o = cmd; + + if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, &o, sizeof(o))) { + fprintf(stderr, + "Setting TCP_REPAIR to %i on socket %i: %s", o, + fds[i], strerror(errno)); + return -1; + }
So, I was thinking about this: I think we need to close() the fd, after calling TCP_REPAIR. If we don't, that's essentially an extra reference to the underlying kernel file object. That means: * When we close() the fd in passt, the socket won't actually go away. I think this is probably the cause of the in use ports you encountered. The current approach of exiting after the migrate is causing passt-repair to also exit, freeing up the additional references. * For incoming migrations, there's: when a migrated connection comes to a proper close on the target, the socket will be held open by the extra fd in the target side passt-repair. * At the moment, I don't think we expect more than two migrations for a single passt-repair instance (one in, and one out). But, particularly for the case of multiple failed migration attempts, I don't think we want to count on that. We're essentially leaking fd slots here, so passt-repair could run out of fds. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
In vhost-user mode, by default, create a second UNIX domain socket
accepting connections from passt-repair, with the usual listener
socket.
When we need to set or clear TCP_REPAIR on sockets, we'll send them
via SCM_RIGHTS to passt-repair, who sets the socket option values we
ask for.
To that end, introduce batched functions to request TCP_REPAIR
settings on sockets, so that we don't have to send a single message
for each socket, on migration. When needed, repair_flush() will
send the message and check for the reply.
Signed-off-by: Stefano Brivio
On Fri, Jan 31, 2025 at 08:39:42PM +0100, Stefano Brivio wrote:
In vhost-user mode, by default, create a second UNIX domain socket accepting connections from passt-repair, with the usual listener socket.
When we need to set or clear TCP_REPAIR on sockets, we'll send them via SCM_RIGHTS to passt-repair, who sets the socket option values we ask for.
To that end, introduce batched functions to request TCP_REPAIR settings on sockets, so that we don't have to send a single message for each socket, on migration. When needed, repair_flush() will send the message and check for the reply.
Signed-off-by: Stefano Brivio
[snip]
+/** + * repair_close() - Close connection to TCP_REPAIR helper + * @c: Execution context + */ +void repair_close(struct ctx *c) +{ + debug("Closing TCP_REPAIR helper socket"); + + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_repair, NULL);
This can be an epoll_del(), since you put the patch introducing that before this one in the series. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Very much draft quality, but it works. Ask passt-repair to switch
TCP sockets to repair mode and dump their current sequence numbers to
the flow table, which will be transferred and used by the target in
the next step.
Signed-off-by: Stefano Brivio
From: David Gibson
From: David Gibson
From: David Gibson
From: David Gibson
On Fri, Jan 31, 2025 at 08:39:47PM +0100, Stefano Brivio wrote:
From: David Gibson
Currently, once a migration device state fd is assigned, we wait for EPOLLIN or EPOLLOUT events on it to actually perform the migration. Change it so that once a migration is requested it we complete it synchronously at the end of the current epoll cycle. This has several advantages:
1. It makes it clear that everything about the migration must be dealt with at once, not split between multiple epoll events on the channel 2. It ensures the migration always takes place between epoll cycles, rather than, for example, between handling TCP events and their deferred handling in post_handler(). 3. It reduces code setting up the epoll watch on the fd.
Signed-off-by: David Gibson
--- epoll_type.h | 2 -- migrate.c | 44 +++++++++++--------------------------------- migrate.h | 2 +- passt.c | 6 ++---- passt.h | 2 ++ vu_common.c | 27 +++++++++++++++++++++++++++ 6 files changed, 43 insertions(+), 40 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index b981d30..7f2a121 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -40,8 +40,6 @@ enum epoll_type { EPOLL_TYPE_VHOST_CMD, /* vhost-user kick event socket */ EPOLL_TYPE_VHOST_KICK, - /* migration device state channel */ - EPOLL_TYPE_DEVICE_STATE, /* TCP_REPAIR helper listening socket */ EPOLL_TYPE_REPAIR_LISTEN, /* TCP_REPAIR helper socket */ diff --git a/migrate.c b/migrate.c index fc6a043..faa7841 100644 --- a/migrate.c +++ b/migrate.c @@ -50,7 +50,6 @@ static union migrate_header header = {
/* Data sections for version 1 */ static struct iovec sections_v1[] = { - { &header, sizeof(header) },
This hunk seems to have migrated in during rebase somehow, it wasn't in my original patch, and it breaks compile.
};
/* Set of data versions */ @@ -333,26 +332,6 @@ static int migrate_target(struct ctx *c, int fd) return rc; }
-/** - * set_migration_watch() - Add the migration file descriptor to epoll - * @c: Execution context - * @fd: File descriptor to add - * @target: Are we the target of the migration? - */ -static void set_migration_watch(const struct ctx *c, int fd, bool target) -{ - union epoll_ref ref = { - .type = EPOLL_TYPE_DEVICE_STATE, - .fd = fd, - }; - struct epoll_event ev = { 0 }; - - ev.data.u64 = ref.u64; - ev.events = target ? EPOLLIN : EPOLLOUT; - - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev); -} - /** * migrate_init() - Set up things necessary for migration * @c: Execution context @@ -372,7 +351,6 @@ void migrate_close(struct ctx *c) { if (c->device_state_fd != -1) { debug("Closing migration channel, fd: %d", c->device_state_fd); - epoll_del(c, c->device_state_fd); close(c->device_state_fd); c->device_state_fd = -1; c->device_state_result = -1; @@ -393,27 +371,27 @@ void migrate_request(struct ctx *c, int fd, bool target) migrate_close(c);
c->device_state_fd = fd; - set_migration_watch(c, c->device_state_fd, target); - + c->migrate_target = target; }
/** * migrate_handler() - Send/receive passt internal state to/from QEMU * @c: Execution context - * @events: epoll events */ -void migrate_handler(struct ctx *c, uint32_t events) +void migrate_handler(struct ctx *c) { - int rc = EIO; + int rc;
- debug("migrate_handler fd %d events %x", c->device_state_fd, events); + if (c->device_state_fd < 0) + return;
- if (events & EPOLLOUT) - rc = migrate_source(c, c->device_state_fd); - else if (events & EPOLLIN) - rc = migrate_target(c, c->device_state_fd); + debug("migrate_handler fd %d target %d", + c->device_state_fd, c->migrate_target);
- /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ + if (c->migrate_target) + rc = migrate_target(c, c->device_state_fd); + else + rc = migrate_source(c, c->device_state_fd);
migrate_close(c);
diff --git a/migrate.h b/migrate.h index a222c48..158241f 100644 --- a/migrate.h +++ b/migrate.h @@ -79,6 +79,6 @@ struct migrate_target_handlers { void migrate_init(struct ctx *c); void migrate_close(struct ctx *c); void migrate_request(struct ctx *c, int fd, bool target); -void migrate_handler(struct ctx *c, uint32_t events); +void migrate_handler(struct ctx *c);
#endif /* MIGRATE_H */ diff --git a/passt.c b/passt.c index 3c3a331..1938290 100644 --- a/passt.c +++ b/passt.c @@ -76,7 +76,6 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TAP_LISTEN] = "listening qemu socket", [EPOLL_TYPE_VHOST_CMD] = "vhost-user command socket", [EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket", - [EPOLL_TYPE_DEVICE_STATE] = "migration device state channel", [EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket", [EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket", }; @@ -360,9 +359,6 @@ loop: case EPOLL_TYPE_VHOST_KICK: vu_kick_cb(c.vdev, ref, &now); break; - case EPOLL_TYPE_DEVICE_STATE: - migrate_handler(&c, eventmask); - break; case EPOLL_TYPE_REPAIR_LISTEN: repair_listen_handler(&c, eventmask); break; @@ -377,5 +373,7 @@ loop:
post_handler(&c, &now);
+ migrate_handler(&c); + goto loop; } diff --git a/passt.h b/passt.h index 5992cbe..4189a4a 100644 --- a/passt.h +++ b/passt.h @@ -241,6 +241,7 @@ struct ip6_ctx { * @vdev: vhost-user device * @device_state_fd: Device state migration channel * @device_state_result: Device state migration result + * @migrate_target: Is this the target for next migration? */ struct ctx { enum passt_modes mode; @@ -313,6 +314,7 @@ struct ctx { /* Migration */ int device_state_fd; int device_state_result; + bool migrate_target; };
void proto_update_l2_buf(const unsigned char *eth_d, diff --git a/vu_common.c b/vu_common.c index 78d1c1b..4797ef9 100644 --- a/vu_common.c +++ b/vu_common.c @@ -305,3 +305,30 @@ err:
return -1; } + +/** + * vu_migrate() - Send/receive passt internal state to/from QEMU + * @c: Execution context + * @events: epoll events + */ +void vu_migrate(struct ctx *c, uint32_t events) +{ + struct vu_dev *vdev = c->vdev; + int rc = EIO; + + debug("vu_migrate fd %d events %x", vdev->device_state_fd, events); + + if (events & EPOLLOUT) + rc = migrate_source(c, vdev->device_state_fd); + else if (events & EPOLLIN) + rc = migrate_target(c, vdev->device_state_fd); + + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ + + vdev->device_state_result = rc; + + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL); + debug("Closing migration channel"); + close(vdev->device_state_fd); + vdev->device_state_fd = -1; +}
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, 3 Feb 2025 12:50:20 +1100
David Gibson
On Fri, Jan 31, 2025 at 08:39:47PM +0100, Stefano Brivio wrote:
From: David Gibson
Currently, once a migration device state fd is assigned, we wait for EPOLLIN or EPOLLOUT events on it to actually perform the migration. Change it so that once a migration is requested it we complete it synchronously at the end of the current epoll cycle. This has several advantages:
1. It makes it clear that everything about the migration must be dealt with at once, not split between multiple epoll events on the channel 2. It ensures the migration always takes place between epoll cycles, rather than, for example, between handling TCP events and their deferred handling in post_handler(). 3. It reduces code setting up the epoll watch on the fd.
Signed-off-by: David Gibson
--- epoll_type.h | 2 -- migrate.c | 44 +++++++++++--------------------------------- migrate.h | 2 +- passt.c | 6 ++---- passt.h | 2 ++ vu_common.c | 27 +++++++++++++++++++++++++++ 6 files changed, 43 insertions(+), 40 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index b981d30..7f2a121 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -40,8 +40,6 @@ enum epoll_type { EPOLL_TYPE_VHOST_CMD, /* vhost-user kick event socket */ EPOLL_TYPE_VHOST_KICK, - /* migration device state channel */ - EPOLL_TYPE_DEVICE_STATE, /* TCP_REPAIR helper listening socket */ EPOLL_TYPE_REPAIR_LISTEN, /* TCP_REPAIR helper socket */ diff --git a/migrate.c b/migrate.c index fc6a043..faa7841 100644 --- a/migrate.c +++ b/migrate.c @@ -50,7 +50,6 @@ static union migrate_header header = {
/* Data sections for version 1 */ static struct iovec sections_v1[] = { - { &header, sizeof(header) },
This hunk seems to have migrated in during rebase somehow, it wasn't in my original patch, and it breaks compile.
Oh, oops, yes. I did one last rebase round before posting, and I was convinced I was just rewording messages, so I didn't even try to rebuild, but obviously I moved some code here and there as well. Do you prefer I post a v4 fixing this up and addressing your other comments (and some of my open points), or should I rather wait, if you're working on patches based on v3? It makes no difference to me. -- Stefano
On Mon, Feb 03, 2025 at 06:38:56AM +0100, Stefano Brivio wrote:
On Mon, 3 Feb 2025 12:50:20 +1100 David Gibson
wrote: On Fri, Jan 31, 2025 at 08:39:47PM +0100, Stefano Brivio wrote:
From: David Gibson
Currently, once a migration device state fd is assigned, we wait for EPOLLIN or EPOLLOUT events on it to actually perform the migration. Change it so that once a migration is requested it we complete it synchronously at the end of the current epoll cycle. This has several advantages:
1. It makes it clear that everything about the migration must be dealt with at once, not split between multiple epoll events on the channel 2. It ensures the migration always takes place between epoll cycles, rather than, for example, between handling TCP events and their deferred handling in post_handler(). 3. It reduces code setting up the epoll watch on the fd.
Signed-off-by: David Gibson
--- epoll_type.h | 2 -- migrate.c | 44 +++++++++++--------------------------------- migrate.h | 2 +- passt.c | 6 ++---- passt.h | 2 ++ vu_common.c | 27 +++++++++++++++++++++++++++ 6 files changed, 43 insertions(+), 40 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index b981d30..7f2a121 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -40,8 +40,6 @@ enum epoll_type { EPOLL_TYPE_VHOST_CMD, /* vhost-user kick event socket */ EPOLL_TYPE_VHOST_KICK, - /* migration device state channel */ - EPOLL_TYPE_DEVICE_STATE, /* TCP_REPAIR helper listening socket */ EPOLL_TYPE_REPAIR_LISTEN, /* TCP_REPAIR helper socket */ diff --git a/migrate.c b/migrate.c index fc6a043..faa7841 100644 --- a/migrate.c +++ b/migrate.c @@ -50,7 +50,6 @@ static union migrate_header header = {
/* Data sections for version 1 */ static struct iovec sections_v1[] = { - { &header, sizeof(header) },
This hunk seems to have migrated in during rebase somehow, it wasn't in my original patch, and it breaks compile.
Oh, oops, yes. I did one last rebase round before posting, and I was convinced I was just rewording messages, so I didn't even try to rebuild, but obviously I moved some code here and there as well.
Do you prefer I post a v4 fixing this up and addressing your other comments (and some of my open points), or should I rather wait, if you're working on patches based on v3? It makes no difference to me.
Wait, I have some patches in the queue. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Jan 31, 2025 at 08:39:47PM +0100, Stefano Brivio wrote:
From: David Gibson
Currently, once a migration device state fd is assigned, we wait for EPOLLIN or EPOLLOUT events on it to actually perform the migration. Change it so that once a migration is requested it we complete it synchronously at the end of the current epoll cycle. This has several advantages:
1. It makes it clear that everything about the migration must be dealt with at once, not split between multiple epoll events on the channel 2. It ensures the migration always takes place between epoll cycles, rather than, for example, between handling TCP events and their deferred handling in post_handler(). 3. It reduces code setting up the epoll watch on the fd.
Signed-off-by: David Gibson
--- epoll_type.h | 2 -- migrate.c | 44 +++++++++++--------------------------------- migrate.h | 2 +- passt.c | 6 ++---- passt.h | 2 ++ vu_common.c | 27 +++++++++++++++++++++++++++ 6 files changed, 43 insertions(+), 40 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index b981d30..7f2a121 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -40,8 +40,6 @@ enum epoll_type { EPOLL_TYPE_VHOST_CMD, /* vhost-user kick event socket */ EPOLL_TYPE_VHOST_KICK, - /* migration device state channel */ - EPOLL_TYPE_DEVICE_STATE, /* TCP_REPAIR helper listening socket */ EPOLL_TYPE_REPAIR_LISTEN, /* TCP_REPAIR helper socket */ diff --git a/migrate.c b/migrate.c index fc6a043..faa7841 100644 --- a/migrate.c +++ b/migrate.c @@ -50,7 +50,6 @@ static union migrate_header header = {
/* Data sections for version 1 */ static struct iovec sections_v1[] = { - { &header, sizeof(header) }, };
/* Set of data versions */ @@ -333,26 +332,6 @@ static int migrate_target(struct ctx *c, int fd) return rc; }
-/** - * set_migration_watch() - Add the migration file descriptor to epoll - * @c: Execution context - * @fd: File descriptor to add - * @target: Are we the target of the migration? - */ -static void set_migration_watch(const struct ctx *c, int fd, bool target) -{ - union epoll_ref ref = { - .type = EPOLL_TYPE_DEVICE_STATE, - .fd = fd, - }; - struct epoll_event ev = { 0 }; - - ev.data.u64 = ref.u64; - ev.events = target ? EPOLLIN : EPOLLOUT; - - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev); -} - /** * migrate_init() - Set up things necessary for migration * @c: Execution context @@ -372,7 +351,6 @@ void migrate_close(struct ctx *c) { if (c->device_state_fd != -1) { debug("Closing migration channel, fd: %d", c->device_state_fd); - epoll_del(c, c->device_state_fd); close(c->device_state_fd); c->device_state_fd = -1; c->device_state_result = -1; @@ -393,27 +371,27 @@ void migrate_request(struct ctx *c, int fd, bool target) migrate_close(c);
c->device_state_fd = fd; - set_migration_watch(c, c->device_state_fd, target); - + c->migrate_target = target; }
/** * migrate_handler() - Send/receive passt internal state to/from QEMU * @c: Execution context - * @events: epoll events */ -void migrate_handler(struct ctx *c, uint32_t events) +void migrate_handler(struct ctx *c) { - int rc = EIO; + int rc;
- debug("migrate_handler fd %d events %x", c->device_state_fd, events); + if (c->device_state_fd < 0) + return;
- if (events & EPOLLOUT) - rc = migrate_source(c, c->device_state_fd); - else if (events & EPOLLIN) - rc = migrate_target(c, c->device_state_fd); + debug("migrate_handler fd %d target %d", + c->device_state_fd, c->migrate_target);
- /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ + if (c->migrate_target) + rc = migrate_target(c, c->device_state_fd); + else + rc = migrate_source(c, c->device_state_fd);
migrate_close(c);
diff --git a/migrate.h b/migrate.h index a222c48..158241f 100644 --- a/migrate.h +++ b/migrate.h @@ -79,6 +79,6 @@ struct migrate_target_handlers { void migrate_init(struct ctx *c); void migrate_close(struct ctx *c); void migrate_request(struct ctx *c, int fd, bool target); -void migrate_handler(struct ctx *c, uint32_t events); +void migrate_handler(struct ctx *c);
#endif /* MIGRATE_H */ diff --git a/passt.c b/passt.c index 3c3a331..1938290 100644 --- a/passt.c +++ b/passt.c @@ -76,7 +76,6 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TAP_LISTEN] = "listening qemu socket", [EPOLL_TYPE_VHOST_CMD] = "vhost-user command socket", [EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket", - [EPOLL_TYPE_DEVICE_STATE] = "migration device state channel", [EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket", [EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket", }; @@ -360,9 +359,6 @@ loop: case EPOLL_TYPE_VHOST_KICK: vu_kick_cb(c.vdev, ref, &now); break; - case EPOLL_TYPE_DEVICE_STATE: - migrate_handler(&c, eventmask); - break; case EPOLL_TYPE_REPAIR_LISTEN: repair_listen_handler(&c, eventmask); break; @@ -377,5 +373,7 @@ loop:
post_handler(&c, &now);
+ migrate_handler(&c); + goto loop; } diff --git a/passt.h b/passt.h index 5992cbe..4189a4a 100644 --- a/passt.h +++ b/passt.h @@ -241,6 +241,7 @@ struct ip6_ctx { * @vdev: vhost-user device * @device_state_fd: Device state migration channel * @device_state_result: Device state migration result + * @migrate_target: Is this the target for next migration? */ struct ctx { enum passt_modes mode; @@ -313,6 +314,7 @@ struct ctx { /* Migration */ int device_state_fd; int device_state_result; + bool migrate_target; };
void proto_update_l2_buf(const unsigned char *eth_d, diff --git a/vu_common.c b/vu_common.c index 78d1c1b..4797ef9 100644 --- a/vu_common.c +++ b/vu_common.c @@ -305,3 +305,30 @@ err:
return -1; } + +/** + * vu_migrate() - Send/receive passt internal state to/from QEMU + * @c: Execution context + * @events: epoll events + */ +void vu_migrate(struct ctx *c, uint32_t events) +{ + struct vu_dev *vdev = c->vdev; + int rc = EIO; + + debug("vu_migrate fd %d events %x", vdev->device_state_fd, events); + + if (events & EPOLLOUT) + rc = migrate_source(c, vdev->device_state_fd); + else if (events & EPOLLIN) + rc = migrate_target(c, vdev->device_state_fd); + + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ + + vdev->device_state_result = rc; + + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL); + debug("Closing migration channel"); + close(vdev->device_state_fd); + vdev->device_state_fd = -1; +}
Also, this hunk, restoring the function removed in an earlier patch seems to have crept in here, which also breaks build. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
We need this to transfer it, so we can throw away this change soon,
I guess.
Signed-off-by: Stefano Brivio
Having every vhost-user message printed as part of debug output makes
debugging anything else a bit complicated.
Signed-off-by: Stefano Brivio
On Fri, Jan 31, 2025 at 08:39:49PM +0100, Stefano Brivio wrote:
Having every vhost-user message printed as part of debug output makes debugging anything else a bit complicated.
Signed-off-by: Stefano Brivio
I'm a little bit baffled by this. You're changing to trace() a couple of relatively rare messages, that I think belong in debug() category, but *not* changing some things that definitely should be trace() - such as the ones in vu_send_single() and vu_kick_cb().
--- vhost_user.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/vhost_user.c b/vhost_user.c index 2dde405..1092387 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -640,8 +640,8 @@ static bool vu_set_vring_num_exec(struct vu_dev *vdev, unsigned int idx = msg->payload.state.index; unsigned int num = msg->payload.state.num;
- debug("State.index: %u", idx); - debug("State.num: %u", num); + trace("State.index: %u", idx); + trace("State.num: %u", num); vdev->vq[idx].vring.num = num;
return false; @@ -1176,11 +1176,11 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events) vu_sock_reset(vdev); return; } - debug("================ Vhost user message ================"); - debug("Request: %s (%d)", vu_request_to_string(msg.hdr.request), + trace("================ Vhost user message ================"); + trace("Request: %s (%d)", vu_request_to_string(msg.hdr.request), msg.hdr.request); - debug("Flags: 0x%x", msg.hdr.flags); - debug("Size: %u", msg.hdr.size); + trace("Flags: 0x%x", msg.hdr.flags); + trace("Size: %u", msg.hdr.size);
need_reply = msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK;
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, 3 Feb 2025 14:11:10 +1100
David Gibson
On Fri, Jan 31, 2025 at 08:39:49PM +0100, Stefano Brivio wrote:
Having every vhost-user message printed as part of debug output makes debugging anything else a bit complicated.
Signed-off-by: Stefano Brivio
I'm a little bit baffled by this. You're changing to trace() a couple of relatively rare messages
Well, "rare" is relative, if you're debugging state migration transfers. :) But...
that I think belong in debug() category,
actually, yes, they're not so frequent. Probably we should improve on the reporting instead, because good luck finding your little message with transferred sizes in a rain of: ================ Vhost user message ================ we probably need a simple facility in vhost-user code allowing to start and continue some messages, so that if we need two lines here for "vhost-user: req %s (%d)\nflags: ..." we can do that easily. As long as we're single-threaded, by the way, this could be simply two calls to debug().
but *not* changing some things that definitely should be trace() - such as the ones in vu_send_single() and vu_kick_cb().
Right, also noted for follow-ups.
--- vhost_user.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/vhost_user.c b/vhost_user.c index 2dde405..1092387 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -640,8 +640,8 @@ static bool vu_set_vring_num_exec(struct vu_dev *vdev, unsigned int idx = msg->payload.state.index; unsigned int num = msg->payload.state.num;
- debug("State.index: %u", idx); - debug("State.num: %u", num); + trace("State.index: %u", idx); + trace("State.num: %u", num); vdev->vq[idx].vring.num = num;
return false; @@ -1176,11 +1176,11 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events) vu_sock_reset(vdev); return; } - debug("================ Vhost user message ================"); - debug("Request: %s (%d)", vu_request_to_string(msg.hdr.request), + trace("================ Vhost user message ================"); + trace("Request: %s (%d)", vu_request_to_string(msg.hdr.request), msg.hdr.request); - debug("Flags: 0x%x", msg.hdr.flags); - debug("Size: %u", msg.hdr.size); + trace("Flags: 0x%x", msg.hdr.flags); + trace("Size: %u", msg.hdr.size);
need_reply = msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK;
-- Stefano
On Mon, Feb 03, 2025 at 07:10:25AM +0100, Stefano Brivio wrote:
On Mon, 3 Feb 2025 14:11:10 +1100 David Gibson
wrote: On Fri, Jan 31, 2025 at 08:39:49PM +0100, Stefano Brivio wrote:
Having every vhost-user message printed as part of debug output makes debugging anything else a bit complicated.
Signed-off-by: Stefano Brivio
I'm a little bit baffled by this. You're changing to trace() a couple of relatively rare messages
Well, "rare" is relative, if you're debugging state migration transfers. :) But...
Eh, I mean it's like 4ish events per migration. And they are actually related to the migration, rather than being unrelated async noise like the stuff in vu_kick_cb(). Admittedly they are pretty verbose messages.
that I think belong in debug() category,
actually, yes, they're not so frequent. Probably we should improve on the reporting instead, because good luck finding your little message with transferred sizes in a rain of:
================ Vhost user message ================
we probably need a simple facility in vhost-user code allowing to start and continue some messages, so that if we need two lines here for "vhost-user: req %s (%d)\nflags: ..." we can do that easily.
Yeah, I'd be fine with making those vhost-user messages a little bit less individually noisy. Which presumably they were to make those stand out against the other debug messages. We perhaps don't want to get into a shouting war with ourselves.
As long as we're single-threaded, by the way, this could be simply two calls to debug().
but *not* changing some things that definitely should be trace() - such as the ones in vu_send_single() and vu_kick_cb().
Right, also noted for follow-ups.
--- vhost_user.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/vhost_user.c b/vhost_user.c index 2dde405..1092387 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -640,8 +640,8 @@ static bool vu_set_vring_num_exec(struct vu_dev *vdev, unsigned int idx = msg->payload.state.index; unsigned int num = msg->payload.state.num;
- debug("State.index: %u", idx); - debug("State.num: %u", num); + trace("State.index: %u", idx); + trace("State.num: %u", num); vdev->vq[idx].vring.num = num;
return false; @@ -1176,11 +1176,11 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events) vu_sock_reset(vdev); return; } - debug("================ Vhost user message ================"); - debug("Request: %s (%d)", vu_request_to_string(msg.hdr.request), + trace("================ Vhost user message ================"); + trace("Request: %s (%d)", vu_request_to_string(msg.hdr.request), msg.hdr.request); - debug("Flags: 0x%x", msg.hdr.flags); - debug("Size: %u", msg.hdr.size); + trace("Flags: 0x%x", msg.hdr.flags); + trace("Size: %u", msg.hdr.size);
need_reply = msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK;
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On migration, the source process asks passt-helper to set TCP sockets
in repair mode, dumps the information we need to migrate connections,
and closes them.
At this point, we can't pass them back to passt-helper using
SCM_RIGHTS, because they are closed, from that perspective, and
sendmsg() will give us EBADF. But if we don't clear repair mode, the
port they are bound to will not be available for binding in the
target.
Terminate once we're done with the migration and we reported the
state. This is equivalent to clearing repair mode on the sockets we
just closed.
Signed-off-by: Stefano Brivio
On Fri, Jan 31, 2025 at 08:39:50PM +0100, Stefano Brivio wrote:
On migration, the source process asks passt-helper to set TCP sockets in repair mode, dumps the information we need to migrate connections, and closes them.
At this point, we can't pass them back to passt-helper using SCM_RIGHTS, because they are closed, from that perspective, and sendmsg() will give us EBADF. But if we don't clear repair mode, the port they are bound to will not be available for binding in the target.
Terminate once we're done with the migration and we reported the state. This is equivalent to clearing repair mode on the sockets we just closed.
As noted on the passt-repair patch, I think this is based on a misinterpreation of the situation. I think the problem is that the sockets aren't closed in passt-repair, so the additional handle copy is keeping the underlying socket open. This appears to work, because it is causing passt-repair to also terminate. That said, we probably want to terminate on the source side after a succesful migrate anyway. At the very least we need to close() all our sockets, and delete the corresponding flows, because we don't own them any more. Quitting is probably the simplest way to do that. Which also makes me realise, on a *failed* outbound migration, we _do_ need to turn repair mode off on everything again. Is that implemented yet?
Signed-off-by: Stefano Brivio
--- vhost_user.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/vhost_user.c b/vhost_user.c index 1092387..19ede8a 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -997,6 +997,8 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev, return false; }
+static bool quit_on_device_state = false; + /** * vu_set_device_state_fd_exec() - Set the device state migration channel * @vdev: vhost-user device @@ -1024,6 +1026,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev, migrate_request(vdev->context, msg->fds[0], direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);
+ if (direction == VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE) + quit_on_device_state = true; + /* We don't provide a new fd for the data transfer */ vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);
@@ -1201,4 +1206,10 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
if (reply_requested) vu_send_reply(fd, &msg); + + if (quit_on_device_state && + msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE) { + info("Migration complete, exiting"); + exit(EXIT_SUCCESS); + } }
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, 3 Feb 2025 12:55:47 +1100
David Gibson
On Fri, Jan 31, 2025 at 08:39:50PM +0100, Stefano Brivio wrote:
On migration, the source process asks passt-helper to set TCP sockets in repair mode, dumps the information we need to migrate connections, and closes them.
At this point, we can't pass them back to passt-helper using SCM_RIGHTS, because they are closed, from that perspective, and sendmsg() will give us EBADF. But if we don't clear repair mode, the port they are bound to will not be available for binding in the target.
Terminate once we're done with the migration and we reported the state. This is equivalent to clearing repair mode on the sockets we just closed.
As noted on the passt-repair patch, I think this is based on a misinterpreation of the situation. I think the problem is that the sockets aren't closed in passt-repair, so the additional handle copy is keeping the underlying socket open. This appears to work, because it is causing passt-repair to also terminate.
Right, exactly that.
That said, we probably want to terminate on the source side after a succesful migrate anyway. At the very least we need to close() all our sockets, and delete the corresponding flows, because we don't own them any more. Quitting is probably the simplest way to do that.
I'm not sure if there's an established behaviour for helpers supporting state migration. We could probably close sockets, delete flows, and keep things up and running for the rest (restart from a clean situation), but at that point we already the guest networking is already broken in a number of ways. So, yeah, maybe let's keep this instead.
Which also makes me realise, on a *failed* outbound migration, we _do_ need to turn repair mode off on everything again. Is that implemented yet?
No, sorry, that's the "/* TODO: rollback */" comments in flow_migrate_source_pre(), flow.c. But other than that, it should be pretty much implemented, at a migrate.c level. -- Stefano
On Mon, Feb 03, 2025 at 07:09:32AM +0100, Stefano Brivio wrote:
On Mon, 3 Feb 2025 12:55:47 +1100 David Gibson
wrote: On Fri, Jan 31, 2025 at 08:39:50PM +0100, Stefano Brivio wrote:
On migration, the source process asks passt-helper to set TCP sockets in repair mode, dumps the information we need to migrate connections, and closes them.
At this point, we can't pass them back to passt-helper using SCM_RIGHTS, because they are closed, from that perspective, and sendmsg() will give us EBADF. But if we don't clear repair mode, the port they are bound to will not be available for binding in the target.
Terminate once we're done with the migration and we reported the state. This is equivalent to clearing repair mode on the sockets we just closed.
As noted on the passt-repair patch, I think this is based on a misinterpreation of the situation. I think the problem is that the sockets aren't closed in passt-repair, so the additional handle copy is keeping the underlying socket open. This appears to work, because it is causing passt-repair to also terminate.
Right, exactly that.
That said, we probably want to terminate on the source side after a succesful migrate anyway. At the very least we need to close() all our sockets, and delete the corresponding flows, because we don't own them any more. Quitting is probably the simplest way to do that.
I'm not sure if there's an established behaviour for helpers supporting state migration.
By "helper" do you mean passt as a device helper to qemu, or passt-repair as a helper to passt. For the latter I wouldn't expect so - it's only a weirdness of our situation that we need passt-repair at all. If the former, I'm not really sure what you're after.
We could probably close sockets, delete flows, and keep things up and running for the rest (restart from a clean situation), but at that point we already the guest networking is already broken in a number of ways. So, yeah, maybe let's keep this instead.
So, I realised it's a bit more complicated than that. We need to identify exactly where the "point of no return" is. I'll discuss in our call tonight.
Which also makes me realise, on a *failed* outbound migration, we _do_ need to turn repair mode off on everything again. Is that implemented yet?
No, sorry, that's the "/* TODO: rollback */" comments in flow_migrate_source_pre(), flow.c. But other than that, it should be pretty much implemented, at a migrate.c level.
Right, I realised that a bit after I wrote this. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, 3 Feb 2025 19:52:37 +1100
David Gibson
On Mon, Feb 03, 2025 at 07:09:32AM +0100, Stefano Brivio wrote:
On Mon, 3 Feb 2025 12:55:47 +1100 David Gibson
wrote: On Fri, Jan 31, 2025 at 08:39:50PM +0100, Stefano Brivio wrote:
On migration, the source process asks passt-helper to set TCP sockets in repair mode, dumps the information we need to migrate connections, and closes them.
At this point, we can't pass them back to passt-helper using SCM_RIGHTS, because they are closed, from that perspective, and sendmsg() will give us EBADF. But if we don't clear repair mode, the port they are bound to will not be available for binding in the target.
Terminate once we're done with the migration and we reported the state. This is equivalent to clearing repair mode on the sockets we just closed.
As noted on the passt-repair patch, I think this is based on a misinterpreation of the situation. I think the problem is that the sockets aren't closed in passt-repair, so the additional handle copy is keeping the underlying socket open. This appears to work, because it is causing passt-repair to also terminate.
Right, exactly that.
That said, we probably want to terminate on the source side after a succesful migrate anyway. At the very least we need to close() all our sockets, and delete the corresponding flows, because we don't own them any more. Quitting is probably the simplest way to do that.
I'm not sure if there's an established behaviour for helpers supporting state migration.
By "helper" do you mean passt as a device helper to qemu, or passt-repair as a helper to passt. For the latter I wouldn't expect so - it's only a weirdness of our situation that we need passt-repair at all. If the former, I'm not really sure what you're after.
I meant passt and similar. Is there any convention we should adopt?
We could probably close sockets, delete flows, and keep things up and running for the rest (restart from a clean situation), but at that point we already the guest networking is already broken in a number of ways. So, yeah, maybe let's keep this instead.
So, I realised it's a bit more complicated than that. We need to identify exactly where the "point of no return" is. I'll discuss in our call tonight.
I think it's simply where we close sockets, by the way. -- Stefano
For migration only: we need to store 'oport', our socket-side port,
as we establish a connection from the guest, so that we can bind the
same oport as source port in the migration target.
Use getsockname() to fetch that.
Signed-off-by: Stefano Brivio
On Fri, Jan 31, 2025 at 08:39:51PM +0100, Stefano Brivio wrote:
For migration only: we need to store 'oport', our socket-side port, as we establish a connection from the guest, so that we can bind the same oport as source port in the migration target.
Use getsockname() to fetch that.
Signed-off-by: Stefano Brivio
--- flow.c | 4 ++-- flow_table.h | 4 ++-- tcp.c | 24 +++++++++++++++++++++++- 3 files changed, 27 insertions(+), 5 deletions(-) diff --git a/flow.c b/flow.c index 5638ff1..506cbac 100644 --- a/flow.c +++ b/flow.c @@ -411,8 +411,8 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, * * Return: pointer to the target flowside information */ -const struct flowside *flow_target(const struct ctx *c, union flow *flow, - uint8_t proto) +struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto) { char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; struct flow_common *f = &flow->f; diff --git a/flow_table.h b/flow_table.h index 633805d..b107107 100644 --- a/flow_table.h +++ b/flow_table.h @@ -178,8 +178,8 @@ const struct flowside *flow_target_af(union flow *flow, uint8_t pif, sa_family_t af, const void *saddr, in_port_t sport, const void *daddr, in_port_t dport); -const struct flowside *flow_target(const struct ctx *c, union flow *flow, - uint8_t proto); +struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto);
union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/tcp.c b/tcp.c index 0bd2a02..4fd405b 100644 --- a/tcp.c +++ b/tcp.c @@ -1471,6 +1471,8 @@ static void tcp_bind_outbound(const struct ctx *c, * @opts: Pointer to start of options * @optlen: Bytes in options: caller MUST ensure available length * @now: Current timestamp + * + * #syscalls:vu getsockname */ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, const void *saddr, const void *daddr, @@ -1479,9 +1481,10 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - const struct flowside *ini, *tgt; + const struct flowside *ini; struct tcp_tap_conn *conn; union sockaddr_inany sa; + struct flowside *tgt; union flow *flow; int s = -1, mss; uint64_t hash; @@ -1586,6 +1589,25 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, }
tcp_epoll_ctl(c, conn); + + if (c->mode == MODE_VU) { /* To rebind to same oport after migration */
I suspect we'll want this local side information in more places in future, but this is fine for now.
+ if (af == AF_INET) { + struct sockaddr_in s_in; + socklen_t sl; + + sl = sizeof(s_in); + getsockname(s, (struct sockaddr *)&s_in, &sl); + tgt->oport = ntohs(s_in.sin_port);
Since we're already doing the getsockname() we should also update tgt->oaddr, and that might matter in cases where the host has multiple local addresses.
+ } else { + struct sockaddr_in6 s_in6; + socklen_t sl; + + sl = sizeof(s_in6); + getsockname(s, (struct sockaddr *)&s_in6, &sl); + tgt->oport = ntohs(s_in6.sin6_port); + }
We should add an inany_getsockname() or something helper for this. In fact I'm pretty sure I wrote one at some point, but it was lost in the shuffles of various flow table iterations.
+ } + FLOW_ACTIVATE(conn); return;
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, 3 Feb 2025 13:05:33 +1100
David Gibson
On Fri, Jan 31, 2025 at 08:39:51PM +0100, Stefano Brivio wrote:
For migration only: we need to store 'oport', our socket-side port, as we establish a connection from the guest, so that we can bind the same oport as source port in the migration target.
Use getsockname() to fetch that.
Signed-off-by: Stefano Brivio
--- flow.c | 4 ++-- flow_table.h | 4 ++-- tcp.c | 24 +++++++++++++++++++++++- 3 files changed, 27 insertions(+), 5 deletions(-) diff --git a/flow.c b/flow.c index 5638ff1..506cbac 100644 --- a/flow.c +++ b/flow.c @@ -411,8 +411,8 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, * * Return: pointer to the target flowside information */ -const struct flowside *flow_target(const struct ctx *c, union flow *flow, - uint8_t proto) +struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto) { char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; struct flow_common *f = &flow->f; diff --git a/flow_table.h b/flow_table.h index 633805d..b107107 100644 --- a/flow_table.h +++ b/flow_table.h @@ -178,8 +178,8 @@ const struct flowside *flow_target_af(union flow *flow, uint8_t pif, sa_family_t af, const void *saddr, in_port_t sport, const void *daddr, in_port_t dport); -const struct flowside *flow_target(const struct ctx *c, union flow *flow, - uint8_t proto); +struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto);
union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/tcp.c b/tcp.c index 0bd2a02..4fd405b 100644 --- a/tcp.c +++ b/tcp.c @@ -1471,6 +1471,8 @@ static void tcp_bind_outbound(const struct ctx *c, * @opts: Pointer to start of options * @optlen: Bytes in options: caller MUST ensure available length * @now: Current timestamp + * + * #syscalls:vu getsockname */ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, const void *saddr, const void *daddr, @@ -1479,9 +1481,10 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - const struct flowside *ini, *tgt; + const struct flowside *ini; struct tcp_tap_conn *conn; union sockaddr_inany sa; + struct flowside *tgt; union flow *flow; int s = -1, mss; uint64_t hash; @@ -1586,6 +1589,25 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, }
tcp_epoll_ctl(c, conn); + + if (c->mode == MODE_VU) { /* To rebind to same oport after migration */
I suspect we'll want this local side information in more places in future, but this is fine for now.
+ if (af == AF_INET) { + struct sockaddr_in s_in; + socklen_t sl; + + sl = sizeof(s_in); + getsockname(s, (struct sockaddr *)&s_in, &sl); + tgt->oport = ntohs(s_in.sin_port);
Since we're already doing the getsockname() we should also update tgt->oaddr, and that might matter in cases where the host has multiple local addresses.
I had that in a previous version, because I was actually restoring it as I thought it was needed, then I dropped it. We expect the configuration of the target to be the same as the source, so the same connect() should yield to the same source address being used (minus the fact that we don't set socket options yet (point 9. of the to-do list from cover letter). So should we really bind() to a specific source address just because we happen to know it? I'm not quite sure.
+ } else { + struct sockaddr_in6 s_in6; + socklen_t sl; + + sl = sizeof(s_in6); + getsockname(s, (struct sockaddr *)&s_in6, &sl); + tgt->oport = ntohs(s_in6.sin6_port); + }
We should add an inany_getsockname() or something helper for this. In fact I'm pretty sure I wrote one at some point, but it was lost in the shuffles of various flow table iterations.
I guess it can be a follow-up. Noted, anyway. -- Stefano
On Mon, Feb 03, 2025 at 07:09:37AM +0100, Stefano Brivio wrote:
On Mon, 3 Feb 2025 13:05:33 +1100 David Gibson
wrote: On Fri, Jan 31, 2025 at 08:39:51PM +0100, Stefano Brivio wrote:
For migration only: we need to store 'oport', our socket-side port, as we establish a connection from the guest, so that we can bind the same oport as source port in the migration target.
Use getsockname() to fetch that.
Signed-off-by: Stefano Brivio
--- flow.c | 4 ++-- flow_table.h | 4 ++-- tcp.c | 24 +++++++++++++++++++++++- 3 files changed, 27 insertions(+), 5 deletions(-) diff --git a/flow.c b/flow.c index 5638ff1..506cbac 100644 --- a/flow.c +++ b/flow.c @@ -411,8 +411,8 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, * * Return: pointer to the target flowside information */ -const struct flowside *flow_target(const struct ctx *c, union flow *flow, - uint8_t proto) +struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto) { char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; struct flow_common *f = &flow->f; diff --git a/flow_table.h b/flow_table.h index 633805d..b107107 100644 --- a/flow_table.h +++ b/flow_table.h @@ -178,8 +178,8 @@ const struct flowside *flow_target_af(union flow *flow, uint8_t pif, sa_family_t af, const void *saddr, in_port_t sport, const void *daddr, in_port_t dport); -const struct flowside *flow_target(const struct ctx *c, union flow *flow, - uint8_t proto); +struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto);
union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/tcp.c b/tcp.c index 0bd2a02..4fd405b 100644 --- a/tcp.c +++ b/tcp.c @@ -1471,6 +1471,8 @@ static void tcp_bind_outbound(const struct ctx *c, * @opts: Pointer to start of options * @optlen: Bytes in options: caller MUST ensure available length * @now: Current timestamp + * + * #syscalls:vu getsockname */ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, const void *saddr, const void *daddr, @@ -1479,9 +1481,10 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - const struct flowside *ini, *tgt; + const struct flowside *ini; struct tcp_tap_conn *conn; union sockaddr_inany sa; + struct flowside *tgt; union flow *flow; int s = -1, mss; uint64_t hash; @@ -1586,6 +1589,25 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, }
tcp_epoll_ctl(c, conn); + + if (c->mode == MODE_VU) { /* To rebind to same oport after migration */
I suspect we'll want this local side information in more places in future, but this is fine for now.
+ if (af == AF_INET) { + struct sockaddr_in s_in; + socklen_t sl; + + sl = sizeof(s_in); + getsockname(s, (struct sockaddr *)&s_in, &sl); + tgt->oport = ntohs(s_in.sin_port);
Since we're already doing the getsockname() we should also update tgt->oaddr, and that might matter in cases where the host has multiple local addresses.
I had that in a previous version, because I was actually restoring it as I thought it was needed, then I dropped it.
We expect the configuration of the target to be the same as the source,
Eh... up to a point. I'm not sure about the kubevirt case, but in general when migrating amongst managed hosts it would not surprise me to find that the destination has multiple IPs. One is the "public" IP used by the workload and will indeed be the same between the ends. However, there could also be a "management" IP that's different between them - after all the management system will need to talk to both source and destination simultaneously for a little while.
so the same connect() should yield to the same source address being used (minus the fact that we don't set socket options yet (point 9. of the to-do list from cover letter). So should we really bind() to a specific source address just because we happen to know it? I'm not quite sure.
Yes, we should. Note that right now, the outbound socket already has a fixed bound address, assigned at connect() time. We don't track it in our internal data structures, it's still there as part of the socket state, and our behaviour reflects that. That forms part of the socket state which we should maintain across migration. We could do this lazily - one of the pre-migrate steps could be a getsockname() on any flows which have an unassigned oaddr/oport.
+ } else { + struct sockaddr_in6 s_in6; + socklen_t sl; + + sl = sizeof(s_in6); + getsockname(s, (struct sockaddr *)&s_in6, &sl); + tgt->oport = ntohs(s_in6.sin6_port); + }
We should add an inany_getsockname() or something helper for this. In fact I'm pretty sure I wrote one at some point, but it was lost in the shuffles of various flow table iterations.
I guess it can be a follow-up. Noted, anyway.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, 3 Feb 2025 19:59:12 +1100
David Gibson
On Mon, Feb 03, 2025 at 07:09:37AM +0100, Stefano Brivio wrote:
On Mon, 3 Feb 2025 13:05:33 +1100 David Gibson
wrote: On Fri, Jan 31, 2025 at 08:39:51PM +0100, Stefano Brivio wrote:
For migration only: we need to store 'oport', our socket-side port, as we establish a connection from the guest, so that we can bind the same oport as source port in the migration target.
Use getsockname() to fetch that.
Signed-off-by: Stefano Brivio
--- flow.c | 4 ++-- flow_table.h | 4 ++-- tcp.c | 24 +++++++++++++++++++++++- 3 files changed, 27 insertions(+), 5 deletions(-) diff --git a/flow.c b/flow.c index 5638ff1..506cbac 100644 --- a/flow.c +++ b/flow.c @@ -411,8 +411,8 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, * * Return: pointer to the target flowside information */ -const struct flowside *flow_target(const struct ctx *c, union flow *flow, - uint8_t proto) +struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto) { char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; struct flow_common *f = &flow->f; diff --git a/flow_table.h b/flow_table.h index 633805d..b107107 100644 --- a/flow_table.h +++ b/flow_table.h @@ -178,8 +178,8 @@ const struct flowside *flow_target_af(union flow *flow, uint8_t pif, sa_family_t af, const void *saddr, in_port_t sport, const void *daddr, in_port_t dport); -const struct flowside *flow_target(const struct ctx *c, union flow *flow, - uint8_t proto); +struct flowside *flow_target(const struct ctx *c, union flow *flow, + uint8_t proto);
union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/tcp.c b/tcp.c index 0bd2a02..4fd405b 100644 --- a/tcp.c +++ b/tcp.c @@ -1471,6 +1471,8 @@ static void tcp_bind_outbound(const struct ctx *c, * @opts: Pointer to start of options * @optlen: Bytes in options: caller MUST ensure available length * @now: Current timestamp + * + * #syscalls:vu getsockname */ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, const void *saddr, const void *daddr, @@ -1479,9 +1481,10 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - const struct flowside *ini, *tgt; + const struct flowside *ini; struct tcp_tap_conn *conn; union sockaddr_inany sa; + struct flowside *tgt; union flow *flow; int s = -1, mss; uint64_t hash; @@ -1586,6 +1589,25 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af, }
tcp_epoll_ctl(c, conn); + + if (c->mode == MODE_VU) { /* To rebind to same oport after migration */
I suspect we'll want this local side information in more places in future, but this is fine for now.
+ if (af == AF_INET) { + struct sockaddr_in s_in; + socklen_t sl; + + sl = sizeof(s_in); + getsockname(s, (struct sockaddr *)&s_in, &sl); + tgt->oport = ntohs(s_in.sin_port);
Since we're already doing the getsockname() we should also update tgt->oaddr, and that might matter in cases where the host has multiple local addresses.
I had that in a previous version, because I was actually restoring it as I thought it was needed, then I dropped it.
We expect the configuration of the target to be the same as the source,
Eh... up to a point. I'm not sure about the kubevirt case, but in general when migrating amongst managed hosts it would not surprise me to find that the destination has multiple IPs. One is the "public" IP used by the workload and will indeed be the same between the ends. However, there could also be a "management" IP that's different between them - after all the management system will need to talk to both source and destination simultaneously for a little while.
so the same connect() should yield to the same source address being used (minus the fact that we don't set socket options yet (point 9. of the to-do list from cover letter). So should we really bind() to a specific source address just because we happen to know it? I'm not quite sure.
Yes, we should. Note that right now, the outbound socket already has a fixed bound address, assigned at connect() time. We don't track it in our internal data structures, it's still there as part of the socket state, and our behaviour reflects that. That forms part of the socket state which we should maintain across migration.
Okay, fine, then let me add that back.
We could do this lazily - one of the pre-migrate steps could be a getsockname() on any flows which have an unassigned oaddr/oport.
Sure, on the other hand that's an optimisation, and I already have code ready for this. -- Stefano
Those are symmetric to TAPSIDE(x)/TAPFLOW(x) and I'll use them in
the next patch to extract 'oport' in order to re-bind sockets to
the original socket-side local port.
Signed-off-by: Stefano Brivio
On Fri, Jan 31, 2025 at 08:39:52PM +0100, Stefano Brivio wrote:
Those are symmetric to TAPSIDE(x)/TAPFLOW(x) and I'll use them in the next patch to extract 'oport' in order to re-bind sockets to the original socket-side local port.
Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- tcp_internal.h | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/tcp_internal.h b/tcp_internal.h index 94e5780..9cf31f5 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -38,9 +38,13 @@ #define OPT_SACK 5 #define OPT_TS 8
-#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) -#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)])) -#define TAP_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_))) +#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) +#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)])) +#define TAP_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_))) + +#define HOSTSIDE(conn_) ((conn_)->f.pif[1] == PIF_HOST) +#define HOSTFLOW(conn_) (&((conn_)->f.side[HOSTSIDE(conn_)])) +#define HOST_SIDX(conn_) (FLOW_SIDX((conn_), TAPSIDE(conn_)))
#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->oaddr)) #define CONN_V6(conn) (!CONN_V4(conn))
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
It's draft quality, with a number of hacks, and it will need a partial
rewrite. Add:
- flow_migrate_target_post(), to open target-side sockets and bind
them, switch them to repair mode, connect them, and make them leave
repair mode again
- copies of flow table, 'flow_first_free' pointer, related hash table,
and hash secret. The copy of the hash secret shows that the current
declarative approach to data sections has some drawbacks
Change tcp_flow_dump_seq() into tcp_flow_repair_seq(), which can dump
as well as restore sequences (used before connecting sockets).
Once we connect sockets, before we take them out of repair mode, we
need to restore MSS and window scaling information (what would be
determined by TCP options on handshake). I'm using hardcoded values as
we don't have a way to transfer these bits of socket-side information.
Before we turn repair mode off, add sockets to the epoll list and set
up per-socket timerfd descriptors, with initial timer scheduling.
Signed-off-by: Stefano Brivio
On Fri, 31 Jan 2025 20:39:33 +0100
Stefano Brivio
What clearly needs changes:
Maybe something less elegant but actually functional like: --- static int migrate_data_v1(int fd, struct ctx *c, bool target) { struct iovec context[] { /* All these need to be network endian */ c->hash_secret, sizeof(c->hash_secret), c->ipv4.addr_seen, sizeof(c->ipv4.addr_seen), c->ipv6.addr_seen, sizeof(c->ipv6.addr_seen), c->ipv6.addr_ll_seen, sizeof(c->ipv6.addr_ll_seen), /* Leave NDP's next_ra alone, start from 0 */ &log_written /* Drop static */ sizeof(log_written), }; int rc; if (target) rc = read_remainder(fd, context, ARRAY_SIZE(context), 0); else rc = write_remainder(fd, context, ARRAY_SIZE(context), 0); if (rc) return errno; if (target) return flow_migrate_data_target_v1(fd, c); return flow_migrate_data_source_v1(fd, c); } --- and: --- struct tcp_flow_transfer_v1 { struct flow_common f; uint8_t retrans; uint16_t ws_from_tap; /* All BE */ uint16_t ws_to_tap; uint16_t tap_mss; ... } __attribute__((packed)); union flow_transfer_v1 { struct tcp_flow_transfer_v1 tcp; ... }; int flow_migrate_data_source_v1(int fd, struct ctx *c) { for_each_active_flow(flowtab) { union flow_transfer_v1 d; switch (flow->f.type) { case FLOW_TCP: d.tcp.f = flow->f; d.tcp.retrans = flow->tcp.retrans; d.tcp.ws_from_tap = htons(d.tcp.ws_from_tap); ...; /* Fetch window stuff, socket must be in repair mode */ send(...); } int flow_migrate_data_target_v1(int fd, struct ctx *c) { ... } And I'm not sure about "[PATCH 0/2] Fancier version handling for migration": perhaps we could switch to something radically easier from the beginning. I mean, as we have to drop the declarative approach altogether, at least let's minimise LoCs... -- Stefano
On Sat, Feb 01, 2025 at 08:45:18AM +0100, Stefano Brivio wrote:
On Fri, 31 Jan 2025 20:39:33 +0100 Stefano Brivio
wrote: What clearly needs changes:
Maybe something less elegant but actually functional like:
--- static int migrate_data_v1(int fd, struct ctx *c, bool target) { struct iovec context[] { /* All these need to be network endian */ c->hash_secret, sizeof(c->hash_secret), c->ipv4.addr_seen, sizeof(c->ipv4.addr_seen), c->ipv6.addr_seen, sizeof(c->ipv6.addr_seen), c->ipv6.addr_ll_seen, sizeof(c->ipv6.addr_ll_seen), /* Leave NDP's next_ra alone, start from 0 */ &log_written /* Drop static */ sizeof(log_written), }; int rc;
if (target) rc = read_remainder(fd, context, ARRAY_SIZE(context), 0); else rc = write_remainder(fd, context, ARRAY_SIZE(context), 0);
if (rc) return errno;
if (target) return flow_migrate_data_target_v1(fd, c); return flow_migrate_data_source_v1(fd, c); } ---
and:
--- struct tcp_flow_transfer_v1 { struct flow_common f;
uint8_t retrans;
uint16_t ws_from_tap; /* All BE */ uint16_t ws_to_tap; uint16_t tap_mss;
... } __attribute__((packed));
union flow_transfer_v1 { struct tcp_flow_transfer_v1 tcp; ... };
int flow_migrate_data_source_v1(int fd, struct ctx *c) { for_each_active_flow(flowtab) { union flow_transfer_v1 d;
switch (flow->f.type) { case FLOW_TCP: d.tcp.f = flow->f; d.tcp.retrans = flow->tcp.retrans; d.tcp.ws_from_tap = htons(d.tcp.ws_from_tap); ...;
/* Fetch window stuff, socket must be in repair mode */
send(...); }
int flow_migrate_data_target_v1(int fd, struct ctx *c) { ... }
And I'm not sure about "[PATCH 0/2] Fancier version handling for migration": perhaps we could switch to something radically easier from the beginning. I mean, as we have to drop the declarative approach altogether, at least let's minimise LoCs...
Yeah, I think I have a better idea that kind of covers both of these. Patches shortly, I hope. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (2)
-
David Gibson
-
Stefano Brivio