[PATCH v5 0/6] Draft, incomplete series introducing state migration

Stefano Brivio

5 Feb 2025 5 Feb '25

1:38 a.m.

This switches to the new facility with stages and introduces per-flow migration, with a simple structure roughly based on struct tcp_tap_conn. I tested this: a simple connection between socat and nc with some text works (again) for me. Sequence numbers (of course), window parameters, and MSS all match. However, it *is* draft quality and *it might contain bugs*. It also contains a lot of TODO items, missing comments, probably some left-overs, and so on. David Gibson (2): migrate: Make more handling common rather than vhost-user specific migrate: Don't handle the migration channel through epoll Stefano Brivio (4): Introduce facilities for guest migration on top of vhost-user infrastructure Add interfaces and configuration bits for passt-repair vhost_user: Make source quit after reporting migration state Implement source and target sides of migration Makefile | 14 +- conf.c | 44 +++++- epoll_type.h | 6 +- flow.c | 226 +++++++++++++++++++++++++++++++ flow.h | 7 + isolation.c | 2 +- migrate.c | 291 ++++++++++++++++++++++++++++++++++++++++ migrate.h | 54 ++++++++ passt.1 | 11 ++ passt.c | 17 ++- passt.h | 17 +++ repair.c | 193 ++++++++++++++++++++++++++ repair.h | 16 +++ tap.c | 65 +-------- tcp.c | 372 +++++++++++++++++++++++++++++++++++++++++++++++++++ tcp_conn.h | 59 ++++++++ util.c | 62 +++++++++ util.h | 27 ++++ vhost_user.c | 69 +++------- virtio.h | 4 - vu_common.c | 49 +------ vu_common.h | 2 +- 22 files changed, 1427 insertions(+), 180 deletions(-) create mode 100644 migrate.c create mode 100644 migrate.h create mode 100644 repair.c create mode 100644 repair.h -- 2.43.0

Show replies by date

Stefano Brivio

5 Feb 5 Feb

1:38 a.m.

New subject: [PATCH v5 1/6] Introduce facilities for guest migration on top of vhost-user infrastructure

Add migration facilities based on top of the current vhost-user infrastructure, moving vu_migrate() to migrate.c. Versioned migration stages define function pointers to be called on source or target, or data sections that need to be transferred. The migration header consists of a magic number and a version identifier. Co-authored-by: David Gibson Signed-off-by: Stefano Brivio --- Makefile | 12 +-- migrate.c | 210 ++++++++++++++++++++++++++++++++++++++++++++++++++++ migrate.h | 51 +++++++++++++ passt.c | 2 +- util.h | 26 +++++++ vu_common.c | 58 +++++---------- vu_common.h | 2 +- 7 files changed, 315 insertions(+), 46 deletions(-) create mode 100644 migrate.c create mode 100644 migrate.h diff --git a/Makefile b/Makefile index d3d4b78..be89b07 100644 --- a/Makefile +++ b/Makefile @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c PASST_REPAIR_SRCS = passt-repair.c @@ -49,10 +49,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ - virtio.h vu_common.h + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ + vhost_user.h virtio.h vu_common.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include \nint main(){int a=getrandom(0, 0, 0);} diff --git a/migrate.c b/migrate.c new file mode 100644 index 0000000..a7031f9 --- /dev/null +++ b/migrate.c @@ -0,0 +1,210 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * PASTA - Pack A Subtle Tap Abstraction + * for network namespace/tap device mode + * + * migrate.c - Migration sections, layout, and routines + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio + */ + +#include +#include + +#include "util.h" +#include "ip.h" +#include "passt.h" +#include "inany.h" +#include "flow.h" +#include "flow_table.h" + +#include "migrate.h" + +/* Current version of migration data */ +#define MIGRATE_VERSION 1 + +/* Magic identifier for migration data */ +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 + +/* Migration header to send from source */ +static struct migrate_header header = { + .magic = htonll_constant(MIGRATE_MAGIC), + .version = htonl_constant(MIGRATE_VERSION), +}; + +/** + * migrate_send_block() - Migration stage handler to send verbatim data + * @c: Execution context + * @stage: Migration stage + * @fd: Migration fd + * + * Sends the buffer in @stage->iov over the migration channel. + */ +__attribute__((__unused__)) +static int migrate_send_block(struct ctx *c, + const struct migrate_stage *stage, int fd) +{ + (void)c; + + if (write_remainder(fd, &stage->iov, 1, 0) < 0) + return errno; + + return 0; +} + +/** + * migrate_recv_block() - Migration stage handler to receive verbatim data + * @c: Execution context + * @stage: Migration stage + * @fd: Migration fd + * + * Reads the buffer in @stage->iov from the migration channel. + * + * #syscalls:vu readv + */ +__attribute__((__unused__)) +static int migrate_recv_block(struct ctx *c, + const struct migrate_stage *stage, int fd) +{ + (void)c; + + if (read_remainder(fd, &stage->iov, 1, 0) < 0) + return errno; + + return 0; +} + +#define DATA_STAGE(v) \ + { \ + .name = #v, \ + .source = migrate_send_block, \ + .target = migrate_recv_block, \ + .iov = { &(v), sizeof(v) }, \ + } + +/* Stages for version 1 */ +static const struct migrate_stage stages_v1[] = { + { + .name = "flow pre", + .target = NULL, + }, + { + .name = "flow post", + .source = NULL, + }, + { 0 }, +}; + +/* Set of data versions */ +static const struct migrate_version versions[] = { + { + 1, stages_v1, + }, + { 0 }, +}; + +/** + * migrate_source() - Migration as source, send state to hypervisor + * @c: Execution context + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +int migrate_source(struct ctx *c, int fd) +{ + const struct migrate_version *v = versions + ARRAY_SIZE(versions) - 1; + const struct migrate_stage *s; + int ret; + + ret = write_all_buf(fd, &header, sizeof(header)); + if (ret) { + err("Can't send migration header: %s, abort", strerror_(ret)); + return ret; + } + + for (s = v->s; *s->name; s++) { + if (!s->source) + continue; + + debug("Source side migration: %s", s->name); + + if ((ret = s->source(c, s, fd))) { + err("Source migration stage %s: %s, abort", s->name, + strerror_(ret)); + return ret; + } + } + + return 0; +} + +/** + * migrate_target_read_header() - Read header in target + * @fd: Descriptor for state transfer + * + * Return: version number on success, 0 on failure with errno set + */ +static uint32_t migrate_target_read_header(int fd) +{ + struct migrate_header h; + + if (read_all_buf(fd, &h, sizeof(h))) + return 0; + + debug("Source magic: 0x%016" PRIx64 ", version: %u", + be64toh(h.magic), ntohl_constant(h.version)); + + if (ntohll_constant(h.magic) != MIGRATE_MAGIC || !ntohl(h.version)) { + errno = EINVAL; + return 0; + } + + return ntohl(h.version); +} + +/** + * migrate_target() - Migration as target, receive state from hypervisor + * @c: Execution context + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +int migrate_target(struct ctx *c, int fd) +{ + const struct migrate_version *v; + const struct migrate_stage *s; + uint32_t id; + int ret; + + id = migrate_target_read_header(fd); + if (!id) { + ret = errno; + err("Migration header check failed: %s, abort", strerror_(ret)); + return ret; + } + + for (v = versions; v->id && v->id == id; v++); + if (!v->id) { + err("Unsupported version: %u", id); + return -ENOTSUP; + } + + for (s = v->s; *s->name; s++) { + if (!s->target) + continue; + + debug("Target side migration: %s", s->name); + + if ((ret = s->target(c, s, fd))) { + err("Target migration stage %s: %s, abort", s->name, + strerror_(ret)); + return ret; + } + } + + return 0; +} diff --git a/migrate.h b/migrate.h new file mode 100644 index 0000000..3093b6e --- /dev/null +++ b/migrate.h @@ -0,0 +1,51 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio + */ + +#ifndef MIGRATE_H +#define MIGRATE_H + +/** + * struct migrate_header - Migration header from source + * @magic: 0xB1BB1D1B0BB1D1B0, network order + * @version: Highest known, target aborts if too old, network order + */ +struct migrate_header { + uint64_t magic; + uint32_t version; +} __attribute__((packed)); + +/** + * struct migrate_stage - Callbacks and parameters for one stage of migration + * @name: Stage name (for debugging) + * @source: Callback to implement this stage on the source + * @target: Callback to implement this stage on the target + * @iov: Optional data section to transfer + */ +struct migrate_stage { + const char *name; + int (*source)(struct ctx *c, + const struct migrate_stage *stage, int fd); + int (*target)(struct ctx *c, + const struct migrate_stage *stage, int fd); + + /* FIXME: rollback callbacks? */ + + struct iovec iov; +}; + +/** + * struct migrate_version - Stages for a particular protocol version + * @id: Version number, host order + * @s: Ordered array of stages, NULL-terminated + */ +struct migrate_version { + uint32_t id; + const struct migrate_stage *s; +}; + +int migrate_source(struct ctx *c, int fd); +int migrate_target(struct ctx *c, int fd); + +#endif /* MIGRATE_H */ diff --git a/passt.c b/passt.c index b1c8ab6..184d4e5 100644 --- a/passt.c +++ b/passt.c @@ -358,7 +358,7 @@ loop: vu_kick_cb(c.vdev, ref, &now); break; case EPOLL_TYPE_VHOST_MIGRATION: - vu_migrate(c.vdev, eventmask); + vu_migrate(&c, eventmask); break; default: /* Can't happen */ diff --git a/util.h b/util.h index 23b165c..1aed629 100644 --- a/util.h +++ b/util.h @@ -122,12 +122,38 @@ (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) #endif +#ifndef __bswap_constant_32 +#define __bswap_constant_32(x) \ + ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | \ + (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) +#endif + +#ifndef __bswap_constant_64 +#define __bswap_constant_64(x) \ + ((((x) & 0xff00000000000000ULL) >> 56) | \ + (((x) & 0x00ff000000000000ULL) >> 40) | \ + (((x) & 0x0000ff0000000000ULL) >> 24) | \ + (((x) & 0x000000ff00000000ULL) >> 8) | \ + (((x) & 0x00000000ff000000ULL) << 8) | \ + (((x) & 0x0000000000ff0000ULL) << 24) | \ + (((x) & 0x000000000000ff00ULL) << 40) | \ + (((x) & 0x00000000000000ffULL) << 56)) +#endif + #if __BYTE_ORDER == __BIG_ENDIAN #define htons_constant(x) (x) #define htonl_constant(x) (x) +#define htonll_constant(x) (x) +#define ntohs_constant(x) (x) +#define ntohl_constant(x) (x) +#define ntohll_constant(x) (x) #else #define htons_constant(x) (__bswap_constant_16(x)) #define htonl_constant(x) (__bswap_constant_32(x)) +#define htonll_constant(x) (__bswap_constant_64(x)) +#define ntohs_constant(x) (__bswap_constant_16(x)) +#define ntohl_constant(x) (__bswap_constant_32(x)) +#define ntohll_constant(x) (__bswap_constant_64(x)) #endif /** diff --git a/vu_common.c b/vu_common.c index ab04d31..3d41824 100644 --- a/vu_common.c +++ b/vu_common.c @@ -5,6 +5,7 @@ * common_vu.c - vhost-user common UDP and TCP functions */ +#include #include #include #include @@ -17,6 +18,7 @@ #include "vhost_user.h" #include "pcap.h" #include "vu_common.h" +#include "migrate.h" #define VU_MAX_TX_BUFFER_NB 2 @@ -305,48 +307,28 @@ err: } /** - * vu_migrate() - Send/receive passt insternal state to/from QEMU - * @vdev: vhost-user device + * vu_migrate() - Send/receive passt internal state to/from QEMU + * @c: Execution context * @events: epoll events */ -void vu_migrate(struct vu_dev *vdev, uint32_t events) +void vu_migrate(struct ctx *c, uint32_t events) { - int ret; + struct vu_dev *vdev = c->vdev; + int rc = EIO; - /* TODO: collect/set passt internal state - * and use vdev->device_state_fd to send/receive it - */ debug("vu_migrate fd %d events %x", vdev->device_state_fd, events); - if (events & EPOLLOUT) { - debug("Saving backend state"); - - /* send some stuff */ - ret = write(vdev->device_state_fd, "PASST", 6); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - vdev->device_state_result = ret == -1 ? -1 : 0; - /* Closing the file descriptor signals the end of transfer */ - epoll_del(vdev->context, vdev->device_state_fd); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } else if (events & EPOLLIN) { - char buf[6]; - - debug("Loading backend state"); - /* read some stuff */ - ret = read(vdev->device_state_fd, buf, sizeof(buf)); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - if (ret != sizeof(buf)) { - vdev->device_state_result = -1; - } else { - ret = strncmp(buf, "PASST", sizeof(buf)); - vdev->device_state_result = ret == 0 ? 0 : -1; - } - } else if (events & EPOLLHUP) { - debug("Closing migration channel"); - /* The end of file signals the end of the transfer. */ - epoll_del(vdev->context, vdev->device_state_fd); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } + if (events & EPOLLOUT) + rc = migrate_source(c, vdev->device_state_fd); + else if (events & EPOLLIN) + rc = migrate_target(c, vdev->device_state_fd); + + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ + + vdev->device_state_result = rc; + + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL); + debug("Closing migration channel"); + close(vdev->device_state_fd); + vdev->device_state_fd = -1; } diff --git a/vu_common.h b/vu_common.h index d56c021..69c4006 100644 --- a/vu_common.h +++ b/vu_common.h @@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size); -void vu_migrate(struct vu_dev *vdev, uint32_t events); +void vu_migrate(struct ctx *c, uint32_t events); #endif /* VU_COMMON_H */ -- 2.43.0

David Gibson

2:44 a.m.

New subject: [PATCH v5 1/6] Introduce facilities for guest migration on top of vhost-user infrastructure

On Wed, Feb 05, 2025 at 01:38:59AM +0100, Stefano Brivio wrote:

...

Add migration facilities based on top of the current vhost-user infrastructure, moving vu_migrate() to migrate.c.

Versioned migration stages define function pointers to be called on source or target, or data sections that need to be transferred.

The migration header consists of a magic number and a version identifier.

Co-authored-by: David Gibson

Given this, it should also have my S-o-b, Signed-off-by: David Gibson And, given that we already have an awkward co-authorship situation, it probably makes sense to fold patches 2 & 3 into this one.

...

Signed-off-by: Stefano Brivio --- Makefile | 12 +-- migrate.c | 210 ++++++++++++++++++++++++++++++++++++++++++++++++++++ migrate.h | 51 +++++++++++++ passt.c | 2 +- util.h | 26 +++++++ vu_common.c | 58 +++++---------- vu_common.h | 2 +- 7 files changed, 315 insertions(+), 46 deletions(-) create mode 100644 migrate.c create mode 100644 migrate.h

diff --git a/Makefile b/Makefile index d3d4b78..be89b07 100644 --- a/Makefile +++ b/Makefile @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)

PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c PASST_REPAIR_SRCS = passt-repair.c @@ -49,10 +49,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1

PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ - virtio.h vu_common.h + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ + vhost_user.h virtio.h vu_common.h HEADERS = $(PASST_HEADERS) seccomp.h

C := \#include \nint main(){int a=getrandom(0, 0, 0);} diff --git a/migrate.c b/migrate.c new file mode 100644 index 0000000..a7031f9 --- /dev/null +++ b/migrate.c @@ -0,0 +1,210 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * PASTA - Pack A Subtle Tap Abstraction + * for network namespace/tap device mode + * + * migrate.c - Migration sections, layout, and routines + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio + */ + +#include +#include + +#include "util.h" +#include "ip.h" +#include "passt.h" +#include "inany.h" +#include "flow.h" +#include "flow_table.h" + +#include "migrate.h" + +/* Current version of migration data */ +#define MIGRATE_VERSION 1 + +/* Magic identifier for migration data */ +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 + +/* Migration header to send from source */ +static struct migrate_header header = { + .magic = htonll_constant(MIGRATE_MAGIC), + .version = htonl_constant(MIGRATE_VERSION), +}; + +/** + * migrate_send_block() - Migration stage handler to send verbatim data + * @c: Execution context + * @stage: Migration stage + * @fd: Migration fd + * + * Sends the buffer in @stage->iov over the migration channel. + */ +__attribute__((__unused__)) +static int migrate_send_block(struct ctx *c, + const struct migrate_stage *stage, int fd) +{ + (void)c; + + if (write_remainder(fd, &stage->iov, 1, 0) < 0) + return errno; + + return 0; +} + +/** + * migrate_recv_block() - Migration stage handler to receive verbatim data + * @c: Execution context + * @stage: Migration stage + * @fd: Migration fd + * + * Reads the buffer in @stage->iov from the migration channel. + * + * #syscalls:vu readv + */ +__attribute__((__unused__)) +static int migrate_recv_block(struct ctx *c, + const struct migrate_stage *stage, int fd) +{ + (void)c; + + if (read_remainder(fd, &stage->iov, 1, 0) < 0) + return errno; + + return 0; +} + +#define DATA_STAGE(v) \ + { \ + .name = #v, \ + .source = migrate_send_block, \ + .target = migrate_recv_block, \ + .iov = { &(v), sizeof(v) }, \ + } + +/* Stages for version 1 */ +static const struct migrate_stage stages_v1[] = { + { + .name = "flow pre", + .target = NULL, + }, + { + .name = "flow post", + .source = NULL, + }, + { 0 }, +}; + +/* Set of data versions */ +static const struct migrate_version versions[] = { + { + 1, stages_v1, + }, + { 0 }, +}; + +/** + * migrate_source() - Migration as source, send state to hypervisor + * @c: Execution context + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +int migrate_source(struct ctx *c, int fd) +{ + const struct migrate_version *v = versions + ARRAY_SIZE(versions) - 1; + const struct migrate_stage *s; + int ret; + + ret = write_all_buf(fd, &header, sizeof(header)); + if (ret) { + err("Can't send migration header: %s, abort", strerror_(ret)); + return ret; + } + + for (s = v->s; *s->name; s++) { + if (!s->source) + continue; + + debug("Source side migration: %s", s->name); + + if ((ret = s->source(c, s, fd))) { + err("Source migration stage %s: %s, abort", s->name, + strerror_(ret)); + return ret; + } + } + + return 0; +} + +/** + * migrate_target_read_header() - Read header in target + * @fd: Descriptor for state transfer + * + * Return: version number on success, 0 on failure with errno set + */ +static uint32_t migrate_target_read_header(int fd) +{ + struct migrate_header h; + + if (read_all_buf(fd, &h, sizeof(h))) + return 0; + + debug("Source magic: 0x%016" PRIx64 ", version: %u", + be64toh(h.magic), ntohl_constant(h.version)); + + if (ntohll_constant(h.magic) != MIGRATE_MAGIC || !ntohl(h.version)) { + errno = EINVAL; + return 0; + } + + return ntohl(h.version); +} + +/** + * migrate_target() - Migration as target, receive state from hypervisor + * @c: Execution context + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +int migrate_target(struct ctx *c, int fd) +{ + const struct migrate_version *v; + const struct migrate_stage *s; + uint32_t id; + int ret; + + id = migrate_target_read_header(fd); + if (!id) { + ret = errno; + err("Migration header check failed: %s, abort", strerror_(ret)); + return ret; + } + + for (v = versions; v->id && v->id == id; v++); + if (!v->id) { + err("Unsupported version: %u", id); + return -ENOTSUP; + } + + for (s = v->s; *s->name; s++) { + if (!s->target) + continue; + + debug("Target side migration: %s", s->name); + + if ((ret = s->target(c, s, fd))) { + err("Target migration stage %s: %s, abort", s->name, + strerror_(ret)); + return ret; + } + } + + return 0; +} diff --git a/migrate.h b/migrate.h new file mode 100644 index 0000000..3093b6e --- /dev/null +++ b/migrate.h @@ -0,0 +1,51 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio + */ + +#ifndef MIGRATE_H +#define MIGRATE_H + +/** + * struct migrate_header - Migration header from source + * @magic: 0xB1BB1D1B0BB1D1B0, network order + * @version: Highest known, target aborts if too old, network order + */ +struct migrate_header { + uint64_t magic; + uint32_t version; +} __attribute__((packed)); + +/** + * struct migrate_stage - Callbacks and parameters for one stage of migration + * @name: Stage name (for debugging) + * @source: Callback to implement this stage on the source + * @target: Callback to implement this stage on the target + * @iov: Optional data section to transfer + */ +struct migrate_stage { + const char *name; + int (*source)(struct ctx *c, + const struct migrate_stage *stage, int fd); + int (*target)(struct ctx *c, + const struct migrate_stage *stage, int fd); + + /* FIXME: rollback callbacks? */ + + struct iovec iov; +}; + +/** + * struct migrate_version - Stages for a particular protocol version + * @id: Version number, host order + * @s: Ordered array of stages, NULL-terminated + */ +struct migrate_version { + uint32_t id; + const struct migrate_stage *s; +}; + +int migrate_source(struct ctx *c, int fd); +int migrate_target(struct ctx *c, int fd); + +#endif /* MIGRATE_H */ diff --git a/passt.c b/passt.c index b1c8ab6..184d4e5 100644 --- a/passt.c +++ b/passt.c @@ -358,7 +358,7 @@ loop: vu_kick_cb(c.vdev, ref, &now); break; case EPOLL_TYPE_VHOST_MIGRATION: - vu_migrate(c.vdev, eventmask); + vu_migrate(&c, eventmask); break; default: /* Can't happen */ diff --git a/util.h b/util.h index 23b165c..1aed629 100644 --- a/util.h +++ b/util.h @@ -122,12 +122,38 @@ (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) #endif

+#ifndef __bswap_constant_32 +#define __bswap_constant_32(x) \ + ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | \ + (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) +#endif + +#ifndef __bswap_constant_64 +#define __bswap_constant_64(x) \ + ((((x) & 0xff00000000000000ULL) >> 56) | \ + (((x) & 0x00ff000000000000ULL) >> 40) | \ + (((x) & 0x0000ff0000000000ULL) >> 24) | \ + (((x) & 0x000000ff00000000ULL) >> 8) | \ + (((x) & 0x00000000ff000000ULL) << 8) | \ + (((x) & 0x0000000000ff0000ULL) << 24) | \ + (((x) & 0x000000000000ff00ULL) << 40) | \ + (((x) & 0x00000000000000ffULL) << 56)) +#endif + #if __BYTE_ORDER == __BIG_ENDIAN #define htons_constant(x) (x) #define htonl_constant(x) (x) +#define htonll_constant(x) (x) +#define ntohs_constant(x) (x) +#define ntohl_constant(x) (x) +#define ntohll_constant(x) (x) #else #define htons_constant(x) (__bswap_constant_16(x)) #define htonl_constant(x) (__bswap_constant_32(x)) +#define htonll_constant(x) (__bswap_constant_64(x)) +#define ntohs_constant(x) (__bswap_constant_16(x)) +#define ntohl_constant(x) (__bswap_constant_32(x)) +#define ntohll_constant(x) (__bswap_constant_64(x)) #endif

/** diff --git a/vu_common.c b/vu_common.c index ab04d31..3d41824 100644 --- a/vu_common.c +++ b/vu_common.c @@ -5,6 +5,7 @@ * common_vu.c - vhost-user common UDP and TCP functions */

+#include #include #include #include @@ -17,6 +18,7 @@ #include "vhost_user.h" #include "pcap.h" #include "vu_common.h" +#include "migrate.h"

#define VU_MAX_TX_BUFFER_NB 2

@@ -305,48 +307,28 @@ err: }

/** - * vu_migrate() - Send/receive passt insternal state to/from QEMU - * @vdev: vhost-user device + * vu_migrate() - Send/receive passt internal state to/from QEMU + * @c: Execution context * @events: epoll events */ -void vu_migrate(struct vu_dev *vdev, uint32_t events) +void vu_migrate(struct ctx *c, uint32_t events) { - int ret; + struct vu_dev *vdev = c->vdev; + int rc = EIO;

- /* TODO: collect/set passt internal state - * and use vdev->device_state_fd to send/receive it - */ debug("vu_migrate fd %d events %x", vdev->device_state_fd, events); - if (events & EPOLLOUT) { - debug("Saving backend state"); - - /* send some stuff */ - ret = write(vdev->device_state_fd, "PASST", 6); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - vdev->device_state_result = ret == -1 ? -1 : 0; - /* Closing the file descriptor signals the end of transfer */ - epoll_del(vdev->context, vdev->device_state_fd); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } else if (events & EPOLLIN) { - char buf[6]; - - debug("Loading backend state"); - /* read some stuff */ - ret = read(vdev->device_state_fd, buf, sizeof(buf)); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - if (ret != sizeof(buf)) { - vdev->device_state_result = -1; - } else { - ret = strncmp(buf, "PASST", sizeof(buf)); - vdev->device_state_result = ret == 0 ? 0 : -1; - } - } else if (events & EPOLLHUP) { - debug("Closing migration channel");

- /* The end of file signals the end of the transfer. */ - epoll_del(vdev->context, vdev->device_state_fd); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } + if (events & EPOLLOUT) + rc = migrate_source(c, vdev->device_state_fd); + else if (events & EPOLLIN) + rc = migrate_target(c, vdev->device_state_fd); + + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ + + vdev->device_state_result = rc; + + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL); + debug("Closing migration channel"); + close(vdev->device_state_fd); + vdev->device_state_fd = -1; } diff --git a/vu_common.h b/vu_common.h index d56c021..69c4006 100644 --- a/vu_common.h +++ b/vu_common.h @@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size); -void vu_migrate(struct vu_dev *vdev, uint32_t events); +void vu_migrate(struct ctx *c, uint32_t events); #endif /* VU_COMMON_H */

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

1:39 a.m.

New subject: [PATCH v5 2/6] migrate: Make more handling common rather than vhost-user specific

From: David Gibson A lot of the migration logic in vhost_user.c and vu_common.c isn't really specific to vhost-user, but matches the overall structure of migration. This applies to vu_migrate() and to the parts of of vu_set_device_state_fd_exec() which aren't related to parsing the specific vhost-user control request. Move this logic to migrate.c, with matching renames. Signed-off-by: David Gibson --- epoll_type.h | 4 +-- migrate.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++-- migrate.h | 6 ++-- passt.c | 6 ++-- passt.h | 6 ++++ vhost_user.c | 58 ++++----------------------------- virtio.h | 4 --- vu_common.c | 27 --------------- vu_common.h | 2 +- 9 files changed, 113 insertions(+), 92 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index fd9eac3..b1d3a1d 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -40,8 +40,8 @@ enum epoll_type { EPOLL_TYPE_VHOST_CMD, /* vhost-user kick event socket */ EPOLL_TYPE_VHOST_KICK, - /* vhost-user migration socket */ - EPOLL_TYPE_VHOST_MIGRATION, + /* migration device state channel */ + EPOLL_TYPE_DEVICE_STATE, EPOLL_NUM_TYPES, }; diff --git a/migrate.c b/migrate.c index a7031f9..86a76fc 100644 --- a/migrate.c +++ b/migrate.c @@ -21,6 +21,7 @@ #include "inany.h" #include "flow.h" #include "flow_table.h" +#include "repair.h" #include "migrate.h" @@ -114,7 +115,7 @@ static const struct migrate_version versions[] = { * * Return: 0 on success, positive error code on failure */ -int migrate_source(struct ctx *c, int fd) +static int migrate_source(struct ctx *c, int fd) { const struct migrate_version *v = versions + ARRAY_SIZE(versions) - 1; const struct migrate_stage *s; @@ -173,7 +174,7 @@ static uint32_t migrate_target_read_header(int fd) * * Return: 0 on success, positive error code on failure */ -int migrate_target(struct ctx *c, int fd) +static int migrate_target(struct ctx *c, int fd) { const struct migrate_version *v; const struct migrate_stage *s; @@ -208,3 +209,90 @@ int migrate_target(struct ctx *c, int fd) return 0; } + +/** + * set_migration_watch() - Add the migration file descriptor to epoll + * @c: Execution context + * @fd: File descriptor to add + * @target: Are we the target of the migration? + */ +static void set_migration_watch(const struct ctx *c, int fd, bool target) +{ + union epoll_ref ref = { + .type = EPOLL_TYPE_DEVICE_STATE, + .fd = fd, + }; + struct epoll_event ev = { 0 }; + + ev.data.u64 = ref.u64; + ev.events = target ? EPOLLIN : EPOLLOUT; + + epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev); +} + +/** + * migrate_init() - Set up things necessary for migration + * @c: Execution context + */ +void migrate_init(struct ctx *c) +{ + c->device_state_fd = -1; + c->device_state_result = -1; +} + +/** + * migrate_close() - Close migration channel + * @c: Execution context + */ +void migrate_close(struct ctx *c) +{ + if (c->device_state_fd != -1) { + debug("Closing migration channel, fd: %d", c->device_state_fd); + epoll_del(c, c->device_state_fd); + close(c->device_state_fd); + c->device_state_fd = -1; + c->device_state_result = -1; + } +} + +/** + * migrate_request() - Request a migration of device state + * @c: Execution context + * @fd: fd to transfer state + * @target: Are we the target of the migration? + */ +void migrate_request(struct ctx *c, int fd, bool target) +{ + debug("Migration requested, fd: %d (was %d)", + fd, c->device_state_fd); + + if (c->device_state_fd != -1) + migrate_close(c); + + c->device_state_fd = fd; + set_migration_watch(c, c->device_state_fd, target); + +} + +/** + * migrate_handler() - Send/receive passt internal state to/from QEMU + * @c: Execution context + * @events: epoll events + */ +void migrate_handler(struct ctx *c, uint32_t events) +{ + int rc = EIO; + + debug("migrate_handler fd %d events %x", c->device_state_fd, events); + + if (events & EPOLLOUT) + rc = migrate_source(c, c->device_state_fd); + else if (events & EPOLLIN) + rc = migrate_target(c, c->device_state_fd); + + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ + + migrate_close(c); + + c->device_state_result = rc; +} diff --git a/migrate.h b/migrate.h index 3093b6e..6be311a 100644 --- a/migrate.h +++ b/migrate.h @@ -45,7 +45,9 @@ struct migrate_version { const struct migrate_stage *s; }; -int migrate_source(struct ctx *c, int fd); -int migrate_target(struct ctx *c, int fd); +void migrate_init(struct ctx *c); +void migrate_close(struct ctx *c); +void migrate_request(struct ctx *c, int fd, bool target); +void migrate_handler(struct ctx *c, uint32_t events); #endif /* MIGRATE_H */ diff --git a/passt.c b/passt.c index 184d4e5..afd63c2 100644 --- a/passt.c +++ b/passt.c @@ -75,7 +75,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TAP_LISTEN] = "listening qemu socket", [EPOLL_TYPE_VHOST_CMD] = "vhost-user command socket", [EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket", - [EPOLL_TYPE_VHOST_MIGRATION] = "vhost-user migration socket", + [EPOLL_TYPE_DEVICE_STATE] = "migration device state channel", }; static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES, "epoll_type_str[] doesn't match enum epoll_type"); @@ -357,8 +357,8 @@ loop: case EPOLL_TYPE_VHOST_KICK: vu_kick_cb(c.vdev, ref, &now); break; - case EPOLL_TYPE_VHOST_MIGRATION: - vu_migrate(&c, eventmask); + case EPOLL_TYPE_DEVICE_STATE: + migrate_handler(&c, eventmask); break; default: /* Can't happen */ diff --git a/passt.h b/passt.h index 0dd4efa..0b8a26c 100644 --- a/passt.h +++ b/passt.h @@ -235,6 +235,8 @@ struct ip6_ctx { * @low_wmem: Low probed net.core.wmem_max * @low_rmem: Low probed net.core.rmem_max * @vdev: vhost-user device + * @device_state_fd: Device state migration channel + * @device_state_result: Device state migration result */ struct ctx { enum passt_modes mode; @@ -300,6 +302,10 @@ struct ctx { int low_rmem; struct vu_dev *vdev; + + /* Migration */ + int device_state_fd; + int device_state_result; }; void proto_update_l2_buf(const unsigned char *eth_d, diff --git a/vhost_user.c b/vhost_user.c index 9e38cfd..b107d0f 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -997,36 +997,6 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev, return false; } -/** - * vu_set_migration_watch() - Add the migration file descriptor to epoll - * @vdev: vhost-user device - * @fd: File descriptor to add - * @direction: Direction of the migration (save or load backend state) - */ -static void vu_set_migration_watch(const struct vu_dev *vdev, int fd, - uint32_t direction) -{ - union epoll_ref ref = { - .type = EPOLL_TYPE_VHOST_MIGRATION, - .fd = fd, - }; - struct epoll_event ev = { 0 }; - - ev.data.u64 = ref.u64; - switch (direction) { - case VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE: - ev.events = EPOLLOUT; - break; - case VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD: - ev.events = EPOLLIN; - break; - default: - ASSERT(0); - } - - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, ref.fd, &ev); -} - /** * vu_set_device_state_fd_exec() - Set the device state migration channel * @vdev: vhost-user device @@ -1051,16 +1021,8 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev, direction != VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD) die("Invalide device_state_fd direction: %d", direction); - if (vdev->device_state_fd != -1) { - epoll_del(vdev->context, vdev->device_state_fd); - close(vdev->device_state_fd); - } - - vdev->device_state_fd = msg->fds[0]; - vdev->device_state_result = -1; - vu_set_migration_watch(vdev, vdev->device_state_fd, direction); - - debug("Got device_state_fd: %d", vdev->device_state_fd); + migrate_request(vdev->context, msg->fds[0], + direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD); /* We don't provide a new fd for the data transfer */ vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK); @@ -1078,9 +1040,7 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev, static bool vu_check_device_state_exec(struct vu_dev *vdev, struct vhost_user_msg *msg) { - (void)vdev; - - vmsg_set_reply_u64(msg, vdev->device_state_result); + vmsg_set_reply_u64(msg, vdev->context->device_state_result); return true; } @@ -1106,8 +1066,8 @@ void vu_init(struct ctx *c) } c->vdev->log_table = NULL; c->vdev->log_call_fd = -1; - c->vdev->device_state_fd = -1; - c->vdev->device_state_result = -1; + + migrate_init(c); } @@ -1157,12 +1117,8 @@ void vu_cleanup(struct vu_dev *vdev) vu_close_log(vdev); - if (vdev->device_state_fd != -1) { - epoll_del(vdev->context, vdev->device_state_fd); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - vdev->device_state_result = -1; - } + /* If we lose the VU dev, we also lose our migration channel */ + migrate_close(vdev->context); } /** diff --git a/virtio.h b/virtio.h index 7bef2d2..0a59441 100644 --- a/virtio.h +++ b/virtio.h @@ -106,8 +106,6 @@ struct vu_dev_region { * @log_call_fd: Eventfd to report logging update * @log_size: Size of the logging memory region * @log_table: Base of the logging memory region - * @device_state_fd: Device state migration channel - * @device_state_result: Device state migration result */ struct vu_dev { struct ctx *context; @@ -119,8 +117,6 @@ struct vu_dev { int log_call_fd; uint64_t log_size; uint8_t *log_table; - int device_state_fd; - int device_state_result; }; /** diff --git a/vu_common.c b/vu_common.c index 3d41824..48826b1 100644 --- a/vu_common.c +++ b/vu_common.c @@ -305,30 +305,3 @@ err: return -1; } - -/** - * vu_migrate() - Send/receive passt internal state to/from QEMU - * @c: Execution context - * @events: epoll events - */ -void vu_migrate(struct ctx *c, uint32_t events) -{ - struct vu_dev *vdev = c->vdev; - int rc = EIO; - - debug("vu_migrate fd %d events %x", vdev->device_state_fd, events); - - if (events & EPOLLOUT) - rc = migrate_source(c, vdev->device_state_fd); - else if (events & EPOLLIN) - rc = migrate_target(c, vdev->device_state_fd); - - /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ - - vdev->device_state_result = rc; - - epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL); - debug("Closing migration channel"); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; -} diff --git a/vu_common.h b/vu_common.h index 69c4006..f538f23 100644 --- a/vu_common.h +++ b/vu_common.h @@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size); -void vu_migrate(struct ctx *c, uint32_t events); + #endif /* VU_COMMON_H */ -- 2.43.0

Stefano Brivio

1:39 a.m.

New subject: [PATCH v5 3/6] migrate: Don't handle the migration channel through epoll

From: David Gibson Currently, once a migration device state fd is assigned, we wait for EPOLLIN or EPOLLOUT events on it to actually perform the migration. Change it so that once a migration is requested it we complete it synchronously at the end of the current epoll cycle. This has several advantages: 1. It makes it clear that everything about the migration must be dealt with at once, not split between multiple epoll events on the channel 2. It ensures the migration always takes place between epoll cycles, rather than, for example, between handling TCP events and their deferred handling in post_handler(). 3. It reduces code setting up the epoll watch on the fd. Signed-off-by: David Gibson --- epoll_type.h | 2 -- migrate.c | 43 +++++++++++-------------------------------- migrate.h | 2 +- passt.c | 6 ++---- passt.h | 2 ++ 5 files changed, 16 insertions(+), 39 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index b1d3a1d..f3ef415 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -40,8 +40,6 @@ enum epoll_type { EPOLL_TYPE_VHOST_CMD, /* vhost-user kick event socket */ EPOLL_TYPE_VHOST_KICK, - /* migration device state channel */ - EPOLL_TYPE_DEVICE_STATE, EPOLL_NUM_TYPES, }; diff --git a/migrate.c b/migrate.c index 86a76fc..9948be0 100644 --- a/migrate.c +++ b/migrate.c @@ -210,26 +210,6 @@ static int migrate_target(struct ctx *c, int fd) return 0; } -/** - * set_migration_watch() - Add the migration file descriptor to epoll - * @c: Execution context - * @fd: File descriptor to add - * @target: Are we the target of the migration? - */ -static void set_migration_watch(const struct ctx *c, int fd, bool target) -{ - union epoll_ref ref = { - .type = EPOLL_TYPE_DEVICE_STATE, - .fd = fd, - }; - struct epoll_event ev = { 0 }; - - ev.data.u64 = ref.u64; - ev.events = target ? EPOLLIN : EPOLLOUT; - - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev); -} - /** * migrate_init() - Set up things necessary for migration * @c: Execution context @@ -248,7 +228,6 @@ void migrate_close(struct ctx *c) { if (c->device_state_fd != -1) { debug("Closing migration channel, fd: %d", c->device_state_fd); - epoll_del(c, c->device_state_fd); close(c->device_state_fd); c->device_state_fd = -1; c->device_state_result = -1; @@ -270,27 +249,27 @@ void migrate_request(struct ctx *c, int fd, bool target) migrate_close(c); c->device_state_fd = fd; - set_migration_watch(c, c->device_state_fd, target); - + c->migrate_target = target; } /** * migrate_handler() - Send/receive passt internal state to/from QEMU * @c: Execution context - * @events: epoll events */ -void migrate_handler(struct ctx *c, uint32_t events) +void migrate_handler(struct ctx *c) { - int rc = EIO; + int rc; - debug("migrate_handler fd %d events %x", c->device_state_fd, events); + if (c->device_state_fd < 0) + return; - if (events & EPOLLOUT) - rc = migrate_source(c, c->device_state_fd); - else if (events & EPOLLIN) - rc = migrate_target(c, c->device_state_fd); + debug("migrate_handler fd %d target %d", + c->device_state_fd, c->migrate_target); - /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ + if (c->migrate_target) + rc = migrate_target(c, c->device_state_fd); + else + rc = migrate_source(c, c->device_state_fd); migrate_close(c); diff --git a/migrate.h b/migrate.h index 6be311a..80d78b8 100644 --- a/migrate.h +++ b/migrate.h @@ -48,6 +48,6 @@ struct migrate_version { void migrate_init(struct ctx *c); void migrate_close(struct ctx *c); void migrate_request(struct ctx *c, int fd, bool target); -void migrate_handler(struct ctx *c, uint32_t events); +void migrate_handler(struct ctx *c); #endif /* MIGRATE_H */ diff --git a/passt.c b/passt.c index afd63c2..c5a99db 100644 --- a/passt.c +++ b/passt.c @@ -75,7 +75,6 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TAP_LISTEN] = "listening qemu socket", [EPOLL_TYPE_VHOST_CMD] = "vhost-user command socket", [EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket", - [EPOLL_TYPE_DEVICE_STATE] = "migration device state channel", }; static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES, "epoll_type_str[] doesn't match enum epoll_type"); @@ -357,9 +356,6 @@ loop: case EPOLL_TYPE_VHOST_KICK: vu_kick_cb(c.vdev, ref, &now); break; - case EPOLL_TYPE_DEVICE_STATE: - migrate_handler(&c, eventmask); - break; default: /* Can't happen */ ASSERT(0); @@ -368,5 +364,7 @@ loop: post_handler(&c, &now); + migrate_handler(&c); + goto loop; } diff --git a/passt.h b/passt.h index 0b8a26c..2255f18 100644 --- a/passt.h +++ b/passt.h @@ -237,6 +237,7 @@ struct ip6_ctx { * @vdev: vhost-user device * @device_state_fd: Device state migration channel * @device_state_result: Device state migration result + * @migrate_target: Is this the target for next migration? */ struct ctx { enum passt_modes mode; @@ -306,6 +307,7 @@ struct ctx { /* Migration */ int device_state_fd; int device_state_result; + bool migrate_target; }; void proto_update_l2_buf(const unsigned char *eth_d, -- 2.43.0

Stefano Brivio

1:39 a.m.

New subject: [PATCH v5 4/6] Add interfaces and configuration bits for passt-repair

In vhost-user mode, by default, create a second UNIX domain socket accepting connections from passt-repair, with the usual listener socket. When we need to set or clear TCP_REPAIR on sockets, we'll send them via SCM_RIGHTS to passt-repair, who sets the socket option values we ask for. To that end, introduce batched functions to request TCP_REPAIR settings on sockets, so that we don't have to send a single message for each socket, on migration. When needed, repair_flush() will send the message and check for the reply. Signed-off-by: Stefano Brivio --- Makefile | 12 ++-- conf.c | 44 ++++++++++-- epoll_type.h | 4 ++ passt.1 | 11 +++ passt.c | 9 +++ passt.h | 7 ++ repair.c | 193 +++++++++++++++++++++++++++++++++++++++++++++++++++ repair.h | 16 +++++ tap.c | 65 +---------------- util.c | 62 +++++++++++++++++ util.h | 1 + 11 files changed, 352 insertions(+), 72 deletions(-) create mode 100644 repair.c create mode 100644 repair.h diff --git a/Makefile b/Makefile index be89b07..d4e1096 100644 --- a/Makefile +++ b/Makefile @@ -38,9 +38,9 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ - ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ - tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ - vhost_user.c virtio.c vu_common.c + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \ + repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \ + udp_vu.c util.c vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c PASST_REPAIR_SRCS = passt-repair.c SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS) @@ -50,9 +50,9 @@ MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ - pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ - tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ - vhost_user.h virtio.h vu_common.h + pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \ + tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \ + udp_vu.h util.h vhost_user.h virtio.h vu_common.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include \nint main(){int a=getrandom(0, 0, 0);} diff --git a/conf.c b/conf.c index df2b016..9ef79a4 100644 --- a/conf.c +++ b/conf.c @@ -816,6 +816,9 @@ static void usage(const char *name, FILE *f, int status) " UNIX domain socket is provided by -s option\n" " --print-capabilities print back-end capabilities in JSON format,\n" " only meaningful for vhost-user mode\n"); + FPRINTF(f, + " --repair-path PATH path for passt-repair(1)\n" + " default: append '.repair' to UNIX domain path\n"); } FPRINTF(f, @@ -1240,8 +1243,25 @@ static void conf_nat(const char *arg, struct in_addr *addr4, */ static void conf_open_files(struct ctx *c) { - if (c->mode != MODE_PASTA && c->fd_tap == -1) - c->fd_tap_listen = tap_sock_unix_open(c->sock_path); + if (c->mode != MODE_PASTA && c->fd_tap == -1) { + c->fd_tap_listen = sock_unix(c->sock_path); + + if (c->mode == MODE_VU && strcmp(c->repair_path, "none")) { + if (!*c->repair_path && + snprintf_check(c->repair_path, + sizeof(c->repair_path), "%s.repair", + c->sock_path)) { + warn("passt-repair path %s not usable", + c->repair_path); + c->fd_repair_listen = -1; + } else { + c->fd_repair_listen = sock_unix(c->repair_path); + } + } else { + c->fd_repair_listen = -1; + } + c->fd_repair = -1; + } if (*c->pidfile) { c->pidfile_fd = output_file_open(c->pidfile, O_WRONLY); @@ -1354,9 +1374,12 @@ void conf(struct ctx *c, int argc, char **argv) {"host-lo-to-ns-lo", no_argument, NULL, 23 }, {"dns-host", required_argument, NULL, 24 }, {"vhost-user", no_argument, NULL, 25 }, + /* vhost-user backend program convention */ {"print-capabilities", no_argument, NULL, 26 }, {"socket-path", required_argument, NULL, 's' }, + + {"repair-path", required_argument, NULL, 27 }, { 0 }, }; const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt"; @@ -1748,6 +1771,9 @@ void conf(struct ctx *c, int argc, char **argv) case 'D': /* Handle these later, once addresses are configured */ break; + case 27: + /* Handle this once we checked --vhost-user */ + break; case 'h': usage(argv[0], stdout, EXIT_SUCCESS); break; @@ -1824,8 +1850,8 @@ void conf(struct ctx *c, int argc, char **argv) if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.guest_gw)) c->no_dhcp = 1; - /* Inbound port options & DNS can be parsed now (after IPv4/IPv6 - * settings) + /* Inbound port options, DNS, and --repair-path can be parsed now, after + * IPv4/IPv6 settings and --vhost-user. */ fwd_probe_ephemeral(); udp_portmap_clear(); @@ -1871,6 +1897,16 @@ void conf(struct ctx *c, int argc, char **argv) } die("Cannot use DNS address %s", optarg); + } else if (name == 27) { + if (c->mode != MODE_VU && strcmp(optarg, "none")) + die("--repair-path is for vhost-user mode only"); + + if (snprintf_check(c->repair_path, + sizeof(c->repair_path), "%s", + optarg)) + die("Invalid passt-repair path: %s", optarg); + + break; } } while (name != -1); diff --git a/epoll_type.h b/epoll_type.h index f3ef415..7f2a121 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -40,6 +40,10 @@ enum epoll_type { EPOLL_TYPE_VHOST_CMD, /* vhost-user kick event socket */ EPOLL_TYPE_VHOST_KICK, + /* TCP_REPAIR helper listening socket */ + EPOLL_TYPE_REPAIR_LISTEN, + /* TCP_REPAIR helper socket */ + EPOLL_TYPE_REPAIR, EPOLL_NUM_TYPES, }; diff --git a/passt.1 b/passt.1 index d9cd33e..63a3a01 100644 --- a/passt.1 +++ b/passt.1 @@ -418,6 +418,17 @@ Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR. .BR \-\-print-capabilities Print back-end capabilities in JSON format, only meaningful for vhost-user mode. +.TP +.BR \-\-repair-path " " \fIpath +Path for UNIX domain socket used by the \fBpasst-repair\fR(1) helper to connect +to \fBpasst\fR in order to set or clear the TCP_REPAIR option on sockets, during +migration. \fB--repair-path none\fR disables this interface (if you need to +specify a socket path called "none" you can prefix the path by \fI./\fR). + +Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path +chosen for the hypervisor UNIX domain socket. No socket is created if not in +\-\-vhost-user mode. + .TP .BR \-F ", " \-\-fd " " \fIFD Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened diff --git a/passt.c b/passt.c index c5a99db..1938290 100644 --- a/passt.c +++ b/passt.c @@ -51,6 +51,7 @@ #include "tcp_splice.h" #include "ndp.h" #include "vu_common.h" +#include "repair.h" #define EPOLL_EVENTS 8 @@ -75,6 +76,8 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TAP_LISTEN] = "listening qemu socket", [EPOLL_TYPE_VHOST_CMD] = "vhost-user command socket", [EPOLL_TYPE_VHOST_KICK] = "vhost-user kick socket", + [EPOLL_TYPE_REPAIR_LISTEN] = "TCP_REPAIR helper listening socket", + [EPOLL_TYPE_REPAIR] = "TCP_REPAIR helper socket", }; static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES, "epoll_type_str[] doesn't match enum epoll_type"); @@ -356,6 +359,12 @@ loop: case EPOLL_TYPE_VHOST_KICK: vu_kick_cb(c.vdev, ref, &now); break; + case EPOLL_TYPE_REPAIR_LISTEN: + repair_listen_handler(&c, eventmask); + break; + case EPOLL_TYPE_REPAIR: + repair_handler(&c, eventmask); + break; default: /* Can't happen */ ASSERT(0); diff --git a/passt.h b/passt.h index 2255f18..4189a4a 100644 --- a/passt.h +++ b/passt.h @@ -20,6 +20,7 @@ union epoll_ref; #include "siphash.h" #include "ip.h" #include "inany.h" +#include "migrate.h" #include "flow.h" #include "icmp.h" #include "fwd.h" @@ -193,6 +194,7 @@ struct ip6_ctx { * @foreground: Run in foreground, don't log to stderr by default * @nofile: Maximum number of open files (ulimit -n) * @sock_path: Path for UNIX domain socket + * @repair_path: TCP_REPAIR helper path, can be "none", empty for default * @pcap: Path for packet capture file * @pidfile: Path to PID file, empty string if not configured * @pidfile_fd: File descriptor for PID file, -1 if none @@ -203,6 +205,8 @@ struct ip6_ctx { * @epollfd: File descriptor for epoll instance * @fd_tap_listen: File descriptor for listening AF_UNIX socket, if any * @fd_tap: AF_UNIX socket, tuntap device, or pre-opened socket + * @fd_repair_listen: File descriptor for listening TCP_REPAIR socket, if any + * @fd_repair: Connected AF_UNIX socket for TCP_REPAIR helper * @our_tap_mac: Pasta/passt's MAC on the tap link * @guest_mac: MAC address of guest or namespace, seen or configured * @hash_secret: 128-bit secret for siphash functions @@ -247,6 +251,7 @@ struct ctx { int foreground; int nofile; char sock_path[UNIX_PATH_MAX]; + char repair_path[UNIX_PATH_MAX]; char pcap[PATH_MAX]; char pidfile[PATH_MAX]; @@ -263,6 +268,8 @@ struct ctx { int epollfd; int fd_tap_listen; int fd_tap; + int fd_repair_listen; + int fd_repair; unsigned char our_tap_mac[ETH_ALEN]; unsigned char guest_mac[ETH_ALEN]; uint64_t hash_secret[2]; diff --git a/repair.c b/repair.c new file mode 100644 index 0000000..6151927 --- /dev/null +++ b/repair.c @@ -0,0 +1,193 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * PASTA - Pack A Subtle Tap Abstraction + * for network namespace/tap device mode + * + * repair.c - Interface (server) for passt-repair, set/clear TCP_REPAIR + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio + */ + +#include +#include + +#include "util.h" +#include "ip.h" +#include "passt.h" +#include "inany.h" +#include "flow.h" +#include "flow_table.h" + +#include "repair.h" + +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */ + +static int repair_fds[SCM_MAX_FD]; +static int repair_cmd; +static int repair_nfds; + +/** + * repair_sock_init() - Start listening for connections on helper socket + * @c: Execution context + */ +void repair_sock_init(const struct ctx *c) +{ + union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN }; + struct epoll_event ev = { 0 }; + + listen(c->fd_repair_listen, 0); + + ref.fd = c->fd_repair_listen; + ev.events = EPOLLIN | EPOLLHUP | EPOLLET; + ev.data.u64 = ref.u64; + epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev); +} + +/** + * repair_listen_handler() - Handle events on TCP_REPAIR helper listening socket + * @c: Execution context + * @events: epoll events + */ +void repair_listen_handler(struct ctx *c, uint32_t events) +{ + union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR }; + struct epoll_event ev = { 0 }; + struct ucred ucred; + socklen_t len; + + if (events != EPOLLIN) { + debug("Spurious event 0x%04x on TCP_REPAIR helper socket", + events); + return; + } + + len = sizeof(ucred); + + /* Another client is already connected: accept and close right away. */ + if (c->fd_repair != -1) { + int discard = accept4(c->fd_repair_listen, NULL, NULL, + SOCK_NONBLOCK); + + if (discard == -1) + return; + + if (!getsockopt(discard, SOL_SOCKET, SO_PEERCRED, &ucred, &len)) + info("Discarding TCP_REPAIR helper, PID %i", ucred.pid); + + close(discard); + return; + } + + c->fd_repair = accept4(c->fd_repair_listen, NULL, NULL, 0); + + if (!getsockopt(c->fd_repair, SOL_SOCKET, SO_PEERCRED, &ucred, &len)) + info("Accepted TCP_REPAIR helper, PID %i", ucred.pid); + + ref.fd = c->fd_repair; + ev.events = EPOLLHUP | EPOLLET; + ev.data.u64 = ref.u64; + epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev); +} + +/** + * repair_close() - Close connection to TCP_REPAIR helper + * @c: Execution context + */ +void repair_close(struct ctx *c) +{ + debug("Closing TCP_REPAIR helper socket"); + + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_repair, NULL); + close(c->fd_repair); + c->fd_repair = -1; +} + +/** + * repair_handler() - Handle EPOLLHUP and EPOLLERR on TCP_REPAIR helper socket + * @c: Execution context + * @events: epoll events + */ +void repair_handler(struct ctx *c, uint32_t events) +{ + (void)events; + + repair_close(c); +} + +/** + * repair_flush() - Flush current set of sockets to helper, with current command + * @c: Execution context + * + * Return: 0 on success, negative error code on failure + */ +int repair_flush(struct ctx *c) +{ + struct iovec iov = { &((int8_t){ repair_cmd }), sizeof(int8_t) }; + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)] + __attribute__ ((aligned(__alignof__(struct cmsghdr)))); + struct cmsghdr *cmsg; + struct msghdr msg; + + if (!repair_nfds) + return 0; + + msg = (struct msghdr){ NULL, 0, &iov, 1, + buf, CMSG_SPACE(sizeof(int) * repair_nfds), 0 }; + cmsg = CMSG_FIRSTHDR(&msg); + + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_RIGHTS; + cmsg->cmsg_len = CMSG_LEN(sizeof(int) * repair_nfds); + memcpy(CMSG_DATA(cmsg), repair_fds, sizeof(int) * repair_nfds); + + repair_nfds = 0; + + if (sendmsg(c->fd_repair, &msg, 0) < 0) { + int ret = -errno; + err_perror("Failed to send sockets to TCP_REPAIR helper"); + repair_close(c); + return ret; + } + + if (recv(c->fd_repair, &((int8_t){ 0 }), 1, 0) < 0) { + int ret = -errno; + err_perror("Failed to receive reply from TCP_REPAIR helper"); + repair_close(c); + return ret; + } + + return 0; +} + +/** + * repair_flush() - Add socket to TCP_REPAIR set with given command + * @c: Execution context + * @s: Socket to add + * @cmd: TCP_REPAIR_ON, TCP_REPAIR_OFF, or TCP_REPAIR_OFF_NO_WP + * + * Return: 0 on success, negative error code on failure + */ +/* cppcheck-suppress unusedFunction */ +int repair_set(struct ctx *c, int s, int cmd) +{ + int rc; + + if (repair_nfds && repair_cmd != cmd) { + if ((rc = repair_flush(c))) + return rc; + } + + repair_cmd = cmd; + repair_fds[repair_nfds++] = s; + + if (repair_nfds >= SCM_MAX_FD) { + if ((rc = repair_flush(c))) + return rc; + } + + return 0; +} diff --git a/repair.h b/repair.h new file mode 100644 index 0000000..de279d6 --- /dev/null +++ b/repair.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio + */ + +#ifndef REPAIR_H +#define REPAIR_H + +void repair_sock_init(const struct ctx *c); +void repair_listen_handler(struct ctx *c, uint32_t events); +void repair_handler(struct ctx *c, uint32_t events); +void repair_close(struct ctx *c); +int repair_flush(struct ctx *c); +int repair_set(struct ctx *c, int s, int cmd); + +#endif /* REPAIR_H */ diff --git a/tap.c b/tap.c index 772648f..3659aab 100644 --- a/tap.c +++ b/tap.c @@ -56,6 +56,7 @@ #include "netlink.h" #include "pasta.h" #include "packet.h" +#include "repair.h" #include "tap.h" #include "log.h" #include "vhost_user.h" @@ -1151,68 +1152,6 @@ void tap_handler_pasta(struct ctx *c, uint32_t events, tap_pasta_input(c, now); } -/** - * tap_sock_unix_open() - Create and bind AF_UNIX socket - * @sock_path: Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix) - * - * Return: socket descriptor on success, won't return on failure - */ -int tap_sock_unix_open(char *sock_path) -{ - int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0); - struct sockaddr_un addr = { - .sun_family = AF_UNIX, - }; - int i; - - if (fd < 0) - die_perror("Failed to open UNIX domain socket"); - - for (i = 1; i < UNIX_SOCK_MAX; i++) { - char *path = addr.sun_path; - int ex, ret; - - if (*sock_path) - memcpy(path, sock_path, UNIX_PATH_MAX); - else if (snprintf_check(path, UNIX_PATH_MAX - 1, - UNIX_SOCK_PATH, i)) - die_perror("Can't build UNIX domain socket path"); - - ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC, - 0); - if (ex < 0) - die_perror("Failed to check for UNIX domain conflicts"); - - ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr)); - if (!ret || (errno != ENOENT && errno != ECONNREFUSED && - errno != EACCES)) { - if (*sock_path) - die("Socket path %s already in use", path); - - close(ex); - continue; - } - close(ex); - - unlink(path); - ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr)); - if (*sock_path && ret) - die_perror("Failed to bind UNIX domain socket"); - - if (!ret) - break; - } - - if (i == UNIX_SOCK_MAX) - die_perror("Failed to bind UNIX domain socket"); - - info("UNIX domain socket bound at %s", addr.sun_path); - if (!*sock_path) - memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX); - - return fd; -} - /** * tap_backend_show_hints() - Give help information to start QEMU * @c: Execution context @@ -1423,6 +1362,8 @@ void tap_backend_init(struct ctx *c) tap_sock_tun_init(c); break; case MODE_VU: + repair_sock_init(c); + /* fall through */ case MODE_PASST: tap_sock_unix_init(c); diff --git a/util.c b/util.c index 800c6b5..77c8612 100644 --- a/util.c +++ b/util.c @@ -178,6 +178,68 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type, return fd; } +/** + * sock_unix() - Create and bind AF_UNIX socket + * @sock_path: Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix) + * + * Return: socket descriptor on success, won't return on failure + */ +int sock_unix(char *sock_path) +{ + int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0); + struct sockaddr_un addr = { + .sun_family = AF_UNIX, + }; + int i; + + if (fd < 0) + die_perror("Failed to open UNIX domain socket"); + + for (i = 1; i < UNIX_SOCK_MAX; i++) { + char *path = addr.sun_path; + int ex, ret; + + if (*sock_path) + memcpy(path, sock_path, UNIX_PATH_MAX); + else if (snprintf_check(path, UNIX_PATH_MAX - 1, + UNIX_SOCK_PATH, i)) + die_perror("Can't build UNIX domain socket path"); + + ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC, + 0); + if (ex < 0) + die_perror("Failed to check for UNIX domain conflicts"); + + ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr)); + if (!ret || (errno != ENOENT && errno != ECONNREFUSED && + errno != EACCES)) { + if (*sock_path) + die("Socket path %s already in use", path); + + close(ex); + continue; + } + close(ex); + + unlink(path); + ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr)); + if (*sock_path && ret) + die_perror("Failed to bind UNIX domain socket"); + + if (!ret) + break; + } + + if (i == UNIX_SOCK_MAX) + die_perror("Failed to bind UNIX domain socket"); + + info("UNIX domain socket bound at %s", addr.sun_path); + if (!*sock_path) + memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX); + + return fd; +} + /** * sock_probe_mem() - Check if setting high SO_SNDBUF and SO_RCVBUF is allowed * @c: Execution context diff --git a/util.h b/util.h index 1aed629..3f48a2b 100644 --- a/util.h +++ b/util.h @@ -211,6 +211,7 @@ struct ctx; int sock_l4_sa(const struct ctx *c, enum epoll_type type, const void *sa, socklen_t sl, const char *ifname, bool v6only, uint32_t data); +int sock_unix(char *sock_path); void sock_probe_mem(struct ctx *c); long timespec_diff_ms(const struct timespec *a, const struct timespec *b); int64_t timespec_diff_us(const struct timespec *a, const struct timespec *b); -- 2.43.0

Stefano Brivio

1:39 a.m.

New subject: [PATCH v5 5/6] vhost_user: Make source quit after reporting migration state

On migration, the source process asks passt-helper to set TCP sockets in repair mode, dumps the information we need to migrate connections, and closes them. At this point, we can't pass them back to passt-helper using SCM_RIGHTS, because they are closed, from that perspective, and sendmsg() will give us EBADF. But if we don't clear repair mode, the port they are bound to will not be available for binding in the target. Terminate once we're done with the migration and we reported the state. This is equivalent to clearing repair mode on the sockets we just closed. Signed-off-by: Stefano Brivio --- vhost_user.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/vhost_user.c b/vhost_user.c index b107d0f..70773d6 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -997,6 +997,8 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev, return false; } +static bool quit_on_device_state = false; + /** * vu_set_device_state_fd_exec() - Set the device state migration channel * @vdev: vhost-user device @@ -1024,6 +1026,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev, migrate_request(vdev->context, msg->fds[0], direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD); + if (direction == VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE) + quit_on_device_state = true; + /* We don't provide a new fd for the data transfer */ vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK); @@ -1201,4 +1206,10 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events) if (reply_requested) vu_send_reply(fd, &msg); + + if (quit_on_device_state && + msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE) { + info("Migration complete, exiting"); + exit(EXIT_SUCCESS); + } } -- 2.43.0

David Gibson

3:09 a.m.

New subject: [PATCH v5 5/6] vhost_user: Make source quit after reporting migration state

On Wed, Feb 05, 2025 at 01:39:03AM +0100, Stefano Brivio wrote:

...

On migration, the source process asks passt-helper to set TCP sockets in repair mode, dumps the information we need to migrate connections, and closes them.

At this point, we can't pass them back to passt-helper using SCM_RIGHTS, because they are closed, from that perspective, and sendmsg() will give us EBADF. But if we don't clear repair mode, the port they are bound to will not be available for binding in the target.

Terminate once we're done with the migration and we reported the state. This is equivalent to clearing repair mode on the sockets we just closed.

As we've discussed, quitting still makes sense, but the description above is not really accurate. Perhaps, === Once we've passed the migration's "point of no return", there's no way to resume the guest on the source side, because we no longer own the connections. There's not really anything we can do except exit. === Except.. thinking about it, I'm not sure that's technically true. After migration, the source qemu enters a kind of limbo state. I suppose for the case of to-disk migration (savevm) the guest can actually be resumed. Which for us is not really compatible with completing at least a local migration properly. Not really sure what to do about that. I think it's also technically possible to use monitor commands to boot up essentially an entirely new guest instance in the original qemu, in which case for us it would make sense to basically reset ourselves (flush the low table). Hrm.. we really need to know the sequence of events in a bit more detail to get this right (not that this stops improving the guts of the logic in the meantime). I'm asking around to see if I can find who did the migration stuff or virtiofsd, so we can compare notes.

...

Signed-off-by: Stefano Brivio --- vhost_user.c | 11 +++++++++++ 1 file changed, 11 insertions(+)

diff --git a/vhost_user.c b/vhost_user.c index b107d0f..70773d6 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -997,6 +997,8 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev, return false; }

+static bool quit_on_device_state = false; + /** * vu_set_device_state_fd_exec() - Set the device state migration channel * @vdev: vhost-user device @@ -1024,6 +1026,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev, migrate_request(vdev->context, msg->fds[0], direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);

+ if (direction == VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE) + quit_on_device_state = true; + /* We don't provide a new fd for the data transfer */ vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);

@@ -1201,4 +1206,10 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)

if (reply_requested) vu_send_reply(fd, &msg); + + if (quit_on_device_state && + msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE) { + info("Migration complete, exiting"); + exit(EXIT_SUCCESS); + } }

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

6:47 a.m.

New subject: [PATCH v5 5/6] vhost_user: Make source quit after reporting migration state

To: German whom I feel okay to To: now that https://github.com/kubevirt/kubevirt/pull/13756 is merged, and Hanna who knows one thing or two about vhost-user based migration. This is Stefano from your neighbouring mailing list, passt-dev. David is wondering: On Wed, 5 Feb 2025 13:09:42 +1100 David Gibson wrote:

...

On Wed, Feb 05, 2025 at 01:39:03AM +0100, Stefano Brivio wrote:

...
On migration, the source process asks passt-helper to set TCP sockets in repair mode, dumps the information we need to migrate connections, and closes them.

At this point, we can't pass them back to passt-helper using SCM_RIGHTS, because they are closed, from that perspective, and sendmsg() will give us EBADF. But if we don't clear repair mode, the port they are bound to will not be available for binding in the target.

Terminate once we're done with the migration and we reported the state. This is equivalent to clearing repair mode on the sockets we just closed.

As we've discussed, quitting still makes sense, but the description above is not really accurate. Perhaps,

=== Once we've passed the migration's "point of no return", there's no way to resume the guest on the source side, because we no longer own the connections. There's not really anything we can do except exit. ===

Except.. thinking about it, I'm not sure that's technically true. After migration, the source qemu enters a kind of limbo state. I suppose for the case of to-disk migration (savevm) the guest can actually be resumed. Which for us is not really compatible with completing at least a local migration properly. Not really sure what to do about that.

I think it's also technically possible to use monitor commands to boot up essentially an entirely new guest instance in the original qemu, in which case for us it would make sense to basically reset ourselves (flush the low table).

Hrm.. we really need to know the sequence of events in a bit more detail to get this right (not that this stops improving the guts of the logic in the meantime).

I'm asking around to see if I can find who did the migration stuff or virtiofsd, so we can compare notes.

...what we should be doing in the source passt at different stages of moving our TCP connections (or failing to move them) over to the target, which might be inspired by what you're doing with your... filesystem things in virtiofsd. We're taking for granted that as long as we have a chance to detect failure (e.g. we can't dump sequence numbers from a TCP socket in the source), we should use that to abort the whole thing. Once we're past that point, we have several options. And actually, that's regardless of failure: because we also have the question of what to do if we see that nothing went wrong. We can exit in the source, for example (this is what patch implements): wait for VHOST_USER_CHECK_DEVICE_STATE, report that, and quit. Or we can just clear up all our connections and resume (start from a blank state). Or do that, only if we didn't smell failure. Would you have some pointers, general guidelines, ideas? I know that the topic is a bit broad, but I'm hopeful that you have a lot of clear answers for us. :) Thanks.

...

...
Signed-off-by: Stefano Brivio --- vhost_user.c | 11 +++++++++++ 1 file changed, 11 insertions(+)

diff --git a/vhost_user.c b/vhost_user.c index b107d0f..70773d6 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -997,6 +997,8 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev, return false; }

+static bool quit_on_device_state = false; + /** * vu_set_device_state_fd_exec() - Set the device state migration channel * @vdev: vhost-user device @@ -1024,6 +1026,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev, migrate_request(vdev->context, msg->fds[0], direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);

+ if (direction == VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE) + quit_on_device_state = true; + /* We don't provide a new fd for the data transfer */ vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);

@@ -1201,4 +1206,10 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)

if (reply_requested) vu_send_reply(fd, &msg); + + if (quit_on_device_state && + msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE) { + info("Migration complete, exiting"); + exit(EXIT_SUCCESS); + } }

-- Stefano

Hanna Czenczek

9:58 a.m.

New subject: [PATCH v5 5/6] vhost_user: Make source quit after reporting migration state

On 05.02.25 06:47, Stefano Brivio wrote:

...

To: German whom I feel okay to To: now that https://github.com/kubevirt/kubevirt/pull/13756 is merged, and Hanna who knows one thing or two about vhost-user based migration.

This is Stefano from your neighbouring mailing list, passt-dev. David is wondering:

On Wed, 5 Feb 2025 13:09:42 +1100 David Gibson wrote:

...
On Wed, Feb 05, 2025 at 01:39:03AM +0100, Stefano Brivio wrote:

...
On migration, the source process asks passt-helper to set TCP sockets in repair mode, dumps the information we need to migrate connections, and closes them.

At this point, we can't pass them back to passt-helper using SCM_RIGHTS, because they are closed, from that perspective, and sendmsg() will give us EBADF. But if we don't clear repair mode, the port they are bound to will not be available for binding in the target.

Terminate once we're done with the migration and we reported the state. This is equivalent to clearing repair mode on the sockets we just closed. As we've discussed, quitting still makes sense, but the description above is not really accurate. Perhaps,

=== Once we've passed the migration's "point of no return", there's no way to resume the guest on the source side, because we no longer own the connections. There's not really anything we can do except exit. ===

Except.. thinking about it, I'm not sure that's technically true. After migration, the source qemu enters a kind of limbo state. I suppose for the case of to-disk migration (savevm) the guest can actually be resumed. Which for us is not really compatible with completing at least a local migration properly. Not really sure what to do about that.

I wouldn’t call it a limbo state, AFAIU the source side just continues to be the VM that it was, just paused. So in case of migration failure on the destination (which is where virtiofsd migration errors are reported), you can (and should, I believe?) continue the source VM. But even in case of success, you could have it continue, and then you’d have cloned the VM (which I think has always been a bit of a problem for networking specifically).

...

...
I think it's also technically possible to use monitor commands to boot up essentially an entirely new guest instance in the original qemu, in which case for us it would make sense to basically reset ourselves (flush the low table).

Hrm.. we really need to know the sequence of events in a bit more detail to get this right (not that this stops improving the guts of the logic in the meantime).

I'm asking around to see if I can find who did the migration stuff or virtiofsd, so we can compare notes. ...what we should be doing in the source passt at different stages of moving our TCP connections (or failing to move them) over to the target, which might be inspired by what you're doing with your... filesystem things in virtiofsd.

As far as I understand, with passt, there’s ownership that needs to be… passet. (ha ha) Which leads to that “point of no return”. We don’t have such a thing in virtiofsd, because ownership can be and is shared between source and destination instance once the migration state has started to be generated (and we know that the guest is stopped during vhost-user migration). So how it works in virtiofsd is: - Depending on how it’s done, collecting our state can too much take time for VM downtime. So once we see F_LOG_ALL set, we begin collecting it. At this point, the guest is still running, so we need to be able to change the collected state when the guest does something that’d affect it. - Once we receive SET_DEVICE_STATE, we serialize our state.* - Errors are returned through CHECK_DEVICE_STATE. - The source instance remains open and operational. If the (source) guest is unpaused, it can continue to use the filesystem. The destination instance meanwhile receives that state (while the guest is stopped), deserializes and applies it, and is then ready to begin operation whenever requests start coming in from the guest. There is no point of no return for us, because the source instance remains operational throughout; nothing is moved, just shared. Until execution continues on the destination (which I think you can only detect by seeing the virtqueue being started?), anything can still report an error and thus fail migration, requiring continuing on the source. Besides waiting for which instance will eventually continue execution, the only thing that comes to mind that might be helpful is the return-path feature; with it, the source instance will at least know whether the destination was successful in migrating or not. It isn’t reflected in vhost-user (yet), though. * Note that in case of snapshotting (migration to file) with the background-snapshot migration flag set, F_LOG_ALL won’t be set. So we must also prepare for receiving SET_DEVICE_STATE without F_LOG_ALL set, in which case we must then first collect our state before we can serialize it. Hanna

...

We're taking for granted that as long as we have a chance to detect failure (e.g. we can't dump sequence numbers from a TCP socket in the source), we should use that to abort the whole thing.

Once we're past that point, we have several options. And actually, that's regardless of failure: because we also have the question of what to do if we see that nothing went wrong.

We can exit in the source, for example (this is what patch implements): wait for VHOST_USER_CHECK_DEVICE_STATE, report that, and quit.

Or we can just clear up all our connections and resume (start from a blank state). Or do that, only if we didn't smell failure.

Would you have some pointers, general guidelines, ideas? I know that the topic is a bit broad, but I'm hopeful that you have a lot of clear answers for us. :) Thanks.

...
...
Signed-off-by: Stefano Brivio --- vhost_user.c | 11 +++++++++++ 1 file changed, 11 insertions(+)

diff --git a/vhost_user.c b/vhost_user.c index b107d0f..70773d6 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -997,6 +997,8 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev, return false; }

+static bool quit_on_device_state = false; + /** * vu_set_device_state_fd_exec() - Set the device state migration channel * @vdev: vhost-user device @@ -1024,6 +1026,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev, migrate_request(vdev->context, msg->fds[0], direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);

+ if (direction == VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE) + quit_on_device_state = true; + /* We don't provide a new fd for the data transfer */ vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);

@@ -1201,4 +1206,10 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)

if (reply_requested) vu_send_reply(fd, &msg); + + if (quit_on_device_state && + msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE) { + info("Migration complete, exiting"); + exit(EXIT_SUCCESS); + } }

Stefano Brivio

11:19 a.m.

New subject: [PATCH v5 5/6] vhost_user: Make source quit after reporting migration state

Thanks for the prompt answer! We'll go through it in more detail in a moment and probably come back with *a few* more questions, but at a first glance it seems to clarify things exactly as we needed. -- Stefano On Wed, 5 Feb 2025 09:58:11 +0100 Hanna Czenczek wrote:

...

[...]

David Gibson

12:39 p.m.

New subject: [PATCH v5 5/6] vhost_user: Make source quit after reporting migration state

On Wed, Feb 05, 2025 at 09:58:11AM +0100, Hanna Czenczek wrote:

...

On 05.02.25 06:47, Stefano Brivio wrote:

...
To: German whom I feel okay to To: now that https://github.com/kubevirt/kubevirt/pull/13756 is merged, and Hanna who knows one thing or two about vhost-user based migration.

This is Stefano from your neighbouring mailing list, passt-dev. David is wondering:

On Wed, 5 Feb 2025 13:09:42 +1100 David Gibson wrote:

...
On Wed, Feb 05, 2025 at 01:39:03AM +0100, Stefano Brivio wrote:

...
On migration, the source process asks passt-helper to set TCP sockets in repair mode, dumps the information we need to migrate connections, and closes them.

At this point, we can't pass them back to passt-helper using SCM_RIGHTS, because they are closed, from that perspective, and sendmsg() will give us EBADF. But if we don't clear repair mode, the port they are bound to will not be available for binding in the target.

Terminate once we're done with the migration and we reported the state. This is equivalent to clearing repair mode on the sockets we just closed. As we've discussed, quitting still makes sense, but the description above is not really accurate. Perhaps,

=== Once we've passed the migration's "point of no return", there's no way to resume the guest on the source side, because we no longer own the connections. There's not really anything we can do except exit. ===

Except.. thinking about it, I'm not sure that's technically true. After migration, the source qemu enters a kind of limbo state. I suppose for the case of to-disk migration (savevm) the guest can actually be resumed. Which for us is not really compatible with completing at least a local migration properly. Not really sure what to do about that.

I wouldn’t call it a limbo state, AFAIU the source side just continues to be the VM that it was, just paused. So in case of migration failure on the destination (which is where virtiofsd migration errors are reported), you can (and should, I believe?) continue the source VM. But even in case of success, you could have it continue, and then you’d have cloned the VM (which I think has always been a bit of a problem for networking specifically).

Right, I kind of realised this later in my thinking.

...

...
...
I think it's also technically possible to use monitor commands to boot up essentially an entirely new guest instance in the original qemu, in which case for us it would make sense to basically reset ourselves (flush the low table).

Hrm.. we really need to know the sequence of events in a bit more detail to get this right (not that this stops improving the guts of the logic in the meantime).

I'm asking around to see if I can find who did the migration stuff or virtiofsd, so we can compare notes. ...what we should be doing in the source passt at different stages of moving our TCP connections (or failing to move them) over to the target, which might be inspired by what you're doing with your... filesystem things in virtiofsd.

As far as I understand, with passt, there’s ownership that needs to be… passet. (ha ha) Which leads to that “point of no return”. We don’t have such a thing in virtiofsd, because ownership can be and is shared between source and destination instance once the migration state has started to be generated (and we know that the guest is stopped during vhost-user migration).

Right, exactly. Because we're a network device with active connections, those connections can't be continued by both the source and destination guests. Therefore, there has to be some point at which ownership is irrevocably transferred. This shows up most obviously for a local to local migration, because on the target we fail to connect() if the source hasn't yet closed its corresponding sockets. I guess with a more normal L2 network device the problem is still there, but appears at a different level: if you resume both guests that'll somewhat work, but one or both will shortly get network errors (TCP RSTs, most likely) as the packets they're trying to send on the same connection conflict.

...

So how it works in virtiofsd is: - Depending on how it’s done, collecting our state can too much take time for VM downtime. So once we see F_LOG_ALL set, we begin collecting it. At this point, the guest is still running, so we need to be able to change the collected state when the guest does something that’d affect it.

Oh! That's a huge insight, thanks. We also (might) have trouble with having too much state to transfer (large socket buffers, or example). I was thinking there was nothing we could do about it, because we didn't know about the migration until the guest was already stopped - but I'd forgotten that we get the F_LOG_ALL notification. That might make a number of improvements possible.

...

- Once we receive SET_DEVICE_STATE, we serialize our state.* - Errors are returned through CHECK_DEVICE_STATE. - The source instance remains open and operational. If the (source) guest is unpaused, it can continue to use the filesystem.

The destination instance meanwhile receives that state (while the guest is stopped), deserializes and applies it, and is then ready to begin operation whenever requests start coming in from the guest.

There is no point of no return for us, because the source instance remains operational throughout; nothing is moved, just shared. Until execution continues on the destination (which I think you can only detect by seeing the virtqueue being started?), anything can still report an error and thus fail migration, requiring continuing on the source.

Besides waiting for which instance will eventually continue execution, the only thing that comes to mind that might be helpful is the return-path feature; with it, the source instance will at least know whether the destination was successful in migrating or not. It isn’t reflected in vhost-user (yet), though.

Right, that's kind of what we'd be looking for, but it's kind of worse, because ideally we'd want a two-way handshake. The target says "I've got all the information" to the source, then the source drops the connections and says to the target "I've dropped all the connections", and only then does the target re-establish the connections.

...

* Note that in case of snapshotting (migration to file) with the background-snapshot migration flag set, F_LOG_ALL won’t be set. So we must also prepare for receiving SET_DEVICE_STATE without F_LOG_ALL set, in which case we must then first collect our state before we can serialize it.

Oh, ouch. For snapshotting, we'd be expecting to resume on the source, even if it's after a long delay. On the target there's really no hope of maintaining active connections. Maybe we should not transfer TCP connnections if F_LOG_ALL isn't set.

...

Hanna

...
We're taking for granted that as long as we have a chance to detect failure (e.g. we can't dump sequence numbers from a TCP socket in the source), we should use that to abort the whole thing.

Once we're past that point, we have several options. And actually, that's regardless of failure: because we also have the question of what to do if we see that nothing went wrong.

We can exit in the source, for example (this is what patch implements): wait for VHOST_USER_CHECK_DEVICE_STATE, report that, and quit.

Or we can just clear up all our connections and resume (start from a blank state). Or do that, only if we didn't smell failure.

Would you have some pointers, general guidelines, ideas? I know that the topic is a bit broad, but I'm hopeful that you have a lot of clear answers for us. :) Thanks.

...
...
Signed-off-by: Stefano Brivio --- vhost_user.c | 11 +++++++++++ 1 file changed, 11 insertions(+)

diff --git a/vhost_user.c b/vhost_user.c index b107d0f..70773d6 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -997,6 +997,8 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev, return false; } +static bool quit_on_device_state = false; + /** * vu_set_device_state_fd_exec() - Set the device state migration channel * @vdev: vhost-user device @@ -1024,6 +1026,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev, migrate_request(vdev->context, msg->fds[0], direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD); + if (direction == VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE) + quit_on_device_state = true; + /* We don't provide a new fd for the data transfer */ vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK); @@ -1201,4 +1206,10 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events) if (reply_requested) vu_send_reply(fd, &msg); + + if (quit_on_device_state && + msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE) { + info("Migration complete, exiting"); + exit(EXIT_SUCCESS); + } }

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

1:39 a.m.

New subject: [PATCH v5 6/6] Implement source and target sides of migration

This implements flow preparation on the source, transfer of data with a format roughly inspired by struct tcp_tap_conn, and flow insertion on the target, with all the appropriate window options, window scaling, MSS, etc. The target side is rather convoluted because we first need to create sockets and switch them to repair mode, before we can apply options that are *not* stored in the flow table. However, we don't want to request repair mode for sockets one by one. So we need to do this in several steps. A hack in order to connect() on the "RARP" message should be easy to enable, I left a couple of comments in that sense. This is very much draft quality, but I tested the whole flow, and it works for me. Window parameters and MSS match, too. Signed-off-by: Stefano Brivio --- flow.c | 226 +++++++++++++++++++++++++++++++ flow.h | 7 + isolation.c | 2 +- migrate.c | 32 +++-- migrate.h | 1 + passt.c | 4 + passt.h | 2 + tcp.c | 372 +++++++++++++++++++++++++++++++++++++++++++++++++++ tcp_conn.h | 59 ++++++++ vhost_user.c | 4 + 10 files changed, 699 insertions(+), 10 deletions(-) diff --git a/flow.c b/flow.c index a6fe6d1..fcdd2b6 100644 --- a/flow.c +++ b/flow.c @@ -19,6 +19,7 @@ #include "inany.h" #include "flow.h" #include "flow_table.h" +#include "repair.h" const char *flow_state_str[] = { [FLOW_STATE_FREE] = "FREE", @@ -874,6 +875,231 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) *last_next = FLOW_MAX; } +/** + * flow_migrate_source_pre_do() - Prepare/"unprepare" source flows for migration + * @c: Execution context + * @stage: Migration stage information (unused) + * @fd: Migration fd (unused) + * @rollback: If true, undo preparation + * + * Return: 0 on success, error code on failure + */ +static int flow_migrate_source_pre_do(struct ctx *c, + const struct migrate_stage *stage, int fd, + bool rollback) +{ + unsigned i, max_i; + int rc; + + (void)stage; + (void)fd; + + if (rollback) { + rc = 0; + i = FLOW_MAX; + goto rollback; + } + + for (i = 0; i < FLOW_MAX; i++) { /* TODO: iterator with skip */ + union flow *flow = &flowtab[i]; + + if (flow->f.state == FLOW_STATE_FREE) + i += flow->free.n - 1; + else if (flow->f.state == FLOW_STATE_ACTIVE && + flow->f.type == FLOW_TCP) + rc = tcp_flow_repair_on(c, &flow->tcp); + + if (rc) { + debug("Can't set repair mode for TCP flows, roll back"); + goto rollback; + } + } + + if ((rc = repair_flush(c))) { /* TODO: move to TCP logic */ + debug("Can't set repair mode for TCP flows, roll back"); + goto rollback; + } + + return 0; + +rollback: + max_i = i; + + for (i = 0; i < max_i; i++) { /* TODO: iterator with skip */ + union flow *flow = &flowtab[i]; + + if (flow->f.state == FLOW_STATE_FREE) + i += flow->free.n - 1; + else if (flow->f.state == FLOW_STATE_ACTIVE && + flow->f.type == FLOW_TCP) + tcp_flow_repair_off(c, &flow->tcp); + } + + repair_flush(c); + + return rc; +} + +/** + * flow_migrate_source_pre() - Prepare source flows for migration + * @c: Execution context + * @stage: Migration stage information (unused) + * @fd: Migration fd (unused) + * @rollback: If true, undo preparation + * + * Return: 0 on success, error code on failure + */ +int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage, + int fd) +{ + return flow_migrate_source_pre_do(c, stage, fd, false); +} + +/** + * flow_migrate_source() - Dump additional information and send data + * @c: Execution context + * @stage: Migration stage information (unused) + * @fd: Migration fd + * + * Return: 0 on success + */ +int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage, + int fd) +{ + uint32_t count = 0; + unsigned i; + int rc; + + for (i = 0; i < FLOW_MAX; i++) { /* TODO: iterator with skip */ + union flow *flow = &flowtab[i]; + + if (flow->f.state == FLOW_STATE_FREE) + i += flow->free.n - 1; + else if (flow->f.state == FLOW_STATE_ACTIVE && + flow->f.type == FLOW_TCP) + count++; + } + + count = htonl(count); + rc = write_all_buf(fd, &count, sizeof(count)); + if (rc) { + rc = errno; + err("Can't send flow count (%u): %s, abort", + ntohl(count), strerror_(errno)); + return rc; + } + debug("Sending %u flows", ntohl(count)); + + /* Send information that can be stored in the flow table, first */ + for (i = 0; i < FLOW_MAX; i++) { /* TODO: iterator with skip */ + union flow *flow = &flowtab[i]; + + if (flow->f.state == FLOW_STATE_FREE) { + i += flow->free.n - 1; + } else if (flow->f.state == FLOW_STATE_ACTIVE && + flow->f.type == FLOW_TCP) { + rc = tcp_flow_migrate_source(fd, &flow->tcp); + if (rc) + goto rollback; + } + /* TODO: other protocols */ + } + + /* And then "extended" data: the target needs to set repair mode on + * sockets before it can set this stuff, but it needs sockets (and + * flows) for that. + */ + for (i = 0; i < FLOW_MAX; i++) { /* TODO: iterator with skip */ + union flow *flow = &flowtab[i]; + + if (flow->f.state == FLOW_STATE_FREE) { + i += flow->free.n - 1; + } else if (flow->f.state == FLOW_STATE_ACTIVE && + flow->f.type == FLOW_TCP) { + rc = tcp_flow_migrate_source_ext(fd, &flow->tcp); + if (rc) + goto rollback; + } + /* TODO: other protocols */ + } + + return 0; + +rollback: + flow_migrate_source_pre_do(c, stage, fd, true); + return rc; +} + +/** + * flow_migrate_target() - Receive flows and insert in flow table + * @c: Execution context + * @stage: Migration stage information (unused) + * @fd: Migration fd + * + * Return: 0 on success + */ +int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage, + int fd) +{ + uint32_t count; + unsigned i; + int rc; + + (void)stage; + + /* TODO: error handling */ + + if (read_all_buf(fd, &count, sizeof(count))) + return errno; + + count = ntohl(count); + debug("Receiving %u flows", count); + + /* TODO: flow header with type, instead? */ + for (i = 0; i < count; i++) { + rc = tcp_flow_migrate_target(c, fd); + if (rc) + return rc; + } + + repair_flush(c); + + for (i = 0; i < count; i++) { + rc = tcp_flow_migrate_target_ext(c, flowtab + i, fd); + if (rc) + return rc; + } + + repair_flush(c); + + return 0; +} + +/** + * flow_migrate_target_post() - connect() sockets after migration + * @c: Execution context + * + * Return: 0 on success + */ +int flow_migrate_target_post(struct ctx *c) +{ + unsigned i; + + for (i = 0; i < FLOW_MAX; i++) { /* TODO: iterator with skip */ + union flow *flow = &flowtab[i]; + + if (flow->f.state == FLOW_STATE_FREE) + i += flow->free.n - 1; + else if (flow->f.state == FLOW_STATE_ACTIVE && + flow->f.type == FLOW_TCP) + tcp_flow_repair_connect(c, &flow->tcp); + } + + repair_flush(c); /* TODO: move to TCP logic */ + + return 0; +} + /** * flow_init() - Initialise flow related data structures */ diff --git a/flow.h b/flow.h index 24ba3ef..4c28235 100644 --- a/flow.h +++ b/flow.h @@ -249,6 +249,13 @@ union flow; void flow_init(void); void flow_defer_handler(const struct ctx *c, const struct timespec *now); +int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage, + int fd); +int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage, + int fd); +int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage, + int fd); +int flow_migrate_target_post(struct ctx *c); void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) __attribute__((format(printf, 3, 4))); diff --git a/isolation.c b/isolation.c index c944fb3..df58bb8 100644 --- a/isolation.c +++ b/isolation.c @@ -377,7 +377,7 @@ void isolate_postfork(const struct ctx *c) { struct sock_fprog prog; - prctl(PR_SET_DUMPABLE, 0); +// prctl(PR_SET_DUMPABLE, 0); switch (c->mode) { case MODE_PASST: diff --git a/migrate.c b/migrate.c index 9948be0..f91a138 100644 --- a/migrate.c +++ b/migrate.c @@ -21,9 +21,9 @@ #include "inany.h" #include "flow.h" #include "flow_table.h" -#include "repair.h" #include "migrate.h" +#include "repair.h" /* Current version of migration data */ #define MIGRATE_VERSION 1 @@ -91,11 +91,13 @@ static int migrate_recv_block(struct ctx *c, static const struct migrate_stage stages_v1[] = { { .name = "flow pre", + .source = flow_migrate_source_pre, .target = NULL, }, { - .name = "flow post", - .source = NULL, + .name = "flow", + .source = flow_migrate_source, + .target = flow_migrate_target, }, { 0 }, }; @@ -115,9 +117,9 @@ static const struct migrate_version versions[] = { * * Return: 0 on success, positive error code on failure */ -static int migrate_source(struct ctx *c, int fd) +int migrate_source(struct ctx *c, int fd) { - const struct migrate_version *v = versions + ARRAY_SIZE(versions) - 1; + const struct migrate_version *v = versions + ARRAY_SIZE(versions) - 2; const struct migrate_stage *s; int ret; @@ -127,7 +129,7 @@ static int migrate_source(struct ctx *c, int fd) return ret; } - for (s = v->s; *s->name; s++) { + for (s = v->s; s->name; s++) { if (!s->source) continue; @@ -174,7 +176,7 @@ static uint32_t migrate_target_read_header(int fd) * * Return: 0 on success, positive error code on failure */ -static int migrate_target(struct ctx *c, int fd) +int migrate_target(struct ctx *c, int fd) { const struct migrate_version *v; const struct migrate_stage *s; @@ -188,13 +190,13 @@ static int migrate_target(struct ctx *c, int fd) return ret; } - for (v = versions; v->id && v->id == id; v++); + for (v = versions; v->id && v->id != id; v++); if (!v->id) { err("Unsupported version: %u", id); return -ENOTSUP; } - for (s = v->s; *s->name; s++) { + for (s = v->s; s->name; s++) { if (!s->target) continue; @@ -218,6 +220,7 @@ void migrate_init(struct ctx *c) { c->device_state_fd = -1; c->device_state_result = -1; + repair_sock_init(c); } /** @@ -275,3 +278,14 @@ void migrate_handler(struct ctx *c) c->device_state_result = rc; } + +/** + * migrate_finish() - Hack to connect() migrated sockets from "RARP" trigger + * @c: Execution context + */ +void migrate_finish(struct ctx *c) +{ + (void)c; + + /* HACK RARP: flow_migrate_target_post(c); */ +} diff --git a/migrate.h b/migrate.h index 80d78b8..9694af6 100644 --- a/migrate.h +++ b/migrate.h @@ -49,5 +49,6 @@ void migrate_init(struct ctx *c); void migrate_close(struct ctx *c); void migrate_request(struct ctx *c, int fd, bool target); void migrate_handler(struct ctx *c); +void migrate_finish(struct ctx *c); #endif /* MIGRATE_H */ diff --git a/passt.c b/passt.c index 1938290..65e9126 100644 --- a/passt.c +++ b/passt.c @@ -119,6 +119,8 @@ static void post_handler(struct ctx *c, const struct timespec *now) ndp_timer(c, now); } +uint64_t g_hash_secret[2]; + /** * random_init() - Initialise things based on random data * @c: Execution context @@ -130,6 +132,8 @@ static void random_init(struct ctx *c) /* Create secret value for SipHash calculations */ raw_random(&c->hash_secret, sizeof(c->hash_secret)); + memcpy(g_hash_secret, c->hash_secret, sizeof(g_hash_secret)); + /* Seed pseudo-RNG for things that need non-cryptographic random */ raw_random(&seed, sizeof(seed)); srandom(seed); diff --git a/passt.h b/passt.h index 4189a4a..6010f92 100644 --- a/passt.h +++ b/passt.h @@ -317,6 +317,8 @@ struct ctx { bool migrate_target; }; +extern uint64_t g_hash_secret[2]; + void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s); diff --git a/tcp.c b/tcp.c index af6bd95..71775f1 100644 --- a/tcp.c +++ b/tcp.c @@ -299,6 +299,7 @@ #include "log.h" #include "inany.h" #include "flow.h" +#include "repair.h" #include "linux_dep.h" #include "flow_table.h" @@ -2645,3 +2646,374 @@ void tcp_timer(struct ctx *c, const struct timespec *now) if (c->mode == MODE_PASTA) tcp_splice_refill(c); } + +/** + * tcp_flow_repair_on() - Enable repair mode for a single TCP flow + * @c: Execution context + * @conn: Pointer to the TCP connection structure + * + * Return: 0 on success, negative error code on failure + */ +int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn) +{ + int rc = 0; + + if ((rc = repair_set(c, conn->sock, TCP_REPAIR_ON))) + err("Failed to set TCP_REPAIR"); + + return rc; +} + +/** + * tcp_flow_repair_off() - Clear repair mode for a single TCP flow + * @c: Execution context + * @conn: Pointer to the TCP connection structure + * + * Return: 0 on success, negative error code on failure + */ +int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn) +{ + int rc = 0; + + if ((rc = repair_set(c, conn->sock, TCP_REPAIR_OFF))) + err("Failed to clear TCP_REPAIR"); + + return rc; +} + +/** + * tcp_flow_repair_seq() - Dump or set sequences for socket queues + * @s: Socket + * @snd: Sending sequence, set on return if @set == false, network order + * @rcv: Receive sequence, set on return if @set == false, network order + * @set: Set if true, dump if false + * + * Return: 0 on success, negative error code on failure + */ +static int tcp_flow_repair_seq(int s, uint32_t *snd, uint32_t *rcv, bool set) +{ + socklen_t vlen = sizeof(uint32_t); + int v; + + /* TODO: proper error management and prints */ + + v = TCP_SEND_QUEUE; + if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v))) + return -errno; + + if (set) { + *snd = ntohl(*snd); + if (setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, snd, vlen)) + return -errno; + debug("Set sending sequence for socket %i to %u", s, *snd); + } else { + if (getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, snd, &vlen)) + return -errno; + debug("Dumped sending sequence for socket %i: %u", s, *snd); + *snd = htonl(*snd); + } + + v = TCP_RECV_QUEUE; + if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v))) + return -errno; + + if (set) { + *rcv = ntohl(*rcv); + if (setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, rcv, vlen)) + return -errno; + debug("Set receive sequence for socket %i to %u", s, *rcv); + } else { + if (getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, rcv, &vlen)) + return -errno; + debug("Dumped receive sequence for socket %i: %u", s, *rcv); + *rcv = htonl(*rcv); + } + + return 0; +} + +/** + * tcp_flow_repair_opt() - Dump or set repair "options" (MSS and window scale) + * @s: Socket + * @ws_to_sock: Window scaling factor from us, network order + * @ws_from_sock: Window scaling factor from peer, network order + * @mss: Maximum Segment Size, socket side, network order + * @set: Set if true, dump if false + * + * Return: 0 on success, TODO: negative error code on failure + */ +int tcp_flow_repair_opt(int s, uint8_t *ws_to_sock, uint8_t *ws_from_sock, + uint32_t *mss, bool set) +{ + struct tcp_info_linux tinfo; + struct tcp_repair_opt opts[2]; + socklen_t sl; + + opts[0].opt_code = TCPOPT_WINDOW; + opts[1].opt_code = TCPOPT_MAXSEG; + + if (set) { + *ws_to_sock = ntohs(*ws_to_sock); + *ws_from_sock = ntohs(*ws_from_sock); + + opts[0].opt_val = *ws_to_sock + (*ws_from_sock << 16); + opts[1].opt_val = ntohl(*mss); + + sl = sizeof(opts); + setsockopt(s, SOL_TCP, TCP_REPAIR_OPTIONS, opts, sl); + } else { + sl = sizeof(tinfo); + getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl); + + *ws_to_sock = tinfo.tcpi_snd_wscale; + *ws_from_sock = tinfo.tcpi_rcv_wscale; + *mss = htonl(tinfo.tcpi_snd_mss); + } + + return 0; +} + +/** + * tcp_flow_repair_wnd() - Dump or set window parameters + * @snd_wl1: See struct tcp_repair_window + * @snd_wnd: Socket-side sending window, network order + * @max_window: Window clamp, network order + * @rcv_wnd: Socket-side receive window, network order + * @rcv_wup: See struct tcp_repair_window + * @set: Set if true, dump if false + * + * Return: 0 on success, TODO: negative error code on failure + */ +int tcp_flow_repair_wnd(int s, uint32_t *snd_wl1, uint32_t *snd_wnd, + uint32_t *max_window, uint32_t *rcv_wnd, + uint32_t *rcv_wup, bool set) +{ + struct tcp_repair_window wnd; + socklen_t sl = sizeof(wnd); + + if (set) { + wnd.snd_wl1 = ntohl(*snd_wl1); + wnd.snd_wnd = ntohl(*snd_wnd); + wnd.max_window = ntohl(*max_window); + wnd.rcv_wnd = ntohl(*rcv_wnd); + wnd.rcv_wup = ntohl(*rcv_wup); + + setsockopt(s, IPPROTO_TCP, TCP_REPAIR_WINDOW, &wnd, sl); + } else { + getsockopt(s, IPPROTO_TCP, TCP_REPAIR_WINDOW, &wnd, &sl); + + *snd_wl1 = htonl(wnd.snd_wl1); + *snd_wnd = htonl(wnd.snd_wnd); + *max_window = htonl(wnd.max_window); + *rcv_wnd = htonl(wnd.rcv_wnd); + *rcv_wup = htonl(wnd.rcv_wup); + } + + return 0; +} + +/** + * tcp_flow_migrate_source() - Send data (flow table part) for a single flow + * @c: Execution context + * @fd: Descriptor for state migration + * @conn: Pointer to the TCP connection structure + */ +int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn) +{ + struct tcp_tap_transfer t = { + .retrans = conn->retrans, + .ws_from_tap = conn->ws_from_tap, + .ws_to_tap = conn->ws_to_tap, + .events = conn->events, + + .sndbuf = htonl(conn->sndbuf), + + .flags = conn->flags, + .seq_dup_ack_approx = conn->seq_dup_ack_approx, + + .wnd_from_tap = htons(conn->wnd_from_tap), + .wnd_to_tap = htons(conn->wnd_to_tap), + + .seq_to_tap = htonl(conn->seq_to_tap), + .seq_ack_from_tap = htonl(conn->seq_ack_from_tap), + .seq_from_tap = htonl(conn->seq_from_tap), + .seq_ack_to_tap = htonl(conn->seq_ack_to_tap), + .seq_init_from_tap = htonl(conn->seq_init_from_tap), + }; + + memcpy(&t.pif, conn->f.pif, sizeof(t.pif)); + memcpy(&t.side, conn->f.side, sizeof(t.side)); + + if (write_all_buf(fd, &t, sizeof(t))) + return errno; + + return 0; +} + +/** + * tcp_flow_migrate_source_ext() - Send extended data for a single flow + * @fd: Descriptor for state migration + * @conn: Pointer to the TCP connection structure + */ +int tcp_flow_migrate_source_ext(int fd, struct tcp_tap_conn *conn) +{ + struct tcp_tap_transfer_ext t; + int s = conn->sock; + + tcp_flow_repair_seq(s, &t.sock_seq_snd, &t.sock_seq_rcv, false); + + tcp_flow_repair_opt(s, &t.ws_to_sock, &t.ws_from_sock, &t.sock_mss, + false); + + tcp_flow_repair_wnd(s, &t.sock_snd_wl1, &t.sock_snd_wnd, + &t.sock_max_window, &t.sock_rcv_wnd, + &t.sock_rcv_wup, false); + + if (write_all_buf(fd, &t, sizeof(t))) + return errno; + + return 0; +} + +/** + * tcp_flow_repair_socket() - Open and bind socket, request repair mode + * @c: Execution context + * @conn: Pointer to the TCP connection structure + * + * Return: 0 on success, negative error code on failure + */ +int tcp_flow_repair_socket(struct ctx *c, struct tcp_tap_conn *conn) +{ + sa_family_t af = CONN_V4(conn) ? AF_INET : AF_INET6; + const struct flowside *sockside = HOSTFLOW(conn); + struct sockaddr_in a; + int rc; + + a = (struct sockaddr_in){ af, htons(sockside->oport), { 0 }, { 0 } }; + + if ((conn->sock = socket(af, SOCK_STREAM, IPPROTO_TCP)) < 0) + return -errno; + + /* On the same host, source socket can be in TIME_WAIT */ + setsockopt(conn->sock, SOL_SOCKET, SO_REUSEADDR, + &((int){ 1 }), sizeof(int)); + + /* TODO: switch to tcp_bind_outbound(c, conn, conn->sock); ...? */ + if (bind(conn->sock, (struct sockaddr *)&a, sizeof(a)) < 0) { + close(conn->sock); + conn->sock = -1; + return -errno; + } + + rc = tcp_flow_repair_on(c, conn); + if (rc) { + close(conn->sock); + conn->sock = -1; + return rc; + } + + return 0; +} + +/** + * tcp_flow_repair_connect() - Connect socket in repair mode, then turn it off + * @c: Execution context + * @conn: Pointer to the TCP connection structure + * + * Return: 0 on success, negative error code on failure + */ +int tcp_flow_repair_connect(struct ctx *c, struct tcp_tap_conn *conn) +{ + struct flowside *tgt = &conn->f.side[TGTSIDE]; + + flowside_connect(c, conn->sock, PIF_HOST, tgt); + + conn->in_epoll = 0; + conn->timer = -1; + tcp_epoll_ctl(c, conn); + + return 0; + + /* HACK RARP: return tcp_flow_repair_off(c, conn); */ +} + +/** + * tcp_flow_migrate_target() - Receive data (flow table part) for flow, insert + * @c: Execution context + * @fd: Descriptor for state migration + */ +int tcp_flow_migrate_target(struct ctx *c, int fd) +{ + struct tcp_tap_transfer t; + struct tcp_tap_conn *conn; + union flow *flow; + int rc; + + if (!(flow = flow_alloc())) + return -ENOMEM; + + if ((rc = read_all_buf(fd, &t, sizeof(t)))) + return errno; + + flow->f.state = FLOW_STATE_TGT; + memcpy(&flow->f.pif, &t.pif, sizeof(flow->f.pif)); + memcpy(&flow->f.side, &t.side, sizeof(flow->f.side)); + conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); + + conn->retrans = t.retrans; + conn->ws_from_tap = t.ws_from_tap; + conn->ws_to_tap = t.ws_to_tap; + conn->events = t.events; + + conn->sndbuf = htonl(t.sndbuf); + + conn->flags = t.flags; + conn->seq_dup_ack_approx = t.seq_dup_ack_approx; + + conn->wnd_from_tap = ntohs(t.wnd_from_tap); + conn->wnd_to_tap = ntohs(t.wnd_to_tap); + + conn->seq_to_tap = ntohl(t.seq_to_tap); + conn->seq_ack_from_tap = ntohl(t.seq_ack_from_tap); + conn->seq_from_tap = ntohl(t.seq_from_tap); + conn->seq_ack_to_tap = ntohl(t.seq_ack_to_tap); + conn->seq_init_from_tap = ntohl(t.seq_init_from_tap); + + tcp_flow_repair_socket(c, conn); + + flow_hash_insert(c, TAP_SIDX(conn)); + FLOW_ACTIVATE(conn); + + return 0; +} + +/** + * tcp_flow_migrate_target_ext() - Receive extended data for flow, set, connect + * @c: Execution context + * @flow: Existing flow for this connection data + * @fd: Descriptor for state migration + */ +int tcp_flow_migrate_target_ext(struct ctx *c, union flow *flow, int fd) +{ + struct tcp_tap_conn *conn = &flow->tcp; + struct tcp_tap_transfer_ext t; + int s = conn->sock, rc; + + if ((rc = read_all_buf(fd, &t, sizeof(t)))) + return errno; + + tcp_flow_repair_seq(s, &t.sock_seq_snd, &t.sock_seq_rcv, true); + + tcp_flow_repair_connect(c, conn); + + tcp_flow_repair_opt(s, &t.ws_to_sock, &t.ws_from_sock, &t.sock_mss, + true); + + tcp_flow_repair_wnd(s, &t.sock_snd_wl1, &t.sock_snd_wnd, + &t.sock_max_window, &t.sock_rcv_wnd, + &t.sock_rcv_wup, true); + + tcp_flow_repair_off(c, conn); + + return 0; +} diff --git a/tcp_conn.h b/tcp_conn.h index d342680..c05a94f 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -94,8 +94,60 @@ struct tcp_tap_conn { uint32_t seq_from_tap; uint32_t seq_ack_to_tap; uint32_t seq_init_from_tap; + + uint32_t sock_seq_snd; + uint32_t sock_seq_rcv; }; +/** + * struct tcp_tap_transfer - TCP data to migrate (flow table part only) + * TODO + */ +struct tcp_tap_transfer { + uint8_t pif[SIDES]; + struct flowside side[SIDES]; + + uint8_t retrans; + uint8_t ws_from_tap; + uint8_t ws_to_tap; + uint8_t events; + + uint32_t sndbuf; + + uint8_t flags; + uint8_t seq_dup_ack_approx; + + uint16_t wnd_from_tap; + uint16_t wnd_to_tap; + + uint32_t seq_to_tap; + uint32_t seq_ack_from_tap; + uint32_t seq_from_tap; + uint32_t seq_ack_to_tap; + uint32_t seq_init_from_tap; +} __attribute__((packed, aligned(__alignof__(uint32_t)))); + +/** + * struct tcp_tap_transfer_ext - TCP data to migrate (not stored in flow table) + * TODO + */ +struct tcp_tap_transfer_ext { + uint32_t sock_seq_snd; + uint32_t sock_seq_rcv; + + uint32_t sock_mss; + + /* We can't just use struct tcp_repair_window: we need network order */ + uint32_t sock_snd_wl1; + uint32_t sock_snd_wnd; + uint32_t sock_max_window; + uint32_t sock_rcv_wnd; + uint32_t sock_rcv_wup; + + uint8_t ws_to_sock; + uint8_t ws_from_sock; +} __attribute__((packed, aligned(__alignof__(uint32_t)))); + /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @f: Generic flow information @@ -140,6 +192,13 @@ extern int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; extern int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; bool tcp_flow_defer(const struct tcp_tap_conn *conn); +int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn); +int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn); +int tcp_flow_repair_connect(struct ctx *c, struct tcp_tap_conn *conn); +int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn); +int tcp_flow_migrate_source_ext(int fd, struct tcp_tap_conn *conn); +int tcp_flow_migrate_target(struct ctx *c, int fd); +int tcp_flow_migrate_target_ext(struct ctx *c, union flow *flow, int fd); bool tcp_splice_flow_defer(struct tcp_splice_conn *conn); void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn); int tcp_conn_pool_sock(int pool[]); diff --git a/vhost_user.c b/vhost_user.c index 70773d6..afc977b 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -44,6 +44,7 @@ #include "tap.h" #include "vhost_user.h" #include "pcap.h" +#include "migrate.h" /* vhost-user version we are compatible with */ #define VHOST_USER_VERSION 1 @@ -994,6 +995,9 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev, eth_ntop((unsigned char *)&msg->payload.u64, macstr, sizeof(macstr))); + /* Abuse this as trigger to finally connect() migrated sockets */ + migrate_finish(vdev->context); + return false; } -- 2.43.0

David Gibson

2:10 a.m.

New subject: [PATCH v5 6/6] Implement source and target sides of migration

On Wed, Feb 05, 2025 at 01:39:04AM +0100, Stefano Brivio wrote:

...

This implements flow preparation on the source, transfer of data with a format roughly inspired by struct tcp_tap_conn, and flow insertion on the target, with all the appropriate window options, window scaling, MSS, etc.

The target side is rather convoluted because we first need to create sockets and switch them to repair mode, before we can apply options that are *not* stored in the flow table. However, we don't want to request repair mode for sockets one by one. So we need to do this in several steps.

A hack in order to connect() on the "RARP" message should be easy to enable, I left a couple of comments in that sense.

This is very much draft quality, but I tested the whole flow, and it works for me. Window parameters and MSS match, too.

Signed-off-by: Stefano Brivio

[snip]

...

diff --git a/isolation.c b/isolation.c index c944fb3..df58bb8 100644 --- a/isolation.c +++ b/isolation.c @@ -377,7 +377,7 @@ void isolate_postfork(const struct ctx *c) { struct sock_fprog prog;

- prctl(PR_SET_DUMPABLE, 0); +// prctl(PR_SET_DUMPABLE, 0);

Looks like a stray debugging change made it in here. Fwiw, I keep a branch around with debugging hacks including exactly this. To make it harder to accidentally leak debug hacks into "real" series, I usually rebase between my debugging branch and main. In this case it conflicted with the patch from my debugging branch, of course, which is why I spotted it.

...

switch (c->mode) { case MODE_PASST:

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

181

Age (days ago)

181

Last active (days ago)

List overview

Download

13 comments

3 participants

participants (3)

David Gibson
Hanna Czenczek
Stefano Brivio