This is obviously incomplete. I have code on top of this, not really working yet, with a loop on transferred flows, and an implementation matching passt-repair, requesting to enable/disable the TCP_REPAIR option as needed, as well as setting/receiving sequences. I'm sending this for early review/rework/rewrite/whatever. What's here should all be tested and working. Adding: { &flow_first_free, sizeof(flow_first_free) }, { flowtab, sizeof(flowtab) }, to data version 1 in 6/7 will properly transfer those sections. Declaring functions and assigning pointers such as: { flow_migrate_source_pre, NULL }, { flow_migrate_source_post, NULL }, { flow_migrate_target_post_v1, NULL }, also executes them. The passt-repair helper in 7/7 is (lightly) tested against a stand-alone source/target implementation which I'll share in a bit. Stefano Brivio (7): icmp, udp: Pad time_t timestamp to 64-bit to ease state migration flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn flow_table: Use size in extern declaration for flowtab util: Add read_remainder() and read_all_buf() Introduce facilities for guest migration on top of vhost-user infrastructure Introduce passt-repair Makefile | 22 +++-- flow.h | 18 ++-- flow_table.h | 15 ++- icmp_flow.h | 6 +- migrate.c | 259 +++++++++++++++++++++++++++++++++++++++++++++++++ migrate.h | 90 +++++++++++++++++ passt-repair.c | 111 +++++++++++++++++++++ passt.c | 2 +- tcp_conn.h | 2 +- udp_flow.h | 6 +- util.c | 70 +++++++++++++ util.h | 2 + vu_common.c | 122 +++++++++++++++-------- vu_common.h | 2 +- 14 files changed, 662 insertions(+), 65 deletions(-) create mode 100644 migrate.c create mode 100644 migrate.h create mode 100644 passt-repair.c -- 2.43.0
That's the only field in flows with different storage sizes depending on the architecture: it's usually 4-byte wide on 32-bit architectures, except for arc and x32 where it's 8 bytes, and 8-byte wide on 64-bit machines. By keeping flow entries the same size across architectures, we avoid having to expand or shrink table entries upon migration. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- icmp_flow.h | 6 +++++- udp_flow.h | 6 +++++- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/icmp_flow.h b/icmp_flow.h index fb93801..da7e255 100644 --- a/icmp_flow.h +++ b/icmp_flow.h @@ -13,6 +13,7 @@ * @seq: Last sequence number sent to tap, host order, -1: not sent yet * @sock: "ping" socket * @ts: Last associated activity from tap, seconds + * @ts_storage: Pad @ts to 64-bit storage to keep state migration sane */ struct icmp_ping_flow { /* Must be first element */ @@ -20,7 +21,10 @@ struct icmp_ping_flow { int seq; int sock; - time_t ts; + union { + time_t ts; + uint64_t ts_storage; + }; }; bool icmp_ping_timer(const struct ctx *c, const struct icmp_ping_flow *pingf, diff --git a/udp_flow.h b/udp_flow.h index 9a1b059..9cb79a0 100644 --- a/udp_flow.h +++ b/udp_flow.h @@ -12,6 +12,7 @@ * @f: Generic flow information * @closed: Flow is already closed * @ts: Activity timestamp + * @ts_storage: Pad @ts to 64-bit storage to keep state migration sane * @s: Socket fd (or -1) for each side of the flow */ struct udp_flow { @@ -19,7 +20,10 @@ struct udp_flow { struct flow_common f; bool closed :1; - time_t ts; + union { + time_t ts; + uint64_t ts_storage; + }; int s[SIDES]; }; -- 2.43.0
On Tue, Jan 28, 2025 at 12:15:26AM +0100, Stefano Brivio wrote:That's the only field in flows with different storage sizes depending on the architecture: it's usually 4-byte wide on 32-bit architectures, except for arc and x32 where it's 8 bytes, and 8-byte wide on 64-bit machines.As discussed on the call, I think there are broader problems with transferring timestamps than just the structure size. So I'm hoping we can work out how to not transfer them at all and avoid this change.By keeping flow entries the same size across architectures, we avoid having to expand or shrink table entries upon migration. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- icmp_flow.h | 6 +++++- udp_flow.h | 6 +++++- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/icmp_flow.h b/icmp_flow.h index fb93801..da7e255 100644 --- a/icmp_flow.h +++ b/icmp_flow.h @@ -13,6 +13,7 @@ * @seq: Last sequence number sent to tap, host order, -1: not sent yet * @sock: "ping" socket * @ts: Last associated activity from tap, seconds + * @ts_storage: Pad @ts to 64-bit storage to keep state migration sane */ struct icmp_ping_flow { /* Must be first element */ @@ -20,7 +21,10 @@ struct icmp_ping_flow { int seq; int sock; - time_t ts; + union { + time_t ts; + uint64_t ts_storage; + }; }; bool icmp_ping_timer(const struct ctx *c, const struct icmp_ping_flow *pingf, diff --git a/udp_flow.h b/udp_flow.h index 9a1b059..9cb79a0 100644 --- a/udp_flow.h +++ b/udp_flow.h @@ -12,6 +12,7 @@ * @f: Generic flow information * @closed: Flow is already closed * @ts: Activity timestamp + * @ts_storage: Pad @ts to 64-bit storage to keep state migration sane * @s: Socket fd (or -1) for each side of the flow */ struct udp_flow { @@ -19,7 +20,10 @@ struct udp_flow { struct flow_common f; bool closed :1; - time_t ts; + union { + time_t ts; + uint64_t ts_storage; + }; int s[SIDES]; };-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, 28 Jan 2025 11:49:16 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Tue, Jan 28, 2025 at 12:15:26AM +0100, Stefano Brivio wrote:This change is not related to the fact that we ignore or use them. It's about making the flow entries the same size, which we need, at least with this implementation. -- StefanoThat's the only field in flows with different storage sizes depending on the architecture: it's usually 4-byte wide on 32-bit architectures, except for arc and x32 where it's 8 bytes, and 8-byte wide on 64-bit machines.As discussed on the call, I think there are broader problems with transferring timestamps than just the structure size. So I'm hoping we can work out how to not transfer them at all and avoid this change.
...to keep migration sane. Right now, the biggest struct in union flow is struct tcp_splice_conn with 120 bytes on x86_64, which should also have the biggest storage and alignment requirements of any architecture we might run on. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- flow.h | 18 ++++++++++++------ flow_table.h | 13 ++++++++++--- 2 files changed, 22 insertions(+), 9 deletions(-) diff --git a/flow.h b/flow.h index 24ba3ef..8eb5964 100644 --- a/flow.h +++ b/flow.h @@ -202,15 +202,21 @@ struct flow_common { /** * struct flow_sidx - ID for one side of a specific flow - * @sidei: Index of side referenced (0 or 1) - * @flowi: Index of flow referenced + * @sidei: Index of side referenced (0 or 1) + * @flowi: Index of flow referenced + * @flow_sidx_storage: Pad to 32 bits */ typedef struct flow_sidx { - unsigned sidei :1; - unsigned flowi :FLOW_INDEX_BITS; + union { + struct { + unsigned sidei :1; + unsigned flowi :FLOW_INDEX_BITS; + }; + uint32_t flow_sidx_storage; + }; } flow_sidx_t; -static_assert(sizeof(flow_sidx_t) <= sizeof(uint32_t), - "flow_sidx_t must fit within 32 bits"); +static_assert(sizeof(flow_sidx_t) == sizeof(uint32_t), + "flow_sidx_t must be 32-bit wide"); #define FLOW_SIDX_NONE ((flow_sidx_t){ .flowi = FLOW_MAX }) diff --git a/flow_table.h b/flow_table.h index f15db53..007f4dd 100644 --- a/flow_table.h +++ b/flow_table.h @@ -26,9 +26,13 @@ struct flow_free_cluster { /** * union flow - Descriptor for a logical packet flow (e.g. connection) - * @f: Fields common between all variants - * @tcp: Fields for non-spliced TCP connections - * @tcp_splice: Fields for spliced TCP connections + * @f: Fields common between all variants + * @free: Entry in a cluster of free entries + * @tcp: Fields for non-spliced TCP connections + * @tcp_splice: Fields for spliced TCP connections + * @ping: Tracking for ping flows + * @udp: Tracking for UDP flows + * @flow_storage: Pad flow entries to 128 bytes to ease state migration */ union flow { struct flow_common f; @@ -37,8 +41,11 @@ union flow { struct tcp_splice_conn tcp_splice; struct icmp_ping_flow ping; struct udp_flow udp; + char flow_storage[128]; }; +static_assert(sizeof(union flow) == 128, "union flow should be 128-byte wide"); + /* Global Flow Table */ extern unsigned flow_first_free; extern union flow flowtab[]; -- 2.43.0
On Tue, Jan 28, 2025 at 12:15:27AM +0100, Stefano Brivio wrote:...to keep migration sane. Right now, the biggest struct in union flow is struct tcp_splice_conn with 120 bytes on x86_64, which should also have the biggest storage and alignment requirements of any architecture we might run on.Necessary for the current "copy the entire table as a blob" approach. As I've noted, I think that will be fragile, but we can revisit this change when/if we figure out a different way to handle the table as a whole.Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- flow.h | 18 ++++++++++++------ flow_table.h | 13 ++++++++++--- 2 files changed, 22 insertions(+), 9 deletions(-) diff --git a/flow.h b/flow.h index 24ba3ef..8eb5964 100644 --- a/flow.h +++ b/flow.h @@ -202,15 +202,21 @@ struct flow_common { /** * struct flow_sidx - ID for one side of a specific flow - * @sidei: Index of side referenced (0 or 1) - * @flowi: Index of flow referenced + * @sidei: Index of side referenced (0 or 1) + * @flowi: Index of flow referenced + * @flow_sidx_storage: Pad to 32 bits */ typedef struct flow_sidx { - unsigned sidei :1; - unsigned flowi :FLOW_INDEX_BITS; + union { + struct { + unsigned sidei :1; + unsigned flowi :FLOW_INDEX_BITS; + }; + uint32_t flow_sidx_storage; + }; } flow_sidx_t; -static_assert(sizeof(flow_sidx_t) <= sizeof(uint32_t), - "flow_sidx_t must fit within 32 bits"); +static_assert(sizeof(flow_sidx_t) == sizeof(uint32_t), + "flow_sidx_t must be 32-bit wide"); #define FLOW_SIDX_NONE ((flow_sidx_t){ .flowi = FLOW_MAX }) diff --git a/flow_table.h b/flow_table.h index f15db53..007f4dd 100644 --- a/flow_table.h +++ b/flow_table.h @@ -26,9 +26,13 @@ struct flow_free_cluster { /** * union flow - Descriptor for a logical packet flow (e.g. connection) - * @f: Fields common between all variants - * @tcp: Fields for non-spliced TCP connections - * @tcp_splice: Fields for spliced TCP connections + * @f: Fields common between all variants + * @free: Entry in a cluster of free entries + * @tcp: Fields for non-spliced TCP connections + * @tcp_splice: Fields for spliced TCP connections + * @ping: Tracking for ping flows + * @udp: Tracking for UDP flows + * @flow_storage: Pad flow entries to 128 bytes to ease state migration */ union flow { struct flow_common f; @@ -37,8 +41,11 @@ union flow { struct tcp_splice_conn tcp_splice; struct icmp_ping_flow ping; struct udp_flow udp; + char flow_storage[128]; }; +static_assert(sizeof(union flow) == 128, "union flow should be 128-byte wide"); + /* Global Flow Table */ extern unsigned flow_first_free; extern union flow flowtab[];-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Moving in_epoll out of the common flow data created a 7-bit hole in struct tcp_splice_conn: repack by shrinking @flags by one (otherwise unused) bit. Fixes: b60fa33eeafb ("tcp: Move in_epoll flag out of common connection structure") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tcp_conn.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tcp_conn.h b/tcp_conn.h index d342680..3d06e2c 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -125,7 +125,7 @@ struct tcp_splice_conn { #define FIN_RCVD(sidei_) ((sidei_) ? BIT(5) : BIT(4)) #define FIN_SENT(sidei_) ((sidei_) ? BIT(7) : BIT(6)) - uint8_t flags; + uint8_t flags :7; #define RCVLOWAT_SET(sidei_) ((sidei_) ? BIT(1) : BIT(0)) #define RCVLOWAT_ACT(sidei_) ((sidei_) ? BIT(3) : BIT(2)) #define CLOSING BIT(4) -- 2.43.0
On Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote:Moving in_epoll out of the common flow data created a 7-bit hole in struct tcp_splice_conn: repack by shrinking @flags by one (otherwise unused) bit.Is this actually necessary for the migration stuff? Or just a cleanup you spotted along the way?Fixes: b60fa33eeafb ("tcp: Move in_epoll flag out of common connection structure") Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tcp_conn.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tcp_conn.h b/tcp_conn.h index d342680..3d06e2c 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -125,7 +125,7 @@ struct tcp_splice_conn { #define FIN_RCVD(sidei_) ((sidei_) ? BIT(5) : BIT(4)) #define FIN_SENT(sidei_) ((sidei_) ? BIT(7) : BIT(6)) - uint8_t flags; + uint8_t flags :7; #define RCVLOWAT_SET(sidei_) ((sidei_) ? BIT(1) : BIT(0)) #define RCVLOWAT_ACT(sidei_) ((sidei_) ? BIT(3) : BIT(2)) #define CLOSING BIT(4)-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, 28 Jan 2025 11:53:09 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote:I thought it was helpful to keep the same size on 32-bit, but it looks like it's not actually needed. Let me drop it from this series as it's just noise and I'm trying to keep this slim. If we are all happy with it I can apply it. If not I'll forget about it. -- StefanoMoving in_epoll out of the common flow data created a 7-bit hole in struct tcp_splice_conn: repack by shrinking @flags by one (otherwise unused) bit.Is this actually necessary for the migration stuff? Or just a cleanup you spotted along the way?
On Tue, Jan 28, 2025 at 07:48:33AM +0100, Stefano Brivio wrote:On Tue, 28 Jan 2025 11:53:09 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Eh, I don't care that much either way. Note, btw, that bit-field packing is another way source and destination could potentially have mismatching data structures. IIUC bit field packing is described by the ABI and doesn't necessarily match the byte endianness. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonOn Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote:I thought it was helpful to keep the same size on 32-bit, but it looks like it's not actually needed. Let me drop it from this series as it's just noise and I'm trying to keep this slim. If we are all happy with it I can apply it. If not I'll forget about it.Moving in_epoll out of the common flow data created a 7-bit hole in struct tcp_splice_conn: repack by shrinking @flags by one (otherwise unused) bit.Is this actually necessary for the migration stuff? Or just a cleanup you spotted along the way?
On Wed, 29 Jan 2025 12:02:09 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Tue, Jan 28, 2025 at 07:48:33AM +0100, Stefano Brivio wrote:Right, that's actually the reason that brought me to this change: I was comparing stuff between x86_64 and armv6l. On the other hand, this part of the specific ABI is generally considered stable so I can rely on it. -- StefanoOn Tue, 28 Jan 2025 11:53:09 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Eh, I don't care that much either way. Note, btw, that bit-field packing is another way source and destination could potentially have mismatching data structures. IIUC bit field packing is described by the ABI and doesn't necessarily match the byte endianness.On Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote:I thought it was helpful to keep the same size on 32-bit, but it looks like it's not actually needed. Let me drop it from this series as it's just noise and I'm trying to keep this slim. If we are all happy with it I can apply it. If not I'll forget about it.Moving in_epoll out of the common flow data created a 7-bit hole in struct tcp_splice_conn: repack by shrinking @flags by one (otherwise unused) bit.Is this actually necessary for the migration stuff? Or just a cleanup you spotted along the way?
On Wed, Jan 29, 2025 at 08:33:40AM +0100, Stefano Brivio wrote:On Wed, 29 Jan 2025 12:02:09 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Uhh.. a specific ABI is stable, yes, but IIUC the whole point of these endian, word size etc. checks is that you're not counting on it being an identical ABI at each end. I'm saying the bit field packing is another way the ABIs at each end could differ, which is not currently accounted for. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonOn Tue, Jan 28, 2025 at 07:48:33AM +0100, Stefano Brivio wrote:Right, that's actually the reason that brought me to this change: I was comparing stuff between x86_64 and armv6l. On the other hand, this part of the specific ABI is generally considered stable so I can rely on it.On Tue, 28 Jan 2025 11:53:09 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Eh, I don't care that much either way. Note, btw, that bit-field packing is another way source and destination could potentially have mismatching data structures. IIUC bit field packing is described by the ABI and doesn't necessarily match the byte endianness.On Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote: > Moving in_epoll out of the common flow data created a 7-bit hole in > struct tcp_splice_conn: repack by shrinking @flags by one (otherwise > unused) bit. Is this actually necessary for the migration stuff? Or just a cleanup you spotted along the way?I thought it was helpful to keep the same size on 32-bit, but it looks like it's not actually needed. Let me drop it from this series as it's just noise and I'm trying to keep this slim. If we are all happy with it I can apply it. If not I'll forget about it.
On Thu, 30 Jan 2025 11:44:19 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Wed, Jan 29, 2025 at 08:33:40AM +0100, Stefano Brivio wrote:Of course. I'm just saying that I can *rely on ABIs*. Not on them being the same.On Wed, 29 Jan 2025 12:02:09 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Uhh.. a specific ABI is stable, yes, but IIUC the whole point of these endian, word size etc. checks is that you're not counting on it being an identical ABI at each end.On Tue, Jan 28, 2025 at 07:48:33AM +0100, Stefano Brivio wrote:Right, that's actually the reason that brought me to this change: I was comparing stuff between x86_64 and armv6l. On the other hand, this part of the specific ABI is generally considered stable so I can rely on it.On Tue, 28 Jan 2025 11:53:09 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote: > On Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote: > > Moving in_epoll out of the common flow data created a 7-bit hole in > > struct tcp_splice_conn: repack by shrinking @flags by one (otherwise > > unused) bit. > > Is this actually necessary for the migration stuff? Or just a cleanup > you spotted along the way? I thought it was helpful to keep the same size on 32-bit, but it looks like it's not actually needed. Let me drop it from this series as it's just noise and I'm trying to keep this slim. If we are all happy with it I can apply it. If not I'll forget about it.Eh, I don't care that much either way. Note, btw, that bit-field packing is another way source and destination could potentially have mismatching data structures. IIUC bit field packing is described by the ABI and doesn't necessarily match the byte endianness.I'm saying the bit field packing is another way the ABIs at each end could differIt does.which is not currently accounted for.That's because I have two hands, but obviously if I'm comparing ABIs... -- Stefano
On Thu, Jan 30, 2025 at 05:55:00AM +0100, Stefano Brivio wrote:On Thu, 30 Jan 2025 11:44:19 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ok, fair enough. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonOn Wed, Jan 29, 2025 at 08:33:40AM +0100, Stefano Brivio wrote:Of course. I'm just saying that I can *rely on ABIs*. Not on them being the same.On Wed, 29 Jan 2025 12:02:09 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Uhh.. a specific ABI is stable, yes, but IIUC the whole point of these endian, word size etc. checks is that you're not counting on it being an identical ABI at each end.On Tue, Jan 28, 2025 at 07:48:33AM +0100, Stefano Brivio wrote: > On Tue, 28 Jan 2025 11:53:09 +1100 > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > On Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote: > > > Moving in_epoll out of the common flow data created a 7-bit hole in > > > struct tcp_splice_conn: repack by shrinking @flags by one (otherwise > > > unused) bit. > > > > Is this actually necessary for the migration stuff? Or just a cleanup > > you spotted along the way? > > I thought it was helpful to keep the same size on 32-bit, but it looks > like it's not actually needed. > > Let me drop it from this series as it's just noise and I'm trying to > keep this slim. If we are all happy with it I can apply it. If not I'll > forget about it. Eh, I don't care that much either way. Note, btw, that bit-field packing is another way source and destination could potentially have mismatching data structures. IIUC bit field packing is described by the ABI and doesn't necessarily match the byte endianness.Right, that's actually the reason that brought me to this change: I was comparing stuff between x86_64 and armv6l. On the other hand, this part of the specific ABI is generally considered stable so I can rely on it.I'm saying the bit field packing is another way the ABIs at each end could differIt does.which is not currently accounted for.That's because I have two hands, but obviously if I'm comparing ABIs...
...so that we can use sizeof() on it. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- flow_table.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/flow_table.h b/flow_table.h index 007f4dd..a85cab5 100644 --- a/flow_table.h +++ b/flow_table.h @@ -48,7 +48,7 @@ static_assert(sizeof(union flow) == 128, "union flow should be 128-byte wide"); /* Global Flow Table */ extern unsigned flow_first_free; -extern union flow flowtab[]; +extern union flow flowtab[FLOW_MAX]; /** * flow_foreach_sidei() - 'for' type macro to step through each side of flow -- 2.43.0
These are symmetric to write_remainder() and write_all_buf() and almost a copy and paste of them, with the most notable differences being reversed reads/writes and a couple of better-safe-than-sorry asserts to keep Coverity happy. I'll use them in the next patch. At least for the moment, they're going to be used for vhost-user mode only, so I'm not unconditionally enabling readv() in the seccomp profile: the caller has to ensure it's there. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- util.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 2 ++ 2 files changed, 72 insertions(+) diff --git a/util.c b/util.c index 11973c4..085937b 100644 --- a/util.c +++ b/util.c @@ -606,6 +606,76 @@ int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip) return 0; } +/** + * read_all_buf() - Fill a whole buffer from a file descriptor + * @fd: File descriptor + * @buf: Pointer to base of buffer + * @len: Length of buffer + * + * Return: 0 on success, -1 on error (with errno set) + * + * #syscalls read + */ +int read_all_buf(int fd, void *buf, size_t len) +{ + size_t left = len; + char *p = buf; + + while (left) { + ssize_t rc; + + ASSERT(left <= len); + + do + rc = read(fd, p, left); + while ((rc < 0) && errno == EINTR); + + if (rc < 0) + return -1; + + p += rc; + left -= rc; + } + return 0; +} + +/** + * read_remainder() - Read the tail of an IO vector from a file descriptor + * @fd: File descriptor + * @iov: IO vector + * @cnt: Number of entries in @iov + * @skip: Number of bytes of the vector to skip reading + * + * Return: 0 on success, -1 on error (with errno set) + * + * Note: mode-specific seccomp profiles need to enable readv() to use this. + */ +int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip) +{ + size_t i = 0, offset; + + while ((i += iov_skip_bytes(iov + i, cnt - i, skip, &offset)) < cnt) { + ssize_t rc; + + if (offset) { + ASSERT(offset < iov[i].iov_len); + /* Read the remainder of the partially read buffer */ + if (read_all_buf(fd, (char *)iov[i].iov_base + offset, + iov[i].iov_len - offset) < 0) + return -1; + i++; + } + + /* Fill as many of the remaining buffers as we can */ + rc = readv(fd, &iov[i], cnt - i); + if (rc < 0) + return -1; + + skip = rc; + } + return 0; +} + /** sockaddr_ntop() - Convert a socket address to text format * @sa: Socket address * @dst: output buffer, minimum SOCKADDR_STRLEN bytes diff --git a/util.h b/util.h index d02333d..73a7a33 100644 --- a/util.h +++ b/util.h @@ -203,6 +203,8 @@ int fls(unsigned long x); int write_file(const char *path, const char *buf); int write_all_buf(int fd, const void *buf, size_t len); int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip); +int read_all_buf(int fd, void *buf, size_t len); +int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip); void close_open_files(int argc, char **argv); bool snprintf_check(char *str, size_t size, const char *format, ...); -- 2.43.0
On Tue, Jan 28, 2025 at 12:15:30AM +0100, Stefano Brivio wrote:These are symmetric to write_remainder() and write_all_buf() and almost a copy and paste of them, with the most notable differences being reversed reads/writes and a couple of better-safe-than-sorry asserts to keep Coverity happy.So, there's one thing that needs to be not quite symmetric for the read() version: we need to handle EOF. At present, I believe these will enter an infinite loop on EOF, which is not a graceful failure mode.I'll use them in the next patch. At least for the moment, they're going to be used for vhost-user mode only, so I'm not unconditionally enabling readv() in the seccomp profile: the caller has to ensure it's there. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- util.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 2 ++ 2 files changed, 72 insertions(+) diff --git a/util.c b/util.c index 11973c4..085937b 100644 --- a/util.c +++ b/util.c @@ -606,6 +606,76 @@ int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip) return 0; } +/** + * read_all_buf() - Fill a whole buffer from a file descriptor + * @fd: File descriptor + * @buf: Pointer to base of buffer + * @len: Length of buffer + * + * Return: 0 on success, -1 on error (with errno set) + * + * #syscalls read + */ +int read_all_buf(int fd, void *buf, size_t len) +{ + size_t left = len; + char *p = buf; + + while (left) { + ssize_t rc; + + ASSERT(left <= len); + + do + rc = read(fd, p, left); + while ((rc < 0) && errno == EINTR); + + if (rc < 0) + return -1; + + p += rc; + left -= rc; + } + return 0; +} + +/** + * read_remainder() - Read the tail of an IO vector from a file descriptor + * @fd: File descriptor + * @iov: IO vector + * @cnt: Number of entries in @iov + * @skip: Number of bytes of the vector to skip reading + * + * Return: 0 on success, -1 on error (with errno set) + * + * Note: mode-specific seccomp profiles need to enable readv() to use this. + */ +int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip) +{ + size_t i = 0, offset; + + while ((i += iov_skip_bytes(iov + i, cnt - i, skip, &offset)) < cnt) { + ssize_t rc; + + if (offset) { + ASSERT(offset < iov[i].iov_len); + /* Read the remainder of the partially read buffer */ + if (read_all_buf(fd, (char *)iov[i].iov_base + offset, + iov[i].iov_len - offset) < 0) + return -1; + i++; + } + + /* Fill as many of the remaining buffers as we can */ + rc = readv(fd, &iov[i], cnt - i); + if (rc < 0) + return -1; + + skip = rc; + } + return 0; +} + /** sockaddr_ntop() - Convert a socket address to text format * @sa: Socket address * @dst: output buffer, minimum SOCKADDR_STRLEN bytes diff --git a/util.h b/util.h index d02333d..73a7a33 100644 --- a/util.h +++ b/util.h @@ -203,6 +203,8 @@ int fls(unsigned long x); int write_file(const char *path, const char *buf); int write_all_buf(int fd, const void *buf, size_t len); int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip); +int read_all_buf(int fd, void *buf, size_t len); +int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip); void close_open_files(int argc, char **argv); bool snprintf_check(char *str, size_t size, const char *format, ...);-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, 28 Jan 2025 11:59:28 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Tue, Jan 28, 2025 at 12:15:30AM +0100, Stefano Brivio wrote:It doesn't happen in our current usage where we close the socket once we're done, but sure, if we use it for something else, boom. Let me add a rc == 0 case (which gets EIO or EINVAL, I'm not sure yet). Or feel free to re-post this if you have clearer ideas how to fix this up (but only if tested). -- StefanoThese are symmetric to write_remainder() and write_all_buf() and almost a copy and paste of them, with the most notable differences being reversed reads/writes and a couple of better-safe-than-sorry asserts to keep Coverity happy.So, there's one thing that needs to be not quite symmetric for the read() version: we need to handle EOF. At present, I believe these will enter an infinite loop on EOF, which is not a graceful failure mode.
On Tue, Jan 28, 2025 at 07:48:49AM +0100, Stefano Brivio wrote:On Tue, 28 Jan 2025 11:59:28 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:I don't see how what we do with the socket is relevant. Couldn't we hit this case if qemu unexpectedly closed the socket or died?On Tue, Jan 28, 2025 at 12:15:30AM +0100, Stefano Brivio wrote:It doesn't happen in our current usage where we close the socket once we're done,These are symmetric to write_remainder() and write_all_buf() and almost a copy and paste of them, with the most notable differences being reversed reads/writes and a couple of better-safe-than-sorry asserts to keep Coverity happy.So, there's one thing that needs to be not quite symmetric for the read() version: we need to handle EOF. At present, I believe these will enter an infinite loop on EOF, which is not a graceful failure mode.but sure, if we use it for something else, boom. Let me add a rc == 0 case (which gets EIO or EINVAL, I'm not sure yet). Or feel free to re-post this if you have clearer ideas how to fix this up (but only if tested).-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, 29 Jan 2025 12:03:30 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Tue, Jan 28, 2025 at 07:48:49AM +0100, Stefano Brivio wrote:Yes, sure. I just mentioned that it's not the intended usage, and rather an error case we need to handle. -- StefanoOn Tue, 28 Jan 2025 11:59:28 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:I don't see how what we do with the socket is relevant. Couldn't we hit this case if qemu unexpectedly closed the socket or died?On Tue, Jan 28, 2025 at 12:15:30AM +0100, Stefano Brivio wrote:It doesn't happen in our current usage where we close the socket once we're done,These are symmetric to write_remainder() and write_all_buf() and almost a copy and paste of them, with the most notable differences being reversed reads/writes and a couple of better-safe-than-sorry asserts to keep Coverity happy.So, there's one thing that needs to be not quite symmetric for the read() version: we need to handle EOF. At present, I believe these will enter an infinite loop on EOF, which is not a graceful failure mode.
On Wed, Jan 29, 2025 at 08:33:47AM +0100, Stefano Brivio wrote:On Wed, 29 Jan 2025 12:03:30 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Oh, sure, no argument there. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonOn Tue, Jan 28, 2025 at 07:48:49AM +0100, Stefano Brivio wrote:Yes, sure. I just mentioned that it's not the intended usage, and rather an error case we need to handle.On Tue, 28 Jan 2025 11:59:28 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:I don't see how what we do with the socket is relevant. Couldn't we hit this case if qemu unexpectedly closed the socket or died?On Tue, Jan 28, 2025 at 12:15:30AM +0100, Stefano Brivio wrote: > These are symmetric to write_remainder() and write_all_buf() and > almost a copy and paste of them, with the most notable differences > being reversed reads/writes and a couple of better-safe-than-sorry > asserts to keep Coverity happy. So, there's one thing that needs to be not quite symmetric for the read() version: we need to handle EOF. At present, I believe these will enter an infinite loop on EOF, which is not a graceful failure mode.It doesn't happen in our current usage where we close the socket once we're done,
Add two sets (source or target) of three functions each for passt in vhost-user mode, triggered by activity on the file descriptor passed via VHOST_USER_PROTOCOL_F_DEVICE_STATE: - migrate_source_pre() and migrate_target_pre() are called to prepare for migration, before data is transferred - migrate_source() sends, and migrate_target() receives migration data - migrate_source_post() and migrate_target_post() are responsible for any post-migration task Callbacks are added to these functions with arrays of function pointers in migrate.c. Migration handlers are versioned. Versioned descriptions of data sections will be added to the data_versions array, which points to versioned iovec arrays. Version 1 is currently empty and will be filled in in subsequent patches. The source announces the data version to be used and informs the peer about endianness, and the size of void *, time_t, flow entries and flow hash table entries. The target checks if the version of the source is still supported. If it's not, it aborts the migration. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- Makefile | 12 +-- migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++ migrate.h | 90 ++++++++++++++++++ passt.c | 2 +- vu_common.c | 122 ++++++++++++++++--------- vu_common.h | 2 +- 6 files changed, 438 insertions(+), 49 deletions(-) create mode 100644 migrate.c create mode 100644 migrate.h diff --git a/Makefile b/Makefile index 464eef1..1383875 100644 --- a/Makefile +++ b/Makefile @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c SRCS = $(PASST_SRCS) $(QRAP_SRCS) @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ - virtio.h vu_common.h + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ + vhost_user.h virtio.h vu_common.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);} diff --git a/migrate.c b/migrate.c new file mode 100644 index 0000000..bee9653 --- /dev/null +++ b/migrate.c @@ -0,0 +1,259 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * PASTA - Pack A Subtle Tap Abstraction + * for network namespace/tap device mode + * + * migrate.c - Migration sections, layout, and routines + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + */ + +#include <errno.h> +#include <sys/uio.h> + +#include "util.h" +#include "ip.h" +#include "passt.h" +#include "inany.h" +#include "flow.h" +#include "flow_table.h" + +#include "migrate.h" + +/* Current version of migration data */ +#define MIGRATE_VERSION 1 + +/* Magic as we see it and as seen with reverse endianness */ +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1 + +/* Migration header to send from source */ +static union migrate_header header = { + .magic = MIGRATE_MAGIC, + .version = htonl_constant(MIGRATE_VERSION), + .time_t_size = htonl_constant(sizeof(time_t)), + .flow_size = htonl_constant(sizeof(union flow)), + .flow_sidx_size = htonl_constant(sizeof(struct flow_sidx)), + .voidp_size = htonl_constant(sizeof(void *)), +}; + +/* Data sections for version 1 */ +static struct iovec sections_v1[] = { + { &header, sizeof(header) }, +}; + +/* Set of data versions */ +static struct migrate_data data_versions[] = { + { + 1, sections_v1, + }, + { 0 }, +}; + +/* Handlers to call in source before sending data */ +struct migrate_handler handlers_source_pre[] = { + { 0 }, +}; + +/* Handlers to call in source after sending data */ +struct migrate_handler handlers_source_post[] = { + { 0 }, +}; + +/* Handlers to call in target before receiving data with version 1 */ +struct migrate_handler handlers_target_pre_v1[] = { + { 0 }, +}; + +/* Handlers to call in target after receiving data with version 1 */ +struct migrate_handler handlers_target_post_v1[] = { + { 0 }, +}; + +/* Versioned sets of migration handlers */ +struct migrate_target_handlers target_handlers[] = { + { + 1, + handlers_target_pre_v1, + handlers_target_post_v1, + }, + { 0 }, +}; + +/** + * migrate_source_pre() - Pre-migration tasks as source + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_source_pre(struct migrate_meta *m) +{ + struct migrate_handler *h; + + for (h = handlers_source_pre; h->fn; h++) { + int rc; + + if ((rc = h->fn(m, h->data))) + return rc; + } + + return 0; +} + +/** + * migrate_source() - Perform migration as source: send state to hypervisor + * @fd: Descriptor for state transfer + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_source(int fd, const struct migrate_meta *m) +{ + static struct migrate_data *d; + unsigned count; + int rc; + + for (d = data_versions; d->v != MIGRATE_VERSION; d++); + + for (count = 0; d->sections[count].iov_len; count++); + + debug("Writing %u migration sections", count - 1 /* minus header */); + rc = write_remainder(fd, d->sections, count, 0); + if (rc < 0) + return errno; + + return 0; +} + +/** + * migrate_source_post() - Post-migration tasks as source + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +void migrate_source_post(struct migrate_meta *m) +{ + struct migrate_handler *h; + + for (h = handlers_source_post; h->fn; h++) + h->fn(m, h->data); +} + +/** + * migrate_target_read_header() - Set metadata in target from source header + * @fd: Descriptor for state transfer + * @m: Migration metadata, filled on return + * + * Return: 0 on success, error code on failure + */ +int migrate_target_read_header(int fd, struct migrate_meta *m) +{ + static struct migrate_data *d; + union migrate_header h; + + if (read_all_buf(fd, &h, sizeof(h))) + return errno; + + debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u", + h.magic, ntohl(h.voidp_size), ntohl(h.version)); + + for (d = data_versions; d->v != ntohl(h.version); d++); + if (!d->v) + return ENOTSUP; + m->v = d->v; + + if (h.magic == MIGRATE_MAGIC) + m->bswap = false; + else if (h.magic == MIGRATE_MAGIC_SWAPPED) + m->bswap = true; + else + return ENOTSUP; + + if (ntohl(h.voidp_size) == 4) + m->source_64b = false; + else if (ntohl(h.voidp_size) == 8) + m->source_64b = true; + else + return ENOTSUP; + + if (ntohl(h.time_t_size) == 4) + m->time_64b = false; + else if (ntohl(h.time_t_size) == 8) + m->time_64b = true; + else + return ENOTSUP; + + m->flow_size = ntohl(h.flow_size); + m->flow_sidx_size = ntohl(h.flow_sidx_size); + + return 0; +} + +/** + * migrate_target_pre() - Pre-migration tasks as target + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_target_pre(struct migrate_meta *m) +{ + struct migrate_target_handlers *th; + struct migrate_handler *h; + + for (th = target_handlers; th->v != m->v && th->v; th++); + + for (h = th->pre; h->fn; h++) { + int rc; + + if ((rc = h->fn(m, h->data))) + return rc; + } + + return 0; +} + +/** + * migrate_target() - Perform migration as target: receive state from hypervisor + * @fd: Descriptor for state transfer + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + * + * #syscalls:vu readv + */ +int migrate_target(int fd, const struct migrate_meta *m) +{ + static struct migrate_data *d; + unsigned cnt; + int rc; + + for (d = data_versions; d->v != m->v && d->v; d++); + + for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++); + + debug("Reading %u migration sections", cnt); + rc = read_remainder(fd, d->sections + 1, cnt, 0); + if (rc < 0) + return errno; + + return 0; +} + +/** + * migrate_target_post() - Post-migration tasks as target + * @m: Migration metadata + */ +void migrate_target_post(struct migrate_meta *m) +{ + struct migrate_target_handlers *th; + struct migrate_handler *h; + + for (th = target_handlers; th->v != m->v && th->v; th++); + + for (h = th->post; h->fn; h++) + h->fn(m, h->data); +} diff --git a/migrate.h b/migrate.h new file mode 100644 index 0000000..5582f75 --- /dev/null +++ b/migrate.h @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + */ + +#ifndef MIGRATE_H +#define MIGRATE_H + +/** + * struct migrate_meta - Migration metadata + * @v: Chosen migration data version, host order + * @bswap: Source has opposite endianness + * @peer_64b: Source uses 64-bit void * + * @time_64b: Source uses 64-bit time_t + * @flow_size: Size of union flow in source + * @flow_sidx_size: Size of struct flow_sidx in source + */ +struct migrate_meta { + uint32_t v; + bool bswap; + bool source_64b; + bool time_64b; + size_t flow_size; + size_t flow_sidx_size; +}; + +/** + * union migrate_header - Migration header from source + * @magic: 0xB1BB1D1B0BB1D1B0, host order + * @version: Source sends highest known, target aborts if unsupported + * @voidp_size: sizeof(void *), network order + * @time_t_size: sizeof(time_t), network order + * @flow_size: sizeof(union flow), network order + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order + * @unused: Go figure + */ +union migrate_header { + struct { + uint64_t magic; + uint32_t version; + uint32_t voidp_size; + uint32_t time_t_size; + uint32_t flow_size; + uint32_t flow_sidx_size; + }; + uint8_t unused[65536]; +}; + +/** + * struct migrate_data - Data sections for given source version + * @v: Source version this applies to, host order + * @sections: Array of data sections, NULL-terminated + */ +struct migrate_data { + uint32_t v; + struct iovec *sections; +}; + +/** + * struct migrate_handler - Function to handle a specific data section + * @fn: Function pointer taking pointer to data section + * @data: Associated data section + */ +struct migrate_handler { + int (*fn)(struct migrate_meta *m, void *data); + void *data; +}; + +/** + * struct migrate_target_handlers - Versioned sets of migration target handlers + * @v: Source version this applies to, host order + * @pre: Set of functions to execute in target before data copy + * @post: Set of functions to execute in target after data copy + */ +struct migrate_target_handlers { + uint32_t v; + struct migrate_handler *pre; + struct migrate_handler *post; +}; + +int migrate_source_pre(struct migrate_meta *m); +int migrate_source(int fd, const struct migrate_meta *m); +void migrate_source_post(struct migrate_meta *m); + +int migrate_target_read_header(int fd, struct migrate_meta *m); +int migrate_target_pre(struct migrate_meta *m); +int migrate_target(int fd, const struct migrate_meta *m); +void migrate_target_post(struct migrate_meta *m); + +#endif /* MIGRATE_H */ diff --git a/passt.c b/passt.c index b1c8ab6..184d4e5 100644 --- a/passt.c +++ b/passt.c @@ -358,7 +358,7 @@ loop: vu_kick_cb(c.vdev, ref, &now); break; case EPOLL_TYPE_VHOST_MIGRATION: - vu_migrate(c.vdev, eventmask); + vu_migrate(&c, eventmask); break; default: /* Can't happen */ diff --git a/vu_common.c b/vu_common.c index f43d8ac..0c67bd0 100644 --- a/vu_common.c +++ b/vu_common.c @@ -5,6 +5,7 @@ * common_vu.c - vhost-user common UDP and TCP functions */ +#include <errno.h> #include <unistd.h> #include <sys/uio.h> #include <sys/eventfd.h> @@ -17,6 +18,7 @@ #include "vhost_user.h" #include "pcap.h" #include "vu_common.h" +#include "migrate.h" #define VU_MAX_TX_BUFFER_NB 2 @@ -305,50 +307,88 @@ err: } /** - * vu_migrate() - Send/receive passt insternal state to/from QEMU - * @vdev: vhost-user device + * vu_migrate_source() - Migration as source, send state to hypervisor + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +static int vu_migrate_source(int fd) +{ + struct migrate_meta m; + int rc; + + if ((rc = migrate_source_pre(&m))) { + err("Source pre-migration failed: %s, abort", strerror_(rc)); + return rc; + } + + debug("Saving backend state"); + + rc = migrate_source(fd, &m); + if (rc) + err("Source migration failed: %s", strerror_(rc)); + else + migrate_source_post(&m); + + return rc; +} + +/** + * vu_migrate_target() - Migration as target, receive state from hypervisor + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +static int vu_migrate_target(int fd) +{ + struct migrate_meta m; + int rc; + + rc = migrate_target_read_header(fd, &m); + if (rc) { + err("Migration header check failed: %s, abort", strerror_(rc)); + return rc; + } + + if ((rc = migrate_target_pre(&m))) { + err("Target pre-migration failed: %s, abort", strerror_(rc)); + return rc; + } + + debug("Loading backend state"); + + rc = migrate_target(fd, &m); + if (rc) + err("Target migration failed: %s", strerror_(rc)); + else + migrate_target_post(&m); + + return rc; +} + +/** + * vu_migrate() - Send/receive passt internal state to/from QEMU + * @c: Execution context * @events: epoll events */ -void vu_migrate(struct vu_dev *vdev, uint32_t events) +void vu_migrate(struct ctx *c, uint32_t events) { - int ret; + struct vu_dev *vdev = c->vdev; + int rc = EIO; - /* TODO: collect/set passt internal state - * and use vdev->device_state_fd to send/receive it - */ debug("vu_migrate fd %d events %x", vdev->device_state_fd, events); - if (events & EPOLLOUT) { - debug("Saving backend state"); - - /* send some stuff */ - ret = write(vdev->device_state_fd, "PASST", 6); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - vdev->device_state_result = ret == -1 ? -1 : 0; - /* Closing the file descriptor signals the end of transfer */ - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, - vdev->device_state_fd, NULL); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } else if (events & EPOLLIN) { - char buf[6]; - - debug("Loading backend state"); - /* read some stuff */ - ret = read(vdev->device_state_fd, buf, sizeof(buf)); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - if (ret != sizeof(buf)) { - vdev->device_state_result = -1; - } else { - ret = strncmp(buf, "PASST", sizeof(buf)); - vdev->device_state_result = ret == 0 ? 0 : -1; - } - } else if (events & EPOLLHUP) { - debug("Closing migration channel"); - - /* The end of file signals the end of the transfer. */ - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, - vdev->device_state_fd, NULL); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } + + if (events & EPOLLOUT) + rc = vu_migrate_source(vdev->device_state_fd); + else if (events & EPOLLIN) + rc = vu_migrate_target(vdev->device_state_fd); + + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ + + vdev->device_state_result = rc; + + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL); + debug("Closing migration channel"); + close(vdev->device_state_fd); + vdev->device_state_fd = -1; } diff --git a/vu_common.h b/vu_common.h index d56c021..69c4006 100644 --- a/vu_common.h +++ b/vu_common.h @@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size); -void vu_migrate(struct vu_dev *vdev, uint32_t events); +void vu_migrate(struct ctx *c, uint32_t events); #endif /* VU_COMMON_H */ -- 2.43.0
On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote:Add two sets (source or target) of three functions each for passt in vhost-user mode, triggered by activity on the file descriptor passed via VHOST_USER_PROTOCOL_F_DEVICE_STATE: - migrate_source_pre() and migrate_target_pre() are called to prepare for migration, before data is transferred - migrate_source() sends, and migrate_target() receives migration data - migrate_source_post() and migrate_target_post() are responsible for any post-migration task Callbacks are added to these functions with arrays of function pointers in migrate.c. Migration handlers are versioned. Versioned descriptions of data sections will be added to the data_versions array, which points to versioned iovec arrays. Version 1 is currently empty and will be filled in in subsequent patches. The source announces the data version to be used and informs the peer about endianness, and the size of void *, time_t, flow entries and flow hash table entries. The target checks if the version of the source is still supported. If it's not, it aborts the migration. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- Makefile | 12 +-- migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++ migrate.h | 90 ++++++++++++++++++ passt.c | 2 +- vu_common.c | 122 ++++++++++++++++--------- vu_common.h | 2 +- 6 files changed, 438 insertions(+), 49 deletions(-) create mode 100644 migrate.c create mode 100644 migrate.h diff --git a/Makefile b/Makefile index 464eef1..1383875 100644 --- a/Makefile +++ b/Makefile @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c SRCS = $(PASST_SRCS) $(QRAP_SRCS) @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ - virtio.h vu_common.h + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ + vhost_user.h virtio.h vu_common.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);} diff --git a/migrate.c b/migrate.c new file mode 100644 index 0000000..bee9653 --- /dev/null +++ b/migrate.c @@ -0,0 +1,259 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * PASTA - Pack A Subtle Tap Abstraction + * for network namespace/tap device mode + * + * migrate.c - Migration sections, layout, and routines + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + */ + +#include <errno.h> +#include <sys/uio.h> + +#include "util.h" +#include "ip.h" +#include "passt.h" +#include "inany.h" +#include "flow.h" +#include "flow_table.h" + +#include "migrate.h" + +/* Current version of migration data */ +#define MIGRATE_VERSION 1 + +/* Magic as we see it and as seen with reverse endianness */ +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1As noted, I'm hoping we can get rid of "either endian" migration. But if this stays, we should define it using __bswap_constant_32() to avoid embarrassing mistakes.+ +/* Migration header to send from source */ +static union migrate_header header = { + .magic = MIGRATE_MAGIC, + .version = htonl_constant(MIGRATE_VERSION), + .time_t_size = htonl_constant(sizeof(time_t)), + .flow_size = htonl_constant(sizeof(union flow)), + .flow_sidx_size = htonl_constant(sizeof(struct flow_sidx)), + .voidp_size = htonl_constant(sizeof(void *)), +}; + +/* Data sections for version 1 */ +static struct iovec sections_v1[] = { + { &header, sizeof(header) }, +}; + +/* Set of data versions */ +static struct migrate_data data_versions[] = { + { + 1, sections_v1, + }, + { 0 }, +}; + +/* Handlers to call in source before sending data */ +struct migrate_handler handlers_source_pre[] = { + { 0 }, +}; + +/* Handlers to call in source after sending data */ +struct migrate_handler handlers_source_post[] = { + { 0 }, +}; + +/* Handlers to call in target before receiving data with version 1 */ +struct migrate_handler handlers_target_pre_v1[] = { + { 0 }, +}; + +/* Handlers to call in target after receiving data with version 1 */ +struct migrate_handler handlers_target_post_v1[] = { + { 0 }, +}; + +/* Versioned sets of migration handlers */ +struct migrate_target_handlers target_handlers[] = { + { + 1, + handlers_target_pre_v1, + handlers_target_post_v1, + }, + { 0 }, +}; + +/** + * migrate_source_pre() - Pre-migration tasks as source + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_source_pre(struct migrate_meta *m) +{ + struct migrate_handler *h; + + for (h = handlers_source_pre; h->fn; h++) { + int rc; + + if ((rc = h->fn(m, h->data))) + return rc; + } + + return 0; +} + +/** + * migrate_source() - Perform migration as source: send state to hypervisor + * @fd: Descriptor for state transfer + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_source(int fd, const struct migrate_meta *m) +{ + static struct migrate_data *d; + unsigned count; + int rc; + + for (d = data_versions; d->v != MIGRATE_VERSION; d++);Should ASSERT() if we don't find the version within the array.+ for (count = 0; d->sections[count].iov_len; count++); + + debug("Writing %u migration sections", count - 1 /* minus header */); + rc = write_remainder(fd, d->sections, count, 0); + if (rc < 0) + return errno; + + return 0; +} + +/** + * migrate_source_post() - Post-migration tasks as source + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +void migrate_source_post(struct migrate_meta *m) +{ + struct migrate_handler *h; + + for (h = handlers_source_post; h->fn; h++) + h->fn(m, h->data);Is there actually anything we might need to do on the source after a successful migration, other than exit?+} + +/** + * migrate_target_read_header() - Set metadata in target from source header + * @fd: Descriptor for state transfer + * @m: Migration metadata, filled on return + * + * Return: 0 on success, error code on failureWe nearly always use negative error codes. Why not here?+ */ +int migrate_target_read_header(int fd, struct migrate_meta *m) +{ + static struct migrate_data *d; + union migrate_header h; + + if (read_all_buf(fd, &h, sizeof(h))) + return errno; + + debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u", + h.magic, ntohl(h.voidp_size), ntohl(h.version)); + + for (d = data_versions; d->v != ntohl(h.version); d++); + if (!d->v) + return ENOTSUP;This is too late. The loop doesn't check it, so you've already overrun the data_versions table if the version wasn't in there. Easier to use an ARRAY_SIZE() limit in the loop, I think.+ m->v = d->v; + + if (h.magic == MIGRATE_MAGIC) + m->bswap = false; + else if (h.magic == MIGRATE_MAGIC_SWAPPED) + m->bswap = true; + else + return ENOTSUP; + + if (ntohl(h.voidp_size) == 4) + m->source_64b = false; + else if (ntohl(h.voidp_size) == 8) + m->source_64b = true; + else + return ENOTSUP; + + if (ntohl(h.time_t_size) == 4) + m->time_64b = false; + else if (ntohl(h.time_t_size) == 8) + m->time_64b = true; + else + return ENOTSUP; + + m->flow_size = ntohl(h.flow_size); + m->flow_sidx_size = ntohl(h.flow_sidx_size); + + return 0; +} + +/** + * migrate_target_pre() - Pre-migration tasks as target + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_target_pre(struct migrate_meta *m) +{ + struct migrate_target_handlers *th; + struct migrate_handler *h; + + for (th = target_handlers; th->v != m->v && th->v; th++); + + for (h = th->pre; h->fn; h++) { + int rc; + + if ((rc = h->fn(m, h->data))) + return rc; + } + + return 0; +} + +/** + * migrate_target() - Perform migration as target: receive state from hypervisor + * @fd: Descriptor for state transfer + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + * + * #syscalls:vu readv + */ +int migrate_target(int fd, const struct migrate_meta *m) +{ + static struct migrate_data *d; + unsigned cnt; + int rc; + + for (d = data_versions; d->v != m->v && d->v; d++); + + for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++); + + debug("Reading %u migration sections", cnt); + rc = read_remainder(fd, d->sections + 1, cnt, 0); + if (rc < 0) + return errno; + + return 0; +} + +/** + * migrate_target_post() - Post-migration tasks as target + * @m: Migration metadata + */ +void migrate_target_post(struct migrate_meta *m) +{ + struct migrate_target_handlers *th; + struct migrate_handler *h; + + for (th = target_handlers; th->v != m->v && th->v; th++); + + for (h = th->post; h->fn; h++) + h->fn(m, h->data); +} diff --git a/migrate.h b/migrate.h new file mode 100644 index 0000000..5582f75 --- /dev/null +++ b/migrate.h @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + */ + +#ifndef MIGRATE_H +#define MIGRATE_H + +/** + * struct migrate_meta - Migration metadata + * @v: Chosen migration data version, host order + * @bswap: Source has opposite endianness + * @peer_64b: Source uses 64-bit void * + * @time_64b: Source uses 64-bit time_t + * @flow_size: Size of union flow in source + * @flow_sidx_size: Size of struct flow_sidx in source + */ +struct migrate_meta { + uint32_t v; + bool bswap; + bool source_64b; + bool time_64b; + size_t flow_size; + size_t flow_sidx_size; +}; + +/** + * union migrate_header - Migration header from source + * @magic: 0xB1BB1D1B0BB1D1B0, host order + * @version: Source sends highest known, target aborts if unsupported + * @voidp_size: sizeof(void *), network order + * @time_t_size: sizeof(time_t), network order + * @flow_size: sizeof(union flow), network order + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order + * @unused: Go figure + */ +union migrate_header { + struct { + uint64_t magic; + uint32_t version; + uint32_t voidp_size; + uint32_t time_t_size; + uint32_t flow_size; + uint32_t flow_sidx_size; + }; + uint8_t unused[65536];So, having looked at this, I no longer think padding the header to 64kiB is a good idea. The structure means we're basically stuck always having that chunky header. Instead, I think the header should be absolutely minimal: basically magic and version only. v1 (and maybe others) can add a "metadata" or whatever section for additional information like this they need.+}; + +/** + * struct migrate_data - Data sections for given source version + * @v: Source version this applies to, host order + * @sections: Array of data sections, NULL-terminated + */ +struct migrate_data { + uint32_t v; + struct iovec *sections; +}; + +/** + * struct migrate_handler - Function to handle a specific data section + * @fn: Function pointer taking pointer to data section + * @data: Associated data section + */ +struct migrate_handler { + int (*fn)(struct migrate_meta *m, void *data); + void *data; +}; + +/** + * struct migrate_target_handlers - Versioned sets of migration target handlers + * @v: Source version this applies to, host order + * @pre: Set of functions to execute in target before data copy + * @post: Set of functions to execute in target after data copy + */ +struct migrate_target_handlers { + uint32_t v; + struct migrate_handler *pre; + struct migrate_handler *post; +}; + +int migrate_source_pre(struct migrate_meta *m); +int migrate_source(int fd, const struct migrate_meta *m); +void migrate_source_post(struct migrate_meta *m); + +int migrate_target_read_header(int fd, struct migrate_meta *m); +int migrate_target_pre(struct migrate_meta *m); +int migrate_target(int fd, const struct migrate_meta *m); +void migrate_target_post(struct migrate_meta *m); + +#endif /* MIGRATE_H */ diff --git a/passt.c b/passt.c index b1c8ab6..184d4e5 100644 --- a/passt.c +++ b/passt.c @@ -358,7 +358,7 @@ loop: vu_kick_cb(c.vdev, ref, &now); break; case EPOLL_TYPE_VHOST_MIGRATION: - vu_migrate(c.vdev, eventmask); + vu_migrate(&c, eventmask); break; default: /* Can't happen */ diff --git a/vu_common.c b/vu_common.c index f43d8ac..0c67bd0 100644 --- a/vu_common.c +++ b/vu_common.c @@ -5,6 +5,7 @@ * common_vu.c - vhost-user common UDP and TCP functions */ +#include <errno.h> #include <unistd.h> #include <sys/uio.h> #include <sys/eventfd.h> @@ -17,6 +18,7 @@ #include "vhost_user.h" #include "pcap.h" #include "vu_common.h" +#include "migrate.h" #define VU_MAX_TX_BUFFER_NB 2 @@ -305,50 +307,88 @@ err: } /** - * vu_migrate() - Send/receive passt insternal state to/from QEMU - * @vdev: vhost-user device + * vu_migrate_source() - Migration as source, send state to hypervisor + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +static int vu_migrate_source(int fd) +{ + struct migrate_meta m; + int rc; + + if ((rc = migrate_source_pre(&m))) { + err("Source pre-migration failed: %s, abort", strerror_(rc)); + return rc; + } + + debug("Saving backend state"); + + rc = migrate_source(fd, &m); + if (rc) + err("Source migration failed: %s", strerror_(rc)); + else + migrate_source_post(&m); + + return rc;After a successful source migration shouldn't we exit, or at least quiesce ourselves so we don't accidentally mess with anything the target is now doing?+} + +/** + * vu_migrate_target() - Migration as target, receive state from hypervisor + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +static int vu_migrate_target(int fd) +{ + struct migrate_meta m; + int rc; + + rc = migrate_target_read_header(fd, &m); + if (rc) { + err("Migration header check failed: %s, abort", strerror_(rc)); + return rc; + } + + if ((rc = migrate_target_pre(&m))) { + err("Target pre-migration failed: %s, abort", strerror_(rc)); + return rc; + } + + debug("Loading backend state"); + + rc = migrate_target(fd, &m); + if (rc) + err("Target migration failed: %s", strerror_(rc)); + else + migrate_target_post(&m); + + return rc; +} + +/** + * vu_migrate() - Send/receive passt internal state to/from QEMU + * @c: Execution context * @events: epoll events */ -void vu_migrate(struct vu_dev *vdev, uint32_t events) +void vu_migrate(struct ctx *c, uint32_t events) { - int ret; + struct vu_dev *vdev = c->vdev; + int rc = EIO; - /* TODO: collect/set passt internal state - * and use vdev->device_state_fd to send/receive it - */ debug("vu_migrate fd %d events %x", vdev->device_state_fd, events); - if (events & EPOLLOUT) { - debug("Saving backend state"); - - /* send some stuff */ - ret = write(vdev->device_state_fd, "PASST", 6); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - vdev->device_state_result = ret == -1 ? -1 : 0; - /* Closing the file descriptor signals the end of transfer */ - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, - vdev->device_state_fd, NULL); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } else if (events & EPOLLIN) { - char buf[6]; - - debug("Loading backend state"); - /* read some stuff */ - ret = read(vdev->device_state_fd, buf, sizeof(buf)); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - if (ret != sizeof(buf)) { - vdev->device_state_result = -1; - } else { - ret = strncmp(buf, "PASST", sizeof(buf)); - vdev->device_state_result = ret == 0 ? 0 : -1; - } - } else if (events & EPOLLHUP) { - debug("Closing migration channel"); - - /* The end of file signals the end of the transfer. */ - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, - vdev->device_state_fd, NULL); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } + + if (events & EPOLLOUT) + rc = vu_migrate_source(vdev->device_state_fd); + else if (events & EPOLLIN) + rc = vu_migrate_target(vdev->device_state_fd); + + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ + + vdev->device_state_result = rc; + + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL); + debug("Closing migration channel"); + close(vdev->device_state_fd); + vdev->device_state_fd = -1; } diff --git a/vu_common.h b/vu_common.h index d56c021..69c4006 100644 --- a/vu_common.h +++ b/vu_common.h @@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size); -void vu_migrate(struct vu_dev *vdev, uint32_t events); +void vu_migrate(struct ctx *c, uint32_t events); #endif /* VU_COMMON_H */-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, 28 Jan 2025 12:40:12 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote:Those always give me issues on musl, so I'd rather test things on big-endian and realise it's actually 0xB0D1B1B01B1DBBB1 (0x0b bitswap). Feel free to post a different proposal if tested.Add two sets (source or target) of three functions each for passt in vhost-user mode, triggered by activity on the file descriptor passed via VHOST_USER_PROTOCOL_F_DEVICE_STATE: - migrate_source_pre() and migrate_target_pre() are called to prepare for migration, before data is transferred - migrate_source() sends, and migrate_target() receives migration data - migrate_source_post() and migrate_target_post() are responsible for any post-migration task Callbacks are added to these functions with arrays of function pointers in migrate.c. Migration handlers are versioned. Versioned descriptions of data sections will be added to the data_versions array, which points to versioned iovec arrays. Version 1 is currently empty and will be filled in in subsequent patches. The source announces the data version to be used and informs the peer about endianness, and the size of void *, time_t, flow entries and flow hash table entries. The target checks if the version of the source is still supported. If it's not, it aborts the migration. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- Makefile | 12 +-- migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++ migrate.h | 90 ++++++++++++++++++ passt.c | 2 +- vu_common.c | 122 ++++++++++++++++--------- vu_common.h | 2 +- 6 files changed, 438 insertions(+), 49 deletions(-) create mode 100644 migrate.c create mode 100644 migrate.h diff --git a/Makefile b/Makefile index 464eef1..1383875 100644 --- a/Makefile +++ b/Makefile @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c SRCS = $(PASST_SRCS) $(QRAP_SRCS) @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ - virtio.h vu_common.h + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ + vhost_user.h virtio.h vu_common.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);} diff --git a/migrate.c b/migrate.c new file mode 100644 index 0000000..bee9653 --- /dev/null +++ b/migrate.c @@ -0,0 +1,259 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * PASTA - Pack A Subtle Tap Abstraction + * for network namespace/tap device mode + * + * migrate.c - Migration sections, layout, and routines + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + */ + +#include <errno.h> +#include <sys/uio.h> + +#include "util.h" +#include "ip.h" +#include "passt.h" +#include "inany.h" +#include "flow.h" +#include "flow_table.h" + +#include "migrate.h" + +/* Current version of migration data */ +#define MIGRATE_VERSION 1 + +/* Magic as we see it and as seen with reverse endianness */ +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1As noted, I'm hoping we can get rid of "either endian" migration. But if this stays, we should define it using __bswap_constant_32() to avoid embarrassing mistakes.This looks a bit unnecessary, MIGRATE_VERSION is defined just above... it's just a readability killer to me.+ +/* Migration header to send from source */ +static union migrate_header header = { + .magic = MIGRATE_MAGIC, + .version = htonl_constant(MIGRATE_VERSION), + .time_t_size = htonl_constant(sizeof(time_t)), + .flow_size = htonl_constant(sizeof(union flow)), + .flow_sidx_size = htonl_constant(sizeof(struct flow_sidx)), + .voidp_size = htonl_constant(sizeof(void *)), +}; + +/* Data sections for version 1 */ +static struct iovec sections_v1[] = { + { &header, sizeof(header) }, +}; + +/* Set of data versions */ +static struct migrate_data data_versions[] = { + { + 1, sections_v1, + }, + { 0 }, +}; + +/* Handlers to call in source before sending data */ +struct migrate_handler handlers_source_pre[] = { + { 0 }, +}; + +/* Handlers to call in source after sending data */ +struct migrate_handler handlers_source_post[] = { + { 0 }, +}; + +/* Handlers to call in target before receiving data with version 1 */ +struct migrate_handler handlers_target_pre_v1[] = { + { 0 }, +}; + +/* Handlers to call in target after receiving data with version 1 */ +struct migrate_handler handlers_target_post_v1[] = { + { 0 }, +}; + +/* Versioned sets of migration handlers */ +struct migrate_target_handlers target_handlers[] = { + { + 1, + handlers_target_pre_v1, + handlers_target_post_v1, + }, + { 0 }, +}; + +/** + * migrate_source_pre() - Pre-migration tasks as source + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_source_pre(struct migrate_meta *m) +{ + struct migrate_handler *h; + + for (h = handlers_source_pre; h->fn; h++) { + int rc; + + if ((rc = h->fn(m, h->data))) + return rc; + } + + return 0; +} + +/** + * migrate_source() - Perform migration as source: send state to hypervisor + * @fd: Descriptor for state transfer + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_source(int fd, const struct migrate_meta *m) +{ + static struct migrate_data *d; + unsigned count; + int rc; + + for (d = data_versions; d->v != MIGRATE_VERSION; d++);Should ASSERT() if we don't find the version within the array.We might want to log a couple of things, which would warrant these handlers. But let's say we need to do something *similar* to "updating the network" such as the RARP announcement that QEMU is requesting (this is intended for OVN-Kubernetes, so go figure), or that we need a workaround for a kernel issue with implicit close() with TCP_REPAIR on... I would leave this in for completeness.+ for (count = 0; d->sections[count].iov_len; count++); + + debug("Writing %u migration sections", count - 1 /* minus header */); + rc = write_remainder(fd, d->sections, count, 0); + if (rc < 0) + return errno; + + return 0; +} + +/** + * migrate_source_post() - Post-migration tasks as source + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +void migrate_source_post(struct migrate_meta *m) +{ + struct migrate_handler *h; + + for (h = handlers_source_post; h->fn; h++) + h->fn(m, h->data);Is there actually anything we might need to do on the source after a successful migration, other than exit?Because the reply to VHOST_USER_SET_DEVICE_STATE_FD is unsigned: https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#front-end-messa… and I want to keep this consistent/untranslated.+} + +/** + * migrate_target_read_header() - Set metadata in target from source header + * @fd: Descriptor for state transfer + * @m: Migration metadata, filled on return + * + * Return: 0 on success, error code on failureWe nearly always use negative error codes. Why not here?Ah, yes, I forgot the '&& d->v' part (see migrate_target()).+ */ +int migrate_target_read_header(int fd, struct migrate_meta *m) +{ + static struct migrate_data *d; + union migrate_header h; + + if (read_all_buf(fd, &h, sizeof(h))) + return errno; + + debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u", + h.magic, ntohl(h.voidp_size), ntohl(h.version)); + + for (d = data_versions; d->v != ntohl(h.version); d++); + if (!d->v) + return ENOTSUP;This is too late. The loop doesn't check it, so you've already overrun the data_versions table if the version wasn't in there.Easier to use an ARRAY_SIZE() limit in the loop, I think.I'd rather keep that as a one-liner, and NULL-terminate the arrays.The header is processed by the target in a separate, preliminary step, though. That's why I added metadata right in the header: if the target needs to abort the migration because, say, the size of a flow entry is too big to handle for a particular version, then we should know that before migrate_target_pre(). As long as we check the version first, we can always shrink the header later on. But having 64 KiB reserved looks more robust because it's a safe place to add this kind of metadata. Note that 64 KiB is typically transferred in a single read/write from/to the vhost-user back-end.+ m->v = d->v; + + if (h.magic == MIGRATE_MAGIC) + m->bswap = false; + else if (h.magic == MIGRATE_MAGIC_SWAPPED) + m->bswap = true; + else + return ENOTSUP; + + if (ntohl(h.voidp_size) == 4) + m->source_64b = false; + else if (ntohl(h.voidp_size) == 8) + m->source_64b = true; + else + return ENOTSUP; + + if (ntohl(h.time_t_size) == 4) + m->time_64b = false; + else if (ntohl(h.time_t_size) == 8) + m->time_64b = true; + else + return ENOTSUP; + + m->flow_size = ntohl(h.flow_size); + m->flow_sidx_size = ntohl(h.flow_sidx_size); + + return 0; +} + +/** + * migrate_target_pre() - Pre-migration tasks as target + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_target_pre(struct migrate_meta *m) +{ + struct migrate_target_handlers *th; + struct migrate_handler *h; + + for (th = target_handlers; th->v != m->v && th->v; th++); + + for (h = th->pre; h->fn; h++) { + int rc; + + if ((rc = h->fn(m, h->data))) + return rc; + } + + return 0; +} + +/** + * migrate_target() - Perform migration as target: receive state from hypervisor + * @fd: Descriptor for state transfer + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + * + * #syscalls:vu readv + */ +int migrate_target(int fd, const struct migrate_meta *m) +{ + static struct migrate_data *d; + unsigned cnt; + int rc; + + for (d = data_versions; d->v != m->v && d->v; d++); + + for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++); + + debug("Reading %u migration sections", cnt); + rc = read_remainder(fd, d->sections + 1, cnt, 0); + if (rc < 0) + return errno; + + return 0; +} + +/** + * migrate_target_post() - Post-migration tasks as target + * @m: Migration metadata + */ +void migrate_target_post(struct migrate_meta *m) +{ + struct migrate_target_handlers *th; + struct migrate_handler *h; + + for (th = target_handlers; th->v != m->v && th->v; th++); + + for (h = th->post; h->fn; h++) + h->fn(m, h->data); +} diff --git a/migrate.h b/migrate.h new file mode 100644 index 0000000..5582f75 --- /dev/null +++ b/migrate.h @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + */ + +#ifndef MIGRATE_H +#define MIGRATE_H + +/** + * struct migrate_meta - Migration metadata + * @v: Chosen migration data version, host order + * @bswap: Source has opposite endianness + * @peer_64b: Source uses 64-bit void * + * @time_64b: Source uses 64-bit time_t + * @flow_size: Size of union flow in source + * @flow_sidx_size: Size of struct flow_sidx in source + */ +struct migrate_meta { + uint32_t v; + bool bswap; + bool source_64b; + bool time_64b; + size_t flow_size; + size_t flow_sidx_size; +}; + +/** + * union migrate_header - Migration header from source + * @magic: 0xB1BB1D1B0BB1D1B0, host order + * @version: Source sends highest known, target aborts if unsupported + * @voidp_size: sizeof(void *), network order + * @time_t_size: sizeof(time_t), network order + * @flow_size: sizeof(union flow), network order + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order + * @unused: Go figure + */ +union migrate_header { + struct { + uint64_t magic; + uint32_t version; + uint32_t voidp_size; + uint32_t time_t_size; + uint32_t flow_size; + uint32_t flow_sidx_size; + }; + uint8_t unused[65536];So, having looked at this, I no longer think padding the header to 64kiB is a good idea. The structure means we're basically stuck always having that chunky header. Instead, I think the header should be absolutely minimal: basically magic and version only. v1 (and maybe others) can add a "metadata" or whatever section for additional information like this they need.Maybe, yes. Pending TCP connections should be safe because with TCP_REPAIR they're already quiesced, but we don't close listening sockets (yet). Perhaps a reasonable approach for the moment would be to declare a single migrate_source_post handler logging a info() message and exiting.+}; + +/** + * struct migrate_data - Data sections for given source version + * @v: Source version this applies to, host order + * @sections: Array of data sections, NULL-terminated + */ +struct migrate_data { + uint32_t v; + struct iovec *sections; +}; + +/** + * struct migrate_handler - Function to handle a specific data section + * @fn: Function pointer taking pointer to data section + * @data: Associated data section + */ +struct migrate_handler { + int (*fn)(struct migrate_meta *m, void *data); + void *data; +}; + +/** + * struct migrate_target_handlers - Versioned sets of migration target handlers + * @v: Source version this applies to, host order + * @pre: Set of functions to execute in target before data copy + * @post: Set of functions to execute in target after data copy + */ +struct migrate_target_handlers { + uint32_t v; + struct migrate_handler *pre; + struct migrate_handler *post; +}; + +int migrate_source_pre(struct migrate_meta *m); +int migrate_source(int fd, const struct migrate_meta *m); +void migrate_source_post(struct migrate_meta *m); + +int migrate_target_read_header(int fd, struct migrate_meta *m); +int migrate_target_pre(struct migrate_meta *m); +int migrate_target(int fd, const struct migrate_meta *m); +void migrate_target_post(struct migrate_meta *m); + +#endif /* MIGRATE_H */ diff --git a/passt.c b/passt.c index b1c8ab6..184d4e5 100644 --- a/passt.c +++ b/passt.c @@ -358,7 +358,7 @@ loop: vu_kick_cb(c.vdev, ref, &now); break; case EPOLL_TYPE_VHOST_MIGRATION: - vu_migrate(c.vdev, eventmask); + vu_migrate(&c, eventmask); break; default: /* Can't happen */ diff --git a/vu_common.c b/vu_common.c index f43d8ac..0c67bd0 100644 --- a/vu_common.c +++ b/vu_common.c @@ -5,6 +5,7 @@ * common_vu.c - vhost-user common UDP and TCP functions */ +#include <errno.h> #include <unistd.h> #include <sys/uio.h> #include <sys/eventfd.h> @@ -17,6 +18,7 @@ #include "vhost_user.h" #include "pcap.h" #include "vu_common.h" +#include "migrate.h" #define VU_MAX_TX_BUFFER_NB 2 @@ -305,50 +307,88 @@ err: } /** - * vu_migrate() - Send/receive passt insternal state to/from QEMU - * @vdev: vhost-user device + * vu_migrate_source() - Migration as source, send state to hypervisor + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +static int vu_migrate_source(int fd) +{ + struct migrate_meta m; + int rc; + + if ((rc = migrate_source_pre(&m))) { + err("Source pre-migration failed: %s, abort", strerror_(rc)); + return rc; + } + + debug("Saving backend state"); + + rc = migrate_source(fd, &m); + if (rc) + err("Source migration failed: %s", strerror_(rc)); + else + migrate_source_post(&m); + + return rc;After a successful source migration shouldn't we exit, or at least quiesce ourselves so we don't accidentally mess with anything the target is now doing?-- Stefano+} + +/** + * vu_migrate_target() - Migration as target, receive state from hypervisor + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +static int vu_migrate_target(int fd) +{ + struct migrate_meta m; + int rc; + + rc = migrate_target_read_header(fd, &m); + if (rc) { + err("Migration header check failed: %s, abort", strerror_(rc)); + return rc; + } + + if ((rc = migrate_target_pre(&m))) { + err("Target pre-migration failed: %s, abort", strerror_(rc)); + return rc; + } + + debug("Loading backend state"); + + rc = migrate_target(fd, &m); + if (rc) + err("Target migration failed: %s", strerror_(rc)); + else + migrate_target_post(&m); + + return rc; +} + +/** + * vu_migrate() - Send/receive passt internal state to/from QEMU + * @c: Execution context * @events: epoll events */ -void vu_migrate(struct vu_dev *vdev, uint32_t events) +void vu_migrate(struct ctx *c, uint32_t events) { - int ret; + struct vu_dev *vdev = c->vdev; + int rc = EIO; - /* TODO: collect/set passt internal state - * and use vdev->device_state_fd to send/receive it - */ debug("vu_migrate fd %d events %x", vdev->device_state_fd, events); - if (events & EPOLLOUT) { - debug("Saving backend state"); - - /* send some stuff */ - ret = write(vdev->device_state_fd, "PASST", 6); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - vdev->device_state_result = ret == -1 ? -1 : 0; - /* Closing the file descriptor signals the end of transfer */ - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, - vdev->device_state_fd, NULL); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } else if (events & EPOLLIN) { - char buf[6]; - - debug("Loading backend state"); - /* read some stuff */ - ret = read(vdev->device_state_fd, buf, sizeof(buf)); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - if (ret != sizeof(buf)) { - vdev->device_state_result = -1; - } else { - ret = strncmp(buf, "PASST", sizeof(buf)); - vdev->device_state_result = ret == 0 ? 0 : -1; - } - } else if (events & EPOLLHUP) { - debug("Closing migration channel"); - - /* The end of file signals the end of the transfer. */ - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, - vdev->device_state_fd, NULL); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } + + if (events & EPOLLOUT) + rc = vu_migrate_source(vdev->device_state_fd); + else if (events & EPOLLIN) + rc = vu_migrate_target(vdev->device_state_fd); + + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ + + vdev->device_state_result = rc; + + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL); + debug("Closing migration channel"); + close(vdev->device_state_fd); + vdev->device_state_fd = -1; } diff --git a/vu_common.h b/vu_common.h index d56c021..69c4006 100644 --- a/vu_common.h +++ b/vu_common.h @@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size); -void vu_migrate(struct vu_dev *vdev, uint32_t events); +void vu_migrate(struct ctx *c, uint32_t events); #endif /* VU_COMMON_H */
On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote:On Tue, 28 Jan 2025 12:40:12 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:What sort of issues? We're already using them, and have fallback versions defined in util.hOn Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote:Those always give me issues on musl,Add two sets (source or target) of three functions each for passt in vhost-user mode, triggered by activity on the file descriptor passed via VHOST_USER_PROTOCOL_F_DEVICE_STATE: - migrate_source_pre() and migrate_target_pre() are called to prepare for migration, before data is transferred - migrate_source() sends, and migrate_target() receives migration data - migrate_source_post() and migrate_target_post() are responsible for any post-migration task Callbacks are added to these functions with arrays of function pointers in migrate.c. Migration handlers are versioned. Versioned descriptions of data sections will be added to the data_versions array, which points to versioned iovec arrays. Version 1 is currently empty and will be filled in in subsequent patches. The source announces the data version to be used and informs the peer about endianness, and the size of void *, time_t, flow entries and flow hash table entries. The target checks if the version of the source is still supported. If it's not, it aborts the migration. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- Makefile | 12 +-- migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++ migrate.h | 90 ++++++++++++++++++ passt.c | 2 +- vu_common.c | 122 ++++++++++++++++--------- vu_common.h | 2 +- 6 files changed, 438 insertions(+), 49 deletions(-) create mode 100644 migrate.c create mode 100644 migrate.h diff --git a/Makefile b/Makefile index 464eef1..1383875 100644 --- a/Makefile +++ b/Makefile @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c SRCS = $(PASST_SRCS) $(QRAP_SRCS) @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ - virtio.h vu_common.h + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ + vhost_user.h virtio.h vu_common.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);} diff --git a/migrate.c b/migrate.c new file mode 100644 index 0000000..bee9653 --- /dev/null +++ b/migrate.c @@ -0,0 +1,259 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * PASTA - Pack A Subtle Tap Abstraction + * for network namespace/tap device mode + * + * migrate.c - Migration sections, layout, and routines + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + */ + +#include <errno.h> +#include <sys/uio.h> + +#include "util.h" +#include "ip.h" +#include "passt.h" +#include "inany.h" +#include "flow.h" +#include "flow_table.h" + +#include "migrate.h" + +/* Current version of migration data */ +#define MIGRATE_VERSION 1 + +/* Magic as we see it and as seen with reverse endianness */ +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1As noted, I'm hoping we can get rid of "either endian" migration. But if this stays, we should define it using __bswap_constant_32() to avoid embarrassing mistakes.so I'd rather test things on big-endian and realise it's actually 0xB0D1B1B01B1DBBB1 (0x0b bitswap). Feel free to post a different proposal if tested.IIUC, that's on the target end, not the source end...This looks a bit unnecessary, MIGRATE_VERSION is defined just above... it's just a readability killer to me.+ +/* Migration header to send from source */ +static union migrate_header header = { + .magic = MIGRATE_MAGIC, + .version = htonl_constant(MIGRATE_VERSION), + .time_t_size = htonl_constant(sizeof(time_t)), + .flow_size = htonl_constant(sizeof(union flow)), + .flow_sidx_size = htonl_constant(sizeof(struct flow_sidx)), + .voidp_size = htonl_constant(sizeof(void *)), +}; + +/* Data sections for version 1 */ +static struct iovec sections_v1[] = { + { &header, sizeof(header) }, +}; + +/* Set of data versions */ +static struct migrate_data data_versions[] = { + { + 1, sections_v1, + }, + { 0 }, +}; + +/* Handlers to call in source before sending data */ +struct migrate_handler handlers_source_pre[] = { + { 0 }, +}; + +/* Handlers to call in source after sending data */ +struct migrate_handler handlers_source_post[] = { + { 0 }, +}; + +/* Handlers to call in target before receiving data with version 1 */ +struct migrate_handler handlers_target_pre_v1[] = { + { 0 }, +}; + +/* Handlers to call in target after receiving data with version 1 */ +struct migrate_handler handlers_target_post_v1[] = { + { 0 }, +}; + +/* Versioned sets of migration handlers */ +struct migrate_target_handlers target_handlers[] = { + { + 1, + handlers_target_pre_v1, + handlers_target_post_v1, + }, + { 0 }, +}; + +/** + * migrate_source_pre() - Pre-migration tasks as source + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_source_pre(struct migrate_meta *m) +{ + struct migrate_handler *h; + + for (h = handlers_source_pre; h->fn; h++) { + int rc; + + if ((rc = h->fn(m, h->data))) + return rc; + } + + return 0; +} + +/** + * migrate_source() - Perform migration as source: send state to hypervisor + * @fd: Descriptor for state transfer + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_source(int fd, const struct migrate_meta *m) +{ + static struct migrate_data *d; + unsigned count; + int rc; + + for (d = data_versions; d->v != MIGRATE_VERSION; d++);Should ASSERT() if we don't find the version within the array.We might want to log a couple of things, which would warrant these handlers. But let's say we need to do something *similar* to "updating the network" such as the RARP announcement that QEMU is requesting (this is+ for (count = 0; d->sections[count].iov_len; count++); + + debug("Writing %u migration sections", count - 1 /* minus header */); + rc = write_remainder(fd, d->sections, count, 0); + if (rc < 0) + return errno; + + return 0; +} + +/** + * migrate_source_post() - Post-migration tasks as source + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +void migrate_source_post(struct migrate_meta *m) +{ + struct migrate_handler *h; + + for (h = handlers_source_post; h->fn; h++) + h->fn(m, h->data);Is there actually anything we might need to do on the source after a successful migration, other than exit?intended for OVN-Kubernetes, so go figure), or that we need a workaround for a kernel issue with implicit close() with TCP_REPAIR on... I would leave this in for completeness....but sure, point taken.Ok.Because the reply to VHOST_USER_SET_DEVICE_STATE_FD is unsigned: https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#front-end-messa… and I want to keep this consistent/untranslated.+} + +/** + * migrate_target_read_header() - Set metadata in target from source header + * @fd: Descriptor for state transfer + * @m: Migration metadata, filled on return + * + * Return: 0 on success, error code on failureWe nearly always use negative error codes. Why not here?Ah, yes, I missed that, we'd need a more complex design to do additional transfers and checks before making the target_pre callbacks.Ah, yes, I forgot the '&& d->v' part (see migrate_target()).+ */ +int migrate_target_read_header(int fd, struct migrate_meta *m) +{ + static struct migrate_data *d; + union migrate_header h; + + if (read_all_buf(fd, &h, sizeof(h))) + return errno; + + debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u", + h.magic, ntohl(h.voidp_size), ntohl(h.version)); + + for (d = data_versions; d->v != ntohl(h.version); d++); + if (!d->v) + return ENOTSUP;This is too late. The loop doesn't check it, so you've already overrun the data_versions table if the version wasn't in there.Easier to use an ARRAY_SIZE() limit in the loop, I think.I'd rather keep that as a one-liner, and NULL-terminate the arrays.The header is processed by the target in a separate, preliminary step, though. That's why I added metadata right in the header: if the target needs to abort the migration because, say, the size of a flow entry is too big to handle for a particular version, then we should know that before migrate_target_pre().+ m->v = d->v; + + if (h.magic == MIGRATE_MAGIC) + m->bswap = false; + else if (h.magic == MIGRATE_MAGIC_SWAPPED) + m->bswap = true; + else + return ENOTSUP; + + if (ntohl(h.voidp_size) == 4) + m->source_64b = false; + else if (ntohl(h.voidp_size) == 8) + m->source_64b = true; + else + return ENOTSUP; + + if (ntohl(h.time_t_size) == 4) + m->time_64b = false; + else if (ntohl(h.time_t_size) == 8) + m->time_64b = true; + else + return ENOTSUP; + + m->flow_size = ntohl(h.flow_size); + m->flow_sidx_size = ntohl(h.flow_sidx_size); + + return 0; +} + +/** + * migrate_target_pre() - Pre-migration tasks as target + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_target_pre(struct migrate_meta *m) +{ + struct migrate_target_handlers *th; + struct migrate_handler *h; + + for (th = target_handlers; th->v != m->v && th->v; th++); + + for (h = th->pre; h->fn; h++) { + int rc; + + if ((rc = h->fn(m, h->data))) + return rc; + } + + return 0; +} + +/** + * migrate_target() - Perform migration as target: receive state from hypervisor + * @fd: Descriptor for state transfer + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + * + * #syscalls:vu readv + */ +int migrate_target(int fd, const struct migrate_meta *m) +{ + static struct migrate_data *d; + unsigned cnt; + int rc; + + for (d = data_versions; d->v != m->v && d->v; d++); + + for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++); + + debug("Reading %u migration sections", cnt); + rc = read_remainder(fd, d->sections + 1, cnt, 0); + if (rc < 0) + return errno; + + return 0; +} + +/** + * migrate_target_post() - Post-migration tasks as target + * @m: Migration metadata + */ +void migrate_target_post(struct migrate_meta *m) +{ + struct migrate_target_handlers *th; + struct migrate_handler *h; + + for (th = target_handlers; th->v != m->v && th->v; th++); + + for (h = th->post; h->fn; h++) + h->fn(m, h->data); +} diff --git a/migrate.h b/migrate.h new file mode 100644 index 0000000..5582f75 --- /dev/null +++ b/migrate.h @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + */ + +#ifndef MIGRATE_H +#define MIGRATE_H + +/** + * struct migrate_meta - Migration metadata + * @v: Chosen migration data version, host order + * @bswap: Source has opposite endianness + * @peer_64b: Source uses 64-bit void * + * @time_64b: Source uses 64-bit time_t + * @flow_size: Size of union flow in source + * @flow_sidx_size: Size of struct flow_sidx in source + */ +struct migrate_meta { + uint32_t v; + bool bswap; + bool source_64b; + bool time_64b; + size_t flow_size; + size_t flow_sidx_size; +}; + +/** + * union migrate_header - Migration header from source + * @magic: 0xB1BB1D1B0BB1D1B0, host order + * @version: Source sends highest known, target aborts if unsupported + * @voidp_size: sizeof(void *), network order + * @time_t_size: sizeof(time_t), network order + * @flow_size: sizeof(union flow), network order + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order + * @unused: Go figure + */ +union migrate_header { + struct { + uint64_t magic; + uint32_t version; + uint32_t voidp_size; + uint32_t time_t_size; + uint32_t flow_size; + uint32_t flow_sidx_size; + }; + uint8_t unused[65536];So, having looked at this, I no longer think padding the header to 64kiB is a good idea. The structure means we're basically stuck always having that chunky header. Instead, I think the header should be absolutely minimal: basically magic and version only. v1 (and maybe others) can add a "metadata" or whatever section for additional information like this they need.As long as we check the version first, we can always shrink the header later on.*thinks*.. I guess so, though it's kind of awkward; a future version would have to read the "header of the header", check the version, then if it's the old one, read the remainder of the 64kiB block. I still think we should clearly separate the part that we're committing to being in every future version (which I think should just be magic and version), from the stuff that's just v1.But having 64 KiB reserved looks more robust because it's a safe place to add this kind of metadata. Note that 64 KiB is typically transferred in a single read/write from/to the vhost-user back-end.Ok, but it also has to go over the qemu migration channel, which will often be a physical link, not a super-fast local/virtual one, and may be bandwidth capped as well. I'm not actually certain if 64kiB is likely to be a problem there, but it *is* large compared to the state blobs of most qemu devices (usually only a few hundred bytes).Seems sensible for now.Maybe, yes. Pending TCP connections should be safe because with TCP_REPAIR they're already quiesced, but we don't close listening sockets (yet). Perhaps a reasonable approach for the moment would be to declare a single migrate_source_post handler logging a info() message and exiting.+}; + +/** + * struct migrate_data - Data sections for given source version + * @v: Source version this applies to, host order + * @sections: Array of data sections, NULL-terminated + */ +struct migrate_data { + uint32_t v; + struct iovec *sections; +}; + +/** + * struct migrate_handler - Function to handle a specific data section + * @fn: Function pointer taking pointer to data section + * @data: Associated data section + */ +struct migrate_handler { + int (*fn)(struct migrate_meta *m, void *data); + void *data; +}; + +/** + * struct migrate_target_handlers - Versioned sets of migration target handlers + * @v: Source version this applies to, host order + * @pre: Set of functions to execute in target before data copy + * @post: Set of functions to execute in target after data copy + */ +struct migrate_target_handlers { + uint32_t v; + struct migrate_handler *pre; + struct migrate_handler *post; +}; + +int migrate_source_pre(struct migrate_meta *m); +int migrate_source(int fd, const struct migrate_meta *m); +void migrate_source_post(struct migrate_meta *m); + +int migrate_target_read_header(int fd, struct migrate_meta *m); +int migrate_target_pre(struct migrate_meta *m); +int migrate_target(int fd, const struct migrate_meta *m); +void migrate_target_post(struct migrate_meta *m); + +#endif /* MIGRATE_H */ diff --git a/passt.c b/passt.c index b1c8ab6..184d4e5 100644 --- a/passt.c +++ b/passt.c @@ -358,7 +358,7 @@ loop: vu_kick_cb(c.vdev, ref, &now); break; case EPOLL_TYPE_VHOST_MIGRATION: - vu_migrate(c.vdev, eventmask); + vu_migrate(&c, eventmask); break; default: /* Can't happen */ diff --git a/vu_common.c b/vu_common.c index f43d8ac..0c67bd0 100644 --- a/vu_common.c +++ b/vu_common.c @@ -5,6 +5,7 @@ * common_vu.c - vhost-user common UDP and TCP functions */ +#include <errno.h> #include <unistd.h> #include <sys/uio.h> #include <sys/eventfd.h> @@ -17,6 +18,7 @@ #include "vhost_user.h" #include "pcap.h" #include "vu_common.h" +#include "migrate.h" #define VU_MAX_TX_BUFFER_NB 2 @@ -305,50 +307,88 @@ err: } /** - * vu_migrate() - Send/receive passt insternal state to/from QEMU - * @vdev: vhost-user device + * vu_migrate_source() - Migration as source, send state to hypervisor + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +static int vu_migrate_source(int fd) +{ + struct migrate_meta m; + int rc; + + if ((rc = migrate_source_pre(&m))) { + err("Source pre-migration failed: %s, abort", strerror_(rc)); + return rc; + } + + debug("Saving backend state"); + + rc = migrate_source(fd, &m); + if (rc) + err("Source migration failed: %s", strerror_(rc)); + else + migrate_source_post(&m); + + return rc;After a successful source migration shouldn't we exit, or at least quiesce ourselves so we don't accidentally mess with anything the target is now doing?-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson+} + +/** + * vu_migrate_target() - Migration as target, receive state from hypervisor + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +static int vu_migrate_target(int fd) +{ + struct migrate_meta m; + int rc; + + rc = migrate_target_read_header(fd, &m); + if (rc) { + err("Migration header check failed: %s, abort", strerror_(rc)); + return rc; + } + + if ((rc = migrate_target_pre(&m))) { + err("Target pre-migration failed: %s, abort", strerror_(rc)); + return rc; + } + + debug("Loading backend state"); + + rc = migrate_target(fd, &m); + if (rc) + err("Target migration failed: %s", strerror_(rc)); + else + migrate_target_post(&m); + + return rc; +} + +/** + * vu_migrate() - Send/receive passt internal state to/from QEMU + * @c: Execution context * @events: epoll events */ -void vu_migrate(struct vu_dev *vdev, uint32_t events) +void vu_migrate(struct ctx *c, uint32_t events) { - int ret; + struct vu_dev *vdev = c->vdev; + int rc = EIO; - /* TODO: collect/set passt internal state - * and use vdev->device_state_fd to send/receive it - */ debug("vu_migrate fd %d events %x", vdev->device_state_fd, events); - if (events & EPOLLOUT) { - debug("Saving backend state"); - - /* send some stuff */ - ret = write(vdev->device_state_fd, "PASST", 6); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - vdev->device_state_result = ret == -1 ? -1 : 0; - /* Closing the file descriptor signals the end of transfer */ - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, - vdev->device_state_fd, NULL); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } else if (events & EPOLLIN) { - char buf[6]; - - debug("Loading backend state"); - /* read some stuff */ - ret = read(vdev->device_state_fd, buf, sizeof(buf)); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - if (ret != sizeof(buf)) { - vdev->device_state_result = -1; - } else { - ret = strncmp(buf, "PASST", sizeof(buf)); - vdev->device_state_result = ret == 0 ? 0 : -1; - } - } else if (events & EPOLLHUP) { - debug("Closing migration channel"); - - /* The end of file signals the end of the transfer. */ - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, - vdev->device_state_fd, NULL); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } + + if (events & EPOLLOUT) + rc = vu_migrate_source(vdev->device_state_fd); + else if (events & EPOLLIN) + rc = vu_migrate_target(vdev->device_state_fd); + + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ + + vdev->device_state_result = rc; + + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL); + debug("Closing migration channel"); + close(vdev->device_state_fd); + vdev->device_state_fd = -1; } diff --git a/vu_common.h b/vu_common.h index d56c021..69c4006 100644 --- a/vu_common.h +++ b/vu_common.h @@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size); -void vu_migrate(struct vu_dev *vdev, uint32_t events); +void vu_migrate(struct ctx *c, uint32_t events); #endif /* VU_COMMON_H */
On Wed, 29 Jan 2025 12:16:58 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote:The very issues that brought me to introduce those fallback versions, so I'm instinctively reluctant to use them. Actually, I think it's even clearer to have this spelt out (I always need to stop for a moment and think: what happens when I cross the 32-bit boundary?).On Tue, 28 Jan 2025 12:40:12 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:What sort of issues? We're already using them, and have fallback versions defined in util.hOn Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote:Those always give me issues on musl,Add two sets (source or target) of three functions each for passt in vhost-user mode, triggered by activity on the file descriptor passed via VHOST_USER_PROTOCOL_F_DEVICE_STATE: - migrate_source_pre() and migrate_target_pre() are called to prepare for migration, before data is transferred - migrate_source() sends, and migrate_target() receives migration data - migrate_source_post() and migrate_target_post() are responsible for any post-migration task Callbacks are added to these functions with arrays of function pointers in migrate.c. Migration handlers are versioned. Versioned descriptions of data sections will be added to the data_versions array, which points to versioned iovec arrays. Version 1 is currently empty and will be filled in in subsequent patches. The source announces the data version to be used and informs the peer about endianness, and the size of void *, time_t, flow entries and flow hash table entries. The target checks if the version of the source is still supported. If it's not, it aborts the migration. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- Makefile | 12 +-- migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++ migrate.h | 90 ++++++++++++++++++ passt.c | 2 +- vu_common.c | 122 ++++++++++++++++--------- vu_common.h | 2 +- 6 files changed, 438 insertions(+), 49 deletions(-) create mode 100644 migrate.c create mode 100644 migrate.h diff --git a/Makefile b/Makefile index 464eef1..1383875 100644 --- a/Makefile +++ b/Makefile @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c SRCS = $(PASST_SRCS) $(QRAP_SRCS) @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ - virtio.h vu_common.h + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ + vhost_user.h virtio.h vu_common.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);} diff --git a/migrate.c b/migrate.c new file mode 100644 index 0000000..bee9653 --- /dev/null +++ b/migrate.c @@ -0,0 +1,259 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * PASTA - Pack A Subtle Tap Abstraction + * for network namespace/tap device mode + * + * migrate.c - Migration sections, layout, and routines + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + */ + +#include <errno.h> +#include <sys/uio.h> + +#include "util.h" +#include "ip.h" +#include "passt.h" +#include "inany.h" +#include "flow.h" +#include "flow_table.h" + +#include "migrate.h" + +/* Current version of migration data */ +#define MIGRATE_VERSION 1 + +/* Magic as we see it and as seen with reverse endianness */ +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1As noted, I'm hoping we can get rid of "either endian" migration. But if this stays, we should define it using __bswap_constant_32() to avoid embarrassing mistakes.The RARP announcement yes, but something *similar* to it, not necessarily.so I'd rather test things on big-endian and realise it's actually 0xB0D1B1B01B1DBBB1 (0x0b bitswap). Feel free to post a different proposal if tested.IIUC, that's on the target end, not the source end...This looks a bit unnecessary, MIGRATE_VERSION is defined just above... it's just a readability killer to me.+ +/* Migration header to send from source */ +static union migrate_header header = { + .magic = MIGRATE_MAGIC, + .version = htonl_constant(MIGRATE_VERSION), + .time_t_size = htonl_constant(sizeof(time_t)), + .flow_size = htonl_constant(sizeof(union flow)), + .flow_sidx_size = htonl_constant(sizeof(struct flow_sidx)), + .voidp_size = htonl_constant(sizeof(void *)), +}; + +/* Data sections for version 1 */ +static struct iovec sections_v1[] = { + { &header, sizeof(header) }, +}; + +/* Set of data versions */ +static struct migrate_data data_versions[] = { + { + 1, sections_v1, + }, + { 0 }, +}; + +/* Handlers to call in source before sending data */ +struct migrate_handler handlers_source_pre[] = { + { 0 }, +}; + +/* Handlers to call in source after sending data */ +struct migrate_handler handlers_source_post[] = { + { 0 }, +}; + +/* Handlers to call in target before receiving data with version 1 */ +struct migrate_handler handlers_target_pre_v1[] = { + { 0 }, +}; + +/* Handlers to call in target after receiving data with version 1 */ +struct migrate_handler handlers_target_post_v1[] = { + { 0 }, +}; + +/* Versioned sets of migration handlers */ +struct migrate_target_handlers target_handlers[] = { + { + 1, + handlers_target_pre_v1, + handlers_target_post_v1, + }, + { 0 }, +}; + +/** + * migrate_source_pre() - Pre-migration tasks as source + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_source_pre(struct migrate_meta *m) +{ + struct migrate_handler *h; + + for (h = handlers_source_pre; h->fn; h++) { + int rc; + + if ((rc = h->fn(m, h->data))) + return rc; + } + + return 0; +} + +/** + * migrate_source() - Perform migration as source: send state to hypervisor + * @fd: Descriptor for state transfer + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_source(int fd, const struct migrate_meta *m) +{ + static struct migrate_data *d; + unsigned count; + int rc; + + for (d = data_versions; d->v != MIGRATE_VERSION; d++);Should ASSERT() if we don't find the version within the array.We might want to log a couple of things, which would warrant these handlers. But let's say we need to do something *similar* to "updating the network" such as the RARP announcement that QEMU is requesting (this is+ for (count = 0; d->sections[count].iov_len; count++); + + debug("Writing %u migration sections", count - 1 /* minus header */); + rc = write_remainder(fd, d->sections, count, 0); + if (rc < 0) + return errno; + + return 0; +} + +/** + * migrate_source_post() - Post-migration tasks as source + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +void migrate_source_post(struct migrate_meta *m) +{ + struct migrate_handler *h; + + for (h = handlers_source_post; h->fn; h++) + h->fn(m, h->data);Is there actually anything we might need to do on the source after a successful migration, other than exit?Sure, I can add a comment.intended for OVN-Kubernetes, so go figure), or that we need a workaround for a kernel issue with implicit close() with TCP_REPAIR on... I would leave this in for completeness....but sure, point taken.Ok.Because the reply to VHOST_USER_SET_DEVICE_STATE_FD is unsigned: https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#front-end-messa… and I want to keep this consistent/untranslated.+} + +/** + * migrate_target_read_header() - Set metadata in target from source header + * @fd: Descriptor for state transfer + * @m: Migration metadata, filled on return + * + * Return: 0 on success, error code on failureWe nearly always use negative error codes. Why not here?Ah, yes, I missed that, we'd need a more complex design to do additional transfers and checks before making the target_pre callbacks.Ah, yes, I forgot the '&& d->v' part (see migrate_target()).+ */ +int migrate_target_read_header(int fd, struct migrate_meta *m) +{ + static struct migrate_data *d; + union migrate_header h; + + if (read_all_buf(fd, &h, sizeof(h))) + return errno; + + debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u", + h.magic, ntohl(h.voidp_size), ntohl(h.version)); + + for (d = data_versions; d->v != ntohl(h.version); d++); + if (!d->v) + return ENOTSUP;This is too late. The loop doesn't check it, so you've already overrun the data_versions table if the version wasn't in there.Easier to use an ARRAY_SIZE() limit in the loop, I think.I'd rather keep that as a one-liner, and NULL-terminate the arrays.The header is processed by the target in a separate, preliminary step, though. That's why I added metadata right in the header: if the target needs to abort the migration because, say, the size of a flow entry is too big to handle for a particular version, then we should know that before migrate_target_pre().+ m->v = d->v; + + if (h.magic == MIGRATE_MAGIC) + m->bswap = false; + else if (h.magic == MIGRATE_MAGIC_SWAPPED) + m->bswap = true; + else + return ENOTSUP; + + if (ntohl(h.voidp_size) == 4) + m->source_64b = false; + else if (ntohl(h.voidp_size) == 8) + m->source_64b = true; + else + return ENOTSUP; + + if (ntohl(h.time_t_size) == 4) + m->time_64b = false; + else if (ntohl(h.time_t_size) == 8) + m->time_64b = true; + else + return ENOTSUP; + + m->flow_size = ntohl(h.flow_size); + m->flow_sidx_size = ntohl(h.flow_sidx_size); + + return 0; +} + +/** + * migrate_target_pre() - Pre-migration tasks as target + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + */ +int migrate_target_pre(struct migrate_meta *m) +{ + struct migrate_target_handlers *th; + struct migrate_handler *h; + + for (th = target_handlers; th->v != m->v && th->v; th++); + + for (h = th->pre; h->fn; h++) { + int rc; + + if ((rc = h->fn(m, h->data))) + return rc; + } + + return 0; +} + +/** + * migrate_target() - Perform migration as target: receive state from hypervisor + * @fd: Descriptor for state transfer + * @m: Migration metadata + * + * Return: 0 on success, error code on failure + * + * #syscalls:vu readv + */ +int migrate_target(int fd, const struct migrate_meta *m) +{ + static struct migrate_data *d; + unsigned cnt; + int rc; + + for (d = data_versions; d->v != m->v && d->v; d++); + + for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++); + + debug("Reading %u migration sections", cnt); + rc = read_remainder(fd, d->sections + 1, cnt, 0); + if (rc < 0) + return errno; + + return 0; +} + +/** + * migrate_target_post() - Post-migration tasks as target + * @m: Migration metadata + */ +void migrate_target_post(struct migrate_meta *m) +{ + struct migrate_target_handlers *th; + struct migrate_handler *h; + + for (th = target_handlers; th->v != m->v && th->v; th++); + + for (h = th->post; h->fn; h++) + h->fn(m, h->data); +} diff --git a/migrate.h b/migrate.h new file mode 100644 index 0000000..5582f75 --- /dev/null +++ b/migrate.h @@ -0,0 +1,90 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + */ + +#ifndef MIGRATE_H +#define MIGRATE_H + +/** + * struct migrate_meta - Migration metadata + * @v: Chosen migration data version, host order + * @bswap: Source has opposite endianness + * @peer_64b: Source uses 64-bit void * + * @time_64b: Source uses 64-bit time_t + * @flow_size: Size of union flow in source + * @flow_sidx_size: Size of struct flow_sidx in source + */ +struct migrate_meta { + uint32_t v; + bool bswap; + bool source_64b; + bool time_64b; + size_t flow_size; + size_t flow_sidx_size; +}; + +/** + * union migrate_header - Migration header from source + * @magic: 0xB1BB1D1B0BB1D1B0, host order + * @version: Source sends highest known, target aborts if unsupported + * @voidp_size: sizeof(void *), network order + * @time_t_size: sizeof(time_t), network order + * @flow_size: sizeof(union flow), network order + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order + * @unused: Go figure + */ +union migrate_header { + struct { + uint64_t magic; + uint32_t version; + uint32_t voidp_size; + uint32_t time_t_size; + uint32_t flow_size; + uint32_t flow_sidx_size; + }; + uint8_t unused[65536];So, having looked at this, I no longer think padding the header to 64kiB is a good idea. The structure means we're basically stuck always having that chunky header. Instead, I think the header should be absolutely minimal: basically magic and version only. v1 (and maybe others) can add a "metadata" or whatever section for additional information like this they need.As long as we check the version first, we can always shrink the header later on.*thinks*.. I guess so, though it's kind of awkward; a future version would have to read the "header of the header", check the version, then if it's the old one, read the remainder of the 64kiB block. I still think we should clearly separate the part that we're committing to being in every future version (which I think should just be magic and version), from the stuff that's just v1.Even if we transfer just what we need of a flow, it's still something well in excess of 50 bytes each. 100k flows would be 5 megs. Well, anyway, let's cut this down to 4k, which should be enough, so that it's not a topic anymore. -- StefanoBut having 64 KiB reserved looks more robust because it's a safe place to add this kind of metadata. Note that 64 KiB is typically transferred in a single read/write from/to the vhost-user back-end.Ok, but it also has to go over the qemu migration channel, which will often be a physical link, not a super-fast local/virtual one, and may be bandwidth capped as well. I'm not actually certain if 64kiB is likely to be a problem there, but it *is* large compared to the state blobs of most qemu devices (usually only a few hundred bytes).
On Wed, Jan 29, 2025 at 08:33:50AM +0100, Stefano Brivio wrote:On Wed, 29 Jan 2025 12:16:58 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Oh, yes, we'd need to add a __bswap_constant_64() for this. [snip]On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote:The very issues that brought me to introduce those fallback versions, so I'm instinctively reluctant to use them. Actually, I think it's even clearer to have this spelt out (I always need to stop for a moment and think: what happens when I cross the 32-bit boundary?).On Tue, 28 Jan 2025 12:40:12 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:What sort of issues? We're already using them, and have fallback versions defined in util.hOn Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote: > Add two sets (source or target) of three functions each for passt in > vhost-user mode, triggered by activity on the file descriptor passed > via VHOST_USER_PROTOCOL_F_DEVICE_STATE: > > - migrate_source_pre() and migrate_target_pre() are called to prepare > for migration, before data is transferred > > - migrate_source() sends, and migrate_target() receives migration data > > - migrate_source_post() and migrate_target_post() are responsible for > any post-migration task > > Callbacks are added to these functions with arrays of function > pointers in migrate.c. Migration handlers are versioned. > > Versioned descriptions of data sections will be added to the > data_versions array, which points to versioned iovec arrays. Version > 1 is currently empty and will be filled in in subsequent patches. > > The source announces the data version to be used and informs the peer > about endianness, and the size of void *, time_t, flow entries and > flow hash table entries. > > The target checks if the version of the source is still supported. If > it's not, it aborts the migration. > > Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> > --- > Makefile | 12 +-- > migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > migrate.h | 90 ++++++++++++++++++ > passt.c | 2 +- > vu_common.c | 122 ++++++++++++++++--------- > vu_common.h | 2 +- > 6 files changed, 438 insertions(+), 49 deletions(-) > create mode 100644 migrate.c > create mode 100644 migrate.h > > diff --git a/Makefile b/Makefile > index 464eef1..1383875 100644 > --- a/Makefile > +++ b/Makefile > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) > > PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > vhost_user.c virtio.c vu_common.c > QRAP_SRCS = qrap.c > SRCS = $(PASST_SRCS) $(QRAP_SRCS) > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 > > PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ > - virtio.h vu_common.h > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ > + vhost_user.h virtio.h vu_common.h > HEADERS = $(PASST_HEADERS) seccomp.h > > C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);} > diff --git a/migrate.c b/migrate.c > new file mode 100644 > index 0000000..bee9653 > --- /dev/null > +++ b/migrate.c > @@ -0,0 +1,259 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > + > +/* PASST - Plug A Simple Socket Transport > + * for qemu/UNIX domain socket mode > + * > + * PASTA - Pack A Subtle Tap Abstraction > + * for network namespace/tap device mode > + * > + * migrate.c - Migration sections, layout, and routines > + * > + * Copyright (c) 2025 Red Hat GmbH > + * Author: Stefano Brivio <sbrivio(a)redhat.com> > + */ > + > +#include <errno.h> > +#include <sys/uio.h> > + > +#include "util.h" > +#include "ip.h" > +#include "passt.h" > +#include "inany.h" > +#include "flow.h" > +#include "flow_table.h" > + > +#include "migrate.h" > + > +/* Current version of migration data */ > +#define MIGRATE_VERSION 1 > + > +/* Magic as we see it and as seen with reverse endianness */ > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1 As noted, I'm hoping we can get rid of "either endian" migration. But if this stays, we should define it using __bswap_constant_32() to avoid embarrassing mistakes.Those always give me issues on musl,Just transferring the in-use flows would be higher priority than being selective about what we send within each flow. It's both easier to do and a bigger win in most cases. That would dramatically reduce the size sent here.Sure, I can add a comment.Ah, yes, I missed that, we'd need a more complex design to do additional transfers and checks before making the target_pre callbacks.> +/** > + * union migrate_header - Migration header from source > + * @magic: 0xB1BB1D1B0BB1D1B0, host order > + * @version: Source sends highest known, target aborts if unsupported > + * @voidp_size: sizeof(void *), network order > + * @time_t_size: sizeof(time_t), network order > + * @flow_size: sizeof(union flow), network order > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order > + * @unused: Go figure > + */ > +union migrate_header { > + struct { > + uint64_t magic; > + uint32_t version; > + uint32_t voidp_size; > + uint32_t time_t_size; > + uint32_t flow_size; > + uint32_t flow_sidx_size; > + }; > + uint8_t unused[65536]; So, having looked at this, I no longer think padding the header to 64kiB is a good idea. The structure means we're basically stuck always having that chunky header. Instead, I think the header should be absolutely minimal: basically magic and version only. v1 (and maybe others) can add a "metadata" or whatever section for additional information like this they need.The header is processed by the target in a separate, preliminary step, though. That's why I added metadata right in the header: if the target needs to abort the migration because, say, the size of a flow entry is too big to handle for a particular version, then we should know that before migrate_target_pre().As long as we check the version first, we can always shrink the header later on.*thinks*.. I guess so, though it's kind of awkward; a future version would have to read the "header of the header", check the version, then if it's the old one, read the remainder of the 64kiB block. I still think we should clearly separate the part that we're committing to being in every future version (which I think should just be magic and version), from the stuff that's just v1.Even if we transfer just what we need of a flow, it's still something well in excess of 50 bytes each. 100k flows would be 5 megs.But having 64 KiB reserved looks more robust because it's a safe place to add this kind of metadata. Note that 64 KiB is typically transferred in a single read/write from/to the vhost-user back-end.Ok, but it also has to go over the qemu migration channel, which will often be a physical link, not a super-fast local/virtual one, and may be bandwidth capped as well. I'm not actually certain if 64kiB is likely to be a problem there, but it *is* large compared to the state blobs of most qemu devices (usually only a few hundred bytes).Well, anyway, let's cut this down to 4k, which should be enough, so that it's not a topic anymore.I still think it's ugly, but whatever. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 30 Jan 2025 11:48:19 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Wed, Jan 29, 2025 at 08:33:50AM +0100, Stefano Brivio wrote:...which doesn't exist on musl. On current Alpine Edge: util.h:131:34: error: implicit declaration of function '__bswap_constant_64' [-Wimplicit-function-declaration] 131 | #define htonll_constant(x) (__bswap_constant_64(x)) | ^~~~~~~~~~~~~~~~~~~ ...so rather than adding an implementation for this single usage, which makes it actually less clear to me, I would keep it like it is.On Wed, 29 Jan 2025 12:16:58 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Oh, yes, we'd need to add a __bswap_constant_64() for this.On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote:The very issues that brought me to introduce those fallback versions, so I'm instinctively reluctant to use them. Actually, I think it's even clearer to have this spelt out (I always need to stop for a moment and think: what happens when I cross the 32-bit boundary?).On Tue, 28 Jan 2025 12:40:12 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote: > On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote: > > Add two sets (source or target) of three functions each for passt in > > vhost-user mode, triggered by activity on the file descriptor passed > > via VHOST_USER_PROTOCOL_F_DEVICE_STATE: > > > > - migrate_source_pre() and migrate_target_pre() are called to prepare > > for migration, before data is transferred > > > > - migrate_source() sends, and migrate_target() receives migration data > > > > - migrate_source_post() and migrate_target_post() are responsible for > > any post-migration task > > > > Callbacks are added to these functions with arrays of function > > pointers in migrate.c. Migration handlers are versioned. > > > > Versioned descriptions of data sections will be added to the > > data_versions array, which points to versioned iovec arrays. Version > > 1 is currently empty and will be filled in in subsequent patches. > > > > The source announces the data version to be used and informs the peer > > about endianness, and the size of void *, time_t, flow entries and > > flow hash table entries. > > > > The target checks if the version of the source is still supported. If > > it's not, it aborts the migration. > > > > Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> > > --- > > Makefile | 12 +-- > > migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > migrate.h | 90 ++++++++++++++++++ > > passt.c | 2 +- > > vu_common.c | 122 ++++++++++++++++--------- > > vu_common.h | 2 +- > > 6 files changed, 438 insertions(+), 49 deletions(-) > > create mode 100644 migrate.c > > create mode 100644 migrate.h > > > > diff --git a/Makefile b/Makefile > > index 464eef1..1383875 100644 > > --- a/Makefile > > +++ b/Makefile > > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) > > > > PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ > > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ > > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ > > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ > > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > > vhost_user.c virtio.c vu_common.c > > QRAP_SRCS = qrap.c > > SRCS = $(PASST_SRCS) $(QRAP_SRCS) > > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 > > > > PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ > > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ > > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ > > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ > > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ > > - virtio.h vu_common.h > > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ > > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ > > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ > > + vhost_user.h virtio.h vu_common.h > > HEADERS = $(PASST_HEADERS) seccomp.h > > > > C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);} > > diff --git a/migrate.c b/migrate.c > > new file mode 100644 > > index 0000000..bee9653 > > --- /dev/null > > +++ b/migrate.c > > @@ -0,0 +1,259 @@ > > +// SPDX-License-Identifier: GPL-2.0-or-later > > + > > +/* PASST - Plug A Simple Socket Transport > > + * for qemu/UNIX domain socket mode > > + * > > + * PASTA - Pack A Subtle Tap Abstraction > > + * for network namespace/tap device mode > > + * > > + * migrate.c - Migration sections, layout, and routines > > + * > > + * Copyright (c) 2025 Red Hat GmbH > > + * Author: Stefano Brivio <sbrivio(a)redhat.com> > > + */ > > + > > +#include <errno.h> > > +#include <sys/uio.h> > > + > > +#include "util.h" > > +#include "ip.h" > > +#include "passt.h" > > +#include "inany.h" > > +#include "flow.h" > > +#include "flow_table.h" > > + > > +#include "migrate.h" > > + > > +/* Current version of migration data */ > > +#define MIGRATE_VERSION 1 > > + > > +/* Magic as we see it and as seen with reverse endianness */ > > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 > > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1 > > As noted, I'm hoping we can get rid of "either endian" migration. But > if this stays, we should define it using __bswap_constant_32() to > avoid embarrassing mistakes. Those always give me issues on musl,What sort of issues? We're already using them, and have fallback versions defined in util.h[snip]Well, of course, I meant that we would only transfer used flows at that point, because it's not about transferring the flow table as a whole, with none of the advantages and disadvantages of it. But still one can have 128k flows at the moment.Just transferring the in-use flows would be higher priority than being selective about what we send within each flow.Sure, I can add a comment.> > +/** > > + * union migrate_header - Migration header from source > > + * @magic: 0xB1BB1D1B0BB1D1B0, host order > > + * @version: Source sends highest known, target aborts if unsupported > > + * @voidp_size: sizeof(void *), network order > > + * @time_t_size: sizeof(time_t), network order > > + * @flow_size: sizeof(union flow), network order > > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order > > + * @unused: Go figure > > + */ > > +union migrate_header { > > + struct { > > + uint64_t magic; > > + uint32_t version; > > + uint32_t voidp_size; > > + uint32_t time_t_size; > > + uint32_t flow_size; > > + uint32_t flow_sidx_size; > > + }; > > + uint8_t unused[65536]; > > So, having looked at this, I no longer think padding the header to 64kiB > is a good idea. The structure means we're basically stuck always > having that chunky header. Instead, I think the header should be > absolutely minimal: basically magic and version only. v1 (and maybe > others) can add a "metadata" or whatever section for additional > information like this they need. The header is processed by the target in a separate, preliminary step, though. That's why I added metadata right in the header: if the target needs to abort the migration because, say, the size of a flow entry is too big to handle for a particular version, then we should know that before migrate_target_pre().Ah, yes, I missed that, we'd need a more complex design to do additional transfers and checks before making the target_pre callbacks.As long as we check the version first, we can always shrink the header later on.*thinks*.. I guess so, though it's kind of awkward; a future version would have to read the "header of the header", check the version, then if it's the old one, read the remainder of the 64kiB block. I still think we should clearly separate the part that we're committing to being in every future version (which I think should just be magic and version), from the stuff that's just v1.Even if we transfer just what we need of a flow, it's still something well in excess of 50 bytes each. 100k flows would be 5 megs.But having 64 KiB reserved looks more robust because it's a safe place to add this kind of metadata. Note that 64 KiB is typically transferred in a single read/write from/to the vhost-user back-end.Ok, but it also has to go over the qemu migration channel, which will often be a physical link, not a super-fast local/virtual one, and may be bandwidth capped as well. I'm not actually certain if 64kiB is likely to be a problem there, but it *is* large compared to the state blobs of most qemu devices (usually only a few hundred bytes).It's both easier to do and a bigger win in most cases. That would dramatically reduce the size sent here.Yep, feel free.Same here. -- StefanoWell, anyway, let's cut this down to 4k, which should be enough, so that it's not a topic anymore.I still think it's ugly, but whatever.
On Thu, Jan 30, 2025 at 05:55:22AM +0100, Stefano Brivio wrote:On Thu, 30 Jan 2025 11:48:19 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Very well.On Wed, Jan 29, 2025 at 08:33:50AM +0100, Stefano Brivio wrote:...which doesn't exist on musl. On current Alpine Edge: util.h:131:34: error: implicit declaration of function '__bswap_constant_64' [-Wimplicit-function-declaration] 131 | #define htonll_constant(x) (__bswap_constant_64(x)) | ^~~~~~~~~~~~~~~~~~~ ...so rather than adding an implementation for this single usage, which makes it actually less clear to me, I would keep it like it is.On Wed, 29 Jan 2025 12:16:58 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Oh, yes, we'd need to add a __bswap_constant_64() for this.On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote: > On Tue, 28 Jan 2025 12:40:12 +1100 > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote: > > > Add two sets (source or target) of three functions each for passt in > > > vhost-user mode, triggered by activity on the file descriptor passed > > > via VHOST_USER_PROTOCOL_F_DEVICE_STATE: > > > > > > - migrate_source_pre() and migrate_target_pre() are called to prepare > > > for migration, before data is transferred > > > > > > - migrate_source() sends, and migrate_target() receives migration data > > > > > > - migrate_source_post() and migrate_target_post() are responsible for > > > any post-migration task > > > > > > Callbacks are added to these functions with arrays of function > > > pointers in migrate.c. Migration handlers are versioned. > > > > > > Versioned descriptions of data sections will be added to the > > > data_versions array, which points to versioned iovec arrays. Version > > > 1 is currently empty and will be filled in in subsequent patches. > > > > > > The source announces the data version to be used and informs the peer > > > about endianness, and the size of void *, time_t, flow entries and > > > flow hash table entries. > > > > > > The target checks if the version of the source is still supported. If > > > it's not, it aborts the migration. > > > > > > Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> > > > --- > > > Makefile | 12 +-- > > > migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > migrate.h | 90 ++++++++++++++++++ > > > passt.c | 2 +- > > > vu_common.c | 122 ++++++++++++++++--------- > > > vu_common.h | 2 +- > > > 6 files changed, 438 insertions(+), 49 deletions(-) > > > create mode 100644 migrate.c > > > create mode 100644 migrate.h > > > > > > diff --git a/Makefile b/Makefile > > > index 464eef1..1383875 100644 > > > --- a/Makefile > > > +++ b/Makefile > > > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) > > > > > > PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ > > > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ > > > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ > > > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > > > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ > > > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > > > vhost_user.c virtio.c vu_common.c > > > QRAP_SRCS = qrap.c > > > SRCS = $(PASST_SRCS) $(QRAP_SRCS) > > > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 > > > > > > PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ > > > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ > > > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ > > > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ > > > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ > > > - virtio.h vu_common.h > > > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ > > > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ > > > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ > > > + vhost_user.h virtio.h vu_common.h > > > HEADERS = $(PASST_HEADERS) seccomp.h > > > > > > C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);} > > > diff --git a/migrate.c b/migrate.c > > > new file mode 100644 > > > index 0000000..bee9653 > > > --- /dev/null > > > +++ b/migrate.c > > > @@ -0,0 +1,259 @@ > > > +// SPDX-License-Identifier: GPL-2.0-or-later > > > + > > > +/* PASST - Plug A Simple Socket Transport > > > + * for qemu/UNIX domain socket mode > > > + * > > > + * PASTA - Pack A Subtle Tap Abstraction > > > + * for network namespace/tap device mode > > > + * > > > + * migrate.c - Migration sections, layout, and routines > > > + * > > > + * Copyright (c) 2025 Red Hat GmbH > > > + * Author: Stefano Brivio <sbrivio(a)redhat.com> > > > + */ > > > + > > > +#include <errno.h> > > > +#include <sys/uio.h> > > > + > > > +#include "util.h" > > > +#include "ip.h" > > > +#include "passt.h" > > > +#include "inany.h" > > > +#include "flow.h" > > > +#include "flow_table.h" > > > + > > > +#include "migrate.h" > > > + > > > +/* Current version of migration data */ > > > +#define MIGRATE_VERSION 1 > > > + > > > +/* Magic as we see it and as seen with reverse endianness */ > > > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 > > > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1 > > > > As noted, I'm hoping we can get rid of "either endian" migration. But > > if this stays, we should define it using __bswap_constant_32() to > > avoid embarrassing mistakes. > > Those always give me issues on musl, What sort of issues? We're already using them, and have fallback versions defined in util.hThe very issues that brought me to introduce those fallback versions, so I'm instinctively reluctant to use them. Actually, I think it's even clearer to have this spelt out (I always need to stop for a moment and think: what happens when I cross the 32-bit boundary?).Right, but in the present draft you pay that cost whether or not you're actually using the flows. Unfortunately a busy server with heaps of active connections is exactly the case that's likely to be most sensitve to additional downtime, but there's not really any getting around that. A machine with a lot of state will need either high downtime or high migration bandwidth. But, I'm really hoping we can move relatively quickly to a model where a guest with only a handful of connections _doesn't_ have to pay that 128k flow cost - and can consequently migrate ok even with quite constrained migration bandwidth. In that scenario the size of the header could become significant.[snip]Well, of course, I meant that we would only transfer used flows at that point, because it's not about transferring the flow table as a whole, with none of the advantages and disadvantages of it. But still one can have 128k flows at the moment.Just transferring the in-use flows would be higher priority than being selective about what we send within each flow.> > > +/** > > > + * union migrate_header - Migration header from source > > > + * @magic: 0xB1BB1D1B0BB1D1B0, host order > > > + * @version: Source sends highest known, target aborts if unsupported > > > + * @voidp_size: sizeof(void *), network order > > > + * @time_t_size: sizeof(time_t), network order > > > + * @flow_size: sizeof(union flow), network order > > > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order > > > + * @unused: Go figure > > > + */ > > > +union migrate_header { > > > + struct { > > > + uint64_t magic; > > > + uint32_t version; > > > + uint32_t voidp_size; > > > + uint32_t time_t_size; > > > + uint32_t flow_size; > > > + uint32_t flow_sidx_size; > > > + }; > > > + uint8_t unused[65536]; > > > > So, having looked at this, I no longer think padding the header to 64kiB > > is a good idea. The structure means we're basically stuck always > > having that chunky header. Instead, I think the header should be > > absolutely minimal: basically magic and version only. v1 (and maybe > > others) can add a "metadata" or whatever section for additional > > information like this they need. > > The header is processed by the target in a separate, preliminary step, > though. > > That's why I added metadata right in the header: if the target needs to > abort the migration because, say, the size of a flow entry is too big > to handle for a particular version, then we should know that before > migrate_target_pre(). Ah, yes, I missed that, we'd need a more complex design to do additional transfers and checks before making the target_pre callbacks. > As long as we check the version first, we can always shrink the header > later on. *thinks*.. I guess so, though it's kind of awkward; a future version would have to read the "header of the header", check the version, then if it's the old one, read the remainder of the 64kiB block. I still think we should clearly separate the part that we're committing to being in every future version (which I think should just be magic and version), from the stuff that's just v1.Sure, I can add a comment.> But having 64 KiB reserved looks more robust because it's a > safe place to add this kind of metadata. > > Note that 64 KiB is typically transferred in a single read/write > from/to the vhost-user back-end. Ok, but it also has to go over the qemu migration channel, which will often be a physical link, not a super-fast local/virtual one, and may be bandwidth capped as well. I'm not actually certain if 64kiB is likely to be a problem there, but it *is* large compared to the state blobs of most qemu devices (usually only a few hundred bytes).Even if we transfer just what we need of a flow, it's still something well in excess of 50 bytes each. 100k flows would be 5 megs.It's on my queue for the next few days.It's both easier to do and a bigger win in most cases. That would dramatically reduce the size sent here.Yep, feel free.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonSame here.Well, anyway, let's cut this down to 4k, which should be enough, so that it's not a topic anymore.I still think it's ugly, but whatever.
On Thu, 30 Jan 2025 18:38:22 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right, but in the present draft you pay that cost whether or not you're actually using the flows. Unfortunately a busy server with heaps of active connections is exactly the case that's likely to be most sensitve to additional downtime, but there's not really any getting around that. A machine with a lot of state will need either high downtime or high migration bandwidth.It's... sixteen megabytes. A KubeVirt node is only allowed to perform up to _four_ migrations in parallel, and that's our main use case at the moment. "High downtime" is kind of relative.But, I'm really hoping we can move relatively quickly to a model where a guest with only a handful of connections _doesn't_ have to pay that 128k flow cost - and can consequently migrate ok even with quite constrained migration bandwidth. In that scenario the size of the header could become significant.I think the biggest cost of the full flow table transfer is rather code that's a bit quicker to write (I just managed to properly set sequences on the target, connections don't quite "flow" yet) but relatively high maintenance (as you mentioned, we need to be careful about every single field) and easy to break. I would like to quickly complete the whole flow first, because I think we can inform design and implementation decisions much better at that point, and we can be sure it's feasible, but I'm not particularly keen to merge this patch like it is, if we can switch it relatively swiftly to an implementation where we model a smaller fixed-endian structure with just the stuff we need. And again, to be a bit more sure of which stuff we need in it, the full flow is useful to have implemented. Actually the biggest complications I see in switching to that approach, from the current point, are that we need to, I guess: 1. model arrays (not really complicated by itself) 2. have a temporary structure where we store flows instead of using the flow table directly (meaning that the "data model" needs to logically decouple source and destination of the copy) 3. batch stuff to some extent. We'll call socket() and connect() once for each socket anyway, obviously, but sending one message to the TCP_REPAIR helper for each socket looks like a rather substantial and avoidable overheadTo me this part actually looks like the biggest priority after/while getting the whole thing to work, because we can start right with a 'v1' which looks more sustainable. And I would just get stuff working on x86_64 in that case, without even implementing conversions and endianness switches etc. -- StefanoIt's on my queue for the next few days.It's both easier to do and a bigger win in most cases. That would dramatically reduce the size sent here.Yep, feel free.
On Thu, Jan 30, 2025 at 09:32:36AM +0100, Stefano Brivio wrote:On Thu, 30 Jan 2025 18:38:22 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Certainly. But I believe it's typical to aim for downtimes in the ~100ms range.Right, but in the present draft you pay that cost whether or not you're actually using the flows. Unfortunately a busy server with heaps of active connections is exactly the case that's likely to be most sensitve to additional downtime, but there's not really any getting around that. A machine with a lot of state will need either high downtime or high migration bandwidth.It's... sixteen megabytes. A KubeVirt node is only allowed to perform up to _four_ migrations in parallel, and that's our main use case at the moment. "High downtime" is kind of relative.Right. And with this draft we can't even change the size of the flow table without breaking migration. That seems like a thing we might well want to change.But, I'm really hoping we can move relatively quickly to a model where a guest with only a handful of connections _doesn't_ have to pay that 128k flow cost - and can consequently migrate ok even with quite constrained migration bandwidth. In that scenario the size of the header could become significant.I think the biggest cost of the full flow table transfer is rather code that's a bit quicker to write (I just managed to properly set sequences on the target, connections don't quite "flow" yet) but relatively high maintenance (as you mentioned, we need to be careful about every single field) and easy to break.I would like to quickly complete the whole flow first, because I think we can inform design and implementation decisions much better at that point, and we can be sure it's feasible,That's fair.but I'm not particularly keen to merge this patch like it is, if we can switch it relatively swiftly to an implementation where we model a smaller fixed-endian structure with just the stuff we need.So, there are kind of two parts to this: 1) Only transferring active flow entries, and not transferring the hash table I think this is pretty easy. It could be done with or without preserving flow indicies. Preserving makes for debug log continuity between the ends, but not preserving lets us change the size of the flow table without breaking migration. 2) Only transferring the necessary pieces of each entry, and using a fixed representation of each piece This is harder. Not *super* hard, I think, but definitely trickier than (1)And again, to be a bit more sure of which stuff we need in it, the full flow is useful to have implemented. Actually the biggest complications I see in switching to that approach, from the current point, are that we need to, I guess: 1. model arrays (not really complicated by itself)So here, I actually think this is simpler if we don't attempt to have a declarative approach to defining the protocol, but just write functions to implement it.2. have a temporary structure where we store flows instead of using the flow table directly (meaning that the "data model" needs to logically decouple source and destination of the copy)Right.. I'd really prefer to "stream" in the entries one by one, rather than having a big staging area. That's even harder to do declaratively, but I think the other advantages are worth it.3. batch stuff to some extent. We'll call socket() and connect() once for each socket anyway, obviously, but sending one message to the TCP_REPAIR helper for each socket looks like a rather substantial and avoidable overheadI don't think this actually has a lot of bearing on the protocol. I'd envisage migrate_target() decodes all the information into the target's flow table, then migrate_target_post() steps through all the flows re-establishing the connections. Since we've already parsed the protocol at that point, we can make multiple passes: one to gather batches and set TCP_REPAIR, another through each entry to set the values, and a final one to clear TCP_REPAIR in batches.Right. Given the number of options here, I think it would be safest to go in expecting to go through a few throwaway protocol versions before reaching one we're happy enough to support long term. To ease that process, I'm wondering if we should, add a default-off command line option to enable migration. For now, enabling it would print some sort of "migration is experimental!" warning. Once we have a stream format we're ok with, we can flip it to on-by-default, but we don't maintain receive compatibility for the experimental versions leading up to that. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonTo me this part actually looks like the biggest priority after/while getting the whole thing to work, because we can start right with a 'v1' which looks more sustainable. And I would just get stuff working on x86_64 in that case, without even implementing conversions and endianness switches etc.It's on my queue for the next few days.It's both easier to do and a bigger win in most cases. That would dramatically reduce the size sent here.Yep, feel free.
On Thu, 30 Jan 2025 19:54:17 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Thu, Jan 30, 2025 at 09:32:36AM +0100, Stefano Brivio wrote:Why? The size of the flow table hasn't changed since it was added. I don't see a reason to improve this if we don't want to transfer the flow table anyway.On Thu, 30 Jan 2025 18:38:22 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Certainly. But I believe it's typical to aim for downtimes in the ~100ms range.Right, but in the present draft you pay that cost whether or not you're actually using the flows. Unfortunately a busy server with heaps of active connections is exactly the case that's likely to be most sensitve to additional downtime, but there's not really any getting around that. A machine with a lot of state will need either high downtime or high migration bandwidth.It's... sixteen megabytes. A KubeVirt node is only allowed to perform up to _four_ migrations in parallel, and that's our main use case at the moment. "High downtime" is kind of relative.Right. And with this draft we can't even change the size of the flow table without breaking migration. That seems like a thing we might well want to change.But, I'm really hoping we can move relatively quickly to a model where a guest with only a handful of connections _doesn't_ have to pay that 128k flow cost - and can consequently migrate ok even with quite constrained migration bandwidth. In that scenario the size of the header could become significant.I think the biggest cost of the full flow table transfer is rather code that's a bit quicker to write (I just managed to properly set sequences on the target, connections don't quite "flow" yet) but relatively high maintenance (as you mentioned, we need to be careful about every single field) and easy to break.I would just add prints on migration showing how old flow indices map to new ones.I would like to quickly complete the whole flow first, because I think we can inform design and implementation decisions much better at that point, and we can be sure it's feasible,That's fair.but I'm not particularly keen to merge this patch like it is, if we can switch it relatively swiftly to an implementation where we model a smaller fixed-endian structure with just the stuff we need.So, there are kind of two parts to this: 1) Only transferring active flow entries, and not transferring the hash table I think this is pretty easy. It could be done with or without preserving flow indicies. Preserving makes for debug log continuity between the ends, but not preserving lets us change the size of the flow table without breaking migration.2) Only transferring the necessary pieces of each entry, and using a fixed representation of each piece This is harder. Not *super* hard, I think, but definitely trickier than (1)Ah, right, I didn't think of using the target flow table directly. That has the advantage that the current code I'm writing to reactivate flows from the flow table can be recycled as it is.And again, to be a bit more sure of which stuff we need in it, the full flow is useful to have implemented. Actually the biggest complications I see in switching to that approach, from the current point, are that we need to, I guess: 1. model arrays (not really complicated by itself)So here, I actually think this is simpler if we don't attempt to have a declarative approach to defining the protocol, but just write functions to implement it.2. have a temporary structure where we store flows instead of using the flow table directly (meaning that the "data model" needs to logically decouple source and destination of the copy)Right.. I'd really prefer to "stream" in the entries one by one, rather than having a big staging area. That's even harder to do declaratively, but I think the other advantages are worth it.3. batch stuff to some extent. We'll call socket() and connect() once for each socket anyway, obviously, but sending one message to the TCP_REPAIR helper for each socket looks like a rather substantial and avoidable overheadI don't think this actually has a lot of bearing on the protocol. I'd envisage migrate_target() decodes all the information into the target's flow table, then migrate_target_post() steps through all the flows re-establishing the connections. Since we've already parsed the protocol at that point, we can make multiple passes: one to gather batches and set TCP_REPAIR, another through each entry to set the values, and a final one to clear TCP_REPAIR in batches.It looks like unnecessary code churn to me. It doesn't need to be merged if it's work in progress. You can also push stuff to a temporary branch if needed. It can also be merged and not documented for a while, as long as it doesn't break existing functionality. -- StefanoRight. Given the number of options here, I think it would be safest to go in expecting to go through a few throwaway protocol versions before reaching one we're happy enough to support long term. To ease that process, I'm wondering if we should, add a default-off command line option to enable migration. For now, enabling it would print some sort of "migration is experimental!" warning. Once we have a stream format we're ok with, we can flip it to on-by-default, but we don't maintain receive compatibility for the experimental versions leading up to that.To me this part actually looks like the biggest priority after/while getting the whole thing to work, because we can start right with a 'v1' which looks more sustainable. And I would just get stuff working on x86_64 in that case, without even implementing conversions and endianness switches etc.> It's both easier to do > and a bigger win in most cases. That would dramatically reduce the > size sent here. Yep, feel free.It's on my queue for the next few days.
On Fri, Jan 31, 2025 at 06:46:21AM +0100, Stefano Brivio wrote:On Thu, 30 Jan 2025 19:54:17 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Which wasn't _that_ long ago. It just seems like a really obvious constant to tune to me, and one which it would be surprising if it broken migration.On Thu, Jan 30, 2025 at 09:32:36AM +0100, Stefano Brivio wrote:Why? The size of the flow table hasn't changed since it was added. IOn Thu, 30 Jan 2025 18:38:22 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Certainly. But I believe it's typical to aim for downtimes in the ~100ms range.Right, but in the present draft you pay that cost whether or not you're actually using the flows. Unfortunately a busy server with heaps of active connections is exactly the case that's likely to be most sensitve to additional downtime, but there's not really any getting around that. A machine with a lot of state will need either high downtime or high migration bandwidth.It's... sixteen megabytes. A KubeVirt node is only allowed to perform up to _four_ migrations in parallel, and that's our main use case at the moment. "High downtime" is kind of relative.Right. And with this draft we can't even change the size of the flow table without breaking migration. That seems like a thing we might well want to change.But, I'm really hoping we can move relatively quickly to a model where a guest with only a handful of connections _doesn't_ have to pay that 128k flow cost - and can consequently migrate ok even with quite constrained migration bandwidth. In that scenario the size of the header could become significant.I think the biggest cost of the full flow table transfer is rather code that's a bit quicker to write (I just managed to properly set sequences on the target, connections don't quite "flow" yet) but relatively high maintenance (as you mentioned, we need to be careful about every single field) and easy to break.don't see a reason to improve this if we don't want to transfer the flow table anyway.I don't follow. Do you mean not transferring the hash table? This is not relevant to that, I'm talking about the size of the base flow table, not the hash table. Or do you mean not transferring the flow table as a whole, but rather entry by entry? In that case I'm seeing it as exactly the mechanism to improve this.That's possible, although it would mean transferring the old indices, which is not otherwise strictly necessary. What we could do easily is a debug log similar to the "new flow" logs but for "immigrated flow".I would just add prints on migration showing how old flow indices map to new ones.I would like to quickly complete the whole flow first, because I think we can inform design and implementation decisions much better at that point, and we can be sure it's feasible,That's fair.but I'm not particularly keen to merge this patch like it is, if we can switch it relatively swiftly to an implementation where we model a smaller fixed-endian structure with just the stuff we need.So, there are kind of two parts to this: 1) Only transferring active flow entries, and not transferring the hash table I think this is pretty easy. It could be done with or without preserving flow indicies. Preserving makes for debug log continuity between the ends, but not preserving lets us change the size of the flow table without breaking migration.Possibly - it might need to do some slightly different things: regenerating some fields from redundant data maybe, and/or re-hashing the entries. But certainly the structure should be similar, yes.2) Only transferring the necessary pieces of each entry, and using a fixed representation of each piece This is harder. Not *super* hard, I think, but definitely trickier than (1)Ah, right, I didn't think of using the target flow table directly. That has the advantage that the current code I'm writing to reactivate flows from the flow table can be recycled as it is.And again, to be a bit more sure of which stuff we need in it, the full flow is useful to have implemented. Actually the biggest complications I see in switching to that approach, from the current point, are that we need to, I guess: 1. model arrays (not really complicated by itself)So here, I actually think this is simpler if we don't attempt to have a declarative approach to defining the protocol, but just write functions to implement it.2. have a temporary structure where we store flows instead of using the flow table directly (meaning that the "data model" needs to logically decouple source and destination of the copy)Right.. I'd really prefer to "stream" in the entries one by one, rather than having a big staging area. That's even harder to do declaratively, but I think the other advantages are worth it.3. batch stuff to some extent. We'll call socket() and connect() once for each socket anyway, obviously, but sending one message to the TCP_REPAIR helper for each socket looks like a rather substantial and avoidable overheadI don't think this actually has a lot of bearing on the protocol. I'd envisage migrate_target() decodes all the information into the target's flow table, then migrate_target_post() steps through all the flows re-establishing the connections. Since we've already parsed the protocol at that point, we can make multiple passes: one to gather batches and set TCP_REPAIR, another through each entry to set the values, and a final one to clear TCP_REPAIR in batches.Eh, just thought merging might save us some rebase work against any other pressing changes we need.It looks like unnecessary code churn to me. It doesn't need to be merged if it's work in progress. You can also push stuff to a temporary branch if needed.Right. Given the number of options here, I think it would be safest to go in expecting to go through a few throwaway protocol versions before reaching one we're happy enough to support long term. To ease that process, I'm wondering if we should, add a default-off command line option to enable migration. For now, enabling it would print some sort of "migration is experimental!" warning. Once we have a stream format we're ok with, we can flip it to on-by-default, but we don't maintain receive compatibility for the experimental versions leading up to that.> > It's both easier to do > > and a bigger win in most cases. That would dramatically reduce the > > size sent here. > > Yep, feel free. It's on my queue for the next few days.To me this part actually looks like the biggest priority after/while getting the whole thing to work, because we can start right with a 'v1' which looks more sustainable. And I would just get stuff working on x86_64 in that case, without even implementing conversions and endianness switches etc.It can also be merged and not documented for a while, as long as it doesn't break existing functionality.I'd be a bit cautious about this. AIUI, right now if you attempt to migrate, qemu will simply fail it because we don't respond to the migration commands. Having enough merged that qemu won't outright fail the migration, but it won't reliably work seems like a bad idea. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 31 Jan 2025 17:32:05 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Fri, Jan 31, 2025 at 06:46:21AM +0100, Stefano Brivio wrote:I mean transferring the flow table entry by entry. I think we need to jump to that directly, because of something else I just found: if we don't transfer one struct tcp_repair_window, the connection works at a basic level but we don't really comply with RFC 9293 anymore, I think, because we might shrink the window. And that struct is 20 bytes. We just reached 124 bytes with the socket-side sequence numbers, so now we would exceed 128 bytes.On Thu, 30 Jan 2025 19:54:17 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Which wasn't _that_ long ago. It just seems like a really obvious constant to tune to me, and one which it would be surprising if it broken migration.On Thu, Jan 30, 2025 at 09:32:36AM +0100, Stefano Brivio wrote:Why? The size of the flow table hasn't changed since it was added. IOn Thu, 30 Jan 2025 18:38:22 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote: > Right, but in the present draft you pay that cost whether or not > you're actually using the flows. Unfortunately a busy server with > heaps of active connections is exactly the case that's likely to be > most sensitve to additional downtime, but there's not really any > getting around that. A machine with a lot of state will need either > high downtime or high migration bandwidth. It's... sixteen megabytes. A KubeVirt node is only allowed to perform up to _four_ migrations in parallel, and that's our main use case at the moment. "High downtime" is kind of relative.Certainly. But I believe it's typical to aim for downtimes in the ~100ms range.> But, I'm really hoping we can move relatively quickly to a model where > a guest with only a handful of connections _doesn't_ have to pay that > 128k flow cost - and can consequently migrate ok even with quite > constrained migration bandwidth. In that scenario the size of the > header could become significant. I think the biggest cost of the full flow table transfer is rather code that's a bit quicker to write (I just managed to properly set sequences on the target, connections don't quite "flow" yet) but relatively high maintenance (as you mentioned, we need to be careful about every single field) and easy to break.Right. And with this draft we can't even change the size of the flow table without breaking migration. That seems like a thing we might well want to change.don't see a reason to improve this if we don't want to transfer the flow table anyway.I don't follow. Do you mean not transferring the hash table? This is not relevant to that, I'm talking about the size of the base flow table, not the hash table. Or do you mean not transferring the flow table as a whole, but rather entry by entry? In that case I'm seeing it as exactly the mechanism to improve this.Transferring them if we have a dedicated structure shouldn't be that bad: we don't need to store them.That's possible, although it would mean transferring the old indices, which is not otherwise strictly necessary. What we could do easily is a debug log similar to the "new flow" logs but for "immigrated flow".I would just add prints on migration showing how old flow indices map to new ones.I would like to quickly complete the whole flow first, because I think we can inform design and implementation decisions much better at that point, and we can be sure it's feasible,That's fair.but I'm not particularly keen to merge this patch like it is, if we can switch it relatively swiftly to an implementation where we model a smaller fixed-endian structure with just the stuff we need.So, there are kind of two parts to this: 1) Only transferring active flow entries, and not transferring the hash table I think this is pretty easy. It could be done with or without preserving flow indicies. Preserving makes for debug log continuity between the ends, but not preserving lets us change the size of the flow table without breaking migration.Not really: we transfer "PASST" at the moment, and the migration succeeds. You can start 'ping' in the source and see it continue flawlessly in the target. Sure, TCP connections break altogether instead of the okayish migration implemented by v3 I'm posting in a bit... -- StefanoPossibly - it might need to do some slightly different things: regenerating some fields from redundant data maybe, and/or re-hashing the entries. But certainly the structure should be similar, yes.2) Only transferring the necessary pieces of each entry, and using a fixed representation of each piece This is harder. Not *super* hard, I think, but definitely trickier than (1)Ah, right, I didn't think of using the target flow table directly. That has the advantage that the current code I'm writing to reactivate flows from the flow table can be recycled as it is.And again, to be a bit more sure of which stuff we need in it, the full flow is useful to have implemented. Actually the biggest complications I see in switching to that approach, from the current point, are that we need to, I guess: 1. model arrays (not really complicated by itself)So here, I actually think this is simpler if we don't attempt to have a declarative approach to defining the protocol, but just write functions to implement it.2. have a temporary structure where we store flows instead of using the flow table directly (meaning that the "data model" needs to logically decouple source and destination of the copy)Right.. I'd really prefer to "stream" in the entries one by one, rather than having a big staging area. That's even harder to do declaratively, but I think the other advantages are worth it.3. batch stuff to some extent. We'll call socket() and connect() once for each socket anyway, obviously, but sending one message to the TCP_REPAIR helper for each socket looks like a rather substantial and avoidable overheadI don't think this actually has a lot of bearing on the protocol. I'd envisage migrate_target() decodes all the information into the target's flow table, then migrate_target_post() steps through all the flows re-establishing the connections. Since we've already parsed the protocol at that point, we can make multiple passes: one to gather batches and set TCP_REPAIR, another through each entry to set the values, and a final one to clear TCP_REPAIR in batches.Eh, just thought merging might save us some rebase work against any other pressing changes we need.It looks like unnecessary code churn to me. It doesn't need to be merged if it's work in progress. You can also push stuff to a temporary branch if needed.> > > It's both easier to do > > > and a bigger win in most cases. That would dramatically reduce the > > > size sent here. > > > > Yep, feel free. > > It's on my queue for the next few days. To me this part actually looks like the biggest priority after/while getting the whole thing to work, because we can start right with a 'v1' which looks more sustainable. And I would just get stuff working on x86_64 in that case, without even implementing conversions and endianness switches etc.Right. Given the number of options here, I think it would be safest to go in expecting to go through a few throwaway protocol versions before reaching one we're happy enough to support long term. To ease that process, I'm wondering if we should, add a default-off command line option to enable migration. For now, enabling it would print some sort of "migration is experimental!" warning. Once we have a stream format we're ok with, we can flip it to on-by-default, but we don't maintain receive compatibility for the experimental versions leading up to that.It can also be merged and not documented for a while, as long as it doesn't break existing functionality.I'd be a bit cautious about this. AIUI, right now if you attempt to migrate, qemu will simply fail it because we don't respond to the migration commands. Having enough merged that qemu won't outright fail the migration, but it won't reliably work seems like a bad idea.
On Fri, Jan 31, 2025 at 10:09:31AM +0100, Stefano Brivio wrote:On Fri, 31 Jan 2025 17:32:05 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ok. This is next on my list.On Fri, Jan 31, 2025 at 06:46:21AM +0100, Stefano Brivio wrote:I mean transferring the flow table entry by entry. I think we need to jump to that directly, because of something else I just found: if we don't transfer one struct tcp_repair_window, the connection works at a basic level but we don't really comply with RFC 9293 anymore, I think, because we might shrink the window. And that struct is 20 bytes. We just reached 124 bytes with the socket-side sequence numbers, so now we would exceed 128 bytes.On Thu, 30 Jan 2025 19:54:17 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Which wasn't _that_ long ago. It just seems like a really obvious constant to tune to me, and one which it would be surprising if it broken migration.On Thu, Jan 30, 2025 at 09:32:36AM +0100, Stefano Brivio wrote: > On Thu, 30 Jan 2025 18:38:22 +1100 > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > Right, but in the present draft you pay that cost whether or not > > you're actually using the flows. Unfortunately a busy server with > > heaps of active connections is exactly the case that's likely to be > > most sensitve to additional downtime, but there's not really any > > getting around that. A machine with a lot of state will need either > > high downtime or high migration bandwidth. > > It's... sixteen megabytes. A KubeVirt node is only allowed to perform up > to _four_ migrations in parallel, and that's our main use case at the > moment. "High downtime" is kind of relative. Certainly. But I believe it's typical to aim for downtimes in the ~100ms range. > > But, I'm really hoping we can move relatively quickly to a model where > > a guest with only a handful of connections _doesn't_ have to pay that > > 128k flow cost - and can consequently migrate ok even with quite > > constrained migration bandwidth. In that scenario the size of the > > header could become significant. > > I think the biggest cost of the full flow table transfer is rather code > that's a bit quicker to write (I just managed to properly set sequences > on the target, connections don't quite "flow" yet) but relatively high > maintenance (as you mentioned, we need to be careful about every single > field) and easy to break. Right. And with this draft we can't even change the size of the flow table without breaking migration. That seems like a thing we might well want to change.Why? The size of the flow table hasn't changed since it was added. Idon't see a reason to improve this if we don't want to transfer the flow table anyway.I don't follow. Do you mean not transferring the hash table? This is not relevant to that, I'm talking about the size of the base flow table, not the hash table. Or do you mean not transferring the flow table as a whole, but rather entry by entry? In that case I'm seeing it as exactly the mechanism to improve this.True.Transferring them if we have a dedicated structure shouldn't be that bad: we don't need to store them.That's possible, although it would mean transferring the old indices, which is not otherwise strictly necessary. What we could do easily is a debug log similar to the "new flow" logs but for "immigrated flow".> I would like to quickly complete the whole flow first, because I think > we can inform design and implementation decisions much better at that > point, and we can be sure it's feasible, That's fair. > but I'm not particularly keen > to merge this patch like it is, if we can switch it relatively swiftly > to an implementation where we model a smaller fixed-endian structure > with just the stuff we need. So, there are kind of two parts to this: 1) Only transferring active flow entries, and not transferring the hash table I think this is pretty easy. It could be done with or without preserving flow indicies. Preserving makes for debug log continuity between the ends, but not preserving lets us change the size of the flow table without breaking migration.I would just add prints on migration showing how old flow indices map to new ones.Oh... right. We kind of already made that mistake.Not really: we transfer "PASST" at the moment, and the migration succeeds.Possibly - it might need to do some slightly different things: regenerating some fields from redundant data maybe, and/or re-hashing the entries. But certainly the structure should be similar, yes.2) Only transferring the necessary pieces of each entry, and using a fixed representation of each piece This is harder. Not *super* hard, I think, but definitely trickier than (1) > And again, to be a bit more sure of which stuff we need in it, the full > flow is useful to have implemented. > > Actually the biggest complications I see in switching to that approach, > from the current point, are that we need to, I guess: > > 1. model arrays (not really complicated by itself) So here, I actually think this is simpler if we don't attempt to have a declarative approach to defining the protocol, but just write functions to implement it. > 2. have a temporary structure where we store flows instead of using the > flow table directly (meaning that the "data model" needs to logically > decouple source and destination of the copy) Right.. I'd really prefer to "stream" in the entries one by one, rather than having a big staging area. That's even harder to do declaratively, but I think the other advantages are worth it. > 3. batch stuff to some extent. We'll call socket() and connect() once > for each socket anyway, obviously, but sending one message to the > TCP_REPAIR helper for each socket looks like a rather substantial > and avoidable overhead I don't think this actually has a lot of bearing on the protocol. I'd envisage migrate_target() decodes all the information into the target's flow table, then migrate_target_post() steps through all the flows re-establishing the connections. Since we've already parsed the protocol at that point, we can make multiple passes: one to gather batches and set TCP_REPAIR, another through each entry to set the values, and a final one to clear TCP_REPAIR in batches.Ah, right, I didn't think of using the target flow table directly. That has the advantage that the current code I'm writing to reactivate flows from the flow table can be recycled as it is.Eh, just thought merging might save us some rebase work against any other pressing changes we need.> > > > It's both easier to do > > > > and a bigger win in most cases. That would dramatically reduce the > > > > size sent here. > > > > > > Yep, feel free. > > > > It's on my queue for the next few days. > > To me this part actually looks like the biggest priority after/while > getting the whole thing to work, because we can start right with a 'v1' > which looks more sustainable. > > And I would just get stuff working on x86_64 in that case, without even > implementing conversions and endianness switches etc. Right. Given the number of options here, I think it would be safest to go in expecting to go through a few throwaway protocol versions before reaching one we're happy enough to support long term. To ease that process, I'm wondering if we should, add a default-off command line option to enable migration. For now, enabling it would print some sort of "migration is experimental!" warning. Once we have a stream format we're ok with, we can flip it to on-by-default, but we don't maintain receive compatibility for the experimental versions leading up to that.It looks like unnecessary code churn to me. It doesn't need to be merged if it's work in progress. You can also push stuff to a temporary branch if needed.It can also be merged and not documented for a while, as long as it doesn't break existing functionality.I'd be a bit cautious about this. AIUI, right now if you attempt to migrate, qemu will simply fail it because we don't respond to the migration commands. Having enough merged that qemu won't outright fail the migration, but it won't reliably work seems like a bad idea.You can start 'ping' in the source and see it continue flawlessly in the target. Sure, TCP connections break altogether instead of the okayish migration implemented by v3 I'm posting in a bit...-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 30 Jan 2025 09:32:36 +0100 Stefano Brivio <sbrivio(a)redhat.com> wrote:I would like to quickly complete the whole flow first, because I think we can inform design and implementation decisions much better at that pointSo, there seems to be a problem with (testing?) this. I couldn't quite understand the root cause yet, and it doesn't happen with the reference source.c and target.c implementations I shared. Let's assume I have a connection in the source guest to 127.0.0.1:9091, from 127.0.0.1:56350. After the migration, in the target, I get: --- socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(79, {sa_family=AF_INET, sin_port=htons(56350), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [1788468535], 4) = 0 write(2, "77.6923: ", 977.6923: ) = 9 write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 1788468535) = 51 write(2, "\n", 1 ) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [115288604], 4) = 0 write(2, "77.6924: ", 977.6924: ) = 9 write(2, "Set receive queue sequence for s"..., 53Set receive queue sequence for socket 79 to 115288604) = 53 write(2, "\n", 1 ) = 1 connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) --- EADDRNOTAVAIL, according to the documentation, which seems to be consistent with a glance at the implementation (that is, I must be missing some issue in the kernel), should be returned on connect() if: EADDRNOTAVAIL (Internet domain sockets) The socket referred to by sockfd had not previously been bound to an address and, upon attempting to bind it to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use. See the discussion of /proc/sys/net/ipv4/ip_local_port_range in ip(7). but well, of course it was bound. To a port, indeed, not a full address, that is, any (0.0.0.0) and address port, but I think for the purposes of this description that bind() call is enough. Is this related to SO_REUSEADDR? I need it (on both source and target) because, at least in my tests, source and target are on the same machine, in the same namespace. If I drop it: --- bind(79, {sa_family=AF_INET, sin_port=htons(46280), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use) --- as expected. However, in my reference implementation, with a connection from 127.0.0.1:9998 to 127.0.0.1:9091, this is what the target does: --- socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(9998), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 socket(AF_UNIX, SOCK_STREAM, 0) = 4 unlink("/tmp/repair.sock") = 0 bind(4, {sa_family=AF_UNIX, sun_path="/tmp/repair.sock"}, 110) = 0 listen(4, 1) = 0 accept(4, NULL, NULL) = 5 sendmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[3]}], msg_controllen=24, msg_flags=0}, 0) = 1 recvfrom(5, "\1", 1, 0, NULL, NULL) = 1 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1612504019], 4) = 0 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1756508956], 4) = 0 connect(3, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 --- The only obvious difference is that, here, I'm not binding to an ephemeral port: the source port (in both source and target "guests") is 9998. Fine, so I tried forcing a lower port in passt (source) as well, and this is what I get in the target now: --- socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(79, {sa_family=AF_INET, sin_port=htons(9000), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-348109334], 4) = 0 write(2, "46.9751: ", 946.9751: ) = 9 write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 3946857962) = 51 write(2, "\n", 1 ) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-1820322671], 4) = 0 write(2, "46.9752: ", 946.9752: ) = 9 write(2, "Set receive queue sequence for s"..., 54Set receive queue sequence for socket 79 to 2474644625) = 54 write(2, "\n", 1 ) = 1 connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) --- no obvious difference. I'll try binding to an explicit address, next, but I have no idea why 1. we get EADDRNOTAVAIL after a bind() and 2. it works with the reference implementation. Yes, I explicitly close() the socket in the source passt now, but that doesn't change things. This is presumably just an issue with testing, because in real use cases source and target guests would be on different machines. Another idea could be separating the namespaces. I can't just run source and target passt in two instances of pasta --config-net, because pasta would run into the same issue, but I could isolate one namespace with it, then add two network namespaces inside that, and connect them with veth pairs. -- Stefano
On Fri, Jan 31, 2025 at 06:36:55AM +0100, Stefano Brivio wrote:On Thu, 30 Jan 2025 09:32:36 +0100 Stefano Brivio <sbrivio(a)redhat.com> wrote:So, I was wondering if binding to 0.0.0.0 is sufficient for a repaired socket. Usually, of course, that 0.0.0.0 would be resolved to a real address at connect() time. But TCP_REPAIR's version of connect() bypasses a bunch of the usual connect logic, so maybe we need an explicit address here. ...but that doesn't explain the difference between passt and your test implementation.I would like to quickly complete the whole flow first, because I think we can inform design and implementation decisions much better at that pointSo, there seems to be a problem with (testing?) this. I couldn't quite understand the root cause yet, and it doesn't happen with the reference source.c and target.c implementations I shared. Let's assume I have a connection in the source guest to 127.0.0.1:9091, from 127.0.0.1:56350. After the migration, in the target, I get: --- socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(79, {sa_family=AF_INET, sin_port=htons(56350), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [1788468535], 4) = 0 write(2, "77.6923: ", 977.6923: ) = 9 write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 1788468535) = 51 write(2, "\n", 1 ) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [115288604], 4) = 0 write(2, "77.6924: ", 977.6924: ) = 9 write(2, "Set receive queue sequence for s"..., 53Set receive queue sequence for socket 79 to 115288604) = 53 write(2, "\n", 1 ) = 1 connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) --- EADDRNOTAVAIL, according to the documentation, which seems to be consistent with a glance at the implementation (that is, I must be missing some issue in the kernel), should be returned on connect() if: EADDRNOTAVAIL (Internet domain sockets) The socket referred to by sockfd had not previously been bound to an address and, upon attempting to bind it to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use. See the discussion of /proc/sys/net/ipv4/ip_local_port_range in ip(7). but well, of course it was bound. To a port, indeed, not a full address, that is, any (0.0.0.0) and address port, but I think for the purposes of this description that bind() call is enough.Is this related to SO_REUSEADDR? I need it (on both source and target) because, at least in my tests, source and target are on the same machine, in the same namespace. If I drop it:Again, I can think of various problems that not having the same address available on source and dest might have, but not any which explain the difference between passt and the experimental impl.--- bind(79, {sa_family=AF_INET, sin_port=htons(46280), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use) --- as expected. However, in my reference implementation, with a connection from 127.0.0.1:9998 to 127.0.0.1:9091, this is what the target does: --- socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(9998), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 socket(AF_UNIX, SOCK_STREAM, 0) = 4 unlink("/tmp/repair.sock") = 0 bind(4, {sa_family=AF_UNIX, sun_path="/tmp/repair.sock"}, 110) = 0 listen(4, 1) = 0 accept(4, NULL, NULL) = 5 sendmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[3]}], msg_controllen=24, msg_flags=0}, 0) = 1 recvfrom(5, "\1", 1, 0, NULL, NULL) = 1 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1612504019], 4) = 0 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1756508956], 4) = 0 connect(3, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 --- The only obvious difference is that, here, I'm not binding to an ephemeral port: the source port (in both source and target "guests") is 9998. Fine, so I tried forcing a lower port in passt (source) as well, and this is what I get in the target now: --- socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(79, {sa_family=AF_INET, sin_port=htons(9000), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-348109334], 4) = 0 write(2, "46.9751: ", 946.9751: ) = 9 write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 3946857962) = 51 write(2, "\n", 1 ) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-1820322671], 4) = 0 write(2, "46.9752: ", 946.9752: ) = 9 write(2, "Set receive queue sequence for s"..., 54Set receive queue sequence for socket 79 to 2474644625) = 54 write(2, "\n", 1 ) = 1 connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) --- no obvious difference. I'll try binding to an explicit address, next, but I have no idea why 1. we get EADDRNOTAVAIL after a bind() and 2. it works with the reference implementation.I have no ideas yet :(.Yes, I explicitly close() the socket in the source passt now, but that doesn't change things. This is presumably just an issue with testing, because in real use cases source and target guests would be on different machines. Another idea could be separating the namespaces.Well, if that's relevant to the problem which isn't clear yet. I mean, I guess it's worth trying with source and dest in different namespaces.I can't just run source and target passt in two instances of pasta --config-net, because pasta would run into the same issue,Uh.. which same issue? pasta's not trying to do any TCP_REPAIR stuff or migration.but I could isolate one namespace with it, then add two network namespaces inside that, and connect them with veth pairs.Two pasta instances actually sounds like a better bet to me, because the two "hosts" will have the same address, which is what we'd expect for a "real" migration - and it kind of has to be the case for the host side connections to work afterwards. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Fixed, finally. Some answers: On Fri, 31 Jan 2025 17:14:18 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Fri, Jan 31, 2025 at 06:36:55AM +0100, Stefano Brivio wrote:It is.On Thu, 30 Jan 2025 09:32:36 +0100 Stefano Brivio <sbrivio(a)redhat.com> wrote:So, I was wondering if binding to 0.0.0.0 is sufficient for a repaired socket.I would like to quickly complete the whole flow first, because I think we can inform design and implementation decisions much better at that pointSo, there seems to be a problem with (testing?) this. I couldn't quite understand the root cause yet, and it doesn't happen with the reference source.c and target.c implementations I shared. Let's assume I have a connection in the source guest to 127.0.0.1:9091, from 127.0.0.1:56350. After the migration, in the target, I get: --- socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(79, {sa_family=AF_INET, sin_port=htons(56350), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [1788468535], 4) = 0 write(2, "77.6923: ", 977.6923: ) = 9 write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 1788468535) = 51 write(2, "\n", 1 ) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [115288604], 4) = 0 write(2, "77.6924: ", 977.6924: ) = 9 write(2, "Set receive queue sequence for s"..., 53Set receive queue sequence for socket 79 to 115288604) = 53 write(2, "\n", 1 ) = 1 connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) --- EADDRNOTAVAIL, according to the documentation, which seems to be consistent with a glance at the implementation (that is, I must be missing some issue in the kernel), should be returned on connect() if: EADDRNOTAVAIL (Internet domain sockets) The socket referred to by sockfd had not previously been bound to an address and, upon attempting to bind it to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use. See the discussion of /proc/sys/net/ipv4/ip_local_port_range in ip(7). but well, of course it was bound. To a port, indeed, not a full address, that is, any (0.0.0.0) and address port, but I think for the purposes of this description that bind() call is enough.Usually, of course, that 0.0.0.0 would be resolved to a real address at connect() time. But TCP_REPAIR's version of connect() bypasses a bunch of the usual connect logic, so maybe we need an explicit address here.No need....but that doesn't explain the difference between passt and your test implementation.The difference that actually matters is that the test implementation terminates, and that has the equivalent effect of switching off repair mode for the closed sockets, which frees up all the associated context, including the port. Usually, there are no valid operations on closed sockets (not even close()). This is the first exception I ever met: you can set TCP_REPAIR_OFF. But there's a catch: you can't pass a closed socket in repair mode via SCM_RIGHTS (well, I'm fairly sure nobody approached this level of insanity before): you get EBADF (which is an understatement). And there's another catch: if you actually try to do that, even if it fails, that has the same effect of clearing the socket entirely: you free up the port. But we can't use this, unfortunately, because if we do, the peer will get a zero-length read (EOF). Now, I could reintroduce a "quit" command in passt-repair, and we would know that EOF doesn't actually mean completion, but it complicates things again. What works, though, is simply terminating. We can't do that before VHOST_USER_CHECK_DEVICE_STATE, but just after that. That's what I implemented at the moment (updated patches coming soon).Same issue in the sense that if I connect namespaces with pasta, I can't migrate a connection between them, because pasta can't migrate a connection. It would close it and try to reopen it.Is this related to SO_REUSEADDR? I need it (on both source and target) because, at least in my tests, source and target are on the same machine, in the same namespace. If I drop it:Again, I can think of various problems that not having the same address available on source and dest might have, but not any which explain the difference between passt and the experimental impl.--- bind(79, {sa_family=AF_INET, sin_port=htons(46280), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use) --- as expected. However, in my reference implementation, with a connection from 127.0.0.1:9998 to 127.0.0.1:9091, this is what the target does: --- socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(9998), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 socket(AF_UNIX, SOCK_STREAM, 0) = 4 unlink("/tmp/repair.sock") = 0 bind(4, {sa_family=AF_UNIX, sun_path="/tmp/repair.sock"}, 110) = 0 listen(4, 1) = 0 accept(4, NULL, NULL) = 5 sendmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[3]}], msg_controllen=24, msg_flags=0}, 0) = 1 recvfrom(5, "\1", 1, 0, NULL, NULL) = 1 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1612504019], 4) = 0 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1756508956], 4) = 0 connect(3, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 --- The only obvious difference is that, here, I'm not binding to an ephemeral port: the source port (in both source and target "guests") is 9998. Fine, so I tried forcing a lower port in passt (source) as well, and this is what I get in the target now: --- socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(79, {sa_family=AF_INET, sin_port=htons(9000), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-348109334], 4) = 0 write(2, "46.9751: ", 946.9751: ) = 9 write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 3946857962) = 51 write(2, "\n", 1 ) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-1820322671], 4) = 0 write(2, "46.9752: ", 946.9752: ) = 9 write(2, "Set receive queue sequence for s"..., 54Set receive queue sequence for socket 79 to 2474644625) = 54 write(2, "\n", 1 ) = 1 connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) --- no obvious difference. I'll try binding to an explicit address, next, but I have no idea why 1. we get EADDRNOTAVAIL after a bind() and 2. it works with the reference implementation.I have no ideas yet :(.Yes, I explicitly close() the socket in the source passt now, but that doesn't change things. This is presumably just an issue with testing, because in real use cases source and target guests would be on different machines. Another idea could be separating the namespaces.Well, if that's relevant to the problem which isn't clear yet. I mean, I guess it's worth trying with source and dest in different namespaces.I can't just run source and target passt in two instances of pasta --config-net, because pasta would run into the same issue,Uh.. which same issue? pasta's not trying to do any TCP_REPAIR stuff or migration.Eh, yes, but we're back to the original problem. A veth interface wouldn't care, instead. Anyway, no need, it's finally working now. -- Stefanobut I could isolate one namespace with it, then add two network namespaces inside that, and connect them with veth pairs.Two pasta instances actually sounds like a better bet to me, because the two "hosts" will have the same address, which is what we'd expect for a "real" migration - and it kind of has to be the case for the host side connections to work afterwards.
On Fri, Jan 31, 2025 at 10:09:19AM +0100, Stefano Brivio wrote:Fixed, finally. Some answers: On Fri, 31 Jan 2025 17:14:18 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ok.On Fri, Jan 31, 2025 at 06:36:55AM +0100, Stefano Brivio wrote:It is.On Thu, 30 Jan 2025 09:32:36 +0100 Stefano Brivio <sbrivio(a)redhat.com> wrote:So, I was wondering if binding to 0.0.0.0 is sufficient for a repaired socket.I would like to quickly complete the whole flow first, because I think we can inform design and implementation decisions much better at that pointSo, there seems to be a problem with (testing?) this. I couldn't quite understand the root cause yet, and it doesn't happen with the reference source.c and target.c implementations I shared. Let's assume I have a connection in the source guest to 127.0.0.1:9091, from 127.0.0.1:56350. After the migration, in the target, I get: --- socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(79, {sa_family=AF_INET, sin_port=htons(56350), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [1788468535], 4) = 0 write(2, "77.6923: ", 977.6923: ) = 9 write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 1788468535) = 51 write(2, "\n", 1 ) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [115288604], 4) = 0 write(2, "77.6924: ", 977.6924: ) = 9 write(2, "Set receive queue sequence for s"..., 53Set receive queue sequence for socket 79 to 115288604) = 53 write(2, "\n", 1 ) = 1 connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) --- EADDRNOTAVAIL, according to the documentation, which seems to be consistent with a glance at the implementation (that is, I must be missing some issue in the kernel), should be returned on connect() if: EADDRNOTAVAIL (Internet domain sockets) The socket referred to by sockfd had not previously been bound to an address and, upon attempting to bind it to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use. See the discussion of /proc/sys/net/ipv4/ip_local_port_range in ip(7). but well, of course it was bound. To a port, indeed, not a full address, that is, any (0.0.0.0) and address port, but I think for the purposes of this description that bind() call is enough.Usually, of course, that 0.0.0.0 would be resolved to a real address at connect() time. But TCP_REPAIR's version of connect() bypasses a bunch of the usual connect logic, so maybe we need an explicit address here.No need.I'm still confused by the specific sequence of events that's causing the problem. If a socket is closed with close(2) it should no longer exist, so I don't see how you could even attempt to do anything with it. Do you mean that the socket is shutdown(RD|WR)? Or that it's been closed by passt, but not by passt-repair? Or the other way around? I'd kind of assume that you _must_ close the socket while still in repair mode, since we want it to go away on the source without attempting to FIN or RST or anything....but that doesn't explain the difference between passt and your test implementation.The difference that actually matters is that the test implementation terminates, and that has the equivalent effect of switching off repair mode for the closed sockets, which frees up all the associated context, including the port. Usually, there are no valid operations on closed sockets (not even close()). This is the first exception I ever met: you can set TCP_REPAIR_OFF.But there's a catch: you can't pass a closed socket in repair mode via SCM_RIGHTS (well, I'm fairly sure nobody approached this level of insanity before): you get EBADF (which is an understatement). And there's another catch: if you actually try to do that, even if it fails, that has the same effect of clearing the socket entirely: you free up the port.!?! this is even more baffling. Passing what's now an unrelated, unassigned integer as an fd is having some effect on a socket that was around!? If so that's a horrifying kernel bug.But we can't use this, unfortunately, because if we do, the peer will get a zero-length read (EOF). Now, I could reintroduce a "quit" command in passt-repair, and we would know that EOF doesn't actually mean completion, but it complicates things again. What works, though, is simply terminating. We can't do that before VHOST_USER_CHECK_DEVICE_STATE, but just after that. That's what I implemented at the moment (updated patches coming soon).-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonSame issue in the sense that if I connect namespaces with pasta, I can't migrate a connection between them, because pasta can't migrate a connection. It would close it and try to reopen it.Is this related to SO_REUSEADDR? I need it (on both source and target) because, at least in my tests, source and target are on the same machine, in the same namespace. If I drop it:Again, I can think of various problems that not having the same address available on source and dest might have, but not any which explain the difference between passt and the experimental impl.--- bind(79, {sa_family=AF_INET, sin_port=htons(46280), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address already in use) --- as expected. However, in my reference implementation, with a connection from 127.0.0.1:9998 to 127.0.0.1:9091, this is what the target does: --- socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(9998), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 socket(AF_UNIX, SOCK_STREAM, 0) = 4 unlink("/tmp/repair.sock") = 0 bind(4, {sa_family=AF_UNIX, sun_path="/tmp/repair.sock"}, 110) = 0 listen(4, 1) = 0 accept(4, NULL, NULL) = 5 sendmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[3]}], msg_controllen=24, msg_flags=0}, 0) = 1 recvfrom(5, "\1", 1, 0, NULL, NULL) = 1 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1612504019], 4) = 0 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [1756508956], 4) = 0 connect(3, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 --- The only obvious difference is that, here, I'm not binding to an ephemeral port: the source port (in both source and target "guests") is 9998. Fine, so I tried forcing a lower port in passt (source) as well, and this is what I get in the target now: --- socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(79, {sa_family=AF_INET, sin_port=htons(9000), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-348109334], 4) = 0 write(2, "46.9751: ", 946.9751: ) = 9 write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 3946857962) = 51 write(2, "\n", 1 ) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [-1820322671], 4) = 0 write(2, "46.9752: ", 946.9752: ) = 9 write(2, "Set receive queue sequence for s"..., 54Set receive queue sequence for socket 79 to 2474644625) = 54 write(2, "\n", 1 ) = 1 connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) --- no obvious difference. I'll try binding to an explicit address, next, but I have no idea why 1. we get EADDRNOTAVAIL after a bind() and 2. it works with the reference implementation.I have no ideas yet :(.Yes, I explicitly close() the socket in the source passt now, but that doesn't change things. This is presumably just an issue with testing, because in real use cases source and target guests would be on different machines. Another idea could be separating the namespaces.Well, if that's relevant to the problem which isn't clear yet. I mean, I guess it's worth trying with source and dest in different namespaces.I can't just run source and target passt in two instances of pasta --config-net, because pasta would run into the same issue,Uh.. which same issue? pasta's not trying to do any TCP_REPAIR stuff or migration.Eh, yes, but we're back to the original problem. A veth interface wouldn't care, instead. Anyway, no need, it's finally working now.but I could isolate one namespace with it, then add two network namespaces inside that, and connect them with veth pairs.Two pasta instances actually sounds like a better bet to me, because the two "hosts" will have the same address, which is what we'd expect for a "real" migration - and it kind of has to be the case for the host side connections to work afterwards.
On Mon, 3 Feb 2025 11:46:13 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Fri, Jan 31, 2025 at 10:09:19AM +0100, Stefano Brivio wrote:While the explanation for the issue is what you gave as comment to 8/20 (I need to close() the socket from passt-repair), let me answer here: sure, I must close() it, and it was close()d by passt but not passt-repair.Fixed, finally. Some answers: On Fri, 31 Jan 2025 17:14:18 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ok.On Fri, Jan 31, 2025 at 06:36:55AM +0100, Stefano Brivio wrote:It is.On Thu, 30 Jan 2025 09:32:36 +0100 Stefano Brivio <sbrivio(a)redhat.com> wrote: > I would like to quickly complete the whole flow first, because I think > we can inform design and implementation decisions much better at that > point So, there seems to be a problem with (testing?) this. I couldn't quite understand the root cause yet, and it doesn't happen with the reference source.c and target.c implementations I shared. Let's assume I have a connection in the source guest to 127.0.0.1:9091, from 127.0.0.1:56350. After the migration, in the target, I get: --- socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(79, {sa_family=AF_INET, sin_port=htons(56350), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [1788468535], 4) = 0 write(2, "77.6923: ", 977.6923: ) = 9 write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 1788468535) = 51 write(2, "\n", 1 ) = 1 setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [115288604], 4) = 0 write(2, "77.6924: ", 977.6924: ) = 9 write(2, "Set receive queue sequence for s"..., 53Set receive queue sequence for socket 79 to 115288604) = 53 write(2, "\n", 1 ) = 1 connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) --- EADDRNOTAVAIL, according to the documentation, which seems to be consistent with a glance at the implementation (that is, I must be missing some issue in the kernel), should be returned on connect() if: EADDRNOTAVAIL (Internet domain sockets) The socket referred to by sockfd had not previously been bound to an address and, upon attempting to bind it to an ephemeral port, it was determined that all port numbers in the ephemeral port range are currently in use. See the discussion of /proc/sys/net/ipv4/ip_local_port_range in ip(7). but well, of course it was bound. To a port, indeed, not a full address, that is, any (0.0.0.0) and address port, but I think for the purposes of this description that bind() call is enough.So, I was wondering if binding to 0.0.0.0 is sufficient for a repaired socket.Usually, of course, that 0.0.0.0 would be resolved to a real address at connect() time. But TCP_REPAIR's version of connect() bypasses a bunch of the usual connect logic, so maybe we need an explicit address here.No need.I'm still confused by the specific sequence of events that's causing the problem. If a socket is closed with close(2) it should no longer exist, so I don't see how you could even attempt to do anything with it. Do you mean that the socket is shutdown(RD|WR)? Or that it's been closed by passt, but not by passt-repair? Or the other way around? I'd kind of assume that you _must_ close the socket while still in repair mode, since we want it to go away on the source without attempting to FIN or RST or anything....but that doesn't explain the difference between passt and your test implementation.The difference that actually matters is that the test implementation terminates, and that has the equivalent effect of switching off repair mode for the closed sockets, which frees up all the associated context, including the port. Usually, there are no valid operations on closed sockets (not even close()). This is the first exception I ever met: you can set TCP_REPAIR_OFF.Nah, most likely not. The EBADF on a close()d socket is a bit questionable (it should be EINVAL? Or a -1 socket in the recipient?), but other than that, the explanation is that passing that closed socket caused EOF in passt-repair, and passt-repair would quit, solving the issue. -- StefanoBut there's a catch: you can't pass a closed socket in repair mode via SCM_RIGHTS (well, I'm fairly sure nobody approached this level of insanity before): you get EBADF (which is an understatement). And there's another catch: if you actually try to do that, even if it fails, that has the same effect of clearing the socket entirely: you free up the port.!?! this is even more baffling. Passing what's now an unrelated, unassigned integer as an fd is having some effect on a socket that was around!? If so that's a horrifying kernel bug.
On Mon, Feb 03, 2025 at 07:09:28AM +0100, Stefano Brivio wrote:On Mon, 3 Feb 2025 11:46:13 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right, I realised the problem with the missing close in passt-repair after I wrote this.On Fri, Jan 31, 2025 at 10:09:19AM +0100, Stefano Brivio wrote:While the explanation for the issue is what you gave as comment to 8/20 (I need to close() the socket from passt-repair), let me answer here: sure, I must close() it, and it was close()d by passt but not passt-repair.Fixed, finally. Some answers: On Fri, 31 Jan 2025 17:14:18 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ok.On Fri, Jan 31, 2025 at 06:36:55AM +0100, Stefano Brivio wrote: > On Thu, 30 Jan 2025 09:32:36 +0100 > Stefano Brivio <sbrivio(a)redhat.com> wrote: > > > I would like to quickly complete the whole flow first, because I think > > we can inform design and implementation decisions much better at that > > point > > So, there seems to be a problem with (testing?) this. I couldn't quite > understand the root cause yet, and it doesn't happen with the reference > source.c and target.c implementations I shared. > > Let's assume I have a connection in the source guest to 127.0.0.1:9091, > from 127.0.0.1:56350. After the migration, in the target, I get: > > --- > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 > setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > bind(79, {sa_family=AF_INET, sin_port=htons(56350), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 > sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 > recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [1788468535], 4) = 0 > write(2, "77.6923: ", 977.6923: ) = 9 > write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 1788468535) = 51 > write(2, "\n", 1 > ) = 1 > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [115288604], 4) = 0 > write(2, "77.6924: ", 977.6924: ) = 9 > write(2, "Set receive queue sequence for s"..., 53Set receive queue sequence for socket 79 to 115288604) = 53 > write(2, "\n", 1 > ) = 1 > connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) > --- > > EADDRNOTAVAIL, according to the documentation, which seems to be > consistent with a glance at the implementation (that is, I must be > missing some issue in the kernel), should be returned on connect() if: > > EADDRNOTAVAIL > (Internet domain sockets) The socket referred to by > sockfd had not previously been bound to an address > and, upon attempting to bind it to an ephemeral > port, it was determined that all port numbers in the > ephemeral port range are currently in use. See the > discussion of /proc/sys/net/ipv4/ip_local_port_range > in ip(7). > > but well, of course it was bound. > > To a port, indeed, not a full address, that is, any (0.0.0.0) and > address port, but I think for the purposes of this description that > bind() call is enough. So, I was wondering if binding to 0.0.0.0 is sufficient for a repaired socket.It is.Usually, of course, that 0.0.0.0 would be resolved to a real address at connect() time. But TCP_REPAIR's version of connect() bypasses a bunch of the usual connect logic, so maybe we need an explicit address here.No need.I'm still confused by the specific sequence of events that's causing the problem. If a socket is closed with close(2) it should no longer exist, so I don't see how you could even attempt to do anything with it. Do you mean that the socket is shutdown(RD|WR)? Or that it's been closed by passt, but not by passt-repair? Or the other way around? I'd kind of assume that you _must_ close the socket while still in repair mode, since we want it to go away on the source without attempting to FIN or RST or anything....but that doesn't explain the difference between passt and your test implementation.The difference that actually matters is that the test implementation terminates, and that has the equivalent effect of switching off repair mode for the closed sockets, which frees up all the associated context, including the port. Usually, there are no valid operations on closed sockets (not even close()). This is the first exception I ever met: you can set TCP_REPAIR_OFF.You're not "passing a closed socket", that's nonsensical. You're trying to pass a stale fd that's no longer refers to your socket. EBADF is _exactly_ what should happen, regardless of whether or not the underlying socket is really closed, or if it's held open by another fd somewhere (a dup() or something passed to another process like in this case).Nah, most likely not. The EBADF on a close()d socket is a bit questionable (it should be EINVAL? Or a -1 socket in the recipient?),But there's a catch: you can't pass a closed socket in repair mode via SCM_RIGHTS (well, I'm fairly sure nobody approached this level of insanity before): you get EBADF (which is an understatement). And there's another catch: if you actually try to do that, even if it fails, that has the same effect of clearing the socket entirely: you free up the port.!?! this is even more baffling. Passing what's now an unrelated, unassigned integer as an fd is having some effect on a socket that was around!? If so that's a horrifying kernel bug.but other than that, the explanation is that passing that closed socket caused EOF in passt-repair, and passt-repair would quit, solving the issue.Passing a bad fd caused an error on the sendmsg(), which caused an EOF on the other end. Which is a little odd, but again nothing to do with "passing a closed socket"; that's impossible - if the socket is closed there's no way to refer to it and so no way to even attempt sending it. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, 3 Feb 2025 20:06:28 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Mon, Feb 03, 2025 at 07:09:28AM +0100, Stefano Brivio wrote:Well, I'm just passing a number that doesn't happen to refer to a current socket, but:On Mon, 3 Feb 2025 11:46:13 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right, I realised the problem with the missing close in passt-repair after I wrote this.On Fri, Jan 31, 2025 at 10:09:19AM +0100, Stefano Brivio wrote:While the explanation for the issue is what you gave as comment to 8/20 (I need to close() the socket from passt-repair), let me answer here: sure, I must close() it, and it was close()d by passt but not passt-repair.Fixed, finally. Some answers: On Fri, 31 Jan 2025 17:14:18 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote: > On Fri, Jan 31, 2025 at 06:36:55AM +0100, Stefano Brivio wrote: > > On Thu, 30 Jan 2025 09:32:36 +0100 > > Stefano Brivio <sbrivio(a)redhat.com> wrote: > > > > > I would like to quickly complete the whole flow first, because I think > > > we can inform design and implementation decisions much better at that > > > point > > > > So, there seems to be a problem with (testing?) this. I couldn't quite > > understand the root cause yet, and it doesn't happen with the reference > > source.c and target.c implementations I shared. > > > > Let's assume I have a connection in the source guest to 127.0.0.1:9091, > > from 127.0.0.1:56350. After the migration, in the target, I get: > > > > --- > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 > > setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > > bind(79, {sa_family=AF_INET, sin_port=htons(56350), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 > > sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 > > recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [1788468535], 4) = 0 > > write(2, "77.6923: ", 977.6923: ) = 9 > > write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 1788468535) = 51 > > write(2, "\n", 1 > > ) = 1 > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [115288604], 4) = 0 > > write(2, "77.6924: ", 977.6924: ) = 9 > > write(2, "Set receive queue sequence for s"..., 53Set receive queue sequence for socket 79 to 115288604) = 53 > > write(2, "\n", 1 > > ) = 1 > > connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) > > --- > > > > EADDRNOTAVAIL, according to the documentation, which seems to be > > consistent with a glance at the implementation (that is, I must be > > missing some issue in the kernel), should be returned on connect() if: > > > > EADDRNOTAVAIL > > (Internet domain sockets) The socket referred to by > > sockfd had not previously been bound to an address > > and, upon attempting to bind it to an ephemeral > > port, it was determined that all port numbers in the > > ephemeral port range are currently in use. See the > > discussion of /proc/sys/net/ipv4/ip_local_port_range > > in ip(7). > > > > but well, of course it was bound. > > > > To a port, indeed, not a full address, that is, any (0.0.0.0) and > > address port, but I think for the purposes of this description that > > bind() call is enough. > > So, I was wondering if binding to 0.0.0.0 is sufficient for a repaired > socket. It is. > Usually, of course, that 0.0.0.0 would be resolved to a real > address at connect() time. But TCP_REPAIR's version of connect() > bypasses a bunch of the usual connect logic, so maybe we need an > explicit address here. No need.Ok.> ...but that doesn't explain the difference between passt and your test > implementation. The difference that actually matters is that the test implementation terminates, and that has the equivalent effect of switching off repair mode for the closed sockets, which frees up all the associated context, including the port. Usually, there are no valid operations on closed sockets (not even close()). This is the first exception I ever met: you can set TCP_REPAIR_OFF.I'm still confused by the specific sequence of events that's causing the problem. If a socket is closed with close(2) it should no longer exist, so I don't see how you could even attempt to do anything with it. Do you mean that the socket is shutdown(RD|WR)? Or that it's been closed by passt, but not by passt-repair? Or the other way around? I'd kind of assume that you _must_ close the socket while still in repair mode, since we want it to go away on the source without attempting to FIN or RST or anything.You're not "passing a closed socket", that's nonsensical. You're trying to pass a stale fd that's no longer refers to your socket.Nah, most likely not. The EBADF on a close()d socket is a bit questionable (it should be EINVAL? Or a -1 socket in the recipient?),But there's a catch: you can't pass a closed socket in repair mode via SCM_RIGHTS (well, I'm fairly sure nobody approached this level of insanity before): you get EBADF (which is an understatement). And there's another catch: if you actually try to do that, even if it fails, that has the same effect of clearing the socket entirely: you free up the port.!?! this is even more baffling. Passing what's now an unrelated, unassigned integer as an fd is having some effect on a socket that was around!? If so that's a horrifying kernel bug.EBADF is _exactly_ what should happen, regardless of whether or not the underlying socket is really closed, or if it's held open by another fd somewhere (a dup() or something passed to another process like in this case)....EBADF on a sendmsg() means, in POSIX.1-2024: [EBADF] The socket argument is not a valid file descriptor. and nothing else. This matches GNU/Linux documentation by the way. The socket argument is *not* one of the file descriptors that you can pass via SCM_RIGHTS. I would argue that a more reasonable and less surprising behaviour would be signalling that there's no socket to send with a -1 in the receiver. Or omit it in ancillary data altogether. POSIX specifies SCM_RIGHTS, but doesn't mention any error for ancillary data.Right, so it shouldn't be sent, but the error doesn't match. -- Stefanobut other than that, the explanation is that passing that closed socket caused EOF in passt-repair, and passt-repair would quit, solving the issue.Passing a bad fd caused an error on the sendmsg(), which caused an EOF on the other end. Which is a little odd, but again nothing to do with "passing a closed socket"; that's impossible - if the socket is closed there's no way to refer to it and so no way to even attempt sending it.
On Mon, Feb 03, 2025 at 10:45:05AM +0100, Stefano Brivio wrote:On Mon, 3 Feb 2025 20:06:28 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Eh, I guess so.On Mon, Feb 03, 2025 at 07:09:28AM +0100, Stefano Brivio wrote:Well, I'm just passing a number that doesn't happen to refer to a current socket, but:On Mon, 3 Feb 2025 11:46:13 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Right, I realised the problem with the missing close in passt-repair after I wrote this.On Fri, Jan 31, 2025 at 10:09:19AM +0100, Stefano Brivio wrote: > Fixed, finally. Some answers: > > On Fri, 31 Jan 2025 17:14:18 +1100 > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > On Fri, Jan 31, 2025 at 06:36:55AM +0100, Stefano Brivio wrote: > > > On Thu, 30 Jan 2025 09:32:36 +0100 > > > Stefano Brivio <sbrivio(a)redhat.com> wrote: > > > > > > > I would like to quickly complete the whole flow first, because I think > > > > we can inform design and implementation decisions much better at that > > > > point > > > > > > So, there seems to be a problem with (testing?) this. I couldn't quite > > > understand the root cause yet, and it doesn't happen with the reference > > > source.c and target.c implementations I shared. > > > > > > Let's assume I have a connection in the source guest to 127.0.0.1:9091, > > > from 127.0.0.1:56350. After the migration, in the target, I get: > > > > > > --- > > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 79 > > > setsockopt(79, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > > > bind(79, {sa_family=AF_INET, sin_port=htons(56350), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 > > > sendmsg(72, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1", iov_len=1}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[79]}], msg_controllen=24, msg_flags=0}, 0) = 1 > > > recvfrom(72, "\1", 1, 0, NULL, NULL) = 1 > > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [2], 4) = 0 > > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [1788468535], 4) = 0 > > > write(2, "77.6923: ", 977.6923: ) = 9 > > > write(2, "Set send queue sequence for sock"..., 51Set send queue sequence for socket 79 to 1788468535) = 51 > > > write(2, "\n", 1 > > > ) = 1 > > > setsockopt(79, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0 > > > setsockopt(79, SOL_TCP, TCP_QUEUE_SEQ, [115288604], 4) = 0 > > > write(2, "77.6924: ", 977.6924: ) = 9 > > > write(2, "Set receive queue sequence for s"..., 53Set receive queue sequence for socket 79 to 115288604) = 53 > > > write(2, "\n", 1 > > > ) = 1 > > > connect(79, {sa_family=AF_INET, sin_port=htons(9091), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EADDRNOTAVAIL (Cannot assign requested address) > > > --- > > > > > > EADDRNOTAVAIL, according to the documentation, which seems to be > > > consistent with a glance at the implementation (that is, I must be > > > missing some issue in the kernel), should be returned on connect() if: > > > > > > EADDRNOTAVAIL > > > (Internet domain sockets) The socket referred to by > > > sockfd had not previously been bound to an address > > > and, upon attempting to bind it to an ephemeral > > > port, it was determined that all port numbers in the > > > ephemeral port range are currently in use. See the > > > discussion of /proc/sys/net/ipv4/ip_local_port_range > > > in ip(7). > > > > > > but well, of course it was bound. > > > > > > To a port, indeed, not a full address, that is, any (0.0.0.0) and > > > address port, but I think for the purposes of this description that > > > bind() call is enough. > > > > So, I was wondering if binding to 0.0.0.0 is sufficient for a repaired > > socket. > > It is. > > > Usually, of course, that 0.0.0.0 would be resolved to a real > > address at connect() time. But TCP_REPAIR's version of connect() > > bypasses a bunch of the usual connect logic, so maybe we need an > > explicit address here. > > No need. Ok. > > ...but that doesn't explain the difference between passt and your test > > implementation. > > The difference that actually matters is that the test implementation > terminates, and that has the equivalent effect of switching off repair > mode for the closed sockets, which frees up all the associated context, > including the port. > > Usually, there are no valid operations on closed sockets (not even > close()). This is the first exception I ever met: you can set > TCP_REPAIR_OFF. I'm still confused by the specific sequence of events that's causing the problem. If a socket is closed with close(2) it should no longer exist, so I don't see how you could even attempt to do anything with it. Do you mean that the socket is shutdown(RD|WR)? Or that it's been closed by passt, but not by passt-repair? Or the other way around? I'd kind of assume that you _must_ close the socket while still in repair mode, since we want it to go away on the source without attempting to FIN or RST or anything.While the explanation for the issue is what you gave as comment to 8/20 (I need to close() the socket from passt-repair), let me answer here: sure, I must close() it, and it was close()d by passt but not passt-repair.You're not "passing a closed socket", that's nonsensical. You're trying to pass a stale fd that's no longer refers to your socket.> But there's a catch: you can't pass a closed socket in repair mode via > SCM_RIGHTS (well, I'm fairly sure nobody approached this level of > insanity before): you get EBADF (which is an understatement). > > And there's another catch: if you actually try to do that, even if it > fails, that has the same effect of clearing the socket entirely: you > free up the port. !?! this is even more baffling. Passing what's now an unrelated, unassigned integer as an fd is having some effect on a socket that was around!? If so that's a horrifying kernel bug.Nah, most likely not. The EBADF on a close()d socket is a bit questionable (it should be EINVAL? Or a -1 socket in the recipient?),EBADF is _exactly_ what should happen, regardless of whether or not the underlying socket is really closed, or if it's held open by another fd somewhere (a dup() or something passed to another process like in this case)....EBADF on a sendmsg() means, in POSIX.1-2024: [EBADF] The socket argument is not a valid file descriptor. and nothing else. This matches GNU/Linux documentation by the way. The socket argument is *not* one of the file descriptors that you can pass via SCM_RIGHTS.I would argue that a more reasonable and less surprising behaviour would be signalling that there's no socket to send with a -1 in the receiver. Or omit it in ancillary data altogether. POSIX specifies SCM_RIGHTS, but doesn't mention any error for ancillary data.I mean, regardless of the letter of the law, throwing an error on sendmsg() seems much more useful than the receiver having to process a bogus value. And given that, EBADF seems again sensible, regardless of the letter of the law. EINVAL can mean so many things its kind of useless. For the man page, I'd chalk it up to them just not considering all the things that could go wrong with SCM_RIGHTS. Not covering it in POSIX is a bit more surprising.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibsonRight, so it shouldn't be sent, but the error doesn't match.but other than that, the explanation is that passing that closed socket caused EOF in passt-repair, and passt-repair would quit, solving the issue.Passing a bad fd caused an error on the sendmsg(), which caused an EOF on the other end. Which is a little odd, but again nothing to do with "passing a closed socket"; that's impossible - if the socket is closed there's no way to refer to it and so no way to even attempt sending it.
A privileged helper to set/clear TCP_REPAIR on sockets on behalf of passt. Not used yet. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- Makefile | 10 +++-- passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 118 insertions(+), 3 deletions(-) create mode 100644 passt-repair.c diff --git a/Makefile b/Makefile index 1383875..1b71cb0 100644 --- a/Makefile +++ b/Makefile @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c -SRCS = $(PASST_SRCS) $(QRAP_SRCS) +PASST_REPAIR_SRCS = passt-repair.c +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS) MANPAGES = passt.1 pasta.1 qrap.1 @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man man1dir ?= $(mandir)/man1 ifeq ($(TARGET_ARCH),x86_64) -BIN := passt passt.avx2 pasta pasta.avx2 qrap +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair else -BIN := passt pasta qrap +BIN := passt pasta qrap passt-repair endif all: $(BIN) $(MANPAGES) docs @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt% qrap: $(QRAP_SRCS) passt.h $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS) +passt-repair: $(PASST_REPAIR_SRCS) + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS) + valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \ rt_sigreturn getpid gettid kill clock_gettime mmap \ mmap2 munmap open unlink gettimeofday futex statx \ diff --git a/passt-repair.c b/passt-repair.c new file mode 100644 index 0000000..e9b9609 --- /dev/null +++ b/passt-repair.c @@ -0,0 +1,111 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + * + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along + * with commands mapping to TCP_REPAIR values, and switch repair mode on or + * off. Reply by echoing the command. Exit if the command is INT_MAX. + */ + +#include <sys/types.h> +#include <sys/socket.h> +#include <sys/un.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <limits.h> +#include <unistd.h> +#include <netdb.h> + +#include <netinet/tcp.h> + +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */ + +int main(int argc, char **argv) +{ + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)] + __attribute__ ((aligned(__alignof__(struct cmsghdr)))); + struct sockaddr_un a = { AF_UNIX, "" }; + int cmd, fds[SCM_MAX_FD], s, ret, i; + struct cmsghdr *cmsg; + struct msghdr msg; + struct iovec iov; + + iov = (struct iovec){ &cmd, sizeof(cmd) }; + msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 }; + cmsg = CMSG_FIRSTHDR(&msg); + + if (argc != 2) { + fprintf(stderr, "Usage: %s PATH\n", argv[0]); + return -1; + } + + ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]); + if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) { + fprintf(stderr, "Invalid socket path: %s\n", argv[1]); + return -1; + } + + if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) { + perror("Failed to create AF_UNIX socket"); + return -1; + } + + if (connect(s, (struct sockaddr *)&a, sizeof(a))) { + fprintf(stderr, "Failed to connect to %s: %s\n", argv[1], + strerror(errno)); + return -1; + } + + while (1) { + int n; + + if (recvmsg(s, &msg, 0) < 0) { + perror("Failed to receive message"); + return -1; + } + + if (!cmsg || + cmsg->cmsg_len < CMSG_LEN(sizeof(int)) || + cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) || + cmsg->cmsg_type != SCM_RIGHTS) + return -1; + + n = cmsg->cmsg_len / CMSG_LEN(sizeof(int)); + memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n); + + switch (cmd) { + case INT_MAX: + return 0; + case TCP_REPAIR_ON: + case TCP_REPAIR_OFF: + case TCP_REPAIR_OFF_NO_WP: + for (i = 0; i < n; i++) { + if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, + &cmd, sizeof(int))) { + perror("Setting TCP_REPAIR"); + return -1; + } + } + + /* Confirm setting by echoing the command back */ + if (send(s, &cmd, sizeof(int), 0) < 0) { + fprintf(stderr, "Reply to command %i: %s\n", + cmd, strerror(errno)); + return -1; + } + + break; + default: + fprintf(stderr, "Unsupported command 0x%04x\n", cmd); + return -1; + } + } +} -- 2.43.0
On Tue, 28 Jan 2025 00:15:32 +0100 Stefano Brivio <sbrivio(a)redhat.com> wrote:A privileged helper to set/clear TCP_REPAIR on sockets on behalf of passt. Not used yet....but tested against source.c / target.c, attached. --- $ nc -l 9996 --- $ ./source 127.0.0.1 9996 9898 /tmp/repair.sock sending sequence: 3244673313 receiving sequence: 2250449386 --- # ./passt-repair /tmp/repair.sock --- $ strace ./target 127.0.0.1 9996 9898 /tmp/repair.sock 3244673313 2250449386 --- # ./passt-repair /tmp/repair.sock --- -- Stefano
On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote:A privileged helper to set/clear TCP_REPAIR on sockets on behalf of passt. Not used yet. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- Makefile | 10 +++-- passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 118 insertions(+), 3 deletions(-) create mode 100644 passt-repair.c diff --git a/Makefile b/Makefile index 1383875..1b71cb0 100644 --- a/Makefile +++ b/Makefile @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c -SRCS = $(PASST_SRCS) $(QRAP_SRCS) +PASST_REPAIR_SRCS = passt-repair.c +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS) MANPAGES = passt.1 pasta.1 qrap.1 @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man man1dir ?= $(mandir)/man1 ifeq ($(TARGET_ARCH),x86_64) -BIN := passt passt.avx2 pasta pasta.avx2 qrap +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair else -BIN := passt pasta qrap +BIN := passt pasta qrap passt-repair endif all: $(BIN) $(MANPAGES) docs @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt% qrap: $(QRAP_SRCS) passt.h $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS) +passt-repair: $(PASST_REPAIR_SRCS) + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS) + valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \ rt_sigreturn getpid gettid kill clock_gettime mmap \ mmap2 munmap open unlink gettimeofday futex statx \ diff --git a/passt-repair.c b/passt-repair.c new file mode 100644 index 0000000..e9b9609 --- /dev/null +++ b/passt-repair.c @@ -0,0 +1,111 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + * + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along + * with commands mapping to TCP_REPAIR values, and switch repair mode on or + * off. Reply by echoing the command. Exit if the command is INT_MAX. + */ + +#include <sys/types.h> +#include <sys/socket.h> +#include <sys/un.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <limits.h> +#include <unistd.h> +#include <netdb.h> + +#include <netinet/tcp.h> + +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */ + +int main(int argc, char **argv) +{ + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)] + __attribute__ ((aligned(__alignof__(struct cmsghdr)))); + struct sockaddr_un a = { AF_UNIX, "" }; + int cmd, fds[SCM_MAX_FD], s, ret, i; + struct cmsghdr *cmsg; + struct msghdr msg; + struct iovec iov; + + iov = (struct iovec){ &cmd, sizeof(cmd) };I mean, local to local, it's *probably* fine, but still a network protocol not defined in terms of explicit width fields makes me nervous. I'd prefer to see the cmd being a packed structure with fixed width elements. I also think we should do some sort of basic magic / version exchange. I don't see any reason we'd need to extend the protocol, but I'd rather have the option if we have to. Plus checking a magic number should make things less damaging and more debuggable if you were to point the repair helper at an entirely unrelated unix socket instead of passt's repair socket.+ msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 }; + cmsg = CMSG_FIRSTHDR(&msg); + + if (argc != 2) { + fprintf(stderr, "Usage: %s PATH\n", argv[0]); + return -1; + } + + ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]); + if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) { + fprintf(stderr, "Invalid socket path: %s\n", argv[1]); + return -1; + } + + if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {Hmm.. would a datagram socket better suit our needs here?+ perror("Failed to create AF_UNIX socket"); + return -1; + } + + if (connect(s, (struct sockaddr *)&a, sizeof(a))) { + fprintf(stderr, "Failed to connect to %s: %s\n", argv[1], + strerror(errno)); + return -1; + } + + while (1) { + int n; + + if (recvmsg(s, &msg, 0) < 0) { + perror("Failed to receive message"); + return -1; + } + + if (!cmsg || + cmsg->cmsg_len < CMSG_LEN(sizeof(int)) || + cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) || + cmsg->cmsg_type != SCM_RIGHTS) + return -1; + + n = cmsg->cmsg_len / CMSG_LEN(sizeof(int)); + memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n); + + switch (cmd) { + case INT_MAX: + return 0; + case TCP_REPAIR_ON: + case TCP_REPAIR_OFF: + case TCP_REPAIR_OFF_NO_WP: + for (i = 0; i < n; i++) { + if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, + &cmd, sizeof(int))) { + perror("Setting TCP_REPAIR"); + return -1;We probably want this to report errors back to passt, rather than just dying in this case. That way if for some weird reason one socket can't be placed in repair mode, we can still migrate all the other connections.+ } + } + + /* Confirm setting by echoing the command back */ + if (send(s, &cmd, sizeof(int), 0) < 0) { + fprintf(stderr, "Reply to command %i: %s\n", + cmd, strerror(errno)); + return -1; + } + + break; + default: + fprintf(stderr, "Unsupported command 0x%04x\n", cmd); + return -1; + } + } +}-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, 28 Jan 2025 12:51:59 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote:It actually is, because: struct { int cmd; }; is a packet structure with fixed width elements. Any architecture we build for (at least the ones I'm aware of) has a 32-bit int. We can make it uint32_t if it makes you feel better.A privileged helper to set/clear TCP_REPAIR on sockets on behalf of passt. Not used yet. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- Makefile | 10 +++-- passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 118 insertions(+), 3 deletions(-) create mode 100644 passt-repair.c diff --git a/Makefile b/Makefile index 1383875..1b71cb0 100644 --- a/Makefile +++ b/Makefile @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c -SRCS = $(PASST_SRCS) $(QRAP_SRCS) +PASST_REPAIR_SRCS = passt-repair.c +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS) MANPAGES = passt.1 pasta.1 qrap.1 @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man man1dir ?= $(mandir)/man1 ifeq ($(TARGET_ARCH),x86_64) -BIN := passt passt.avx2 pasta pasta.avx2 qrap +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair else -BIN := passt pasta qrap +BIN := passt pasta qrap passt-repair endif all: $(BIN) $(MANPAGES) docs @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt% qrap: $(QRAP_SRCS) passt.h $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS) +passt-repair: $(PASST_REPAIR_SRCS) + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS) + valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \ rt_sigreturn getpid gettid kill clock_gettime mmap \ mmap2 munmap open unlink gettimeofday futex statx \ diff --git a/passt-repair.c b/passt-repair.c new file mode 100644 index 0000000..e9b9609 --- /dev/null +++ b/passt-repair.c @@ -0,0 +1,111 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + * + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along + * with commands mapping to TCP_REPAIR values, and switch repair mode on or + * off. Reply by echoing the command. Exit if the command is INT_MAX. + */ + +#include <sys/types.h> +#include <sys/socket.h> +#include <sys/un.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <limits.h> +#include <unistd.h> +#include <netdb.h> + +#include <netinet/tcp.h> + +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */ + +int main(int argc, char **argv) +{ + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)] + __attribute__ ((aligned(__alignof__(struct cmsghdr)))); + struct sockaddr_un a = { AF_UNIX, "" }; + int cmd, fds[SCM_MAX_FD], s, ret, i; + struct cmsghdr *cmsg; + struct msghdr msg; + struct iovec iov; + + iov = (struct iovec){ &cmd, sizeof(cmd) };I mean, local to local, it's *probably* fine, but still a network protocol not defined in terms of explicit width fields makes me nervous. I'd prefer to see the cmd being a packed structure with fixed width elements.I also think we should do some sort of basic magic / version exchange. I don't see any reason we'd need to extend the protocol, but I'd rather have the option if we have to.passt-repair will be packaged and distributed together with passt, though. Versions must match. And latency here might matter more than in the rest of the migration process.Plus checking a magic number should make things less damaging and more debuggable if you were to point the repair helper at an entirely unrelated unix socket instead of passt's repair socket.Maybe, yes, even though I don't really see good chances for that mistake to happen. Feel free to post a proposal, of course.We need a connection though, so that passt knows when the helper is ready to get messages. It could be done with a synchronisation datagram but it looks more complicated to handle. By the way, with a connection, we could probably just close() the socket here instead of having a "quit" command. If you're referring to the fact we don't keep message boundaries, so we would in theory need to add short read handling to the recvmsg() below: I'd rather switch cmd to a single byte instead. You can't transfer less than that.+ msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 }; + cmsg = CMSG_FIRSTHDR(&msg); + + if (argc != 2) { + fprintf(stderr, "Usage: %s PATH\n", argv[0]); + return -1; + } + + ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]); + if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) { + fprintf(stderr, "Invalid socket path: %s\n", argv[1]); + return -1; + } + + if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {Hmm.. would a datagram socket better suit our needs here?We implicitly report the error in the sense that we close the connection and passt will abort the migration. If you look at the handling of TCP_REPAIR in do_tcp_setsockopt(), you'll see that it either always fails (EPERM), or always succeeds. I mean, it's straightforward to implement, and we can just reply with a different command. But it's probably more meaningful and fitting to abort altogether. Besides, if we have to report exactly on which socket we failed, we won't be able to switch to a single-byte command protocol.+ perror("Failed to create AF_UNIX socket"); + return -1; + } + + if (connect(s, (struct sockaddr *)&a, sizeof(a))) { + fprintf(stderr, "Failed to connect to %s: %s\n", argv[1], + strerror(errno)); + return -1; + } + + while (1) { + int n; + + if (recvmsg(s, &msg, 0) < 0) { + perror("Failed to receive message"); + return -1; + } + + if (!cmsg || + cmsg->cmsg_len < CMSG_LEN(sizeof(int)) || + cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) || + cmsg->cmsg_type != SCM_RIGHTS) + return -1; + + n = cmsg->cmsg_len / CMSG_LEN(sizeof(int)); + memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n); + + switch (cmd) { + case INT_MAX: + return 0; + case TCP_REPAIR_ON: + case TCP_REPAIR_OFF: + case TCP_REPAIR_OFF_NO_WP: + for (i = 0; i < n; i++) { + if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, + &cmd, sizeof(int))) { + perror("Setting TCP_REPAIR"); + return -1;We probably want this to report errors back to passt, rather than just dying in this case. That way if for some weird reason one socket can't be placed in repair mode, we can still migrate all the other connections.> + } > + } > + > + /* Confirm setting by echoing the command back */ > + if (send(s, &cmd, sizeof(int), 0) < 0) { > + fprintf(stderr, "Reply to command %i: %s\n", > + cmd, strerror(errno)); > + return -1; > + } > + > + break; > + default: > + fprintf(stderr, "Unsupported command 0x%04x\n", cmd); > + return -1; > + } > + } > +}-- Stefano
On Tue, Jan 28, 2025 at 07:51:31AM +0100, Stefano Brivio wrote:On Tue, 28 Jan 2025 12:51:59 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Sorry, I should have said "*explicitly* fixed width fields". So, yes, uint32_t would make me feel better :)On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote:It actually is, because: struct { int cmd; }; is a packet structure with fixed width elements. Any architecture we build for (at least the ones I'm aware of) has a 32-bit int. We can make it uint32_t if it makes you feel better.A privileged helper to set/clear TCP_REPAIR on sockets on behalf of passt. Not used yet. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- Makefile | 10 +++-- passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 118 insertions(+), 3 deletions(-) create mode 100644 passt-repair.c diff --git a/Makefile b/Makefile index 1383875..1b71cb0 100644 --- a/Makefile +++ b/Makefile @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c -SRCS = $(PASST_SRCS) $(QRAP_SRCS) +PASST_REPAIR_SRCS = passt-repair.c +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS) MANPAGES = passt.1 pasta.1 qrap.1 @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man man1dir ?= $(mandir)/man1 ifeq ($(TARGET_ARCH),x86_64) -BIN := passt passt.avx2 pasta pasta.avx2 qrap +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair else -BIN := passt pasta qrap +BIN := passt pasta qrap passt-repair endif all: $(BIN) $(MANPAGES) docs @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt% qrap: $(QRAP_SRCS) passt.h $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS) +passt-repair: $(PASST_REPAIR_SRCS) + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS) + valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \ rt_sigreturn getpid gettid kill clock_gettime mmap \ mmap2 munmap open unlink gettimeofday futex statx \ diff --git a/passt-repair.c b/passt-repair.c new file mode 100644 index 0000000..e9b9609 --- /dev/null +++ b/passt-repair.c @@ -0,0 +1,111 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + * + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along + * with commands mapping to TCP_REPAIR values, and switch repair mode on or + * off. Reply by echoing the command. Exit if the command is INT_MAX. + */ + +#include <sys/types.h> +#include <sys/socket.h> +#include <sys/un.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <limits.h> +#include <unistd.h> +#include <netdb.h> + +#include <netinet/tcp.h> + +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */ + +int main(int argc, char **argv) +{ + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)] + __attribute__ ((aligned(__alignof__(struct cmsghdr)))); + struct sockaddr_un a = { AF_UNIX, "" }; + int cmd, fds[SCM_MAX_FD], s, ret, i; + struct cmsghdr *cmsg; + struct msghdr msg; + struct iovec iov; + + iov = (struct iovec){ &cmd, sizeof(cmd) };I mean, local to local, it's *probably* fine, but still a network protocol not defined in terms of explicit width fields makes me nervous. I'd prefer to see the cmd being a packed structure with fixed width elements.But nothing enforces that. AIUI KubeVirt will be running passt-repair in a different context. Which means it may well be deployed by a different path than the passt binary, which means however we distribute it's quite plausible that a downstream screwup could mismatch the versions. We should endeavour to have a reasonably graceful failure mode for that.I also think we should do some sort of basic magic / version exchange. I don't see any reason we'd need to extend the protocol, but I'd rather have the option if we have to.passt-repair will be packaged and distributed together with passt, though. Versions must match.And latency here might matter more than in the rest of the migration process.I disagree on the good chances for a mistake thing. In GSS I saw plenty of occasions where things that shouldn't be mismatched were due to some packaging or user screwup. And that's before even considering the way that KubeVirt deploys its various pieces seems to provide a number of opportunities to mess this up. So, I'll see what I can come up with. I'm fine with requiring matching versions if it's actually checked. Maybe a magic derived from our git hash, or even our build-id.Plus checking a magic number should make things less damaging and more debuggable if you were to point the repair helper at an entirely unrelated unix socket instead of passt's repair socket.Maybe, yes, even though I don't really see good chances for that mistake to happen. Feel free to post a proposal, of course.Good point.We need a connection though, so that passt knows when the helper is ready to get messages. It could be done with a synchronisation datagram but it looks more complicated to handle.+ msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 }; + cmsg = CMSG_FIRSTHDR(&msg); + + if (argc != 2) { + fprintf(stderr, "Usage: %s PATH\n", argv[0]); + return -1; + } + + ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]); + if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) { + fprintf(stderr, "Invalid socket path: %s\n", argv[1]); + return -1; + } + + if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {Hmm.. would a datagram socket better suit our needs here?By the way, with a connection, we could probably just close() the socket here instead of having a "quit" command.True.If you're referring to the fact we don't keep message boundaries, so we would in theory need to add short read handling to the recvmsg() below: I'd rather switch cmd to a single byte instead. You can't transfer less than that.I was thinking that preserving message boundaries might allow extending the command format more easily, but you've convinced me it's not worth the trouble.Ah, right. That's probably good enough for now.We implicitly report the error in the sense that we close the connection and passt will abort the migration. If you look at the handling of TCP_REPAIR in do_tcp_setsockopt(), you'll see that it either always fails (EPERM), or always succeeds.+ perror("Failed to create AF_UNIX socket"); + return -1; + } + + if (connect(s, (struct sockaddr *)&a, sizeof(a))) { + fprintf(stderr, "Failed to connect to %s: %s\n", argv[1], + strerror(errno)); + return -1; + } + + while (1) { + int n; + + if (recvmsg(s, &msg, 0) < 0) { + perror("Failed to receive message"); + return -1; + } + + if (!cmsg || + cmsg->cmsg_len < CMSG_LEN(sizeof(int)) || + cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) || + cmsg->cmsg_type != SCM_RIGHTS) + return -1; + + n = cmsg->cmsg_len / CMSG_LEN(sizeof(int)); + memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n); + + switch (cmd) { + case INT_MAX: + return 0; + case TCP_REPAIR_ON: + case TCP_REPAIR_OFF: + case TCP_REPAIR_OFF_NO_WP: + for (i = 0; i < n; i++) { + if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, + &cmd, sizeof(int))) { + perror("Setting TCP_REPAIR"); + return -1;We probably want this to report errors back to passt, rather than just dying in this case. That way if for some weird reason one socket can't be placed in repair mode, we can still migrate all the other connections.I mean, it's straightforward to implement, and we can just reply with a different command. But it's probably more meaningful and fitting to abort altogether.Right, best effort maintenance of connections can be a later feature, if anyone wants it.Besides, if we have to report exactly on which socket we failed, we won't be able to switch to a single-byte command protocol.-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson> + } > + } > + > + /* Confirm setting by echoing the command back */ > + if (send(s, &cmd, sizeof(int), 0) < 0) { > + fprintf(stderr, "Reply to command %i: %s\n", > + cmd, strerror(errno)); > + return -1; > + } > + > + break; > + default: > + fprintf(stderr, "Unsupported command 0x%04x\n", cmd); > + return -1; > + } > + } > +}
On Wed, 29 Jan 2025 12:29:27 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Tue, Jan 28, 2025 at 07:51:31AM +0100, Stefano Brivio wrote:Changed to int8_t anyway meanwhile. We don't need all those bits.On Tue, 28 Jan 2025 12:51:59 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Sorry, I should have said "*explicitly* fixed width fields". So, yes, uint32_t would make me feel better :)On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote:It actually is, because: struct { int cmd; }; is a packet structure with fixed width elements. Any architecture we build for (at least the ones I'm aware of) has a 32-bit int. We can make it uint32_t if it makes you feel better.A privileged helper to set/clear TCP_REPAIR on sockets on behalf of passt. Not used yet. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- Makefile | 10 +++-- passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 118 insertions(+), 3 deletions(-) create mode 100644 passt-repair.c diff --git a/Makefile b/Makefile index 1383875..1b71cb0 100644 --- a/Makefile +++ b/Makefile @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c -SRCS = $(PASST_SRCS) $(QRAP_SRCS) +PASST_REPAIR_SRCS = passt-repair.c +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS) MANPAGES = passt.1 pasta.1 qrap.1 @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man man1dir ?= $(mandir)/man1 ifeq ($(TARGET_ARCH),x86_64) -BIN := passt passt.avx2 pasta pasta.avx2 qrap +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair else -BIN := passt pasta qrap +BIN := passt pasta qrap passt-repair endif all: $(BIN) $(MANPAGES) docs @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt% qrap: $(QRAP_SRCS) passt.h $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS) +passt-repair: $(PASST_REPAIR_SRCS) + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS) + valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \ rt_sigreturn getpid gettid kill clock_gettime mmap \ mmap2 munmap open unlink gettimeofday futex statx \ diff --git a/passt-repair.c b/passt-repair.c new file mode 100644 index 0000000..e9b9609 --- /dev/null +++ b/passt-repair.c @@ -0,0 +1,111 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio <sbrivio(a)redhat.com> + * + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along + * with commands mapping to TCP_REPAIR values, and switch repair mode on or + * off. Reply by echoing the command. Exit if the command is INT_MAX. + */ + +#include <sys/types.h> +#include <sys/socket.h> +#include <sys/un.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <limits.h> +#include <unistd.h> +#include <netdb.h> + +#include <netinet/tcp.h> + +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */ + +int main(int argc, char **argv) +{ + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)] + __attribute__ ((aligned(__alignof__(struct cmsghdr)))); + struct sockaddr_un a = { AF_UNIX, "" }; + int cmd, fds[SCM_MAX_FD], s, ret, i; + struct cmsghdr *cmsg; + struct msghdr msg; + struct iovec iov; + + iov = (struct iovec){ &cmd, sizeof(cmd) };I mean, local to local, it's *probably* fine, but still a network protocol not defined in terms of explicit width fields makes me nervous. I'd prefer to see the cmd being a packed structure with fixed width elements.Distribution packages. If I run claws-mail with the wrong version of, say, libpixman, it won't start. If you don't use them, you're on your own.But nothing enforces that.I also think we should do some sort of basic magic / version exchange. I don't see any reason we'd need to extend the protocol, but I'd rather have the option if we have to.passt-repair will be packaged and distributed together with passt, though. Versions must match.AIUI KubeVirt will be running passt-repair in a different context. Which means it may well be deployed by a different path than the passt binaryNo, that's not the way it works. It needs to match, in the sense that 1. it's a KubeVirt requirement to have compatible packages between distribution and the "base container image" and 2. this would most likely be sourced from the "base container image" anyway. I maintain the packages for four distributions, plus AppArmor and SELinux policies upstream and downstream, and I take care of updating the package in KubeVirt as well, so I guess I have a vague idea of what's convenient, enforced, burdensome, and so on.which means however we distribute it's quite plausible that a downstream screwup could mismatch the versions. We should endeavour to have a reasonably graceful failure mode for that.Regardless of this, I think that *this one* is an interface (I wouldn't even call it a protocol) that needs to be set in stone, except for hypothetical (and highly unlikely) UAPI additions which we'll be anyway able to accommodate for easily. It's a single socket option with three possible values (for 13 years now), of which we plan to use two. If we want this interface to do anything else, it should be another interface. So there's really no problem with this. Besides, the helper runs with CAP_NET_ADMIN (even though CAP_NET_RAW should ideally suffice), so it needs to be extremely simple and auditable.Both would make things significantly less usable, because they would make different but compatible builds incompatible, and different implementations rather inconvenient. For example, it might be a practical solution to have a Go implementation of this in KubeVirt's virt-handler itself, but if it needs to extract information or strings from the binary, that becomes impractical. -- StefanoAnd latency here might matter more than in the rest of the migration process.I disagree on the good chances for a mistake thing. In GSS I saw plenty of occasions where things that shouldn't be mismatched were due to some packaging or user screwup. And that's before even considering the way that KubeVirt deploys its various pieces seems to provide a number of opportunities to mess this up. So, I'll see what I can come up with. I'm fine with requiring matching versions if it's actually checked. Maybe a magic derived from our git hash, or even our build-id.Plus checking a magic number should make things less damaging and more debuggable if you were to point the repair helper at an entirely unrelated unix socket instead of passt's repair socket.Maybe, yes, even though I don't really see good chances for that mistake to happen. Feel free to post a proposal, of course.
On Wed, Jan 29, 2025 at 08:04:28AM +0100, Stefano Brivio wrote:On Wed, 29 Jan 2025 12:29:27 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Works or me.On Tue, Jan 28, 2025 at 07:51:31AM +0100, Stefano Brivio wrote:Changed to int8_t anyway meanwhile. We don't need all those bits.On Tue, 28 Jan 2025 12:51:59 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Sorry, I should have said "*explicitly* fixed width fields". So, yes, uint32_t would make me feel better :)On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote: > A privileged helper to set/clear TCP_REPAIR on sockets on behalf of > passt. Not used yet. > > Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> > --- > Makefile | 10 +++-- > passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 118 insertions(+), 3 deletions(-) > create mode 100644 passt-repair.c > > diff --git a/Makefile b/Makefile > index 1383875..1b71cb0 100644 > --- a/Makefile > +++ b/Makefile > @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ > tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > vhost_user.c virtio.c vu_common.c > QRAP_SRCS = qrap.c > -SRCS = $(PASST_SRCS) $(QRAP_SRCS) > +PASST_REPAIR_SRCS = passt-repair.c > +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS) > > MANPAGES = passt.1 pasta.1 qrap.1 > > @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man > man1dir ?= $(mandir)/man1 > > ifeq ($(TARGET_ARCH),x86_64) > -BIN := passt passt.avx2 pasta pasta.avx2 qrap > +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair > else > -BIN := passt pasta qrap > +BIN := passt pasta qrap passt-repair > endif > > all: $(BIN) $(MANPAGES) docs > @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt% > qrap: $(QRAP_SRCS) passt.h > $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS) > > +passt-repair: $(PASST_REPAIR_SRCS) > + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS) > + > valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \ > rt_sigreturn getpid gettid kill clock_gettime mmap \ > mmap2 munmap open unlink gettimeofday futex statx \ > diff --git a/passt-repair.c b/passt-repair.c > new file mode 100644 > index 0000000..e9b9609 > --- /dev/null > +++ b/passt-repair.c > @@ -0,0 +1,111 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > + > +/* PASST - Plug A Simple Socket Transport > + * for qemu/UNIX domain socket mode > + * > + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets > + * > + * Copyright (c) 2025 Red Hat GmbH > + * Author: Stefano Brivio <sbrivio(a)redhat.com> > + * > + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along > + * with commands mapping to TCP_REPAIR values, and switch repair mode on or > + * off. Reply by echoing the command. Exit if the command is INT_MAX. > + */ > + > +#include <sys/types.h> > +#include <sys/socket.h> > +#include <sys/un.h> > +#include <errno.h> > +#include <stdio.h> > +#include <stdlib.h> > +#include <string.h> > +#include <limits.h> > +#include <unistd.h> > +#include <netdb.h> > + > +#include <netinet/tcp.h> > + > +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */ > + > +int main(int argc, char **argv) > +{ > + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)] > + __attribute__ ((aligned(__alignof__(struct cmsghdr)))); > + struct sockaddr_un a = { AF_UNIX, "" }; > + int cmd, fds[SCM_MAX_FD], s, ret, i; > + struct cmsghdr *cmsg; > + struct msghdr msg; > + struct iovec iov; > + > + iov = (struct iovec){ &cmd, sizeof(cmd) }; I mean, local to local, it's *probably* fine, but still a network protocol not defined in terms of explicit width fields makes me nervous. I'd prefer to see the cmd being a packed structure with fixed width elements.It actually is, because: struct { int cmd; }; is a packet structure with fixed width elements. Any architecture we build for (at least the ones I'm aware of) has a 32-bit int. We can make it uint32_t if it makes you feel better.But shared libraries *do* have versioning checks: there are defined compatibility semantics for sonames, and there can be symbol versions as well.Distribution packages. If I run claws-mail with the wrong version of, say, libpixman, it won't start. If you don't use them, you're on your own.But nothing enforces that.I also think we should do some sort of basic magic / version exchange. I don't see any reason we'd need to extend the protocol, but I'd rather have the option if we have to.passt-repair will be packaged and distributed together with passt, though. Versions must match.Ok, I can buy that, but it's a contradictory position to "versions must match".AIUI KubeVirt will be running passt-repair in a different context. Which means it may well be deployed by a different path than the passt binaryNo, that's not the way it works. It needs to match, in the sense that 1. it's a KubeVirt requirement to have compatible packages between distribution and the "base container image" and 2. this would most likely be sourced from the "base container image" anyway. I maintain the packages for four distributions, plus AppArmor and SELinux policies upstream and downstream, and I take care of updating the package in KubeVirt as well, so I guess I have a vague idea of what's convenient, enforced, burdensome, and so on.which means however we distribute it's quite plausible that a downstream screwup could mismatch the versions. We should endeavour to have a reasonably graceful failure mode for that.Regardless of this, I think that *this one* is an interface (I wouldn't even call it a protocol) that needs to be set in stone, except for hypothetical (and highly unlikely) UAPI additions which we'll be anyway able to accommodate for easily.It's a single socket option with three possible values (for 13 years now), of which we plan to use two. If we want this interface to do anything else, it should be another interface. So there's really no problem with this. Besides, the helper runs with CAP_NET_ADMIN (even though CAP_NET_RAW should ideally suffice), so it needs to be extremely simple and auditable.Sending and checking a magic number is not a lot of complexity, even in something on this scale.Ok, so you're definitely now saying versions *don't* need to match.Both would make things significantly less usable, because they would make different but compatible builds incompatible, and different implementations rather inconvenient.And latency here might matter more than in the rest of the migration process.I disagree on the good chances for a mistake thing. In GSS I saw plenty of occasions where things that shouldn't be mismatched were due to some packaging or user screwup. And that's before even considering the way that KubeVirt deploys its various pieces seems to provide a number of opportunities to mess this up. So, I'll see what I can come up with. I'm fine with requiring matching versions if it's actually checked. Maybe a magic derived from our git hash, or even our build-id.Plus checking a magic number should make things less damaging and more debuggable if you were to point the repair helper at an entirely unrelated unix socket instead of passt's repair socket.Maybe, yes, even though I don't really see good chances for that mistake to happen. Feel free to post a proposal, of course.For example, it might be a practical solution to have a Go implementation of this in KubeVirt's virt-handler itself, but if it needs to extract information or strings from the binary, that becomes impractical.Ok... could we at least add just a magic number then. If we do ever need a new protocol we can change it, otherwise the protocol immutable for now. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 30 Jan 2025 11:53:08 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Wed, Jan 29, 2025 at 08:04:28AM +0100, Stefano Brivio wrote:Note: "Regardless of this". It's *another* consideration *on top of that*. 1. Versions (builds) match. 2. And even if they didn't, it wouldn't be a problem, because this interface will not change.On Wed, 29 Jan 2025 12:29:27 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Works or me.On Tue, Jan 28, 2025 at 07:51:31AM +0100, Stefano Brivio wrote:Changed to int8_t anyway meanwhile. We don't need all those bits.On Tue, 28 Jan 2025 12:51:59 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote: > On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote: > > A privileged helper to set/clear TCP_REPAIR on sockets on behalf of > > passt. Not used yet. > > > > Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> > > --- > > Makefile | 10 +++-- > > passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 118 insertions(+), 3 deletions(-) > > create mode 100644 passt-repair.c > > > > diff --git a/Makefile b/Makefile > > index 1383875..1b71cb0 100644 > > --- a/Makefile > > +++ b/Makefile > > @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ > > tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > > vhost_user.c virtio.c vu_common.c > > QRAP_SRCS = qrap.c > > -SRCS = $(PASST_SRCS) $(QRAP_SRCS) > > +PASST_REPAIR_SRCS = passt-repair.c > > +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS) > > > > MANPAGES = passt.1 pasta.1 qrap.1 > > > > @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man > > man1dir ?= $(mandir)/man1 > > > > ifeq ($(TARGET_ARCH),x86_64) > > -BIN := passt passt.avx2 pasta pasta.avx2 qrap > > +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair > > else > > -BIN := passt pasta qrap > > +BIN := passt pasta qrap passt-repair > > endif > > > > all: $(BIN) $(MANPAGES) docs > > @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt% > > qrap: $(QRAP_SRCS) passt.h > > $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS) > > > > +passt-repair: $(PASST_REPAIR_SRCS) > > + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS) > > + > > valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \ > > rt_sigreturn getpid gettid kill clock_gettime mmap \ > > mmap2 munmap open unlink gettimeofday futex statx \ > > diff --git a/passt-repair.c b/passt-repair.c > > new file mode 100644 > > index 0000000..e9b9609 > > --- /dev/null > > +++ b/passt-repair.c > > @@ -0,0 +1,111 @@ > > +// SPDX-License-Identifier: GPL-2.0-or-later > > + > > +/* PASST - Plug A Simple Socket Transport > > + * for qemu/UNIX domain socket mode > > + * > > + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets > > + * > > + * Copyright (c) 2025 Red Hat GmbH > > + * Author: Stefano Brivio <sbrivio(a)redhat.com> > > + * > > + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along > > + * with commands mapping to TCP_REPAIR values, and switch repair mode on or > > + * off. Reply by echoing the command. Exit if the command is INT_MAX. > > + */ > > + > > +#include <sys/types.h> > > +#include <sys/socket.h> > > +#include <sys/un.h> > > +#include <errno.h> > > +#include <stdio.h> > > +#include <stdlib.h> > > +#include <string.h> > > +#include <limits.h> > > +#include <unistd.h> > > +#include <netdb.h> > > + > > +#include <netinet/tcp.h> > > + > > +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */ > > + > > +int main(int argc, char **argv) > > +{ > > + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)] > > + __attribute__ ((aligned(__alignof__(struct cmsghdr)))); > > + struct sockaddr_un a = { AF_UNIX, "" }; > > + int cmd, fds[SCM_MAX_FD], s, ret, i; > > + struct cmsghdr *cmsg; > > + struct msghdr msg; > > + struct iovec iov; > > + > > + iov = (struct iovec){ &cmd, sizeof(cmd) }; > > I mean, local to local, it's *probably* fine, but still a network > protocol not defined in terms of explicit width fields makes me > nervous. I'd prefer to see the cmd being a packed structure with > fixed width elements. It actually is, because: struct { int cmd; }; is a packet structure with fixed width elements. Any architecture we build for (at least the ones I'm aware of) has a 32-bit int. We can make it uint32_t if it makes you feel better.Sorry, I should have said "*explicitly* fixed width fields". So, yes, uint32_t would make me feel better :)But shared libraries *do* have versioning checks: there are defined compatibility semantics for sonames, and there can be symbol versions as well.Distribution packages. If I run claws-mail with the wrong version of, say, libpixman, it won't start. If you don't use them, you're on your own.> I also think we should do some sort of basic magic / version exchange. > I don't see any reason we'd need to extend the protocol, but I'd > rather have the option if we have to. passt-repair will be packaged and distributed together with passt, though. Versions must match.But nothing enforces that.Ok, I can buy that, but it's a contradictory position to "versions must match".AIUI KubeVirt will be running passt-repair in a different context. Which means it may well be deployed by a different path than the passt binaryNo, that's not the way it works. It needs to match, in the sense that 1. it's a KubeVirt requirement to have compatible packages between distribution and the "base container image" and 2. this would most likely be sourced from the "base container image" anyway. I maintain the packages for four distributions, plus AppArmor and SELinux policies upstream and downstream, and I take care of updating the package in KubeVirt as well, so I guess I have a vague idea of what's convenient, enforced, burdensome, and so on.which means however we distribute it's quite plausible that a downstream screwup could mismatch the versions. We should endeavour to have a reasonably graceful failure mode for that.Regardless of this, I think that *this one* is an interface (I wouldn't even call it a protocol) that needs to be set in stone, except for hypothetical (and highly unlikely) UAPI additions which we'll be anyway able to accommodate for easily.If you want to have multiple bytes (because I'm forecasting that you won't be happy with 255 values), it's substantial complexity in comparison to the current implementation.It's a single socket option with three possible values (for 13 years now), of which we plan to use two. If we want this interface to do anything else, it should be another interface. So there's really no problem with this. Besides, the helper runs with CAP_NET_ADMIN (even though CAP_NET_RAW should ideally suffice), so it needs to be extremely simple and auditable.Sending and checking a magic number is not a lot of complexity, even in something on this scale.They don't need to, no. They will match, but they don't need to.Ok, so you're definitely now saying versions *don't* need to match.Both would make things significantly less usable, because they would make different but compatible builds incompatible, and different implementations rather inconvenient.And latency here might matter more than in the rest of the migration process.> Plus checking a magic number > should make things less damaging and more debuggable if you were to > point the repair helper at an entirely unrelated unix socket instead > of passt's repair socket. Maybe, yes, even though I don't really see good chances for that mistake to happen. Feel free to post a proposal, of course.I disagree on the good chances for a mistake thing. In GSS I saw plenty of occasions where things that shouldn't be mismatched were due to some packaging or user screwup. And that's before even considering the way that KubeVirt deploys its various pieces seems to provide a number of opportunities to mess this up. So, I'll see what I can come up with. I'm fine with requiring matching versions if it's actually checked. Maybe a magic derived from our git hash, or even our build-id.Adding a non-byte magic number implies handling short reads and short writes, which I think is absolutely unnecessary. Feel free to propose an implementation, as usual. If you are happy with a single byte magic number, then I suppose that, given that we're just using three values, we could encode that information using a combination of the remaining bits, which has the advantage of not needing any specific implementation until it's actually needed (never, I suppose), because passt-repair already terminates on an unknown command value. -- StefanoFor example, it might be a practical solution to have a Go implementation of this in KubeVirt's virt-handler itself, but if it needs to extract information or strings from the binary, that becomes impractical.Ok... could we at least add just a magic number then. If we do ever need a new protocol we can change it, otherwise the protocol immutable for now.
On Thu, Jan 30, 2025 at 05:55:43AM +0100, Stefano Brivio wrote:On Thu, 30 Jan 2025 11:53:08 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Hm, ok. I'm way less convinced on both of those points. Which means I'd like to have a clear policy on whether we require versions to match or not. Which we prioritise affects design choices.On Wed, Jan 29, 2025 at 08:04:28AM +0100, Stefano Brivio wrote:Note: "Regardless of this". It's *another* consideration *on top of that*. 1. Versions (builds) match. 2. And even if they didn't, it wouldn't be a problem, because this interface will not change.On Wed, 29 Jan 2025 12:29:27 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Works or me.On Tue, Jan 28, 2025 at 07:51:31AM +0100, Stefano Brivio wrote: > On Tue, 28 Jan 2025 12:51:59 +1100 > David Gibson <david(a)gibson.dropbear.id.au> wrote: > > > On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote: > > > A privileged helper to set/clear TCP_REPAIR on sockets on behalf of > > > passt. Not used yet. > > > > > > Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> > > > --- > > > Makefile | 10 +++-- > > > passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 118 insertions(+), 3 deletions(-) > > > create mode 100644 passt-repair.c > > > > > > diff --git a/Makefile b/Makefile > > > index 1383875..1b71cb0 100644 > > > --- a/Makefile > > > +++ b/Makefile > > > @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ > > > tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > > > vhost_user.c virtio.c vu_common.c > > > QRAP_SRCS = qrap.c > > > -SRCS = $(PASST_SRCS) $(QRAP_SRCS) > > > +PASST_REPAIR_SRCS = passt-repair.c > > > +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS) > > > > > > MANPAGES = passt.1 pasta.1 qrap.1 > > > > > > @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man > > > man1dir ?= $(mandir)/man1 > > > > > > ifeq ($(TARGET_ARCH),x86_64) > > > -BIN := passt passt.avx2 pasta pasta.avx2 qrap > > > +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair > > > else > > > -BIN := passt pasta qrap > > > +BIN := passt pasta qrap passt-repair > > > endif > > > > > > all: $(BIN) $(MANPAGES) docs > > > @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt% > > > qrap: $(QRAP_SRCS) passt.h > > > $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS) > > > > > > +passt-repair: $(PASST_REPAIR_SRCS) > > > + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS) > > > + > > > valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \ > > > rt_sigreturn getpid gettid kill clock_gettime mmap \ > > > mmap2 munmap open unlink gettimeofday futex statx \ > > > diff --git a/passt-repair.c b/passt-repair.c > > > new file mode 100644 > > > index 0000000..e9b9609 > > > --- /dev/null > > > +++ b/passt-repair.c > > > @@ -0,0 +1,111 @@ > > > +// SPDX-License-Identifier: GPL-2.0-or-later > > > + > > > +/* PASST - Plug A Simple Socket Transport > > > + * for qemu/UNIX domain socket mode > > > + * > > > + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets > > > + * > > > + * Copyright (c) 2025 Red Hat GmbH > > > + * Author: Stefano Brivio <sbrivio(a)redhat.com> > > > + * > > > + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along > > > + * with commands mapping to TCP_REPAIR values, and switch repair mode on or > > > + * off. Reply by echoing the command. Exit if the command is INT_MAX. > > > + */ > > > + > > > +#include <sys/types.h> > > > +#include <sys/socket.h> > > > +#include <sys/un.h> > > > +#include <errno.h> > > > +#include <stdio.h> > > > +#include <stdlib.h> > > > +#include <string.h> > > > +#include <limits.h> > > > +#include <unistd.h> > > > +#include <netdb.h> > > > + > > > +#include <netinet/tcp.h> > > > + > > > +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */ > > > + > > > +int main(int argc, char **argv) > > > +{ > > > + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)] > > > + __attribute__ ((aligned(__alignof__(struct cmsghdr)))); > > > + struct sockaddr_un a = { AF_UNIX, "" }; > > > + int cmd, fds[SCM_MAX_FD], s, ret, i; > > > + struct cmsghdr *cmsg; > > > + struct msghdr msg; > > > + struct iovec iov; > > > + > > > + iov = (struct iovec){ &cmd, sizeof(cmd) }; > > > > I mean, local to local, it's *probably* fine, but still a network > > protocol not defined in terms of explicit width fields makes me > > nervous. I'd prefer to see the cmd being a packed structure with > > fixed width elements. > > It actually is, because: > > struct { > int cmd; > }; > > is a packet structure with fixed width elements. Any architecture we > build for (at least the ones I'm aware of) has a 32-bit int. We can > make it uint32_t if it makes you feel better. Sorry, I should have said "*explicitly* fixed width fields". So, yes, uint32_t would make me feel better :)Changed to int8_t anyway meanwhile. We don't need all those bits.But shared libraries *do* have versioning checks: there are defined compatibility semantics for sonames, and there can be symbol versions as well.> > I also think we should do some sort of basic magic / version exchange. > > I don't see any reason we'd need to extend the protocol, but I'd > > rather have the option if we have to. > > passt-repair will be packaged and distributed together with passt, > though. Versions must match. But nothing enforces that.Distribution packages. If I run claws-mail with the wrong version of, say, libpixman, it won't start. If you don't use them, you're on your own.Ok, I can buy that, but it's a contradictory position to "versions must match".AIUI KubeVirt will be running passt-repair in a different context. Which means it may well be deployed by a different path than the passt binaryNo, that's not the way it works. It needs to match, in the sense that 1. it's a KubeVirt requirement to have compatible packages between distribution and the "base container image" and 2. this would most likely be sourced from the "base container image" anyway. I maintain the packages for four distributions, plus AppArmor and SELinux policies upstream and downstream, and I take care of updating the package in KubeVirt as well, so I guess I have a vague idea of what's convenient, enforced, burdensome, and so on.which means however we distribute it's quite plausible that a downstream screwup could mismatch the versions. We should endeavour to have a reasonably graceful failure mode for that.Regardless of this, I think that *this one* is an interface (I wouldn't even call it a protocol) that needs to be set in stone, except for hypothetical (and highly unlikely) UAPI additions which we'll be anyway able to accommodate for easily.If you want to have multiple bytes (because I'm forecasting that you won't be happy with 255 values), it's substantial complexity in comparison to the current implementation.It's a single socket option with three possible values (for 13 years now), of which we plan to use two. If we want this interface to do anything else, it should be another interface. So there's really no problem with this. Besides, the helper runs with CAP_NET_ADMIN (even though CAP_NET_RAW should ideally suffice), so it needs to be extremely simple and auditable.Sending and checking a magic number is not a lot of complexity, even in something on this scale.Ok, it's on my list.They don't need to, no. They will match, but they don't need to.Ok, so you're definitely now saying versions *don't* need to match.> And latency here might matter more than in > the rest of the migration process. > > Plus checking a magic number > > should make things less damaging and more debuggable if you were to > > point the repair helper at an entirely unrelated unix socket instead > > of passt's repair socket. > > Maybe, yes, even though I don't really see good chances for that > mistake to happen. Feel free to post a proposal, of course. I disagree on the good chances for a mistake thing. In GSS I saw plenty of occasions where things that shouldn't be mismatched were due to some packaging or user screwup. And that's before even considering the way that KubeVirt deploys its various pieces seems to provide a number of opportunities to mess this up. So, I'll see what I can come up with. I'm fine with requiring matching versions if it's actually checked. Maybe a magic derived from our git hash, or even our build-id.Both would make things significantly less usable, because they would make different but compatible builds incompatible, and different implementations rather inconvenient.Adding a non-byte magic number implies handling short reads and short writes, which I think is absolutely unnecessary. Feel free to propose an implementation, as usual.For example, it might be a practical solution to have a Go implementation of this in KubeVirt's virt-handler itself, but if it needs to extract information or strings from the binary, that becomes impractical.Ok... could we at least add just a magic number then. If we do ever need a new protocol we can change it, otherwise the protocol immutable for now.If you are happy with a single byte magic number, then I suppose that, given that we're just using three values, we could encode that information using a combination of the remaining bits, which has the advantage of not needing any specific implementation until it's actually needed (never, I suppose), because passt-repair already terminates on an unknown command value.Yeah, as you predicted, I'm not really happy with a 1 byte magic number. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 30 Jan 2025 18:43:34 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Hm, ok. I'm way less convinced on both of those points. Which means I'd like to have a clear policy on whether we require versions to match or not. Which we prioritise affects design choices.No, in this case, we don't require versions to match. The protocol is well-defined and won't change. Any change to it will require a different interface. The protocol is one byte that can be TCP_REPAIR_ON, TCP_REPAIR_OFF, TCP_REPAIR_OFF_WP, and one to SCM_MAX_FD sockets as SCM_RIGHTS ancillary message, sent by the server. The client replies with the same byte (and no ancillary message) to signal success, and closes the connection on failure. The server closes the connection on error or completion. This is obviously enough for a privileged helper that has the only function of setting the TCP_REPAIR socket option to TCP_REPAIR_ON, TCP_REPAIR_OFF, or TCP_REPAIR_OFF_WP, on a given set of sockets. As a result, I think that any added complexity is plain wrong. -- Stefano