On Tue, Jan 28, 2025 at 07:51:31AM +0100, Stefano Brivio wrote:
On Tue, 28 Jan 2025 12:51:59 +1100
David Gibson <david(a)gibson.dropbear.id.au> wrote:
On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano
Brivio wrote:
A privileged helper to set/clear TCP_REPAIR on
sockets on behalf of
passt. Not used yet.
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>
---
Makefile | 10 +++--
passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 118 insertions(+), 3 deletions(-)
create mode 100644 passt-repair.c
diff --git a/Makefile b/Makefile
index 1383875..1b71cb0 100644
--- a/Makefile
+++ b/Makefile
@@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c
fwd.c \
tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
vhost_user.c virtio.c vu_common.c
QRAP_SRCS = qrap.c
-SRCS = $(PASST_SRCS) $(QRAP_SRCS)
+PASST_REPAIR_SRCS = passt-repair.c
+SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
MANPAGES = passt.1 pasta.1 qrap.1
@@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man
man1dir ?= $(mandir)/man1
ifeq ($(TARGET_ARCH),x86_64)
-BIN := passt passt.avx2 pasta pasta.avx2 qrap
+BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
else
-BIN := passt pasta qrap
+BIN := passt pasta qrap passt-repair
endif
all: $(BIN) $(MANPAGES) docs
@@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
qrap: $(QRAP_SRCS) passt.h
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS)
-o qrap $(LDFLAGS)
+passt-repair: $(PASST_REPAIR_SRCS)
+ $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
+
valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
rt_sigreturn getpid gettid kill clock_gettime mmap \
mmap2 munmap open unlink gettimeofday futex statx \
diff --git a/passt-repair.c b/passt-repair.c
new file mode 100644
index 0000000..e9b9609
--- /dev/null
+++ b/passt-repair.c
@@ -0,0 +1,111 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ * for qemu/UNIX domain socket mode
+ *
+ * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio(a)redhat.com>
+ *
+ * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
+ * with commands mapping to TCP_REPAIR values, and switch repair mode on or
+ * off. Reply by echoing the command. Exit if the command is INT_MAX.
+ */
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <limits.h>
+#include <unistd.h>
+#include <netdb.h>
+
+#include <netinet/tcp.h>
+
+#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
+
+int main(int argc, char **argv)
+{
+ char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
+ __attribute__ ((aligned(__alignof__(struct cmsghdr))));
+ struct sockaddr_un a = { AF_UNIX, "" };
+ int cmd, fds[SCM_MAX_FD], s, ret, i;
+ struct cmsghdr *cmsg;
+ struct msghdr msg;
+ struct iovec iov;
+
+ iov = (struct iovec){ &cmd, sizeof(cmd) };
I mean, local to local, it's *probably* fine, but still a network
protocol not defined in terms of explicit width fields makes me
nervous. I'd prefer to see the cmd being a packed structure with
fixed width elements.
It actually is, because:
struct {
int cmd;
};
is a packet structure with fixed width elements. Any architecture we
build for (at least the ones I'm aware of) has a 32-bit int. We can
make it uint32_t if it makes you feel better.
Sorry, I should have said "*explicitly* fixed width fields". So, yes,
uint32_t would make me feel better :)
I also think
we should do some sort of basic magic / version exchange.
I don't see any reason we'd need to extend the protocol, but I'd
rather have the option if we have to.
passt-repair will be packaged and distributed together with passt,
though. Versions must match.
But nothing enforces that. AIUI KubeVirt will be running passt-repair
in a different context. Which means it may well be deployed by a
different path than the passt binary, which means however we
distribute it's quite plausible that a downstream screwup could
mismatch the versions. We should endeavour to have a reasonably
graceful failure mode for that.
And latency here might matter more than in
the rest of the migration process.
Plus checking
a magic number
should make things less damaging and more debuggable if you were to
point the repair helper at an entirely unrelated unix socket instead
of passt's repair socket.
Maybe, yes, even though I don't really see good chances for that
mistake to happen. Feel free to post a proposal, of course.
I disagree on the good chances for a mistake thing. In GSS I saw
plenty of occasions where things that shouldn't be mismatched were due
to some packaging or user screwup. And that's before even considering
the way that KubeVirt deploys its various pieces seems to provide a
number of opportunities to mess this up.
So, I'll see what I can come up with. I'm fine with requiring
matching versions if it's actually checked. Maybe a magic derived
from our git hash, or even our build-id.
+ msg = (struct msghdr){ NULL, 0, &iov, 1, buf,
sizeof(buf), 0 };
+ cmsg = CMSG_FIRSTHDR(&msg);
+
+ if (argc != 2) {
+ fprintf(stderr, "Usage: %s PATH\n", argv[0]);
+ return -1;
+ }
+
+ ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
+ if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
+ fprintf(stderr, "Invalid socket path: %s\n", argv[1]);
+ return -1;
+ }
+
+ if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
Hmm.. would a datagram socket better suit our needs here?
We need a connection though, so that passt knows when the helper is
ready to get messages. It could be done with a synchronisation datagram
but it looks more complicated to handle.
Good point.
By the way, with a connection, we could probably just
close() the
socket here instead of having a "quit" command.
True.
If you're referring to the fact we don't
keep message boundaries, so we
would in theory need to add short read handling to the recvmsg() below:
I'd rather switch cmd to a single byte instead. You can't transfer less
than that.
I was thinking that preserving message boundaries might allow
extending the command format more easily, but you've convinced me it's
not worth the trouble.
+ perror("Failed to create AF_UNIX
socket");
+ return -1;
+ }
+
+ if (connect(s, (struct sockaddr *)&a, sizeof(a))) {
+ fprintf(stderr, "Failed to connect to %s: %s\n", argv[1],
+ strerror(errno));
+ return -1;
+ }
+
+ while (1) {
+ int n;
+
+ if (recvmsg(s, &msg, 0) < 0) {
+ perror("Failed to receive message");
+ return -1;
+ }
+
+ if (!cmsg ||
+ cmsg->cmsg_len < CMSG_LEN(sizeof(int)) ||
+ cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) ||
+ cmsg->cmsg_type != SCM_RIGHTS)
+ return -1;
+
+ n = cmsg->cmsg_len / CMSG_LEN(sizeof(int));
+ memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n);
+
+ switch (cmd) {
+ case INT_MAX:
+ return 0;
+ case TCP_REPAIR_ON:
+ case TCP_REPAIR_OFF:
+ case TCP_REPAIR_OFF_NO_WP:
+ for (i = 0; i < n; i++) {
+ if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR,
+ &cmd, sizeof(int))) {
+ perror("Setting TCP_REPAIR");
+ return -1;
We probably want this to report errors back to passt, rather than just
dying in this case. That way if for some weird reason one socket
can't be placed in repair mode, we can still migrate all the other
connections.
We implicitly report the error in the sense that we close the
connection and passt will abort the migration. If you look at the
handling of TCP_REPAIR in do_tcp_setsockopt(), you'll see that it
either always fails (EPERM), or always succeeds.
Ah, right. That's probably good enough for now.
I mean, it's straightforward to implement, and we
can just reply with a
different command. But it's probably more meaningful and fitting to
abort altogether.
Right, best effort maintenance of connections can be a later feature,
if anyone wants it.
Besides, if we have to report exactly on which socket
we failed, we
won't be able to switch to a single-byte command protocol.
> + }
> + }
> +
> + /* Confirm setting by echoing the command back */
> + if (send(s, &cmd, sizeof(int), 0) < 0) {
> + fprintf(stderr, "Reply to command %i: %s\n",
> + cmd, strerror(errno));
> + return -1;
> + }
> +
> + break;
> + default:
> + fprintf(stderr, "Unsupported command 0x%04x\n", cmd);
> + return -1;
> + }
> + }
> +}
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson