On Fri, Mar 07, 2025 at 07:14:45AM +0100, Stefano Brivio wrote:
On Fri, 7 Mar 2025 13:59:08 +1100
David Gibson <david(a)gibson.dropbear.id.au> wrote:
On Thu, Mar 06, 2025 at 08:00:51PM +0100, Stefano
Brivio wrote:
...and time out after that. This will be needed
because of an upcoming
change to passt-repair enabling it to start before passt is started,
on both source and target, by means of an inotify watch.
Once the inotify watch triggers, passt-repair will connect right away,
but we have no guarantees that the connection completes before we
start the migration process, so wait for it (for a reasonable amount
of time).
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>
Hmm. So I'm pretty confident this patch does what it sets out to do.
I'm increasingly dubious about this direction.
---
I'm still working on the passt-repair change mentioned above. It's
turning out to be a bit trickier than expected because to support
typical libvirt migration usage, on the target, we also need to be
able to wait for creation of a directory and, separately, for a socket
file being added inside it.
Oof. This significantly amplifies my concerns about the security
model: one of the most obvious ways to to bound who could potentially
convince passt-repair to connect is to make sure the directory exists
before hand: if you don't have permissions to create the socket in
that directory, passt-repair won't connect.
If we're also waiting for directories to appear, it makes this even
harder to analyse in terms of whether someone who's not supposed to
could convince passt-repair to connect to them.
I don't think your concern is particularly founded because in a general
case we have LSM policies preventing other processes from creating that
file in that directory, and in KubeVirt's case we simply have no other
processes that can access the pod mount namespace and do that.
I mean, LSM properties should generally be a additional layer of
protection, things should be basically ok with the vanilla Unix
permission model.
But that part of the change is not strictly necessary,
surely not for a
proper libvirt integration (libvirt knows the path) and also not for
the integration for KubeVirt (KubeVirt can build the path).
So, to dodge the obstructionism here, I'll happily drop that part. It
also saves me time.
Given the complexity you're hitting here,
I'm really starting to think
maybe we should reverse our model: have passt connect to passt-repair.
The security model is easier to understand: permission to the socket
and its directory is permission to use TCP_REPAIR. There's no awkward
waiting on either side - I think it might actually be less total code
chyurn.
Of course it's a bit less code, but it turns a potential concern into
the actually expected situation: you have something permanently around
that allows any process (because you have no practical way to restrict
A (semi-)permanentaly running daemon was not what I had in mind. What
Istead I was thinking that you start passt-repair shortly before the
migration (as we already planned), it does exactly one accept(), then
terminates once its single client is done.
it, not in distributions and not in KubeVirt) to
set/clear repair mode
on sockets. Unless you start introducing authentication, that is.
Well, the plan was the authentication was via the Unix permissions on
the socket.
In the current model it's still a one-off
operation, even with an
additional wait. You start passt-repair once to do its job, and once it
does the job, it terminates.
Right, also what I was intending... but of course, I realised a few
hours later that I'd forgotten that our filesystem isolation means we
can't connect() to a Unix socket in the wider world at migration time.
There are ways to work around that, of course, but I have yet to spot
one that seems better than the status quo. So,
Reviewed-by: David Gibson <david(a)gibson.dropbear.id.au>
I guess.
Judging by the
early queries from the Kubevirt guys it seems
to be the more obvious model people would expect.
Note that you talked to people taking care of (almost exclusively)
networking there. Knowing myself a fair bit of code, architecture, and
history of security weaknesses and discussions around them, I'm pretty
sure you can't do the opposite.
The long-term solutions are anyway having libvirt provide a phase or
blocking event where this can be done, and, at least in KubeVirt's
case, eventually handle this with seitan.
Right. Probably not really valuable while it still needs
CAP_NET_ADMIN, but if we do manage to change that to CAP_NET_RAW, it
would make sense for passt to only use the helper if it can't do the
TCP_REPAIR itself. That makes setup simpler for environments that
don't have counterproductive "security" policies.
On the other
hand,
configuring the socket path is still an issue, and being able to tell
passt where to connect to through libvirt might become more important
than it is now.
However, this part looks solid and it should do
what we need. I
tested it by starting passt-repair after sleep(1) in migrate/* tests,
changing this timeout to a minute, on either source, target, or both.
flow.c | 20 ++++++++++++++++++++
repair.c | 31 +++++++++++++++++++++++++++++++
repair.h | 1 +
3 files changed, 52 insertions(+)
diff --git a/flow.c b/flow.c
index 749c498..5e64b79 100644
--- a/flow.c
+++ b/flow.c
@@ -911,6 +911,21 @@ static int flow_migrate_source_rollback(struct ctx *c, unsigned
bound, int ret)
return ret;
}
+/**
+ * flow_migrate_need_repair() - Do we need to set repair mode for any flow?
+ *
+ * Return: true if repair mode is needed, false otherwise
+ */
+static bool flow_migrate_need_repair(void)
+{
+ union flow *flow;
+
+ foreach_established_tcp_flow(flow)
+ return true;
+
+ return false;
+}
+
/**
* flow_migrate_repair_all() - Turn repair mode on or off for all flows
* @c: Execution context
@@ -966,6 +981,9 @@ int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage
*stage,
(void)stage;
(void)fd;
+ if (flow_migrate_need_repair())
+ repair_wait(c);
+
if ((rc = flow_migrate_repair_all(c, true)))
return -rc;
@@ -1083,6 +1101,8 @@ int flow_migrate_target(struct ctx *c, const struct migrate_stage
*stage,
if (!count)
return 0;
+ repair_wait(c);
+
if ((rc = flow_migrate_repair_all(c, true)))
return -rc;
diff --git a/repair.c b/repair.c
index 3ee089f..ce5a4a4 100644
--- a/repair.c
+++ b/repair.c
@@ -27,6 +27,11 @@
#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
+/* Wait for a while for TCP_REPAIR helper to connect if it's not there yet */
+#define REPAIR_ACCEPT_TIMEOUT_MS 100
100ms is a pretty long time in the context of a stopped guest. I
guess is is a worst case.
It's actually not a case I can reproduce at all, as expected:
passt-repair connects quickly enough once the socket is available.
Good to know. It does mean we're adding 100ms of downtime for the
fallback case where there is no passt-repair, of course.
We could
possibly mitigate it by doing the
Mitigate what exactly? The expected mode of operation is with
passt-repair available. The failure handling is a fallback. I can turn
it into 1 ms, it just seems to increase risk for no good reason.
wait when
memory change logging goes on (so migration is starting, bu
the guest is not get stopped). However at that point we can't rely on
flow_migrate_need_repair(), because we might have no flows, but one is
opened before the guest is stopped.
Right, that complicates things quite a lot.
+#define
REPAIR_ACCEPT_TIMEOUT_US (REPAIR_ACCEPT_TIMEOUT_MS * 1000)
+static_assert(REPAIR_ACCEPT_TIMEOUT_US < 1000 * 1000);
In this context the static_assert() seems like a pretty arbitrary
bound. To make the reason for it clearer, I'd suggest moving it to
where we construct the timeval.
Oh, I didn't think of that. Sure, I'll move it.
> +
> /* Pending file descriptors for next repair_flush() call, or command change */
> static int repair_fds[SCM_MAX_FD];
>
> @@ -138,6 +143,32 @@ void repair_handler(struct ctx *c, uint32_t events)
> repair_close(c);
> }
>
> +/**
> + * repair_wait() - Wait (with timeout) for TCP_REPAIR helper to connect
> + * @c: Execution context
> + */
> +void repair_wait(struct ctx *c)
> +{
> + struct timeval tv = { .tv_sec = 0,
> + .tv_usec = (long)(REPAIR_ACCEPT_TIMEOUT_US) };
> +
> + if (c->fd_repair >= 0 || c->fd_repair_listen == -1)
> + return;
> +
> + if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO,
> + &tv, sizeof(tv))) {
> + err_perror("Set timeout on TCP_REPAIR listening socket");
> + return;
> + }
> +
> + repair_listen_handler(c, EPOLLIN);
> +
> + tv.tv_usec = 0;
> + if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO,
> + &tv, sizeof(tv)))
> + err_perror("Clear timeout on TCP_REPAIR listening socket");
> +}
> +
> /**
> * repair_flush() - Flush current set of sockets to helper, with current command
> * @c: Execution context
> diff --git a/repair.h b/repair.h
> index de279d6..1d37922 100644
> --- a/repair.h
> +++ b/repair.h
> @@ -10,6 +10,7 @@ void repair_sock_init(const struct ctx *c);
> void repair_listen_handler(struct ctx *c, uint32_t events);
> void repair_handler(struct ctx *c, uint32_t events);
> void repair_close(struct ctx *c);
> +void repair_wait(struct ctx *c);
> int repair_flush(struct ctx *c);
> int repair_set(struct ctx *c, int s, int cmd);
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson