On Fri, Feb 21, 2025 at 06:59:12AM +0100, Stefano Brivio wrote:On Fri, 21 Feb 2025 13:40:12 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Well.. probably. I suspect something sufficiently insane in an LSM could break that rule.On Thu, Feb 20, 2025 at 11:38:00AM +0100, Stefano Brivio wrote:By the way, after having another look at the kernel interface and implementation: this makes no sense. Either we're able to set repair mode for all the sockets, or for none of them (EPERM).On Thu, 20 Feb 2025 21:18:06 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:It may have just been on IRC somewhere.This sort of thing is, incidentally, why I did way back suggest the possibility of passt-repair reporting failures per-fd, rather than just per-batch.Sorry, I somehow missed that proposal, and I can't find any trace of it.But anyway, the problem is that if we fail to read a batch for any reason (invalid ancillary data... maybe always implying a kernel issue, but I'm not sure), you can't _reliably_ report per-fd failures. *Usually*, you can. Worth it?Ah, I see. We could handle that by being able to report both per-fd and "whole batch" failure (equivalent to failure on every fd), but that would complexify the protocol, of course.And if there's any invalid file descriptor in the set, we'll get EBADF for the whole sendmsg(). The stuff I'm proposing below is well beyond my threshold of things that make no sense to implement, but at least it limits damage in terms of complexity (and hence of potential impact on the actual usage, because that's what we're talking about here: a die() that makes no sense but now proves to be actually harmful). I wouldn't go any further than that.Again, it shouldn't happen in practice, but you will get a mess if you ever managed to set repair mode, but failed to clear it. There's a similar question in the other direction, if passt is trying to REPAIR_OFF, and we fail part way through. Should passt-repair REPAIR_ON again? I'm not sure it has the information to make a sane choice here.Not for what I'm proposing: 1. passt sends 1 (TCP_REPAIR_ON) for sockets 2, 3 2. passt-repair sets repair mode for 2 3. passt-repair fails to set repair mode for 3 4. passt-repair clears repair mode for 2In any case, if it's simple, we can still do it, because passt and passt-repair are distributed together. You can't pass back the file descriptors via SCM_RIGHTS though, because we want to close() them before we reply. Another alternative could be that passt-repair reverts back the state of the file descriptors that were already switched, on failure.That might help a bit, we'd still need to rework the passt-side interface to know what needs reverting at the right stage.5. passt-repair closes the connection to signal failure 6. passt knows that repair mode is *not* set for any socket in the batch The interface remains the same (per-batch error only), but you can rely on the whole batch to have failed. You already can, even with no changes in passt-repair (see above), by the way.Well, I just finished implementing a simple way of reporting partial failures. I guess see what you think of it. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson