On Thu, 14 Aug 2025 07:12:55 +0200
Stefano Brivio
On Thu, 14 Aug 2025 14:10:20 +1000 David Gibson
wrote: On Wed, Aug 13, 2025 at 06:45:05PM +0200, Stefano Brivio wrote:
I didn't imagine that occasionally truncated pcap and log files, as a result of commit a9d63f91a59a ("passt-repair: use _exit() over return"), would be such a big deal, until I tried to debug TCP issues with this beauty:
I think the problems are more introduced by the previous patch d0006fa784a7 ("treewide: use _exit() over exit()").
Oops, right.
while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done
...flush files and pcap if we're using them. Ignore fsync() errors for the log file as we obviously can't reliably log them.
Signed-off-by: Stefano Brivio
Hmmmmm.
I mean, yes, AFAICT this patch will address the immediate problem. But between this and 081df67d1fb2 ("conf: flush stdout before early exit") it seems more and more to me that _exit() just isn't what we want. Basically the assertion in d0006fa784a7 that "no exit handlers are needed" doesn't really seem to be true.
It was a wrong assumption but just for a couple of cases, and mind that exit() doesn't give you any guarantee anyway. While glibc might guarantee those flushes, we're not just building against glibc.
So, strictly speaking, at least for correctness, we should actually keep those fflush() calls plus the ones I'm adding here, even with exit().
Ah, wait, sorry, I just double checked and starting from Issue 7 of POSIX (referring to _exit() and _Exit()): Austin Group Interpretation 1003.1-2001 #031 is applied, separating these functions from the exit() function. and: The exit() function shall then flush all open streams with unwritten buffered data and close all open streams. ...still, the rest of my reasoning applies.
Here we're adding a new syscall to work around the problems with _exit(). In which case, why don't we add futex() to the syscall list and go back to exit(3).
Because futex() just came up unexpectedly and Paul and myself had to spend hours figuring that out, and there are good chances we'll get something else like that from glibc in the future.
On top of that, see CVE-2014-3153 and CVE-2020-14381 about futex(). From a quick glance (and intuitively) fsync() is much simpler than that.
With Laurent working on multi-threading we might well want futexes anyhow.
True, but then I'd still prefer to allow futex() explicitly, rather than re-enabling exit handlers, because that's more predictable.
-- Stefano