On Mon, 7 Nov 2022 10:51:11 +0100
Stefano Brivio
On Mon, 7 Nov 2022 14:17:44 +1100 David Gibson
wrote: On Fri, Nov 04, 2022 at 02:53:28AM +0100, Stefano Brivio wrote:
This happens about every third time on the two_guests/basic test, and on that test only: we clone() twice, first to spawn a child, then to spawn a thread to check that we can enter the target network namespace.
In this thread, we open a file descriptor associated to the target namespace. It might happen that it doesn't exist yet: the kernel can legitimately take its time to create one, after clone(). In this case, at least on a 5.15 Linux kernel, trying to open that file again always yields EACCES, and we get stuck there.
This only occurs if we spawn two instances of pasta very close together, as it's done in the two_guests/basic case.
I couldn't figure out what the race condition is, yet, and especially if it's a kernel issue or something we're doing wrong. However, if we wait until the execvp() in the child is done, the issue disappears. I'm not sure yet if it's just because of timing and this is hiding an unrelated race condition.
The workaround consists of checking /proc/PID/exe against our own. If it's different, that means execvp() already completed and we can proceed. It's rather ugly, but much better than the alternative. Leave a FIXME there for the moment being.
Signed-off-by: Stefano Brivio
Weird and ugly, but seems like we need it.
Reviewed-by: David Gibson
Sorry, I forgot to mention: I didn't actually push this, because while repeating that test over and over again I hit the same failure at some point, even with this trick.
So this is probably just affecting the timing. I need to look into it again.
Tracked at: https://bugs.passt.top/show_bug.cgi?id=36 for the moment being. -- Stefano