On Thu, 5 Sep 2024 10:35:14 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Wed, Sep 04, 2024 at 07:19:22PM +0200, Stefano Brivio wrote:Right now, we have a condition where we fail to handle EPOLLRDHUP before an outbound connection is established, see https://github.com/containers/podman/issues/23686#issuecomment-233023742, and we end up in a tight event processing loop. I guess what we're missing in tcp_sock_handler() is clear (we should orderly close the connection), but the tight loop didn't happen on 2024_06_24.1ee2eca (I'm bisecting right now) and we don't know why it didn't. If we set EPOLLET, we won't see that anymore, because the EPOLLRDHUP event is reported just once, but that doesn't mean we solved this. -- StefanoOn Wed, 4 Sep 2024 13:17:53 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ah, nice.On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:I don't see at the moment anything indicating TCP issues other than the one you addressed with your tentative debug patch at: https://passt.top/passt/commit/?h=podman23686&id=026fb71d1dde60135d9574… Given that, with that patch, we had at least another report of event storms, this time on UDP, that is, the one from: https://github.com/containers/podman/issues/23686#issuecomment-2324945010 I shared this other one on top: https://passt.top/passt/commit/?h=podman23686&id=0c6c20dee5c24bd324834a…On Tue, 3 Sep 2024 22:02:29 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote: > This is a draft patch working towards adding EPOLLOUT handling to the > tap code, which could then be used to "unstick" flows which have > unsent data from the socket side. For now that's just a stub, but > makes what I think are some worthwhile cleanups to the tap side event > handling in the meantime. Except for the issue in 3/6 and nits elsewhere, it all makes sense and tap-side EPOLLOUT handling is definitely going to be an improvement. I wonder if it's the right moment for this kind of series, though, in terms of future bisections, as long as we're grappling with https://github.com/containers/podman/issues/23686 and https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that this series doesn't fix anything.I don't think this series will fix anything as it stands. It is, indirectly, aimed at addressing bug 94. I'm struggling to figure out what to do with bug 94, because I find it almost impossible to reason about the current event masks in TCP.Uh.. I'm confused. In what way would we not notice issues, other than the issues not existing which.. would be good, right?I'd really like to simplify them so it's clearer what's correct and not and I think the most obvious path to doing so is using EPOLLET all the time. That requires some sort of kick when the tap is ready to accept more data, hence this series as a prerequisite.Sure, it's going to be simpler and more robust, but on the other hand we wouldn't notice these kind of issues.