On Wed, 25 Jan 2023 14:13:44 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Tue, Jan 24, 2023 at 10:20:43PM +0100, Stefano Brivio wrote:I couldn't do it conclusively, yet. :/ Before "tcp: Combine two parts of passt tap send path together", no stalls at all. After that, I'm routinely getting a stall on the perf/passt_udp test, IPv4 host-to-guest with 256B MTU. I know, that test is probably meaningless as a performance figure, but it helps find issues like this, at least. :) Yes, UDP -- the iperf3 client doesn't connect to the server, passt doesn't crash, but it's gone (zombie) by the time I get to it. I think it's the test scripts terminating it (even though I don't see anything on the terminal), and script.log ends with: 2023/01/25 21:27:14 socat[3432381] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 ssh-keygen: generating new host keys: RSA 2023/01/25 21:27:14 socat[3432390] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 2023/01/25 21:27:14 socat[3432393] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 2023/01/25 21:27:14 socat[3432396] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 2023/01/25 21:27:14 socat[3432399] E connect(5, AF=40 cid:94557 port:22, 16): Connection reset by peer kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 DSA ECDSA ED25519 # Warning: Permanently added 'guest' (ED25519) to the list of known hosts. which looks like fairly normal retries. If I run the tests with DEBUG=1, they get stuck during UDP functional testing, so I'm letting that aside for a moment. If I apply the whole series, other tests get stuck (including TCP ones). There might be something going wrong with iperf3's (TCP) control message exchange. I'm going to run this single test next, and add some debugging prints here and there.On Fri, 6 Jan 2023 11:43:04 +1100 David Gibson <david(a)gibson.dropbear.id.au> wrote:Drat, I didn't encounter that. Any chance you could bisect to figure out which patch specifically seems to trigger it?Although we have an abstraction for the "slow path" (DHCP, NDP) guest bound packets, the TCP and UDP forwarding paths write directly to the tap fd. However, it turns out how they send frames to the tap device is more similar than it originally appears. This series unifies the low-level tap send functions for TCP and UDP, and makes some clean ups along the way. This is based on my earlier outstanding series.For some reason, performance tests consistently get stuck (both TCP and UDP, sometimes throughput, sometimes latency tests) with this series, and not without it, but I don't see any possible relationship with that.I wonder if this could be related to the stalls I'm debugging, although those didn't appear on the perf tests and also occur on main. I have now discovered they seem to be masked by large socket buffer sizes - more info at https://bugs.passt.top/show_bug.cgi?id=41Maybe the subsequent failures (or even this one) could actually be related, and triggered somehow by some change in timing. I'm still clueless at the moment. -- Stefano