On Sat, 17 Sep 2022 20:19:21 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:On Sat, Sep 17, 2022 at 10:44:41AM +0200, Stefano Brivio wrote:Okay, thanks.On Sat, 17 Sep 2022 13:32:45 +1000 David Gibson <david(a)gibson.dropbear.id.au> wrote:Ah, right.On Sat, Sep 17, 2022 at 01:55:34AM +0200, Stefano Brivio wrote:Just mind POSIX only specifies integers as argument for sleep(1), a pattern I commonly use is: sleep 0.1 || sleep 1There are some 'sleep 1' commands between starting the socat server and its corresponding client to avoid races due to the server not being ready as we start sending data. However, those don't cover all the cases where we might need them, and in some cases the sleep command actually ended up being before the server even starts. This fixes occasional failures in TCP and UDP simple transfer tests, that became apparent with the new command dispatch mechanism. Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>Heh, I have a very similar patch in my upcoming series. I used sleep 0.1 though, to avoid taking so long, which seems to be sufficient in practice.Yeah, I suppose.I did look for a proper way to wait for the socat server to be ready, but I couldn't find anything workable.I guess we could enable logging and look for "starting accept loop", from _xioopen_listen() in xio-listen.c.Anyway, right now, I'm just trying to get the tests to complete with all your pending series, because it's been a while. This doesn't look too bad, we can try to improve it later.Yeah, fair enough. When I rebase I'll see if there's any refinements I can make relatively easily.I was watching the text execution, then as I saw the socat server process getting stuck, I just re-run the client with ssh -F ... with the original command, and that worked for me every time. So, the behaviour I'm inferring is: - server not listening, no retry option: connect() goes through, but anything else will get ECONNRESET (RST from passt), plus there should be EPOLLERR if socat checks that: client exits - server not listening, retry option: client exits without retry because it doesn't get ECONNREFUSED ...I guess a cleaner behaviour on passt's side would be to delay the accept() (that's in tcp_conn_from_sock()) until we get SYN, ACK from the guest. But: - we can't send a RST if we don't accept(), as we have no socket to close(). Maybe we could close and reopen the listening socket...? We need to be a bit careful about not turning that into a vector for DoS, though - we can't send ICMP/ICMPv6 messages ...so we risk a connect() timeout, which is even less desirable. In case the connection goes through, though... I actually tried in the past to wait before we accept(), and it was increasing latency (of course), so I discarded that approach, because I couldn't think of any practical case. But here we are, so perhaps this behaviour should be there, at least as an option (maybe even as default).I'm not entirely sure what you mean by manual retries. I was trying to use socat's retry option, but that only seems to fire on ECONNREFUSED.I thought I could retry on the client side, but that doesn't work (at least for host to guest connections) because passt is always listening on the forwarded ports, so the client won't see a connection refused from the actual server. I don't think there's any sane way to propagate connection refused in that way, although we could and probably should forward connection resets outwards.They should be already, and (manual) retries actually worked for me,I actually checked, it did for me, without SO_LINGER.but it looks like added complexity. Before even checking the connection state, tcp_tap_handler() checks for the RST flag: if (th->rst) { conn_event(c, conn, CLOSED); return p->count; } and this will abruptly close the originating socket, which implies a RST, on Linux.Right, I found that... but I'm not sure it is sending an RST. Setting SO_LINGER with zero timeout does send an RST, but I'm not sure it does so without it.Even if it does, it's not really helpful for this. I found that sending an RST doesn't reliably cause the client socat to exit with failure. I'm not 100% sure what's going on, but I think what can happen is that the client sends everything and completes before the RST arrives, because it all gets buffered before in passt (or at least in the kernel socket buffers associated with passt). In that case the client doesn't detect the error....right.Wow, okay, I had no idea.We don't send out ICMP or ICMPv6 messages, even if the guest additionally replied with one, because we can't ("ping" sockets are just for that -- echoes). For TCP, a simple RST is probably not as descriptive, but that's all we can do.Yeah, about 36 minutes to be precise, I think. There's a reason I haven't been attempting to run this until now.Btw, I've also been doing a bunch of work on the static checks - with some different options I've brought the runtime of make cppcheck from ~40 minutes down to ~40 seconds :).Whoa, 40 minutes? For me it was never more than a couple of minutes,Ah, yes, that might explain it.see https://passt.top/passt/about/#continuous-integration, "build/static_checkers". I guess it depends on system headers... but in any case that's currently taking way too long, also for myself. :)Huh, interesting. I wonder if it's because it simply isn't finding all the system headers for you. I see you have a missingIncludeSystem suppression in there, and when I removed that I found it complained about not finding the headers on Debian until I added -I/usr/include/x86_64-linux-gnu. Anyway, I have patches that fix this which I'll be sending soon. In any case, since the run time will be exponential in the number of config defines, it doesn't take a huge difference in the headers to make a vast difference to runtime.I also found that both clang-tidy and cppcheck fail for me, I have a bunch of fixes for this, but I'm not finished yet. There's a couple of tricky ones, including one with dueling errors - cppcheck says there's a redundant NULL check, but if I remove it clang-tidy says there's a NULL pointer dereference. Still working on it.I'm currently using Cppcheck 2.8 and LLVM 13.0.1, perhaps you have more recent versions, I'll update them as soon as I finally get the tests to go through. ;) Thanks in advance for fixing those. -- Stefano