On Thu, Apr 25, 2024 at 07:23:28PM -0400, Jon Maloy wrote:On 2024-04-23 20:44, David Gibson wrote:Ouch, that sounds bad. I'm pretty sure that means there's a bug on one side or the other.On Sat, Apr 20, 2024 at 03:19:19PM -0400, Jon Maloy wrote:[...]The kernel may support recvmsg(MSG_PEEK), starting reading data from aNot worth a respin on its own, but I think the comma above is misplaced, and for me makes the sentence much harder to read. > given offset set by the SO_PEEK_OFF socket option. This makes it > possible to avoid repeated reading of already read initial bytes of a > received message, hence saving read cycles when forwarding TCP messages > in the host->name space direction. > > In this commit, we add functionality to leverage this feature when available,In theory yes. I tried it for a while, using SEQ_GE(max_ack_seq, ack_seq) as criteria for retransmission. I observed some strange behavior, like retransmits that seemingly did not come from fast retransmit or timer retransmit, and that the kernel 'sk_peek_off' didn´t always have the expected value when comparing with 'already_sent´.@@ -2174,6 +2183,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) if (iov_rem) iov_sock[fill_bufs].iov_len = iov_rem; + if (peek_offset_cap) { + /* Don't use discard buffer */ + mh_sock.msg_iov = &iov_sock[1]; + mh_sock.msg_iovlen -= 1; + + /* Keep kernel sk_peek_off in synch */ + set_peek_offset(s, already_sent);I thought we didn't need to set SO_PEEK_OFF here - that it would track on its own, and we only needed to change it for retransmits. I don't think we even need to calculate 'already_sent' when we have SO_PEEK_OFF. In fact - if we set already_sent to 0 here, it might make things a bit cleaner than having to have special cases for adjusting the iov and sendlen.Since my focus was on the zero-window issue I decided to skip this for now and take the safe option. I may revisit this later.-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson> + } > + > /* Receive into buffers, don't dequeue until acknowledged by guest. */ > do > len = recvmsg(s, &mh_sock, MSG_PEEK); > @@ -2195,7 +2213,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > return 0; > }[...]Made it a debug() as suggested by Stefano.+ peek_offset_cap = true; + } + close(s); + } + printf("SO_PEEK_OFF%ssupported\n", peek_offset_cap ? " " : " not ")Should be an info().