On Sat, 20 Apr 2024 15:19:19 -0400 Jon Maloy <jmaloy(a)redhat.com> wrote:The kernel may support recvmsg(MSG_PEEK), starting reading data from a given offset set by the SO_PEEK_OFF socket option.It would be practical to mention in commit message and in a code comment which kernel commit introduced the feature, that is, your commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option") -- even if it's on net-next, better than nothing (net-next might be rebased but it's quite rare). Also the (forecast, at this point) kernel version where this will be introduced would be nice to have.This makes it possible to avoid repeated reading of already read initial bytes of a received message, hence saving read cycles when forwarding TCP messages in the host->name space direction. In this commit, we add functionality to leverage this feature when available, while we fall back to the previous behavior when not. Measurements with iperf3 shows that throughput increases with 15-20 percent in the host->namespace direction when this feature is used. Signed-off-by: Jon Maloy <jmaloy(a)redhat.com> --- tcp.c | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/tcp.c b/tcp.c index 905d26f..95d400a 100644 --- a/tcp.c +++ b/tcp.c @@ -505,6 +505,7 @@ static struct tcp_buf_seq_update tcp6_l2_buf_seq_update[TCP_FRAMES_MEM]; static unsigned int tcp6_l2_buf_used; /* recvmsg()/sendmsg() data for tap */ +static bool peek_offset_cap = false;No need to initialise, it's static. The comment just above referred to tcp_buf_discard and iov_sock ("data"), now you're adding a flag just under it, which is a bit confusing. I would rather leave the original comment for tcp_buf_discard and iov_sock, and add another one, say...:static char tcp_buf_discard [MAX_WINDOW]; static struct iovec iov_sock [TCP_FRAMES_MEM + 1];/* Does the kernel support TCP_PEEK_OFF? */ static bool peek_offset_cap;@@ -582,6 +583,14 @@ static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; +static void set_peek_offset(int s, int offset)A kerneldoc-style function comment would be nice, like all other functions in this file: /** * set_peek_offset() - Set SO_PEEK_OFF offset on a socket if supported * @s Socket * @offset Offset in bytes */+{ + if (!peek_offset_cap) + return; + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset)))gcc ('make') says: In function ‘set_peek_offset’, inlined from ‘tcp_listen_handler’ at tcp.c:2819:2: tcp.c:592:13: warning: ‘s’ may be used uninitialized [-Wmaybe-uninitialized] 592 | if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ tcp.c: In function ‘tcp_listen_handler’: tcp.c:2815:13: note: ‘s’ was declared here 2815 | int s; | ^ clang-tidy ('make clang-tidy'): /home/sbrivio/passt/tcp.c:2819:2: error: 1st function call argument is an uninitialized value [clang-analyzer-core.CallAndMessage,-warnings-as-errors] set_peek_offset(s, 0); ^ ~ /home/sbrivio/passt/tcp.c:2815:2: note: 's' declared without an initial value int s; ^~~~~ /home/sbrivio/passt/tcp.c:2817:6: note: Assuming field 'no_tcp' is 0 if (c->no_tcp || !(flow = flow_alloc())) ^~~~~~~~~ /home/sbrivio/passt/tcp.c:2817:6: note: Left side of '||' is false /home/sbrivio/passt/tcp.c:2817:21: note: Assuming 'flow' is non-null if (c->no_tcp || !(flow = flow_alloc())) ^~~~ /home/sbrivio/passt/tcp.c:2817:2: note: Taking false branch if (c->no_tcp || !(flow = flow_alloc())) ^ /home/sbrivio/passt/tcp.c:2819:2: note: 1st function call argument is an uninitialized value set_peek_offset(s, 0); ^ ~ /home/sbrivio/passt/tcp.c:2819:18: error: variable 's' is uninitialized when used here [clang-diagnostic-uninitialized,-warnings-as-errors] set_peek_offset(s, 0); ^ /home/sbrivio/passt/tcp.c:2815:7: note: initialize the variable 's' to silence this warning int s; ^ = 0 and cppcheck ('make cppcheck'): tcp.c:2819:18: error: Uninitialized variable: s [uninitvar] set_peek_offset(s, 0); ^ tcp.c:2817:16: note: Assuming condition is false if (c->no_tcp || !(flow = flow_alloc())) ^ tcp.c:2819:18: note: Uninitialized variable: s set_peek_offset(s, 0); ^ make: *** [Makefile:296: cppcheck] Error 1+ perror("Failed to set SO_PEEK_OFF\n"); +} + /** * tcp_conn_epoll_events() - epoll events mask for given connection state * @events: Current connection events @@ -1951,7 +1960,7 @@ static void tcp_conn_from_tap(struct ctx *c, if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) goto cancel; } -Spurious change.+ set_peek_offset(s, 0);Do we really need to initialise it to zero on a new connection? Extra system calls on this path matter for latency of connection establishment.conn = &flow->tcp; conn->f.type = FLOW_TCP; conn->sock = s; @@ -2174,6 +2183,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) if (iov_rem) iov_sock[fill_bufs].iov_len = iov_rem; + if (peek_offset_cap) { + /* Don't use discard buffer */ + mh_sock.msg_iov = &iov_sock[1]; + mh_sock.msg_iovlen -= 1; + + /* Keep kernel sk_peek_off in synch */While Wiktionary lists "synch" as "Alternative spelling of sync", I would argue that "sync" is much more common and less surprising.+ set_peek_offset(s, already_sent);Note that there's an early return before this point where already_sent is set to 0. Don't you need to set_peek_offset() also in that case? I guess this would be easier to follow if both assignments of already_sent, earlier in this function, were followed by a set_peek_offset() call. Or do we risk calling it spuriously? > + } > + > /* Receive into buffers, don't dequeue until acknowledged by guest. */ > do > len = recvmsg(s, &mh_sock, MSG_PEEK); > @@ -2195,7 +2213,9 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) > return 0; > } > > - sendlen = len - already_sent; > + sendlen = len; > + if (!peek_offset_cap) > + sendlen -= already_sent; > if (sendlen <= 0) { > conn_flag(c, conn, STALLED); > return 0; > @@ -2718,6 +2738,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, > tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, > s, (struct sockaddr *)&sa)) > return;+ set_peek_offset(s, 0);Same here, do we really need to initialise it to zero? If yes, it would be nice to leave a blank line after that 'return' as it was before.tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, (struct sockaddr *)&sa, now); @@ -3042,6 +3063,7 @@ static void tcp_sock_refill_init(const struct ctx *c) int tcp_init(struct ctx *c) { unsigned b; + int s; for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) tc_hash[b] = FLOW_SIDX_NONE; @@ -3065,6 +3087,17 @@ int tcp_init(struct ctx *c) NS_CALL(tcp_ns_socks_init, c); } + /* Probe for SO_PEEK_OFF support */ + s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);By default, we should always pass SOCK_CLOEXEC to socket(), just in case we miss closing one.+ if (s < 0) { + perror("Temporary tcp socket creation failed\n");This should be a warn() call, and s/tcp/TCP/.+ } else { + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &(int){0}, sizeof(int))) {No particular reason to exceed 80 columns here, and curly brackets aren't needed (coding style is the same as kernel). For consistency with the rest of the codebase: &((int) { 0 }).+ peek_offset_cap = true; + } + close(s); + } + printf("SO_PEEK_OFF%ssupported\n", peek_offset_cap ? " " : " not ");This should be a debug() call, a bit too much for info().return 0; }...I'm still reviewing 2/2, sorry for the delay. -- Stefano