Re: [PATCH v3] tcp: add support for SO_PEEK_OFF

16 Feb 2024

On 2024-02-16 04:21, Eric Dumazet wrote:
...
On Fri, Feb 16, 2024 at 10:14 AM Paolo Abeni  wrote:
...
On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote:
...
On 2024-02-15 12:46, Eric Dumazet wrote:
...
On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni  wrote:
...
Note: please send text-only email to netdev.
On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:
...
I wonder if the following could be acceptable:
if (flags & MSG_PEEK)
          sk_peek_offset_fwd(sk, used);
   else if (peek_offset > 0)
         sk_peek_offset_bwd(sk, used);
peek_offset is already present in the data cache, and if it has the value
   zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero.
   Either way, no rewind is needed in that case.
I agree the above should avoid touching cold cachelines in the
fastpath, and looks functionally correct to me.
The last word is up to Eric :)
An actual patch seems needed.
In the current form, local variable peek_offset is 0 when !MSG_PEEK.
So the "else if (peek_offset > 0)" would always be false.
Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the
beginning of the function.
I will look at the other suggestions.
I *think* that moving sk_peek_off this way:
---

diff --git a/include/net/sock.h b/include/net/sock.h
index a9d99a9c583f..576a6a6abb03 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -413,7 +413,7 @@ struct sock {
         unsigned int            sk_napi_id;
  #endif
         int                     sk_rcvbuf;
-       int                     sk_disconnects;
+       int                     sk_peek_off;
struct sk_filter __rcu  *sk_filter;
         union {
@@ -439,7 +439,7 @@ struct sock {
                 struct rb_root  tcp_rtx_queue;
         };
         struct sk_buff_head     sk_write_queue;
-       __s32                   sk_peek_off;
+       int                     sk_disconnects;
         int                     sk_write_pending;
         __u32                   sk_dst_pending_confirm;
         u32                     sk_pacing_status; /* see enum sk_pacing */
---
should avoid problematic accesses,
The relevant cachelines layout is as follow:
/* --- cacheline 4 boundary (256 bytes) --- */
                 struct sk_buff *   tail;                 /*   256     8 */
         } sk_backlog;                                    /*   240    24 */
         int                        sk_forward_alloc;     /*   264     4 */
         u32                        sk_reserved_mem;      /*   268     4 */
         unsigned int               sk_ll_usec;           /*   272     4 */
         unsigned int               sk_napi_id;           /*   276     4 */
         int                        sk_rcvbuf;            /*   280     4 */
         int                        sk_disconnects;       /*   284     4 */
                                 // will become sk_peek_off
         struct sk_filter *         sk_filter;            /*   288     8 */
         union {
                 struct socket_wq * sk_wq;                /*   296     8 */
                 struct socket_wq * sk_wq_raw;            /*   296     8 */
         };                                               /*   296     8 */
         struct xfrm_policy *       sk_policy[2];         /*   304    16 */
         /* --- cacheline 5 boundary (320 bytes) --- */
//  ...
/* --- cacheline 6 boundary (384 bytes) --- */
         __s32                      sk_peek_off;          /*   384     4 */
                                 // will become sk_diconnects
         int                        sk_write_pending;     /*   388     4 */
         __u32                      sk_dst_pending_confirm; /*   392     4 */
         u32                        sk_pacing_status;     /*   396     4 */
         long int                   sk_sndtimeo;          /*   400     8 */
         struct timer_list          sk_timer;             /*   408    40 */
/* XXX last struct has 4 bytes of padding */
/* --- cacheline 7 boundary (448 bytes) --- */
sk_peek_off will be in the same cachline of sk_forward_alloc /
sk_reserved_mem / backlog tail, that are already touched by the
tcp_recvmsg_locked() main loop.
WDYT?
I was about to send a similar change, also moving sk_rcvtimeo, and
adding __cacheline_group_begin()/__cacheline_group_end
annotations.
I can finish this today.
There is also the following alternative:

if (flags & MSG_PEEK)
        sk_peek_offset_fwd(sk, used);
else if (flags & MSG_TRUNC)
        sk_peek_offset_bwd(sk, used);

This is the way we use it, and probably the typical usage.
It would force a user to drain the receive queue with MSG_TRUNC whenever 
he is using
MSG_PEEK_OFF, but I don't really see that as a limitation.

Anyway, if Paolo's suggestion solves the problem this shouldn't be 
necessary.

///jon