From: Jon Maloy <jmaloy(a)redhat.com> Testing with iperf3 using the "pasta" protocol splicer has revealed a bug in the way tcp handles window advertising in extreme memory squeeze situations. Under memory pressure, a socket endpoint may temporarily advertise a zero-sized window, but this is not stored as part of the socket data. The reasoning behind this is that it is considered a temporary setting which shouldn't influence any further calculations. However, if we happen to stall at an unfortunate value of the current window size, the algorithm selecting a new value will consistently fail to advertise a non-zero window once we have freed up enough memory. This means that this side's notion of the current window size is different from the one last advertised to the peer, causing the latter to not send any data to resolve the sitution. The problem occurs on the iperf3 server side, and the socket in question is a completely regular socket with the default settings for the fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. The following excerpt of a logging session, with own comments added, shows more in detail what is happening: // tcp_v4_rcv(->) // tcp_rcv_established(->) [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ==== [5201<->39222]: tcp_data_queue(->) [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0] [5201<->39222]: tcp_data_queue(<-) OFO queue: gap: 65480, len: 0 [5201<->39222]: __tcp_transmit_skb(->) [5201<->39222]: tcp_select_window(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE [5201<->39222]: tcp_select_window(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160, returning 0 [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160 [5201<->39222]: __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_rcv_established(<-) [5201<->39222]: tcp_v4_rcv(<-) // Receive queue is at 85 buffers and we are out of memory. // We drop the incoming buffer, although it is in sequence, and decide // to send an advertisement with a window of zero. // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means // we unconditionally shrink the window. [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes. [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0] // After each read, the algorithm for calculating the new receive // window in __tcp_cleanup_rbuf() finds it is too small to advertise // or to update tp->rcv_wnd. // Meanwhile, the peer thinks the window is zero, and will not send // any more data to trigger an update from the interrupt mode side. [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes. [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0] // The above pattern repeats again and again, since nothing changes // between the reads. [...] [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes. [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 265469200->265545488 (76288), unread 54672, qlen 1, ofoq 0] [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes. [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0] // The receive queue is empty, but no new advertisement has been sent. // The peer still thinks the receive window is zero, and sends nothing. // We have ended up in a deadlock situation. Furthermore, we have observed that in these situations this side may send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup as it should be. Backing ack_seq seems to be harmless, but is of course still wrong from a protocol viewpoint. We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet has been dropped because of memory exhaustion and we have to advertize a zero window. Further testing shows that the connection recovers neatly from the squeeze situation, and traffic can continue indefinitely. Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory") Reviewed-by: Stefano Brivio <sbrivio(a)redhat.com> Signed-off-by: Jon Maloy <jmaloy(a)redhat.com> --- v1: -Posted on Apr 6, 2024 v2: -Improved commit log to clarify how we end up in this situation. -After feedback from Eric Dumazet, removed references to use of SO_PEEK and SO_PEEK_OFF which may lead to a misunderstanding about how this situation occurs. Those flags are used at the peer side's incoming connection, and not on this one. --- net/ipv4/tcp_output.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 0e5b9a654254..ba295f798e5e 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -265,11 +265,13 @@ static u16 tcp_select_window(struct sock *sk) u32 cur_win, new_win; /* Make the window 0 if we failed to queue the data because we - * are out of memory. The window is temporary, so we don't store - * it on the socket. + * are out of memory. */ - if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) + if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) { + tp->rcv_wnd = 0; + tp->rcv_wup = tp->rcv_nxt; return 0; + } cur_win = tcp_receive_window(tp); new_win = __tcp_select_window(sk); -- 2.48.0
On Fri, Jan 17, 2025 at 10:40 PM <jmaloy(a)redhat.com> wrote:From: Jon Maloy <jmaloy(a)redhat.com> Testing with iperf3 using the "pasta" protocol splicer has revealed a bug in the way tcp handles window advertising in extreme memory squeeze situations. Under memory pressure, a socket endpoint may temporarily advertise a zero-sized window, but this is not stored as part of the socket data. The reasoning behind this is that it is considered a temporary setting which shouldn't influence any further calculations. However, if we happen to stall at an unfortunate value of the current window size, the algorithm selecting a new value will consistently fail to advertise a non-zero window once we have freed up enough memory. This means that this side's notion of the current window size is different from the one last advertised to the peer, causing the latter to not send any data to resolve the sitution. The problem occurs on the iperf3 server side, and the socket in question is a completely regular socket with the default settings for the fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. The following excerpt of a logging session, with own comments added, shows more in detail what is happening: // tcp_v4_rcv(->) // tcp_rcv_established(->) [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ==== [5201<->39222]: tcp_data_queue(->) [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0] [5201<->39222]: tcp_data_queue(<-) OFO queue: gap: 65480, len: 0 [5201<->39222]: __tcp_transmit_skb(->) [5201<->39222]: tcp_select_window(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE [5201<->39222]: tcp_select_window(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160, returning 0 [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160 [5201<->39222]: __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_rcv_established(<-) [5201<->39222]: tcp_v4_rcv(<-) // Receive queue is at 85 buffers and we are out of memory. // We drop the incoming buffer, although it is in sequence, and decide // to send an advertisement with a window of zero. // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means // we unconditionally shrink the window. [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes. [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0] // After each read, the algorithm for calculating the new receive // window in __tcp_cleanup_rbuf() finds it is too small to advertise // or to update tp->rcv_wnd. // Meanwhile, the peer thinks the window is zero, and will not send // any more data to trigger an update from the interrupt mode side. [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes. [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0] // The above pattern repeats again and again, since nothing changes // between the reads. [...] [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes. [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 265469200->265545488 (76288), unread 54672, qlen 1, ofoq 0] [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes. [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0] // The receive queue is empty, but no new advertisement has been sent. // The peer still thinks the receive window is zero, and sends nothing. // We have ended up in a deadlock situation. Furthermore, we have observed that in these situations this side may send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup as it should be. Backing ack_seq seems to be harmless, but is of course still wrong from a protocol viewpoint. We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet has been dropped because of memory exhaustion and we have to advertize a zero window. Further testing shows that the connection recovers neatly from the squeeze situation, and traffic can continue indefinitely. Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory") Reviewed-by: Stefano Brivio <sbrivio(a)redhat.com> Signed-off-by: Jon Maloy <jmaloy(a)redhat.com> --- v1: -Posted on Apr 6, 2024Could you post the link, this was a long time ago and I forgot the context.v2: -Improved commit log to clarify how we end up in this situation. -After feedback from Eric Dumazet, removed references to use of SO_PEEK and SO_PEEK_OFF which may lead to a misunderstanding about how this situation occurs. Those flags are used at the peer side's incoming connection, and not on this one. --- net/ipv4/tcp_output.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 0e5b9a654254..ba295f798e5e 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -265,11 +265,13 @@ static u16 tcp_select_window(struct sock *sk) u32 cur_win, new_win; /* Make the window 0 if we failed to queue the data because we - * are out of memory. The window is temporary, so we don't store - * it on the socket. + * are out of memory. */ - if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) + if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) { + tp->rcv_wnd = 0; + tp->rcv_wup = tp->rcv_nxt;I wonder if we should not clear tp->pred_flags here ? Also, any chance you could provide a packetdrill test ? Your changelog contains traces that are hard to follow. Thanks.
[Fixed Cc: for Menglong Dong, this is a reply to: https://lore.kernel.org/all/CANn89i+Ks52JVTBsMFQBM4CqUR4cegXhbSCH77aMCqFpd-…] On Fri, 17 Jan 2025 23:09:27 +0100 Eric Dumazet <edumazet(a)google.com> wrote:On Fri, Jan 17, 2025 at 10:40 PM <jmaloy(a)redhat.com> wrote:https://lore.kernel.org/all/20240406182107.261472-1-jmaloy@redhat.com/#r -- Stefanov1: -Posted on Apr 6, 2024Could you post the link, this was a long time ago and I forgot the context.
On Sat, Jan 18, 2025 at 5:41 AM <jmaloy(a)redhat.com> wrote:From: Jon Maloy <jmaloy(a)redhat.com> Testing with iperf3 using the "pasta" protocol splicer has revealed a bug in the way tcp handles window advertising in extreme memory squeeze situations. Under memory pressure, a socket endpoint may temporarily advertise a zero-sized window, but this is not stored as part of the socket data. The reasoning behind this is that it is considered a temporary setting which shouldn't influence any further calculations. However, if we happen to stall at an unfortunate value of the current window size, the algorithm selecting a new value will consistently fail to advertise a non-zero window once we have freed up enough memory. This means that this side's notion of the current window size is different from the one last advertised to the peer, causing the latter to not send any data to resolve the sitution. The problem occurs on the iperf3 server side, and the socket in question is a completely regular socket with the default settings for the fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. The following excerpt of a logging session, with own comments added, shows more in detail what is happening: // tcp_v4_rcv(->) // tcp_rcv_established(->) [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ==== [5201<->39222]: tcp_data_queue(->) [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0] [5201<->39222]: tcp_data_queue(<-) OFO queue: gap: 65480, len: 0 [5201<->39222]: __tcp_transmit_skb(->) [5201<->39222]: tcp_select_window(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE [5201<->39222]: tcp_select_window(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160, returning 0 [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160 [5201<->39222]: __tcp_transmit_skb(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_rcv_established(<-) [5201<->39222]: tcp_v4_rcv(<-) // Receive queue is at 85 buffers and we are out of memory. // We drop the incoming buffer, although it is in sequence, and decide // to send an advertisement with a window of zero. // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means // we unconditionally shrink the window. [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_recvmsg_locked(<-) returning 6104 bytes. [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0] // After each read, the algorithm for calculating the new receive // window in __tcp_cleanup_rbuf() finds it is too small to advertise // or to update tp->rcv_wnd. // Meanwhile, the peer thinks the window is zero, and will not send // any more data to trigger an update from the interrupt mode side. [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes. [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0] // The above pattern repeats again and again, since nothing changes // between the reads. [...] [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_recvmsg_locked(<-) returning 131072 bytes. [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 265469200->265545488 (76288), unread 54672, qlen 1, ofoq 0] [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [5201<->39222]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: tcp_recvmsg_locked(<-) returning 54672 bytes. [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0] // The receive queue is empty, but no new advertisement has been sent. // The peer still thinks the receive window is zero, and sends nothing. // We have ended up in a deadlock situation. Furthermore, we have observed that in these situations this side may send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup as it should be. Backing ack_seq seems to be harmless, but is of course still wrong from a protocol viewpoint. We fix this by setting tp->rcv_wnd and tp->rcv_wup even when a packet has been dropped because of memory exhaustion and we have to advertize a zero window. Further testing shows that the connection recovers neatly from the squeeze situation, and traffic can continue indefinitely.Thanks for such a detailed explanation :) However, undeniably, I've read several times until I totally follow the issue you reported. I wonder if you can compact and simplify the commit message a little bit, say, remove the last two calltraces that can be replaced by a single sentence like you've already stated (which is "the queue is empty"). Let me rephrase the scenario to see if I'm understanding in the right way: 1. First time the receiver sends a skb with zero window because of shortage of memory. 2. Sometimes the user calls the tcp_recvmsg() to remove the data from kernel to user space, sending the ACK process should be triggered, especially when the queue is empty. 3. Then the client reading the zero window signal stops sending data. #1 fails to update the rcv_wup and rcv_wnd, so since then in the manner of __tcp_cleanup_rbuf(), the rcv_window_now variable is always the same in this case (please see the calculation in tcp_receive_window()). rcv_window_now, which I presume is the same as win_now in the above description, should vary when __tcp_cleanup_rbuf() is called. Because of that, #2 could never send an ACK. My small question is why is the new_window always 262144 no matter what amount of data the user receives? Is it because in __tcp_select_window() there is no wscale option? Thanks, Jason
On Fri, Jan 17, 2025 at 4:41 PM <jmaloy(a)redhat.com> wrote:From: Jon Maloy <jmaloy(a)redhat.com> Testing with iperf3 using the "pasta" protocol splicer has revealed a bug in the way tcp handles window advertising in extreme memory squeeze situations. Under memory pressure, a socket endpoint may temporarily advertise a zero-sized window, but this is not stored as part of the socket data. The reasoning behind this is that it is considered a temporary setting which shouldn't influence any further calculations. However, if we happen to stall at an unfortunate value of the current window size, the algorithm selecting a new value will consistently fail to advertise a non-zero window once we have freed up enough memory.The "if we happen to stall at an unfortunate value of the current window size" phrase is a little vague... :-) Do you have a sense of what might count as "unfortunate" here? That might help in crafting a packetdrill test to reproduce this and have an automated regression test.This means that this side's notion of the current window size is different from the one last advertised to the peer, causing the latter to not send any data to resolve the sitution.Since the peer last saw a zero receive window at the time of the memory-pressure drop, shouldn't the peer be sending repeated zero window probes, and shouldn't the local host respond to a ZWP with an ACK with the correct non-zero window? Do you happen to have a tcpdump .pcap of one of these cases that you can share?The problem occurs on the iperf3 server side, and the socket in question is a completely regular socket with the default settings for the fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. The following excerpt of a logging session, with own comments added, shows more in detail what is happening: // tcp_v4_rcv(->) // tcp_rcv_established(->) [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ==== [5201<->39222]: tcp_data_queue(->) [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]What is "win_now"? That doesn't seem to correspond to any variable name in the Linux source tree. Can this be renamed to the tcp_select_window() variable it is printing, like "cur_win" or "effective_win" or "new_win", etc? Or perhaps you can attach your debugging patch in some email thread? I agree with Eric that these debug dumps are a little hard to parse without seeing the patch that allows us to understand what some of these fields are... I agree with Eric that probably tp->pred_flags should be cleared, and a packetdrill test for this would be super-helpful. thanks, neal
On 2025-01-18 15:04, Neal Cardwell wrote:On Fri, Jan 17, 2025 at 4:41 PM <jmaloy(a)redhat.com> wrote:Obviously, it happens when the following code snippet in __tcp_cleanup_rbuf() { [....] if (copied > 0 && !time_to_ack && !(sk->sk_shutdown & RCV_SHUTDOWN)) { __u32 rcv_window_now = tcp_receive_window(tp); /* Optimize, __tcp_select_window() is not cheap. */ if (2*rcv_window_now <= tp->window_clamp) { __u32 new_window = __tcp_select_window(sk); /* Send ACK now, if this read freed lots of space * in our buffer. Certainly, new_window is new window. * We can advertise it now, if it is not less than * current one. * "Lots" means "at least twice" here. */ if (new_window && new_window >= 2 * rcv_window_now) time_to_ack = true; } } [....] } yields time_to_ack = false, i.e. __tcp_select_window(sk) returns a value new_window < (2 * tcp_receive_window(tp)). In my log I have for brevity used the following names: win_now: same as rcv_window_now (= tcp_receive_window(tp), = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt, = 265469200 + 262144 - 265600160, = 131184) new_win: same as new_window (= __tcp_select_window(sk), = 0 first time, later 262144 ) rcv_wnd: same as tp->rcv_wnd, (=262144) We see that although the last test actually is pretty close (262144 >= 262368 ? => false) it is not close enough. We also notice that (tp->rcv_nxt - tp->rcv_wup) = (265600160 - 265469200) = 130960. 130960 < tp->rcv_wnd / 2, so the last test in __tcp_cleanup_rbuf(): (new_window >= 2 * rcv_window_now) will always be false. Too me it looks like __tcp_select_window(sk) doesn't at all take the freed-up memory into account when calculating a new window. I haven't looked into why that is happening.From: Jon Maloy <jmaloy(a)redhat.com> Testing with iperf3 using the "pasta" protocol splicer has revealed a bug in the way tcp handles window advertising in extreme memory squeeze situations. Under memory pressure, a socket endpoint may temporarily advertise a zero-sized window, but this is not stored as part of the socket data. The reasoning behind this is that it is considered a temporary setting which shouldn't influence any further calculations. However, if we happen to stall at an unfortunate value of the current window size, the algorithm selecting a new value will consistently fail to advertise a non-zero window once we have freed up enough memory.The "if we happen to stall at an unfortunate value of the current window size" phrase is a little vague... :-) Do you have a sense of what might count as "unfortunate" here? That might help in crafting a packetdrill test to reproduce this and have an automated regression test.It should, but at the moment when I found this bug the peer stack was not the Linux kernel stack, but one we develop for our own purpose. We fixed that later, but it still means that traffic stops for a couple of seconds now and then before the timer restarts the flow. This happens too often for comfort in our usage scenarios. We can of course blame the the peer stack, but I still feel this is a bug, and that it could be handled better by the kernel stack.This means that this side's notion of the current window size is different from the one last advertised to the peer, causing the latter to not send any data to resolve the sitution.Since the peer last saw a zero receive window at the time of the memory-pressure drop, shouldn't the peer be sending repeated zero window probes, and shouldn't the local host respond to a ZWP with an ACK with the correct non-zero window?Do you happen to have a tcpdump .pcap of one of these cases that you can share?I had one, although not for this particular run, and I cannot find it right now. I will continue looking or make a new one. Is there some shared space I can put it?See above. Can this be renamed to theThe problem occurs on the iperf3 server side, and the socket in question is a completely regular socket with the default settings for the fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. The following excerpt of a logging session, with own comments added, shows more in detail what is happening: // tcp_v4_rcv(->) // tcp_rcv_established(->) [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ==== [5201<->39222]: tcp_data_queue(->) [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]What is "win_now"? That doesn't seem to correspond to any variable name in the Linux source tree.tcp_select_window() variable it is printing, like "cur_win" or "effective_win" or "new_win", etc? Or perhaps you can attach your debugging patch in some email thread? I agree with Eric that these debug dumps are a little hard to parse without seeing the patch that allows us to understand what some of these fields are... I agree with Eric that probably tp->pred_flags should be cleared, and a packetdrill test for this would be super-helpful.I must admit I have never used packetdrill, but I can make an effort. ///jonthanks, neal
On 2025-01-20 00:03, Jon Maloy wrote:On 2025-01-18 15:04, Neal Cardwell wrote:Here it is. Look at frame #1067. https://passt.top/static/iperf3_jon_zero_window_cut.pcapOn Fri, Jan 17, 2025 at 4:41 PM <jmaloy(a)redhat.com> wrote:Obviously, it happens when the following code snippet in __tcp_cleanup_rbuf() { [....] if (copied > 0 && !time_to_ack && !(sk->sk_shutdown & RCV_SHUTDOWN)) { __u32 rcv_window_now = tcp_receive_window(tp); /* Optimize, __tcp_select_window() is not cheap. */ if (2*rcv_window_now <= tp->window_clamp) { __u32 new_window = __tcp_select_window(sk); /* Send ACK now, if this read freed lots of space * in our buffer. Certainly, new_window is new window. * We can advertise it now, if it is not less than * current one. * "Lots" means "at least twice" here. */ if (new_window && new_window >= 2 * rcv_window_now) time_to_ack = true; } } [....] } yields time_to_ack = false, i.e. __tcp_select_window(sk) returns a value new_window < (2 * tcp_receive_window(tp)). In my log I have for brevity used the following names: win_now: same as rcv_window_now (= tcp_receive_window(tp), = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt, = 265469200 + 262144 - 265600160, = 131184) new_win: same as new_window (= __tcp_select_window(sk), = 0 first time, later 262144 ) rcv_wnd: same as tp->rcv_wnd, (=262144) We see that although the last test actually is pretty close (262144 >= 262368 ? => false) it is not close enough. We also notice that (tp->rcv_nxt - tp->rcv_wup) = (265600160 - 265469200) = 130960. 130960 < tp->rcv_wnd / 2, so the last test in __tcp_cleanup_rbuf(): (new_window >= 2 * rcv_window_now) will always be false. Too me it looks like __tcp_select_window(sk) doesn't at all take the freed-up memory into account when calculating a new window. I haven't looked into why that is happening.From: Jon Maloy <jmaloy(a)redhat.com> Testing with iperf3 using the "pasta" protocol splicer has revealed a bug in the way tcp handles window advertising in extreme memory squeeze situations. Under memory pressure, a socket endpoint may temporarily advertise a zero-sized window, but this is not stored as part of the socket data. The reasoning behind this is that it is considered a temporary setting which shouldn't influence any further calculations. However, if we happen to stall at an unfortunate value of the current window size, the algorithm selecting a new value will consistently fail to advertise a non-zero window once we have freed up enough memory.The "if we happen to stall at an unfortunate value of the current window size" phrase is a little vague... :-) Do you have a sense of what might count as "unfortunate" here? That might help in crafting a packetdrill test to reproduce this and have an automated regression test.It should, but at the moment when I found this bug the peer stack was not the Linux kernel stack, but one we develop for our own purpose. We fixed that later, but it still means that traffic stops for a couple of seconds now and then before the timer restarts the flow. This happens too often for comfort in our usage scenarios. We can of course blame the the peer stack, but I still feel this is a bug, and that it could be handled better by the kernel stack.This means that this side's notion of the current window size is different from the one last advertised to the peer, causing the latter to not send any data to resolve the sitution.Since the peer last saw a zero receive window at the time of the memory-pressure drop, shouldn't the peer be sending repeated zero window probes, and shouldn't the local host respond to a ZWP with an ACK with the correct non-zero window?Do you happen to have a tcpdump .pcap of one of these cases that you can share?I had one, although not for this particular run, and I cannot find it right now. I will continue looking or make a new one. Is there some shared space I can put it?I hear from other sources that you cannot force a memory exhaustion with packetdrill anyway, so this sounds like a pointless exercise. ///jonSee above. Can this be renamed to theThe problem occurs on the iperf3 server side, and the socket in question is a completely regular socket with the default settings for the fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. The following excerpt of a logging session, with own comments added, shows more in detail what is happening: // tcp_v4_rcv(->) // tcp_rcv_established(->) [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/ tcp_data_queue()/5257 ==== [5201<->39222]: tcp_data_queue(->) [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]What is "win_now"? That doesn't seem to correspond to any variable name in the Linux source tree.tcp_select_window() variable it is printing, like "cur_win" or "effective_win" or "new_win", etc? Or perhaps you can attach your debugging patch in some email thread? I agree with Eric that these debug dumps are a little hard to parse without seeing the patch that allows us to understand what some of these fields are... I agree with Eric that probably tp->pred_flags should be cleared, and a packetdrill test for this would be super-helpful.I must admit I have never used packetdrill, but I can make an effort.///jonthanks, neal
On Mon, Jan 20, 2025 at 5:10 PM Jon Maloy <jmaloy(a)redhat.com> wrote:On 2025-01-20 00:03, Jon Maloy wrote:We certainly can and should add a feature like that to packetdrill. Documentation/fault-injection/ has some relevant information. Even without this, tcp_try_rmem_schedule() is reading sk->sk_rcvbuf that could be lowered by a packetdrill script I think.On 2025-01-18 15:04, Neal Cardwell wrote:Here it is. Look at frame #1067. https://passt.top/static/iperf3_jon_zero_window_cut.pcapOn Fri, Jan 17, 2025 at 4:41 PM <jmaloy(a)redhat.com> wrote:Obviously, it happens when the following code snippet in __tcp_cleanup_rbuf() { [....] if (copied > 0 && !time_to_ack && !(sk->sk_shutdown & RCV_SHUTDOWN)) { __u32 rcv_window_now = tcp_receive_window(tp); /* Optimize, __tcp_select_window() is not cheap. */ if (2*rcv_window_now <= tp->window_clamp) { __u32 new_window = __tcp_select_window(sk); /* Send ACK now, if this read freed lots of space * in our buffer. Certainly, new_window is new window. * We can advertise it now, if it is not less than * current one. * "Lots" means "at least twice" here. */ if (new_window && new_window >= 2 * rcv_window_now) time_to_ack = true; } } [....] } yields time_to_ack = false, i.e. __tcp_select_window(sk) returns a value new_window < (2 * tcp_receive_window(tp)). In my log I have for brevity used the following names: win_now: same as rcv_window_now (= tcp_receive_window(tp), = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt, = 265469200 + 262144 - 265600160, = 131184) new_win: same as new_window (= __tcp_select_window(sk), = 0 first time, later 262144 ) rcv_wnd: same as tp->rcv_wnd, (=262144) We see that although the last test actually is pretty close (262144 >= 262368 ? => false) it is not close enough. We also notice that (tp->rcv_nxt - tp->rcv_wup) = (265600160 - 265469200) = 130960. 130960 < tp->rcv_wnd / 2, so the last test in __tcp_cleanup_rbuf(): (new_window >= 2 * rcv_window_now) will always be false. Too me it looks like __tcp_select_window(sk) doesn't at all take the freed-up memory into account when calculating a new window. I haven't looked into why that is happening.From: Jon Maloy <jmaloy(a)redhat.com> Testing with iperf3 using the "pasta" protocol splicer has revealed a bug in the way tcp handles window advertising in extreme memory squeeze situations. Under memory pressure, a socket endpoint may temporarily advertise a zero-sized window, but this is not stored as part of the socket data. The reasoning behind this is that it is considered a temporary setting which shouldn't influence any further calculations. However, if we happen to stall at an unfortunate value of the current window size, the algorithm selecting a new value will consistently fail to advertise a non-zero window once we have freed up enough memory.The "if we happen to stall at an unfortunate value of the current window size" phrase is a little vague... :-) Do you have a sense of what might count as "unfortunate" here? That might help in crafting a packetdrill test to reproduce this and have an automated regression test.It should, but at the moment when I found this bug the peer stack was not the Linux kernel stack, but one we develop for our own purpose. We fixed that later, but it still means that traffic stops for a couple of seconds now and then before the timer restarts the flow. This happens too often for comfort in our usage scenarios. We can of course blame the the peer stack, but I still feel this is a bug, and that it could be handled better by the kernel stack.This means that this side's notion of the current window size is different from the one last advertised to the peer, causing the latter to not send any data to resolve the sitution.Since the peer last saw a zero receive window at the time of the memory-pressure drop, shouldn't the peer be sending repeated zero window probes, and shouldn't the local host respond to a ZWP with an ACK with the correct non-zero window?Do you happen to have a tcpdump .pcap of one of these cases that you can share?I had one, although not for this particular run, and I cannot find it right now. I will continue looking or make a new one. Is there some shared space I can put it?I hear from other sources that you cannot force a memory exhaustion with packetdrill anyway, so this sounds like a pointless exercise.See above. Can this be renamed to theThe problem occurs on the iperf3 server side, and the socket in question is a completely regular socket with the default settings for the fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. The following excerpt of a logging session, with own comments added, shows more in detail what is happening: // tcp_v4_rcv(->) // tcp_rcv_established(->) [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/ tcp_data_queue()/5257 ==== [5201<->39222]: tcp_data_queue(->) [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]What is "win_now"? That doesn't seem to correspond to any variable name in the Linux source tree.tcp_select_window() variable it is printing, like "cur_win" or "effective_win" or "new_win", etc? Or perhaps you can attach your debugging patch in some email thread? I agree with Eric that these debug dumps are a little hard to parse without seeing the patch that allows us to understand what some of these fields are... I agree with Eric that probably tp->pred_flags should be cleared, and a packetdrill test for this would be super-helpful.I must admit I have never used packetdrill, but I can make an effort.
On 2025-01-20 11:22, Eric Dumazet wrote:On Mon, Jan 20, 2025 at 5:10 PM Jon Maloy <jmaloy(a)redhat.com> wrote: > > > > On 2025-01-20 00:03, Jon Maloy wrote: >> >>[...]Neal, Eric, How do you suggest we proceed with this? I downloaded packetdrill and tried it a bit, but to understand it well enough to introduce a new feature would require more time than I am able to spend on this. Maybe Neal, who I see is one of the contributors to packetdrill could help out? I can certainly clear tp->pred_flags and post it again, maybe with an improved and shortened log. Would that be acceptable? I also made a run where I looked into why __tcp_select_window() ignores all the space that has been freed up: tcp_recvmsg_locked(->) __tcp_cleanup_rbuf(->) (copied 131072) tp->rcv_wup: 1788299855, tp->rcv_wnd: 5812224, tp->rcv_nxt 1793800175 __tcp_select_window(->) tcp_space(->) tcp_space(<-) returning 458163 free_space = round_down(458163, 1 << 4096) = 454656 (free_space > tp->rcv_ssthresh) --> free_space = tp->rcv_ssthresh = 261920 window = ALIGN(261920, 4096) = 26144 __tcp_select_window(<-) returning 262144 [rcv_win_now 311904, 2 * rcv_win_now 623808, new_window 262144] (new_window >= (2 * rcv_win_now)) ? --> time_to_ack 0 NOT calling tcp_send_ack() __tcp_cleanup_rbuf(<-) [tp->rcv_wup 1788299855, tp->rcv_wnd 5812224, tp->rcv_nxt 1793800175] tcp_recvmsg_locked(<-) returning 131072 bytes. [tp->rcv_nxt 1793800175, tp->rcv_wnd 5812224, tp->rcv_wup 1788299855, sk->last_ack 0, tcp_receive_win() 311904, copied_seq 1788299855->1788395953 (96098), unread 5404222, sk_rcv_qlen 83, ofo_qlen 0] As we see tp->rcv_ssthresh is the limiting factor, causing a consistent situation where (new_window < (rcv_win_now * 2)), and even (new_window < rcv_win_now). To me, it looks like tp->ssthresh should have a higher value in this situation, or maybe we should alter this test. The combination of these two issues, -not updating tp->wnd and _tcp_select_window() returning a wrong value, is what is causing this whole problem. ///jonWe certainly can and should add a feature like that to packetdrill. Documentation/fault-injection/ has some relevant information. Even without this, tcp_try_rmem_schedule() is reading sk->sk_rcvbuf that could be lowered by a packetdrill script I think.I hear from other sources that you cannot force a memory exhaustion with packetdrill anyway, so this sounds like a pointless exercise.I agree with Eric that probably tp->pred_flags should be cleared, and a packetdrill test for this would be super-helpful.I must admit I have never used packetdrill, but I can make an effort.
On Fri, Jan 24, 2025 at 6:40 PM Jon Maloy <jmaloy(a)redhat.com> wrote:On 2025-01-20 11:22, Eric Dumazet wrote:I will spend some time this week preparing for some tests. I would prefer not merging new code without a clear understanding of the issue. Thanks.On Mon, Jan 20, 2025 at 5:10 PM Jon Maloy <jmaloy(a)redhat.com> wrote: > > > > On 2025-01-20 00:03, Jon Maloy wrote: >> >>[...]Neal, Eric, How do you suggest we proceed with this? I downloaded packetdrill and tried it a bit, but to understand it well enough to introduce a new feature would require more time than I am able to spend on this. Maybe Neal, who I see is one of the contributors to packetdrill could help out?We certainly can and should add a feature like that to packetdrill. Documentation/fault-injection/ has some relevant information. Even without this, tcp_try_rmem_schedule() is reading sk->sk_rcvbuf that could be lowered by a packetdrill script I think.> I agree with Eric that probably tp->pred_flags should be cleared, and > a packetdrill test for this would be super-helpful. I must admit I have never used packetdrill, but I can make an effort.I hear from other sources that you cannot force a memory exhaustion with packetdrill anyway, so this sounds like a pointless exercise.I can certainly clear tp->pred_flags and post it again, maybe with an improved and shortened log. Would that be acceptable? I also made a run where I looked into why __tcp_select_window() ignores all the space that has been freed up: tcp_recvmsg_locked(->) __tcp_cleanup_rbuf(->) (copied 131072) tp->rcv_wup: 1788299855, tp->rcv_wnd: 5812224, tp->rcv_nxt 1793800175 __tcp_select_window(->) tcp_space(->) tcp_space(<-) returning 458163 free_space = round_down(458163, 1 << 4096) = 454656 (free_space > tp->rcv_ssthresh) --> free_space = tp->rcv_ssthresh = 261920 window = ALIGN(261920, 4096) = 26144 __tcp_select_window(<-) returning 262144 [rcv_win_now 311904, 2 * rcv_win_now 623808, new_window 262144] (new_window >= (2 * rcv_win_now)) ? --> time_to_ack 0 NOT calling tcp_send_ack() __tcp_cleanup_rbuf(<-) [tp->rcv_wup 1788299855, tp->rcv_wnd 5812224, tp->rcv_nxt 1793800175] tcp_recvmsg_locked(<-) returning 131072 bytes. [tp->rcv_nxt 1793800175, tp->rcv_wnd 5812224, tp->rcv_wup 1788299855, sk->last_ack 0, tcp_receive_win() 311904, copied_seq 1788299855->1788395953 (96098), unread 5404222, sk_rcv_qlen 83, ofo_qlen 0] As we see tp->rcv_ssthresh is the limiting factor, causing a consistent situation where (new_window < (rcv_win_now * 2)), and even (new_window < rcv_win_now). To me, it looks like tp->ssthresh should have a higher value in this situation, or maybe we should alter this test. The combination of these two issues, -not updating tp->wnd and _tcp_select_window() returning a wrong value, is what is causing this whole problem. ///jon
On Fri, 24 Jan 2025 12:40:16 -0500 Jon Maloy <jmaloy(a)redhat.com> wrote:I can certainly clear tp->pred_flags and post it again, maybe with an improved and shortened log. Would that be acceptable?Talking about an improved log, what strikes me the most of the whole problem is: $ tshark -r iperf3_jon_zero_window.pcap -td -Y 'frame.number in { 1064 .. 1068 }' 1064 0.004416 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611679466 Ack=1 Win=36864 Len=65480 1065 0.007334 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611744946 Ack=1 Win=36864 Len=65480 1066 0.005104 192.168.122.1 → 192.168.122.198 TCP 56382 [TCP Window Full] 34482 → 5201 [ACK] Seq=1611810426 Ack=1 Win=36864 Len=56328 1067 0.015226 192.168.122.198 → 192.168.122.1 TCP 54 [TCP ZeroWindow] 5201 → 34482 [ACK] Seq=1 Ack=1611090146 Win=0 Len=0 1068 6.298138 fe80::44b3:f5ff:fe86:c529 → ff02::2 ICMPv6 70 Router Solicitation from 46:b3:f5:86:c5:29 ...and then the silence, 192.168.122.198 never announces that its window is not zero, so the peer gives up 15 seconds later: $ tshark -r iperf3_jon_zero_window_cut.pcap -td -Y 'frame.number in { 1069 .. 1070 }' 1069 8.709313 192.168.122.1 → 192.168.122.198 TCP 55 34466 → 5201 [ACK] Seq=166 Ack=5 Win=36864 Len=1 1070 0.008943 192.168.122.198 → 192.168.122.1 TCP 54 5201 → 34482 [FIN, ACK] Seq=1 Ack=1611090146 Win=778240 Len=0 Data in frame #1069 is iperf3 ending the test. This didn't happen before e2142825c120 ("net: tcp: send zero-window ACK when no memory") so it's a relatively recent (17 months) regression. It actually looks pretty simple (and rather serious) to me. -- Stefano
On Mon, Jan 27, 2025 at 11:01 AM Stefano Brivio <sbrivio(a)redhat.com> wrote:On Fri, 24 Jan 2025 12:40:16 -0500 Jon Maloy <jmaloy(a)redhat.com> wrote:With all that, it should be pretty easy to cook a packetdrill test, right ? packetdrill tests are part of tools/testing/selftests/net/ already, we are not asking for something unreasonable.I can certainly clear tp->pred_flags and post it again, maybe with an improved and shortened log. Would that be acceptable?Talking about an improved log, what strikes me the most of the whole problem is: $ tshark -r iperf3_jon_zero_window.pcap -td -Y 'frame.number in { 1064 .. 1068 }' 1064 0.004416 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611679466 Ack=1 Win=36864 Len=65480 1065 0.007334 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611744946 Ack=1 Win=36864 Len=65480 1066 0.005104 192.168.122.1 → 192.168.122.198 TCP 56382 [TCP Window Full] 34482 → 5201 [ACK] Seq=1611810426 Ack=1 Win=36864 Len=56328 1067 0.015226 192.168.122.198 → 192.168.122.1 TCP 54 [TCP ZeroWindow] 5201 → 34482 [ACK] Seq=1 Ack=1611090146 Win=0 Len=0 1068 6.298138 fe80::44b3:f5ff:fe86:c529 → ff02::2 ICMPv6 70 Router Solicitation from 46:b3:f5:86:c5:29 ...and then the silence, 192.168.122.198 never announces that its window is not zero, so the peer gives up 15 seconds later: $ tshark -r iperf3_jon_zero_window_cut.pcap -td -Y 'frame.number in { 1069 .. 1070 }' 1069 8.709313 192.168.122.1 → 192.168.122.198 TCP 55 34466 → 5201 [ACK] Seq=166 Ack=5 Win=36864 Len=1 1070 0.008943 192.168.122.198 → 192.168.122.1 TCP 54 5201 → 34482 [FIN, ACK] Seq=1 Ack=1611090146 Win=778240 Len=0 Data in frame #1069 is iperf3 ending the test. This didn't happen before e2142825c120 ("net: tcp: send zero-window ACK when no memory") so it's a relatively recent (17 months) regression. It actually looks pretty simple (and rather serious) to me.
On Mon, 27 Jan 2025 11:06:07 +0100 Eric Dumazet <edumazet(a)google.com> wrote:On Mon, Jan 27, 2025 at 11:01 AM Stefano Brivio <sbrivio(a)redhat.com> wrote:Not really :( because to reproduce this exact condition you need to somehow get the right amount of memory pressure so that you can actually establish a connection, start the transfer, and then exhaust the receive buffer at the right moment. And packetdrill doesn't do that. Sure, it would be great if it did, and it's probably a nice feature to implement... given enough time. Given less time, I guess fixing regressions has a higher priority. One could perhaps tweak sk->sk_rcvbuf as you suggested but that just artificially reproduces one part of it. It's not a really fitting test. For example: when would you increase it back?On Fri, 24 Jan 2025 12:40:16 -0500 Jon Maloy <jmaloy(a)redhat.com> wrote:With all that, it should be pretty easy to cook a packetdrill test, right ?I can certainly clear tp->pred_flags and post it again, maybe with an improved and shortened log. Would that be acceptable?Talking about an improved log, what strikes me the most of the whole problem is: $ tshark -r iperf3_jon_zero_window.pcap -td -Y 'frame.number in { 1064 .. 1068 }' 1064 0.004416 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611679466 Ack=1 Win=36864 Len=65480 1065 0.007334 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611744946 Ack=1 Win=36864 Len=65480 1066 0.005104 192.168.122.1 → 192.168.122.198 TCP 56382 [TCP Window Full] 34482 → 5201 [ACK] Seq=1611810426 Ack=1 Win=36864 Len=56328 1067 0.015226 192.168.122.198 → 192.168.122.1 TCP 54 [TCP ZeroWindow] 5201 → 34482 [ACK] Seq=1 Ack=1611090146 Win=0 Len=0 1068 6.298138 fe80::44b3:f5ff:fe86:c529 → ff02::2 ICMPv6 70 Router Solicitation from 46:b3:f5:86:c5:29 ...and then the silence, 192.168.122.198 never announces that its window is not zero, so the peer gives up 15 seconds later: $ tshark -r iperf3_jon_zero_window_cut.pcap -td -Y 'frame.number in { 1069 .. 1070 }' 1069 8.709313 192.168.122.1 → 192.168.122.198 TCP 55 34466 → 5201 [ACK] Seq=166 Ack=5 Win=36864 Len=1 1070 0.008943 192.168.122.198 → 192.168.122.1 TCP 54 5201 → 34482 [FIN, ACK] Seq=1 Ack=1611090146 Win=778240 Len=0 Data in frame #1069 is iperf3 ending the test. This didn't happen before e2142825c120 ("net: tcp: send zero-window ACK when no memory") so it's a relatively recent (17 months) regression. It actually looks pretty simple (and rather serious) to me.packetdrill tests are part of tools/testing/selftests/net/ already, we are not asking for something unreasonable.I would agree, in general, except that I don't see a way to craft a test like this with packetdrill. At least not trivially with the current feature set. On top of that, this is not a new feature, it's a fix for a regression (that was introduced without adding any test, of course). And the fix itself was definitely tested, just not with packetdrill. Requesting that tests are 1. automated and 2. written with a specific tool is something I can quite understand for general convenience, but I don't think it always makes sense. Especially as this fix has been blocked for about 9 months now because of the fact that automating a test for it is quite hard. -- Stefano
On Mon, Jan 27, 2025 at 6:01 PM Stefano Brivio <sbrivio(a)redhat.com> wrote:On Fri, 24 Jan 2025 12:40:16 -0500 Jon Maloy <jmaloy(a)redhat.com> wrote:I remembered last time it really also took me some time to totally follow. Packetdrill should be helpful :) As to the patch itself, I agreed with this fix last time while now I have to re-read that long analysis to recall as much as possible. I'm not that sure if it's a bug belonging to the Linux kernel. The other side not sending a window probe causes this issue...? The other part of me says we cannot break the user's behaviour. One way or another, I will also take a look at it again. Thanks, JasonI can certainly clear tp->pred_flags and post it again, maybe with an improved and shortened log. Would that be acceptable?Talking about an improved log, what strikes me the most of the whole problem is: $ tshark -r iperf3_jon_zero_window.pcap -td -Y 'frame.number in { 1064 .. 1068 }' 1064 0.004416 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611679466 Ack=1 Win=36864 Len=65480 1065 0.007334 192.168.122.1 → 192.168.122.198 TCP 65534 34482 → 5201 [ACK] Seq=1611744946 Ack=1 Win=36864 Len=65480 1066 0.005104 192.168.122.1 → 192.168.122.198 TCP 56382 [TCP Window Full] 34482 → 5201 [ACK] Seq=1611810426 Ack=1 Win=36864 Len=56328 1067 0.015226 192.168.122.198 → 192.168.122.1 TCP 54 [TCP ZeroWindow] 5201 → 34482 [ACK] Seq=1 Ack=1611090146 Win=0 Len=0 1068 6.298138 fe80::44b3:f5ff:fe86:c529 → ff02::2 ICMPv6 70 Router Solicitation from 46:b3:f5:86:c5:29 ...and then the silence, 192.168.122.198 never announces that its window is not zero, so the peer gives up 15 seconds later: $ tshark -r iperf3_jon_zero_window_cut.pcap -td -Y 'frame.number in { 1069 .. 1070 }' 1069 8.709313 192.168.122.1 → 192.168.122.198 TCP 55 34466 → 5201 [ACK] Seq=166 Ack=5 Win=36864 Len=1 1070 0.008943 192.168.122.198 → 192.168.122.1 TCP 54 5201 → 34482 [FIN, ACK] Seq=1 Ack=1611090146 Win=778240 Len=0 Data in frame #1069 is iperf3 ending the test. This didn't happen before e2142825c120 ("net: tcp: send zero-window ACK when no memory") so it's a relatively recent (17 months) regression. It actually looks pretty simple (and rather serious) to me.
On Mon, 27 Jan 2025 18:17:28 +0800 Jason Xing <kerneljasonxing(a)gmail.com> wrote:I'm not that sure if it's a bug belonging to the Linux kernel.It is, because for at least 20-25 years (before that it's a bit hard to understand from history) a non-zero window would be announced, as obviously expected, once there's again space in the receive window.The other side not sending a window probe causes this issue...?It doesn't cause this issue, but it triggers it.The other part of me says we cannot break the user's behaviour.This sounds quite relevant, yes. -- Stefano
On Mon, Jan 27, 2025 at 6:32 PM Stefano Brivio <sbrivio(a)redhat.com> wrote:On Mon, 27 Jan 2025 18:17:28 +0800 Jason Xing <kerneljasonxing(a)gmail.com> wrote:Sorry for the late reply. I think the key of this problem is what should we do when we receive a tcp packet and we are out of memory. The RFC doesn't define such a thing, so in the commit e2142825c120 ("net: tcp: send zero-window ACK when no memory"), I reply with a zero-window ACK to the peer. And the peer will keep probing the window by retransmitting the packet that we dropped if the peer is a LINUX SYSTEM. As I said, the RFC doesn't define such a case, so the behavior of the peer is undefined if it is not a LINUX SYSTEM. If the peer doesn't keep retransmitting the packet, it will hang the connection, just like the problem that described in this commit log. However, we can make some optimization to make it more adaptable. We can send a ACK with the right window to the peer when the memory is available, and __tcp_cleanup_rbuf() is a good choice. Generally speaking, I think this patch makes sense. However, I'm not sure if there is any other influence if we make "tp->rcv_wnd=0", but it can trigger a ACK in __tcp_cleanup_rbuf(). Following is the code that I thought before to optimize this case (the code is totally not tested): diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 3c82fad904d4..bedd78946762 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -116,7 +116,8 @@ struct inet_connection_sock { #define ATO_BITS 8 __u32 ato:ATO_BITS, /* Predicted tick of soft clock */ lrcv_flowlabel:20, /* last received ipv6 flowlabel */ - unused:4; + is_oom:1, + unused:3; unsigned long timeout; /* Currently scheduled timeout */ __u32 lrcvtime; /* timestamp of last received data packet */ __u16 last_seg_size; /* Size of last incoming segment */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 0d704bda6c41..6f3c85a1f4da 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1458,11 +1458,11 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len) */ void __tcp_cleanup_rbuf(struct sock *sk, int copied) { + struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); bool time_to_ack = false; if (inet_csk_ack_scheduled(sk)) { - const struct inet_connection_sock *icsk = inet_csk(sk); if (/* Once-per-two-segments ACK was not sent by tcp_input.c */ tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss || @@ -1502,6 +1502,11 @@ void __tcp_cleanup_rbuf(struct sock *sk, int copied) time_to_ack = true; } } + if (unlikely(icsk->icsk_ack.is_oom)) { + icsk->icsk_ack.is_oom = false; + time_to_ack = true; + } + if (time_to_ack) tcp_send_ack(sk); } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 0e5b9a654254..e2d65213b3b7 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -268,9 +268,12 @@ static u16 tcp_select_window(struct sock *sk) * are out of memory. The window is temporary, so we don't store * it on the socket. */ - if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) + if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) { + inet_csk(sk)->icsk_ack.is_oom = true; return 0; + } + inet_csk(sk)->icsk_ack.is_oom = false; cur_win = tcp_receive_window(tp); new_win = __tcp_select_window(sk); if (new_win < cur_win) {I'm not that sure if it's a bug belonging to the Linux kernel.It is, because for at least 20-25 years (before that it's a bit hard to understand from history) a non-zero window would be announced, as obviously expected, once there's again space in the receive window.The other side not sending a window probe causes this issue...?It doesn't cause this issue, but it triggers it.The other part of me says we cannot break the user's behaviour.This sounds quite relevant, yes. -- Stefano
On Mon, 27 Jan 2025 21:37:23 +0800 Menglong Dong <menglong8.dong(a)gmail.com> wrote:On Mon, Jan 27, 2025 at 6:32 PM Stefano Brivio <sbrivio(a)redhat.com> wrote:Why not? RFC 9293, 3.8.6: There is an assumption that this is related to the data buffer space currently available for this connection. That is, out-of-memory -> zero window.On Mon, 27 Jan 2025 18:17:28 +0800 Jason Xing <kerneljasonxing(a)gmail.com> wrote:Sorry for the late reply. I think the key of this problem is what should we do when we receive a tcp packet and we are out of memory. The RFC doesn't define such a thing,I'm not that sure if it's a bug belonging to the Linux kernel.It is, because for at least 20-25 years (before that it's a bit hard to understand from history) a non-zero window would be announced, as obviously expected, once there's again space in the receive window.so in the commit e2142825c120 ("net: tcp: send zero-window ACK when no memory"), I reply with a zero-window ACK to the peer.Your patch is fundamentally correct, nobody is disputing that. The problem is that it introduces a side effect because it gets the notion of "current window" out of sync by sending a one-off packet with a zero-window, without recording that.And the peer will keep probing the window by retransmitting the packet that we dropped if the peer is a LINUX SYSTEM. As I said, the RFC doesn't define such a case, so the behavior of the peer is undefined if it is not a LINUX SYSTEM. If the peer doesn't keep retransmitting the packet, it will hang the connection, just like the problem that described in this commit log.It's not undefined. RFC 9293 3.8.6.1 (just like RFC 1122 4.2.2.17, RFC 793 3.7) requires zero-window probes. But keeping the window closed indefinitely if there's no zero-window probe is a regression anyway: - a retransmission timeout must elapse (RFC 9293 3.8.1) before the zero-window probe is sent, so relying on zero-window probes means introducing an unnecessary delay - if the peer (as it was the case here) fails to send a zero-window probe for whatever reason, things break. This is a userspace breakage, regardless of the fact that the peer should send a zero-window probeHowever, we can make some optimization to make it more adaptable. We can send a ACK with the right window to the peer when the memory is available, and __tcp_cleanup_rbuf() is a good choice. Generally speaking, I think this patch makes sense. However, I'm not sure if there is any other influence if we make "tp->rcv_wnd=0", but it can trigger a ACK in __tcp_cleanup_rbuf().I don't understand what's your concern with the patch that was proposed (and tested quite thoroughly, by the way).Following is the code that I thought before to optimize this case (the code is totally not tested): [...]-- Stefano
On Fri, Jan 24, 2025 at 6:40 PM Jon Maloy <jmaloy(a)redhat.com> wrote:On 2025-01-20 11:22, Eric Dumazet wrote:Yes.On Mon, Jan 20, 2025 at 5:10 PM Jon Maloy <jmaloy(a)redhat.com> wrote: > > > > On 2025-01-20 00:03, Jon Maloy wrote: >> >>[...]Neal, Eric, How do you suggest we proceed with this? I downloaded packetdrill and tried it a bit, but to understand it well enough to introduce a new feature would require more time than I am able to spend on this. Maybe Neal, who I see is one of the contributors to packetdrill could help out? I can certainly clear tp->pred_flags and post it again, maybe with an improved and shortened log. Would that be acceptable?We certainly can and should add a feature like that to packetdrill. Documentation/fault-injection/ has some relevant information. Even without this, tcp_try_rmem_schedule() is reading sk->sk_rcvbuf that could be lowered by a packetdrill script I think.> I agree with Eric that probably tp->pred_flags should be cleared, and > a packetdrill test for this would be super-helpful. I must admit I have never used packetdrill, but I can make an effort.I hear from other sources that you cannot force a memory exhaustion with packetdrill anyway, so this sounds like a pointless exercise.I also made a run where I looked into why __tcp_select_window() ignores all the space that has been freed up: tcp_recvmsg_locked(->) __tcp_cleanup_rbuf(->) (copied 131072) tp->rcv_wup: 1788299855, tp->rcv_wnd: 5812224, tp->rcv_nxt 1793800175 __tcp_select_window(->) tcp_space(->) tcp_space(<-) returning 458163 free_space = round_down(458163, 1 << 4096) = 454656 (free_space > tp->rcv_ssthresh) --> free_space = tp->rcv_ssthresh = 261920 window = ALIGN(261920, 4096) = 26144 __tcp_select_window(<-) returning 262144 [rcv_win_now 311904, 2 * rcv_win_now 623808, new_window 262144] (new_window >= (2 * rcv_win_now)) ? --> time_to_ack 0 NOT calling tcp_send_ack() __tcp_cleanup_rbuf(<-) [tp->rcv_wup 1788299855, tp->rcv_wnd 5812224, tp->rcv_nxt 1793800175] tcp_recvmsg_locked(<-) returning 131072 bytes. [tp->rcv_nxt 1793800175, tp->rcv_wnd 5812224, tp->rcv_wup 1788299855, sk->last_ack 0, tcp_receive_win() 311904, copied_seq 1788299855->1788395953 (96098), unread 5404222, sk_rcv_qlen 83, ofo_qlen 0] As we see tp->rcv_ssthresh is the limiting factor, causing a consistent situation where (new_window < (rcv_win_now * 2)), and even (new_window < rcv_win_now).Your changelog could simply explain this, in one sentence. instead of lengthy traces.