On Wed, 3 Apr 2024 18:58:33 -0400
Jon Maloy <jmaloy(a)redhat.com> wrote:
Testing of the previous commit ("tcp: add
support for SO_PEEK_OFF")
in this series along with the pasta protocol splicer revealed a bug in
the way tcp handles window advertising during extreme memory squeeze
situations.
The excerpt of the below logging session shows what is happeing:
[5201<->54494]: ==== Activating log @ tcp_select_window()/268 ====
[5201<->54494]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) -->
TRUE
[5201<->54494]: tcp_select_window(<-) tp->rcv_wup: 2812454294,
tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0
[5201<->54494]: ADVERTISING WINDOW SIZE 0
[5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294,
tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294,
tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: (win_now: 250164, new_win: 262144 >= (2 * win_now):
500328))? --> time_to_ack: 0
[5201<->54494]: NOT calling tcp_send_ack()
[5201<->54494]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294,
tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now:
250164, qlen: 83
[...]
[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294,
tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: (win_now: 250164, new_win: 262144 >= (2 * win_now):
500328))? --> time_to_ack: 0
[5201<->54494]: NOT calling tcp_send_ack()
[5201<->54494]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294,
tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now:
250164, qlen: 1
[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294,
tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: (win_now: 250164, new_win: 262144 >= (2 * win_now):
500328))? --> time_to_ack: 0
[5201<->54494]: NOT calling tcp_send_ack()
[5201<->54494]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294,
tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now:
250164, qlen: 0
[5201<->54494]: tcp_recvmsg_locked(->)
[5201<->54494]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294,
tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: NOT calling tcp_send_ack()
[5201<->54494]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294,
tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164,
qlen: 0
We can see that although we are adverising a window size of zero,
tp->rcv_wnd is not updated accordingly. This leads to a discrepancy
between this side's and the peer's view of the current window size.
- The peer thinks the window is zero, and stops sending.
- This side ends up in a cycle where it repeatedly caclulates a new
window size it finds too small to advertise.
Hence no messages are received, and no acknowledges are sent, and
the situation remains locked even after the last queued receive buffer
has been consumed.
We fix this by setting tp->rcv_wnd to 0 before we return from the
function tcp_select_window() in this particular case.
Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.
Signed-off-by: Jon Maloy <jmaloy(a)redhat.com>
---
net/ipv4/tcp_output.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e3167ad96567..5803fd402708 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -264,8 +264,11 @@ static u16 tcp_select_window(struct sock *sk)
* are out of memory. The window is temporary, so we don't store
* it on the socket.
One nit: now that you do store it on the socket, you
should probably
change this comment as well.
*/
- if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+ if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+ tp->rcv_wnd = 0;
+ tp->rcv_wup = tp->rcv_nxt;
...I'm wondering if you should set
'pred_flags' to 0, as it's done at
the end of the function for other cases where the window is advertised
as zero.
At least according to the comment to tcp_rcv_established() it looks
like it's needed:
* - A zero window was announced from us - zero window probing
* is only handled properly in the slow path.
return 0;
+ }
cur_win = tcp_receive_window(tp);
new_win = __tcp_select_window(sk);
The rest, including 1/2, looks good to me.
Good points. I'll fix those and post the patches with your
"Reviewed-by:"
/thx