A bug in kernel TCP may lead to a deadlock where a zero window is sent from the peer, while it is unable to send out window updates even after reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisemnts from the peer can only be triggered by packets arriving from this side. However, such packets are never sent, because the zero-window condition currently prevents this side from sending out any packets whatsoever to the peer. We notice that the above bug is triggered *only* after the peer has dropped an arriving packet because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that it goes against the RFC 9293 recommendation that a previously advertised window never should shrink. RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find the following statement: "A TCP receiver SHOULD NOT shrink the window, i.e., move the right window edge to the left (SHLD-14). However, a sending TCP peer MUST be robust against window shrinking, which may cause the "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34). If this happens, the sender SHOULD NOT send new data (SHLD-15), but SHOULD retransmit normally the old unacknowledged data between SND.UNA and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data beyond SND.UNA+SND.WND (MAY-7)" We never see the window become negative, but we interpret this as a recommendation to use the previously available window during retransmission even when the currently advertised window is zero. We use the above mechanism only for timer-induced retransmits, while the fast-retransmit mechanism won't trigger on this condition. It should be noted that although this solves the problem we have at hand, it is not a genuine solution to the kernel bug. There may well be TCP stacks around in other OS-es which don't do this, nor have keep-alive probing as an alternatve way to solve the situation. Signed-off-by: Jon Maloy <jmaloy(a)redhat.com> --- v2: - Using previously advertised window during retransmission, instead highest send sequencece number in the cycle. v3: - Rebased to newest code - Changes based on feedback from PASST team - Sending out empty probe message at timer expiration when we are not in retransmit situation. v4: - Some small changes based on feedback from PASST team. - Replaced fast retransmit with a one-time 'fast probe' when window is zero. v5: - Gave up on 'fast probing' for now. When I got the sequence numbers right in the flag message (after having emptied the tap queue), it turns out an empty message does *not* force a new peer window update as was my previous understanding of the code. - Added cppcheck suppression line (which I was unable to verify) as suggested by S. Brivio. - Removed sending of empty probe when window from tap side is zero. It looks pointless at the moment, at least for solving the above described situation. v6: - Ensure that arrival of new data doesn´t cause us to ignore a zero-window situation. - Removed the pointless probing referred to in v5 comment. --- tcp.c | 26 ++++++++++++++++++++------ tcp_conn.h | 2 ++ 2 files changed, 22 insertions(+), 6 deletions(-) diff --git a/tcp.c b/tcp.c index fa13292..38c3480 100644 --- a/tcp.c +++ b/tcp.c @@ -1764,9 +1764,17 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn, */ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) { + uint32_t wnd_edge; + wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); + /* cppcheck-suppress [knownConditionTrueFalse, unmatchedSuppression] */ + conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX); + wnd_edge = conn->seq_ack_from_tap + wnd; + if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) + conn->seq_wnd_edge_from_tap = wnd_edge; + /* FIXME: reflect the tap-side receiver's window back to the sock-side * sender by adjusting SO_RCVBUF? */ } @@ -1799,6 +1807,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; } /** @@ -2208,13 +2217,12 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) { - uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; int sendlen, len, dlen, v4 = CONN_V4(conn); + uint32_t already_sent, max_send, seq; int s = conn->sock, i, ret = 0; struct msghdr mh_sock = { 0 }; uint16_t mss = MSS_GET(conn); - uint32_t already_sent, seq; struct iovec *iov; /* How much have we read/sent since last received ack ? */ @@ -2228,19 +2236,24 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) tcp_set_peek_offset(s, 0); } - if (!wnd_scaled || already_sent >= wnd_scaled) { + /* How much are we still allowed to send within current window ? */ + max_send = conn->seq_wnd_edge_from_tap - conn->seq_to_tap; + if (SEQ_LE(max_send, 0)) { + flow_trace(conn, "Window full: right edge: %u, sent: %u", + conn->seq_wnd_edge_from_tap, conn->seq_to_tap); + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; conn_flag(c, conn, STALLED); conn_flag(c, conn, ACK_FROM_TAP_DUE); return 0; } /* Set up buffer descriptors we'll fill completely and partially. */ - fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss); + fill_bufs = DIV_ROUND_UP(max_send, mss); if (fill_bufs > TCP_FRAMES) { fill_bufs = TCP_FRAMES; iov_rem = 0; } else { - iov_rem = (wnd_scaled - already_sent) % mss; + iov_rem = max_send % mss; } /* Prepare iov according to kernel capability */ @@ -2347,6 +2360,7 @@ err: * * Return: count of consumed packets */ + static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, const struct pool *p, int idx) { @@ -2950,7 +2964,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events) if (events & (EPOLLRDHUP | EPOLLHUP)) conn_event(c, conn, SOCK_FIN_RCVD); - if (events & EPOLLIN) + if (events & EPOLLIN && conn->wnd_from_tap) tcp_data_from_sock(c, conn); if (events & EPOLLOUT) diff --git a/tcp_conn.h b/tcp_conn.h index d280b22..5cbad2a 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -30,6 +30,7 @@ * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap * @seq_ack_from_tap: Last ACK number received from tap + * @seq_wnd_edge_from_tap: Right edge of last non-zero window from tap * @seq_from_tap: Next sequence for packets from tap (not actually sent) * @seq_ack_to_tap: Last ACK number sent to tap * @seq_init_from_tap: Initial sequence number from tap @@ -101,6 +102,7 @@ struct tcp_tap_conn { uint32_t seq_to_tap; uint32_t seq_ack_from_tap; + uint32_t seq_wnd_edge_from_tap; uint32_t seq_from_tap; uint32_t seq_ack_to_tap; uint32_t seq_init_from_tap; -- 2.42.0